RAL Tier1 weekly operations Grid 20100426
From GridPP Wiki
Revision as of 17:41, 27 April 2010 by Andrew lahiff (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy | |
APEL publishing problem | all | low | APEL publishing isn't working |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. | ||
HW for SL5 CMS Phedex Vobox | 19-Mar-2010 | High | Required to replace the existing SL4 machine [2010-04-21] PhEDEx is no longer supprted on SL4 |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
- Working on setting up and testing ATLASGROUP disk at RAL.
- Working with B-Physics Group on group analysis requirements (TAG based analysis).
- Looking into ATLAS PFC (Pool File Catalogue) problems.
Andrew
- Fixes to CMS tape pool & high time resolution CPU efficiency ganglia plots [Done]
- Changed PhEDEx custom CASTOR stager agent to generic stager agent; limited number of files staged to 150/10mins. [Done]
- CMS data ops
- Dealing with last processing & merge jobs from remaining workflows
- Re-started backfill at CNAF for testing changes to their storage system
- Attended CMS UK computing meeting in Bristol on Wednesday
- Writing script to compare checksums of random files from specific files in CASTOR with PhEDEx [Ongoing]
Catalin
- ALICE VOBOXes gLite updates [done]
- various OS updates [done]
- Self Service Tools training [done]
- APR [ongoing]
- ATLAS, Alice phone calls
- install and configure squid on LHCb VOBOX [ongoing]
Derek
- Investigating scheduler avoidance of new WNs [Ongoing]
- Evaluating cloud technology for Grid Services testbed use [Ongoing]
- APR [Ongoing]
- SSC Training [Done]
- Requested renewal of 3 CE certificates
Matt
- Arrange meeting to discuss 09 disk deployment [Done]
- Add FTS check for jobs stuck in Preparing state [Done]
- Review batch system change controls [Done]
- Look at draft User Board allocations; update CPU/disk capacity profiles [ongoing]
Richard
- 1/2 days Oracle/SSC Training (Thu)
- Drafted a Change Control request to move some of the BDII servers to the Atlas building for greater resilience
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Continuing p/p stress testing
Mayo
- Implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- Writing and configuring Nagios nrpe plugins [Done]
- Certificate viewer for NGS cert wizard
- Write PDU power controller query script [Done]
- Write a script to turn PDU ports off
VO Reports
ALICE
Would like CREAM-CE v1.6 to be installed asap
ATLAS
- Not heard anything about LHC technical stop. (Maybe I missed something/it will be announced tomorrow)
- LHC continuing to islowly increase luminosity.
- ATLAS Software and computing week. (More chaotic than usual due to Volcano)
- Fast 'fast re-processing' could start this week.
- ATLAS announced that it would like all disk space in production by June 1st.
CMS
- PhEDEx on SL4 is no longer supported. 3.3.1 has just been released for SL5 only.
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek
- AoD: