RAL Tier1 weekly operations Grid 20100621
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- FTS upgraded: since the update, no blocking transfers seen, or high load due to this on SRMs.
- Progressing: SL4 batch shutdown, deployment of second ALICE CREAM CE.
- CPU accounting pages switched from KSI2K to HEP-SPEC06.
Highlights for Tier-1 VO Liaison Meeting
- FTS upgraded: since the update, no blocking transfers seen, or high load due to this on SRMs.
- Progressing: SL4 batch shutdown, deployment of second ALICE CREAM CE.
- CPU accounting pages switched from KSI2K to HEP-SPEC06.
Detailed Individual Reports
Alastair
- Working on ATLAS software server on /afs
- Group production work at RAL.
- Working to improve pbsjobs database to allow easier monitoring of production work.
- Work on ATLAS Frontier service, monitoring and backup.
Andrew
- Accounting change from KSI2K to HEP-SPEC06
- Ganglia capacity monitoring updated & accounting pages (eff-stats.pl) [Done]
- UB schedule scripts updated [Done]
- FTS
- Downgraded lcgfts01 from 2.2.4 to 2.2.3 [Done]
- Added MICE to test endpoint [Done]
- Updated services.xml & added file limits for MICE to production endpoint [Done]
- Various file limit & timeout changes [Done]
- Updated RGMA ACL [Done]
- CMS data ops
- Running rereco & skimming at RAL, IN2P3, FNAL
- Running MC rereco at RAL, CNAF, IN2P3
Catalin
- test LFC deployment using quattor [ongoing]
- configure squid on LHCb VOBOX [ongoing]
- job plans into Oracle [ongoing]
Derek
- Testbed Strategy [ongoing]
- E-mailing experiment contacts about Sl4 shutdown [done]
- Setting up NGS UEE on worker nodes
- Change control for deploying lcgce03 [ongoing]
- Testing glexec update
- Configuring pool accounts in quattor [ongoing]
- Fixed corrupt ICE database on lcgwms02
Matt
- Produce FTS training material
- Talk on ongoing SVN work for OnCall meeting
- Upgrade FTS to 2.2.4 [Done]
- Change Control workflow [Done]
Richard
- Further work on the "team status page" being developed as an action from team awayday
- Reviewing G/S process documentation
- Developed a tool to help with automating the wiki page on grid middleware versions
- Adding a Nagios check to look for the error that gave rise to the weekend's BDII problems
- CASTOR items:
- Carried out latest phase in pre-prod upgrade
- Next Week
- Finishing off 2.1.7 metrics documentation
- Run functional tests on pre-prod
- Run stress tests on pre-prod
- 1 day Tier1 AwayDay
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- integrate certificate viewer module with existing NGS certificate wizard code
- Write script to control ports on multiple PDUs
- Create Handover Document tation for finished projects [ongoing]
- Enter job plan into ssc
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7
ATLAS
CMS
- Major disruption to data and MC reprocessing at all T1s due to central WMS problems (CMS normally only use CNAF WMSs for production jobs). Started to use some CERN and RAL WMSs in addition to CNAF.
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun)
- AoD: