RAL Tier1 weekly operations Grid 20100614
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Firewall change for lcgce03 | 17 May 2010 | 15 June 2010 | Medium | Required to deploy lcgce03 as Production CREAM CE for Alice [11/06/10] Request made to networking on 9/6/10 |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
- FTS2.2.4 upgrade done; MICE support added.
- BDII: mitigation for problem seen at the weekend; Tuesday problem did not have the same cause
Detailed Individual Reports
Alastair
- Working on ATLAS software server on /afs
- Writing scripts/checks to allow faster identification of causes of high transfer rate problem seen last week.
- Group production work at RAL.
- Working to improve pbsjobs database to allow easier monitoring of production work.
Andrew
- Putting job plan into Oracle [Ongoing]
- Accounting
- May accounting [Done]
- Wrote script to add CASTOR tape usage to UB schedule MySQL database without requiring tape spreadsheet [Done]
- Updated published CPU capacities, scaling factors for SL09 WNs, Ganglia capacity scripts, wrote documentation [Done]
- Added some ops users to lcgvo-02-21 so that SAM tests will work [Done]
- FTS
- Updated services.xml due to missing endpoint [Done]
- Changed STAR-UKISOUTHGRIDRALPP from srmcopy to urlcopy; increased transfer marker timeout [Done]
- Updated RGMA ACL [Done]
- CMS data ops
- Running skims at FNAL [Done]
- Running Run2010A rereco at RAL, IN2P3, FNAL
- Attended Oracle finance & OTL training [Done]
Catalin
- test LFC deployment using quattor [ongoing]
- configure squid on LHCb VOBOX [ongoing]
- job plans into Oracle [ongoing]
Derek
- Testbed Strategy [ongoing]
- E-mailing experiment contacts about Sl4 shutdown
- Setting up NGS UEE on worker nodes
- Change control for deploying lcgce03 [ongoing]
- Testing glexec update
- Configuring pool accounts in quattor
Matt
- Produce FTS training material
- Talk on ongoing SVN work for OnCall meeting
- Test upgrade path to FTS2.2.4 [Done]
- Submit Change Control request for FTS2.2.4 upgrade [Done]
- Construct end-to-end timeline for 08 and 09 disk deployments [Done]
Richard
- Added extra logic into the CIP->site BDII "bridging" script to check for existence of particular items rather than just non-zero volume of output [Done]
- Built LCG0630 as a top-level BDII to test quattor configuration of the "cachesize" directive in the glue-slapd.conf
- Further work on the "team status page" being developed as an action from team awayday
- Reviewing G/S process documentation
- Adding a Nagios check to look for the error that gave rise to the weekend's BDII problems
- CASTOR items:
- Upgraded central name server in pre-prod
- Ran functional tests on pre-prod
- Finishing adding metrics to pre-prod benchmark results wiki page
- Next Week
- Complete running of the pre-prod stress tests
- Take the logic developed for the CIP->site BDII script and create a Nagios check to see how often the condition arises
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- integrate certificate viewer module with existing NGS certificate wizard code
- Write script to control ports on multiple PDUs
- Create Handover Document tation for finished projects [ongoing]
- Enter job plan into ssc
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7
ATLAS
CMS
- Very large MC reprocessing will begin soon
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun)
- AoD: