RAL Tier1 weekly operations Grid 20100607
From GridPP Wiki
Revision as of 12:10, 9 June 2010 by Matt hodges (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
LFC/FTS unsched d/time | Wed 2 June 14:00 | Wed 2 June 17:30 | All | Critical | Following the Oracle updates to DB systems, one Oracle RAC node became unstable. The patch had to be rolled back and services restarted. |
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Firewall change for lcgce03 | 17 May 2010 | 15 June 2010 | Medium | Required to deploy lcgce03 as Production CREAM CE for Alice |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Construct end-to-end timeline for 08 and 09 disk deployments
- Site BDII checks to detect null output from CIP (but not resource BDII)
Highlights for Tier-1 VO Liaison Meeting
- FTS2.2.4 testing
- Disk deployments from nonProd to Prod for ATLAS and CMS
Detailed Individual Reports
Alastair
- Away end of last week, working from home beginning of this week.
- Assisting with debugging new ATLAS release. Ongoing problems with missing libraries.
- Testing FTS and check summing at RAL trying to reduce backlog problems.
- Deploying 22 disk servers into production.
Andrew
- Putting job plan into Oracle [Ongoing]
- Working on analysis of VO support survey responses [Ongoing]
- Completed sorting out files from bad tape CS6000 [Done]
- May accounting, including more APEL fixing; updates to scripts for T2K; wrote script to automate checking of CESGA; investigating why ATLAS has split into 2 distinct VOs on CESGA webpage
- RGMA ACL update [Done]
- Investigated new CMS VOBOX proxy renewal daemon failure; added cron to do additional checks [Done]
- Added additional checking for PhEDEx for detecting problems staging files [Done]
- CMS data ops
- Running skims at FNAL
- Next week: deploy CMS V09 disk servers to cmsFarmRead
Catalin
- gLite updates on WMS [done]
- kernel upgrades [done]
- configure squid on LHCb VOBOX [ongoing]
- LFC/FTS replication (w/ Carmine) [ongoing]
- job plans [ongoing]
- test LFC deployment using quattor
Derek
- A/L all week [Done]
- CPU Capacity publishing update [Done]
- Testbed Strategy
- E-mailing experiment contacts about Sl4 shutdown
- Setting up NGS UEE on worker nodes
- Change control for deploying lcgce03
- Quattor helpdesk queue [Done]
- Testing glexec update
- Attending HEPSYSMAN (Thu-Fri)
Matt
- Test upgrade path to FTS2.2.4/agree schedule with production team
- Construct end-to-end timeline for 08 and 09 disk deployments
- Extra checking of CIP output on site BDIIs
Richard
- Added extra logic into the CIP->site BDII "bridging" script to check for existence of particular items rather than just non-zero volume of output
- Built LCG0630 as a top-level BDII to test quattor configuration of the "cachesize" directive in the glue-slapd.conf
- Further work on the "team status page" being developed as an action from team awayday
- Reviewing G/S process documentation
- CASTOR items:
- Ran pre-prod stress tests
- Next Week
- Complete running of the pre-prod stress tests
- Take the logic developed for the CIP->site BDII script and create a Nagios check to see how often the condition arises
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- integrate certificate viewer module with existing NGS certificate wizard code
- Write script to control ports on multiple PDUs
- Create Handover Document tation for finished projects [ongoing]
- Enter job plan into ssc
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
ATLAS
CMS
- Reprocessing requests have started appearing for ICHEP
- Mostly skimming as been running at T1s over the past week or so
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Tue-Sun)
- Grid OnCall: Derek (Mon)
- AoD: Catalin (Tue)