RAL Tier1 weekly operations Grid 20100524
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Low | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Installed FTS2.2.4 pre-release on test endpoint for functional testing by ATLAS
- Testing FTS Group configuration (to replace cloud configuration)
- Added service owner info to oncall alarm response document
- VOBOXs: squid service for LHCb; migration of PhEDEX for CMS
Highlights for Tier-1 VO Liaison Meeting
- Fixed FTS configuration to impose per-VO file limits; default changed in FTS2.2.3.
- Expect all diskservers needed to meet wLCG pledges to be in nonProd by end of week.
Detailed Individual Reports
Alastair
- Working on ATLAS software server upgrade [ongoing]
- Looking into ATLAS PFC (Pool File Catalogue) problems.
- Testing FTS and check summing at RAL.
- Deploying 22 disk servers into NonProd.
Andrew
- Job plan
- APEL consistency checking [Done]
- Installing & setting up PhEDEx on SL5 VOBOX [Ongoing]
- Migration to use of FTS groups in FTS "cloud" channels [Ongoing]
- Started V09 disk server deployment into cmsNonProd [Ongoing; delay due to SL5 LSF issues]
- A few FTS channel adjustments for Bristol & Estonia [Done]
- CMS data ops
- Backfill at RAL & PIC [Ongoing]
- Started MC production workflow at RAL, PIC, CNAF (52472 jobs)
Catalin
- Atlas Frontier server updates [done]
- ATLAS Frontier documentation in SVN [done]
- work on CMS Phedex and blparser Nagios monitoring [ongoing]
- configure squid on LHCb VOBOX [ongoing]
- gLite updates on LHCB VOBOX [done]
- LFC/FTS replication (w/ Carmine) [ongoing]
- job plans [ongoing]
Derek
- Intervention on lcgce06 for glexec [Done]
- Intervention on lcgce07 for glexec
- Sync of templates with QWG for glite 3.1 and 3.2 [done]
- Testing CREAM CE 1.6
Matt
- Job Plans
- Adjust FTS channel config policies that lead to opportunistic use of empty slots by other VOs
- Investigate problem with FTS file limits being exceeded [Done]
- Install FTS2.2.4 pre-release on test endpoint [Done]
- Add service owner info to oncall alarm response document [Done]
- Team development talk
Richard
- APR-Signoff [Done]
- Entered Job Plan info SSC [Done]
- Worked with Jonathan to get NIS netgroups up to date (partly for convenience of having ~ mounted when logging into machines but also for the sake of reducing the number of messages that Production Team need to wade through)
- Worked on the "missing CIP" problem
- Built an additional top-level BDII server on testbed machine (lcg0628) to test behaviour on removing "schemacheck off" directive from /opt/bdii/etc/bdii-slapd.conf
- Looking at the site-bdii timeout problem
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Wrote up results from p/p stress tests [Done]
- Ran functional test suite on p/p [Done]
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- integrate certificate viewer module with existing NGS certificate wizard code
- Write script to control ports on multiple PDUs
- Create Handover Document tation for finished projects [ongoing]
- Enter job plan into ssc
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- asked about Castor@RAL status and plans
ATLAS
CMS
- Now using 8 primary datasets. Every CMS T1 site now receives custodially one primary dataset.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Fri)
- Grid OnCall: Derek (Fri-Sun)
- AoD: Catalin (Wed)