RAL Tier1 weekly operations Grid 20100531
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Low | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server upgrade [ongoing]
- Looking into ATLAS PFC (Pool File Catalogue) problems.
- Testing FTS and check summing at RAL.
- Deploying 22 disk servers into NonProd.
Andrew
- Installing & setting up PhEDEx on SL5 VOBOX, updating documentation, monitoring [Done]
- Migration to use of FTS groups in FTS "cloud" channels [Ongoing]
- V09 disk server deployment into cmsNonProd [Done]
- A few FTS channel adjustments for ATLAS [Done]
- Updates to accounting scripts for T2K [Done]
- Archived old APEL records (+ wrote documentation); cleaned up tables [Done]
- Added two new checks to fts-checks.pl [Done]
- Recovering CMS file from bad tape (CS6000) [Ongoing]
- CMS data ops
- Running MC production workflow at RAL, PIC, CNAF [Done]
- Running MC rereco preproduction at CNAF
Catalin
- work on CMS Phedex and blparser Nagios monitoring [ongoing]
- configure squid on LHCb VOBOX [ongoing]
- LFC/FTS replication (w/ Carmine) [ongoing]
- job plans [ongoing]
Derek
- Intervention on lcgce08 for glexec [Done]
- Beta test of new APEL CE parser [In progress]
- CIP incident review [Done]
- Adding new hosts to testbed [Done]
- Extended time limit on grid2000M queue [Done]
- Enabled ngs.ac.uk vo on grid2000M queue [Done]
- Announced SL4 Farm closure on GridPP-Users [Done]
- Sick Monday [Done]
- A/L all week
Matt
- Job Plans [Ongoing]
- Adjust FTS channel config policies that lead to opportunistic use of empty slots by other VOs [Done]
- Team development talk [Done]
Richard
- APR-Signoff [Done]
- Entered Job Plan info SSC [Done]
- Worked with Jonathan to get NIS netgroups up to date (partly for convenience of having ~ mounted when logging into machines but also for the sake of reducing the number of messages that Production Team need to wade through)
- Worked on the "missing CIP" problem
- Built an additional top-level BDII server on testbed machine (lcg0628) to test behaviour on removing "schemacheck off" directive from /opt/bdii/etc/bdii-slapd.conf
- Looking at the site-bdii timeout problem
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Wrote up results from p/p stress tests [Done]
- Ran functional test suite on p/p [Done]
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- integrate certificate viewer module with existing NGS certificate wizard code
- Write script to control ports on multiple PDUs
- Create Handover Document tation for finished projects [ongoing]
- Enter job plan into ssc
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- asked about Castor@RAL status and plans
ATLAS
CMS
- Started using test CERN FTS endpoint (latest version of FTS) for the PhEDEx debug instance for CERN - RAL transfers.
- All PhEDEx instances (prod, debug, dev) now running on new SL5 VOBOX (lcgvo-02-21)
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall:
- AoD: