RAL Tier1 weekly operations Grid 20100628
From GridPP Wiki
Revision as of 12:41, 30 June 2010 by Matt hodges (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Testing updates to WN/glexec (to meet baseline requirements).
- Finalising Change Control for second ALICE CE (CE03).
- Ongoing work to finalise close of SL4 batch service.
- Looking at SARA-RAL transfer problems.
Highlights for Tier-1 VO Liaison Meeting
- Lots of errors on SARA-RAL channel due to missing source files.
- SL4 decommissioning deadline approaching (August 1).
- Approved change to rollout second CREAM CE for ALICE; need to schedule.
- Testing to further investigate CMS issues with WMS bulk job submission.
Detailed Individual Reports
Alastair
- Working on ATLAS software server on /afs [ongoing]
- Written script to identify unavaliable files when a disk server is taken out of production. [testing]
- Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
- Working to improve pbsjobs database to allow easier monitoring of production work.
- Working on ATLAS Frontier service, monitoring and backup.
Andrew
- e-Science "away" day
- CMS data ops
- Running data rereco & skimming at FNAL, PIC, KIT
- Running MC rereco at RAL, CNAF, IN2P3
- Next week: adjust TFC for cmsTemp & do some testing
Catalin
- various WMS issues [ongoing]
- test LFC deployment using quattor [ongoing]
- LFC talk for NGS
- Frontier monitoring
- mysql pbsjobs DB issue
Derek
- Testbed Strategy [ongoing]
- E-mailing experiment contacts about Sl4 shutdown [done]
- Setting up NGS UEE on worker nodes
- Change control for deploying lcgce03 [done]
- Testing glexec update [ongoing]
- Configuring pool accounts in quattor [ongoing]
- Implementing new updated change control process on dev helpdesk
- Added plugin to sync between blog and twitter
Matt
- Produce FTS training material
- Talk on ongoing SVN work for OnCall meeting
Richard
- Planning updates to RAL top-level BDIIs
- Further work on the "team status page" being developed as an action from team awayday
- Reviewing G/S process documentation
- Developed a tool to help with automating the wiki page on grid middleware versions
- Wrote a gmetric script to monitor the # of entries in RAL BDII servers
- Writing a Nagios plugin to check the "deltas" in # of entries in RAL BDII servers
- CASTOR items:
- Carried out latest phase in pre-prod upgrade
- Ran 2.1.8 functional tests on latest pre-prod s/w
- Next Week
- Finishing off 2.1.7 metrics documentation
- Run functional tests on pre-prod
- Run stress tests on pre-prod
- 2 days A/L
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- integrate certificate viewer module with existing NGS certificate wizard code
- Write script to control ports on multiple PDUs
- Create Handover Document tation for finished projects [ongoing]
- Enter job plan into ssc
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7
ATLAS
CMS
- Splitting cmsFarmRead into cmsFarmRead & a D1T0 service class called cmsTemp. Everything in /store/unmerged will go into cmsTemp rather than cmsFarmRead, as these are only temporary files that don't need to go to tape.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: