Difference between revisions of "RAL Tier1 weekly operations castor 29/11/2010"
From GridPP Wiki
Matt viljoen (Talk | contribs) |
(No difference)
|
Latest revision as of 15:16, 29 November 2010
Contents
Operations News
- On 25/11/10 all ATLAS+CMS SL08 disk servers were put into Read Only mode via LSF, to prevent new files being lost if there is a further catastrophic crash.
Operations Issues
- On 22/11/10 CMS experienced slowness transferring files from cmsWanOut. 3 disk servers were running very hot. Putting them into draining to distribute the hot files helped.
- On 24/11/10 at 00:34 and again on 27/11/10 at 22:59 the CMS jobmanager stopped processing requests for approx. 30 minutes (on both occassions) due to unknown reasons. Afterwards it resumed operations normally. Over these period, transfers to/from RAL failed. We have enabled a second jobmanager instance on CMS to protect us from a future reoccurance.
- Very slow connectivity seems to be affecting a number of disk servers in CMS, ATLAS, LHCb. Indications are that there may be a common problem with their networking.
Blocking issues
- Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Update ATLAS to 2.1.9-6 | 06/12/2010 08:00 | 08/12/2010 18:00 | Downtime | ATLAS |
Advanced Planning
- Deploy new puppetmaster
- Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
- CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
- CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
- CASTOR for Facilities instance in production by end of 2010
Staffing
- Castor on Call person: Matthew
- Staff absence/out of the office:
- Matthew on A/L Friday PM