RAL Tier1 weekly operations castor 02/08/2010
From GridPP Wiki
Work this week
- Operations: Two disk servers in LHCbUser showed heavy load. Due to other two being full.
- Operations: Database problem on Neptune; resloved with Ian and Keir
- Operations: Need to schedule firmware update for SL08 disk servers
- PreProd: Some problems with grid-map file. Resolved by changing order of fetching info.
- PreProd: Problems on VULCAN DB
- PreProd: Problems with gdss154 as source for disk-2-disk copy
- PreProd: CMS and ATLAS started testing. No response yet from ALICE and LHCb
- gdss419 (AtlasSimStrip-d1t0) - has got 2 drive failures. One of the drive replaced 29/07/2010, the second one should be replaced next week. The machine could be out of production until 06/08/2010
- gdss187 (AtlasFarm)- h/w has been fixed, needs to have checksum verified due to fsprobe errors
- CMS has reported some files not migrating to tape after couple of weeks. They all were in "tapecopy_failed" status. Resetting manually the status to "tapecopy_tobemigrated" has moved them to tape. The cause of this problem is still unknown.
- Atlas jobManager has crashed silently without producing any errors. Has been restarted which has fixed the problem (30/07/2010)
- ATLAS stager db corrupted due to known bug on Sunday. Recovered on the same day.
- Vulcan DB went down on 29/07/2010 at 9:00pm due to disk array controller lockup
- gdss125 and gdss154 didn't come back after RAID controller reconfiguration
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB None
- Upgrade to 2.1.8/2.1.9 2010
- Castor on Call person: ..
- Staff absences:
- Chris on leave until 16th August 2010