RAL Tier1 weekly operations castor 21/01/2011
From GridPP Wiki
- Last disk servers (Gen) quattorized and upgraded to SL5 64bit
- WAN tuning rolled out to all remaining CMS disk servers
- srm0662 (ATLAS) repartitioned to give more space to logs. Two more to go.
- atlasSimStrip was successfully merged into atlasStripInput
- Around 10k FTS transfers failed for ATLAS on Monday after switching to a new robot certificate, which wasn't correctly pushed out in grid-mapfiles due to a misconfiguration when upgrading to the new puppetmaster02.
- After the disk pool merging, ATLAS continued using SIMSTRIP and failed to modify their pilot jobs to use DATADISK.
- On 17/2 the xrootd redirector crashed resulting in failing functional tests. It was quickly noticed and restarted after 2 hours. An automatic restarter was written and installed that will kick in if it happens again.
- Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Been ordered. Servers arriving this week, RAID device mid-March.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
|Roll out WAN tuning changes to all remaining disk servers (STC)||22/02/2011 09:00||22/02/2011 16:00||At-Risk||ATLAS,LHCb,Gen|
|Upgrade NS to 2.1.10 (STC)||mid March||mid March||Downtime||ALL|
- CASTOR certification and upgrade to 2.1.10 and upgrade of SRM to 2.10 which incorporates:
- fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
- fix to report files on draining disk servers accessed by FTS to be NEARLINE not UNAVAILABLE
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Move Facilities instance to new Database hardware running 10g
- Start migrating from T10KA to T10KC media later this year
- Castor on Call person: Chris
- Staff absence/out of the office:
- Shaun, Richard (all week)