RAL Tier1 weekly operations castor 02/08/2010

From GridPP Wiki
Jump to: navigation, search

Work this week

  • Matthew:
    • A/L
  • Shaun:
    • Operations: Two disk servers in LHCbUser showed heavy load. Due to other two being full.
    • Operations: Database problem on Neptune; resloved with Ian and Keir
    • Operations: Need to schedule firmware update for SL08 disk servers
    • PreProd: Some problems with grid-map file. Resolved by changing order of fetching info.
    • PreProd: Problems on VULCAN DB
    • PreProd: Problems with gdss154 as source for disk-2-disk copy
    • PreProd: CMS and ATLAS started testing. No response yet from ALICE and LHCb
  • Chris:
    • A/L
  • Richard:
    • ..
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • gdss419 (AtlasSimStrip-d1t0) - has got 2 drive failures. One of the drive replaced 29/07/2010, the second one should be replaced next week. The machine could be out of production until 06/08/2010
  • gdss187 (AtlasFarm)- h/w has been fixed, needs to have checksum verified due to fsprobe errors
  • CMS has reported some files not migrating to tape after couple of weeks. They all were in "tapecopy_failed" status. Resetting manually the status to "tapecopy_tobemigrated" has moved them to tape. The cause of this problem is still unknown.
  • Atlas jobManager has crashed silently without producing any errors. Has been restarted which has fixed the problem (30/07/2010)
  • ATLAS stager db corrupted due to known bug on Sunday. Recovered on the same day.

PreProd


  • Vulcan DB went down on 29/07/2010 at 9:00pm due to disk array controller lockup
  • gdss125 and gdss154 didn't come back after RAID controller reconfiguration

Blocking issues

  • None

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB None

Advanced Planning

  • Upgrade to 2.1.8/2.1.9 2010

Staffing

  • Castor on Call person: ..
  • Staff absences:
    • Chris on leave until 16th August 2010