RAL Tier1 weekly operations castor 22/06/2015

Operations News

Checksum issues – Shaun’s tool to clear them up
Facilities CASTOR - change to time to write to tape from 30 mins to 5 mins now
Change to improve file open times on CASTOR (central db, subrequest todo procedure)
CASTOR rebalancing underway

DiRAC small VO being completed / tested
SL6 - tape x3 in Facilities completed, 9 to go
New LHCb tape pools created
Mice (Castor Gen) will be operating overnight and able to call pri oncall
All tape servers have now been upgraded to SL6 and are running smoothly
On a related note, the disk server SL6 configuration is ready but waiting for the Oracle updates to be completed.
- We are examining options for running this in a slow-and-steady fashion with CASTOR up.
Testing CASTOR rebalancer on preproduction, and developing associated tools. We hope to use the rebalancer to prevent future hotspotting issues.
13-generation disk servers are being prepared for deployment into CASTOR production
The move of the standby DB racks to R26 has been successfully completed. Some issues remained with the hardware following the move, resulting in an unplanned at-risk, but these were resolved.
We are examining options for the upgrade of the CASTOR DBs to Oracle version 11.2.0.4. The experiments are keen to avoid downtime early in run 2, so some careful scheduling will be necessary.

A few SRM DB dups atlas x2
Atlas running out of space (read-only being included in free space accounting not helping)
CLF VO issues - configuration error corrected
GDSS711 CMSdisk – issue around statup of xroot needs to be considered further
GDSS763 atlasStripInput – failed one day after deployment, looking to return to read_only and drain

AtlasStripInput is filling up - need to deploy new servers
CMS fileopen issue - have allowed unscheduled xroot reads / need to be documented for oncall
Atlas files without NS checksums causing file copies to fail (may affect other VOs)
Possible problem identified when creating new service class for DiRAC, castor external emailed
standby DBs for castor are occasionally 10 or 15 mins behind and return to sync
xroot redirection - works with our redirector but not others, this started during a past upgrade - shaun debugging.
- We have determined that the most serious incidence of this problem is due to a number of hot datasets that are located almost entirely on one node. Shaun has implemented a process to redistribute this data across the rest of the cmsDisk pool.
Retrieval errors from facilities castor - a number of similar incidents seen in last few weeks.
GDSS757 cmsDisk / 763 atlasStripInput - new motherboards, fabric acceptance test complete, ready for deployment.
A higher number of checksum errors (Alice) - found to be due to VO actions. This is being cleared up.
Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian working on this.
- This is related to a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
castor functional test on lcgccvm02 causing problems - Gareth reviewing.

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Proposed to upgrade Juno DB Oracle 11.2.04 ... 30th June - CASTOR Facilities downtime

Tasks

Disk deployments - SL6 disk deployments, dev mostly complete, deployment planning needed
srm 2.1.14 - functional testing positive / stress testing required

Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
Intervention to upgrade CASTOR DBs to Oracle 11.2.0.4

Interventions

Staff absence/out of the office:
- Rob - Out this week
- Chris – Out Tuesday TBC

Rob to look into procedural issues with CMS disk server interventions
Bruno to document processes to control services previously controlled by puppet
Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
Rob to get jobs thought to cause CMS pileup
Bruno to put SL6 on preprod disk
Bruno / Rob to write change control doc for SL6 disk
Shaun testing/working gfalcopy rpms
Someone - mice, what access protocol do they use?

Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
Rob/Alastair to clarify what we are doing with 'broken' disk servers
Gareth to ensure that there is a ping test etc to the atlas building