RAL Tier1 Operations Report for 15th June 2011

Since the start of June we have seen some problems on the CMS Castor instance. These appear as some transfer failures and intermittent time-outs on the CMS SAM test of srm-cms. These appear to be triggered by a particular CMS workflow, which has been throttled back, but the detailed effect is not understood as the workflow does not appear to have caused excessive load.
On Wednesday 8th June we investigated a problem on the LHCb Castor instance. This initially looked like a bad file, but was finally found to be a user requesting a TURL but but not making use of it to start transferring the file within the time-out. This was being repeated for the same file. The errors seen by the Castor team were the jobs created by the get TURL timing out.
Wednesday (15th) Problems seen with the scheduler for the CMS Castor instance. Unscheduled outage declared for 90 minutes.
Changes made this last week:
- lcgce08 (LHC CREAM CE) updated to glite3.2 CREAM v3.2.10-0.
- lcgce05 (non-LHC CREAM CE) has been Quattorized and updated to glite3.2 CREAM v3.2.10-0

Since CE07 was taken out of service for re-installation as a CREAM CE, both Alice & LHCb are not including any CEs in their availability calculations. As a result our LHCb availability is no longer affected by intermittent problems with the LHCb CVMFS test - although we know this test does still fail from time to time.
We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.

Thursday 16th June: Castor (All SRMs) At Risk during OS and Oracle parameter changes needed to follow up with problem gathering Oracle statistics. (Being rolled out to all nodes in RACs following tests last week.)

The following items are being discussed and are still to be formally scheduled:

Updates to Site Routers (the Site Access Router and the UKLight router) are required.
Upgrade Castor clients on the Worker Nodes to version 2.1.10.
Address permissions problem regarding Atlas User access to all Atlas data.
Minor Castor update to enable access to T10KC tapes.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Further updates to CEs: (Convert CE06 to a CREAM CE; Glite updates on CE09). Priority & order to be decided.

There has been one unscheduled entries in the GOCDB for this period. This is for today's problem with the CMS Castor instance.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-cms	UNSCHEDULED	OUTAGE	15/06/2011 11:55	15/06/2011 13:25	1 hour and 30 minutes	We are investigating a problem with the scheduler within Castor for the CMS instance.
lcgce05	SCHEDULED	OUTAGE	10/06/2011 15:30	14/06/2011 11:15	3 days, 19 hours and 45 minutes	Update of glite3.2 CREAM release
srm-atlas, srm-cms	SCHEDULED	WARNING	09/06/2011 10:00	09/06/2011 11:00	1 hour	At Risk during OS and Oracle parameter changes needed to follow up with problem gathering Oracle statistics.
lcgce08	SCHEDULED	OUTAGE	07/06/2011 16:00	10/06/2011 10:15	2 days, 18 hours and 15 minutes	Update of glite3.2 CREAM release

Tier1 Operations Report 2011-06-15