RAL Tier1 Operations Report for 20th July 2011

Wednesday (13th July 2011) High load seen on Atlas Castor instance. We halved the FTS channels and stopped active draining of gdss208.
Thursday (14th July 2011) There was an outage of the Atlas Castor instance from approx 15:00 until 17:00. This was due to a database problem (There was a subrequest with no id2type entry).
Tuesday (19th July 2011) WMS02 (LHC) had problems with the number of gridFTP connections set. The number of gridFTP connections was increased by 200.
Tuesday (19th July 2011) Commenced draining 2008 streamline worker nodes so they can be re-numbered and re-installed.
Wednesday (20th July 2011) Several batch workers knocked offline by an Atlas user leaving files in and filling /home/pool.
Disk Server Issues:
- Thursday (14th July 23:13) gdss190 (Atlasscratchdisk D1T0) was removed from service due to it failing a read only file system check. It was checked over and returned to service on Friday.
- Monday morning (18th July) gdss195 (atlasScratchDisk d1t0) was removed from service due to a double drive failure. It was returned to service on Tuesday (19 th July) morning.
- Wednesday (20th July) gdss96 (cmsWanin d0t1) had a kernel panic and was removed from service.
Changes made this last week:
- Thursday (14th) Switched all Castor instances to use the LRU garbage collection policies.
- Thursday (14th) Set default ACLs for CMS.
- Friday (15th) kernel/errata updates applied to CMS s/w server lcg0616.

The following points are unchanged from previous reports:
- Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.
- We have observer some packet loss on the main network link from the RAL site (not the route used by our data). An intervention was made on the 5th July but this does not appear to have fixed it. (?)

The following items are being discussed and are still to be formally scheduled and announced:

Address permissions problem regarding Atlas User access to all Atlas data.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
There is a requirement to re-number 4 clusters of worker nodes.
There is a need to reboot all the VO software servers for kernel updates and errata.

There was an unscheduled outage due to the Castor/Database problems on 14th.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	14/07/2011 15:00	14/07/2011 17:18	2 hours and 18 minutes	Downtime while we investigate castor problems

Tier1 Operations Report 2011-07-20