Latest revision as of 14:33, 10 July 2013

RAL Tier1 Operations Report for 10th July 2013

Review of Issues during the week 3rd to 10th July 2013.

There have been some ongoing CVMFS problems - notably affecting CMS during this last week. Work is ongoing to investigate this, including upgrading to a version of CVMFS (2.1.12) that better handles fail-over.
On Friday (5th July) a problem arose with the Atlas Castor during the afternoon which was declared as down (outage in GOC DB) for 2.5 hours and At Risk for the rest of the weekend. This was traced to a recurrence of the problem of a week ago - and is a bug fixed in version 2.1.13.
The problem reported two weeks ago with the RAL site firewall logging of a large number of connection requests causing high load was seen elsewhere by ALICE. Although not understood & fixed a workaround is in place. We have raised the maximum number of Alice jobs from 500 to 1000.
We have had a number of occurrences of connections to the batch server failing - this is also manifest as the pbs_server using a lot of memory. We trap and call out on this memory problem, but only this last week have we seen connection problems linked to it. This problem was compounded over the weekend by our Nagios server not running this memory check for a long time (from Saturday evening until Monday). The cause of this is not understood - other Nagios tests were run but somehow this one wasn't scheduled. A restart of the Nagios process on on Monday resolved the issue. We failed some CE SUM tests over the weekend as a result of this problem.

Resolved Disk Server Issues

On Thursday (4th July) GDSS662 (AtlasDataDisk) became unresponsive and was taken out of production for investigation. Following hardware checks it was returned to service the following day, initially in 'passive draining' mode. This was changed to full production later that afternoon as there was a suggestion that servers in passive draining were implicated in exposing the bug seen in version 2.1.12.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE & LHCb being brought on-board with the testing.

Ongoing Disk Server Issues

None

Notable Changes made this last week

Castor Atlas instance (stager) being upgraded to version 2.1.13-9 today.
Some of the farm nodes have been upgrade to CVMFS version 2.1.12 - which resolves a fail-over problem seen in 2.1.11.
Some Atlas Tier2s (Manchester,ECDF, RAL PPD) have moved transfers to use FTS3.

Declared in the GOC DB

None.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The upgrade of the Atlas Castor stager is taking place today. We plan to update the remaining stagers on the following dates: CMS & GEN: Tuesday 23rd July; LHCb Tuesday 30th July.
Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13 (ongoing)
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
  - Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.

Entries in GOC DB starting between 3rd and 10th July 2013.

There were three unscheduled entries in the GOC DB. Two were for the problem with the Atlas Castor instance on Friday (5th July) - for the outage and then a 'warning' over the weekend. The other was for an outage of the (test) ARC-CEs.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas	SCHEDULED	OUTAGE	10/07/2013 09:00	10/07/2013 14:00	5 hours	Upgrade of Atlas Castor Stager to version 2.1.13-9.
arc-ce01, arc-ce02	UNSCHEDULED	OUTAGE	09/07/2013 12:00	09/07/2013 15:15	3 hours and 15 minutes	Problem on Hypervisor hosting these ARC-CEs.
srm-atlas	UNSCHEDULED	WARNING	05/07/2013 18:30	08/07/2013 09:30	2 days, 15 hours	Following problems with Atlas Castor instance
srm-atlas	UNSCHEDULED	OUTAGE	05/07/2013 15:45	05/07/2013 18:17	2 hours and 32 minutes	Problem with Atlas Castor instance. Investigations ongoing.
All Castor & batch	SCHEDULED	OUTAGE	03/07/2013 09:00	03/07/2013 12:30	3 hours and 30 minutes	Castor nameserver Upgrade to version 2.1.13-9. Castor and batch services unavailable.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
95435	Yellow	Urgent	In Progress	2013-07-04	2013-07-04	LHCb	CVMFS problem at RAL-LCG2
91658	Red	Less Urgent	In Progress	2013-02-20	2013-07-02		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-17-06		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
03/07/13	85.4	80.9	76.5	82.8	85.4	Scheduled Outage for Castor nameserver 2.1.13 upgrade. Also failed a few Atlas SRM tests (timeouts) and a few CE tests (CVMFS on particular nodes).
04/07/13	100	100	92.9	100	100	Mainly loss due to CE tests with CVMFS not functional on some nodes.
05/07/13	100	100	92.5	100	97.2	Atlas: Problem with Castor instance during afternoon. LHCb: Problem connecting to batch server
06/07/13	100	89.6	100	100	95.8	Problem connecting to batch server
07/07/13	100	89.8	94.1	100	84.3	Problem connecting to batch server
08/07/13	100	100	100	100	100
09/07/13	100	100	99.2	100	100	Single SRM test ailure (timeout).

Difference between revisions of "Tier1 Operations Report 2013-07-10"

Latest revision as of 14:33, 10 July 2013

RAL Tier1 Operations Report for 10th July 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools