SAM availability: July-September 2007

SAM Test Results - 3rd analysis to the end of September 2007

The plots below map the SAM availability results for each GridPP site vs the average across all the sites. The SAM summary data is available from: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/sam.html. The figure for each day is based on the hourly tests that run that day from midnight to midnight.

The title of each graph has the following format: Site name (% site availability for last month: % average GridPP site availability for last month: amount site is above or below average) The title font colour depends on whether the site monthly average is above (green) or below (red) the target for July (85%). The target for September was 90%.

The number following the main title indicates the KSI2K hrs of CPU time recorded in APEL for the site in the period 1st September to 30th September. The number is in a red font if there appears to be a problem with the site APEL publishing during the period.

London Tier2

UKI-LT2-Brunel

Site has generally good availability apart from three major periods of down-time explained below and a period of unreliability 9-19th September:

July 17th to 19th -- DPM upgrade.
Aug 12th -- dpm.ftpd out of memory error affected both our SE and the nfs software area exported from one of the disk pools
Sept 9th - dpm pool node hang
Sept 18th-19th - power cut and associated fall out
Sept 21th-24th - scheduled downtime for electrical maintenance
Oct 13-14 - DPM pool node hang

File:071012-1.JPG

UKI-LT2-IC-HEP

We had several downtimes in June due to SRM upgrade and ICT site problems.

June 11th to 12th -- SRM-dCache upgraded to solve ongoing CMS problem.
June 12th to 13th -- ICT site upgraded images on the cluster and were missing libraries. The site was failing SAM replica tests.
June 13th to 15th -- Problem with ICT site firewall.
June 18th -- Last dCache upgrade had memory issues therefore a new upgrade was scheduled.
Sept 1st-3rd -- Problem with dCache services.

File:071012-2.JPG

UKI-LT2-IC-LESC

The site default SE is same as IC-HEP.

June 11th to 12th -- Site SRM-dCache upgraded.
June 18th -- Last dCache upgrade had memory issues therefore a new upgrade was scheduled.
June 25th -- SFT tests failure due to job proxy expired.
July 3rd -- Brief disruption on the cluster (SGE) queuing system.
Sept 1st-3rd -- Problem with dCache services.

File:071012-3.JPG

UKI-LT2-QMUL

QMUL got the following situation:

June 05 -- our CE was hanging, and one of our storage nodes got a degraded disk,
June 22 -- our SE got a problem about eth1
June 27 -- I did not have enough time to manage everything
July 15 -- qmul was not available on bdii at ral using lcg-info command, although I could see our data using jxplorer. Matt Hodges cleared out all the cached data and that resolved the problem.
at the beginning of Sept (the first to weeks): I went on holiday
the week after: I needed some time to understand what the problem was

File:071012-4.JPG

UKI-LT2-RHUL

All the problems not correlated with dips in the average are understood:

15 June - problem unknown. SFT-job failures.
On 6 July there were problems connecting to the site due to a DNS problem. The RHUL DNS is operated by our computer centre and therefore out of our control. It was quickly reported and fixed (~3 hours) but then took up to a day for some caches around the world to be updated, notably at CERN.
Site went down due to air conditioning failure around midnight 7 July (Sat/Sun), which was fixed late afternoon on Sun 8th July. The opportunity was then taken to perform general maintenance (o/s and kernel upgrades, fsck the DPM pool & VO s/w file systems, check a possible disk fault, make more space in /var for logs and make more swap space on some front-end nodes). These were completed by Wed 11 July. This extra time for maintenance was booked in as down time in the GOCDB once the site was already offline for the aircon failure.
16 July CE-sft-lcg-rm failures due to a badly setup pool node (was running wrong kernel by mistake and crashed in certain circumstances - several times that day until fixed.)

4 Aug - campus network outage
16-23 Aug - One of the DPM file systems was full but the SAM tests were still using it, trying to copy files there and failing.
1-7 Sept - undiagnosed nfs related problems with cluster head node (unable to download tar file from RB) and problems with DPM pool nodes.
8-12 Sept - DPM pool node (gridraid2) hung. Also possibly affected by DPM.ftpd out of memory problem.
4-5 Oct - UPS shutdown all three DPM pool nodes.
10 Oct - DPM upgrade
13-14 Oct - DPM pool node hang (gridraid2)
15-24 Oct - DPM pool node gridraid2 taken off-line - reliability improves
24 Oct - DPM pool node gridraid2 put back on-line
24-25 Oct - gridraid2 fails - DPM.ftpd??
2-4 Nov - SE BDII problems
10-16 Nov - SRMv1 daemon quietly died

File:071012-5.JPG

UKI-LT2-UCL-CENTRAL

Early June: change of ops VOMS roles affecting SAM tests at DPM sites. Fixed by ACL change.
Outage 13-15 June due to work on power supply to machine room. Queues drained beforehand.
Various problems with file servers, including problems that occurred while expanding the available space on the SE from 2 to 20 TB.

File:071012-6.JPG

UKI-LT2-UCL-HEP

1-11 June: ops VOMS roles changed, no notice to sites. A fix for ACLs on the DPM was needed. Further downtime due to misconfiguration of the opssgm account at the site. Lack of manpower meant slow solution of the problem
27-29 June / 2-3 July: The CE was hung-up and needed to be re-booted. This triggered an auto-update that left the gLite installation in unknown state (since auto-update had been reluctantly turned off a few weeks earlier due to poor quality control on the rolling gLite releases). The site BDII was dead. Thus an upgrade to gLite r27 was forced and the site re-configuration proved a bit laborious.
Last 3 deeps not really correlated to real downtime. We had a 2-hour downtime on 16-July due to an unscheduled power-cut affecting the batch farm. We do experience random failures of the SAM lcg-rm tests, these are never correlated with real site (or SE) problems, throw rather obscure messages and go away by themselves. It is possible that these deeps relate to such failures, although I do not remember that many failures to justify such deeps.

File:071012-7.JPG

NorthGrid

UKI-NORTHGRID-LANCS-HEP

June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed.
mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs.
mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood.

File:071012-8.JPG

UKI-NORTHGRID-LIV-HEP

The big drop between July 13th and 17th was due to host certificates going out of date on some servers, coupled with an unexpected departmental power outage.

File:071012-9.JPG

UKI-NORTHGRID-MAN-HEP

15/09/2007

The accounting for manchester has been fixed and the cpu hour are 457,341 not ~60k

The figure is not great anyway because, despite 9 VOs running, lhcb absence is noticeable. Lhcb blacklisted manchester for software area problems. They never told us we were blacklisted and they didn't reinstat us even when the problem disappeared because that step is not automatic. They did reacted promptly when contacted though.

18/10/2007

http://northgrid-tech.blogspot.com/2007/10/manchester-dcache-fixed.html

File:071012-10.JPG

UKI-NORTHGRID-SHEF-HEP

June 1st - 15th: Host CA being out of date and requiring the DN to be corrected after the update
June 19th - 25th: Host CA again the DN in the update was different and needed correcting
July 2nd - 10th: BDII problems are a result of the Glue-CORE.schema issue with gLite 3.0 upgrade
July 29th: DNS crash here at Sheffield

As with the CA problem the majority of the time was trying to assess the problem and find the solution.

Aug - Oct: This long term down time was a result of server updates going wrong and compounding the problems from the one before. A lot of the time was spent with the help of others trying to pinpoint each problem and resolve it.

File:071012-11.JPG

ScotGrid

UKI-SCOTGRID-DURHAM

As with many sites, the large dip 03/06 was due to the broken SAM tests issue affecting all DPM sites. The dip around 25/06 was due to an unexpected change in the DNS servers from the Durham ITS department. On the 13/07 the CE froze due to excessive swap and had to be rebooted (cause unknown). The dip around the 19/07 was due to a powerdown of the building to bring a backup generator online. Dip around 2007-08-12 was caused by CE locking up while staff were on holiday and unable to respond (no remote intervention possible, we tried).

File:071012-12.JPG

ScotGrid-Edinburgh

June/July

The one uncorrelated reduction in service (which still failed to take our availability below target) was due to a scheduled power cut at our host facility, which was undergoing building upgrades. Transient effects persisted over the weekend post-powercut, and were resolved on the Monday (the 25th).

October

Dips mainly due to instability in one of the storage pools, resulting in filesystems going read-only, and hence failure of CE rm tests (despite the CE being fine), and of SRM tests. Also, at about the same time, there was a brief issue with the information system, due to the site BDII accidentally being only partly reconfigured (resulting in issues where some of it understood the new GLUE schema, and some of it didn't) - this only lasted a couple of hours, though.

File:071012-13.JPG

UKI-SCOTGRID-GLASGOW

June

The large dip around 2007-06-03 was the screwup with SAM tests[1] [2], which affected all DPM sites.
The dip on 2007-06-21 I just don't understand. Looks like there was a period when SAM tests weren't running properly on the cluster and things were timing out, but I can't see enough detail to work this out now.

July

The wobble around 2007-07-07 was caused by a pheno user's VOMS proxy not renewing properly and the CE almost collapsing [3].

August

Large dip around 2007-08-12 was CE taken down by Pheno user (again) [4], [5].

September

Drop at 2007-09-20 was site upgrade to SL4 [6].
Drop around 2007-09-28 was, I think, our DPM having troubles [7].

File:071012-14.JPG

SouthGrid

UKI-SOUTHGRID-BHAM-HEP

The first two large drops in availability are consequences of update 25. The Birmingham site has been consistently upgrading as soon as new updates are released. Applying upgrades early has a big impact on the stability of the site as the recent updates have all come with numerous problems. The introduction of pool of software managers and production accounts have caused havoc (see lots of discussion regarding this on ROLLOUT). Some conflicting instructions in the release notes and subsequent Broadcast or messages on ROLLOUT does not help site stability either. For example, on one hand sites are advised to use sgm pool accounts in the release notes and on the other hand some VOs say they want to keep static accounts. Outside the turbulences of the updates, the site availability remains constant and above the target figure. This shows that the site efficiency would much be higher if it were to change its strategy and wait until a critical number of sites have applied the updates and all problems have been resolved. The drop starting in early July and extending until the 15th of July was related to DPM permission problems which only affected the sgm accounts on our SE and not the rest of our grid services. Initially, Judith was either mapped to a regular pool account or an sgm account. We failed the RM test whenever she was mapped to a sgm pool account. From the 11 of July she got consistently mapped to an sgm account and the availability dropped to zero. The drop is disproportionate to user experience as regular pool accounts were not affected. This period coincides with various DPM fixes of the vulnerability discovered by Kostas and measures taken by the site took to limit the impact of the security hole. In considering site availability, the age of cluster should be considered (and the care and feeding it requires to keep it in good shape). Our BaBar cluster is now rather old and has suffered from a lot of disk failures with sometimes are not predictable (with smartd for example) and impact on stability.

File:071012-15.JPG

UKI-SOUTHGRID-BRIS-HEP

The large dip in early June was the screwup with SAM tests[8] [9], which affected all DPM sites.
Dip 13-14 June - unsure, still digging up old info (unfortunately workload meant missed researching the reasons & logging them in the weekly site reports that week)
Dip 1-4 July - unsure, the CIC pre-report archives have zero (or have lost) the weekly reports for that period.
Dip 20-22 July: Site was in scheduled downtime (entire bulding power outage); after which for 4 hours SAM had problems (couldn't download testjob from CERN RB)

File:071012-16.JPG

UKI-SOUTHGRID-CAM-HEP

As far as the site configuration is concerned, we didn't/don't have any such flaws. The actual reason for most of the failures are mostly "unknown" because those are mainly caused by condor-incomparable/broken middleware.
One of the biggest issues was informs "Condor and Maradona" error, which leads the jobs to be failed, apparently. There was a hidden condor related bug in the middleware (until I discovered it last week), which I've fixed now and I've started to believe that was the main reason for the "Condor and Maradona" error at out site.

File:071012-17.JPG

UKI-SOUTHGRID-OX-HEP

The first drop in availability in early June was caused by an accidental upgrade of the site: an update was applied which did not support static pool accounts. yaim generated a buggy edg-mkgridmap.conf file (see EFDA-JET failing SAM CE tests thread on TB-SUPPORT on 7 June for more info and fix applied). Between the 13th and 17th of July, hundreds of biomed jobs brought the site down. The gatekeeper lost track of its processes and kept spawning new ones - fix consisted in temporarily shut down the gatekeeper while cleaning up its files and processes.

September 10th - 21st New SL4 cluster was being setup and this caused instabillity in the BDII, separating this function off the ce helped stabalise the system.
Oct 3rd - 9th Replica Managemenet failures started when the DN of the SAM test submitter changed, this exposed a typo in the site-info.def file which had propagated around the service nodes.

File:071012-18.JPG

UKI-SOUTHGRID-RALPP

The big drops on June 3rd to 5th and July 21st to 23rd were due to the host running the Site BDII service crashing and an unscheduled air conditioning outage caused by a water leak respectively. The low background level of failures seems to be due to random Replica Management failures which do not seem to have any pattern and fix themselves.

File:071012-19.JPG

Tier1

RAL-LCG2-Tier1

June 1st - High load on CE caused it to drop out of the information system, cause of the high load is not known 6th - A BDII timeout 7th - 1 hour scheduled downtime for CE reconfiguration + 1 failure due to sending monitoring jobs to accounts that couldn't run jobs following the reconfiguration, quicky rectified 8th - Network outage due to a problem on Tier 1 switches, requiring a reset 8th - 11th - Sporadic failures due to overwriting of a customised lcgpbs jobmanager with the original in the reconfiguration on 7th, leading to jobs in a Waiting state being removed, the customisation was reintroduced on the 11th (Monday) 14th - After some work to fix problem, previous backed out reconfiguration was reintroduced, causing a monitoring job's success to not be noticed by the CE as the account mappings had changed 15th - A further job getting lost, difference in times published is due to proxy expiration + a odd SRM failure, where the delete apparently failed, but appeared to have deleted the file on our SE but not the other SE it was stored on 20th - 22nd - OPN connection to CERN lost, causing all SEs being unable to contact a section of CERN's network 23rd - A number of BDII timeouts - believed to be caused by a high number of queries from the LHCb VO Box 24th - 2 SRM failures which did not leave any error messages in the logs of the Storage Element making them difficult to debug

File:071012-20.JPG