SAM availability: May-August 2007

From GridPP Wiki
Revision as of 16:04, 6 October 2007 by Duncan rand (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

SAM Test Results - 2nd analysis August 2007

The plots below map the SAM availability results for each GridPP site vs the average across all the sites. The SAM summary data is available from: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/sam.html. The figure for each day is based on the hourly tests that run that day from midnight to midnight.

The title of each graph has the following format: Site name (% site availability for last month: % average GridPP site availability for last month: amount site is above or below average) The title font colour depends on whether the site monthly average is above (green) or below (red) the target for July (85%). The target for August will remain at 85%. The target for September is 90%.

The number following the main title indicates the KSI2K hrs of CPU time recorded in APEL for the site in the period 1st June to 25th July. The number is in a red font if there appears to be a problem with the site APEL publishing during the period.

UKI-LT2-Brunel

  • Sept 18th-19th - power cut and associated fall out
  • Sept 9th - dpm pool node hang
  • Aug 12th -- dpm.ftpd out of memory error affected both our SE and the nfs software area exported from one of the disk pools
  • July 17th to 19th -- DPM upgrade.

File:070822-1.JPG UKI-LT2-IC-HEP We had several downtimes in June due to SRM upgrade and ICT site problems.

  • June 11th to 12th -- SRM-dCache upgraded to solve ongoing CMS problem.
  • June 12th to 13th -- ICT site upgraded images on the cluster and were missing libraries. The site was failing SAM replica tests.
  • June 13th to 15th -- Problem with ICT site firewall.
  • June 18th -- Last dCache upgrade had memory issues therefore a new upgrade was scheduled.

File:070822-2.JPG UKI-LT2-IC-LESC The site default SE is same as IC-HEP.

  • June 11th to 12th -- Site SRM-dCache upgraded.
  • June 18th -- Last dCache upgrade had memory issues therefore a new upgrade was scheduled.
  • June 25th -- SFT tests failure due to job proxy expired.
  • July 3rd -- Brief disruption on the cluster (SGE) queuing system.

File:070822-3.JPG

UKI-LT2-QMUL QMUL got the following situation:

  • June 05 -- our CE was hanging, and one of our storage nodes got a degraded disk,
  • June 22 -- our SE got a problem about eth1
  • June 27 -- I did not have enough time to manage everything
  • July 15 -- qmul was not available on bdii at ral using lcg-info command, although I could see our data using jxplorer. Matt Hodges cleared out all the cached data and that resolved the problem.


File:070822-4.JPG

UKI-LT2-RHUL

  • 4-5 Oct - UPS shutdown all three DPM pool nodes.
  • 8-12 Sept - DPM pool node (gridraid2) hung. Also possibly affected by DPM.ftpd out of memory problem.
  • 1-7 Sept - undiagnosed nfs related problems with cluster head node and problems with DPM pool nodes - rebooted.
  • 16-23 Aug - One of the DPM file systems was full but the SAM tests were still using it, trying to copy files there and failing.

All the problems not correlated with dips in the average are understood:

  • 15 June - problem unknown. SFT-job failures.
  • On 6 July there were problems connecting to the site due to a DNS problem. The RHUL DNS is operated by our computer centre and therefore out of our control. It was quickly reported and fixed (~3 hours) but then took up to a day for some caches around the world to be updated, notably at CERN.
  • Site went down due to air conditioning failure around midnight 7 July (Sat/Sun), which was fixed late afternoon on Sun 8th July. The opportunity was then taken to perform general maintenance (o/s and kernel upgrades, fsck the DPM pool & VO s/w file systems, check a possible disk fault, make more space in /var for logs and make more swap space on some front-end nodes). These were completed by Wed 11 July. This extra time for maintenance was booked in as down time in the GOCDB once the site was already offline for the aircon failure.
  • 16 July CE-sft-lcg-rm failures due to a badly setup pool node (was running wrong kernel by mistake and crashed in certain circumstances - several times that day until fixed.)

File:070822-5.JPG

UKI-LT2-UCL-CENTRAL

  • Early June: change of ops VOMS roles affecting SAM tests at DPM sites. Fixed by ACL change.
  • Outage 13-15 June due to work on power supply to machine room. Queues drained beforehand.
  • Various problems with file servers, including problems that occurred while expanding the available space on the SE from 2 to 20 TB.

File:070822-6.JPG

UKI-LT2-UCL-HEP

  • 1-11 June: ops VOMS roles changed, no notice to sites. A fix for ACLs on the DPM was needed. Further downtime due to misconfiguration of the opssgm account at the site. Lack of manpower meant slow solution of the problem
  • 27-29 June / 2-3 July: The CE was hung-up and needed to be re-booted. This triggered an auto-update that left the gLite installation in unknown state (since auto-update had been reluctantly turned off a few weeks earlier due to poor quality control on the rolling gLite releases). The site BDII was dead. Thus an upgrade to gLite r27 was forced and the site re-configuration proved a bit laborious.
  • Last 3 deeps not really correlated to real downtime. We had a 2-hour downtime on 16-July due to an unscheduled power-cut affecting the batch farm. We do experience random failures of the SAM lcg-rm tests, these are never correlated with real site (or SE) problems, throw rather obscure messages and go away by themselves. It is possible that these deeps relate to such failures, although I do not remember that many failures to justify such deeps.

File:070822-7.JPG UKI-NORTHGRID-LANCS-HEP: These failures are due to an ongoing dCache problem, we have tickets and discussion open with the developers. Update: this issue has now been resolved and we expect availability to remain at 100% as seen in the latest (week-average) result. File:070822-8.JPG UKI-NORTHGRID-LIV-HEP: The big drop between July 13th and 17th was due to host certificates going out of date on some servers, coupled with an unexpected departmental power outage. File:070822-9.JPG UKI-NORTHGRID-MAN-HEP: the accounting for manchester has been fixed and the cpu hour are 457,341 not ~60k

The figure is not great anyway because, despite 9 VOs running, lhcb absence is noticeable. Lhcb blacklisted manchester for software area problems. They never told us we were blacklisted and they didn't reinstat us even when the problem disappeared because that step is not automatic. They did reacted promptly when contacted though.

File:070822-10.JPG

UKI-NORTHGRID-SHEF-HEP:

  • June 1st - 15th: Host CA being out of date and requiring the DN to be corrected after the update
  • June 19th - 25th: Host CA again the DN in the update was different and needed correcting
  • July 2nd - 10th: BDII problems are a result of the Glue-CORE.schema issue with gLite 3.0 upgrade
  • July 29th: DNS crash here at Sheffield

As with the CA problem the majority of the time was trying to assess the problem and find the solution. File:070822-11.JPG UKI-SCOTGRID-DURHAM: As with many sites, the large dip 03/06 was due to the broken SAM tests issue affecting all DPM sites. The dip around 25/06 was due to an unexpected change in the DNS servers from the Durham ITS department. On the 13/07 the CE froze due to excessive swap and had to be rebooted (cause unknown). The dip around the 19/07 was due to a powerdown of the building to bring a backup generator online. Dip around 2007-08-12 was caused by CE locking up while staff were on holiday and unable to respond (no remote intervention possible, we tried). File:070822-12.JPG ScotGrid-Edinburgh: The one uncorrelated reduction in service (which still failed to take our availability below target) was due to a scheduled power cut at our host facility, which was undergoing building upgrades. Transient effects persisted over the weekend post-powercut, and were resolved on the Monday (the 25th). File:070822-13.JPG UKI-SCOTGRID-GLASGOW: The large dip around 2007-03-06 was the screwup with SAM tests[1] [2], which affected all DPM sites. The dip on 2007-06-21 I just don't understand. Looks like there was a period when SAM tests weren't running properly on the cluster and things were timing out, but I can't see enough detail to work this out now. The wobble around 2007-07-07 was caused by a pheno user's VOMS proxy not renewing properly and the CE almost collapsing [3]. Large dip around 2007-08-12 was CE taken down by Pheno user (again) [4], [5]. File:070822-14.JPG File:070822-15.JPG The first two large drops in availability are consequences of update 25. The Birmingham site has been consistently upgrading as soon as new updates are released. Applying upgrades early has a big impact on the stability of the site as the recent updates have all come with numerous problems. The introduction of pool of software managers and production accounts have caused havoc (see lots of discussion regarding this on ROLLOUT). Some conflicting instructions in the release notes and subsequent Broadcast or messages on ROLLOUT does not help site stability either. For example, on one hand sites are advised to use sgm pool accounts in the release notes and on the other hand some VOs say they want to keep static accounts. Outside the turbulences of the updates, the site availability remains constant and above the target figure. This shows that the site efficiency would much be higher if it were to change its strategy and wait until a critical number of sites have applied the updates and all problems have been resolved. The drop starting in early July and extending until the 15th of July was related to DPM permission problems which only affected the sgm accounts on our SE and not the rest of our grid services. Initially, Judith was either mapped to a regular pool account or an sgm account. We failed the RM test whenever she was mapped to a sgm pool account. From the 11 of July she got consistently mapped to an sgm account and the availability dropped to zero. The drop is disproportionate to user experience as regular pool accounts were not affected. This period coincides with various DPM fixes of the vulnerability discovered by Kostas and measures taken by the site took to limit the impact of the security hole. In considering site availability, the age of cluster should be considered (and the care and feeding it requires to keep it in good shape). Our BaBar cluster is now rather old and has suffered from a lot of disk failures with sometimes are not predictable (with smartd for example) and impact on stability.

File:070822-16.JPG

UKI-SOUTHGRID-BRIS-HEP

  • The large dip in early June was the screwup with SAM tests[6] [7], which affected all DPM sites.
  • Dip 13-14 June - unsure, still digging up old info (unfortunately workload meant missed researching the reasons & logging them in the weekly site reports that week)
  • Dip 1-4 July - unsure, the CIC pre-report archives have zero (or have lost) the weekly reports for that period.
  • Dip 20-22 July: Site was in scheduled downtime (entire bulding power outage); after which for 4 hours SAM had problems (couldn't download testjob from CERN RB)

File:070822-17.JPG File:070822-18.JPG The first drop in availability in early June was caused by an accidental upgrade of the site: an update was applied which did not support static pool accounts. yaim generated a buggy edg-mkgridmap.conf file (see EFDA-JET failing SAM CE tests thread on TB-SUPPORT on 7 June for more info and fix applied). Between the 13th and 17th of July, hundreds of biomed jobs brought the site down. The gatekeeper lost track of its processes and kept spawning new ones - fix consisted in temporarily shut down the gatekeeper while cleaning up its files and processes.

File:070822-19.JPG The big drops on June 3rd to 5th and July 21st to 23rd were due to the host running the Site BDII service crashing and an unscheduled air conditioning outage caused by a water leak respectively. The low background level of failures seems to be due to random Replica Management failures which do not seem to have any pattern and fix themselves. File:070822-20.JPG June 1st - High load on CE caused it to drop out of the information system, cause of the high load is not known 6th - A BDII timeout 7th - 1 hour scheduled downtime for CE reconfiguration + 1 failure due to sending monitoring jobs to accounts that couldn't run jobs following the reconfiguration, quicky rectified 8th - Network outage due to a problem on Tier 1 switches, requiring a reset 8th - 11th - Sporadic failures due to overwriting of a customised lcgpbs jobmanager with the original in the reconfiguration on 7th, leading to jobs in a Waiting state being removed, the customisation was reintroduced on the 11th (Monday) 14th - After some work to fix problem, previous backed out reconfiguration was reintroduced, causing a monitoring job's success to not be noticed by the CE as the account mappings had changed 15th - A further job getting lost, difference in times published is due to proxy expiration + a odd SRM failure, where the delete apparently failed, but appeared to have deleted the file on our SE but not the other SE it was stored on 20th - 22nd - OPN connection to CERN lost, causing all SEs being unable to contact a section of CERN's network 23rd - A number of BDII timeouts - believed to be caused by a high number of queries from the LHCb VO Box 24th - 2 SRM failures which did not leave any error messages in the logs of the Storage Element making them difficult to debug