Operations Bulletin 040313

Bulletin archive

Week commencing 25th February 2013

Task Areas

General updates

Monday 25th February

A bug in the linux kernel affects ATLAS jobs, where the load gradually ramps up and the host eventually becomes unresponsive and has to be rebooted. The bug seems to have been fixed on or about kernel-2.6.18-214.
Suggestion to use CertWizard
An agenda for Tuesday's OMB meeting
Wahid created a TCP tuning page.
Minutes of the 15th February GridPP Cloud meeting.

Tier-1 - Status Page

Tuesday 26th February

Ongoing problems with the batch farm not starting enough jobs over the past weeks
Problem with Atlas Castor instance for few hours on Saturday evening. Fixed by on-call team.
Planned network intervention this morning. FTS drained ahead of this. Had an approx 30 minute disconnect of Tier1.
Last week we declared 68 files lost to T2K following a disk server failure.
A small number of nodes now form a SL6 batch queue behind its own CE (lcgce12).
Ongoing testing of FTS version 3.

Storage & Data Management - Agendas/Minutes

Wednesday 13 Feb 2013

EGI Community Forum Preparations
- "Small" VOs - the user/community perspective
- More grids and clouds stuff -
DPM upgrades to 1.8.6.
UK in good shape on this with many sites in ATLAS "FAX" federation. And the big CMS sites in their one too.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th February

SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
An update of the metrics page has been requested.

Tuesday 30th October

Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 11th February

There was an EGI ops meeting today.
There is a list-match problem with EMI2 WMS (GGUS 90240)

Monday 28th January

There was an EGI ops [ https://wiki.egi.eu/wiki/Agenda-28-01-2013 meeting] yesterday.
EMI release: DPM 1.8.6 (Small update with security fixes) and VOMS 2.0.10-1 (Small update with security fixes)
EMI3 due in April. Looking for SR sites.
For UMD-2, CREAM is released to production, WMS had problems found.
CA update 1.52-1 under SR, release expected 30-01-2013 (so get ready to update the CA certs...), and SAM-Update 20

Rollup of 'known issues', gathered from tickets, ones that will affect UK Tier-2 sites pulled out here:
lcg-gt problems with dCache https://ggus.eu/tech/ticket_show.php?ticket=90807
proxy-renwal problems on EMI-1 WMS https://ggus.eu/tech/ticket_show.php?ticket=89801. (And, although fixed in EMI-2, EMI-2 WMS is broken for other reasons...)
EMI-2 WN - yaim bug with cleanup-grid-accounts https://ggus.eu/tech/ticket_show.php?ticket=90486
Use of Configuration Management tools; (Survey for SITES): Ask for sites to return via web form by 1st March. For a list of questions in advance take a look at this [ https://documents.egi.eu/secure/RetrieveFile?docid=1557&version=1&filename=configuration-tools-survey-v1.pdf document].

gLite support calendar.

Monitoring - Links MyWLCG

Tuesday 5th February

Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Tuesday 12th February

Need all ROD members to complete availability survey for the rota.

Monday 21st January

Good week, with only a few downtimes and long lived alarms. All outstanding alarms are covered by tickets as of now.
As summarised in Daniela's handover from last week, several sites have red COD-level status because the tickets are more than a month old. This due to the lack of upgrade of the WNs due to lack of the new tar-balls, and results in raising sec alarms. Some details in this ticket: https://ggus.eu/ws/ticket_info.php?ticket=90184

Tuesday 15th January

Main issue relates to COD tickets as mentioned last week.

Monday 7th January

Several sites are in the red due to the middleware tickets being older than 30 days. We got a COD ticket for this despite the ticket being filed as a top priority, COD did not answer so I reset the priority to something sensible. We aren't the only ones hit by this problem.
At the moment the security alerts don't seem to update on the dashboard - at least ceprod08 has not cleared all day.

Rollout Status WLCG Baseline

Tuesday 5th February

National overview page updated on 4th February. Please check your site information!

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 26th February

Another local privilege escalation vulnerability affecting linux kernels 3.3-3.8. Vendor supplied kernels in RHEL/SL not vulnerable (https://bugzilla.redhat.com/show_bug.cgi?id=915052) but fc18 and ubuntu 12.10 need upgrading. Code to exploit this vulnerability is widely available.

Tuesday 19th February

Local privilege escalation kernel vulnerability (CVE-2013-0871) posted to oss-security. Fixed upstream in January 2013. Redhat kernel confirmed vulnerable (https://bugzilla.redhat.com/show_bug.cgi?id=911937).

Services - PerfSonar dashboard | GridPP VOMS

Monday 18th February

PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

NGS VOMS to be switched off this week

Tickets

Monday 25th February 2013 15.00 GMT 30 open tickets for the UK this week. As usual everyone's doing a good job of keeping things tidy.

TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=91687 (21/2) Of interest- the epic.vo.gridpp.ac.uk request for access to the RAL WMS. The RAL chaps are working on it. In progress (21/2)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2) Chris W has asked for webdav access to the RAL LFC (essentially a request for an upgrade). Wahid has commented that this might want to wait a couple of weeks for the next LFC release- which the Tier 1 will likely do. In progress (22/2)

https://ggus.eu/ws/ticket_info.php?ticket=91029 (30/1) FTS problem when dealing with atlas robot certificates. A fix on the srm side isn't going to be coming anytime soon, and unless atlas want to switch to colon-less DNs for their robot names this ticket wil probably need to be on-holded - but it might be worth asking atlas if they're willing to change the robot DNs. In progress (25/2)

https://ggus.eu/ws/ticket_info.php?ticket=90528 (17/1) Sno+ jobs submitted to the RAL WMSes aren't being sent to Sheffield. Further investigation reveals that it works for lcgwms03, but not 02. After further investigation revealed no earthly explanation for this behaviour Catalin proposed limiting SNO+ to the working wms. Waiting for reply (19/2)

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=91439 (12/2) The atlas transfer errors have come back, although atlas forgot to switch the ticket back from "Waiting for reply" after they replied. Waiting for reply (24/2)

https://ggus.eu/ws/ticket_info.php?ticket=90362 (13/1) Switching the ngs VO over to the GridPP VOMS server. The last Glasgow CE should be switched over now, Gareth has asked for a test. Waiting for reply (25/2)

RALPP https://ggus.eu/ws/ticket_info.php?ticket=91377 (11/2) This atlas transfer failure ticket is looking quite neglected, last reply was from atlas a while ago confirming that the errors still existed - it's likely worth asking again if the problem persists. In progress (13/2)

SOLVED? Chris brought this t2k ticket back to my attention last week: https://ggus.eu/ws/ticket_info.php?ticket=90235 (solved by the SYSTEM on 30/1, should it have been? edit -reading up the ticket I see that it was solved by Catalin, my eyeballs failed me) and it's "parent" ticket: https://ggus.eu/ws/ticket_info.php?ticket=89105 It regarding WMSs failing to renew proxies. I can't say I have a clue what's going on, but ticket 89105 has been reassigned to "Operations" but hasn't been picked up by anyone - it's looking very neglected for the last month. If the original problem is still going then we will need to make some noise about this.

Tools - MyEGI Nagios

Tuesday 13th November

Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.

Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.

I have applied a patch to fix nagios jobs segfaulting on SL6 WN's. https://tomtools.cern.ch/jira/browse/SAM-2999

VOs - GridPP VOMS VO IDs Approved VO table

Monday 26th February 2013

NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
Issues with Proxy renewal.
- Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
- Other myproxy issues as well. GGUS#99105 GGUS#9172

Monday 18 February 2013

VomsSnooper http://www.gridpp.ac.uk/news/?p=2695 - worth doing - found 4 errors.
NGS VOMS server. 2 Sites remaining: Glasgow and Durham. Progress on both. https://ggus.eu/ws/ticket_info.php?ticket=90356

Monday 12 February 2013

3 sites remaining to enable the GridPP VOMS server for VOs previously supported by the NGS VOMS https://ggus.eu/ws/ticket_info.php?ticket=90356 is a parent ticket for this.

SNO+ Questions

Jobs appear to fail, but have uploaded output and it is in LFC

MC production
- Want 2-3 people managing this
- Shifters monitoring sites and filing tickets
- How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
- How best to do this - should they use a robot cert?

Monday 14 January 2013

NGS VOMS server: Please enable GridPP VOMS server
- Some sites have enabled the GridPP VOMS server, 7 sites have issues. https://ggus.eu/ws/ticket_info.php?ticket=90356 is a parent ticket for this

Neiss.org.uk
- Now have VO-ID card in operations-portal (previously CIC portal)
- GridPP/NGS VOMSs server issues
- NGS WMS hadn't enabled current CEs at QMUL and Lancs, so I've requested the GridPP WMSs enable it - as the VO is supported on GridPP sites.
- Would be a good use case for SARONGS - but they don't have the time to debug this.

T2K.org - lots of issues
- Have started a round of MC production
- Nagios now available (Thanks Kashif)- can sites fix issues: https://t2wlcgnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=VO_t2k.org&style=detail
- Lots of job failures for various reasons - including "Cannot move ISB" - seen at a number of sites.
- Reporting that proxies don't renew (CJW has tried to reproduce this and failed - proxies seem to be renewing)

Site Updates

Actions

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 1st October

ELC work

Tuesday 25th September

Reviewing pledges.
Q2 2012 review
Clouds and DIRAC

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

TBC

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 27th February

Operations report
The existing (very small) test SL6 batch queue is accessible by Atlas & CMS for tests. This is being expanded using part of the new purchase of CPU nodes (around 450 job slots) and will be offered to other VOs for use.
The meeting continues to use Vidyo as part of a four-week test.

WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October

NGI UK - Homepage CA

Wednesday 22nd August

Operationally few changes - VOMS and Nagios changes on hold due to holidays
Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
The NGS is rebranding to NES (National e-Infrastructure Service)
EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
Next meeting is on Friday 14th September at 13:00.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

N/A

To note

N/A

Operations Bulletin 040313

SNO+ Questions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools