Operations Bulletin 110213

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 4th February 2013
Task Areas
General updates

Tuesday 5th February

  • UK upgrades in very healthy state - great job. Thank you to everyone for pulling together.
  • The monitoring links page has been updated.
  • There is a WLCG Operations Coordination meeting this Thursday at 14:30.
  • GlueValidator – will become a Nagios probe (March/April 2013)

Tuesday 29th January

  • A reminder that this week represents the standard cut-off for migration away from gLite.
  • Also a reminder that the daily WLCG operations meeting notes are available online. Useful to get an up-to-date and broad picture of WLCG activity.
  • ATLAS have a timeout issues with lcg_util and/or GFAL commands. This is fixed in EMI-2 Update 8 released on Jan 28. ATLAS supporting sites are requested to update as soon as possible.
  • We will in coming weeks revisit information publishing. Please check your site data in gstat2.
  • A proposal has been discussed by the WLCG MB to remove OPS VO testing from monthly WLCG reporting and to replace with separate reports for each LHC VO. (Ops tests will continue but not form part of the reporting).
  • Other MB news: Helge Meinhard claimed that for most people the migration to SL6 will be seen to happen when the alias for LXPLUS moves from SLC5 to SLC6. The proposed timetable is to move the default login for LXPLUS to SLC6 at the end of April 2013, but SLC5 will continue to be available after that date.
Tier-1 - Status Page

Tuesday 5th February

  • There were some problems with the Atlas Castor instance during the morning of Thursday 31st Jan - traced to an unresponsive disk server.
  • Roll-out of new version of Top BDII under completed. There are now three EMI2/SL6 BDIIs on newer hardware behind the alias.
  • Participating in Atlas FAX (Federated Atlas Xrootd) tests.
  • A small number of nodes now form a SL6 batch queue behind its own CE (lcgce12).
  • Ongoing testing of FTS version 3.
  • We are preparing to remove AFS clients from the worker nodes.
Storage & Data Management - Agendas/Minutes

Wednesday 23 jan

  • Storage Federations
    • See slides on [[1][storage meeting agenda]] for details.
    • ATLAS and CMS both pushing ahead with xrootd based requirements.
    • (whether they are) Benefits still to be seen.
    • UK in good shape on this with many sites in ATLAS "FAX" federation. And the big CMS sites in their one too.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th October

  • Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.

Friday 28th September

  • Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
  • See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.

Wednesday 6th September

  • Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 29th January

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.


Tuesday 6th November

  • Do we need the Approved VOs document the set out the software needs for the VOs?

Tuesday 23rd October

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 28th January

  • There was an EGI ops [ https://wiki.egi.eu/wiki/Agenda-28-01-2013 meeting] yesterday.
  • EMI release: DPM 1.8.6 (Small update with security fixes) and VOMS 2.0.10-1 (Small update with security fixes)
  • EMI3 due in April. Looking for SR sites.
  • For UMD-2, CREAM is released to production, WMS had problems found.
  • CA update 1.52-1 under SR, release expected 30-01-2013 (so get ready to update the CA certs...), and SAM-Update 20


Monitoring - Links MyWLCG

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 21st January

  • Good week, with only a few downtimes and long lived alarms. All outstanding alarms are covered by tickets as of now.
  • As summarised in Daniela's handover from last week, several sites have red COD-level status because the tickets are more than a month old. This due to the lack of upgrade of the WNs due to lack of the new tar-balls, and results in raising sec alarms. Some details in this ticket: https://ggus.eu/ws/ticket_info.php?ticket=90184

Tuesday 15th January

  • Main issue relates to COD tickets as mentioned last week.

Monday 7th January

  • Several sites are in the red due to the middleware tickets being older than 30 days. We got a COD ticket for this despite the ticket being filed as a top priority, COD did not answer so I reset the priority to something sensible. We aren't the only ones hit by this problem.
  • At the moment the security alerts don't seem to update on the dashboard - at least ceprod08 has not cleared all day.
Rollout Status WLCG Baseline

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 5th February

  • Large Java update released by Redhat [2] and [3] SL5/SL6 expected imminently.
  • UPnP issue. Metasploit exploit released for Supermicro Onboard IPMI (X9SCL/X9SCM).

Monday 21st January

  • SHA-2 and glexec updates given at pre-GDB.
  • EUGridPMA agreed on timeline where SHA-2 becomes the default in August 2013. SHA-1 will still be requestable.

Tuesday 15th January

  • There has been a recent java patch. Sites should consider to roll out updates to client machines (especially those containing user certificates). The most current version is "Version 7 Update 11".


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 5th February

  • NGS VOMS to be switched off this week

Tuesday 22nd January

  • VOMS changes
  • There was an LHCONE P2P workshop in December (summary). Do we have coverage?

Tuesday 20th November

  • Reminder for sites to add perfSONAR services in GOCDB.
  • VOMS upgraded at Manchester. No reported problems. Next step to do the replication to Oxford/Imperial.
Tickets

Monday 4th February 2013, 14:30 GMT</br> We're hitting Febuary with 43 open tickets - slowly whittling away at them! It's the first Monday of the month, so let's dive into them all.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=90451 (15/1)</br> Grouping of the core services. Progress is being made, although no deadline for this work has been given. In Progress (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=91081 (1/2)</br> Our ROD team has been ticketed in order to make sure they're keeping track of out of date services after the 1st February deadline. We'll discuss this in the meeting. In progress (4/2)</br>

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9/2012)</br> "Correlated packet-loss on perfsonar host". This ticket is being considered within the scope of wider scale networking issues at RAL, but other aspects of the investigation are coming first. On hold (16/1)

https://ggus.eu/ws/ticket_info.php?ticket=91029 (30/1)</br> atlas were having a problem querying the FTS jobs, which if I'm reading the ticket right might have been caused by some transfers between castor and QMUL's storm going awry. Chris has offered to upgrade his storm to EMI2 if it's thought that would help, and has asked atlas what they'd like him to do. In Progress (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=90151 (8/1)</br> neiss have been enabled on the RAL WMS, but some problems still need to be ironed out. In Progress (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=90528 (17/1)</br> The RAL WMS isn't assigning SNO+ jobs to Sheffield. Still being investigated. In Progress (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=91060 (31/1)</br> A CMS ticket (although it's not got CMS as its "Concerned VO"), about glexec problems on a few workers. There was a few days where identity switching didn't work. More pool accounts have been requested, and when that's done the issue should be solved. In progress (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=89733 (17/12/2012)</br> Chris' uncovering of a dodgey top-BDII node at RAL. A new BDII trinity went live today, hopefully that'll have solved the problems. We're now at the wait-and-see-if-it's-fixed stage. In progress (4/2)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=90244 (10/1)</br> Atlas migration from groupdisk. Waiting on atlas to finish moving data. With Brian on the other side of the planet it might need someone else to keep an eye on this and similar tickets. Waiting for reply (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=90863 (27/1)</br> Atlas FTS errors on intra-site transfers/deletions. Looked to be a load related problem, possibly caused by the deletions. Did it come back? In progress (28/1)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9/2012)</br> Ye olde "low atlas sonar rates to BNL ticket" for Oxford. Have there been any further investigation on this issue. Does the problem still exist? We don't want to leave these tickets to rot. On hold (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=90245 (10/1)</br> Oxford's atlas group disk migration ticket. Oxford seem to be mostly drained. Waiting for reply (28/1)

https://ggus.eu/ws/ticket_info.php?ticket=91117 (3/2)</br> atlas FTS failures, the problem seemed to be caused by high load on a dpm disk pool. Things looked to have calmed down (did you read-only the server?), this ticket looks good for closing. In progress (4/2)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=90275 (10/1)</br> cms (I think it's cms) have ticketed sites about their cvmfs status. Winnie is working on this, but has time constraints. On hold (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=90328 (11/1)</br> Stephen ticketed Bristol over some strange values published by their SE. Waiting to track down how a similar problem was fixed. In progress (31/1)

https://ggus.eu/ws/ticket_info.php?ticket=90361 (13/1)</br> Enabling the GridPP VOMS server ticket for the ngs VO - the Bristol edition. Winnie's put the ticket on hold. On Hold (29/1)

BIRMINGHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9/2012)</br> Birmingham's "low atlas sonar rate to BNL" ticket. The same comments to the Oxford version apply to this one. Maybe we're lucky and the problem's evaporated! On hold (30/11/12)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=90862 (27/1)</br> Glasgow have a descrepency between the advertised space used according the the SRM and their BDII. Inder investigation, Stephen has asked that any findings get passed along to DPM support. In Progress (28/1)

https://ggus.eu/ws/ticket_info.php?ticket=89804 (18/12/12)</br> The Glaswegian atlas group disk migration ticket. After the initial changes this seems quiet, maybe too quiet. On hold (10/1)

https://ggus.eu/ws/ticket_info.php?ticket=91106 (2/2)</br> Atlas shifters noticed the Glasgow SE down. Things are settled now, so this ticket can probably be closed (remember that it's usually best NOT to leave it to a VO to close a ticket). In progress (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=90966 (28/1)</br> The Glasgow WMS doesn't seem to be working for the londongrid VO. In progress (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=90386 (14/1)</br> enmr.eu report that they can't run jobs when they use proxies containing VOMS group information. Hopefully this will be fixed when Glasgow roll out their new argus server. In progress (21/1)

https://ggus.eu/ws/ticket_info.php?ticket=90362 (13/1)</br> Enabling the GridPP VOMS server ticket for the ngs VO - Glasgow style. Hopefully this will be fixed with their new argus server. In progress (21/1)

https://ggus.eu/ws/ticket_info.php?ticket=89753 (17/12/2012)</br> Path MTU discovery problems from QMUL to Glasgow. Discovered to be a problem within Clydenet, held until it's fixed. On hold (23/1)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=90878 (27/1)</br> lhcb report cvmfs problems. Turned out to be a missing nfs mount on some workers causing jobs to have problems, things have been fixed and the bad jobs removed. Andy asks if LHCB jobs are doing better at their site. Waiting for reply (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9/2012)</br> Low atlas sonar rates to BNL ticket - Edinburgh edition. See my comments for Birmingham and Oxford. Wahid gave a brief update, things have been proceeding offline. On hold (16/1)

https://ggus.eu/ws/ticket_info.php?ticket=89356 (10/12/2012)</br> Wahid has given a statement about the need for the tarball to undergo more testing, and the ticket has been extended. On hold (31/1)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=91072 (1/2)</br> Durham are having cream nagios test failures- "teething troubles" for their updated services. In progress (1/2)

https://ggus.eu/ws/ticket_info.php?ticket=89825 (19/12/2012)</br> enmr.eu having trouble installing software on the Durham cluster. Ticket "On hold" but there seems to be some progress going on as Durham get their reinstalled services back up and running. On hold (2/2)

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/2011)</br> Ancient Compchem ticket. Mike reports that the new CE is up but needs the VO software reinstalling. On hold (1/2)

https://ggus.eu/ws/ticket_info.php?ticket=90358 (13/1)</br> Durham's enabling the gridpp voms for the ngs VO ticket. On hold until the current batch of work is complete. On hold (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=90340 (12/1)</br> lhcb pilots aborting at Durham. Let's see how the reinstalled services work for them, we might want to ask the VOs in these tickets directly how things are going. On hold (1/2)

https://ggus.eu/ws/ticket_info.php?ticket=90393 (14/1)</br> Helloworld dteam jobs failing at Durham. All that has been written previously for the Durham tickets probably applies here! On hold (1/2)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=90243 (10/1)</br> The scouser atlas groupdisk migration ticket. John has stated that they stand ready to move space on atlas' word, which has yet to come. Waiting for reply (11/1)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=90242 (10/1)</br> The red-rose version of the atlas groupdisk migration ticket. The migration seems to have stalled atlas-side. On hold (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=90395 (14/1)</br> dteam helloworld jobs fail at Lancaster. Tracked down to a CE being rubbish rather then a configuration error, the offending CE is due for downtime this week to correct it's poor behaviour. On hold (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)</br> t2k transfer failures to Lancaster. The problem has been greatly reduced, and the FTS channels have has their number of concurrent transfers turned down. Waiting to see how this goes. Waiting for reply (24/1)

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br> Pilot jobs for ilc failing at Lancaster, due to the same performance issues seen above. Hopefully it'll be no more after the reinstall. On hold (4/2)

https://ggus.eu/ws/ticket_info.php?ticket=88772 (22/11/2012)</br> One of Lancaster's clusters is giving out bad GlueCEPolicyMaxCPUTime, tracked to a bug in the dynamic publishing (https://ggus.eu/ws/ticket_info.php?ticket=88904). Waiting on a fix, which I don't think made it out in the last update. On hold (3/12)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=89751 (17/12/2012)</br> Path MTU discovery problems to RHUL. The RHUL networking team are following up with Janet. On hold (28/1)

IMPERIAL</br> https://ggus.eu/ws/ticket_info.php?ticket=89750 (17/12/2012)</br> IC's Path MTU discovery ticket. Again the ball is in Janet's court. On hold (16/1)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=90359 (13/1)</br> Brunels ticket to enable the GridPP voms server for the ngs VO. Raul had a go at fixing it but no joy. In progress (21/1)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=88227 (6/11/2012)</br> No dynamic publishing at EFDA-JET for biomed. Ideas appear to have been exhausted. In progress (23/1)

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 14 January 2012

  • Neiss.org.uk
    • Now have VO-ID card in operations-portal (previously CIC portal)
    • GridPP/NGS VOMSs server issues
    • NGS WMS hadn't enabled current CEs at QMUL and Lancs, so I've requested the GridPP WMSs enable it - as the VO is supported on GridPP sites.
    • Would be a good use case for SARONGS - but they don't have the time to debug this.


Tuesday 9 January 2012

  • Please can VOs report publications (there's now a section on the Quarterly report for them)
  • Spring Cleaning of VO support
    • GridPP VOMS server support - affects ngs.ac.uk and neiss.org.uk - expect tickets soon.
    • Removal of VOs no longer in use - totalep and some others.
  • Neiss.org.uk: NGS WMS hasn't enabled QMUL and Lancs CEs. Should we support it on the GridPP WMSs
  • T2k FTS transfers slowed due to copying files that already exist - T2k script now more robust.

Mon 17th December

Tue 4th December

Thursday 29 November

Tuesday 27 November

  • VOs supported at sites page updated
    • now lists number of sites supporting a VO, and number of VOs supported by a site.
    • Linked to by Steve Lloyd's pages


Tuesday 23 October

  • A local user is wanting to get on the grid and wants to set up his own UI. Do we have instructions?


Site Updates

Tuesday 22nd January

  • Site updates at Tuesday's meeting.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 1st October

  • ELC work


Tuesday 25th September

  • Reviewing pledges.
  • Q2 2012 review
  • Clouds and DIRAC
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 23rd January

  • Operations report
  • Roll-out of SL6/EMI-2 Top-BDII nodes now complete with three nodes behind the alias all running the SL6/EMI-2 version of the software with newer hardware.
  • H1 have been added to the small-VOs CVMFS system.
  • The meeting was the first of a series of four of these meetings to use Vidyo. After that the decision whether to stay with Vidyo or revert to EVO/SeeVogh will be made.
WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October




NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER
  • N/A
To note

  • N/A