Operations Bulletin 050813

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 29th July 2013
Task Areas
General updates

Tuesday 30th July

  • GridPP news: Tomorrow is the last day that Stuart will be with GridPP ... thanks Stuart for your contributions! Coincidentally, Neasan will also be moving on after tomorrow so again thanks are due. Good luck to both!
  • GLUE2 BDII output at Liverpool - restart fixed the problem. Midmon was reporting CRITICAL - errors 1068, warnings 1071, info 1299.
  • All UK ATLAS sites now managed by RAL FTS3 though not all site transfers use it at the moment. QMUL has an issue due to Storm.
  • A reminder that (EGI/NGI) operations procedures for certain key tasks can be found linked from here in the EGI wiki. In particular PROC13 lists the steps we are expected to follow when decommissioning a VO (something we will be doing shortly).
  • EGI is setting up a task force to explore CVMFS as a service for all EGI VOs. This follow's Ian's talk at the Manchester Community Forum in April. Catalin will be leading the task force.
  • Early bird registration for the EGI Technical Forum will now end at midnight on Sunday 4th August 2013.
  • The Tier-1 will cease operation of the RAL AFS cell on the 31st October this year.



Tuesday 23rd July

  • There is a small workshop today on clouds and virtualisation (agenda).
  • Tomorrow there is a High Performance Networking Special Interest Group meeting (programme) taking place at UCL.
WLCG Operations Coordination - Agendas

Tuesday 16th July

  • SL6
    • EMI-3 voms-proxy-info: 3rd problem java eating away memory. You can follow the story in both tickets GGUS 94878 and GGUS 95574
      • A fix is in the testing repositories and has been tested at Liverpool and Oxford.
    • UK status: 4 sites online, 3 testing, 7 with a plan, 3 without a plan (UCL, Durham, RALPP).
    • Presentation today at Atlas ADC weekly
    • Checking now with sites how LHCb is doing. Not running everywhere it seems.
  • Monitoring
  • Next Coord meeting Thursday 18/7/2013

Tuesday 9th July

  • SL6
    • Atlas new sw validation system scalability problem has been solved.
    • voms are now in the EMI-3 repository. No testing or prod PT repositories are necessary.
    • UK status: 3&1/2 sites online, 3 testing, 7 with a plan, 4 without a plan (UCL, Durham, RALPP, SUSX).
    • HS06: T0 tests on the compilers didn't give significant differences. Hepix has started an SL6 HS06 page where sites are welcome to post their results SL6 HS06 benchmark results
  • Monitoring
    • WLCG Monitoring consolidation group to consolidate the WLCG monitoring. It doesn't include all the monitoring there is a portion developed by experiments which is not included, but it concerns well known dashboards.
      • WLCG monitoring Initial status.
      • First meeting last week. The experiments have already given a first evaluation, sites will be represented via WLCG Ops Coordination. To get feedback from sites a group has been setup to collect sites opinion (see Maria's slide). Who is interested should contact Pepe Flix (jflix@NOSPAMpic.es). David Crooks and Kashif might want to be part of it as this touches on the GridPP core tasks.
    • Among things interesting to discuss
      • myWLCG vs SUM tests they both get the information from the same source i.e. nagios.
      • Personalised dashboard looks interesting but was never publicized much.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 30th July

  • Castor 2.1.13 upgrades for the CMS & LHCb instances completed successfully last Tuesday. Castor 'GEN' instance upgrade (ALICE plus non-LHC VOs) today.
  • Problems with batch server (pbs_server) resolved.
  • Testing of alternative batch system (Condor/ARC CEs/Sl6) proceeding. We are working on opening this up to all supported VOs as soon as possible.
Storage & Data Management - Agendas/Minutes

Tuesday 15th July

  • Three sites are now run all UK FTS traffic via 'FTS3' service as a test. Mostly successful; (small issue with a few US sites to be resolved before taking tests further.)

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 23rd July

  • Only minor updates to the keydocs mentioned last week as in need of attention/review. Please could everyone review the documents for which they are responsible.

Tuesday 16th July

  • Many key docs have reached their validity limit and need reviewing.


Tuesday 30th April

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Interoperation - EGI ops agendas

Monday 22nd July

  • Problems with recent versions of VOMS, WMS, UI and Storm. New release of dCache that supports SHA-2 proxies.
  • S/w releases: dCache 2.6.5; which has support for SHA-2 certs. Backport of SHA-2 support to 2.2.* series expected end of the month.
  • S/w issues: VOMS server doesn't (always) start the resource BDII automatically. This makes the SHA-2 probe fail, because there's no entry in the info system for the VOMS in many cases. There was some discussion on how to handle this - as it gives a false positive. The probes will be removed for the moment; with the expectation of a voms update to fixe the BDII problem in short order. Once that's done, then the alarms will be re-enabled. If no quick fix is forthcoming, then this will be looked at again (and put in abeyance). Therefore: RoD will be able to close the SHA-2 probe on VOMS servers; if they think it's the right way to handle it; or, if there is a ticket already open, then the ticket can be left open until resolved.
  • Gridsite problem: This affects UI, WMS and LB. The latest version of Gridsite breaks on proxies with '-' in it, which is seen as intermittent fails when attempting to delegate proxies.. A workaround is to yum downgrade gridsite gridsite-libs on the WMS or yum downgrade gridsite-commands gridsite-libs on the UI.
  • Storm: Performance problems with current version of Storm. This means that the current version isn't certified - hence the SHA-2 tests for Storm are off, because there is not a certified version of Storm that supports SHA-2.
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 23rd July

  • Looking at options to supplement ROD team. Tier-1 may provide some effort.
  • The Operations Dashboard is full of SHA2 critical alarms. As most of the sites are failing one or more SHA2 tests, tickets are being created against most sites. These alarms are generated by Midmon Nagios and it checks if the service endpoint is SHA2 compliant or not. A list of SHA-2 ready middleware has been produced as has a summary of the related SAM tests. Most of the alarms are related to the creamce. The easiest way to solve this issue is to upgrade to the latest EMI2 or EMI3 release. The baseline release for the creamce is update 10 released in EMI on 2nd April 2013.
Rollout Status WLCG Baseline

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 30th July

  • One UK site recently appeared on the pakiti critical list.

Monday 22nd July

  • A summary of the SSC6 findings was circulated last week. Questions?


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 23rd July

  • PerfSONAR: the issues with the WLCG mesh appear to be understood and a new minor release (e.g. 3.3.1) is likely to be released. In the meantime please could sites upgrade by following instructions here but leave the WLCG mesh URL (tests-wlcg-all.json) commented out. Please also update the site progress page.
  • Where are we with the VOMS rollout?

Monday 10th June

  • Issue with neurogrid.incf.org ownership. Is more guidance needed?
  • Where are we with the perfsonar mesh?
  • Are we ready for full rollout of the VOMS backups?


Tickets

Monday 29th July 2013 14.30 BST</br> There are 52 Open UK tickets this week. It's business as usual this week, but with so many tickets the risk of me missing something is greater then usual so let me know if I've skimmed over an important issue for your site or of interest to the UK.

Why so many tickets? We've been hit by several groups of tickets at once: 13 gLExec tickets, 10 Decommissioning NGS sites, 7 SHA-2, 5 Unresponsive VOs (the makings of a terrible Christmas Carol). That leaves only 17 tickets outside these categories.

SHA-2 tickets</br> 7 tickets left for these, affecting Glasgow, Manchester, ECDF, Durham, Lancaster, Imperial and the Tier 1. Depending on the time frame of when you plan to look at this those who haven't see to On hold might want to (Glasgow, IC, Tier 1) if they aren't going to be tackling this soon. Kashif has extended the tickets.

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=95792 (16/7)</br> The HyperK VO has been rolled out and just needs testing now. In Progress (29/7) Update- Tests passed, ticket closed. Tickets have gone out asking WMS sites to support the new VO.

Unresponsive VOs (5/7)</br> https://ggus.eu/ws/ticket_info.php?ticket=95442 -Master ticket.</br> https://ggus.eu/ws/ticket_info.php?ticket=95474 - camont. Waiting for repy, can be closed (22/7)</br> https://ggus.eu/ws/ticket_info.php?ticket=95473 - gridpp. Jeremy waiting on new e-mail lists. In progress (29/7)</br> https://ggus.eu/ws/ticket_info.php?ticket=95472 - minos. The VO is probably dead, Jeremy is checking. Assigned (26/7)</br> https://ggus.eu/ws/ticket_info.php?ticket=95469 - supernemo. Probably also dead, Jeremy is checking here too. In progress (22/7)</br>

GLEXEC (1/7)</br> Birmingham, Cambridge, Bristol, Sussex, ECDF, Durham, Sheffield, Manchester, Lancaster, UCL, RHUL, Queen Mary and EFDA-JET all have open gLexec tickets. Durham and IC are In Progress - ironing out a few bugs (although both tickets could do with a soothing update to keep us in the loop-particularly the Durham one). The rest are on hold, but Sussex also appear to be close to a solution. Others are quoting late summer before tackling gLExec deployment.

NGS Decommissioning.</br> Not much to see here, 10 sites are on the "chopping block" but no progress is expected until late August so they're all on hold.

TIER-1</br> https://ggus.eu/ws/ticket_info.php?ticket=96079 (23/7)</br> Atlas seeing slow deletion rates, caused by occasional time outs for some deletions. Shaun is investigating, and can't see anything in the SRM layer causing this. Waiting for atlas to get back with some examples from their logs to cross-reference. Waiting for reply (24/7)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Webdav for the RAL LFC. Catalin has asked for some advice on setting up the Webdav interface. Is anyone able to help him? Waiting for reply (17/7) Update - Catalin has rolled out the Webdav interface, Chris W will now crusade to get more sites Webdaved. Catalin points out that RAL's Castor doesn't support webdav.

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=96090 (23/7)</br> This issue was brought up in the Storage meeting, but worth mentioning here. Ewan had a problem where his one server with space on it ended up being given a weighting of zero by DPM gremlins. After fixing this things are better, but the one empty disk server is under a lot of stress. As our sites fill up issues like this could become more common. In progress (26/7) (Also see https://ggus.eu/ws/ticket_info.php?ticket=96071, which is the same issue for Sno+. It looks like the issue has been solved for them though).

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=96024 (22/7)</br> It looks like Durham have a black-hole node, sucking jobs into oblivion: the culprit appears to be n36.dur.scotgrid.ac.uk from the atlas monitoring. It could be a red herring, but things are quiet on the ticket fro the site end. In progress (29/7) Update-Mike offlined the bad node, but now all their nodes are failing jobs! It never rains but it pours...

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=95165 (28/6)</br> Duncan poked Sussex to check their perfsonar installation. Not much word on this ticket for a while since Duncan updated with some information. In progress (1/7)

100IT</br> https://ggus.eu/ws/ticket_info.php?ticket=94780 (11/6)</br> The Cloud Site ticket. JK wants to hammer out a few things with NGI members before finalising this as it is the first Industrial Partner, but he's on leave (hopefully somewhere without this weekend's rain!), so it might be a while before this is finalised.

Nothing exciting in the solved case pile, but my eyes are about to fall out of my head after sifting through the active tickets so I may well have missed something.

Tools - MyEGI Nagios

Tuesday 23rd July

  • In a campaign to update VO ID card details it turns out that a few of our supported VOs are obsolete: babar, possibly supernemo and ngs.ac.uk. The first of these can be safely removed but we need to confirm our announcement process.

Tuesday 11th June

  • Installation of DIRAC instance at IC ready for 'another' test user.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Friday 2 August 2013

  • SNO+ would like to streamline their submission
    • Is Dirac possible
  • WebDAV support at RAL LFC
    • Firewall seems to be in the way.
  • HyperK.org
    • Waiting on WMS support from somebody.
    • 1 month so far from starting this off - can we do this quicker next time.


Mon 29 July 2013

  • SNO+ status (Matthew Mottram): we're running a round of production currently.
    • (and will probably need to send out requests to Tier2 sites to increase our storage quotas, as I think there are a number of sites where we have of order 1TB currently, we probably need to ramp up to 20TB at each eventually).
  • T2K.org (Jon Perkin): we're going to be running a small scale MC production over the next couple of weeks, nothing too drastic.
  • HyperK.org - New VO. VOMS server setup (https://ggus.eu/ws/ticket_info.php?ticket=95792).
    • WMS and LFC needed.
    • Operations-portal has problems with voms servers at Oxford and Imperial
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 25th July

  • Operations report
  • The CMS & LHCb Castor instances were upgraded to version 2.1.13 yesterday (23rd).
  • The upgrade to CVMFS version 2.1.12 access the farm has largely removed the batch job set-up problem for LHCb and this problem will now be regarded as solved.
  • There was a hardware problem on an Atlas Disk Server (GDSS664) that was causing the RAID array not to rebuild. Atlas were warned of potential data loss. However, the fabric team managed to recover the server and all the data has now been successfully copied off.
  • There have been ongoing problems with the batch server (pbs_server) that are still being investigated.
  • Note: No meeting next week (31st July) owing to many staff attending an internal event.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A