Operations Bulletin 180313

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 11th March 2013
Task Areas
General updates

Monday 11th March

  • Minutes from the GridPP cloud meeting of 1st March.
  • The agenda for the LHCONE/LHCOPN meeting on 17th and 18th March is now available. Well the link is available!
  • The EGI applications database has been relaunched.
  • Reminder for EMI-2: ARGUS needs a BDII entry now, because it has an ldap server itself.
  • For those intending to submit an abstract to CHEP 2013 please note the [ http://www.chep2013.org/bulletins/2 deadline] of 25th March.

Monday 4th March

  • EMI 1 software will reach end of security updates and support on 30-04-2013. An upgrade campaign has started.
  • How much space is needed in the SVN repository?
  • Security training (3rd/4th) and next HEPSYSMAN (4th/5th) taking place in June. Please see Pete's email.
  • Pre-GDB on clouds agenda for 12th March and GDB agenda for 13th March.
  • A gentle reminder to please contribute to the Blogs where you can. Thank you!
  • Last days to complete the EGI Configuration Tool Survey
  • Please let Jeremy know if you wish to seek GridPP funding to attend the HEPiX Spring workshop.
  • The agenda for this Thursday's WLCG Ops Coordination Team meeting.
Tier-1 - Status Page

Tuesday 12th March

  • There is an intervention on the main core switch in the Tier1 network ongoing today.
  • On Wednesday/Thursday (6/7 Mar) there were problems on the main RAL network that caused intermittent connectivity problems.
  • Planned roll out of change to the batch system to nice the batch jobs as part of investigations into failures at job set-up.
Storage & Data Management - Agendas/Minutes

Tue 5th March 2013

  • EGI Community Forum - Storage session Wed pm.
  • DPM upgrades: all sites are now at 1.8.6.
  • Rfio -> xrootd
    • CMS need xrootd (instead of rfio at DPM sites) - does this apply to T3?
    • General support to transition UK sites also for ATLAS
    • working OK at ECDF/Glasgow test queues.
    • planning to move couple of production sites (ox and ed and ...) slowly.
  • Pointing some real users (eg. Small VOs) to WebDav interfaces
  • Cloud storage testing discussed at both storage and cloud meetings - see post.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 4th March

  • An EGI operations meeting agenda for today's meeting is now available.
  • SR: Large number of updates in UMD2, and UMD1. In particular, in theUMD1 release, the DPM, LFC and L&B are security updates. The UMD2 WMS is _not_ backwards compatible, without a workaround, as describe in the release notes: https://wiki.egi.eu/wiki/UMD-2:UMD-2.4.0
  • EMI-3 release expected 7th March, UMD-3 prioritisation underway
  • Argus should be in the Site-BDII; it had the information provider from the EMI-2 release, so it's probably a plan to update EMI-1 Argus's. (As should VOMS servers; they've had the information providers in all EMI releases)

Monday 11th February

  • There was an EGI ops meeting today.
  • There is a list-match problem with EMI2 WMS (GGUS 90240)


Monitoring - Links MyWLCG

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 5th March

  • Handling tickets related to EMI-1 probes - what to expect.
  • Recommendation with respect to upgrading CE (drain first)

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.

Monday 21st January

  • Good week, with only a few downtimes and long lived alarms. All outstanding alarms are covered by tickets as of now.
  • As summarised in Daniela's handover from last week, several sites have red COD-level status because the tickets are more than a month old. This due to the lack of upgrade of the WNs due to lack of the new tar-balls, and results in raising sec alarms. Some details in this ticket: https://ggus.eu/ws/ticket_info.php?ticket=90184
Rollout Status WLCG Baseline

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.

Tuesday 26th February

  • Another local privilege escalation vulnerability affecting linux kernels 3.3-3.8. Vendor supplied kernels in RHEL/SL not vulnerable (https://bugzilla.redhat.com/show_bug.cgi?id=915052) but fc18 and ubuntu 12.10 need upgrading. Code to exploit this vulnerability is widely available.

Tuesday 19th February



Services - PerfSonar dashboard | GridPP VOMS

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

Monday 11th March 2013 14.30 GMT.</br> 34 Open UK tickets this week. A quarter of them are EMI1 upgrade tickets, which are largely in hand.

NGI/UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=92412 (11/3)</br> UCL is being threatened with suspension unless the NGI intervene within 10 days. I've assigned this to the ops team mailing list as NGI tickets can sneak under the radar. In Progress (11/3)

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=92306 (7/3)</br> vo.earthsci.ac.uk would like to be added to the VOMS server. Robert has noticed that the domain earthsci.ac.uk hasn't been registered, so the name can't be used, and asks if the VO plans to register it. Waiting for reply (8/3)

EMI 1 RETIREMENT TICKETS</br> I won't go over them all, here's the two that stand out as everyone else has laid out a plan.</br> https://ggus.eu/ws/ticket_info.php?ticket=92111 (RHUL) - Still just assigned.</br> https://ggus.eu/ws/ticket_info.php?ticket=91995 (Bristol) - I'm not sure how much Bristol need to upgrade (I think it's just their BDII), but no plans from Winnie yet.

For the sites needing to upgrade WMSii Daniela reported that it went quite smoothly for her using the instructions from http://www.eu-emi.eu/products/-/asset_publisher/1gkD/content/wms-1.

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=92266 (6/3)</br> Chris has ticketed the Tier 1 over their myproxy server's certificate. Stephen B references https://ggus.eu/ws/ticket_info.php?ticket=92065 (from Daniela), and the Tier 1 chaps aim to replace the host certificate on or about the 18th of March. In Progress (8/3)

https://ggus.eu/ws/ticket_info.php?ticket=91687 (21/2)</br> EPIC vo support on the RAL WMS. The VO has been enabled, but Tom is having problems. Could someone who admins the Scotgrid UI have a check that the EPIC vo gubbins are set up correctly please. In progress (7/3)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Request for Webdav support on the RAL LFC. Ricardo reports that the new version being waited on is in epel-testing and awaiting validation - if you're feeling brave it can be tested. In progress (5/3)

https://ggus.eu/ws/ticket_info.php?ticket=90528 (17/1)</br> Sno+ jobs not being assigned to Sheffield by one of the RAL WMSes. James from Sno+ has got back to us, they're okay with just using the working WMS. If this can be set up (if it hasn't already) the ticket can probably be closed. In progress (6/3)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=92190 (6/3) LHCB saw cvmfs related job failures, Andrew and Alessandra identified the problem cvmfs outgrowing its cache on several nodes. The caches have been increased and a savannah ticket filed with cvmfs. Looks like the ticket can be closed, but others might want to watch out for this. In progress (9/3)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=92299 (7/3)</br> Biomed are seeing invalid publishing for their VO at Sheffield. It could be that you're seeing the same problems that they saw at JET (https://ggus.eu/ws/ticket_info.php?ticket=88227) and fixed with an update. In progress (8/3)

Of interest to WMS admins:</br> Daniela brought this ticket to my attention:</br> https://ggus.eu/tech/ticket_show.php?ticket=92288</br> This problems sounds like it could be a right pain in the bum, from Daniela "I would consider this a rather major bug, it wipes all done jobs from the LB every night, as a bonus leaving all the crud (sandboxes) on the WMS lying around without the users being able to retrieve them. (Though the fix is simple.)"

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 4th March 2013

Monday 26th February 2013

  • NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
  • SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
  • Issues with Proxy renewal.
    • Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
    • Other myproxy issues as well. GGUS#99105 GGUS#9172

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 1st October

  • ELC work


Tuesday 25th September

  • Reviewing pledges.
  • Q2 2012 review
  • Clouds and DIRAC
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 13th March

  • Operations report
  • There were networking problems at RAL on Wed/Thu 6/7 March that caused a number of breaks in connectivity to the TIer1.
  • The planned intervention on Tues 12th overran significantly. The main core switch was replaced as planned and an internal bottleneck to one network stack relieved. However, the overrun was caused by problems re-establishing the paired (2*10Gbit) uplink and at the moment this running with a single 10Gbit connection.
  • The meeting continues to use Vidyo.
WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October




NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER
  • N/A
To note

  • N/A