Operations Bulletin 110313

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 4th March 2013
Task Areas
General updates

Monday 4th March

  • EMI 1 software will reach end of security updates and support on 30-04-2013. An upgrade campaign has started.
  • How much space is needed in the SVN repository?
  • Security training (3rd/4th) and next HEPSYSMAN (4th/5th) taking place in June. Please see Pete's email.
  • Pre-GDB on clouds agenda for 12th March and GDB agenda for 13th March.
  • A gentle reminder to please contribute to the Blogs where you can. Thank you!
  • Last days to complete the EGI Configuration Tool Survey
  • Please let Jeremy know if you wish to seek GridPP funding to attend the HEPiX Spring workshop.
  • The agenda for this Thursday's WLCG Ops Coordination Team meeting.

Monday 25th February

  • A bug in the linux kernel affects ATLAS jobs, where the load gradually ramps up and the host eventually becomes unresponsive and has to be rebooted. The bug seems to have been fixed on or about kernel-2.6.18-214.
  • Suggestion to use CertWizard
  • An agenda for Tuesday's OMB meeting
  • Wahid created a TCP tuning page.
  • Minutes of the 15th February GridPP Cloud meeting.
Tier-1 - Status Page

Tuesday 5th March

  • Ongoing problems with the batch farm not starting enough jobs over the past weeks still being tracked. Also investigating problem causing some batch job failures owing to time taken to set-up jobs.
  • Finalising plans for intervention next Tuesday (12th) to replace core switch in Tier1 network. Outage for around 6 hours (including contingency).
Storage & Data Management - Agendas/Minutes

Tue 5th March 2013

  • EGI Community Forum - Storage session Wed pm.
  • DPM upgrades: all sites are now at 1.8.6.
  • Rfio -> xrootd
    • CMS need xrootd (instead of rfio at DPM sites) - does this apply to T3?
    • General support to transition UK sites also for ATLAS
    • working OK at ECDF/Glasgow test queues.
    • planning to move couple of production sites (ox and ed and ...) slowly.
  • Pointing some real users (eg. Small VOs) to WebDav interfaces
  • Cloud storage testing discussed at both storage and cloud meetings - see post.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.

Tuesday 30th October

  • Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 4th March

  • An EGI operations meeting agenda for today's meeting is now available.
  • SR: Large number of updates in UMD2, and UMD1. In particular, in theUMD1 release, the DPM, LFC and L&B are security updates. The UMD2 WMS is _not_ backwards compatible, without a workaround, as describe in the release notes: https://wiki.egi.eu/wiki/UMD-2:UMD-2.4.0
  • EMI-3 release expected 7th March, UMD-3 prioritisation underway
  • Argus should be in the Site-BDII; it had the information provider from the EMI-2 release, so it's probably a plan to update EMI-1 Argus's. (As should VOMS servers; they've had the information providers in all EMI releases)

Monday 11th February

  • There was an EGI ops meeting today.
  • There is a list-match problem with EMI2 WMS (GGUS 90240)


Monitoring - Links MyWLCG

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 5th March

  • Handling tickets related to EMI-1 probes - what to expect.
  • Recommendation with respect to upgrading CE (drain first)

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.

Monday 21st January

  • Good week, with only a few downtimes and long lived alarms. All outstanding alarms are covered by tickets as of now.
  • As summarised in Daniela's handover from last week, several sites have red COD-level status because the tickets are more than a month old. This due to the lack of upgrade of the WNs due to lack of the new tar-balls, and results in raising sec alarms. Some details in this ticket: https://ggus.eu/ws/ticket_info.php?ticket=90184
Rollout Status WLCG Baseline

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.

Tuesday 26th February

  • Another local privilege escalation vulnerability affecting linux kernels 3.3-3.8. Vendor supplied kernels in RHEL/SL not vulnerable (https://bugzilla.redhat.com/show_bug.cgi?id=915052) but fc18 and ubuntu 12.10 need upgrading. Code to exploit this vulnerability is widely available.

Tuesday 19th February



Services - PerfSonar dashboard | GridPP VOMS

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

Monday 4th March 2013 14.45 GMT</br> 38 Open UK tickets today. All was going smoothly until the EMI1 tickets hit us, still the reply to them was swift from sites. It's the start of the month, so I need to take a break from Spring cleaning my desk (the horrors that I have seen) and take a look at all the tickets.


EMI 1 Tickets:</br> (I won't go into much detail as they're likely to be talked about elsewhere and they only came out this morning.)

RALPP https://ggus.eu/ws/ticket_info.php?ticket=91997 (In progress) - Plan in place

OXFORD https://ggus.eu/ws/ticket_info.php?ticket=91996 (In Progress) - Is the deadline to upgrade the end of April, or do we need to be sorted before then?

BRISTOL https://ggus.eu/ws/ticket_info.php?ticket=91995 (In progress) - Winnie has asked for clarification for what's going on.

BIRMINGHAM https://ggus.eu/ws/ticket_info.php?ticket=91994 (In progress) - Mark will get onto this as soon as Birmingham's AC starts behaving.

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=91992 (In progress) - There are some red herrings at Glasgow due to hanging CE bdiis. Just the WMSes and LB to go, these are being handled.

SHEFFIELD https://ggus.eu/ws/ticket_info.php?ticket=91990 (In progress) - Elena plans to upgrade this month.

RHUL https://ggus.eu/ws/ticket_info.php?ticket=91987 (Assigned)</br> https://ggus.eu/ws/ticket_info.php?ticket=91982 (Assigned)</br> https://ggus.eu/ws/ticket_info.php?ticket=91981 (Assigned)</br> (Poor RHUL getting 3 tickets - I assume this is the ROD dashboard being silly as Daniela mentioned)</br> The real ticket: https://ggus.eu/ws/ticket_info.php?ticket=92111

LIVERPOOL https://ggus.eu/ws/ticket_info.php?ticket=91984 (In progress) - The Liver lads are working on it.

QMUL https://ggus.eu/ws/ticket_info.php?ticket=91980 (In Progress) - Chris has updated his BDII, so hopefully things will be sorted.

IC https://ggus.eu/ws/ticket_info.php?ticket=91978 (In Progress) - wms updated, last CE has a scheduled downtime, um, scheduled.

BRUNEL https://ggus.eu/ws/ticket_info.php?ticket=91975 (In Progress) - Raul plans to upgrade things at the end of the month. He asks about dangers upgrading the CE from EMI1 to 2 - Daniela replies that the DB change means that it's recommended to drain your CE first.

TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=91974 (In Progress) - The team plan to have all services updated by the end of March.


Atlas data moving tickets:</br> https://ggus.eu/ws/ticket_info.php?ticket=90242 (Lancaster)</br> https://ggus.eu/ws/ticket_info.php?ticket=90243 (Liverpool)</br> https://ggus.eu/ws/ticket_info.php?ticket=90244 (RALPP)</br> https://ggus.eu/ws/ticket_info.php?ticket=90245 (Oxford)</br> https://ggus.eu/ws/ticket_info.php?ticket=89804 (Glasgow)</br>

Nearing the end of these. Lancaster and Oxford are down to their last few files (which might need to be manually fixed at the site end- the one left at Lancaster is lost for good). RALPP similarly have dark data files that might need to be cleaned up locally. Liverpool are waiting on atlas after giving them a new list of files. Glasgow have been asked for a fresh file dump.


The Rest:

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=91687 (21/2)</br> Support for the epic VO on the RAL WMS. Request for pool accounts went out but no word since. In progress (21/2)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Request from Chris W for webdav redirection support on the RAL LFC. As reported last week waiting on the next release which has better, stronger, faster webdav support in it. In Progress (22/2)

https://ggus.eu/ws/ticket_info.php?ticket=91146 (4/2)</br> atlas tracking RAL bandwidth issues. The ticket was waiting on last week's downtime to hopefully sort things out. Did the picture improve? In progress (12/2)

https://ggus.eu/ws/ticket_info.php?ticket=91029 (30/1)</br> Again from atlas, this is the FTS queries failing for some jobs involving users with odd characters in the name ticket. A fix either needs to be implemented by the srm developers or atlas need to workaround by changing their robot DNs. On hold (27/2)

https://ggus.eu/ws/ticket_info.php?ticket=90528 (17/1)</br> Sno+ Jobs weren't making their way to Sheffield, tracked to a problem with one wms. As the cause of the problem is unknown and completely unobvious it was suggested restricting Sno+ jobs to the working WMS, but still no reply from Sno+. Waiting for reply (19/2)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9/2012)</br> Correlated packet loss on the RAL Perfsonar host. Did last week's network intervention fix things? Or maybe the problem evapourated (I'm ever the optimist)? On hold (16/1)

IMPERIAL</br> https://ggus.eu/ws/ticket_info.php?ticket=91866 (28/2)</br> It looks like atlas jobs were running afoul of some cvmfs problems on some nodes. They've been given a kick, it's worth seeing if the problem has gone away. In progress (28/2)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=91792 (26/2)</br> Atlas thought that they had lost some files, but it turns out that they just had bad permissions on a pool node (root.root) - the problem's been fixed and Sam is investigating with his DPM hat on, whilst checking the filesystems for more possible bad files. In progress (4/3)

https://ggus.eu/ws/ticket_info.php?ticket=90362 (13/1)</br> All Glasgow's CEs have been switched over to use the GridPP voms server for ngs.ac.uk, they just need some testing. Solved (4/3).

SHEFFIELD https://ggus.eu/ws/ticket_info.php?ticket=91770 (25/2)</br> lhcb complaining about the default value being published for Max CPU time. No news from Sheffield beyond the acknowledgement of the ticket. In Progress (25/2)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=91745 (24/2)</br> enmr.eu having trouble with lcg-tagging things at DUrham. Mike gave this a kick, and asked if the problem has gone away. Waiting for reply (25/2)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=91711 (21/2)</br> atlas having trouble copying files into RHUL. It's being looked at but PRODDISK and ANALY_RHUL have been put offline. In Progress (28/2)

https://ggus.eu/ws/ticket_info.php?ticket=89751 (17/12/12)</br> Path MTU discovery problems to RHUL. On hold since being handed over to the Network guys, who were following it up with Janet. On hold (28/1)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=91304 (8/2)</br> LHCB having trouble on one of Lancaster's cluster as they like to run their jobs in the home directory rather then $TMPDIR. Forcing this behaviour is harder then it should be in LSF, so it looks like we're going to have to relocate the lhcb home directories. In Progress (1/3)

https://ggus.eu/ws/ticket_info.php?ticket=90395 (14/1)</br> dteam jobs failed at Lancaster, due to our old CE being rubbish. Its since been reborn with new disks, but embarrassingly I haven't found the time to set a UI up for dteam and test it myself (which I intend to do as part of testing the UI tarball, but that's a whole other story). In progress (18/2)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=90878 (27/1)</br> lhcb were having problem with cvmfs at Edinburgh, but the fixes attempted can't be checked due to dirac problems at the site. In progress (could be knocked back to waiting for reply) (28/2)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=90328 (11/1)</br> The Bristol SE is publishing some odd values - zero used space. Waiting on another, similar ticket (90325) to be resolved. On hold (11/2)

https://ggus.eu/ws/ticket_info.php?ticket=90275 (10/1)</br> The CVMFS taskforce have asked for Bristol's CVMFS plans. One Bristol CE is migrated to using it, with one left to go. On hold (5/2)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=88227 (6/11/2012)</br> Jet have exhausted all options trying to fix this biomed job publishing problem. They're looking at reinstalling the CE to fix it, which seems like using a sledgehammer to crack a walnut (but I don't have any better ideas). On hold (25/2) Daniela suggests assigning issue to the developers.

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 4th March 2013

Monday 26th February 2013

  • NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
  • SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
  • Issues with Proxy renewal.
    • Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
    • Other myproxy issues as well. GGUS#99105 GGUS#9172

Monday 18 February 2013

Monday 12 February 2013

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Monday 14 January 2013

  • Neiss.org.uk
    • Now have VO-ID card in operations-portal (previously CIC portal)
    • GridPP/NGS VOMSs server issues
    • NGS WMS hadn't enabled current CEs at QMUL and Lancs, so I've requested the GridPP WMSs enable it - as the VO is supported on GridPP sites.
    • Would be a good use case for SARONGS - but they don't have the time to debug this.


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 1st October

  • ELC work


Tuesday 25th September

  • Reviewing pledges.
  • Q2 2012 review
  • Clouds and DIRAC
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 27th February

  • Operations report
  • The existing (very small) test SL6 batch queue is accessible by Atlas & CMS for tests. This is being expanded using part of the new purchase of CPU nodes (around 450 job slots) and will be offered to other VOs for use.
  • The meeting continues to use Vidyo as part of a four-week test.
WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October




NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER
  • N/A
To note

  • N/A