Operations Bulletin 091213

From GridPP Wiki
Revision as of 09:07, 10 December 2013 by Jeremy coles (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 2nd December 2013
Task Areas
General updates

Tuesday 3rd December

  • Although ready, the UK CA will wait to move to default SHA-2 certificates in January (WLCG overall has not confirmed readiness).
  • There is an EGI push for ARGUS deployment - a central server is being configured at RAL.
  • Minutes from Monday's regular WLCG ops call are available. Generally quiet.


Monday 25th November

  • There is a pre-GDB on Identity Federation in WLCG (agenda). The next GDB is on 11th December.
  • EMI-3 WN tarball status (and glexec)?
  • There is an LFC outage today (see the downtime announcement.
  • The middleware readiness group are setting a time for their meeting. More site admins are needed! Discussions will surround the items in the twiki.
  • There was an email thread last week on ATLAS plans to move jobs/data away from a site going into downtime. The focus seemed to be on the execute not the storage side of things.
  • A new SAM interface is available for checking.
  • Glue2 information validation is ongoing. Look to the monitoring summary page for more information.


WLCG Operations Coordination - Agendas

Tuesday 3rd December

Tuesday 26th November

  • CMS - CRAB users warned that gLite-WMS submission is in decreasing support
  • LHCb - LHCb will only build slc6 binaries as of January 2014
  • SHA-2 - the experiments have tested a lot and look ready. By Dec 1 the WLCG infrastructure is expected to be mostly ready.
  • WMS decommissioning - Some progress for CMS.
  • glexec - EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have the Perl module Time/HiRes.pm installed. Status is tracked here.
  • xrootd deployment - UDP collector (a.k.a. GLED ) for detailed monitoring. An additional instance of the collectors has been enabled at CERN for FAX
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 3rd December

  • The planned upgrade of firmware in a disk array that led to LFC, Atlas 3D and FTS2 services down for a few hours last Tuesday was completed successfully.
Storage & Data Management - Agendas/Minutes

Tuesday 8th October

  • The DPM workshop agenda and registration page will appear here.

Monday 30th September

  • A DPM workshop is being organised in Edinburgh for 13th December. GridPP PMB anticipated covering travel for of order 10 UK sysadmins for this event. Interest should be indicated during the storage group meeting.

Tuesday 17th September

  • Perhaps someone could summarise the "Dark Data identification tools" thread on TB-Support?



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 26th November

Tuesday 5th November

  • A reminder to keep an eye on the SL HS06 page for odd ratios. Steve takes HS06 cpu numbers direct from ATLAS and the page does get stuck every now and then.
  • The metrics page has been updated.

Tuesday 13th August

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.


Monday 11 November

  • The plan for use of adoption of backup servers continues to evolve. Please see latest version here. The new version contains details of tests and concluding operations for site and VO admins.
  • The approved VOs page continues to be updated with the newest data from the operations portal.

Note: T2K now requires liblockfile-devel.

Tuesday 5th November

  • Documents states will be reviewed at the core ops meeting this coming Thursday.

Tuesday 1st October

  • The approved VOs page has been updated with the newest data from the operations portal. Note that the VOMS records for LondonGrid now contain some alternative voms servers. The migration plan for use of these backup servers is now document here.
Interoperation - EGI ops agendas

Tuesday 3rd December

  • Additional notes:
    • the 2.6.16 version of dCache mentioned has a serious bug in the migration module; 2.6.17 has this fixed so should be used in preference. The possibility of skipping 2.6.16 in the overall release of EMI-3 being discussed
    • Note that the cream updates mentioned in this meeting contain security updates and so are recommended.
    • Looking for CREAM/LSF plugin staged rollout, but don't believe there are any such sites in the UK
    • SHA-2 : 17 sites remaining in the EGI that are publishing SHA-2 and alarming; I don't think that any such sites in the UK (just a couple) are unaccounted for/previously documented.
    • It was asked when CAs would start issuing SHA-2 certs only (UK noting that it's planning to from January)
  • Next meeting: (last for 2013) 6th December

Tuesday 26th November

  • Notes for this meeting mainly superseded by more recent ones.
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 26th November

  • As noted by Alessandra, if possible we'd like site feedback on the consolidated monitoring prototype before the next meeting a week on Friday to report back to the group (with thanks to everyone who has already contributed)
  • Some notes to form a wiki on Graphite are to be found here: https://www.gridpp.ac.uk/wiki/MonitoringTools but these are under development, however if there are areas people would find useful that could be expanded, please let David know.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 3rd December

  • Both Nagios servers upgarded to SAM update22 and the active instance has moved back to Oxford again.
  • Some critical alarms from the old instance had to be dealt with directly.
  • Bristol and RAL PPD ARC CE has a few issues after the upgrade. Luke opened a ticket with the ARC developers and it is on-going.
  • EFDA Jet and Sussex had gLExec issues after the test became critical. A new admin is starting in Sussex but gLExec may take a while to be sorted. Jet has opened a ticket to solve their gLExec problem.
  • Brunel has an apel ticket open.
  • UCL SL6 upgrade on-going and may have issues.
  • RALPP dcache mid mon ticket still open

Monday 25th November

  • Nothing unusual. A steady trickle of transient problems.
  • The RAL Tier-1 SHA-2 ticket was finally closed as the relevant machines were decommissioned.
Rollout Status WLCG Baseline

Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated the Stage of the Nation page.


Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805] References


Security - Incident Procedure Policies Rota

Tuesday 19th November

  • There was a team meeting last Friday 15th November. Next meeting on 29th.
  • Just a couple of site issues showing up in Pakiti.
  • Looking at ARGUS server for UK NGI.

Tuesday 29th October

  • There was a team meeting on Friday 25th.
  • A couple of critical warnings are appearing in Pakiti and being followed up.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 26th November

  • The main perfSONAR issues this week affect Manchester and Sussex.

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.

Tuesday 1st October

  • PerfSONAR latency hosts configured to use the WLCG meshes should now have a traceroute measurement achive (MA) accessible from the GUI under 'Service Graphs' --> 'Traceroute'. Here is an example.

Tuesday 17th September

  • Upgrading/re-installing hosts to v3.3.1/mesh is only making slow progress.
  • There is a new view of the status between sites.
  • An outage at Manchester due to central switch maintenance means that VOMS is not going to be contactable for a period this morning. It is clear that we need the backup VOMS instances fully available to VOs - please can someone take a lead?
Tickets

Monday 2nd December, 14.30 GMT </br> 39 Open UK Tickets this week. Site by Site, in no particular order, we have:

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=99198 (26/11)</br> Sussex have recieved a glexec.WN nagios ticket. It could be that just glexec is broken. Pete G acknowledged the ticket on behalf of the site (is Jeremy M still having trouble with GGUS?). In progress (28/11)

https://ggus.eu/ws/ticket_info.php?ticket=95165 (28/6)</br> Sussex needing a refreshed Perfsonar. Emyr reports that it will be the new guy's first job, whenever he or she lands. On hold (20/11)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=98923 (15/11)</br> RALPP are being Nagiosed about their dcache not being a SHA-2 compliant version - but it is. It's just their publishing that's broken. Chris is scratching his head over it. In progress (2/12) Update - Chris took care of business, tracked a bug in the dcache publishing, fixed it and submitted a ticket to the dcache devs. Solved.

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=99362 (2/12)</br> Oxford have been asked to remove ngs.ac.uk from their backup voms server. Kashif is away, so it may not happen this week. In progress (2/12)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=99377 (2/12)</br> Bristol's ARC CE is producing errors for the Ops nagios tests. They're aware of the problem, and will put the CE in downtime if it keeps giving them gyp. In progress (2/12)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=97068 (5/9)</br> Duncan ticketed Glasgow about their perfsonar. Gareth reported that they'd get to it after the SL6 migration (which was the mantra for a lot of us over the last few months), but this was back in October. Please can you show the ticket (and more importantly the issue at hand) some love! On hold (15/10) Update - Dave showed the nessicery love to the ticket.

https://ggus.eu/ws/ticket_info.php?ticket=96234 (29/7)</br> Support for HyperK on the Glasgow WMS. After t'other weekend's power shenanigans Dave thinks he got it, could Chris (or another HyperK member) please test. Waiting for reply (2/12)

https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10)</br> A CMS ticket that has morphed into "getting CMS glideins to work at Glasgow (if I'm reading it right). Things are progressing (in spite of power cuts and top bdii's playing up), at last word Daniela spotted something off in the site xml file. In progress (26/11)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=95303 (1/7)</br> Edinburgh's GlExEC ticket. No progress, although Wahid has made his opinions on the matter quite clear. It's still on me (with my tarball hat on) I'm afraid. On hold (29/11)

https://ggus.eu/ws/ticket_info.php?ticket=99179 (25/11)</br> ECDF got a ticket over using a "buggy" version of the BDII (at Lancaster we had to update the site-BDII to fix this). Wahid correctly corrected the ticket to a "Change Request" not an "Incident". Still probably best to try to do this soon if you can, as it preempts a new set of nagios tests. On hold (26/11)

https://ggus.eu/ws/ticket_info.php?ticket=99180 (25/11)</br> Along the same vein, this ticket is about the publishing of default values. On hold (25/11) Update - Andy has asked some question to clarify which CEs are publishing bad values.

DURHAM</br></br> https://ggus.eu/ws/ticket_info.php?ticket=95302 (1/7) Durham's gLeXeC ticket. Ewan S reports that it's still being worked on. In progress (26/11)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=95301 (1/7)</br> Sheffield's GlexeC ticket. There was a hope to get this done min-November, what's the new timeframe? Does anyone know what the current deadline for this is? On hold (29/10)

https://ggus.eu/ws/ticket_info.php?ticket=98594 (4/11)</br> LHCB transfer problems from Sheffield. This is being worked on, but I don't think it's understood yet. In progress (27/11)

https://ggus.eu/ws/ticket_info.php?ticket=97039 (4/9)</br> Biomed complaining about lack of dynamic publishing at Sheffield. Its had a few bashes at it, but nothing works. As a note I had some problems recently after an update that needed me to set the ldap user up as an operator (in qmgr) for our torque server. Elena referenced similar problems in ticket 98748. On hold (13/11)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=97066 (5/9)</br> A perfsonar ticket. Alessandra's revised date for getting this done seems to be the 9/12. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=99334 (29/11)</br> The kinda parent ticket to the other two voms tickets, requesting the purging of ngs.ac.uk from the voms servers. Waiting on the other two tickets to be solved. On hold (2/12)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=95299 (1/7)</br> Lancaster's GlExEc ticket. If I don't sort out the tarball glexec soon I'm going to have to commit Seppuku to atone. Any volunteers to be my second? On hold (17/7)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=98792 (11/11)</br> Nagios JobSubmit failures. The site's in downtime, and Ben is working on the SL6 upgrade. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=98542 (1/11)</br> SL6 upgrade plan ticket. Ben's in the middle of upgrading now, he set the reminder date for today. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=98719 (7/11)</br> Brian submitted a request from atlas to bring the UCL dpm up to a "minimum level" and enable WedDav. Still in the middle of the SL6 upgrade, so this work has been delayed slightly. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=99174 (25/11)</br> Obsolete Glue2 entries ticket. Ben will fix this before putting the site back in production. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=99176 (25/11)</br> "Publishing default values" ticket, twin to 99174. As above. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=98125 (17/10)</br> Atlas file transfer problems that I believe preempted the current downtime. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=95298 (1/7)</br> UCL's glExEc ticket. Will be worked on after the SL6 upgrade. On hold (25/11)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=94746 (10/6)</br> That one where QM's SE publishes that it supports Biomed. After going around the Storm developers (who's response was a politely worded "not our problem") it's back in Chris' hands. On hold (25/11)

https://ggus.eu/ws/ticket_info.php?ticket=99294 (28/11)</br> Brian has asked for some space juggling. This ticket seems to have flown under the radar. Assigned (28/11) Update - Solved by Chris

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=99316 (29/11)</br> Nagios "Apel-Pub" ticket. Raul cites a ticket he has open with the APEL devs (99320). In progress (29/11)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9)</br> LHCB job failures at JET. Krishan is fighting the good fight, it maybe a missing package or two causing the trouble according to Vladimir. In progress (28/11)

https://ggus.eu/ws/ticket_info.php?ticket=95295 (1/7)</br> Jet's gLExEc ticket. It's installed, but not quite working right. A ticket has been submitted to the Argus devs (98609). In progress (2/12)

https://ggus.eu/ws/ticket_info.php?ticket=99197 (26/11)</br> Nagios gLexEc ticket. Working on it alongside t'other issues. On hold (2/12)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=86152 (7/9/2012)</br> The most venerable of our tickets, "correlated packet-loss on perfsonar host". Did a new latency host get installed? On hold (18/10)

https://ggus.eu/ws/ticket_info.php?ticket=97868 (8/10)</br> T2K's cvmfs request ticket. Ben asked to test with a ROOT tarball, but no news since. In progress (18/11)

https://ggus.eu/ws/ticket_info.php?ticket=98249 (21/10)</br> SNO+ cvmfs request. Catalin asked some questions, no reply from Sno+ yet. The SYSTEM has sent its second warning/reminder. They have 7 days to comply! Waiting for reply (18/11)

https://ggus.eu/ws/ticket_info.php?ticket=97385 (17/9)</br> HyperK's cvmfs ticket. Chugging along nicely, but no news for a while. In progress (18/11)

https://ggus.eu/ws/ticket_info.php?ticket=98122 (17/10)</br> cern@school's cvmfs request, also on its second reminder. Waiting for reply (18/11)

https://ggus.eu/ws/ticket_info.php?ticket=97025 (3/9)</br> A ticket left open as a reminder about the RAL myproxy server's idiosyncrasies. Last word was that they hoped to have it replaced soon. On hold (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=99162 (25/11)</br> A "publishing default values" ticket. Looked to be fixed, but then got reopened on the RAL guys today with default values being publishing for "GLUE2ComputingShareEstimatedAverageWaitingTime" and "GLUE2ComputingShareEstimatedWorstWaitingTime". Reopened (2/12)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> LFC webdav support ticket. Chris reports that his first tests worked well, and that this ticket can be closed. Top stuff. In progress (27/11) Update- Solved.

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 2nd December 2013


Monday 25th November 2013

  • CVMFS progress - but not quite there yet.
  • 6 VOs (cern@school,gridpp,na62, pheno,sno+,t2k.org ) have updated their VOID card entries and updated the wiki.
  • Storage
    • Gfal2 - GGUS 99043,99044,99055,99067 - not performant, but very interesting functionality
    • Webdav now enabled on LFC@RAL and ports free from firewall - needs testing

Tuesday 19 November 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 4th December

  • Operations report
  • WevDav access to the LFC is now available (in read-only mode only).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A