Operations Bulletin 021213

From GridPP Wiki
Revision as of 03:10, 2 December 2013 by Jeremy coles (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 25th November 2013
Task Areas
General updates

Monday 25th November

  • There is a pre-GDB on Identity Federation in WLCG (agenda). The next GDB is on 11th December.
  • EMI-3 WN tarball status (and glexec)?
  • There is an LFC outage today (see the downtime announcement.
  • The middleware readiness group are setting a time for their meeting. More site admins are needed! Discussions will surround the items in the twiki.
  • There was an email thread last week on ATLAS plans to move jobs/data away from a site going into downtime. The focus seemed to be on the execute not the storage side of things.
  • A new SAM interface is available for checking.
  • Glue2 information validation is ongoing. Look to the monitoring summary page for more information.


Tuesday 19th November

  • There is a workshop on clouds on 28th & 29th November.
  • There is an update of the GridPP pledge spreadsheet.
  • A summary of the WLCG workshop is available. The agenda is here.
  • The final WLCG T2 October ops availability/reliability report is now available.


WLCG Operations Coordination - Agendas

Tuesday 26th November

  • CMS - CRAB users warned that gLite-WMS submission is in decreasing support
  • LHCb - LHCb will only build slc6 binaries as of January 2014
  • SHA-2 - the experiments have tested a lot and look ready. By Dec 1 the WLCG infrastructure is expected to be mostly ready.
  • WMS decommissioning - Some progress for CMS.
  • glexec - EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have the Perl module Time/HiRes.pm installed. Status is tracked here.
  • xrootd deployment - UDP collector (a.k.a. GLED ) for detailed monitoring. An additional instance of the collectors has been enabled at CERN for FAX
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 26th November

  • There was a failure of the Primary OPN link to CERN yesterday morning. The automatic failover didn't work as the router at the RAL end did not 'see' the break. Fixed by manually dropping that connection. (Primary OPN link to CERN now fixed - switched back to this just before the meeting today).
  • Planned uprade of firmware in a disk array ongoing this morning. Currently LFC, Atlas 3D and FTS2 services down for a few hours. (FTS3 unaffected).
Storage & Data Management - Agendas/Minutes

Tuesday 8th October

  • The DPM workshop agenda and registration page will appear here.

Monday 30th September

  • A DPM workshop is being organised in Edinburgh for 13th December. GridPP PMB anticipated covering travel for of order 10 UK sysadmins for this event. Interest should be indicated during the storage group meeting.

Tuesday 17th September

  • Perhaps someone could summarise the "Dark Data identification tools" thread on TB-Support?



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 26th November

Tuesday 5th November

  • A reminder to keep an eye on the SL HS06 page for odd ratios. Steve takes HS06 cpu numbers direct from ATLAS and the page does get stuck every now and then.
  • The metrics page has been updated.

Tuesday 13th August

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.


Monday 11 November

  • The plan for use of adoption of backup servers continues to evolve. Please see latest version here. The new version contains details of tests and concluding operations for site and VO admins.
  • The approved VOs page continues to be updated with the newest data from the operations portal.

Note: T2K now requires liblockfile-devel.

Tuesday 5th November

  • Documents states will be reviewed at the core ops meeting this coming Thursday.

Tuesday 1st October

  • The approved VOs page has been updated with the newest data from the operations portal. Note that the VOMS records for LondonGrid now contain some alternative voms servers. The migration plan for use of these backup servers is now document here.
Interoperation - EGI ops agendas

Tuesday 26th November

  • David will upload notes soon, apologies for the delay in getting them posted: however one item to draw out was that the GFAL lcg-utils product team is proposing to phase out GFAL/lcg_utils in favour of GFAL2/gfal-utils (https://svnweb.cern.ch/trac/lcgutil/wiki/MediumTermProposal) - feedback is solicited on this, which was stressed as being a proposal.

Monday 28th October

  • UMD-2 (no news really - support/users dwindling - security support to end by the end of Apr/2014 - bug with BDII; fix coming soon.
  • ARC - Major release coming in November.
  • UMD-3 Cream in test - Slurm plugin (becoming mainstream?) - also Torque, Blah plugin - Storm and VOMS server and client bug fixes
  • DMSU bug - affecting retrieval of output file from Cream (EMI-2 and EMI-3 UI affected)
  • xroot issue for dCache - J. Pina (SA1.3 /LIP): "dcache 2.2.17 does not support xrootd-backport, which is required for running a CMS site on dcache 2.2."
  • a new probe for Glue Validator alarms - sites failing it now in this view. See also this document - not clear if list is complete or accurate as status of the probe was not clarified - complaints from sites about tight schedule due to current effort dedicated to SHA-2 and SL6 - to be decided in November
  • Next meeting: Nov 1 - changes to timeline? start Jan e possible deadline in 2months. Next meeting: Nov 11.
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 26th November

  • As noted by Alessandra, if possible we'd like site feedback on the consolidated monitoring prototype before the next meeting a week on Friday to report back to the group (with thanks to everyone who has already contributed)
  • Some notes to form a wiki on Graphite are to be found here: https://www.gridpp.ac.uk/wiki/MonitoringTools but these are under development, however if there are areas people would find useful that could be expanded, please let David know.

Friday 15th November

  • The next Monitoring Consolidation meeting, taking place on Friday 22nd November, is currently planned to discuss site implications, after which we can report back, as well as noting any updates to the draft planning report. It has been noted in the meetings, however, that API provision in the new framework is part of the restructure process.

Monday 30th September

  • David summarised the UK site's position on Nagios in an email last week as:
  • There is a desire for a monitoring solution that gave automatic notifications and links to further information, and didn't require additional webpages (which describes Nagios). We noted that Nagios could be used to import central nagios tests and repurposing them for local testing.
  • In addition, it would be useful if the further details could include details of the testing execution commands (even including the test itself) for local diagnosis.
  • We wondered whether (and where) there might be common ground with the WLCG Nagios project - while this may have been discussed, it would be useful to clarify this.
  • It's important to have a clear and documented messaging/transport layer for any solution that's decided on, for integration with future monitoring solutions.

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 25th November

  • Nothing unusual. A steady trickle of transient problems.
  • The RAL Tier-1 SHA-2 ticket was finally closed as the relevant machines were decommissioned.

Monday 18th November

  • Fairly busy week.
  • QMUL had intermittent failure because of high load on VM Hosting server
  • UCL CE is in downtime from 15th Nov to 27th Nov awaiting upgrade of WN. Open Ticket
  • RALPP dcache server is failing MidMon SHA2 test again. Maybe because it is publishing ProductVersion as UNDEFINEDVALUE. Open Ticket
  • Sussex Storage is broken. Open Ticket .
  • Tier 1 CE decommissioning ticket is still open. Should be closed. Open Ticket
Rollout Status WLCG Baseline

Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated the Stage of the Nation page.


Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805] References


Security - Incident Procedure Policies Rota

Tuesday 19th November

  • There was a team meeting last Friday 15th November. Next meeting on 29th.
  • Just a couple of site issues showing up in Pakiti.
  • Looking at ARGUS server for UK NGI.

Tuesday 29th October

  • There was a team meeting on Friday 25th.
  • A couple of critical warnings are appearing in Pakiti and being followed up.

Tuesday 8th October

  • ARGUS setup for UK
  • ARGUS configuration (see Chris's email)

Tuesday 17th September

  • More information on the EGI/PRACE/EUDAT Joint Security Training event mentioned last week is now available.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 26th November

  • The main perfSONAR issues this week affect Manchester and Sussex.

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.

Tuesday 1st October

  • PerfSONAR latency hosts configured to use the WLCG meshes should now have a traceroute measurement achive (MA) accessible from the GUI under 'Service Graphs' --> 'Traceroute'. Here is an example.

Tuesday 17th September

  • Upgrading/re-installing hosts to v3.3.1/mesh is only making slow progress.
  • There is a new view of the status between sites.
  • An outage at Manchester due to central switch maintenance means that VOMS is not going to be contactable for a period this morning. It is clear that we need the backup VOMS instances fully available to VOs - please can someone take a lead?
Tickets

Monday 25th November 2013, 15.30 GMT</br> 41 Open UK tickets today.

Information System Tickets:</br> RALPP, ECDF, Lancaster, Liverpool, UCL Brunel, RHUL and the Tier 1 all got tickets about their information system (this is a prelude to information system probes going into the SAM tests). </br> I asked for some clarification in the Lancaster ticket, as our resource bdiis are up to date and recently reconfigured, but as these tickets are super-fresh don't panic about them.

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=99186 (25/11)</br> Not a reflection on the site (the ticket is 10 minutes old at time of writing), but the subject interested me "NAGIOS *emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot* failed on heplnv146.pp.rl.ac.uk@UKI-SOUTHGRID-RALPP". Are glexec failures becoming critical? Assigned (25/11)

Which reminds me, I'll be taking a look at all your (and my own...*whimper*) glexec tickets next week.

https://ggus.eu/ws/ticket_info.php?ticket=98923 (15/11)</br> Picking on RALPP again, this other (SHA2) nagios ticket got reopened. Looks like you're just not publishing your dcache version. To the ldifs! Reopened (25/11)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=98882 (14/11)</br> Emyr fixed Sussex's STORM (hang on, I thought Emyr had escaped?) The site's been whitelisted for testing since the 21st, if things are looking good I suggest closing this ticket. In progress (21/11)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=98594 (4/11)</br> This LHCB ticket, regarding file uploading troubles running at Sheffield post SL6 upgrade, is looking a bit neglected. Does anyone else know of any post-SL6 tweaks that they needed to apply (say a cheeky undocumented rpm) to get LHCB to work after their move to SL6? In Progress (13/11)

cvmfs@RAL tickets</br> https://ggus.eu/ws/ticket_info.php?ticket=98249 (SNO+)</br> https://ggus.eu/ws/ticket_info.php?ticket=98122 (cern@school)</br> Both of these tickets have received their first warning for being in the "waiting for reply" state for too long.

https://ggus.eu/ws/ticket_info.php?ticket=97868 (t2k)</br> T2K don't have software to put into their statum 0 yet, but would like to test with a ROOT tarball. No word from Catalin over this modest testing plan (at least on the ticket, you might be beavering away offline on this). In progress (18/11)

https://ggus.eu/ws/ticket_info.php?ticket=97385 (hyperK)</br> A similar story here (I think work is just progressing offline, hopefully we haven't entered a nightmarish universe where anything not documented in GGUS tickets doesn't happen-yet). In progress (18/11)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=96234 (29/7)</br> WMS support for HyperK at Glasgow. Chris spotted a problem, Dave said he'd get on it on Monday (which unless Dave had a 9 day weekend was a week ago). Any luck? In progress (15/11)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9)</br> I mention this LHCB ticket last week, as this recurring problem has stumped everyone involved. The JET guys have asked LHCB for some information to try to help them debug the problem. Waiting for reply (18/11)

I've no doubt missed something, having rushed this out in half the time I usually take, so I'll cover my shoddiness with my usual line that if I've missed any tickets of interest, please bring them up at the meeting or online.

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.


Tuesday 19th Nov Backup Nagios at Lancaster has been upgraded. Name of some of test has been changed and few new test has been added. Please have a look at https://gridppnagios.lancs.ac.uk/nagios and report any problem.

Tuesday 12th Nov Planning to update Backup Nagios at Lancaster. The new release is a glite to UMD release so it require re-installation of nagios box. I will put Lancaster Nagios box in downtime for 3 days from 13 Nov. It will not affect any monitoring but there will be no backup if main nagios box at Oxford fail during this period.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 25th November 2013

Tuesday 19 November 2013

Monday 21st October 2013


Monday 7th October 2013

  • CVMFS server for hyperk.org still outstanding
  • LFC Webdav still awaiting port opening
  • HyperK - progress - expect to run significant number of jobs soon.


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 27th November

WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A