Operations Bulletin 251113

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 18th November 2013
Task Areas
General updates

Tuesday 19th November

  • There is a workshop on clouds on 28th & 29th November.
  • There is an update of the GridPP pledge spreadsheet.
  • A summary of the WLCG workshop is available. The agenda is here.
  • The final WLCG T2 October ops availability/reliability report is now available.


Tuesday 5th November

  • Now passed the end of October deadline for SL6 WN migration.
  • The report from Monday's WLCG ops meeting is available. ALICE has lower job efficiency on SLC6 than SLC5.
  • The October WLCG Tier-2 availability/reliability figures are now published (OPS pdf). Results are available for the VOs (see here). Individual reports: ALICE ATLAS CMS LHCB.
  • The UK eScience CA infrastructure services are at risk on 5th November and possibly on 6th November due to electrical work.
  • The WLCG workshop takes place next week. The agenda is available.
  • A reminder that HEPiX talks from last week are available (here).
WLCG Operations Coordination - Agendas

Tuesday 19th November

  • The next meeting will be 'virtual' and on 21st November (see the agenda).
  • There was a meeting on 7th November.

...

Tuesday 2nd September

  • Middleware
    • New BDII release in the latest EMI-2/3 update, including better GLUE-2 support and security fixes. Sites should update all their BDII instances
    • New CVMFS version released for a security fix. Sites should upgrade or at least apply the hot fix in the above twiki
    • perfSONAR: sites should upgrade to the latest version, fixing many deployment problems
    • The end of support for dCache 1.9.12 has been postponed to September 30 due to a delay in releasing the SHA-2 compliant version in the dCache 2.2 series.
    • Consult https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
  • SHA-2
    • Discussion mostly dedicated to the experiments testing status. Atlas and LHCb have tested the services but not job submission yet. All experiments have been encouraged to test this.
  • SL6
    • T2 Done: 49/129 (Alice 11/39, Atlas 28/89, CMS 22/65, LHCb 13/45) -> 80/129 still to be done.
    • HS06: Reminder that sites are requested to run HS06 benchmark and update the value in the BDII. Increased values might be discussed at the WLCG MB.
    • EMI-3: voms-clients have been fixed and the latest version is in the PT repository but not in EMI-3 yet. Both CMS and Atlas work on DPM/dcache sites with this patch. (QMUL might want to give an update on Storm when they upgrade)
    • UK status: Liverpool to be finished soon, Bham in downtime to upgrade this week, Bristol and Sussex should be done by the 15/9/2013, RALPP 20/09/2013 and QMUL, Lancaster, UCL 30/09/2013
  • glexec
    • 55 sites still to respond they have attached the installation to SL6 upgrade.

Monday 12th August

  • There have been no recent meetings. The next is on 29th August.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 19th November

  • The generator load test last Wed (13th Nov) was successful.
  • Condor batch farm running OK. We do have a problem with one batch of worker nodes - but this is not related to Condor.
  • Next Tuesday (26th Nov) we will be uprading the firmware in a disk array. This will cause an interruption to the LFC, Atals 3D and FTS2 services for a few hours. (FTS3 unaffected).
Storage & Data Management - Agendas/Minutes

Tuesday 8th October

  • The DPM workshop agenda and registration page will appear here.

Monday 30th September

  • A DPM workshop is being organised in Edinburgh for 13th December. GridPP PMB anticipated covering travel for of order 10 UK sysadmins for this event. Interest should be indicated during the storage group meeting.

Tuesday 17th September

  • Perhaps someone could summarise the "Dark Data identification tools" thread on TB-Support?



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 5th Novemeber

  • A reminder to keep an eye on the SL HS06 page for odd ratios. Steve takes HS06 cpu numbers direct from ATLAS and the page does get stuck every now and then.
  • The metrics page has been updated.

Tuesday 13th August

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.


Monday 11 November

  • The plan for use of adoption of backup servers continues to evolve. Please see latest version here. The new version contains details of tests and concluding operations for site and VO admins.
  • The approved VOs page continues to be updated with the newest data from the operations portal.

Note: T2K now requires liblockfile-devel.

Tuesday 5th November

  • Documents states will be reviewed at the core ops meeting this coming Thursday.

Tuesday 1st October

  • The approved VOs page has been updated with the newest data from the operations portal. Note that the VOMS records for LondonGrid now contain some alternative voms servers. The migration plan for use of these backup servers is now document here.
Interoperation - EGI ops agendas

Monday 28th October

  • UMD-2 (no news really - support/users dwindling - security support to end by the end of Apr/2014 - bug with BDII; fix coming soon.
  • ARC - Major release coming in November.
  • UMD-3 Cream in test - Slurm plugin (becoming mainstream?) - also Torque, Blah plugin - Storm and VOMS server and client bug fixes
  • DMSU bug - affecting retrieval of output file from Cream (EMI-2 and EMI-3 UI affected)
  • xroot issue for dCache - J. Pina (SA1.3 /LIP): "dcache 2.2.17 does not support xrootd-backport, which is required for running a CMS site on dcache 2.2."
  • a new probe for Glue Validator alarms - sites failing it now in this view. See also this document - not clear if list is complete or accurate as status of the probe was not clarified - complaints from sites about tight schedule due to current effort dedicated to SHA-2 and SL6 - to be decided in November
  • Next meeting: Nov 1 - changes to timeline? start Jan e possible deadline in 2months. Next meeting: Nov 11.
gLite support calendar.


Monitoring - Links MyWLCG

Friday 15th November

  • The next Monitoring Consolidation meeting, taking place on Friday 22nd November, is currently planned to discuss site implications, after which we can report back, as well as noting any updates to the draft planning report. It has been noted in the meetings, however, that API provision in the new framework is part of the restructure process.

Monday 30th September

  • David summarised the UK site's position on Nagios in an email last week as:
  • There is a desire for a monitoring solution that gave automatic notifications and links to further information, and didn't require additional webpages (which describes Nagios). We noted that Nagios could be used to import central nagios tests and repurposing them for local testing.
  • In addition, it would be useful if the further details could include details of the testing execution commands (even including the test itself) for local diagnosis.
  • We wondered whether (and where) there might be common ground with the WLCG Nagios project - while this may have been discussed, it would be useful to clarify this.
  • It's important to have a clear and documented messaging/transport layer for any solution that's decided on, for integration with future monitoring solutions.

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 18th November

  • Fairly busy week.
  • QMUL had intermittent failure because of high load on VM Hosting server
  • UCL CE is in downtime from 15th Nov to 27th Nov awaiting upgrade of WN. Open Ticket
  • RALPP dcache server is failing MidMon SHA2 test again. Maybe because it is publishing ProductVersion as UNDEFINEDVALUE. Open Ticket
  • Sussex Storage is broken. Open Ticket .
  • Tier 1 CE decommissioning ticket is still open. Should be closed. Open Ticket

Monday 4th November

  • Quiet week. SHA-2 tickets still at the Tier-1 and ECDF. RAL PP has an EMI-1

dCache ticket.

Tuesday 22nd October

  • There seem to be a number of sites struggling to publish, but there already seem to be quite a number of GGUS tickets out there.
Rollout Status WLCG Baseline

Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated the Stage of the Nation page.


Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805] References


Security - Incident Procedure Policies Rota

Tuesday 19th November

  • There was a team meeting last Friday 15th November. Next meeting on 29th.
  • Just a couple of site issues showing up in Pakiti.
  • Looking at ARGUS server for UK NGI.

Tuesday 29th October

  • There was a team meeting on Friday 25th.
  • A couple of critical warnings are appearing in Pakiti and being followed up.

Tuesday 8th October

  • ARGUS setup for UK
  • ARGUS configuration (see Chris's email)

Tuesday 17th September

  • More information on the EGI/PRACE/EUDAT Joint Security Training event mentioned last week is now available.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.

Tuesday 1st October

  • PerfSONAR latency hosts configured to use the WLCG meshes should now have a traceroute measurement achive (MA) accessible from the GUI under 'Service Graphs' --> 'Traceroute'. Here is an example.

Tuesday 17th September

  • Upgrading/re-installing hosts to v3.3.1/mesh is only making slow progress.
  • There is a new view of the status between sites.
  • An outage at Manchester due to central switch maintenance means that VOMS is not going to be contactable for a period this morning. It is clear that we need the backup VOMS instances fully available to VOs - please can someone take a lead?
Tickets

Monday 18th November 15.00 GMT</br> 39 Open UK tickets this week. None of them are really exciting, a lot of "business as usual" this week, so I'm not going to go over them all. If you have been poked by Mohit, Sam or Guenter on your tickets please can you address their concerns.

Scraping the barrel of interestingness:</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9)</br> Jet are still losing LHCB jobs to these wierd (seemingly) cert-based errors, even after a resinstall of their nodes. Has anyone seen anything like it before? Waiting for reply (11/11)

My ticket-sense tingle over the state of SUSSEX, now that Emyr has headed off to greener pastures. Especially as their Storm is in distress.

That's about it really. Of course I could be wrong, or missed something in my state of GGUS jadedness. So feel free to mention any tickets you want to talk about (particularly any you've submitted yourself).

http://tinyurl.com/cblj3ab


Tools - MyEGI Nagios

Tuesday 19th Nov

Backup Nagios at Lancaster has been upgraded. Name of some of test has been changed and few new test has been added. Please have a look at https://gridppnagios.lancs.ac.uk/nagios and report any problem.

Tuesday 12th Nov Planning to update Backup Nagios at Lancaster. The new release is a glite to UMD release so it require re-installation of nagios box. I will put Lancaster Nagios box in downtime for 3 days from 13 Nov. It will not affect any monitoring but there will be no backup if main nagios box at Oxford fail during this period.

Monday 30th September

  • Ewan has put together a slightly modified WLCG VO box, but the effect is of a UI that takes gsi ssh logins from people in one particular VO, but then can be used as a UI for other VOs once you're logged in. The idea is that anyone who would need access to a central UI machine (so, mostly not people in PP depts.) would join a special-purpose VO. See Ewan's TB-SUPPORT email on 23rd September for more details.

Monday 2nd September

  • Intermittent Nagios errors -> Imperial WMS and all the jobs going through it were failing with ‘no compatible error’. Some reports of ongoing issues. What is the direct impact?
  • MyEGI and gstat were also down last week.
  • Jens is testing SHA-2 compliance of components. The version of gridsite on the GridPP website is not compliant but SHA-2 will be supported with a move to a new server (when?).
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 19 November 2013

Monday 21st October 2013


Monday 7th October 2013

  • CVMFS server for hyperk.org still outstanding
  • LFC Webdav still awaiting port opening
  • HyperK - progress - expect to run significant number of jobs soon.


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 13th November

  • Operations report
  • Systems have worked OK since the intervention on the UPS a week ago. This morning there was a successful load test of the UPS/generator.
  • The worker nodes that were in the Torque/Maui batch farm have been moved into the Condor farm. The CREAM CEs that served this farm (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) are set as being not in production in the GOC DB.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A