Operations Bulletin 230614

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 16th June 2014
Task Areas
General updates

Monday 16th June

  • There was a GDB last week and pre-GDB on IPv6. Of particular interest is Helge's trip report from HEPIX. The official GDB summary notes are now available and the actions updated.
  • GGUS ran into problems over the weekend and this affected ticket processing. The system did not recover from a DB connection problem and required a server restart.
  • There is a plan for the upgrade of the CVMFS installation servers hosted at CERN from v 2.0 to v 2.1, starting from August the 5th. Before the servers upgrade (by the end of July) there is a request to make sure that the CVMFS clients deployed on the WLCG infrastructure are upgraded to the 2.1.19 version, released on 28th May.
  • Well done to Brunel for being the first UK site to offer EGI resources via the eGrant system!

Monday 9th June

  • Note that the EMI-2 deadline set by EGI has now passed (31st May) and all remaining EMI-2 services/endpoints must be put into downtime.
  • The UK CA moved to issuing SHA-2 certificates on 28th May.
  • Reminder that the WLCG workshop registration should have been completed by now (and the accompanying GridPP travel request).
  • There is a GDB at CERN this week with pre-GDB on IPv6. The first few overview talks (setting up and comparing IPv4 and IPv6) are good background and it is recommended to review them. You can test your connectivity via ipv6-test.com.
  • The agenda from the HEPSYSMAN meeting at RAL last week is available here.
  • The GridPP DIRAC service can be accessed via this link.
  • ATLAS datasets on LocalGroupDisk more than 2 years old are being deleted starting from June 1st 2014.
  • EGI is now informing sites on a biweekly basis of VOs seeking additional resources. A process has been created for sites to register their resources into a ‘pool’ via the eGrant system. More information is available.
  • The May WLCG availability/reliability figures have been released. A reminder, if you want to request a re-computation you need to submit a GGUS ticket. Specific follow-ups have been requested in an email to TB-SUPPORT on 2nd June.

Monday 27th May

  • There is a DIRAC workshop at CERN this week. If you, or VOs you work with, have any thoughts on specific requirements for future DIRAC development please let Janusz/Dan or Jeremy know.
  • An EGI Operations Management Board takes place on Tuesday morning. The topics: CSIRT update; resource allocations; gfal2 replacing lcg_utils; central SAM update; EMI-2 decommissioning and update on GPGPUs.
  • The latest ops portal update integrated the possibility to register and update VO ID cards according to a new discipline classification. VO managers are being encouraged to check/update their VO ID card via this link.
  • There was a GridPP technical meeting last Friday.
  • The final WLCG Tier-2 reliability & availability reports for Arpil are now uploaded.
  • There has been some TB-Support discussion on GitHub vs BitBucket. Any conclusions?
  • A reminder that David C has put together a blog on monitoring.

WLCG Operations Coordination - Agendas

Monday 16th June

  • The next WLCG ops meeting is on Thursday 19th June. The meeting structure is changing to have a dedicated section for T1 and T2s to comment, respond or raise new concerns.

Tuesday 10th June

  • There was a WLCG ops coordination meeting on Thursday 5th June.
  • Middleware: CVMFS updated; FTS3 added; fix for DPM 1.8.8
  • CERN: Grid submissions to the remaining SLC5 resources stop on the 19th of June. LFC decommissioning for Atlas: the daemons have been stopped, and the data is frozen.
  • DM: Update to DPM's gridftp server released, to fix issues encountered with FTS2 transfers.
  • All sites need to upgrade their CVMFS client to version 2.1.19 by August 5th ahead of CERN repository migration to 2.1.X.
  • ALICE: KIT seeing high network load due to continued use of old ROOT versions by users
  • ATLAS: MonteCarlo production and analysis: stable load in the past week
  • CMS: will now ramp up scale of Tier-0 tests on AI and HLT clouds. ARGUS problem - affects glexec. DPM fix for FTS2 issue.
  • LHCb: CVMFS switching over to new stratum infrastructure.
  • Tracking: Next GGUS release 19th July - The automatic creation of tickets through mail will be stopped.
  • FTS3: Discussion on new feature request: multi-destination transfer with automated rerouting.
  • glexec: Down to 10 open tickets. See the tracking page.
  • Machine/job features: SGE implementation now at Imperial.
  • M/w readiness: See the task overview.
  • Multicore: Nothing to report (NTR).
  • SHA-2: CERN VOMS - EMI fix now available. Quick check with RFC proxies failed for ATLAS.
  • WMS: NTR
  • IPv6: NTR
  • HTTP proxy discovery: NTR
  • Network and transfer metrics: Planning to organize a kick-off meeting in July - membership being agreed (so get involved now).

Tier-1 - Status Page

Tuesday 17th June

  • Castor Namserver Upgrade (to version 2.1.14) successful last week. CMS Stager update this morning. June; GEN - Tue 24th June; LHCb - Thu 26th; Atlas - Wed 2nd July.
  • We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
  • The software server used by the small VOs will be withdrawn from service. Its use as a software server is very limited (possibly only SNO+) although a few VOs use it for uploading files to the CVMFS repository.
Storage & Data Management - Agendas/Minutes

Tuesday 17th June

  • Advances with CEPH at RAL will be reported to the Storage Meeting. It is hoped to setup a regular update contribution.

Tuesday 10th June

  • The DPM Collaboration agreement has been updated.

Wed 28 May 2014

  • FTS capabilities - with and without Web interface - interest in more tests
  • Impact of deprecation of lcg-utils - particularly for non-LHC VOs that use LFC. Conversely, started playing with GFAL2 (Sam).
  • Interest in DIRAC tutorial either at hepsysman or next GridPP.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 10th June

  • APEL not up-to-date for: Brunel, Sheffield, QMUL, Durham and Sussex? EMI-2 service downtime related in some cases?

Tuesday 20th May

  • Sites with APEL 'delays': IC, Liverpool, Sheffield, Durham, ECDF and Glasgow.

Tuesday 13th May

  • Will review GridPP metrics soon. Trying to get table up-to-date first.
  • No HEPSPEC06 wiki updates showing SL6 results for UCL or RALPP.
  • ATLAS HS06 coefficient for Lancaster 13.9?
  • APEL publishing 'stopped' for Liverpool, ECDF and Glasgow.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Interoperation - EGI ops agendas

Tuesday 10th June

  • Next meeting June 16th.

Monitoring - Links MyWLCG

Tuesday 10th June

On-duty - Dashboard ROD rota

Tuesday 17th June

  • Quiet week
  • Sussex: The EMI-3 ticket can probably be resolved. There are no outstanding alarms associated with the ticket now.
  • QMUL: Availabilities: These should start to climb back after the weekend.

Monday 9th June

  • No issues to report.

Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).


Security - Incident Procedure Policies Rota

Tuesday 10th June

  • Comments from the workshop last week

Monday 26th May

  • NGI security communications were tested today.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.


Monday 16th June 2014, 15.00 BST
28 Open UK Tickets today.
Please can everyone check to make sure they don't have tickets going stale.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106057 (9/6)
The creation of a new UK Cloud site UKI-LT2-IC-HEP-Cloud. Jeremy has created the site and I see Adam has signed himself up as an Admin to it. Does anything else need doing? In progress (11/6)

EMI 2 APEL tickets for RHUL and MANCHESTER:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105923 https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 Not much noise from either of these tickets.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106243 (16/6)
Sno+ ran into a spot of bother with some of the UK vomses, Robert replied with a good explanation of what was going on at Manchester. Something for the other voms sites to watch out for (although you probably know all about it). In progress (is it solved?) (16/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106060 (9/6)
Matt RB fixed one atlas problem with the Sussex Storm SE, but another has come along- looking like bad checksums. Wahid suggests asking Chris Walker (the Storm Whisperer) for his advice. In progress (16/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's EMI ticket - almost done now, I believe the alarms are disappearing and now the problems are with services not working (but as we discussed last week, upgraded and broken is still upgraded!). In progress (13/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sorry to be picking on Sussex. I suspect this Sno+ cvmfs ticket has been put on the back burner - can it be put on hold until you get round to it. In progress (9/6)

Same for https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937

Tools - MyEGI Nagios

Tuesday 20th May

Between May 1st and May 12th, SAM-CENTRAL and the Message Broker Network have experienced a set of chained failures that resulted in the loss of a large portion of the metric results that were published by the SAM NGI Instances. The loss of these messages will result in an unusually high number of UNKNOWNS in the May A/R reports, but the actual A/R numbers will not be affected as UNKNOWNS are not take into account. No other services have been affected.

Tuesday 13th May

  • From last week's discussion DiRAC now supports: NA62, vo.landslides.mossaic.org, t2k.org, snoplus, gridpp, CERN@school and northgrid. NA62 are moving from LFC to DFC and plan to use DiRAC in place of the WMS.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 16 June 2014

    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).

Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 18th June 2014

  • Operations report
  • Castor CMS Stager 2.1.14-13 updated yesterday (17th June) although there were some problems. Remaining stager dates as follows ( GEN - Tue 24th June; LHCb - Thu 26th June; Atlas - Tue 1st July.)
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A