Operations Bulletin 160614

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 9th June 2014
Task Areas
General updates

Monday 9th June

  • Note that the EMI-2 deadline set by EGI has now passed (31st May) and all remaining EMI-2 services/endpoints must be put into downtime.
  • The UK CA moved to issuing SHA-2 certificates on 28th May.
  • Reminder that the WLCG workshop registration should have been completed by now (and the accompanying GridPP travel request).
  • There is a GDB at CERN this week with pre-GDB on IPv6. The first few overview talks (setting up and comparing IPv4 and IPv6) are good background and it is recommended to review them. You can test your connectivity via ipv6-test.com.
  • The agenda from the HEPSYSMAN meeting at RAL last week is available here.
  • The GridPP DIRAC service can be accessed via this link.
  • ATLAS datasets on LocalGroupDisk more than 2 years old are being deleted starting from June 1st 2014.
  • EGI is now informing sites on a biweekly basis of VOs seeking additional resources. A process has been created for sites to register their resources into a ‘pool’ via the eGrant system. More information is available.
  • The May WLCG availability/reliability figures have been released. A reminder, if you want to request a re-computation you need to submit a GGUS ticket. Specific follow-ups have been requested in an email to TB-SUPPORT on 2nd June.
WLCG Operations Coordination - Agendas

Monday 16th June

  • The next WLCG ops meeting is on Thursday 19th June. The meeting structure is changing to have a dedicated section for T1 and T2s to comment, respond or raise new concerns.

Tuesday 10th June

  • There was a WLCG ops coordination meeting on Thursday 5th June.
  • Middleware: CVMFS updated; FTS3 added; fix for DPM 1.8.8
  • CERN: Grid submissions to the remaining SLC5 resources stop on the 19th of June. LFC decommissioning for Atlas: the daemons have been stopped, and the data is frozen.
  • DM: Update to DPM's gridftp server released, to fix issues encountered with FTS2 transfers.
  • All sites need to upgrade their CVMFS client to version 2.1.19 by August 5th ahead of CERN repository migration to 2.1.X.
  • ALICE: KIT seeing high network load due to continued use of old ROOT versions by users
  • ATLAS: MonteCarlo production and analysis: stable load in the past week
  • CMS: will now ramp up scale of Tier-0 tests on AI and HLT clouds. ARGUS problem - affects glexec. DPM fix for FTS2 issue.
  • LHCb: CVMFS switching over to new stratum infrastructure.
  • Tracking: Next GGUS release 19th July - The automatic creation of tickets through mail will be stopped.
  • FTS3: Discussion on new feature request: multi-destination transfer with automated rerouting.
  • glexec: Down to 10 open tickets. See the tracking page.
  • Machine/job features: SGE implementation now at Imperial.
  • M/w readiness: See the task overview.
  • Multicore: Nothing to report (NTR).
  • SHA-2: CERN VOMS - EMI fix now available. Quick check with RFC proxies failed for ATLAS.
  • WMS: NTR
  • IPv6: NTR
  • HTTP proxy discovery: NTR
  • Network and transfer metrics: Planning to organize a kick-off meeting in July - membership being agreed (so get involved now).


Tier-1 - Status Page

Tuesday 10th June

  • Castor and batch services currently down for Castor Namserver Upgrade (to version 2.1.14). If all goes well plan to upgrade stagers on: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
  • There was a short break in network connectivity (around 10 minutes) this morning while core site switches were upgraded.
  • We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
  • The software server used by the small VOs will be withdrawn from service.
Storage & Data Management - Agendas/Minutes

Tuesday 10th June

  • The DPM Collaboration agreement has been updated.

Wed 28 May 2014

  • FTS capabilities - with and without Web interface - interest in more tests
  • Impact of deprecation of lcg-utils - particularly for non-LHC VOs that use LFC. Conversely, started playing with GFAL2 (Sam).
  • Interest in DIRAC tutorial either at hepsysman or next GridPP.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 10th June

  • APEL not up-to-date for: Brunel, Sheffield, QMUL, Durham and Sussex? EMI-2 service downtime related in some cases?

Tuesday 20th May

  • Sites with APEL 'delays': IC, Liverpool, Sheffield, Durham, ECDF and Glasgow.

Tuesday 13th May

  • Will review GridPP metrics soon. Trying to get table up-to-date first.
  • No HEPSPEC06 wiki updates showing SL6 results for UCL or RALPP.
  • ATLAS HS06 coefficient for Lancaster 13.9?
  • APEL publishing 'stopped' for Liverpool, ECDF and Glasgow.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Tuesday 15th April

Tuesday 1st April

  • Keydocs action needed by Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
  • We need to reassign Mark M's documents on Core Grid Services


Tuesday 18th March

  • Keydocs action needed by: Mark M; Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
Interoperation - EGI ops agendas

Tuesday 10th June

  • Next meeting June 16th.


Monitoring - Links MyWLCG

Tuesday 10th June

On-duty - Dashboard ROD rota

Monday 9th June

  • No issues to report.

Tuesday 20th May

  • Quiet week. Created tickets to cover two low availability alarms just now. No

UK-wide problems.

  • EMI-3 upgrades still ongoing. EGI following up on status.


Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 10th June

  • Comments from the workshop last week

Monday 26th May

  • NGI security communications were tested today.



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

  • There will be an outage of the GridPP VOMS server on 11/06/2014 between 10am and 12am BST. The issuing of VOMS proxies will not be affected, but changes to VOs with the VOMS admin web interface "might get lost" during that time.


Tickets

Monday 9th June 2014, 15.00 BST
26 Open Tickets this week.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC ticket. Things got a bit muddled but ILC would like to know the state of Durham's CE. My impression is that they're submitting to a now defunct one - could you please let us know what's up? In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105989 (4/6)
Technically I think this is a Glasgow ticket - I was going to give this a home but there but noticed that the ticket looked solved (it concerned enabling the cern@school cvmfs at Glasgow - which the Glasgow lads had done alongside the other gridpp repos). In progress (can be solved) (6/6)

STOP PRESS
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106057 (9/6)
A ticket from Adam concerning the creation of a new UK Cloud site (UKI-LT2-IC-HEP-Cloud). I'm not sure who this needs to be bounced to (NGI-OPS, Imperial?), it could be that it's all in hand. Assigned (9/6)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937 (2/6)
Sussex got a low availability nagios ticket - Matt RB replied that the trouble is with the EMI3 upgrade and hopes to have dug his way out of that pit shortly. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's EMI3 upgrade ticket. The deadline is pass, and anything not upgraded is in downtime. How goes things? In progress (2/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sno+ were/are having cvmfs problems at Sussex. Related to 105989 above, has /cvmfs/snoplus.snolab.ca been replaced by /cvmfs/snoplus.gridpp.ac.uk? (The latter of which I can see at my site). In progress (29/5).

BIRMINGHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106020 (6/6)
Some little lost cern@school jobs at Birmingham, sitting in an odd state. Matt W is having a look, suspecting argus. In progress (6/6)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106011 (5/6)
Atlas deletion errors at Glasgow. Sam and the lads suspect a dodgey disk pool, and are working on it. In progress (6/6)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105996 (5/6)
Duncan spotted that the ECDF perfsonar box had fallen over. Andy and Wahid are prodding it with their remote stick. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105839 (28/5)
Glue Validator failures at ECDF. Andy's reckoning that the CE's are misconfigured, and is digging into the guts of the matter. In progress (3/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
My shame, the tarball glexec tickets. Sorry to say nothing to see here again. On hold (27/1)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105617 (21/5)
A Sno+ cvmfs ticket, similar to the Sussex one (105618). Not much news on it. In progress (21/5)

MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 (2/6)
Manchester are still publishing using the EMI2 apel. The work is scheduled to be done next (this) week. In the mean time has publishing been turned off? On hold (2/6)

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105939 (2/6)
Biomed ticketed Lancaster over gridftp not being open on our dpm headnode. After advice from Sam we decided that opening up the firewall ports would be okay, but also told biomed that restricting gfal to just one protocol was a bit silly. Waiting to hear if all's well for them. Waiting for reply (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1/2013)
Poor perfsonar bandwidth performance at Lancaster. Following Duncan's advice a downtime has been declared to try a reinstall of the node on Wednesday. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
glexec tarball ticket. On hold (4/4)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar hit a spot of hardware trouble. Disks and RAID controller have been replaced, last word was that the OS was hoped to be reinstalled at the end of April. I suspect then the EMI3 upgrade storm hit. Any news since? On hold (28/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/2013)
UCL's glexec ticket. At last word waiting on a new staff member to take the reins. On hold (16/4)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
LHCB problems at JET. The last updated was from me in May, saying that'd I'd ask for help on JET's behalf (which I did...but failed to push on it. Sorry Jet). On Hold (12/5)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/2013)
Sno+ CVMFS ticket. After looking like it was almost done this ticket has become a bit more murky in recent weeks, with talk of desire for an OSG "mirror" which Catalin points out breaks the cvmfs model. I think some more planning in Sno+ and discussion with the experts is needed. Waiting for reply (2/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
A Vidyo router firewall ticket. Not really sure it's that interesting to any outside the Tier 1 - although there are a lot of Vidyo documentation links that might be useful. Not much news on the ticket for a while. In progress (27/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
Mismatch between bdii and srm storage numbers - which has happened before (101310). In progress but no news. In progress (3/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105100 (2/5)
CMS are doing a round of their Storage Consistency Checks. There's been some back and forth between CMS and RAL with clean up being done. Not entirely sure what's the next step for this ticket - it doesn't seem to be a problem yet though. In progress (6/6)


Tools - MyEGI Nagios

Tuesday 20th May

Between May 1st and May 12th, SAM-CENTRAL and the Message Broker Network have experienced a set of chained failures that resulted in the loss of a large portion of the metric results that were published by the SAM NGI Instances. The loss of these messages will result in an unusually high number of UNKNOWNS in the May A/R reports, but the actual A/R numbers will not be affected as UNKNOWNS are not take into account. No other services have been affected.

Tuesday 13th May

  • From last week's discussion DiRAC now supports: NA62, vo.landslides.mossaic.org, t2k.org, snoplus, gridpp, CERN@school and northgrid. NA62 are moving from LFC to DFC and plan to use DiRAC in place of the WMS.


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 11th June 2014

  • Operations report
  • Castor Nameserver 2.1.14-13 updated successfully yesterday (10th June). Stager dates as follows ( CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.)
  • The rollout of CVMFS Client version 2.1.19 has been completed.
  • The Tier1 availabilities for May 2014 were very good. (All 100%).
  • The first SHA-2 host certificate hs been successfully installed on arc-ce02.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A