Operations Bulletin 210412

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 14th April 2014
Task Areas
General updates

Tuesday 15th April

  • Summary notes from April's GDB are available. The actions have also been updated.
  • A new GOCDB role has been requested for the use case where a user's DN is to be associated with the site, allowing other systems (nagios for example) to read the list of user DNs that are linked to that Site and take subsequent authorisation decisions.
  • There has been an advance notification of an extended GOCDB service OUTAGE starting 07:00 to 14:00 (BST) on 29th April.
  • A reminder: The WLCG T2 March availability/reliability figures were made available two weeks ago. Please could sites below the 90% targets write with details of issues encountered.ALICE, ATLAS, CMS, and LHCb. The EGI availability/reliability figures for March are available.


Tuesday 8th April

  • The April GDB agenda is now available. Duncan is the T2 rep this month.
  • A reminder that the next WLCG workshop will be 7th-9th July in Barcelona. If you would like to present at the event please inform Jeremy.
  • There is a HEPiX IPv6 F2F at CERN this Thursday (agenda). Duncan and Chris are registered.
  • The next LHCONE-LHCOPN meeting is on 28th and 29th April. (agenda)
  • EGI is planning a conference on solutions and challenges for big data processing to take place 24th-26th September.
  • EGI is developing a cloud-related H2020 project proposal aiming at delivering a new generation intercloud testbed (36 month duration).
  • A reminder that the old EMI-2 MyProxy server at RAL (lcgrbp01.gridpp.rl.ac.uk) was decommissioned last week.
  • The WLCG T2 March availability/reliability figures were made available last week. Please could sites below the 90% targets write with details of issues encountered.ALICE, ATLAS, CMS, and LHCb.

Tuesday 1st April

WLCG Operations Coordination - Agendas

Tuesday 15th April

Tuesday 8th April

  • Registration for the next WLCG workshop opens this week.
  • WLCG [ps://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions baselines] have been updated. gLiteWMS to be checked.
  • Various Tier-0/1 storage updates - see table in minutes
  • Various Oracle updates have been completed at CERN.
  • Job efficiency report Meyrin vs Wigner being compiled.
  • Some delays in use of VOMS-admin (due to some bugs to be fixed and some features that need further understanding/changing (because of their different behaviour to VOMRS).
  • CERN batch capacity migrated to SLC6 was at 65% last week.
  • ALICE: Steady activities in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio commissioning: we started just in the last days the commissioning of the various Rucio services. DataTransfer issues: observed few links with "slow transfers" (order of 0.5MB/s) includes 3 UK sites. Observed issue with CVMFS cache: ATLAS file is 2.2GB and the default shared cache was set to 2GB.
  • CMS: DBS2 will be switched off April 7th. CVMFS switch at CERN: Monday, April 14th .
  • LHCb: Incremental stripping campaign almost finished. Future VOMS2 server added to the VO card.
  • Tools: GGUS new version released on 26 March: multiple site notification, CMS specific SU and forms.
  • FTS3: New version deployed as pilot - in production in 2-3 weeks if no issues.
  • glexec: 79 tickets closed and verified, 16 still open (no change)
  • Machine/JF: detailed plan for bare metal, cloud, client and bi-directional developments has been discussed and agreed within the TF
  • Middleware readiness: process agreed. Volunteer sites to be agreed by 15th April.
  • Multi-core: Various reviews done. Next review experience in CMS and ATLAS shared sites when handling multicore jobs from both VOs.
  • perfSONAR: Deadline for perfSONAR installation has passed (April 1st). 9 sites missing out of 111. No UK sites listed - thank you! But some firewall issues to resolve.
  • SHA-2: EGI Operations Portal VO cards for the experiments have been updated with the details of the future VOMS servers
  • WMS decommissioning: CERN WMS instances for experiments are being drained as of 13:53 CEST on April 1
  • xrootd: no update
  • IPv6: Some new test sites. Panda Dev instances are being made dual stack.
  • http proxy discovery: no update.

Tuesday 1st April

Tier-1 - Status Page

Tuesday 15th April

  • The alias for the RAL CVMFS Stratum 1 to point to the new CVMFS Stratum 1 server running version 2.1 next week ws updated this morning (09:30). No change is need by sysadmins - just flagging this up.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • Tier1 Outage announced for Tuesday 29th April for the upgrade of the Tier1 network's link into the RAL site core network.
Storage & Data Management - Agendas/Minutes

Wedn. 2 April 2014

  • All metrics green for the past quarter!
  • Performance issues being pursued - Brian is testing/coordinating
  • Report from GridPP32: "big" VOs, "small" VOs. See blog.
  • Report from ISGC2014: dCache, DIRAC, new countries. See blog.

Tuesday 18th March

  • Chris noticed some of Steve's tests failing. At IC this related to a full spacetoken. Bristol is not working as there is no SCRATCHDISK spacetoken. Durham fails with a no space left on device error message.

March 2014

  • How would we move data between DiRAC and GridPP?


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 15th April

  • The APEL accounting system has been undergoing database maintenance to improve performance and reliability. Networking problems at the RAL site have delayed completion of the operation. Sites may see nagios alerts warning them that they have not published accounting data for 7 days - these will stop after the maintenance work completes.

Tuesday 1st April

Tuesday 18th March

  • A review of the HEPSPEC page reveals that most sites now have SL6 entries with the exceptions being: UCL-HEP; Manchester and RALPP. We will be checking HS06 figures in the 2014 Q1 quarterly reports so the deadline for action is 31st March 2014.
  • APEL publishing appears up-to-date for all sites.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 15th April

Tuesday 1st April

  • Keydocs action needed by Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
  • We need to reassign Mark M's documents on Core Grid Services


Tuesday 18th March

  • Keydocs action needed by: Mark M; Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
Interoperation - EGI ops agendas

Tuesday 15th April

Tuesday 8th April

  • Meeting yesterday (Agenda: https://wiki.egi.eu/wiki/Agenda-07-04-2014)
    • URT
      • ARC - 13.11u1 version 4.1.0, UMD-3 dcache-server 2.6.23, BDII core - new glue-validator, DPM/LFC - v. 1.8.8, GFAL/lcg_utils - v. 2.5.5, FTS3 - v. 3.1.74, GridSite - v. 2.2.3, CANL - v. 2.1.4, WMS v. 3.6.4
    • UMD release:
      • lcg-CA 1.56 out on April 2, alarms imminent (though on checking UK sites most are updated)
      • UMD 3.6.0 ready for release: wms v. 3.6.3, cream_torque v. 2.1.3, dpm-yaim v. 1.8.7, gridsite v. 2.2.2, glexec-wn v. 1.2.2
      • New in UMD: gfal2 v. 2.4.8, slurm-wn v. 1.0.0, cream-slurm v. 1.0.1, gridsafe v. 1.3.1
    • SR:
      • wms v. 3.6.4
      • UMD-3 campaign, checking for updates on early adopter sites listed for UMD-3 https://www.egi.eu/earlyAdopters/table
      • Some UK sites listed as not replying and still listed under UMD1/2
    • EMI-2 decommissioning



Monitoring - Links MyWLCG

Tuesday 15th April

On-duty - Dashboard ROD rota

Tuesday 15th April

  • A new availability poll for ROD members has been circulated - please complete it this week!
  • Gareth has been closing GLUE2 validation warnings. They seem to come back....
  • Gareth submitted notepad entries for the GKLUE2 validation warnings. (Tickets for criticals).
  • There are several ongoing issues with the new dashboard.


Monday 7th April

  • New dashboard in use but has some problems - such as with handover functionality.
  • There are a lot of glue2 validator warnings (not errors) on the dashboard that should not be appearing as they are not to be pursued with tickets.
Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 15th April

  • Update on the OpenSSL status.
  • The discussion list members have been updated. Anyone missing?

Monday 7th April

  • There was an EGI SVG Advisory 'High' RISK - Vulnerability announced
  • The security team meeting last week did not take place.
  • Linda has produced a new rota - only 2 people had responded to the poll so please check if you are in the team!



Services - PerfSonar dashboard | GridPP VOMS

Tuesday 15th April

  • New LiveCD and LiveUSB images are now available containing the latest openssl packages (see email of 11th April).

Tuesday 8th April

  • Some discrepancies found in VOMS ports and listings between VOMSsnooper and the dashboard for ops. (15009 vs 15002.
  • Also noted WLCG VOMS changes. New VOMS servers are being introduced as notified in this broadcast.
Tickets

Monday 14th April 2014, 15.30 BST
No ticket update from Matt next week.

33 Open UK tickets today.

NGI (No Geezers In-particular in this case)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
ILC cvmfs ticket, No change since last week really, after tomorrows meeting I'll on hold this ticket until I'm back next week. In progress (3/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103043 (7/4)
Tom's ticket requesting cern@school access to the IC Dirac server. It's all done, the ticket just needs closing (and whilst I'm happy to stick my nose into tickets I won't close or reopen them). Assigned(!) (7/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103197 (9/4)
Chris W has spotted several instances where the old myproxy server shows up in the online documentation. Andrew has tried to edit https://www.gridpp.ac.uk/deployment/users/myproxy.html but can't get access - Daniela suggested asking the hosting site but maybe Tom has access? Waiting for Reply (9/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/2013)
The Sno+ CVMFS ticket. Could some of the progress mentioned last week please be put into the ticket? In progress (26/3)

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103028 (6/4)
Chris ran these atlas job failures down and discovered they were due to the jobs going over their memory quotas. What I didn't like the looks of was how it the jobs themselves requesting these amounts of memory. Atlas says can be solved, but something to watch out for. In progress (11/4)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)
As mentioned last week, LHCB have got back to Glasgow deciding that MaxCPUTime needs to be set to something, Sam respectfully maintains his stance. Steve B links a interesting ticket to the cream devs: https://ggus.eu/index.php?mode=ticket_info&ticket_id=97721 On Hold (8/4)

"EMI UPGRADE" tickets.

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102611
Kashif points out that the NGI argus isn't in the site bdii, which is the probably cause of the test failures. The other two problem servers are due to be decommissioned, so all good here. In progress (14/4)

DURHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103722 (14/4)
A very fresh alarm ticket for Durham's CE and SE. Sorry you guys have to do this dance again! Assigned (14/4)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102201 (14/3)
Andy notes that the links to the alarms given in the ticket appear to be broken. How gos the upgrade in general? On Hold (7/4)

RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102189 (14/3)
I think RHUL just has some CEs to upgrade, have you done the site BDII? The list of services that need to be upgraded isn't exhaustive. On hold (21/3)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
You guys put in a good plan, did it survive contact with the enemy? In progress (1/4)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102202 (14/3)
The Glasgow list of services to upgrade was long, but that's just a reflection of how much stuff they run. Gareth gave a good update last week, so there's naught to worry about here (hopefully I didn't just curse you...). In Progress (8/4)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102205 (14/3)
Winnie sounded confident that upgrade will be done by the end of April (and we aren't halfway though the month yet). In progress (4/4)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102193 (14/3)
Ben set a reminder date for the 31st of March, no news since then. On hold (14/3)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102166 (14/3)
It's just the Jet DPM that looks like it needs upgrading. If they've kept it up to date then this upgrade is trivial. Hope to be done by the end of April. On hold (24/3)


Tools - MyEGI Nagios

Monday 17th March

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Tuesday 8th April

  • Steve noted that Liverpool are having a problem with the CVMFS clients on their workers nodes. "...in short, VO/CVMFS admin for na62 and mice are publishing stale .cvmfswhitelist and repos cannot be mounted on new systems. I expect this to spread to other systems and VOs as local cache dates expire."


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 16th April 2014

  • Operations report
  • The intervention to update the Tier1's network connection into the RAL site network has been announced (in the GOC DB) for Tuesday 29th April.
  • There will be a UPS/Generator load test on Wednesday 30th April. ('Warning' announced in GOC DB).
  • The automatic failove of the OPN Primary link to the backup has not worked during recent breaks. This is being followed up.
  • Reminder: The software server used by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A