Operations Bulletin 170314

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 10th March 2014
Task Areas
General updates

Monday 10th March

  • EGI recently took part in a EC workshop on "Open Access in H2020: services and support for projects". EGI is starting a pilot distributed open data repository (interoperable with openAIRE) that may be of interest.
  • A Vidyo test meeting room is still available for testing linked from this agenda. Headsets are recommended.
  • Cristina Aiftimiei has joined the EGI.eu Operations team. She will continue to be employed by INFN but working full time to support the EGI Operations.
  • The WLCG Tier-2 availability/reliability reports for February are now available. Please check your site for: ALICE; ATLAS; CMS and LHCb.(Reminder of Alessandra's suggestion to undertake a cloud view as in this FR cloud report).
  • There was an outage of the GOCDB on 5th March for which an incident review has taken place. An initial database disk failure and no automated DNS switch from goc.egi.eu to goc.dl.ac.uk was complicated by a scheduled downtime for an OS update after which the primary DNS failed to resolve external addresses and a manual switch back to goc.egi.eu failed.
  • EGI are moving their federated cloud sites to a production service (follow progress).
  • This week there is a pre-GDB meeting on batch systems and a GDB.

Tuesday 4th March

  • A Vidyo test meeting room is available for testing linked from this agenda. Headsets are recommended.
  • No problems uncovered with the GridPP test website.
  • From Monday's WLCG ops: intermittent problems with RAL's virtualization cluster, affecting many services (including FTS3).
  • GGUS update - is there feedback we want to collate?
  • For information: APARSEN-EGI Community Workshop on Managing, Computing and Preserving Big Data for Research will take place next week, 4-6 March.
  • There will be a pre-GDB on batch systems next Tuesday, and a GDB covering various update areas.
  • There was an EGI OMB meeting last Thursday.
  • Steve produced an overview of getting ARGUS working at Liverpool.
  • GOCDBv5.2 was released last week. This release adds an extensibility mechanism which allows Services, ServiceGroups and Sites to be extended using custom key-value pairs (following the GLUE2 extensibility mechanism).
WLCG Operations Coordination - Agendas

Thursday 6th March

  • There was a meeting today (agenda, minutes).
  • Simone Campana was nominated ATLAS Distributed Computing coordinator and will step down as chair of WLCG Operations Coordination.
  • Baselines: WMS fix for 512-bit keys, already applied at CERN.
  • CERN would like to propose a deadline to switch FTS 2 off on the 1st of August.
  • Review of baselines - main update is that fix for the 512-bit keys on WMSes is being applied.
  • dCache is going to extend the security support for 2.2 until Enstore and dCache 2.6 are properly integrated. This will happen by summer.
  • Tests ongoing with Oracle12
  • dCache: is going to extend the security support for 2.2 until Enstore and dCache 2.6 are properly integrated.
  • ALICE: Tier-1/2 workshop in Japan.
  • ATLAS: in the middle of a disk crisis, many of the Tier1s are almost full of primary data. JEDI is under testing now. JEM activated (Job Evolution Monitor) for all the production resources. Rucio migration (Rucio as file catalog instead of LFC) in progress.
  • CMS: Soon starting 13 TeV MC DIGI/RECO. Looking at ccess to high memory resources and multi-core jobs.
  • LHCb: 2014 spring incremental stripping in full swing, 1/4 of the data has been processed (statistics).
  • FTS3 deployment: Discussed with experiment DM developers how to integrate multiple FTS3 servers with experiment frameworks.
  • glexec deployment: 79 tickets closed and verified, 16 still open
  • Machine job features: no update
  • Middleware readiness: Next meeting Tuesday 2014/03/18 @ 14:30h CET. See twiki.
  • Multicore: Several meetings. Conclusions so far systems reviewed are capable of supporting multicore jobs however a tuning of each system is required to be able to absorb them (draining/reservation of resources) when running together with single core jobs.
  • perfSONAR: perfSONAR 3.3.2 is now baseline. Deadline April 1, 2014 - all WLCG sites should have instances deployed, using the mesh - configuration and registered in OIM/GOCDB. Instructions in slides.
  • SHA2: Many new users already registered OK with SHA-2 certificates. Host certs of CERN future VOMS servers are from the new SHA-2 CERN CA.
  • Tracking tools: no update
  • WMS decommissioning: no update
  • xrootd deployment: no update
Tier-1 - Status Page

Tuesday 11th March

  • The problems with the Virtual Machine infrastructure are now stable. As requested, Atlas have temporarily moved their file transfers (for everything except UK transfers) off our FTS3 server.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • A replacement MyProxy server has been put into production (to resolve the MyProxy issues raised in GGUS ticket 97025). This new service is called myproxy.gridpp.rl.ac.uk. Sites and VOs need to make appropriate reconfigurations to use this. We plan to turn the old one (lcgrbp01.gridpp.rl.ac.uk) off at the end of March.
  • Atlas disk space at the RAL Tier1 is full. One factor in this is a slow deletion rate that is being investigated.
  • We are currently rolling out EMI-3 WN to one batch of WMs.
  • The Tier1 will move to use the new site firewall on Monday 17th March. We will drain and stop FTS (2 & 3) transfers while the change takes place (07:00-08:00) and stop new batch work from starting. Otherwise services will be up for the day but will be At Risk.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 25th February

  • Keydocs owners need to take some action!

Tuesday 11th February

  • Documents still need attention....

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Monday 10th March

  • An EGI operations meeting took place today (agenda).
  • URT recent or future planned releases
    • GridSite 2.2.2 (bug fix ) and dCache 2.6.20 for UMD-3
  • SR updates
    • WMS 3.6.3 (today) - EMI3 WN tarball SR flagged
  • New Nagios probes
    • emi-cream-nagios v. 1.1.1 - released with EMI 3 Update 14, released soon in SAM framework
    • org.sam.WN-SoftVer - new probes check the $EMI_TARBALL_BASE/etc/emi-version file
    • WN replication tests in emi-nagios are now distributed by the SAM-team, nagios-plugins-wn-rep
    • --wn-se-rep option as well as all the other previous --wn-* options will not be supported anymore by technology provider (see #88835 and #91683 )
    • NGIs requested to feedback how they feel about this option not being supported anymore.
  • EMI-2 decommissioning
    • dCache extended the support for the 2.2.x versions until July 2014.
    • List of services failing given Services
    • Alarms to begin on Wednesday, so please check this list for errors ASAP.
  • Cloud probes start raising alarms
    • 4 cloud sites have been certified in the last weeks, sites are currently monitored by cloudmon, but the errors are not raising alarms.

Monitoring - Links MyWLCG

Tuesday 4th March

  • Summary on HC functional tests
  • Overview of feedback

Monday 24th February

  • Next meeting this Friday, agenda looking at HammerCloud Functional tests

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).
On-duty - Dashboard ROD rota

Tuesday 10th March

  • Last week was "uneventful". Expect EMI-2 tickets this week.

Monday 3rd March

  • A new dashboard is available for testing.
  • The ROD rota has been extended to April
  • Brunel sub-sites caused a problem leading to EMI-3 APEL alarms which is now fixed.

Rollout Status WLCG Baseline

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).


Security - Incident Procedure Policies Rota

Tuesday 5th March

  • Ready for more ARGUS testing
  • SHA-2 looks ready for UK CA switch
  • Looking at technologies

Monday 24th February

  • Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There is some setup documentation.

Services - PerfSonar dashboard | GridPP VOMS

Monday 10th March

  • A reminder that the perfSONAR documentation is available here.
  • Deadline for 3.3.2 is 1st April.

Tuesday 4th March

  • The full UK perfSONAR view is given on this dashboard.
  • When perfSONAR is performing in a stable fashion the site will appear on the main monitoring page.

Monday 17th February

  • Note that WLCG see perfSONAR as a production service (see page 5 in Ian Bird's talk). The UK dashboard shows work still to be done at: ECDF, RHUL, Sheffield, Brunel and RALPPD.

Tuesday 4th February


Monday 10th March, 13.00 GMT</br> Only 28 Open UK tickets this week.

NGI</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)</br> ILC moving to cvmfs for their software area. As Jeremy mentioned after tomorrow we're going to start chasing sites that support ILC but haven't rolled out these changes. 4 sites have implemented the move and passed muster. A tip from me is to remember to update the software area entry in your CE's info system for ILC as well as on the nodes. In progress (10/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101820 (5/3)</br> This goc db ticket ended up assigned to the UK. I've punted it in the direction of the GOC DB support unit. Assigned (10/3)

EDINBURGH</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100569 (28/1)</br> Wahid has got stuck trying to reinstall his perfsonar box, if I'm reading it right the reinstall from the netimage isn't "taking". Has anyone seen this before or have any tips? Waiting for reply (10/3)

GLASGOW</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)</br> LHCB wanting MaxCPUTime to be published. Sam has eloquently explained his point about why he doesn't want to set this, I fear that some kind of impasse has been reached, and I'm not sure where to go on this issue. In progress (4/3)

PERFSONAR</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101136 (RALPP)</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (SHEFFIELD)</br> Any news on upgrading the perfsonar instances at RALPP or SHEFFIELD? Reminder dates on these tickets have passed by a week now.

That's all my addled brain can process I'm afraid, can sites please check the link below (oh, and yippie for GGUS search bringing back ordering by site again):</br> http://tinyurl.com/p37ey64

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).

Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 12th March

  • Operations report
  • All LHCb workflows now go through the ARC CEs OK.
  • Network Changes: Move of Tier1 to use new site firewall will be on Monday 17th March. There will some interruption to services as seen from outside RAL. Internally services are expected to continue uninterrupted. Details in GOC DB.
  • The Tier1 will be At Risk for an hour or so during the morning of Wedensday 19th March during a UPS/Generator load test.
  • A new MyProxy server is in production (myproxy.gridpp.rl.ac.uk).
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA



4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:


Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A