Operations Bulletin 140414

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 7th April 2014
Task Areas
General updates

Tuesday 8th April

  • The April GDB agenda is now available. Duncan is the T2 rep this month.
  • A reminder that the next WLCG workshop will be 7th-9th July in Barcelona. If you would like to present at the event please inform Jeremy.
  • There is a HEPiX IPv6 F2F at CERN this Thursday (agenda). Duncan and Chris are registered.
  • The next LHCONE-LHCOPN meeting is on 28th and 29th April. (agenda)
  • EGI is planning a conference on solutions and challenges for big data processing to take place 24th-26th September.
  • EGI is developing a cloud-related H2020 project proposal aiming at delivering a new generation intercloud testbed (36 month duration).
  • A reminder that the old EMI-2 MyProxy server at RAL (lcgrbp01.gridpp.rl.ac.uk) was decommissioned last week.
  • The WLCG T2 March availability/reliability figures were made available last week. Please could sites below the 90% targets write with details of issues encountered.ALICE, ATLAS, CMS, and LHCb.

Tuesday 1st April

Monday 24th March

  • There is no ops meeting this week due to GridPP32 in Pitlochry.

Tuesday 18th March

  • The GridPP website was upgraded over the weekend (it is now SHA-2 ready). Please inform Andrew if you encounter any problems with it.
  • Last week there was a pre-GDB on batch systems and a GDB. The March GDB meeting summary is now available. The GDB actions list has been updated.
  • There was a GridPP Strategic Review yesterday. Recommendations will be shared with the PMB and CB in due course.
  • The CERN VOMS service will move to new hosts whose host certificates are signed by the new (SHA-2) CERN CA during 2014. First VOMS-aware services in WLCG need to be aware of these hosts. See the EGI broadcast message which indicates a timeline of 6th May for services to be updated.
  • The GridPP IPv6 site status table has been updated to provide an 'allocation' column. This follows discussion of allocation strategies across sites. This IETF document may be of interest.
  • EGI invites entries to win a funded trip to the community forum in May.
  • The next WLCG middleware readiness WG meeting takes place this afternoon at 13:30 UK time. There are pre-meeting updates in the wiki.
  • Alarms (and then tickets) are now being raised against EMI-2 services at sites. Please respond to the tickets quickly and indicate your plans for removing the EMI-2 based node - replies are expected within 10 working days. If you believe alarms/tickets result from false-positives then please indicate this in the ticket... we are aware of some examples already.
  • France is deploying an EGI wide Dirac instance for 'other communities' and EGI is considering to include this as part of a H2020 proposal.
  • CVMFS v2.1.17 was recently released. Ian reported that RAL T1 had been using it stably for several weeks.
  • The EGI availability/reliability figures for February have been added to the reports wiki page. The UK services show 100%. Ops for NGI_UK is 98%:98%. No sites were below the EGI 70% targets.

WLCG Operations Coordination - Agendas

Tuesday 8th April

  • Registration for the next WLCG workshop opens this week.
  • WLCG [ps://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions baselines] have been updated. gLiteWMS to be checked.
  • Various Tier-0/1 storage updates - see table in minutes
  • Various Oracle updates have been completed at CERN.
  • Job efficiency report Meyrin vs Wigner being compiled.
  • Some delays in use of VOMS-admin (due to some bugs to be fixed and some features that need further understanding/changing (because of their different behaviour to VOMRS).
  • CERN batch capacity migrated to SLC6 was at 65% last week.
  • ALICE: Steady activities in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio commissioning: we started just in the last days the commissioning of the various Rucio services. DataTransfer issues: observed few links with "slow transfers" (order of 0.5MB/s) includes 3 UK sites. Observed issue with CVMFS cache: ATLAS file is 2.2GB and the default shared cache was set to 2GB.
  • CMS: DBS2 will be switched off April 7th. CVMFS switch at CERN: Monday, April 14th .
  • LHCb: Incremental stripping campaign almost finished. Future VOMS2 server added to the VO card.
  • Tools: GGUS new version released on 26 March: multiple site notification, CMS specific SU and forms.
  • FTS3: New version deployed as pilot - in production in 2-3 weeks if no issues.
  • glexec: 79 tickets closed and verified, 16 still open (no change)
  • Machine/JF: detailed plan for bare metal, cloud, client and bi-directional developments has been discussed and agreed within the TF
  • Middleware readiness: process agreed. Volunteer sites to be agreed by 15th April.
  • Multi-core: Various reviews done. Next review experience in CMS and ATLAS shared sites when handling multicore jobs from both VOs.
  • perfSONAR: Deadline for perfSONAR installation has passed (April 1st). 9 sites missing out of 111. No UK sites listed - thank you! But some firewall issues to resolve.
  • SHA-2: EGI Operations Portal VO cards for the experiments have been updated with the details of the future VOMS servers
  • WMS decommissioning: CERN WMS instances for experiments are being drained as of 13:53 CEST on April 1
  • xrootd: no update
  • IPv6: Some new test sites. Panda Dev instances are being made dual stack.
  • http proxy discovery: no update.

Tuesday 1st April

Tier-1 - Status Page

Tuesday 8th April

  • We will move the alias for the RAL CVMFS Stratum 1 to point to the new CVMFS Stratum 1 server running version 2.1 next week. We will send a formal announcement of the time shortly. No change is need by sysadmins - just flagging this up.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • Old MyProxy server (lcgrbp01.gridpp.rl.ac.uk) turned off last week (2nd April). Replaced by myproxy.gridpp.rl.ac.uk.
  • EMI-3 WN roll out completed and all systems now using the EMI-3 Argus server.
  • Deployments of new disk servers have been continuing.
  • Tier1 Outage announced for Tuesday 29th April for the upgrade of the Tier1 network's link into the RAL site core network.
Storage & Data Management - Agendas/Minutes

Wedn. 2 April 2014

  • All metrics green for the past quarter!
  • Performance issues being pursued - Brian is testing/coordinating
  • Report from GridPP32: "big" VOs, "small" VOs. See blog.
  • Report from ISGC2014: dCache, DIRAC, new countries. See blog.

Tuesday 18th March

  • Chris noticed some of Steve's tests failing. At IC this related to a full spacetoken. Bristol is not working as there is no SCRATCHDISK spacetoken. Durham fails with a no space left on device error message.

March 2014

  • How would we move data between DiRAC and GridPP?

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 1st April

Tuesday 18th March

  • A review of the HEPSPEC page reveals that most sites now have SL6 entries with the exceptions being: UCL-HEP; Manchester and RALPP. We will be checking HS06 figures in the 2014 Q1 quarterly reports so the deadline for action is 31st March 2014.
  • APEL publishing appears up-to-date for all sites.

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 1st April

  • Keydocs action needed by Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
  • We need to reassign Mark M's documents on Core Grid Services

Tuesday 18th March

  • Keydocs action needed by: Mark M; Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
Interoperation - EGI ops agendas

Tuesday 8th April

  • Meeting yesterday (Agenda: https://wiki.egi.eu/wiki/Agenda-07-04-2014)
    • URT
      • ARC - 13.11u1 version 4.1.0, UMD-3 dcache-server 2.6.23, BDII core - new glue-validator, DPM/LFC - v. 1.8.8, GFAL/lcg_utils - v. 2.5.5, FTS3 - v. 3.1.74, GridSite - v. 2.2.3, CANL - v. 2.1.4, WMS v. 3.6.4
    • UMD release:
      • lcg-CA 1.56 out on April 2, alarms imminent (though on checking UK sites most are updated)
      • UMD 3.6.0 ready for release: wms v. 3.6.3, cream_torque v. 2.1.3, dpm-yaim v. 1.8.7, gridsite v. 2.2.2, glexec-wn v. 1.2.2
      • New in UMD: gfal2 v. 2.4.8, slurm-wn v. 1.0.0, cream-slurm v. 1.0.1, gridsafe v. 1.3.1
    • SR:
      • wms v. 3.6.4
      • UMD-3 campaign, checking for updates on early adopter sites listed for UMD-3 https://www.egi.eu/earlyAdopters/table
      • Some UK sites listed as not replying and still listed under UMD1/2
    • EMI-2 decommissioning

Monitoring - Links MyWLCG

Tuesday 8th April

On-duty - Dashboard ROD rota

Monday 7th April

  • New dashboard in use but has some problems - such as with handover functionality.
  • There are a lot of glue2 validator warnings (not errors) on the dashboard that should not be appearing as they are not to be pursued with tickets.

Monday 31st March

  • Routine alarms and a lot of EMI-2 tickets. Most of the EMI-2 tickets until next week.
  • Thanks to Daniela for cleaning up the dashboard early in the week.

Tuesday 18th March

  • Many EMI-2 tickets created during the week. Some false positives add to confusion!
  • Tickets outstanding (as on 15th March) for Brunel (2); Oxford - Although system in downtime until yesterday and Sheffield.

Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).


Security - Incident Procedure Policies Rota

Monday 7th April

  • There was an EGI SVG Advisory 'High' RISK - Vulnerability announced
  • The security team meeting last week did not take place.
  • Linda has produced a new rota - only 2 people had responded to the poll so please check if you are in the team!

Tuesday 5th March

  • Ready for more ARGUS testing
  • SHA-2 looks ready for UK CA switch
  • Looking at technologies

Monday 24th February

  • Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There is some setup documentation.

Services - PerfSonar dashboard | GridPP VOMS

Tuesday 8th April

  • Some discrepancies found in VOMS ports and listings between VOMSsnooper and the dashboard for ops. (15009 vs 15002.
  • Also noted WLCG VOMS changes. New VOMS servers are being introduced as notified in this broadcast.

Tuesday 18th March

  • A reminder of this perfSONAR overview for UK sites. Username is WLCGps. Currently shows a number of problems that need to be addressed at various sites.

Monday 10th March

  • A reminder that the perfSONAR documentation is available here.
  • Deadline for 3.3.2 is 1st April.

Monday 7th April 2014, 13.30 BST

32 Open UK tickets this week.

No site in particular.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC cvmfs rollout ticket. Glasgow, Oxford, Durham and Bristol were missing at last head count - although Glasgow are mid-rollout and should be fully deployed any day now (if not already). I think Oxford are in a similar boat? As JK points out, we've got to the point where probably need to on hold the ticket whilst I harass the last few stragglers. In progress (3/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103043 (7/4)
Squire Whyntie has asked for cern@school registration on the Imperial Dirac. Janusz has done so and Tom confirmed it works and cab be solved. If only all things were solved so quickly! Assigned (7/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
The new Sussex EMI2 upgrade ticket. Matt RB copied the Sussex plan over from the original ticket. Daniela cleared up the mystery of what happened to the original ticket (dashboard shenanigans) and posted some useful instructions for the BDII upgrade. In progress (1/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102990 (3/4)
Duncan's unending perfsonar vigilance discovered a a problem with the RALPP latency box. Ian reports firewall problems that have been solved, so it looks like this one can be closed (if all is well). In progress (can be closed) (4/4) Not quite out of the woods yet after all, Ian spotted and fixed a few more problems, Duncan has spotted something else away.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102953 (24/3)
CMS glidein hammercloud jobs not running at the site (specifically their defunct cream CEs)- Chris points out another ticket (https://ggus.eu/index.php?mode=ticket_info&ticket_id=102915) essentially detailing the same problem (just for different job types). Probably worth on holding this one whilst waiting on the other, as it looks like the problems are CMS side. In progress (2/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103027 (5/4)
LHCB pilots aborting, Kashif asks if the problem persists, the ticket fairy set the ticket to Waiting for Reply (5/4) SOLVED

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102469 (19/3)
cvmfs for t2k. I think this has fallen through some cracks, no word for a while. In progress (21/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102205 (14/3)
Bristol's EMI2 upgrade ticket. Not much news, although there was a positive update from Winnie that looks like the April deadline will be made. In progress (4/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102914 (1/4)
An atlas ticket, detailing some odd transfer behaviour for some files, likely attributed to some off tcp window settings on a disk server. There was a similar looking (although possibly not identical) problem at RHUL (https://ggus.eu/index.php?mode=ticket_info&ticket_id=102311). Some interesting stuff. In progress (4/4) Sam updated the ticket, with no more "sub-optimally tuned" disk pools. I think it should be set to waiting for reply though

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102202 (14/3)
Not as interesting, Glasgow's EMI upgrade ticket. Chugging along, last word was from David a little while back about having watching some atlas canary jobs running on the EMI3 worker nodes. How did these pan out? In progress (27/3) Gareth updates that progress is slow but steady, draining nodes is taking a while.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)
LHCB asked Glasgow to publish their max CPU time. Not wanting to be made liars of, Sam pointed out why they didn't (shouldn't) do this. This has seemed to send LHCB back to the drawing board, so the ticket is on hold. On Hold (12/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102201 (14/3)
The ECDF EMI upgrade ticket. Not much to report here, although the apel box through a wobbly as well, Andy's on it. In progress (2/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/13)
glexec ticket. Word on that later. On Hold (27/1)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102199 (14/3)
Another EMI upgrade deadline ticket. A plan is in place and the work is underway. On Hold (24/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)
Sheffield's perfsonar having trouble. Elena upgraded and got the Sheffield IT guys to open port 8086 - it looks like she's nailed the problem and has asked for confirmation. Waiting for reply (7/4) (And before I even finished the review, the ticket was solved).

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7)
GLEXEC ticket. The tarball glexec isn't going well (no thanks to EMI3 taking up the last 6 weeks of tarball time). I might have to admit defeat (but will ask the devs for help before I do). On hold (4/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1/13)
Lancaster's Poo Perfsonar Performance (I said I wouldn't use that alliteration again, I lied). Using "normal" iperf to probe the boxes I see no 1Gb bottlenecks in my network, could be problem be software? On hold (7/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar also having difficulty, although their difficulty is caused by the hardware going kaput on them. Ben is chasing up Dell for new bits. In progress (3/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102193 (14/3)
EMI upgrade ticket. Ben put in a brief plan, but the reminder date has passed. How goes it? The bdii and DPM are fairly straightforward to upgrade. On hold (14/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)
GlexeC ticket. No news for a while, is this work to be rolled into the EMI3 upgrade? On hold (27/1)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102189 (14/3)
RHUL's EMI upgrade ticket. Not much news here. On hold (21/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103028 (6/4)
Atlas seeing production jobs failing due to pilot errors. Chris asked if production job options have changed recently? The ticket fairy struck again, setting the ticket to Waiting for reply (although he's less sure if that was the intention of Chris' reply). In progress (7/4) Atlas replied saying that they don't think there has been any job changes. Full prod disk is making things even cloudier, but Dan has asked for clarification on what an error message actually means - "!!FAILED!!1999!! Job killed by signal 24: Signal handler has set job result to FAILED, ec = 1204"

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101639 (26/2)
RFC3820 proxy problems at QM (and elsewhere). JK has asked the submitter for his ticket intentions. Set to Waiting for reply by our friend, the ticket fairy. (1/4)

(Please remember to set your tickets to Waiting for Reply after asking a question to the submitter. Don't make me spend yet another Monday afternoon referring to myself as the ticket fairy.)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102888 (1/4)
Biomed asked for access to their cvmfs repo to be rolled out at IC. Daniela has said fine but asked that they completely migrate to it within 3 months (nfs or cvmfs). Daniela has completed the rollout and asked biomed to test. Waiting for reply (7/4) Biomed have got back saying that they've launched some test jobs, but expect it might take a while for them to run. I think they also were kinda asking if Imperial would give them some leyway on moving wholey to cvmfs.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102166 (14/3)
The JET EMI upgrade ticket. There was a hope to upgrade before the end of April. On hold (24/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
SSL type errors for LHCB at JET. No progress on this for a while, the problem somehow survived the move to SL6/EMI3. On hold (11/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102611 (24/3)
The Tier 1 EMI upgrade ticket. There seem to be some false positives on the list, which could do with clarification which these are (especially due to the dashboard noise on the ticket). In progress (27/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)
CVMFS for SNO+. Matt reported that the collaboration has given permission to have their software on cvmfs, and hoped to have tarballs ready for last week. Has there been any progress offline? In progress (26/3) Update - Squire Whyntie informed me that this is being actively worked on offline, with Tom kindly providing assisstance.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)
ARC CEs publishing the wrong DefaultSE. Andrew has hacking this on his todo list, but bumped this issue down the list (which is fine, as it's low priority) . In progress (1/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=99556 (6/12/13)
The NGI argus ticket. I'm pretty sure that this can be closed, as argusngi.gridpp.rl.ac.uk is setup and tested by several sites- so all looks well here. On Hold (21/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101968 (11/3)
Atlas deletion errors at the Tier 1. The problem is known, but not well understood, and sadly persists (last set of errors reported on the 4th). Alastair has put in a good explanation of the symptoms. On hold (4/4)

Tools - MyEGI Nagios

Monday 17th March

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).

Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Tuesday 8th April

  • Steve noted that Liverpool are having a problem with the CVMFS clients on their workers nodes. "...in short, VO/CVMFS admin for na62 and mice are publishing stale .cvmfswhitelist and repos cannot be mounted on new systems. I expect this to spread to other systems and VOs as local cache dates expire."

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 9th April 2014

  • Operations report
  • The intervention to update the Tier1's network connection into the RAL site network has been announced (in the GOC DB) for Tuesday 29th April.
  • Rollout of worker nodes to EMI-3 and to use the EMI-3 Argus server are complete.
  • A pair od new Perfsonar nodes are now in use.
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA



4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:


Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A