Operations Bulletin Latest

Bulletin archive

Week commencing 30th June 2014

Task Areas

General updates

Monday 23rd June

GridPP33 registration (20th-22nd August) is now open - Pete G notices some advance fares from London that save 1/3 on the typical price.
Notes from Thursday's WLCG twice weekly operations meeting.
EGI availability/reliability figures for May have been uploaded to the EGI reports page.
A reminder most will have seen: SRN Janet community will be disconnected on August 31st 2014.
Please help with the EGI survey on UMD usage.

Monday 16th June

There was a GDB last week and pre-GDB on IPv6. Of particular interest is Helge's trip report from HEPIX. The official GDB summary notes are now available and the actions updated.
GGUS ran into problems over the weekend and this affected ticket processing. The system did not recover from a DB connection problem and required a server restart.
There is a plan for the upgrade of the CVMFS installation servers hosted at CERN from v 2.0 to v 2.1, starting from August the 5th. Before the servers upgrade (by the end of July) there is a request to make sure that the CVMFS clients deployed on the WLCG infrastructure are upgraded to the 2.1.19 version, released on 28th May.
Well done to Brunel for being the first UK site to offer EGI resources via the eGrant system!

WLCG Operations Coordination - Agendas

Monday 23rd June

Minutes from last Thursday's meeting. Highlights....
A page is available listing current known middleware issues affecting WLCG.
Baselines: Storm 1.11.4 released in EMI containing several bug fixes. Baseline update with UMD release.
3 issues affected some sites after the latest EMI update of Cream and LB. The problems are under investigations by the PTs.
CVMFS: Starting from July, sites not compliant with the 2.1.19 version will be notified with a GGUS ticket (noted that upgrade just requires an update of the RPM and a restart CVMFS).
T0: The OPS VO now runs in voms-admin instead of VOMRS, after the migration done on June 17th
Tier-1/Tier-2 feedback: NTR!
ALICE: successful campaign for users to move away from old ROOT versions. T0 job efficiency issues ongoing.
ATLAS: DC14 expected to start in approximately 2 weeks from now.Panda/Jedi is now fully ready for user analysis.
CMS: Started to remove individual release tags from CEs. After the introduction of disk/tape separation at the T1 sites, CMS now must site readiness measures for T1 sites
LHCb: Recommend CVMFS 2.1.19. General request: ensure that downtimes, including unscheduled outages, accurately reflect the specific services which are unavailable.
FTS3: Monitoring the auto-tuning algorithm closely and adjusting various monitoring tools of FTS3.
glexec: 10 sites have yet to enable it. ARGUS instabilities being investigated.
Machine/job features: PBS/torque and LSF implemented. SLURM pending. SGE and HTCondor in progress.
MW readiness: ATLAS and CMS DPM setups in progress. Monitoring prototype being deployed at test sites.
Multicore: CMS stable flow. Gathering reports for July workshop. ATLAS MC jobs on-hold pending new software release.
SHA-2: New VOMS fix for CERN instances requires sites to update ARGUS, UI, CREAM and WN instances.
WMS decommissioning: Progress with SAM Condor validation. ARC-CE WN tests failing for some CMS sites (incl. Imperial).
IPv6: NTR
HTTP proxy discovery: Task overview table updated.
Network and transfers metrics: Mesh leaders developed. Kick off in July.
AOB: OSG plan to migrate to HTCondor CEs by October.

Tier-1 - Status Page

Tuesday 24th June

Castor Stager Upgrade was carried out last week. 'GEN' stager update this morning. Remaining updates: LHCb - Thu 26th; Atlas - Tue 1st July.
We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
The software server used by the small VOs will be withdrawn from service. Its use as a software server is very limited (possibly only SNO+) although a few VOs use it for uploading files to the CVMFS repository.

Storage & Data Management - Agendas/Minutes

Tuesday 17th June

Advances with CEPH at RAL will be reported to the Storage Meeting. It is hoped to setup a regular update contribution.

Tuesday 10th June

The DPM Collaboration agreement has been updated.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 24th June

APEL not up-to-date for: RHUL; Manchester, Durham and Sussex.

Tuesday 10th June

APEL not up-to-date for: Brunel, Sheffield, QMUL, Durham and Sussex? EMI-2 service downtime related in some cases?

Tuesday 20th May

Sites with APEL 'delays': IC, Liverpool, Sheffield, Durham, ECDF and Glasgow.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 16th June

A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Interoperation - EGI ops agendas

Updates requested for the early adopters table.

Tuesday 10th June

Next meeting June 16th.

Monitoring - Links MyWLCG

Tuesday 24th June

Meeting last Friday: https://indico.cern.ch/event/324687/

On-duty - Dashboard ROD rota

Tuesday 24th June

Very quiet shift. Dashboard downtime on Tuesday seemed to go ok.

Rollout Status WLCG Baseline

Tuesday 18th March

The EMI-2 decommissioning task has started.
The next WLCG middleware readiness WG meeting takes place this afternoon at 13:30 UK time.

Tuesday 11th February

31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Monday 23rd June

CVE-2014-3153 - but no public exploit.
- This kernel vulnerability has been patched in errata released last week.
PerfSonar/Cacti updates.
New IGTF CA release 1.58 - the EGI release is due on 30th June.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 17th June

The GridPP VOMS server was updated on 11/06/2014 - no issues reported.

Tickets

Monday 30th June 2014, 14.30 BST
Full Review this week, a little earlier then usual. 28 Open UK Tickets

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937 (2/6)
Low availability ticket, due to EMI3 upgrade woes. Most issues have been solved, but Apel publishing problems have been rolled into the ticket. Matt RB seems digging his way out in the right direction though. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sno+ CVMFS unavailable at Sussex. On Hold whilst the other issues are dealt with. On Hold (23/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106492 (25/6)
A request from atlas to resize Space Tokens. Matt also asked if atlashostdisk and atlasgroupdisk could be deleted - Brian gave the nod yes. Probably all done with here? In Progress (27/6)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106438 (23/6)
CMS having some trouble running jobs at Bristol (especially having lots of "held" jobs- but reading the ticket this means held on the cms queue, not in the local batch system). Winnie notes that for at least one of their queues they have over a hundred waiting cms jobs on a 72 slot shared queue. But it looks like the problem may have evapourated. At last word the cms submitter said he'd close the ticket if things stayed clear - but this was last Thursday. In Progress (26/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6)
A different CMS ticket, about pilot jobs losing connection to their submission hosts. After another round of nomenclature confusion, it was found that the problem seems to be between Bristol and hosts cmssrv119.fnal.gov and vocms97.cern.ch. Lukasz suggests using perfsonar to investigate. Also the dates on this ticket are well off (creation date 1/6, but first update 18/6) In progress (27/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6)
Again the dates on this ticket are very off (creation date was the 1/6, but the first update is the 29/6)- so the issue may have disappeared. This is another cms ticket about a heavy transfer backlog between Bristol and FNAL - if it's still a problem possibly linked to the above issue. Waiting on Lukasz to get back. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106058 (9/6)
CMS xrootd problems at Bristol. Also waiting on Lukasz's return (which I think has happened). On Hold (16/6)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
glexec ticket. No news, the early review meant I couldn't sooth my shame on this matter. On Hold (27/1)

MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 (2/6)
Manchester publishing to EMI2 APEL. It's being worked on, but one piece is missing - on hold until this detail is sorted. On Hold (25/6)

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106406 (23/6)
LHCB having trouble on Lancaster's older cluster. First issue was cvmfs timeouts - linked to older WNs being overloaded. Second issue is cream CE losing track of jobs in the batch system. Being worked on, but like a case of old age-tuning can only fix so much. In progress (26/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
glexec ticket. As with ECDF. On Hold (4/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)
Persistant Poor Perfsonar Performance Problems Plaguing Plymouth-born Postdoc... nope, that's as many Ps as I can get (and I'm not sure I still count as a Postdoc). A reinstall of the box hasn't helped. If anyone has a normal 10G iperf endpoint I could test against that would be great. Other then that waiting on some networking rejigging at Lancaster to shake things up and give the network engineers another chance to go over things. On Hold (23/6)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (23/6)
UCL failing ops tests that are using their SE. Ben noticed a problem with one of their pools, but fixing it didn't seem to solve the problem. Gareth has asked for an update pending being forced to escalate. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/2013)
UCL's glexec ticket. Last word was this would be the first job of a newer staff member, who was due to start within a few months (so about nowish?). On Hold (16/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar not working after suffering a hardware failure. Bits have been replaced and the machine was due a reinstall a while ago. On Hold (28/4)

RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106437 (23/6)
Atlas have inaccessible file(s) at RHUL due to a pool node in distress. Govind hopes to install a new motherboard tomorrow and will update after. Good luck with the repair! In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105943 (2/6)
Biomed asking for gsiftp access on the RHUL headnode so that they can read the namespace with gsiftp. Govind tried to enable this but biomed report that it didn't work. Not much word since - but I expect Govind's been busy. In progress (23/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105923 (2/6)
RHUL still publishing to EMI2 APEL too. On Govind's to do list, but low priority. No word for a while. On Hold (17/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106495 (25/6)
Inconsistent storage capacity publishing at RHUL. Govind reckons (quite rightly) that this is due to having a pool node out of commission and will look at it once that's fixed. In Progress (26/5)

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105771 (27/5)
Biomed having problems accessing files via https at QM. Chris explains that they've had to switch off https access and are waiting for 105361 to be fixed and storm to be updated. On Hold (12/6)

IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106369 (20/6)
Biomed ticket, similar to 105943 for RHUL, but with some added history (106369). Biomed are being a little insistent, and asked a question that I don't fully understand about path publishing. In Progress (30/6)

IMPERIAL CLOUD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106347 (19/6)
The new cloud site needed to tune things as VMs weren't using proxies but hitting the cern statum 0 directly. Adam is working on how to get around this - Ewan has mentioned that Oxford have shoal running and have seen accesses from the Imperial Cloud machines - so the problem may have a no work required workaround (the best kind!). In Progress (29/6)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/2013)
LHCB jobs having openssl like problems at Jet. No progress on this for a while but none was expected - the problem survived the move to EMI3, and the jet admins are stuck. On Hold (12/5)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
Vidyo router firewall ticket. I suspect this ticket can be closed, as other issues are being followed up elsewhere- or it at least needs an update/being ste on hold. In Progress (10/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
Inconsistent BDII and SRM storage numbers for lhcb. This has been worked on, and seems almost fixed. There's some debate over the tape figures, Brian points out that the 'online' values are correct. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 (18/6)
CMS pilots losing connection to their submission hosts at RAL. It looks like this has been going on silently for a while, the RAL team are taking it up with their networking chaps to see if it's a firewall issue.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106480 (25/6)
The information publishing police have pointed out that the RAL Castor isn't publishing a sane version. Brian suspects an rogue ":" causing the problems.

Tools - MyEGI Nagios

Tuesday 24th June

An update from Janusz on DIRAC:
We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
The async FTS is still under study, there some issues with this.
I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

Tuesday 20th May

Central myegi service moved from http://grid-monitoring.cern.ch/myegi to https://mon.egi.eu/myegi/ . Please visit new portal to check availability/reliability figures
There was an issue with central message broker and some results have been lost because of this. Mail from central sam team

Between May 1st and May 12th, SAM-CENTRAL and the Message Broker Network have experienced a set of chained failures that resulted in the loss of a large portion of the metric results that were published by the SAM NGI Instances. The loss of these messages will result in an unusually high number of UNKNOWNS in the May A/R reports, but the actual A/R numbers will not be affected as UNKNOWNS are not take into account. No other services have been affected.

Tuesday 13th May

From last week's discussion DiRAC now supports: NA62, vo.landslides.mossaic.org, t2k.org, snoplus, gridpp, CERN@school and northgrid. NA62 are moving from LFC to DFC and plan to use DiRAC in place of the WMS.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 16 June 2014

CVMFS
- Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software

VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

Is there interest in an FTS3 web front end? (more details)

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 20th May

Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 25th June 2014

Operations report
Castor GEN Stager 2.1.14-13 updated yesterday (24th June). Some problems with xroot for ALICE not resolved until following morning. Remaining stager dates as follows (LHCb - Thu 26th June; Atlas - Tue 8th July.)

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin Latest

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools