Operations Bulletin 280714

Bulletin archive

Week commencing 21st July 2014

Task Areas

General updates

Tuesday 22nd July

The RAL FTS2 is due to be switched off on 2nd September.
HyperK enablement request still stands.
The WLCG biweekly WLCG status from yesterday's ops meeting is available here.
There was a general reminder last week about putting too much detail into messages that go to our public email lists. Please remember the Traffic Light Protocol!
There will be an IPv6 session at GridPP33. JANET will participate. Pete C and G are taking ideas for talks.
There was a GridPP technical meeting on Friday.
The final WLCG T2 availability and reliability figures for June 2014 are now available.
There was a first meeting last Wednesday of the HEP Software Foundation. The first step is to prepare a “call for volunteers" who can devote the time in the coming months to lead the work that has to be done.

Monday 14th July

Workshop - CVMFS monitoring feedback
ATLAS DC14 13TeV simulation starting - note Alessandra's recommendation regarding Nikehf scripts and multicore running (for torque/maui) sites.
topBDII caching and errors
ILC VOMS changes

Sites with ARC CEs who want to support LHCb need to make a few configuration changes. This is to ensure that there is an environment variable available to jobs which specifies the name of the queue.
EGI A/R report for June
Did anyone else see kernel problems like Liverpool (see blog)
Large numbers of biomed jobs have been impacting various sites. Is setting MaxTotalJobs the answer? Do we need follow-up with the VO?
HyperK can now make use of additional resources and a general request for enablement was circulated. It has been confirmed that they only need disk at QMUL.

WLCG Operations Coordination - Agendas

Tuesday 21st July

The next coordination meeting takes place this Thursday at 14:30 UK time. There is now a standing item for sites to raise issues of concern. Is there anything we would like to mention this week? we are invited to update the twiki up until 1 hour before the meeting.

Tuesday 14th July

The regular meeting would have taken place last week - an update was presented at the workshop. Next meeting is on 24th July.

Tuesday 1st July

There will be a multi-core TF meeting this afternoon at 13:30 UK time.
A reminder that the next MW readiness WG meeting is this Wednesday (2nd July) at 15:00 UK time.

Monday 23rd June

Minutes from last Thursday's meeting. Highlights....
A page is available listing current known middleware issues affecting WLCG.
Baselines: Storm 1.11.4 released in EMI containing several bug fixes. Baseline update with UMD release.
3 issues affected some sites after the latest EMI update of Cream and LB. The problems are under investigations by the PTs.
CVMFS: Starting from July, sites not compliant with the 2.1.19 version will be notified with a GGUS ticket (noted that upgrade just requires an update of the RPM and a restart CVMFS).
T0: The OPS VO now runs in voms-admin instead of VOMRS, after the migration done on June 17th
Tier-1/Tier-2 feedback: NTR!
ALICE: successful campaign for users to move away from old ROOT versions. T0 job efficiency issues ongoing.
ATLAS: DC14 expected to start in approximately 2 weeks from now.Panda/Jedi is now fully ready for user analysis.
CMS: Started to remove individual release tags from CEs. After the introduction of disk/tape separation at the T1 sites, CMS now must site readiness measures for T1 sites
LHCb: Recommend CVMFS 2.1.19. General request: ensure that downtimes, including unscheduled outages, accurately reflect the specific services which are unavailable.
FTS3: Monitoring the auto-tuning algorithm closely and adjusting various monitoring tools of FTS3.
glexec: 10 sites have yet to enable it. ARGUS instabilities being investigated.
Machine/job features: PBS/torque and LSF implemented. SLURM pending. SGE and HTCondor in progress.
MW readiness: ATLAS and CMS DPM setups in progress. Monitoring prototype being deployed at test sites.
Multicore: CMS stable flow. Gathering reports for July workshop. ATLAS MC jobs on-hold pending new software release.
SHA-2: New VOMS fix for CERN instances requires sites to update ARGUS, UI, CREAM and WN instances.
WMS decommissioning: Progress with SAM Condor validation. ARC-CE WN tests failing for some CMS sites (incl. Imperial).
IPv6: NTR
HTTP proxy discovery: Task overview table updated.
Network and transfers metrics: Mesh leaders developed. Kick off in July.
AOB: OSG plan to migrate to HTCondor CEs by October.

Tier-1 - Status Page

Tuesday 22nd July

All Castor instances have been upgraded to version 2.1.14. The upgrade is complete apart from turning off a compatibility mode on the namserver component which will be done shortly.
We have announced that we will shutdown the FTS2 service on the 2nd September.
The software server used by the small VOs will be withdrawn from service. We are planning to do this on the 2nd September. Moving VOs to use CVMFS has been progressing well.
There was a 'warning' (At Risk) on the Tier1 for a network routing change in the core RAL network.

Storage & Data Management - Agendas/Minutes

Wednesday 23 July 2014

We really should try to document our VO policy: the stuff in people's heads, experiences with "small" VOs (that tend to grow bigger), best practices. Also, much of the wiki needs reviewing. "Boring" old documentation - hey ho!

Tuesday 22nd July

Alert today advising to update DPM, mainly to get the new (bug fixed)

DPM-dsi

Wednesday 2 July

Guidance and policies for "small" VOs: how to get them started with stuff, without preventing them later growing bigger.

Tuesday 1st July

Spacetokens for smaller VOs ... most want them but what happens post SRM. Chris's summary on Spacetokens needs updating and a consensus! Could the SEs implement some reservation system internally? Is there merit in the suggestion to make use of RAL genscratch and its Least Used Policy?

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 22nd July

EGI accounting portal does not show any significant outages in publishing in recent days.

Tuesday 1st July

There are no SL6 HS06 entries in our wiki for UCL and EFDA.
Are there any observations from the latest GridPP metrics tables? (Does anything need addressing or correcting?).
APEL is not up-to-date for: RHUL; Manchester and Durham.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 22nd July

Starting on revisions this week.
Is the alert system now working?

Monday 16th June

A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Interoperation - EGI ops agendas

Tuesday 14th July

Last meeting yesterday.

Agenda: https://wiki.egi.eu/wiki/Agenda-14-07-2014
Minutes:

URT: see agenda for details
SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
Next meeting placeholder 28th July, but may not happen (OMD depending)
Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients

Updates requested for the early adopters table.

Monitoring - Links MyWLCG

Tuesday 22nd July

Chris noted yesterday that gstat reports most sites as critical. SB thought the underlying problem is that a value is supposed to be "Production" in GLUE 1 and "production" in GLUE 2; at some point they changed both to lower case.

Tuesday 15th July

Meeting on 4th July https://indico.cern.ch/event/326288/
- I discussed my Status Board talk
- Looking at recomputations in SAM3 and how to visualise them: https://indico.cern.ch/event/326288/contribution/5/material/slides/1.pdf

On-duty - Dashboard ROD rota

Tuesday 22nd July

Another quiet week. Bham availability alarm ticket created.

Tuesday 1st July

Quiet week. Sussex emi2 ticket is still open. UCL also has a open ticket regarding some problem with storage.

Tuesday 24th June

Very quiet shift. Dashboard downtime on Tuesday seemed to go ok.

Rollout Status WLCG Baseline

Tuesday 18th March

The EMI-2 decommissioning task has started.
The next WLCG middleware readiness WG meeting takes place this afternoon at 13:30 UK time.

Tuesday 11th February

31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 22nd July

The EGI security dashboard is coming back to life.

Monday 14th July

EGI CSIRT ADVISORY [EGI-ADV-20140625]

Tuesday 1st July

There was a very useful security challenge debrief last week. Thanks to Heiko.
There may be a site contacts challenge in the coming months. Please could every site review their site security contact details and ensure that the GOCDB entry is up-to-date and working.
EGI indicates that site ARGUS instances can now be hooked up with the regional instances.
There was one EGI amber final report last week.
Next team meeting 16th July.

Monday 23rd June

CVE-2014-3153 - but no public exploit.
- This kernel vulnerability has been patched in errata released last week.
PerfSonar/Cacti updates.
New IGTF CA release 1.58 - the EGI release is due on 30th June.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 22nd July

There was a problem with VOMS admin that was noticed last Thursday: VO page requestes resulted in the WEB-INF directory of the jetty app being displayed. A server restart fixed the problem.

Tuesday 17th June

The GridPP VOMS server was updated on 11/06/2014 - no issues reported.

Tickets

Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

An update from Janusz on DIRAC:
We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
The async FTS is still under study, there some issues with this.
I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

VOs - GridPP VOMS VO IDs Approved VO table

"Monday 14th July 2014"

HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

HyperK.org request for support from other sites
- 2TB storage requested.
- CVMFS required

Cernatschool.org
- WebDAV access to storage -world read works at QMUL.
- ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

CVMFS
- Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software

VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

Is there interest in an FTS3 web front end? (more details)

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 20th May

Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 23rd July 2014

Operations report
The termination of the FTS2 service has been announced for the 2nd September.
The software server used by the smaller VOs will be turned off - also on 2nd September.
We are planning to turn off access to the cream CEs (possibly keeping them open for Alice) - although no date yet decided for this.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin 280714

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools