Operations Bulletin 020215

Bulletin archive

Week commencing 26th January 2015

Task Areas

General updates

Tuesday 27th January

A reminder there is a new EGI document for cloud/grid: Introducing new cloud/grid components to EGI.
There was a WLCG middleware readiness review meeting last Wednesday.
CHEP travel bids have been considered and individuals notified on outcome.
Message from Tom: Thanks to the DIRAC team and regional VOs for their help with the new UCLan users.

Tuesday 20th January

A GridPP technical meeting took place on Friday. Minutes are available.
There has been agreement on CVMFS repository naming for the regional VOs e.g. /cvmfs/londongrid.gridpp.ac.uk
Glasgow WNs and gcc-gfortran issue (masking user setup problems?)
EGI-Engage project evaluation was very positive (15/15).
Minutes of last Wednesday's GDB are now available. The GDB action list has been updated.
- Still looking for feedback on ginfo
- PerfSonar to be reinstalled/enabled at sites
- Squid registration in GOCDB
- Test Machine Job Features approach
- Participate in MW readiness working group
- Looking for more contributors to for ARGUS testing/support.

Minutes of the pre-GDB on data management have also been produced.
Matt D had a question about the gfal2 xrootd plugin for use with his tarball...
Birmingham job numbers - lower than available slots despite queues.
[http://hepsoftwarefoundation.org/workshop-slac-jan-2015/ HEP SW foundation workshop] this week 20th-22nd.
There was a CHEP 2015 bulletin on 15th December indicating that notification of abstract acceptance was postponed until 15th January.

Thursday 8th January

The draft December Tier-2 reliability/availability figures are available.
If you might attend the pre-GDB on clouds in March please complete the WLCG doodle poll.
There is a pre-GDB on Tuesday 13th on Data Management and preservation.
There is a GDB on 14th January.
Steve has written up issues around operational aspects of ATLAS availability/reliability processes. Comments requested.
At the December EGI Operations Management Board (OMB) it was decided to ask all sites using the APEL client to configure it to publish the number of cores used by jobs. For this in /etc/apel/parser.cfg set parallel = true

Tuesday 30th December

Issues with Tier-1 switch 24th/25th December. GOCDB switchover.

WLCG Operations Coordination - Agendas

Tuesday 27th January

There was a WLCG ops coordination meeting last Thursday (Agenda:Minutes). In summary:

News: 101 sites responded to survey - thank you. There is a draft agenda available for the WLCG workshop in Okinawa. Nicolò Magini (secretary) is moving.
Baselines: FTS 3.2.31 released; Gridsite (2.2.5) in UMD 3; StoRM 1.11.5/1.11.6 released by the PT. dCache 2.6.x end of support is June 2015 (move to 2.10.x/2.11.x soon).
Tier-0: The deadline to decommission VOMRS at CERN has moved to 16th February so that issues with the VOMS-Admin replacement can be resolved. The tentative date for AFS-UI decommissioning remains 2nd February.
Tier-1: NTR
Tier-2: NTR
ALICE: High activity except 15th Jan due to cert issue. Large SARA (NLT1) data loss. Offline repo split: AliRoot Core (dependable) vs. AliPhysics (agile).
ATLAS: Prodsys-2 has been fully validated. Rucio 'fairly' stable. Production has some hiccups. Most prod multicore. SARA: 0.5M files lost. Data lifetime policy now applied. MC15 not yet ready.
CMS: Moderate load. Some T1 disk full. Doing tape exercises. 50% T1 resources to be multicore enabled. Moving CRAB and central production into a single global Condor pool. Tier-2s will stop receiving pilot jobs with VOMS role production.
LHCb: Run1 Legacy Stripping almost done. Pre-staging with FTS3 worked well. SARA-MATRIX file loss. HTTP/WEBDAV access unstable.
WLCG critical services review: Done for T0. Fully updated tables available.
glexec: Focus still on PanDA integration.
SHA-2: a new VOMS-Admin version is expected now. VOMRS lives a bit longer.
Machine/Job features: Still need volunteer sites to deploy machine/job features on their batch / cloud infrastructure.
MW readiness: Met on 21st Jan. Minutes up. Reasonable participation. The new version of the Package Reporter is ready - need more sites testing it.
Multicore: CMS focus on T1s. T2s to restart soon (pending factory update).
IPv6: Meeting last week. Tier-1s to go dual-stack on their perfSonar instances by April. perfSONAR dashboard showing IPv6 status proposed - IPv6 test to be added to Nagios.
Squid: NTR
Network & Transfer metrics: NTR.
Next meeting 5th February.

Tuesday 20th January

There is a WLCG ops coordination meeting this Thursday. Minutes will appear here. Are there any T1 or T2 issues we want raised?
There is also a WLCG middleware readiness meeting on Wednesday UK time 3pm-4pm (minutes appear here). Note ECDF and QMUL are on the site lists, but more participation is needed.

Thursday 8th January

There was a 'virtual only' ops meeting today. Please review the information submitted. Andrea uploaded some slides on WLCG critical services.

'

Tier-1 - Status Page

Tuesday 27th January

Safety testing of electrical circuits in the R89 machine room has been completed. Some problems were encountered but no critical services affected.
The delayed upgrade of the Castor Namservers to SL6 took place successfully yesterday, Monday (19th Jan).
The problems with our primary network router are still being followed up. A test was made last Thursday (20th) during which the unit quickly failed. We are following up with the vendor.
Kernel and errata updates (requiring a reboot) are being applied to Castor disk servers this week.

Storage & Data Management - Agendas/Minutes

Wedn 28 Jan

Ready for run 2!
Towards the exascale with GridPP?
Should we spring clean our defunct VOs?

Wedn 21 Jan

Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
Update from RAL's CEPH team.

Wedn 07 Jan

HNY. Most sites ticked over with relatively few glitches.
Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

Update from T1 CEPH team.
Last meeting of year. No updating and configuring or touching! Merry and Happy!

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 20th January

Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.

APEL status: An issue at Sheffield?

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 27th January

Reviewed briefly at the core ops meeting last week. Doc owners are to be requested to review the usefulness of their keydocs and report any they believe should be removed from keydoc status.

Monday 8th December

Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

New section in Wiki called "Project Management Pages".

The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.

Interoperation - EGI ops agendas

Monday 12th January

There was an EGI ops meeting on Monday 12th January. Minutes to follow.
- STORM 1.11.5
- SR: Another reminder to check your contacts
- Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
- Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
- EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).

Monitoring - Links MyWLCG

Monday 7th December

Meeting last Friday - agenda: https://indico.cern.ch/event/356853/ minutes: https://indico.cern.ch/event/356853/material/minutes/1.pdf
This was the wrap-up meeting of the consolidation TF; the mailing list will remain extant for a while yet.

On-duty - Dashboard ROD rota

Tuesday 27th January

Rota to be updated this week based on previous input.

Tuesday 20th January

Last week was quiet.
Kashif + Gordon (shadowing) next week.
The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

A couple of tickets for systems with low availability.
There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
Tier-1 aware of long standing ticket.

Rollout Status WLCG Baseline

Tuesday 20th January

From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 27th January

Keep watching paikiti - WNs are popping up (seemingly) unpatched at some sites!

Tuesday 20th January

DPM configuration instructions.
Status of CVE-2014-9322.
Steve's old perf package false positive.

Tuesday 13th January

EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
Any issues over holiday period?

Tuesday 16th December

Any update on the FTS3/GFAL bug?

The EGI security dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 20th January

The next LHCOPN/LHCONE meeting is in Cambridge 9-10 February.
Perfsonar is expected to be available at sites - where are we with the reinstall?

Tuesday 16th December

For the PS dashboard for the time being use this link.
As of today the following have no data: RHUL; Sheffield; ECDF; Bristol and Sussex.

Monday 8th December

Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

perfSONAR 3.4 available (63%)
- YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
- NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)

Tickets

Monday 26th January 2015, 14.15 GMT
Back after being forgotten about by me:
Other VO Nagios Status:

At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only has 10 days left to live).
Sussex: snoplus failures (but I think Sussex is in downtime).
RALPP: A number of failures across a number of CEs, all a few hours old. An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old. And of course the srm-$VONAME failures at the Tier 1, which are caused by incompatibility between the tests and Castor AIUI. Things are generally looking good.

22 Open UK Tickets this week.
NGI/100IT
111333(22/1)
The NGI has been asked to upgrade the cloud accounting probe, and then notify our (only at the moment) cloud site to republish their accounting. Not entirely sure what this entails or who this falls on, I assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the "notify site" list - odd). Assigned (22/1)

TIER 1
108944(1/10/14)
CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1 is building a replacement xrootd box which is currently being prepared. If that will take a while can the ticket be put on hold? In progress (19/1)

QMUL
110353(25/11/14)
An atlas ticket, asking for httpd access to at QMUL. The QM chaps were waiting on a production ready Storm that could handle this, and are preparing to test one out. This is another ticket that looks like it might need to be put On Hold (will leave that up to you chaps - there's a big difference between "slow and steady" progress and "no progress for a while"). In progress (21/1)

RHUL
111355(23/1)
A dteam ticket - concerning http access to RHUL's SE. Although the initial observation about the SE certificate being expired was incorrect (the expiry date was reported as 5/1/15, which to be fair I would read as the 5th of January and not the 1st of May!) there still is some underlying problem here with intermittent test failures. Also this ticket raises the question of under what context are these tests being conducted? Anyone know, or shall we ask the submitter? In progress (26/1)

BIOMED PROBLEMS:
Manchester: 111356(23/1)
Imperial: 111357(23/1)
Biomed are having job problems, looking to be caused by using crusty old WMSes to communicate with these site's shiny up-to-date CEs. According to ticket 110635 a cream side fix should be out by the end of January (CREAM 1.16.5), although Alessandra suggests that Biomed should try to use newer, working WMSes - or Dirac instead!

Tools - MyEGI Nagios

Tuesday 27th January

Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html

Tuesday 16th Sep

Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
https://vo-nagios.physics.ox.ac.uk/nagios/
It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015

GEANT4 has now updated their VOMS records in the ops portal to suit the new servers, lcg-voms2.cern.ch voms2.cern.ch. The Approved VOs document has been updated to match:
https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

Tuesday 16th December 2014

Discussion of setting up CVMFS for 'other VOs'.
LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Monday 8th December 2014

Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Proxy renewal isn't done by WMS with ARC CEs - https://ggus.eu/?mode=ticket_info&ticket_id=109915

Tuesday 11th November 2014

Status of CERN@School data

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 27th January

Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

Multicore status. Queues available (63%)
- YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
- NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)

According to our table for cloud/VMs (26%)
- YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
- NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)

GridPP DIRAC jobs successful (58%)
- YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
- NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)

IPv6 status
- Allocation - 42%
- YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
- NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex

Dual stack nodes - 21%
- YES: Brunel; IC; QMUL; Oxford (4)
- NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th January 2015

Operations report
Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
Electrical power circuit testing completed.
The migration of all data off T10000A & B media has been completed.
Ongoing discussions with vendor to investigate problems on Primary Tier1 router.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin 020215

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools