Operations Bulletin 090215

Bulletin archive

Week commencing 2nd February 2015

Task Areas

General updates

Tuesday 3rd February

HEP S/W foundation workshop - agenda link. Notes from meeting.
There was an EGI OMB on Thursday 29th January. A snapshot of areas/updates discussed:

Operations: RC/RP OLA: User SLA Biovel, DRIHM, CompChem.
- QoS: (response) GGUS report monitor: https://ggus.eu/?mode=report_view
- ROD: evolution. Some small. Propose stop monitoring ROD.
- MC accounting: At 33% sites not reporting.
- New global VOs: eli-np.eu (lasers); vo.compass.cern.ch (SPS); fermilab.fnal.gov (Fermi wide)
- Cloud Accounting campaign: FedCloud sites to upgrade probe (OpenStack & OpenNebula)
- Documentation: Upcoming new procedures/Manuals (PROC19 - adding resources; PROC20 - CVMFS replication.
- Guide for ops tool developers.
- EGI 2015 conference: Open for registration. April is the early bird deadline. Co-located: OGF44 & Globus meeting.
- Next OMB meeting: 26th February

Security
- Current: CVE-2014-9322 (linux kernel) ongoing. Pakiti issues. Some sites suspended.
- CVE-2015-0235 (glibc) – under consideration.
GOCDB:
- v5.3.1 out for testing.
- Login via x509 or EGI SSO; get_user PI update.
Fedcloud F2F:
- EGI-Engage (standards & capabilities) & INDIGO Datacloud (services) will be funded.
- Use cases: 26 communities with 50 use cases. 43 NatSci; 3 Medical; 3 Humanities.
- Many sites experimenting with Big Data services also for IaaS provisioning
- Cloud accounting working but improved.
- Cloud security: response; policies; assessment (to happen)
- VM Management: Using OCCI 1.1. Support OS; ON and Sinnefo.
- Data Management: CDMI uses Synnefo. Lack of support.
- Cloud Blueprint docs: Cover GOCDB, monitoring & accounting.
- AAI – AARC and EGI-Engage will push this forward.
- Monitoring: for FedCloud interfaces/services done or in dev.

Earth science e-infrastructures workshop
- Outreach included ES communities (VERCE, CLIPC, EIDA, EPOS…) + EGI.eu+Geant+EUDAT+PRACE.
- Sessions: AAI; Data management; Cloud. Use cases.
- Virtual Earthquake & Seismology Research Community in Europe (VERCE) – iRODS
- Climate4Impact: Data from Earth System Grid Federation (ESGF) 6PB. Data transfer bottlenecks.
- ... + many others...

EGI participation to PRACE data storage pilot call - brief discussion. Survey for responses.

Evolution of UMD repositories
- Will form managed UMD-preview. Preserve EMI repos characteristics.
- Web pages http://repository.egi.eu/

UMD 3.11 update
- FTS 3
- UMD u11: CREAM 1.16.4 (no VOMS client dependencies); BLAH 1.20.7; Torque 2.1.4; VOMS-admin 2.0.12; voms-clients 3.0.6; DPM 1.8.9; DMLite 0.7.2.; CERN DM clients (e.g. gfal2…).

As of yesterday AFS UI at CERN closed. The top directory of the AFS UI installation is /afs/cern.ch/project/gd/LCG-share

WLCG Operations Coordination - Agendas

--

News: An updated version of the agenda for the Okinawa workshop will be soon produced.
MW Baselines: frontier/squid 2.7.STABLE9-22.
MW Issues: Ghost vulnerability. ARGUS failures with latest Java (disable SSLv3 by default) – workaround in place.
MW Services: Various upgrades mentioned for T0 and T1s.
T0 news: Latest VOMS-admin in test. If no issues VOMRS off from 16th Feb. AFS UI closed 2nd Feb.
T1: NTR
T2: Asked about CERN accounts for T2 sysadmins. Should be possible!
ALICE:
ATLAS:
CMS: Bigger run 2 MC started. Doing staging/consistency checks. >50% T1 capacity MC –partional slot model so slots will be used either way.
LHCb: Run-1 legacy stripping almost done. Problems due to SARA disk failure. Some issues CERN simulation jobs. APEL issues 2 sites. VOMS tests.
Glexec: With panda - 55 sites covered. T1 and biggest T2s. Issue resolved at BNL. Further ramping ATLAS.
SHA-2: old VOMS retirement – will keep special firewall/router configs until VOMRS gone. VOMS config reminder broadcast soon.
MJF: Site testing – documentation question.
MW readiness: Some steady progress.
IPv6: TBC
Squid mon + HTTP proxy dis TF: Progress. Not all yet registered so will broadcast again. Generate ATLAS/CMS pages for monitoring. Contributors busy. Additional CVMFS repos now available. Some sites put restriction.

Net+Transfer metrics:

Tuesday 3rd February

Next meeting this Thursday 5th Febraury.
- Tier-2 feedback so far: Has there been any progress on CERN accounts or material access?

Tier-1 - Status Page

Tuesday 3rd February

Safety testing of electrical circuits in the R89 machine room has been completed. Some problems were encountered but no critical services affected.
Backup link to CERN moved to a new route this morning. Currently running over that link as a test.
The problems with our primary network router are still being followed up.
Outage for reboot of Castor systems yesterday (to pick up latest OS patches).

Storage & Data Management - Agendas/Minutes

Wedn 28 Jan

Ready for run 2!
Towards the exascale with GridPP?
Should we spring clean our defunct VOs?

Wedn 21 Jan

Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
Update from RAL's CEPH team.

Wedn 07 Jan

HNY. Most sites ticked over with relatively few glitches.
Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

Update from T1 CEPH team.
Last meeting of year. No updating and configuring or touching! Merry and Happy!

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 3rd February

A reminder to keep updating the HEPSPEC06 tables.

Tuesday 20th January

Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.

APEL status: An issue at Sheffield?

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Survey of KeyDocs
https://www.gridpp.ac.uk/wiki/Current_Activities

Tuesday 3 Feb

New VO, LSST, in Approved VOs: https://www.gridpp.ac.uk/w/index.php?title=GridPP_approved_VOs

Tuesday 27th January

Reviewed briefly at the core ops meeting last week. Doc owners are to be requested to review the usefulness of their keydocs and report any they believe should be removed from keydoc status.

Monday 8th December

Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

New section in Wiki called "Project Management Pages".

The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.

Interoperation - EGI ops agendas

Monday 12th January

There was an EGI ops meeting on Monday 12th January. Minutes to follow.
- STORM 1.11.5
- SR: Another reminder to check your contacts
- Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
- Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
- EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).

Monitoring - Links MyWLCG

Monday 7th December

Meeting last Friday - agenda: https://indico.cern.ch/event/356853/ minutes: https://indico.cern.ch/event/356853/material/minutes/1.pdf
This was the wrap-up meeting of the consolidation TF; the mailing list will remain extant for a while yet.

On-duty - Dashboard ROD rota

Tuesday 27th January

Rota to be updated this week based on previous input.

Tuesday 20th January

Last week was quiet.
Kashif + Gordon (shadowing) next week.
The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

A couple of tickets for systems with low availability.
There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
Tier-1 aware of long standing ticket.

Rollout Status WLCG Baseline

Tuesday 20th January

From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 27th January

Keep watching paikiti - WNs are popping up (seemingly) unpatched at some sites!

Tuesday 20th January

DPM configuration instructions.
Status of CVE-2014-9322.
Steve's old perf package false positive.

Tuesday 13th January

EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
Any issues over holiday period?

Tuesday 16th December

Any update on the FTS3/GFAL bug?

The EGI security dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 3rd February

There was a network and transfer metrics WG meeting last week. Any feedback?

Tuesday 20th January

The next LHCOPN/LHCONE meeting is in Cambridge 9-10 February.
Perfsonar is expected to be available at sites - where are we with the reinstall?

Tickets

Monday 2nd February 2015, 14.00 GMT
22 Open UK tickets this month.

SUSSEX
110389 (26/11/14)
A perfsonar ticket for Sussex. Their perfsonar has been reinstalled, but needs soothing. Matt has informed us that this might have to wait a few weeks due to other issues. On Hold (21/1)

RALPP
110536 (2/12/14)
MICE job failures at RALPP - it looked like they were dying due to running of of memory. The queues have been tweaked to give MICE more, but no word from the MICE if this has solved the problem. Waiting for reply (12/1)

BRISTOL
110365 (25/11/14)
Another perfsonar ticket. Again the node is reinstalled, just not quite working right. Winnie is waiting for news from the other sites in a similar boat. In progress (maybe On Hold it?) (20/1)

EDINBURGH
111118 (12/1)
ECDF "low availability" ticket - just waiting for the silly alarm to clear. Daniela submitted a ticket about this foolish alarm a while ago - 107689. On Hold (19/1)

95303 (1/7/13)
glexec tarball ticket. With my tarball hat on - still no positive news on this front - it's beginning to look like this can't be done but we're having one last go. Sorry! On Hold (19/12)

MANCHESTER
110225 (18/11/14)
Change of VO Manager for helios-vo.eu. It looks like this ticket is being held up at the user end a lot. I'm not sure there's anything we can do as it involves outside CAs. On Hold (20/1)

111356 (23/1)
One of Manchester's CEs not working for biomed, due to problems with the new CREAM/old WMS communication. Alessandra gave biomed some sagely advice, but I suspect this ticket will need to be prodded soon to get a reponse from biomed (who I agree should use a newer WMS and close it). On Hold (26/1)

LANCASTER
111547 (2/2)
I'm reporting on a ticket that I submitted to myself today. I'm not sure what that says about the world. Anyway - a ticket to track the decommissioning of one of Lancaster's CEs, as we try to do it all proper like. On Hold (2/2)

100566 (21/1/14)
Lancaster's perfsonar ticket, which I sadly let reach its first birthday. I've been prodding this offline, does anyone have the address for a regular, open iperf endpoint I could borrow? On Hold (9/1)

95299 (1/7/13)
Lancaster's tarball glexec ticket, as the ECDF one. On hold (26/1)

UCL
95299 (1/7/13)
UCL's glexec ticket. They've been having trouble getting it to behave, and at last check Ben was off ill - probably due to dealing with glexec :-) On Hold (20/1)

QMUL
110353 (25/11/14)
Atlas asking for QM's storage to be made available via https. Waiting on a production ready STORM that can provide this - Dan is trying it out on his testbed se02.esc.qmul.ac.uk, which still needs tweaking. In progress (28/1)

IMPERIAL
111357(23/1)
One of the IC CEs not working for biomed. Similar to the Manchester ticket, Daniela points to ticket 110635 and is waiting on an EMI release to fix it (due out imminently AIUI). On Hold (28/1)

EFDA-JET
97485 (21/9/13)
Jet's LCHB job failure tickets. I'm afraid I haven't been able to chase this up (partly due to only ever remembering on the first Monday of the month) - there's been no news for a while. On Hold (1/10/14)

100IT
111333 (22/1)
A ticket to 100IT and the NGI to get the cloud accounting probe upgraded. I notified 100IT, but forgot to reassign the ticket - thanks to Jeremy for doing it. Assigned (2/2)

108356(10/9/14)
Getting VMcatcher working at 100IT. David from 100IT has asked for some answers on which "glancepush" to use, but no reply for a while. Waiting for reply (19/1)

TIER 1
111477(29/1)
CMS would like to run some staging tests to warm up for Run2. The Tier 1 warned CMS of today's outage and they're happy to proceed tomorrow (the 3rd) - I think they'd like a response. In progress (30/1)

107935(27/8/14)
A ticket regarding inconsistent BDII and SRM storage numbers. Waiting on a fix from the developers regarding read-only disk accounting (I think), Brian is still on the case. Stephen B let us know that Maria the ticket submitter is on maternity leave, and asks in her stead if the numbers are expected to align now. On hold (28/1)

111120(12/1)
An atlas ticket about a large number of data transfer errors seen between RAL and BNL. Brian reckoned that this was due to shallow checksums on the old data being transferred, but had trouble looking at the BNL FTS. Regardless, the ADCoS shifter hadn't seen any errors for a week and suggests the ticket can be closed. Waiting for reply (29/1)

108944(1/10/14)
CMS AAA test problems at RAL. After setting up a new xrootd box the test failures have changed in nature, but sadly they're still failures. In progress (29/1)

111347(22/1)
CMS Consistency Check for RAL, January 2015 edition. Filelists were generated, orphan files were identified, then purged. Just need to know what CMS want to do next. Waiting for reply (26/1)

109694(28/10/14)
Sno+ ticket concerning gfal tool problems, waiting on the new release to come out (middle of this month I believe). If you don't want to wait that long then I believe the 2.8 gfal2 tools can be found in the fts3 repo at last check. On hold (20/1)

Tools - MyEGI Nagios

Tuesday 27th January

Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html

Tuesday 16th Sep

Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
https://vo-nagios.physics.ox.ac.uk/nagios/
It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015

GEANT4 has now updated their VOMS records in the ops portal to suit the new servers, lcg-voms2.cern.ch voms2.cern.ch. The Approved VOs document has been updated to match:
https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

Tuesday 16th December 2014

Discussion of setting up CVMFS for 'other VOs'.
LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 27th January

Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

Multicore status. Queues available (63%)
- YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
- NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)

According to our table for cloud/VMs (26%)
- YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
- NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)

GridPP DIRAC jobs successful (58%)
- YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
- NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)

IPv6 status
- Allocation - 42%
- YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
- NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex

Dual stack nodes - 21%
- YES: Brunel; IC; QMUL; Oxford (4)
- NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th January 2015

Operations report
Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
Electrical power circuit testing completed.
The migration of all data off T10000A & B media has been completed.
Ongoing discussions with vendor to investigate problems on Primary Tier1 router.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin 090215

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools