Operations Bulletin 090215

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 2nd February 2015
Task Areas
General updates


Tuesday 3rd February

  • Operations: RC/RP OLA: User SLA Biovel, DRIHM, CompChem.
  • Security
    • Current: CVE-2014-9322 (linux kernel) ongoing. Pakiti issues. Some sites suspended.
    • CVE-2015-0235 (glibc) – under consideration.
  • GOCDB:
  • Fedcloud F2F:
    • EGI-Engage (standards & capabilities) & INDIGO Datacloud (services) will be funded.
    • Use cases: 26 communities with 50 use cases. 43 NatSci; 3 Medical; 3 Humanities.
    • Many sites experimenting with Big Data services also for IaaS provisioning
    • Cloud accounting working but improved.
    • Cloud security: response; policies; assessment (to happen)
    • VM Management: Using OCCI 1.1. Support OS; ON and Sinnefo.
    • Data Management: CDMI uses Synnefo. Lack of support.
    • Cloud Blueprint docs: Cover GOCDB, monitoring & accounting.
    • AAI – AARC and EGI-Engage will push this forward.
    • Monitoring: for FedCloud interfaces/services done or in dev.
  • Earth science e-infrastructures workshop
    • Outreach included ES communities (VERCE, CLIPC, EIDA, EPOS…) + EGI.eu+Geant+EUDAT+PRACE.
    • Sessions: AAI; Data management; Cloud. Use cases.
    • Virtual Earthquake & Seismology Research Community in Europe (VERCE) – iRODS
    • Climate4Impact: Data from Earth System Grid Federation (ESGF) 6PB. Data transfer bottlenecks.
    • ... + many others...
  • EGI participation to PRACE data storage pilot call - brief discussion. Survey for responses.
  • UMD 3.11 update
    • FTS 3
    • UMD u11: CREAM 1.16.4 (no VOMS client dependencies); BLAH 1.20.7; Torque 2.1.4; VOMS-admin 2.0.12; voms-clients 3.0.6; DPM 1.8.9; DMLite 0.7.2.; CERN DM clients (e.g. gfal2…).
  • As of yesterday AFS UI at CERN closed. The top directory of the AFS UI installation is /afs/cern.ch/project/gd/LCG-share


WLCG Operations Coordination - Agendas

--

  • News: An updated version of the agenda for the Okinawa workshop will be soon produced.
  • MW Baselines: frontier/squid 2.7.STABLE9-22.
  • MW Issues: Ghost vulnerability. ARGUS failures with latest Java (disable SSLv3 by default) – workaround in place.
  • MW Services: Various upgrades mentioned for T0 and T1s.
  • T0 news: Latest VOMS-admin in test. If no issues VOMRS off from 16th Feb. AFS UI closed 2nd Feb.
  • T1: NTR
  • T2: Asked about CERN accounts for T2 sysadmins. Should be possible!
  • ALICE:
  • ATLAS:
  • CMS: Bigger run 2 MC started. Doing staging/consistency checks. >50% T1 capacity MC –partional slot model so slots will be used either way.
  • LHCb: Run-1 legacy stripping almost done. Problems due to SARA disk failure. Some issues CERN simulation jobs. APEL issues 2 sites. VOMS tests.
  • Glexec: With panda - 55 sites covered. T1 and biggest T2s. Issue resolved at BNL. Further ramping ATLAS.
  • SHA-2: old VOMS retirement – will keep special firewall/router configs until VOMRS gone. VOMS config reminder broadcast soon.
  • MJF: Site testing – documentation question.
  • MW readiness: Some steady progress.
  • IPv6: TBC
  • Squid mon + HTTP proxy dis TF: Progress. Not all yet registered so will broadcast again. Generate ATLAS/CMS pages for monitoring. Contributors busy. Additional CVMFS repos now available. Some sites put restriction.

Net+Transfer metrics:


Tuesday 3rd February

  • Next meeting this Thursday 5th Febraury.
    • Tier-2 feedback so far: Has there been any progress on CERN accounts or material access?
Tier-1 - Status Page

Tuesday 3rd February

  • Safety testing of electrical circuits in the R89 machine room has been completed. Some problems were encountered but no critical services affected.
  • Backup link to CERN moved to a new route this morning. Currently running over that link as a test.
  • The problems with our primary network router are still being followed up.
  • Outage for reboot of Castor systems yesterday (to pick up latest OS patches).
Storage & Data Management - Agendas/Minutes

Wedn 28 Jan

  • Ready for run 2!
  • Towards the exascale with GridPP?
  • Should we spring clean our defunct VOs?

Wedn 21 Jan

  • Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
  • Update from RAL's CEPH team.

Wedn 07 Jan

  • HNY. Most sites ticked over with relatively few glitches.
  • Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
  • Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

  • Update from T1 CEPH team.
  • Last meeting of year. No updating and configuring or touching! Merry and Happy!



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 3rd February

Tuesday 20th January

  • Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

  • Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 3 Feb

New VO, LSST, in Approved VOs: https://www.gridpp.ac.uk/w/index.php?title=GridPP_approved_VOs


Tuesday 27th January

  • Reviewed briefly at the core ops meeting last week. Doc owners are to be requested to review the usefulness of their keydocs and report any they believe should be removed from keydoc status.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 12th January

  • There was an EGI ops meeting on Monday 12th January. Minutes to follow.
    • STORM 1.11.5
    • SR: Another reminder to check your contacts
    • Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
    • Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
    • EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).


Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Tuesday 27th January

  • Rota to be updated this week based on previous input.

Tuesday 20th January

  • Last week was quiet.
  • Kashif + Gordon (shadowing) next week.
  • The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

  • A couple of tickets for systems with low availability.
  • There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
  • Tier-1 aware of long standing ticket.
Rollout Status WLCG Baseline

Tuesday 20th January

  • From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 27th January

  • Keep watching paikiti - WNs are popping up (seemingly) unpatched at some sites!


Tuesday 20th January

  • DPM configuration instructions.
  • Status of CVE-2014-9322.
  • Steve's old perf package false positive.

Tuesday 13th January

  • EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
  • Any issues over holiday period?

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 3rd February

Tuesday 20th January

Tickets

Monday 2nd February 2015, 14.00 GMT
22 Open UK tickets this month.

SUSSEX
110389 (26/11/14)
A perfsonar ticket for Sussex. Their perfsonar has been reinstalled, but needs soothing. Matt has informed us that this might have to wait a few weeks due to other issues. On Hold (21/1)

RALPP
110536 (2/12/14)
MICE job failures at RALPP - it looked like they were dying due to running of of memory. The queues have been tweaked to give MICE more, but no word from the MICE if this has solved the problem. Waiting for reply (12/1)

BRISTOL
110365 (25/11/14)
Another perfsonar ticket. Again the node is reinstalled, just not quite working right. Winnie is waiting for news from the other sites in a similar boat. In progress (maybe On Hold it?) (20/1)

EDINBURGH
111118 (12/1)
ECDF "low availability" ticket - just waiting for the silly alarm to clear. Daniela submitted a ticket about this foolish alarm a while ago - 107689. On Hold (19/1)

95303 (1/7/13)
glexec tarball ticket. With my tarball hat on - still no positive news on this front - it's beginning to look like this can't be done but we're having one last go. Sorry! On Hold (19/12)

MANCHESTER
110225 (18/11/14)
Change of VO Manager for helios-vo.eu. It looks like this ticket is being held up at the user end a lot. I'm not sure there's anything we can do as it involves outside CAs. On Hold (20/1)

111356 (23/1)
One of Manchester's CEs not working for biomed, due to problems with the new CREAM/old WMS communication. Alessandra gave biomed some sagely advice, but I suspect this ticket will need to be prodded soon to get a reponse from biomed (who I agree should use a newer WMS and close it). On Hold (26/1)

LANCASTER
111547 (2/2)
I'm reporting on a ticket that I submitted to myself today. I'm not sure what that says about the world. Anyway - a ticket to track the decommissioning of one of Lancaster's CEs, as we try to do it all proper like. On Hold (2/2)

100566 (21/1/14)
Lancaster's perfsonar ticket, which I sadly let reach its first birthday. I've been prodding this offline, does anyone have the address for a regular, open iperf endpoint I could borrow? On Hold (9/1)

95299 (1/7/13)
Lancaster's tarball glexec ticket, as the ECDF one. On hold (26/1)

UCL
95299 (1/7/13)
UCL's glexec ticket. They've been having trouble getting it to behave, and at last check Ben was off ill - probably due to dealing with glexec :-) On Hold (20/1)

QMUL
110353 (25/11/14)
Atlas asking for QM's storage to be made available via https. Waiting on a production ready STORM that can provide this - Dan is trying it out on his testbed se02.esc.qmul.ac.uk, which still needs tweaking. In progress (28/1)

IMPERIAL
111357(23/1)
One of the IC CEs not working for biomed. Similar to the Manchester ticket, Daniela points to ticket 110635 and is waiting on an EMI release to fix it (due out imminently AIUI). On Hold (28/1)

EFDA-JET
97485 (21/9/13)
Jet's LCHB job failure tickets. I'm afraid I haven't been able to chase this up (partly due to only ever remembering on the first Monday of the month) - there's been no news for a while. On Hold (1/10/14)

100IT
111333 (22/1)
A ticket to 100IT and the NGI to get the cloud accounting probe upgraded. I notified 100IT, but forgot to reassign the ticket - thanks to Jeremy for doing it. Assigned (2/2)

108356(10/9/14)
Getting VMcatcher working at 100IT. David from 100IT has asked for some answers on which "glancepush" to use, but no reply for a while. Waiting for reply (19/1)

TIER 1
111477(29/1)
CMS would like to run some staging tests to warm up for Run2. The Tier 1 warned CMS of today's outage and they're happy to proceed tomorrow (the 3rd) - I think they'd like a response. In progress (30/1)

107935(27/8/14)
A ticket regarding inconsistent BDII and SRM storage numbers. Waiting on a fix from the developers regarding read-only disk accounting (I think), Brian is still on the case. Stephen B let us know that Maria the ticket submitter is on maternity leave, and asks in her stead if the numbers are expected to align now. On hold (28/1)

111120(12/1)
An atlas ticket about a large number of data transfer errors seen between RAL and BNL. Brian reckoned that this was due to shallow checksums on the old data being transferred, but had trouble looking at the BNL FTS. Regardless, the ADCoS shifter hadn't seen any errors for a week and suggests the ticket can be closed. Waiting for reply (29/1)

108944(1/10/14)
CMS AAA test problems at RAL. After setting up a new xrootd box the test failures have changed in nature, but sadly they're still failures. In progress (29/1)

111347(22/1)
CMS Consistency Check for RAL, January 2015 edition. Filelists were generated, orphan files were identified, then purged. Just need to know what CMS want to do next. Waiting for reply (26/1)

109694(28/10/14)
Sno+ ticket concerning gfal tool problems, waiting on the new release to come out (middle of this month I believe). If you don't want to wait that long then I believe the 2.8 gfal2 tools can be found in the fts3 repo at last check. On hold (20/1)


Tools - MyEGI Nagios

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015


Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.


Site Updates

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th January 2015

  • Operations report
  • Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
  • Electrical power circuit testing completed.
  • The migration of all data off T10000A & B media has been completed.
  • Ongoing discussions with vendor to investigate problems on Primary Tier1 router.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A