Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
Line 146: Line 146:
<!-- *********************************************************** ----->
<!-- *********************************************************** ----->
<!-- ***********************Start T1 text*********************** ----->
<!-- ***********************Start T1 text*********************** ----->
'''Tuesday 3rd February'''
'''Tuesday 10th February'''
* Safety testing of electrical circuits in the R89 machine room has been completed. Some problems were encountered but no critical services affected.
* Patching of Oracle databases behind Castor scheduled for Feb 11th.
* Backup link to CERN moved to a new route this morning. Currently running over that link as a test.
* Some racks in the datacentre have to be moved to make way for latest procurement. This will affect one generation of production diskservers. Details are still being finalised.
* The problems with our primary network router are still being followed up.  
* The problems with our primary network router are still being followed up.  
* Outage for reboot of Castor systems yesterday (to pick up latest OS patches).
<!-- **********************End T1 text************************** ----->
<!-- **********************End T1 text************************** ----->
<!-- *********************************************************** ----->
<!-- *********************************************************** ----->

Revision as of 09:54, 10 February 2015

Bulletin archive

Week commencing 9th February 2015
Task Areas
General updates

Tuesday 10th February

Tuesday 3rd February

  • Operations: RC/RP OLA: User SLA Biovel, DRIHM, CompChem.
  • Security
    • Current: CVE-2014-9322 (linux kernel) ongoing. Pakiti issues. Some sites suspended.
    • CVE-2015-0235 (glibc) – under consideration.
  • GOCDB:
  • Fedcloud F2F:
    • EGI-Engage (standards & capabilities) & INDIGO Datacloud (services) will be funded.
    • Use cases: 26 communities with 50 use cases. 43 NatSci; 3 Medical; 3 Humanities.
    • Many sites experimenting with Big Data services also for IaaS provisioning
    • Cloud accounting working but improved.
    • Cloud security: response; policies; assessment (to happen)
    • VM Management: Using OCCI 1.1. Support OS; ON and Sinnefo.
    • Data Management: CDMI uses Synnefo. Lack of support.
    • Cloud Blueprint docs: Cover GOCDB, monitoring & accounting.
    • AAI – AARC and EGI-Engage will push this forward.
    • Monitoring: for FedCloud interfaces/services done or in dev.
  • Earth science e-infrastructures workshop
    • Outreach included ES communities (VERCE, CLIPC, EIDA, EPOS…) + EGI.eu+Geant+EUDAT+PRACE.
    • Sessions: AAI; Data management; Cloud. Use cases.
    • Virtual Earthquake & Seismology Research Community in Europe (VERCE) – iRODS
    • Climate4Impact: Data from Earth System Grid Federation (ESGF) 6PB. Data transfer bottlenecks.
    • ... + many others...
  • EGI participation to PRACE data storage pilot call - brief discussion. Survey for responses.
  • UMD 3.11 update
    • FTS 3
    • UMD u11: CREAM 1.16.4 (no VOMS client dependencies); BLAH 1.20.7; Torque 2.1.4; VOMS-admin 2.0.12; voms-clients 3.0.6; DPM 1.8.9; DMLite 0.7.2.; CERN DM clients (e.g. gfal2…).
  • As of yesterday AFS UI at CERN closed. The top directory of the AFS UI installation is /afs/cern.ch/project/gd/LCG-share

WLCG Operations Coordination - Agendas


  • News: An updated version of the agenda for the Okinawa workshop will be soon produced.
  • MW Baselines: frontier/squid 2.7.STABLE9-22.
  • MW Issues: Ghost vulnerability. ARGUS failures with latest Java (disable SSLv3 by default) – workaround in place.
  • MW Services: Various upgrades mentioned for T0 and T1s.
  • T0 news: Latest VOMS-admin in test. If no issues VOMRS off from 16th Feb. AFS UI closed 2nd Feb.
  • T1: NTR
  • T2: Asked about CERN accounts for T2 sysadmins. Should be possible!
  • ALICE:
  • ATLAS:
  • CMS: Bigger run 2 MC started. Doing staging/consistency checks. >50% T1 capacity MC –partional slot model so slots will be used either way.
  • LHCb: Run-1 legacy stripping almost done. Problems due to SARA disk failure. Some issues CERN simulation jobs. APEL issues 2 sites. VOMS tests.
  • Glexec: With panda - 55 sites covered. T1 and biggest T2s. Issue resolved at BNL. Further ramping ATLAS.
  • SHA-2: old VOMS retirement – will keep special firewall/router configs until VOMRS gone. VOMS config reminder broadcast soon.
  • MJF: Site testing – documentation question.
  • MW readiness: Some steady progress.
  • IPv6: TBC
  • Squid mon + HTTP proxy dis TF: Progress. Not all yet registered so will broadcast again. Generate ATLAS/CMS pages for monitoring. Contributors busy. Additional CVMFS repos now available. Some sites put restriction.

Net+Transfer metrics:

Tuesday 3rd February

  • Next meeting this Thursday 5th Febraury.
    • Tier-2 feedback so far: Has there been any progress on CERN accounts or material access?
Tier-1 - Status Page

Tuesday 10th February

  • Patching of Oracle databases behind Castor scheduled for Feb 11th.
  • Some racks in the datacentre have to be moved to make way for latest procurement. This will affect one generation of production diskservers. Details are still being finalised.
  • The problems with our primary network router are still being followed up.
Storage & Data Management - Agendas/Minutes

Tuesday 10th February

Wedn 28 Jan

  • Ready for run 2!
  • Towards the exascale with GridPP?
  • Should we spring clean our defunct VOs?

Wedn 21 Jan

  • Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
  • Update from RAL's CEPH team.

Wedn 07 Jan

  • HNY. Most sites ticked over with relatively few glitches.
  • Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
  • Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

  • Update from T1 CEPH team.
  • Last meeting of year. No updating and configuring or touching! Merry and Happy!

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 3rd February

Tuesday 20th January

  • Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

  • Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 3 Feb

New VO, LSST, in Approved VOs: https://www.gridpp.ac.uk/w/index.php?title=GridPP_approved_VOs

Tuesday 27th January

  • Reviewed briefly at the core ops meeting last week. Doc owners are to be requested to review the usefulness of their keydocs and report any they believe should be removed from keydoc status.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 9th February

  • The agenda for February's EGI ops meeting is here.
    • Argus PAP update (1.6.2) out imminently. This should fix the SSL problems for sites affected after a java update. Hopefully this will solve Raul's problems in his ticket 111505.
    • Other update notifications include APEL 1.4.0 (with a bunch of new features), Storm 1.11.6, UNICORE v7.2.0 and a new version of Globus in EPEL testing. UMD release 3.11.0 being prepared.
    • Multicore republishing was discussed - if sites need to republish multicore data previously parsed as single core then they need to reparse the logs. Any site wanting to republish is asked to submit a ticket to apel support first.
    • Operational issues were discussed whereby sites saw problems running ARGUS and CREAM on the same node, special care is taken if you want to take this mad path. See tickets 111464 and 111373.
    • The decommissioning of the EMI2 APEL service and the UMD repositories was mentioned.
    • Sites are advised to avoid using MySQL v.5.0 where possible.
    • A comment was passed that the EPEL5 repository was becoming harder to keep up to date, bringing the effective EOL of SL5 closer for the community
      • SL7 is a topic that still needs to be discussed in earnest in this forum.
    • Next meeting is March 9th - effort to keep the number of meetings down.

Monday 12th January

  • There was an EGI ops meeting on Monday 12th January. Minutes to follow.
    • STORM 1.11.5
    • SR: Another reminder to check your contacts
    • Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
    • Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
    • EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).

Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Tuesday 27th January

  • Rota to be updated this week based on previous input.

Tuesday 20th January

  • Last week was quiet.
  • Kashif + Gordon (shadowing) next week.
  • The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

  • A couple of tickets for systems with low availability.
  • There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
  • Tier-1 aware of long standing ticket.
Rollout Status WLCG Baseline

Tuesday 20th January

  • From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


Security - Incident Procedure Policies Rota

Tuesday 27th January

  • Keep watching paikiti - WNs are popping up (seemingly) unpatched at some sites!

Tuesday 20th January

  • DPM configuration instructions.
  • Status of CVE-2014-9322.
  • Steve's old perf package false positive.

Tuesday 13th January

  • EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
  • Any issues over holiday period?

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 3rd February

Tuesday 20th January


Monday 9th February 2015, 15.00 GMT

Other VO Nagios Results
At the time of writing the only site showing red that aren't suffering an understood problem was RALPP with org.nordugrid.ARC-CE-submit and SRM-submit test failures for gridpp, pheno, t2k and southgrid for both its CEs and its SE. The failures are between 1 and 12 hours old, so it doesn't seem to be a persistent failure, but it seems to be quite consistent. They all seem to be failing with "Job submission failed... arcsub exited with code 256: ...ERROR: Failed to connect to XXXX(IPv4):443 .... Job submission failed, no more possible targets". Anyone seen something like this before?

Only 20 Open UK tickets this week.

Biomed tickets:
111356 (Manchester)
111357 (Imperial)
Biomed have linked both these tickets as children of 110636, being worked on by the cream blah team. AFAIKS no sign of Cream 1.16.5 just yet.

111347 (22/1)
CMS consistency checks for January 2015. It looks like everything that was asked of RAL has been done by RAL, so hopefully this can be successfully closed. In progress (3/2)

111120 (12/1)
Another ticket, this time concerning a period of Atlas transfer failures between RAL and BNL, that looks like it can be closed as the failures seem to have stopped (and might well have been at the BNL end). Waiting for reply (22/1)

108944 (1/10/14)
CMS AAA test failures at RAL. Federica can't connect to the new xrootd service according to the error messages. No news for a while. In progress (29/1)

100IT 108356
Both of these 100IT tickets are looking a bit crusty - the first is waiting for advice, the second was just put "In progress".

110353 (25/11/14) Dan has set up se02.esc.qmul.ac.uk to test out the latest https-accessible version of storm for dteam and atlas. As a cherry on top this node is also IPv6 enabled. I'm not sure if Dan wants others in the UK to "give it a go"? In progress (6/2)

100566 (27/1/14)
(Blatantly scounging for advice) Trying to figure out why Lancaster's perfsonar is under-performing. Ewan kindly gave us access to a iperf endpoint and it's been very useful in characterising some of the weirdness - although I'm still confused. Ewan also gave us a bunch of suggestions for testing that have been useful - next stop, window sizes. If anyone else wants to throw advice to me all wisdom donations are thankfully accepted. My advice for others in be careful trying to connect to the default iperf port on a working DPM pool node.... In Progress (9/2)

Tools - MyEGI Nagios

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:


Tests removed:


release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015

Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Site Updates

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th January 2015

  • Operations report
  • Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
  • Electrical power circuit testing completed.
  • The migration of all data off T10000A & B media has been completed.
  • Ongoing discussions with vendor to investigate problems on Primary Tier1 router.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A