Operations Bulletin 020215

From GridPP Wiki
Revision as of 09:48, 2 February 2015 by Jeremy Coles 4cb4ce56a7 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 26th January 2015
Task Areas
General updates

Tuesday 27th January

Tuesday 20th January

  • A GridPP technical meeting took place on Friday. Minutes are available.
  • There has been agreement on CVMFS repository naming for the regional VOs e.g. /cvmfs/londongrid.gridpp.ac.uk
  • Glasgow WNs and gcc-gfortran issue (masking user setup problems?)
  • EGI-Engage project evaluation was very positive (15/15).
  • Minutes of last Wednesday's GDB are now available. The GDB action list has been updated.
    • Still looking for feedback on ginfo
    • PerfSonar to be reinstalled/enabled at sites
    • Squid registration in GOCDB
    • Test Machine Job Features approach
    • Participate in MW readiness working group
    • Looking for more contributors to for ARGUS testing/support.


Thursday 8th January


Tuesday 30th December

  • Issues with Tier-1 switch 24th/25th December. GOCDB switchover.
WLCG Operations Coordination - Agendas

Tuesday 27th January

  • There was a WLCG ops coordination meeting last Thursday (Agenda:Minutes). In summary:
  • News: 101 sites responded to survey - thank you. There is a draft agenda available for the WLCG workshop in Okinawa. Nicolò Magini (secretary) is moving.
  • Baselines: FTS 3.2.31 released; Gridsite (2.2.5) in UMD 3; StoRM 1.11.5/1.11.6 released by the PT. dCache 2.6.x end of support is June 2015 (move to 2.10.x/2.11.x soon).
  • Tier-0: The deadline to decommission VOMRS at CERN has moved to 16th February so that issues with the VOMS-Admin replacement can be resolved. The tentative date for AFS-UI decommissioning remains 2nd February.
  • Tier-1: NTR
  • Tier-2: NTR
  • ALICE: High activity except 15th Jan due to cert issue. Large SARA (NLT1) data loss. Offline repo split: AliRoot Core (dependable) vs. AliPhysics (agile).
  • ATLAS: Prodsys-2 has been fully validated. Rucio 'fairly' stable. Production has some hiccups. Most prod multicore. SARA: 0.5M files lost. Data lifetime policy now applied. MC15 not yet ready.
  • CMS: Moderate load. Some T1 disk full. Doing tape exercises. 50% T1 resources to be multicore enabled. Moving CRAB and central production into a single global Condor pool. Tier-2s will stop receiving pilot jobs with VOMS role production.
  • LHCb: Run1 Legacy Stripping almost done. Pre-staging with FTS3 worked well. SARA-MATRIX file loss. HTTP/WEBDAV access unstable.
  • WLCG critical services review: Done for T0. Fully updated tables available.
  • glexec: Focus still on PanDA integration.
  • SHA-2: a new VOMS-Admin version is expected now. VOMRS lives a bit longer.
  • Machine/Job features: Still need volunteer sites to deploy machine/job features on their batch / cloud infrastructure.
  • MW readiness: Met on 21st Jan. Minutes up. Reasonable participation. The new version of the Package Reporter is ready - need more sites testing it.
  • Multicore: CMS focus on T1s. T2s to restart soon (pending factory update).
  • IPv6: Meeting last week. Tier-1s to go dual-stack on their perfSonar instances by April. perfSONAR dashboard showing IPv6 status proposed - IPv6 test to be added to Nagios.
  • Squid: NTR
  • Network & Transfer metrics: NTR.
  • Next meeting 5th February.

Tuesday 20th January

Thursday 8th January

'

Tier-1 - Status Page

Tuesday 27th January

  • Safety testing of electrical circuits in the R89 machine room has been completed. Some problems were encountered but no critical services affected.
  • The delayed upgrade of the Castor Namservers to SL6 took place successfully yesterday, Monday (19th Jan).
  • The problems with our primary network router are still being followed up. A test was made last Thursday (20th) during which the unit quickly failed. We are following up with the vendor.
  • Kernel and errata updates (requiring a reboot) are being applied to Castor disk servers this week.
Storage & Data Management - Agendas/Minutes

Wedn 28 Jan

  • Ready for run 2!
  • Towards the exascale with GridPP?
  • Should we spring clean our defunct VOs?

Wedn 21 Jan

  • Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
  • Update from RAL's CEPH team.

Wedn 07 Jan

  • HNY. Most sites ticked over with relatively few glitches.
  • Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
  • Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

  • Update from T1 CEPH team.
  • Last meeting of year. No updating and configuring or touching! Merry and Happy!



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 20th January

  • Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

  • Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 27th January

  • Reviewed briefly at the core ops meeting last week. Doc owners are to be requested to review the usefulness of their keydocs and report any they believe should be removed from keydoc status.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 12th January

  • There was an EGI ops meeting on Monday 12th January. Minutes to follow.
    • STORM 1.11.5
    • SR: Another reminder to check your contacts
    • Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
    • Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
    • EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).


Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Tuesday 27th January

  • Rota to be updated this week based on previous input.

Tuesday 20th January

  • Last week was quiet.
  • Kashif + Gordon (shadowing) next week.
  • The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

  • A couple of tickets for systems with low availability.
  • There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
  • Tier-1 aware of long standing ticket.
Rollout Status WLCG Baseline

Tuesday 20th January

  • From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 27th January

  • Keep watching paikiti - WNs are popping up (seemingly) unpatched at some sites!


Tuesday 20th January

  • DPM configuration instructions.
  • Status of CVE-2014-9322.
  • Steve's old perf package false positive.

Tuesday 13th January

  • EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
  • Any issues over holiday period?

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 20th January

Tuesday 16th December

  • For the PS dashboard for the time being use this link.
  • As of today the following have no data: RHUL; Sheffield; ECDF; Bristol and Sussex.

Monday 8th December

  • Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)
Tickets

Monday 26th January 2015, 14.15 GMT
Back after being forgotten about by me:
Other VO Nagios Status:

At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only has 10 days left to live).
Sussex: snoplus failures (but I think Sussex is in downtime).
RALPP: A number of failures across a number of CEs, all a few hours old. An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old. And of course the srm-$VONAME failures at the Tier 1, which are caused by incompatibility between the tests and Castor AIUI. Things are generally looking good.

22 Open UK Tickets this week.
NGI/100IT
111333(22/1)
The NGI has been asked to upgrade the cloud accounting probe, and then notify our (only at the moment) cloud site to republish their accounting. Not entirely sure what this entails or who this falls on, I assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the "notify site" list - odd). Assigned (22/1)

TIER 1
108944(1/10/14)
CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1 is building a replacement xrootd box which is currently being prepared. If that will take a while can the ticket be put on hold? In progress (19/1)

QMUL
110353(25/11/14)
An atlas ticket, asking for httpd access to at QMUL. The QM chaps were waiting on a production ready Storm that could handle this, and are preparing to test one out. This is another ticket that looks like it might need to be put On Hold (will leave that up to you chaps - there's a big difference between "slow and steady" progress and "no progress for a while"). In progress (21/1)

RHUL
111355(23/1)
A dteam ticket - concerning http access to RHUL's SE. Although the initial observation about the SE certificate being expired was incorrect (the expiry date was reported as 5/1/15, which to be fair I would read as the 5th of January and not the 1st of May!) there still is some underlying problem here with intermittent test failures. Also this ticket raises the question of under what context are these tests being conducted? Anyone know, or shall we ask the submitter? In progress (26/1)

BIOMED PROBLEMS:
Manchester: 111356(23/1)
Imperial: 111357(23/1)
Biomed are having job problems, looking to be caused by using crusty old WMSes to communicate with these site's shiny up-to-date CEs. According to ticket 110635 a cream side fix should be out by the end of January (CREAM 1.16.5), although Alessandra suggests that Biomed should try to use newer, working WMSes - or Dirac instead!

Tools - MyEGI Nagios

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015


Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Monday 8th December 2014

  • Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
  • For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data
Site Updates

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th January 2015

  • Operations report
  • Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
  • Electrical power circuit testing completed.
  • The migration of all data off T10000A & B media has been completed.
  • Ongoing discussions with vendor to investigate problems on Primary Tier1 router.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A