Operations Bulletin 011214

From GridPP Wiki
Revision as of 10:02, 1 December 2014 by Jeremy Coles dc208346be (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 24th November 2014
Task Areas
General updates


EGI OMB - Thursday 27th November

  • Agenda
    • Actions: Do we want to pilot ARGUS instances?
    • OLA/SLA framework - any comment on the framework/dosc?
    • Check the the metrics
    • New VO requests: vo.chain-project.eu (FP7 project to encourage cross e-Infrastructure computing) and lagoproject.org (Astro; Space weather and radiation).
    • Recheck mon=N & prod =y status and tickets
    • EGI conference 18-22 May 2015 in Lisbon.
    • New docs: CVMFS replication from OSG; introducing clouds to EGI; CSIRT Certification procedure (check it).
    • Optimising ops communications - procedures/manuals to be updated
    • Early adopters needed for: FTS3 (now have CREAM-LSF, SQUID and CVMFS covered).
    • There will be a cloud-init webinar on 15th December.
    • Next OMB 18th December - any topics to propose?
  • PerfSONAR status: 214 endpoints. Support unit in place. Testing central configuration service (for tests). 3 areas: network path; bandwidth & latency. Useful ESNET usage examples. 3.4 uses iptables and in a recent review guidance is given on additional ports that can be closed.
  • SAM/ARGO update: SAM update-23 progressing (probes to UMD; SAM-GridMon removed..) and in staged rollout. ops VOMS config changes: old ops voms decommissioned Wednesday 26th. (Still SL5).
  • EGI core activities: 17 services critical. OLAs in place for them. Talk reviewed the problems encountered with each service during year.
  • Update on longtail of science:

Tuesday 25th November

  • There was a WLCG ops meeting last Thursday. The agenda and minutes are linked here. Highlights for easier digestion:
  • News: ARGUS future workshop in December. WLCG survey pending web form update.
  • Middleware: WLCG repository is now signed.
  • Baselines: UMD 3.9.0 - gfal2, BDII, WN and UI updates.
  • MW issues: RHEL6.6 kernel fuse bug affecting CVMFS NFS installations. Recommend all sites with this type of installation to not upgrade to SL6.6. gridftp logging too verbose on DPM 1.8.9 - wait to upgrade as also publishes webDAV while EGI ops probes SRM.
  • T0 & T1 updates: Some dCache upgrades to 2.10.10. CERN move to FTS v3.2.30.
  • Oracle: CERN upgrades ongoing.
  • T0 news: myproxy.cern.ch will be upgraded to 6.0-2 on Tuesday, November 25th. VO feedback sought on voms-admin test instance - move in January? voms.cern.ch and lcg-voms.cern.ch to be replaced by voms2.cern.ch and lcg-voms2.cern.ch on Wednesday November 26th.
  • Tier-1 news: NDGF-T1 2 new tape systems are getting deployed (Oslo and Copenhagen).
  • Tier-2 news: NTR
  • ALICE: High activity. RAL ARC CEs direct submission work in progress.
  • ATLAS: ATLAS Central Service status (migration to AI) ongoing. ProdSys2 and Rucio migration timeline agreed - ramping up but stopping the prodys1 so low no of jobs for next ~2weeks.
  • CMS: Various tests ongoing (VOMS; data transfer; tape...). Moving CRAB and central production into a single global Condor pool. Reminder of site config requests.
  • LHCb: MC and user jobs in the last 2 weeks. Stripping 21 validation revealed a problem and delayed the start of the campaign.
  • glexec: in PanDA testing ongoing - some issues.
  • Machine/job features: NTR
  • Middleware readiness: Following technical discussions between the MW Package Reporter and Pakiti developers and the WLCG and EGI Security responsibles, a technical solution of common agreement was adopted by which each site will be given the option to enable pakiti only, the Package Reporter or both. Thus security concerns are addressed and the site independence is respected. A release along these lines is expected during the 1st quarter 2015.
  • Multicore: Passing parameters to batch systems reviewed at GDB by Alessandra. Capabilities recorded in this table. A report on accounting given to MB by Alessandra. See recommendations.
  • SHA-2: the old servers can be used until November 26th, 14:00 UTC. The maximum proxy lifetime for the old servers will be as low as 2 days and by then the old VOMS ports will refuse connections.
  • IPv6: NTR
  • Squid Mon & HTTP proxy discovery: Alistair working on updates.
  • Network & transfer metrics: 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts. Starting validation of 3.4.1 instances.
  • Next meeting 4th December.


Tier-1 - Status Page

Tuesday 24th November

  • There was a planned reboot of the site firewall this morning (24th). (There are a pair of firewalls and it will fail over and back as each is rebooted). This is expected to be transparent.
Storage & Data Management - Agendas/Minutes

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting

Wedn 19 Nov

  • Logs: chatty DPM 1.8.9, and elasticsearching logs.
  • Reports and other interesting things from workshops: cloud data transfer and sync, HEPiX, hepsysman.
  • RAID controllers for 36 bay nodes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 25th November

  • All sites approximately up-to-date.

Tuesday 11th November

  • A reminder to please update HEPSPEC06 figures with new equipment benchmarks.
  • Please check your GridPP metrics lines in Steve's tables and report any issues.

Tuesday 4th November

  • Sussex and Sheffield publishing issues still apparent.

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.


Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.


Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.


Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 18/11

On-duty - Dashboard ROD rota

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.


Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.

Tuesday 11th November

  • Target date for perfSONAR 3.4 upgrades is 8th December.

Tuesday 4th November

  • perfSONAR 3.4+ install/update instructions are ready. More details will be included in the WLCG broadcast to all sites planned for later today.

Tuesday 28th October

  • Have the perfSONAR 3.4 instructions/documentation been updated yet? Last week volunteers were sought at the...
  • perfSONAR operations meeting took place on 22nd October.
    • There is a recommendation for sites supporting IPv6 to deploy perfSONAR dual-stack.
    • Concerned about Tier-3s requesting to join Tier-2 meshes.
    • A network transfer metrics wiki page is available.


Tickets

Monday 24th November 2014, 15.00 GMT
22 Open UK tickets this week: 11 On Hold, 3 Waiting for Reply, 8 In Progress.

Ticket with No Home
107880(26/8)
It's that srmcp ticket that has been assigned to QMUL after being assigned to RAL. Chris has suggested that the ticket be assigned to the srmcp devs (if there are any left...). Not a bad suggestion (although I would suggested closing this ticket and opening a fresher one for clarity, as the initial problems are solved AIUI), let's make a decision on this one in the meeting. Waiting for reply (20/11)

100IT
108356(10/9)
Much like when I learning to drive around my hilly home town, this vmcatcher ticket seems to keep stalling. Owen has updated with some good information. In progress (13/11) Update - David replied to Owen, with positive news.

BRUNEL
110059(11/11)
This ticket (Brunel's DPM being shut down by spider attacks!) was being kept open for fear of the issue showing up again (as this is the second incarnation of the issue) - however Henry has had a chance to reyaim his DPM this time and all seems alright, so maybe it can be closed? On Hold (17/11) Update - Henry closed this ticket.

TIER 1
109712(29/10)
CMS glexec error at the tier 1. Andrew L said he'd look into this again after he's back from a well-deserved break, but that was a while ago. Any news? On Hold (10/11)

107935(27/8)
BDII/SRM storage capacity mismatch. At last word Brian had submitted a request to Castor to find out how it reports read-only volumes. Any news? On Hold (3/11)

(I realise that both these tickets are On Hold and therefore no update should be necessarily expected, but they were both seemed that they might not be held up for long).

MANCHESTER
109272(11/10)
Atlas having transfer problems, related to a filesystem loss at Manchester. The files are *still* going through recovery (http://bourricot.cern.ch/dq2/recovery/ - thanks Wahid, I had forgotten about this page). They're very nearly done though, I was going to suggest On Holding this ticket but I doubt it will be worth it now. In progress (18/11)


Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate
  • Northgrid support on RAL WMS being checked
  • Gfal-copy and castor issue


Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory
  • Impact
Site Updates

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 26th November 2014

  • Operations report
  • Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
  • Provisional dates for safety testing of circuits in the machine room is Tues-Thu weeks 13-15 & 20-22 January '15. Services will be 'at risk' during this time.
  • Provisional dates announed for upgrades of Castor headnodes to SL6, strating with LHCb next Tuesday (2nd Dec).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A