Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 24th November 2014
Task Areas
General updates

Tuesday 25th November

Tuesday 18th November

  • HEPSYSMAN took place on Monday and covered: IPv6; multicore; cgroups; Dirac (CERN@school); HS06 from haswell; Elasticsearch (logs analytics); HEPiX review; snmp log match and blanking disks.
  • At HEPSYSMAN reference was made to the WLCG agreement with Oracle. This does include Tier-2 sites for WLCG related work.
  • The notes from the November GDB last Wednesday (agenda) are now available together with those of the pre-GDB on volunteer computing. The GDB actions list has been updated.
WLCG Operations Coordination - Agendas

Tuesday 25th November

  • There was a WLCG ops meeting last Thursday. The agenda and minutes are linked here. Highlights for easier digestion:
  • News: ARGUS future workshop in December. WLCG survey pending web form update.
  • Middleware: WLCG repository is now signed.
  • Baselines: UMD 3.9.0 - gfal2, BDII, WN and UI updates.
  • MW issues: RHEL6.6 kernel fuse bug affecting CVMFS NFS installations. Recommend all sites with this type of installation to not upgrade to SL6.6. gridftp logging too verbose on DPM 1.8.9 - wait to upgrade as also publishes webDAV while EGI ops probes SRM.
  • T0 & T1 updates: Some dCache upgrades to 2.10.10. CERN move to FTS v3.2.30.
  • Oracle: CERN upgrades ongoing.
  • T0 news: myproxy.cern.ch will be upgraded to 6.0-2 on Tuesday, November 25th. VO feedback sought on voms-admin test instance - move in January? voms.cern.ch and lcg-voms.cern.ch to be replaced by voms2.cern.ch and lcg-voms2.cern.ch on Wednesday November 26th.
  • Tier-1 news: NDGF-T1 2 new tape systems are getting deployed (Oslo and Copenhagen).
  • Tier-2 news: NTR
  • ALICE: High activity. RAL ARC CEs direct submission work in progress.
  • ATLAS: ATLAS Central Service status (migration to AI) ongoing. ProdSys2 and Rucio migration timeline agreed - ramping up but stopping the prodys1 so low no of jobs for next ~2weeks.
  • CMS: Various tests ongoing (VOMS; data transfer; tape...). Moving CRAB and central production into a single global Condor pool. Reminder of site config requests.
  • LHCb: MC and user jobs in the last 2 weeks. Stripping 21 validation revealed a problem and delayed the start of the campaign.
  • glexec: in PanDA testing ongoing - some issues.
  • Machine/job features: NTR
  • Middleware readiness: Following technical discussions between the MW Package Reporter and Pakiti developers and the WLCG and EGI Security responsibles, a technical solution of common agreement was adopted by which each site will be given the option to enable pakiti only, the Package Reporter or both. Thus security concerns are addressed and the site independence is respected. A release along these lines is expected during the 1st quarter 2015.
  • Multicore: Passing parameters to batch systems reviewed at GDB by Alessandra. Capabilities recorded in this table. A report on accounting given to MB by Alessandra. See recommendations.
  • SHA-2: the old servers can be used until November 26th, 14:00 UTC. The maximum proxy lifetime for the old servers will be as low as 2 days and by then the old VOMS ports will refuse connections.
  • IPv6: NTR
  • Squid Mon & HTTP proxy discovery: Alistair working on updates.
  • Network & transfer metrics: 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts. Starting validation of 3.4.1 instances.
  • Next meeting 4th December.


Tier-1 - Status Page

Tuesday 24th November

  • There was a planned reboot of the site firewall this morning (24th). (There are a pair of firewalls and it will fail over and back as each is rebooted). This is expected to be transparent.
Storage & Data Management - Agendas/Minutes

Wedn 19 Nov

  • Logs: chatty DPM 1.8.9, and elasticsearching logs.
  • Reports and other interesting things from workshops: cloud data transfer and sync, HEPiX, hepsysman.
  • RAID controllers for 36 bay nodes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 25th November

  • All sites approximately up-to-date.

Tuesday 11th November

  • A reminder to please update HEPSPEC06 figures with new equipment benchmarks.
  • Please check your GridPP metrics lines in Steve's tables and report any issues.

Tuesday 4th November

  • Sussex and Sheffield publishing issues still apparent.

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.


Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.


Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.


Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 18/11

On-duty - Dashboard ROD rota

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.


Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.

Tuesday 11th November

  • Target date for perfSONAR 3.4 upgrades is 8th December.

Tuesday 4th November

  • perfSONAR 3.4+ install/update instructions are ready. More details will be included in the WLCG broadcast to all sites planned for later today.

Tuesday 28th October

  • Have the perfSONAR 3.4 instructions/documentation been updated yet? Last week volunteers were sought at the...
  • perfSONAR operations meeting took place on 22nd October.
    • There is a recommendation for sites supporting IPv6 to deploy perfSONAR dual-stack.
    • Concerned about Tier-3s requesting to join Tier-2 meshes.
    • A network transfer metrics wiki page is available.


Tickets

Monday 24th November 2014, 15.00 GMT
22 Open UK tickets this week: 11 On Hold, 3 Waiting for Reply, 8 In Progress.

Ticket with No Home
107880(26/8)
It's that srmcp ticket that has been assigned to QMUL after being assigned to RAL. Chris has suggested that the ticket be assigned to the srmcp devs (if there are any left...). Not a bad suggestion (although I would suggested closing this ticket and opening a fresher one for clarity, as the initial problems are solved AIUI), let's make a decision on this one in the meeting. Waiting for reply (20/11)

100IT
108356(10/9)
Much like when I learning to drive around my hilly home town, this vmcatcher ticket seems to keep stalling. Owen has updated with some good information. In progress (13/11) Update - David replied to Owen, with positive news.

BRUNEL
110059(11/11)
This ticket (Brunel's DPM being shut down by spider attacks!) was being kept open for fear of the issue showing up again (as this is the second incarnation of the issue) - however Henry has had a chance to reyaim his DPM this time and all seems alright, so maybe it can be closed? On Hold (17/11) Update - Henry closed this ticket.

TIER 1
109712(29/10)
CMS glexec error at the tier 1. Andrew L said he'd look into this again after he's back from a well-deserved break, but that was a while ago. Any news? On Hold (10/11)

107935(27/8)
BDII/SRM storage capacity mismatch. At last word Brian had submitted a request to Castor to find out how it reports read-only volumes. Any news? On Hold (3/11)

(I realise that both these tickets are On Hold and therefore no update should be necessarily expected, but they were both seemed that they might not be held up for long).

MANCHESTER
109272(11/10)
Atlas having transfer problems, related to a filesystem loss at Manchester. The files are *still* going through recovery (http://bourricot.cern.ch/dq2/recovery/ - thanks Wahid, I had forgotten about this page). They're very nearly done though, I was going to suggest On Holding this ticket but I doubt it will be worth it now. In progress (18/11)


Tools - MyEGI Nagios

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate
  • Northgrid support on RAL WMS being checked
  • Gfal-copy and castor issue


Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory
  • Impact
Site Updates

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 12th November 2014

  • Operations report
  • Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
  • Provisional dates for safety testing of circuits in the machine room is the week 12-16 January '15. Services will be 'at risk' during this time.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A