Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
Line 105: Line 105:
<!-- *********************************************************** ----->
<!-- *********************************************************** ----->
<!-- ***********************Start T1 text*********************** ----->
<!-- ***********************Start T1 text*********************** ----->
'''Tuesday 24th November'''
'''Tuesday 2nd December'''
* There was a planned reboot of the site firewall this morning (24th). (There are a pair of firewalls and it will fail over and back as each is rebooted). This is expected to be transparent.
* The headnodes of LHCb instance of Castor is being upgraded to SL6 today 10:00-14:00.
* Investigating problems on the CMS Castor instance.
<!-- **********************End T1 text************************** ----->
<!-- **********************End T1 text************************** ----->
<!-- *********************************************************** ----->
<!-- *********************************************************** ----->

Revision as of 10:47, 2 December 2014

Bulletin archive

Week commencing 1st December 2014
Task Areas
General updates

Tuesday 2nd December

  • WLCG Overview Board (OB) met on Friday 28th November. Ian Bird's status report gives a summary of resource usage, projections and current project directions (data preservation, RUN-2 preparations etc.).
  • There is an ATLAS jamboree 3rd-5th December 2014.
  • Certificate renewal email reminders were not working 3rd November - 1st December. Nagios may have reminded you... but if not contact John Kewley for the Nagios scripts.
  • The old VOMS servers 'voms.cern.ch' and 'lcg-voms.cern.ch' _were?_ switched off for good and replaced by 'voms2.cern.ch' and 'lcg-voms2.cern.ch on Wednesday 26th November at 15:00 CET.

EGI OMB - Thursday 27th November

  • Agenda
    • Actions: Do we want to pilot ARGUS instances?
    • OLA/SLA framework - any comment on the framework/dosc?
    • Check the the metrics
    • New VO requests: vo.chain-project.eu (FP7 project to encourage cross e-Infrastructure computing) and lagoproject.org (Astro; Space weather and radiation).
    • Recheck mon=N & prod =y status and tickets
    • EGI conference 18-22 May 2015 in Lisbon.
    • New docs: CVMFS replication from OSG; introducing clouds to EGI; CSIRT Certification procedure (check it).
    • Optimising ops communications - procedures/manuals to be updated
    • Early adopters needed for: FTS3 (now have CREAM-LSF, SQUID and CVMFS covered).
    • There will be a cloud-init webinar on 15th December.
    • Next OMB 18th December - any topics to propose?
  • PerfSONAR status: 214 endpoints. Support unit in place. Testing central configuration service (for tests). 3 areas: network path; bandwidth & latency. Useful ESNET usage examples. 3.4 uses iptables and in a recent review guidance is given on additional ports that can be closed.
  • SAM/ARGO update: SAM update-23 progressing (probes to UMD; SAM-GridMon removed..) and in staged rollout. ops VOMS config changes: old ops voms decommissioned Wednesday 26th. (Still SL5).
  • EGI core activities: 17 services critical. OLAs in place for them. Talk reviewed the problems encountered with each service during year.
  • Update on longtail of science:

Monday 1st December

  • WLCG ops coordination team have launched a survey. Please could all GridPP sites respond to it by 19th December.

Tuesday 25th November

  • There was a WLCG ops meeting last Thursday. The agenda and minutes are linked here. Highlights for easier digestion:
  • News: ARGUS future workshop in December. WLCG survey pending web form update.
  • Middleware: WLCG repository is now signed.
  • Baselines: UMD 3.9.0 - gfal2, BDII, WN and UI updates.
  • MW issues: RHEL6.6 kernel fuse bug affecting CVMFS NFS installations. Recommend all sites with this type of installation to not upgrade to SL6.6. gridftp logging too verbose on DPM 1.8.9 - wait to upgrade as also publishes webDAV while EGI ops probes SRM.
  • T0 & T1 updates: Some dCache upgrades to 2.10.10. CERN move to FTS v3.2.30.
  • Oracle: CERN upgrades ongoing.
  • T0 news: myproxy.cern.ch will be upgraded to 6.0-2 on Tuesday, November 25th. VO feedback sought on voms-admin test instance - move in January? voms.cern.ch and lcg-voms.cern.ch to be replaced by voms2.cern.ch and lcg-voms2.cern.ch on Wednesday November 26th.
  • Tier-1 news: NDGF-T1 2 new tape systems are getting deployed (Oslo and Copenhagen).
  • Tier-2 news: NTR
  • ALICE: High activity. RAL ARC CEs direct submission work in progress.
  • ATLAS: ATLAS Central Service status (migration to AI) ongoing. ProdSys2 and Rucio migration timeline agreed - ramping up but stopping the prodys1 so low no of jobs for next ~2weeks.
  • CMS: Various tests ongoing (VOMS; data transfer; tape...). Moving CRAB and central production into a single global Condor pool. Reminder of site config requests.
  • LHCb: MC and user jobs in the last 2 weeks. Stripping 21 validation revealed a problem and delayed the start of the campaign.
  • glexec: in PanDA testing ongoing - some issues.
  • Machine/job features: NTR
  • Middleware readiness: Following technical discussions between the MW Package Reporter and Pakiti developers and the WLCG and EGI Security responsibles, a technical solution of common agreement was adopted by which each site will be given the option to enable pakiti only, the Package Reporter or both. Thus security concerns are addressed and the site independence is respected. A release along these lines is expected during the 1st quarter 2015.
  • Multicore: Passing parameters to batch systems reviewed at GDB by Alessandra. Capabilities recorded in this table. A report on accounting given to MB by Alessandra. See recommendations.
  • SHA-2: the old servers can be used until November 26th, 14:00 UTC. The maximum proxy lifetime for the old servers will be as low as 2 days and by then the old VOMS ports will refuse connections.
  • IPv6: NTR
  • Squid Mon & HTTP proxy discovery: Alistair working on updates.
  • Network & transfer metrics: 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts. Starting validation of 3.4.1 instances.
  • Next meeting 4th December.

Tier-1 - Status Page

Tuesday 2nd December

  • The headnodes of LHCb instance of Castor is being upgraded to SL6 today 10:00-14:00.
  • Investigating problems on the CMS Castor instance.
Storage & Data Management - Agendas/Minutes

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting

Wedn 19 Nov

  • Logs: chatty DPM 1.8.9, and elasticsearching logs.
  • Reports and other interesting things from workshops: cloud data transfer and sync, HEPiX, hepsysman.
  • RAID controllers for 36 bay nodes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 25th November

  • All sites approximately up-to-date.

Tuesday 11th November

  • A reminder to please update HEPSPEC06 figures with new equipment benchmarks.
  • Please check your GridPP metrics lines in Steve's tables and report any issues.

Tuesday 4th November

  • Sussex and Sheffield publishing issues still apparent.

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.

Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.

Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.

Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 18/11

On-duty - Dashboard ROD rota

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.

Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


Security - Incident Procedure Policies Rota

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)

Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.

Tuesday 11th November

  • Target date for perfSONAR 3.4 upgrades is 8th December.

Monday 1st December 2014, 14.30 GMT
34 Open UK Tickets this month. Quite a few of them are from Duncan, asking sites to please reinstall their perfsonar hosts.

Simon F ticketed the CA concerning a possible problem with the ticket reminder system. JK has responded with a reply, and asked that similar tickets in the future use the helpdesk at support@grid-support.ac.uk rather then GGUS (and definitely don't use both!). He's looking into it at his end, and has asked Simon to check the spam filters. Assigned (should be In Progress?) (1/12) Update - in progress now, and Jens has been roped in to the ticket as well - there was a problem after all (see JK's email).

Duncan has reminded Matt RB to reinstall his Perfsonar with the latest release. Matt reckons he'll get to this the first half of this week. Nothing more to say. In Progress (26/11)

Another perfsonar ticket, Bristol's perfsonar seems ill, but Duncan gave the URL for the Sheffield perfsonar. Probably just a copy and paste error when he wrote the ticket though. In progress (26/11) Update - Winnie confirms that the perfsonar has been reinstalled, poked and prodded. Things are still off with the box, and the site firewall admins are being consulted - but if it isn't a firewall problem Winnie would appreciate assistance debugging the problem.

CMS pilots losing connection at Bristol. The Bristol admins are still looking at this, and the problems are still happening. They've asked some questions (which likely will need a ticket status switch), and have tried disabling IPv6 on their workers for the time being to cross another factor off the list. On Hold (27/11)

Duncan has also asked Birmingham to update their perfsonar boxen- no reply from Matt or Mark yet. Maybe they missed the ticket. Assigned (26/11)

Another request to upgrade perfsonar boxes. Gareth has replied, hopefully it'll get done this week. In progress (26/11)

The Edinburgh "please upgrade your Perfsonar" ticket. Wahid has replied with the ECDF stance on perfsonar, and put the ticket On Hold. On Hold (26/11)

ECDF's glexec tarball ticket. Same position as last month I'm afraid. On Hold (29/8)

Durham's perfsonar results going just plain weird. The Durham chaps have reinstalled their perfsonar, but as expected things are still odd. Hope to test a new routing arrangement later this week. Is that still on course? On hold (12/11)

Atlas have ticketed Manchester about the same issue again (see 110366), which boils down to lost files not being able to be declared lost due to the rucio migration. Not much that can be done Manchester side until the file deletion service is back up at full swing- On Hold the ticket? In progress (1/12)

A ticket for the voms service host at Manchester, detailing the change in VO manager for vo.helios-vo.eu. Bit of confusion with the new VO manager's certificate to be used for this, this ticket might need some shepherding, perhaps even On Holding if it gets too close to Christmas. In Progress (21/11)

Atlas have noticed that the Liverpool DPM has some kind of webdav access problem, browsing works but downloads didn't. This was on purpose as a security, but John enabled http access offsite from the disk nodes. There was some discussion in the ticket about http/https access within DPM, but I suspect this ticket is done unless these points need to be thrashed out a bit. In progress (26/11)

I upgraded my DPM to 1.8.9, and all I got was this ticket! Lancaster's failing the second half of the getTURL test due to what I believe is an incompatibility with the latest DPM version and the SAM tests (and I wasn't rolling back to pass nagios tests!). Waiting on a new set of tests to be rolled out. On Hold (1/12)

Lancaster's bad perfsonar performance ticket. No win after upgrading to the latest perfsonar, hope to run some other tests in the pre-Christmas quiet period.

Lancaster's glexec tarball ticket. No news - my hope is to work on this in the two week per-Christmas quiet period, same as our perfsonar problem. On Hold (14/11)

Atlas have noticed transfer problems to UCL. Ben is trying to investigate, and Wahid is lending a hand. In Progress (28/11)

UCL's "please reinstall your perfsonar" ticket. In progress (26/11)

Nagios ticket for UCL, concerning glexec test failures. Ben has replied that he is trying to debug their glexec installation. In progress (28/11)

UCL's glexec ticket. Ben's working on it, but the site got hit by problems last week. In progress (24/11)

Another atlas httpd access ticket, although this one is quite different from the Liverpool one as it appears they are trying from within a job. I don't think this has been noticed by the QM chaps yet. Assigned (25/11) Update - In Progress now, Dan's checking if https should be working. Elena has involved uk cloud support.

The not-really-a-QM problem snoplus/suse/srmcp ticket. We discussed how to handle this last week, but no news - it seems we're waiting for Matt M to re-engage? Waiting for reply (20/11)

Brunel's "please reinstall your perfsonar" ticket. Raul is on it. In progress (26/11)

The Jet LHCB job failure ticket. If ever there was a candidate for setting a ticket to unsolved, this is it. On Hold (1/10)

Our commercial cloud site's vmcatcher ticket. After Owen's help it looks like things are on the up, but the images still aren't being published. An interesting link was posted with the instructions how to do that. In progress (28/11)

CMS Pilots losing connectivity at RAL, sister to the Bristol ticket. Not much news, but Andrew L has a plan to discuss the problem with the HTCondor devs at CERN when he's there. On Hold (27/11)

Sno+ not being able to copy files out of RAL with the gfal tools. It appears to be a non-snoplus specific gfal problem. Perhaps an install problem with wrong versions of gfal2-utils? Andrew L is going to contact the gfal2 devs for help. On hold (26/11)

Inconsistant published BDII/SRM storage numbers. Has been discussed recently in the Ops meeting, a conversation is ongoing with the Castor devs about this, but there wasn't much noise from them at last check. The ticket could do with a mini-update, even if it's "nothing to see here, move along". On Hold (3/11)

Some CMS users having trouble with the RAL FTS REST web interface. Everything seems to be fixed now, so it looks like this ticket can be closed. In progress (27/11)

Duncan has ticketed the Tier 1 regarding not being able to access the LFC via his browser. Catalin confirmed that the problem was occurring for him for his non-dteam identities. Things seem to be working for Chris though. How goes it? In progress (27/11)

CMS glexec errors at the Tier 1. Andrew is back on the case, but needs to test things out first before rolling them out. In progress (27/11)

Another CMS ticket, this time AAA tests failing at RAL. Andrew L asked for the testing scripts so that RAL can test themselves - Duncan provided a link that will help point the way. In progress (26/11)

And the last ticket, the Tier 1's "please upgrade your perfsonar" ticket. In progress (26/11)

Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:


Tests removed:


release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate
  • Northgrid support on RAL WMS being checked
  • Gfal-copy and castor issue

Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory
  • Impact
Site Updates

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 26th November 2014

  • Operations report
  • Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
  • Provisional dates for safety testing of circuits in the machine room is Tues-Thu weeks 13-15 & 20-22 January '15. Services will be 'at risk' during this time.
  • Provisional dates announed for upgrades of Castor headnodes to SL6, strating with LHCb next Tuesday (2nd Dec).
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A