Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 613: Line 613:
 
===== =====
 
===== =====
 
<!-- ******************Edit start********************* ----->
 
<!-- ******************Edit start********************* ----->
 +
'''Tuesday 2nd December'''
 +
* perfSONAR 3.4 available (63%)
 +
** YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
 +
** NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)
 +
  
 
'''Tuesday 21st October'''
 
'''Tuesday 21st October'''
Line 620: Line 625:
 
* Intel announced the new generation of Xeon based on Haswell.
 
* Intel announced the new generation of Xeon based on Haswell.
  
'''Tuesday 20th May'''
 
* Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.
 
  
 
<!-- ******************Edit stop********************* ----->
 
<!-- ******************Edit stop********************* ----->

Revision as of 09:55, 2 December 2014

Bulletin archive


Week commencing 1st December 2014
Task Areas
General updates

Tuesday 2nd December

  • WLCG Overview Board (OB) met on Friday 28th November. Ian Bird's status report gives a summary of resource usage, projections and current project directions (data preservation, RUN-2 preparations etc.).
  • There is an ATLAS jamboree 3rd-5th December 2014.
  • Certificate renewal email reminders were not working 3rd November - 1st December. Nagios may have reminded you... but if not contact John Kewley for the Nagios scripts.

EGI OMB - Thursday 27th November

  • Agenda
    • Actions: Do we want to pilot ARGUS instances?
    • OLA/SLA framework - any comment on the framework/dosc?
    • Check the the metrics
    • New VO requests: vo.chain-project.eu (FP7 project to encourage cross e-Infrastructure computing) and lagoproject.org (Astro; Space weather and radiation).
    • Recheck mon=N & prod =y status and tickets
    • EGI conference 18-22 May 2015 in Lisbon.
    • New docs: CVMFS replication from OSG; introducing clouds to EGI; CSIRT Certification procedure (check it).
    • Optimising ops communications - procedures/manuals to be updated
    • Early adopters needed for: FTS3 (now have CREAM-LSF, SQUID and CVMFS covered).
    • There will be a cloud-init webinar on 15th December.
    • Next OMB 18th December - any topics to propose?
  • PerfSONAR status: 214 endpoints. Support unit in place. Testing central configuration service (for tests). 3 areas: network path; bandwidth & latency. Useful ESNET usage examples. 3.4 uses iptables and in a recent review guidance is given on additional ports that can be closed.
  • SAM/ARGO update: SAM update-23 progressing (probes to UMD; SAM-GridMon removed..) and in staged rollout. ops VOMS config changes: old ops voms decommissioned Wednesday 26th. (Still SL5).
  • EGI core activities: 17 services critical. OLAs in place for them. Talk reviewed the problems encountered with each service during year.
  • Update on longtail of science:

Monday 1st December

  • WLCG ops coordination team have launched a survey. Please could all GridPP sites respond to it by 19th December.

Tuesday 25th November

  • There was a WLCG ops meeting last Thursday. The agenda and minutes are linked here. Highlights for easier digestion:
  • News: ARGUS future workshop in December. WLCG survey pending web form update.
  • Middleware: WLCG repository is now signed.
  • Baselines: UMD 3.9.0 - gfal2, BDII, WN and UI updates.
  • MW issues: RHEL6.6 kernel fuse bug affecting CVMFS NFS installations. Recommend all sites with this type of installation to not upgrade to SL6.6. gridftp logging too verbose on DPM 1.8.9 - wait to upgrade as also publishes webDAV while EGI ops probes SRM.
  • T0 & T1 updates: Some dCache upgrades to 2.10.10. CERN move to FTS v3.2.30.
  • Oracle: CERN upgrades ongoing.
  • T0 news: myproxy.cern.ch will be upgraded to 6.0-2 on Tuesday, November 25th. VO feedback sought on voms-admin test instance - move in January? voms.cern.ch and lcg-voms.cern.ch to be replaced by voms2.cern.ch and lcg-voms2.cern.ch on Wednesday November 26th.
  • Tier-1 news: NDGF-T1 2 new tape systems are getting deployed (Oslo and Copenhagen).
  • Tier-2 news: NTR
  • ALICE: High activity. RAL ARC CEs direct submission work in progress.
  • ATLAS: ATLAS Central Service status (migration to AI) ongoing. ProdSys2 and Rucio migration timeline agreed - ramping up but stopping the prodys1 so low no of jobs for next ~2weeks.
  • CMS: Various tests ongoing (VOMS; data transfer; tape...). Moving CRAB and central production into a single global Condor pool. Reminder of site config requests.
  • LHCb: MC and user jobs in the last 2 weeks. Stripping 21 validation revealed a problem and delayed the start of the campaign.
  • glexec: in PanDA testing ongoing - some issues.
  • Machine/job features: NTR
  • Middleware readiness: Following technical discussions between the MW Package Reporter and Pakiti developers and the WLCG and EGI Security responsibles, a technical solution of common agreement was adopted by which each site will be given the option to enable pakiti only, the Package Reporter or both. Thus security concerns are addressed and the site independence is respected. A release along these lines is expected during the 1st quarter 2015.
  • Multicore: Passing parameters to batch systems reviewed at GDB by Alessandra. Capabilities recorded in this table. A report on accounting given to MB by Alessandra. See recommendations.
  • SHA-2: the old servers can be used until November 26th, 14:00 UTC. The maximum proxy lifetime for the old servers will be as low as 2 days and by then the old VOMS ports will refuse connections.
  • IPv6: NTR
  • Squid Mon & HTTP proxy discovery: Alistair working on updates.
  • Network & transfer metrics: 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts. Starting validation of 3.4.1 instances.
  • Next meeting 4th December.


Tier-1 - Status Page

Tuesday 24th November

  • There was a planned reboot of the site firewall this morning (24th). (There are a pair of firewalls and it will fail over and back as each is rebooted). This is expected to be transparent.
Storage & Data Management - Agendas/Minutes

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting

Wedn 19 Nov

  • Logs: chatty DPM 1.8.9, and elasticsearching logs.
  • Reports and other interesting things from workshops: cloud data transfer and sync, HEPiX, hepsysman.
  • RAID controllers for 36 bay nodes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 25th November

  • All sites approximately up-to-date.

Tuesday 11th November

  • A reminder to please update HEPSPEC06 figures with new equipment benchmarks.
  • Please check your GridPP metrics lines in Steve's tables and report any issues.

Tuesday 4th November

  • Sussex and Sheffield publishing issues still apparent.

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.


Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.


Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.


Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 18/11

On-duty - Dashboard ROD rota

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.


Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.

Tuesday 11th November

  • Target date for perfSONAR 3.4 upgrades is 8th December.

Tuesday 4th November

  • perfSONAR 3.4+ install/update instructions are ready. More details will be included in the WLCG broadcast to all sites planned for later today.

Tuesday 28th October

  • Have the perfSONAR 3.4 instructions/documentation been updated yet? Last week volunteers were sought at the...
  • perfSONAR operations meeting took place on 22nd October.
    • There is a recommendation for sites supporting IPv6 to deploy perfSONAR dual-stack.
    • Concerned about Tier-3s requesting to join Tier-2 meshes.
    • A network transfer metrics wiki page is available.


Tickets

Monday 1st December 2014, 14.30 GMT
34 Open UK Tickets this month. Quite a few of them are from Duncan, asking sites to please reinstall their perfsonar hosts.

THE CA
110484(1/12)
Simon F ticketed the CA concerning a possible problem with the ticket reminder system. JK has responded with a reply, and asked that similar tickets in the future use the helpdesk at support@grid-support.ac.uk rather then GGUS (and definitely don't use both!). He's looking into it at his end, and has asked Simon to check the spam filters. Assigned (should be In Progress?) (1/12)

SUSSEX
110389(26/11)
Duncan has reminded Matt RB to reinstall his Perfsonar with the latest release. Matt reckons he'll get to this the first half of this week. Nothing more to say. In Progress (26/11)

BRISTOL
110365(25/11)
Another perfsonar ticket, Bristol's perfsonar seems ill, but Duncan gave the URL for the Sheffield perfsonar. Probably just a copy and paste error when he wrote the ticket though. In progress (26/11)

106325(18/6)
CMS pilots losing connection at Bristol. The Bristol admins are still looking at this, and the problems are still happening. They've asked some questions (which likely will need a ticket status switch), and have tried disabling IPv6 on their workers for the time being to cross another factor off the list. On Hold (27/11)

BIRMINGHAM
110388(26/11)
Duncan has also asked Birmingham to update their perfsonar boxen- no reply from Matt or Mark yet. Maybe they missed the ticket. Assigned (26/11)

GLASGOW
110387(26/11)
Another request to upgrade perfsonar boxes. Gareth has replied, hopefully it'll get done this week. In progress (26/11)

ECDF
110386(26/11)
The Edinburgh "please upgrade your Perfsonar" ticket. Wahid has replied with the ECDF stance on perfsonar, and put the ticket On Hold. On Hold (26/11)

95303(1/7/13)
ECDF's glexec tarball ticket. Same position as last month I'm afraid. On Hold (29/8)

DURHAM
108273(5/9)
Durham's perfsonar results going just plain weird. The Durham chaps have reinstalled their perfsonar, but as expected things are still odd. Hope to test a new routing arrangement later this week. Is that still on course? On hold (12/11)

MANCHESTER
110457(29/11)
Atlas have ticketed Manchester about the same issue again (see 110366), which boils down to lost files not being able to be declared lost due to the rucio migration. Not much that can be done Manchester side until the file deletion service is back up at full swing- On Hold the ticket? In progress (1/12)

110225(18/11)
A ticket for the voms service host at Manchester, detailing the change in VO manager for vo.helios-vo.eu. Bit of confusion with the new VO manager's certificate to be used for this, this ticket might need some shepherding, perhaps even On Holding if it gets too close to Christmas. In Progress (21/11)

LIVERPOOL
110391(26/11)
Atlas have noticed that the Liverpool DPM has some kind of webdav access problem, browsing works but downloads didn't. This was on purpose as a security, but John enabled http access offsite from the disk nodes. There was some discussion in the ticket about http/https access within DPM, but I suspect this ticket is done unless these points need to be thrashed out a bit. In progress (26/11)

LANCASTER
110482(1/12)
I upgraded my DPM to 1.8.9, and all I got was this ticket! Lancaster's failing the second half of the getTURL test due to what I believe is an incompatibility with the latest DPM version and the SAM tests (and I wasn't rolling back to pass nagios tests!). Waiting on a new set of tests to be rolled out. On Hold (1/12)

100566(27/1)
Lancaster's bad perfsonar performance ticket. No win after upgrading to the latest perfsonar, hope to run some other tests in the pre-Christmas quiet period.

95299(1/7/2013)
Lancaster's glexec tarball ticket. No news - my hope is to work on this in the two week per-Christmas quiet period, same as our perfsonar problem. On Hold (14/11)

UCL
110442(28/11)
Atlas have noticed transfer problems to UCL. Ben is trying to investigate, and Wahid is lending a hand. In Progress (28/11)

110384(26/11)
UCL's "please reinstall your perfsonar" ticket. In progress (26/11)

110358(25/11)
Nagios ticket for UCL, concerning glexec test failures. Ben has replied that he is trying to debug their glexec installation. In progress (28/11)

95298(1/7/13)
UCL's glexec ticket. Ben's working on it, but the site got hit by problems last week. In progress (24/11)

QMUL
110353(25/11)
Another atlas httpd access ticket, although this one is quite different from the Liverpool one as it appears they are trying from within a job. I don't think this has been noticed by the QM chaps yet. Assigned (25/11)

107880(26/8)
The not-really-a-QM problem snoplus/suse/srmcp ticket. We discussed how to handle this last week, but no news - it seems we're waiting for Matt M to re-engage? Waiting for reply (20/11)

BRUNEL
110383(26/11)
Brunel's "please reinstall your perfsonar" ticket. Raul is on it. In progress (26/11)

EFDA-JET
97485(21/9/13)
The Jet LHCB job failure ticket. If ever there was a candidate for setting a ticket to unsolved, this is it. On Hold (1/10)

100IT
108356(10/9)
Our commercial cloud site's vmcatcher ticket. After Owen's help it looks like things are on the up, but the images still aren't being published. An interesting link was posted with the instructions how to do that. In progress (28/11)

THE TIER 1
106324(18/6)
CMS Pilots losing connectivity at RAL, sister to the Bristol ticket. Not much news, but Andrew L has a plan to discuss the problem with the HTCondor devs at CERN when he's there. On Hold (27/11)

109694(28/10)
Sno+ not being able to copy files out of RAL with the gfal tools. It appears to be a non-snoplus specific gfal problem. Perhaps an install problem with wrong versions of gfal2-utils? Andrew L is going to contact the gfal2 devs for help. On hold (26/11)

107935(27/8)
Inconsistant published BDII/SRM storage numbers. Has been discussed recently in the Ops meeting, a conversation is ongoing with the Castor devs about this, but there wasn't much noise from them at last check. The ticket could do with a mini-update, even if it's "nothing to see here, move along". On Hold (3/11)

109276(11/10)
Some CMS users having trouble with the RAL FTS REST web interface. Everything seems to be fixed now, so it looks like this ticket can be closed. In progress (27/11)

110397(26/11)
Duncan has ticketed the Tier 1 regarding not being able to access the LFC via his browser. Catalin confirmed that the problem was occurring for him for his non-dteam identities. Things seem to be working for Chris though. How goes it? In progress (27/11)

109712(29/10)
CMS glexec errors at the Tier 1. Andrew is back on the case, but needs to test things out first before rolling them out. In progress (27/11)

108944(1/10)
Another CMS ticket, this time AAA tests failing at RAL. Andrew L asked for the testing scripts so that RAL can test themselves - Duncan provided a link that will help point the way. In progress (26/11)

110382(26/11)
And the last ticket, the Tier 1's "please upgrade your perfsonar" ticket. In progress (26/11)


Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate
  • Northgrid support on RAL WMS being checked
  • Gfal-copy and castor issue


Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory
  • Impact
Site Updates

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 26th November 2014

  • Operations report
  • Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
  • Provisional dates for safety testing of circuits in the machine room is Tues-Thu weeks 13-15 & 20-22 January '15. Services will be 'at risk' during this time.
  • Provisional dates announed for upgrades of Castor headnodes to SL6, strating with LHCb next Tuesday (2nd Dec).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A