Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 370: Line 370:
 
===== =====
 
===== =====
 
<!-- ******************Edit start********************* ----->
 
<!-- ******************Edit start********************* ----->
'''Monday 6th October 2014, 14.30 BST'''<br />
 
On top of the lovely tickets there was a discussion in the Ops team last week and it was mentioned how it would be handy to look how sites were doing on the VO nagios, so I thought I'd go over that here.
 
 
[https://vo-nagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15 VO Nagios]
 
 
Site's that seem to be having trouble on one or more of their nodes at the time of writing are:<br />
 
Durham: pheno and gridpp<br />
 
Lancaster: pheno and gridpp<br />
 
Sussex: snoplus<br />
 
EFDA-JET: gridpp, pheno, southgrid<br />
 
Liverpool: gridpp, snoplus<br />
 
Sheffield: gridpp, snoplus<br />
 
QMUL: t2k.org<br />
 
TIER 1: snoplus and t2k<br />
 
Although only Lancaster, Sheffield and the Tier 1 seem to be having really long term problems.
 
 
(I'm still trying to think how best to parse this information, so my apologies that it's poorly presented).
 
 
On to the tickets.
 
 
Only 24 open UK tickets this month (organised by site).
 
 
'''SUSSEX'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108765 108765](24/9)
 
Sussex have a ROD ticket, originating from a glue validation error (although it's just picked up some SHA-2 failures). Matt RB was away
 
though, so not much progress - Matt can you get to it this week? In progress (3/10)
 
 
'''RALPP'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=109115 109115](6/10)<br />
 
A fresh ticket from cms, complaining that RALPP don't have any backup squids listed in their site xml file. Assigned (6/10) and closed on (7/10) as the site name was old (the old one being too long!).
 
 
'''BRISTOL'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 106325](18/6)<br />
 
CMS pilots losing network connectivity. CMS have confirmed that it is only a subset of the Bristol clusters seeing pilots dropping connections. Winnie has continued to poke and prod this, and between her and CMS they've (more or less) ruled out natting as the cause of the problem. Bristol are still quite stuck, and kind of hoping some unrelated network tweaks might sweep this issue away. On Hold (2/10)
 
 
'''ECDF'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 95303](1/7/13)<br />
 
tarball glexec deployment - see Lancaster entry on the same issue. On hold (29/8)
 
 
'''DURHAM'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108273 108273](5/9)<br />
 
Durham experienced a sudden, odd change in their perfsonar results (outbound bandwidth went up, in bound dropped). The Durham chaps were looking into this but were interrupted by this shellshock business. Oliver has included some long term plans in the ticket and will update it again when they have their perfsonar back. On hold (6/10)
 
 
'''SHEFFIELD'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108716 108716](23/9)<br />
 
Snoplus jobs not running at Sheffield. Elena had to bash one of her CEs into shape, but it should be fixed now and has asked Matt M if he still sees a problem. Waiting for reply (6/10)
 
 
'''MANCHESTER'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=109001 109001](2/10)<br />
 
Not quite a site problem, but David M was having trouble committing to the SVN hosted at Manchester (and a reminder that I believe the "official" way of reporting problems with these services is to ticket the site). It looks like this has been solved and the ticket can probably be closed. In progress (3/10)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=109049 109049](4/10)<br />
 
Atlas transfer problems - the underlying issue being a downed (and dead) disk server. Alessandra is doing the lost file declaration stuff and offered to provide lists of these files to the users directly. Not much more that Manchester can do. In progress (6/10)
 
 
'''LANCASTER'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 100566](27/1)<br />
 
Poor, unexplained perfsonar performance. Although some ideas have been made how to tackle this, holidays then shellshock have got in the way of implementing them. On hold (1/10)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108715 108715](23/9)<br />
 
Sno+ jobs not running at Lancaster. Hopefully after a tweak to the information system on our CEs I fixed this - as Duncan pointed out things are looking okay on the VO nagios. I've asked Matt M how things are looking for "real" Sno+ work. Waiting for reply (1/10)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 95299](1/7/13)<br />
 
tarball glexec ticket. As mentioned in last week's Ops meeting, due to holidays there has been no progress over the last month but things look hopeful. On hold (9/9)
 
 
'''UCL'''
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 95298](1/7/13)<br />
 
Non-tarball glexec ticket. Ben's been trying to install this, but having dependency troubles - did anyone who uses rpms notice this when they last tried to install the glexec WN? In progress (29/9)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=109039 109039](3/10)<br />
 
Another Glue2 validation ROD ticket. In progress (3/10)
 
 
'''IMPERIAL'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108723 108723](23/9)<br />
 
Chris W has ticket Imperial with a few dirac file catalogue queries. Duncan responded with some documentation that others might also find useful and some other information. I believe the ticket is now waiting for feedback from Chris (who may in turn be waiting for feedback from the other VO user groups). Waiting for reply (1/10)
 
 
'''EFDA-JET'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108735 108735](23/9)<br />
 
biomed have asked that JET activate the biomed cvmfs repo at their site. Ticket seen but no news or action. In progress (23/9)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 97485](21/9/13)<br />
 
One of the ancient tickets. LHCB having authentication errors at Jet. No change. On hold (1/10)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=109080 109080](6/10)<br />
 
A fresh ROD ticket about a number of alarms - at first glance I would say a certificate has expired. In progress (6/10)
 
 
'''100IT'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108356 108356](10/9)<br />
 
VM images from fedcloud.egi.eu not available at 100IT. This ticket showed up an issue with creating an AppDB profile, but that has since been solved. No news on the state of this ticket other then that the issue persists. In progress (1/10)
 
 
'''THE TIER 1'''<br />
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107935 107935](27/8)<br />
 
"BDII vs SRM inconsistent storage capacity numbers". No news on this for a long time. This ticket really could do with some love (or at least on holding!). In progress (3/9)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 106324](18/6)<br />
 
CMS pilots losing connection, similar to the Bristol ticket. The issue has been tracked to being *something* in the Tier 1's internal network after comparing firewall rules to RALPP. CMS have updated the ticket with some more information and some nice plots, but the long and the short of it is the problem persists. In progress (1/10)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108546 108546](16/9)<br />
 
atlas seeing failures on the RAL-LCG2_HIMEM_SL6 queue. Ticket in an odd state - the atlas shifters seem to think the problem was transient but Gareth and go are seeing a lot of load on diskservers despite nothing on BiGpanda. The RAL team is keeping an eye, but this ticket could do with some updates/on holding in the mean time. In progress (22/9)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107880 107880](26/8)<br />
 
Sno+ asking RAL for help/alternatives with srmcping for a small group of seemingly awkward Suse using users. Some input from others but not much word from Sno+ or the Tier 1 - Chris, could you please take a peek with your small VO hat on? In progress (30/9)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108944 108944](1/10)<br />
 
CMS running into a lot of "file not found" errors when running a AAA check at RAL, and asking if things are alright. When looking over the whole Castor namespace it appears that all files are present and correct which doesn't explain why CMS had trouble finding them. In progress (1/10)
 
 
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=108845 108845](27/9)<br />
 
Atlas seeing gridftp timeouts. This looks to be a hotspot problem (at this point in the review I'm just skim reading tickets). Atlas also report seeing deletion errors, and have included links. I'm not sure if this ticket will be impacted by this afternoon's Castor intervention. Still very much In Progress (5/10)
 
  
  

Revision as of 13:30, 13 October 2014

Bulletin archive


Week commencing 13th October 2014
Task Areas
General updates

Monday 13th October

Tuesday 7th October

  • A reminder to please share your work with everyone via blog posts. In the core-ops meeting it was suggested that there be an incentive... we'll consider that!
  • Ewan will take a closer look at the middleware package reporter (the Pakiti contender... or ally).
  • Matt will be (trialing) following up on VO Nagios errors from GridPP Nagios.
  • There is an IPv6 quarterly meeting this week.
  • There is a GDB tomorrow 8th October.
  • GridPP collaboration meeting now scheduled for April 28th to 30th 2015

Tuesday 30th September

WLCG Operations Coordination - Agendas

Tuesday 14th October

Tuesday 7th October

  • * There was a WLCG ops coordination meeting last Thursday. (Agenda: Minutes). Some notes follow...
  • News: HEP_OSlibs-7.0.0-0.el7.cern.x86_64.rpm for CentOS7 has been released; CHEP 2015 15th October [s://indico.cern.ch/event/304944/call-for-abstracts/ abstract deadline] approaching; comments on Shellshock.
  • MW baselines: New version of the UI and WN estimated for next UMD end October; dCache 2.2.x decommissioning deadline is 31-10-2014
  • MW issues: xroot package deployed with ROOT 6 breaks access to dCache storage, affecting LHCb. Fix coming. CREAM, WMS, L&B, UI, WN cannot be installed at the moment because the classads package ( dependency for all of them ) was declared an orphan in EPEL!
  • T0 & T1 updates: Mainly SE upgrades
  • Oracle: Upgrade plans updated.
  • T0 news: WMSes decommissioned 1st October.Lxplus5 will be stopped in October; AFS UI (removal) discussion ongoing.
  • T1 feedback: NTR
  • T2 feedback: NTR
  • ALICE: Investigation of job failure rates and inefficiencies; HLT farm running as an ALICE site since Sep 24.
  • ATLAS: DC14 ongoing. Multi-core recommendation: 16GB physical memory per job. Serial production tasks in future will be limited. ARC-CE tests in ATLAS-CRITICAL from 1st October.
  • CMS: Scale testing of HTCondor and GlideinWMS by OSG - various issues. Reminder: Participate in space monitoring; Update xrootd fallback configuration.
  • LHCb: dCache storage sites broken when accessed by ROOT6/xrootd; new stripping campaign is currently being prepared; testing new VOMS.
  • glexec: NTR
  • Machine job features: NTR
  • MW readiness: Meeting on 1st October. DPM, CREAM and BDII verification exercises. MW package reporter development. Next meeting 19th November.
  • Multi-core: 50M events/daily for ATLAS. Continue deployment.
  • SHA-2: Testing new VOMS for each experiment.
  • WMS decommissioning: with the deployment of the Condor SAM probes nothing is using WMS anymore. Machines off. WG will end.
  • IPv6: LHCbDIRAC tested and working
  • Squid monitoring/HTTP proxy: NTR
  • Network & Tmetrics WG: Shellshock & perfSONAR news. PS 3.4 coming.


Tier-1 - Status Page

Tuesday 7th October

  • Access for all VOs to our CREAM CEs has been stopped (apart from ALICE and SNO+).
  • We are currently experiencing a problem with a disk array that holds the Castor databases. Castor performance may be degraded and we await an engineer to fix the faulty array.
Storage & Data Management - Agendas/Minutes

Wedn 01 Oct

  • Summary of all the exciting events in Amsterdam last week - EUDAT, EGI big data, RDA
  • DPM 1.8.9 early testing, and (separately) xroot4 early-ish testing. Supporting multiple VOs in one xroot server.

Wedn 17 Sept

  • iRODS - what it is and why it should choose to collapse on Betelgeuse 7.
  • Technical problems with Vidyo

Wedn 10 Sept.

  • High load at L'pool causing low throughput - how to throttle xroot transfers (and is the load necessary or a bug?)
  • Still testing WebFTS
  • Prep for DPM workshop


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 7th October

  • GridPP metrics need updating for CMS. Any comments on the metrics page at the moment?
  • APEL issues for Birmingham and Sussex, and the portal appears to stop at 1st October (being followed-up).

Tuesday 30th September

  • Slight delay for Birmingham and Sussex.

Tuesday 23rd September

  • Slight APEL delay for Birmingham .


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.


Interoperation - EGI ops agendas

Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 30th September

  • Monitoring meeting last Friday, link : https://indico.cern.ch/event/341748/ , minutes: https://indico.cern.ch/event/341748/material/minutes/minutes.html
    • Of note, we had identified a couple of differences between SAM2/3 where entries were appearing in SAM3 which had not been in SAM2. This is because they were being picked up from the vofeed - from the minutes, "Pablo and Maarten confirm that the VOfeed is the only authoritative source of topology; agreed that all services in the VOfeed will be tested and their availability will be calculated; agreed to add a new attribute to VOfeed to flag which services should be excluded from the official reports."

Tuesday 23rd September

  • Next (replacement) meeting this Friday to continue discussions.
On-duty - Dashboard ROD rota

Monday 29th September

  • Rota being updated.

Monday 22nd September

  • Quiet week - little to report.
  • EGI is looking for people to join the ops portal review and testing TF.

Tuesday 2nd September

  • Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
  • The UCL ticket is now finally receiving some attention.
  • Ongoing problems at RAL.

Tuesday 26th August

  • RAL : Nagios jobs staying in queue for long time - to be investigated.
  • Sussex : Matt needs help probably from some SGE experts.
  • UCL : No acknowledgement from the site (ticket escalated to second level).
  • 100IT : There is an alarm from EGI federated cloud - this needs discussion.
  • Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

  • Last week was quiet.
  • Still one or to responses needed for next rota allocations.


Rollout Status WLCG Baseline

Tuesday 26th August

Monday 28th July


References


Security - Incident Procedure Policies Rota

Tuesday 14th October

  • Shellshock updates and follow-up
  • Banning challenge status

Tuesday 30th September

  • Shellshock - advice and follow-up.
  • Note particularly the advisories from WLCG/EGI.

Tuesday 23rd September

  • High priority vs critical tests in pakiti.
  • FAX update



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Monday 13th October

  • perfSONAR 3.4 has been released. Clear documentation on what to do (clean reinstall) coming this week together with information on mesh updates. See the GDB presentation slides 13 and 14.
  • RIPE have sent a reminder to connect probes that have been handed out (some weeks ago now). Please could the following sites check their status: Lancaster; Brunel; Sussex; and ECDF. 20599 at RAL has never properly connected (DHCP issue?).

Tuesday 7th October

Tuesday 23rd September

Tickets
Tools - MyEGI Nagios

Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.


Site Updates

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 8th October 2014

  • Operations report
  • On Monday (6th October) access to the two cream CEs (cream-ce01, cream-ce02) was modified to only accept ALICE, dteam and ops jobs plus snoplus.snolab.ca.
  • Initial testing by ALICE of running jobs through the ARC CEs is encouraging.
  • There has been a problem with a disk array used to host the main Castor databases. A battery fault has stopped the cache working, greatly reducing the array's performance. The databases have been partly reconfigured to work around this and an engineer is awaited to fix the underlying problem.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A