Operations Bulletin 100314

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 3rd March 2014
Task Areas
General updates

Tuesday 4th March

  • A Vidyo test meeting room is available for testing linked from this agenda. Headsets are recommended.
  • No problems uncovered with the GridPP test website.
  • From Monday's WLCG ops: intermittent problems with RAL's virtualization cluster, affecting many services (including FTS3).
  • GGUS update - is there feedback we want to collate?
  • For information: APARSEN-EGI Community Workshop on Managing, Computing and Preserving Big Data for Research will take place next week, 4-6 March.
  • There will be a pre-GDB on batch systems next Tuesday, and a GDB covering various update areas.
  • There was an EGI OMB meeting last Thursday.
  • Steve produced an overview of getting ARGUS working at Liverpool.
  • GOCDBv5.2 was released last week. This release adds an extensibility mechanism which allows Services, ServiceGroups and Sites to be extended using custom key-value pairs (following the GLUE2 extensibility mechanism).

Monday 24th February

  • There is a test GridPP website for SHA-2.
  • The final WLCG Tier-2 availability/reliability reports for January 2014 are available.
  • Alessandra noted a FR cloud report on January's VO test results. The suggestion made was to do something similar for UK sites.
  • We need to revisit our plans for RIPE ATLAS probes.
  • Janet is moving away from SeeVogh/EVO. Support ends in August. Our meetings will migrate to Vidyo.
WLCG Operations Coordination - Agendas

Tuesday 11th February

  • A (pre-GDB) F2F is taking place today. We will review next week. There will also be a summary at tomorrow's GDB.

Tuesday 4th February

  • There is a multi-core TF meeting this afternoon. The focus is on CMS and PIC.
  • The second middleware readiness meeting takes place this Thursday 6th February.
  • There was a ops coordination meeting last Thursday. The minutes are available. In summary:
  • BASELINES: WMS baseline downgraded to 3.6.1 for issues; APEL baselines added after meeting
  • OpenSSL: WMS needs new version of glite-px-proxyrenewal. ETA this week.
  • SAM: plan to split SAM services for WLCG (at CERN) and EGI (at consortium). Code will fork.
  • ALICE: gearing up for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio renaming campaign almost over. Rucio commissioning has started. DC14 simulation started on 1st of January.
  • CMS: DBS migration has been postponed. gLexec test (not yet critical) is a bit difficult for Tier-1s.
  • LHCb: Issues with ARC CEs.
  • FTS3: Experiments re-started increasing the load on the RAL FTS3 instance. Deployment discussion in February meeting.
  • gLexec: 22 tickets remain open. EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tarball WN.
  • IPv6: Report at next meeting.
  • MW readiness: Next meeting on Feb 6 at 15:30 CET: agenda in particular about how to involve experiments and sites. Need site input on table.
  • MULTICORE: October 2014 proposed by TF coordinators as a target date for a functional system to be deployed,
  • perfSONAR: New release this week. Lots of minor fixes and improvements. All sites should update to this.
  • SHA-2: EOS SRM for LHCb not yet OK. voms-proxy-init on lxplus crashes on creating SHA-2 RFC proxies.
  • TRACKING: No update
  • WMS decom: Deadline - end of April to decommission CMS and shared instances


Tier-1 - Status Page

Tuesday 4th March

  • There have been problems with spart of the Virtual Machine infrastructure on the Tier1. This has caused problems for a number of services, including both FTS2 & 3. These are largely worked around now. However, we have asked Atlas to temporarily move their file transfers (for everything except UK transfers) off our FTS3 server.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • A replacement MyProxy server has been put into production (to resolve the MyProxy issues raised in GGUS ticket 97025). This new service is called myproxy.gridpp.rl.ac.uk. Sites and VOs need to make appropriate reconfigurations to use this. We plan to turn the old one (lcgrbp01.gridpp.rl.ac.uk) off at the end of March.
  • Atlas disk space at the RAL Tier1 is full. One factor in this is a slow deletion rate that is being investigated.
  • The Tier1 will move to use the new site firewall on Monday 17th March. We will stop FTS transfers while the change takes place (07:00-08:00) and stop new batch work from starting. Otherwise services will be up for the day but will be At Risk.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 25th February

  • Keydocs owners need to take some action!

Tuesday 11th February

  • Documents still need attention....

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Monday 24th February

    • URT News: ARC, WMS, SAM probes
    • UMD 3.5 released last week. Storm 1.11.3, other updates for openssl
    • SR: IGE.globus-rls v. 5.2.5 no EA
    • GLUE2 Validation: Possible timeline: Broadcast to ROD and Sites on March 3rd, Probe will be set OPERATIONAL on March 10th, Sites will have other two weeks to fix the Site-BDII before receiving alarms.


Monitoring - Links MyWLCG

Tuesday 4th March

  • Summary on HC functional tests
  • Overview of feedback


Monday 24th February

  • Next meeting this Friday, agenda looking at HammerCloud Functional tests

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).
On-duty - Dashboard ROD rota

Monday 3rd March

  • A new dashboard is available for testing.
  • The ROD rota has been extended to April
  • Brunel sub-sites caused a problem leading to EMI-3 APEL alarms which is now fixed.


Tuesday 25th February

  • No issues to discuss.
  • The rota needs updating this week.


Rollout Status WLCG Baseline

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 4th March

  • Questionnaires have been produced for EGI federated cloud sites.
  • The next security team meeting is Wednesday 5th March at 11am.

Monday 24th February

  • Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There is some setup documentation.

Tuesday 18th February

  • Progress on NGI ARGUS and testing. More tinkering needed. No wider deployment just yet.



Services - PerfSonar dashboard | GridPP VOMS

Tuesday 4th March

  • The full UK perfSONAR view is given on this dashboard.
  • When perfSONAR is performing in a stable fashion the site will appear on the main monitoring page.


Monday 17th February

  • Note that WLCG see perfSONAR as a production service (see page 5 in Ian Bird's talk). The UK dashboard shows work still to be done at: ECDF, RHUL, Sheffield, Brunel and RALPPD.

Tuesday 4th February

Tickets

Monday 3rd March 2014, 14.30 GMT</br> 44 Open UK NGI tickets this week.

NGI</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)</br> ILC moving to cvmfs, so those of us seekign to continue support will need to enable it. IC and Cambridge have already moved and been confirmed working. It might be easier if we collate any other sites who have moved into a single list to give to ILC. The working plan is to open tickets against sites who haven't moved after giving them a suitable grace period. In progress (26/2)

TIER 1</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=99556 (6/12/13)</br> The NGI Argus ticket. There's been great progress on this, can we reflect some of this in the ticket? Or perhaps close it if we're satisfied. In progress (13/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101491 (23/2)</br> The RAL perfsonar latency box is being troublesome. It crashed and was brought back up again, but has crashed again so Duncan has reopened the ticket. Reopened (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101716 (28/2)</br> This cms transfer ticket has INFN as the "notified site", surely it should be RAL-LCG2 instead? I didn't change it myself in case I missed some nuance. Transfer problems appear to be linked to the virtualisation problems RAL have been experiencing affecting FTS3. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101729 (1/3)</br> LHCB pilots failing on a RAL CE. Being looked into. In Progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101701 (28/2)</br> ILC having troubles with the RAL ARC CEs. Looks to be a user group for ilc (production) missing. In progress (28/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101052 (6/2)</br> Biomed having trouble retrieving results from RAL cream CEs. Tracked down to the RAL EMI2 argus not handling Rfc proxies. An update to EMI3 is hoped to fix this, although Dan reports that this isn't the case at QM (see 101639). In progress (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101532 (25/2)</br> LHCB noting that RAL is publishing the default MaxCPUtime. Fixed but Orlin notes some caching behaviour. Maria AP chimed in that you might have a buffy bdii version in the chain. In progress (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100114 (8/1)</br> Chris W's ticket concerning jobs failing to get from RAL to Imperial. Catalin asked for some testing, but Chris has been on busy. The ticket hit its second reminder though. Waiting for reply (11/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=97025 (3/9/13)</br> Longstanding myproxy issue. Andrew reports that the new myproxy service is up and running, so I assume this ticket can be closed soon? Or at least put back in progress. On hold (25/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)</br> ARC CEs having a default SE of 0 and not being able to tune this per VO. Andrew is figuring out a fix to this. In progress (25/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)</br> cvmfs for Sno+. Ticket on hold whilst tarballs are created. Been that way for a while. On hold (29/1).

EDINBURGH</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100569 (28/1)</br> ECDF's perfsonar box refusing MA connections. Wahid has rebooted the box but no joy, Duncan linked some instructions as requested. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=99794 (16/12/13)</br> Access to the ECDF perfsonar pages. There's a big ACL overhaul going on at the moment, Andy apologises and will chase the central IT chaps about it. On hold (28/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101659 (27/2)</br> 44444 jobs publishing on some ECDF CEs (as part of information system cleanup campaign). These CEs are due for retirement (replicant style) today, so this and the related tickets will be done with soon. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100840 (29/1)</br> Apel-Pub nagios test failures at ECDF. The guys are working on it, but sadly the ticket is escalating. Daniela posted a note that if you have a support ticket with APEL open (which I think is advisable) to link that into this ticket. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/13)</br> glexec deployment ticket. The ECDF lads are waiting on the tarball (i.e. me). Still. On hold (27/1)

RALPP</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101726 (1/3)</br> LHCB ticket about the default CPU time (999999) being published at RALPP. I thought that RALPP had solved something like this recently, but maybe I dreamt it? Assigned (1/3) Update - Solved, something was being published that shouldn't be any more.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101727 (1/3)</br> Info system cleanup campaign, 4444444 job at RALPP. Assigned (1/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101398 (19/2)</br> LHCB would like xrootd holes poked in the RALPP firewall. As mentioned last week I believe this requires holes poked in the RAL firewall, which is undergoing an overhaul. This ticket could do with some attention mentioning these problems, and possible on holding. In progress (19/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101136 (11/2)</br> Request to upgrade the RALPP perfsonar to the latest version. Due to a lack of hands on deck Chris postponed this work, with a reminder date of today. On hold (21/2)

IMPERIAL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101367 (18/2)</br> A cms user having trouble srmcping in his jobs at IC. Looks to be a java 1.7 mismatch problem. Simon has asked some questions, no answer yet (user has set notify to "on solution" so might not have got the update). Waiting for reply (24/2)

DURHAM</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101752 (3/3)</br> LHCB jobs having problems at Durham. Ewan S. has asked if the problems persist. Waiting for reply (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101763 (3/3)</br> Part of the campaign to clean up the information system, Durham have been asked to update their BDIIs (site and resource) to not-buggy versions. Assigned (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101177 (12/2)</br> Durham trying to wash the biomed out of their SE's information system. No joy yet. I advise asking at the storage meeting if stuck. In progress (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=99621 (10/12/13)</br> enmr noticed a bad WN, which was promptly quarantined. It hasn't been fixed, but I maintain that the problem itself is contained and solved if you want to close the ticket... On hold (28/1)

GLASGOW</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101710 (28/2)</br> Nagios SRM-Put test failures. The problem is known (it's DPM being odd with its space reporting whilst a pool is readonly -Sam describes it better). In progress (28/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)</br> LHCB sees that Glasgow is also publishing default max CPU time for some (all? one?) of their queues. Sam points out that this is on purpose (due in part to multicore jobs, jobs are limited by Wall time only), and asks if LHCB can't make educated guesses. Stefen replies with a point about the difference in "MaxCPUTime" and "MaxTotalCPUTime", but I'm not sure that covers the Glasgow concerns. Worth discussing to get a UK stance on this. In progress (3/3)

BRUNEL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100568 (28/1)</br> Perfsonar MA problem. Raul has been working steadily at this and it looks to be progressing nicely. In progress (28/2)

QMUL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101676 (27/2)</br> One of QM's perfsonar boxes is having problems, missing services. Likely to be caused by running a bleeding edge version of perfsonar. In progress (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101682 (27/2)</br> Brian has asked for a SE dump of QM atlas files. Assigned (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101557 (25/2)</br> Matt from SNO+ having trouble on a QM UI, delegating proxies to the FTS. The same works on lxplus though. This ticket needs a home, but there's an argument that it isn't a site problem (as a UI isn't necessarily part of a site). Assigned (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=94746 (10/6/13)</br> Biomed haunting the QM SE's info system. I believe Chris is waiting on his changes to seep into the Storm release (100290). On hold (14/1)

BRISTOL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101669 (27/2)</br> lhcb ticketed Bristol, but the CE in question is in scheduled downtime. Possibly worth keeping this open whilst downtime is on to avoid a duplicate. In progress (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101516 (24/2)</br> Bristol's perfsonar ticket. Bristol upgraded which seems to have solved some of their problems, but their other server is having trouble now. Maybe the same again will fix it? In progress (25/2)

UCL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)</br> glexec at UCL. No news for a while from Ben. Daniela reminds him that the EMI3 upgrade is also imminent. On hold (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)</br> A perfsonar ticket for UCL. A power outage looks to have brutalised their box. No word yet on if Ben has been able to save it. On hold (22/2)

SHEFFIELD</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101374 (19/2)</br> Sheffield's LHCB maxcputime ticket. Elena has set in progress but no news. In progress (25/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)</br> A perfsonar ticket for Sheffield, whose perfsonar needs updating. No news for a while. On hold (3/2)

LANCASTER</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/13)</br> Lancaster's glexec ticket. Whilst there's been some progress in the glexec tarball (not as much as there should be, as tarball time keeps being redirected, particularly with EMI3), no movement on the ticket. On hold (31/1)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)</br> Lancaster suffering Poo Perfsonar Performance (I couldn't resist the childish alliteration). It doesn't seem to be an artificial carp (the rate has peeped over the 1Gb/s mark now and again. Looking for bottlenecks, but not had anytime to investigate. On hold (17/2)

EFDA-JET</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)</br> LHCB jobs failing at JET due to openssl problems. No progress for a while, after the JET guys exhausted everything. On hold (11/2)

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).

Tuesday 4 February 2014

  • Proxy renewal
    • Imperial have a workaround for proxy renewal
    • EMI released an update yesterday - should fix things, but needs to be deployed.

Tuesday 28 January 2014

  • hyperk having problems with proxy renewal.
    • This may be related to openssl

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 5th March

  • Operations report
  • Network Changes: Move of Tier1 to use new site firewall will be on Monday 17th March. There will some interruption to services as seen from outside RAL. Internally services are expected to continue uninterrupted.
  • Significant problems with part of the Tier1 Hyper-V infrastructure. This started on Friday (28th). The more important VMs have been moved elsewhere while the underlying problem is investigated. Services impacted included FTS3 which was running a large scale test for Atlas & CMS. At our request, on Tuesday (4th) Atlas moved the bulk of their file transfers (all except for the UK) to other FTS3 servers.
  • A new MyProxy server is in production (myproxy.gridpp.rl.ac.uk).
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A