Operations Bulletin 100214

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 3rd February 2014
Task Areas
General updates

Tuesday 4th February

  • The agenda for the February GDB is available.
  • The March pre-GDB will be on batch systems.
  • Discussion on RIPE ATLAS probes has continued off list. The PMB agree that there is an opportunity here and prefer to link this with outreach and dissemination. For those interested a discussion of what to propose will take place this Friday 7th February (email Jeremy).

Tuesday 28th January

  • There are suggestions for a WLCG pre-GDB on batch systems in March.
  • openssl status update
  • There was an IPv6 working group meeting at CERN last week (agenda).

Tuesday 21st January

  • ipv6.hepix.org VO request
  • WLCG T2 A/R report available. Feedback required from 5 sites.
  • Please check the VO derived A/R figures.
  • The January GDB took place last week.
  • In-kind contribution follow-up still needed for Imperial, QMUL (Steve will do this), Liverpool, Manchester .. and also UCL and Durham.
WLCG Operations Coordination - Agendas

Tuesday 4th March

  • There is a multi-core TF meeting this afternoon. The focus is on CMS and PIC.
  • The second middleware readiness meeting takes place this Thursday 6th February.
  • There was a ops coordination meeting last Thursday. The minutes are available. In summary:
  • BASELINES: WMS baseline downgraded to 3.6.1 for issues; APEL baselines added after meeting
  • OpenSSL: WMS needs new version of glite-px-proxyrenewal. ETA this week.
  • SAM: plan to split SAM services for WLCG (at CERN) and EGI (at consortium). Code will fork.
  • ALICE: gearing up for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio renaming campaign almost over. Rucio commissioning has started. DC14 simulation started on 1st of January.
  • CMS: DBS migration has been postponed. gLexec test (not yet critical) is a bit difficult for Tier-1s.
  • LHCb: Issues with ARC CEs.
  • FTS3: Experiments re-started increasing the load on the RAL FTS3 instance. Deployment discussion in February meeting.
  • gLexec: 22 tickets remain open. EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tarball WN.
  • IPv6: Report at next meeting.
  • MW readiness: Next meeting on Feb 6 at 15:30 CET: agenda in particular about how to involve experiments and sites. Need site input on table.
  • MULTICORE: October 2014 proposed by TF coordinators as a target date for a functional system to be deployed,
  • perfSONAR: New release this week. Lots of minor fixes and improvements. All sites should update to this.
  • SHA-2: EOS SRM for LHCb not yet OK. voms-proxy-init on lxplus crashes on creating SHA-2 RFC proxies.
  • TRACKING: No update
  • WMS decom: Deadline - end of April to decommission CMS and shared instances


Tuesday 28th January

...

Tuesday 10th December

  • Confirmation of the multi-core task force with this mandate. Some concerns about overlaps with the machine/job features TF.
  • Discussion of experiment Christmas plans
  • Update of the [ttps://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions baseline versions]. BDII update important for SAM BDII nodes at CERN.
  • Tier-1 WNs on OPN is now being tracked here.
  • ALICE - MC will continue over break. Best efforts approach appreciated.
  • ATLAS - plans for ramp up of MC production. Repro and analysis ramp up also expected in coming weeks.
  • CMS - Run2 MC samples prep starting. "Appreciate all support from the sites we can get, but don’t expect normal levels of support, especially for T2 sites"
  • LHCb: Usage of distributed grid resources for mainly monte carlo productions. Surveillance by the operations team on a best effort basis. Also note a new CVMFS dashboard for LHCb.
  • Christmas plans summary: "All experiments will run activities over christmas at non negligible scale. They do not require special effort from sites or WLCG in general, while best effort support is highly appreciated"
  • WMS decommisioning: looks like WMS usage by CMS decreasing but it is variable.
  • glexec: 31 tickets remain open. Status tracked here.
  • FTS3: testing ongoing
  • Tracking tools: An engineer will be on-call for GGUS over the vacation period.
  • perfSONAR: Code maintenance an issue with BNL funding cuts. Looking at OSG and ESNet options. 3.3.2 out soon. See Status & Plans update. Asking sites to make accessible the perfSONAR main page

(https://<hostname>/toolkit) for the central operations activity. Plans are for OSG to host perfSONAR-PS central service, BNL dashboard not all correct.

  • IPv6: request from CMS to have IPV6 supported on SLC5 at CERN. Alistair D taking on ATLAS role for IPv6 testing.
  • Middleware readiness: Meeting planned for 12th December.
  • Machine/job features: Discussion between current implementation and proposed route minimizing draining waste (MDW) cpu time for multi-core pilots.
  • SHA-2: still some updates at sites ongoing (>10 sites). "by mid January the WLCG infrastructure is expected to be essentially ready ". OSG plans to move in mid-January.
  • VOMRS: VOMS-Admin still in testing.

Tuesday 3rd December

Tier-1 - Status Page

Tuesday 4th February

  • CVMFS Client version 2.1.17 has been rolled out on one batch of worker nodes (around 10% of the farm). So far so good.
  • The software server usd by the small VOs will be withdrawn from service (aiming for June).
  • Work is restartin to resolve the MyProxy issues raied in GGUS ticket 97025. There will be a new MyProxy server and it will be necessary to make appropriate reconfigurations to use this.
  • Work is progressing on Tier1 Network changes. On Tuesday 11th March it is planned to move the Tier1 to use the new site firewall. The plan is to install the new Routing layer for the Tier1 & change the way the Tier1 connects to the RAL network before this date.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.

Tuesday 28th January

  • It has been noted that many sites are not publishing revised benchmark HS06 figures following their upgrade to SL6. Please check your site ASAP. Tickets will be raised shortly against sites observed to be unchanged. Please include updates in the HS06 wiki page.

Tuesday 26th November

Tuesday 5th November

  • A reminder to keep an eye on the SL HS06 page for odd ratios. Steve takes HS06 cpu numbers direct from ATLAS and the page does get stuck every now and then.
  • The metrics page has been updated.

Tuesday 13th August

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Tuesday 4th February

  • SR: mpi v. 1.5.3
    • lb v. 4.0.12
    • apel-parser v. 2.2.1 and apel-ssm v. 2.1.1
    • Globus 5.2.5:
    • gridftp v. 5.2.5
    • gram5 v. 5.2.5
  • Still open WMS issues
  • EMI-2 decommissioning deadlines: 30/04/14 end of support, 31/05/14 deadline for upgrades
  • Affected:
    • ARC v2.*
    • ARGUS v1.5.*
    • BDII Site older than v1.2.0
    • BDII Top older than v1.1.0
    • CREAM v1.14.*
    • dCache v2.2.*
    • DPM older than v1.8.6
    • EMI-UI v2.*
    • EMI-WN v2.*
    • FTS v.2.2.8
    • StoRM older than v.1.11.0
    • VOMS v.2.*

Tuesday 21 January

  • Notes from last meeting, Tuesday 14th meeting. Agenda: https://wiki.egi.eu/wiki/Agenda-14-01-2014 . Summary:
    • Updates for Gridsite and Globus gatekeeper as part of openssl , Gridsite in staged rollout
    • Flagged the webdav StoRM update 1.11.3 from December
    • Releases: only major release ARC 4.0.0
    • SHA-2 update: 99% of services completed. No discussion of NGIs moving to SHA-2 certs
    • Flagged that the clients are not monitored, but for dcache-srm-client the first version supporting SHA-2 certificates is v2.2.22.
    • Decommission of CERN WMSes : Planned for April, it was noted by service managers that there was a high usage for ops tests.
    • We mentioned the UKs trial of RPMs for VOMS data; this was favourably received and I said we'd update in a few weeks with more experience.

Tuesday 17th December

  • The next meeting will be on Thursday combined with the EGI OMB.

Tuesday 3rd December

  • Additional notes:
    • the 2.6.16 version of dCache mentioned has a serious bug in the migration module; 2.6.17 has this fixed so should be used in preference. The possibility of skipping 2.6.16 in the overall release of EMI-3 being discussed
    • Note that the cream updates mentioned in this meeting contain security updates and so are recommended.
    • Looking for CREAM/LSF plugin staged rollout, but don't believe there are any such sites in the UK
    • SHA-2 : 17 sites remaining in the EGI that are publishing SHA-2 and alarming; I don't think that any such sites in the UK (just a couple) are unaccounted for/previously documented.
    • It was asked when CAs would start issuing SHA-2 certs only (UK noting that it's planning to from January)
  • Next meeting: (last for 2013) 16th December
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).

Tuesday 10th December

  • Feedback transmitted and discussed by consolidation group; next meeting is now in January.

Tuesday 26th November

  • As noted by Alessandra, if possible we'd like site feedback on the consolidated monitoring prototype before the next meeting a week on Friday to report back to the group (with thanks to everyone who has already contributed)
  • Some notes to form a wiki on Graphite are to be found here: https://www.gridpp.ac.uk/wiki/MonitoringTools but these are under development, however if there are areas people would find useful that could be expanded, please let David know.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 3rd February

  • Good week. APEL ticket about Brunel alarms still open (although was passing this

afternoon)

Monday 20th January

  • Sussex have put their CE into a downtime. Cleared up some tickets after they did this.
  • The APEL alarms at Brunel reappeared from time to time. Closed these referring to a ticket Daniela raised. See:

https://ggus.eu/ws/ticket_info.php?ticket=100287.

Monday 13th January

  • Quiet week (after Apel problems resolved). Brunel still seems to get dubious

APEL alarms occasionally.

  • Glexec at Sussex unresolved - perhaps may be best to put this into downtime whilst issue addressed?


Rollout Status WLCG Baseline

Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated the Stage of the Nation page.


Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805] References


Security - Incident Procedure Policies Rota

Tuesday 14th January

  • nmap test results show 4 UK sites yet to take action on perfSONAR
  • openssl status


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 4th February

Tuesday 7th January

  • A perfSONAR dashboard has been established in London based on maDDash.

Tuesday 26th November

  • The main perfSONAR issues this week affect Manchester and Sussex.

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.
Tickets

Monday 3rd February 2014, 14.30 GMT</br> Only 29 open tickets in the UK at the moment. To split it further, only 4 of these are "green", three are "yellow, the rest are "red". 7 are perfsonar related tickets, the only really big group of tickets we have.

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=100480 (23/1)</br> Some obsolete entries were being published at RALPP, Chris thinks he has fixed it though (a problem on the cluster BDII), awaiting confirmation. Waiting for reply (31/1) Update-Solved

https://ggus.eu/ws/ticket_info.php?ticket=100849 (29/1)</br> Duncan has ticketed RALPP over their perfsonar latency box, he reckons a full log partition. Looks like this ticket hasn't been noticed yet though. Assigned (30/1)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=99642 (10/12)</br> Backup Voms server testing for GridPP and Southgrid VOs at Oxford. On hold (30/1)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=99910 (20/12/2013)</br> LHCB having problems with the environment at Bristol, tracked to ARC being an odd duck. The problem has been forwarded to the ARC devs. On hold (21/1)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/2013)</br> Getting CMS working at Glasgow - the ticket. Gareth has updated a magic CMS xml file using one given to him by Daniela and notes that they're still failing CMS xrootd tests. Gareth asks if the tests are critical, and if they are he pleads for help. The lack of CMS credentials is really nobbling their efforts to getting this sorted, or even digging up docs. Waiting for reply (3/2) Update- Daniela provided an update containing what I can only assume is an invocation of dark forces, Gareth has risked his immortal soul and applied it.

EDINBURGH</br> I'll probably be better off coming back to these in a few weeks time!

https://ggus.eu/ws/ticket_info.php?ticket=100840 (29/1)</br> ECDF have an APEL-Pub nagios error going on. Looks like this has flown under the radar, probably due to both Andy and Wahid having more important things on their mind right now. Assigned (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=99179 (25/11/2013)</br> Glue2 obsolete entries. Plans to retire the CEs have been slowed down due to waiting on networking changes. Andy reported that he'll fix the publishing if their not in position to decommission soon. On hold (24/1)

https://ggus.eu/ws/ticket_info.php?ticket=99180 (25/11/2013)</br> Similar to above, but publishing default values. It's the same CEs at fault, so this ticket is in the same boat. On hold (4/12/2013)

https://ggus.eu/ws/ticket_info.php?ticket=99794 (16/12/2013)</br> ECDF's perfsonar boxen blocking access to their webpages. Was held up by Christmas, but no news since-probably won't be for a few weeks. On hold (16/12/2013)

https://ggus.eu/ws/ticket_info.php?ticket=100569 (28/1)</br> The perfsonar latency box has started refusing connections. On hold whist Andy's off. On hold (28/1)

https://ggus.eu/ws/ticket_info.php?ticket=95303 (1/7/2013)</br> glexec ticket. Sadly the same story as last time (or the last times).

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=99621 (10/12/2013)</br> Durham have a bad worker node, spotted by enmr.eu. Whilst the guys haven't had a chance to fix it, one could argue that an offlined problem is a solved problem, as it can't hurt the jobs anymore. On hold (28/1)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=100037 (3/1)</br> Sheffield's perfsonar box needed some site firewall holes poking for it. On the to do list is an upgrade and assimilation into the mesh due to only testing against 6 sites currently. On hold (27/1)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=100867 (30/1)</br> Teething problems for Manchester's new perfsonar boxes. Alessandra asks Duncan if it can be closed. In progress (3/2) Update- Solved, and wasn't a site problem to begin with.

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=100566 (27/1)</br> Lancaster isn't getting 10G performance out of its perfsonar boxen. My suspicion is that the NICs themselves are running slow, not the switches. Maybe I'm using the wrong drivers? In progress (3/2)

https://ggus.eu/ws/ticket_info.php?ticket=95299 (1/7/2013)</br> Lancaster's GLEXEC ticket, waiting on me getting a tarball one working. I'm currently trying out another tarball one on my test bed, but it's early days yet (it's more an exercise in documenting the errors at the mo). On hold (31/1)

https://ggus.eu/ws/ticket_info.php?ticket=100011 (31/12/2013)</br> Biomed stopped working for one of the Lancaster CEs. The ticket suffered from lack of priority (sorry biomed!). On hold (24/1)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=95298 (1/7/2013)</br> The UCL glexec ticket. SL6 and DPM upgrades are done, Ben is just getting things settled before he starts tackling this. On hold (27/1)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=94746 (10/6/2013)</br> QM having trouble scrubbing the biomed out of their SE's information system. Chris submitted https://ggus.eu/ws/ticket_info.php?ticket=100290 and has put a lot of hours into this. On hold (14/1)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=100568 (28/1)</br> Brunel's perfsonar have problems. Raul plans to upgrade, and has let know his distaste that an upgrade requires a reinstall. In progress (29/1)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9/2013)</br> LHCB job problems still haunting jet. I think this ticket should be in "Waiting for reply", but I also think that I know the answer to the question (that the error message they're seeing as a red herring). In progress, should be in some other status (29/1)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> Chis has spotted jobs failing to get from RAL WMS to Imperial. Looked to be SSL problems. On hold awaiting RAL upgrade to the next WMS release. On hold (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=100343 (16/1)</br> RAL WMS producing 512-bit proxies (occasionally). Waiting on the same release. Waiting for reply (?) (27/1)

https://ggus.eu/ws/ticket_info.php?ticket=100887 (31/1/2013)</br> Due to the same underlying issue as the above tickets , Chris asks for the gridsite package on the webdav LFC to be updated. In progress (31/1)

https://ggus.eu/ws/ticket_info.php?ticket=100507 (23/1)</br> CMS transfers failed between Caltech and RAL. The problem has eased itself, so the ticket only needs to be kept open if further investigation is warranted (as Brian pointed out). In progress (3/2)

https://ggus.eu/ws/ticket_info.php?ticket=98249 (21/10/2013)</br> CVMFS for SNO+. Almost there, creating the Sno+ tarballs to test with is taking longer then expected. On hold (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/2013)</br> The new NGI Argus server (argusngi.gridpp.rl.ac.uk) has been set up in the gocdb and is online. In progress (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=97025 (3/9/2013)</br> Ye olde RAL myproxy server name confusion issue. No news on this for a while, the hope is having this dealt with soon. But then the last update was nearly a month ago, so soon isn't as soon as we'd like it to be! On hold (6/1)

That's all folks. I noticed a few longstanding tickets have been solved over the course of January, so thanks for that!

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 4 February 2014

  • Proxy renewal
  • Imperial have a workaround for proxy renewal
  • EMI released an update yesterday - should fix things, but needs to be deployed.

Tuesday 28 January 2014

  • hyperk having problems with proxy renewal.
    • This may be related to openssl

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Tuesday 9 December 2013

  • Backup VOMS server
    • VO managers still need to check sites - Scotgrid,northgrid,southgrid,londongrid,gridpp VOs were going first, but have not yet updated their status.

Monday 2nd December 2013


Monday 25th November 2013

  • CVMFS progress - but not quite there yet.
  • 6 VOs (cern@school,gridpp,na62, pheno,sno+,t2k.org ) have updated their VOID card entries and updated the wiki.
  • Storage
    • Gfal2 - GGUS 99043,99044,99055,99067 - not performant, but very interesting functionality
    • Webdav now enabled on LFC@RAL and ports free from firewall - needs testing

Tuesday 19 November 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 5th February

  • Operations report
  • Work is progressing on Tier1 Network changes. On Tuesday 11th March it is planned to move the Tier1 to use the new site firewall. The plan is to install the new Routing layer for the Tier1 & change the way the Tier1 connects to the RAL network before this date.
  • It was announced that the software server usd by the small VOs will be withdrawn from service (aiming for June).
  • CVMFS client version 2.1.17 is being tested on one batch of worker nodes (approx 10% of the batch farm).
  • The same batch of worker nodes has also been configured to access the new CernVM-FS Stratum-1 service at RAL (cvmfs-wlcg.gridpp.rl.ac.uk).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A