Operations Bulletin 290713

Bulletin archive

Week commencing 22nd July 2013

Task Areas

General updates

Tuesday 23rd July

There is a small workshop today on clouds and virtualisation (agenda).
Tomorrow there is a High Performance Networking Special Interest Group meeting (programme) taking place at UCL.

Tuesday 16th July

A reminder of the puppet discussion meeting last week.
There was a pre-GDB on clouds last week and a GDB.
An EGI Operations Management Board (agenda) takes place today. Topics include UMD release updates, the future of torque support, EGI CSIRT procedure for compromised certificates and central security emergency suspension, ARGUS (for use with central suspension), the GLUE validator and requirements for the VO security contacts list.
There is a SHA-2 deadline of 1st October.
The WLCG June availability/reliability report is now final.
A WLCG workshop is being proposed for 11th-12th November (meeting information).
GridPP31 is scheduled to take place at Imperial College on 24th and 25th September.

WLCG Operations Coordination - Agendas

Tuesday 16th July

SL6
- EMI-3 voms-proxy-info: 3rd problem java eating away memory. You can follow the story in both tickets GGUS 94878 and GGUS 95574
  - A fix is in the testing repositories and has been tested at Liverpool and Oxford.
- UK status: 4 sites online, 3 testing, 7 with a plan, 3 without a plan (UCL, Durham, RALPP).
- Presentation today at Atlas ADC weekly
- Checking now with sites how LHCb is doing. Not running everywhere it seems.
Monitoring
- WLCG Monitoring consolidation group to consolidate the WLCG monitoring. It doesn't include all the monitoring there is a portion developed by experiments which is not included, but it concerns well known dashboards.
  - WLCG monitoring Initial status.
  - Application Usage with experiments response.
  - First meeting last week. Next meeting Friday 19/7/2013
  - wlcg-ops-coord-mon egroup is for sites to give feedback. Things already under discussion on the mailing list
    - myWLCG vs SUM tests they both get the information from the same source i.e. nagios.
    - Personalised dashboard looks interesting but was never publicized much.
    - How to change experiments nagios/SUM tests to make them more representative
Next Coord meeting Thursday 18/7/2013

Tuesday 9th July

SL6
- Atlas new sw validation system scalability problem has been solved.
- voms are now in the EMI-3 repository. No testing or prod PT repositories are necessary.
- UK status: 3&1/2 sites online, 3 testing, 7 with a plan, 4 without a plan (UCL, Durham, RALPP, SUSX).
- HS06: T0 tests on the compilers didn't give significant differences. Hepix has started an SL6 HS06 page where sites are welcome to post their results SL6 HS06 benchmark results
Monitoring
- WLCG Monitoring consolidation group to consolidate the WLCG monitoring. It doesn't include all the monitoring there is a portion developed by experiments which is not included, but it concerns well known dashboards.
  - WLCG monitoring Initial status.
  - First meeting last week. The experiments have already given a first evaluation, sites will be represented via WLCG Ops Coordination. To get feedback from sites a group has been setup to collect sites opinion (see Maria's slide). Who is interested should contact Pepe Flix (jflix@NOSPAMpic.es). David Crooks and Kashif might want to be part of it as this touches on the GridPP core tasks.
- Among things interesting to discuss
  - myWLCG vs SUM tests they both get the information from the same source i.e. nagios.
  - Personalised dashboard looks interesting but was never publicized much.
  - Sites monitoring requirements: SUM tests not representing the real experiment status for example.

Tier-1 - Status Page

Tuesday 23rd July

Castor 2.1.13 upgrades for the CMS & LHCb instances today.
We have had ongoing problems with the batch server (pbs_server becoming unresponsive). Investigations ongoing.
There was a failure of the CERN Primary link yesterday (Mon 22nd) for around 7 hours. Traffic failed over to the backup link.
We had a GOC DB warning of short breaks in the network for firewall reboots this morning. Only one reboot done. (There were problems with the part of the network that hosts the GOC DB & APEL this morning. However that did not affect the Tier1 itself.)
Testing of alternative batch system proceeding.

Storage & Data Management - Agendas/Minutes

Tuesday 15th July

Three sites are now run all UK FTS traffic via 'FTS3' service as a test. Mostly successful; (small issue with a few US sites to be resolved before taking tests further.)

Tuesday 28th May

The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 23rd July

Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
An update of the metrics page has been requested.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 23rd July

Only minor updates to the keydocs mentioned last week as in need of attention/review. Please could everyone review the documents for which they are responsible.

Tuesday 16th July

Many key docs have reached their validity limit and need reviewing.

Tuesday 30th April

Working on improvements to VOMS related documents. User Expiry information and notifications.

Tuesday 9th April

Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Interoperation - EGI ops agendas

Monday 22nd July

Problems with recent versions of VOMS, WMS, UI and Storm. New release of dCache that supports SHA-2 proxies.
S/w releases: dCache 2.6.5; which has support for SHA-2 certs. Backport of SHA-2 support to 2.2.* series expected end of the month.
S/w issues: VOMS server doesn't (always) start the resource BDII automatically. This makes the SHA-2 probe fail, because there's no entry in the info system for the VOMS in many cases. There was some discussion on how to handle this - as it gives a false positive. The probes will be removed for the moment; with the expectation of a voms update to fixe the BDII problem in short order. Once that's done, then the alarms will be re-enabled. If no quick fix is forthcoming, then this will be looked at again (and put in abeyance). Therefore: RoD will be able to close the SHA-2 probe on VOMS servers; if they think it's the right way to handle it; or, if there is a ticket already open, then the ticket can be left open until resolved.
Gridsite problem: This affects UI, WMS and LB. The latest version of Gridsite breaks on proxies with '-' in it, which is seen as intermittent fails when attempting to delegate proxies.. A workaround is to yum downgrade gridsite gridsite-libs on the WMS or yum downgrade gridsite-commands gridsite-libs on the UI.
Storm: Performance problems with current version of Storm. This means that the current version isn't certified - hence the SHA-2 tests for Storm are off, because there is not a certified version of Storm that supports SHA-2.

gLite support calendar.

Monitoring - Links MyWLCG

Tuesday 23rd July

Reminder that there is a WLCG monitoring consolidation group. Site admins can still get involved. There is an overview page showing areas of consideration.

Tuesday 18th June

David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Tuesday 23rd July

Looking at options to supplement ROD team. Tier-1 may provide some effort.

The Operations Dashboard is full of SHA2 critical alarms. As most of the sites are failing one or more SHA2 tests, tickets are being created against most sites. These alarms are generated by Midmon Nagios and it checks if the service endpoint is SHA2 compliant or not. A list of SHA-2 ready middleware has been produced as has a summary of the related SAM tests. Most of the alarms are related to the creamce. The easiest way to solve this issue is to upgrade to the latest EMI2 or EMI3 release. The baseline release for the creamce is update 10 released in EMI on 2nd April 2013.

Rollout Status WLCG Baseline

Tuesday 9th July

New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

New EMI3 CE coming into SR. Liverpool will test.
A lot of EMI3 testing done at Brunel.
EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Monday 22nd July

A summary of the SSC6 findings was circulated last week. Questions?

Tuesday 11th June

We would like to collect immediate feedback on the security training held last week in conjunction with HEPSYSMAN.
Suggestions on future training the content last week would be useful.
John added a wiki page on forensics.

Tuesday 21st May

SL6 vulnerability. Need to track progress. (See private thread).

Services - PerfSonar dashboard | GridPP VOMS

Tuesday 23rd July

PerfSONAR: the issues with the WLCG mesh appear to be understood and a new minor release (e.g. 3.3.1) is likely to be released. In the meantime please could sites upgrade by following instructions here but leave the WLCG mesh URL (tests-wlcg-all.json) commented out. Please also update the site progress page.
Where are we with the VOMS rollout?

Monday 10th June

Issue with neurogrid.incf.org ownership. Is more guidance needed?
Where are we with the perfsonar mesh?
Are we ready for full rollout of the VOMS backups?

Tickets

Monday 22nd July 2013 45 Open tickets for the NGI this week. http://tinyurl.com/cblj3ab

With so many open tickets and my day starting with a trip to the vets and then ending up with me working from home with one eye on a poorly cat I'm afraid this is the second week in a row with a half-cocked ticket review- sorry about that. The connection at home is a bit slow making checking individual tickets a pain, hence the different format. There's no excuse for any puns though.

New VO (https://ggus.eu/ws/ticket_info.php?ticket=95792 (16/7) ) Chris has submitted a request for the creation of the HyperK VO on the UK voms server. The request is chugging along. In progress (18/7).

NGS wind down. There's a handful of tickets tracking the closure of some ngs resource centres (Keele-NGS, NGS-Leeds and ral-ngs2). I don't think this affects anyone in GridPP, but I like to report anything out of the ordinary.

SHA-2 hitting the fan... As mentioned by Kashif, a number of sites have been handed out tickets after failing SHA-2 tests. Liverpool, Lancaster, RALPP, Bristol, ECDF and Durham have all received tickets for one or two of their CREAM CEs, IC recieved one for their WMS (which Daniela has already expressed her righteous displeasure about). Most are In Progress already.

UK Cloud Site (https://ggus.eu/ws/ticket_info.php?ticket=94780) There's a request from Malgorzata if we can move forward with the cloud site, everything that's needed to be set up has been set up.

Unresponsive VOs still being Unresponsive (https://ggus.eu/ws/ticket_info.php?ticket=95442). Some movement here, for one from https://ggus.eu/ws/ticket_info.php?ticket=95470 babar is now deleted (just in case you still support babar somewhere). supernemo is likely to go the same way (https://ggus.eu/ws/ticket_info.php?ticket=95469). The camont (https://ggus.eu/ws/ticket_info.php?ticket=95474) and minos (https://ggus.eu/ws/ticket_info.php?ticket=95472) tickets still haven't been acknowledged.

gLEXEC-utive Decision. Not much movement on the gLExec deployment front - but from most site's initial replies progress wasn't expected until after July was over. The list of gLExec-less sites (or sites with broken gLEexec) is Sussex, Cambridge, Bristol, Birmingham, ECDF, Durham, Sheffield, Manchester, Lancaster, UCL, RHUL, QMUL and EFDA-JET. There is no shame being on that list (not yet anyway!).

LFC Webdav (https://ggus.eu/ws/ticket_info.php?ticket=91658) Catalin has had a go at installing the LFC webdav, but would like a hand in implimenting the webdav interface.

Tools - MyEGI Nagios

Tuesday 23rd July

In a campaign to update VO ID card details it turns out that a few of our supported VOs are obsolete: babar, possibly supernemo and ngs.ac.uk. The first of these can be safely removed but we need to confirm our announcement process.

Tuesday 11th June

Installation of DIRAC instance at IC ready for 'another' test user.

Tuesday 13th November

Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.

Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.

I have applied a patch to fix nagios jobs segfaulting on SL6 WN's. https://tomtools.cern.ch/jira/browse/SAM-2999

VOs - GridPP VOMS VO IDs Approved VO table

Mon 17th June

SNO+ request for Ubuntu UI. Do we have one?
Short Dirac update from Janucz
cernatschool.org VO WMS enabled at Glasgow - waiting for testing. Operations portal entry to be created.

Thurs 6th June

SNO+ jobs now work through the glasgow WMS

Mon 20 May

RAL wms02 and wm03 seem to have been taken out of commission but were still in the information system.
Glasgow WMS doesn't accept SNO+ jobs (https://ggus.eu/ws/ticket_info.php?ticket=94213)
SNO+ filling with water and expect to be taking test data Aug/Sept - expect more grid use after that.
Epic doing serious testing - running at Glasgow Liverpool and Lancs.

Thurs 16 May

SL6 - likely to be deployed for LHC VOs, non LHC should be aware - see mail to vo-admins list.

Site Updates

Actions

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 25th July

Operations report
The CMS & LHCb Castor instances were upgraded to version 2.1.13 yesterday (23rd).
The upgrade to CVMFS version 2.1.12 access the farm has largely removed the batch job set-up problem for LHCb and this problem will now be regarded as solved.
There was a hardware problem on an Atlas Disk Server (GDSS664) that was causing the RAID array not to rebuild. Atlas were warned of potential data loss. However, the fabric team managed to recover the server and all the data has now been successfully copied off.
There have been ongoing problems with the batch server (pbs_server) that are still being investigated.
Note: No meeting next week (31st July) owing to many staff attending an internal event.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin 290713

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools