Operations Bulletin 200513

Bulletin archive

Week commencing 13th May 2013

Task Areas

General updates

Tuesday 14th May

There was a positive PMB review of the Tier-1 last Friday. The talks on the agenda may be of interest.
We held a core ops tasks update meeting on Thursday (agenda).
There was a GDB at CERN last Wednesday (agenda; notes)
Plans are being drawn up for the operations agenda at the EGI Technical Forum in September. Let Jeremy know if you would like specific topics put forward for inclusion.
EGI has funded a mini-project to support a task force on future developments of availability computation. They are seeking more contributors for the work.
An informal feeback page on GridPP site findings on EMI2/3 and SL5/6 compatibility is now in the wiki.

Monday 6th May

The April WLCG Tier-2 availability and reliability figures are now available as a pdf.
camont.gridpp.ac.uk is to be decommisioned (see GGUS 93828).
There was a presentation on Ceph on Friday.
JANET is setting up a Special Interest Group on the topic of High Throughput Networking (see here for more details.
The GDB agenda for this week has evolved very slightly.
Please help Alessandra ensure this SL6 status page is up-to-date.
WLCG is now ready for the experiments to test their services (and clients) for SHA-2 readiness. ([ https://twiki.cern.ch/twiki/bin/view/LCG/SHA2readinessTesting more information]). There is a related egroup.

Tuesday 30th April

There is a GDB next week (agenda).
WLCG squid monitoring is in place. How are we using it?
The EarthSci VO (now under earthsci.vo.gridpp.ac.uk) has been approved by the PMB.

WLCG Operations Coordination - Agendas

Tuesday 14th May

Next Meeting 16th May 2013

Tuesday 30th April

Extracts form the 25th April 2013 meeting minutes

EGI operations
- User suspension policy under discussion the emergency procedure for central suspension has been extended.
- GGUS new workflows have been defined. The most important ones is that every supporter will have rw access to all the tickets and no best effort from the support units or team products will be accepted. Tickets should not be left without an answer.
- Continuation of support for several products still needs to be clarified, including: WMS, EMI-Common (UI, WN, YAIM, Torque config, emi-nagios), EGI will liaise directly with PTs to get information about release and software support plans;
Experiments
- Atlas
  - SL6: see SL6 TF
  - cvmfs: waiting for a stable cvmfs 2.1 for sites that needs NFS export. 2.1.9 is promising but needs more testing.
- CMS
  - Submission of HammerCloud trough GlideinWMS: comparison to gLite done - o.k. Will switch beginning of May
  - Updates of Squid configuration to WLCG monitoring: about a third of the sites has done it. Followed in CMS Computing Operations.
- LHCb
  - glexec: solved problems with software, deployment is manpower intensive though and experiment doesn't want to do it. WLCG needs to take care of it.
  - SL6: see TF
cvmfs
- Atlas and LHCb deadline is the 30 April [note: UK is ok]
- CMS deadline is 1 April 2014 but will stop but already from September 30 no software installation jobs will be sent
- CVMFS 2.1.9 is about to be released; the update is recommended but sites using the NFS export or at the 2.0 version should test it carefully for a few weeks.
- Finally, the testing and deployment process is described; in particular, sites should upgrade their nodes in stages. Interested sites are invited to join the "pre-production" effort.
glexec
- Maarten asks the MB reiterates that all sites need to install glexec
perfSONAR
- Release candidate v3.3 is out. Currently been tested in US and at CERN not yet suggested to sites but if in 2 weeks no problems are discovered sites will be encouraged to install this version.
SL6
- Created a deployment page to track sites status: https://twiki.cern.ch/twiki/bin/view/LCG/SL6DeploymentSites
- Increased information in procedures https://twiki.cern.ch/twiki/bin/view/LCG/SL6Migration#Procedures_and_how_to_contact_ex
  - None of the experiments want mixed queues they ask for one queue per architecture
  - Best way to go is to reuse an existing queue if possible
  - WLCG repository has been created and should be enabled by all sites. It already contains the latest version of HEP_OSlibs [tested by Brunel and Oxford plus two Turkish sites]
  - LHCb requires the CE/queue information to be published in production in the BDII otherwise they don't see them automatically and would like to avoid manual steps for 150 sites. [RAL SL6 testing queue wasn't published and this is why it wasn't used. It is now]
  - Atlas has found a problem with the excessive number of file descriptors (similar to those observed by Brunel on their SL6 CE). Problem has been passed to the TF.
Frontier/squid
- Squid upgrade is now being followed by CMS and Atlas computing respectively
- Dave Dykstra is not part of squid support anymore
  - Representatives of CMS and Atlas Frontier/squid groups have joined WLCG Coord to replace him and help with future squid requests.

Tuesday 23rd April

The next meeting takes place this Thursday (agenda)

Tier-1 - Status Page

Tuesday 14th May

Overall a quiet week operationally.
On Wednesday last week (8th May) we successfully swapped the production & standby databases behind the castor service.
Testing of alternative batch system (slurm & Condor) proceeding.
Investigations are still ongoing into problems at batch job set-up.

Storage & Data Management - Agendas/Minutes

Wednesday 1 May 2013

Puppet report from March Puppet camp in London
Technical suggestions for hepsysman, or otherwise

Tuesday 30th April

The DPM collaboration formally starts on 1st May 2013. For those interested in the collaboration agreement see this page.

Friday 17th April

Good buzz at EGI CF last week: excellent GridPP presence, loads of useful people to talk to. We spent today's meeting comparing notes.

Tuesday 9th April

There is an update to the DPM Collaboration Agreement.
Wahid is leading the EGI CF storage thread on Wednesday from 14:00.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th April

A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
An update of the metrics page has been requested.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 30th April

Working on improvements to VOMS related documents. User Expiry information and notifications.

Tuesday 9th April

Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thu, 25th April

Process for de-emailing the host certificates of a site. This allows an admin to replace certificates with email address components in the DN - they cause havoc in the security system.

Monday, 4th March

New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.

New draft document of Policies_for_GridPP_approved_VOs. Main points: 1) PMB/Griddpp approval; 2) put it in ops portal; 3) Lifecycle issues.

Changes to VomsSnooper RPM to obey IsVomsAdminServer tag implemented and changes rolled into future Approved VOs documents.

EPIC integrated into Approved VOs, pending full acceptance.

Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 6th May

There was an update of the package log4c (from v1.2.1 to v1.2.3) in EPEL, which is not compatible, and breaks, with the LB version currently available in the UMD-2 repositories (LB v3.2.8). For the moment, I would not recommend to update the log4c package. This issue affects both SL5 and SL6 installations.

Developers provided a workaround which is now described in the LB known issues.

Tuesday 30th April

There was an EGI ops meeting on 24th April. (agenda).
Note there is a change in the recommended hardware for running a top-BDII.

gLite support calendar.

Monitoring - Links MyWLCG

Tuesday 13th May

David C will present the monitoring work at Glasgow (based on Graphite) at the HEPSYSMAN meeting in June.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Monday 13th May

QMUL received an EMI-1 ticket for Storm. Otherwise a quiet week.

Monday 29th April

Not much to report. This is the week the EMI1 tickets will have to be dealt with one way or another.
RALPP has upgraded their CEs, but the EMI1 alarm keeps coming back intermittently, I guess there's a bad cache somewhere. The other sites with open EMI1 tickets are Glasgow (for the WMS) and Durham.

Rollout Status WLCG Baseline

Tuesday 14th May

A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 30th April

Sites are continuing to upgrade their kernels to rectify CVE-2013-0871. Security dashboard shows 3 sites with outstanding upgrades. This vulnerability is still considered HIGH risk by EGI-CSIRT.

Monday 8th April

We have a number of site notifications from Pakiti. Please check your site summary.

Services - PerfSonar dashboard | GridPP VOMS

Tuesday 14th May

perfSonar support team is asking for statements from the projects using it to help securing funding for their team. Below the email they've sent. The WLCG TF is looking for the WLCG MB and Computing coordinators statements but it was agreed that statements from the sites would also help. Below is the email sent to the users mailing list.

From: Jason Zurawski Date: Mon, May 6, 2013 at 11:07 AM Subject: [perf-node-users] Community Assistance Request

All;

One of the great cruelties of open source software is that while it is free to use, it is not free to develop and maintain. From time to time it becomes necessary to tap the 'tip jar' that sustains the perfSONAR-PS project.

In years past we have asked for help in developing, testing, and even evangelizing the software to make things stronger and more reliable. We have a different request this go-round: we simply need statements from the community (U.S. Based, as well as International) that perfSONAR is a useful (perhaps even vital) part of your campus/lab/regional/backbone infrastructure, and life would be more difficult if this software were to fade away. Perhaps additional comments if you have used the tool to fix a hard problem, enable a complex activity, or if it is critical to day to day operations.

This request does come with a caveat - while the words/praise of a network/software engineer or researcher are important, the words/praise from a Director/Executive/Chief matter a lot more to the parties that control the purse strings. Since the majority of subscribers to this list may fall into the former category, we would ask that you please talk with your supervisors (perhaps even draft them some text they could lend their name to) about making a statement regarding the usefulness of perfSONAR. Naturally, we need as many statements as we can, so if you can't get the ear of your management, your own personal account would be most helpful to this request.

This simple act will go a long way to ensuring that the project is able to keep up with the demand of software updates, bug fixes, and feature enhancements in a down economy; to do this it is vital that we can convince our own management chain that there are passionate and active users out there. By our count we know of nearly ~600 deployed instances of the software spanning numerous domains and countries. These letters of support can be addressed and sent to both Steve Wolff (swolff@internet2.edu) and Wendy Huntoon (huntoon@internet2.edu), who will aggregate them into a cohesive report. If possible these notes should come in the next 1-2 weeks.

Thanks again for all of your support for this project, and we do appreciate all users and want to ensure that we can continue to delivery for the needs of the networking community into the future;

Tuesday 23rd April

Ewan created a wiki page tracking the plan for deploying VOMS at Oxford and IC into production.

Tuesday 9th April

It is now getting urgent to configure and have enabled the backup VOMS instances at Oxford and Imperial. Please can we arrange a follow-up meeting (postponed last week as Daniela was out).

Tickets

Monday 13th May 14.45 BST 16 Open UK tickets this afternoon.

EMI Upgrade QMUL: https://ggus.eu/ws/ticket_info.php?ticket=93981 (10/5) We thought we had the last of these, but sadly Storm has started triggering an EMI1 alert. As Chris is in the unenviable position of having no where to upgrade to (a working EMI3 Storm has an ETA of the end of May) upgrading isn't an option. Chris has held this ticket, but it might be that we need to counter-ticket the nagios team (citing the test as "unsuitable") or get testimony from the Storm devs to back our case *if* someone gets shirty about this. Will keep an eye on this one. On Hold (13/10) Update - Daniela has strongly recommended that we launch a pre-emptive counter-ticket at storm and/or nagios.

TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=93149 (5/4) Atlas jobs were failing on nodes testing a new version of cvmfs. A new new version was installed on Friday, and appears to be working, which is good news. Testing is still ongoing though. On hold (13/5)

https://ggus.eu/ws/ticket_info.php?ticket=92266 (6/3) The new myproxy server is up and running, but no feedback has been given on the ticket. Has feedback been given elsewhere? It's likely that we just want to close this as I'm not sure feedback will be forthcoming, any problems with the new service could be handled in a new, fresh ticket. Waiting for reply (22/4)

QMUL https://ggus.eu/ws/ticket_info.php?ticket=93791 (2/5) LHCB jobs were just staying idle at QM, Chris tracked it due to a bug in the CREAM/SGE interactions - lhcb set a memory requirement which cream wasn't passing on correctly. Chris patched his scripts and submitted a ticket (https://ggus.eu/ws/ticket_info.php?ticket=93956). Probably needs to be Waiting for Reply-ed to get the thumbs up from lhcb. In progress (9/5) UPDATE- This ticket has been solved and verified

RALPP https://ggus.eu/ws/ticket_info.php?ticket=93905 (7/5) Chris B shuffled this CMS ticket over to GGUS shortly before heading on leave - and its been untouched ever since. Can anyone else in RALPP comment on it? It could be that the problem is no more (I'm ever the optimist!). In progress (7/5) Update - Rob is on it, wonkiness is still seen.

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=92590 (18/3) This ticket is mega-crusty now (I can think of no other phrase to describe it). GGUS ticket monitoring have involved Claire. Let's not have things escalate over such a benign issue. On hold (18/3)

SOLVED CASES (both freshly conquered) https://ggus.eu/ws/ticket_info.php?ticket=93833 (3/5) Imperial were called out over both of their load-balanced site-BDIIs not being in the GOCDB. Daniela solved the ticket, although I admit to being a little confused by the end. It wasn't in the gocdb, but in the general site information that both BDIIs were expected to be published. Things got mixed up in my head.

https://ggus.eu/ws/ticket_info.php?ticket=89804 (18/12/2012) After no word from Stephane or anyone else from atlas, Sam closed this ticket after deleting Glasgow's groupdisk space token. He was certainly in the right, but it would have been nice for Atlas to sanction the deletion. As Sam correctly pointed out, the problem was communication with the Atlas "central" and not the UK Atlas support, who have been nothing but communicative, helpful and generally great.

Tools - MyEGI Nagios

Tuesday 16th April

Installation of DIRAC instances at IC pending return of Janusz.

Tuesday 13th November

Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.

Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.

I have applied a patch to fix nagios jobs segfaulting on SL6 WN's. https://tomtools.cern.ch/jira/browse/SAM-2999

VOs - GridPP VOMS VO IDs Approved VO table

Thurs 16 May

SL6 - likely to be deployed for LHC VOs, non LHC should be aware - see mail to vo-admins list.

Monday 8th April

Please note Chris W is away this week.
Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Instructions now available for hardware x509 key https://www.gridpp.ac.uk/wiki/KeyTokens

Monday 4th March 2013

NGS VOMS server. https://ggus.eu/ws/ticket_info.php?ticket=90356 - Glasgow now checked. All sites now migrated.
WebDAV access to LFC requested.
Progress on myproxy issues.

Site Updates

Actions

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 15th May

Operations report
A quiet week operationally.
Seven new disk servers (630TB) were added to AtlasDataDisk and a further six (540TB) added to LHCbDst.
Castor outage being scheduled during next Tuesday morning (21st May) for network intervention.
Dates being suggested for Castor 2.1.13 upgrade starting with the nameserver plus Atlas stager.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin 200513

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools