Operations Bulletin 280113

Bulletin archive

Week commencing 21st January 2013

Task Areas

General updates

Monday 21st January

A permanent nagios probe working group will be created under coordination of COD. The objectives are to evaluate existing current nagios probes starting from known problem areas; evaluate new probes before integration within SAM and propose to the EGI OMB probes to be integrated/rejected. If you would like to be involved speak to Jeremy.
Moving tickets to 'in-progress' and the status of re-opened tickets.
Posters and demonstrations submissions/proposals for the EGI CF: 27 January 2013
EGI request for an NGI core services 'Service Group'.
Last Tuesday there was a pre-GDB on operations topics. See the talks on the agenda for more information. Will cover some areas under relevant bulletin updates.
The WLCG T2 availability/reliability report is now final.
EGI now host a middleware support calendar. For EMI-1 probes will warn from January and go critical in March. Deadline end of April.
There is now an inter-NGI resource consumption report
The next WLCG Operations Coordination meeting takes place on 24th (agenda)
Middleware configuration post-EMI will be the subject of a survey to be circulated soon.
EGI calls for more participation in its Resource Allocation task force. The idea is to make more resources available (for a testbed in the first instance) to help smaller VOs get access to resources.

Tuesday 15th January

Failing dteam jobs - request for sites to check config with vomsSnooper
EGI-Inspire have funds for mini-projects for new work in operations, communication, coordination... if you have ideas please share them.
The January GDB agenda is available.
There is a pre-GDB on Operations. See the agenda. The focus is Operations (e.g. SL6 and security).
ATLAS and LHCb have a 30th April 2013 target date for their sites to have CVMFS.

Tier-1 - Status Page

Tuesday 22nd January

Problem over weekend with CRLs at CERN caused some operational problems for the RAL Tier1 on Saturday.
New version of Top BDII being prepared for roll-out (EMI2 on SL6).
Participating in Atlas FAX (Federated Atlas Xrootd) tests.
Other items:
- Ongoing investigating into asymmetric data rates seen to remote sites.
- Test instance of FTS version 3 being tested by Atlas & CMS.
Although not part of the Tier1 directly, there have been problems with one of the FS servers (now resolved).

Storage & Data Management - Agendas/Minutes

Wednesday 5 Dec

DPM EMI upgrades:
- Future DPM support now better understood (DMLite)
- Brunel still to try dCache migration
ATLAS Jamboree next week, ATLAS want to change all their filenames... (by 2014)
How we are doing Big Data(tm)

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th October

Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.

Friday 28th September

Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.

Wednesday 6th September

Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 4th December

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Tuesday 6th November

Do we need the Approved VOs document the set out the software needs for the VOs?

Tuesday 23rd October

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Tuesday 18th December

Update coming from today's meeting....!

Monday 3rd December

There was an EGI ops meeting https://wiki.egi.eu/wiki/Agenda-03-12-2012.

Monday 5th November

There was an EGI ops meeting today.
UMD 2.3.0 in preparation. Release due 19 November, freeze date 12 November.
EMI-2 updates: DPM/LFC and VOMS - bugfixes, and glue 2.0 in DPM.
EGI have a list of sites considered unresponsive or having insufficient plans for the middleware migration. The one UK site mentioned has today updated their ticket again with further information.
In general an upgrade plan cannot extend after the end of 2012.
A dCache probe was being rolled into production yesterday, alarms should appear in the next 24 hours on the security dashboard
CSIRT is taking over from COD on migration ticketing. By next Monday the NGIs with problematic sites will be asked to contact the sites, asking them to register a downtime for their unsupported services.

Problems with WMS in EMI-2 (update 4) - WMS version 3.4.0. Basically, it can get proxy interaction with MyProxy a bit wrong. The detail is at GGUS 87802, and there exist a couple of workarounds.

gLite support calendar.

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Monday 21st January

Good week, with only a few downtimes and long lived alarms. All outstanding alarms are covered by tickets as of now.
As summarised in Daniela's handover from last week, several sites have red COD-level status because the tickets are more than a month old. This due to the lack of upgrade of the WNs due to lack of the new tar-balls, and results in raising sec alarms. Some details in this ticket: https://ggus.eu/ws/ticket_info.php?ticket=90184

Tuesday 15th January

Main issue relates to COD tickets as mentioned last week.

Monday 7th January

Several sites are in the red due to the middleware tickets being older than 30 days. We got a COD ticket for this despite the ticket being filed as a top priority, COD did not answer so I reset the priority to something sensible. We aren't the only ones hit by this problem.
At the moment the security alerts don't seem to update on the dashboard - at least ceprod08 has not cleared all day.

Rollout Status WLCG Baseline

Tuesday 21st January

National overview page updated for WNs (15th January). Please check your site information!

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Monday 21st January

SHA-2 and glexec updates given at pre-GDB.
EUGridPMA agreed on timeline where SHA-2 becomes the default in August 2013. SHA-1 will still be requestable.

Tuesday 15th January

There has been a recent java patch. Sites should consider to roll out updates to client machines (especially those containing user certificates). The most current version is "Version 7 Update 11".

Services - PerfSonar dashboard | GridPP VOMS

Tuesday 22nd January

VOMS changes
There was an LHCONE P2P workshop in December (summary). Do we have coverage?

Tuesday 20th November

Reminder for sites to add perfSONAR services in GOCDB.
VOMS upgraded at Manchester. No reported problems. Next step to do the replication to Oxford/Imperial.

Tickets

Monday 21st January 14:45 GMT 53 tickets this week. Despite being back to my usual perky self I might have turned down the resolution of my scan over these tickets a bit too much and missed something. Let me know if that's the case.

Ticket Bunches:

DTEAM tickets: Lancaster, Durham and ECDF have tickets concerning dteam working at their site, although all our in progress.

NGS tickets: Glasgow, Bristol, Brunel, Durham and Manchester have tickets about the move of the ngs VO to our VOMS server.

Atlas Space Juggling: Oxford, RalPP, Liverpool, Glasgow and Lancaster have open tickets for the atlas move from groupdisk to datadisk. Ewan has posted some legitimate concerns about not being kept in the dark regarding dark data cleanup.

WN-Sec test: Durham, Glasgow, Imperial, ECDF & Lancaster have tickets for their out-of-date worker nodes

NGI

https://ggus.eu/ws/ticket_info.php?ticket=90451 (15/1) Core services grouping ticket. We should tend to this ticket. Assigned (15/1)

RALPP

https://ggus.eu/ws/ticket_info.php?ticket=90575 (18/1) Nagios failures on one of your CEs, maybe slipped under the radar? Still Assigned (21/6)

https://ggus.eu/ws/ticket_info.php?ticket=90040 (2/1) A CMS user hasn't replied to this ticket in a while, I'd be tempted to close it soon. Waiting for reply (4/1)

TIER 1

https://ggus.eu/ws/ticket_info.php?ticket=90528 (17/1) Sno+ jobs aren't being sent to Sheffield for some reason, Catalin has asked if they see the same with another WMS. Waiting for reply (17/1)

https://ggus.eu/ws/ticket_info.php?ticket=90151 (8/1) A request for WMS enablement by Neiss might have been mistook for a request for resources at the tier 1, Chris has tried to clear things up though. In progress (21/1)

SUSSEX

https://ggus.eu/ws/ticket_info.php?ticket=90518 (17/1) Nagios failures on a Sussex CE. Emyr has called on the help of Ewan and Chris today to conquer these glitches. In progress (17/1)

https://ggus.eu/ws/ticket_info.php?ticket=90239 (10/1) Similar for their SE. Would be nice to hear how today went. In Progress (15/1)

https://ggus.eu/ws/ticket_info.php?ticket=90236 (10/1) A ticket from atlas regarding problems at Sussex. In progress (21/1)

Bristol:

https://ggus.eu/ws/ticket_info.php?ticket=90328 (11/1) Bristol are publishing zero used space. Winnie's hard pressed to investigate with her time constraints. In progress (15/1)

RHUL:

https://ggus.eu/ws/ticket_info.php?ticket=90219 (9/1) RHUL are publishing negative space. As nothing seems out of place it's bene suggested reassigning this ticket to the DPM chaps for support. In progress (11/1)

IC:

https://ggus.eu/ws/ticket_info.php?ticket=89468 (11/12/12) A fusion user was having proxy problems, but the un-reproducibility of the error and the user silence suggests that this ticket can be put to bed. In progress (8/1)

Tools - MyEGI Nagios

Tuesday 13th November

Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.

Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.

I have applied a patch to fix nagios jobs segfaulting on SL6 WN's. https://tomtools.cern.ch/jira/browse/SAM-2999

VOs - GridPP VOMS VO IDs Approved VO table

Monday 14 January 2012

NGS VOMS server: Please enable GridPP VOMS server
- Some sites have enabled the GridPP VOMS server, 7 sites have issues. https://ggus.eu/ws/ticket_info.php?ticket=90356 is a parent ticket for this

Neiss.org.uk
- Now have VO-ID card in operations-portal (previously CIC portal)
- GridPP/NGS VOMSs server issues
- NGS WMS hadn't enabled current CEs at QMUL and Lancs, so I've requested the GridPP WMSs enable it - as the VO is supported on GridPP sites.
- Would be a good use case for SARONGS - but they don't have the time to debug this.

T2K.org - lots of issues
- Have started a round of MC production
- Nagios now available (Thanks Kashif)- can sites fix issues: https://t2wlcgnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=VO_t2k.org&style=detail
- Lots of job failures for various reasons - including "Cannot move ISB" - seen at a number of sites.
- Reporting that proxies don't renew (CJW has tried to reproduce this and failed - proxies seem to be renewing)

Tuesday 9 January 2012

Please can VOs report publications (there's now a section on the Quarterly report for them)
Spring Cleaning of VO support
- GridPP VOMS server support - affects ngs.ac.uk and neiss.org.uk - expect tickets soon.
- Removal of VOs no longer in use - totalep and some others.

Neiss.org.uk: NGS WMS hasn't enabled QMUL and Lancs CEs. Should we support it on the GridPP WMSs
T2k FTS transfers slowed due to copying files that already exist - T2k script now more robust.

Mon 17th December

Sites are reminded that VOs supported by the NGS VOMS server are moving to the GridPP VOMS server. https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs should contain details of the new yaim config.
T2K.org now monitored by VO Nagios - https://t2wlcgnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=VO_t2k.org&style=detail - still experimental, but thanks Kashif

Tue 4th December

https://www.gridpp.ac.uk/wiki/WebDAV - please fill in your site status

Thursday 29 November

https://ggus.eu/ws/ticket_info.php?ticket=89035 filed - request that the information system reports space used by VOs

Tuesday 27 November

VOs supported at sites page updated
- now lists number of sites supporting a VO, and number of VOs supported by a site.
- Linked to by Steve Lloyd's pages

Tuesday 23 October

A local user is wanting to get on the grid and wants to set up his own UI. Do we have instructions?

Site Updates

Tuesday 22nd January

Site updates at Tuesday's meeting.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 1st October

ELC work

Tuesday 25th September

Reviewing pledges.
Q2 2012 review
Clouds and DIRAC

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

TBC

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 23rd January

Operations report
Problem of CRL updates at CERN on Saturday (19 Jan) caused problems for our services.
There was a presentation about Castor 2.1.13 - both what is in this upgrade and initial plans for roll-out.
The first SL6/EMI-2 Top-BDII node is now in production. This is the start of a transparent rolling upgrade to the Top-BDII.
Test with FTS3 are ongoing.
The Tier1 team is checking whether any VO requires the AFS client on the worker nodes.

WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October

NGI UK - Homepage CA

Wednesday 22nd August

Operationally few changes - VOMS and Nagios changes on hold due to holidays
Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
The NGS is rebranding to NES (National e-Infrastructure Service)
EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
Next meeting is on Friday 14th September at 13:00.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

N/A

To note

N/A

Operations Bulletin 280113

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools