Operations Bulletin 180612

Bulletin archive

Week commencing 11th June 2012

Task Areas

General updates

Tuesday 12th June

There is a WLCG pre-GDB meeting on WN security - the agenda is here. Vidyo connection available.

The next GDB is this Wednesday. The agenda is here. Vidyo connection available.

Monday 11th June

T2 reliability and availability results for May 2012.
EGI querying whether NGIs/institutes host local UMD repositories.

Friday 8th June

Mingchao leaves his role as security officer today. Thanks to him for many years of contributions. His role will be covered by members of the ops team - recruitment of a new security officer is underway.

Thursday 31st

Please note that the Technical Evolution Strategy documents have been moved within the WLCG Document Repository. The folder is now under this area.

Check site gstat values

Tier-1 - Status Page

Tuesday 12th June

A new version of the Castor Information Provider (CIP) was rolled out on Wednesday 6th June.
Problem of starting diesel generator discovered during test on 6th June was resolved the next day.
Atlas Castor instance switched to use Transfer Manager on Thu. 7th June.
Failure of a power dist. unit caused problems for a network switch and some of the BDII nodes on Thursday afternoon 7th June.
Problem during morning of Friday 8th June with one of the FTS agent nodes caused a problem with file transfers particularly to/from the RAL Tier1.
Problem with DNS lookups for nodes at Fermilab investigated and fixed during Friday morning 8th June.
There was a short disconnect in our network connectivity for around 15 minutes around 17:30 on Friday 8th June.
Castor 2.1.11-9 update and LFC/FTS database Oracle 11 updates scheduled for Wednesday 13th June.
WMS01 unavailable Thu 14 to Wed 20 June for drain and re-installation.
Site unavailable for networking upgrade on Tuesday morning 19th June.

Storage & Data Management - Agendas/Minutes

Wednesday 6 June 2012 - we are still digesting CHEP information, see also blog, plus a few of the usual operational upgradional stuff. Hoping to find a few spare clock cycles for some slightly more experimental stuff.

Wednesday 23 May 2012 - lots of exciting stuff at CHEP, we have about five things in, some posters, some oral.

Accounting - UK Grid Metrics HEPSPEC06

Wednesday 6th June - Core-ops

Request sites to publish HS06 figures from new kit to this page.

Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Friday 4th May - TB-SUPPORT

long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:

Check using: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

Maintain site VOMS info document for the approved VOs
Check a site's VOMS records correspond exactly with CIC portal
Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes.

This will be converted to wiki formatted and made available in the normal way. Next jobs:

review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.

Tuesday, 29th May

VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.

Interoperation - EGI ops agendas

Friday 8th June - EGI ops agenda

Sites not publishing UserDN's. There's a number of reasons - site policy; configuration wrong; and there's a bug that the APEL team will contact affectes sites about. (and another that doesn't apply here - sites that publish via an aggregator). A list of sites publishing has been calculated:

UK sites not publishing userDN's: UKI-LT2-Brunel; UKI-NORTHGRID-MAN-HEP; EFDA-JET; UKI-LT2-UCL-HEP; UKI-SCOTGRID-ECDF; UKI-SOUTHGRID-BRIS-HEP; UKI-SOUTHGRID-CAM-HEP

RHUL and RAL are counting slightly below 100% - I would expect that's probably because it's recently turned on, although there could be other issues lurking behind that stat (i.e. could also be recently turned off). EGI want a view on the various reasons for this. Suspect that this will be down to one of three reasons: Policy; Plan to but not had time yet; or some technical problem stopping it. Let SP know.

Monitoring - Links MyWLCG

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD Rota

Monday 11th June - KM

Very busy week. A lot of sites were failing CAdist tests but all of them updated after ticket. Durham suffered another power cut and cooling failure and it is still failing test, a ticket is already open. Manchester is failing SE and CE tests. Two open tickets against Manchester, no update from site.

Friday 25th May - AM

Several sites have downtimes, and a few raised non-trivial alarms during the week. All are ok again now or still in downtime, with Durham having the ce01 and se02 tickets although it's in an unplanned downtime now.

Glasgow had problems with MPI alarms even though MPI shouldn't have been advertised, and these alarms were eventually closed still at critical on the Dashboard along with the corresponding ticket (the Nagios pages for the nodes themselves were back to all green due to the removal of the MPI endpoints.)

Rollout Status WLCG Baseline

Monday 11th June

EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
Call for more sites to take part in EMI-2 rollout tests.
The overall SR contributions are in this table.

Friday 27th April

Updated version information on rollout page
WN scan indicates some sites not keen on OS updates to those nodes.

Security - Incident Procedure Policies

Wednesday 6th June

The GridPP security team have begun operating a weekly "on duty" rota to provide cover until a replacement for Mingchao is appointed.

UKNGI will not be contributing to the EGI security officer duty rota for the time being.

Tuesday 15th May

The next NGI/GridPP security team meeting is on Wednesday 16th.

Main discussion is on the rota and handover tasks.

Wednesday 2nd May

Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.

Currently reviewing the overall security task.

SSC5 preparations to start soon.

Services - PerfSonar dashboard

Wednesday 6th June

Plan to have 4 more sites perfsonar enabled by end of June.
Testing matrix to be confirmed.
Will survey sites looking for DRI kit deployment issues.

23rd April requested network utilisation figures for March and April
LHCONE meetng last week in Stockholm
Agreed to focus on perfosonar.

Tickets

Monday 11th of June, 13:00 BST 29 open UK tickets this week.

Over the last week there have been a few queries, at the Ops-team meeting and in tickets themselves, concerning the correct procedure for involving other sites/support units. I hope to have something solid on this for next week.

QMUL/NGI https://ggus.eu/ws/ticket_info.php?ticket=83020 Reliability/availability report ticket for QMUL. Jeremy's already notified the site but worth mentioning here too.

DURHAM/NGI https://ggus.eu/ws/ticket_info.php?ticket=83006 Same for Durham. update - Durham replied, citing the infrastructure issues that they've had.

NGI https://ggus.eu/ws/ticket_info.php?ticket=82491 Jeremy's ticket to the VOMS chaps. Looks like it can be closed.

https://ggus.eu/ws/ticket_info.php?ticket=82492 Chris' ticket on the subject, can be closed or reassigned to the voms developers as an RFE.

SNO+ https://ggus.eu/ws/ticket_info.php?ticket=82671 https://ggus.eu/ws/ticket_info.php?ticket=82670 Have been seeing problems retrieving output from desdemona.zih.tu-dresden.de on both the Glasgow and Imperial WMSs. Likely a problem at the other end, but these tickets threaten to bounce around.

IC https://ggus.eu/ws/ticket_info.php?ticket=82946 Interesting ticket - despite having cvmfs installed IC seem to be missing a release. This ticket is tracking the investigation.

QMUL https://ggus.eu/ws/ticket_info.php?ticket=82842 https://ggus.eu/ws/ticket_info.php?ticket=82891 Chris has been seeing jobs dying with "Cancelled by CE admin" errors. This prompted his mail to TB-SUPPORT today (where he saw lots of files in /opt/glite/tmp that needed clearing out). Affecting biomed & hone jobs. Daniela suggests a full service restart and cites a recently discovered problem in ticket 82891. update - Chris has moved non-lhc VOs off of the affected CE which seems to have calmed the problem.

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=82818 lhcb pilots seem to be dying due to cvmfs problems - although fixing it may be a low priority due to ongoing power troubles at the site. update - Durham are currently having a bad time of it, this will get looked at in due course.

RALPP https://ggus.eu/ws/ticket_info.php?ticket=82739 heplnx206.pp.rl.ac.uk not working for biomed, looks like information system errors (or the SE isn't actually for biomed's use). Ticket is looking a little neglected. update - Chris has found the error to be caused by the "fix" to another ticket - (75960), removing the fix cured the problem

CAMBRIDGE https://ggus.eu/ws/ticket_info.php?ticket=82296 Still no word from atlas if the problems have gone away. I suggest closing the ticket after leaving a quick description of your solution.

SOLVED CASE PILE https://ggus.eu/ws/ticket_info.php?ticket=82749 A ticket from the UK (to NGI_GRNET) concerning the problems seen during the the SUSSEX certification (81784), all issues have been solved.

FROM THE UK Due to the glacial movement of many of the tickets I'm cataloging these tickets elsewhere (https://www.gridpp.ac.uk/wiki/Tickets_From_The_UK), only documenting significant changes or new problems here.

https://ggus.eu/ws/ticket_info.php?ticket=83133 The na62 FTS service at cern doesn't apepar to be switched on...

Update: Daniela has two interesting tickets: https://ggus.eu/ws/ticket_info.php?ticket=82746 Daniela spotted an error in the certificate handling for LB on SL6. The LB developers look like they've found the cause.

https://ggus.eu/tech/ticket_show.php?ticket=82448 The EMI-1 LB seems to have a habit of filling up /var/tmp with notifications when things aren't working as intended, tracked down to the glite-lb-notif-interlogd crashing. Investigation continues.

Tools - MyEGI Nagios

Wednesday 6th June

By end of June plan is to have backup Nagios available.

Smaller VO automated testing still waiting for a bug fix.

Sunday 8th April

Lancaster Nagios backup hardware has arrived

Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.

There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.

VOs - GridPP VOMS VO IDs Approved

Wednesday 6th June

Cross-checking VOs enabled vs VO table.
Surveying VO-admins for problems faced in their VOs.

Site Updates

Tuesday 12th June

Sussex still in certification.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 11th June

NGS future planning
Options to make VOMS more resilient
Plans for Tier-1 review (to be held 20th June)

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

ATLAS DATADISK now being used for production input files
Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
Target date for perfsonar-ps at sites is the end of July
KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

Trying the present bulletin as a conduit for information
7 sites below 90% in ATLAS monitoring!
Instructions for small VO space-token setup/usage requested
EMI WN tarball now available - comment in ticket
Not every site running latest SL5 OS

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 6th June

All Castor instances switched to use the Transfer Manager (which replaces LSF) except Atlas which is scheduled for 7th June.
The following Tier1 interventions are announced (not yet in GOC DB).:
- Wednesday 13th June: Castor 2.1.11-9 update. (Expect Castor down from morning to early afternoon).
- Wednesday 13th June: Oracle 11 update for LFC & FTS databases. (Expect these services down all working day).
- Tuesday 19th June: Replacement of RAL Site Access Router. (Expect external connectivity broken for up to 3 hours in morning.)
- Wednesday 27th June: Update Castor databases to Oracle 11. (Expect Castor down all working day).
Test of backup diesel generator unsuccessful (today, Wed 6th June) as it failed to start. Experts expected tomorrow to investigate. Site at an additional risk if power cut.
Useful discussion with NA62 about issues they are having getting going with using the site.
Operations report

WLCG Grid Deployment Board - Agendas MB agendas

Next meeting Wednesday 13th June

NGI UK - Homepage CA

Monday 14th May

CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
UKIROC decommissioning in progress
NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

Priorities discussion for the CA. Plans to be clarified for future meeting.
Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.

Email addresses now removed from certificates.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd May

Testing glideinWMS but some problems spotted

Tuesday 24th April

ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.

A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)

The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Tuesday 24th April

T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.

Requests

More sites needed to test EMI-2

Operations Bulletin 180612

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools