Operations Bulletin 121112

Bulletin archive

Week commencing 5th November 2012

Task Areas

General updates

Tuesday 6th November

VOMRS validator script led to confusing emails to LHC VO members (VOMRS checks user and CA DN and the old references had not been removed)
Some progress is being made with the WN tarballs
HEPSYSMAN is on Friday.
Please take a look at and comment on the WLCG Software Life Cycle Process beyond EMI document.
The current status of site upgrade plans is on Alessandra's summary page.

Monday 29th October

The November GDB agenda is available.
UMD v1.9.0 has been released.

Tuesday 23rd October

Site contact lists have been added to the list wlcg-operations at cern.ch to improve the transmission of WLCG ops information. There will be an attempt to minimise 'noise'. Please let Jeremy know of any concerns to be fed back to the WLCG ops team.
The status of EMI releases was last given in this update.
When decommissioning services please remember to check the EGI procedures for steps.
Alessandra created this useful table to collate plans. Thank you to everyone for keeping it updated! Just 3 sites have yet to share their plans....
The ROD team have been given extended access to the Security Dashboard reports to help chase sites on middleware issues spotted by EGI.
It has been proposed that from 1st November, lcg-ce tests are removed from the three SAM profiles: ROC [1], ROC_OPERATORS [2] (the list of tests generating alarms into the operations portal), and ROC_CRITICAL [3] (the list of tests generating results that are taken into account for Availability/Reliability reports). This should not affect any UK sites as none are among the 29 services found by EGI as still active on gLite 3.1.
The Nagios tests need updating for SL6 as some issues have been found.
The final WLCG Tier-2 availability/reliability report for September 2012 is now available.

Tier-1 - Status Page

Tuesday 6th November

Castor GEN instance upgraded to version 2.1.12 last Tuesday (30th). This completes the Castor 2.1.12 upgrade.
On Sunday (4th Nov) there was a ~six hour outage of the Atlas Castor instance caused by a problem with the database behind the SRM.
There has been a problem on the OPN link such that from Sunday morning to Monday we were running with packets going one way over the primary link, and the other way over the backup link. Problem partly resolved Monday (by using the backup link both ways), but not yet finally fixed.
There will be an outage of the site next Tuesday (13th Nov). A failing board needs to be replaced in a router. We will make use of this time to do some other work. Expect batch & Castor down for the order of an hour or so (TBC).
We have seen repeated SUM test failures from the experiment VOs (failure to connect to SRMs). These correlate with SUM test failures across whole of UK and wider.
Investigating some asymmetric data rates seen to remote sites.
Planning to roll out over-commit of batch jobs to make use of hyperthreading imminently.
Upgrade to EMI-2 SL5 WNs in final testing / preparation.
Test instance of FTS version 3 now available. Non-LHC VOs that use the existing service have been enabled on it and looking for one of the VOs to test.

Storage & Data Management - Agendas/Minutes

Wednesday 10th October

DPM EMI upgrades:
- 9 sites need to upgrade from gLite 3.2
QMUL asking for FTS settings to be increased to fully test Network link.
Initial discussion on how Brunel might upgrade it's SE and decommission is old SE
Classic SE support , both for new SEs and plan to remove current publishing of classic SE endpoint

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th October

Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.

Friday 28th September

Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.

Wednesday 6th September

Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 6th November

Do we need the Approved VOs document the set out the software needs for the VOs?

Tuesday 23rd October

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 5th November

There was an EGI ops meeting today.
UMD 2.3.0 in preparation. Release due 19 November, freeze date 12 November.
EMI-2 updates: DPM/LFC and VOMS - bugfixes, and glue 2.0 in DPM.
EGI have a list of sites considered unresponsive or having insufficient plans for the middleware migration. The one UK site mentioned has today updated their ticket again with further information.
In general an upgrade plan cannot extend after the end of 2012.
A dCache probe was being rolled into production yesterday, alarms should appear in the next 24 hours on the security dashboard
CSIRT is taking over from COD on migration ticketing. By next Monday the NGIs with problematic sites will be asked to contact the sites, asking them to register a downtime for their unsupported services.

Problems with WMS in EMI-2 (update 4) - WMS version 3.4.0. Basically, it can get proxy interaction with MyProxy a bit wrong. The detail is at GGUS 87802, and there exist a couple of workarounds.

Monday 8th October

COD are about to launch monitoring tickets for 'out of support' services, (i.e. glite 3.2), for removal by the end of the month. (They seem to have missed some gLite 32 CREAM CE's, however - we need to make sure we don't).

EMI updates. EMI-2, expected today or so.

DPM 1.8.4 (Yay! But let it filter through staged rollout a bit...) LB and WMS 3.4 (Both with security updates). UI and WN (including 32bit libs, and a few other dependancies).

Tarballs were raised. Tiziana raised the need for a tarball (EMI-2) before the gLite 3.2 were retired.

Staged Rollout. The ARC 2.0.0 clients are in the production repositories due to the emi-ui being in production. (Don't think that affects anyone in the UK).

Released today: BDII Core and GFAL/lcgUtils.
Products in staged rollout: WMS 3.3.8; CREAM 1.13.4 (due to a mismatch between EMI and UMD versions, this is 1.13.5 in UMD)
It's been noted that there are a number of products without early adopters in EMI 2: EMIR, Pseudonymity, Wnodes, GridSAM and OGSA-DAI. These will not be included in UMD, unless there's an EA, and demand from NGI's. There's also a few with no EA in EMI-2, but there are in EMI-1, and these are expected to move to EMI-2 at some point: CLUSTER, CREAM-LSF. (VOMS was listed, but the EA was present, and pointed out they are on EMI-2).

Unsupported services on 8th October EGI list.

gLite support calendar.

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Monday 5th November

Friday 19th October

Many sites continue to have planned downtimes for EMI upgrades, with knock-on effects on other local services (eg SRM -> WN Rep alarms). Changeover back to Oxford GridPP Nagios midweek, with some caching weirdness (reloading the dashboard produced one of two different sets of results!) which quickly went away.

Friday 12th October

There is a new ROD newsletter available from EGI.
As of this week, John Walsh will no longer be contributing to the ROD work. Many thanks to John for his input over the years!

Rollout Status WLCG Baseline

Tuesday 6th November

National overview page updated for WNs. Please check your site information!

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Monday 22nd October

Last week's UK security activity was very much business as usual; there are a lot of alarms in the dashboard for UK sites, but for most of the week they only related to the gLite 3.2 retirement.

Friday 12th October

The main activity over the last week has been due to new Nagios tests for obsoleted glite middleware and classic SE instances. Most UK sites have alerts against them in the security dashboard and the COD has ticketed sites as appropriate. Several problems have been fixed already, though it seems that the dashboard is slow to notice the fixes.

Tuesday 25th September

EUGridPMA announce new CA rpms (release notes).

Services - PerfSonar dashboard

Monday 5th November

perfSONAR service types are now defined in GOCDB.
Reminder that the gridpp VOMS will be upgraded next Wednesday.

Thursday 18th October

VOMS sub-group meeting on Thursday with David Wallom to discuss the NGS VOs. Approximately 20 will be supported on the GridPP VOMS. The intention is to go live with the combined (upgrades VOMS) on 14th November.
The Manchester-Oxford replication has been successfully tested. Imperial to test shortly.

Tickets

Monday 5th November 14:00 GMT 32 Open UK Tickets this week. It's the first Monday of the month, so we get to look at all of them. Have all the GGUS access problems experienced by atlas team members last week soothed themselves?

It's worth noting that a quarter of the open tickets are concerning networking/transfer type problems.

UNSUPPORTED GLITE SOFTWARE TICKETS

Congratulations to those sites who closed their tickets. I suspect these will be gone over in greater detail so again I'll just summarise them, we can look at each in the meeting if needed. All seem to be in hand, but my rule of thumb is the more recent the update the lesser the worry.

NGI/VOMS

https://ggus.eu/ws/ticket_info.php?ticket=87813 (25/10) Migration of vo.helios-vo.eu to Manchester. The transfer was completed manually,users were asked if things okay. In Progress, I "waiting for replied" it today. (30/10) David indicates it works and will now test with WMS/CE (5/11)

TIER 1

https://ggus.eu/ws/ticket_info.php?ticket=88112 (3/11) Slow atlas transfers, found to be caused by database problems. The problems have been fixed, the atlas instance restarted and data is flowing once more. Waiting for the thumbs up from atlas. Waiting for reply (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=86690 (3/10) t2k are missing JPKEKCRC02 FTS ganglia metrics. There were some problems with the rrd files that meant they had to be deleted, which hopefully will fix the plots. Things look better to my eyes, In Progress, can be waiting for replied/solved (31/10) t2k give the thumbs up, seems okay to them now

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9) Packet loss on the RAL perfsonar. This is being taken under the wing of wider network investigations at RAL. On hold (31/10)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11) DPM Sl4 retirement ticket. The only reason this is open is possible SL4 disk servers at Durham right? Are they still there? In progress (30/10)

RALPP

https://ggus.eu/ws/ticket_info.php?ticket=88099 (3/11) atlas seeing transfer errors into RALPP with "No transfer markers received" errors, although the problem seems to be abating itself slowly. Still just "Assigned" (4/11) ATLAS still see problem (5/11). Still just assigned

BRUNEL

https://ggus.eu/ws/ticket_info.php?ticket=88019 (1/11) lhcb seeing failures on some nodes, blaming cvmfs. Raul has put CE in downtime. In Progress (1/11)

BIRMINGHAM

https://ggus.eu/ws/ticket_info.php?ticket=88009 (1/11) Hone with one of their usual politely worded requests to get their jobs moving. Mark tweaked the batch system, and hone are happy again. In progress, can be closed (2/11) Solved

https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9) Poor sonar rates between Birmingham & BNL. Investigation made difficult due to EMI2 problems with the DPM, Brian has tried to see if doubling the number of steams would help. Did it? On hold (16/10)

DURHAM

https://ggus.eu/ws/ticket_info.php?ticket=88151 (5/11) apel nagios test problems. Assigned (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9) Biomed not cleaning out their cream sandbox. Mike pulled them up about this a while ago but no reply. We should close this ticket and/or re-ticket the VO if they're causing a mess. Waiting for reply (4/10)

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7) atlas production job failures at Durham, which has become a bit of a catch-all ticket for atlas problems at Durham. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11) Compchem authentication ticket. On hold, but is it still relevant? (8/10)

ECDF

https://ggus.eu/ws/ticket_info.php?ticket=88119 (4/11) Atlas transfer's are failing due to a sickly pool node. In Progress (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=87958 (31/10) atlas transfers between Edinburgh & FZK having problems, likely due to their firewall. FZK had been ticketed (no ticket number given though). In Progress (1/11)

https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9) Poor atlas sonar rates between ECDF & BNL. Wahid has "harmonised" his tcp tunings, and is waiting on some further WAN upgrades. On hold (25/10)

GLASGOW

https://ggus.eu/ws/ticket_info.php?ticket=87879 (29/10) na62 mapping problems, traced to a pool node not making its grid map. Seems things are fixed now, despite the user's initial protests to the contrary. Turns out they were just being impatient! In progress, can be closed (30/10) SOLVED

SUSSEX

https://ggus.eu/ws/ticket_info.php?ticket=86996 (8/10) Sussex's APEL problems. Things look better now after a lot of work. In progress, can be closed (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5) The Sussex Certification Chronicle. Surely the Grid Overlords are satisfied that Sussex is worthy of certification, after paying so much tribute in tears and sanity? :-) In progress (bit quiet though) (23/10) SOLVED! SUSSEX IS ONE OF US NOW...

QMUL

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9) Hard-to-kill lhcb jobs at QMUL. Chris is still getting regular hit-lists. Chris's corresponding ticket to the cream developers (https://ggus.eu/tech/ticket_show.php?ticket=87891) has problems as lhcb can't reply to it! He has however written information in this ticket. In progress (1/11)

CAMBRIDGE

https://ggus.eu/ws/ticket_info.php?ticket=86108 (14/9) Perfsonar WAN bandwidth asymmetry. Been on hold for a while, the classic question must be asked - has the problem gone away all by itself? On hold (2/10)

OXFORD

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9) Low atlas sonar rates between BNL and Oxford. Tweaking the FTS settings hasn't made any difference. The next step was to tweak tcp tuning perimeters. Duncan observed similar transfer rates between Oxford & TRIUMF. In progress (19/10) Tuning tcp didn't help, what to do next...

LANCASTER

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8) ilc jobs were aborting on one of Lancaster's CEs. This CE has poor performance, which for some reason was affecting ilc jobs more then most. The only fix is a reinstall (and reconfigure), but other priorities keep getting in the way (the latest being the use of this CE to test EMI2 tarballs). On hold (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)

t2k.org transfer timeout failures between RAL and Lancaster. Traffic is in the process of being routed over SJ5 from the lightpath to see if that helps. Other then that is the possibility that this is taking too long to stage from tape thing - but no reason why that's only being a problem for us. In progress (1/11)

Tools - MyEGI Nagios

Wednesday 17th October

The active Nagios returned to https://gridppnagios.physics.ox.ac.uk/nagios.
For current results always check the CENTRAL myEGI pages: https://grid-monitoring.cern.ch/myegi.
The GridPP Nagios information is available in summary in the wiki.

Monday 17th September

Current state of Nagios is now on this page.

Monday 10th September

Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 23 October

A local user is wanting to get on the grid and wants to set up his own UI. Do we have instructions?

Monday 15th October

Sno+ jobs now work at Dresden https://ggus.eu/ws/ticket_info.php?ticket=86741, but there has got to be a better way.
Discussion with SNO+ about their requirements - discussions started on the following topics:
Robot certificates and hardware keys
FCR
Managing storage - how to avoid users filling up the space

Monday 8th October

Sno+ had problems with EMI-2 WN and ganga - formatting changes in EMI-2 command output.
Now fixed by Mark Slater (8 hours to install EMI2-WN and 20 mins to fix ganga.
Snoplus jobs don't work at Dresden https://ggus.eu/ws/ticket_info.php?ticket=86741
Draft e-mail to warning "non LHC VOs" about upcoming updates sent to ops list. Comments please.

Site Updates

Monday 5th November

SUSSEX: Site working on enabling of ATLAS jobs.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 1st October

ELC work

Tuesday 25th September

Reviewing pledges.
Q2 2012 review
Clouds and DIRAC

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

TBC

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 31st October

Operations report
Castor 2.1.12 upgrade for GEN instance successful yesterday (30th Oct). This completes Castor 2.1.12 updates.
New EMI CREAM CEs bedding in. Now stopping any remaining job submission via the old glite CEs.
The routing of network packets back from North American Tier1s (BNL, FerminLab, Triumph) has been corrected (by the remote sites) to use the OPN rather than other production networks.
Investigating asymmetric data flows, notably poor outbound data rates.

WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October

NGI UK - Homepage CA

Wednesday 22nd August

Operationally few changes - VOMS and Nagios changes on hold due to holidays
Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
The NGS is rebranding to NES (National e-Infrastructure Service)
EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
Next meeting is on Friday 14th September at 13:00.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 121112

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools