Operations Bulletin 230712

Bulletin archive

Week commencing 16th July 2012

Task Areas

General updates

Monday 16th July

Site coordinates in the GOCDB - these only need to be approximate to locate the city/town.
For those concerned with VOMS (admins and VO admins), the recent VOMS tutorials arranged by EGI are now available as slides and a webinar from the meeting meeting agenda page.
From 13th July: Setup a CVMFS stratum-0 repository framework for small VOs at RAL Tier-1. Progress has been made and now we're in position to ask for volunteers.
From 12th July: UMD v2.0.0 was released today, this is the second Major release of UMD (Unified Middleware Distribution) made available for the EGI production infrastructure. This release introduces support for Scientific Linux 6 and Debian 6.

Monday 2nd July

Tier-2 quarterly reports requested by Wednesday 18th July
There is now an LHCb CVMFS sites map

Tier-1 - Status Page

Wednesday 18th July

CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
After hyperthreading tests; 2011 Dell Worker Nodes will have number of jobs increased from 12 to 14.

Storage & Data Management - Agendas/Minutes

Considering input to a community support model for DPM and possible alternatives.

Wednesday 18th July

snoplus needs/plans on how ( on grid resources) to "blind" data for analysis.
plans for sites to finally upgrade/decommission glite3.1 Storage services ahead of Oct 1st deadline.
GLite to EMUI upgrade path.
Issues regarding PXE booting from 10G cards.
Plan to discuss UK role in community suppport of DPM .

Accounting - UK Grid Metrics HEPSPEC06

Wednesday 6th June - Core-ops

Request sites to publish HS06 figures from new kit to this page.

Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 3rd July

Started a page on stale documents. Please update this page if you find documents or pages that need attention.

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

Maintain site VOMS info document for the approved VOs
Check a site's VOMS records correspond exactly with CIC portal
Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes. This will be converted to wiki formatted and made available in the normal way. Next jobs:

review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.

Tuesday, 29th May

VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.

Interoperation - EGI ops agendas

Monday 16th July

EMI updates: Updates expected on the 19th: Top BDII, BLAH and gLite-gsoap/gss. Also StoRM for EMI-2 (and WNoDES, if anyone uses that...). New procedure in place, so in the event of a Globus lib update, so any problems aught to be caught by processes in future. Some discussion on compliance with RFC proxies and SHA2 algorithm.

Staged Rollout: There's some UMD1 and UMD2, and priority is going to be given to UMD1. In particular, the gLite-gsoap/gss. There's a need to update the SAM tests - they fail on the EMI-2 WN at the moment. Significant discussion on this, there was no change, and it _aught_ to work, so the problem appears to be related to the WN package, rather than the test itself. This might be SL6 specific. For the moment, if you update to EMI-2 WN, you should know about the workaround in https://ggus.org/ws/ticket_info.php?ticket=82899 to get the SAM test passing .

Upgrade of gLite 3.1/3.2 components to UMD: All gLite 3.1 components are unsupported, and should be upgraded to UMD stuff by Sept 30th. Only a few gLite 3.2 components are still supported, until Oct 2012 (and should be upgraded by then): FTS, gLExec, LFC, DPM, UI, WN. (and VOBox until Apr 2013).

There are lists available of services running at sites that are gLite 3.1 (current as of 17/7/2012 ... somehow!): https://indico.egi.eu/indico/getFile.py/access?resId=0&materialId=2&confId=1124 and for gLite 3.2: https://indico.egi.eu/indico/getFile.py/access?resId=0&materialId=3&confId=1124

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD Rota

Monday 16th July - DB

Quiet week. Sheffield is showing an alarm for a machine that's not in the GOCDB, that might be worth following up. QMUL have an alarm for a machine not in production, they are aware of it, but short of closing the alarm every couple of days I haven't got any better suggestion on how to deal with it.

Monday 9th July - KM

Quite week. Nothing to report. Just a reminder that now Dashboard is getting result from gridppnagios at Lancaster. So if you want to see raw result for something, go to https://gridppnagios.lancs.ac.uk/nagios

Rollout Status WLCG Baseline

Monday 11th June

EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
Call for more sites to take part in EMI-2 rollout tests.
The overall SR contributions are in this table.

Friday 27th April

Updated version information on rollout page
WN scan indicates some sites not keen on OS updates to those nodes.

Security - Incident Procedure Policies

Monday 25th June

Rota availability responses slow
Is anyone following up on SSC5/6?
Stratuslab VM (ex UK)
gridftp

Services - PerfSonar dashboard

Monday 16th July

As part of GGUS 84008; can UK T2s who have deployed their perfSonarPS box ensure they are testing bandwidth against the German Tier1.

Monday 9th July

The perfsonar link has changed to a new production instance (see link above)
A couple more sites have been added in the last week.
The GridMon boxes are to be returned to Darebury!

Tickets

Monday 16th of July, 13:30 BST

26 open UK tickets this week, although not many jump out at me.

NEW TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=84270 lcgbdii.gridpp.rl.ac.uk has been selected as a WLCG "recommended top bdii", with all the honour and glory that brings.

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=84123 Seeing atlas production job failures, "send2nsd: NS002 - send error : Could not load a security plugin" error. No word since Wednesday, but Mike might be busy.

https://ggus.eu/ws/ticket_info.php?ticket=84066 Availability/Reliability update for June, needs to be done by next Tuesday.

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=83943 Despite being in a well organised and well advertised scheduled downtime involving most of their resources Glasgow are being picked on for failed atlas transfers.

QMUL https://ggus.eu/ws/ticket_info.php?ticket=83773 The atlas cvmfs related problems seem to have been fixed by Chris upgrading to cvmfs-2.0.18-0.3.3574svn, looks like this ticket can be closed (unless Chris has something else he wants to be investigate).

SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 Emyr has fixed the certificates issues from last week, he's hoping that the last hurdle has been jumped in the quest to certify Sussex.

VOMS https://ggus.eu/ws/ticket_info.php?ticket=81784 Chris's request for multiple membership expiry reminders is technically possible (https://ggus.eu/tech/ticket_show.php?ticket=84020), but requires EMI-2 VOMS servers.

Nothing exciting going on on the "Solved Case" or "Tickets from the UK" fronts.

Tools - MyEGI Nagios

Monday 2nd July

Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.

VOs - GridPP VOMS VO IDs Approved

Monday 16th July

EGI UCST Representatives have change VO neurogrid.incf.org status from New to

Production

Wednesday 6th July

Cross-checking VOs enabled vs VO table.
Surveying VO-admins for problems faced in their VOs.
SNO+ have asked about using git to deploy software - what are the options?

Site Updates

Monday 25th June

N/A

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 2nd July

No meeting. Next PMB on Monday 9th.

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

ATLAS DATADISK now being used for production input files
Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
Target date for perfsonar-ps at sites is the end of July
KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

Trying the present bulletin as a conduit for information
7 sites below 90% in ATLAS monitoring!
Instructions for small VO space-token setup/usage requested
EMI WN tarball now available - comment in ticket
Not every site running latest SL5 OS

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 18th July

Operations report
CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
After hyperthreading tests; 2011 Dell Worker Nodes will have number of jobs increased from 12 to 14.

WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.

NGI UK - Homepage CA

Monday 2nd July

Next meeting is on 9th July.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 230712

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools