Operations Bulletin 300712

Bulletin archive

Week commencing 23rd July 2012

Task Areas

General updates

Monday 23rd July

WLCG has listed its top 20 top-BDIIs.
EGI has released a dashboard view (with history) of site availabilities and reliabilities.
Tier-2 quarterly reports were due last week.

Monday 16th July

Site coordinates in the GOCDB - these only need to be approximate to locate the city/town.
For those concerned with VOMS (admins and VO admins), the recent VOMS tutorials arranged by EGI are now available as slides and a webinar from the meeting meeting agenda page.
From 13th July: Setup a CVMFS stratum-0 repository framework for small VOs at RAL Tier-1. Progress has been made and now we're in position to ask for volunteers.
From 12th July: UMD v2.0.0 was released today, this is the second Major release of UMD (Unified Middleware Distribution) made available for the EGI production infrastructure. This release introduces support for Scientific Linux 6 and Debian 6.

Monday 2nd July

Tier-2 quarterly reports requested by Wednesday 18th July
There is now an LHCb CVMFS sites map

Tier-1 - Status Page

Tuesday 24th July

CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
CMS switched to use CVMFS
Following enabling of hyperthreading, one batch of worker nodes has number of jobs increased from 12 to 14.
We continue to test a 'whole node' job queue (ten nodes available).
We have a new test on the memory usage of the pbs_server. When this rises it correlates with the occurrence of a communication problem between the CEs and the batch server ('Batch Protocol Error' in the logs.)
In the early hours of Sunday 22nd July a PSU on a disk server (gdss102) failed. This caused the PDU to trip, which in turn took out a network switch. Staff attended on site and the problem was resolved around 06:00.

Storage & Data Management - Agendas/Minutes

Considering input to a community support model for DPM and possible alternatives.

Tuesday 24th July

Sam testing xrootd redirection on Glasgow test cluster - going well.

Wednesday 18th July

snoplus needs/plans on how ( on grid resources) to "blind" data for analysis.
plans for sites to finally upgrade/decommission glite3.1 Storage services ahead of Oct 1st deadline.
GLite to EMUI upgrade path.
Issues regarding PXE booting from 10G cards.
Plan to discuss UK role in community suppport of DPM .

Accounting - UK Grid Metrics HEPSPEC06

Wednesday 18th July - Core-ops

Still need definitive statement on disk situation and SL ATLAS accounting conclusions.
Sites should again check Steve's HS06 page.

Wednesday 6th June - Core-ops

Request sites to publish HS06 figures from new kit to this page.

Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Wednesday 18th July - Core ops

Review of stale documents in coming weeks
Plan to setup a site template page to cover 'special' site update pages that are no longer updated

Tuesday 3rd July

Started a page on stale documents. Please update this page if you find documents or pages that need attention.

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

Maintain site VOMS info document for the approved VOs
Check a site's VOMS records correspond exactly with CIC portal
Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes. This will be converted to wiki formatted and made available in the normal way. Next jobs:

review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.

Tuesday, 29th May

VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.

Interoperation - EGI ops agendas

Monday 16th July

EMI updates: Updates expected on the 19th: Top BDII, BLAH and gLite-gsoap/gss. Also StoRM for EMI-2 (and WNoDES, if anyone uses that...). New procedure in place, so in the event of a Globus lib update, so any problems aught to be caught by processes in future. Some discussion on compliance with RFC proxies and SHA2 algorithm.

Staged Rollout: There's some UMD1 and UMD2, and priority is going to be given to UMD1. In particular, the gLite-gsoap/gss. There's a need to update the SAM tests - they fail on the EMI-2 WN at the moment. Significant discussion on this, there was no change, and it _aught_ to work, so the problem appears to be related to the WN package, rather than the test itself. This might be SL6 specific. For the moment, if you update to EMI-2 WN, you should know about the workaround in https://ggus.org/ws/ticket_info.php?ticket=82899 to get the SAM test passing .

Upgrade of gLite 3.1/3.2 components to UMD: All gLite 3.1 components are unsupported, and should be upgraded to UMD stuff by Sept 30th. Only a few gLite 3.2 components are still supported, until Oct 2012 (and should be upgraded by then): FTS, gLExec, LFC, DPM, UI, WN. (and VOBox until Apr 2013).

There are lists available of services running at sites that are gLite 3.1 (current as of 17/7/2012 ... somehow!): https://indico.egi.eu/indico/getFile.py/access?resId=0&materialId=2&confId=1124 and for gLite 3.2: https://indico.egi.eu/indico/getFile.py/access?resId=0&materialId=3&confId=1124

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD Rota

Monday 23th July - KM

quite week. Nothing to report.

Monday 16th July - DB

Quiet week. Sheffield is showing an alarm for a machine that's not in the GOCDB, that might be worth following up. QMUL have an alarm for a machine not in production, they are aware of it, but short of closing the alarm every couple of days I haven't got any better suggestion on how to deal with it.

Monday 9th July - KM

Quite week. Nothing to report. Just a reminder that now Dashboard is getting result from gridppnagios at Lancaster. So if you want to see raw result for something, go to https://gridppnagios.lancs.ac.uk/nagios

Rollout Status WLCG Baseline

Wednesday 18th July - Core ops

Sites (that needed a tarball install) will need to work on own glexec installs
Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.

Monday 11th June

EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
Call for more sites to take part in EMI-2 rollout tests.
The overall SR contributions are in this table.

Friday 27th April

Updated version information on rollout page
WN scan indicates some sites not keen on OS updates to those nodes.

Security - Incident Procedure Policies

Monday 23rd July

WMS vulnerabilities identified. Sites will have been contacted. Please respond to tickets ASAP.

Monday 25th June

Rota availability responses slow
Is anyone following up on SSC5/6?
Stratuslab VM (ex UK)
gridftp

Services - PerfSonar dashboard

Monday 23rd July

Sheffield added to perfsonar matrix. Glasgow almost ready.
QMUL would like to test with sites using MTU >1500.

Monday 16th July

As part of GGUS 84008; can UK T2s who have deployed their perfSonarPS box ensure they are testing bandwidth against the German Tier1.

Monday 9th July

The perfsonar link has changed to a new production instance (see link above)
A couple more sites have been added in the last week.
The GridMon boxes are to be returned to Darebury!

Tickets

Monday 23rd of July, 23:00 BST by Jeremy

19 tickets in the open state.

LANCASTER https://ggus.eu/ws/ticket_info.php?ticket=84461 Failing transfers for t2k.org. (23/07)

RAL TIER-1 https://ggus.eu/ws/ticket_info.php?ticket=84408 Enable neurogrid.incf.org on WMS and LFC (in progress 20/07) https://ggus.eu/ws/ticket_info.php?ticket=84270 To confirm addition of lcgbdii.gridpp.rl.ac.uk to WLCG recommended Top BDII list: https://tomtools.cern.ch/confluence/display/IS/WLCG_Support_Proposal. https://ggus.eu/ws/ticket_info.php?ticket=83927 snoplus glite-transfer permissions issue. Looks like FTS channels were not configured. Suggestions sent back for endpoints (19/07). https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/03/2011) Retirement of SL4 and 32-bit head nodes and servers. On hold but still valid 17/07.

RHUL https://ggus.eu/ws/ticket_info.php?ticket=83627 (27/06) Biomed – SE reporting invalid used space. Work in progress! (20/07)

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=83283 (14/06) LHCb job failures. Related to CVMFS timeouts? (https://savannah.cern.ch/bugs/index.php?95420) (09/07). Put on hold?

NGS https://ggus.eu/ws/ticket_info.php?ticket=83213 (12/06) Decommissioning of CE03. Ticket to ngs.ac.uk VO. Close?

IMPERIAL https://ggus.eu/ws/ticket_info.php?ticket=82946 (07/06) Possible CVMFS issue for ATLAS. Cache problem? (19/07)

MANCHESTER https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/05) VOMS server AUP resining requests. Reopened 11/07 – multiple reminders should be possible. Put on-hold?

NGI_UK https://ggus.eu/ws/ticket_info.php?ticket=84381 New VO for the COMET experiment (proposed name comet.j-parc.jp)

https://ggus.eu/ws/ticket_info.php?ticket=81784 Certification of UKI-SOUTHGRID-SUSX (01/05). Jobs stay running. (23/07)

https://ggus.eu/ws/ticket_info.php?ticket=80259 (14/03) Creation of neuroscience VO. Waiting on WMS enablement (20/07). Also adding to GGUS.

BRISTOL https://ggus.eu/ws/ticket_info.php?ticket=80155 (12/03) Timeline for SE upgrade/decommissioning. (on hold). Retire v1.3: “Ideally before end of August we hope” (09/07). Brian to comment…

ECDF https://ggus.eu/ws/ticket_info.php?ticket=80152 (12/03) Timeline for SE upgrade/decommissioning. On-hold. Waiting for release? (09/07)

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=84123 Job failures at site (open 11/07). WNs put offline. Site forced into test mode.

https://ggus.eu/ws/ticket_info.php?ticket=83950 CVMFS problem. Squid server had fallen over. Waiting for reply. https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11) Authentication problems in some CEs (compchem VO). Lots of chasing on this ticket! Mark M checking as of 18/07. https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/03/11) Retirement of SL4 and 32-bit DPM head node and servers. Lots of chasing on this ticket. Still valid 17/07.

Solved cases not reviewed.

Tools - MyEGI Nagios

Tuesday 25th July

Gridppnagios at Lancaster will remain main Nagios instance until further announcement. KM writing down procedure for switch over to backup nagios in case of emergency https://www.gridpp.ac.uk/wiki/Backup_Regional_Nagios . KM now away for one month holiday and may not be able to reply to emails. New email address for Nagios: gridppnagios-admin at physics.ox.ac.uk for any question or information regarding regional nagios. Currently this mail goes to Ewan and Kashif.

Monday 2nd July

Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.

VOs - GridPP VOMS VO IDs Approved

Monday 23rd July

CW requested feedback from non-LHC VOs on issues
Proxy renewal issues thought to be resolved on all services except FTS - a new version may solve problems on that service.
Steve's VOMS snooper application is picking up many site VO config problems. We should review all sites in turn.

Monday 16th July

EGI UCST Representatives have change VO neurogrid.incf.org status from New to

Production

Wednesday 6th July

Cross-checking VOs enabled vs VO table.
Surveying VO-admins for problems faced in their VOs.
SNO+ have asked about using git to deploy software - what are the options?

Site Updates

Tuesday 24th July

GLASGOW: In recovery mode at the moment and will be bringing equipment back up to check it after the air conditioning failure from the weekend.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 2nd July

No meeting. Next PMB on Monday 9th.

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

ATLAS DATADISK now being used for production input files
Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
Target date for perfsonar-ps at sites is the end of July
KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

Trying the present bulletin as a conduit for information
7 sites below 90% in ATLAS monitoring!
Instructions for small VO space-token setup/usage requested
EMI WN tarball now available - comment in ticket
Not every site running latest SL5 OS

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 25th July

Operations report
CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
Continue with hyperthreading tests. One batch (2011 Dell Worker Nodes) have number of jobs increased from 12 to 14.

WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.

NGI UK - Homepage CA

Monday 2nd July

Next meeting is on 9th July.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 300712

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools