Operations Bulletin 230712

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 16th July 2012
Task Areas
General updates

Monday 16th July

  • Site coordinates in the GOCDB - these only need to be approximate to locate the city/town.
  • For those concerned with VOMS (admins and VO admins), the recent VOMS tutorials arranged by EGI are now available as slides and a webinar from the meeting meeting agenda page.
  • From 13th July: Setup a CVMFS stratum-0 repository framework for small VOs at RAL Tier-1. Progress has been made and now we're in position to ask for volunteers.
  • From 12th July: UMD v2.0.0 was released today, this is the second Major release of UMD (Unified Middleware Distribution) made available for the EGI production infrastructure. This release introduces support for Scientific Linux 6 and Debian 6.

Monday 2nd July

  • Tier-2 quarterly reports requested by Wednesday 18th July
  • There is now an LHCb CVMFS sites map
Tier-1 - Status Page

Wednesday 18th July

  • CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • After hyperthreading tests; 2011 Dell Worker Nodes will have number of jobs increased from 12 to 14.
Storage & Data Management - Agendas/Minutes

Considering input to a community support model for DPM and possible alternatives.

Wednesday 18th July

  • snoplus needs/plans on how ( on grid resources) to "blind" data for analysis.
  • plans for sites to finally upgrade/decommission glite3.1 Storage services ahead of Oct 1st deadline.
  • GLite to EMUI upgrade path.
  • Issues regarding PXE booting from 10G cards.
  • Plan to discuss UK role in community suppport of DPM .


Accounting - UK Grid Metrics HEPSPEC06

Wednesday 6th June - Core-ops

  • Request sites to publish HS06 figures from new kit to this page.
  • Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Documentation - KeyDocs

Tuesday 3rd July

  • Started a page on stale documents. Please update this page if you find documents or pages that need attention.


Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

  • Maintain site VOMS info document for the approved VOs
  • Check a site's VOMS records correspond exactly with CIC portal
  • Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes. This will be converted to wiki formatted and made available in the normal way. Next jobs:

  • review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
  • create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.


Tuesday, 29th May

  • VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

  • Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 16th July

  • EMI updates: Updates expected on the 19th: Top BDII, BLAH and gLite-gsoap/gss. Also StoRM for EMI-2 (and WNoDES, if anyone uses that...). New procedure in place, so in the event of a Globus lib update, so any problems aught to be caught by processes in future. Some discussion on compliance with RFC proxies and SHA2 algorithm.
  • Staged Rollout: There's some UMD1 and UMD2, and priority is going to be given to UMD1. In particular, the gLite-gsoap/gss. There's a need to update the SAM tests - they fail on the EMI-2 WN at the moment. Significant discussion on this, there was no change, and it _aught_ to work, so the problem appears to be related to the WN package, rather than the test itself. This might be SL6 specific. For the moment, if you update to EMI-2 WN, you should know about the workaround in https://ggus.org/ws/ticket_info.php?ticket=82899 to get the SAM test passing .
  • Upgrade of gLite 3.1/3.2 components to UMD: All gLite 3.1 components are unsupported, and should be upgraded to UMD stuff by Sept 30th. Only a few gLite 3.2 components are still supported, until Oct 2012 (and should be upgraded by then): FTS, gLExec, LFC, DPM, UI, WN. (and VOBox until Apr 2013).


Monitoring - Links MyWLCG

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 16th July - DB

  • Quiet week. Sheffield is showing an alarm for a machine that's not in the GOCDB, that might be worth following up. QMUL have an alarm for a machine not in production, they are aware of it, but short of closing the alarm every couple of days I haven't got any better suggestion on how to deal with it.


Monday 9th July - KM

  • Quite week. Nothing to report. Just a reminder that now Dashboard is getting result from gridppnagios at Lancaster. So if you want to see raw result for something, go to https://gridppnagios.lancs.ac.uk/nagios


Rollout Status WLCG Baseline

Monday 11th June

  • EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Monday 25th June

  • Rota availability responses slow
  • Is anyone following up on SSC5/6?
  • Stratuslab VM (ex UK)
  • gridftp



Services - PerfSonar dashboard

Monday 16th July

  • As part of GGUS 84008; can UK T2s who have deployed their perfSonarPS box ensure they are testing bandwidth against the German Tier1.

Monday 9th July

  • The perfsonar link has changed to a new production instance (see link above)
  • A couple more sites have been added in the last week.
  • The GridMon boxes are to be returned to Darebury!
Tickets

Monday 16th of July, 13:30 BST</br>

26 open UK tickets this week, although not many jump out at me.</br>

NEW TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=84270</br> lcgbdii.gridpp.rl.ac.uk has been selected as a WLCG "recommended top bdii", with all the honour and glory that brings.

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=84123</br> Seeing atlas production job failures, "send2nsd: NS002 - send error : Could not load a security plugin" error. No word since Wednesday, but Mike might be busy.

https://ggus.eu/ws/ticket_info.php?ticket=84066</br> Availability/Reliability update for June, needs to be done by next Tuesday.

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=83943</br> Despite being in a well organised and well advertised scheduled downtime involving most of their resources Glasgow are being picked on for failed atlas transfers.

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=83773</br> The atlas cvmfs related problems seem to have been fixed by Chris upgrading to cvmfs-2.0.18-0.3.3574svn, looks like this ticket can be closed (unless Chris has something else he wants to be investigate).

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784</br> Emyr has fixed the certificates issues from last week, he's hoping that the last hurdle has been jumped in the quest to certify Sussex.

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=81784</br> Chris's request for multiple membership expiry reminders is technically possible (https://ggus.eu/tech/ticket_show.php?ticket=84020), but requires EMI-2 VOMS servers.

Nothing exciting going on on the "Solved Case" or "Tickets from the UK" fronts.

Tools - MyEGI Nagios

Monday 2nd July

  • Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.



VOs - GridPP VOMS VO IDs Approved

Monday 16th July

  • EGI UCST Representatives have change VO neurogrid.incf.org status from New to

Production

Wednesday 6th July

  • Cross-checking VOs enabled vs VO table.
  • Surveying VO-admins for problems faced in their VOs.
  • SNO+ have asked about using git to deploy software - what are the options?


Site Updates

Monday 25th June

  • N/A



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 2nd July

  • No meeting. Next PMB on Monday 9th.
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 18th July

  • Operations report
  • CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • After hyperthreading tests; 2011 Dell Worker Nodes will have number of jobs increased from 12 to 14.
WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.



NGI UK - Homepage CA

Monday 2nd July

  • Next meeting is on 9th July.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Thursday 21st June - JANET6

  • JANET6 meeting in London (agenda)
  • Spend of order £24M for strategic rather than operational needs.
  • Recommendations to BIS shortly
  • Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
  • Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
  • Reliability limited by funding not ops so need smart provisioning to reduce costs
  • Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
  • Goal of dynamic provisioning
  • Looking at ubiquitous connectivity via ISPs
  • Contracts were 10yrs wrt connection and 5yrs transmission equipment.
  • Current native capacity 80 channels of 100Gb/s per channel
  • Fibre procurement for next phase underway (standard players) - 6400km fibre
  • Transmission equipment also at tender stage
  • Industry engagement - Glaxo case study.
  • Extra requiements: software coding, security, domain knowledge.
  • Expect genome data usage to explode in 3-5yrs.
  • Licensing is a clear issue
To note

Tuesday 26th June