Operations Bulletin 300712

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 23rd July 2012
Task Areas
General updates

Monday 23rd July

  • WLCG has listed its top 20 top-BDIIs.
  • EGI has released a dashboard view (with history) of site availabilities and reliabilities.
  • Tier-2 quarterly reports were due last week.


Monday 16th July

  • Site coordinates in the GOCDB - these only need to be approximate to locate the city/town.
  • For those concerned with VOMS (admins and VO admins), the recent VOMS tutorials arranged by EGI are now available as slides and a webinar from the meeting meeting agenda page.
  • From 13th July: Setup a CVMFS stratum-0 repository framework for small VOs at RAL Tier-1. Progress has been made and now we're in position to ask for volunteers.
  • From 12th July: UMD v2.0.0 was released today, this is the second Major release of UMD (Unified Middleware Distribution) made available for the EGI production infrastructure. This release introduces support for Scientific Linux 6 and Debian 6.

Monday 2nd July

  • Tier-2 quarterly reports requested by Wednesday 18th July
  • There is now an LHCb CVMFS sites map
Tier-1 - Status Page

Tuesday 24th July

  • CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • CMS switched to use CVMFS
  • Following enabling of hyperthreading, one batch of worker nodes has number of jobs increased from 12 to 14.
  • We continue to test a 'whole node' job queue (ten nodes available).
  • We have a new test on the memory usage of the pbs_server. When this rises it correlates with the occurrence of a communication problem between the CEs and the batch server ('Batch Protocol Error' in the logs.)
  • In the early hours of Sunday 22nd July a PSU on a disk server (gdss102) failed. This caused the PDU to trip, which in turn took out a network switch. Staff attended on site and the problem was resolved around 06:00.
Storage & Data Management - Agendas/Minutes

Considering input to a community support model for DPM and possible alternatives.

Tuesday 24th July

  • Sam testing xrootd redirection on Glasgow test cluster - going well.

Wednesday 18th July

  • snoplus needs/plans on how ( on grid resources) to "blind" data for analysis.
  • plans for sites to finally upgrade/decommission glite3.1 Storage services ahead of Oct 1st deadline.
  • GLite to EMUI upgrade path.
  • Issues regarding PXE booting from 10G cards.
  • Plan to discuss UK role in community suppport of DPM .


Accounting - UK Grid Metrics HEPSPEC06

Wednesday 18th July - Core-ops

  • Still need definitive statement on disk situation and SL ATLAS accounting conclusions.
  • Sites should again check Steve's HS06 page.


Wednesday 6th June - Core-ops

  • Request sites to publish HS06 figures from new kit to this page.
  • Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Documentation - KeyDocs

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Wednesday 18th July - Core ops

  • Review of stale documents in coming weeks
  • Plan to setup a site template page to cover 'special' site update pages that are no longer updated

Tuesday 3rd July

  • Started a page on stale documents. Please update this page if you find documents or pages that need attention.


Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

  • Maintain site VOMS info document for the approved VOs
  • Check a site's VOMS records correspond exactly with CIC portal
  • Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes. This will be converted to wiki formatted and made available in the normal way. Next jobs:

  • review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
  • create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.


Tuesday, 29th May

  • VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

  • Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 16th July

  • EMI updates: Updates expected on the 19th: Top BDII, BLAH and gLite-gsoap/gss. Also StoRM for EMI-2 (and WNoDES, if anyone uses that...). New procedure in place, so in the event of a Globus lib update, so any problems aught to be caught by processes in future. Some discussion on compliance with RFC proxies and SHA2 algorithm.
  • Staged Rollout: There's some UMD1 and UMD2, and priority is going to be given to UMD1. In particular, the gLite-gsoap/gss. There's a need to update the SAM tests - they fail on the EMI-2 WN at the moment. Significant discussion on this, there was no change, and it _aught_ to work, so the problem appears to be related to the WN package, rather than the test itself. This might be SL6 specific. For the moment, if you update to EMI-2 WN, you should know about the workaround in https://ggus.org/ws/ticket_info.php?ticket=82899 to get the SAM test passing .
  • Upgrade of gLite 3.1/3.2 components to UMD: All gLite 3.1 components are unsupported, and should be upgraded to UMD stuff by Sept 30th. Only a few gLite 3.2 components are still supported, until Oct 2012 (and should be upgraded by then): FTS, gLExec, LFC, DPM, UI, WN. (and VOBox until Apr 2013).


Monitoring - Links MyWLCG

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 23th July - KM

  • quite week. Nothing to report.


Monday 16th July - DB

  • Quiet week. Sheffield is showing an alarm for a machine that's not in the GOCDB, that might be worth following up. QMUL have an alarm for a machine not in production, they are aware of it, but short of closing the alarm every couple of days I haven't got any better suggestion on how to deal with it.


Monday 9th July - KM

  • Quite week. Nothing to report. Just a reminder that now Dashboard is getting result from gridppnagios at Lancaster. So if you want to see raw result for something, go to https://gridppnagios.lancs.ac.uk/nagios


Rollout Status WLCG Baseline

Wednesday 18th July - Core ops

  • Sites (that needed a tarball install) will need to work on own glexec installs
  • Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.

Monday 11th June

  • EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Monday 23rd July

  • WMS vulnerabilities identified. Sites will have been contacted. Please respond to tickets ASAP.


Monday 25th June

  • Rota availability responses slow
  • Is anyone following up on SSC5/6?
  • Stratuslab VM (ex UK)
  • gridftp



Services - PerfSonar dashboard

Monday 23rd July

  • Sheffield added to perfsonar matrix. Glasgow almost ready.
  • QMUL would like to test with sites using MTU >1500.

Monday 16th July

  • As part of GGUS 84008; can UK T2s who have deployed their perfSonarPS box ensure they are testing bandwidth against the German Tier1.

Monday 9th July

  • The perfsonar link has changed to a new production instance (see link above)
  • A couple more sites have been added in the last week.
  • The GridMon boxes are to be returned to Darebury!
Tickets

Monday 23rd of July, 23:00 BST by Jeremy</br>

19 tickets in the open state.

LANCASTER </br> https://ggus.eu/ws/ticket_info.php?ticket=84461 Failing transfers for t2k.org. (23/07)

RAL TIER-1 </br> https://ggus.eu/ws/ticket_info.php?ticket=84408 </br> Enable neurogrid.incf.org on WMS and LFC (in progress 20/07) </br> https://ggus.eu/ws/ticket_info.php?ticket=84270 </br> To confirm addition of lcgbdii.gridpp.rl.ac.uk to WLCG recommended Top BDII list: https://tomtools.cern.ch/confluence/display/IS/WLCG_Support_Proposal. </br> https://ggus.eu/ws/ticket_info.php?ticket=83927 </br> snoplus glite-transfer permissions issue. Looks like FTS channels were not configured. Suggestions sent back for endpoints (19/07). </br> https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/03/2011) </br> Retirement of SL4 and 32-bit head nodes and servers. On hold but still valid 17/07.

RHUL </br> https://ggus.eu/ws/ticket_info.php?ticket=83627 (27/06)</br> Biomed – SE reporting invalid used space. Work in progress! (20/07)

GLASGOW </br> https://ggus.eu/ws/ticket_info.php?ticket=83283 (14/06)</br> LHCb job failures. Related to CVMFS timeouts? (https://savannah.cern.ch/bugs/index.php?95420) (09/07). Put on hold?

NGS </br> https://ggus.eu/ws/ticket_info.php?ticket=83213 (12/06)</br> Decommissioning of CE03. Ticket to ngs.ac.uk VO. Close?

IMPERIAL </br> https://ggus.eu/ws/ticket_info.php?ticket=82946 (07/06)</br> Possible CVMFS issue for ATLAS. Cache problem? (19/07)

MANCHESTER </br> https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/05)</br> VOMS server AUP resining requests. Reopened 11/07 – multiple reminders should be possible. Put on-hold?

NGI_UK </br> https://ggus.eu/ws/ticket_info.php?ticket=84381</br> New VO for the COMET experiment (proposed name comet.j-parc.jp) </br>

https://ggus.eu/ws/ticket_info.php?ticket=81784</br> Certification of UKI-SOUTHGRID-SUSX (01/05). Jobs stay running. (23/07) </br>

https://ggus.eu/ws/ticket_info.php?ticket=80259 (14/03)</br> Creation of neuroscience VO. Waiting on WMS enablement (20/07). Also adding to GGUS.

BRISTOL </br> https://ggus.eu/ws/ticket_info.php?ticket=80155 (12/03)</br> Timeline for SE upgrade/decommissioning. (on hold). Retire v1.3: “Ideally before end of August we hope” (09/07). Brian to comment…

ECDF </br> https://ggus.eu/ws/ticket_info.php?ticket=80152 (12/03)</br> Timeline for SE upgrade/decommissioning. On-hold. Waiting for release? (09/07)

DURHAM </br> https://ggus.eu/ws/ticket_info.php?ticket=84123 </br> Job failures at site (open 11/07). WNs put offline. Site forced into test mode. </br>

https://ggus.eu/ws/ticket_info.php?ticket=83950</br> CVMFS problem. Squid server had fallen over. Waiting for reply. </br> https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11)</br> Authentication problems in some CEs (compchem VO). Lots of chasing on this ticket! Mark M checking as of 18/07.</br> https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/03/11)</br> Retirement of SL4 and 32-bit DPM head node and servers. Lots of chasing on this ticket. Still valid 17/07.


Solved cases not reviewed.

Tools - MyEGI Nagios

Tuesday 25th July

Gridppnagios at Lancaster will remain main Nagios instance until further announcement. KM writing down procedure for switch over to backup nagios in case of emergency https://www.gridpp.ac.uk/wiki/Backup_Regional_Nagios . KM now away for one month holiday and may not be able to reply to emails. New email address for Nagios: gridppnagios-admin at physics.ox.ac.uk for any question or information regarding regional nagios. Currently this mail goes to Ewan and Kashif.

Monday 2nd July

  • Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.



VOs - GridPP VOMS VO IDs Approved

Monday 23rd July

  • CW requested feedback from non-LHC VOs on issues
  • Proxy renewal issues thought to be resolved on all services except FTS - a new version may solve problems on that service.
  • Steve's VOMS snooper application is picking up many site VO config problems. We should review all sites in turn.

Monday 16th July

  • EGI UCST Representatives have change VO neurogrid.incf.org status from New to

Production

Wednesday 6th July

  • Cross-checking VOs enabled vs VO table.
  • Surveying VO-admins for problems faced in their VOs.
  • SNO+ have asked about using git to deploy software - what are the options?


Site Updates

Tuesday 24th July

  • GLASGOW: In recovery mode at the moment and will be bringing equipment back up to check it after the air conditioning failure from the weekend.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 2nd July

  • No meeting. Next PMB on Monday 9th.
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 25th July

  • Operations report
  • CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • Continue with hyperthreading tests. One batch (2011 Dell Worker Nodes) have number of jobs increased from 12 to 14.
WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.



NGI UK - Homepage CA

Monday 2nd July

  • Next meeting is on 9th July.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Thursday 21st June - JANET6

  • JANET6 meeting in London (agenda)
  • Spend of order £24M for strategic rather than operational needs.
  • Recommendations to BIS shortly
  • Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
  • Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
  • Reliability limited by funding not ops so need smart provisioning to reduce costs
  • Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
  • Goal of dynamic provisioning
  • Looking at ubiquitous connectivity via ISPs
  • Contracts were 10yrs wrt connection and 5yrs transmission equipment.
  • Current native capacity 80 channels of 100Gb/s per channel
  • Fibre procurement for next phase underway (standard players) - 6400km fibre
  • Transmission equipment also at tender stage
  • Industry engagement - Glaxo case study.
  • Extra requiements: software coding, security, domain knowledge.
  • Expect genome data usage to explode in 3-5yrs.
  • Licensing is a clear issue
To note

Tuesday 26th June