Operations Bulletin 170912

Bulletin archive

Week commencing 10th September 2012

Task Areas

General updates

Tuesday 11th September

CVMFS client testing (See Ian's message from today)

Monday 10th September

The WLCG availability report for August is online.
There is a GDB this week. Please take a look at the agenda.
A reminder that there is now a security team behind the UKNGI-SECURITY email (See Linda's email from Monday)
SSC6 participation was good. Thank you to the sites that were involved and to Alessandra Forti for coordinating for the security team.
The WLCG Operations Team kick-off meeting will now be on Monday 24th September 13:00-16:00 BST (possible team topics CVMFS deployment, glexec deployment completion, perfsonar deployment, squid monitoring, operational aspects of middleware deployment, FTS3 deployment, tracking tools evolution, SHA-2 migration and WMS decommissioning. Keen for more site admins to participate).

Friday 7th September

Discussion required on identifying the active Nagios instance.

Tuesday 4th September

Request to update networking information in the GridPP wiki
Non-LHC VOs and testing of EMI WNs (GridPP test clusters seem to prefer EMI-2 SL5)
gLite support calendar has been updated (see interoperation section) - extensions to 30th November.
GridPP list of Technology Development contributions
Upcoming GDB ( agenda). There is still a call for participation in the WLCG Operation Coordination Team and a kick-off meeting will be held on 20th September.
Reminder of EGI GPGPU questionnaire. The deadline is 13th September.

Tier-1 - Status Page

Tuesday 11th September

Problem overnight Wed/Thu (5/6 Sep). One of the pair of uplinks to switch stack was failing intermittently. Caused problems accessing one batch of disk servers and some worker nodes.
LHCb moved to use T10KC tapes - all going OK.
Continuing test of hyperthreading.
Continue with ten EMI-2 SL-5 worker nodes in normal production.
Getting close to Castor 2.1.12 update. Plan to upgrade LHCb today backed out as not all issues resolved in time.

Storage & Data Management - Agendas/Minutes

Friday 7th September

The GridPP response to the DPM community proposal will be discussed again at the PMB this coming Friday (14th). Those involved please take one final look at the document being circulated. If you do not have the document to view but would like to see it please let Jeremy or Jens know.

Wednesday 29th August

Update on DPM support plan - aim to define tasks, then look for "volunteers"
Planning ahead for coming events - particularly GridPP29
Volunteers for ATLAS job recovery?

Accounting - UK Grid Metrics HEPSPEC06

Wednesday 18th July - Core-ops

Still need definitive statement on disk situation and SL ATLAS accounting conclusions.
Sites should again check Steve's HS06 page.

Wednesday 6th June - Core-ops

Request sites to publish HS06 figures from new kit to this page.

Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 11th September

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Monitoring(2/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 10th September

Tuesday 4th September

The end of security support of the following products:

- glite 3.2 glite-UI - glite 3.2 glite-WN - glite 3.2 glite-GLEXEC_wn - glite 3.2 glite-LFC_mysql/glite-LFC_oracle - glite 3.2 glite-SE_dpm_disk/glite-SE_dpm_mysql

was extended to 30/11/2012 (http://glite.cern.ch/support_calendar/).

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Monday 10th September

No major incidents during the week.
7 sites currently have one alarm of ticket set. Many of these alarms are 12 hours old.
Only 2 sites have alarms which are approaching 24 hours old (Oxford, Glasgow).
Kashif on-duty this week
Intend to hold ROD meeting this week (those involved please respond to availability email!)

Monday 3rd September

Some issues with ROD handover last week. We need to agree a handshake (in conjunction with report perhaps - AM).
John W is on-duty this week.
A new rota needs to be created for beyond September.

Rollout Status WLCG Baseline

Thursday 13rd September

Updated all SR pages.

Monday 3rd September

Test queues for EMI WNs: RAL T1, Oxford, Liverpool?, Brunel

Tuesday 31st July

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.

Wednesday 18th July - Core ops

Sites (that needed a tarball install) will need to work on own glexec installs
Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.

Security - Incident Procedure Policies Rota

Monday 10th September

Lessons from SSC6 (ops meeting feedback TBC)

Monday 30th July

WMSes patched/configured correctly.

Monday 23rd July

WMS vulnerabilities identified. Sites will have been contacted. Please respond to tickets ASAP.

Services - PerfSonar dashboard

Tuesday 11th September

Still some sites needing to deploy perfsonar
Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week

Tuesday 4th September

There was a CA TAG meeting last Tuesday
There is an issue with UK certificates and CERN SSO that is being investigated.
To prevent CRLs failing to validate and causing alarms/errors, the old CA certificate lifetime has been extended until March 2013, but no new certificates will be generated with it. This update will be in a September IGTF release. "No person who has a certificate under the old CA will need to do anything special or unusual. No site will need to do anything special or unusual. The purpose of the rollover is to move away from the 2007 key that is hosted in an old signing module for which support will end.

Tickets

Monday 10th September 15:00 BST. 34 Open Tickets this week. The number is slowly shrinking.

Still no sign of ticket reminders here at Lancaster, did anyone else get round to testing it to make sure it's not just me? Also is anyone still getting the automated weekly reminder e-mails for tickets at your site from GGUS. I haven't seen one of those in a while either.

Sno+ (after some nudging from Jeremy) have started answering their tickets again. Thanks to Jeremy for some epic ticket wrangling last week in general.

Glite 3.1 Retirement Tickets (14/8) https://ggus.eu/ws/ticket_info.php?ticket=85189 (UCL) Daniela has offered to help, but needs the bare installs set up first. In Progress (6/9) https://ggus.eu/ws/ticket_info.php?ticket=85185 (Cambridge) Plan to switch off the lcg-CEs, but to delay as long as possible as this will mean the loss of 128 job slots. Probably don't want to leave it until the 11th hour though! In progress, Jeremy upped to "Very Urgent" with the rest (29/8) https://ggus.eu/ws/ticket_info.php?ticket=85183 (Glasgow) Not much news from Glasgow, at last check they were trying to debug problems they were seeing with the EMI-1 WMS/LB. Jeremy asks how this is going (and has knocked the status to "In Progress" from "On Hold"). (14/8) https://ggus.eu/ws/ticket_info.php?ticket=85181 (Durham) The cunning plan here is to simply switch off the offending CEs nearer the deadline. In Progress (6/9) https://ggus.eu/ws/ticket_info.php?ticket=80155 (Bristol) Bristol are confident and committed to upgrading by the deadline (6/9)

Brunel, ECDF & RHUL have purged glite 3.2 from their sites, their tickets are nicely closed.

TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8) Biomed nagios errors at RAL, registering files on srm-biomed.gridpp.rl.ac.uk. In an uncertain state since 3/9, Jeremy queried what was going on (6/9).

https://ggus.eu/ws/ticket_info.php?ticket=84492 (24/7) SNO+ were having job matching problems submitting to RAL. Looks like these have been solved (although uncovered new problems at Glasgow). (6/9)

https://ggus.eu/ws/ticket_info.php?ticket=85023 (9/8) SNO+ WMS problems at RAL. After a very long break waiting for reply and a gently nudge from Jeremy James from SNO+ has provided some output to help continue the investigation. In Progress (6/9)

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8) Sister ticket to 85023. SNO+ have provided some of the requested information. In Progress (6/9)

NGI/RALPP https://ggus.eu/ws/ticket_info.php?ticket=85793 (5/9) As seen on TB-SUPPORT, RALPP request a recalculation of August's availability due to problems with jobs sent by the Lancaster Nagios, not at the site. It looks like it's being worked on. In progress (7/9)

NGI https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7) COMET VO creation ticket. After some confusion with Imperial's mail system the question was raised concerning how best to handle tickets to the VOMS team in the future. On hold (6/9) The related ticket for the VOs validation (https://ggus.eu/ws/ticket_info.php?ticket=85736) has stalled due to the AUP being not suitable for use. A new AUP has been requested (7/9)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11) Brian's SL4 SE tracking ticket. With the glite 3.1 deadline approaching should this be set In Progress (to be soon closed).

UCL https://ggus.eu/ws/ticket_info.php?ticket=85549 (28/8) The last UserDn publishing ticket (85547). Still no movement on it. Has a fix been attempted? (28/8)

BRUNEL https://ggus.eu/ws/ticket_info.php?ticket=85011 (9/8) Pheno have given their blessing for shutting down the older SE dgc-grid-50.brunel.ac.uk after Jeremy explained the situation. The ticket looks like it can be closed. In Progress (10/9).

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11) Durham SE tracking ticket. As Durham's DPM is now EMI1 can this ticket be closed? In progress (6/9)

Tickets from the UK https://ggus.eu/ws/ticket_info.php?ticket=84015 Tracking the LSF publishing problems seen at Lancaster. An updated .jar has been received and is undergoing testing.

Tools - MyEGI Nagios

Monday 10th September

Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view

Tuesday 25th July Gridppnagios at Lancaster will remain main Nagios instance until further announcement. KM writing down procedure for switch over to backup nagios in case of emergency https://www.gridpp.ac.uk/wiki/Backup_Regional_Nagios . KM now away for one month holiday and may not be able to reply to emails. New email address for Nagios: gridppnagios-admin at physics.ox.ac.uk for any question or information regarding regional nagios. Currently this mail goes to Ewan and Kashif.

Monday 2nd July

Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 27th August

We are required to encourage our smaller VOs to try running on EMI WNs to inform the upcoming transition. Experiences should be added to https://wiki.egi.eu/wiki/NGI-VO_WN_tests.

Monday 23rd July

CW requested feedback from non-LHC VOs on issues
Proxy renewal issues thought to be resolved on all services except FTS - a new version may solve problems on that service.
Steve's VOMS snooper application is picking up many site VO config problems. We should review all sites in turn.

Site Updates

Friday 7th September

SUSSEX: Still cluster has been upgraded. Intention is to return to grid components now(ish).
Is there sufficient support effort from GridPP?

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 2nd July

No meeting. Next PMB on Monday 3rd September.

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

TBC

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 12th September

Operations report
A Castor upgrade was announced for the LHCb instance on Tuesday (11th). This was cancelled on Monday afternoon after a problem was found during final testing.
Problems have been seen on the nodes testing hyperthreading. When there are many cpu-bound jobs (Atlas monte-carlo) on the same node these have taken longer to run on these nodes and exceeded maximum wall time. In response the overcommit of jobs was reduced on Tuesday.
Visitors to the computer room have been impressed by signage indicating which server hosts one of the candidate Atlas Higgs events. We are looking for other "interesting" data to label in a similar way.

WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.

NGI UK - Homepage CA

Wednesday 22nd August

Operationally few changes - VOMS and Nagios changes on hold due to holidays
Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
The NGS is rebranding to NES (National e-Infrastructure Service)
EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
Next meeting is on Friday 14th September at 13:00.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 170912

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools