Operations Bulletin 240912

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 17th September 2012
Task Areas
General updates

Monday 17th September

  • There was a GDB (agenda)last week (meeting summary will be posted soon). There are also official notes online.
  • The final August Tier-2 availability/reliability figures have been published (PDF).
  • Brian's dark data clean-up page coming to a screen near you!
  • EGI is offering some EMI training this week on Wednesday (details).
  • Multicore work at NIKHEF is written up here.

Tuesday 11th September

  • CVMFS client testing (See Ian's message from today)


Tier-1 - Status Page

Tuesday 18th September

  • Continuing test of hyperthreading. Some problems encountered and have reduced overcommit of nodes.
  • Continue with ten EMI-2 SL-5 worker nodes in normal production.
  • Test instance of FTS version 3 now available. Non-LHC VOs that use the existing service have been enabled on it and looking for one of the VOs to test.
  • Castor 2.1.12 update for Atlas stager announced for next Tuesday (25th).
Storage & Data Management - Agendas/Minutes

Friday 14th September

  • The current GridPP response on the DPM community support proposal: "GridPP acknowledges the concerns and issues raised in the DPM Community proposal. As a collaboration that has many sites with DPM endpoints we presently have a good level of engagement with the DPM development team and in providing additional tools, testing and (currently mainly local) support for DPM. We would be happy to continue this level of contribution and take part in meetings shaping the emerging DPM community. Over the coming months it would be useful to trial working with the DPM team to develop and test additional DPM components which would help us, and sites across WLCG more generally, be better placed to understand how DPM can deliver to presently unknown but evidently changing WLCG experiment requirements on the LS1 timescale."
  • Has anyone in the SG looked at the Glue 2.0 document yet?


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Wednesday 6th September

  • Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Wednesday 18th July - Core-ops

  • Still need definitive statement on disk situation and SL ATLAS accounting conclusions.
  • Sites should check Steve's HS06 page HS06 page

Wednesday 6th June - Core-ops

  • Request sites to publish HS06 figures from new kit to this page.
  • Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Documentation - KeyDocs

Tuesday 11th September

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Monitoring(2/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 10th September - EGI ops meeting minutes


Tuesday 4th September

  • The end of security support of the following products:

- glite 3.2 glite-UI - glite 3.2 glite-WN - glite 3.2 glite-GLEXEC_wn - glite 3.2 glite-LFC_mysql/glite-LFC_oracle - glite 3.2 glite-SE_dpm_disk/glite-SE_dpm_mysql

was extended to 30/11/2012 (http://glite.cern.ch/support_calendar/).


Monitoring - Links MyWLCG

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Friday 14th September

  • Few alarms at different sites but fixed in time. Brunel, Oxford and Liverpool are testing emi wn's on test clusters and failing nagios test intermittently

because of bug in lcg_utils. I have opened ticket and put them on hold. Please keep extending expiration date of these tickets from Dashboard until bug fix. Qmul has a open ticket as they are installing emi Apel service. Next on Duty :Daniela


Monday 10th September

  • No major incidents during the week.
  • 7 sites currently have one alarm of ticket set. Many of these alarms are 12 hours old.
  • Only 2 sites have alarms which are approaching 24 hours old (Oxford, Glasgow).
  • Kashif on-duty this week
  • Intend to hold ROD meeting this week (those involved please respond to availability email!)


Rollout Status WLCG Baseline

Thursday 13rd September

Updated all SR pages.

Monday 3rd September

  • Test queues for EMI WNs: RAL T1, Oxford, Liverpool?, Brunel

Tuesday 31st July

  • Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.

Wednesday 18th July - Core ops

  • Sites (that needed a tarball install) will need to work on own glexec installs
  • Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.
Security - Incident Procedure Policies Rota

Monday 10th September

  • Lessons from SSC6 (ops meeting feedback TBC)

Monday 30th July

  • WMSes patched/configured correctly.

Monday 23rd July

  • WMS vulnerabilities identified. Sites will have been contacted. Please respond to tickets ASAP.


Services - PerfSonar dashboard

Tuesday 18th September

  • VOMS in Manchester is now installed with both NGS/GridPP VOs. There is some political decision to take about how to support the NGS VOs and how to maintain them but they have been installed. Replication tests between Manchester and Oxford can now start.
  • Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week

Tuesday 11th September

  • Still some sites needing to deploy perfsonar
  • Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week
Tickets

Monday 17th September 16:00 BST

36 open tickets this week. Although I can't complain as more then the fair share of them are mine. No ticket update from me next week, as I'm on leave again (I need to learn to take my holiday's earlier in the year!). Although with the GridPP meeting next week I'm not sure there will be a Tuesday meeting anyway.

ROD</br> https://ggus.eu/ws/ticket_info.php?ticket=86009 (11/9)</br> Our ROD team is being picked on to justify the August metrics. No response from anyone since the 11th when this ticket was submitted, has this snuck under the radar? Assigned (11/9)

THE GLITE 3.1 PURGE</br> Just 3 tickets left. So we'll just about be finished with this in time to do the same with the glite 3.2 services!</br> https://ggus.eu/ws/ticket_info.php?ticket=85185 CAMBRIDGE. John has turned off the lcg-CE. I think the BDII is done too, so looks like this is a victory.</br> https://ggus.eu/ws/ticket_info.php?ticket=85181 DURHAM. Daniela asks about the glite 3.1 BDII, a service that can't just be turned off. Time is running out. (13/9).</br> https://ggus.eu/ws/ticket_info.php?ticket=85183 GLASGOW. Worryingly quiet. I assume the Glasgow lads are working on it but the rest of us are getting nervous!</br>

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)</br> biomed SRM problems. A terse reply from biomed suggests that everything is okay, I asked for clarification and their blessing to close it. Waiting for reply (17/9)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3)</br> The ticket tracking older DPMs. Should be put back in progress with view to closing after the above purge is finished. On Hold. (6/9)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=86095 (14/9) Liverpool Ops test jobs falling afoul of lcg_utils bug. I put On Hold as no hope until bug is fixed. On Hold (17/9)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=85973 (10/9)</br> Being bitten by the same bug as Liverpool. On Hold (12/9)</br>

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=85968 (10/9)</br> And again! On Hold (12/9)

Ewan helpfully linked to the likely cause's ticket: https://ggus.eu/tech/ticket_show.php?ticket=85601</br> Although apart for Ewan's cross-referencing the tickets there's been no movement since 29/8

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=85967 (10/9)</br> Nagios APEL tests failing during move from glite to EMI. I got the wrong end of the stick here, QMUL are replacing their APEL box with a newer, EMI machine after a publishing error broke broke APEL at Queen Mary & RAL. John Gordon adds a reminder to be sure that you don't accidentally republish all your data! In Progress (12/9)

https://ggus.eu/ws/ticket_info.php?ticket=80052 (8/3)</br> A ticket from QMUL rather then to it, concerning availability calculation back in March. I think this is some kind of ticket orphen.

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br> Sno+ WMS problems. Sno+ supplied some requested information (6/9). No word since. Is this WMS one that will go in the above mentioned glite 3.1 purge? If so then it will be worth mentioning. In progress (6/9)

BRUNEL (again, I didn't want to break my chain above)</br> https://ggus.eu/ws/ticket_info.php?ticket=85011 (9/8)</br> Retirement of their old CE. I believe that this ticket can be closed. In progress (10/9)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> Things have bene quiet on the Sussex front. Jeremy gave a target for being out of downtime by the 21st. Is that looking likely? In progress (10/9)

Solved Tickets</br> All the UserDN accounting tickets are now closed. No other solved tickets jump out at me.

Tickets from the UK</br> https://ggus.eu/ws/ticket_info.php?ticket=85449</br> Winnie had some trouble with a deleted downtime not being taken into account. This has sparked some deliberation and some changes to the way sam polls the gocdb to prevent this happening again. It maybe that Winnie will have to open a availability/reliability amendment ticket to take into account the false downtime.

Tools - MyEGI Nagios

Monday 17th September

  • Current state of Nagios is now on this page.

Monday 10th September

  • Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view



VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 18 September 2012

  • No VOs reporting issues.
  • VOs have been asked for a brief summary for the GridPP meeting.

Monday 27th August


Monday 23rd July

  • CW requested feedback from non-LHC VOs on issues
  • Proxy renewal issues thought to be resolved on all services except FTS - a new version may solve problems on that service.
  • Steve's VOMS snooper application is picking up many site VO config problems. We should review all sites in turn.


Site Updates

Friday 7th September

  • SUSSEX: Still cluster has been upgraded. Intention is to return to grid components now(ish).
  • Is there sufficient support effort from GridPP?


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 2nd July

  • No meeting. Next PMB on Monday 3rd September.
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 19th September

  • Operations report
  • Castor 2.1.12 upgrade announced for the Atlas instance for next Tuesday (25th).
  • Continuing test of hyperthreading on one batch of worker nodes (the Dell 2011 batch). We expect to go to with a 50% over-commit on all hyper-threaded nodes at the start of October. Comments/concerns from VOs welcome.
WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.



NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Thursday 21st June - JANET6

  • JANET6 meeting in London (agenda)
  • Spend of order £24M for strategic rather than operational needs.
  • Recommendations to BIS shortly
  • Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
  • Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
  • Reliability limited by funding not ops so need smart provisioning to reduce costs
  • Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
  • Goal of dynamic provisioning
  • Looking at ubiquitous connectivity via ISPs
  • Contracts were 10yrs wrt connection and 5yrs transmission equipment.
  • Current native capacity 80 channels of 100Gb/s per channel
  • Fibre procurement for next phase underway (standard players) - 6400km fibre
  • Transmission equipment also at tender stage
  • Industry engagement - Glaxo case study.
  • Extra requiements: software coding, security, domain knowledge.
  • Expect genome data usage to explode in 3-5yrs.
  • Licensing is a clear issue
To note

Tuesday 26th June