Operations Bulletin 110612

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 4th June 2012
Task Areas
General updates

Friday 8th June

  • Mingchao leaves his role as security officer today. Thanks to him for many years of contributions. His role will be covered by members of the ops team - recruitment of a new security officer is underway.

Thursday 31st

  • Please note that the Technical Evolution Strategy documents have been moved within the WLCG Document Repository. The folder is now under [

https://espace.cern.ch/WLCG-document-repository/Technical_Documents/ here].

  • Check site gstat values
Tier-1 - Status Page

Wednesday 6th June

  • Some additional risk if power cut as backup generator failed to start after a test (6th June).
  • All Castor instances using "Transfer Manager" except Atlas, for which changeover scheduled for 7th June.
  • Castor 2.1.11-9 update and LFC/FTS database Oracle 11 updates scheduled for Wednesday 13th June.
Storage & Data Management - Agendas/Minutes

Wednesday 6 June 2012 - we are still digesting CHEP information, see also blog, plus a few of the usual operational upgradional stuff. Hoping to find a few spare clock cycles for some slightly more experimental stuff.

Wednesday 23 May 2012 - lots of exciting stuff at CHEP, we have about five things in, some posters, some oral.


Accounting - UK Grid Metrics HEPSPEC06

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Friday 4th May - TB-SUPPORT

  • long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:
Documentation - KeyDocs

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

  • Maintain site VOMS info document for the approved VOs
  • Check a site's VOMS records correspond exactly with CIC portal
  • Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes.

This will be converted to wiki formatted and made available in the normal way. Next jobs:

  • review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
  • create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.


Tuesday, 29th May

  • VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

  • Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 7th May - EGI ops agenda

  • No EMI update, everyone in transit for EMI all hands meeting.
  • Staged Rollout: A detailed list is on the agenda - note that 'verification' is the step before SR, so anything in verification is expected in SR soon. In particular, in verification: BLAH update for CREAM, DPM, lcg-utils (for UI and WN), MyProxy and WMS.
  • TMPDIR policy. Draft poilcy available. Some discission of how it relates to EDG_WN_SCRATCH, and talking to the WLCG lot about it. Any comments, pass to Stuart Purdie by Friday 18th May.
  • TopBDII Availaibity: UK: 100%. Awesome. Some very long discussion of the Swiss situation (they use Germany's, so who should get a ticket if it drops out, the service provider or the NGI? Not relevant to the UK).


Monitoring - Links MyWLCG
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Friday 25th May - AM

  • Several sites have downtimes, and a few raised non-trivial alarms during the week. All are ok again now or still in downtime, with Durham having the ce01 and se02 tickets although it's in an unplanned downtime now.
  • Glasgow had problems with MPI alarms even though MPI shouldn't have been advertised, and these alarms were eventually closed still at critical on the Dashboard along with the corresponding ticket (the Nagios pages for the nodes themselves were back to all green due to the removal of the MPI endpoints.)

Monday 14th May - JW

  • A few minor alarms at the moment on 2 sites.
  • QMUL needs to update the CA RPMs (Daniela also opened a ticket against them). They also have an SE which is not correctly functioning.
  • One 'yellow' ticket for ROD - Easter plus dashboard usability concerns


Rollout Status WLCG Baseline

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Wednesday 6th June

  • The GridPP security team have begun operating a weekly "on duty" rota to provide cover until a replacement for Mingchao is appointed.
  • UKNGI will not be contributing to the EGI security officer duty rota for the time being.

Tuesday 15th May

  • The next NGI/GridPP security team meeting is on Wednesday 16th.
  • Main discussion is on the rota and handover tasks.

Wednesday 2nd May

  • Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.
  • Currently reviewing the overall security task.
  • SSC5 preparations to start soon.


Services - PerfSonar dashboard
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng last week in Stockholm
  • Agreed to focus on perfosonar.
Tickets

Monday 4th of June, 15:00 BST</br> 26 tickets this week, a number are from the week before. No really urgent tickets, here's a couple that seem to stand out. As there's no meeting this week I thought I'd keep it short.

OF INTEREST</br> https://ggus.eu/ws/ticket_info.php?ticket=81498</br> Ticket debating atlas cvmfs cache size.</br>

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=82702</br> Chris W's ticket concerning the recent VOMS trouble and how it affected cern@school.</br>

CAMBRIDGE</br> https://ggus.eu/ws/ticket_info.php?ticket=82296</br> The problems at Cambridge may have been licked, can atlas confirm?</br>

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=82455</br> Nagios test failures on ce01, the ticket's getting a bit stale.</br>

Tickets from the UK (same as last week, I'm just leaving them here as a reminder.

https://ggus.eu/tech/ticket_show.php?ticket=79529</br> The UK Voms team have a ticket open concerning the mechanisms used by voms servers to verify tickets.

https://ggus.eu/tech/ticket_show.php?ticket=75984</br> Chris W has a long standing ticket open concerning seg faults in the tarball version of lcg-cr.

https://ggus.eu/tech/ticket_show.php?ticket=76532</br> Andy W submitted this ticket to the CREAM developers after some problems back in November. I think that it can be closed (unless the problems resurfaced).

https://ggus.eu/tech/ticket_show.php?ticket=72506</br> Steve J submitted a bug to the CREAM/BLAH guys about a potential for an infinite loop in the CREAM/Torque interactions. No movement since February, when Steve upped the priority to "Urgent".

https://ggus.eu/tech/ticket_show.php?ticket=72358</br> Jon P from t2k saw myproxy delegation failures on the RAL to QMUL FTS channel. The ticket ended up with FTS development, where it was put on hold waiting for FTS 2.2.8. Is that release out and deployed (i.e. can this be another closed ticket?).

https://ggus.eu/ws/ticket_info.php?ticket=64388</br> Stephen Burke's older but still active ticket documenting his quest to have gstat reduce the overcounting of jobslots.

https://ggus.eu/tech/ticket_show.php?ticket=63614</br> Chris W's asked for a vomses syntax checker for yaim in 2010. No movement in this or the link savannah ticket (https://savannah.cern.ch/bugs/?82836) for a long time. UPDATE- looks like this is set to be closed as the bug is "READY FOR REVIEW".

Tools - MyEGI Nagios

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved

Friday 11th May

  • A new mailing list for VO admin contacts has been created: vo-admins at jiscmail....

Tue 15 May

  • VOs supported at sites - now have a script, just need to put it into production.
  • Grid user crash course reviewed
    • Needs information on renewing a proxy
    • Needs information on sending jobs to data


Site Updates

Tuesday 15th May

  • Sussex now in certification testing stage. The results indicate a few areas to follow up (may be on the monitoring side). A job submitted to the CE returns:

JobID=link Status = [CANCELLED] ExitCode = [] Description = [Cancelled by CE admin]



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 14th May

  • Update from HEPSYSMAN
  • First look at quarterly reports
  • Discussion on short-term travel claims - GridPP desires to maintain some sort of approval process when universities take more control this summer
  • Brief discussion of Tier-1 network outages in recent weeks
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 6th June

  • All Castor instances switched to use the Transfer Manager (which replaces LSF) except Atlas which is scheduled for 7th June.
  • The following Tier1 interventions are announced (not yet in GOC DB).:
    • Wednesday 13th June: Castor 2.1.11-9 update. (Expect Castor down from morning to early afternoon).
    • Wednesday 13th June: Oracle 11 update for LFC & FTS databases. (Expect these services down all working day).
    • Tuesday 19th June: Replacement of RAL Site Access Router. (Expect external connectivity broken for up to 3 hours in morning.)
    • Wednesday 27th June: Update Castor databases to Oracle 11. (Expect Castor down all working day).
  • Test of backup diesel generator unsuccessful (today, Wed 6th June) as it failed to start. Experts expected tomorrow to investigate. Site at an additional risk if power cut.
  • Useful discussion with NA62 about issues they are having getting going with using the site.
  • Operations report
WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 9th May - Agenda

Introduction (Michel Jouvin)

• GDB summary needed. Plan to put notes in wiki linked to agenda.

• Encourage cross-VO T2 deployment of perfSONAR.

TEG next steps (Ian Bird)

• The TEGs as such (big working groups) have finished. Still work to be done.

• Further discussions at CHEP – including future of DPM.

• Deployment of glexec becomes a priority (tarball is not a solved problem)

CRSG report to the C-RRB (Ian Bird)

• Higher rates and parking data plus more analysis means slightly more resources needed (ALICE, ATLAS and CMS)

• LHCb have a revised charm physics program

• A lot more pile-up in 2012

• Use of high-level trigger farms to help with processing

• T2 installed capacity still an issue – REBUS will be used. Need to check the figures published for each site.

LHCOPN/ONE status & directions (John Shade)

• LHCOPN functioning well. Alarms: https://cclhcopnmon.in2p3.fr/LHCOPN/report/.

• perfSONAR-PS and MDM interoperability not tested. Jason Zurawski offering workshop on toolkit for site managers – any interest?

• L3VPN operations https://twiki.cern.ch/twiki/bin/view/LHCONE/WebHome.

• Routing policies important – symmetric paths

• OpenFlow as a protocol. TRILL/SPB for resilience (replace P2P links)

• PerfSONAR is a dormant setup with some selected core sites/points involved with regular testing.

• Encourage sites to install – guidelines in twiki. Sites need to setup alarms for themselves.

Federated Identity Management for HEP (David Kelsey)

• Remove the ID management from the service (use single sign-on). Adding an attribute authority (e.g VOMS) adds complexity

• Spans many communities not just HEP – common requirements being discussed (e.g. open standards, attribute aggregation) including operational ones like traceability.

• Research communities are to perform a risk analysis of using IdM.

Procedure to follow for proposed new T1 sites (Ian Bird)

• Policy document linked from agenda – discussed in 2011

• Requires expt. Support and balance against high-standards of existing T1 services.

• Prepare detailed plan. Follow tests and meet required service levels. Reach full status after about 1 year and Overview Board approval

HEPiX (Helge Meinhard)

• 23rd-27th April. Over-packed agenda. New track on business continuity

• Fabric management changing. Many labs moving to puppet. Quattor healthy. Some sites moving monitoring away from Nagios.

• Cloud computing on the horizon of realism (Openstack and OpenNebula)

• Working groups (virtualization; IPv6; Storage; Benchmarking)

WLCG workshop (Jamie Shiers)

• wlcg workshop mailing list

• Draft agenda now online

Virtualized WNs and Clouds

HEPiX Virtualisation WG report (Tony Cass)

• Set up to facilitate the instantiation of user-generated VM images at HEPiX sites

• Image endorsement – technical constraints and policy discussed

• Framework for endorsers to publish has been developed

Update on WNosDeS (Davide Salomoni)

• Worker nodes on demand service: http://web/infn.it/wnodes

• Integrates Grid and Cloud provisioning through virtualization

• Upcoming – dynamic VLANs

EGI Federated Clouds Task Force (Matteo Turilli)

• Why? Users keen on personalized environments

• Federation testbed now live

https://wiki.egi.eu/wiki/Fedcloud-tf:FederatedCloudsTaskForce

VM contextualization and image cataloguing in ATLAS (Fernando Megino)

• Why? Some sites support multiple VOs with different needs. To be ready for when cloud resources are offered.

• Work ongoing over 1 year. Still manual steps. Currently use CERNVM.

• Contextualization desired for Gangli and Condor installs.

• Comment: Getting credentials in fine but concern about site settings for syslog etc. being overridden. Condor should be via CVMFS.


The next meeting is on 13th June. David is the T2 rep. There is a draft agenda in indico.



NGI UK - Homepage CA

Monday 14th May

  • CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
  • CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
  • Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
  • UKIROC decommissioning in progress
  • NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.
  • Email addresses now removed from certificates.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd May

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2