Operations Bulletin 280512

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 28th May 2012
Task Areas
General updates

Monday 28th May

  • GOCDB experienced an unexpected downtime between Friday 25.05.12 19.15 and Monday 28.05.12 07.45 (all times UTC). The downtime was caused by an issue with the Oracle database used to store the GOCDB data.


Tuesday 22nd May

  • Santanu Das will be moving on to a new job next week and therefore leaving the Cambridge site in the capable hands of the group system administrator, John Hill. Santanu wants to thank everyone for their support over the years. GridPP expresses its gratitude to Santanu for many years of contribution to the project. Thanks Santanu and good luck in your new job.

Monday 14th May

  • There is an issue with publishing userDNs. Which sites are using the UMD APEL version?
  • HEPSPEC06 values used for publishing should use the 32-bit mode results.
  • Involvement in HEPiX or EGI IPv6 testbeds needs to be documented.
  • Check site gstat values
  • The EMI-2 release is currently expected this Friday 18th May.
  • The WLCG workshop is on Saturday (19th) and Sunday. Check the agenda for details.


Tier-1 - Status Page

Tuesday 22nd May

  • Last week we deployed disk servers to atlasStripInput.
  • On Tuesday 15th May there was a repeat of the CERN CRL expiring problem. This was understood and CERN now seems to be deploying new CRLs in a timely manner.
  • On Friday 18th May there was an issue with gdss374. This resulted in 90 lost files for Atlas. This is in addition to the 34 files lost from this machine on Monday 14th May.
  • On Saturday 19th May, there was high FTS transfer failure rate with SRM_ABORTs (ATLAS, CMS, LHCb). Fixed itself at approx 17:00. Not understood.
  • Monday 21st May, an errata update to the FTS machines has caused the old problem of expired proxies to re-appear. We are investigating.
  • Ongoing problems with the batch farm. (jobs failing to start and high number of queued jobs from small VOs)

Tuesday 14th May

  • There was a problem on Friday evening with file transfers to RAL. Believed to be related to CRLs but not really understood. Fixed at about midnight.
  • One disk server gdss374 (atlasTape (d0t1)) developed fsprobe errors - it currently being drained.
  • We attempted to deploy 13 disk servers into atlasStripInput, but the disk servers immediately caused failed file transfers. We are currently draining them and investigating.
  • Tape Gateway deployed on GEN and LHCb instances.
Storage & Data Management - Agendas/Minutes

Wednesday 23 May 2012 - lots of exciting stuff at CHEP, we have about five things in, some posters, some oral.

Wednesday 16 May 2012 - marvellous how some tasks solve themselves if you wait long enough :-)

Wednesday 9th May 2012 - no meeting. Storage discussions took place at the HEPSYSMAN meeting. T1 uses Puppet, is this the way to go for T2s as well?


Accounting - UK Grid Metrics HEPSPEC06

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Friday 4th May - TB-SUPPORT

  • long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:
Documentation - KeyDocs

Friday 27th April

  • Background effort to address the shortcomings in the GridPP Approved VO list and detail records. No RPM of LSC files is planned (after discussions with Christina), so primary approach may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones. Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes. Manual transcription is far too error prone.
  • Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet. New users and/or VOs may wish to consult this early on, to get basic feel of grid applications. References the "Glite User Guide", which remains the best reference source for all user cases.
  • I appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 7th May - EGI ops agenda

  • No EMI update, everyone in transit for EMI all hands meeting.
  • Staged Rollout: A detailed list is on the agenda - note that 'verification' is the step before SR, so anything in verification is expected in SR soon. In particular, in verification: BLAH update for CREAM, DPM, lcg-utils (for UI and WN), MyProxy and WMS.
  • TMPDIR policy. Draft poilcy available. Some discission of how it relates to EDG_WN_SCRATCH, and talking to the WLCG lot about it. Any comments, pass to Stuart Purdie by Friday 18th May.
  • TopBDII Availaibity: UK: 100%. Awesome. Some very long discussion of the Swiss situation (they use Germany's, so who should get a ticket if it drops out, the service provider or the NGI? Not relevant to the UK).


Monitoring - Links MyWLCG
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Friday 25th May - AM

  • Several sites have downtimes, and a few raised non-trivial alarms during the week. All are ok again now or still in downtime, with Durham having the ce01 and se02 tickets although it's in an unplanned downtime now.
  • Glasgow had problems with MPI alarms even though MPI shouldn't have been advertised, and these alarms were eventually closed still at critical on the Dashboard along with the corresponding ticket (the Nagios pages for the nodes themselves were back to all green due to the removal of the MPI endpoints.)

Monday 14th May - JW

  • A few minor alarms at the moment on 2 sites.
  • QMUL needs to update the CA RPMs (Daniela also opened a ticket against them). They also have an SE which is not correctly functioning.
  • One 'yellow' ticket for ROD - Easter plus dashboard usability concerns


Rollout Status WLCG Baseline

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Tuesday 15th May

  • The next NGI/GridPP security team meeting is on Wednesday 16th.
  • Main discussion is on the rota and handover tasks.

Wednesday 2nd May

  • Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.
  • Currently reviewing the overall security task.
  • SSC5 preparations to start soon.


Services - PerfSonar dashboard
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng last week in Stockholm
  • Agreed to focus on perfosonar.
Tickets

Monday 28th May, 13:00 BST

25 tickets this week.

NGI affecting</br> https://ggus.eu/ws/ticket_info.php?ticket=82492</br> The ticket submitted by Chris regarding last weeks voms madness, no doubt this will be discussed elsewhere in the meeting.

https://ggus.eu/ws/ticket_info.php?ticket=82491</br> A similar ticket submitted by Jeremy.

https://ggus.eu/ws/ticket_info.php?ticket=80259</br> Mark asks if the neurogrid.incf.org ticket should remain open unti the fledgling VO is fully on its feet?

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=82486</br> Biomed seem confused over how much free space they have at Brunel.

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=82462</br> Atlas jobs were dying due to not switching to /tmp before running.

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=82496</br> t2k have been having fts problems (possibly related to https://ggus.eu/tech/ticket_show.php?ticket=81844) where proxy delegation isn't working for them.

https://ggus.eu/ws/ticket_info.php?ticket=82495</br> CMS hammer cloud jobs had problems, caused after switching to xrootd access, switched back to rfio until the problem can be understood.</br>

https://ggus.eu/ws/ticket_info.php?ticket=82402</br> SNO+ were having proxy renewal problems, but there appears to be a light at the end of the tunnel for these ones.</br>

https://ggus.eu/ws/ticket_info.php?ticket=82376</br> t2k were having problems at the tier 1 due to delegation failures, Catalin took some actions that appeared to fix things but Jon P wants to know if there are ways to protect against this happening again.

CAMBRIDGE</br> https://ggus.eu/ws/ticket_info.php?ticket=82296</br> Atlas glite-ins are failing, with Santanu off to pastures green debugging might go a little slowly.</br>

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=82265</br> I believe this hone ticket can be closed, but I think people were CHEPing last week.</br>

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784</br> The Sussex Odyssey continues. There seems to be a problem whereby the certification top-bdii drops Sussex every few days, making testing problematic. Ewan suggests a ticket to the bdii-cert.hellasgrid.gr admins.</br>

SOLVED CASES:</br> https://ggus.eu/ws/ticket_info.php?ticket=82464</br> ECDF had a problem where atlas jobs were failing due to to running out of space. The root problem was caused by other users on the shared nodes using too much space/not cleaning up properly, but could indicative of atlas job's increased hunger for disk space.

NEW THIS WEEK!</br> A Pick of Tickets from the UK (being my first go at this section it's a little heavy handed):

https://ggus.eu/tech/ticket_show.php?ticket=79529</br> The UK Voms team have a ticket open concerning the mechanisms used by voms servers to verify tickets.</br>

https://ggus.eu/tech/ticket_show.php?ticket=75984</br> Chris W has a long standing ticket open concerning seg faults in the tarball version of lcg-cr.</br>

https://ggus.eu/tech/ticket_show.php?ticket=76532</br> Andy W submitted this ticket to the CREAM developers after some problems back in November. I think that it can be closed (unless the problems resurfaced).</br>

https://ggus.eu/tech/ticket_show.php?ticket=72506</br> Steve J submitted a bug to the CREAM/BLAH guys about a potential for an infinite loop in the CREAM/Torque interactions. No movement since February, when Steve upped the priority to "Urgent".</br>

https://ggus.eu/tech/ticket_show.php?ticket=72358</br> Jon P from t2k saw myproxy delegation failures on the RAL to QMUL FTS channel. The ticket ended up with FTS development, where it was put on hold waiting for FTS 2.2.8. Is that release out and deployed (i.e. can this be another closed ticket?).</br>

https://ggus.eu/ws/ticket_info.php?ticket=64388</br> Stephen Burke's older but still active ticket documenting his quest to have gstat reduce the overcounting of jobslots.</br>

https://ggus.eu/tech/ticket_show.php?ticket=63614</br> Chris W's asked for a vomses syntax checker for yaim in 2010. No movement in this or the link savannah ticket (https://savannah.cern.ch/bugs/?82836) for a long time.

Tools - MyEGI Nagios

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved

Friday 11th May

  • A new mailing list for VO admin contacts has been created: vo-admins at jiscmail....

Tue 15 May

  • VOs supported at sites - now have a script, just need to put it into production.
  • Grid user crash course reviewed
    • Needs information on renewing a proxy
    • Needs information on sending jobs to data


Site Updates

Tuesday 15th May

  • Sussex now in certification testing stage. The results indicate a few areas to follow up (may be on the monitoring side). A job submitted to the CE returns:

JobID=link Status = [CANCELLED] ExitCode = [] Description = [Cancelled by CE admin]



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 14th May

  • Update from HEPSYSMAN
  • First look at quarterly reports
  • Discussion on short-term travel claims - GridPP desires to maintain some sort of approval process when universities take more control this summer
  • Brief discussion of Tier-1 network outages in recent weeks
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 16th May

  • Tape gateway now deployed in all CASTOR instances
  • Confirm dates for migration to the transfer manager
  • Operations report
WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 9th May - Agenda

Introduction (Michel Jouvin)

• GDB summary needed. Plan to put notes in wiki linked to agenda.

• Encourage cross-VO T2 deployment of perfSONAR.

TEG next steps (Ian Bird)

• The TEGs as such (big working groups) have finished. Still work to be done.

• Further discussions at CHEP – including future of DPM.

• Deployment of glexec becomes a priority (tarball is not a solved problem)

CRSG report to the C-RRB (Ian Bird)

• Higher rates and parking data plus more analysis means slightly more resources needed (ALICE, ATLAS and CMS)

• LHCb have a revised charm physics program

• A lot more pile-up in 2012

• Use of high-level trigger farms to help with processing

• T2 installed capacity still an issue – REBUS will be used. Need to check the figures published for each site.

LHCOPN/ONE status & directions (John Shade)

• LHCOPN functioning well. Alarms: https://cclhcopnmon.in2p3.fr/LHCOPN/report/.

• perfSONAR-PS and MDM interoperability not tested. Jason Zurawski offering workshop on toolkit for site managers – any interest?

• L3VPN operations https://twiki.cern.ch/twiki/bin/view/LHCONE/WebHome.

• Routing policies important – symmetric paths

• OpenFlow as a protocol. TRILL/SPB for resilience (replace P2P links)

• PerfSONAR is a dormant setup with some selected core sites/points involved with regular testing.

• Encourage sites to install – guidelines in twiki. Sites need to setup alarms for themselves.

Federated Identity Management for HEP (David Kelsey)

• Remove the ID management from the service (use single sign-on). Adding an attribute authority (e.g VOMS) adds complexity

• Spans many communities not just HEP – common requirements being discussed (e.g. open standards, attribute aggregation) including operational ones like traceability.

• Research communities are to perform a risk analysis of using IdM.

Procedure to follow for proposed new T1 sites (Ian Bird)

• Policy document linked from agenda – discussed in 2011

• Requires expt. Support and balance against high-standards of existing T1 services.

• Prepare detailed plan. Follow tests and meet required service levels. Reach full status after about 1 year and Overview Board approval

HEPiX (Helge Meinhard)

• 23rd-27th April. Over-packed agenda. New track on business continuity

• Fabric management changing. Many labs moving to puppet. Quattor healthy. Some sites moving monitoring away from Nagios.

• Cloud computing on the horizon of realism (Openstack and OpenNebula)

• Working groups (virtualization; IPv6; Storage; Benchmarking)

WLCG workshop (Jamie Shiers)

• wlcg workshop mailing list

• Draft agenda now online

Virtualized WNs and Clouds

HEPiX Virtualisation WG report (Tony Cass)

• Set up to facilitate the instantiation of user-generated VM images at HEPiX sites

• Image endorsement – technical constraints and policy discussed

• Framework for endorsers to publish has been developed

Update on WNosDeS (Davide Salomoni)

• Worker nodes on demand service: http://web/infn.it/wnodes

• Integrates Grid and Cloud provisioning through virtualization

• Upcoming – dynamic VLANs

EGI Federated Clouds Task Force (Matteo Turilli)

• Why? Users keen on personalized environments

• Federation testbed now live

https://wiki.egi.eu/wiki/Fedcloud-tf:FederatedCloudsTaskForce

VM contextualization and image cataloguing in ATLAS (Fernando Megino)

• Why? Some sites support multiple VOs with different needs. To be ready for when cloud resources are offered.

• Work ongoing over 1 year. Still manual steps. Currently use CERNVM.

• Contextualization desired for Gangli and Condor installs.

• Comment: Getting credentials in fine but concern about site settings for syslog etc. being overridden. Condor should be via CVMFS.


The next meeting is on 13th June. David is the T2 rep. There is a draft agenda in indico.



NGI UK - Homepage CA

Monday 14th May

  • CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
  • CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
  • Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
  • UKIROC decommissioning in progress
  • NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.
  • Email addresses now removed from certificates.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd April

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2