Operations Bulletin 280512

Bulletin archive

Week commencing 28th May 2012

Task Areas

General updates

Monday 28th May

GOCDB experienced an unexpected downtime between Friday 25.05.12 19.15 and Monday 28.05.12 07.45 (all times UTC). The downtime was caused by an issue with the Oracle database used to store the GOCDB data.

Tuesday 22nd May

Santanu Das will be moving on to a new job next week and therefore leaving the Cambridge site in the capable hands of the group system administrator, John Hill. Santanu wants to thank everyone for their support over the years. GridPP expresses its gratitude to Santanu for many years of contribution to the project. Thanks Santanu and good luck in your new job.

Monday 14th May

There is an issue with publishing userDNs. Which sites are using the UMD APEL version?
HEPSPEC06 values used for publishing should use the 32-bit mode results.
Involvement in HEPiX or EGI IPv6 testbeds needs to be documented.
Check site gstat values
The EMI-2 release is currently expected this Friday 18th May.
The WLCG workshop is on Saturday (19th) and Sunday. Check the agenda for details.

Tier-1 - Status Page

Tuesday 22nd May

Last week we deployed disk servers to atlasStripInput.
On Tuesday 15th May there was a repeat of the CERN CRL expiring problem. This was understood and CERN now seems to be deploying new CRLs in a timely manner.
On Friday 18th May there was an issue with gdss374. This resulted in 90 lost files for Atlas. This is in addition to the 34 files lost from this machine on Monday 14th May.
On Saturday 19th May, there was high FTS transfer failure rate with SRM_ABORTs (ATLAS, CMS, LHCb). Fixed itself at approx 17:00. Not understood.
Monday 21st May, an errata update to the FTS machines has caused the old problem of expired proxies to re-appear. We are investigating.
Ongoing problems with the batch farm. (jobs failing to start and high number of queued jobs from small VOs)

Tuesday 14th May

There was a problem on Friday evening with file transfers to RAL. Believed to be related to CRLs but not really understood. Fixed at about midnight.
One disk server gdss374 (atlasTape (d0t1)) developed fsprobe errors - it currently being drained.
We attempted to deploy 13 disk servers into atlasStripInput, but the disk servers immediately caused failed file transfers. We are currently draining them and investigating.
Tape Gateway deployed on GEN and LHCb instances.

Storage & Data Management - Agendas/Minutes

Wednesday 23 May 2012 - lots of exciting stuff at CHEP, we have about five things in, some posters, some oral.

Wednesday 16 May 2012 - marvellous how some tasks solve themselves if you wait long enough :-)

Wednesday 9th May 2012 - no meeting. Storage discussions took place at the HEPSYSMAN meeting. T1 uses Puppet, is this the way to go for T2s as well?

Accounting - UK Grid Metrics HEPSPEC06

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Friday 4th May - TB-SUPPORT

long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:

Check using: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Friday 27th April

Background effort to address the shortcomings in the GridPP Approved VO list and detail records. No RPM of LSC files is planned (after discussions with Christina), so primary approach may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones. Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes. Manual transcription is far too error prone.

Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet. New users and/or VOs may wish to consult this early on, to get basic feel of grid applications. References the "Glite User Guide", which remains the best reference source for all user cases.

I appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.

Interoperation - EGI ops agendas

Monday 7th May - EGI ops agenda

No EMI update, everyone in transit for EMI all hands meeting.
Staged Rollout: A detailed list is on the agenda - note that 'verification' is the step before SR, so anything in verification is expected in SR soon. In particular, in verification: BLAH update for CREAM, DPM, lcg-utils (for UI and WN), MyProxy and WMS.
TMPDIR policy. Draft poilcy available. Some discission of how it relates to EDG_WN_SCRATCH, and talking to the WLCG lot about it. Any comments, pass to Stuart Purdie by Friday 18th May.
TopBDII Availaibity: UK: 100%. Awesome. Some very long discussion of the Swiss situation (they use Germany's, so who should get a ticket if it drops out, the service provider or the NGI? Not relevant to the UK).

Monitoring - Links MyWLCG

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD Rota

Friday 25th May - AM

Several sites have downtimes, and a few raised non-trivial alarms during the week. All are ok again now or still in downtime, with Durham having the ce01 and se02 tickets although it's in an unplanned downtime now.

Glasgow had problems with MPI alarms even though MPI shouldn't have been advertised, and these alarms were eventually closed still at critical on the Dashboard along with the corresponding ticket (the Nagios pages for the nodes themselves were back to all green due to the removal of the MPI endpoints.)

Monday 14th May - JW

A few minor alarms at the moment on 2 sites.

QMUL needs to update the CA RPMs (Daniela also opened a ticket against them). They also have an SE which is not correctly functioning.

One 'yellow' ticket for ROD - Easter plus dashboard usability concerns

Rollout Status WLCG Baseline

Thursday 10th May

The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
Call for more sites to take part in EMI-2 rollout tests.
The overall SR contributions are in this table.

Friday 27th April

Updated version information on rollout page
WN scan indicates some sites not keen on OS updates to those nodes.

Security - Incident Procedure Policies

Tuesday 15th May

The next NGI/GridPP security team meeting is on Wednesday 16th.

Main discussion is on the rota and handover tasks.

Wednesday 2nd May

Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.

Currently reviewing the overall security task.

SSC5 preparations to start soon.

Services - PerfSonar dashboard

23rd April requested network utilisation figures for March and April
LHCONE meetng last week in Stockholm
Agreed to focus on perfosonar.

Tickets

Monday 28th May, 13:00 BST

25 tickets this week.

NGI affecting https://ggus.eu/ws/ticket_info.php?ticket=82492 The ticket submitted by Chris regarding last weeks voms madness, no doubt this will be discussed elsewhere in the meeting.

https://ggus.eu/ws/ticket_info.php?ticket=82491 A similar ticket submitted by Jeremy.

https://ggus.eu/ws/ticket_info.php?ticket=80259 Mark asks if the neurogrid.incf.org ticket should remain open unti the fledgling VO is fully on its feet?

BRUNEL https://ggus.eu/ws/ticket_info.php?ticket=82486 Biomed seem confused over how much free space they have at Brunel.

UCL https://ggus.eu/ws/ticket_info.php?ticket=82462 Atlas jobs were dying due to not switching to /tmp before running.

TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=82496 t2k have been having fts problems (possibly related to https://ggus.eu/tech/ticket_show.php?ticket=81844) where proxy delegation isn't working for them.

https://ggus.eu/ws/ticket_info.php?ticket=82495 CMS hammer cloud jobs had problems, caused after switching to xrootd access, switched back to rfio until the problem can be understood.

https://ggus.eu/ws/ticket_info.php?ticket=82402 SNO+ were having proxy renewal problems, but there appears to be a light at the end of the tunnel for these ones.

https://ggus.eu/ws/ticket_info.php?ticket=82376 t2k were having problems at the tier 1 due to delegation failures, Catalin took some actions that appeared to fix things but Jon P wants to know if there are ways to protect against this happening again.

CAMBRIDGE https://ggus.eu/ws/ticket_info.php?ticket=82296 Atlas glite-ins are failing, with Santanu off to pastures green debugging might go a little slowly.

MANCHESTER https://ggus.eu/ws/ticket_info.php?ticket=82265 I believe this hone ticket can be closed, but I think people were CHEPing last week.

SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 The Sussex Odyssey continues. There seems to be a problem whereby the certification top-bdii drops Sussex every few days, making testing problematic. Ewan suggests a ticket to the bdii-cert.hellasgrid.gr admins.

SOLVED CASES: https://ggus.eu/ws/ticket_info.php?ticket=82464 ECDF had a problem where atlas jobs were failing due to to running out of space. The root problem was caused by other users on the shared nodes using too much space/not cleaning up properly, but could indicative of atlas job's increased hunger for disk space.

NEW THIS WEEK! A Pick of Tickets from the UK (being my first go at this section it's a little heavy handed):

https://ggus.eu/tech/ticket_show.php?ticket=79529 The UK Voms team have a ticket open concerning the mechanisms used by voms servers to verify tickets.

https://ggus.eu/tech/ticket_show.php?ticket=75984 Chris W has a long standing ticket open concerning seg faults in the tarball version of lcg-cr.

https://ggus.eu/tech/ticket_show.php?ticket=76532 Andy W submitted this ticket to the CREAM developers after some problems back in November. I think that it can be closed (unless the problems resurfaced).

https://ggus.eu/tech/ticket_show.php?ticket=72506 Steve J submitted a bug to the CREAM/BLAH guys about a potential for an infinite loop in the CREAM/Torque interactions. No movement since February, when Steve upped the priority to "Urgent".

https://ggus.eu/tech/ticket_show.php?ticket=72358 Jon P from t2k saw myproxy delegation failures on the RAL to QMUL FTS channel. The ticket ended up with FTS development, where it was put on hold waiting for FTS 2.2.8. Is that release out and deployed (i.e. can this be another closed ticket?).

https://ggus.eu/ws/ticket_info.php?ticket=64388 Stephen Burke's older but still active ticket documenting his quest to have gstat reduce the overcounting of jobslots.

https://ggus.eu/tech/ticket_show.php?ticket=63614 Chris W's asked for a vomses syntax checker for yaim in 2010. No movement in this or the link savannah ticket (https://savannah.cern.ch/bugs/?82836) for a long time.

Tools - MyEGI Nagios

Sunday 8th April

Lancaster Nagios backup hardware has arrived

Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.

There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.

VOs - GridPP VOMS VO IDs Approved

Friday 11th May

A new mailing list for VO admin contacts has been created: vo-admins at jiscmail....

Tue 15 May

WMS at RAL now renews proxies https://ggus.eu/ws/ticket_info.php?ticket=81606 - At long long last. Unfortunately, other

VOs supported at sites - now have a script, just need to put it into production.

Grid user crash course reviewed
- Needs information on renewing a proxy
- Needs information on sending jobs to data

Site Updates

Tuesday 15th May

Sussex now in certification testing stage. The results indicate a few areas to follow up (may be on the monitoring side). A job submitted to the CE returns:

JobID=link Status = [CANCELLED] ExitCode = [] Description = [Cancelled by CE admin]

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 14th May

Update from HEPSYSMAN
First look at quarterly reports
Discussion on short-term travel claims - GridPP desires to maintain some sort of approval process when universities take more control this summer
Brief discussion of Tier-1 network outages in recent weeks

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

ATLAS DATADISK now being used for production input files
Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
Target date for perfsonar-ps at sites is the end of July
KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

Trying the present bulletin as a conduit for information
7 sites below 90% in ATLAS monitoring!
Instructions for small VO space-token setup/usage requested
EMI WN tarball now available - comment in ticket
Not every site running latest SL5 OS

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 16th May

Tape gateway now deployed in all CASTOR instances
Confirm dates for migration to the transfer manager
Operations report

WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 9th May - Agenda

Introduction (Michel Jouvin)

• GDB summary needed. Plan to put notes in wiki linked to agenda.

• Encourage cross-VO T2 deployment of perfSONAR.

TEG next steps (Ian Bird)

• The TEGs as such (big working groups) have finished. Still work to be done.

• Further discussions at CHEP – including future of DPM.

• Deployment of glexec becomes a priority (tarball is not a solved problem)

CRSG report to the C-RRB (Ian Bird)

• Higher rates and parking data plus more analysis means slightly more resources needed (ALICE, ATLAS and CMS)

• LHCb have a revised charm physics program

• A lot more pile-up in 2012

• Use of high-level trigger farms to help with processing

• T2 installed capacity still an issue – REBUS will be used. Need to check the figures published for each site.

LHCOPN/ONE status & directions (John Shade)

• LHCOPN functioning well. Alarms: https://cclhcopnmon.in2p3.fr/LHCOPN/report/.

• perfSONAR-PS and MDM interoperability not tested. Jason Zurawski offering workshop on toolkit for site managers – any interest?

• L3VPN operations https://twiki.cern.ch/twiki/bin/view/LHCONE/WebHome.

• Routing policies important – symmetric paths

• OpenFlow as a protocol. TRILL/SPB for resilience (replace P2P links)

• PerfSONAR is a dormant setup with some selected core sites/points involved with regular testing.

• Encourage sites to install – guidelines in twiki. Sites need to setup alarms for themselves.

Federated Identity Management for HEP (David Kelsey)

• Remove the ID management from the service (use single sign-on). Adding an attribute authority (e.g VOMS) adds complexity

• Spans many communities not just HEP – common requirements being discussed (e.g. open standards, attribute aggregation) including operational ones like traceability.

• Research communities are to perform a risk analysis of using IdM.

Procedure to follow for proposed new T1 sites (Ian Bird)

• Policy document linked from agenda – discussed in 2011

• Requires expt. Support and balance against high-standards of existing T1 services.

• Prepare detailed plan. Follow tests and meet required service levels. Reach full status after about 1 year and Overview Board approval

HEPiX (Helge Meinhard)

• 23rd-27th April. Over-packed agenda. New track on business continuity

• Fabric management changing. Many labs moving to puppet. Quattor healthy. Some sites moving monitoring away from Nagios.

• Cloud computing on the horizon of realism (Openstack and OpenNebula)

• Working groups (virtualization; IPv6; Storage; Benchmarking)

WLCG workshop (Jamie Shiers)

• wlcg workshop mailing list

• Draft agenda now online

Virtualized WNs and Clouds

HEPiX Virtualisation WG report (Tony Cass)

• Set up to facilitate the instantiation of user-generated VM images at HEPiX sites

• Image endorsement – technical constraints and policy discussed

• Framework for endorsers to publish has been developed

Update on WNosDeS (Davide Salomoni)

• Worker nodes on demand service: http://web/infn.it/wnodes

• Integrates Grid and Cloud provisioning through virtualization

• Upcoming – dynamic VLANs

EGI Federated Clouds Task Force (Matteo Turilli)

• Why? Users keen on personalized environments

• Federation testbed now live

• https://wiki.egi.eu/wiki/Fedcloud-tf:FederatedCloudsTaskForce

VM contextualization and image cataloguing in ATLAS (Fernando Megino)

• Why? Some sites support multiple VOs with different needs. To be ready for when cloud resources are offered.

• Work ongoing over 1 year. Still manual steps. Currently use CERNVM.

• Contextualization desired for Gangli and Condor installs.

• Comment: Getting credentials in fine but concern about site settings for syslog etc. being overridden. Condor should be via CVMFS.

The next meeting is on 13th June. David is the T2 rep. There is a draft agenda in indico.

NGI UK - Homepage CA

Monday 14th May

CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
UKIROC decommissioning in progress
NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

Priorities discussion for the CA. Plans to be clarified for future meeting.
Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.

Email addresses now removed from certificates.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd April

Testing glideinWMS but some problems spotted

Tuesday 24th April

ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.

A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)

The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Tuesday 24th April

T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.

Requests

More sites needed to test EMI-2

Operations Bulletin 280512

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools