Operations Bulletin 210512

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 14th May 2012
Task Areas
General updates

Monday 14th May

  • There is an issue with publishing userDNs. Which sites are using the UMD APEL version?
  • HEPSPEC06 values used for publishing should use the 32-bit mode results.
  • Involvement in HEPiX or EGI IPv6 testbeds needs to be documented.
  • Check site gstat values
  • The EMI-2 release is currently expected this Friday 18th May.
  • The WLCG workshop is on Saturday (19th) and Sunday. Check the agenda for details.
Tier-1 - Status Page

Tuesday 14th May

  • There was a problem on Friday evening with file transfers to RAL. Believed to be related to CRLs but not really understood. Fixed at about midnight.
  • One disk server gdss374 (atlasTape (d0t1)) developed fsprobe errors - it currently being drained.
  • We attempted to deploy 13 disk servers into atlasStripInput, but the disk servers immediately caused failed file transfers. We are currently draining them and investigating.
  • Tape Gateway deployed on GEN and LHCb instances.

Tuesday 8th May

  • Last week we deployed five additional disk servers for Alice (these will replace five smaller ones) and ten new disk servers for LHCb.
  • Apart from a couple of short disk server problems it has been a fairly quiet week.
  • There was a local problem that did not affect the Tier1 but caused some problems for the GOC DB (also hosted here) from Thursday evening (3rd May) through Friday morning.
Storage & Data Management - Agendas/Minutes

Wednesday 9th May 2012 - no meeting. Storage discussions took place at the HEPSYSMAN meeting.

Wednesday 02 May 2012

  • Discussion about discussions at the hepsysman storage afternoon. Filesystems always "stimulate" discussion, but rarely change recommendations. Ricardo and Sam to present remotely.
  • HEPiX summary by James triggered discussion about redundancy in filesystem (eg RAID) vs across nodes (RAIN) vs across sites (grid). Also the requirement for recent kernels in Linux (but not in SL5/SL6) to improve XFS performance (Sam) and support NFS4 clients.
  • AOB: NO meeting next week, as we have the storage half day on Thursday.


Accounting - UK Grid Metrics HEPSPEC06

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Friday 4th May - TB-SUPPORT

  • long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:
Documentation - KeyDocs

Friday 27th April

  • Background effort to address the shortcomings in the GridPP Approved VO list and detail records. No RPM of LSC files is planned (after discussions with Christina), so primary approach may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones. Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes. Manual transcription is far too error prone.
  • Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet. New users and/or VOs may wish to consult this early on, to get basic feel of grid applications. References the "Glite User Guide", which remains the best reference source for all user cases.
  • I appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 7th May - EGI ops agenda

  • No EMI update, everyone in transit for EMI all hands meeting.
  • Staged Rollout: A detailed list is on the agenda - note that 'verification' is the step before SR, so anything in verification is expected in SR soon. In particular, in verification: BLAH update for CREAM, DPM, lcg-utils (for UI and WN), MyProxy and WMS.
  • TMPDIR policy. Draft poilcy available. Some discission of how it relates to EDG_WN_SCRATCH, and talking to the WLCG lot about it. Any comments, pass to Stuart Purdie by Friday 18th May.
  • TopBDII Availaibity: UK: 100%. Awesome. Some very long discussion of the Swiss situation (they use Germany's, so who should get a ticket if it drops out, the service provider or the NGI? Not relevant to the UK).


Monitoring - Links MyWLCG
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 14th May - JW

  • A few minor alarms at the moment on 2 sites.
  • QMUL needs to update the CA RPMs (Daniela also opened a ticket against them). They also have an SE which is not correctly functioning.
  • One 'yellow' ticket for ROD - Easter plus dashboard usability concerns


Monday 30th April - KM

  • Imperial and RALPP had intermittent org.sam.SRM-GetTURLs failures because of a mysterious issue with lcg_utils or the gridppnagios machine. This ticket was opened. As yet the underlying cause has not been found but the suggested workaround will be applied.
Rollout Status WLCG Baseline

Friday 18th May

  • EMI-2 (Matterhorn) was released.
  • Glexec tarball built at IC but unclear about configuration

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Wednesday 16th May

  • Update meeting held on 16th May. Team approach reviewed and roles checked.
  • Looks likely that SSC6 will take place in June followed by SSC5.


Services - PerfSonar dashboard
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng last week in Stockholm
  • Agreed to focus on perfosonar.
Tickets

Monday 21st May, 12:00 BST

20 Open UK tickets this week.</br>

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=80259</br> Positive progress being made with the new neuroscience VO neurogrid.incf.org.</br>

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=82100</br> SNO+ are having difficulties getting the using srm-snoplus.gridpp.rl.ac.uk. RAL are having trouble getting the DEFAULT_SE value to publish.</br>

Brunel</br> https://ggus.eu/ws/ticket_info.php?ticket=82341</br> Brunel being hit by a torque bug affecting lhcb jobs, Brunel are implementing a workaround.</br>

RHUL https://ggus.eu/ws/ticket_info.php?ticket=82320</br> An ATLAS user's jobs are suffering a 50% failure rate, after a very good job postmortem by Duncan it appears that the failed jobs aren't setting up properly (incorrect/incomplete paths).</br>

BIRMINGHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=82284</br> Atlas are seeing library problems, some libraries cannot be preloaded. Seems similar to previous problems with libraries.</br>

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=82191</br> na62 transfers to Glasgow were failing due to the srmv2 interface not publishingna62 support. Sam fixed it, and looks like it can be closed.</br>

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784</br> The 12 Tasks of Emyr. He's currently trying to tame his CREAM CE (see his mail today to TB-SUPPORT), any help would be appreciated.</br>

Solved Case File:</br> https://ggus.eu/ws/ticket_info.php?ticket=82081</br> QMUL got ticketed due to their "test" SE failing Ops jobs. It would be wise to prevent this from happening again (would removing it from the gocdb although it to avoid tests but still operate fully for testing?).</br>

From the UK with Love:</br> Following Stephen Burke's suggestion to search tickets via DN to try to track tickets submitted by UKers seems to reveal some good results, but sadly no EMI tickets. In future weeks I'll start trying to get a handle on the list of open EMI UK-submitted tickets to see if I can catch any relevant tidbits. FYI EMI tickets are at: http://tinyurl.com/cu424oa

Of Interest to the Ops team (particularly Chris W):</br> na62 relevant deployment tickets:</br> https://ggus.eu/ws/ticket_info.php?ticket=82327</br> Documents the validation progress</br> https://ggus.eu/ws/ticket_info.php?ticket=81669</br> Documents the setting up of the fts channels at CERN (bounced from RAL).</br>

Tools - MyEGI Nagios

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved

Friday 11th May

  • A new mailing list for VO admin contacts has been created: vo-admins at jiscmail....

Tue 15 May

  • VOs supported at sites - now have a script, just need to put it into production.
  • Grid user crash course reviewed
    • Needs information on renewing a proxy
    • Needs information on sending jobs to data


Site Updates

Tuesday 22nd May

  • Sussex still in certification testing stage. The results indicate ongoing job submission problems. A job submitted to the CE returns:

JobID=link Status = [CANCELLED] ExitCode = [] Description = [Cancelled by CE admin]



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 14th May

  • Update from HEPSYSMAN
  • First look at quarterly reports
  • Discussion on short-term travel claims - GridPP desires to maintain some sort of approval process when universities take more control this summer
  • Brief discussion of Tier-1 network outages in recent weeks
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 16th May

  • Tape gateway now deployed in all CASTOR instances
  • Confirm dates for migration to the transfer manager
  • Operations report
WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 9th May - Agenda

Introduction (Michel Jouvin)

• GDB summary needed. Plan to put notes in wiki linked to agenda.

• Encourage cross-VO T2 deployment of perfSONAR.

TEG next steps (Ian Bird)

• The TEGs as such (big working groups) have finished. Still work to be done.

• Further discussions at CHEP – including future of DPM.

• Deployment of glexec becomes a priority (tarball is not a solved problem)

CRSG report to the C-RRB (Ian Bird)

• Higher rates and parking data plus more analysis means slightly more resources needed (ALICE, ATLAS and CMS)

• LHCb have a revised charm physics program

• A lot more pile-up in 2012

• Use of high-level trigger farms to help with processing

• T2 installed capacity still an issue – REBUS will be used. Need to check the figures published for each site.

LHCOPN/ONE status & directions (John Shade)

• LHCOPN functioning well. Alarms: https://cclhcopnmon.in2p3.fr/LHCOPN/report/.

• perfSONAR-PS and MDM interoperability not tested. Jason Zurawski offering workshop on toolkit for site managers – any interest?

• L3VPN operations https://twiki.cern.ch/twiki/bin/view/LHCONE/WebHome.

• Routing policies important – symmetric paths

• OpenFlow as a protocol. TRILL/SPB for resilience (replace P2P links)

• PerfSONAR is a dormant setup with some selected core sites/points involved with regular testing.

• Encourage sites to install – guidelines in twiki. Sites need to setup alarms for themselves.

Federated Identity Management for HEP (David Kelsey)

• Remove the ID management from the service (use single sign-on). Adding an attribute authority (e.g VOMS) adds complexity

• Spans many communities not just HEP – common requirements being discussed (e.g. open standards, attribute aggregation) including operational ones like traceability.

• Research communities are to perform a risk analysis of using IdM.

Procedure to follow for proposed new T1 sites (Ian Bird)

• Policy document linked from agenda – discussed in 2011

• Requires expt. Support and balance against high-standards of existing T1 services.

• Prepare detailed plan. Follow tests and meet required service levels. Reach full status after about 1 year and Overview Board approval

HEPiX (Helge Meinhard)

• 23rd-27th April. Over-packed agenda. New track on business continuity

• Fabric management changing. Many labs moving to puppet. Quattor healthy. Some sites moving monitoring away from Nagios.

• Cloud computing on the horizon of realism (Openstack and OpenNebula)

• Working groups (virtualization; IPv6; Storage; Benchmarking)

WLCG workshop (Jamie Shiers)

• wlcg workshop mailing list

• Draft agenda now online

Virtualized WNs and Clouds

HEPiX Virtualisation WG report (Tony Cass)

• Set up to facilitate the instantiation of user-generated VM images at HEPiX sites

• Image endorsement – technical constraints and policy discussed

• Framework for endorsers to publish has been developed

Update on WNosDeS (Davide Salomoni)

• Worker nodes on demand service: http://web/infn.it/wnodes

• Integrates Grid and Cloud provisioning through virtualization

• Upcoming – dynamic VLANs

EGI Federated Clouds Task Force (Matteo Turilli)

• Why? Users keen on personalized environments

• Federation testbed now live

https://wiki.egi.eu/wiki/Fedcloud-tf:FederatedCloudsTaskForce

VM contextualization and image cataloguing in ATLAS (Fernando Megino)

• Why? Some sites support multiple VOs with different needs. To be ready for when cloud resources are offered.

• Work ongoing over 1 year. Still manual steps. Currently use CERNVM.

• Contextualization desired for Gangli and Condor installs.

• Comment: Getting credentials in fine but concern about site settings for syslog etc. being overridden. Condor should be via CVMFS.


The next meeting is on 13th June. David is the T2 rep. There is a draft agenda in indico.



NGI UK - Homepage CA

Monday 14th May

  • CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
  • CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
  • Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
  • UKIROC decommissioning in progress
  • NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.
  • Email addresses now removed from certificates.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd April

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2