Operations Bulletin 140512

From GridPP Wiki
Jump to: navigation, search


Week commencing 7th May 2012
Task Areas
General updates

Tuesday 8th May

Tier-1 - Status Page

Tuesday 8th May

  • Last week we deployed five additional disk servers for Alice (these will replace five smaller ones) and ten new disk servers for LHCb.
  • Apart from a couple of short disk server problems it has been a fairly quiet week.
  • There was a local problem that did not affect the Tier1 but caused some problems for the GOC DB (also hosted here) from Thursday evening (3rd May) through Friday morning.
Storage & Data Management - Agendas/Minutes

Wednesday 02 May 2012

  • Discussion about discussions at the hepsysman storage afternoon. Filesystems always "stimulate" discussion, but rarely change recommendations. Ricardo and Sam to present remotely.
  • HEPiX summary by James triggered discussion about redundancy in filesystem (eg RAID) vs across nodes (RAIN) vs across sites (grid). Also the requirement for recent kernels in Linux (but not in SL5/SL6) to improve XFS performance (Sam) and support NFS4 clients.
  • AOB: NO meeting next week, as we have the storage half day on Thursday.


Accounting - UK Grid Metrics HEPSPEC06

Friday 4th May - TB-SUPPORT

  • long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:

Tuesday 17th April

  • SL presented models for disk storage accounting at GridPP28
  • AF at GridPP28 presented impacts of changed ATLAS submissions
Documentation - KeyDocs

Friday 27th April

  • Background effort to address the shortcomings in the GridPP Approved VO list and detail records. No RPM of LSC files is planned (after discussions with Christina), so primary approach may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones. Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes. Manual transcription is far too error prone.
  • Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet. New users and/or VOs may wish to consult this early on, to get basic feel of grid applications. References the "Glite User Guide", which remains the best reference source for all user cases.
  • I appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 7th May - EGI ops agenda

  • No EMI update, everyone in transit for EMI all hands meeting.
  • Staged Rollout: A detailed list is on the agenda - note that 'verification' is the step before SR, so anything in verification is expected in SR soon. In particular, in verification: BLAH update for CREAM, DPM, lcg-utils (for UI and WN), MyProxy and WMS.
  • TMPDIR policy. Draft poilcy available. Some discission of how it relates to EDG_WN_SCRATCH, and talking to the WLCG lot about it. Any comments, pass to Stuart Purdie by Friday 18th May.
  • TopBDII Availaibity: UK: 100%. Awesome. Some very long discussion of the Swiss situation (they use Germany's, so who should get a ticket if it drops out, the service provider or the NGI? Not relevant to the UK).

Monday 16th April - EGI ops agenda

  • BDII Instability: Did we observe problems with BDII on April 12? (One for RoD?)


Monitoring - Links MyWLCG
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 30th April - KM

  • Imperial and RALPP had intermittent org.sam.SRM-GetTURLs failures because of a mysterious issue with lcg_utils or the gridppnagios machine. This ticket was opened. As yet the underlying cause has not been found but the suggested workaround will be applied.

Thursday 3rd April

Rollout Status WLCG Baseline

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Wednesday 2nd May

  • Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.
  • Currently reviewing the overall security task.
  • SSC5 preparations to start soon.


Services - PerfSonar dashboard
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng last week in Stockholm
  • Agreed to focus on perfosonar.
Tickets

Monday 7th May

STOP PRESS

UCL Availability/Reliability for April ticket: https://ggus.eu/ws/ticket_info.php?ticket=81963

19 open tickets this week.

NGI/SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 This ticket is tracking the certification of the Sussex site, which could turn into an interesting saga. Currently waiting on a bug in an existing ticket to get itself sorted (https://ggus.eu/ws/ticket_info.php?ticket=81792).

UCL/GOC https://ggus.eu/ws/ticket_info.php?ticket=81878 UCL are having trouble taking themselves out of downtime due to the GOCDB being awkward after an accidental deletion. This is keeping the existing ticket open (https://ggus.eu/ws/ticket_info.php?ticket=80989).

RALPP https://ggus.eu/ws/ticket_info.php?ticket=81862 One of the RALPP CREAMs is misbehaving for nagios tests - submitted on Friday so it may have been missed.

https://ggus.eu/ws/ticket_info.php?ticket=81891 Just a heads up, looks like some dcache problems have cropped up over the weekend.

EFDA-JET https://ggus.eu/ws/ticket_info.php?ticket=81886 Another heads up, lhcb are having disk-full type errors at JET.

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=81728 A third and final heads up, I don't trust sites to be notified when tickets get reopened. Looks like some job stage-in problems for atlas have cropped up again.

Tools - MyEGI Nagios

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved
  • Discussion about VO information in LSC files - EMI says no.
  • Tidying up VO information and gathering addresses for VO admin email list.
  • WMS issue for SNO+ fixed with Sussex UI update
Site Updates

Tuesday 1st May

  • Brunel: Last Thursday storage usage reached 95%. We had a little crisis with lots of FTS failures and DPM services needing several restarts. Corrected when CMS started removing unneeded data.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 7th May - No meeting

Monday 30th April - mainly a review of actions.

Tuesday 17th April - joint with ops. Agenda

  • CPU accounting benchmarks may revert to Steve's as simul jobs are now quite variable.
  • In holding pattern with NGS (awaiting long-term funding). GridPP should continue to offer support to VOs in our area of expertise - i.e. those with HTC needs. VOs typically supported first at sites with local users.
  • Rewarding disk allocations to non-LHC VOs is of growing importance. Usage needs closer tracking and requirements from the VOs to be recorded.
  • GridMon or Perfsonar? Balance is shifting and ops team should discuss again.
  • DIRAC can be useful for 'other' VOs. Reliability/callout needs may push instance to T1.
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS


Tuesday 24th April 2012- Agenda Minutes

  • Expected this week: BLAH, DPM, Hydra, GFAL/lcg_util, StoRM and WMS
  • UI/WN tarball: There are testing releases of the tarballs, Linked off this ticket.
  • At T1 two new FTS front end systems on virtual machines.
  • Networking monitoring: consensus is to deploy perfsonar (US version)
  • Reminder about 11th/12th May HEPSYSMAN
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 2nd May

  • New disk servers being deployed for ALICE
  • Discussion about NA62 FTS channels (setup via CERN or RAL) to be concluded

Wednesday 25th April - Operations report

WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 9th May - Agenda

Introduction (Michel Jouvin)

• GDB summary needed. Plan to put notes in wiki linked to agenda.

• Encourage cross-VO T2 deployment of perfSONAR.

TEG next steps (Ian Bird)

• The TEGs as such (big working groups) have finished. Still work to be done.

• Further discussions at CHEP – including future of DPM.

• Deployment of glexec becomes a priority (tarball is not a solved problem)

CRSG report to the C-RRB (Ian Bird)

• Higher rates and parking data plus more analysis means slightly more resources needed (ALICE, ATLAS and CMS)

• LHCb have a revised charm physics program

• A lot more pile-up in 2012

• Use of high-level trigger farms to help with processing

• T2 installed capacity still an issue – REBUS will be used. Need to check the figures published for each site.

LHCOPN/ONE status & directions (John Shade)

• LHCOPN functioning well. Alarms: https://cclhcopnmon.in2p3.fr/LHCOPN/report/.

• perfSONAR-PS and MDM interoperability not tested. Jason Zurawski offering workshop on toolkit for site managers – any interest?

• L3VPN operations https://twiki.cern.ch/twiki/bin/view/LHCONE/WebHome.

• Routing policies important – symmetric paths

• OpenFlow as a protocol. TRILL/SPB for resilience (replace P2P links)

• PerfSONAR is a dormant setup with some selected core sites/points involved with regular testing.

• Encourage sites to install – guidelines in twiki. Sites need to setup alarms for themselves.

Federated Identity Management for HEP (David Kelsey)

• Remove the ID management from the service (use single sign-on). Adding an attribute authority (e.g VOMS) adds complexity

• Spans many communities not just HEP – common requirements being discussed (e.g. open standards, attribute aggregation) including operational ones like traceability.

• Research communities are to perform a risk analysis of using IdM.

Procedure to follow for proposed new T1 sites (Ian Bird)

• Policy document linked from agenda – discussed in 2011

• Requires expt. Support and balance against high-standards of existing T1 services.

• Prepare detailed plan. Follow tests and meet required service levels. Reach full status after about 1 year and Overview Board approval

HEPiX (Helge Meinhard)

• 23rd-27th April. Over-packed agenda. New track on business continuity

• Fabric management changing. Many labs moving to puppet. Quattor healthy. Some sites moving monitoring away from Nagios.

• Cloud computing on the horizon of realism (Openstack and OpenNebula)

• Working groups (virtualization; IPv6; Storage; Benchmarking)

WLCG workshop (Jamie Shiers)

• wlcg workshop mailing list

• Draft agenda now online

Virtualized WNs and Clouds

HEPiX Virtualisation WG report (Tony Cass)

• Set up to facilitate the instantiation of user-generated VM images at HEPiX sites

• Image endorsement – technical constraints and policy discussed

• Framework for endorsers to publish has been developed

Update on WNosDeS (Davide Salomoni)

• Worker nodes on demand service: http://web/infn.it/wnodes

• Integrates Grid and Cloud provisioning through virtualization

• Upcoming – dynamic VLANs

EGI Federated Clouds Task Force (Matteo Turilli)

• Why? Users keen on personalized environments

• Federation testbed now live

https://wiki.egi.eu/wiki/Fedcloud-tf:FederatedCloudsTaskForce

VM contextualization and image cataloguing in ATLAS (Fernando Megino)

• Why? Some sites support multiple VOs with different needs. To be ready for when cloud resources are offered.

• Work ongoing over 1 year. Still manual steps. Currently use CERNVM.

• Contextualization desired for Gangli and Condor installs.

• Comment: Getting credentials in fine but concern about site settings for syslog etc. being overridden. Condor should be via CVMFS.



NGI UK - Homepage CA

The next management meeting will be on Monday 14th May.

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Email addresses now removed from certificates.
Events

HEPiX- 23rd-27th April (Prague)Agenda

HEPSYSMAN - 10th-11th May (RAL) Agenda

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd April

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2