Operations Bulletin 010512

From GridPP Wiki
Jump to: navigation, search
1st May 2012
Project Management Board - MembersMinutes Quarterly Reports

Tuesday 17th April - joint with ops. Agenda

  • Awaiting minutes
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 24th April 2012- Agenda Minutes

  • Expected this week: BLAH, DPM, Hydra, GFAL/lcg_util, StoRM and WMS
  • UI/WN tarball: There are testing releases of the tarballs, Linked off this ticket.
  • At T1 two new FTS front end systems on virtual machines.
  • Networking monitoring: consensus is to deploy perfsonar (US version)
  • Reminder about 11th/12th May HEPSYSMAN
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 25th April - Operations report

WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 18th April - Agenda Summary report

Introduction

  • Suggestions for future format - TEGs/working groups in pre-GDB slot
  • Need to look at public cloud models

TEG outlines

  • DM & Storage: Look at http and webdav. gridFTP needed medium term. FTS3 plan to include http. LFC not needed medium term. Security more work.
  • Ops Tools: Awaiting task prioritization. Common monitoring (including WLCG coordination body), CVMFS, common sysadmin training, review services, endorse EPEL, apps repository. Develop GGUS and broadcasts. Expand pre-release uptake. Revisit SSB.
  • Workload M: Use gelexec. Extend CE for streamed submission, whole node and multi-node jobs, job types (i/o or CPU bound). Remove WMS and simplify InfoSystem.
  • Security: Risk analysis done. Fine-grained traceability issues. Data ownership and other issues TBC. Lack of stakeholder input.
  • Databases: Use COOL. More Frontier usage. WLCG to monitor squids. Interest in NoSQL options.

PerfSonar

  • Need standard. It aids diagnosis - but alert who? Two boxes: latency and bandwitdth. Configs flexible. Main issues firewalls and congested GPNs.

Middleware

  • EMI-1:Update 15 due 20th April. Fixes for BLAH, WMS, DPM/LFC, GFAL, Proxy renewal, VOMS-admin. gLite security fixes end 30th April (WN and UI covered till 30th Sept.).
  • EMI-2: SL5 and SL6 builds now >95%. Release due 7th May.
  • UMD and WLCG: EMI tests seek elimination of bugs. EGI tests seek continuous service delivery.

SHA2 & RFC proxies

  • IGTF want CAs using SHA-2 ASAP. Target Jan 2013.
  • Therefore need to move to RFC proxies (away from Globus ones). SHA-1 risks are dCache and BestMan. RFC support needed for middleware but most components ok in EMI-2.

Glexec deployment

  • Check regional tests. For UK click here.
  • Sites need to flag support in GOCDB
  • Experiments have plans to use

OSG software update

  • OSG3 now using RPM format (via Koji). Pushing EPEL. Best support RHEL. Tarballs may come soon.
NGI UK - Homepage CA

January Management meeting?

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Email addresses now removed from certificates.
Events

HEPiX- 23rd-27th April (Prague)Agenda

HEPSYSMAN - 10th-11th May (RAL) Agenda

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests
  • More sites needed to test EMI-2


Tier-1 Status Page

Tuesday 24th April

  • Problem on one of the Atlas Castor headnodes caused by time drift.
  • Problem with xrootd access to the AtlasStripDeg service class - traced to a configuration problem.
  • Found an unnecessary restriction on our 4GB batch queue - a limit that we have raised.
  • Added two new FTS front end systems on virtual machines.
Storage & Data Management - Agendas/Minutes

Wednesday 25th April

  • "Exploding" DPMs. Bug <1.8.3?
  • Document data model recommended for small VOs
  • HEPiX approaching


Accounting - UK Grid Metrics HEPSPEC06

Tuesday 17th April

  • SL presented models for disk storage accounting at GridPP28
  • AF at GridPP28 presented impacts of changed ATLAS submissions
Documentation - KeyDocs

Friday 27th April

a) Background effort to address the shortcomings in the GridPP Approved VO list and detail records. No RPM of LSC files is planned (after discussions with Christina), so primary approach may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones. Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes. Manual transcription is far too error prone.

b) Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet. New users and/or VOs may wish to consult this early on, to get basic feel of grid applications. References the "Glite User Guide", which remains the best reference source for all user cases.

c) I appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.

Interoperation - EGI ops agendas

Monday 16th April - EGI ops agenda

  • BDII Instability: Did we observe problems with BDII on April 12? (One for RoD?)
Monitoring - Links MyWLCG
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 10th April - JW

  • A few sites were caught out with the update to CA RPMs 1.46 just before the Easter break.
  • There were intermittent problems with the Glasgow (srv022) and a RAL WMS which were alarms were flapping. Glasgow still has problems.


Monday 16th April - AM

  • Several sites in planned and unplanned downtimes still (following round of hardware upgrades?) but no UK-wide issues.
Rollout Status

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies
  • Phone meeting planned in early May to ensure continuity when Mingchao leaves
  • SSC5 preparations to start soon.
Services - PerfSonar dashboard
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng next week in Amsterdam
  • Agreed to focus on perfosonar.
Tickets

Monday 24th April

  • Some tickets starting to look stale.

22 open UK tickets this week.

NGI https://ggus.eu/ws/ticket_info.php?ticket=80259 The new neuro science VO nearly has a name. Nearly. The devil's in the details (as always).

MANCHESTER https://ggus.eu/ws/ticket_info.php?ticket=81449 This got sent to Liverpool by accident. John rallied it to the right place, but it may have slipped under the radar. Ticket is from lhcb, sounds like cvmfs problems causing job failures.

https://ggus.eu/ws/ticket_info.php?ticket=81343 Biomed complaining about negative space advertised by the CE.

CAMBRIDGE https://ggus.eu/ws/ticket_info.php?ticket=80732 This ticket can be put to bed, the user doesn't see the problems anymore. I'm not sure what Santanu did to fix things though.

https://ggus.eu/ws/ticket_info.php?ticket=77008 Looks like this old ticket can be closed to (with the appropriate saga recorded in the solution).

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=80752 Has the heavy load on the WMS evened itself out?

https://ggus.eu/ws/ticket_info.php?ticket=80371 If the WMS has started to behave, will you be able to look at enabling SNO+ soon?

BIRMINGHAM https://ggus.eu/ws/ticket_info.php?ticket=80527 https://ggus.eu/ws/ticket_info.php?ticket=81434 https://ggus.eu/ws/ticket_info.php?ticket=80527 Has a couple of tickets, likely to be caused (or not helped) by the extreme transition going on at Birmingham. It might help Mark to put these onto On Hold if they can't be solved.

UCL https://ggus.eu/ws/ticket_info.php?ticket=80989 Is there anything anybody can do to help get your SE back up? We stand ready to assist. There could be useful information here (if your problem is similar to Lancaster's and other crashing sites): https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Admin/Maintenance#Cleaningupinvalidreplicas Or it could be easier to upgrade (1.8.3 should be out soon, I'm not sure if the storage group have an stance on this).

From the Solved Case pile: The only one that jumps out at me is: https://ggus.eu/ws/ticket_info.php?ticket=81444 Another case where the renewal of a VO Admin's certificate under the new CA cert causes shenanigans (no other word for it). One to watch out for other UK people over the coming months as they renew their certs.

Tools - MyEGI Nagios

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved
a) Discussion about VO information in LSC files - EMI says no.

b) Tidying up VO information and gathering addresses for VO admin email list.

c) WMS issue for SNO+ fixed with Sussex UI update