Operations Bulletin 070512

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 30th April 2012
Task Areas
Tier-1 - Status Page

Tuesday 1st May

  • As stated in last week's report - we did roll out an updated kernel (and network driver) to two batches of disk servers with a particular 10Gbit network card as we had seen these system drop their network connectivity. (This had been previously done for the disk servers in one of the Atlas service classes of this type.)
  • On Friday there was an extended break of the primary OPN link to CERN. We ran using the backup link from around 10:30am to 11pm. There was no operational effect on the Tier1.
  • We did have some batch problems on Sunday morning. We need to extend our black hole detector to include another case that leaves jobs in a "waiting" state.
  • We have seen another short break on the "bypass" network route that is used by data traffic to Tier2s. This causes file transfer failures, but the operational impact is limited. Further changes have been made to try and fix this (XFP replaced).
Storage & Data Management - Agendas/Minutes

Wednesday 02 May 2012

  • Discussion about discussions at the hepsysman storage afternoon. Filesystems always "stimulate" discussion, but rarely change recommendations. Ricardo and Sam to present remotely.
  • HEPiX summary by James triggered discussion about redundancy in filesystem (eg RAID) vs across nodes (RAIN) vs across sites (grid). Also the requirement for recent kernels in Linux (but not in SL5/SL6) to improve XFS performance (Sam) and support NFS4 clients.
  • AOB: NO meeting next week, as we have the storage half day on Thursday.


Accounting - UK Grid Metrics HEPSPEC06

Tuesday 17th April

  • SL presented models for disk storage accounting at GridPP28
  • AF at GridPP28 presented impacts of changed ATLAS submissions
Documentation - KeyDocs

Friday 27th April

  • Background effort to address the shortcomings in the GridPP Approved VO list and detail records. No RPM of LSC files is planned (after discussions with Christina), so primary approach may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones. Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes. Manual transcription is far too error prone.
  • Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet. New users and/or VOs may wish to consult this early on, to get basic feel of grid applications. References the "Glite User Guide", which remains the best reference source for all user cases.
  • I appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 16th April - EGI ops agenda

  • BDII Instability: Did we observe problems with BDII on April 12? (One for RoD?)


Monitoring - Links MyWLCG
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 30th April - KM

  • Imperial and RALPP had intermittent org.sam.SRM-GetTURLs failures because of a mysterious issue with lcg_utils or the gridppnagios machine. This ticket was opened. As yet the underlying cause has not been found but the suggested workaround will be applied.

Thursday 3rd April

Rollout Status WLCG Baseline

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies
  • Phone meeting planned in early May to ensure continuity when Mingchao leaves
  • SSC5 preparations to start soon.


Services - PerfSonar dashboard
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng next week in Amsterdam
  • Agreed to focus on perfosonar.
Tickets

Monday 30th April

Only 20 open tickets this week.

NGI https://ggus.eu/ws/ticket_info.php?ticket=80259 Any more news on the new neuroscience VO name?

Tier-1 https://ggus.eu/ws/ticket_info.php?ticket=81724 Alice requested a VOBOX update, they got it, it seems to work for them. If only everything worked out like this. The ticket can be closed by the looks of it.

https://ggus.eu/ws/ticket_info.php?ticket=81669 na62 have requested an FTS channel. This is odd as apparently they should request for one at CERN. Any comments?

https://ggus.eu/ws/ticket_info.php?ticket=81606 t2k.org proxy renewal failures using lcgwms03. Looks to be unrelated to the problem that Imperial saw (81614, RAL don't see the same errors & Chris duplicated the error using the RAL myproxy).

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=81726 Durham continues to have problems that can likely be placed at the feet of its campus firewall. (solved as I typed this).

https://ggus.eu/ws/ticket_info.php?ticket=75488 Old Compchem ticket about authentication problems at Durham. Probably linked to the problem mentioned above. Anything else to report about it? (It's On Hold, but we don't want it to get too stale).

https://ggus.eu/ws/ticket_info.php?ticket=68859 Any luck in upgrading the disk servers to a newer DPM release?

QMUL https://ggus.eu/ws/ticket_info.php?ticket=81516 SNO+ were having difficulty lcg-cr'ing on the QMUL SE. Problem seems to have been transient (user error?), it could be closed (if Chris is satisfied that nothing's still broken).

OXFORD https://ggus.eu/ws/ticket_info.php?ticket=81437 A pool node died, and took a bunch of hone data with it (always painful). Ewan's conjuring up a list of lost files, but are the smaller VOs equipped to deal with lost data?

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=80371 Now that the WMS is behaving (*touch wood with crossed fingers*, ticket 80752) just a gentle reminder about enabling SNO+ support.

MANCHESTER https://ggus.eu/ws/ticket_info.php?ticket=81343 "I hate that script." I can't help but agree with Alessandra, but it looks like the Manchester DPM is reporting non-negative results again.

UCL https://ggus.eu/ws/ticket_info.php?ticket=80989 Any movement at UCL on this?

BRISTOL https://ggus.eu/ws/ticket_info.php?ticket=80155 Winnie reports that she's almost ready to upgrade Storm, as we only have 10% of her time on this it would be helpful if we stand ready to lend a hand.

Good movement on most of the other "Your SE is looking Crusty" tickets": https://ggus.eu/ws/ticket_info.php?ticket=80152 https://ggus.eu/ws/ticket_info.php?ticket=80153 https://ggus.eu/ws/ticket_info.php?ticket=68858 https://ggus.eu/ws/ticket_info.php?ticket=68853

SOLVED CASE PILE https://ggus.eu/ws/ticket_info.php?ticket=81614 Imperial WMS weren't renewing proxies for t2k. Turned out to be a problem with the myproxy at CERN (which didn't have the DNs of the Imperial WMSs in it). The clues to this were found by Daniela in the syslog, not the WMS logs.

https://ggus.eu/ws/ticket_info.php?ticket=81563 The Glasgow DPM crashed and had to be restarted (a couple of times). Why is this interesting (worrying)? Sam had upgraded to DPM 1.8.3, which supposed to contain fixes to prevent stuff like this happening. He's dutifully informed the developers.

Tools - MyEGI Nagios

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.

Thursday 3rd April

  • New GOCDB features: scoping of service end-points and sites, a new set of roles and related permissions, and the possibility of creating groups of service end-points distributed across multiple GOCDB sites. An overview of these new features and the related use cases is available here.


VOs - GridPP VOMS VO IDs Approved
  • Discussion about VO information in LSC files - EMI says no.
  • Tidying up VO information and gathering addresses for VO admin email list.
  • WMS issue for SNO+ fixed with Sussex UI update
Site Updates

Tuesday 1st May

  • Brunel: Last Thursday storage usage reached 95%. We had a little crisis with lots of FTS failures and DPM services needing several restarts. Corrected when CMS started removing unneeded data.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Tuesday 17th April - joint with ops. Agenda

  • CPU accounting benchmarks may revert to Steve's as simul jobs are now quite variable.
  • In holding pattern with NGS (awaiting long-term funding). GridPP should continue to offer support to VOs in our area of expertise - i.e. those with HTC needs. VOs typically supported first at sites with local users.
  • Rewarding disk allocations to non-LHC VOs is of growing importance. Usage needs closer tracking and requirements from the VOs to be recorded.
  • GridMon or Perfsonar? Balance is shifting and ops team should discuss again.
  • DIRAC can be useful for 'other' VOs. Reliability/callout needs may push instance to T1.
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 24th April 2012- Agenda Minutes

  • Expected this week: BLAH, DPM, Hydra, GFAL/lcg_util, StoRM and WMS
  • UI/WN tarball: There are testing releases of the tarballs, Linked off this ticket.
  • At T1 two new FTS front end systems on virtual machines.
  • Networking monitoring: consensus is to deploy perfsonar (US version)
  • Reminder about 11th/12th May HEPSYSMAN
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 2nd May

  • New disk servers being deployed for ALICE
  • Discussion about NA62 FTS channels (setup via CERN or RAL) to be concluded

Wednesday 25th April - Operations report

WLCG Grid Deployment Board - Agendas MB agendas

Wednesday 18th April - Agenda Summary report

Introduction

  • Suggestions for future format - TEGs/working groups in pre-GDB slot
  • Need to look at public cloud models

TEG outlines

  • DM & Storage: Look at http and webdav. gridFTP needed medium term. FTS3 plan to include http. LFC not needed medium term. Security more work.
  • Ops Tools: Awaiting task prioritization. Common monitoring (including WLCG coordination body), CVMFS, common sysadmin training, review services, endorse EPEL, apps repository. Develop GGUS and broadcasts. Expand pre-release uptake. Revisit SSB.
  • Workload M: Use gelexec. Extend CE for streamed submission, whole node and multi-node jobs, job types (i/o or CPU bound). Remove WMS and simplify InfoSystem.
  • Security: Risk analysis done. Fine-grained traceability issues. Data ownership and other issues TBC. Lack of stakeholder input.
  • Databases: Use COOL. More Frontier usage. WLCG to monitor squids. Interest in NoSQL options.

PerfSonar

  • Need standard. It aids diagnosis - but alert who? Two boxes: latency and bandwitdth. Configs flexible. Main issues firewalls and congested GPNs.

Middleware

  • EMI-1:Update 15 due 20th April. Fixes for BLAH, WMS, DPM/LFC, GFAL, Proxy renewal, VOMS-admin. gLite security fixes end 30th April (WN and UI covered till 30th Sept.).
  • EMI-2: SL5 and SL6 builds now >95%. Release due 7th May.
  • UMD and WLCG: EMI tests seek elimination of bugs. EGI tests seek continuous service delivery.

SHA2 & RFC proxies

  • IGTF want CAs using SHA-2 ASAP. Target Jan 2013.
  • Therefore need to move to RFC proxies (away from Globus ones). SHA-1 risks are dCache and BestMan. RFC support needed for middleware but most components ok in EMI-2.

Glexec deployment

  • Check regional tests. For UK click here.
  • Sites need to flag support in GOCDB
  • Experiments have plans to use

OSG software update

  • OSG3 now using RPM format (via Koji). Pushing EPEL. Best support RHEL. Tarballs may come soon.


NGI UK - Homepage CA

January Management meeting?

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Email addresses now removed from certificates.
Events

HEPiX- 23rd-27th April (Prague)Agenda

HEPSYSMAN - 10th-11th May (RAL) Agenda

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd April

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2