Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 19th May 2014
Task Areas
General updates

Monday 19th May

  • David C has put together a blog on monitoring. Who can/will contribute content?
  • HEPiX takes place this week (19th-23rd May) and talks are available from the the event page. Monday covered some site reports and OS related updates. Tuesday's focus is batch systems. Wednesday covers IPv6, security and benchmarking. Thursday storage, monitoring and infrastructure deployment. Friday is cloud day.
  • The EGI Community Forum takes place this week in Helsinki. There are talks/tracks covering: Helix-nebula; earth sciences; CSIRT (focus on clouds); tools updates (incl. GOCDB and APEL); Lifesciences; data preservation; vulnerability handling; sustainability; federated clouds; DiRAC; data management ... and of course H2020!
  • Jeremy's 'official' notes are in the GDB wiki. The actions have also been updated.
  • A reminder to register for the workshop if you are attending - registration closes 9th June.
  • The next pre-GDB is on 10th June covering IPv6.

Monday 12th May

  • The PMB discussed the issues raised at last week's ops meeting regarding LHCONE and reiterate that the UK position is that we do not need to join LHCONE, though the technical issue of whether a peering point is possible is being investigated by JANET. The UK position will only change if there is a demonstrable need in this area and the experiments formally request it.
  • There is a pre-GDB this week on Data Access. This includes experiment plans in the area but also a review of the recent workshop (looking at monitoring - what data needs to be kept since already have 1TB, cost models etc.).
  • There is a GDB on Wednesday 14th May. It consists mainly of update reports in areas such as configuration management and operations coordination.
  • Still waiting on some availability/reliability 'explanations' from last week. How many sites struggle to get useful information from the SAM results?


WLCG Operations Coordination - Agendas

Thursday 15th May

  • There was a middleware readiness meeting last Thursday. Most updates are appearing in the twiki.
  • Focus has been on the volunteer sites and getting a testing process in place. Focus has been on what currently exists/happens at each site.
  • There was also a look at how the middleware baselines information is references and used/applied at the T0 and T1s.
  • Some discussion of a proposal on how to monitor installed middleware packages.
  • Discussion mainly about middleware packages vs RPMs and defining what is up-to-date from the results.
  • Tests will be carried out with volunteer sites and Pakiti used as a possible way forward.


Monday 12th May

  • Alastair Dewhurst replaces Simone Campana in the IPv6 task force
  • Future support for ARGUS is being reviewed. SWITCH will support it for another 6 months.
  • There is to be a new task force (or working group) on network and transfer metrics. The proposed mandate is to identify and publish the metrics, make sure that issues can be better understood and fixed, and enable use of network-aware tools.
  • A reminder to register for the WLCG workshop.
  • Security support for EMI-2 ended on April 30th, all baseline versions increased to EMI-3 except for dCache for which support was extended.
  • The WLCG baselines page is up-to-date. Please check it.
  • No significant job efficiency differences between CERN Geneva and Wigner (i.e. depending on location) have been found. Still following up on several possibilities (see the MB presentation).
  • T0 WMSes (except SAM instances) now powered off.
  • DPM 1.8.8 has been released to EPEL-stable.
  • A series of storage developer meetings are taking place with an aim to ensure consistent, complete and correct publishing of storage systems to GLUE2, in particular relating to capacity publishing.
  • ALICE: Activity ahead of Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: MC - lower activity in the last two weeks. Rucio stress test planned to start after 20th of May. Multi core allocation - sites asked to reduce the multi-core partition in case of static single-core/multi-core allocation.
  • CMS: SAM test for glexec goes critical on May 15th. Reminder to sites to please deploy detailed xrootd monitoring. Started to send production workflows through mixture of multi-core and single-core pilots. FTS3 for Phedex Debug transfers becoming mandatory now.
  • Some discussion on strategy for availability recalculation in case of failures or timeouts in submission of SAM jobs through gliteWMS, which do not necessarily affect production jobs.
  • LHCb: Incremental stripping campaign finished, all productions closed. CASTOR->EOS migration of LHCb user data finished.
  • Tracking tools: GGUS proposal to stop ticket creation through email.
  • FTS3: CERN prod instance has been upgraded to the latest stable version 3.2.22. RAL on Wednesday 14th May.
  • glexec: Only one change. See deployment tracking page. (UK= UCL, Lancs, ECDF).
  • Machine/Job features: development and soon deployment of a machine/job features service for a cloud infrastructure. Soon mfj.py client does not need to be deployed as LHC VOs plan to bring it in s/w stack.
  • M/W readiness: Checking usage of baselines page. Next meeting Thursday 15th @ 09:30 UK time.
  • Multicore: start to evaluate the compatibility of ATLAS and CMS approaches to submitting multicore jobs to shared sites.
  • SHA-2: New issue found for CERN VOMS - job submission to CREAM fails when the proxy is signed by a VOMS server with a SHA512 host certificate. Fix out soon. Sites will then need to update CEs. CMS has found no blocking issues with RFC proxie.
  • WMS decom: CERN WMS instances for experiments have been switched off on May 5
  • IPv6: HEPiX IPv6 meeting last week. Also, lxplus-ipv6.cern.ch, an lxplus instance with dual-stack connectivity now available.
  • HTTP proxy discovery: Waiting on full implementation of the SquidMonitoringTaskForce recommendations - then sites will need to register squids.


Tuesday 6th May

  • There will be a WLCG ops coordination meeting this Thursday 8th May. Pre-meeting reports can be found in the twiki.


Tier-1 - Status Page

Tuesday 20th May

  • Testing CVMFS client 2.1.19 ongoing.
  • In process of scheduling Castor 2.1.14 upgrade. (Now likely to be 10th June for nameserver with stagers in the weeks after that).
  • We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
Storage & Data Management - Agendas/Minutes

Wedn 21 May 2014

  • DPM upgrade to 1.8.8 - Edinburgh has mostly-puppet-configured, Oxford has yaimed pool-only.
  • WebFTS may be a useful alternative/supplement to GlobusOnline (they both transfer files but in different ways.) Will evaluate once RAL sets it up.
    • Could be useful to support some of the tiny non-GridPP users (few tens of TB), so they can share resources or at least interfaces with GridPP users. Maybe.
  • Summaries of data access pre-GDB at CERN yesterweek. ATLAS encourage sites to support xroot to support FAX, then DAV.
  • xroot is in GOCDB, DAV isn't.

Tuesday 6th May

  • There was a DPM collaboration meeting last Wednesday.
  • The following priorities were agreed for the next year:
    • YAIM->Puppet transition (YAIM support ends this year);
    • I/O Monitoring; GridFTP redirection - available now for testing;
    • Admin interface and improved HTTP file management;
    • Nightly testing of WAN HTTP access performance, Hammercloud;
    • Removal of legacy components where possible (eg RFIO);
    • System logging via dmlite;
    • Rebalancing utilities;
    • and move of web presence and docs to an indexable Drupal site.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 20th May

  • Sites with APEL 'delays': IC, Liverpool, Sheffield, Durham, ECDF and Glasgow.

Tuesday 13th May

  • Will review GridPP metrics soon. Trying to get table up-to-date first.
  • No HEPSPEC06 wiki updates showing SL6 results for UCL or RALPP.
  • ATLAS HS06 coefficient for Lancaster 13.9?
  • APEL publishing 'stopped' for Liverpool, ECDF and Glasgow.


Tuesday 29th April

  • Glasgow looks slightly delayed with recent accounting data publishing.

Tuesday 15th April

  • The APEL accounting system has been undergoing database maintenance to improve performance and reliability. Networking problems at the RAL site have delayed completion of the operation. Sites may see nagios alerts warning them that they have not published accounting data for 7 days - these will stop after the maintenance work completes.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Tuesday 15th April

Tuesday 1st April

  • Keydocs action needed by Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
  • We need to reassign Mark M's documents on Core Grid Services


Tuesday 18th March

  • Keydocs action needed by: Mark M; Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
Interoperation - EGI ops agendas

Tuesday May 20th

  • Next meeting June 2


Monitoring - Links MyWLCG

Tuesday 20th May

On-duty - Dashboard ROD rota

Tuesday 20th May

  • Quiet week. Created tickets to cover two low availability alarms just now. No

UK-wide problems.

  • EMI-3 upgrades still ongoing. EGI following up on status.

Monday 12th May

  • Problems with dashboard
  • Issue with UCL availability ticket
  • EGI identified EMI/UMD-2 endpoints at:
    • UCL - DPM, WNs, BDII, CE
    • Durham - CE
    • ECDF - CE, info3
    • Sussex - CE, BDII
    • Bristol - CEs


Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 29th April

  • The changes to the regional dashboard make the on-duty task harder. Need to rely on Pakiti again.

Tuesday 15th April

  • Update on the OpenSSL status.
  • The discussion list members have been updated. Anyone missing?



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 13th May

  • Ewan's gridpp VO membership expired without warning. Does this only go to the VO admin for VOs on the GridPP VOMS?

Tuesday 29th April

  • It was mentioned several weeks ago that the perfsonar meshes were being sorted by host name and that sorting by site name would be available soon. This is now the case. You can see the familiar GridPP site sorting here and the large WLCG mesh here. Note the square of GridPP sites towards the bottom right. Red squares represent throughput of less than 500 Mb/s.
Tickets

Monday 19th of May 2014, 16.45 BST
30 Open tickets this week.

Big news at the moment are the EMI upgrade tickets (still). The Durham and Sussex tickets have stalled a little. UCL are working on it but have hit problems with their DPM upgrade. Edinburgh are waiting on nagios jobs to start running on their upgraded kit, but otherwise look almost done, and Bristol have vanquished their ticket. The balloon is going up on this one, we have 11 days left till the deadline.

Ye Olde ILC ticket is almost done, Durham have made the move and are just waiting on ILC to test (who in turn are waiting for Durham to come back online).

The only other ticket that really catches my eye is this atlas one:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105308

It concerns multicore atlas jobs failing at RAL. Alastair did a good bit of sleuthing and looks to have tracked the problem down to an issue with multiple multicore jobs running on the same node - which is worrying. Watch this space! Where (where space == ticket).

Tools - MyEGI Nagios

Tuesday 20th May

Between May 1st and May 12th, SAM-CENTRAL and the Message Broker Network have experienced a set of chained failures that resulted in the loss of a large portion of the metric results that were published by the SAM NGI Instances. The loss of these messages will result in an unusually high number of UNKNOWNS in the May A/R reports, but the actual A/R numbers will not be affected as UNKNOWNS are not take into account. No other services have been affected.

Tuesday 13th May

  • From last week's discussion DiRAC now supports: NA62, vo.landslides.mossaic.org, t2k.org, snoplus, gridpp, CERN@school and northgrid. NA62 are moving from LFC to DFC and plan to use DiRAC in place of the WMS.

Monday 17th March

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 14th May 2014

  • Operations report
  • Ongoing testing of CVMFS client 2.1.19. So far so good
  • In process of scheduling Castor 2.1.14 upgrade. Proposed date for Nameserver upgrade now changed to Tuesday 10th June.
  • As stated last week we are proposing to turn off the CREAM CEs. We are also starting to plan to end the FTS2 service.
  • Reminder: The software server used by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A