Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 9th June 2014
Task Areas
General updates

Monday 27th May

  • There is a DIRAC workshop at CERN this week. If you, or VOs you work with, have any thoughts on specific requirements for future DIRAC development please let Janusz/Dan or Jeremy know.
  • An EGI Operations Management Board takes place on Tuesday morning. The topics: CSIRT update; resource allocations; gfal2 replacing lcg_utils; central SAM update; EMI-2 decommissioning and update on GPGPUs.
  • The latest ops portal update integrated the possibility to register and update VO ID cards according to a new discipline classification. VO managers are being encouraged to check/update their VO ID card via this link.
  • There was a GridPP technical meeting last Friday.
  • The final WLCG Tier-2 reliability & availability reports for Arpil are now uploaded.
  • There has been some TB-Support discussion on GitHub vs BitBucket. Any conclusions?


Monday 19th May

  • David C has put together a blog on monitoring. Who can/will contribute content?
  • HEPiX takes place this week (19th-23rd May) and talks are available from the the event page. Monday covered some site reports and OS related updates. Tuesday's focus is batch systems. Wednesday covers IPv6, security and benchmarking. Thursday storage, monitoring and infrastructure deployment. Friday is cloud day.
  • The EGI Community Forum takes place this week in Helsinki. There are talks/tracks covering: Helix-nebula; earth sciences; CSIRT (focus on clouds); tools updates (incl. GOCDB and APEL); Lifesciences; data preservation; vulnerability handling; sustainability; federated clouds; DiRAC; data management ... and of course H2020!
  • Jeremy's 'official' notes are in the GDB wiki. The actions have also been updated.
  • A reminder to register for the workshop if you are attending - registration closes 9th June.
  • The next pre-GDB is on 10th June covering IPv6.
WLCG Operations Coordination - Agendas

Monday 27th May

    • A reminder of the need to register for the WLCG workshop.
  • Andrea Manzi takes over the maintenance of the baseline versions as WLCG Middleware Officer.
  • T0: Quattor phase-out - CERN is currently migrating all centrally managed services from Quattor to a new Puppet based Configuration Management system.
  • DPM 1.8.8 was released last week including a new gridftp component, which has issues with the gridftp implementation in FTS2 (currently still used only by CMS); FTS3 transfers are OK. Since CMS is anyway pushing to switch to FTS3, the affected sites which have already upgraded are asked to change their PhEDEx configuration to use FTS3. Sites which have not yet upgraded are encouraged to wait for the DPM fix which is in testing and is probably going to be released next week.
  • ALICE: Returns to high production activities.
  • ATLAS: Rucio full chain testing starting now. All but US sites moved. LFC at CERN no longer used.
  • CMS: SAM test for glexec critical on May 19th. Test for xrootd fallback goes critical soon.
  • LHCb: Mostly MonteCarlo production and user analysis during the last 2 weeks.
  • Tracking Tools: GGUS release on the 26th. The alarms for UK and USA will be done on the 27th. The rest, on the 26th
  • FTS3: CMS opening GGUS tickets to sites to complete migration to FTS3 in PhEDEx Debug.
  • glexec: 12 tickets still open (-3 on last meeting). Related - there have been many tickets for ARGUS instabilities recently, but the experts are actively following up, hopefully a solution will be found soon.
  • Middleware readiness: an internally developed solution is going to be used to track middleware versions instead of pakiti, since this would need to be extended anyway.
  • SHA2: ATLAS discovered that Condor still uses an old CREAM client that does not support RFC proxies which blocks the introduction of the new CERN VOMS servers.
  • WMS decom: hard deadline is 31st of October since the SAM WMS machines are Quattor-managed
  • IPv6: Please complete the site survey.
  • N&T (network and transport) metrics: New group with kick-off soon. Propose two sub-groups:
    • Deployment, commissioning and maintenance of the tools providing network and transfer metrics (FAX, AAA, FTS, perfSONAR, etc.) - versions and configuration tracking (parameter tuning), operational issues, etc.
    • Higher-level services that will make use of the provided metrics (PhEDEx, Panda, Rucio) - technical aspects of existing metrics (latency, measurement methodology, API access, etc.), identification of missing metrics, data analytics on archived data.

Thursday 15th May

  • There was a middleware readiness meeting last Thursday. Most updates are appearing in the twiki.
  • Focus has been on the volunteer sites and getting a testing process in place. Focus has been on what currently exists/happens at each site.
  • There was also a look at how the middleware baselines information is references and used/applied at the T0 and T1s.
  • Some discussion of a proposal on how to monitor installed middleware packages.
  • Discussion mainly about middleware packages vs RPMs and defining what is up-to-date from the results.
  • Tests will be carried out with volunteer sites and Pakiti used as a possible way forward.


Tier-1 - Status Page

Tuesday 27th May

  • CVMFS client 2.1.19 rolled out to whole batch farm. Looks good.
  • Castor 2.1.14 upgrade. Firming update of 10th June for nameserver with stagers CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
  • We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
  • The software server used by the small VOs will be withdrawn from service.
Storage & Data Management - Agendas/Minutes

Wed 28 May 2014

  • FTS capabilities - with and without Web interface - interest in more tests
  • Impact of deprecation of lcg-utils - particularly for non-LHC VOs that use LFC. Conversely, started playing with GFAL2 (Sam).
  • Interest in DIRAC tutorial either at hepsysman or next GridPP.

Wedn 21 May 2014

  • DPM upgrade to 1.8.8 - Edinburgh has mostly-puppet-configured, Oxford has yaimed pool-only.
  • WebFTS may be a useful alternative/supplement to GlobusOnline (they both transfer files but in different ways.) Will evaluate once RAL sets it up.
    • Could be useful to support some of the tiny non-GridPP users (few tens of TB), so they can share resources or at least interfaces with GridPP users. Maybe.
  • Summaries of data access pre-GDB at CERN yesterweek. ATLAS encourage sites to support xroot to support FAX, then DAV.
  • xroot is in GOCDB, DAV isn't.

Tuesday 6th May

  • There was a DPM collaboration meeting last Wednesday.
  • The following priorities were agreed for the next year:
    • YAIM->Puppet transition (YAIM support ends this year);
    • I/O Monitoring; GridFTP redirection - available now for testing;
    • Admin interface and improved HTTP file management;
    • Nightly testing of WAN HTTP access performance, Hammercloud;
    • Removal of legacy components where possible (eg RFIO);
    • System logging via dmlite;
    • Rebalancing utilities;
    • and move of web presence and docs to an indexable Drupal site.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 20th May

  • Sites with APEL 'delays': IC, Liverpool, Sheffield, Durham, ECDF and Glasgow.

Tuesday 13th May

  • Will review GridPP metrics soon. Trying to get table up-to-date first.
  • No HEPSPEC06 wiki updates showing SL6 results for UCL or RALPP.
  • ATLAS HS06 coefficient for Lancaster 13.9?
  • APEL publishing 'stopped' for Liverpool, ECDF and Glasgow.


Tuesday 29th April

  • Glasgow looks slightly delayed with recent accounting data publishing.

Tuesday 15th April

  • The APEL accounting system has been undergoing database maintenance to improve performance and reliability. Networking problems at the RAL site have delayed completion of the operation. Sites may see nagios alerts warning them that they have not published accounting data for 7 days - these will stop after the maintenance work completes.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Tuesday 15th April

Tuesday 1st April

  • Keydocs action needed by Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
  • We need to reassign Mark M's documents on Core Grid Services


Tuesday 18th March

  • Keydocs action needed by: Mark M; Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
Interoperation - EGI ops agendas

Tuesday May 20th

  • Next meeting June 2


Monitoring - Links MyWLCG

Tuesday 20th May

On-duty - Dashboard ROD rota

Tuesday 20th May

  • Quiet week. Created tickets to cover two low availability alarms just now. No

UK-wide problems.

  • EMI-3 upgrades still ongoing. EGI following up on status.

Monday 12th May

  • Problems with dashboard
  • Issue with UCL availability ticket
  • EGI identified EMI/UMD-2 endpoints at:
    • UCL - DPM, WNs, BDII, CE
    • Durham - CE
    • ECDF - CE, info3
    • Sussex - CE, BDII
    • Bristol - CEs


Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Monday 26th May

  • NGI security communications were tested today.

Tuesday 29th April

  • The changes to the regional dashboard make the on-duty task harder. Need to rely on Pakiti again.

Tuesday 15th April

  • Update on the OpenSSL status.
  • The discussion list members have been updated. Anyone missing?



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 13th May

  • Ewan's gridpp VO membership expired without warning. Does this only go to the VO admin for VOs on the GridPP VOMS?

Tuesday 29th April

  • It was mentioned several weeks ago that the perfsonar meshes were being sorted by host name and that sorting by site name would be available soon. This is now the case. You can see the familiar GridPP site sorting here and the large WLCG mesh here. Note the square of GridPP sites towards the bottom right. Red squares represent throughput of less than 500 Mb/s.
Tickets

Monday 9th June 2014, 15.00 BST
26 Open Tickets this week.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC ticket. Things got a bit muddled but ILC would like to know the state of Durham's CE. My impression is that they're submitting to a now defunct one - could you please let us know what's up? In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105989 (4/6)
Technically I think this is a Glasgow ticket - I was going to give this a home but there but noticed that the ticket looked solved (it concerned enabling the cern@school cvmfs at Glasgow - which the Glasgow lads had done alongside the other gridpp repos). In progress (can be solved) (6/6)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937 (2/6)
Sussex got a low availability nagios ticket - Matt RB replied that the trouble is with the EMI3 upgrade and hopes to have dug his way out of that pit shortly. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's EMI3 upgrade ticket. The deadline is pass, and anything not upgraded is in downtime. How goes things? In progress (2/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sno+ were/are having cvmfs problems at Sussex. Related to 105989 above, has /cvmfs/snoplus.snolab.ca been replaced by /cvmfs/snoplus.gridpp.ac.uk? (The latter of which I can see at my site). In progress (29/5).

BIRMINGHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106020 (6/6)
Some little lost cern@school jobs at Birmingham, sitting in an odd state. Matt W is having a look, suspecting argus. In progress (6/6)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106011 (5/6)
Atlas deletion errors at Glasgow. Sam and the lads suspect a dodgey disk pool, and are working on it. In progress (6/6)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105996 (5/6)
Duncan spotted that the ECDF perfsonar box had fallen over. Andy and Wahid are prodding it with their remote stick. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105839 (28/5)
Glue Validator failures at ECDF. Andy's reckoning that the CE's are misconfigured, and is digging into the guts of the matter. In progress (3/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
My shame, the tarball glexec tickets. Sorry to say nothing to see here again. On hold (27/1)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105617 (21/5)
A Sno+ cvmfs ticket, similar to the Sussex one (105618). Not much news on it. In progress (21/5)

MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 (2/6)
Manchester are still publishing using the EMI2 apel. The work is scheduled to be done next (this) week. In the mean time has publishing been turned off? On hold (2/6)

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105939 (2/6)
Biomed ticketed Lancaster over gridftp not being open on our dpm headnode. After advice from Sam we decided that opening up the firewall ports would be okay, but also told biomed that restricting gfal to just one protocol was a bit silly. Waiting to hear if all's well for them. Waiting for reply (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1/2013)
Poor perfsonar bandwidth performance at Lancaster. Following Duncan's advice a downtime has been declared to try a reinstall of the node on Wednesday. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
glexec tarball ticket. On hold (4/4)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar hit a spot of hardware trouble. Disks and RAID controller have been replaced, last word was that the OS was hoped to be reinstalled at the end of April. I suspect then the EMI3 upgrade storm hit. Any news since? On hold (28/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/2013)
UCL's glexec ticket. At last word waiting on a new staff member to take the reins. On hold (16/4)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
LHCB problems at JET. The last updated was from me in May, saying that'd I'd ask for help on JET's behalf (which I did...but failed to push on it. Sorry Jet). On Hold (12/5)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/2013)
Sno+ CVMFS ticket. After looking like it was almost done this ticket has become a bit more murky in recent weeks, with talk of desire for an OSG "mirror" which Catalin points out breaks the cvmfs model. I think some more planning in Sno+ and discussion with the experts is needed. Waiting for reply (2/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
A Vidyo router firewall ticket. Not really sure it's that interesting to any outside the Tier 1 - although there are a lot of Vidyo documentation links that might be useful. Not much news on the ticket for a while. In progress (27/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
Mismatch between bdii and srm storage numbers - which has happened before (101310). In progress but no news. In progress (3/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105100 (2/5)
CMS are doing a round of their Storage Consistency Checks. There's been some back and forth between CMS and RAL with clean up being done. Not entirely sure what's the next step for this ticket - it doesn't seem to be a problem yet though. In progress (6/6)


Tools - MyEGI Nagios

Tuesday 20th May

Between May 1st and May 12th, SAM-CENTRAL and the Message Broker Network have experienced a set of chained failures that resulted in the loss of a large portion of the metric results that were published by the SAM NGI Instances. The loss of these messages will result in an unusually high number of UNKNOWNS in the May A/R reports, but the actual A/R numbers will not be affected as UNKNOWNS are not take into account. No other services have been affected.

Tuesday 13th May

  • From last week's discussion DiRAC now supports: NA62, vo.landslides.mossaic.org, t2k.org, snoplus, gridpp, CERN@school and northgrid. NA62 are moving from LFC to DFC and plan to use DiRAC in place of the WMS.

Monday 17th March

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th May 2014

  • Operations report
  • Castor Nameserver 2.1.14 update on 10th June announced in GOC DB. Stager dates to follow ( CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.)
  • The rollout of CVMFS Client version 2.1.19 has been completed.
  • The UK EScience CA has switched to issuing SHA2 signed certificates this morning.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A