Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 114: Line 114:
 
* It took a while to recover IPv6 connectivity after the power outage on the 20th November, but all is restored now.  
 
* It took a while to recover IPv6 connectivity after the power outage on the 20th November, but all is restored now.  
 
* There are problems with the new CA trust anchor release. The  Tier1 plans to update to version 1.88 on the 12th Dec.  
 
* There are problems with the new CA trust anchor release. The  Tier1 plans to update to version 1.88 on the 12th Dec.  
* Patching and updates are ongoing. The production FTS is being updated to 3.77 today. The Castor Database machines will be patched tomorrow. The Castor SRM machines will be updated Tomorrow (6th Dec).
+
* Patching and updates are ongoing. The production FTS is being updated to 3.7.7 today. The Castor Database machines will be patched tomorrow (6th Dec). The Castor SRM machines will be updated tomorrow (6th Dec).
 
* The WMS service at RAL will be decommissioned at the end of December.  
 
* The WMS service at RAL will be decommissioned at the end of December.  
 
<!-- **********************End T1 text************************** ----->
 
<!-- **********************End T1 text************************** ----->

Revision as of 09:12, 5 December 2017

Bulletin archive


Week commencing Monday 4th December 2017
Task Areas
General updates

Monday 4th December


Monday 27th November

  • HSF CWP "Roadmap for HEP Software and Computing R&D for the 2020s"
  • CHEP 2016 proceedings have been published.
  • The next WLCG GDB will be on 13th December. Agenda.
  • perfSONAR configuration change.


Monday 20th November



WLCG Operations Coordination - AgendasWiki Page

Tuesday 21st November

  • The next WLCG ops coordination meeting will be on 7th December. Do we have any topics we would like covered?

Tuesday 7th November

  • The WLCG ops coordination meeting last week (agenda | minutes had as highlights the topics:
    • Security Operations Center WG Workshop/Hackathon
    • Review of EL7 migration plans
    • Providing reliable storage: presentation by BNL


Monday 23rd October

  • The next WLCG ops coordination meeting will be on 2nd November.

Tuesday 10th October

  • There was an ops coordination meeting last Thursday 5th October: Agenda | Minutes.
    • There is one action for us to follow-up: "Collect plans from sites to move to EL7". We could make use of the Batch Systems table.


Tier-1 - Status Page

Tuesday 5th December A reminder that there is a weekly Tier-1 experiment liaison meeting. Notes from the last meeting here

  • It took a while to recover IPv6 connectivity after the power outage on the 20th November, but all is restored now.
  • There are problems with the new CA trust anchor release. The Tier1 plans to update to version 1.88 on the 12th Dec.
  • Patching and updates are ongoing. The production FTS is being updated to 3.7.7 today. The Castor Database machines will be patched tomorrow (6th Dec). The Castor SRM machines will be updated tomorrow (6th Dec).
  • The WMS service at RAL will be decommissioned at the end of December.
Storage & Data Management - Agendas/Minutes

Wed 15 Nov

  • More xcache progress at some sites; less at others who may be blocking on dependencies or be busy with Other Things(tm)
  • Some interest in following the non-X.509 authentication/authorisation

Wed 08 Nov

  • Not (very) modest at all progress on xcache testing... from Chris at RALPP

Wed 01 Nov

  • (very) modest progress on xcache testing...
  • Except that RALPP has it fully working for dCache - hope to get full report next week

Tuesday 3rd October

  • Looking for small DPM ATLAS site to possibly test xrootd caching
  • Tracking upgrade of SEs to new what will be new baseline versions. (dCache sites updated; 6/14 DPM up-to-date)
  • Alastair D. to give T2 site storage evolution talk 4/10/17 at gridpp-storage meeting


Tier-2 Evolution - GridPP JIRA

Tuesday 14 Nov

  • Single processor LHCb stoppable Monte Carlo VMs running in a mixture with fixed length 8 processor ATLAS VMs and single processor GridPP VMs. Please get in touch if you are running Vac and want to try this.

Tuesday 3 Oct

  • Productions tests of Vac 03.00pre at Manchester.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 24th Oct


Monday 16th January

  • The discussion topic for next week will be accounting comparisons. Please note Alessandra's comments last week.

Monday 14th November

  • Alessandra has written an FAQ to extract numbers from ATLAS and APEL avoiding the SSB.

Monday 26th September

  • A problem with the APEL Pub and Sync tests developed last Tuesday and was resolved on Wednesday. This had a temporary impact on the accounting portal.
Documentation - KeyDocs

Tuesday 21st Nov 2017

TEMPORARY NOTICE

VOMS servers located at CNAF have been knocked out due to some flooding. It is not known when they might be recorvered. Hence, at present, there are no VOMS servers at all for both planck or ipv6.hepix.org. And a single new VOMS server has been provided for enmr. See the VOMS records in the Approved VOs document for details.

Ste Jones

Tuesday 26th Sept 2017

  • Safari on macOS autogenerated client certificates fix put into production last week.

Tuesday 19th Sept 2017

  • Bug in Safari on macOS causes random Apple generated client cert to be presented to www.gridpp.ac.uk using a CA we don't know about. The connection is then rejected by Apache. Other browsers ok since we request but do not require that they supply a certificate. Workaround produced by (a) installing the Apple CA root and intermediary CA and (b) whitelisting UK e-Science DNs (+ CERN? Others?) in the HTTPS Wiki plugin. Tested but not yet in production.

Tuesday 5th Sept 2017

New Approved VO. SKA European regional data centre, skatelescope.eu

https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

This VO is ramping up and has requested support. Andrew McNab is the contact person for GridPP.

General note

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Interoperation - EGI ops agendas Indico schedule

Monday 20th November

  • There was an EGI Operations Meeting today: Agenda.

Monday 9th October

  • MW
    • UMD 4.6 in November
      • UI, CREAM, ARGUS
      • If not before in 4.5.1
    • UMD Preview update on 22/9
      • not rooted 4.7.0 has a bug, wait for update
      • CREAM version in Preview, please test if applicable (leading to release as noted above)
  • Ops
    • cASO (OpenStack Accounting extractor) update: 100%IT done (GGUS ticket 130665)
    • A/R nothing to report
    • Computation Power
      • Sites check data by end of the month
    • EMI WMS
      • mice noted as being enabled on DIRAC
    • IPv6 Readiness
      • Would be useful for us to provide a report
    • webdav probes
      • If you're advertising a webdav endpoint, make sure it's in the GOC DB
    • Storage accounting deployment
      • Enable by 30th October; tickets to follow. Note that per broadcast new data currently not showing, please ticket on errors in logs only


Monitoring - Links MyWLCG

Tuesday 4th July

  • There were a number of useful links provided in the monitoring talks at the WLCG workshop in Manchester - especially those in the Wednesday sessions.

Monday 13th February

  • This category is pretty much inactive. Are there any topics under "monitoring" that anyone wants reported at this ops meeting? If not we will remove this section from the regular updates area of the bulletin and just leave the main links.

Tuesday 1st December


Tuesday 16th June

  • F Melaccio & D Crooks decided to add a FAQs section devoted to common monitoring issues under the monitoring page.
  • Feedback welcome.


On-duty - Dashboard ROD rota

Monday 20th November

  • Generally quiet. There are three outstanding tickets: low availability at Birmingham, one at Liverpool which might just have gone green, and out-of-date IGTF CAs at Sheffield.
Rollout Status WLCG Baseline

Monday 20th November



Historical References


Security - Incident Procedure Policies Rota
  • GGUS tickets and the trust anchor. Who to ticket result?

Tuesday 28th November

  • Nothing new to report (tracking update from last week - remember you can check your site!)

Tuesday 21st November

A few sites have showed up in pakiti with high risk vulnerabilities (CVE-2017-1000364). If you haven't been prodded about this already then expect an email from someone in the security team soon.

Monday 30th October

  • EGI SVG ALERT [TLP:WHITE] Up to 'CRITICAL' risk, kernel exploit CVE-2017-7184 and others [EGI-SVG-CVE-2017-7184]
  • Upcoming meetings
    • Details and registration for SOC WG Workshop in December: Advert Indico



Services - PerfSonar production dashboard |PerfSonar development dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Monday 20th November

  • perfSONAR: No news.
  • GridPP DIRAC: DPM & xrootd version. Upgrade due 16th November.
  • RIPE ATLAS probes: Status & tests TBC
  • The next LHCOPN/LHCONE meeting will be at RAL 6-7th March 2018. The UK position on LHCONE is being reviewed. RAL T1 will (likely) connect via the LHCOPN in the coming months. T2s watch this space!


Monday 23rd October

  • The LHCOPN/ONE meeting took place last week at KEK, co-hosted with the HEPiX Fall meeting.
    • You can find a short report of the meeting here.
    • All the slides presented are available in the agenda.

Tuesday 4th July

  • Pete C is preparing another networking forward look document. Note that figures presented in the Manchester workshop were for a Tier-2 (not always a single site).

here

  • Duncan has recreated the UK perfSONAR mesh. Link here!


Tickets

Monday 4th December 2017, 15.00 GMT
38 Open Tickets this week

IPv6 Deployment tickets.
We have 15 of these left open in the UK (I think we've only closed one?). Rather then go through them individually I thought I would just give folks a chance to chime in with any specific issues. Most of these tickets probably want putting on hold if they're not already.

The rest of the tickets, site by site.

SUSSEX
132027 (23/11)
A ROD ticket, Sussex are having trouble with their CREAM - although everything looks green now so maybe the ticket can be closed? In progress (30/11)

132233 (1/12)
Standard issue ROD low availability ticket. On hold (1/12)

122772 (11/7/16)
The atlas webdav/xroot ticket. Leo went and rebuilt things from scratch back in October but I don't think there was any joy. In progress (18/10)

132189 (29/11)
Sno+ cvmfs not working at Sussex. It looks like cvmfs has failed on a few nodes, Leo is investigating. In progress (4/12)

On another note, is Karin the latest Sno+ contact?

RALPP
131565 (2/11)
CMS stageout failures at RALPP. Chris tried a fix by upping the space available, no news from CMS. Assume it's fixed? Waiting for reply (2/11)

130264 (28/8)
Biomed ARC publishing ticket, with a patch to ARC needed. As noted in the linked Brunel ticket a fix is out and working at Brunel if you fancy giving it a whirl. On hold (10/11)

OXFORD
129931 (4/8)
Atlas http tests failing, proved tricky to fix so hoping a new install will fix - waiting for hardware to arrive at last check. On hold (13/11)

BIRMINGHAM
131670 (7/11)
ROD availability ticket. Mark asked about the nature of these tests for a VAC site, which I don't think any of us have a definitive answer for. Mark updates that he's hardwired some slots of run tests, so numbers are on the mend. On hold (1/12)

129930 (4/8)
The other atlas http test failing ticket. Mark is also going for the reinstall it and hope that fixes it solution. On hold (16/11)

SHEFFIELD
131835 (14/11)
Another ROD availability ticket. Numbers are on the mend. On hold (14/11)

MANCHESTER
131526 (1/11)
Storage accounting ticket. Really could do with an update. In progress (1/11)

132121 (28/11)
A ticket to the VOMS servers from pheno, after spotting some oddness using the voms interface. Robert is taking a look. In progress (1/12)

UCL
131807 (13/11)
Ben's apel certificate expired, and he's not sure what to do with the replacement. The ticket has called out to Andrew M for help. In progress (23/11)

QMUL
132232 (1/12)
Another availability ticket - came in on Friday and not likely spotted yet. The argo numbers have turned green again at least. Assigned (1/12)

130262 (28/8)
The Storm version of the Biomed publishing tickets. Quite rightly Dan has pointed out that this is an issue with Storm and suggested that Biomed ticket the devs. Waiting for reply (20/11)

BRUNEL
130263 (28/8)
The last of the Biomed publishing tickets, Raul has rolled out the patch that fixes the ARC publishing and asked if the ticket can be closed (I reckon it can). Waiting for reply (22/11)

TIER ONE
131840 (14/11)
solidexperiment.org having trouble with Castor. George asks if the new data mover certificate, which is installed, works? Also he points out that Janusz's certificate won't work as it's already associated with another VO (MICE) due to CASTOR reasons. Waiting for reply (28/11)

130207 (24/8)
MICE transfers timing out into CASTOR. Still waiting for "new" hardware to be made available to increase the MICE disk pool. On hold (13/11)

132222 (30/11)
CMS phedex transfers failing to the Tier 1 - possibly/likely due to CA certificate fun and games (and possibly one of the causes of the Tier 1 CA rollback announced today). In progress (4/12)

131815 (13/11)
T2K seeing very long download times from Castor tape. Brian has tried to help and Henry has added in his experiences with MICE work which was really helpful. Now just to see if this all worked for T2K. In progress (1/12)

127597 (7/4)
Checking networking and xroot performance for CMS. Summed up with a statement from Gareth in October concerning the site firewall. Are their solutions beyond just fixing the firewall? On hold (5/10)

124876 (7/11/16)
Ops tests not working for ECHO due to problems with the tests. Alastair chased the ticket to the test developers (125026), but there's no movement, not even a reply to Alastair's poke. On hold (13/11)

117683 (18/11/15)
Glue2 publishing for CASTOR. Background work simmers away at this slowly, but Rob in his last update cited the new storage accounting proposal, which would mean no need for glue2 if I'm reading it correctly. On hold (6/11)

Tools - MyEGI Nagios

Monday 20th November

Tuesday 18th July

  • Following our ops discussion last week, Steve will focus his tests on supporting the GridPP DIRAC area and decommission the other tests.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 20th November

  • Tom Whyntie has requested (and been granted) access to the GridPP VO to get some pipelines working for large-scale processing and analysis of MRI scans associated with the UK Biobank project.
  • All VOs in the incubation page being prompted for updates by the end of November (required input for OC documents).
  • QMUL (Steve L) is following up on the biomed MoU. GridPP want to be cited in research papers for the support our resources/sites provide.


Site Updates

Date



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Highlights from this meeting are now included in the Tier1 report farther up this page.

WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Atlas S&C week 2-6 Feb 2015

Production

• Prodsys-2 in production since Dec 1st

• Deployment has not been transparent , many issued has been solved, the grid is filled again

• MC15 is expected to start soon, waiting for physics validations, evgen testing is underway and close to finalised.. Simulation expected to be broadly similar to MC14, no blockers expected.

Rucio

• Rucio in production since Dec 1st and is ready for LHC RUN-2. Some fields need improvements, including transfer and deletion agents, documentation and monitoring.

Rucio dumps available.

Dark data cleaning

files declaration . Only Only DDM ops can issue lost files declaration for now, cloud support needs to fill a ticket.

• Webdav panda functional tests with Hammercloud are ongoing

Monitoring

Main page

DDM Accounting

space

Deletion

ASAP

• ASAP (ATLAS Site Availability Performance) in place. Every 3 months the T2s sites performing BELOW 80% are reported to the International Computing Board.


UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A