Operations Bulletin 240214

From GridPP Wiki
Revision as of 10:13, 24 February 2014 by Jeremy coles (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 17th February 2014
Task Areas
General updates

Tuesday 18th February


Tuesday 11th February

  • There was a WLCG middleware readiness meeting last week. INFN will continue to maintain the EMI repo.
  • A WLCG ops coordination meeting (F2F) is taking place today at CERN.
  • The January NGI availability reports are now online.
  • The WLCG A/R reports are available...
  • For ALICE all fine.
  • For ATLAS (page 8-9). Below 90% are: UCL; Durham; RALPP and Sussex.
  • For CMS (page 8). Below 90%: RALPP.
  • For LHCb (pages 6-7). Below 90% are: Sheffield; Durham and RALPP.
  • Tomorrow's GDB agenda is now final.
  • EGI FedCloud sites moving to production. Do we have any sites being validated?


WLCG Operations Coordination - Agendas

Tuesday 11th February

  • A (pre-GDB) F2F is taking place today. We will review next week. There will also be a summary at tomorrow's GDB.

Tuesday 4th February

  • There is a multi-core TF meeting this afternoon. The focus is on CMS and PIC.
  • The second middleware readiness meeting takes place this Thursday 6th February.
  • There was a ops coordination meeting last Thursday. The minutes are available. In summary:
  • BASELINES: WMS baseline downgraded to 3.6.1 for issues; APEL baselines added after meeting
  • OpenSSL: WMS needs new version of glite-px-proxyrenewal. ETA this week.
  • SAM: plan to split SAM services for WLCG (at CERN) and EGI (at consortium). Code will fork.
  • ALICE: gearing up for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio renaming campaign almost over. Rucio commissioning has started. DC14 simulation started on 1st of January.
  • CMS: DBS migration has been postponed. gLexec test (not yet critical) is a bit difficult for Tier-1s.
  • LHCb: Issues with ARC CEs.
  • FTS3: Experiments re-started increasing the load on the RAL FTS3 instance. Deployment discussion in February meeting.
  • gLexec: 22 tickets remain open. EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tarball WN.
  • IPv6: Report at next meeting.
  • MW readiness: Next meeting on Feb 6 at 15:30 CET: agenda in particular about how to involve experiments and sites. Need site input on table.
  • MULTICORE: October 2014 proposed by TF coordinators as a target date for a functional system to be deployed,
  • perfSONAR: New release this week. Lots of minor fixes and improvements. All sites should update to this.
  • SHA-2: EOS SRM for LHCb not yet OK. voms-proxy-init on lxplus crashes on creating SHA-2 RFC proxies.
  • TRACKING: No update
  • WMS decom: Deadline - end of April to decommission CMS and shared instances


Tier-1 - Status Page

Tuesday 11th February

  • Following successful testing CVMFS Client version 2.1.17 was rolled out to the rest of the batch farm at the start of this week.
  • There is an interruption to the FTS3 service today as the VMs are moved to a different infrastructure.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • Work is restarting to resolve the MyProxy issues raied in GGUS ticket 97025. There will be a new MyProxy server and it will be necessary to make appropriate reconfigurations to use this.
  • Work is progressing on Tier1 Network changes. However, the plans to install the new Routing layer and change the way the Tier1 connects to the RAL network will not now happen on 25th February (as stated in last week's report). This change is delayed. It is most likely the Tier1, along with some other non-GridPP services at RAL, will move to use the new site firewall on Monday 17th March and there may be some disruption around this change.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 11th February

  • Documents still need attention....

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Tuesday 11th February

  • Product Team updates [see agenda]
  • EMI2 decommissioning discussed - they asked about WN tar balls so checked in with Matt (he said end of the month for more if I've read his email properly)
  • in SR [see agenda]
  • Still open WMS issues
  • EMI-2 decommissioning deadlines: 30/04/14 end of support, 31/05/14 deadline for upgrades
  • Affected [see agenda]

gLite support calendar.

  • Glue2 validation flagged - request out to follow up with local sites.


Monitoring - Links MyWLCG

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).
On-duty - Dashboard ROD rota

Tuesday 18th February

  • Sussex has an APEL publishing ticket.
  • Durham has a ticket against glexec but the tests are not running as the cluster is full. Is this "unsolvable"?
  • ECDF have several tickets needing attention - we know staff return next week.
Rollout Status WLCG Baseline

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 18th February

  • Progress on NGI ARGUS and testing. More tinkering needed. No wider deployment just yet.

Tuesday 11th February

  • Central user suspension in place by end May 2014

Tuesday 14th January

  • nmap test results show 4 UK sites yet to take action on perfSONAR
  • openssl status


Services - PerfSonar dashboard | GridPP VOMS

Monday 17th February

  • Note that WLCG see perfSONAR as a production service (see page 5 in Ian Bird's talk). The UK dashboard shows work still to be done at: ECDF, RHUL, Sheffield, Brunel and RALPPD.

Tuesday 4th February

Tickets

Monday 17th February 14.30 GMT</br> 35 Open UK tickets this week - the number is creeping up, I think largely due to the build up of perfsonar tickets. I plan to look at these in detail next week (or maybe bring them up in the Storage meeting if that's a more appropriate forum?).

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/2013)</br> The NGI Argus ticket. Ewan has helped out with some successful testing, there's a general call for others to get involved if they fancy it. In progress (13/2)

https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> Jobs failing on the RAL WMS, due to the gridsite/openssl/proxy size debacle. Chris successfully tested lcgwms06 after it was updated. Now lcgwms04 and 05 have been updated and Chris has once again been asked to work his testing magic (my apologies if this is already on your to do list Chris). Waiting for reply (11/2)

https://ggus.eu/ws/ticket_info.php?ticket=101052 (6/2)</br> Biomed having trouble with one of the RAL CEs. What really caught my eye here was that Biomed are using JSaga for their job submission- do we have any other user groups using this? (This also leads me to once again question what I find interesting!). No problems with how the ticket itself. In Progress (14/2)

https://ggus.eu/ws/ticket_info.php?ticket=101015 (5/2)</br> This CMS transfer problem (between Minnesota and RAL) ticket is looking a bit ropey. Last word on Friday was that the transfers were still failing. Of course, there are two sides to every transfer failure. In progress (14/2)

https://ggus.eu/ws/ticket_info.php?ticket=101079 (9/2)</br> I don't mean to pick on the Tier 1, but you keep getting thrown the interesting problems. Another "Idiosyncrasies of the ARC CE" ticket, here we see it's oddness with publishing different default SEs for different VOs. Again, naught actually wrong with the ticket. In progress (17/2)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=101135 (11/2)</br> I lied earlier, and I am bringing up one of the perfsonar tickets. Any luck with getting your perfsonar updated Govind? In progress (11/2)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/2013)</br> The getting CMS to work at Glasgow epic (or would you prefer saga?). CMS have pointed out that the original problem is solved, so from their point of view the ticket can be closed when the Glasgow guys feels satisfied. The ticket is in "waiting for reply", but I'm not sure that anyone who you'd like to have input from is paying attention (the second reminder went out today). Waiting for reply (17/2)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=101177 (12/2)</br> Durham's SE is publishing biomed support when Durham no longer support them. Here's wishing you good luck with purging biomed from your system! In progress (17/2)

"Submitted from the UK"</br> I've been very lax about tracking tickets submitted by us NGI_UKers (partly as I never found a good way of doing it), but Steve's submission of the dteam voms server problem ticket (101177) whilst I was writing this up has prompted me to retackle that one. Watch this space!

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).

Tuesday 4 February 2014

  • Proxy renewal
    • Imperial have a workaround for proxy renewal
    • EMI released an update yesterday - should fix things, but needs to be deployed.

Tuesday 28 January 2014

  • hyperk having problems with proxy renewal.
    • This may be related to openssl

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 12th February

  • Operations report
  • Work is progressing on Tier1 Network changes. The plan is to install the new Routing layer for the Tier1 & change the way the Tier1 connects to the RAL network during an intervention on Tuesday 25th February. This is still to be confirmed but if it goes ahead we anticipate Tier1 services down most of that day. If this work does go ahead then on Tuesday 11th March it is planned to move the Tier1 to use the new site firewall, although this step is expected to have only a minor service interruption.
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
  • Testing continues with CVMFS client version 2.1.17 on one batch of worker nodes (approx 10% of the batch farm).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A