Difference between revisions of "Operations Bulletin 170214"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 23:36, 16 February 2014

Bulletin archive


Week commencing 10th February 2014
Task Areas
General updates

Tuesday 11th February

  • There was a WLCG middleware readiness meeting last week. INFN will continue to maintain the EMI repo.
  • A WLCG ops coordination meeting (F2F) is taking place today at CERN.
  • The January NGI availability reports are now online.
  • The WLCG A/R reports are available...
  • For ALICE all fine.
  • For ATLAS (page 8-9). Below 90% are: UCL; Durham; RALPP and Sussex.
  • For CMS (page 8). Below 90%: RALPP.
  • For LHCb (pages 6-7). Below 90% are: Sheffield; Durham and RALPP.
  • Tomorrow's GDB agenda is now final.
  • EGI FedCloud sites moving to production. Do we have any sites being validated?


Tuesday 4th February

  • The agenda for the February GDB is available.
  • The March pre-GDB will be on batch systems.
  • Discussion on RIPE ATLAS probes has continued off list. The PMB agree that there is an opportunity here and prefer to link this with outreach and dissemination. For those interested a discussion of what to propose will take place this Friday 7th February (email Jeremy).

Tuesday 28th January

  • There are suggestions for a WLCG pre-GDB on batch systems in March.
  • openssl status update
  • There was an IPv6 working group meeting at CERN last week (agenda).


WLCG Operations Coordination - Agendas

Tuesday 11th February

  • A (pre-GDB) F2F is taking place today. We will review next week. There will also be a summary at tomorrow's GDB.

Tuesday 4th February

  • There is a multi-core TF meeting this afternoon. The focus is on CMS and PIC.
  • The second middleware readiness meeting takes place this Thursday 6th February.
  • There was a ops coordination meeting last Thursday. The minutes are available. In summary:
  • BASELINES: WMS baseline downgraded to 3.6.1 for issues; APEL baselines added after meeting
  • OpenSSL: WMS needs new version of glite-px-proxyrenewal. ETA this week.
  • SAM: plan to split SAM services for WLCG (at CERN) and EGI (at consortium). Code will fork.
  • ALICE: gearing up for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio renaming campaign almost over. Rucio commissioning has started. DC14 simulation started on 1st of January.
  • CMS: DBS migration has been postponed. gLexec test (not yet critical) is a bit difficult for Tier-1s.
  • LHCb: Issues with ARC CEs.
  • FTS3: Experiments re-started increasing the load on the RAL FTS3 instance. Deployment discussion in February meeting.
  • gLexec: 22 tickets remain open. EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tarball WN.
  • IPv6: Report at next meeting.
  • MW readiness: Next meeting on Feb 6 at 15:30 CET: agenda in particular about how to involve experiments and sites. Need site input on table.
  • MULTICORE: October 2014 proposed by TF coordinators as a target date for a functional system to be deployed,
  • perfSONAR: New release this week. Lots of minor fixes and improvements. All sites should update to this.
  • SHA-2: EOS SRM for LHCb not yet OK. voms-proxy-init on lxplus crashes on creating SHA-2 RFC proxies.
  • TRACKING: No update
  • WMS decom: Deadline - end of April to decommission CMS and shared instances


Tier-1 - Status Page

Tuesday 11th February

  • Two 'At Risks' announced this week; This morning is one for a small network intervention; Tomorrow for a load test of the UPS/Generator.
  • Still testing CVMFS Client version 2.1.17 on one batch of worker nodes (around 10% of the farm). So far so good.
  • The software server usd by the small VOs will be withdrawn from service (aiming for June).
  • Work is restarting to resolve the MyProxy issues raied in GGUS ticket 97025. There will be a new MyProxy server and it will be necessary to make appropriate reconfigurations to use this.
  • Work is progressing on Tier1 Network changes. We are looking at an intervention (probably requiring a day's downtime) to install the new Routing layer and change the way the Tier1 connects to the RAL network on the 25th February. Then, on Tuesday 11th March it is planned to move the Tier1 to use the new site firewall.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 11th February

  • Documents still need attention....

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Tuesday 11th February

  • Product Team updates
    • DPM/LFC inclusion of STAR accounting to be used off of the shelf
    • FTS3 release in EPEL
    • WMS update by the end of February
  • UMD 3.4 released
    • CREAM TORQUE v. 2.1.2: This release fixes the wrong total cpu count from PBS infoprovider together with a dependency issue with lcg-info-dynamic-scheduler
    • ARC-CE v. 4.0.0: This is a major release of ARC-CE and include updates for both server and client tools.
  • UMD 2
    • UMD-2 is considered under security support, but update Gridsite, glite-px and globus-proxy-utils in order to make them compatible with the latest openssl updates.
  • EMI2 decommissioning discussed - they asked about WN tar balls so checked in with Matt (he said end of the month for more if I've read his email properly)
  • in SR
    • mpi v. 1.5.3
    • lb v. 4.0.12
    • apel-parser v. 2.2.1 and apel-ssm v. 2.1.1
    • Globus 5.2.5:
    • gridftp v. 5.2.5
    • gram5 v. 5.2.5
    • canl v. 2.2.1
    • glite-proxyrenewal v. 2.1.3
    • gridsite v. 2.2.1
  • Still open WMS issues
  • EMI-2 decommissioning deadlines: 30/04/14 end of support, 31/05/14 deadline for upgrades
  • Affected:
    • ARC v2.*
    • ARGUS v1.5.*
    • BDII Site older than v1.2.0
    • BDII Top older than v1.1.0
    • CREAM v1.14.*
    • dCache v2.2.*
    • DPM older than v1.8.6
    • EMI-UI v2.*
    • EMI-WN v2.*
    • FTS v.2.2.8
    • StoRM older than v.1.11.0
    • VOMS v.2.*

gLite support calendar.

  • Glue2 validation flagged - request out to follow up with local sites.


Monitoring - Links MyWLCG

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).
On-duty - Dashboard ROD rota

Tuesday 11th Febraury

  • Nothing major issues.
  • There is some progress in the longstanding issue about the fake APEL alerts for Brunel, and Sussex and RALPP seem to have APEL issues as well, though it shows up as N/A on the dashboard, rather than as a full blown alert.


Monday 3rd February

  • Good week. APEL ticket about Brunel alarms still open (although was passing this

afternoon)

Rollout Status WLCG Baseline

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 11th February

  • Central user suspension in place by end May 2014

Tuesday 14th January

  • nmap test results show 4 UK sites yet to take action on perfSONAR
  • openssl status


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 4th February

Tuesday 7th January

  • A perfSONAR dashboard has been established in London based on maDDash.

Tuesday 26th November

  • The main perfSONAR issues this week affect Manchester and Sussex.

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.
Tickets

Monday 10th February 2014, 15.00 GMT</br> 32 tickets for the UK this week.

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=100849 (29/1)</br> This perfsonar ticket is is still just "assigned" state, don't make Duncan feel spurned, take a look at his ticket. Assigned (29/1)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/13)</br> NGI argus setup. argusngi.gridpp.rl.ac.uk is setup and in the GOCDB, but what next with the ticket? In progress (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> A ticket from Chris W concerning job failures due to 512-bit proxie problem. Catalin asked for the update to be tested, but is this testing covered in https://ggus.eu/ws/ticket_info.php?ticket=100343? Waiting for reply (6/2)

Talking of which, can:</br> https://ggus.eu/ws/ticket_info.php?ticket=100343</br> and</br> https://ggus.eu/ws/ticket_info.php?ticket=100887 (gridsite version on the webdav LFC)</br> be closed?

And that's it really. A scan through the the solved ticket pile doesn't show anything exciting. But on the second Monday of a month I tend to overcompensate for going over all the tickets the week before, so let me know if I missed ought.

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).

Tuesday 4 February 2014

  • Proxy renewal
    • Imperial have a workaround for proxy renewal
    • EMI released an update yesterday - should fix things, but needs to be deployed.

Tuesday 28 January 2014

  • hyperk having problems with proxy renewal.
    • This may be related to openssl

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 12th February

  • Operations report
  • Work is progressing on Tier1 Network changes. The plan is to install the new Routing layer for the Tier1 & change the way the Tier1 connects to the RAL network during an intervention on Tuesday 25th February. This is still to be confirmed but if it goes ahead we anticipate Tier1 services down most of that day. If this work does go ahead then on Tuesday 11th March it is planned to move the Tier1 to use the new site firewall, although this step is expected to have only a minor service interruption.
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
  • Testing continues with CVMFS client version 2.1.17 on one batch of worker nodes (approx 10% of the batch farm).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A