Difference between revisions of "Operations Bulletin 030314"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 00:46, 3 March 2014

Bulletin archive


Week commencing 24th February 2014
Task Areas
General updates

Monday 24th February

  • There is a test GridPP website for SHA-2.
  • The final WLCG Tier-2 availability/reliability reports for January 2014 are available.
  • Alessandra noted a FR cloud report on January's VO test results. The suggestion made was to do something similar for UK sites.
  • We need to revisit our plans for RIPE ATLAS probes.
  • Janet is moving away from SeeVogh/EVO. Support ends in August. Our meetings will migrate to Vidyo.

Tuesday 18th February

WLCG Operations Coordination - Agendas

Tuesday 11th February

  • A (pre-GDB) F2F is taking place today. We will review next week. There will also be a summary at tomorrow's GDB.

Tuesday 4th February

  • There is a multi-core TF meeting this afternoon. The focus is on CMS and PIC.
  • The second middleware readiness meeting takes place this Thursday 6th February.
  • There was a ops coordination meeting last Thursday. The minutes are available. In summary:
  • BASELINES: WMS baseline downgraded to 3.6.1 for issues; APEL baselines added after meeting
  • OpenSSL: WMS needs new version of glite-px-proxyrenewal. ETA this week.
  • SAM: plan to split SAM services for WLCG (at CERN) and EGI (at consortium). Code will fork.
  • ALICE: gearing up for Quark Matter 2014 (May 19-24, GSI Darmstadt)
  • ATLAS: Rucio renaming campaign almost over. Rucio commissioning has started. DC14 simulation started on 1st of January.
  • CMS: DBS migration has been postponed. gLexec test (not yet critical) is a bit difficult for Tier-1s.
  • LHCb: Issues with ARC CEs.
  • FTS3: Experiments re-started increasing the load on the RAL FTS3 instance. Deployment discussion in February meeting.
  • gLexec: 22 tickets remain open. EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tarball WN.
  • IPv6: Report at next meeting.
  • MW readiness: Next meeting on Feb 6 at 15:30 CET: agenda in particular about how to involve experiments and sites. Need site input on table.
  • MULTICORE: October 2014 proposed by TF coordinators as a target date for a functional system to be deployed,
  • perfSONAR: New release this week. Lots of minor fixes and improvements. All sites should update to this.
  • SHA-2: EOS SRM for LHCb not yet OK. voms-proxy-init on lxplus crashes on creating SHA-2 RFC proxies.
  • TRACKING: No update
  • WMS decom: Deadline - end of April to decommission CMS and shared instances


Tier-1 - Status Page

Tuesday 25th February

  • There were problems with the FTS3 service last Tuesday when difficulties were encountered moving the VMs around. Since then the service has run successfully and is being used for an extensive test by Atlas and CMS.
  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • A replacement MyProxy server is being put into production (to resolve the MyProxy issues raied in GGUS ticket 97025). It will be necessary for VOs to make appropriate reconfigurations to use this.
  • It is most likely the Tier1, along with some other non-GridPP services at RAL, will move to use the new site firewall on Monday 17th March and there may be some disruption around this change. We do not have a date for the other significant network change we have to do which is the installation of our new Routing layer and changes to the way the Tier1 connects to the RAL network.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 25th February

  • Keydocs owners need to take some action!

Tuesday 11th February

  • Documents still need attention....

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Monday 24th February

    • URT News: ARC, WMS, SAM probes
    • UMD 3.5 released last week. Storm 1.11.3, other updates for openssl
    • SR: IGE.globus-rls v. 5.2.5 no EA
    • GLUE2 Validation: Possible timeline: Broadcast to ROD and Sites on March 3rd, Probe will be set OPERATIONAL on March 10th, Sites will have other two weeks to fix the Site-BDII before receiving alarms.


Monitoring - Links MyWLCG

Monday 24th February

  • Next meeting this Friday, agenda looking at HammerCloud Functional tests

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).
On-duty - Dashboard ROD rota

Tuesday 25th February

  • No issues to discuss.
  • The rota needs updating this week.

Tuesday 18th February

  • Sussex has an APEL publishing ticket.
  • Durham has a ticket against glexec but the tests are not running as the cluster is full. Is this "unsolvable"?
  • ECDF have several tickets needing attention - we know staff return next week.
Rollout Status WLCG Baseline

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Monday 24th February

  • Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There is some setup documentation.

Tuesday 18th February

  • Progress on NGI ARGUS and testing. More tinkering needed. No wider deployment just yet.



Services - PerfSonar dashboard | GridPP VOMS

Monday 17th February

  • Note that WLCG see perfSONAR as a production service (see page 5 in Ian Bird's talk). The UK dashboard shows work still to be done at: ECDF, RHUL, Sheffield, Brunel and RALPPD.

Tuesday 4th February

Tickets

Monday 24th February 2014, 15.00 GMT</br>

36 Open UK tickets this week, but the majority are progressing nicely (only a third of them haven't had an update in the last week, and of these all of them are "On Hold").

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=101502 (24/2)</br> ILC have ticketed the UK to inform us of their move to using cvmfs for their software area. They've included extensive instructions (and updated their VO card). The best forum to ask questions of the VO seems to be this ticket. In progress (24/2)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/13)</br> NGI Argus ticket. As seen on TB-Support, good progress here but the ticket could do with some love. In progress (13/2)

https://ggus.eu/ws/ticket_info.php?ticket=101015 (5/2)</br> This CMS phedex problem looks like it can be bounced to Minnesota. I advise being proactive with the bouncing - either reassign it yourselves or solve it with a big "not a problem in our power to fix". In progress (24/2)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=101398 (19/2)</br> LHCB want holes poked in the RAL firewall to allow direct xrootd access to the RALPP SE - more a heads up for everyone then a ticket nag. In progress (19/2)

EDINBURGH</br> https://ggus.eu/ws/ticket_info.php?ticket=100840 (29/1)</br> Daniela has given some tips on how to tackle this APEL nagios ticket. In progress (20/2)

PERFSONAR TICKETS:</br> A quick round up of these as there are a lot of them.

Lancaster: https://ggus.eu/ws/ticket_info.php?ticket=100566</br> RHUL: https://ggus.eu/ws/ticket_info.php?ticket=101135</br> ECDF: https://ggus.eu/ws/ticket_info.php?ticket=100569</br> RALPP: https://ggus.eu/ws/ticket_info.php?ticket=101136</br> Brunel: https://ggus.eu/ws/ticket_info.php?ticket=100568</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=101285</br> Sussex: https://ggus.eu/ws/ticket_info.php?ticket=101517</br> Durham: https://ggus.eu/ws/ticket_info.php?ticket=100968</br> Bristol: https://ggus.eu/ws/ticket_info.php?ticket=101516</br>

There's a lot of them, but none are looking very neglected (yet). The one with the biggest risk of neglect is actually the Lancaster ticket! Others are soldiering on or have firm reminder dates set for their upgrade.

Tickets from the UK:</br> I had my dreams of easily searching for tickets submitted by UKers smashed: https://ggus.eu/ws/ticket_info.php?ticket=101362 So it looks like it's back to my old method of searching for "Walker", "Bauer" or "Jones" :-D

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).

Tuesday 4 February 2014

  • Proxy renewal
    • Imperial have a workaround for proxy renewal
    • EMI released an update yesterday - should fix things, but needs to be deployed.

Tuesday 28 January 2014

  • hyperk having problems with proxy renewal.
    • This may be related to openssl

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 26th February

  • Operations report
  • Network Changes: Move of Tier1 to use new site firewall is most likely on Monday 17th March (TBC). There will some interruption to services. Preparations continue for the installation of the new Routing layer for the Tier1 & change to the way the Tier1 connects to the RAL network. This change is not yet scheduled - but is now likely to happend around April.
  • Extensive tests of FTS3 are ongoing with Atlas & CMS.
  • A new MyProxy server is in production (myproxy.gridpp.rl.ac.uk).
  • There was special presentation on Castor 2.1.14. It is hoped to roll this out around April.
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A