Operations Bulletin 250313

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 18th March 2013
Task Areas
General updates

Monday 18th March

  • The March GDB (agenda) minutes are available. See also the actions and the pre-GDB on Clouds agenda and summary pages.
  • The next WLCG operations coordination planning meeting takes place this Thursday 21st March. (agenda)
  • EMI-3 ARGUS has shown again an issue with email addresses in certificates. The UK CA can now issue certificates without these addresses and it may be beneficial for sites to change their certificates sooner rather than later.
  • EGI have been collecting information about problems found by component in the EMI-1 to EMI-2 transition. For those who can access it please check this page. If the page is not open please email Jeremy with any problems encountered that you want checked as captured.
  • The final WLCG availability report for February is now online.

Monday 11th March

  • Minutes from the GridPP cloud meeting of 1st March.
  • The agenda for the LHCONE/LHCOPN meeting on 17th and 18th March is now available. Well the link is available!
  • The EGI applications database has been relaunched.
  • Reminder for EMI-2: ARGUS needs a BDII entry now, because it has an ldap server itself.
  • For those intending to submit an abstract to CHEP 2013 please note the deadline of 25th March.
Tier-1 - Status Page

Tuesday 19th March

  • EMI-1 WMS nodes (lcgwms01, lcgwms02, lcgwms03) will be retired by end March. Report if this is a concern.
  • The intervention last Tuesday to replace the core switch in the Tier1 overran significantly (services fully restored around 6 hours late at 9pm). The intervention has resolved problems of asymmetric external network performance and an overloaded link to one of the internal switch stacks. However, it was not possible to re-establish the load balancing on the 2*10Gbit link to the UKLight router (our data link). This is currently running on a single 10Gbit link.
  • The change to nice the batch jobs has not fixed the problem of job set-up failures for Atlas & LHCb. Investigations are ongoing.
Storage & Data Management - Agendas/Minutes

Wed 20 March 2013

  • Ruminated over the agenda items from last week's GDB
    • EMI roadmap (dCache, and other things)
    • FTS support for HTTP - we knew this but how do we make use of it now
    • Storage accounting records, needs updated APEL;
    • Work of storage group(s) on interfaces and protocols, and future furlongpebbles.
  • RAL D1T0 evaluation.
    • Seems to be settling on HDFS and CEPH which will be run anyway
    • what about Lustre?
    • Presentation to PMB next Monday, but no decision yet.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 19th March

  • The next EGI operations meeting (agenda) takes place this Wednesday 20th March.

Monday 4th March

  • An EGI operations meeting agenda for today's meeting is now available.
  • SR: Large number of updates in UMD2, and UMD1. In particular, in theUMD1 release, the DPM, LFC and L&B are security updates. The UMD2 WMS is _not_ backwards compatible, without a workaround, as describe in the release notes: https://wiki.egi.eu/wiki/UMD-2:UMD-2.4.0
  • EMI-3 release expected 7th March, UMD-3 prioritisation underway
  • Argus should be in the Site-BDII; it had the information provider from the EMI-2 release, so it's probably a plan to update EMI-1 Argus's. (As should VOMS servers; they've had the information providers in all EMI releases)


Monitoring - Links MyWLCG

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 5th March

  • Handling tickets related to EMI-1 probes - what to expect.
  • Recommendation with respect to upgrading CE (drain first)

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.
Rollout Status WLCG Baseline

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.



Services - PerfSonar dashboard | GridPP VOMS

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

Monday 18th March 15.00 GMT</br> Only 26 open tickets this week, although a lot of them are "interesting".

NGI/ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=92512 (14/3)</br> ECDF have been accused of being in an UNKNOWN state for 22% of February. Wahid, pretty sure that ECDF state through the month was fairly known, has questioned these results. Waiting for reply (18/3)

NGI/UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=92412 (11/3)</br> Jeremy has given a reply to the EGI COD giving the reasons why UCL shouldn't be suspended. Waiting for reply (18/3)

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=92306 (7/3)</br> Still no reply from the earthsci guys over the state of their corresponding domain name, as we might be playing chinese whispers with Mark asking for the creation on the proto-VO's behalf things are likely to go slowly. Waiting for reply (11/3)

EPIC GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=91687 (21/2)</br> The epic VO testing problems have officially bounced to Glasgow, I know the chaps are investigating the UI oddness but they haven't accepted the reassigned ticket yet (and of course once this is fixed there still might be a problem). Assigned (18/3)

EMI 1 UPGRADE</br> Just a reminder of who has a ticket:</br> BIRMINGHAM, GLASGOW, SHEFFIELD, BRUNEL, RAL TIER 1, RALPP, BRISTOL and RHUL. Things are chugging along, with site's making various levels of progress. The only worry is RHUL, still no reply on their ticket: https://ggus.eu/ws/ticket_info.php?ticket=92111.

ATLAS DATA MOVEMENT</br> RALPP: https://ggus.eu/ws/ticket_info.php?ticket=90244</br> GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=89804</br> The last word on these tickets was both from atlas, but maybe the conversation has moved offline? Either way both are nearly done by the looks of it.

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=92266 (6/3)</br> Chris W ticketing the Tier 1 over the mismatch between the my proxy server's hostname and certificate. There was an attempt to switch over to a matching certificate today, but that caused a failure in retrieving existing credentials. Plan B is to create a new MyProxy, Chris asks if the CA can't issue a multi-alias certificate? In progress (18/3)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Webdaving the Tier 1 LFC. Catalin asked if EMI3 had the latest and greatest webdav support in it, Ricardo reports that sadly it does not. In progress (15/3)

https://ggus.eu/ws/ticket_info.php?ticket=91146 (4/2)</br> Atlas ticketing the Tier 1 over their network bandwidth. The picture is much improved and atlas are happy that this will continue being looked at so they are happy to close the ticket. In progress (14/3)</br> (does this mean that the perfsonar problems in https://ggus.eu/ws/ticket_info.php?ticket=86152 are likely to be also fixed?).

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=92299 (7/3)</br> Stephen B has noticed that this biomed publishing problem seems to have evaporated. In progress (probably can be closed) (18/3)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=92444 (12/3)</br> LHCB problems at QM, some problems were found and fixed - waiting to see if the VO still sees problems (you guys forgot to "Waiting for reply"). Implied "Waiting for reply" (13/3)

https://ggus.eu/ws/ticket_info.php?ticket=92158 (5/3)</br> Hone have given their blessing to close this ticket concerning hone job problems. In progress (15/3)

Tickets spotted by Stephen B</br> https://ggus.eu/tech/ticket_show.php?ticket=92585</br> Steve's (the other Steve) ticket to the argus chaps concerning their EMI3 argus problems and the long standing (but soon to be extinct?) emailaddress-in-the-cert problems.

https://ggus.eu/ws/ticket_info.php?ticket=90328</br> Winnie fixed the Storm publishing problems after getting in touch with the Storm developers - as Stephen pointed out it looks to be a Storm bug that only the devs know about. There's an important lesson to be learned here, and from EFDA-JET's similar ticket from a few weeks back - if the solution is non-obvious then don't be hesitant to ask the relevent devs!

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 4th March 2013

Monday 26th February 2013

  • NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
  • SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
  • Issues with Proxy renewal.
    • Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
    • Other myproxy issues as well. GGUS#99105 GGUS#9172

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 13th March

  • Operations report
  • There were networking problems at RAL on Wed/Thu 6/7 March that caused a number of breaks in connectivity to the TIer1.
  • The planned intervention on Tues 12th overran significantly. The main core switch was replaced as planned and an internal bottleneck to one network stack relieved. However, the overrun was caused by problems re-establishing the paired (2*10Gbit) uplink and at the moment this running with a single 10Gbit connection.
  • The meeting continues to use Vidyo.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A