Operations Bulletin 080413

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 1st April 2013
Task Areas
General updates

Tuesday 2nd April

  • Any remaining certificate problems?
  • Support for EMI-1 dCache was extended. See this broadcast. Report any tickets that have not been updated.
  • Jens has produced a page onKey Tokens. How do we want to use this now?
  • GGUS have released a new page on using the system.
  • There was an OMB meeting last Tuesday. (To be reviewed)


Monday 18th March

  • The March GDB (agenda) minutes are available. See also the actions and the pre-GDB on Clouds agenda and summary pages.
  • The next WLCG operations coordination planning meeting takes place this Thursday 21st March. (agenda)
  • EMI-3 ARGUS has shown again an issue with email addresses in certificates. The UK CA can now issue certificates without these addresses and it may be beneficial for sites to change their certificates sooner rather than later.
  • EGI have been collecting information about problems found by component in the EMI-1 to EMI-2 transition. For those who can access it please check this page. If the page is not open please email Jeremy with any problems encountered that you want checked as captured.
  • The final WLCG availability report for February is now online.
WLCG Operations Coordination - Agendas

Tuesday 2nd April

  • A new task force on http proxy discovery is being formed (read more). They are looking for members.
  • Minutes of the 21st March planning meeting are now available.
Tier-1 - Status Page

Tuesday 2nd April

  • Generally quiet operations this last week.
  • Investigations are ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

Monday 1st April

  • DDN report?

Wed 20 March 2013

  • Ruminated over the agenda items from last week's GDB
    • EMI roadmap (dCache, and other things)
    • FTS support for HTTP - we knew this but how do we make use of it now
    • Storage accounting records, needs updated APEL;
    • Work of storage group(s) on interfaces and protocols, and future furlongpebbles.
  • RAL D1T0 evaluation.
    • Seems to be settling on HDFS and CEPH which will be run anyway
    • what about Lustre?
    • Presentation to PMB next Monday, but no decision yet.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Tuesday 2nd April

  • Minutes of the 20th March EGI ops meeting are available.

Monday 19th March

  • The next EGI operations meeting (agenda) takes place this Wednesday 20th March.

Monday 4th March

  • An EGI operations meeting agenda for today's meeting is now available.
  • SR: Large number of updates in UMD2, and UMD1. In particular, in theUMD1 release, the DPM, LFC and L&B are security updates. The UMD2 WMS is _not_ backwards compatible, without a workaround, as describe in the release notes: https://wiki.egi.eu/wiki/UMD-2:UMD-2.4.0
  • EMI-3 release expected 7th March, UMD-3 prioritisation underway
  • Argus should be in the Site-BDII; it had the information provider from the EMI-2 release, so it's probably a plan to update EMI-1 Argus's. (As should VOMS servers; they've had the information providers in all EMI releases)


Monitoring - Links MyWLCG

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 1st April

  • A new GOCDB field related to the ROD email address was not populated. Emails should now reach the team.

Tuesday 5th March

  • Handling tickets related to EMI-1 probes - what to expect.
  • Recommendation with respect to upgrading CE (drain first)

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.
Rollout Status WLCG Baseline

Tuesday 2nd April

  • EMI-1 components should be out of production. Nagios probes will report critical this month. Services remaining (without special condition) beyond 30th April will need to be placed in downtime.

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 2nd April

  • Reminder about ptrace kernel issue (CVE-2013-0871)
  • Thanks to all those sites that took part in the security challenge

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.



Services - PerfSonar dashboard | GridPP VOMS

Tuesday 2nd April

  • Impending electrical work at Manchester - we need to commission the backup VOMS arrangement as soon as possible.

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

Monday 1st April 20:00 BST</br> 27 Open UK tickets, but we'll have to wait until next week for a full review of them all as Matt's on leave this week and sending his apologies for tomorrow's meeting - nothing's striking him as urgent although someone on the ROD/Ops team might want to look at https://ggus.eu/ws/ticket_info.php?ticket=92512 (Wahid has set it to waiting for reply, there might be some confusion over who needs to do the replying).

In the meant time if you aren't on leave too then please have a gander at your sites tickets and see if there's ought that needs your attention: http://tinyurl.com/cblj3ab

Otherwise he'll catch y'all next week, by then hopefully he will have stopped referring to himself in the third person again.

In other news:

EMI-3 Storm is not production ready: https://ggus.eu/tech/ticket_show.php?ticket=92819


Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 2 April 2013

Monday 4th March 2013

Monday 26th February 2013

  • NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
  • SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
  • Issues with Proxy renewal.
    • Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
    • Other myproxy issues as well. GGUS#99105 GGUS#9172

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 3rd April

  • Operations report
  • The three EMI-1 WMSs (lcgwms01,02,03) have been retired and replaced with lcgwms04,05,06.
  • There have been some networking problems over the last fortnight which have caused transitory problems.
  • Investigations continue into the problems affecting batch starts for Atlas & LHCb.
  • New hardware (disk and CPU) in final testing ahead of going into production.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A