Operations Bulletin 250213

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 18th February 2013
Task Areas
General updates

Monday 18th February

  • Recent ATLAS jobs and WN crashes - what is the agreed action to take?
  • Glexec and ARGUS deployment guidelines
  • Agenda for Thursday's WLCG ops coordination meeting


Monday 11th February

Tier-1 - Status Page

Tuesday 19th February

  • Ongoing problems with the batch farm not starting enough jobs over the past couple of weeks
  • AFS clients removed from the worker nodes last week.
  • A small number of nodes now form a SL6 batch queue behind its own CE (lcgce12).
  • Ongoing testing of FTS version 3.
Storage & Data Management - Agendas/Minutes

Wednesday 13 Feb 2013

  • EGI Community Forum Preparations
    • "Small" VOs - the user/community perspective
    • More grids and clouds stuff -
  • DPM upgrades to 1.8.6.
  • UK in good shape on this with many sites in ATLAS "FAX" federation. And the big CMS sites in their one too.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.

Tuesday 30th October

  • Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 12th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 11th February

  • There was an EGI ops meeting today.
  • There is a list-match problem with EMI2 WMS (GGUS 90240)

Monday 28th January

  • There was an EGI ops [ https://wiki.egi.eu/wiki/Agenda-28-01-2013 meeting] yesterday.
  • EMI release: DPM 1.8.6 (Small update with security fixes) and VOMS 2.0.10-1 (Small update with security fixes)
  • EMI3 due in April. Looking for SR sites.
  • For UMD-2, CREAM is released to production, WMS had problems found.
  • CA update 1.52-1 under SR, release expected 30-01-2013 (so get ready to update the CA certs...), and SAM-Update 20


Monitoring - Links MyWLCG

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 25th February

  • There are a number of sites struggeling with APEL - QMUL knows about this, UCL has an expired host cert (sigh), it's probably worth keeping an eye on this.

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.


Rollout Status WLCG Baseline

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 19th February

Monday 11th February

  • DPM status

Tuesday 5th February

  • Large Java update released by Redhat [1] and [2] SL5/SL6 expected imminently.
  • UPnP issue. Metasploit exploit released for Supermicro Onboard IPMI (X9SCL/X9SCM).

Monday 21st January

  • SHA-2 and glexec updates given at pre-GDB.
  • EUGridPMA agreed on timeline where SHA-2 becomes the default in August 2013. SHA-1 will still be requestable.

Tuesday 15th January

  • There has been a recent java patch. Sites should consider to roll out updates to client machines (especially those containing user certificates). The most current version is "Version 7 Update 11".


Services - PerfSonar dashboard | GridPP VOMS

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

Monday 18th February 15.00 GMT</br> 33 open tickets for UK sites this week. Only 1 is "green" - 2 are "yellow" and the rest are "red".

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=91146 (4/2)</br> Atlas RAL bandwidth issues.</br> https://ggus.eu/ws/ticket_info.php?ticket=91029 (30/1)</br> Atlas having problems querying the FTS.</br>

Both these tickets are waiting on something happening/being fixed in the (hopefully) not to distant future, so they probably should be On Holded.

https://ggus.eu/ws/ticket_info.php?ticket=90528 (17/1)</br> Sno+ jobs don't get sent to Sheffield from the Tier 1 WMSes. Matt M has provided additional information. In progress (14/2)

CAMBRIDGE</br> https://ggus.eu/ws/ticket_info.php?ticket=91582 (17/2)</br> Atlas have used up all their data disk at Cambridge, then ticketed the site about it. John has set the ticket to In Progress, confirming the situation (the space is used up, there's no accounting problem or dead disk server). At the very least I think this ticket should be set to "Waiting for Reply" (When are you going to clean up your data?), or even have the ticket bounced to the Atlas DDM people. In progress (17/2)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=91439 (12/2)</br> Atlas had transfer problems at Glasgow, fixed but errors continue due to a problem at FZK. Bear that in mind if you get a atlas ticket this week. In Progress, perhaps can be Waiting for Reply/FZK to fix themselves (18/2)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=91377 (11/2)</br> Atlas replied to say that they were still seeing transfer problems, although this was a couple of days ago. In progress (13/2)


ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=90878 (27/1)</br> This lhcb ticket concerning cvmfs problems (which weren't cvmfs problems after all) is looking a little neglected. LHCB replied to a question a while back now. In Progress (6/2)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=90395 (14/1)</br> Lancaster had problems with running dteam jobs (the CE in question had problems running anyone's jobs to be fair). The CE has been rejuvenated, but embarrassingly for me I still have yet to configure one of the Lancaster UI's for dteam. In Progress (11/2)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=90393</br> https://ggus.eu/ws/ticket_info.php?ticket=90340</br> https://ggus.eu/ws/ticket_info.php?ticket=90358</br> https://ggus.eu/ws/ticket_info.php?ticket=89825</br> https://ggus.eu/ws/ticket_info.php?ticket=75488</br>

As mentioned last week, after Mike's Heraclean efforts Durham is back on it's feet. If Mike's back from his well-earned break it's worth for each of these tickets to ask if the problem's persist (or at least changed error message) and switching them all from "On hold" to "Waiting for Reply".

Any other tickets people want to go over - too or from the UK (or an issue which might affect us?).

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 18 February 2013

Monday 12 February 2013

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Monday 14 January 2013

  • Neiss.org.uk
    • Now have VO-ID card in operations-portal (previously CIC portal)
    • GridPP/NGS VOMSs server issues
    • NGS WMS hadn't enabled current CEs at QMUL and Lancs, so I've requested the GridPP WMSs enable it - as the VO is supported on GridPP sites.
    • Would be a good use case for SARONGS - but they don't have the time to debug this.


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 1st October

  • ELC work


Tuesday 25th September

  • Reviewing pledges.
  • Q2 2012 review
  • Clouds and DIRAC
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 20th February

  • Operations report
  • A data loss (68 files) has been reported to T2K following a disk server failure.
  • Testing of Castor version 2.1.13 continues and two production Tape Servers have been successfully upgraded to this version.
  • The meeting continues to use Vidyo as part of a four-week test.
WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October




NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER
  • N/A
To note

  • N/A