Operations Bulletin 180612

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 11th June 2012
Task Areas
General updates

Tuesday 12th June

  • There is a WLCG pre-GDB meeting on WN security - the agenda is here. Vidyo connection available.
  • The next GDB is this Wednesday. The agenda is here. Vidyo connection available.

Monday 11th June

  • T2 reliability and availability results for May 2012.
  • EGI querying whether NGIs/institutes host local UMD repositories.

Friday 8th June

  • Mingchao leaves his role as security officer today. Thanks to him for many years of contributions. His role will be covered by members of the ops team - recruitment of a new security officer is underway.

Thursday 31st

  • Please note that the Technical Evolution Strategy documents have been moved within the WLCG Document Repository. The folder is now under this area.
  • Check site gstat values
Tier-1 - Status Page

Tuesday 12th June

  • A new version of the Castor Information Provider (CIP) was rolled out on Wednesday 6th June.
  • Problem of starting diesel generator discovered during test on 6th June was resolved the next day.
  • Atlas Castor instance switched to use Transfer Manager on Thu. 7th June.
  • Failure of a power dist. unit caused problems for a network switch and some of the BDII nodes on Thursday afternoon 7th June.
  • Problem during morning of Friday 8th June with one of the FTS agent nodes caused a problem with file transfers particularly to/from the RAL Tier1.
  • Problem with DNS lookups for nodes at Fermilab investigated and fixed during Friday morning 8th June.
  • There was a short disconnect in our network connectivity for around 15 minutes around 17:30 on Friday 8th June.
  • Castor 2.1.11-9 update and LFC/FTS database Oracle 11 updates scheduled for Wednesday 13th June.
  • WMS01 unavailable Thu 14 to Wed 20 June for drain and re-installation.
  • Site unavailable for networking upgrade on Tuesday morning 19th June.
Storage & Data Management - Agendas/Minutes

Wednesday 6 June 2012 - we are still digesting CHEP information, see also blog, plus a few of the usual operational upgradional stuff. Hoping to find a few spare clock cycles for some slightly more experimental stuff.

Wednesday 23 May 2012 - lots of exciting stuff at CHEP, we have about five things in, some posters, some oral.


Accounting - UK Grid Metrics HEPSPEC06

Wednesday 6th June - Core-ops

  • Request sites to publish HS06 figures from new kit to this page.
  • Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Friday 4th May - TB-SUPPORT

  • long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:
Documentation - KeyDocs

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

  • Maintain site VOMS info document for the approved VOs
  • Check a site's VOMS records correspond exactly with CIC portal
  • Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes.

This will be converted to wiki formatted and made available in the normal way. Next jobs:

  • review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
  • create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.


Tuesday, 29th May

  • VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

  • Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Friday 8th June - EGI ops agenda

  • Sites not publishing UserDN's. There's a number of reasons - site policy; configuration wrong; and there's a bug that the APEL team will contact affectes sites about. (and another that doesn't apply here - sites that publish via an aggregator). A list of sites publishing has been calculated:
  • UK sites not publishing userDN's: UKI-LT2-Brunel; UKI-NORTHGRID-MAN-HEP; EFDA-JET; UKI-LT2-UCL-HEP; UKI-SCOTGRID-ECDF; UKI-SOUTHGRID-BRIS-HEP; UKI-SOUTHGRID-CAM-HEP
  • RHUL and RAL are counting slightly below 100% - I would expect that's probably because it's recently turned on, although there could be other issues lurking behind that stat (i.e. could also be recently turned off). EGI want a view on the various reasons for this. Suspect that this will be down to one of three reasons: Policy; Plan to but not had time yet; or some technical problem stopping it. Let SP know.


Monitoring - Links MyWLCG

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 11th June - KM

  • Very busy week. A lot of sites were failing CAdist tests but all of them updated after ticket. Durham suffered another power cut and cooling failure and it is still failing test, a ticket is already open. Manchester is failing SE and CE tests. Two open tickets against Manchester, no update from site.

Friday 25th May - AM

  • Several sites have downtimes, and a few raised non-trivial alarms during the week. All are ok again now or still in downtime, with Durham having the ce01 and se02 tickets although it's in an unplanned downtime now.
  • Glasgow had problems with MPI alarms even though MPI shouldn't have been advertised, and these alarms were eventually closed still at critical on the Dashboard along with the corresponding ticket (the Nagios pages for the nodes themselves were back to all green due to the removal of the MPI endpoints.)
Rollout Status WLCG Baseline

Monday 11th June

  • EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Wednesday 6th June

  • The GridPP security team have begun operating a weekly "on duty" rota to provide cover until a replacement for Mingchao is appointed.
  • UKNGI will not be contributing to the EGI security officer duty rota for the time being.

Tuesday 15th May

  • The next NGI/GridPP security team meeting is on Wednesday 16th.
  • Main discussion is on the rota and handover tasks.

Wednesday 2nd May

  • Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.
  • Currently reviewing the overall security task.
  • SSC5 preparations to start soon.


Services - PerfSonar dashboard

Wednesday 6th June

  • Plan to have 4 more sites perfsonar enabled by end of June.
  • Testing matrix to be confirmed.
  • Will survey sites looking for DRI kit deployment issues.
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng last week in Stockholm
  • Agreed to focus on perfosonar.
Tickets

Monday 11th of June, 13:00 BST</br> 29 open UK tickets this week.

Over the last week there have been a few queries, at the Ops-team meeting and in tickets themselves, concerning the correct procedure for involving other sites/support units. I hope to have something solid on this for next week.

QMUL/NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=83020</br> Reliability/availability report ticket for QMUL. Jeremy's already notified the site but worth mentioning here too.

DURHAM/NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=83006</br> Same for Durham. update - Durham replied, citing the infrastructure issues that they've had.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=82491</br> Jeremy's ticket to the VOMS chaps. Looks like it can be closed.

https://ggus.eu/ws/ticket_info.php?ticket=82492</br> Chris' ticket on the subject, can be closed or reassigned to the voms developers as an RFE.

SNO+</br> https://ggus.eu/ws/ticket_info.php?ticket=82671</br> https://ggus.eu/ws/ticket_info.php?ticket=82670</br> Have been seeing problems retrieving output from desdemona.zih.tu-dresden.de on both the Glasgow and Imperial WMSs. Likely a problem at the other end, but these tickets threaten to bounce around.

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=82946</br> Interesting ticket - despite having cvmfs installed IC seem to be missing a release. This ticket is tracking the investigation.

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=82842</br> https://ggus.eu/ws/ticket_info.php?ticket=82891</br> Chris has been seeing jobs dying with "Cancelled by CE admin" errors. This prompted his mail to TB-SUPPORT today (where he saw lots of files in /opt/glite/tmp that needed clearing out). Affecting biomed & hone jobs. Daniela suggests a full service restart and cites a recently discovered problem in ticket 82891. update - Chris has moved non-lhc VOs off of the affected CE which seems to have calmed the problem.

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=82818</br> lhcb pilots seem to be dying due to cvmfs problems - although fixing it may be a low priority due to ongoing power troubles at the site. update - Durham are currently having a bad time of it, this will get looked at in due course.

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=82739</br> heplnx206.pp.rl.ac.uk not working for biomed, looks like information system errors (or the SE isn't actually for biomed's use). Ticket is looking a little neglected. update - Chris has found the error to be caused by the "fix" to another ticket - (75960), removing the fix cured the problem

CAMBRIDGE</br> https://ggus.eu/ws/ticket_info.php?ticket=82296</br> Still no word from atlas if the problems have gone away. I suggest closing the ticket after leaving a quick description of your solution.


SOLVED CASE PILE</br> https://ggus.eu/ws/ticket_info.php?ticket=82749</br> A ticket from the UK (to NGI_GRNET) concerning the problems seen during the the SUSSEX certification (81784), all issues have been solved.

FROM THE UK</br> Due to the glacial movement of many of the tickets I'm cataloging these tickets elsewhere (https://www.gridpp.ac.uk/wiki/Tickets_From_The_UK), only documenting significant changes or new problems here.

https://ggus.eu/ws/ticket_info.php?ticket=83133</br> The na62 FTS service at cern doesn't apepar to be switched on...

Update: Daniela has two interesting tickets:</br> https://ggus.eu/ws/ticket_info.php?ticket=82746</br> Daniela spotted an error in the certificate handling for LB on SL6. The LB developers look like they've found the cause.

https://ggus.eu/tech/ticket_show.php?ticket=82448</br> The EMI-1 LB seems to have a habit of filling up /var/tmp with notifications when things aren't working as intended, tracked down to the glite-lb-notif-interlogd crashing. Investigation continues.


Tools - MyEGI Nagios

Wednesday 6th June

  • By end of June plan is to have backup Nagios available.
  • Smaller VO automated testing still waiting for a bug fix.

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved

Wednesday 6th June

  • Cross-checking VOs enabled vs VO table.
  • Surveying VO-admins for problems faced in their VOs.


Site Updates

Tuesday 12th June

  • Sussex still in certification.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 11th June

  • NGS future planning
  • Options to make VOMS more resilient
  • Plans for Tier-1 review (to be held 20th June)
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 6th June

  • All Castor instances switched to use the Transfer Manager (which replaces LSF) except Atlas which is scheduled for 7th June.
  • The following Tier1 interventions are announced (not yet in GOC DB).:
    • Wednesday 13th June: Castor 2.1.11-9 update. (Expect Castor down from morning to early afternoon).
    • Wednesday 13th June: Oracle 11 update for LFC & FTS databases. (Expect these services down all working day).
    • Tuesday 19th June: Replacement of RAL Site Access Router. (Expect external connectivity broken for up to 3 hours in morning.)
    • Wednesday 27th June: Update Castor databases to Oracle 11. (Expect Castor down all working day).
  • Test of backup diesel generator unsuccessful (today, Wed 6th June) as it failed to start. Experts expected tomorrow to investigate. Site at an additional risk if power cut.
  • Useful discussion with NA62 about issues they are having getting going with using the site.
  • Operations report
WLCG Grid Deployment Board - Agendas MB agendas

Next meeting Wednesday 13th June



NGI UK - Homepage CA

Monday 14th May

  • CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
  • CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
  • Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
  • UKIROC decommissioning in progress
  • NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.
  • Email addresses now removed from certificates.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd May

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2