Operations Bulletin 250612

From GridPP Wiki
Revision as of 08:24, 25 June 2012 by Jeremy coles (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 18th June 2012
Task Areas
General updates

Monday 18th June

Tuesday 12th June

  • There is a WLCG pre-GDB meeting on WN security - the agenda is here. Vidyo connection available.
  • The next GDB is this Wednesday. The agenda is here. Vidyo connection available.

Monday 11th June

  • T2 reliability and availability results for May 2012.
  • EGI querying whether NGIs/institutes host local UMD repositories.

Friday 8th June

  • Mingchao leaves his role as security officer today. Thanks to him for many years of contributions. His role will be covered by members of the ops team - recruitment of a new security officer is underway.

Thursday 31st

  • Please note that the Technical Evolution Strategy documents have been moved within the WLCG Document Repository. The folder is now under this area.
  • Check site gstat values
Tier-1 - Status Page

Tuesday 19th June

  • The update of Castor on Wednesday 13th June to version 2.1.11-9 went well.
  • The update of the database behind the non-LHC VO's LFC and FTS on Wednesday 13th June was problematic. The FTS service was restored using a clean database that afternoon. The LFC service was not restored until the Friday morning (15th).
  • Today (Tues. 19th) there is a site networking upgrade.
  • A problem with file transfers to/from the German Tier1 (FZK) has been investigated and worked around.
  • Castor databases will be updated to Oracle 11 on Wednesday 27th June.
Storage & Data Management - Agendas/Minutes

Wednesday 6 June 2012 - we are still digesting CHEP information, see also blog, plus a few of the usual operational upgradional stuff. Hoping to find a few spare clock cycles for some slightly more experimental stuff.

Wednesday 23 May 2012 - lots of exciting stuff at CHEP, we have about five things in, some posters, some oral.


Accounting - UK Grid Metrics HEPSPEC06

Wednesday 6th June - Core-ops

  • Request sites to publish HS06 figures from new kit to this page.
  • Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Friday 4th May - TB-SUPPORT

  • long-standing problem that gstat overcounts things like RunningJobs and TotalJobs if there are multiple CE nodes feeding the same set of queues. The gstat developers have just put a comment on ticket for that indicating that they have done something to remove the duplicates:
Documentation - KeyDocs

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

  • Maintain site VOMS info document for the approved VOs
  • Check a site's VOMS records correspond exactly with CIC portal
  • Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes.

This will be converted to wiki formatted and made available in the normal way. Next jobs:

  • review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
  • create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.


Tuesday, 29th May

  • VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

  • Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.
Interoperation - EGI ops agendas

Monday 18th June

The EMI 1 updates are just minor revisions: Top BDII, BLAH and Storm. A repackage for GFAL/lcg-utils to handle the globus lib dependancy problems. Further EMI-2 updates, probably of interest only for those doing EA of them.

  • Staged rollout: Lot's of EMI-2 packages, working their way through the verification/SR process. The software that is just a repackage from EMI-1 to EMI-2 are skipping SR on SL5 - the SL6 versions will be tested. Most of the products in SL5, support upgrade and reconfiguration from the EMI1 versions.
  • Note that CREAM is one of the products that can't to an inplace update - new DB schema, so needs a drain/wipe/re-install.
  • Question from Tiziana - anyone using CREAM in Cluster Mode? Any feedback on that?


Friday 8th June - EGI ops agenda

  • Sites not publishing UserDN's. There's a number of reasons - site policy; configuration wrong; and there's a bug that the APEL team will contact affectes sites about. (and another that doesn't apply here - sites that publish via an aggregator). A list of sites publishing has been calculated:
  • UK sites not publishing userDN's: UKI-LT2-Brunel; UKI-NORTHGRID-MAN-HEP; EFDA-JET; UKI-LT2-UCL-HEP; UKI-SCOTGRID-ECDF; UKI-SOUTHGRID-BRIS-HEP; UKI-SOUTHGRID-CAM-HEP
  • RHUL and RAL are counting slightly below 100% - I would expect that's probably because it's recently turned on, although there could be other issues lurking behind that stat (i.e. could also be recently turned off). EGI want a view on the various reasons for this. Suspect that this will be down to one of three reasons: Policy; Plan to but not had time yet; or some technical problem stopping it. Let SP know.


Monitoring - Links MyWLCG

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD Rota

Monday 11th June - KM

  • Very busy week. A lot of sites were failing CAdist tests but all of them updated after ticket. Durham suffered another power cut and cooling failure and it is still failing test, a ticket is already open. Manchester is failing SE and CE tests. Two open tickets against Manchester, no update from site.

Friday 25th May - AM

  • Several sites have downtimes, and a few raised non-trivial alarms during the week. All are ok again now or still in downtime, with Durham having the ce01 and se02 tickets although it's in an unplanned downtime now.
  • Glasgow had problems with MPI alarms even though MPI shouldn't have been advertised, and these alarms were eventually closed still at critical on the Dashboard along with the corresponding ticket (the Nagios pages for the nodes themselves were back to all green due to the removal of the MPI endpoints.)
Rollout Status WLCG Baseline

Monday 11th June

  • EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

  • The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
  • Call for more sites to take part in EMI-2 rollout tests.
  • The overall SR contributions are in this table.

Friday 27th April

  • Updated version information on rollout page
  • WN scan indicates some sites not keen on OS updates to those nodes.
Security - Incident Procedure Policies

Wednesday 6th June

  • The GridPP security team have begun operating a weekly "on duty" rota to provide cover until a replacement for Mingchao is appointed.
  • UKNGI will not be contributing to the EGI security officer duty rota for the time being.

Tuesday 15th May

  • The next NGI/GridPP security team meeting is on Wednesday 16th.
  • Main discussion is on the rota and handover tasks.

Wednesday 2nd May

  • Setting up a security on-duty rota - to improve team and cover period after Mingchao leaves. Alessandra, Linda, Rob and Ewan are involved.
  • Currently reviewing the overall security task.
  • SSC5 preparations to start soon.


Services - PerfSonar dashboard

Tuesday 19th June

  • Some of the volunteer sites may not have perfsonar by end of June. Which other sites are close?
  • GridPP will resume running VOMS. Current plan is for the master to remain at Manchester and to host backups at Oxford/Imperial.

Wednesday 6th June

  • Plan to have 4 more sites perfsonar enabled by end of June.
  • Testing matrix to be confirmed.
  • Will survey sites looking for DRI kit deployment issues.
  • 23rd April requested network utilisation figures for March and April
  • LHCONE meetng last week in Stockholm
  • Agreed to focus on perfosonar.
Tickets

Monday 18th of June, 13:00 BST</br> 22 Open UK tickets this week.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=80259</br> A few finishing touches and neurogrid.incf.org will be ready for launch.

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=83330</br> Atlas FTS transfers to Oxford were suffering from time out failures (that appeared to occur in batches). As I understand it the Oxford-RAL timeout settings had been reduced from their original (very high) settings, they've now been loosened up somewhat.

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=83283</br> LHCB have been having software-setting-up problems on some nodes, Dave expects this is due to problems chronicled in https://savannah.cern.ch/bugs/index.php?95420 & https://savannah.cern.ch/support/?129468 compounded by local bandwidth problems to some subsets of their machines.

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=83020</br> Chris is waiting on the the availability/reliability site to fix their certificate chain (https://ggus.eu/ws/ticket_info.php?ticket=83237) before he can fully comment on their stats for May.

https://ggus.eu/ws/ticket_info.php?ticket=83198</br> Queen Mary are decommissioning one of their CEs (ce03.esc.qmul.ac.uk), Chris split this ticket into 15 and assigned it to each VO it supported. Which leads to..

T2K</br> reference: https://ggus.eu/ws/ticket_info.php?ticket=83209</br> As seen in this incarnation of Chris' ticket, t2k have requested that t2k.org get a VO entry in GGUS. Has anyone started the ball rolling on this?

(PS Chris, the pheno & camont tickets looks like it can be closed, I suspect the ngs one will take some time...)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=83006</br> Availability/Reliability for May ticket. Mike put in a good (in my eyes) answer last week, but no movement from elsewhere on this ticket.

https://ggus.eu/ws/ticket_info.php?ticket=82214</br> https://ggus.eu/ws/ticket_info.php?ticket=82818</br> Both these tickets are looking almost wrapped up, nice one!

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=82946</br> Still watching this ticket on atlas troubles with cvmfs, no movement although Daniela is on the case.

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784</br> The certification infrastructure at GRNET has started to cause problems (again), Jeremy ticketed them (https://ggus.eu/ws/ticket_info.php?ticket=83284).

SOLVED CASES</br> https://ggus.eu/ws/ticket_info.php?ticket=83326</br> Raul at Brunel were having cvmfs troubles on a few nodes, fixed by a forced clean-up & restart. Not very interesting on its own, but there seems to be a number of cvmfs tickets cropping up.

NEW https://ggus.eu/ws/ticket_info.php?ticket=82670</br> SNO+ ticket that Daniela brought back to my attention from last week, the apparent WMS problem was actually a CREAM side "misconfiguration", details in the ticket and e-mail Daniela sent to the list.

FROM THE UK:</br> (https://www.gridpp.ac.uk/wiki/Tickets_From_The_UK)</br> No significant change since last week on existing tickets.

https://ggus.eu/ws/ticket_info.php?ticket=83243</br> Daniela noticed that IC weren't updating in APEL, this looks to be caused by the Imperial CEs not being registered in the gocdb as APEL endpoints.

https://ggus.eu/ws/ticket_info.php?ticket=83352</br> Daniela's ticket to track problems seen in the SL6/EMI2 bdii.

Tools - MyEGI Nagios

Tuesday 12th June

  • Lancaster backup Nagios now available (link).
  • A reminder as to who can see the data: all members of ops and dteam and any one who is registered in GOCDB as site-admin, regional manager etc. it is also possible to add any one who has a PKI certificate.

Wednesday 6th June

  • By end of June plan is to have backup Nagios available.
  • Smaller VO automated testing still waiting for a bug fix.

Sunday 8th April

  • Lancaster Nagios backup hardware has arrived
  • Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
  • There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
VOs - GridPP VOMS VO IDs Approved

Wednesday 6th June

  • Cross-checking VOs enabled vs VO table.
  • Surveying VO-admins for problems faced in their VOs.


Site Updates

Monday 18th June

  • Sussex still in certification. The WNs have been reconfigured.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 11th June

  • NGS future planning
  • Options to make VOMS more resilient
  • Plans for Tier-1 review (to be held 20th June)
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

  • ATLAS DATADISK now being used for production input files
  • Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
  • Target date for perfsonar-ps at sites is the end of July
  • KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

  • Trying the present bulletin as a conduit for information
  • 7 sites below 90% in ATLAS monitoring!
  • Instructions for small VO space-token setup/usage requested
  • EMI WN tarball now available - comment in ticket
  • Not every site running latest SL5 OS
RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 6th June

  • All Castor instances switched to use the Transfer Manager (which replaces LSF) except Atlas which is scheduled for 7th June.
  • The following Tier1 interventions are announced (not yet in GOC DB).:
    • Wednesday 13th June: Castor 2.1.11-9 update. (Expect Castor down from morning to early afternoon).
    • Wednesday 13th June: Oracle 11 update for LFC & FTS databases. (Expect these services down all working day).
    • Tuesday 19th June: Replacement of RAL Site Access Router. (Expect external connectivity broken for up to 3 hours in morning.)
    • Wednesday 27th June: Update Castor databases to Oracle 11. (Expect Castor down all working day).
  • Test of backup diesel generator unsuccessful (today, Wed 6th June) as it failed to start. Experts expected tomorrow to investigate. Site at an additional risk if power cut.
  • Useful discussion with NA62 about issues they are having getting going with using the site.
  • Operations report
WLCG Grid Deployment Board - Agendas MB agendas

June meeting Wednesday 13th June

WLCG meeting notes

Welcome [MJ]

  • August meeting is cancelled. October meeting is in Annecy.
  • EGI Technical Forum 17-21st September: http://tf2012.egi.eu/
  • HEPiX Fall – 15th -19th October.

Post-TEG Working Groups [Ian Bird]

  • Large number of WGs proposed.
  • DM&S: Benchmarking. Federation. Networking
  • WLM: Extensions of CE (multi-core; whole node; pilot support). Information System.
  • Security: Proposals coming
  • Database: share experiences.
  • Operations: m/w sw process. Monitoring.
  • Teams approaches: Operations coordination team. Sharing experiences/tech watch (pre-GDB discussion)
  • Possibly Missing? Cloud. SRM (but to be more generic in title!). Bartch systems.

Storage Accounting (John Gordon)

  • StAR
  • Plan to publish to APEL but in EMI-3 for May 2013
  • Interim possibility to use gstat
  • Noted that information that is published is not precise.

Information System Status and Evolution (Maria Alandes Pradillo)

- glue-validator (in EMI-1 ans 2) - glue 2 still to be deployed widely - Future work (EMIR; ginfo and IS monitoring/metdata). Question if OSG fully engaged?

AAI on WN update (Romain Wartel)

  • Security controls – central banning body required
  • ARGUS locally needed (to pull banning lists from central ARGUS)
  • Ownership of traceability. VO-site collaboration needed to cover all cases
  • Recommendations to fulfill logging and traceability policy on WN.
  • Not current possible to use clouds (VMs) in a way that conforms with WLCG security policies.
  • Critical proxy extension (ALICE less limited)
  • Proxy lifetime - reduce back to 24hrs? Balanced compromise between complexity and risk. Proxy credentials can not be revoked.
  • Pool account recycling – recycle only after 6 months.

EMI update (Cristina Aiftimiei)

  • EMI-1 at update 15 (23.04.2012)
  • EMI-1 Full support & maintenance until 28.02.2012. Updates till 31.10.2012.
  • EMI-2 released 21.05.2012. Supports SL5 and SL6. Some Debian6.
  • New products: CANL, EMIR, EMI-Nagios, Pseudonymity, WNoDeS.
  • Hydra and WMS not released yet.
  • Some backward incompatibilities due to existing EPEL package names.
  • UI/WN tarballs in the next update.


Globus SW support at OSG

  • Discussions including use of Cream/Glue2; this to be investigated as it impacts use of the WMS

EMI Sustainability Plans (Alberto Di Meglio)

  • The end of EMI is the end of the coordination between product teams – not the end of those product teams.
  • Ian Bird: the outcome of the above WLCG-EMI-EGI meeting needs to be how do we manage software in the future, also to discuss: how do we do certification, staged rollout and deployment in general.


Communicating Machine Features to Batch Jobs (Tony Cass)

  • Jeff will share a script for PBS to test implementation using /etc/machinefeatures.

MUPJ – gLexec update (Maarten Litmaath)

Federated Identity Vision (Romain Wartel)

  • Document presented at last GDB. Approved by MB on 5th June.
  • Pilot project for WLCG - any volunteers to be involved?



NGI UK - Homepage CA

Monday 14th May

  • CA: Retiring 2007 CA cert: it stopped signing last October and now *must* vanish from IGTF at the end of Sep release.
  • CA: We won't be able to meet IGTF targets of 1st October 2012 for IPv6 support for our CRLs, Networks suggest even March 2013 might be optimistic.
  • Services: Leeds to deploy a clean WLCG Nagios development platform with help from Kashif
  • UKIROC decommissioning in progress
  • NGS Ops team to reconsider resilience of core services following recent network outages at RAL

Friday 9th March NGS-CA-TAG meeting

  • Priorities discussion for the CA. Plans to be clarified for future meeting.
  • Mathew Dovey has taken over from Neil Geddes as EGI Council and EB chair.
  • Email addresses now removed from certificates.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Wednesday 2nd May

  • Testing glideinWMS but some problems spotted


Tuesday 24th April

  • ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.

Thursday 3rd April

  • ECDF has a full data disk. Every time this happens the site is automatically blacklisted and cleaned up by ATLAS however this automatic process results in a few thousand failed transfers. There is also a concern that these failing transfers are potentially creating dark data. Wahid Bhimji is contacting ATLAS DDM operations to find out if these problems can be fixed. We are concerned that this may be a first for ATLAS as ECDF has a small amount of disk compared to its available CPU.
  • A disk server at Oxford has failed with the loss of a significant fraction of data. Users are noticing their jobs are failing at Oxford. We are going to proceed with the lost file recovery as soon as possible although are waiting for confirmation from the site. (https://ggus.eu/ws/ticket_info.php?ticket=81788)
  • The next meeting will be 17th May because of HEP Sysman being held at RAL on the 10/11th May.
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Tuesday 24th April

  • T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
Requests

  • More sites needed to test EMI-2