Operations Bulletin 170613

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 10th June 2013
Task Areas
General updates

Joint T1/T2 availability/reliability reporting (more information).

Tuesday 11th June

  • There is a GDB tomorrow (agenda). Topics: Security; HEPiX; Ops; Tier-0; Monitoring.
  • John Green has now left STFC/GridPP and the security coordination role returns to the security team.
  • The June EGI OMB takes place on 18th June. Please let Jeremy know of any EGI management issues that you would like discussed.
  • GGUS apologise for duplicate GGUS tickets being created for Savannah tickets (spotted by CMS). There was a bug in their mail parser.
  • GridPP/UKI ROC technical meetings have been moved from under the old EGEE Indigo category to GridPP-Technical-Operations. If as a consequence you need edit rights re-enabled please contact Jeremy.
  • The scope of the biweekly Friday GridPP Cloud meeting is increasing to cover other areas of development. Meeting agendas will appear under the GridPP technical area in Indigo. The agenda of the latest meeting is at this link.
  • Requests for CHEP2013 funding support were due with Dave Kelsey by Friday last week.
  • The NGI will shortly 'close' the sites that have remained in the 'uncertified' state for a long period (basically the NGS sites).
  • A new grid information system website is now available.


Monday 3rd June

  • Bristol is considering to move subnet. See the LCG-ROLLOUT thread for more details.
  • Currently checking the status of MPI across UK sites (more to follow).

Tuesday 28th May

  • There is an EGI OMB this morning (agenda).
  • Some VOs are hitting the default 365 day membership point. VO-admins can extend to a longer default and can renew individual memberships.
  • In process of putting names against Tier-2 GDB representation rota.
  • Updating of the HS06 table.
  • There was a GridPP cloud meeting on Friday (agenda: minutes).
  • A reminder to provide informal feedback on your site findings on EMI2/3 and SL5/6 compatibility.
WLCG Operations Coordination - Agendas

Tuesday 10th June

  • There was a short WLCG ops coordination meeting last Thursday 6th June (agenda: minutes).

Monday 3rd June

  • Minutes from the WLCG ops coordination meeting last Thursday.


Tuesday 30th April

Extracts form the 25th April 2013 meeting minutes

  • EGI operations
    • User suspension policy under discussion the emergency procedure for central suspension has been extended.
    • GGUS new workflows have been defined. The most important ones is that every supporter will have rw access to all the tickets and no best effort from the support units or team products will be accepted. Tickets should not be left without an answer.
    • Continuation of support for several products still needs to be clarified, including: WMS, EMI-Common (UI, WN, YAIM, Torque config, emi-nagios), EGI will liaise directly with PTs to get information about release and software support plans;
  • Experiments
    • Atlas
      • SL6: see SL6 TF
      • cvmfs: waiting for a stable cvmfs 2.1 for sites that needs NFS export. 2.1.9 is promising but needs more testing.
    • CMS
      • Submission of HammerCloud trough GlideinWMS: comparison to gLite done - o.k. Will switch beginning of May
      • Updates of Squid configuration to WLCG monitoring: about a third of the sites has done it. Followed in CMS Computing Operations.
    • LHCb
      • glexec: solved problems with software, deployment is manpower intensive though and experiment doesn't want to do it. WLCG needs to take care of it.
      • SL6: see TF
  • cvmfs
    • Atlas and LHCb deadline is the 30 April [note: UK is ok]
    • CMS deadline is 1 April 2014 but will stop but already from September 30 no software installation jobs will be sent
    • CVMFS 2.1.9 is about to be released; the update is recommended but sites using the NFS export or at the 2.0 version should test it carefully for a few weeks.
    • Finally, the testing and deployment process is described; in particular, sites should upgrade their nodes in stages. Interested sites are invited to join the "pre-production" effort.
  • glexec
    • Maarten asks the MB reiterates that all sites need to install glexec
  • perfSONAR
    • Release candidate v3.3 is out. Currently been tested in US and at CERN not yet suggested to sites but if in 2 weeks no problems are discovered sites will be encouraged to install this version.
  • SL6
    • Created a deployment page to track sites status: https://twiki.cern.ch/twiki/bin/view/LCG/SL6DeploymentSites
    • Increased information in procedures https://twiki.cern.ch/twiki/bin/view/LCG/SL6Migration#Procedures_and_how_to_contact_ex
      • None of the experiments want mixed queues they ask for one queue per architecture
      • Best way to go is to reuse an existing queue if possible
      • WLCG repository has been created and should be enabled by all sites. It already contains the latest version of HEP_OSlibs [tested by Brunel and Oxford plus two Turkish sites]
      • LHCb requires the CE/queue information to be published in production in the BDII otherwise they don't see them automatically and would like to avoid manual steps for 150 sites. [RAL SL6 testing queue wasn't published and this is why it wasn't used. It is now]
      • Atlas has found a problem with the excessive number of file descriptors (similar to those observed by Brunel on their SL6 CE). Problem has been passed to the TF.
  • Frontier/squid
    • Squid upgrade is now being followed by CMS and Atlas computing respectively
    • Dave Dykstra is not part of squid support anymore
      • Representatives of CMS and Atlas Frontier/squid groups have joined WLCG Coord to replace him and help with future squid requests.

Tuesday 23rd April

  • The next meeting takes place this Thursday (agenda)


Tier-1 - Status Page

Tuesday 11th June

  • The planned UPS/genereator load test on the 4th June ran into some problems. There was a temporary problem with cooling in the HPD room. There was no effect on services apart from one batch of WNs being stopped manually. The test will need to be re-done.
  • Testing of alternative batch system (slurm & Condor) proceeding.
  • Investigations are still ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.

Tuesday 21st May

  • Do we have an agenda page for the June workshop?

Wednesday 1 May 2013

  • Puppet report from March Puppet camp in London
  • Technical suggestions for hepsysman, or otherwise

Tuesday 30th April

  • The DPM collaboration formally starts on 1st May 2013. For those interested in the collaboration agreement see this page.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 30th April

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thu, 25th April

  • Process for de-emailing the host certificates of a site. This allows an admin to replace certificates with email address components in the DN - they cause havoc in the security system.

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 10th June

  • There was an EGI ops meeting today.
  • Staged Rollout: Highlights from the SR/UMD process are:

cream-slurm: verification under way cream-gridengine: New EMI-3 release on 3rd June fixes some issues with previous version. Verification to start shortly. Storm: v.1.11.10 in EMI-3 passed verification. EMI-WN: for EMI-2, update to fixed issues with SL-6.

  • SHA-2: Grid CA's will be able to issue certificates based on the SHA-2 digest soon. As a prelude to that, there's a list of software versions that support SHA-2 certificates.
  • An official calendar will be set out shortly, but there will be Ops portal alarms for sites with software that doesn't support SHA-2. In general, this is the EMI-2 / UMD-2 version; with the following exceptions (version with support in brackets):

CREAM (V 1.14.4 from EMI-2 does; so a recent CREAM should do); dCache (not released yet); Pseudonimity (EMI-3); StoRM (EMI-3 v 1.11.10) and WMS (EMI-3, v3.5.0).

Tuesday 21st May

  • Updates from 15th May EGI ops meeting.
  • StagedRollout - EMI/UMD 3 update
    • A few minor update issues on LFC; Top BDII; DPM; ARGUS; UI; WN and LB. (Details)
    • More significant points: EMI-3 Cream installed EMI-3 APEL by default, see here for a way to stick on EMI-2 APEL.
    • EMI-3 APEL is _not_ backwards compatible, and needs configs changed on the APEL central. There is an upgrade plan to be followed. (Summary: You need to have a GGUS ticket, and have a dialogue with APEL to do the upgrade).
    • StoRM will be supported on EMI-1 until 21st July.
    • EMI/UMD-2 LB server: there's an incompatability with the logging package in EPEL. If anyone runs this, be don't do updates until this is fixed (or see this workaround).
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 13th May

  • David C will present the monitoring work at Glasgow (based on Graphite) at the HEPSYSMAN meeting in June.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 28th May

  • Very quite week. No open ticket.

Tuesday 21st May

  • Quiet week. The only outstanding issue was the QMUL Storm.

Monday 13th May

  • QMUL received an EMI-1 ticket for Storm. Otherwise a quiet week.
Rollout Status WLCG Baseline

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 11th June

  • We would like to collect immediate feedback on the security training held last week in conjunction with HEPSYSMAN.
  • Suggestions on future training the content last week would be useful.
  • John added a wiki page on forensics.

Tuesday 21st May

  • SL6 vulnerability. Need to track progress. (See private thread).


Services - PerfSonar dashboard | GridPP VOMS

Monday 10th June

  • Issue with neurogrid.incf.org ownership. Is more guidance needed?
  • Where are we with the perfsonar mesh?
  • Are we ready for full rollout of the VOMS backups?

Monday 20th May

  • Letter sent to Internet2 from GridPP management.

Tuesday 14th May

  • perfSonar support team is asking for statements from the projects using it to help securing funding for their team. Below the email they've sent. The WLCG TF is looking for the WLCG MB and Computing coordinators statements but it was agreed that statements from the sites would also help. Below is the email sent to the users mailing list.

Tuesday 23rd April

Tickets

Monday 10th June 2013 13.00 BST</br> 16 Open UK tickets this week, some of them so fresh the ink is still wet. And then I come in this morning and find a bunch of them solved, nice work! Shame that the tickets keep coming


NEW</br> https://ggus.eu/ws/ticket_info.php?ticket=94780 (11/6)</br> The NGI has received a request to instantiate a cloud site. Assigned (11/6)

https://ggus.eu/ws/ticket_info.php?ticket=94766 (10/6)</br> The UK has received a ticket about the use of FTS3, which is causing errors for some atlas transfers (specifically to NET2). Brian is on the case. In progress (11/6)

Tier 1</br> https://ggus.eu/ws/ticket_info.php?ticket=94758 (10/6)</br> CMS noted a SAM test failure at RAL, who are already on it (and think it should have been cleared up). Waiting for reply (to confirm) (10/6) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=94755 (10/6)</br> A user, from an unspecified VO, is having trouble retrieving data from a RAL WMS. A bit of a cryptic ticket. Assigned (10/6)

https://ggus.eu/ws/ticket_info.php?ticket=94505 (3/6)</br> https://ggus.eu/ws/ticket_info.php?ticket=94615 (5/6)</br> These two tickets are almost identical, CMS Hammercloud failures caused by high load on the CMS Castor instance. In progress (10/6) Both SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=94543 (4/6)</br> Sno+ were having problem receiving their outputs from the RAL WMS. Daniela referenced https://ggus.eu/ws/ticket_info.php?ticket=92288 and gave a possible fix- I don't know if this was tried out. In progress (5/6)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9/2012)</br> Correlated Packet Loss on the RAL Perfsonar. This ticket has passed it's reminder date, so could do with updating before it gets too whiffy. On hold (19/3)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Enabling webdav support on the RAL LFC. Waiting for a reply offline regarding the expected update to be put into production. On hold (29/5)

https://ggus.eu/ws/ticket_info.php?ticket=94731 (7/6) Request from Chris to enable cernatschool on the RAL WMS. In progress (10/6)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=94510 (3/6)</br> QM were publishing too large a MaxCPUTime, causing lhcb some grief. The interesting part was that Chris found some interesting behaviour that his changes would work when he ran a `/etc/init.d/bdii restart`, but not with a `service bdii restart` (or a reboot). Some sge related variables in /etc/profile.d weren't being seen in the latter two instances. Maarten posted a patch to the bdii script, and discussion brought to light an old ticket from Andrew at ECDF: https://ggus.eu/ws/ticket_info.php?ticket=88284. This ticket looks like it can be closed though. In Progress (10/6)

https://ggus.eu/ws/ticket_info.php?ticket=94746 (10/6)</br> Biomed complaining as the QM SE published Biomed support when it is supposed to be decommissioned for the VO. Perhaps some yaim artifact has cropped up in a recent reconfigure? Assigned (10/6)

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=94358 (27/5)</br> Biomed complaining about software version tests failing at IC, due to the probe not containing the up-to-date tweak to take into account the tarballs. Daniela links two tickets trying to get this taken into account (one from me-90768, one from herself- 89891). I also wonder why Lancaster hasn;t received one of these tickets. Waiting for reply (10/6)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=94781 (11/6)</br> A Ops test eu.egi.sec.WN-ops. I think you have a glite-version "hack" in your path which is causing the failure. The official way of publishing your version is to have EMI_TARBALL_BASE in your environment- perhaps it disappeared on the move to SL6? Assigned (11/6)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=94732 (7/6)</br> Chris requested cernatschool support on the Glasgow WMS, Gareth has set things up and requests a test run (but noted that he couldn't see them in the CiC portal-and I couldn't either). Waiting for reply (10/6)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=94602 (5/6)</br> Hone were having jobs aborted on one of RALPP's queues. Chris kicked things but problems continue. In Progress (6/6) SOLVED

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=94241 (21/5)</br> Please close the ticket. I draw the line at closing peoples tickets. Don't make me cross that line! In Progress (30/5) SOLVED - THANKS!

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=94247 (21/5)</br> Atlas WLCG squid change over. Ben has installed the new squid, on holding the ticket until ready for a changeover. On hold (but reminder date passed) (30/5)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=94423 (30/5)</br> Atlas space shuffling request, from LOCALGROUP to DATA. Elena informed that juggling tokens was difficult due to a dodgey disk server. In progress (30/5) SOLVED

Tools - MyEGI Nagios

Tuesday 11th June

  • Installation of DIRAC instance at IC ready for 'another' test user.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Thurs 6th June

  • SNO+ jobs now work through the glasgow WMS

Mon 20 May

  • RAL wms02 and wm03 seem to have been taken out of commission but were still in the information system.
  • Glasgow WMS doesn't accept SNO+ jobs (https://ggus.eu/ws/ticket_info.php?ticket=94213)
  • SNO+ filling with water and expect to be taking test data Aug/Sept - expect more grid use after that.
  • Epic doing serious testing - running at Glasgow Liverpool and Lancs.

Thurs 16 May

  • SL6 - likely to be deployed for LHC VOs, non LHC should be aware - see mail to vo-admins list.


Monday 8th April

  • Please note Chris W is away this week.
  • Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Monday 4th March 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 22nd May

  • Operations report
  • On Tuesday (21st) Tier Castor & batch services were stopped for a change in the RAL network which was carried out successfully.
  • Plans for updating Tier1 Castor to version 2.1.13-9 are delayed. The (non-Tier1) Castor instance that has been upgraded has uncovered a problem that needs to be understood and resolved before the Tier1 Castor instances are upgraded.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A