Operations Bulletin 260813

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 19th August 2013
Task Areas
General updates

  • An EGI task force has started (see kickoff agenda) to develop the use of CVMFS by non-LHC VOs.Task force progress is being tracked in the associated CVMFS wiki page.

Tuesday 20th August

  • An EMI-2 SL5 tarball WN has been put into CVMFS thanks to Matt and Jakob (see GGUS:96030 for further information).
  • SL6 s/w areas have been discussed several times - the conclusion is that it is too late for a single approach to be adopted.
  • The WLCG T2 availability/reliability figures are now final.
  • A provisional date of 2nd September has been scheduled for the GOCDB v5release. This is a major release which involves migration of the data to a new domain model that supports different DBs. The change involves the use of new primary keys. For more (testing) information see here.


Monday 12th August

  • Redirector problems affecting many UK ATLAS sites.
  • EGI reliability/availability of UK core services for July 2013 (i.e. top BDII) is recorded as good (almost 100%).
  • EGI also monitors site availability/reliability (results akin to the WLCG numbers) and the figures are available via these tables. For July use this link.
  • WLCG operations generally quiet (Monday's summary). Main UK issue is CMS Caltech-RAL transfers - under invesitgation.
  • A reminder that registration for GridPP31 is now open. The meeting will be at Imperial on 24th and 25th September.


Tuesday 30th July

  • GridPP news: Tomorrow is the last day that Stuart will be with GridPP ... thanks Stuart for your contributions! Coincidentally, Neasan will also be moving on after tomorrow so again thanks are due. Good luck to both!
  • GLUE2 BDII output at Liverpool - restart fixed the problem. Midmon was reporting CRITICAL - errors 1068, warnings 1071, info 1299.
  • All UK ATLAS sites now managed by RAL FTS3 though not all site transfers use it at the moment. QMUL has an issue due to Storm.
  • A reminder that (EGI/NGI) operations procedures for certain key tasks can be found linked from here in the EGI wiki. In particular PROC13 lists the steps we are expected to follow when decommissioning a VO (something we will be doing shortly).
  • EGI is setting up a task force to explore CVMFS as a service for all EGI VOs. This follow's Ian's talk at the Manchester Community Forum in April. Catalin will be leading the task force.
  • Early bird registration for the EGI Technical Forum will now end at midnight on Sunday 4th August 2013.
  • The Tier-1 will cease operation of the RAL AFS cell on the 31st October this year.
WLCG Operations Coordination - Agendas

Monday 12th August

  • There have been no recent meetings. The next is on 29th August.

Tuesday 16th July

  • SL6
    • EMI-3 voms-proxy-info: 3rd problem java eating away memory. You can follow the story in both tickets GGUS 94878 and GGUS 95574
      • A fix is in the testing repositories and has been tested at Liverpool and Oxford.
    • UK status: 4 sites online, 3 testing, 7 with a plan, 3 without a plan (UCL, Durham, RALPP).
    • Presentation today at Atlas ADC weekly
    • Checking now with sites how LHCb is doing. Not running everywhere it seems.
  • Monitoring
  • Next Coord meeting Thursday 18/7/2013

Tuesday 9th July

  • SL6
    • Atlas new sw validation system scalability problem has been solved.
    • voms are now in the EMI-3 repository. No testing or prod PT repositories are necessary.
    • UK status: 3&1/2 sites online, 3 testing, 7 with a plan, 4 without a plan (UCL, Durham, RALPP, SUSX).
    • HS06: T0 tests on the compilers didn't give significant differences. Hepix has started an SL6 HS06 page where sites are welcome to post their results SL6 HS06 benchmark results
  • Monitoring
    • WLCG Monitoring consolidation group to consolidate the WLCG monitoring. It doesn't include all the monitoring there is a portion developed by experiments which is not included, but it concerns well known dashboards.
      • WLCG monitoring Initial status.
      • First meeting last week. The experiments have already given a first evaluation, sites will be represented via WLCG Ops Coordination. To get feedback from sites a group has been setup to collect sites opinion (see Maria's slide). Who is interested should contact Pepe Flix (jflix@NOSPAMpic.es). David Crooks and Kashif might want to be part of it as this touches on the GridPP core tasks.
    • Among things interesting to discuss
      • myWLCG vs SUM tests they both get the information from the same source i.e. nagios.
      • Personalised dashboard looks interesting but was never publicized much.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 20th August

  • Problems (timeouts) on cms_tape believed to be fixed.
  • Completed update of CVMFS to 2.1.14-1 in response to EGI-SVG-2013-5890
  • Testing of alternative batch system (Condor/ARC CEs/Sl6) proceeding.
Storage & Data Management - Agendas/Minutes

Tuesday 15th July

  • Three sites are now run all UK FTS traffic via 'FTS3' service as a test. Mostly successful; (small issue with a few US sites to be resolved before taking tests further.)

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 13th August

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people. Tuesday 20th August

  • Observed that VO approvals information is not being recorded clearly. A status table may be a good way to trace the status.

Tuesday 13th August

  • Due to various staff changes/moves there are now a non-negligible number of documents that need new 'owners'. We will reallocate at the next core tasks meeting.

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Interoperation - EGI ops agendas

Monday 19th August

  • Yesterday's agenda. Attended by David+ and Raul.
  • There is an EGI discussion forum for UMD updates.
  • Coming releases: DPM (being tested in EPEL); BDII core (bug and vulnerability fix); BDII site (bug fixes).
  • Noted that dCache 2.6.5 contains a serious bug: ""if you have a permanent migration thread and issue a "save", the next time you save after a restart, the migration will be saved as "null", causing a second restart to fail"

so something you do now can cause a fail to happen in 2 months time when you restart dcache for some other issue". Thought fixed but not rolled out.

  • New in staged Rollout:
    • emi-ui - 3.0.2
    • gridsite - 2.1.2

This is a hotfix for GridSite, allowing the previously disallowed dash character in delegation IDs. Delegation IDs containing non-alphanumeric characters other than a dot, coma, underscore or dash are rejected. It also properly sets the type of proxy before calling the signing function from caNl.

    • canl - 2.1.2

This is a hotfix for a bug whereby the type of proxy to sign whas erroneously hard-coded to a single value for different types of proxies, most importantly affecting RFC proxies.

    • cream - 1.16.1

Authentication and authorization in the CREAM service now makes use of the CAnL library. The gLite security libraries are no more required.

    • blah - 1.20.2

Memory leak in BNotifier

  • Already in SR:
    • bdii-site - 1.2.1

This new version of the site BDII contains a fix in the ldap info provider script to set to 'Unknown' cached GLUE state attributes. Bug fixes:BUG #101709: Set to 'Unknown' ldap info provider cached state attributes for the site BDII.

    • bdii-top - 1.1.1

This version of the top BDII fixes a bug in the publication of delayed delete GLUE entries. A new plugin is responsible for publishing cached entries with value 'Unknown' in the corresponding GLUE state attributes. This version also includes a bug fix in the glite-info-update-endpoints script.

    • wms - 3.6.0

This version solves the problem with Argus and WMS integration (SL6).

    • voms - 3.2.0

VOMS Admin now supports Group managers, a mechanism which allow the hierarchical dispatching of the notification resulting from user VO membership and group membership requests.

    • apel - 2.2.0

vulnerability bug fix


  • SHA-2 monitoring: The Nagios instance midmon continues to monitor the services not supporting SHA-2. An extensive overview will be part of next week's OMB meeting.

Currently there are two products not yet supporting SHA-2: dCache and StoRM.

  • dCache version 2.6.5 has been released by the product team, and currently it is in the UMD software provisioning process.

The SHA-2 supporting version for the 2.2.x golden release is expected, but not yest released by the product team. The target for the 2.2.x version is UMD-2

  • StoRM: UMD-3 latest version is 1.11.1 It supports SHA-2 but has some critical load issues. It is recommended not to deploy this version.

StoRM PT is testing a new release (1.11.2) which solves the critical issues of the current one. Currently midmon is not generating critical alarms for dCache and StoRM: there are no related alarms in the operations dashboard

  • VOMS monitoring for SHA-2 support has been re-enabled. It was suspended because of a problem in the resource information provider, which was causing some false positives

The problem has been identified (was related to a sudo version), and information has been provided to ROD teams. In case of a false positive with VOMS site administrators should be able to quickly solve the problem and make the service publish again itself.

This is believed to have been fixed but not yet rolled out.


gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 20th August

  • Some sites not running yum update for their EMI2 CREAM-CEs.
  • Rota updated. Gareth from the T1 will participate.
  • Glasgow will continue to provide effort in this area.


Rollout Status WLCG Baseline

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 20th August

  • Several sites showing up in pakiti this week.
  • Update on workaround discussed last week.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 13th August

  • PerfSONAR: The dashboard is showing more problems across various sites - presumably the monitoring?
  • VOMS: Using multiple VOMS servers in the ops portal (VO ID cards) requires careful use of designations.

Tuesday 23rd July

  • PerfSONAR: the issues with the WLCG mesh appear to be understood and a new minor release (e.g. 3.3.1) is likely to be released. In the meantime please could sites upgrade by following instructions here but leave the WLCG mesh URL (tests-wlcg-all.json) commented out. Please also update the site progress page.
  • Where are we with the VOMS rollout?

Monday 10th June

  • Issue with neurogrid.incf.org ownership. Is more guidance needed?
  • Where are we with the perfsonar mesh?
  • Are we ready for full rollout of the VOMS backups?


Tickets

Monday 19th August 2013, 15.00 BST</br> 50 Open tickets this week. Keeping it light with the review though.

There are gLexec Tickets, SHA-2 tickets, NGS site decommissioning tickets and Unresponsive VO tickets (minos and supernemo left), but ( decided not to separate them out this week - I've only included them below if they stood out as needing extra soothing.

No ticket update from me next week, I'm taking the week off to Pug-sit.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=96634 (15/8)</br> The "cloud" site, 100IT, has received a certification ticket. Assigned (15/8) (Child ticket of https://ggus.eu/ws/ticket_info.php?ticket=94780)

BIRMINGHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=96685 (18/8)</br> LHCB have spotted pilot jobs failing on node epgf06.ph.bham.ac.uk. Assigned (18/8) Edit - Acknowledged and In progress.

https://ggus.eu/ws/ticket_info.php?ticket=96533 (9/8)</br> LHCB asked for g++ to be installed at Birmingham. Evolved into a discussion with Ewan about putting g++ into the HEPOSlibs_Sl6 rpm, which from reading the updates it has. Huzzah. In progress (16/8)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=96528 (9/8)</br> 444444 waiting jobs on some of the up-to-dater Glasgow CEs. Gareth involved the CREAM developers, but other then an acknowledgement from Maria there have been now sound from them. May need shimmying along. In progress (12/8) Update - The Cream devs came back with a workaround, which solved the problem.

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=96530 (9/8)</br> Another 444444 waiting job ticket, could do with a bit of an update. In progress (12/8)

https://ggus.eu/ws/ticket_info.php?ticket=95302 (1/7)</br> Durham's gLexec ticket. Could do with a spot of soothing - either an update or on-holding. In progress (12/7)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=95996 (22/7)</br> One of the SHA-2 tickets, this could do with an update or the ROD will have to declare it overdue. In progress (22/7)

https://ggus.eu/ws/ticket_info.php?ticket=96321 (2/8)</br> Sno+ SAM jobs failing at RAL. Probably a problem with the cert these jobs are run under, I've involved Kashif in the ticket. Waiting for reply (19/8) Update - Kashif and Chris have had an exchange about this, as discussed last week the problem is due to Castor being VOMS unaware.

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=96629 (15/8)</br> Hone noticed a problem with jobs dying at Manchester, but the problem seems to have gone away since. However Hone forgot to "notify" the site (they also did this with a QMUL ticket). I've informed them that they need to start notifying us when things break. Assigned (16/8) Update- closed

Tools - MyEGI Nagios

Monday 12 Aug

  • VO-Nagios

t2wlcgnagios.physics.ox.ac.uk monitors few UK VO and it uses Robot Certificate. Due to some confusion Robot Certificate got expired. I applied for extension of Robot Certificate beforehand but Cert Wizard doesn't understand Robot Certificate and I thought that it has been extended. Finally Jens stepped in sorted it out. Now VO Nagios is working.

  • SHA2 Certificates

I have been issued a SHA2 certificate by Jens. I tested few CE's and some interesting results came out. Gridpp VOMS server is SHA2 compatible so SHA2 proxies can be created for VO's hosted at voms.gridpp.ac.uk. None of CERN voms server are sha2 compatible but there is workaround to add a secondary SHA2 certificate. Details are here https://twiki.cern.ch/twiki/bin/view/LCG/SHA2readinessTesting#SHA_2_VOMS_server I have added my SHA2 certificate but it is not approved yet as most of the people are on holiday. Interestingly when I submitted few jobs using ngs.ac.uk with SHA2 certificate, it finished successfully on the CE's which are not suppose to be SHA2 complaint. I will test again with OPS vo to confirm it.

Tuesday 23rd July

  • In a campaign to update VO ID card details it turns out that a few of our supported VOs are obsolete: babar, possibly supernemo and ngs.ac.uk. The first of these can be safely removed but we need to confirm our announcement process.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 19 August

  • EPIC
    • Support requested at Tier-1
    • Any other sites prepared to support them?
  • Catalogue synchronisation - Biomed working on it.


Monday 12 August

  • HyperK.org
    • VOMS servers set up (Manchester, Oxford, Imperial)
    • VOID card - stalled on a homepage.
    • WMS set up (Imperial) - awaiting Glasgow, Ral
    • Site set up (QMUL)
    • LFC - in progress
    • CVMFS - considering


  • SNO+
    • Dirac set up for some CEs
  • Epic
    • Doing stuff
  • ngs.ac.uk VO - any reason to keep it?
  • Software areas for SL6
    • Are we keeping the same areas as sl5?
    • What about the software tags?
    • Push CVMFS?

Friday 2 August 2013

  • SNO+ would like to streamline their submission
    • Is Dirac possible
  • WebDAV support at RAL LFC
    • Firewall seems to be in the way.
  • HyperK.org
    • Waiting on WMS support from somebody.
    • 1 month so far from starting this off - can we do this quicker next time.
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 14th August

  • Operations report
  • There have been some problems with the cms-tape service class within Castor that are under investigation.
  • The test ARC-CEs have been enabled for more non-LHC VOs (now includes hone, biomed, mice, na62, superb, snoplus.)
  • We continue to work with Atlas on the testing of FTS3.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A