Operations Bulletin 290413

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 22nd April 2013
Task Areas
General updates

Tuesday 23rd April

  • An almost final reminder for any site running EMI-1 middleware - unless the component is upgraded by next Wednesday 1st May the service must be put in downtime unless there is a good (and agreed with EGI) technical reason not to upgrade. Sites not complying face suspension.
  • EGI is chasing EMI product teams that have not indicated their plans post EMI (i.e after next week)! There are some significant components where ongoing development/support is unknown (WMS, EMI-Common (EMI-UI, EMI-WN, gLite-yaim-core, Torque server config, Torque WN config, emi-nagios), EMI-Messaging, gLite-Infosys and WNoDES).
  • No re-computations were requested for the March 2013 WLCG Tier-2 availability and reliability report.
  • Please could sites review their non-LHC supported VOs and consider supporting additional ones. LHC VO work has decreased and it would be good to support other communities work (for example the enmr.eu VO) while we can. Use VomsSnooper to check your configurations.
  • The sysadmin guide has been updated with new hardware and publishing requirements for top-bdii and bdii nodes.


Tuesday 16th April

  • There was an EGI OMB on Friday (agenda)
  • It has been agreed that tickets stuck without a response after several reminders will be manually closed as 'unsolved' by GGUS (flow diagram).
  • HEPiX is taking place this week in Bologna. (agenda)
WLCG Operations Coordination - Agendas

Tuesday 23rd April

  • The next meeting takes place this Thursday (agenda)

Tuesday 16th April

Extracts form the 11th April 2013 meeting minutes

  • New IPv6 compatibility task force is being created to test IPv6 within the experiments frameworks. Sites representatives are needed. Considering the IPv6 effort in the UK perhaps someone wants to join?
  • Middleware (WLCG Baseline)
    • there was a security release for CREAM, sites should upgrade to it
    • now the baseline versions table contains the versions of clients to deploy on UIs and WNs
    • EMI-3 has been released but no product is baseline yet; still, sites are free to upgrade services to EMI-3 (the WN needs more testing)
    • CERN WLCG repository to be created in the coming days, can augment EGI WLCG repository and/or serve for failover, will serve various use cases (HEP_OSlibs, XrootD plugins)
  • Experiments
    • CMS Requests to the Tier-2 sites
      • Fair share allocations: 50% Role=production or Role=t1production, 40% Role=pilot, 10% remaining CMS
      • Provide and publish 48h job queues
  • CVMFS
    • SAM probe for CVMFS currently in preparation may be included into the experiment SAM suites - is this enough for experiment testing?
  • glexec
    • LHCb has to reimplement a good portion of DIRAC because it isn't working anymore: no timeline for this is given
    • Atlas panda implementation going on might be finished by the end of May.
  • squid
    • There was a request to upgrade squid by the end of April to enable the new monitoring however the new monitoring isn't visible yet. CMS has already sent out instructions for their sites, Atlas will do when they are ready hopefully when the monitoring becomes visible. It will require to open an additional port.

Monday 8th April

  • The dates of the next WLCG Operations Coordination meetings are: Thursday 11th and 25th April, 15:30 CEST.
  • The agenda for Thursday is currently based on the standing items. Let Alessandra or Jeremy know if you have items you would like raised/discussed.
Tier-1 - Status Page

Tuesday 23rd April

  • Planned interventions to apply patches to FTS/LFC & Castor standby databases tomorrow (24th) (At Risk / Warning). Will do main Castor databases the following week.
  • Testing of alternative batch system (slurm) proceeding.
  • Three more disk servers deployed into AtlasDataDisk from second batch of new disk servers.
  • Investigations are ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

Friday 17th April

  • Good buzz at EGI CF last week: excellent GridPP presence, loads of useful people to talk to. We spent today's meeting comparing notes.

Tuesday 9th April

Monday 1st April

  • DDN report - see slides circulated by Pete G.

Wed 20 March 2013

  • Ruminated over the agenda items from last week's GDB
    • EMI roadmap (dCache, and other things)
    • FTS support for HTTP - we knew this but how do we make use of it now
    • Storage accounting records, needs updated APEL;
    • Work of storage group(s) on interfaces and protocols, and future furlongpebbles.
  • RAL D1T0 evaluation.
    • Seems to be settling on HDFS and CEPH which will be run anyway
    • what about Lustre?
    • Presentation to PMB next Monday, but no decision yet.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thu, 25th April

  • Process for de-emailing the host certificates of a site. This allows an admin to replace certificates with email address components in the DN - they cause havoc in the security system.

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Tuesday 9th April

  • There was an EGI ops meeting on 3rd April.
  • UMD/SR - note issues with CREAM in UMD-2 - also there's a new CREAM in EMI-2, with security updates. Does anyone in the UK run CREAM from UMD-2 at the moment?
  • EMI-2 WN tarball has passed SR. Expect a deadline for the upgrade soon. gLite 3.2 WN tarballs should be updated ASAP.
  • EMI-3 WMS on SL6 doesn't work with Argus (GGUS 92773)
  • EMI-3 VOMS Critical issue; fix scheduled April 18th.
  • Only APEL and VOMS appear to have stopped supporting YAIM core in the early EMI-3 release.

Tuesday 2nd April

  • Minutes of the 20th March EGI ops meeting are available.


Monitoring - Links MyWLCG

Tuesday 9th April

  • David C has material to present (Glasgow solutions to monitoring) but can not make our Tuesday ops meeting. Looking at options.

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 22nd April

  • Quiet week. Not very much progress (or response!) on tickets about EMI 1

upgrades though.

Monday 15th April

  • A lot of alarms because of Networking problem at Tier1 at the start of the week.
  • Three sites have open emi tickets.


Rollout Status WLCG Baseline

Tuesday 23rd April

  • Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

Tuesday 2nd April

  • EMI-1 components should be out of production. Nagios probes will report critical this month. Services remaining (without special condition) beyond 30th April will need to be placed in downtime.

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Monday 16th April

  • Sites are continuing to upgrade their kernels to rectify CVE-2013-0871. This vulnerability is still considered HIGH risk by EGI-CSIRT.

Monday 8th April

  • We have a number of site notifications from Pakiti. Please check your site summary.

Tuesday 2nd April

  • Reminder about ptrace kernel issue (CVE-2013-0871)
  • Thanks to all those sites that took part in the security challenge

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.



Services - PerfSonar dashboard | GridPP VOMS

Tuesday 23rd April

Tuesday 9th April

  • It is now getting urgent to configure and have enabled the backup VOMS instances at Oxford and Imperial. Please can we arrange a follow-up meeting (postponed last week as Daniela was out).
Tickets

Monday 22nd April 2013, 15.00 BST</br> Only 20 Open tickets this week.

gridpp.ac.uk</br> https://ggus.eu/ws/ticket_info.php?ticket=93337 (15/4)</br> The user's problem accessing the gridpp website has been solved, the ticket can be closed. In progress (can be closed) (16/4)

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=92306 (7/3)</br> The earthsci VO has been been deployed to all the UK VOMS servers. Steve J has asked a question to the VO about becoming an Approved VO (tm), not sure if this has been received by the VO/Mark Mitchell. Otherwise if the voms is working for the VO this ticket can be closed. In progress (9/4)

EMI1</br> The argus EMI1 alerts have started showing up:</br> GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=93407</br> Gareth can't find any EMI1-ness about their argus box, it is running EMI2, and manually running the ldapsearch matching the test supports this. Waiting for reply.</br> BRUNEL: https://ggus.eu/ws/ticket_info.php?ticket=93406 </br> Raul has put down a comprehensive reply, although again the offending server is EMI2 (although it's being/has been shut down). A service which should fail the tests (which is scheduled for an upgrade) managed to sneak under the radar. Raul solved this ticket whilst I was typing.

And there's still the two DPM tickets:</br> GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=92805</br> DURHAM: https://ggus.eu/ws/ticket_info.php?ticket=92804</br> Both these tickets could really do with an update from their respective sites.

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=92266 (6/3)</br> The new myproxy service is open at the Tier 1 ready and open for testing: myproxy.gridpp.rl.ac.uk It's not in the gocdb yet, but feel free to give it a whirl. Waiting for reply (feedback) (16/4)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=93416 (17/3)</br> Biomed reported seeing nagios job failures at RALPP, it turned out that the CE was having problems due to a flood of biomed jobs. Biomed have asked for the user's DN so that they can have a word. In progress (18/3)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=93493 (19/4)</br> Another biomed ticket concerning their nagios jobs, which were failing on some Glasgow CEs. It looks like the problem is their end with their proxies expiring and their lfc not working, but they seem to have got confused. Waiting for reply (21/4)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=89751 (17/12/12)</br> Govind had been informed that the path MTU discovery to RHUL should work, but Chris has reported that he still sees the problem. It might be useful to find out where the external box that the RHUL network admins used for their test resides. In progress (17/4)

Tools - MyEGI Nagios

Tuesday 16th April

  • Installation of DIRAC instances at IC pending return of Janusz.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 8th April

  • Please note Chris W is away this week.
  • Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Monday 4th March 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 24th April

  • Operations report
  • It has been a quiet week with steady running.
  • A problem encountered with testing the next version of Castor (2.1.13) has been understood. This was blocking its roll-out. Final stress testing of version 2.1.13 is now taking place ahead of scheduling its roll-out.
  • A problem has been understood (and a fix made available) for occasional time-out problems seen during Castor access. The have been seen as periodic failures in SUM tests of the SRM.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A