Difference between revisions of "Operations Bulletin 220413"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 08:40, 22 April 2013

Bulletin archive


Week commencing 15th April 2013
Task Areas
General updates

Tuesday 16th April

  • There was an EGI OMB on Friday (agenda)
  • It has been agreed that tickets stuck without a response after several reminders will be manually closed as 'unsolved' by GGUS (flow diagram).
  • HEPiX is taking place this week in Bologna. (agenda)

Monday 8th April

  • As pointed out in Alessandra's email last week, aspects of the experiment computing model evolution is addressed in this DM presentation for ATLAS and CMS and this one on LHCONE.
  • Due to the Easter networking outages at RAL (4 days in total), APEL is still catching up. Some sites are publishing several months of data and the server has gone from processing 6 million to 16 million events per day. Sites have been seen to timeout when the consumer table is optimized (new issue) and this is being investigated. Please be patient as the service catches up. WLCG and GridPP are aware that accounting data is not yet up-to-date for March (for the monthly and quarterly reports).
  • On Thursday at the EGI Community Forum there is a talk on EMI-3 APEL - if you are a sysadmin and at the EGI CF that day please consider attending to give feedback on the approach. We can also get Will along to an ops meeting.


WLCG Operations Coordination - Agendas

Tuesday 16th April

Extracts form the 11th April 2013 meeting minutes

  • New IPv6 compatibility task force is being created to test IPv6 within the experiments frameworks. Sites representatives are needed. Considering the IPv6 effort in the UK perhaps someone wants to join?
  • Middleware (WLCG Baseline)
    • there was a security release for CREAM, sites should upgrade to it
    • now the baseline versions table contains the versions of clients to deploy on UIs and WNs
    • EMI-3 has been released but no product is baseline yet; still, sites are free to upgrade services to EMI-3 (the WN needs more testing)
    • CERN WLCG repository to be created in the coming days, can augment EGI WLCG repository and/or serve for failover, will serve various use cases (HEP_OSlibs, XrootD plugins)
  • Experiments
    • CMS Requests to the Tier-2 sites
      • Fair share allocations: 50% Role=production or Role=t1production, 40% Role=pilot, 10% remaining CMS
      • Provide and publish 48h job queues
  • CVMFS
    • SAM probe for CVMFS currently in preparation may be included into the experiment SAM suites - is this enough for experiment testing?
  • glexec
    • LHCb has to reimplement a good portion of DIRAC because it isn't working anymore: no timeline for this is given
    • Atlas panda implementation going on might be finished by the end of May.
  • squid
    • There was a request to upgrade squid by the end of April to enable the new monitoring however the new monitoring isn't visible yet. CMS has already sent out instructions for their sites, Atlas will do when they are ready hopefully when the monitoring becomes visible. It will require to open an additional port.

Monday 8th April

  • The dates of the next WLCG Operations Coordination meetings are: Thursday 11th and 25th April, 15:30 CEST.
  • The agenda for Thursday is currently based on the standing items. Let Alessandra or Jeremy know if you have items you would like raised/discussed.
Tier-1 - Status Page

Tuesday 16th April

  • Planned intervention on the database behind FTS/LFC services on Thursday was successful.
  • Starting more extensive tests of alternative batch system (slurm).
  • All new worker nodes in production. Now over 10K job slots (with hyperthreading).
  • Part of new disk purchase deployed. (540TB to AtlasDataDisk & 720TB to CMSDisk).
  • Investigations are ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

17 April

  • Good buzz at EGI CF last week: excellent GridPP presence, loads of useful people to talk to. We spent today's meeting comparing notes.

Tuesday 9th April

Monday 1st April

  • DDN report - see slides circulated by Pete G.

Wed 20 March 2013

  • Ruminated over the agenda items from last week's GDB
    • EMI roadmap (dCache, and other things)
    • FTS support for HTTP - we knew this but how do we make use of it now
    • Storage accounting records, needs updated APEL;
    • Work of storage group(s) on interfaces and protocols, and future furlongpebbles.
  • RAL D1T0 evaluation.
    • Seems to be settling on HDFS and CEPH which will be run anyway
    • what about Lustre?
    • Presentation to PMB next Monday, but no decision yet.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Tuesday 9th April

  • There was an EGI ops meeting on 3rd April.
  • UMD/SR - note issues with CREAM in UMD-2 - also there's a new CREAM in EMI-2, with security updates. Does anyone in the UK run CREAM from UMD-2 at the moment?
  • EMI-2 WN tarball has passed SR. Expect a deadline for the upgrade soon. gLite 3.2 WN tarballs should be updated ASAP.
  • EMI-3 WMS on SL6 doesn't work with Argus (GGUS 92773)
  • EMI-3 VOMS Critical issue; fix scheduled April 18th.
  • Only APEL and VOMS appear to have stopped supporting YAIM core in the early EMI-3 release.

Tuesday 2nd April

  • Minutes of the 20th March EGI ops meeting are available.


Monitoring - Links MyWLCG

Tuesday 9th April

  • David C has material to present (Glasgow solutions to monitoring) but can not make our Tuesday ops meeting. Looking at options.

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 15th April

  • A lot of alarms because of Networking problem at Tier1 at the start of the week.
  • Three sites have open emi tickets.

Monday 1st April

  • A new GOCDB field related to the ROD email address was not populated. Emails should now reach the team.

Tuesday 5th March

  • Handling tickets related to EMI-1 probes - what to expect.
  • Recommendation with respect to upgrading CE (drain first)

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.
Rollout Status WLCG Baseline

Tuesday 2nd April

  • EMI-1 components should be out of production. Nagios probes will report critical this month. Services remaining (without special condition) beyond 30th April will need to be placed in downtime.

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Monday 16th April

  • Sites are continuing to upgrade their kernels to rectify CVE-2013-0871. This vulnerability is still considered HIGH risk by EGI-CSIRT.

Monday 8th April

  • We have a number of site notifications from Pakiti. Please check your site summary.

Tuesday 2nd April

  • Reminder about ptrace kernel issue (CVE-2013-0871)
  • Thanks to all those sites that took part in the security challenge

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.



Services - PerfSonar dashboard | GridPP VOMS

Tuesday 9th April

  • It is now getting urgent to configure and have enabled the backup VOMS instances at Oxford and Imperial. Please can we arrange a follow-up meeting (postponed last week as Daniela was out).

Tuesday 2nd April

  • Impending electrical work at Manchester - we need to commission the backup VOMS arrangement as soon as possible.

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

Monday 15th April 2013 14.30 BST</br> 26 Open UK tickets this week, most seem in hand. Here's the one's that jump out. I have an ill-timed appointment at the vets so I might not make it to the meeting in time, but the important bits are the gridpp.ac.uk ticket, and the remaining 3 EMI1 upgrade tickets which are in need of updating by the corresponding sites (Glasgow, Durham, RALPP).

NGI/gridpp.ac.uk</br> https://ggus.eu/ws/ticket_info.php?ticket=93337 (15/4)</br> This one stumped me about where it should be sent to, the submitter is having cert problems with the gridpp.ac.uk website- possibly due to the CA certs being out of date. Assigned (15/4) Update- Andrew sorted this out, and the user reports problem solved. Looks like this can be closed.

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=93343 (15/4)</br> This ticket has been assigned to NGS-GLASGOW, which I'm almost certain is wrong - can one of the Glasgow chaps check and reassign to themselves if I'm right. Assigned (15/4) Update- Gareth solved this one.

EMI1 Upgrade.</br> Only the DPM tickets at Glasgow and Durham, and the dcache ticket at RALPP, remain. There are special circumstances around all of them (DPM and dcache versioning is quite separate from the EMI number) but all three have requests for updates on them.</br> GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=92805</br> DURHAM: https://ggus.eu/ws/ticket_info.php?ticket=92804</br> RALPP: https://ggus.eu/ws/ticket_info.php?ticket=91997 Update- Chris has solved the ticket, although there are still errors on the dashboard everything is upgraded.

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=92969 (29/3)</br> Biomed reported seeing negative used space values for the RHUL dpm. Govind attempted to apply the old patch and failed, and has opened a new ticket with the DP devs: https://ggus.eu/tech/ticket_show.php?ticket=93026 In Progress (might want to On Hold if a new patch looks slow in coming) (10/4)

interest:</br> https://ggus.eu/ws/ticket_info.php?ticket=92498</br> I overlooked this one last week, but QMUL's ticket charting their upgrade to EMI3 APEL might be of interest.

Tools - MyEGI Nagios

Tuesday 16th April

  • Installation of DIRAC instances at IC pending return of Janusz.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 8th April

  • Please note Chris W is away this week.
  • Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Monday 4th March 2013

Monday 26th February 2013

  • NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
  • SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
  • Issues with Proxy renewal.
    • Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
    • Other myproxy issues as well. GGUS#99105 GGUS#9172

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 17th April

  • Operations report
  • The second of the two batches of new worker nodes has been deployed into production. All the 2012 CPU purchase is now in service and the batch farm currently has over 10,000 job slots.
  • The Post Mortem review of the failure of disk server GDSS594 (GenTape) in February that led to the loss of 68 T2K files has been completed. See here
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A