Difference between revisions of "Operations Bulletin 100613"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 17:58, 10 June 2013

Bulletin archive


Week commencing 3rd June 2013
Task Areas
General updates

Monday 3rd June

  • Bristol is considering to move subnet. See the LCG-ROLLOUT thread for more details.
  • Currently checking the status of MPI across UK sites (more to follow).

Tuesday 28th May

  • There is an EGI OMB this morning (agenda).
  • Some VOs are hitting the default 365 day membership point. VO-admins can extend to a longer default and can renew individual memberships.
  • In process of putting names against Tier-2 GDB representation rota.
  • Updating of the HS06 table.
  • There was a GridPP cloud meeting on Friday (agenda: minutes).
  • A reminder to provide informal feedback on your site findings on EMI2/3 and SL5/6 compatibility.


Monday 20th May

  • The next update for the EGI Trust Anchor distribution has been pushed to the RT SW-REL queue as ticket #5544. It will be released on or after 27th May.
  • 7 sites have yet to update their Squid monitoring to work with http://wlcg-squid-monitor.cern.ch/. The instructions are online.
  • UMD 3.0.0 has been released (details).
  • The WLCG SHA-2 timeline was approved at a recent EUGridPMA meeting. (notes).
  • Pete updated the HEPSYSMAN agenda to allow for longer puppet/hiera templates discussion.


WLCG Operations Coordination - Agendas

Monday 3rd June

  • Minutes from the WLCG ops coordination meeting last Thursday.

Tuesday 28th May

Next Meeting 30th May 2013

Tuesday 21st May

  • Minutes from the 16th May meeting
  • From the EGI talk on UMD 3.0.0 note slide 7.

Tuesday 14th May

Next Meeting 16th May 2013

Tuesday 30th April

Extracts form the 25th April 2013 meeting minutes

  • EGI operations
    • User suspension policy under discussion the emergency procedure for central suspension has been extended.
    • GGUS new workflows have been defined. The most important ones is that every supporter will have rw access to all the tickets and no best effort from the support units or team products will be accepted. Tickets should not be left without an answer.
    • Continuation of support for several products still needs to be clarified, including: WMS, EMI-Common (UI, WN, YAIM, Torque config, emi-nagios), EGI will liaise directly with PTs to get information about release and software support plans;
  • Experiments
    • Atlas
      • SL6: see SL6 TF
      • cvmfs: waiting for a stable cvmfs 2.1 for sites that needs NFS export. 2.1.9 is promising but needs more testing.
    • CMS
      • Submission of HammerCloud trough GlideinWMS: comparison to gLite done - o.k. Will switch beginning of May
      • Updates of Squid configuration to WLCG monitoring: about a third of the sites has done it. Followed in CMS Computing Operations.
    • LHCb
      • glexec: solved problems with software, deployment is manpower intensive though and experiment doesn't want to do it. WLCG needs to take care of it.
      • SL6: see TF
  • cvmfs
    • Atlas and LHCb deadline is the 30 April [note: UK is ok]
    • CMS deadline is 1 April 2014 but will stop but already from September 30 no software installation jobs will be sent
    • CVMFS 2.1.9 is about to be released; the update is recommended but sites using the NFS export or at the 2.0 version should test it carefully for a few weeks.
    • Finally, the testing and deployment process is described; in particular, sites should upgrade their nodes in stages. Interested sites are invited to join the "pre-production" effort.
  • glexec
    • Maarten asks the MB reiterates that all sites need to install glexec
  • perfSONAR
    • Release candidate v3.3 is out. Currently been tested in US and at CERN not yet suggested to sites but if in 2 weeks no problems are discovered sites will be encouraged to install this version.
  • SL6
    • Created a deployment page to track sites status: https://twiki.cern.ch/twiki/bin/view/LCG/SL6DeploymentSites
    • Increased information in procedures https://twiki.cern.ch/twiki/bin/view/LCG/SL6Migration#Procedures_and_how_to_contact_ex
      • None of the experiments want mixed queues they ask for one queue per architecture
      • Best way to go is to reuse an existing queue if possible
      • WLCG repository has been created and should be enabled by all sites. It already contains the latest version of HEP_OSlibs [tested by Brunel and Oxford plus two Turkish sites]
      • LHCb requires the CE/queue information to be published in production in the BDII otherwise they don't see them automatically and would like to avoid manual steps for 150 sites. [RAL SL6 testing queue wasn't published and this is why it wasn't used. It is now]
      • Atlas has found a problem with the excessive number of file descriptors (similar to those observed by Brunel on their SL6 CE). Problem has been passed to the TF.
  • Frontier/squid
    • Squid upgrade is now being followed by CMS and Atlas computing respectively
    • Dave Dykstra is not part of squid support anymore
      • Representatives of CMS and Atlas Frontier/squid groups have joined WLCG Coord to replace him and help with future squid requests.

Tuesday 23rd April

  • The next meeting takes place this Thursday (agenda)


Tier-1 - Status Page

Tuesday 28th May

  • Today's planned update of Tier1 Castor (nameserver & Atlas instance) was cancelled. Problems were found a week ago in the other (non-Tier1) instance that had been upgraded. This problem is now largely understood - will get back to re-scheduling the Tier1 upgrade in due course.
  • Testing of alternative batch system (slurm & Condor) proceeding.
  • Investigations are still ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.

Tuesday 21st May

  • Do we have an agenda page for the June workshop?

Wednesday 1 May 2013

  • Puppet report from March Puppet camp in London
  • Technical suggestions for hepsysman, or otherwise

Tuesday 30th April

  • The DPM collaboration formally starts on 1st May 2013. For those interested in the collaboration agreement see this page.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 30th April

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thu, 25th April

  • Process for de-emailing the host certificates of a site. This allows an admin to replace certificates with email address components in the DN - they cause havoc in the security system.

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Tuesday 21st May

  • Updates from 15th May EGI ops meeting.
  • StagedRollout - EMI/UMD 3 update
    • A few minor update issues on LFC; Top BDII; DPM; ARGUS; UI; WN and LB. (Details)
    • More significant points: EMI-3 Cream installed EMI-3 APEL by default, see here for a way to stick on EMI-2 APEL.
    • EMI-3 APEL is _not_ backwards compatible, and needs configs changed on the APEL central. There is an upgrade plan to be followed. (Summary: You need to have a GGUS ticket, and have a dialogue with APEL to do the upgrade).
    • StoRM will be supported on EMI-1 until 21st July.
    • EMI/UMD-2 LB server: there's an incompatability with the logging package in EPEL. If anyone runs this, be don't do updates until this is fixed (or see this workaround).

Monday 6th May

  • There was an update of the package log4c (from v1.2.1 to v1.2.3) in EPEL, which is not compatible, and breaks, with the LB version currently available in the UMD-2 repositories (LB v3.2.8). For the moment, I would not recommend to update the log4c package. This issue affects both SL5 and SL6 installations.

Developers provided a workaround which is now described in the LB known issues.


Tuesday 30th April

  • There was an EGI ops meeting on 24th April. (agenda).
  • Note there is a change in the recommended hardware for running a top-BDII.
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 13th May

  • David C will present the monitoring work at Glasgow (based on Graphite) at the HEPSYSMAN meeting in June.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 28th May

  • Very quite week. No open ticket.

Tuesday 21st May

  • Quiet week. The only outstanding issue was the QMUL Storm.

Monday 13th May

  • QMUL received an EMI-1 ticket for Storm. Otherwise a quiet week.
Rollout Status WLCG Baseline

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 21st May

  • SL6 vulnerability. Need to track progress. (See private thread).

Tuesday 30th April

  • Sites are continuing to upgrade their kernels to rectify CVE-2013-0871. Security dashboard shows 3 sites with outstanding upgrades. This vulnerability is still considered HIGH risk by EGI-CSIRT.

Monday 8th April

  • We have a number of site notifications from Pakiti. Please check your site summary.


Services - PerfSonar dashboard | GridPP VOMS

Monday 20th May

  • Letter sent to Internet2 from GridPP management.

Tuesday 14th May

  • perfSonar support team is asking for statements from the projects using it to help securing funding for their team. Below the email they've sent. The WLCG TF is looking for the WLCG MB and Computing coordinators statements but it was agreed that statements from the sites would also help. Below is the email sent to the users mailing list.

Tuesday 23rd April

Tuesday 9th April

  • It is now getting urgent to configure and have enabled the backup VOMS instances at Oxford and Imperial. Please can we arrange a follow-up meeting (postponed last week as Daniela was out).
Tickets

Stardate 04-06-13-00.04</br> No proper ticket overview this week, as we're defending the Federation from Klingon hackers.

My net connection is a bit ropey at Starfleet's barracks here at Cosener's, but of the 16 UK tickets there's a RHUL and a QMUL one that need attending to (probably as the admins have been busy defending the free galaxy):</br> https://ggus.eu/ws/ticket_info.php?ticket=94521 (RHUL)</br> https://ggus.eu/ws/ticket_info.php?ticket=94510 (QMUL)</br>

There's a ROD ticket I don't really understand (as it concerns Lancaster, but as far as I know we haven't experienced any 72-hour problems):</br> https://ggus.eu/ws/ticket_info.php?ticket=94519

This Sussex ticket concerning Sno+ looks as if it can be closed:</br> https://ggus.eu/ws/ticket_info.php?ticket=94241

Whilst this other Sno+ ticket to Glasgow looks like it could use some love:</br> https://ggus.eu/ws/ticket_info.php?ticket=94213

Could Kashif or someone else Nagios experty please comment on this Biomed ticket to IC- Biomed are confused about nagios versions and need advice:</br> https://ggus.eu/ws/ticket_info.php?ticket=94358

If I see the admins listed tomorrow I'll give you a personal prod.

Live long and prosper!

Tools - MyEGI Nagios

Tuesday 16th April

  • Installation of DIRAC instances at IC pending return of Janusz.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Thurs 6th June

  • SNO+ jobs now work through the glasgow WMS

Mon 20 May

  • RAL wms02 and wm03 seem to have been taken out of commission but were still in the information system.
  • Glasgow WMS doesn't accept SNO+ jobs (https://ggus.eu/ws/ticket_info.php?ticket=94213)
  • SNO+ filling with water and expect to be taking test data Aug/Sept - expect more grid use after that.
  • Epic doing serious testing - running at Glasgow Liverpool and Lancs.

Thurs 16 May

  • SL6 - likely to be deployed for LHC VOs, non LHC should be aware - see mail to vo-admins list.


Monday 8th April

  • Please note Chris W is away this week.
  • Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Monday 4th March 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 22nd May

  • Operations report
  • On Tuesday (21st) Tier Castor & batch services were stopped for a change in the RAL network which was carried out successfully.
  • Plans for updating Tier1 Castor to version 2.1.13-9 are delayed. The (non-Tier1) Castor instance that has been upgraded has uncovered a problem that needs to be understood and resolved before the Tier1 Castor instances are upgraded.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A