Difference between revisions of "Operations Bulletin 060513"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 22:42, 3 May 2013

Bulletin archive


Week commencing 29th April 2013
Task Areas
General updates

Tuesday 30th April

  • There is a GDB next week (agenda).
  • WLCG squid monitoring is in place. How are we using it?
  • The EarthSci VO (now under earthsci.vo.gridpp.ac.uk) has been approved by the PMB.

Tuesday 23rd April

  • An almost final reminder for any site running EMI-1 middleware - unless the component is upgraded by next Wednesday 1st May the service must be put in downtime unless there is a good (and agreed with EGI) technical reason not to upgrade. Sites not complying face suspension.
  • EGI is chasing EMI product teams that have not indicated their plans post EMI (i.e after next week)! There are some significant components where ongoing development/support is unknown (WMS, EMI-Common (EMI-UI, EMI-WN, gLite-yaim-core, Torque server config, Torque WN config, emi-nagios), EMI-Messaging, gLite-Infosys and WNoDES).
  • No re-computations were requested for the March 2013 WLCG Tier-2 availability and reliability report.
  • Please could sites review their non-LHC supported VOs and consider supporting additional ones. LHC VO work has decreased and it would be good to support other communities work (for example the enmr.eu VO) while we can. Use VomsSnooper to check your configurations.
  • The sysadmin guide has been updated with new hardware and publishing requirements for top-bdii and bdii nodes.


Tuesday 16th April

  • There was an EGI OMB on Friday (agenda)
  • It has been agreed that tickets stuck without a response after several reminders will be manually closed as 'unsolved' by GGUS (flow diagram).
  • HEPiX is taking place this week in Bologna. (agenda)
WLCG Operations Coordination - Agendas

Tuesday 30th April

Extracts form the 25th April 2013 meeting minutes

  • EGI operations
    • User suspension policy under discussion the emergency procedure for central suspension has been extended.
    • GGUS new workflows have been defined. The most important ones is that every supporter will have rw access to all the tickets and no best effort from the support units or team products will be accepted. Tickets should not be left without an answer.
    • Continuation of support for several products still needs to be clarified, including: WMS, EMI-Common (UI, WN, YAIM, Torque config, emi-nagios), EGI will liaise directly with PTs to get information about release and software support plans;
  • Experiments
    • Atlas
      • SL6: see SL6 TF
      • cvmfs: waiting for a stable cvmfs 2.1 for sites that needs NFS export. 2.1.9 is promising but needs more testing.
    • CMS
      • Submission of HammerCloud trough GlideinWMS: comparison to gLite done - o.k. Will switch beginning of May
      • Updates of Squid configuration to WLCG monitoring: about a third of the sites has done it. Followed in CMS Computing Operations.
    • LHCb
      • glexec: solved problems with software, deployment is manpower intensive though and experiment doesn't want to do it. WLCG needs to take care of it.
      • SL6: see TF
  • cvmfs
    • Atlas and LHCb deadline is the 30 April [note: UK is ok]
    • CMS deadline is 1 April 2014 but will stop but already from September 30 no software installation jobs will be sent
    • CVMFS 2.1.9 is about to be released; the update is recommended but sites using the NFS export or at the 2.0 version should test it carefully for a few weeks.
    • Finally, the testing and deployment process is described; in particular, sites should upgrade their nodes in stages. Interested sites are invited to join the "pre-production" effort.
  • glexec
    • Maarten asks the MB reiterates that all sites need to install glexec
  • perfSONAR
    • Release candidate v3.3 is out. Currently been tested in US and at CERN not yet suggested to sites but if in 2 weeks no problems are discovered sites will be encouraged to install this version.
  • SL6
    • Created a deployment page to track sites status: https://twiki.cern.ch/twiki/bin/view/LCG/SL6DeploymentSites
    • Increased information in procedures https://twiki.cern.ch/twiki/bin/view/LCG/SL6Migration#Procedures_and_how_to_contact_ex
      • None of the experiments want mixed queues they ask for one queue per architecture
      • Best way to go is to reuse an existing queue if possible
      • WLCG repository has been created and should be enabled by all sites. It already contains the latest version of HEP_OSlibs [tested by Brunel and Oxford plus two Turkish sites]
      • LHCb requires the CE/queue information to be published in production in the BDII otherwise they don't see them automatically and would like to avoid manual steps for 150 sites. [RAL SL6 testing queue wasn't published and this is why it wasn't used. It is now]
      • Atlas has found a problem with the excessive number of file descriptors (similar to those observed by Brunel on their SL6 CE). Problem has been passed to the TF.
  • Frontier/squid
    • Squid upgrade is now being followed by CMS and Atlas computing respectively
    • Dave Dykstra is not part of squid support anymore
      • Representatives of CMS and Atlas Frontier/squid groups have joined WLCG Coord to replace him and help with future squid requests.

Tuesday 23rd April

  • The next meeting takes place this Thursday (agenda)


Tier-1 - Status Page

Tuesday 30th April

  • Overall a quiet week operationally. We had seen saturation of the network uplink this last week - traced to a test data flow (now stopped).
  • 'Warning' (At Risk) on Castor tomorrow (Wed 1st May) for Oracle patches to be applied to back-end database.
  • The problem that was blocking the upgrade to Castor 2.1.13 has been understood. Final tests progressing before deployment.
  • A time-out problem in Castor that intermittently affected CMS SUM tests has been found and fixed.
  • Seven more disk servers deployed into CMSDisk.
  • Testing of alternative batch system (slurm) proceeding.
  • Investigations are ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

Wednesday 1 May 2013

  • Puppet report from March Puppet camp in London
  • Technical suggestions for hepsysman, or otherwise

Tuesday 30th April

  • The DPM collaboration formally starts on 1st May 2013. For those interested in the collaboration agreement see this page.

Friday 17th April

  • Good buzz at EGI CF last week: excellent GridPP presence, loads of useful people to talk to. We spent today's meeting comparing notes.

Tuesday 9th April


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 30th April

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thu, 25th April

  • Process for de-emailing the host certificates of a site. This allows an admin to replace certificates with email address components in the DN - they cause havoc in the security system.

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Tuesday 30th April

  • There was an EGI ops meeting on 24th April. (agenda).
  • Note there is a change in the recommended hardware for running a top-BDII.

Tuesday 9th April

  • There was an EGI ops meeting on 3rd April.
  • UMD/SR - note issues with CREAM in UMD-2 - also there's a new CREAM in EMI-2, with security updates. Does anyone in the UK run CREAM from UMD-2 at the moment?
  • EMI-2 WN tarball has passed SR. Expect a deadline for the upgrade soon. gLite 3.2 WN tarballs should be updated ASAP.
  • EMI-3 WMS on SL6 doesn't work with Argus (GGUS 92773)
  • EMI-3 VOMS Critical issue; fix scheduled April 18th.
  • Only APEL and VOMS appear to have stopped supporting YAIM core in the early EMI-3 release.


Monitoring - Links MyWLCG

Tuesday 9th April

  • David C has material to present (Glasgow solutions to monitoring) but can not make our Tuesday ops meeting. Looking at options.

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 29th April

  • Not much to report. This is the week the EMI1 tickets will have to be dealt with one way or another.
  • RALPP has upgraded their CEs, but the EMI1 alarm keeps coming back intermittently, I guess there's a bad cache somewhere. The other sites with open EMI1 tickets are Glasgow (for the WMS) and Durham.

Monday 22nd April

  • Quiet week. Not very much progress (or response!) on tickets about EMI 1

upgrades though.

Monday 15th April

  • A lot of alarms because of Networking problem at Tier1 at the start of the week.
  • Three sites have open emi tickets.


Rollout Status WLCG Baseline

Tuesday 23rd April

  • Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

Tuesday 2nd April

  • EMI-1 components should be out of production. Nagios probes will report critical this month. Services remaining (without special condition) beyond 30th April will need to be placed in downtime.

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Tuesday 30th April

  • Sites are continuing to upgrade their kernels to rectify CVE-2013-0871. Security dashboard shows 3 sites with outstanding upgrades. This vulnerability is still considered HIGH risk by EGI-CSIRT.

Monday 8th April

  • We have a number of site notifications from Pakiti. Please check your site summary.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 23rd April

Tuesday 9th April

  • It is now getting urgent to configure and have enabled the backup VOMS instances at Oxford and Imperial. Please can we arrange a follow-up meeting (postponed last week as Daniela was out).
Tickets

Monday 29th April 2013, 14.45 BST</br> Only 17 open tickets assigned to the UK NGI this week. Make that 16 open.

EMI UPGRADE SEASON.</br> No doubt this will be covered elsewhere in the meeting, but with the deadline imminent it doesn't hurt repeating ourselves over this.

RALPP: https://ggus.eu/ws/ticket_info.php?ticket=93676 (26/4)</br> The site got re-ticketed about this on Friday, and the chaps might not have noticed it yet. As Daniela pointed out, if this is the red herring that it looks to be we need to counter-ticket the EU Nagios by the end of the month to avoid the ban-hammer. Assigned (26/4) Update - Chris, Stephen and Daniela are in discussion about what to do - it could be the nagios caching things it shouldn't or related to cern bdii problems - https://ggus.eu/ws/ticket_info.php?ticket=93650

GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=93632 (24/4)</br> Glasgow closed their other tickets, but got a new one about their WMSii for their trouble (no rest for the wicked?). Gareth has stated their plan to take these EMI1 WMS down on the 30th, to be bought back when/if the rebuild troubles they've seen can be worked out.

MUNDANE TICKETS</br> gridpp.ac.uk</br> https://ggus.eu/ws/ticket_info.php?ticket=93337 (15/4)</br> This ticket still looks like it's solved, if no one objects I'll close it myself (I assume the solution was "updated certificates on the web server"?). In progress (can be closed) (23/4)

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=92306 (7/3)</br> Setting up the earthsci VO. Robert has asked for David and Gareth's e-mail addresses to use for the VO records. Waiting for reply (24/4)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=93654 (25/4)</br> Chris has put in a request to have the T2K LFC at RAL upgraded from a "local" to a "global" LFC. The RAL team are on it. In progress (26/4)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=93642 (24/4)</br> I've singled out this ticket for two reasons - one that you should discourage VOs from piling extra, unrelated issues onto an existing ticket. The second is that sites should remember that tickets that are "re-opened" on you still need to have their statuses changed once they land back in your lap. (The ticket is also technically interesting as it codifies the problems Glasgow have been seeing with pile jobs on their many-core nodes, but this has been discussed in the atlas UK meetings). In progress (29/4) Update - the storm has passed and things have calmed down, Elena closed the ticket.

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=93532 (29/4)</br> I think this CMS ticket can be put to Waiting for Reply now that you have your SL6 nodes working, but I'm not sure enough to interfere with it myself. On hold (29/4)

No Tickets in the solved pile catch my eye.

Tickets of Interest.

https://ggus.eu/tech/ticket_show.php?ticket=93701</br> Chris ticketed the argus unit requesting a man page for pap-admin. The ticket was promptly closed (unsolved) with "not enough man power to produce a man page" - further stating that the help command should be sufficient. A little concerning.

https://ggus.eu/ws/ticket_info.php?ticket=92498</br> The ticket covering Chris and his EMI3 APEL migration. As I'm going to have to migrate to EMI3 soon for the improved LSF support this is very relevant to my interests.

In fact let's keep going with a few more of Chris' tickets...

https://ggus.eu/tech/ticket_show.php?ticket=91587</br> Memory Leak in BUpdaterSGE. Chris upgraded to EMI3 and still sees the issue.

https://ggus.eu/tech/ticket_show.php?ticket=88976</br> "glite-wn-info doesn't list any conf files" I think that this was supposed to be fixed in EMI3 WN, but there has been deathly silence from the WN devs (are there any now?).

Tools - MyEGI Nagios

Tuesday 16th April

  • Installation of DIRAC instances at IC pending return of Janusz.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 8th April

  • Please note Chris W is away this week.
  • Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Monday 4th March 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 1st May

  • Operations report
  • It has been a quiet week with steady running. There was a problem of saturation of the uplink traced to a CMS test.
  • The fix to occasional time-out problems seen during Castor access has been rolled out. SUM tests results now cleaner.
  • There are a number of interventions in the pipleline. These await detailed scheduling, but include:
    • Switch of Castor Production/Standby databases.
    • Re-establish paired (2 * 10Gbit) uplink.
    • Castor 2.1.13 upgrade now the blocking issue found during testing has been resolved.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A