Difference between revisions of "Operations Bulletin 160913"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 09:38, 16 September 2013

Bulletin archive


Week commencing 9th September 2013
Task Areas
General updates

Tuesday 10th September

  • There is a GDB tomorrow - Agenda.
  • EGI is looking for NGIs to volunteer for central banning testing. In the UK we plan to deploy ARGUS at every site on the same timeline as deploying glexec at each site.
  • Some changes have recently been implemented to improve the communications channels of various EGI security teams: the EGI CSIRT and EGI Incident Response Task Force (IRTF). The new lists are csirt and irtf at mailman.egi.eu respectively and now include the backup team members.
  • There will be an EGI User Community Board meeting at the TF. The purpose of this meeting is to "discuss new needs emerging from EGI user communities, in particular new use cases for integration of heterogeneous data from multiple sources to support interdisciplinary science, value added services for computing, simulation and data exploration that reuse tools and services available from existing infrastructures". If you have any input please let Jeremy know.


Tuesday 3rd September

  • There is a call (deadline this Thursday) for expressions of interest to attend the WLCG workshop in November.
  • We are joined by 2 new year in industry students: Sam Worley and Mohit Mittal. Please be patient with them as they learn the how to direct tickets!
  • Progress with the HEPiX puppet working group was expected via an announcement on hepix-users last week (Bob Jones).
  • The GOCDB v5 release, which was provisionally scheduled for the 2nd of September, was delayed while some newly encountered issues are resolved (compatibility problems and hardware issues experienced by the server hosting the [https://gocdb-test.esc.rl.ac.uk/v5 GOCDB test

instance]). The delay will be of order 3 weeks.

  • The agenda for next week's GDB is now online.
WLCG Operations Coordination - Agendas

Tuesday 2nd September

  • Middleware
    • New BDII release in the latest EMI-2/3 update, including better GLUE-2 support and security fixes. Sites should update all their BDII instances
    • New CVMFS version released for a security fix. Sites should upgrade or at least apply the hot fix in the above twiki
    • perfSONAR: sites should upgrade to the latest version, fixing many deployment problems
    • The end of support for dCache 1.9.12 has been postponed to September 30 due to a delay in releasing the SHA-2 compliant version in the dCache 2.2 series.
    • Consult https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
  • SHA-2
    • Discussion mostly dedicated to the experiments testing status. Atlas and LHCb have tested the services but not job submission yet. All experiments have been encouraged to test this.
  • SL6
    • T2 Done: 49/129 (Alice 11/39, Atlas 28/89, CMS 22/65, LHCb 13/45) -> 80/129 still to be done.
    • HS06: Reminder that sites are requested to run HS06 benchmark and update the value in the BDII. Increased values might be discussed at the WLCG MB.
    • EMI-3: voms-clients have been fixed and the latest version is in the PT repository but not in EMI-3 yet. Both CMS and Atlas work on DPM/dcache sites with this patch. (QMUL might want to give an update on Storm when they upgrade)
    • UK status: Liverpool to be finished soon, Bham in downtime to upgrade this week, Bristol and Sussex should be done by the 15/9/2013, RALPP 20/09/2013 and QMUL, Lancaster, UCL 30/09/2013
  • glexec
    • 55 sites still to respond they have attached the installation to SL6 upgrade.

Monday 12th August

  • There have been no recent meetings. The next is on 29th August.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 10th September

  • There were Castor problems over the weekend (mainly affecting Atlas). Diagnosis was difficult owing to a problem with the logging.
  • New batch farm: Have decided that we will move to Condor and that we will have (at least initially) both ARC and CREAM CEs. Currently moving more nodes ('08 & '09 WNs) into the Condor farm to extend capacity - with WNs upgraded to SL6 as they migrate. Depending on how this goes will decide if need to upgrade remaining WNs to SL6 whilst still on the old Torque/Maui farm or if all nodes can be migrated to Condor/SL6.
  • WMSs being upgraded to EMI-3 (for SHAH-2 compliance).
  • During tomorrow's Liaison meeting there will be a presentation about the work to add a WebDav interface to Castor.
Storage & Data Management - Agendas/Minutes

Tuesday 15th July

  • Three sites are now run all UK FTS traffic via 'FTS3' service as a test. Mostly successful; (small issue with a few US sites to be resolved before taking tests further.)

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 13th August

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 3 September 2013

  • Proposal for "Instant UI", with the aim to produce a suite of documentation and software that will enable a new user to set up a UI and join the grid with the minimum of hassle. Doc will show for to admin a UI that can be used to submit jobs and retrieve output for a given set of users belonging to a given set of VOs. "Instant UI" is currently in consulation phase with GridPP admin community.

Tuesday 27th August

  • The document status was reviews at a core ops meeting last Thursday. Several documents are to be reassigned and some removed.
  • Looking at VO lifecycle.

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Interoperation - EGI ops agendas

Monday 2nd September

  • Yesterday's agenda. Attended by David and Raul.

1. Middleware releases and staged rollout 1.1 News from URT StoRM 1.11.2 still not released The Product Team found an issue (fixed) during the testing and the fix is currently under testing (within days if everything goes well).

Gridsite 2.1.3(UMD-3) and 1.7.28(UMD-2) To be released during this week by PT Not in EPEL yet (gridsite PT is not releasing yet in EPEL) Single fix for a problem affecting delegation

DPM v 1.8.7 Need to remove packages from the EMI repositories before releasing in EPEL

dCache 2.2.15 released Not the SHA-2 compliant version for UMD-2 2.2.16 will include sha-2 support and will be released tomorrow (Tuesday)

Globus 3.2.2 -Bug fix release from IGE -GridWay -GSI-SSHTerm -BES-GRAM -Globus Info Provider Service


1.2 VOMS: A question was asked, "has voms been tested with IPv6 now?" - the answer was not yet, as there was no IPv6 machines in the testbed - this was to be followed up.

1.3 UMD-3 to be released ~next week, waiting a couple of days to allow for the possibility of StoRM update to be tested.

2.2 Host name in the host certificates

As reported in several GGUS tickets (Rules for issuing certificates for hosts with an alias, Problems related to a myproxy service). Host certificates must have the hostname used to reach the service in either the CN or DNS fields (Common Name and Alternate Subject Name). Please make sure to provide all the needed information to your CA when requesting a new host certificate.

The only service which this affects in the UK is lcgrbp01.gridpp.rl.ac.uk ; Peter noted that he would issue tickets shortly (if not by now).

2.A. Peter Solagna reported on the Central emergency suspension project, - he requested that NGIs report back to him about whether each has a NGI ARGUS instance for use by the framework, so that the central ARGUS server can have a limited set of servers talking to it.

As a reminder, register ARGUS servers in GOCDB.

2.3 UserDN publication in the accounting records Reminder: follow with sites generating alarms for non published User DN. Currently only few sites have 'critical' failures on that check.

2.4 CVMFS webinar They trailed Catalin's webinar as discussed on TB-SUPPORT

2.4 AAI workshop at TF Madrid (Authentication Authorization Infrastructure) AAI workshop on Tuesday September 17th at 11:00

AOB Monday 16th September presentation by Product Teams during Technical Forum.

3.2 Next meeting September 23rd, h14:00 Amsterdam time - might be modified because of Friday 27th Meeting OMB meeting

No meeting in 2 weeks because of TF.


gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 2nd September

  • A lot of intermittent Nagios failures due to tier1 BDII issue. Imperial WMS also had some issue on Thursday and lot of nagios jobs stayed in waiting because of that.
  • One ticket escalated to the NGI.
  • A few SHA2 tickets remain open. Bristol is testing an ARC CE and that CE was not in production but monitored. It was generating alarms in Dashboard. Luke agreed to put it in downtime until fixed.
  • Several of the SHA-2 tickets have gone red as they've reached 30 days of being open. We need to decide how to handle them.

Tuesday 27th August

  • A COD ticket was raised due to overdue SHA2 tickets: https://ggus.eu/ws/ticket_info.php?ticket=96765
  • QMUL has an odd alarm for a non-production machine: eu.egi.MPI-GOCDB-Check. The machine appears to be declared correctly in the GOCDB.
  • Sussex is quite far behind in its APEL publishing.
  • Durham has several issues.
Rollout Status WLCG Baseline

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 3rd September

  • Security contacts and system staff of the partners of EGI, EUDAT and PRACE are invited for a joint security event from Monday, October 7 - Wednesday, October 9 in Linköping, Sweden. Monday and Tuesday will be a training event which should be of interest for all staff managing the systems which are part of our infrastructures. The second part of the meeting will consist of a discussion of security policies and the collaboration among the different infrastructures.
  • Rob H has moved to the Tier-1. Looking at options for security team.
  • At the next team meeting we should review our approach to 100percentit (just moved to uncertified).

Tuesday 20th August

  • Several sites showing up in pakiti this week.
  • Update on workaround discussed last week.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 27th August

  • PerfSONAR: version 3.3.1 was released on 21st Aug. This update should fix problems with the WLCG mesh (which can now be included in the config file). Indeed traceroute and pingER are now working at Cambridge and Imperial but not, yet, at Oxford or Lancaster. Please could sites upgrade to this or at a minimum check their existing perfsonar host(s) are working OK here dashboard.


Tickets

Monday 9th September 2013, 15.00 BST</br> We have 50 Open UK tickets- a number of issues have cropped up but I don't see any patterns. I went through all the tickets last week, so we'll just skim them today. Everyone's doing a good job, and a lot of the fresh issues that have cropped up are being handled nicely.

GLEXEC</br> Not much movement on the GLEXEC tickets since last week, but none was expected really.

HyperK</br> https://ggus.eu/ws/ticket_info.php?ticket=96235</br> https://ggus.eu/ws/ticket_info.php?ticket=96233</br> Whilst testing the HyperK VO Chris noticed some problems when he used proxies from voms2 and voms3 for the WMS and LFC, I guess due to a misconfiguration at the RAL end (and not problems with the voms servers), but I'm often wrong. Both In Progress (6/9)

PERFSONAR</br> Duncan has been busy poking sites that were looking bad in:</br> http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=UK</br> A few sites have fixed the problems already, other sites are working on it. No tickets seem stalled (the Durham one was reopened though, they hadn't quite quashed the gremlins).

Well that's not many tickets is it. Nothing too exciting on the solved ticket pile either.

If anyone has any issues they want bought up please let us know.

And finally, a reminder that if you thought you read something in one of the Ticket Round-Ups that has since been overwritten and you can't be bothered trawling through the wiki history to resurrect the information I keep the old Ticket Round Ups here:</br> https://www.gridpp.ac.uk/wiki/Past_Ticket_Bulletins

(or if you're really looking for a blast from the past: https://www.gridpp.ac.uk/wiki/Past_Ticket_Bulletins_2012 ).

Tools - MyEGI Nagios

Monday 2nd September

  • Intermittent Nagios errors -> Imperial WMS and all the jobs going through it were failing with ‘no compatible error’. Some reports of ongoing issues. What is the direct impact?
  • MyEGI and gstat were also down last week.
  • Jens is testing SHA-2 compliance of components. The version of gridsite on the GridPP website is not compliant but SHA-2 will be supported with a move to a new server (when?).

Monday 12 Aug

  • VO-Nagios

t2wlcgnagios.physics.ox.ac.uk monitors few UK VO and it uses Robot Certificate. Due to some confusion Robot Certificate got expired. I applied for extension of Robot Certificate beforehand but Cert Wizard doesn't understand Robot Certificate and I thought that it has been extended. Finally Jens stepped in sorted it out. Now VO Nagios is working.

  • SHA2 Certificates

I have been issued a SHA2 certificate by Jens. I tested few CE's and some interesting results came out. Gridpp VOMS server is SHA2 compatible so SHA2 proxies can be created for VO's hosted at voms.gridpp.ac.uk. None of CERN voms server are sha2 compatible but there is workaround to add a secondary SHA2 certificate. Details are here https://twiki.cern.ch/twiki/bin/view/LCG/SHA2readinessTesting#SHA_2_VOMS_server I have added my SHA2 certificate but it is not approved yet as most of the people are on holiday. Interestingly when I submitted few jobs using ngs.ac.uk with SHA2 certificate, it finished successfully on the CE's which are not suppose to be SHA2 complaint. I will test again with OPS vo to confirm it.

Tuesday 23rd July

  • In a campaign to update VO ID card details it turns out that a few of our supported VOs are obsolete: babar, possibly supernemo and ngs.ac.uk. The first of these can be safely removed but we need to confirm our announcement process.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 2nd September

  • The next quarterly Tier 1 allocation/resourcing meeting is scheduled for Wednesday 18th September (after the weekly T1 meeting) the hardware requirements and fair-shares for the period October-December 2013 will be reviewed. It looks ahead over the next 12 month timeframe. Can all experiments/projects please let Pete G have any updates or requests to these numbers by Friday 13th September please?

Monday 19 August

  • EPIC
    • Support requested at Tier-1
    • Any other sites prepared to support them?
  • Catalogue synchronisation - Biomed working on it.


Monday 12 August

  • HyperK.org
    • VOMS servers set up (Manchester, Oxford, Imperial)
    • VOID card - stalled on a homepage.
    • WMS set up (Imperial) - awaiting Glasgow, Ral
    • Site set up (QMUL)
    • LFC - in progress
    • CVMFS - considering
  • SNO+
    • Dirac set up for some CEs
  • Epic
    • Doing stuff
  • ngs.ac.uk VO - any reason to keep it?
  • Software areas for SL6
    • Are we keeping the same areas as sl5?
    • What about the software tags?
    • Push CVMFS?
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 4th September

  • Operations report
  • The Change Control process has agree that we will move to a Condor batch farm with (at least initially) both ARC and CREAM CEs. Final testing is ongoing with this farm (ARC-CEs, Condor, SL6). The '08 and '09 batches of worker nodes being currently being drained and moved from the old (Torque/Maui) farm to the Condor farm. However, we have not yet finalised whether the migration of all nodes to SL6 will be done by moving the remaining WNs to the Condor farm or if a portion of the farm will be upgraded 'in-situ' in the Torque/Maui farm.
  • RAL will move to SuperJanet 6 on Tuesday 1st October.
  • There was a presentation (and demonstration - only visible to those at RAL) about the WebDAV interface to CASTOR.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A