Operations Bulletin 071013

From GridPP Wiki
Revision as of 10:41, 7 October 2013 by Jeremy coles (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 30th September 2013
Task Areas
General updates

Tuesday 17th September

  • The EGI Technical Forum is taking place this week. Take a look at the agenda. Not all talks are yet online. There are training materials/sessions on security and Glue2 publishing validation.
  • The GPGPU working group in EGI is seeking more participants.
  • The SHA-2 deadline has moved to 1st December.
  • There is a new CHEP 2013 bulletin.
  • WLCG continues to run a biweekly operations meeting on Monday's and Thursday's. See the minutes here.
  • The final WLCG T2 reliability/availability report for August has now been released.


Tuesday 10th September

  • There is a GDB tomorrow - Agenda.
  • EGI is looking for NGIs to volunteer for central banning testing. In the UK we plan to deploy ARGUS at every site on the same timeline as deploying glexec at each site.
  • Some changes have recently been implemented to improve the communications channels of various EGI security teams: the EGI CSIRT and EGI Incident Response Task Force (IRTF). The new lists are csirt and irtf at mailman.egi.eu respectively and now include the backup team members.
  • There will be an EGI User Community Board meeting at the TF. The purpose of this meeting is to "discuss new needs emerging from EGI user communities, in particular new use cases for integration of heterogeneous data from multiple sources to support interdisciplinary science, value added services for computing, simulation and data exploration that reuse tools and services available from existing infrastructures". If you have any input please let Jeremy know.
WLCG Operations Coordination - Agendas

Tuesday 17th September

  • The agenda of the next WLCG operations meeting is available here. The details of the agenda are not yet final. The participation of the Tier-1 contacts is being strongly encouraged, but also Tier-2 sites are welcome to listen in and contribute (via Vidyo).

Tuesday 2nd September

  • Middleware
    • New BDII release in the latest EMI-2/3 update, including better GLUE-2 support and security fixes. Sites should update all their BDII instances
    • New CVMFS version released for a security fix. Sites should upgrade or at least apply the hot fix in the above twiki
    • perfSONAR: sites should upgrade to the latest version, fixing many deployment problems
    • The end of support for dCache 1.9.12 has been postponed to September 30 due to a delay in releasing the SHA-2 compliant version in the dCache 2.2 series.
    • Consult https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
  • SHA-2
    • Discussion mostly dedicated to the experiments testing status. Atlas and LHCb have tested the services but not job submission yet. All experiments have been encouraged to test this.
  • SL6
    • T2 Done: 49/129 (Alice 11/39, Atlas 28/89, CMS 22/65, LHCb 13/45) -> 80/129 still to be done.
    • HS06: Reminder that sites are requested to run HS06 benchmark and update the value in the BDII. Increased values might be discussed at the WLCG MB.
    • EMI-3: voms-clients have been fixed and the latest version is in the PT repository but not in EMI-3 yet. Both CMS and Atlas work on DPM/dcache sites with this patch. (QMUL might want to give an update on Storm when they upgrade)
    • UK status: Liverpool to be finished soon, Bham in downtime to upgrade this week, Bristol and Sussex should be done by the 15/9/2013, RALPP 20/09/2013 and QMUL, Lancaster, UCL 30/09/2013
  • glexec
    • 55 sites still to respond they have attached the installation to SL6 upgrade.

Monday 12th August

  • There have been no recent meetings. The next is on 29th August.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 1st October

  • All three Top BDIIs have been updated to the latest software version.
  • The Condor batch farm was marked as being in 'Production' status in the GOC DB on Monday. It has around 50% of the batch capacity. We are just checking plans to upgrade the remaining nodes to SL6 (staying within the Torque/Maui farm) - hopefully this will be done on Thursday.
  • The RAL site was successfully upgraded to JANET 6 this morning.
  • We have announced an 'At Risk' for a UPS/generator load test tomorrow morning (Wednesday 2nd Oct).
Storage & Data Management - Agendas/Minutes

Tuesday 17th September

  • Perhaps someone could summarise the "Dark Data identification tools" thread on TB-Support?

Tuesday 15th July

  • Three sites are now run all UK FTS traffic via 'FTS3' service as a test. Mostly successful; (small issue with a few US sites to be resolved before taking tests further.)

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 13th August

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 17th September

  • Little obvious change in the status table since last week.

Tuesday 3 September 2013

  • Proposal for "Instant UI", with the aim to produce a suite of documentation and software that will enable a new user to set up a UI and join the grid with the minimum of hassle. Doc will show for to admin a UI that can be used to submit jobs and retrieve output for a given set of users belonging to a given set of VOs. "Instant UI" is currently in consulation phase with GridPP admin community.

Tuesday 27th August

  • The document status was reviews at a core ops meeting last Thursday. Several documents are to be reassigned and some removed.
  • Looking at VO lifecycle.

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Interoperation - EGI ops agendas

Monday 16th September

  • The next meeting takes place on 23rd September at 13:00 (UK time).
  • UMD 3.2.0 was released last week. See the release page for more information.

Monday 2nd September

  • Yesterday's agenda. Attended by David and Raul.

1. Middleware releases and staged rollout 1.1 News from URT StoRM 1.11.2 still not released The Product Team found an issue (fixed) during the testing and the fix is currently under testing (within days if everything goes well).

Gridsite 2.1.3(UMD-3) and 1.7.28(UMD-2) To be released during this week by PT Not in EPEL yet (gridsite PT is not releasing yet in EPEL) Single fix for a problem affecting delegation

DPM v 1.8.7 Need to remove packages from the EMI repositories before releasing in EPEL

dCache 2.2.15 released Not the SHA-2 compliant version for UMD-2 2.2.16 will include sha-2 support and will be released tomorrow (Tuesday)

Globus 3.2.2 -Bug fix release from IGE -GridWay -GSI-SSHTerm -BES-GRAM -Globus Info Provider Service


1.2 VOMS: A question was asked, "has voms been tested with IPv6 now?" - the answer was not yet, as there was no IPv6 machines in the testbed - this was to be followed up.

1.3 UMD-3 to be released ~next week, waiting a couple of days to allow for the possibility of StoRM update to be tested.

2.2 Host name in the host certificates

As reported in several GGUS tickets (Rules for issuing certificates for hosts with an alias, Problems related to a myproxy service). Host certificates must have the hostname used to reach the service in either the CN or DNS fields (Common Name and Alternate Subject Name). Please make sure to provide all the needed information to your CA when requesting a new host certificate.

The only service which this affects in the UK is lcgrbp01.gridpp.rl.ac.uk ; Peter noted that he would issue tickets shortly (if not by now).

2.A. Peter Solagna reported on the Central emergency suspension project, - he requested that NGIs report back to him about whether each has a NGI ARGUS instance for use by the framework, so that the central ARGUS server can have a limited set of servers talking to it.

As a reminder, register ARGUS servers in GOCDB.

2.3 UserDN publication in the accounting records Reminder: follow with sites generating alarms for non published User DN. Currently only few sites have 'critical' failures on that check.

2.4 CVMFS webinar They trailed Catalin's webinar as discussed on TB-SUPPORT

2.4 AAI workshop at TF Madrid (Authentication Authorization Infrastructure) AAI workshop on Tuesday September 17th at 11:00

AOB Monday 16th September presentation by Product Teams during Technical Forum.

3.2 Next meeting September 23rd, h14:00 Amsterdam time - might be modified because of Friday 27th Meeting OMB meeting

No meeting in 2 weeks because of TF.


gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 17th September

  • Daniela is on duty this week.
  • A lot of alarms last week because of top bdii issue at RAL Tier1. lcg-* command uses topbdii to get se endpoints so this issue is affecting actual jobs as well.
  • SHA2 tickets still open for three sites.
  • SE issue at Durham is still going on. Open ticket.
  • Opened ticket for Sussex Apel issue. No response from the site.


Monday 2nd September

  • A lot of intermittent Nagios failures due to tier1 BDII issue. Imperial WMS also had some issue on Thursday and lot of nagios jobs stayed in waiting because of that.
  • One ticket escalated to the NGI.
  • A few SHA2 tickets remain open. Bristol is testing an ARC CE and that CE was not in production but monitored. It was generating alarms in Dashboard. Luke agreed to put it in downtime until fixed.
  • Several of the SHA-2 tickets have gone red as they've reached 30 days of being open. We need to decide how to handle them.
Rollout Status WLCG Baseline

Tuesday 17th September

  • Chris sent in a report for Storm.

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 17th September

  • More information on the EGI/PRACE/EUDAT Joint Security Training event mentioned last week is now available.

Tuesday 3rd September

  • Security contacts and system staff of the partners of EGI, EUDAT and PRACE are invited for a joint security event from Monday, October 7 - Wednesday, October 9 in Linköping, Sweden. Monday and Tuesday will be a training event which should be of interest for all staff managing the systems which are part of our infrastructures. The second part of the meeting will consist of a discussion of security policies and the collaboration among the different infrastructures.
  • Rob H has moved to the Tier-1. Looking at options for security team.
  • At the next team meeting we should review our approach to 100percentit (just moved to uncertified).

Tuesday 20th August

  • Several sites showing up in pakiti this week.
  • Update on workaround discussed last week.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 1st October

  • Bristol have installed two perfSONAR hosts and the google maps view is now all green for the UK.

Tuesday 17th September

  • Upgrading/re-installing hosts to v3.3.1/mesh is only making slow progress.
  • There is a new view of the status between sites.
  • An outage at Manchester due to central switch maintenance means that VOMS is not going to be contactable for a period this morning. It is clear that we need the backup VOMS instances fully available to VOs - please can someone take a lead?

Tuesday 27th August

  • PerfSONAR: version 3.3.1 was released on 21st Aug. This update should fix problems with the WLCG mesh (which can now be included in the config file). Indeed traceroute and pingER are now working at Cambridge and Imperial but not, yet, at Oxford or Lancaster. Please could sites upgrade to this or at a minimum check their existing perfsonar host(s) are working OK here dashboard.


Tickets

Monday 30th September 15.00 BST</br> First up a gentle reminder that if you've asked the submitter a question in a ticket (usually "Is it still broke for you?") remember to set the ticket to "Waiting for Reply". Then it's obvious to us watching that any tardiness on the ticket is the user's fault.

Secondly a general request that as next week we'll have a full review for sites to have a bit of a autumnal clean of their tickets, update what needs an update , close what can be closed.

Thirdly I've noticed a few cases of "boomerang tickets" over the last week, with tickets submitted by sites faithfully returning to their submitters. yet another thing to watch out for!

GLEXEC Hall of People with GLEXEC tickets:</br> CAMBRIDGE, RHUL, DURHAM, UCL, SHEFFIELD, EDINBURGH, QMUL, EFDA-JET, LANCASTER, BRISTOL and MANCHESTER all have gLEXec tickets open. It would be nice if they all received updates over the next week. BIRMINGHAM almost got gLexec dusted, but their ticket got reopened on them (95304). Looks like they should be out of the woods soon though.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=95469 (5/7)</br> Unresponsive VO master ticket. Malgorzata asks if the ticket can be passed to VO services to start the decommissioning process for the VOs that have had their day. On Hold (23/9)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=97139 (9/9)</br> APEL test failures at Sussex. If you guys are stuck (no shame there, accounting problems are hard), then I suggest you set the ticket On Hold and/or open a support ticket with the apel guys (and cross reference it here). In progress (18/9)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9)</br> This Jet ticket is looking a little crusty, could someone in Southgrid please have a poke, IIRC this just needs an upgrade of the CA cert rpms. In progress (23/9)

RAL (sort of)</br> https://ggus.eu/ws/ticket_info.php?ticket=97360 (17/9)</br> You may have heard about this infamous epic ticket last week at GridPP. It'll make you laugh, it'll make you cry. Then probably make you cry some more. It documents how circumstances have conspired to allow a single misconfigured CE break all WMS/CREAM interactions - others with a better understanding could say more. In Progress (possibly can be closed) (27/9)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=97103 (6/9)</br> Another issue that cropped up in conversation at GridPP last week was this one at Durham, the likely suspect to their GridFTP problems being network security tools on the Durham firewall. Hopefully CIS will purge the IDS from your subnet! In progress (24/9)

Ticket Update: Supplimental</br> Squire Jones has brought this EMI3 Apel ticket he submitted to our attention:</br> https://ggus.eu/ws/ticket_info.php?ticket=97528

It describes a fairly annoying bug in the new APEL conncction handling, as well as a doubling up of entries that we've also seen at Lancaster (where the Job's assigned "MachineName" is "MachineName" - https://ggus.eu/ws/ticket_info.php?ticket=95365). Steve's advice is "You'd be better off sticking for now.". For those of us on the EMI3 APEL boat there are some nice instructions in the ticket for handling things if you connection jams up. Thanks Steve!

Tools - MyEGI Nagios

Monday 2nd September

  • Intermittent Nagios errors -> Imperial WMS and all the jobs going through it were failing with ‘no compatible error’. Some reports of ongoing issues. What is the direct impact?
  • MyEGI and gstat were also down last week.
  • Jens is testing SHA-2 compliance of components. The version of gridsite on the GridPP website is not compliant but SHA-2 will be supported with a move to a new server (when?).

Monday 12 Aug

  • VO-Nagios

t2wlcgnagios.physics.ox.ac.uk monitors few UK VO and it uses Robot Certificate. Due to some confusion Robot Certificate got expired. I applied for extension of Robot Certificate beforehand but Cert Wizard doesn't understand Robot Certificate and I thought that it has been extended. Finally Jens stepped in sorted it out. Now VO Nagios is working.

  • SHA2 Certificates

I have been issued a SHA2 certificate by Jens. I tested few CE's and some interesting results came out. Gridpp VOMS server is SHA2 compatible so SHA2 proxies can be created for VO's hosted at voms.gridpp.ac.uk. None of CERN voms server are sha2 compatible but there is workaround to add a secondary SHA2 certificate. Details are here https://twiki.cern.ch/twiki/bin/view/LCG/SHA2readinessTesting#SHA_2_VOMS_server I have added my SHA2 certificate but it is not approved yet as most of the people are on holiday. Interestingly when I submitted few jobs using ngs.ac.uk with SHA2 certificate, it finished successfully on the CE's which are not suppose to be SHA2 complaint. I will test again with OPS vo to confirm it.

Tuesday 23rd July

  • In a campaign to update VO ID card details it turns out that a few of our supported VOs are obsolete: babar, possibly supernemo and ngs.ac.uk. The first of these can be safely removed but we need to confirm our announcement process.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 2nd September

  • The next quarterly Tier 1 allocation/resourcing meeting is scheduled for Wednesday 18th September (after the weekly T1 meeting) the hardware requirements and fair-shares for the period October-December 2013 will be reviewed. It looks ahead over the next 12 month timeframe. Can all experiments/projects please let Pete G have any updates or requests to these numbers by Friday 13th September please?

Monday 19 August

  • EPIC
    • Support requested at Tier-1
    • Any other sites prepared to support them?
  • Catalogue synchronisation - Biomed working on it.


Monday 12 August

  • HyperK.org
    • VOMS servers set up (Manchester, Oxford, Imperial)
    • VOID card - stalled on a homepage.
    • WMS set up (Imperial) - awaiting Glasgow, Ral
    • Site set up (QMUL)
    • LFC - in progress
    • CVMFS - considering
  • SNO+
    • Dirac set up for some CEs
  • Epic
    • Doing stuff
  • ngs.ac.uk VO - any reason to keep it?
  • Software areas for SL6
    • Are we keeping the same areas as sl5?
    • What about the software tags?
    • Push CVMFS?
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 2nd October

  • Operations report
  • The Condor batch farm has been declared as being in production, with about 50% of our batch capacity in it - all WNs running SL6. The remaining nodes will be upgraded from SL5 to SL6 within the Torque/Maui farm. This upgrade will take place tomorrow (3rd Oct).
  • All three Top-BDIIs have been upgraded to the latest version (EMI v3.8.0). Hopefully this will alleviate the problems seen in the previous weeks.
  • The RAL link and the backup OPN link to CERN was successfully moved to the SuperJanet 6 infrastructure yesterday (Tuesday 1st October).
  • Castor version 2.1.14 has just been released by CERN and RAL will begin testing it.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A