Operations Bulletin 141013

From GridPP Wiki
Revision as of 09:46, 14 October 2013 by Jeremy coles (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 7th October 2013
Task Areas
General updates

Tuesday 8th October

  • The September WLCG Tier-2 availability/reliability report has now been circulated. Amendments are to be requested by this Friday.
  • lcg-voms.cern.ch problem when FTS proxies at BNL were renewed, which caused failed transfers and job failures due to LFC authentication errors for about two hours on Monday. (for details see the ops report). Wider issue with vo.racf.bnl.gov whose host DN recently has changed - services using it need updating (see related broadcast).
  • A storage area network (SAN) fault affected GOCDB and APEL over the weekend (4th to 6th Oct). Delays in APEL updating are expected.


Monday 30th September

  • There is a provisional agenda for the GDB next Wednesday.
  • Michel has indicated a desire for WLCG to review cloud progress - particularly around security issues like identity used to execute user payloads (identity change?), traceability requirements, and policy update (topics like root account usage). A F2F meeting on these topics is likely to take place in December/January.
  • We plan to decommission several VOs (see this ticket for the background. Does anybody have any knowledge of oxgrid.ox.ac.uk? Was there a ticket following up on this VO?
  • From today 3 ARC and 2 CREAM CEs will be declared as in production under Condor at RAL.
  • There was an EGI OMB meeting last Friday.
  • Problems with an ATLAS pilot factory and also FTS3 issues last week led to a low number of jobs across UK Tier-2s on several days.
  • Registration for the WLCG workshop in Copenhagen (November 11th-12th) is now open. Please let Jeremy know if you have a specific interest in attending (to ensure travel requests can be accommodated).
  • GOCDB v5 is scheduled for release on the morning of Wednesday 2nd October. This is a major release.
  • Following the HEPiX meeting in Bologna, a configuration management working group is being established, focussed primarily on puppet. The goal is not to replace the wider puppet community with something HEP specific for general puppet activity, but to find areas of collaboration for HEPiX specific problems. The initial motivation for setting up the WG were to share information between sites that are deploying puppet, and in particular for services where ongoing YAIM support is unclear. For more information see the wiki or contact Ben Jones or Yves Kemp.

Tuesday 17th September

  • The EGI Technical Forum is taking place this week. Take a look at the agenda. Not all talks are yet online. There are training materials/sessions on security and Glue2 publishing validation.
  • The GPGPU working group in EGI is seeking more participants.
  • The SHA-2 deadline has moved to 1st December.
  • There is a new CHEP 2013 bulletin.
  • WLCG continues to run a biweekly operations meeting on Monday's and Thursday's. See the minutes here.
  • The final WLCG T2 reliability/availability report for August has now been released.
WLCG Operations Coordination - Agendas

Tuesday 17th September

  • The agenda of the next WLCG operations meeting is available here. The details of the agenda are not yet final. The participation of the Tier-1 contacts is being strongly encouraged, but also Tier-2 sites are welcome to listen in and contribute (via Vidyo).

Tuesday 2nd September

  • Middleware
    • New BDII release in the latest EMI-2/3 update, including better GLUE-2 support and security fixes. Sites should update all their BDII instances
    • New CVMFS version released for a security fix. Sites should upgrade or at least apply the hot fix in the above twiki
    • perfSONAR: sites should upgrade to the latest version, fixing many deployment problems
    • The end of support for dCache 1.9.12 has been postponed to September 30 due to a delay in releasing the SHA-2 compliant version in the dCache 2.2 series.
    • Consult https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
  • SHA-2
    • Discussion mostly dedicated to the experiments testing status. Atlas and LHCb have tested the services but not job submission yet. All experiments have been encouraged to test this.
  • SL6
    • T2 Done: 49/129 (Alice 11/39, Atlas 28/89, CMS 22/65, LHCb 13/45) -> 80/129 still to be done.
    • HS06: Reminder that sites are requested to run HS06 benchmark and update the value in the BDII. Increased values might be discussed at the WLCG MB.
    • EMI-3: voms-clients have been fixed and the latest version is in the PT repository but not in EMI-3 yet. Both CMS and Atlas work on DPM/dcache sites with this patch. (QMUL might want to give an update on Storm when they upgrade)
    • UK status: Liverpool to be finished soon, Bham in downtime to upgrade this week, Bristol and Sussex should be done by the 15/9/2013, RALPP 20/09/2013 and QMUL, Lancaster, UCL 30/09/2013
  • glexec
    • 55 sites still to respond they have attached the installation to SL6 upgrade.

Monday 12th August

  • There have been no recent meetings. The next is on 29th August.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 8th October

  • The Condor batch farm (with SL6 WNs, both ARC & CREAM CEs) has been running stably with about 50% of the total Tier1 batch capacity.
  • On Thursday, 3rd October, the WNs in the Torque/Maui farm were successfully upgraded to SL6. Of the five batches of WNs in that farm two were put into production (as SL6) on the Thursday and another the following day. The final two batches are being re-installed which will restore full batch capacity.
  • We will run with the two farms until early November when the Torque/Maui farm will be decommissioned and its WNs moved to the Condor farm.
  • Last Tuesday (1st Oct.) the RAL site network links were migrated successfully to Janet 6 infrastructure. The backup OPN link to CERN was moved that day too.
  • The Primary OPN link to CERN has been successfully moved to the Janet6 infrastructure this morning (8th Oct).
  • We had a network break of about 45 minutes overnight Wed/Thu 2/3 October (not due to the Janet 6 migration).
  • On Wednesday (2nd Oct) there was a successful load test on the UPS/generator system.
  • We are working on plans for managing services during a planned intervention on the UPS on 5/6 November.
Storage & Data Management - Agendas/Minutes

Tuesday 8th October

  • The DPM workshop agenda and registration page will appear here.

Monday 30th September

  • A DPM workshop is being organised in Edinburgh for 13th December. GridPP PMB anticipated covering travel for of order 10 UK sysadmins for this event. Interest should be indicated during the storage group meeting.

Tuesday 17th September

  • Perhaps someone could summarise the "Dark Data identification tools" thread on TB-Support?



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 13th August

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 1st October

  • The approved VOs page has been updated with the newest data from the operations portal. Note that the VOMS records for LondonGrid now contain some alternative voms servers. The migration plan for use of these backup servers is now document here.

Tuesday 17th September

  • Little obvious change in the status table since last week.

Tuesday 3 September 2013

  • Proposal for "Instant UI", with the aim to produce a suite of documentation and software that will enable a new user to set up a UI and join the grid with the minimum of hassle. Doc will show for to admin a UI that can be used to submit jobs and retrieve output for a given set of users belonging to a given set of VOs. "Instant UI" is currently in consulation phase with GridPP admin community.
Interoperation - EGI ops agendas

Monday 7th October

  • The ops meeting today covered: news from URT; staged rollout updates; UMD updates; DMSU updates (WMS problems at CNAF); ARGUS connection problems and SHA-2 update.

Monday 30th September

  • There was an EGI ops meeting on 23rd September. See the agenda for more details.

Monday 16th September

  • The next meeting takes place on 23rd September at 13:00 (UK time).
  • UMD 3.2.0 was released last week. See the release page for more information.

Monday 2nd September

  • Yesterday's agenda. Attended by David and Raul.
gLite support calendar.


Monitoring - Links MyWLCG

Monday 30th September

  • David summarised the UK site's position on Nagios in an email last week as:
  • There is a desire for a monitoring solution that gave automatic notifications and links to further information, and didn't require additional webpages (which describes Nagios). We noted that Nagios could be used to import central nagios tests and repurposing them for local testing.
  • In addition, it would be useful if the further details could include details of the testing execution commands (even including the test itself) for local diagnosis.
  • We wondered whether (and where) there might be common ground with the WLCG Nagios project - while this may have been discussed, it would be useful to clarify this.
  • It's important to have a clear and documented messaging/transport layer for any solution that's decided on, for integration with future monitoring solutions.

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 30th September

  • Quite a good week - with problems at only a couple of sites. No UK-wide problems.
  • Point to note for this week: several of the SHA-2 tickets are currently set to expire at the start of October, due to the original deadline.

Tuesday 17th September

  • Daniela is on duty this week.
  • A lot of alarms last week because of top bdii issue at RAL Tier1. lcg-* command uses topbdii to get se endpoints so this issue is affecting actual jobs as well.
  • SHA2 tickets still open for three sites.
  • SE issue at Durham is still going on. Open ticket.
  • Opened ticket for Sussex Apel issue. No response from the site.
Rollout Status WLCG Baseline

Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805]

Tuesday 17th September

  • Chris sent in a report for Storm.

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 8th October

  • ARGUS setup for UK
  • ARGUS configuration (see Chris's email)

Tuesday 17th September

  • More information on the EGI/PRACE/EUDAT Joint Security Training event mentioned last week is now available.

Tuesday 3rd September

  • Security contacts and system staff of the partners of EGI, EUDAT and PRACE are invited for a joint security event from Monday, October 7 - Wednesday, October 9 in Linköping, Sweden. Monday and Tuesday will be a training event which should be of interest for all staff managing the systems which are part of our infrastructures. The second part of the meeting will consist of a discussion of security policies and the collaboration among the different infrastructures.
  • Rob H has moved to the Tier-1. Looking at options for security team.
  • At the next team meeting we should review our approach to 100percentit (just moved to uncertified).

Tuesday 20th August

  • Several sites showing up in pakiti this week.
  • Update on workaround discussed last week.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 1st October

  • PerfSONAR latency hosts configured to use the WLCG meshes should now have a traceroute measurement achive (MA) accessible from the GUI under 'Service Graphs' --> 'Traceroute'. Here is an example.

Tuesday 17th September

  • Upgrading/re-installing hosts to v3.3.1/mesh is only making slow progress.
  • There is a new view of the status between sites.
  • An outage at Manchester due to central switch maintenance means that VOMS is not going to be contactable for a period this morning. It is clear that we need the backup VOMS instances fully available to VOs - please can someone take a lead?
Tickets

Monday 7th October 2013, 14.30 BST</br> 38 Open UK tickets this month (nice, they're going down).

GLEXEC tickets - keeping these separate. Congratulations to Manchester for vanquishing their lack of glexec during their SL6 upgrade.

CAMBRIDGE</br> https://ggus.eu/ws/ticket_info.php?ticket=95306 </br> John said in his last update that they intend to roll out gLExec alongside their SL6 upgrade, which should be happening this week. How goes it? On hold (2/9)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=95305</br> gLexec will be dealt with after the other stuff (SL6, new CEs etc). On hold (11/7)

BIRMINGHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=95304</br> Mark enabled gLexec, but had a few bugs that needed ironing out for Alice. I'd suggest solving it again if you think these are fixed Mark. In progress (26/9)

  • Note the LHCb test is now passing intermittently.

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=95303</br> Waiting on the tarball - *tarball hat on* Sorry guys, I've dropped the ball here. Keeping the lights on at Lancaster ate all of September and not giving me much time to tackle this problem. I hope to have an update soon in a big tarball refresh. On hold (21/8)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=95302</br> It looked like Durham were close, but Mike was on leave. Any news? On hold (2/9)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=95301</br> Like most others Elena is rolling out this with the SL6 upgrade, so soon hopefully. On hold (1/10)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=95299</br> See my apology to ECDF. On hold (17/7)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=95297</br> Coming along with SL6. On hold (2/9)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=95296</br> glExec is basically working on QM's SL6 nodes, so it's just a matter of time before glexEc is rolled out across their nodes. On hold (12/8)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=95298</br> Ben hoped to roll this out by the end of September along with the SL6 upgrade. On hold (29/8)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=95295</br> There was a plan to roll this out in a few weeks time, back in July. On hold (19/7)

Common or Garden Tickets:

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=95469 (5/7)</br> The last of the Unresponsive VO child tickets (supernemo). Jeremy has mentioned that he will close this ticket, it just looks like he hasn't got around to it yet. Malgorzata has chimed in asking for the decommissioning ticket to be opened. In progress (7/10)

  • JC: I had (thought) I closed it but immediately after the comment! Now 'solved'.

https://ggus.eu/ws/ticket_info.php?ticket=95442 (4/7)</br> The unresponsive VO master ticket. Nearly done now. On hold (12/8)

https://ggus.eu/ws/ticket_info.php?ticket=95833 (17/7)</br> Decommissioning of ral-ngs2. On hold whilst migrating some of the remaining services to Scarf. On hold (23/9)

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=97823 (7/10)</br> A ticket has come into Manchester requesting that their voms stops supporting minos as part of the VO decommissioning process. Assigned (7/10)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=95165 (28/6)</br> Duncan asked Sussex to check their Perfsonar. At last word they were going to reinstall it with the latest version of perfsonar. Is anything likely to happen soon? On hold (14/8)

https://ggus.eu/ws/ticket_info.php?ticket=97139 (9/9)</br> APEL-Pub nagios test failures. Kashif has extended the ticket again. Other ROD shifters might not be so kind! In progress (18/9)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=96261 (30/7)</br> Bristol are seeing some stage out problems for a CMS user's jobs. They're trying valiantly to fix, but not having much luck. In progress (3/10)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=97068 (5/9)</br> Glasgow's perfsonar boxen need a kick too. Dave reports that they'll review the ports being used whilst updating their Perfsonar box to the latest and greatest version. On hold (18/9)

https://ggus.eu/ws/ticket_info.php?ticket=96234 (29/7)</br> Request for WMS HyperK support. The plan was to do it after GridPP. It's after GridPP. Just mentioning that... On hold (18/9)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=96002 (22/7)</br> A SHA-2 nagios ticket. Enough said. You might want to update your CEs for other reasons before SHA-2 becomes "mandatory". On hold (20/8)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=97378 (17/9)</br> Another Apel-Pub nagios test ticket. Things are looking much better at Durham, and Stuart from the Apel team has gotten involved. In progress (7/10)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=97039 (4/9)</br> Biomed complaining about 44444444444 waiting jobs at Sheffield. Have you had a chance to take a peek at the problem Elena? On hold (11/9)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=97066 (5/9)</br> Duncan spotted that the Manchester perfsonar wasn't working very well. Alessandra reports that they'll get to this until after the SL6 upgrade and they've finished debugging their central switch. On hold (9/9)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=97682 (1/10)</br> Liverpool's perfsonar has fallen ill after a power cut. It seems these boxes eventually go bad. The Liver lads can't get the perfSONAR-BUOY tests to start, it's looking like a reinstall is on the cards. In progress (2/10)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=97783 (5/10)</br> Atlas are seeing transfer failures into UCL. There appears to be a discrepancy between real and reported space causing the issue. In progress (7/10)

https://ggus.eu/ws/ticket_info.php?ticket=97461 (20/9)</br> Atlas transfer failures caused by network troubles by the looks of it. Network engineers are putting in some 10G cards tomorrow (the 8th), so hopefully that will sooth it. In progress (3/10)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=97615 (27/9)</br> atlas complaining that the QM SE is not responding fast enough to lcg-stmd queries. Dan asked for some deserved clarification. Things are better, but Chris would like the Storm developers to get involved. On hold (2/10)

https://ggus.eu/ws/ticket_info.php?ticket=97819 (7/10)</br> hone have noticed a lack of disk space on some QM nodes affecting their jobs. Dan has tracked this down to epic not cleaning up after themselves, we've noticed the same at Lancaster. Dan has let epic know of their error. In progress (7/7)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=97385 (17/9)</br> Request for a HyperK (the most breakfast cerealy sounding VO) cvmfs repo. Things are being worked on, but slowly. If things are going to0 slowly Chris pipe up! On hold (26/9)

https://ggus.eu/ws/ticket_info.php?ticket=97025 (3/9)</br> A ticket that keeps popping up in one form or another, concerning the RAL myproxy server's certificate. This has been worked on before, with a new service in stand by. The ticket has been held to prevent the Tier 1 being hassled about this again. On hold (12/9)

https://ggus.eu/ws/ticket_info.php?ticket=97516 (23/9)</br> t2k had problems FTSing their files about. It looks like this could have been caused by temporary network problems at external sites. The Tier 1 guys would like to know if the problem persists for t2k. Waiting for reply (30/9)

https://ggus.eu/ws/ticket_info.php?ticket=97759 (4/10)</br> The Tier 1's SHA-2 ticket. The plan is currently to let all the current out of date CEs die natural deaths as the clusters is decommissioned from under them. If they get a reprieve then they'll get upgraded. On hold (4/10)

https://ggus.eu/ws/ticket_info.php?ticket=97479 (20/9)</br> Atlas were seeing high failure rates on RAL SL5 nodes. Maybe an old cvmfs bug. With the move to SL6 this wasn't too much of a worry. Can this ticket be closed now? On hold (30/9)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9/2012)</br> I forgot to get a birthday card for this ticket - "correlated packet-loss on perfsonar host". Any news? On hold (17/6)

https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Chris' request for webdav support on the RAL LFC. Last word were that a few holes needed to be poked in the RAL firewall for the service, then silence. On hold (9/8)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9)</br> lhcb jobs failures, probably due to an out of date set of CA certificates. Jet have updated and asked lhcb if things have started working. Waiting for reply (2/10)


Tools - MyEGI Nagios

Monday 30th September

  • Ewan has put together a slightly modified WLCG VO box, but the effect is of a UI that takes gsi ssh logins from people in one particular VO, but then can be used as a UI for other VOs once you're logged in. The idea is that anyone who would need access to a central UI machine (so, mostly not people in PP depts.) would join a special-purpose VO. See Ewan's TB-SUPPORT email on 23rd September for more details.

Monday 2nd September

  • Intermittent Nagios errors -> Imperial WMS and all the jobs going through it were failing with ‘no compatible error’. Some reports of ongoing issues. What is the direct impact?
  • MyEGI and gstat were also down last week.
  • Jens is testing SHA-2 compliance of components. The version of gridsite on the GridPP website is not compliant but SHA-2 will be supported with a move to a new server (when?).
VOs - GridPP VOMS VO IDs Approved VO table

Monday 7th October 2013

Monday 2nd September

  • The next quarterly Tier 1 allocation/resourcing meeting is scheduled for Wednesday 18th September (after the weekly T1 meeting) the hardware requirements and fair-shares for the period October-December 2013 will be reviewed. It looks ahead over the next 12 month timeframe. Can all experiments/projects please let Pete G have any updates or requests to these numbers by Friday 13th September please?

Monday 19 August

  • EPIC
    • Support requested at Tier-1
    • Any other sites prepared to support them?
  • Catalogue synchronisation - Biomed working on it.


Monday 12 August

  • HyperK.org
    • VOMS servers set up (Manchester, Oxford, Imperial)
    • VOID card - stalled on a homepage.
    • WMS set up (Imperial) - awaiting Glasgow, Ral
    • Site set up (QMUL)
    • LFC - in progress
    • CVMFS - considering
  • SNO+
    • Dirac set up for some CEs
  • Epic
    • Doing stuff
  • ngs.ac.uk VO - any reason to keep it?
  • Software areas for SL6
    • Are we keeping the same areas as sl5?
    • What about the software tags?
    • Push CVMFS?
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 9th October

  • Operations report
  • Last week remaining nodes in the Torque/maui batch farm were upgraded from SL5 to SL6. All WNs (in both Condor & Torque/Maui farms) now running Sl6
  • Yesterday morning (Tuesday 8th Oct) the primary OPN link was moved to the Janet 6 infrastructure, completing the RAL site's moves to Janet 6.
  • Initial plans for service interruptions during work on the UPS on 5/6 November were discussed.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A