General updates
|
Monday 29th October
Tuesday 23rd October
- Site contact lists have been added to the list wlcg-operations at cern.ch to improve the transmission of WLCG ops information. There will be an attempt to minimise 'noise'. Please let Jeremy know of any concerns to be fed back to the WLCG ops team.
- The status of EMI releases was last given in this update.
- When decommissioning services please remember to check the EGI procedures for steps.
- Alessandra created this useful table to collate plans. Thank you to everyone for keeping it updated! Just 3 sites have yet to share their plans....
- The ROD team have been given extended access to the Security Dashboard reports to help chase sites on middleware issues spotted by EGI.
- It has been proposed that from 1st November, lcg-ce tests are removed from the three SAM profiles: ROC [1], ROC_OPERATORS [2] (the list of tests generating alarms into the operations portal), and ROC_CRITICAL [3] (the list of tests generating results that are taken into account for Availability/Reliability reports). This should not affect any UK sites as none are among the 29 services found by EGI as still active on gLite 3.1.
- The Nagios tests need updating for SL6 as some issues have been found.
- The final WLCG Tier-2 availability/reliability report for September 2012 is now available.
Monday 15th October
- There is a new nagios test and through the alarms that come with it, sites and NGIs get an early warning that a site is about to fail the OLA requirements. This new test has recently been implemented in the Dashboard, raising alarms named egi.eu.lowAvailability. Availability alarms are to be handled by RODs in the Dashboard, and in the future are going to replace the Availability/Reliability tickets that the COD submits to underperforming NGI sites. These alarms are a warning for NGIs informing about poor performance of sites within the last 30 days.
- HEPiX is this week. See talks via the agenda page or join via Vidyo
- The status of the EMI-WN testing in captured here.
- SAM test consequence of removing ATLASGROUPDISK Space tokens.
- The official GDB meeting minutes from the October meeting are now available.
|
Tier-1 - Status Page
|
Tuesday 30th October
- Castor 2.1.12 update for LHCb completed OK last Tuesday (23rd). GEN upgrade ongoing this morning.
- Replacement of glite CREAM CEs with EMI CREAM CEs went OK last Tuesday. Took some time to get all SUM tests moved to new CEs.
- This morning (Tues 30th) failed over to backup link to CERN. also, short (10 minute) problem caused by RAL firewall.
- Investigating some assymetric network links via perfsonar. The North American Tier1s were routing packets back to us over the production networks. This has been fixed. (Routing confirmed correct for all other Tier1 now).
- All WMSs now running EMI versions of software.
- Continuing test of hyperthreading. Plan to implement after CE updates completed.
- Continue with ten EMI-2 SL-5 worker nodes in normal production.
- Test instance of FTS version 3 now available. Non-LHC VOs that use the existing service have been enabled on it and looking for one of the VOs to test.
|
Storage & Data Management - Agendas/Minutes
|
Wednesday 10th October
- DPM EMI upgrades:
- 9 sites need to upgrade from gLite 3.2
- QMUL asking for FTS settings to be increased to fully test Network link.
- Initial discussion on how Brunel might upgrade it's SE and decommission is old SE
- Classic SE support , both for new SEs and plan to remove current publishing of classic SE endpoint
|
Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06
|
Tuesday 30th October
- Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.
Friday 28th September
- Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
- See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.
Wednesday 6th September
- Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.
I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather
than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be
any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are
checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our
BDIIs. No doubt there will be a lively debate at GridPP29!
|
Interoperation - EGI ops agendas
|
Monday 8th October
- COD are about to launch monitoring tickets for 'out of support' services, (i.e. glite 3.2), for removal by the end of the month. (They seem to have missed some gLite 32 CREAM CE's, however - we need to make sure we don't).
- EMI updates. EMI-2, expected today or so.
DPM 1.8.4 (Yay! But let it filter through staged rollout a bit...)
LB and WMS 3.4 (Both with security updates).
UI and WN (including 32bit libs, and a few other dependancies).
- Tarballs were raised. Tiziana raised the need for a tarball (EMI-2) before the gLite 3.2 were retired.
- Staged Rollout. The ARC 2.0.0 clients are in the production repositories due to the emi-ui being in production. (Don't think that affects anyone in the UK).
- Released today: BDII Core and GFAL/lcgUtils.
- Products in staged rollout: WMS 3.3.8; CREAM 1.13.4 (due to a mismatch between EMI and UMD versions, this is 1.13.5 in UMD)
- It's been noted that there are a number of products without early adopters in EMI 2: EMIR, Pseudonymity, Wnodes, GridSAM and OGSA-DAI. These will not be included in UMD, unless there's an EA, and demand from NGI's. There's also a few with no EA in EMI-2, but there are in EMI-1, and these are expected to move to EMI-2 at some point: CLUSTER, CREAM-LSF. (VOMS was listed, but the EA was present, and pointed out they are on EMI-2).
- Unsupported services on 8th October EGI list.
|
Monitoring - Links MyWLCG
|
Monday 2nd July
- DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting
Wednesday 6th June
- Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
- Glasgow dashboard now packaged and can be downloaded here.
|
On-duty - Dashboard ROD rota
|
Friday 19th October
- Many sites continue to have planned downtimes for EMI upgrades, with knock-on effects on other local services (eg SRM -> WN Rep alarms). Changeover back to Oxford GridPP Nagios midweek, with some caching weirdness (reloading the dashboard produced one of two different sets of results!) which quickly went away.
Friday 12th October
- There is a new ROD newsletter available from EGI.
- As of this week, John Walsh will no longer be contributing to the ROD work. Many thanks to John for his input over the years!
|
Rollout Status WLCG Baseline
|
Monday 1st October
- All gLite 3.1 services and nodes should now have been upgraded or removed.
Thursday 13rd September
Updated all SR pages.
Monday 3rd September
- Test queues for EMI WNs: RAL T1, Oxford, Liverpool?, Brunel
Tuesday 31st July
- Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.
Wednesday 18th July - Core ops
- Sites (that needed a tarball install) will need to work on own glexec installs
- Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.
|
Security - Incident Procedure Policies Rota
|
Monday 22nd October
- Last week's UK security activity was very much business as usual; there are a lot of alarms in the dashboard for UK sites, but for most of the week they only related to the gLite 3.2 retirement.
Friday 12th October
- The main activity over the last week has been due to new Nagios tests for obsoleted glite middleware and classic SE instances. Most UK sites have alerts against them in the security dashboard and the COD has ticketed sites as appropriate. Several problems have been fixed already, though it seems that the dashboard is slow to notice the fixes.
Tuesday 25th September
|
|
Services - PerfSonar dashboard
|
Thursday 18th October
- VOMS sub-group meeting on Thursday with David Wallom to discuss the NGS VOs. Approximately 20 will be supported on the GridPP VOMS. The intention is to go live with the combined (upgrades VOMS) on 14th November.
- The Manchester-Oxford replication has been successfully tested. Imperial to test shortly.
Tuesday 18th September
- VOMS in Manchester is now installed with both NGS/GridPP VOs. There is some political decision to take about how to support the NGS VOs and how to maintain them but they have been installed. Replication tests between Manchester and Oxford can now start.
- Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week
Tuesday 11th September
- Still some sites needing to deploy perfsonar
- Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week
|
Tools - MyEGI Nagios
|
Wednesday 17th October
Monday 17th September
Monday 10th September
- Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view
|
VOs - GridPP VOMS VO IDs Approved VO table
|
Tuesday 23 October
- A local user is wanting to get on the grid and wants to set up his own UI. Do we have instructions?
Monday 15th October
- Sno+ jobs now work at Dresden https://ggus.eu/ws/ticket_info.php?ticket=86741, but there has got to be a better way.
- Discussion with SNO+ about their requirements - discussions started on the following topics:
- Robot certificates and hardware keys
- FCR
- Managing storage - how to avoid users filling up the space
Monday 8th October
- Sno+ had problems with EMI-2 WN and ganga - formatting changes in EMI-2 command output.
- Now fixed by Mark Slater (8 hours to install EMI2-WN and 20 mins to fix ganga.
- Snoplus jobs don't work at Dresden https://ggus.eu/ws/ticket_info.php?ticket=86741
- Draft e-mail to warning "non LHC VOs" about upcoming updates sent to ops list. Comments please.
Friday 30th September
- Summary of some VO activities given at GridPP29
- Need more feedback/testing from smaller VOs ahead of EMI2-WN change and then SL6.
Tuesday 18 September 2012
- No VOs reporting issues.
- VOs have been asked for a brief summary for the GridPP meeting.
Monday 27th August
|
Site Updates
|
Tuesday 9th October
- SUSSEX: Site working on enabling of ATLAS jobs.
|
|