Operations Bulletin 101212

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 3rd December 2012
Task Areas
General updates

Monday 3rd December

  • The 2nd DPM Community Workshop is taking place this Monday and Tuesday. It includes tutorial sessions on DM-Lite.
  • Following on from the GridPP29 discussions, there was a GridPP/UK cloud kick-off meeting last Friday. If your site is already running a cloud of some description, then please let David know. The core group is defined in the slides. There is a jiscmail mailing list if you want to follow/contribute to this work: GRIDPP-CLOUD.
  • An agenda is now available for December's GDB.
  • Last week a subset of the PMB plus Neasan and Chris discussed project impact and knowledge exchange topics. This will be important in future proposals. We would like to find/produce some VO exemplars from non-HEP areas. Please let Jeremy and Chris know if you have at your site a community that may (significantly) benefit from using the Grid and have an interest in piloting a project.


Monday 26th November

  • A reminder of the repository hosted at Manchester.
  • Please note that the SARoNGS CA certificate expires on the 30th Nov (this Friday).
  • There was a WLCG Operations Coordination Team meeting last Thursday. Note that a gLExec test has been added to the EGI ROC_OPERATORS profile. A push for enablement will likely start in December or January.
  • Brian circulated some tables showing (ATLAS) WAN performance.


Tuesday 20th November

  • The WLCG T2 reliability/availability report for October is now final.
  • There was a GDB last week. Ewan's report is in the wiki.
  • There is an ongoing EGI Operations Management Board meeting.
  • The DTEAM VO membership service is currently supported by VOMRS (AUTH). VOMRS is unsupported since 01-10-2012. VOMRS ready for migration to VOMS at the end of November.
Tier-1 - Status Page

Tuesday 4th December

  • Still see some worse availabilities with some services not completely stable following the power incident of two weeks ago. Ongoing work to source some replacement components in order to rebuild resilience.
  • All worker nodes upgraded to to EMI-2.
  • Other items:
    • Ongoing investigating into asymmetric data rates seen to remote sites.
    • Roll-out the over-commit of batch jobs to make use of hyperthreading ongoing.
    • Test instance of FTS version 3 now available and being tested by Atlas & CMS.
Storage & Data Management - Agendas/Minutes

Wednesday 10th October

  • DPM EMI upgrades:
    • 9 sites need to upgrade from gLite 3.2
  • QMUL asking for FTS settings to be increased to fully test Network link.
  • Initial discussion on how Brunel might upgrade it's SE and decommission is old SE
  • Classic SE support , both for new SEs and plan to remove current publishing of classic SE endpoint


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th October

  • Storage availability in SL pages has been affected by a number of sites being asked by ATLAS to retire the ATLASGROUPDISK space token while the SUM tests were still testing it as critical. The availability will be corrected manually once the month ends. Sites affected in different degrees are RHUL, CAM, BHAM, SHEF and MAN.

Friday 28th September

  • Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
  • See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.

Wednesday 6th September

  • Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Documentation - KeyDocs

Tuesday 4th December

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.


Tuesday 6th November

  • Do we need the Approved VOs document the set out the software needs for the VOs?

Tuesday 23rd October

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 3rd December


Monday 5th November

  • There was an EGI ops meeting today.
  • UMD 2.3.0 in preparation. Release due 19 November, freeze date 12 November.
  • EMI-2 updates: DPM/LFC and VOMS - bugfixes, and glue 2.0 in DPM.
  • EGI have a list of sites considered unresponsive or having insufficient plans for the middleware migration. The one UK site mentioned has today updated their ticket again with further information.
  • In general an upgrade plan cannot extend after the end of 2012.
  • A dCache probe was being rolled into production yesterday, alarms should appear in the next 24 hours on the security dashboard
  • CSIRT is taking over from COD on migration ticketing. By next Monday the NGIs with problematic sites will be asked to contact the sites, asking them to register a downtime for their unsupported services.
  • Problems with WMS in EMI-2 (update 4) - WMS version 3.4.0. Basically, it can get proxy interaction with MyProxy a bit wrong. The detail is at GGUS 87802, and there exist a couple of workarounds.



Monitoring - Links MyWLCG

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Sunday 2nd December - SP

  • Phenomonly quiet week - with one exception, all I ever saw was alarms that had already gone green!
  • One case open at the moment - ECDF, which is clicking over into the 'ticket them' window today - so that'll need some action either today or tomorrow morning.

Friday 23rd November - DB

  • Despite the meltdown at RAL nothing exciting to report. The Imperial WMS got stuck at some point (you'll see an accumulation of "job cancelled" due to failed to run messages). Simon managed to reproduce the bug https://ggus.eu/tech/ticket_show.php?ticket=88831 (In real life it was the Durham CE that caused the problem.) If this happens again, let Daniela know and she will check the WMS.

Monday 19th November - AM

  • Good week overall. No UK-wide problems. Several sites still with (upgrade)

related planned downtimes. Only one outstanding ticket (Durham) and no alarms left open over the weekend.

Monday 12th November

  • Birmingham is in downtime till further notice as university is operating on emergency power.
  • Durham has a open ticket.
  • A lot of Nagios test failure because of power failure at Tier1 but now everything is back to normal.


Rollout Status WLCG Baseline

Tuesday 6th November

References


Security - Incident Procedure Policies Rota

Monday 3rd December

  • One critical alarm for legacy CREAMCE-gLite-32 service which is already in downtime. World writable directory warning, site will raise with ATLAS.

Monday 22nd October

  • Last week's UK security activity was very much business as usual; there are a lot of alarms in the dashboard for UK sites, but for most of the week they only related to the gLite 3.2 retirement.

Friday 12th October

  • The main activity over the last week has been due to new Nagios tests for obsoleted glite middleware and classic SE instances. Most UK sites have alerts against them in the security dashboard and the COD has ticketed sites as appropriate. Several problems have been fixed already, though it seems that the dashboard is slow to notice the fixes.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 20th November

  • Reminder for sites to add perfSONAR services in GOCDB.
  • VOMS upgraded at Manchester. No reported problems. Next step to do the replication to Oxford/Imperial.

Monday 5th November

  • perfSONAR service types are now defined in GOCDB.
  • Reminder that the gridpp VOMS will be upgraded next Wednesday.

Thursday 18th October

  • VOMS sub-group meeting on Thursday with David Wallom to discuss the NGS VOs. Approximately 20 will be supported on the GridPP VOMS. The intention is to go live with the combined (upgrades VOMS) on 14th November.
  • The Manchester-Oxford replication has been successfully tested. Imperial to test shortly.


Tickets

Monday 3rd December 13.45 GMT</br> 32 Open UK tickets this week. It's the start of the month, so all tickets, great or small, will get reviewed.

  • NGI/VOMS</br>

https://ggus.eu/ws/ticket_info.php?ticket=88546 (16/11)</br> Creation of epic.vo.gridpp.ac.uk. Name has been settled on, deployed on the master VOMS instance and rolled out to the backups, ready for whatever the next step will be. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=87813 (25/10)</br> Migration of the vo.helio-vo.eu to the UK. At last word everything was done on the VOMS side, and testing on grid resources was needed to be done. In progress (15/11)

  • TIER 1</br>

https://ggus.eu/ws/ticket_info.php?ticket=89141 (3/12)</br> RAL are seeing a high atlas production job failure rate, and a possibly related high FTS failure rate. In Progress (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=89081 (30/11)</br> Failed biomed SAM tests, tracked to a missing / in a .lsc file. Should be fixed, waiting for confirmation (but don't wait too long). Waiting for reply (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=89063 (30/11)</br> The atlas frontier squids at RAL weren't working, fixed (networking problem) but ticket reopened and placed on hold as the monitoring for these boxes needs updating. On hold (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88596 (19/11)</br> t2k.org jobs weren't be delegated to RAL. After some effort this has been fixed, the ticket can be closed. In progress (1/12)

https://ggus.eu/ws/ticket_info.php?ticket=86690 (3/10)</br> "JPKEKCRC02 missing from FTS ganglia metrics" for t2k. This has been a pain to fix, at last word RAL were waiting on their ganglia expert to come back, but that was a while ago (however I suspect they had bigger fish to fry in November). In progress (6/11)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)</br> Correlated packet loss on the RAL perfsonar. On hold pending a wider scale investigation. On hold (31/10)

  • UCL

https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10)</br> The last Unsupported gLite software ticket (until the next batch). Ben has put the remaining out of date CE into downtime after updating another. In progress (29/11)

  • BIRMINGHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=89129 (3/12)</br> High atlas production failure rate, likely to be due to the migration to EMI. It could be a problem with the software area, Mark has involved Alessandro De Salvo. Waiting for reply (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9)</br> Low atlas sonar rates to BNL from Birmingham. atlas tag removed from ticket to lower noise. On hold (30/11)

  • IMPERIAL</br>

https://ggus.eu/ws/ticket_info.php?ticket=89105 (1/12)</br> t2k.org jobs failing on I.C. WMSs due to proxy expiry. Daniela thinks that it may be a problem with myproxy (the cern myproxy servers are having dns alias trouble by the looks of it). In progress (3/12)

  • SHEFFIELD</br>

https://ggus.eu/ws/ticket_info.php?ticket=89096 (30/11)</br> lhcb jobs to Sheffield that go through the WMS are seeing "BrokerHelper: no compatible resources" resources, possibly due to the published values for GlueCEStateFreeCPUs & GlueCEStateFreeJobSlots being 0. In progress (3/12)

  • LANCASTER</br>

https://ggus.eu/ws/ticket_info.php?ticket=89066 (30/11)</br> biomed nagios tests failing on the Lancaster SE. "problem listing Storage Path(s)", which suggests to me that we have a publishing problem. Couldn't find any obvious bugbears though, keeping on digging. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=89084 (30/11)</br> The problem in 89066 is also affecting the biomed CE tests. On hold (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88628 (20/11)</br> Getting t2k working on our clusters. Had some problem with building root on one cluster, and even just submitting jobs to the other. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88772 (22/11)</br> One of Lancaster's clusters is reporting default values for "GlueCEPolicyMaxCPUTime", mucking up lhcb's job scheduling. Tracked to a problem in the scripts (https://ggus.eu/ws/ticket_info.php?ticket=88904), the fix will be out in January so I've on-holded this until then. On hold (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br> ilc jobs always fail on a Lancaster CE, possibly due to the CE's poor performance. For the third time in a row I've had to put this work off for a month. On hold (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)</br> t2k transfer failures to Lancaster. Having trouble getting a routing change put through with the RAL networking team, probably due to them having a lot on their plate over the past month. In Progress (3/12)

  • LIVERPOOL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88761 (22/11)</br> Technically a ticket from Liverpool to lhcb. A complaint over the bandwidth used by lhcb jobs, probably due to a spike in lhcb jobs running during an atlas quiet period. Are all sides satisfied about the cause of this problem and the steps taken to prevent this happening again? In progress (23/11)

  • SUSSEX</br>

https://ggus.eu/ws/ticket_info.php?ticket=88631 (20/11)</br> Looks like Emyr has fixed Sussex's not-publishing-UserDNs APEL problem, so this ticket can be closed. In Progress (26/11)

  • QMUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88822 (23/11)</br> A similar ticket to 88772 at Lancaster. It could be that the SGE scripts are needing updating too. In progress (26/11)

https://ggus.eu/ws/ticket_info.php?ticket=88987 (28/11)</br> t2k jobs are failing on ce05. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88887 (26/11)</br> lhcb pilots are also failing on ce05. In progress (28/11)

https://ggus.eu/ws/ticket_info.php?ticket=88878 (26/11)</br> hone are also having troubles on ce05... In progress (26/11)

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> LHCB redundant, hard-to-kill pilots at QMUL. Chris opened a ticket to the cream developers (https://ggus.eu/tech/ticket_show.php?ticket=87891). But still the request to purge lists come in from lhcb. In progress (21/11).

  • GLASGOW</br>

https://ggus.eu/ws/ticket_info.php?ticket=88376 (8/11)</br> Biomed authorisation errors on CE svr026. Sam asked if this was the only CE that has seen this problem on the 9th. No reply since, I added in the biomed e-mail address explicitly to the cc list to try and coax a response. Waiting for reply (9/11)

  • ECDF</br>

https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9)</br> Low atlas sonar rates to BNL. Apparently things went from bad to worse on the 23rd/24th of October. Duncan has removed the atlas VO tag on the ticket to lower the noise on the atlas daily summary. On hold (30/11)

  • EFDA-JET</br>

https://ggus.eu/ws/ticket_info.php?ticket=88227 (6/11)</br> biomed complaining about 444444 waiting jobs & no running jobs being published by jet. The guys there have had a go at fixing the problem (probably caused by their update to EMI2), but are likely out of ideas. I had a brain wave regarding user access in maui.cfg but if that's not the solution I'm sure they'll appreciate ideas. In progress (3/12).

  • OXFORD</br>

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9)</br> Poor atlas sonar rates from Oxford to BNL. On hold due to running out of fixes to try, and the fact that they get good rates elsewhere. VO tag removed to reduce noise. On hold (30/11)

  • DURHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)</br> atlas production failures at Durham. Site still in "quarantine". On hold (20/11).

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11) compchem authentication failures. As this ticket has been on hold at a low priority since January then it would seem worthwhile to contact the ticket originators to see what they want to do. On hold (8/10)

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


Wednesday 17th October

Monday 17th September

  • Current state of Nagios is now on this page.

Monday 10th September

  • Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view



VOs - GridPP VOMS VO IDs Approved VO table

Tue 4th December

Thursday 29 November

Tuesday 27 November

  • VOs supported at sites page updated
    • now lists number of sites supporting a VO, and number of VOs supported by a site.
    • Linked to by Steve Lloyd's pages


Tuesday 23 October

  • A local user is wanting to get on the grid and wants to set up his own UI. Do we have instructions?


Site Updates

Monday 5th November

  • SUSSEX: Site working on enabling of ATLAS jobs.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 1st October

  • ELC work


Tuesday 25th September

  • Reviewing pledges.
  • Q2 2012 review
  • Clouds and DIRAC
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 5th December

  • Operations report
  • Since the power incident of 20th November the Tier1 has been running but there have been a greater than usual number of problems. In addition resilience has been significantly reduced although the main area of concern here, the power supplies for fibrechannel SAN switches, was resolved yesterday (4th Dec). In addition discussions are taking place as to whether to schedule a test of the UPS/diesel generator next week.
  • The Tier1 team is checking whether any VO requires the AFS client on the worker nodes.
WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October




NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Thursday 21st June - JANET6

  • JANET6 meeting in London (agenda)
  • Spend of order £24M for strategic rather than operational needs.
  • Recommendations to BIS shortly
  • Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
  • Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
  • Reliability limited by funding not ops so need smart provisioning to reduce costs
  • Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
  • Goal of dynamic provisioning
  • Looking at ubiquitous connectivity via ISPs
  • Contracts were 10yrs wrt connection and 5yrs transmission equipment.
  • Current native capacity 80 channels of 100Gb/s per channel
  • Fibre procurement for next phase underway (standard players) - 6400km fibre
  • Transmission equipment also at tender stage
  • Industry engagement - Glaxo case study.
  • Extra requiements: software coding, security, domain knowledge.
  • Expect genome data usage to explode in 3-5yrs.
  • Licensing is a clear issue
To note

Tuesday 26th June