Difference between revisions of "Operations Bulletin 221012"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 05:05, 22 October 2012

Bulletin archive


Week commencing 15th October 2012
Task Areas
General updates

Monday 15th October

  • There is a new nagios test and through the alarms that come with it, sites and NGIs get an early warning that a site is about to fail the OLA requirements. This new test has recently been implemented in the Dashboard, raising alarms named egi.eu.lowAvailability. Availability alarms are to be handled by RODs in the Dashboard, and in the future are going to replace the Availability/Reliability tickets that the COD submits to underperforming NGI sites. These alarms are a warning for NGIs informing about poor performance of sites within the last 30 days.
  • HEPiX is this week. See talks via the agenda page or join via Vidyo
  • The status of the EMI-WN testing in captured here.
  • SAM test consequence of removing ATLASGROUPDISK Space tokens.
  • The official GDB meeting minutes from the October meeting are now available.

Wednesday 10th October


Tuesday 9th October

  • There is a GDB this week. Please look at the agenda.
  • gLite 3.2 CEs became unsupported at the end of September. Sites need to move fully to EMI CEs this month.
  • GridPP travel requests can now be done online via the web based approval system.


Tuesday 2nd October

  • Draft September WLCG reliability and availability figures released.
  • CREAM in High Availability systems. EGI have asked for feedback on the release of this feature which is foreseen with EMI 3. To finalize development and documentation, the CREAM Project Team needs feedback from site administrators about their existing site setup in the following areas: used shared file systems (NFS, GPFS, etc); used gateways (DNS, Apache server); load balancing algorithms in use (Round Robin, Weight based, etc) and data replication. Please email Jeremy with any feedback/interest. Further details are in this talk.


Tier-1 - Status Page

Tuesday 16th October

  • Castor 2.1.12 update for CMS being done this morning.
  • Oracle patches applied to FTS, non-LHC LFC and Atlas conditions databases yesterday.
  • The intermittent problem with the FTS (which stops accepting new transfer requests or starting transfers in the queue) is being tracked with FTS developers. A recently applied patch modified this but did not fix this behaviour.
  • The roll-out of the EMI-2 CREAM CEs is underway. Issues with the information provided to the BDIIs are being followed up.
  • One of the WMSs (WMS03) has been upgraded to the EMI version. The other two are following (announced in GOC DB).
  • Continuing test of hyperthreading. Plan to implement after CE updates completed.
  • Continue with ten EMI-2 SL-5 worker nodes in normal production.
  • Test instance of FTS version 3 now available. Non-LHC VOs that use the existing service have been enabled on it and looking for one of the VOs to test.
Storage & Data Management - Agendas/Minutes

Wednesday 10th October

  • DPM EMI upgrades:
    • 9 sites need to upgrade from gLite 3.2
  • QMUL asking for FTS settings to be increased to fully test Network link.
  • Initial discussion on how Brunel might upgrade it's SE and decommission is old SE
  • Classic SE support , both for new SEs and plan to remove current publishing of classic SE endpoint


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Friday 28th September

  • Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
  • See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.


Wednesday 6th September

  • Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Documentation - KeyDocs

Tuesday 9th October

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Monitoring(2/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 8th October

  • COD are about to launch monitoring tickets for 'out of support' services, (i.e. glite 3.2), for removal by the end of the month. (They seem to have missed some gLite 32 CREAM CE's, however - we need to make sure we don't).
  • EMI updates. EMI-2, expected today or so.

DPM 1.8.4 (Yay! But let it filter through staged rollout a bit...) LB and WMS 3.4 (Both with security updates). UI and WN (including 32bit libs, and a few other dependancies).

  • Tarballs were raised. Tiziana raised the need for a tarball (EMI-2) before the gLite 3.2 were retired.
  • Staged Rollout. The ARC 2.0.0 clients are in the production repositories due to the emi-ui being in production. (Don't think that affects anyone in the UK).
  • Released today: BDII Core and GFAL/lcgUtils.
  • Products in staged rollout: WMS 3.3.8; CREAM 1.13.4 (due to a mismatch between EMI and UMD versions, this is 1.13.5 in UMD)
  • It's been noted that there are a number of products without early adopters in EMI 2: EMIR, Pseudonymity, Wnodes, GridSAM and OGSA-DAI. These will not be included in UMD, unless there's an EA, and demand from NGI's. There's also a few with no EA in EMI-2, but there are in EMI-1, and these are expected to move to EMI-2 at some point: CLUSTER, CREAM-LSF. (VOMS was listed, but the EA was present, and pointed out they are on EMI-2).
  • Unsupported services on 8th October EGI list.


Monday 24th September

  • EGI operations meeting minutes.
  • Decommission of a service deploying unsupported software (gLite 3.1 and part of gLite3.2): Site managers must decommission the unsupported software following the production service decommissioning procedure (PROC12), this includes (among other actions) removing the service from GOCDB and the Site-BDII.
  • UMD 2.2, after a first prioritization within SA2, the list of candidates for the next UMD2 update are: CREAM-SGE; Unicore-UVOs; MPI; BDII-core; EMI-myProxy; ARGUS; Trustmanager and GRIDSite.
  • Missing CSIRT info for UCL.

Monday 10th September - EGI ops meeting minutes

Tuesday 4th September

  • The end of security support of the following products:

- glite 3.2 glite-UI - glite 3.2 glite-WN - glite 3.2 glite-GLEXEC_wn - glite 3.2 glite-LFC_mysql/glite-LFC_oracle - glite 3.2 glite-SE_dpm_disk/glite-SE_dpm_mysql

was extended to 30/11/2012 (http://glite.cern.ch/support_calendar/).


Monitoring - Links MyWLCG

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Friday 12th October

  • There is a new ROD newsletter available from EGI.
  • As of this week, John Walsh will no longer be contributing to the ROD work. Many thanks to John for his input over the years!

Friday 5th October

  • Normal week. Not much to report apart from usual transient failures at different sites.
  • Oxford, Liv and Brunel have open ticket due to bug in emi wn.
  • Qmul has a ticket about APEL installation which still going on.

Friday 28th September

  • Cuurently 8 sites have at least 1 alarm or ticket open. There are 5 sites with open tickets (including the 3 above).
  • Durham has an alarm relating to low availability for the month.


Rollout Status WLCG Baseline

Monday 1st October

  • All gLite 3.1 services and nodes should now have been upgraded or removed.

Thursday 13rd September

Updated all SR pages.

Monday 3rd September

  • Test queues for EMI WNs: RAL T1, Oxford, Liverpool?, Brunel

Tuesday 31st July

  • Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.

Wednesday 18th July - Core ops

  • Sites (that needed a tarball install) will need to work on own glexec installs
  • Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.
Security - Incident Procedure Policies Rota

Friday 12th October

  • The main activity over the last week has been due to new Nagios tests for obsoleted glite middleware and classic SE instances. Most UK sites have alerts against them in the security dashboard and the COD has ticketed sites as appropriate. Several problems have been fixed already, though it seems that the dashboard is slow to notice the fixes.

Tuesday 25th September


Services - PerfSonar dashboard

Tuesday 18th September

  • VOMS in Manchester is now installed with both NGS/GridPP VOs. There is some political decision to take about how to support the NGS VOs and how to maintain them but they have been installed. Replication tests between Manchester and Oxford can now start.
  • Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week

Tuesday 11th September

  • Still some sites needing to deploy perfsonar
  • Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week
Tickets

Monday 15th October 14:00 BST</br> 3631 open UK tickets this week. Nothing too exciting other then a bunch of tickets regarding unsupported glite software, most which have been handled (and many would and have argued are non-issues, e.g. ClassicSE entries). I've once again glossed over the various networking tickets, but progress on those fronts is expectantly slow.

NGI/ROD</br> https://ggus.eu/ws/ticket_info.php?ticket=87317 (12/10)</br> This ticket against the ROD complained about how old we "let" ticket 85973 get (Brunel's problem with the lcg utils timing out in EMI WNs). The offending ticket is closed, so I think this ticket can be closed too. Waiting for reply (15/10)

https://ggus.eu/ws/ticket_info.php?ticket=86927 (8/10)</br> Sussex's high level of UNKNOWN status in September. John G has involved NGI Ops and Kashif to ask why - I replied. In progress (15/10) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=86847</br> https://ggus.eu/ws/ticket_info.php?ticket=86846</br> The September availability/reliability tickets for Glasgow & Durham. Both sites have submitted a good explanation, i've asked if the powers that be are satisfied. Waiting for reply (15/10) BOTH SOLVED

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=87308 (12/10)</br> Biomed are seeing the default job numbers in gstat, looks like dynamic publish is broken. Assigned (12/10) SOLVED 14/10

https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10)</br> "Unsupported Glite Software" ticket. A plan is in the works for upgrading. In progress (10/10)

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=87272 (11/10)</br> An interesting one. LHCB jobs were failing at an Imperial CE due to the node running out of inodes - not something I've seen before. Nothing wrong with this ticket, just it caught my eye. In progress (11/10) UPDATE- Daniela has filed a bug report about this: https://ggus.eu/ws/ticket_info.php?ticket=87264

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=87343 (14/10)</br> Oxford's torque server crashed, and lhcb noticed. Fixed quickly, so the ticket can either be solved or lhcb can be asked if things are okay for them now. In Progress (14/10)

EDINBURGH</br> https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10)</br> Edinburgh has received a ticket about Unsupported Glite Software at their site. It's being handled though. In Progress (10/10)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=87170 (10/10)</br> Similar ticket for t'other side of Scotland. Sam makes some good points about the "Classic SE" witchhunt, and asks for clarity on when the actual deadlines are. In progress (11/10)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=87167 (10/10)</br> Liverpool's U.G.S. ticket. Steve gave it a cheeky In Progress, but no other news (11/10).

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> LHCB are still seeing un-purgable jobs at Queen Mary, delivering another list of zombie jobs. Has anyone else seen this? In progress (12/10)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> That Sussex ticket :-) The nagios problems have been worked around, Brian has been giving advice on Space Tokens. In Progress (11/10)

Solved Cases</br> https://ggus.eu/ws/ticket_info.php?ticket=86753</br> UCL's dpm problems were solved (touch wood) by increasing the VM resources and correspondingly upping the mysql innodb_buffer_pool_size setting. The lesson is don't skimp on your dpm headnode resources!

The Tier-1, Oxford, Cambridge, Bristol, Birmingham and Sheffield all got U.G.S. tickets that they sorted promptly.

No other tickets of interest that I noticed, does anyone have any?

Tools - MyEGI Nagios

Monday 17th September

  • Current state of Nagios is now on this page.

Monday 10th September

  • Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view



VOs - GridPP VOMS VO IDs Approved VO table

Monday 15th October

* Robot certificates and hardware keys
* FCR
* Managing storage - how to avoid users filling up the space

Monday 8th October

  • Sno+ had problems with EMI-2 WN and ganga - formatting changes in EMI-2 command output.
  • Now fixed by Mark Slater (8 hours to install EMI2-WN and 20 mins to fix ganga.
  • Snoplus jobs don't work at Dresden https://ggus.eu/ws/ticket_info.php?ticket=86741
  • Draft e-mail to warning "non LHC VOs" about upcoming updates sent to ops list. Comments please.


Friday 30th September

  • Summary of some VO activities given at GridPP29
  • Need more feedback/testing from smaller VOs ahead of EMI2-WN change and then SL6.

Tuesday 18 September 2012

  • No VOs reporting issues.
  • VOs have been asked for a brief summary for the GridPP meeting.

Monday 27th August


Site Updates

Tuesday 9th October

  • SUSSEX: Site working on enabling of ATLAS jobs.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 1st October

  • ELC work


Tuesday 25th September

  • Reviewing pledges.
  • Q2 2012 review
  • Clouds and DIRAC
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 17th October

  • Operations report
  • Castor 2.1.12 upgrade for CM instance successful yesterday (16th Oct). Dates for other instances announced now: LHCb on 23rd Oct, GEN on 30th Oct.
  • CE upgrades to EMI versions. Final testing ongoing. If all goes well plan to do this next Tuesday (23rd).
  • The change to enable use of hyperthreading has been approved. This will be implemented once the above CEs changes have been completed.
  • Migration of LHCb data from T10000A to T10000C tapes completed.
WLCG Grid Deployment Board - Agendas MB agendas

October meeting Wednesday 10th October




NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Thursday 21st June - JANET6

  • JANET6 meeting in London (agenda)
  • Spend of order £24M for strategic rather than operational needs.
  • Recommendations to BIS shortly
  • Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
  • Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
  • Reliability limited by funding not ops so need smart provisioning to reduce costs
  • Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
  • Goal of dynamic provisioning
  • Looking at ubiquitous connectivity via ISPs
  • Contracts were 10yrs wrt connection and 5yrs transmission equipment.
  • Current native capacity 80 channels of 100Gb/s per channel
  • Fibre procurement for next phase underway (standard players) - 6400km fibre
  • Transmission equipment also at tender stage
  • Industry engagement - Glaxo case study.
  • Extra requiements: software coding, security, domain knowledge.
  • Expect genome data usage to explode in 3-5yrs.
  • Licensing is a clear issue
To note

Tuesday 26th June