Operations Team Completed Actions

From GridPP Wiki
Jump to: navigation, search

This is a Wiki area to track operations team actions

Do we really need all actions going back to 2011 ? What's wrong with the last 12 months or so ?

Action ID prefix Status
O = From Operations team meeting Open = Action has been created
OS = From joint Operations team and sites meeting Progress = Action is being worked on
BR = Created by Buck Rogers Closed = Action is complete


Actions from dteam meetings
Action ID Action description Owner Target date Status Date closed Notes
O-123-45 The summary description The owner Target date for closure Current status (open/progress/closed) Date closed - when closed Progress notes + summary upon closure
O-110927-03 Develop the https://www.gridpp.ac.uk/wiki/Documentation page. Develop a tool to highlight pages which have not been updated recently. Survey the documentation to spot outdated, and missing information. Assign areas to the core areas, who can devolve down to others to be responsible. Report on progress at Ops meeting in two weeks. Andrew & Stephen 2011-10-11 Closed 2012-07-26
O-110816-03 Action on Wahid to create a MySQL query (for DPM) to move non-space token files into space tokens, check it with Jean-Philippe and circulate. Wahid 2011-08-30 Closed 3-04-2012 JP has left but the query can be checked with someone else. As of Dec 2011 this is low on Wahid's priority list.Taken up by core DPM team.


O-110906-04 Cream jobwrapper loses job exit status Jeremy Coles -> Chris Brew 2011-09-22 Closed 2012-07-26 Chris Brew reports that the CREAM jobwrapper doesn't report the underlying job exit status. Jeremy Coles to talk to the developers about this. 061211 Chris unsure of his request but the issue still could do with follow up. Follow-up led to request to create ticket/savannah bug... so back with Chris! This was discussed at length on LCG-ROLLOUT here (started by Alessandra!) ATM the user job exit status is in the Cream Logs but there was discussion of defining a standard, did this get as fas as an RFE? Alessandra?
O-120313-02 Email everyone on how to hack the publishing system to avoid publishing incorrect GlueSubClusterWNTmpDir. Stuart Purdie 2012-03-27 Closed 26-06-2012 Emailed 1st May: If you need to hack GlueSubClusterWNTmpDir, it's set on the CE's in /opt/glite/var/tmp/gip/glite-info-static-cluster.conf (/var/lib/bdii/gip/ldif/static-file-Cluster.ldif on EMI CREAM). It appears to default to /tmp, based on /opt/glite/yaim/functions/config_gip ... so I don't think that YAIM lets you configure it.
O-120320-03 UCL DPM TICKETs 80366, 80331 dpm problems. Duncan Rand 20-04-2012 Closed 3-04-2012 Tickets closed. DPM working though somewhat unstable maybe (with developers)


O-120320-01 At QMUL, CW saw LHCB job that wanted 30GB of memory. Pls Check. Raja Nandakumar 20-05-2012 Closed 3-04-2012 Just one or two jobs.One off - Chris not seen it again
O-111025-02 CMS and RAL to liase over future tests between RAL (Tier 1 and 2) and IC. Brian and Stuart W 2011-11-01 Closed


O-110906-03 GGUS 73773 Sam Skipsey 2011-09-13 Closed Sam to update ticket status - still open (4/10/11). Closed by December.


O-110516-01 Migration strategy for ROC? Jeremy Closed 2010-11-02 Current situation appears to be fine; further changes depend on the NGS side of things. December 2010: JG still suggests early in 2011. Late January Ireland have started the validation process, so expect NGI_UK process to start in mid-February. Ireland finished creation of NGI_IE w/c 20th June so the UK part of the migration should begin soon. July: The process has finally started! Hold ya breath... Done September.


O-110524-01 Determine who is an EMI-1 early adopter, and get their feedback from deployment process Jeremy Coles - presumably now Duncan/Daniela 2011-07-24 Closed Please refer to Webpage. Now part of core tasks.


O-112806-02 Consider nominations for "WLCG Technology Evolution Work Group". All 2011-06-28 Closed Some feedback given on the revised smaller working group representation suggestions. UK provided suggestions, we await information as to the outcome. Oct: Involvement in security, storage and ops groups.
O-110830-03 Determine why Birmingham sporadically fails HC tests, resulting in Broker Off. Mark Slater 2011-09-30 Closed After more investigation, MS believes that the period in question was due to a number of other problems at the same time causing the failures. In fact, BHAM has only been offlined once in the last week (04/10/11) so things seem to be OK now. MS will keep an eye on the problem though and will try to match offlines with specific problems.



O-110816-01 https://www.gridpp.ac.uk/wiki/ROD_rota needs to be updated soon. Jeremy 2011-08-30 Closed


O-110524-02 Test/validate/release CREAM/Argus configuration. Raul Lopes 2011-07-24 Closed Will submit a staged-rollout report
O-110516-03 Provide input on hardware scaling issues for disk servers at Tier-2s, particularly TB_disk/Gb_networking ratios. Sam, Brian, Wahid 2010-12-07 Closed 2011-06-28 2011-01-12 Being discussed internally to provide a consensus value.2011-06-28 http://www.gridpp.ac.uk/wiki/Suggestions_for_suitable_hardware_to_run_a_Grid_SE#Network_capabilities
O-110516-05 Raise issues with ATLAS Panda service view of experiment software being site, not cluster or queue, resolution. (Which causes huge problems with heterogeneous clustered sites). Alessandra 2011-04-12 Closed 2011-05-16 Issue raised, got positive response from 1 person but not from the other who should do the work. They all agree updating panda is a good idea and it's in the todo list but it doesn't have a high priority. Will keep on raising the issue when it appears again.
O-110524-03 Assign an Ops Core Team "volunteer" to be T2 rep for GDB Jeremy Coles 2011-06-01 Closed
O-110524-04 Ticket to increase ATLAS software area to be > 250 GB Duncan Rand 2011-07-24 closed 2011-07-26 https://ggus.eu/ws/ticket_info.php?ticket=70960
O-110524-05 Increase UK Cloud Space to safer value. Alessandra Forti 2011-07-24 closed 2011-05-25 Space was increased at all sites that needed it. Might need review in the future to decrease it to an appropriate value when the new clean up policy kicks in.
O-110607-01 Follow up on the monitoring infrastructure Ireland ROC will use. Jeremy 2011-06-14 Closed
O-110607-02 Review each site's network situation, particularly w.r.t. subnetting and monitoring. All admins 2011-06-29 Closed If possible include details in site update talks at HEPSYSMAN. [Jeremy is collating].
O-110607-03 Invite "management" to Ops meeting to dicuss benchmarking and metrics. Jeremy? 2011-06-14 Closed Steve L invited but he does not see the need to discuss this further. Manchester in private discussion.
O-110614-01 Find out within ATLAS if and how more UK sites can register to receive work from additional ATLAS clouds Alessandra Forti 2011-06-28 Closed Mark reported "confirmation that ATLAS GDP (Grid Data Processing) decide who can join multiple clouds (FR, CERN) , other T2D sites should be able to join." Jeremy checked at the PMB and Roger Jones indicated that this was in-hand (i.e. sites not currently expected to follow up themselves).
O-112806-01 Review "fallback" SE options for SAM nagios tests. Kashif 2011-06-28 close 2011-08-16 Gridppnagios is using storage-monit.physics.ox.ac.uk as mail storage replication server and dgc-grid-38.brunel.ac.uk and se01.dur.scotgrid.ac.uk as backup


O-112806-03 Follow up on site glexec statuses. Jeremy 2011-06-28 Closed PMB updated on status.


O-110906-01 BDII crashes - what version of openldap does QMUL use? Chris Walker 2011-09-13 Closed 2011-09-06 Details posted to tb-support: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1109&L=TB-SUPPORT&F=&S=&P=27670
O-110830-01 Obtain and distribute summary to clarify Steve Lloyd's tests and test pages. Jeremy Coles 2011-09-30 Closed Done for metrics but not the test pages. 061211 We need to think about use of these tests for the GridPP reports. Done online.
O-110906-02 GGUS 73872 Chris Walker 2011-09-13 Closed 2011-09-06 Update ticket status


O-110524-06 Eliminate glibc as a CREAM issue in newest release. Symptoms - fails on heavy load. Stuart Purdie, Christopher Walker, Mohammad kashif 2011-07-24 closed 2011-06-28 Stuart checking, will report back next meeting. Chris reports new gLite has a new cream/sge, could have fix. This particular issue is no longer relevant.


O-110516-04 Review the need for GridPP VO lists http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs Jeremy 2011-03-01 Closed 2011-02-22: Revised text which was out of date. PG updated dteam entry. Review of VO information and check on CIC generation of vomses entries ongoing. Incorporating into VO ops-team task. Now in wider VOs task.


O-112607-01 Check steps to take at a site when a VO is no longer supported Jeremy/Brian Closed What's wrong with just submitting a GGUS ticket to the VO, explaining the situation ? EGI now have a procedure for the decommissioning process.


O-110830-02 Implement security measures to protect network topology twiki. Andrew McNab, Jeremy Coles 2011-09-30 Closed Andrew has developed the required measures to solve the issue. JC (and others)needs to make use of it. Any protected wiki page just needs to have a name started with: protected . This action will close once the networking pages have been confirmed.


O-110906-03 GGUS 73773 Sam Skipsey 2011-09-13 Closed Sam to update ticket status - still open (4/10/11). Closed by December.
O-110927-01 Develop the https://www.gridpp.ac.uk/wiki/Staged_rollout page, to provide more information about which UK sites are doing what. Also to encourage others to contribute, either in an ad hoc way or formally.Report on progress at next Ops meeting. Daniela & Duncan 2011-10-04 Closed https://www.gridpp.ac.uk/wiki/Staged_rollout
O-110927-02 Develop the https://www.gridpp.ac.uk/wiki/Ticket_follow-up page, keep an eye on UK tickets and look out for commonalities, and escalating tickets. Report on progress at next Ops meeting. Matt 2011-10-04 Closed https://www.gridpp.ac.uk/wiki/Ticket_follow-up
O-111025-01 Check the validity of the security probes on the EGI dashboard after false positives at Manchester and Liverpool. Mingchao 2011-11-01 Closed


O-111101-01 Follow up information provided by security dashboard and access control/information Mingchao Closed
O-120320-02 CW says that at QMUL 5000 FTS jobs stuck in transferring state. Chris Walker, G Smith 20-04-2012 Closed 23-03-2012 FTS had stuck transfers, they have applied a patch to prevent this blocking the channel. Cedric fixed whatever was causing transfers not to be sent to the FTS.



O-110906-03 GGUS 73773 Sam Skipsey 2011-09-13 Closed 11-04-2012 Sam to update ticket status - still open (4/10/11). Closed by December.
O-120313-05 Everyone should check that they have correct permissions to access security dashboard. Everyone 2012-03-27 Closed 11-04-2012
O-120320-06 Is there tolerance for late paying Uni's re: hardware purchase? Pete Gronbech 20-04-2012 Closed 11-04-2012 Some admins were concerned about process/deadlines etc
O-120313-04 Put link on Wiki to EGI procedure for correcting availability/reliability stats. Jeremy 2012-03-27 Closed For EGI this is done in the ticket response. For WLCG an email to the WLCG project office is required. Done.


O-120320-04 Pls add storage key docs Wahid Bhimji 20-04-2012 Closed Should be grouped with other KeyDoc actions


O-120313-03 Investigate state of Cambridge SE now that it has been upgraded. Storage group (via Sam) 2012-03-27 Closed Passed to Storage Group


O-110516-01 Organise reviewers for web pages and wiki sections Jeremy 2009-07-07 Closed - part of core tasks 20/07/07 asked in the first instance for volunteers. Remainder will be assigned across team to balance. Revisit in November. It is mainly the sysadmins introductory pages that need attention. Update December 2010: This is still ongoing and needs to be brought in line with timelines for website changes for GridPP4. Review at meeting on 8th February 2011. June 2011: Current plan is to integrate this with ops-team tasks. July: Will work on this in conjunction with documentation core task. Febraury 2012: Ongoing with core task. Closed July 2012 as part of documentation task.


O-120410-02 Research Backup VOMs server ideas Chris Walker and Jeremy Coles 10-04-2012 Closed 19-06-2012 This action has been superseded by the planed trio of GridPP run VOMS servers.
O-120410-01 Check up on those UCL/VPN errors (for Atlas) Duncan Rand 10-04-2012 Closed 8-05-2012
O-111122-01 Find out how to get EGI to distribute RPMs containing LSC files. Jeremy 2011-12-06 Closed Request with EMI, to reach consensus on distribution mechanism (O-111122-01,O-120313-01,O-120320-05 all related). EMI will not be taking up this request due to manpower issues. Examine again in post EMI proposal!
O-120312-01 Report to Andrew the three Key Docs for your area. Ops members 2012-03-06 Closed Pressing. Pls do it by GriddPP 28. (13/9: Can this be closed ?)


O-110524-07 Timeline required for relocatable GLEXEC tarball Martin Litmaath, Jeremy Coles 2011-06-15 Closed 15th June No feedback. Same on the 28th of June. email activity late July. It goes on ... Current status: untested, AFAIK. July 2012 - update from Daniela. September/October 2012: Interest from ATLAS for CVMFS UI/WN tarball. We need to work with A Elwell/D Smith. Matt D has volunteered to take first steps. Wahid also looking at build scripts. Work started Nov 2012.


O-120313-01 Investigate getting YAIM extracts for VO config via CIC portal. Jeremy (or reassign to appropriate person) 2012-03-27 Closed (O-111122-01,O-120313-01,O-120320-05 all related). Proposed to EGI. Will not happen within EMI. Vomssnooper was the result of the UK effort to address this issue.


O-120619-02 Discuss options and plans for the "GridPP VOMS Takeover". Jeremy, potential VOMS admins 03-07-2012 Closed Lay the groundwork over the next fortnight for the GridPP-run VOMs servers at Manchester, Oxford & I.C. Supersedes O-120410-02. 18th July - transfer of core work on-hold while Robert F is on leave at Manchester. Will setup parallel test infrastructure. 22nd October: Testing done for Manchester & Oxford. Awaiting IC update. Switchover date set for 14th November 2012 - D Wallom informing NGS VOs.


O-120320-05 discuss poss. automation of VOID transfer Chris Walker, Steve Jones, Jeremy 20-04-2012 Closed A tool for this was released (VomsSnooper) Discuss if this is proper approach (O-111122-01,O-120313-01,O-120320-05 all related). November: Move to action sites to make use of VomsSnooper?
O-130319-01 Examine with John Kewley moving to DNs without emailAddresses in CN. Steve Jones (&John Kewley) 2013-04-19 Closed Issue particularly concerns the need to be compatible with EMI-3 ARGUS (&possibly other EMI3 services)

https://www.gridpp.ac.uk/wiki/Grid_Certificate#Converting_host_certificates_to_omit_the_email_addresses_from_DNs

O-130219-02 Collate and document site's sysctl settings for transfers. Wahid 05-03-2013 Closed 12-03-2013 Wahid volunteered to summarise site's sysctl settings on the Wiki. It now appears in the wiki.
O-130212-01 UMD status page not public - request it be made so Jeremy Coles 05-02-2013 Closed Request sent to EGI. Question came back about what specifically those without an RT account wish to see! Is the EGI summary sufficient? March 2013 - EGI will not make the page public but sysadmins can access the existing page by requesting an EGI sso account.
O-120619-01 Lancaster to engage with IC & QMUL to help with glexec tarball testing. Matt Doidge 26-06-2012 Closed 12-03-13 QMUL is now using rpms. Lancaster taken over tarball glexec devel (after tarball finalised). Imperial is testing tarball.
O-120605-02 Recommend, and document how to use, an inexpensive hardware token for storing the private part of grid certificates. Jens Jensen 06-06-2012 Closed 02-04-2013 https://www.gridpp.ac.uk/wiki/KeyTokens
O-130513-02 Rob to take over ownership of Security Keydocs. Rob H 2013-06-01 Complete 2013-06-21
O-120410-03 Look at how to improve and document Voms Admin policies and procedures. Chris Walker, Jeremy Coles, John Gordon 10-04-2012 Open 10-05-2012 Somehow, some users got deleted when their records expired. A discussion ensued on how to deal with it, but still not resolved. 24th July - documents being updated. October - little progress due to other priorities. Review in December.
O-130513-03 Prepare PerfSonar support document for PMB review Jeremy, Alessandra and Duncan 2013-05-20 Closed 2013-05-20 Chris W, J Coles and D. Britton drafted a letter of support.
O-130624-01 Send out a doodle pole to find a suitable date for a Puppet discussion meeting in EVO. Pete 2013-07-02 Closed 2013-06-26 Poll setup and sent to hepsysman and tb-support. This action can be moved after next weeks meeting.
O-130219-03 Update Perfsonar boxes to use the version 3.3 Perfsonar and implement Mesh Lancaster, Oxford and Birmingham 01-07-2013 Oxford, Cambridge and Lancaster updated. Closed. 2013-08-01 These sites volunteered should update their Perfsonar boxes first.
O-110816-02 Check out state of DPM-LFC checker and make available for testing. Sam 2011-08-30 Closed See O-130820-01 for redux


O-110830-04 Solve APEL/LSF parser mismatch, allowing Lanc. accounting to be published. Matt Doidge 2011-09-30 Closed GGUS Ticket https://ggus.eu/ws/ticket_info.php?ticket=84015. Manual accounting updates in place since January. Use EMI3 code ?
O-130226-01 Try vomsSnooper to check site config. All sites 31-06-2013 It works. Closed. Sites to try the application and provide feedback to Steve. Some sites did try - any more volunteers ?


O-130219-01 All sites to query the IPv6 status of their respective institutions Everyone 11-03-2013 Closed Even if sites don't intend IPv6 testing in the near future. (Update 21/01/2014, review this in February) All sites have spoken to thier networking and updated the wiki here:https://www.gridpp.ac.uk/wiki/IPv6_site_status
O-131029-01 Co-ordinate update of site-configuration for Backup VOMS Chris, Daniela, Kashif, Robert and Steve 2013-08-17 Closed 2014-02-11 29/10/13 Onus now on sites to upgrade before Chris starts handing out tickets. Most sites did some testing, it is now being rolled out and any problems will surface


O-120605-01 Document how to use robot certificates to perform routine tasks without regular human intervention. Jens Jensen 06-06-2012 Closed T2k and snoplus should probably be doing this to routinely move their data. It would be useful to have it documented. 2013-08-20: Jeremy to chase up. (update 21/01/14 follow up at subsequent meeting with Jens)


O-120605-03 Document how to renew a server's grid certificate without loading it into a browser. Jens Jensen, John Kewley 06-06-2012 Closed If sysadmin A renews a server's certificate on a Friday, then goes on holiday, it would be useful for sysadmin B to deploy the newly renewed certificate. For renewal, you could try this: README QUICKSTART - that's what we use at Imperial. (update 21/01/14 follow up at subsequent meeting with Jens)


O-130513-01 Explore possibilities for LFC backup at a Tier-2 Ops Team 2013-06-14 Closed? (Might be less relevant now) Oracle expertise may be needed. Ian Collier volunteered to confirm this at the Tier 1. (update 21/01/2014, following internal discussions may have this at RAL site, possible Daresbury, Jeremy will follow this up)


O-130528-01 Plan out the future of CE/Batch System integration. Torque/maui are not supported by EGI. Layout an agenda with proposals. Jeremy, Alessandra 2013-05-20 Closed 2014-03-14 After the pre-GDB on batch systems Alessandra created a batch system comparison table in the GDB twiki area
O-130820-01 Explore LFC/SE consistency checking status and develop tools if needed. Sam 2013-08-20 Closed - see below Examining possibility of generalising Biomed tool https://github.com/frmichel/biomed-support-tools/blob/master/SE/consistency/diff-dpm-lfc.sh. Exploring status of the SYNCAT (SEMsg) Working Group via Fabrizio Furano. (Placeholder update 21/01/14, discussed that there may be a need to have a tool and should maybe talk to other VOs, Jeremy said he would update this action)


O-140121-01 Follow up on benchmarking of sites following SL6 upgrade Closed
O-140121-02 Progression to using backup VOMS servers Closed
O-140225-01 Sites to fill in Batch System Status table Various 11/3/14 Closed https://www.gridpp.ac.uk/wiki/Batch_system_status still needs filling in by Brunel, Lancaster, RHUL, UCL, Sheffield, Birmingham, RALPP and Sussex.
O-140225-02 ILC software area migration to cvmfs Sites supporting ILC 4/3/14 Closed https://ggus.eu/ws/ticket_info.php?ticket=101502 Sites should move to using cvmfs for ILC or reconsider their support. Will review in next meeting and decide if child tickets are needed.
O-140225-03 Redistribute Perfsonar Documentation Jeremy, Duncan or Steve 4/3/14 Closed Sites have asked for Perfsonar documentation, particularly that pertaining to the reinstall, upgrade and mesh deployment, to be redistributed on TB-SUPPORT.
O-140506-01 Investigate best procedure for managing top bdiis during network outages/interventions Gareth Smith Closed Following the network intervention at RAL recently the 3 Top BDIIs displayed different information but all claimed to be authoritative.


0-141216-01 Discuss regional VO cvmfs semantics, put forward a recommendation. Ewan, Tom, Others. 2014-12-16 Closed. With the rollout of regional VO cvmfs areas Ewan made a point that a one cvmfs repo per model might not be the best solution. Either we could use one cvmfs area for multiple VOs or the opposite: a cvmfs area for each VO's subgroup.


O-140303-04 to circulate details on other middleware affected by Java and SSLv3. Brian 10-03-2014 Closed 10-03-2015 See argus action below. Email sent to tb-support
O-140303-01 contact Catalin regarding a discussion next week on CVMFS Jeremy 10-03-2014 Closed Catalin will attend on 10th March and give a summary of the UK CVMFS status and provide a review of the CVMFS workshop.


O-140303-05 Look into perfsonar dashboards. And data extraction meeting. Jeremy 10-05-2014 Closed 070415 Testing and evaluation of the pilot instances for esmond/maddash ongoing as psds.grid.iu.edu and psmad.grid.iu.edu. The production instance of the infrastructure monitoring is at psomd.grid.iu.edu. Data extraction meeting slides: https://indico.cern.ch/event/377034/material/slides/1.pdf (see slides 1 and 2 only).


O-140303-03 check 'pap-admin lp' works and if not fix All sites 10-05-2014 Closed fix by updating argus
0-141216-02 UClan New Users Tom, Jeremy, Lancaster lads 2014-12-16 Closed Test jobs submitted via DIRAC and software compiled and running on a GridPP-contextualised SL6 CERN VM. Preparing for CVMFS deployment.


See also: Operations Team Action items