Operations Team Completed Actions
From GridPP Wiki
Revision as of 11:02, 7 April 2015 by Peter Gronbech ec78699c15 (Talk | contribs)
This is a Wiki area to track operations team actions
Do we really need all actions going back to 2011 ? What's wrong with the last 12 months or so ?
Action ID prefix | Status |
---|---|
O = From Operations team meeting | Open = Action has been created |
OS = From joint Operations team and sites meeting | Progress = Action is being worked on |
BR = Created by Buck Rogers | Closed = Action is complete |
Action ID | Action description | Owner | Target date | Status | Date closed | Notes | |
---|---|---|---|---|---|---|---|
O-123-45 | The summary description | The owner | Target date for closure | Current status (open/progress/closed) | Date closed - when closed | Progress notes + summary upon closure | |
O-110927-03 | Develop the https://www.gridpp.ac.uk/wiki/Documentation page. Develop a tool to highlight pages which have not been updated recently. Survey the documentation to spot outdated, and missing information. Assign areas to the core areas, who can devolve down to others to be responsible. Report on progress at Ops meeting in two weeks. | Andrew & Stephen | 2011-10-11 | Closed | 2012-07-26 | ||
O-110816-03 | Action on Wahid to create a MySQL query (for DPM) to move non-space token files into space tokens, check it with Jean-Philippe and circulate. | Wahid | 2011-08-30 | Closed | 3-04-2012 | JP has left but the query can be checked with someone else. As of Dec 2011 this is low on Wahid's priority list.Taken up by core DPM team.
| |
O-110906-04 | Cream jobwrapper loses job exit status | Jeremy Coles -> Chris Brew | 2011-09-22 | Closed | 2012-07-26 | Chris Brew reports that the CREAM jobwrapper doesn't report the underlying job exit status. Jeremy Coles to talk to the developers about this. 061211 Chris unsure of his request but the issue still could do with follow up. Follow-up led to request to create ticket/savannah bug... so back with Chris! This was discussed at length on LCG-ROLLOUT here (started by Alessandra!) ATM the user job exit status is in the Cream Logs but there was discussion of defining a standard, did this get as fas as an RFE? Alessandra? | |
O-120313-02 | Email everyone on how to hack the publishing system to avoid publishing incorrect GlueSubClusterWNTmpDir. | Stuart Purdie | 2012-03-27 | Closed | 26-06-2012 | Emailed 1st May: If you need to hack GlueSubClusterWNTmpDir, it's set on the CE's in /opt/glite/var/tmp/gip/glite-info-static-cluster.conf (/var/lib/bdii/gip/ldif/static-file-Cluster.ldif on EMI CREAM). It appears to default to /tmp, based on /opt/glite/yaim/functions/config_gip ... so I don't think that YAIM lets you configure it. | |
O-120320-03 | UCL DPM TICKETs 80366, 80331 dpm problems. | Duncan Rand | 20-04-2012 | Closed | 3-04-2012 | Tickets closed. DPM working though somewhat unstable maybe (with developers)
| |
O-120320-01 | At QMUL, CW saw LHCB job that wanted 30GB of memory. Pls Check. | Raja Nandakumar | 20-05-2012 | Closed | 3-04-2012 | Just one or two jobs.One off - Chris not seen it again | |
O-111025-02 | CMS and RAL to liase over future tests between RAL (Tier 1 and 2) and IC. | Brian and Stuart W | 2011-11-01 | Closed |
| ||
O-110906-03 | GGUS 73773 | Sam Skipsey | 2011-09-13 | Closed | Sam to update ticket status - still open (4/10/11). Closed by December.
| ||
O-110516-01 | Migration strategy for ROC? | Jeremy | Closed | 2010-11-02 Current situation appears to be fine; further changes depend on the NGS side of things. December 2010: JG still suggests early in 2011. Late January Ireland have started the validation process, so expect NGI_UK process to start in mid-February. Ireland finished creation of NGI_IE w/c 20th June so the UK part of the migration should begin soon. July: The process has finally started! Hold ya breath... Done September.
| |||
O-110524-01 | Determine who is an EMI-1 early adopter, and get their feedback from deployment process | Jeremy Coles - presumably now Duncan/Daniela | 2011-07-24 | Closed | Please refer to Webpage. Now part of core tasks.
| ||
O-112806-02 | Consider nominations for "WLCG Technology Evolution Work Group". | All | 2011-06-28 | Closed | Some feedback given on the revised smaller working group representation suggestions. UK provided suggestions, we await information as to the outcome. Oct: Involvement in security, storage and ops groups. | ||
O-110830-03 | Determine why Birmingham sporadically fails HC tests, resulting in Broker Off. | Mark Slater | 2011-09-30 | Closed | After more investigation, MS believes that the period in question was due to a number of other problems at the same time causing the failures. In fact, BHAM has only been offlined once in the last week (04/10/11) so things seem to be OK now. MS will keep an eye on the problem though and will try to match offlines with specific problems.
| ||
O-110816-01 | https://www.gridpp.ac.uk/wiki/ROD_rota needs to be updated soon. | Jeremy | 2011-08-30 | Closed |
| ||
O-110524-02 | Test/validate/release CREAM/Argus configuration. | Raul Lopes | 2011-07-24 | Closed | Will submit a staged-rollout report | ||
O-110516-03 | Provide input on hardware scaling issues for disk servers at Tier-2s, particularly TB_disk/Gb_networking ratios. | Sam, Brian, Wahid | 2010-12-07 | Closed | 2011-06-28 | 2011-01-12 Being discussed internally to provide a consensus value.2011-06-28 http://www.gridpp.ac.uk/wiki/Suggestions_for_suitable_hardware_to_run_a_Grid_SE#Network_capabilities | |
O-110516-05 | Raise issues with ATLAS Panda service view of experiment software being site, not cluster or queue, resolution. (Which causes huge problems with heterogeneous clustered sites). | Alessandra | 2011-04-12 | Closed | 2011-05-16 | Issue raised, got positive response from 1 person but not from the other who should do the work. They all agree updating panda is a good idea and it's in the todo list but it doesn't have a high priority. Will keep on raising the issue when it appears again. | |
O-110524-03 | Assign an Ops Core Team "volunteer" to be T2 rep for GDB | Jeremy Coles | 2011-06-01 | Closed | |||
O-110524-04 | Ticket to increase ATLAS software area to be > 250 GB | Duncan Rand | 2011-07-24 | closed | 2011-07-26 | https://ggus.eu/ws/ticket_info.php?ticket=70960 | |
O-110524-05 | Increase UK Cloud Space to safer value. | Alessandra Forti | 2011-07-24 | closed | 2011-05-25 | Space was increased at all sites that needed it. Might need review in the future to decrease it to an appropriate value when the new clean up policy kicks in. | |
O-110607-01 | Follow up on the monitoring infrastructure Ireland ROC will use. | Jeremy | 2011-06-14 | Closed | |||
O-110607-02 | Review each site's network situation, particularly w.r.t. subnetting and monitoring. | All admins | 2011-06-29 | Closed | If possible include details in site update talks at HEPSYSMAN. [Jeremy is collating]. | ||
O-110607-03 | Invite "management" to Ops meeting to dicuss benchmarking and metrics. | Jeremy? | 2011-06-14 | Closed | Steve L invited but he does not see the need to discuss this further. Manchester in private discussion. | ||
O-110614-01 | Find out within ATLAS if and how more UK sites can register to receive work from additional ATLAS clouds | Alessandra Forti | 2011-06-28 | Closed | Mark reported "confirmation that ATLAS GDP (Grid Data Processing) decide who can join multiple clouds (FR, CERN) , other T2D sites should be able to join." Jeremy checked at the PMB and Roger Jones indicated that this was in-hand (i.e. sites not currently expected to follow up themselves). | ||
O-112806-01 | Review "fallback" SE options for SAM nagios tests. | Kashif | 2011-06-28 | close | 2011-08-16 | Gridppnagios is using storage-monit.physics.ox.ac.uk as mail storage replication server and dgc-grid-38.brunel.ac.uk and se01.dur.scotgrid.ac.uk as backup
| |
O-112806-03 | Follow up on site glexec statuses. | Jeremy | 2011-06-28 | Closed | PMB updated on status.
| ||
O-110906-01 | BDII crashes - what version of openldap does QMUL use? | Chris Walker | 2011-09-13 | Closed | 2011-09-06 | Details posted to tb-support: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1109&L=TB-SUPPORT&F=&S=&P=27670 | |
O-110830-01 | Obtain and distribute summary to clarify Steve Lloyd's tests and test pages. | Jeremy Coles | 2011-09-30 | Closed | Done for metrics but not the test pages. 061211 We need to think about use of these tests for the GridPP reports. Done online. | ||
O-110906-02 | GGUS 73872 | Chris Walker | 2011-09-13 | Closed | 2011-09-06 | Update ticket status
| |
O-110524-06 | Eliminate glibc as a CREAM issue in newest release. Symptoms - fails on heavy load. | Stuart Purdie, Christopher Walker, Mohammad kashif | 2011-07-24 | closed | 2011-06-28 | Stuart checking, will report back next meeting. Chris reports new gLite has a new cream/sge, could have fix. This particular issue is no longer relevant.
| |
O-110516-04 | Review the need for GridPP VO lists http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs | Jeremy | 2011-03-01 | Closed | 2011-02-22: Revised text which was out of date. PG updated dteam entry. Review of VO information and check on CIC generation of vomses entries ongoing. Incorporating into VO ops-team task. Now in wider VOs task.
| ||
O-112607-01 | Check steps to take at a site when a VO is no longer supported | Jeremy/Brian | Closed | What's wrong with just submitting a GGUS ticket to the VO, explaining the situation ? EGI now have a procedure for the decommissioning process.
| |||
O-110830-02 | Implement security measures to protect network topology twiki. | Andrew McNab, Jeremy Coles | 2011-09-30 | Closed | Andrew has developed the required measures to solve the issue. JC (and others)needs to make use of it. Any protected wiki page just needs to have a name started with: protected . This action will close once the networking pages have been confirmed.
| ||
O-110906-03 | GGUS 73773 | Sam Skipsey | 2011-09-13 | Closed | Sam to update ticket status - still open (4/10/11). Closed by December. | ||
O-110927-01 | Develop the https://www.gridpp.ac.uk/wiki/Staged_rollout page, to provide more information about which UK sites are doing what. Also to encourage others to contribute, either in an ad hoc way or formally.Report on progress at next Ops meeting. | Daniela & Duncan | 2011-10-04 | Closed | https://www.gridpp.ac.uk/wiki/Staged_rollout | ||
O-110927-02 | Develop the https://www.gridpp.ac.uk/wiki/Ticket_follow-up page, keep an eye on UK tickets and look out for commonalities, and escalating tickets. Report on progress at next Ops meeting. | Matt | 2011-10-04 | Closed | https://www.gridpp.ac.uk/wiki/Ticket_follow-up | ||
O-111025-01 | Check the validity of the security probes on the EGI dashboard after false positives at Manchester and Liverpool. | Mingchao | 2011-11-01 | Closed |
| ||
O-111101-01 | Follow up information provided by security dashboard and access control/information | Mingchao | Closed | ||||
O-120320-02 | CW says that at QMUL 5000 FTS jobs stuck in transferring state. | Chris Walker, G Smith | 20-04-2012 | Closed | 23-03-2012 | FTS had stuck transfers, they have applied a patch to prevent this blocking the channel. Cedric fixed whatever was causing transfers not to be sent to the FTS.
| |
O-110906-03 | GGUS 73773 | Sam Skipsey | 2011-09-13 | Closed | 11-04-2012 | Sam to update ticket status - still open (4/10/11). Closed by December. | |
O-120313-05 | Everyone should check that they have correct permissions to access security dashboard. | Everyone | 2012-03-27 | Closed | 11-04-2012 | ||
O-120320-06 | Is there tolerance for late paying Uni's re: hardware purchase? | Pete Gronbech | 20-04-2012 | Closed | 11-04-2012 | Some admins were concerned about process/deadlines etc | |
O-120313-04 | Put link on Wiki to EGI procedure for correcting availability/reliability stats. | Jeremy | 2012-03-27 | Closed | For EGI this is done in the ticket response. For WLCG an email to the WLCG project office is required. Done.
| ||
O-120320-04 | Pls add storage key docs | Wahid Bhimji | 20-04-2012 | Closed | Should be grouped with other KeyDoc actions
| ||
O-120313-03 | Investigate state of Cambridge SE now that it has been upgraded. | Storage group (via Sam) | 2012-03-27 | Closed | Passed to Storage Group
| ||
O-110516-01 | Organise reviewers for web pages and wiki sections | Jeremy | 2009-07-07 | Closed - part of core tasks | 20/07/07 asked in the first instance for volunteers. Remainder will be assigned across team to balance. Revisit in November. It is mainly the sysadmins introductory pages that need attention. Update December 2010: This is still ongoing and needs to be brought in line with timelines for website changes for GridPP4. Review at meeting on 8th February 2011. June 2011: Current plan is to integrate this with ops-team tasks. July: Will work on this in conjunction with documentation core task. Febraury 2012: Ongoing with core task. Closed July 2012 as part of documentation task.
| ||
O-120410-02 | Research Backup VOMs server ideas | Chris Walker and Jeremy Coles | 10-04-2012 | Closed | 19-06-2012 | This action has been superseded by the planed trio of GridPP run VOMS servers. | |
O-120410-01 | Check up on those UCL/VPN errors (for Atlas) | Duncan Rand | 10-04-2012 | Closed | 8-05-2012 | ||
O-111122-01 | Find out how to get EGI to distribute RPMs containing LSC files. | Jeremy | 2011-12-06 | Closed | Request with EMI, to reach consensus on distribution mechanism (O-111122-01,O-120313-01,O-120320-05 all related). EMI will not be taking up this request due to manpower issues. Examine again in post EMI proposal! | ||
O-120312-01 | Report to Andrew the three Key Docs for your area. | Ops members | 2012-03-06 | Closed | Pressing. Pls do it by GriddPP 28. (13/9: Can this be closed ?)
| ||
O-110524-07 | Timeline required for relocatable GLEXEC tarball | Martin Litmaath, Jeremy Coles | 2011-06-15 | Closed | 15th June No feedback. Same on the 28th of June. email activity late July. It goes on ... Current status: untested, AFAIK. July 2012 - update from Daniela. September/October 2012: Interest from ATLAS for CVMFS UI/WN tarball. We need to work with A Elwell/D Smith. Matt D has volunteered to take first steps. Wahid also looking at build scripts. Work started Nov 2012.
| ||
O-120313-01 | Investigate getting YAIM extracts for VO config via CIC portal. | Jeremy (or reassign to appropriate person) | 2012-03-27 | Closed | (O-111122-01,O-120313-01,O-120320-05 all related). Proposed to EGI. Will not happen within EMI. Vomssnooper was the result of the UK effort to address this issue.
| ||
O-120619-02 | Discuss options and plans for the "GridPP VOMS Takeover". | Jeremy, potential VOMS admins | 03-07-2012 | Closed | Lay the groundwork over the next fortnight for the GridPP-run VOMs servers at Manchester, Oxford & I.C. Supersedes O-120410-02. 18th July - transfer of core work on-hold while Robert F is on leave at Manchester. Will setup parallel test infrastructure. 22nd October: Testing done for Manchester & Oxford. Awaiting IC update. Switchover date set for 14th November 2012 - D Wallom informing NGS VOs.
| ||
O-120320-05 | discuss poss. automation of VOID transfer | Chris Walker, Steve Jones, Jeremy | 20-04-2012 | Closed | A tool for this was released (VomsSnooper) Discuss if this is proper approach (O-111122-01,O-120313-01,O-120320-05 all related). November: Move to action sites to make use of VomsSnooper? | ||
O-130319-01 | Examine with John Kewley moving to DNs without emailAddresses in CN. | Steve Jones (&John Kewley) | 2013-04-19 | Closed | Issue particularly concerns the need to be compatible with EMI-3 ARGUS (&possibly other EMI3 services) | ||
O-130219-02 | Collate and document site's sysctl settings for transfers. | Wahid | 05-03-2013 | Closed | 12-03-2013 | Wahid volunteered to summarise site's sysctl settings on the Wiki. It now appears in the wiki. | |
O-130212-01 | UMD status page not public - request it be made so | Jeremy Coles | 05-02-2013 | Closed | Request sent to EGI. Question came back about what specifically those without an RT account wish to see! Is the EGI summary sufficient? March 2013 - EGI will not make the page public but sysadmins can access the existing page by requesting an EGI sso account. | ||
O-120619-01 | Lancaster to engage with IC & QMUL to help with glexec tarball testing. | Matt Doidge | 26-06-2012 | Closed | 12-03-13 | QMUL is now using rpms. Lancaster taken over tarball glexec devel (after tarball finalised). Imperial is testing tarball. | |
O-120605-02 | Recommend, and document how to use, an inexpensive hardware token for storing the private part of grid certificates. | Jens Jensen | 06-06-2012 | Closed | 02-04-2013 | https://www.gridpp.ac.uk/wiki/KeyTokens | |
O-130513-02 | Rob to take over ownership of Security Keydocs. | Rob H | 2013-06-01 | Complete | 2013-06-21 | ||
O-120410-03 | Look at how to improve and document Voms Admin policies and procedures. | Chris Walker, Jeremy Coles, John Gordon | 10-04-2012 | Open | 10-05-2012 | Somehow, some users got deleted when their records expired. A discussion ensued on how to deal with it, but still not resolved. 24th July - documents being updated. October - little progress due to other priorities. Review in December. | |
O-130513-03 | Prepare PerfSonar support document for PMB review | Jeremy, Alessandra and Duncan | 2013-05-20 | Closed | 2013-05-20 | Chris W, J Coles and D. Britton drafted a letter of support. | |
O-130624-01 | Send out a doodle pole to find a suitable date for a Puppet discussion meeting in EVO. | Pete | 2013-07-02 | Closed | 2013-06-26 | Poll setup and sent to hepsysman and tb-support. This action can be moved after next weeks meeting. | |
O-130219-03 | Update Perfsonar boxes to use the version 3.3 Perfsonar and implement Mesh | Lancaster, Oxford and Birmingham | 01-07-2013 | Oxford, Cambridge and Lancaster updated. Closed. | 2013-08-01 | These sites volunteered should update their Perfsonar boxes first. | |
O-110816-02 | Check out state of DPM-LFC checker and make available for testing. | Sam | 2011-08-30 | Closed | See O-130820-01 for redux
| ||
O-110830-04 | Solve APEL/LSF parser mismatch, allowing Lanc. accounting to be published. | Matt Doidge | 2011-09-30 | Closed | GGUS Ticket https://ggus.eu/ws/ticket_info.php?ticket=84015. Manual accounting updates in place since January. Use EMI3 code ? | ||
O-130226-01 | Try vomsSnooper to check site config. | All sites | 31-06-2013 | It works. Closed. | Sites to try the application and provide feedback to Steve. Some sites did try - any more volunteers ? |
| |
O-130219-01 | All sites to query the IPv6 status of their respective institutions | Everyone | 11-03-2013 | Closed | Even if sites don't intend IPv6 testing in the near future. (Update 21/01/2014, review this in February) All sites have spoken to thier networking and updated the wiki here:https://www.gridpp.ac.uk/wiki/IPv6_site_status | ||
O-131029-01 | Co-ordinate update of site-configuration for Backup VOMS | Chris, Daniela, Kashif, Robert and Steve | 2013-08-17 | Closed | 2014-02-11 | 29/10/13 Onus now on sites to upgrade before Chris starts handing out tickets. Most sites did some testing, it is now being rolled out and any problems will surface |
|
O-120605-01 | Document how to use robot certificates to perform routine tasks without regular human intervention. | Jens Jensen | 06-06-2012 | Closed | T2k and snoplus should probably be doing this to routinely move their data. It would be useful to have it documented. 2013-08-20: Jeremy to chase up. (update 21/01/14 follow up at subsequent meeting with Jens)
| ||
O-120605-03 | Document how to renew a server's grid certificate without loading it into a browser. | Jens Jensen, John Kewley | 06-06-2012 | Closed | If sysadmin A renews a server's certificate on a Friday, then goes on holiday, it would be useful for sysadmin B to deploy the newly renewed certificate. For renewal, you could try this: README QUICKSTART - that's what we use at Imperial. (update 21/01/14 follow up at subsequent meeting with Jens)
| ||
O-130513-01 | Explore possibilities for LFC backup at a Tier-2 | Ops Team | 2013-06-14 | Closed? (Might be less relevant now) | Oracle expertise may be needed. Ian Collier volunteered to confirm this at the Tier 1. (update 21/01/2014, following internal discussions may have this at RAL site, possible Daresbury, Jeremy will follow this up) |
| |
O-130528-01 | Plan out the future of CE/Batch System integration. Torque/maui are not supported by EGI. Layout an agenda with proposals. | Jeremy, Alessandra | 2013-05-20 | Closed | 2014-03-14 After the pre-GDB on batch systems Alessandra created a batch system comparison table in the GDB twiki area | ||
O-130820-01 | Explore LFC/SE consistency checking status and develop tools if needed. | Sam | 2013-08-20 | Closed - see below | Examining possibility of generalising Biomed tool https://github.com/frmichel/biomed-support-tools/blob/master/SE/consistency/diff-dpm-lfc.sh. Exploring status of the SYNCAT (SEMsg) Working Group via Fabrizio Furano. (Placeholder update 21/01/14, discussed that there may be a need to have a tool and should maybe talk to other VOs, Jeremy said he would update this action) |
| |
O-140121-01 | Follow up on benchmarking of sites following SL6 upgrade | Closed | |||||
O-140121-02 | Progression to using backup VOMS servers | Closed | |||||
O-140225-01 | Sites to fill in Batch System Status table | Various | 11/3/14 | Closed | https://www.gridpp.ac.uk/wiki/Batch_system_status still needs filling in by Brunel, Lancaster, RHUL, UCL, Sheffield, Birmingham, RALPP and Sussex. | ||
O-140225-02 | ILC software area migration to cvmfs | Sites supporting ILC | 4/3/14 | Closed | https://ggus.eu/ws/ticket_info.php?ticket=101502 Sites should move to using cvmfs for ILC or reconsider their support. Will review in next meeting and decide if child tickets are needed. | ||
O-140225-03 | Redistribute Perfsonar Documentation | Jeremy, Duncan or Steve | 4/3/14 | Closed | Sites have asked for Perfsonar documentation, particularly that pertaining to the reinstall, upgrade and mesh deployment, to be redistributed on TB-SUPPORT. | ||
O-140506-01 | Investigate best procedure for managing top bdiis during network outages/interventions | Gareth Smith | Closed | Following the network intervention at RAL recently the 3 Top BDIIs displayed different information but all claimed to be authoritative.
| |||
0-141216-01 | Discuss regional VO cvmfs semantics, put forward a recommendation. | Ewan, Tom, Others. | 2014-12-16 | Closed. | With the rollout of regional VO cvmfs areas Ewan made a point that a one cvmfs repo per model might not be the best solution. Either we could use one cvmfs area for multiple VOs or the opposite: a cvmfs area for each VO's subgroup. |
| |
O-140303-04 | to circulate details on other middleware affected by Java and SSLv3. | Brian | 10-03-2014 | Closed | 10-03-2015 | See argus action below. Email sent to tb-support | |
O-140303-01 | contact Catalin regarding a discussion next week on CVMFS | Jeremy | 10-03-2014 | Closed | Catalin will attend on 10th March and give a summary of the UK CVMFS status and provide a review of the CVMFS workshop.
| ||
O-140303-05 | Look into perfsonar dashboards. And data extraction meeting. | Jeremy | 10-05-2014 | Closed | 070415 | Testing and evaluation of the pilot instances for esmond/maddash ongoing as psds.grid.iu.edu and psmad.grid.iu.edu. The production instance of the infrastructure monitoring is at psomd.grid.iu.edu. Data extraction meeting slides: https://indico.cern.ch/event/377034/material/slides/1.pdf (see slides 1 and 2 only).
| |
O-140303-03 | check 'pap-admin lp' works and if not fix | All sites | 10-05-2014 | Closed | fix by updating argus | ||
0-141216-02 | UClan New Users | Tom, Jeremy, Lancaster lads | 2014-12-16 | Closed | Test jobs submitted via DIRAC and software compiled and running on a GridPP-contextualised SL6 CERN VM. Preparing for CVMFS deployment.
|
See also: Operations Team Action items