RAL Tier1 OnCall Actions

From GridPP Wiki
Jump to: navigation, search

Lists of actions related to the RAL Tier-1 On-call Service

See also RAL Tier1 OnCall Milestones

Open Actions

Action ID Associated milestones Owner Description Status
A-20081107-02 Jonathan Follow up with us having our own modem to call out. [2008-11-14] Problem using internal extension. [2009-06-18] The idea of using another group's modem does not look promising as they do not provide 24hour cover in the event of a failure. Migrated to Footprints Incident Helpdesk
A-20090122-03 Jonathan Check, and if necessary modify, the routing of e-mails (particularly alarms) via PAT so as to eliminate unnecessary steps. This may need changes to MX records. [2009-01-29] - started and ongoing. Action terminated. Will be absorbed in general update of internal mail system.
A-20090129-02 Jonathan Look at methods to eliminate the problem of a 'clear' being missed because it has arrived too soon. One possibility discussed was to delay processing the clears. [2009-03-26] It was pointed out that whilst the 'clear' is missed by the ticket system, the alarm will be removed from nincom when the issue goes away. [2009-04-24] Once we move to R89 and have different call mechanism (modem) this code will be reworked anyway. Review this then. Migrated to Footprints Incident Helpdesk (merged with action 20090924-02)
A-20090212-06 James T Chase up getting a reduction in the number of FSPROBE errors. A flood of these posed a threat to RT. [2009-03-12] This is already on the Fabric team's list of issues. Migrated to Footprints Incident Helpdesk
A-20090226-04 Jonathan During 23/24 February the database monitoring sent a message to RT. This should have generated a callout but didn't. Investigate what went wrong. [2009-02-27] Initial thought - At the time of the alarms host cdbc13 was in Nagios downtime because of the intervention on the Castor Atlas instance. BUT it appears there is a problem as the Nagios hostgroup used for the downtimes includes a node that is serves both Atlas and LHCb. Need a more general solution here. Discussion [2009-03-05]. Carmine thought the allocation of nodes to Nagios HostGroup may not be correct in this case, or at least could be improved. This should be checked. [2009-03-12] Agreed that the database systems NEPTUNE and PLUTO should be taken out of the HostGroup and that these should be stand alone (each a group)?. Likewise separating the disk servers (which can be auto-configured) from the Castor central nodes for each instance should be done.
A-20090305-02 James T Follow up with getting double disk failures and RAID rebuild failures to page. [2009-03-12] This is being looked at. [2009-06-18] New firmware expected that will alleviate problem. [2009-08-20] New firmware received and being tested. [2009-09-03] New firmware does not install on relevant nodes. Migrated to Footprints Incident Helpdesk
A-20090423-02 Gareth Obtain contact details for experiment shifters (phone & e-mail) for use by AoD/On-Call. Migrated to Footprints Incident Helpdesk
A-20090521-01 James T Set-up method to view time until fsck next due on system (e.g. in MIMIC) [2009-08-06] This can be done by running a command or script (I.e. at a point in time.) This is a low priority action. Action dropped. It is a low priority. Also we will be running a FSCK on the disk servers and setting the date of its next default run beyond the end of the next LHC run.
A-20090709-02 Matt V Check on stale nfs mounts. This was a problem for the Castor LSF system after the network intervention. Castor will be reconfigured next week to not use nfs. However, this flags up other systems may benefit from monitoring nfs mounts (Derek suggested the CEs). James to add this to the 'exception monitoring' task. For Grid Services a list has been provided. Ned some input from Castor with either a list of [2009-09-24] For Grid Services a list has been provided. Need some input from Castor with either a list of systems that remain to be added to this ist, or confirmation there are none. This has been included in Cheney's list of Nagios changes. A general place-holder for Castor Nagios changes has been added to the Footprints Incident Helpdesk
A-20090716-04 Jonathan Modify tests for Castor tape servers so that there is a callout only when two or more failures occur on different systems. Just send a ticket to RT for problems on a single tape server. This is in response to an out-of-hours callout for the failure of rfiod and rtcpd on Castor201. Migrated to Footprints Incident Helpdesk
A-20090716-05 Jonathan Callout (i.e. page) for failures of VO-specific SAM tests during the working day. (When three-in-a-row failures as for the OPS tests.) This is initially only during the day while we build up more confidence in the reliability of the VO specific SAM tests. Migrated to Footprints Incident Helpdesk
A-20090806-04 Tim / Cheney Errors on tape servers (missing processes.) Need a general change (more general than A-20090716-04) such that all these tests only call out if two (or more) tape servers have problems. Migrated to Nagios Implementation List (Cheney)
A-20090806-05 Cheney Review documentation for Castor callouts that state statements that say "Call OPS" in the spreadsheet. Not always clear under what circumstances this should be done. Migrated to Nagios Implementation List (Cheney)
A-20090806-06 James T / Chris GDSS229 - status incorrect in Overwatch: Implement a system to verify the status of disk servers in Overwatch against their actual Castor status. [2009-09-10] This is being worked on and progress being made. Migrated to Footprints Incident Helpdesk
A-20090806-08 Matt H. MAUI - configure Nagios test on number of held jobs in MAUI. This has been done. Just remains to create/check documentation for handling the fault.

[2009-09-02] Documentation updated (Grid alarm response); this check not yet configured to call out. [2009-09-25] Not adding to callout while considering range of situations that can result in held jobs. Also, added ability of batch service to temporarily defer jobs before putting them in batch hold, which may change response. Problem with incorrect SSH keys on WNs not fully solved. Note added 2-Mar-2010. This does not call-out so does not need 'call-out' documentation. Complete.

A-20090820-02 James T / Jonathan Review status and monitoring of MARLEY and other systems involved in the callout system. Is it appropriate to call out on these system (e.g. if two out of the three fail)? Migrated to Footprints Incident Helpdesk
A-20090827-01 Gareth Raise issue of updating CRLs. Should a better(?) method be put in place, for example a common method across the Tier1. [2009-09-10] This was discussed at the Post Mortem Review for the A/C problems. It was agreed some method of checking (and if necessary updating) CRLs at/after boot was deemed necessary. Leave marker here until satisfied this action tracked appropriately. [2009-09-24] Another occurrence of this over the weekend on a single machine. Check if there is a Nagios test for out of date CRLs. Migrated to Footprints Incident Helpdesk
A-20090924-02 Jonathan Look up cause of us missing clear when we get an alarm raised and cleared in quick succession - as happens in some cases from the database monitoring (Peaceful) Migrated to Footprints Incident Helpdesk (merged with action 20090129-02)
A-20090924-03 Jonathan Check/Update list of systems being checked for NFS servers to include all software serving nodes in PBS. Action terminated. No longer so relevant.
A-20090924-04 Derek Document how to access and modify RT so as to be able to break a loop of e-mails/tickets bouncing. Migrated to Footprints Incident Helpdesk
A-20090924-05 Gareth Find out where we are as regards a test for crond not running. Migrated to Footprints Incident Helpdesk
A-20091015-02 Gareth Clarify with OPS (Hiten) & Tim procedure for calling when Castor problems. Should go through standard route (I.e. Primary on-call.) Done. Feb 2010.
A-20091105-01 (Gareth) Implement a test for failure to get database backups to tape. This has been included in Cheney's list of Nagios changes. A general place-holder for Castor Nagios changes has been added to the Footprints Incident Helpdesk
A-20091105-01 Matt H/Jonathan Implement a test for read-only file systems. This was discussed in relation to Worker Nodes. However, it is something that should exist across all systems.

Closed Actions

Action ID Associated milestones Owner Description Status Date Closed
A-20071130-01 M-2 Andrew Talk to Neil regarding his knowledge of issues around benefit in kind and mail around a summary Done. 2008-01-25
A-20071130-02 M-3 Andrew Martin David Gordon Ensure list of critical hosts, alarms and responses is produced. Documentation of responses will be required by December 21 [2007-12-07] CASTOR list generated by Cheney; Grid Team lists generated by Derek and Matt (waiting on Catalin); James progressing Fabric Team list.

[2007-12-20] Documentation from Cheney, Derek and Matt uploaded to mailing list file area.

[2008-01-11] Need to discuss DB side with Gordon; Fabric Team list/responses and some Grid Services responses still pending.

[2008-01-18] DB monitoring system will escalate alarms to Nagios via e-mail; need Nagios host monitoring of the monitoring system, peaceful. Documentation of DB alarms progressing.

[2008-01-21] Documentation received from Catalin, and critical alarm list from Jonathan.

[2008-01-25] Documentation received from Jonathan.

[2008-02-15] Documentation previously sent by Carmine, but permission problems with the mailing list. Closing action.

2008-02-15
A-20071130-03 M-5 Jonathan Demonstrate end to end callout by December 21 [2007-12-21] End-to-end callout test done. 2007-12-21
A-20071130-04 M-5 M-7 Jonathan Cheney Coordinate work on escalation of Nagios alarms to callouts [2007-12-07] CASTOR configuration to map critical alarms onto castor-contacts-callout Contact Group.

[2008-01-25] Will use event handler for critical host/service combinations.

[2008-02-15] Will not use SMS. Event handler configured for Fabric Team callout alarms. Sequence is Nagios->sysreq/sure->automate->bleep.

[2008-02-22]Waiting for Hiten to return to progress work on automate. Expect to use two bleepers for primary oncall, then one each for oncall backup, CASTOR, DB and Fabric/Grid. Bleep should always go to primary oncall. Need to keep a log of bleeps before going live.

[2008-03-07]Script controlling callout updated and tested; need to update configuration of Automate to cover Tier-1 primary contact.

Jonathan to add e-mailing callout alerts from script.

[2008-04-04] E-mailing of callout alerts raised and cleared now working.

2008-04-11
A-20071130-05 M-5 James Find lead time for obtaining pagers [2008-01-11] Have five pagers. 2008-01-25
A-20071130-06 Matt Add David to mailing list 2007-12-07
A-20071130-07 M-6 David Check group laptop requirements 2007-12-07
A-20071211-01 M-2 Andrew Resolve Questions Raised wrt Remuneration Paper (I) 1) Can wireless routers be provided to attach to pre-existing broadband.

[2008-01-11] Treat as per pre-existing laptops.

2) Can charges incurred on pre-existing equipment/phone be claimed back.

[2008-01-11] Probably not.

3) How are multiple calls in the same 2 hour window to be handled.

[2008-01-18] For unrelated callouts, each call can be claimed. More advice needed regarding claiming of overtime for overlapping work in the minimum two-hour overtime period.

[2008-02-15] One callout payment per 2-hour interval, but can be paid for multiple blocks of overtime for overlapping callouts.

4) Does equipment kept at home need to be insured.

[2008-01-11] No, but authorisation is required to take equipment home.

5) How will people handle flexi-clock after a late night session?

[2008-01-11] As per Operations.

2008-02-15
A-20071214-01 M-5 Carmine Jonathan Implement logging of DB faults via Nagios [2008-01-25] Will use passive test as per SAM plugin.

[2008-02-15] Testing is ongoing.

[2008-02-22] Options are snmp trap handling, or nsca for more direct communication with Nagios.

[2008-03-07] Nagios needs configuring to accept DB alarms; scripts are ready on the DB side.

2008-04-18
A-20071211-02 M-2 David Resolve Questions Raised wrt Remuneration Paper (II) 1) Can we pay to uplift people's car insurance in order to handle on-site attendance.

[2008-01-07] In principle, STFC should be able to meet these costs (staff should check whether existing policies cover business use); reimbursement would be linked to period over which staff are expected to provide cover. Reimbursement may be taxable (to be confirmed).

[2008-01-14] This would be a benefit in kind, and therefore taxable. Necessary insurance should be in place before attending on site; group leaders may want to confirm this before authorising payments.

[2008-02-15] David will clarify and resend.

2) What is the maximum safe working time on laptops.

[2008-01-07] DSE regulations state that the same limits apply to desktops and laptops: regular breaks (5-10 minutes per hour of continuous use) are more effective than longer, less frequent breaks; a suitable surface should be used for laptops, and docking stations should be provided, and used whenever possible.

2008-02-15
A-20080111-01 M-3 David Extend CASTOR incident logging to Tier-1 and Database groups. [2008-02-15] Trialing continues for CASTOR Group. David will discuss with Gordon; Grid/Fabric Groups already aware of the processes. 2008-04-04
A-20080111-02 M-3 M-7 Andrew Write document detailing oncall processes. [2008-02-22] Closing of incidents needs careful consideration, and each incident should have a single coordinator to avoid the problem of people working on faults declaring an incident over when they have not been exposed the full extent of the fault. Need to decide how to deal with communication during incidents. 2008-10-24
A-20080307-01 Jonathan Configure nagios to feed all critical alarms to callout. [2008-04-04] Grid Services and DB alarms now callout; CASTOR alarms still to be configured. 2008-05-02
A-20080307-02 Jonathan Set up callout script to e-mail oncall alerts to Matt/Jonathan, Cheney (CASTOR). [2008-04-04] Still need to add Cheney.

[2008-04-18] Add Cheney and dutyadm.

[2008-04-25] Dutyadm done; still need to add Bonny and Cheney.

2008-05-02
A-20080404-01 Jonathan Document configuration process for adding/removing callout alarms. This is needed for the entire system (Nagios/Sure/Automate). Need to ensure that those requiring access to the necessary hosts have it. 2008-09-19
A-20080404-02 Jonathan Add all diskservers as critical hosts. [2008-04-25] Exclude only non-LHC, non-CASTOR and dteam disk servers. 2008-05-02
A-20080404-03 Matt Write documentation for processes (getting started, prerequisites, etc.). 2008-05-30
A-20080404-04 Matt Check uptake of primary oncall and second-line support. 2008-04-25
A-20080404-05 Matt Draft rota/contacts page on internal e-Science Wiki. 2008-04-25
A-20080411-01 Cheney Review list of CASTOR callout alarms 2008-05-09
A-20080425-01 James Consider remote console management for critical services. [2008-07-11]This work is in progress. Disk servers are being cabled for IPMI. 2008-10-24
A-20080425-02 Catalin Reconstruct timeline for alarms and helpdesk tickets resulting from the network failures on 2008-04-24 and 2008-04-25. 2008-05-02
A-20080425-03 Jonathan Raise issue of Fabric Team participation in OnCall at their Group meeting 2008-05-09
A-20080425-04 Jonathan Confirm 8-hour acknowledgment limit in Sure, and decide what to do about it. [2008-05-02] Alarms should be ignored on nincom rather than acknowledged,since acknowledgements expire after 8 hours. Need to document this on Wiki.

[2008-05-09] If alarms are ignored to circumvent the 8-hour clearing of acknowledged alarms, don't close helpdesk tickets; dutyadm should follow up.

[2008-05-16] Agreed that we should increase the acknowledgement expiry.

[2008-06-27] Updated to 24h - closed

2008-06-27
A-20080502-01 Jonathan Raise issue of how to write Nagios tests for the various networks in Fabric Team meeting. [2008-05-16] Need separate network monitoring infrastructure from switches through to OPN. 2008-05-30
A-20080502-02 Jonathan Document best practice for adding/removing hosts from checking (also groups of hosts/services) [2008-05-09] Document need to disable event handler if required.

[2008-05-16] Still need some clarity on how to handle scheduled service interruptions. Do we have to stop passive tests, disable event handler, etc?

2008-09-19
A-20080523-01 Derek Review whether CE02 high-load alarm should callout. High load ignored. 2008-05-30
A-20080509-01 Jonathan Implement method of stopping and restarting callouts Done 2009-03-19
A-20080509-02 Cheney Ready to act as backup for Jonathan for all Nagios configuration. (Formerly part of A-20080404-01.)

[2008-04-18] Need Cheney ready to act as backup for Jonathan by CCRC08. Install keys where needed for primary on-call (e.g., nincom, service hosts).

[2008-05-30] This should be considered a high priority. [2008-07-11] Cheney is able to do this - item closed

2008-07-11
A-20080627-01 Jonathan Investigate purchase of email->pager gateway [2008-07-16] No need to purchase service; e-mail can be sent to <pager number>@paging.vodafone.net (this sends mail to the on-call pager. Note that there is a message limit of 240 characters, including Subject field, and a limit of 25 messages per 3 month accounting period (to exceed this we need to pay £1 per 175 messages) 2008-08-22
A-20080627-02 Cheney Castor team to consider policy in event of disk servers down Policy/Process to be documented on the wiki in special section 2008-09-29
A-20080627-03 Andrew How should blue group members notify on-call of found faults [2008-07-11] Documented in the process documentation 2008-07-11
A-20080711-01 Jonathan How to monitor nincom/nagios more comprehensively 2008-08-22
A-20080801-01 Jonathan Need mechanism (outside automate?) to swing between pagers. [2008-11-14] Now coded but not tested; will add wrapper script. 2008-12-05
A-20080801-02 Jonathan Set up test bleep at 16.00 daily, to be cleared at 16.15 2008-09-19
A-20080912-01 Jonathan Produce a list of disabled event handlers and mail it to on call mailing list before next meeting. 2008-09-19
A-20081024-01 Andrew Agree Networking callout procedures Terms of engagement, how to contact them, types of requests that we can make. 2008-11-14
A-20081024-02 Ian Determine method to access network topology. [2009-03-19] In progress Done 2009-11-26
A-20081121-01 Jonathan Reduce frequency of ping testing for host up/down after initial failure. Cannot remember why we wanted this, so remove action. 2009-07-09
A-20081128-01 James Thorne Follow up with Fabric Team how we will deal with drive throws during holiday period Decide whether we call out on single/double disk failures, and document any procedures 2009-01-22
A-20081128-02 Matt Set up extra rotas for holiday period 2008-12-05
A-20081128-03 Jonathan Check that the alarm responses for core Fabric hosts are sufficient 2009-01-22
A-20081128-09 Cheney Configure Nagios to make CASTOR checks more frequent. Some checks run infrequently (e.g, every two hours), and we get generic callouts (SAM test failures), when specific problems should already be known. [2008-12-05] May not be trivial as frequency depends on the number of tests configured on the (01) slave. Might have to, for example, split CASTOR tests across more than one slave. [2009-04-24] Review this again after hardware changes made to Nagios systems. Deprecated 2009-12-02
A-20081128-04 Gareth Advertise to the experiments what our commitments over the holiday period are 2009-01-22
A-20081128-05 Andrew Confirm Networking oncall availability over the holiday period 2009-01-22
A-20081128-06 Matt Confirm OPS oncall availability over the holiday period 2009-01-22
A-20081128-07 Matt Add pager-person mapping to rota 2008-12-05
A-20081128-08 Jonathan Set up pager swings for the holiday period once rota is finalised 2009-01-22
A-20090122-01 Gareth Clarify ownership and subsequent handling of tickets in the oncall queue that are created by an oncall person overnight. [2009-02-13] - proposal circulated. Not yet added to Wiki. 2009-06-18
A-20090129-01 Gareth Check with Shaun under what circumstances the srmDaemon back end process should (or should not) be restarted. If necessary clarify in documentation. This was confirmed with Shaun. The documentation is correct as it stands. In particular there is no requirement to always restart the srmDaemon. However, restarting does not cause undue problems. 2009-02-12
A-20090129-03 Carmine Check documentation for alarms on Seth and Neptune. Discussed if we should flag up to correlate with other FTS errors (in this case). With Gareth make sure there are links to the documentation that details the roles of these database systems. A Wiki page has been provided that gives these details and has been linked in from elsewhere in the On-Call wiki. 2009-02-13
A-20090129-05 Gareth Discuss with Ian the possibility of Puppet generating a message of the day, or similar to give an easy way of finding out what a system does.

[2009-04-09] Change to discuss with Ian re Quattor.

Done - added to Ian list for Quattor. 2009-04-23
A-20090212-02 Cheney Can the restart of the JobMgr service be automated. This arose following alarm Tier1_service_c_dlf_proc_jobmgr_on_host_ccsd08. [2009-03-12] This work is scheduled to be done. [2009-11-05] This will be done via cron. Done 2009-11-26
A-20090212-03 Cheney Documentation missing for alarm: Tier1_service_c_ts_rtstat_failure-process_on_host_ctsc11 [2009-03-12] This work is scheduled to be done. [2009-09-03] Checked and documentation in place.
A-20090212-05 Gareth Update list of contacts on the Wiki with surnames. Done. 2009-03-02
A-20090226-01 Cheney Nagios scripts have been put in place to test cleanup of Puppet area. Check that this calls out and there is documentation Done 2009-04-23
A-20090226-02 Cheney The restart script for SRMDaemon didn't work. Check it works for the new version. This is done by Puppet. Cancel this action. This work is scheduled to be done. 2009-04-24
A-20090226-03 Matt H We seem to have called out on less than 3-in-row SAM test failures. Check logic. Unable to check logic as too much time has now elapsed and information already cleaned up. 2009-02-26
A-20090226-05 Cheney Check (and if necessary put in place) Nagios host check for all SRM Stager machines. Done 2009-04-23
A-20090226-06 Jonathan Add Nagios check and callout on NFSDaemons Nagios checks with callout are now in place for hosts lcg0614/7/8 and their rpc.mountd (NFS server) daemon. I have also updated the Fabric-alarm-response spreadsheet on the e-Science wiki with procedures in case either one of these hosts or a rpc.mountd (NFS server) daemon fails. This involved updating the documentation for maui restarts as well. 2009-03-03
A-20090226-07 James T Add nagios test on degraded software RAID array. [2009-03-12] Need to check if these alerts should call out. Decided to implement this should not call out out-of-hours. However, it is urgent that this test is enabled. This needs follow up outside this meeting. [2009-04-23] A list of the systems that need this test has been provided to Fabric team. Just needs implementing. [2009-05-07] This has been done for the list of servers provided by the grid services team. However, it was realized that there remain some Castor (and other) servers which need it. It would be expected to be needed on all servers that have software mirrored system disks. 2009-05-21
A-20090226-08 James T Enable, where possible, and test hot swapping of disks in software RAID arrays. [2009-03-12] In progress. Note that this will require a re-install of the OS so will be implemented for new systems. [2009-04-24] Flag this up to ensure included in Fabric Team's list of issues. Regard as done here. 2009-05-21
A-20090305-01 Chris K Check documentation for Tier1_service_c_stager_proc_rhserver_priv_new_on_host_castor101 error Done 2009-11-05
A-20090305-02 Gareth Investigate Nagios groups for disk servers so as to ease issue of setting only these into downtime. Could the Nagios host group membership be generated from Overwatch? See A-20090226-04 2009-04-23
A-20090326-01 Gareth A message via the web page will be missed if there is already a message there. (These are Tier1_Service_Check_Mail_Alarms_on_host_lcghelp0613 messages. Add item to AoD checklist to verify there are no message left there. 2009-06-18
A-20090326-02 Jonathan On 24th March there was a major power glitch but no callout. Need to understand why not, and fix this. [2009-05-07] This is now understood but not yet fixed. The Nagios slaves only callout pager 1, which was not the one in use by the on-call at the time of the incident. Furthermore Pager 1 had exceeded the monthly quota of calls that it could receive via the web interface used by the nagios slaves. (Now fixed). 2009-06-18
A-20090409-01 Jonathan Investigate ways to stop spam messages to Pager 1. [2009-05-21] In progress. Investigating specifying that only certain e-mail addresses can send messages to the pager(s).
 New number for pager in place and working now.
2009-06-18
A-20090409-02 Gareth Discuss with Tim creating ticket (e.g. by e-mail) when significant problem on tape robot, rather than just e-mail to duty admin. [2009-07-09] In progress. Instructions were to only send an e-mail outside working hours. Now extended to any time. Done. 2009-08-20
A-20090423-01 Matt H. Finalise procedures for throttling back access to the BDII following problem of particular site overloading all national BDII. [2009-05-28] Posted blog entry about BDII monitoring and blocking. Final procedure is automatic attempts to block problem networks, and if these don't work normal escalation to callout. 2009-07-09
A-20090430-01 Jonathan Investigate why there was no callout for GDSS200 when it failed on 29-Apr. [2009-04-30] This, along with about a dozen other disk servers, were mis-configured in Nagios. This was fixed in Nagios, although how this came about unclear. It is believed these servers were put into production before the new server deployment scheme. 2009-05-07
A-20090618-02 Matt H & Jonathan Understand batch scheduler problem of 17th June and appropriateness of Nagios test on MAUI calling/not calling out. [2009-07-09] Believe a nagios test calling out would not be appropriate in this case as it is likely to lead to many false alarms, and this particular case is rather pathological. [This to be confirmed with Matt H.] Matt H. says that there is a nagios test available in SVN. Jonathan to apply it in August. [JohnK 2009-07-23] 2009-08-27
A-20090618-03 Matt H Follow up with problem of CE not accepting Atlas work when reach 32000 files in a directory. [2009-07-09] Done apart from setting up a nagios test for this condition. 2009-08-27
A-20090618-04 Castor team & Gareth Why was there no callout on the Nagios host test on SRM0661, which failed on 17th June. Too detailed to be picked up now. Ignore action. 2009-07-09
A-20090618-05 Gareth Contact Hiten to confirm that the automate system will not have a gap in its service if (when) it moves to R89. Automate will move to R89. Hiten is aware that we need notice of this. Expect a gap (about 4 hours) in callout during the move, which will be during a working day. 2009-07-09
A-20090709-01 Jonathan We did not receive many tickets (or callouts) during the network intervention on 7th July. Check logs to verify this was correct, and not because of some failure. I.e. should we have received more tickets? No longer relevant. 2009-08-20
A-20090716-01 Jonathan/Cheney Now that there is no longer a /lsf partition on the Castor disk servers, remove the test for this being full. However, confirm that we are alerted if "/" fills up. Also, verify where the LSF logs now go. Complete; check for /lsf partition removed as it is covered by check for / partition. LSF logs go to /lsf/log 2009-09-03
A-20090716-02 Gareth There is a long term plan for the existing SAM infrastructure to be replaced with Nagios. Add this to the list of long term (strategic?) tasks Done 2009-08-27
A-20090716-03 Shaun/Jonathan/Gareth Investigate why there was no callout for the failure of gdss312. This did not call out or raise a ticket at the end of the intervention on Tuesday 14th when a notification was received by e-mail. (Confirm exactly what happened.) Done 2009-08-20
A-20090716-06 Gareth For large interventions with a significant number of nodes, modify procedures for downtimes to add the ending of nagios downtimes an hour before the ending of the overall downtime (in GOC DB). Also add in a review some 30 minutes before this. plus a This is to allow time for the nagios monitoring to report problems before the outage ends but after systems are (or should be) up. Done 2009-09-17
A-20090730-01 John K Ask James about faulty worker nodes being put back into production and causing jobs to be held. Done 2009-08-20
A-20090806-01 Jonathan Confirm Old Loggers removed from callout and documentation updated accordingly. Done 2009-08-11
A-20090806-02 Gareth / Tim Verify that tape systems go into nagios downtime ahead of interventions. (I think this is the correct action. I realised my notes just state "check procedure for this type of thing!" Done 2009-09-17
A-20090806-03 Chris Logs on LSF server: Need to update the documentation to correct path to the log files. Done 2009--11-05
A-20090806-07 Matt H. Implement a test for a number of batch workers being unavailable. Suggest callout if more than 25% of them are unavailable. Done 2009-08-20
A-20090820-01 Gareth Confirm with Martin status of documentation on power controllers. There is some lack of confidence in this as it incorrectly refers to equipment still being in R27. Verified with James A. 2009-09-24
A-20090827-02 Gareth We have been seeing intermittent failures on the test of the OPN. Understand this. Done. Tiju has modified the test to go to a different node at CERN. 2009-09-03
A-20090827-03 Gareth Add a link to the LHCOPN ticket system to the documentation on the OPN test failure. Done 2009-09-03
A-20090827-04 Gareth / Jonathan Check on Nagios downtimes expiring over the weekend (especially the forthcoming long weekend.) Document this as a regular weekend activity for AoD. Jonathan has made the check ahead of this weekend. Gareth added to AoD documentation. 2009-09-03
A-20090903-01 Gareth Check with Hiten cause of Pager not working on Wednesday late afternoon (2nd Sep). The problem was that the modem was getting a 'busy' from the other end. Hiten contacted Vodafone who cleared a fault at their end. 2009-09-10
A-20090903-02 Matt V. Find out how to make a claim for on-call/overtime once SSC up and running. (Matt V. to ask at next week's presentation.) Matt reported back from one of the SSC familiarisation sessions. The response was that this process will not change at the moment. 2009-09-10
A-20090910-01 Matt H. Look at improving Nagios test on unavailability of batch workers to allow for case when many nodes in draining for intervention. [2009-09-11] Done. Script now ignores WNs in Nagios downtime. 2009-09-24
A-20090910-02 Jonathan (Gareth) Create and end-to-end test of Nagios and Pager. Use this instead of current 'Test Pager'. New test in place - old one left there as well. 2009-09-17
A-20090917-01 Derek Check on-call documentation an actions following the upgrade of the batch system to SL5. Done 2009-09-24
A-20090917-02 Gareth Check procedures for intervention on D1T0 disk servers following a memory faulty. A quicker response may be obtained by taking memory out of a spare system, rather than waiting for new memory tobe ordered. Done 2009-10-22
A-20090924-01 Matt V Check if there is a re-starter on the mighunter process (or verify if one is needed). Done (via crond) 2009-11-05
A-20091015-01 Jonathan Enable standard Nagios check for DNS servers. Set to create ticket (by e-mail) but not to page. Done 2009-11-19
A-20091022-01 Carmine/Tiju Modify Database Team's operating procedures to put services into Nagios downtime during interventions. Done 2009-11-05