RAL Tier1 OnCall Milestones

From GridPP Wiki
Jump to: navigation, search

Lists of milestones related to the RAL Tier-1 On-call Service

See also RAL Tier1 OnCall Actions

Milestone ID Milestone By End Associated Actions Owner Description Status
M-1 Define objectives of callout October Andrew Define our general requirement. What kind of events should be handled by a callout. (Probably not WLCG-07-01.) Done
M-2 Staff agree to provide cover November A-20071130-01 (done) A-20071211-01 (done) A-20071211-02 (done) Andrew Provide details of financial remuneration that staff will receive. Address concerns regarding tax for benefit in kind on hardware. Define expectations wrt:
  • Mode of working (on-call/best efforts)
  • Remote versus on-site visit
  • Time to respond and hours to work
  • Number of calls to handle per week

Gauge initial staff take-up.

[2007-12-14] Initial uptake from all groups polled. Need Neil to sign off before people can start claiming.

[2008-01-11] Signed off by Neil; people can claim for work done over Christmas holidays.

M-3 Alarm list and response December A-20071130-02 (done) A-20080111-01 A-20080111-02 Matt Define list of hosts and alarms that we will callout on. Define procedures to follow when alarm is raised.
M-4 Define interaction with third parties December Matt Need to decide how (and what) to allow third parties such as experiment production/CIC/Roc/other RAL teams. [2008-01-18] Will used peered Nagios (CERN-RAL) when/if that is available. This will monitor experiments services, and we will choose which critical alarms will raise callouts.
M-5 Automation System December A-20071130-03 (done) A-20071130-04 A-20071130-05 (done) A-20071214-01 Jonathan Monitoring and Automation System capable of calling Out. We have Nagios alarms, but we have to get them to SMS or Bleeper. [2007-12-21] End-to-end callout test done.
M-6 On-call hardware December James Provide staff with laptops and if necessary mobile phones to allow them to respond to calls. Expect delays in obtaining laptops owing to supply shortages. [2007-12-14] Specs almost finalised; need to consider 3G and/or bluetooth requirements.

[2008-01-11] Most people have chosen which laptop they want.

[2008-01-25] List sent to FBU/IT for quote.

[2008-02-12] Requisition form sent to Finance.

[2008-02-22] Laptops ordered.

[2008-02-29] Laptops have to be Windows due to encryption issues.

[2008-04-04] Most laptops encrypted and collected; lend laptops to CASTOR Team members if required.

[2008-05-16] Investigate 3G options. We should provide this for anyone who needs it.

[2008-05-30] James has ordered 3G cards for testing. Cards for other laptops to follow.

[2008-07-04] Everyone has had opportunity to acquire 3G cards; no outstanding issues. Closing.

M-7 Trial (dummy) service January A-20071130-02 (done) A-20071130-04 A-20080111-02 Matt Trial Service with dummy callout. Test alarm handling processes and documentation. Admin on duty handles alarm condition during daytime. Will run first trial during CCRC08 (M-10).
M-8 Complete safety risk assessment January Andrew [2008-05-09] Safety risk assessment agreed; Andrew to circulate document.

[2008-08-22] Done except for agreement on minor issue.

[2008-05-30] At final draft stage.

M-9 Recruit incident response staff February Andrew Recruit Incident Response Staff who will be required to handle first line of callout. Need to resolve problem that we needed 3 but only got 2. [2008-02-15] Consider how to start service with existing staff; experience will be useful for training new starters.

[2008-02-22] Paperwork done and approved.

[2008-05-16] External recruitments approved.

[2008-08-22] Tracked elsewhere; closing.

M-10 Proposal to run trial on-call for CCRC08 in February February Andrew [2008-02-15] May not be possible to get service running during CCRC08; will aim for trial service as soon as possible subject to completing Nagios configuration and delivery of laptops.
M-11 Callout service WLCG-07-03 March Matt Start of Live Callout service