Difference between revisions of "SRM File Loss"

From GridPP Wiki
Jump to: navigation, search
(CMS)
(LHCb)
Line 70: Line 70:
  
 
==LHCb==
 
==LHCb==
Request for information with VO contacts. This will be updated when information is received.
+
Create a list of SURLs for the files which are lost. Site may have to remove the physical/nameserver file form the site depenedent on failure mode.
 +
Email UK VO contacts and email CC lhcb-datamanagement (Mailing list for the LHCb data management) <lhcb-datamanagement@cern.ch>
  
 
==Alice==
 
==Alice==

Revision as of 12:03, 19 July 2017

If you become aware of file loss from your SRM the procedure to be followed is roughly:

Initial Investigations

Make a assesment of the situation, to determine how bad it is and if files have been irretreivably lost.

  • If this is starting to take too long (>30min?) then put your site into downtime and make a EGEE broadcast stating that an investigation has started.

Can the disk/machine be restarted?

If it can, then restart it and restore service as quickly as possible. If not...

Is the loss temporary or permanent?

If the machine has a correctable fault (e.g. blown power supply), then

  1. It may be appropriate to put your site into downtime (you can mark a downtime as only affecting the SE).
  2. If so, make an EGEE broadcast stating that your site is in unsheduled downtime, hopefully giving an estimate of the return time.

If the fault is uncorrectable, then initiate the file loss procedure.

Call in the experts?

There are lots of storage experts in the UK who may be able to help with this procedure. You should certainly work with your Tier-2 co-ordinator.

File Loss Procedure

If you really are really convinced that files have been lost then there may be little point in going into downtime - they aren't going to magically come back...

Find out which SURLs are affected (this section needs reviewing)

You need to be able to tell the VO data managers which files have been lost, so that they can be purged from the VOs' catalogs:

Generally, it's a better idea to put this information on the web and send a link, rather than making a giant broadcast message.

Inform the VOs

Once you know which files have been lost, use the EGEE broadcast tool on the CIC portal to send this information to the VO data managers. You should CC the UKI ROC and let your Tier 2 coordinator know.

Postmortem

If you have suffered file loss then it would be worthwhile reviewing your storage hardware/policies/procedures to see if furture losses can be avoided.


Current Position on SE File Loss (at Tier-2s)

ATLAS

Rationale
  1. Files that are non-unique (have replicas elsewhere) are less valuable than files that are unique.Files that are in the ATLAS SCRATCHDISK and (LOCAL)GROUPDISK tokens are more likely to be unique than any other token's contents
  2. Inconsistency in experiment infrastructure catalogues is more damaging than lost non-unique files (as it indirectly results in additional lost work as jobs are incorrectly sent to sites without the data they expect). Therefore, reestablishing consistency is usually more important than recovering every last file.
Position for Tier-2 Sites (ATLAS perspective)
  1. generate a list of all SURLs (complete with full srm:// path) for data on the lost server.
  2. post the list of SURLs to atlas-support-cloud-uk@cern.ch (0), along with a brief description of the situation (site name, disk server loss type, if the disk server is completely lost or if there is some possibility of data recovery).
  3. if the data is totally unrecoverable, then:
    1. the ATLAS consistency service will automate updating ATLAS catalogues to the new state (this includes removing the files from the SE in question, and repairing datasets at the affected site where possible)
  4. else (partial loss of a data server and potential loss of the data therein)
    1. files that are *not* unique *will* be considered lost immediately, and the consistency service run to repair matters from the existing replicas (which includes removing the copies at the affected site from their namespace). This is identical to 3.1.
    2. it is possible that ATLAS may ask the site to attempt further recovery of unique data, but this will be decided on a case by case basis (dependant on the type of files and the difficulty of recovery estimated by the site, amongst other factors). If not, the files will be treated as lost, as with case 3.


(0) or, if you know how to, open a Jira ticket in relevant project at https://its.cern.ch/jira/browse/

CMS

Position for Tier-2 Sites
  1. Tier 2 site admin generates a list of all SURLs (complete with full srm:// path)
  2. Give the list of SURLs to their Tier2 CMS liasion.(This may well be the same person, or someone else at the site, or at another T2.)
(Nota Bene, there are various CMS tools for whomever runs the phedex configuration for the site to clean up/consistency check the site.)

LHCb

Create a list of SURLs for the files which are lost. Site may have to remove the physical/nameserver file form the site depenedent on failure mode. Email UK VO contacts and email CC lhcb-datamanagement (Mailing list for the LHCb data management) <lhcb-datamanagement@cern.ch>

Alice

Request for information with VO contacts. This will be updated when information is received.

This page is a Key Document, and is the responsibility of Brian Davies. It was last reviewed on 2017-07-17 when it was considered to be 95% complete. It was last judged to be accurate on 2017-07-17.