RAL Tier1 CASTOR planning

Page no longer actively maintained.

Castor at RAL Status and Plans

STATUS

As of 9 Dec 2010:

Prod stager instances
- cms: 2.1.9-6, SLC4-64
- atlas: 2.1.9-6 + hotfix for database performance, SLC4-64
- lhcb: 2.1.9-6, SL4-64
- gen: 2.1.9-6, SLC4-64
- repack: 2.1.8-14, SLC4-64
Prod nameservers: 2.1.8-3, SLC4-64 (2 hosts)
Prod tapeservers: 2.1.8-8, SLC4-64
VDQM2 servers (x2): 2.1.8-3
Prod SRMv2:
- cms: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
- atlas: 2.8-6 (database version 2.8-6, local ns: 2.1.8-17)
- lhcb: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
- alice/gen: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
Oracle:
- Test Oracle RAC (Vulcan) for pre-prod instance
- Prod Oracle RAC-1 (Pluto): cms stager, cms srm, gen stager, gen(alice) srm, repack
- Prod Oracle RAC-2 (Neptune): nameserver, lhcb stager, lhcb srm , atlas stager, atlas srm
Tape drives: 16 drives
Castor Information Provider (CIP):
- 2.1.0.1
LSF:
- cms: v7.0.2
- atlas: v7.0.2
- lhcb: v7.0.2
- gen: v7.0.2
- (different job slots/protocol funtionality added and using http instead of nfs for job manager for all instances)

KEY DATES

VO tests, data challenges and key dates

The plans for each experiment are now on separate pages:

When?	Who?	What?
Oct 2009?	ALICE ATLAS CMS LHCB	First data taking

WEEKLY OPERATIONS

Weely reports of the activities of the Tier1 CASTOR group can be found here.

CASTOR SERVICE SCHEDULES AND DEVELOPMENTS

Prod instance upgrades

For past and future upgrades please check the weekly CASTOR reports linked above.

Rootd and xrootd

Dates	Activity
16 Feb 2009	11 disk servers with xrootd for Alice installed and working correctly.
18 Nov 2008	Rootd fully deployed for lhcb.

SRMv2

Dates	Activity
27 Nov	SRM 2.7-10 and 2.7-15 services in production in parallel with 1.3-27
mid-Feb 09	SRM 1.3-27 discontinued
10 Jan	SRM Database updated to release 2.7-12

CASTOR Info Provider

For past and future upgrades please check the weekly CASTOR reports linked above.

Tape Drives and Migration Policies

Dates	Activity
Completed Nov 08	All tape drives now running castor 2.1.7-15 with 64-bit
Dec 08 - Jan 09	Testing new atlas tape families
Dec 08 - Jan 09	Preparation and testing of changes to CMS tape families for custodial data

Repack

Separate repack instance now being installed, initially concentrating on repacking CMS tapes to high density.

CASTOR TESTING

Pre-prod instance

Being set up to stress test 2.1.8.

Certification testbed

Dates	Activity
In progress	Being reconfigured to certify 2.1.8
In progress	Test gridftp-internal with castor 2.1.7-24
In progress	Test new LSF configuration with castor 2.1.7-24
Done	Test vdqm2 with castor 2.1.7-24
Done	Test black and white list with castor 2.1.7-24

CASTOR ISSUES AND TOPICS

Resilience and availability

Dates	Activity
Autumn 2009	Review and improve deployment, disaster recovery, and backup
Autumn 2009	Deploy redundant, load-balanced stagers

Monitoring by users

http://castormon.gridpp.rl.ac.uk/atlas/ http://castormon.gridpp.rl.ac.uk/cms/ http://castormon.gridpp.rl.ac.uk/gen/ http://castormon.gridpp.rl.ac.uk/lhcb/

General problem areas

Dates	Activity
Ongoing	Two problem currently seen in the Job Manager where it occassionally becomes unresponsive for 2-3 minutes. One affects delay with jobs reaching the job manager from the stager, the other with delays of jobs reaching LSF. Under investigation.
Fixed	Big IDs: Very big values continuously being inserted in id2type; Workaround is released in 2.1.7-27 and Oracle have released a fix, which has now been applied (13/7/09)
Ongoing	Oracle unique constraint violations in request handler; not understood, oracle triggers added to gain more info. Seen while running atlas on standalone Oracle, so unrelated to RAC.
Fixed	Possible crosstalk between atlas and lhcb stagers; led to deletion of files for lhcb in Aug 08. Now have syncronization turned off to avoid possibility of further file deletion. Problem has not recurred with synchronization off. CERN have released a hotfix in 2.1.7-26 that will enable us to turn synchronization back on without the danger of further file deletions (2/6/09). Fix now deployed with 2.1.7-27 upgrade, and synchronization is now turned on and is running daily (15/7/09)
Ongoing	Migration performance - improvements made on CMS; to be deployed for atlas and lhcb after 2.1.7 upgrade
Ongoing	Still get recurrent stuck recalls
Resolved?	Problem with stuck disk2disk copies not seen in 2.1.7

RAL Tier1 CASTOR planning

Contents

Castor at RAL Status and Plans

STATUS

KEY DATES

VO tests, data challenges and key dates

WEEKLY OPERATIONS

CASTOR SERVICE SCHEDULES AND DEVELOPMENTS

Prod instance upgrades

Rootd and xrootd

SRMv2

CASTOR Info Provider

Tape Drives and Migration Policies

Repack

CASTOR TESTING

Pre-prod instance

Certification testbed

CASTOR ISSUES AND TOPICS

Resilience and availability

Monitoring by users

General problem areas

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools