RAL Tier1 CASTOR planning

From GridPP Wiki
Jump to: navigation, search


Page no longer actively maintained.

Castor at RAL Status and Plans

STATUS

As of 9 Dec 2010:

  • Prod stager instances
    • cms: 2.1.9-6, SLC4-64
    • atlas: 2.1.9-6 + hotfix for database performance, SLC4-64
    • lhcb: 2.1.9-6, SL4-64
    • gen: 2.1.9-6, SLC4-64
    • repack: 2.1.8-14, SLC4-64
  • Prod nameservers: 2.1.8-3, SLC4-64 (2 hosts)
  • Prod tapeservers: 2.1.8-8, SLC4-64
  • VDQM2 servers (x2): 2.1.8-3
  • Prod SRMv2:
    • cms: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
    • atlas: 2.8-6 (database version 2.8-6, local ns: 2.1.8-17)
    • lhcb: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
    • alice/gen: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
  • Oracle:
    • Test Oracle RAC (Vulcan) for pre-prod instance
    • Prod Oracle RAC-1 (Pluto): cms stager, cms srm, gen stager, gen(alice) srm, repack
    • Prod Oracle RAC-2 (Neptune): nameserver, lhcb stager, lhcb srm , atlas stager, atlas srm
  • Tape drives: 16 drives
  • Castor Information Provider (CIP):
    • 2.1.0.1
  • LSF:
    • cms: v7.0.2
    • atlas: v7.0.2
    • lhcb: v7.0.2
    • gen: v7.0.2
    • (different job slots/protocol funtionality added and using http instead of nfs for job manager for all instances)

KEY DATES

VO tests, data challenges and key dates

The plans for each experiment are now on separate pages:

When? Who? What?
Oct 2009? ALICE ATLAS CMS LHCB First data taking

WEEKLY OPERATIONS

Weely reports of the activities of the Tier1 CASTOR group can be found here.

CASTOR SERVICE SCHEDULES AND DEVELOPMENTS

Prod instance upgrades

For past and future upgrades please check the weekly CASTOR reports linked above.

Rootd and xrootd

Dates Activity
16 Feb 2009 11 disk servers with xrootd for Alice installed and working correctly.
18 Nov 2008 Rootd fully deployed for lhcb.

SRMv2

Dates Activity
27 Nov SRM 2.7-10 and 2.7-15 services in production in parallel with 1.3-27
mid-Feb 09 SRM 1.3-27 discontinued
10 Jan SRM Database updated to release 2.7-12

CASTOR Info Provider

For past and future upgrades please check the weekly CASTOR reports linked above.

Tape Drives and Migration Policies

Dates Activity
Completed Nov 08 All tape drives now running castor 2.1.7-15 with 64-bit
Dec 08 - Jan 09 Testing new atlas tape families
Dec 08 - Jan 09 Preparation and testing of changes to CMS tape families for custodial data

Repack

Separate repack instance now being installed, initially concentrating on repacking CMS tapes to high density.

CASTOR TESTING

Pre-prod instance

Being set up to stress test 2.1.8.

Certification testbed

Dates Activity
In progress Being reconfigured to certify 2.1.8
In progress Test gridftp-internal with castor 2.1.7-24
In progress Test new LSF configuration with castor 2.1.7-24
Done Test vdqm2 with castor 2.1.7-24
Done Test black and white list with castor 2.1.7-24

CASTOR ISSUES AND TOPICS

Resilience and availability

Dates Activity
Autumn 2009 Review and improve deployment, disaster recovery, and backup
Autumn 2009 Deploy redundant, load-balanced stagers

Monitoring by users

http://castormon.gridpp.rl.ac.uk/atlas/ http://castormon.gridpp.rl.ac.uk/cms/ http://castormon.gridpp.rl.ac.uk/gen/ http://castormon.gridpp.rl.ac.uk/lhcb/

General problem areas

Dates Activity
Ongoing Two problem currently seen in the Job Manager where it occassionally becomes unresponsive for 2-3 minutes. One affects delay with jobs reaching the job manager from the stager, the other with delays of jobs reaching LSF. Under investigation.
Fixed Big IDs: Very big values continuously being inserted in id2type; Workaround is released in 2.1.7-27 and Oracle have released a fix, which has now been applied (13/7/09)
Ongoing Oracle unique constraint violations in request handler; not understood, oracle triggers added to gain more info. Seen while running atlas on standalone Oracle, so unrelated to RAC.
Fixed Possible crosstalk between atlas and lhcb stagers; led to deletion of files for lhcb in Aug 08. Now have syncronization turned off to avoid possibility of further file deletion. Problem has not recurred with synchronization off. CERN have released a hotfix in 2.1.7-26 that will enable us to turn synchronization back on without the danger of further file deletions (2/6/09). Fix now deployed with 2.1.7-27 upgrade, and synchronization is now turned on and is running daily (15/7/09)
Ongoing Migration performance - improvements made on CMS; to be deployed for atlas and lhcb after 2.1.7 upgrade
Ongoing Still get recurrent stuck recalls
Resolved? Problem with stuck disk2disk copies not seen in 2.1.7