RAL Tier1 CASTOR planning
From GridPP Wiki
Page no longer actively maintained.
Contents
Castor at RAL Status and Plans
STATUS
As of 9 Dec 2010:
- Prod stager instances
- cms: 2.1.9-6, SLC4-64
- atlas: 2.1.9-6 + hotfix for database performance, SLC4-64
- lhcb: 2.1.9-6, SL4-64
- gen: 2.1.9-6, SLC4-64
- repack: 2.1.8-14, SLC4-64
- Prod nameservers: 2.1.8-3, SLC4-64 (2 hosts)
- Prod tapeservers: 2.1.8-8, SLC4-64
- VDQM2 servers (x2): 2.1.8-3
- Prod SRMv2:
- cms: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
- atlas: 2.8-6 (database version 2.8-6, local ns: 2.1.8-17)
- lhcb: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
- alice/gen: 2.8-2 (database version 2.8-2, local ns: 2.1.8-17)
- Oracle:
- Test Oracle RAC (Vulcan) for pre-prod instance
- Prod Oracle RAC-1 (Pluto): cms stager, cms srm, gen stager, gen(alice) srm, repack
- Prod Oracle RAC-2 (Neptune): nameserver, lhcb stager, lhcb srm , atlas stager, atlas srm
- Tape drives: 16 drives
- Castor Information Provider (CIP):
- 2.1.0.1
- LSF:
- cms: v7.0.2
- atlas: v7.0.2
- lhcb: v7.0.2
- gen: v7.0.2
- (different job slots/protocol funtionality added and using http instead of nfs for job manager for all instances)
KEY DATES
VO tests, data challenges and key dates
The plans for each experiment are now on separate pages:
When? | Who? | What? |
---|---|---|
Oct 2009? | ALICE ATLAS CMS LHCB | First data taking |
WEEKLY OPERATIONS
Weely reports of the activities of the Tier1 CASTOR group can be found here.
CASTOR SERVICE SCHEDULES AND DEVELOPMENTS
Prod instance upgrades
For past and future upgrades please check the weekly CASTOR reports linked above.
Rootd and xrootd
Dates | Activity |
---|---|
16 Feb 2009 | 11 disk servers with xrootd for Alice installed and working correctly. |
18 Nov 2008 | Rootd fully deployed for lhcb. |
SRMv2
Dates | Activity |
---|---|
27 Nov | SRM 2.7-10 and 2.7-15 services in production in parallel with 1.3-27 |
mid-Feb 09 | SRM 1.3-27 discontinued |
10 Jan | SRM Database updated to release 2.7-12 |
CASTOR Info Provider
For past and future upgrades please check the weekly CASTOR reports linked above.
Tape Drives and Migration Policies
Dates | Activity |
---|---|
Completed Nov 08 | All tape drives now running castor 2.1.7-15 with 64-bit |
Dec 08 - Jan 09 | Testing new atlas tape families |
Dec 08 - Jan 09 | Preparation and testing of changes to CMS tape families for custodial data |
Repack
Separate repack instance now being installed, initially concentrating on repacking CMS tapes to high density.
CASTOR TESTING
Pre-prod instance
Being set up to stress test 2.1.8.
Certification testbed
Dates | Activity |
---|---|
In progress | Being reconfigured to certify 2.1.8 |
In progress | Test gridftp-internal with castor 2.1.7-24 |
In progress | Test new LSF configuration with castor 2.1.7-24 |
Done | Test vdqm2 with castor 2.1.7-24 |
Done | Test black and white list with castor 2.1.7-24 |
CASTOR ISSUES AND TOPICS
Resilience and availability
Dates | Activity |
---|---|
Autumn 2009 | Review and improve deployment, disaster recovery, and backup |
Autumn 2009 | Deploy redundant, load-balanced stagers |
Monitoring by users
http://castormon.gridpp.rl.ac.uk/atlas/ http://castormon.gridpp.rl.ac.uk/cms/ http://castormon.gridpp.rl.ac.uk/gen/ http://castormon.gridpp.rl.ac.uk/lhcb/
General problem areas
Dates | Activity |
---|---|
Ongoing | Two problem currently seen in the Job Manager where it occassionally becomes unresponsive for 2-3 minutes. One affects delay with jobs reaching the job manager from the stager, the other with delays of jobs reaching LSF. Under investigation. |
Fixed | Big IDs: Very big values continuously being inserted in id2type; Workaround is released in 2.1.7-27 and Oracle have released a fix, which has now been applied (13/7/09) |
Ongoing | Oracle unique constraint violations in request handler; not understood, oracle triggers added to gain more info. Seen while running atlas on standalone Oracle, so unrelated to RAC. |
Fixed | Possible crosstalk between atlas and lhcb stagers; led to deletion of files for lhcb in Aug 08. Now have syncronization turned off to avoid possibility of further file deletion. Problem has not recurred with synchronization off. CERN have released a hotfix in 2.1.7-26 that will enable us to turn synchronization back on without the danger of further file deletions (2/6/09). Fix now deployed with 2.1.7-27 upgrade, and synchronization is now turned on and is running daily (15/7/09) |
Ongoing | Migration performance - improvements made on CMS; to be deployed for atlas and lhcb after 2.1.7 upgrade |
Ongoing | Still get recurrent stuck recalls |
Resolved? | Problem with stuck disk2disk copies not seen in 2.1.7 |