RAL Tier1 CASTOR Experiments Completed Actions 2007

From GridPP Wiki
Revision as of 11:54, 13 January 2009 by James thorne (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Actions from RAL Tier1 CASTOR Experiments Actions closed in 2007.

Action ID Priority Experiment(s) Owner Description Status Completed date
A-20070425-01 ATLAS Matt Hodges Discussion on disk allocation needs to be concluded. 02/05/2007
A-20070425-02 ATLAS Matt Hodges Stop jobs immediately; add new disk servers after upgrade on 30 April/1 May. Need to have disk allocation details (action A-20070425-01) by Friday 27/4/2007. 02/05/2007
A-20070425-03 ATLAS Chris Kruk Check provisioning for LSF licences.
A-20070425-04 All Matt Hodges Create a table on the wiki showing which disk servers are in which CASTOR pools. See here 02/05/2007
A-20070425-05 All Matt Hodges Send SSH keys to Chris Kruk. 02/05/2007
A-20070425-06 ATLAS, LHCb Matt Hodges Split ATLAS/LHCb shared disk pool. 02/05/2007
A-20070425-08 LHCb, CMS(?) Raja Nandakumar Email Dave Newbold a copy of LHCb test plans. 02/05/2007
A-20070502-01 LHCb Raja Nandakumar Ensure that files on list sent by Derek Ross are not requested from castor tape. 09/05/2007
A-20070502-02 All David Corney Ensure that the experiments are kept in the loop" regarding RAL plans for castor configuration change. 09/05/2007
A-20070509-02 All James Thorne Provision 26 servers for CASTOR. 16/05/2007
A-20070509-05 CMS Simon Metson Arrange a meeting with Bonny and Shaun in Bristol. 16/05/2007
A-20070509-01 All Shaun Change service class replication for all experiments. On hold until production platform working 23/05/2007
A-20070523-03 CMS Simon Metson Check on key files prior to CMS file deletion. Can probably delete files; CMS are now checking all key files are on tape. CMS data management system needs syncing. 13/06/2007
A-20070523-04 All Tim Folkes Update Bonny Strong on repack to ensure progression while Tim is on leave 13/06/2008
A-20070523-05 ATLAS Matt Hodges Provision new disk pools for ATLAS tests. CASTOR config pending.
A-20070531-01 All Shaun De Witt, Bonny Strong Need plan to co-ordinate testing of re-install by the experiments. 13/06/2007
A-20070516-01 All Nick White Introduce monitoring to detect "out of memory" on disc servers Needs highlighting in Nagios, if OOM starts then it is probably too late. It looks like a low memory problem. Not related to no. of jobs (only 30 per host), may be related to GridFTP. David Corney to ask to Martin Bly to contact Nick White.
Refer to issue.
A-20070523-01 All Shaun De Witt, Bonny Strong Revised plan for separate instances and upgrade for discussion next week to clarify experiment needs, especially ATLAS and LHCb. Include integration with SRM 2.2 endpoint. Bonny to push hardware install for ATLAS so that it's about 2 weeks behind CMS for 2.1.3.
James Thorne working with Cheney to get ATLAS machines set up.
A-20070613-01 CMS Shaun De Witt Email Dave Newbold description of SQL (from shell) problem for forwarding onto Dave's SQL contacts. 20/06/2007
A-20070613-03 All David Corney, James Thorne Put plan to show timetable of roll out of 2.1.3 CASTOR instances on wiki See the RAL Tier1 CASTOR 2.1.3 Roll Out page. 20/06/2007
A-20070620-01 ATLAS Bonny Strong, Matt Hodges, Martin Bly Review (re-)deployment of ATLAS disk servers for dCache. CASTOR group have two servers that may be available. 27/06/2007
A-20070620-03 LHCb Bonny Stong, Martin Bly Meet to discuss hardware requirements for LHCb 2.1.3 instance. Meeing on Monday 25/06/2007? 26/06/2007
A-20070620-04 ATLAS Bonny Strong Open tape SRM (d0t1) for ATLAS 27/06/2007
A-20070627-02 All Matt Hodges, Shaun De Witt Update/review disk server wiki page. Done. 04/07/2007
A-20070509-03 CMS Simon Metson Investigate use of "WAN out". Review at meeting in Bristol on 22/05
Ongoing, some sort of Phedex problem. Bonny talking to Simon. Fixed.
A-20070523-02 CMS Bonny Strong Set up CMS pool for temporary files. Ongoing.
Don't garbage collect small file; d0t0.
This needs to be done before production use.
A-20070704-04 All James Thorne Link to SRM endpoint information from the meeting page. Done. Link to the RAL Tier1 CASTOR SRM page added. 11/07/2007
A-20070704-05 All Bonny Strong Discussion needed about using different tape pools for different ns paths (e.g. to group all data in a year, or by type). Questions about castor-specific meaning of pathnames, how other sites would be affected, and if this should/could be handled by storage tokens in SRMv2. Should include Shaun de Witt. Site specific. 11/07/2007
A-20070711-03 All James Thorne Link from the wiki page to James Jackson's CMS testing documentation (RAL Tier1 CASTOR CMS Testing). Done. 12/07/2007
A-20070425-07 All Andrew Sansum Put hardware thoughts and discussion on the wiki. Done 25/07/007
A-20070509-04 All Andrew Sansum Re-visit disk server network tuning. Wait until CASTOR source(?) is stable for days Done 25/07/2007
A-20070620-02 ATLAS Bonny Strong, Stephen Burke Plan ATLAS 2.1.3 instance testing. Done 25/07/2007
A-20070627-01 ATLAS Shaun De Witt Discuss with ATLAS the effect of moving to shorter paths and whether to keep the longer paths for permanent data. Done 25/07/2007
A-20070704-01 CMS I-20070704-02
James Jackson
Run functionality tests for CMS 2.1.3 instance. Also look at timings (I-20070704-04). Done 25/07/2007
A-20070704-02 ATLAS Shaun De Witt, Stephen Burke Discuss moving disk servers from 2.1.2 to new 2.1.3 instance. Done 25/07/2007
A-20070711-01 All Chris Kruk Co-ordinate with David Edoh to apply hot fixes for Oracle. Done 25/07/2007
A-20070711-02 All Andrew Sansum See also A-20070509. Meet with James Jackson and Nick White to discuss disk server tuning. Done 25/07/2007
A-20070711-04 All Chris Kruk Create/update SRM wiki page. Done 25/07/2007
A-20070711-08 ATLAS Shaun De Witt Send Brian Davies some hints on testing. Done 25/07/2007
A-20070711-11 All Shaun De Witt Deploy SRM2 Done 25/07/2007
A-20070704-03 ATLAS Stephen Burke Inform Bonny of decision on whether to move d1 files to 2.1.3 production instance. Done 08/08/2007
A-20070711-09 ATLAS Brian Davies Send out test results when ATLAS has some figures. Done 08/08/2007
A-20070711-10 ATLAS Shaun De Witt, Stephen Burke Discussion on how we progress to SRM2. For discussion at 1st Aug meeting Done 08/08/2007
A-20070725-02 CMS James/Simon Metson Agree date with CMS for formal switch on of production system. Done 08/08/2007
A-20070725-03 ATLAS Stephen Burke Decided when to switch d1t0 end point to 2.1.3 and inform CASTOR team at RAL ASAP as test machine upgrade is waiting on this decision. Done 08/08/2007
A-20070711-05 All David Corney Review CASTOR@RAL issues list Let James Thorne have the updated list for the wiki. 22/08/2007
A-20070725-01 ATLAS Shaun Phone Miguel to find out when T0 testing starts for ATLAS. Done 22/08/2007
A-20070808-01 All Bonny Set up new SA path for BDII for RAL SRM. Done 22/08/2007
A-20070808-02 All David Publish Experiment test dates via links on this wiki. Done 22/08/2007
A-20070711-06 CMS Dave Newbold Script to delete test data. Ongoing 29/08/2007
A-20070829-01 ATLAS, LHCB David Corney Contact Atlas and LHCB to ensure disc tuning can be scheduled asap and before Nick White goes on leave Closed 05/09/2007
A-20070822-03 All Matt Hodges Tune FTS Ongoing. Done for CMS. Closed, see A-20070905-06. 12/09/2007
A-20070905-03 All Dave Newbold, Shaun De Witt Check RFIO timeout for A-20070905-01. Timeout is 300s. Closed. 12/09/2007
A-20070905-08 CMS Bonny Strong Arrange meeting with Simon, Dave N., Shaun, Matt on Friday 7/9/2007 to discuss CMS transfer performance. Closed. 12/09/2007
A-20070905-09 All Bonny Strong, Shaun De Witt Check with developers regarding "prepare to get" being scheduled. Prepare to get is being scheduled. Closed. 12/09/2007
A-20070905-10 All Tim Folkes Provide Matt with tape usage figures. Done, closed. 12/09/2007
A-20070711-07 All Simon Metson Look at logs from CSA06 to see if the transfer "dead time" was in evidence then. Ongoing. 26/09/2007
A-20070905-02 All Bonny Strong Run fix script for A-20070905-01 to check the impact on the databases. Script needs to run more frequently than RFIO timeout. Not automated until we have a better understanding of the problem. 26/09/2007
A-20070905-04 All Bonny Strong Provide name server hosts on wiki to mitigate against problems with nsrm command. Ongoing. 26/09/2007
A-20070905-07 ATLAS Shaun De Witt Chase up LSF tuning with ATLAS. Ongoing. 26/09/2007
A-20070912-01 CMS Bonny Strong Investigate upping LSF slots to 8. 26/09/2007
A-20070912-02 CMS Bonny Strong, Chris Brew Debug file transfers by tracking a file through the system. 26/09/2007
A-20070912-03 CMS Chris Kruk Check Dave Newbolds password to LSF web GUI. 26/09/2007
A-20070912-04 ATLAS Shaun De Witt Do LSF tuning for ATLAS (see also A-20070905-07). 26/09/2007
A-20070912-05 All Shaun De Witt Put something in Savannah re issue I-20070822-01. 26/09/2007
A-20070912-06 All Andrew Sansum Raise issue at PMB meeting. 26/09/2007
A-20070926-01 All Shaun De Witt When is the real target date for production deploymnt of SRM 2.2? Answer next week. 10/10/2007
A-20071003-03 ATLAS Shaun De Witt Check dates of ATLAS M5 22nd Nov 10/10/2007
A-20071003-04 ATLAS Catalin Condurache Raise participation in ATLAS M4 reprocessing on the ATLAS UK mailing list Problems with M4 reprocessing software means M4 reprocessing is delayed. 17/10/2007
A-20071017-04 All Shaun De Witt, Bonny Strong Set up test for internal CASTOR gridFTP v2. Being done by CERN. 24/10/2007
A-20071003-02 CMS James Jackson, Bonny Strong Investigate processing of info from the log archives (A-20071003-01) Being treated along with A-20071003-01 now so closing this one. 24/10/2007
A-20070711-12 CMS Andrew Sansum Move CMS WANout disk server to new route to JANET. Waiting for network people. CMS would like this done before 24/9/2007. Andrew has asked networking for this to be done before Tuesday. Ongoing, escalated. Should be testing by next meeting (20071107). Using gdss128 to test. 07/11/2007
A-20071003-01 CMS James Jackson, Bonny Strong Investigate archiving and processing of log information. James has a useful tool set for this now. James will support other VOs who want to use the CMS stuff. Each VO will need access to stager and LSF logs. 07/11/2007
A-20071017-02 CMS Bonny Strong Reduce by one the number of tape drives for CMS WanInTest. Done but WanIn didn't work. Bonny to change file class for load test. 07/11/2007
A-20071024-01 All Shaun De Witt Ask David Edoh to look at database performance before and after upgrade. 07/11/2007
A-20071024-02 ATLAS Brian Davies Try to get files out of CASTOR post database upgrade. 07/11/2007
A-20071024-03 CMS James Jackson Document logging tools. 07/11/2007
A-20070905-01 All Bonny Strong Put fix in place for recalls problem. Ongoing. Recall problem went away for ATLAS when connections to database were increased. Monitoring. Closed as the issue is being monitored, see issue list. 07/11/2007
A-20070613-02 CMS Shaun De Witt, Bonny Strong, Dave Newbold Set up discussion on service classes and tape pool mapping After CSA07. Done. 14/11/2007
A-20071003-05 All Shaun De Witt Send SRM deadtime issue to CERN Waiting for a second set of graphs from Brian. Done. 14/11/2007
A-20071024-04 All Bonny Strong, experiment reps. Discuss possibility of cross-experiment testing. Done. 14/11/2007
A-20071107-01 LHCb Derek Ross Change LHCb SA paths in BDII. Done. 14/11/2007
A-20070822-01 ATLAS, LHCb Catalin Condurache Plan ATLAS disk usage and dCache -> CASTOR migration. Plan for LHCb too. ATLAS have 84TB on disk and LHCb have 70TB on disk. 120 MB/s (~10 TB/day) with 8 streams (8 parallel globus URL copies), max. 50 MB/s pre server. This is without any checksumming. Need to look at tape to tape migration too. 28/11/2007
A-20070822-02 All I-20070620-01, Bonny Strong Write wiki How-to on deleting files from CASTOR. Done 28/11/2007
A-20070905-05 All Andrew Sansum Put tuning information on wiki, including ext3 journalling options. Andrew has asked Nick White to do this. Done. On the wiki at RAL Tier1 Disk Server Tuning. 28/11/2007
A-20070905-06 ATLAS Matt Hodges Determine FTS tuning requirements for ATLAS in the same way as CMS. See also A-20070822-03. Done 28/11/2007
A-20071010-01 ATLAS Bonny Strong Stuck recalls for ATLAS are not being monitored/checked need a tool to do this. Gone away in 2.1.4 28/11/2007
A-20071017-03 ATLAS Shaun De Witt Ask Miguel to open up ATLAS T0 tests to repack instance. Shaun asked Miguel. Done. 28/11/2007
A-20071114-01 All Shaun De Witt Put a Request For Enhancement into Savannah for the ability to dedicate tape drives to a service class. Done. 28/11/2007
A-20071114-02 ATLAS, CMS Matt Hodges Send a broadcast announcement for the ATLAS and CMS CASTOR downtime. Done. 28/11/2007
A-20071128-01 All David Corney Follow up with Jens regarding dynamic/daily publishing of space information from SRMv2. 12/12/2007
A-20071128-02 All Matt Hodges Move inbound FTS channels to Globus URL copy. 12/12/2007
A-20071017-05 All Derek Ross, Jens Jensen Publish GLUE Schema for SRM2.2. Ongoing. Done for srmf. 19/12/2007
A-20071128-04 ALICE Matt Hodges Allocate disk servers for ALICE Depoends on A-20071128-05 19/12/2007
A-20071128-05 ALICE Catalin Condurache Engage ALICE in discussion on xrootd/CASTOR Inform ALICE when test setup ready. 19/12/2007
A-20071212-01 All Derek Ross Update space token documentation on wiki 19/12/2007
A-20071212-03 All David Corney Draft a list of "top ten" Tier1 CASTOR issues and circulate to the experiments. 19/12/2007
A-20071212-04 ALICE Shaun De Witt Set up ALICE SRM/CASTOR/xrootd test bed. 19/12/2007