RAL Tier1 CASTOR Experiments Completed Actions 2010

From GridPP Wiki
Revision as of 13:56, 5 January 2011 by James thorne (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Actions from RAL Tier1 Experiments Liaison Meeting closed in 2010.

Action ID Priority Experiment(s) Owner Description Status Completed date
20081001-05 ATLAS Andrew Sansum Case by case basis for job limits - meet with Roger. Removed as there are other actions covering limits. 2008-11-05
20081001-01 ATLAS Derek Ross Need to know memory configuration; memory limits and kill policy. Discussion with Graeme Stewart. Wiki page requested. Done. 2008-12-03
20081001-02 Derek Ross Look at queue limits. Awaiting document/wiki page from action 20081001-01. Done. 2008-12-03
20081105-01 ILC Matt Hodges Speak to Glenn Patrick about upping the priority of the ILC disk deployment. Done. 2008-12-03
20081105-02 ILC Brian Davies Ensure CASTOR team is aware that ILC space token setup is fairly urgent. Done. 2008-12-03
20081203-02 All James Thorne Consider combining items 1 and 2 on the agenda. Done. 2009-01-07
20081203-03 All Brian Davies Ensure relevant Tier1 staff are aware that CASTOR is not "VOMS aware" and bare this in mind when resolving user problems. Closed. 2009-01-07
20081203-04 ILC Martin Bly Stop ILC jobs in the batch system before 09:00 on 4/12/2008 while Chris Brew updates dCache at RALPP. Closed. 2009-01-07
20081001-03 ATLAS Catalin Condurache Liaise with ATLAS to arrange to get job recovery working. Discussion held and can be implemented. Can put recovery area on worker nodes or have common area on a disk server, maybe the software area (depends on the load). Ongoing. 2009-02-04
20081001-04 ATLAS Catalin Condurache Permissions check on software areas. Some sites have wrong permissions. Ongoing. 2009-02-04
20090107-01 ALICE Gareth Smith Follow up ALICE disk quota issues with Cristina via email. Closed. 2009-02-04
20081203-01 All Gareth Smith Clarify use and purpose of CASTOR-PP and csf-l mailing lists. Gareth suggests closing the lists and using EGEE the broadcast system. Closed. 2009-03-04
20090204-02 All Martin Bly Report at the next meeting the time scale for moving to 64-bit and/or SL5 worker nodes. Closed. 2009-03-04
20090107-02 ATLAS Brian Davies Co-ordinate solution for ATLAS hot files such as conditions data. Graeme suggests something called "pCache". Closed. 2009-05-13
20090204-01 All Gareth Smith Determine actions required before starting a CASTOR downtime (e.g. stopping batch queues). No response from some experiments. Closed. 2009-05-13
20090304-02 CMS Chris Brew Look at whether the RAL Tier1 can take extra data sets on behalf of ASGC. No longer relevant. Closed. 2009-05-13
20090107-03 CMS, ALICE Matt Hodges Liaise with CMS and ALICE regarding software server installations. Ongoing. CMS using theirs. Done. 2009-06-03
A-20090311-03 High All James Thorne Deploy scheduled verifies on disk servers by mid-May. Will be verifying 1/5 of a disk pool at any one time. Deployed on non-production CASTOR machines and on non-CASTOR machines. Rest will follow in July, after the move. Closed. 2009-06-17
A-20090506-03 Medium All Experiment Reps. Provide feedback and comments on Gareth's draft table specifying the actions to be taken when closing batch queues ahead of CASTOR interventions (circulated to the list on 6 May). Feedback received from most experiments. Closed. 2009-06-17
A-20090521-01 High LHCb James Thorne, Matt Viljoen Track down status of LHCb server (gdss160) with problems. Machine is back in production. Closed. 2009-06-17
A-20090527-01 Medium ALICE Lee Barnby Confirm ALICE's plans for STEP09. Done, closed. 2009-06-17
A-20090527-02 Medium All Brian Davies Distribute URLs for the VO CASTOR monitoring pages to the experiment reps. Done, closed. 2009-06-17
20090603-03 All Martin Bly Review planned dates for site network outage (23 and 30 June). Dates are now 7 and 14 July which are OK (ish). Closed. 2009-06-17
20090603-05 n/a Andrew Sansum Encourage Jeremy Coles to attend the meeting. Jeremy is attending the meeting, closed. 2009-06-17
20090603-01 All Derek Ross Take BDII issue back to the dteam. Discussed by dteam. 2009-06-24
20090603-04 All Brian Davies Create a page explaining what the castormon plots mean. Closed,put titles on plots rather than creating a separate page. 2009-06-24
20090617-01 High All Matt Hodges Prepare plan for move to SL5 WNs and present it at the meeting on 24 June. Done 2009-06-24
20090624-01 High All Gareth Smith Confirm with Networking whether the site network intervention on 7 July will involve a break in connectivity or just an "at risk". Closed. 2009-07-08
20090624-02 High All Andrew Sansum Decide priorities for Tier1 update window by 1 July. Closed. 2009-07-15
20090624-03 Medium CMS Chris Brew Send link to CMS pre-staging info to meeting list. Closed. 2009-07-15
20090624-05 Medium LHCb Shaun De Witt Test whether upping the LSF job cap to 600 causes a performance problem. Decided not necessary. Closed. 2009-07-15
20090603-02 Medium ATLAS Brian Davies Circulate link to twiki page containing out of hours contact details for ATLAS shifters. Closed. 2009-07-29
20090624-06 High All Matt Hodges Publish SL5 plan on the Tier1 blog and seek input from the experiments. Ongoing, publish plans on blog before 29 July. Closed. 2009-07-29
20090722-02 Medium LHCb Raja Nandakumar Send to the Tier1 the job IDs of some jobs that are not having job info (mem use etc.) added to the pbs database. Closed. 2009-07-29
20090513-01 Medium All Martin Bly Contact VO contact list to determine VO hardware requirements for next procurement. Martin to forward email sent to HAG to Andrew for distribution to the UB. Closed. 2009-08-06
20090624-07 Medium All James Thorne Locate a new meeting room with better facilities. Ongoing, trying the Access Grid room and EVO on 29 July. Closed. 2009-08-06
20090715-02 Medium ATLAS Brian Davies Follow up with Tim Folkes regarding the ability to recall entire tapes ATLAS thinks this will be more efficient than individual files. It is possible. Closed. 2009-08-06
20090722-01 High ATLAS Matt Hodges Look at rescheduling the LFC RAC migration to August (in Catalin's absence) and ensure the rest of the grid team are familiar with the migration procedure. Meeting on 30 July with ATLAS. Closed. 2009-08-06
20090722-04 Medium All Experiment reps. Let Martin Bly know how long their jobs may need to exceed their memory limits. 10 mins? 20 mins? Closed. 2009-08-06
20090722-05 Medium All Derek Ross Circulate the location of the PBS info to the meeting list. Done. 2009-08-19
20090806-01 High All Andrew Sansum Flag the planned dates (September) for migration to SL5 to PMB and MB. Closed 2009-08-19
20090826-02 Medium MICE David Corney Mail Paul Kyberd and Henry to encourage them to present and discuss their plans at the weekly liaison meeting. Done 2009-09-16
20090826-01 Medium All Shaun de Witt Suggest to Tim that we should repack the tapes that may have been affected by the water ingress into the tape robot. This is to ensure that the data is on known clean tapes. Shaun spoke to Tim and Tim will have stats on tape damage, if any in the next week. Closed. 2009-09-16
A-20090225-02 Low ILC Chris Kruk Speak to Dennis at CERN to try and find out if there's a way to implement "fairshare" in CASTOR (for the gen instance). Closed. 2009-09-30
A-20090429-02 Medium All Gareth Smith Review our procedure for notifying VOs of broken disk servers. Ongoing. Review of disk server intervention procedure on 1 October. Closed. 2009-10-14
20090715-01 Medium LHCb James Thorne Determine why Ganglia graphs for Storage_LHCb stopped after the CASTOR upgrade and reboot of disk servers. Ongoing. Believed to be caused by the data source machines being started after the rest of the machines in the cluster. Need to document start order. The start order was a red herring. One of the Storage_LHCb cluster data sources had become an ATLAS server and so ganglia was "confused". Closed. 2009-10-14
20090923-01 Medium MICE Shaun de Witt Send Henry link to SRM download site. Closed. 2009-10-14
20090923-03 Medium CMS Shaun de Witt Investigate gridFTP "marker" timeouts with CASTOR@RAL to dCache site transfers. Done. 2009-11-04
20090930-01 Low All Matthew Viljoen Explore possibilities for "fair share" in CASTOR (e.g. extra instances). Done. 2009-11-04
20090204-03 Medium All Alastair Dewhurst Ask Mingchao to perform a security audit of all experiment software areas. This follows up from action 20081001-04 (closed). LHC software areas done. Ongoing. Alastair written a nagios check. Done. 2010-01-13
A-20090506-04 High All Andrew Sansum Investigate disk server failure rates. Investigation started. Ongoing. Have something from Ian Collier by Christmas 2009. Closed. 2010-01-20
20091111-01 Medium T2K Shaun de Witt Investigate the possibility of setting up T2K to test file transfers on the gen instance. Closed. 2010-01-20
20091202-01 Medium All Martin Bly, Fabric team Prepare summary of disk failure rates (action A-20090506-04) for an "item 5" presentation. Presentation on January 20th. Closed. 2010-01-20
20090722-03 Medium All Alastair Dewhurst Gather a list of "requirements questions" for the experiments and circulate to the meeting list before the next meeting on 29 July. With Martin Bly, Matt Viljoen, Alastair Dewhurst and Derek Ross. Ongoing. Alastair has drafted something for ATLAS and passed on to CMS to see if it can be applied to them as well. Alastair will also pass it on to LHCb. Raja working with Alastair on LHCb requirements. Presentation given on 10/2/2010 2010-02-10
20100203-01 Medium All James Thorne Ask Martin for an update on 2009 CPU tender deliveries. For meeting on 10/2/2010. Martin gave update at meeting. 2010-02-10
20090304-01 Medium ATLAS Alastair Dewhurst Review and understand what maui, torque and linux are doing regarding memory limits when killing ATLAS batch jobs. Ongoing. Is memory calculation including paged memory? Still a problem in SL5. Monitoring of jobs now in place. Closed. 2010-02-24
20100210-01 Medium All Shaun de Witt Ask Matt Viljoen to set up discussion on upgrade plans for CASTOR 2.1.9 Closed. 2010-02-24
20100217-01 Medium All Matt Hodges, James Thorne Clarify first deployment date for Viglen 08 kit. Report back on 2010-02-24. Closed. 2010-02-24
20100224-01 Medium All Gareth Smith Chase up status of backup OPN link commissioning. "The installation of the backup OPN link is expected to be complete by 31 March 2010, and it to be brought into service through April." Information provided by Robin Tasker. Closed. 2010-03-03
20100303-01 Medium CMS Shaun de Witt Check on status of James Jackson's work on migration policies. Closed. 2010-03-10
20100303-02 Medium All Matt Viljoen Circulate summary of 2.1.8/2.1.9 presentation given on 2010-03-03. Closed. 2010-03-01
20091007-01 Low All Andrew Sansum Check/confirm priority order of services (re)start for Tier1. Closed at PMB. 2010-03-31
20100317-01 Medium All Shaun de Witt Presentation on CASTOR upgrade plans on 31 March 2010 under "special presentations". Closed 2010-03-31
20100331-01 Medium All Andrew Sansum Chase status of current disk allocations with Glen. Closed. 2010-04-21
20100407-01 Medium ATLAS Alastair Dewhurst Confirm with ATLAS priority for publishing disabled space. ATLAS are happy with the status quo. Closed. 2010-04-21
20090923-02 Low MICE Henry Nebrensky Resolve data permissions requirements. Other problems that need sorting out before the permissions. Waiting for Henry to test with second VOMS role. Closed. 2010-04-28
20090624-04 Low ATLAS Alastair Dewhurst Determine ATLAS requirements and plans for group analysis: # jobs, users... Ongoing. Power user jobs working well. Need to re-test group prod. analysis when possible. Done. 2010-05-12
20100512-01 Medium SuperB James Thorne Invite Fergus Wilson to meeting as representative for SuperB. Done. 2010-05-19
20100512-02 High All Gareth Smith Revisit the banning of user cron jobs on the UIs with James A. Some users at the liaison meeting use cron. Done. Users need to submit a Tier1 support ticket if they require access. 2010-06-02
20100526-01 Medium SuperB Shaun de Witt Follow up SuperB problems reported in helpdesk ticket #59984. Closed. 2010-06-02
20100526-02 Medium MICE Matt Viljoen Follow up MICE requirement for two tape copies of files. Closed 2010-06-09
20100519-01 Medium LHCb Gareth Smith Determine procedure for on call staff to contact LHCb via GGUS alarm tickets. Done. 2010-06-23
20100609-01 Medium ATLAS Alistair Dewhurst Circulate details of ATLAS data discussion to Tier1 team. Done. 2010-06-23
20100609-02 Medium All Matthew Viljoen Provide summary of functional differences between CASTOR 2.1.7, 2.1.8 and 2.1.9 on the wiki. Done, provided at CASTOR_218_219_upgrade_feature_highlights 2010-06-30
20100616-02 Medium All David Corney Provide a summary of R89 building problems to the liaison meeting on 23/06/2010. Will circulate before meeting on 30 June. Circulated on 25 June. Done. 2010-06-30
20100616-01 Medium All All Experiment reps Confirm with Jeremy Coles whether they have tools that interface directly with the GOCDB. Done. 2010-06-30
20100623-01 Medium CMS Chris Brew Choose a "good date" for T10KB migration and inform Tier1 staff. Done. 2010-06-30
20100623-02 High All Andrew Sansum Co-ordinate T10KB migration; determine order of changes required. Done. 2010-06-30
20100714-01 Medium LHCb James Thorne Check that a replacement machine was allocated to replace gdss474. Replacement was not provided but gdss474 has now gone back into production so this action is no longer needed. 2010-07-21
20100714-03 Medium ATLAS Alastair Dewhurst Give brief report on recent ATLAS tape recall tests at next week's meeting. Closed. See link. 2010-07-28
20100721-01 Medium Alice Derek Ross Check if there is a cap on the number of running Alice jobs and report the cap to the meeting if so. No cap at time of asking, but now added. Closed. 2010-07-28
20100721-02 Medium CMS Tim Folkes Confirm total time for CMS tape migration. About three months. Closed. 2010-07-28
20100714-02 Medium ATLAS Brian Davies Check that Alastair is chasing up ATLAS software server problems. 2010-07-28: Total number of ATLAS jobs is capped; Ian is testing alternative solution (CVMFS) to AFS. 2010-08-04
20100804-01 Medium All Shaun de Witt Contact Tim regarding repack. Done 2010-08-11
20100804-03 Medium All All experiment reps Inform Shaun de Witt of experiment plans for testing the CASTOR 2.1.9 pre-prod instance. Now redundant. Closed. 2010-09-01
20100811-01 Medium All All experiment reps Let Matthew Viljoen know what tests experiments would like to do to re-validate CASTOR at RAL after the 2.1.9 upgrade. All done. Closed. 2010-09-01
20100818-01 Medium [[H1] Derek Ross Put ce02 back into action (pointing at the SL5 farm) so that H1 can run jobs. Propose a solution to H1's problems with CREAM CE. Done, closed. 2010-09-01
20100804-02 Medium CMS Chris Brew Arrange hammercloud test against pre-prod instance. Closed. 2010-09-08
20100908-01 High All Gareth Smith Review hardware intervention procedure. Include procedures for replacing disk servers removed from production and the dialogue with experiments. Done. 2010-09-15
20100908-02 High All Andrew Sansum Arrange change control meeting to review the proposed CASTOR upgrade. Done. 2010-09-15
20100915-01 High All Gareth Smith Resolve procedural loophole that allowed gdss379 to be put back into production incorrectly. Done. 2010-09-29
20100929-02 High BaBar Gareth Smith Make sure BaBar know whats happening this weekend. Done. 2010-10-06
20100929-01 High All Alastair Dewhurst Understand impact of draining diskservers on FTS. Closed. 2010-10-13
20101013-01 Medium All Gareth Smith Ensure the issue of worker-node black holes is discussed at the on-call meeting. Discussed at on call meeting and suitable actions created. Being tracked there. Closed. 2010-10-20
20101020-01 High CMS Andrew Lahiff Understand cause of CMS job hangs. Some jobs completed OK on worker nodes but batch system thinks that they are still running. Caused by load on batch system and CMS central problem. Closed. 2010-10-27
20101117-01 Medium All Derek Ross Report back on possible causes of batch system resources problem. Report at meeting on 24/11/2010. Correlated with an update. Done. 2010-11-24
20101117-02 High ATLAS Martin Bly Ensure James Adams is aware that the deployment of additional ATLAS SRMs is blocked until boxes provided. Closed. 2010-11-24
20101006-01 Medium All Derek Ross Provide an additional CREAM CE for ATLAS. Closed. 2010-12-08
20101124-01 Medium All Andrew Sansum Circulate an update on the progress with "fixing" the R89 UPS. Done. 2010-12-15
20101208-02 Medium CMS Andrew Lahiff Test inheritance of ACLs in CASTOR 2.1.9. Done. 2010-12-15
20101208-03 Medium CMS Andrew Lahiff Find out if CMS have any plans for moving to CREAM CE only. CMS to start a push in January to move. 2010-12-15