Difference between revisions of "RAL Tier1 CASTOR Experiments Completed Actions 2010"
From GridPP Wiki
James thorne (Talk | contribs) |
(No difference)
|
Latest revision as of 13:56, 5 January 2011
Actions from RAL Tier1 Experiments Liaison Meeting closed in 2010.
Action ID | Priority | Experiment(s) | Owner | Description | Status | Completed date |
---|---|---|---|---|---|---|
20081001-05 | ATLAS | Andrew Sansum | Case by case basis for job limits - meet with Roger. | Removed as there are other actions covering limits. | 2008-11-05 | |
20081001-01 | ATLAS | Derek Ross | Need to know memory configuration; memory limits and kill policy. | Discussion with Graeme Stewart. Wiki page requested. Done. | 2008-12-03 | |
20081001-02 | Derek Ross | Look at queue limits. | Awaiting document/wiki page from action 20081001-01. Done. | 2008-12-03 | ||
20081105-01 | ILC | Matt Hodges | Speak to Glenn Patrick about upping the priority of the ILC disk deployment. | Done. | 2008-12-03 | |
20081105-02 | ILC | Brian Davies | Ensure CASTOR team is aware that ILC space token setup is fairly urgent. | Done. | 2008-12-03 | |
20081203-02 | All | James Thorne | Consider combining items 1 and 2 on the agenda. | Done. | 2009-01-07 | |
20081203-03 | All | Brian Davies | Ensure relevant Tier1 staff are aware that CASTOR is not "VOMS aware" and bare this in mind when resolving user problems. | Closed. | 2009-01-07 | |
20081203-04 | ILC | Martin Bly | Stop ILC jobs in the batch system before 09:00 on 4/12/2008 while Chris Brew updates dCache at RALPP. | Closed. | 2009-01-07 | |
20081001-03 | ATLAS | Catalin Condurache | Liaise with ATLAS to arrange to get job recovery working. | Discussion held and can be implemented. Can put recovery area on worker nodes or have common area on a disk server, maybe the software area (depends on the load). Ongoing. | 2009-02-04 | |
20081001-04 | ATLAS | Catalin Condurache | Permissions check on software areas. Some sites have wrong permissions. | Ongoing. | 2009-02-04 | |
20090107-01 | ALICE | Gareth Smith | Follow up ALICE disk quota issues with Cristina via email. | Closed. | 2009-02-04 | |
20081203-01 | All | Gareth Smith | Clarify use and purpose of CASTOR-PP and csf-l mailing lists. | Gareth suggests closing the lists and using EGEE the broadcast system. Closed. | 2009-03-04 | |
20090204-02 | All | Martin Bly | Report at the next meeting the time scale for moving to 64-bit and/or SL5 worker nodes. | Closed. | 2009-03-04 | |
20090107-02 | ATLAS | Brian Davies | Co-ordinate solution for ATLAS hot files such as conditions data. | Graeme suggests something called "pCache". Closed. | 2009-05-13 | |
20090204-01 | All | Gareth Smith | Determine actions required before starting a CASTOR downtime (e.g. stopping batch queues). | No response from some experiments. Closed. | 2009-05-13 | |
20090304-02 | CMS | Chris Brew | Look at whether the RAL Tier1 can take extra data sets on behalf of ASGC. | No longer relevant. Closed. | 2009-05-13 | |
20090107-03 | CMS, ALICE | Matt Hodges | Liaise with CMS and ALICE regarding software server installations. | Ongoing. CMS using theirs. Done. | 2009-06-03 | |
A-20090311-03 | High | All | James Thorne | Deploy scheduled verifies on disk servers by mid-May. Will be verifying 1/5 of a disk pool at any one time. | Deployed on non-production CASTOR machines and on non-CASTOR machines. Rest will follow in July, after the move. Closed. | 2009-06-17 |
A-20090506-03 | Medium | All | Experiment Reps. | Provide feedback and comments on Gareth's draft table specifying the actions to be taken when closing batch queues ahead of CASTOR interventions (circulated to the list on 6 May). | Feedback received from most experiments. Closed. | 2009-06-17 |
A-20090521-01 | High | LHCb | James Thorne, Matt Viljoen | Track down status of LHCb server (gdss160) with problems. | Machine is back in production. Closed. | 2009-06-17 |
A-20090527-01 | Medium | ALICE | Lee Barnby | Confirm ALICE's plans for STEP09. | Done, closed. | 2009-06-17 |
A-20090527-02 | Medium | All | Brian Davies | Distribute URLs for the VO CASTOR monitoring pages to the experiment reps. | Done, closed. | 2009-06-17 |
20090603-03 | All | Martin Bly | Review planned dates for site network outage (23 and 30 June). | Dates are now 7 and 14 July which are OK (ish). Closed. | 2009-06-17 | |
20090603-05 | n/a | Andrew Sansum | Encourage Jeremy Coles to attend the meeting. | Jeremy is attending the meeting, closed. | 2009-06-17 | |
20090603-01 | All | Derek Ross | Take BDII issue back to the dteam. | Discussed by dteam. | 2009-06-24 | |
20090603-04 | All | Brian Davies | Create a page explaining what the castormon plots mean. | Closed,put titles on plots rather than creating a separate page. | 2009-06-24 | |
20090617-01 | High | All | Matt Hodges | Prepare plan for move to SL5 WNs and present it at the meeting on 24 June. | Done | 2009-06-24 |
20090624-01 | High | All | Gareth Smith | Confirm with Networking whether the site network intervention on 7 July will involve a break in connectivity or just an "at risk". | Closed. | 2009-07-08 |
20090624-02 | High | All | Andrew Sansum | Decide priorities for Tier1 update window by 1 July. | Closed. | 2009-07-15 |
20090624-03 | Medium | CMS | Chris Brew | Send link to CMS pre-staging info to meeting list. | Closed. | 2009-07-15 |
20090624-05 | Medium | LHCb | Shaun De Witt | Test whether upping the LSF job cap to 600 causes a performance problem. | Decided not necessary. Closed. | 2009-07-15 |
20090603-02 | Medium | ATLAS | Brian Davies | Circulate link to twiki page containing out of hours contact details for ATLAS shifters. | Closed. | 2009-07-29 |
20090624-06 | High | All | Matt Hodges | Publish SL5 plan on the Tier1 blog and seek input from the experiments. | Ongoing, publish plans on blog before 29 July. Closed. | 2009-07-29 |
20090722-02 | Medium | LHCb | Raja Nandakumar | Send to the Tier1 the job IDs of some jobs that are not having job info (mem use etc.) added to the pbs database. | Closed. | 2009-07-29 |
20090513-01 | Medium | All | Martin Bly | Contact VO contact list to determine VO hardware requirements for next procurement. | Martin to forward email sent to HAG to Andrew for distribution to the UB. Closed. | 2009-08-06 |
20090624-07 | Medium | All | James Thorne | Locate a new meeting room with better facilities. | Ongoing, trying the Access Grid room and EVO on 29 July. Closed. | 2009-08-06 |
20090715-02 | Medium | ATLAS | Brian Davies | Follow up with Tim Folkes regarding the ability to recall entire tapes | ATLAS thinks this will be more efficient than individual files. It is possible. Closed. | 2009-08-06 |
20090722-01 | High | ATLAS | Matt Hodges | Look at rescheduling the LFC RAC migration to August (in Catalin's absence) and ensure the rest of the grid team are familiar with the migration procedure. | Meeting on 30 July with ATLAS. Closed. | 2009-08-06 |
20090722-04 | Medium | All | Experiment reps. | Let Martin Bly know how long their jobs may need to exceed their memory limits. | 10 mins? 20 mins? Closed. | 2009-08-06 |
20090722-05 | Medium | All | Derek Ross | Circulate the location of the PBS info to the meeting list. | Done. | 2009-08-19 |
20090806-01 | High | All | Andrew Sansum | Flag the planned dates (September) for migration to SL5 to PMB and MB. | Closed | 2009-08-19 |
20090826-02 | Medium | MICE | David Corney | Mail Paul Kyberd and Henry to encourage them to present and discuss their plans at the weekly liaison meeting. | Done | 2009-09-16 |
20090826-01 | Medium | All | Shaun de Witt | Suggest to Tim that we should repack the tapes that may have been affected by the water ingress into the tape robot. | This is to ensure that the data is on known clean tapes. Shaun spoke to Tim and Tim will have stats on tape damage, if any in the next week. Closed. | 2009-09-16 |
A-20090225-02 | Low | ILC | Chris Kruk | Speak to Dennis at CERN to try and find out if there's a way to implement "fairshare" in CASTOR (for the gen instance). | Closed. | 2009-09-30 |
A-20090429-02 | Medium | All | Gareth Smith | Review our procedure for notifying VOs of broken disk servers. | Ongoing. Review of disk server intervention procedure on 1 October. Closed. | 2009-10-14 |
20090715-01 | Medium | LHCb | James Thorne | Determine why Ganglia graphs for Storage_LHCb stopped after the CASTOR upgrade and reboot of disk servers. | Ongoing. Believed to be caused by the data source machines being started after the rest of the machines in the cluster. Need to document start order. The start order was a red herring. One of the Storage_LHCb cluster data sources had become an ATLAS server and so ganglia was "confused". Closed. | 2009-10-14 |
20090923-01 | Medium | MICE | Shaun de Witt | Send Henry link to SRM download site. | Closed. | 2009-10-14 |
20090923-03 | Medium | CMS | Shaun de Witt | Investigate gridFTP "marker" timeouts with CASTOR@RAL to dCache site transfers. | Done. | 2009-11-04 |
20090930-01 | Low | All | Matthew Viljoen | Explore possibilities for "fair share" in CASTOR (e.g. extra instances). | Done. | 2009-11-04 |
20090204-03 | Medium | All | Alastair Dewhurst | Ask Mingchao to perform a security audit of all experiment software areas. | This follows up from action 20081001-04 (closed). LHC software areas done. Ongoing. Alastair written a nagios check. Done. | 2010-01-13 |
A-20090506-04 | High | All | Andrew Sansum | Investigate disk server failure rates. | Investigation started. Ongoing. Have something from Ian Collier by Christmas 2009. Closed. | 2010-01-20 |
20091111-01 | Medium | T2K | Shaun de Witt | Investigate the possibility of setting up T2K to test file transfers on the gen instance. | Closed. | 2010-01-20 |
20091202-01 | Medium | All | Martin Bly, Fabric team | Prepare summary of disk failure rates (action A-20090506-04) for an "item 5" presentation. | Presentation on January 20th. Closed. | 2010-01-20 |
20090722-03 | Medium | All | Alastair Dewhurst | Gather a list of "requirements questions" for the experiments and circulate to the meeting list before the next meeting on 29 July. | With Martin Bly, Matt Viljoen, Alastair Dewhurst and Derek Ross. Ongoing. Alastair has drafted something for ATLAS and passed on to CMS to see if it can be applied to them as well. Alastair will also pass it on to LHCb. Raja working with Alastair on LHCb requirements. Presentation given on 10/2/2010 | 2010-02-10 |
20100203-01 | Medium | All | James Thorne | Ask Martin for an update on 2009 CPU tender deliveries. | For meeting on 10/2/2010. Martin gave update at meeting. | 2010-02-10 |
20090304-01 | Medium | ATLAS | Alastair Dewhurst | Review and understand what maui, torque and linux are doing regarding memory limits when killing ATLAS batch jobs. | Ongoing. Is memory calculation including paged memory? Still a problem in SL5. Monitoring of jobs now in place. Closed. | 2010-02-24 |
20100210-01 | Medium | All | Shaun de Witt | Ask Matt Viljoen to set up discussion on upgrade plans for CASTOR 2.1.9 | Closed. | 2010-02-24 |
20100217-01 | Medium | All | Matt Hodges, James Thorne | Clarify first deployment date for Viglen 08 kit. | Report back on 2010-02-24. Closed. | 2010-02-24 |
20100224-01 | Medium | All | Gareth Smith | Chase up status of backup OPN link commissioning. | "The installation of the backup OPN link is expected to be complete by 31 March 2010, and it to be brought into service through April." Information provided by Robin Tasker. Closed. | 2010-03-03 |
20100303-01 | Medium | CMS | Shaun de Witt | Check on status of James Jackson's work on migration policies. | Closed. | 2010-03-10 |
20100303-02 | Medium | All | Matt Viljoen | Circulate summary of 2.1.8/2.1.9 presentation given on 2010-03-03. | Closed. | 2010-03-01 |
20091007-01 | Low | All | Andrew Sansum | Check/confirm priority order of services (re)start for Tier1. | Closed at PMB. | 2010-03-31 |
20100317-01 | Medium | All | Shaun de Witt | Presentation on CASTOR upgrade plans on 31 March 2010 under "special presentations". | Closed | 2010-03-31 |
20100331-01 | Medium | All | Andrew Sansum | Chase status of current disk allocations with Glen. | Closed. | 2010-04-21 |
20100407-01 | Medium | ATLAS | Alastair Dewhurst | Confirm with ATLAS priority for publishing disabled space. | ATLAS are happy with the status quo. Closed. | 2010-04-21 |
20090923-02 | Low | MICE | Henry Nebrensky | Resolve data permissions requirements. | Other problems that need sorting out before the permissions. Waiting for Henry to test with second VOMS role. Closed. | 2010-04-28 |
20090624-04 | Low | ATLAS | Alastair Dewhurst | Determine ATLAS requirements and plans for group analysis: # jobs, users... | Ongoing. Power user jobs working well. Need to re-test group prod. analysis when possible. Done. | 2010-05-12 |
20100512-01 | Medium | SuperB | James Thorne | Invite Fergus Wilson to meeting as representative for SuperB. | Done. | 2010-05-19 |
20100512-02 | High | All | Gareth Smith | Revisit the banning of user cron jobs on the UIs with James A. Some users at the liaison meeting use cron. | Done. Users need to submit a Tier1 support ticket if they require access. | 2010-06-02 |
20100526-01 | Medium | SuperB | Shaun de Witt | Follow up SuperB problems reported in helpdesk ticket #59984. | Closed. | 2010-06-02 |
20100526-02 | Medium | MICE | Matt Viljoen | Follow up MICE requirement for two tape copies of files. | Closed | 2010-06-09 |
20100519-01 | Medium | LHCb | Gareth Smith | Determine procedure for on call staff to contact LHCb via GGUS alarm tickets. | Done. | 2010-06-23 |
20100609-01 | Medium | ATLAS | Alistair Dewhurst | Circulate details of ATLAS data discussion to Tier1 team. | Done. | 2010-06-23 |
20100609-02 | Medium | All | Matthew Viljoen | Provide summary of functional differences between CASTOR 2.1.7, 2.1.8 and 2.1.9 on the wiki. | Done, provided at CASTOR_218_219_upgrade_feature_highlights | 2010-06-30 |
20100616-02 | Medium | All | David Corney | Provide a summary of R89 building problems to the liaison meeting on 23/06/2010. | Will circulate before meeting on 30 June. Circulated on 25 June. Done. | 2010-06-30 |
20100616-01 | Medium | All | All Experiment reps | Confirm with Jeremy Coles whether they have tools that interface directly with the GOCDB. | Done. | 2010-06-30 |
20100623-01 | Medium | CMS | Chris Brew | Choose a "good date" for T10KB migration and inform Tier1 staff. | Done. | 2010-06-30 |
20100623-02 | High | All | Andrew Sansum | Co-ordinate T10KB migration; determine order of changes required. | Done. | 2010-06-30 |
20100714-01 | Medium | LHCb | James Thorne | Check that a replacement machine was allocated to replace gdss474. | Replacement was not provided but gdss474 has now gone back into production so this action is no longer needed. | 2010-07-21 |
20100714-03 | Medium | ATLAS | Alastair Dewhurst | Give brief report on recent ATLAS tape recall tests at next week's meeting. | Closed. See link. | 2010-07-28 |
20100721-01 | Medium | Alice | Derek Ross | Check if there is a cap on the number of running Alice jobs and report the cap to the meeting if so. | No cap at time of asking, but now added. Closed. | 2010-07-28 |
20100721-02 | Medium | CMS | Tim Folkes | Confirm total time for CMS tape migration. | About three months. Closed. | 2010-07-28 |
20100714-02 | Medium | ATLAS | Brian Davies | Check that Alastair is chasing up ATLAS software server problems. | 2010-07-28: Total number of ATLAS jobs is capped; Ian is testing alternative solution (CVMFS) to AFS. | 2010-08-04 |
20100804-01 | Medium | All | Shaun de Witt | Contact Tim regarding repack. | Done | 2010-08-11 |
20100804-03 | Medium | All | All experiment reps | Inform Shaun de Witt of experiment plans for testing the CASTOR 2.1.9 pre-prod instance. | Now redundant. Closed. | 2010-09-01 |
20100811-01 | Medium | All | All experiment reps | Let Matthew Viljoen know what tests experiments would like to do to re-validate CASTOR at RAL after the 2.1.9 upgrade. | All done. Closed. | 2010-09-01 |
20100818-01 | Medium | [[H1] | Derek Ross | Put ce02 back into action (pointing at the SL5 farm) so that H1 can run jobs. Propose a solution to H1's problems with CREAM CE. | Done, closed. | 2010-09-01 |
20100804-02 | Medium | CMS | Chris Brew | Arrange hammercloud test against pre-prod instance. | Closed. | 2010-09-08 |
20100908-01 | High | All | Gareth Smith | Review hardware intervention procedure. | Include procedures for replacing disk servers removed from production and the dialogue with experiments. Done. | 2010-09-15 |
20100908-02 | High | All | Andrew Sansum | Arrange change control meeting to review the proposed CASTOR upgrade. | Done. | 2010-09-15 |
20100915-01 | High | All | Gareth Smith | Resolve procedural loophole that allowed gdss379 to be put back into production incorrectly. | Done. | 2010-09-29 |
20100929-02 | High | BaBar | Gareth Smith | Make sure BaBar know whats happening this weekend. | Done. | 2010-10-06 |
20100929-01 | High | All | Alastair Dewhurst | Understand impact of draining diskservers on FTS. | Closed. | 2010-10-13 |
20101013-01 | Medium | All | Gareth Smith | Ensure the issue of worker-node black holes is discussed at the on-call meeting. | Discussed at on call meeting and suitable actions created. Being tracked there. Closed. | 2010-10-20 |
20101020-01 | High | CMS | Andrew Lahiff | Understand cause of CMS job hangs. | Some jobs completed OK on worker nodes but batch system thinks that they are still running. Caused by load on batch system and CMS central problem. Closed. | 2010-10-27 |
20101117-01 | Medium | All | Derek Ross | Report back on possible causes of batch system resources problem. | Report at meeting on 24/11/2010. Correlated with an update. Done. | 2010-11-24 |
20101117-02 | High | ATLAS | Martin Bly | Ensure James Adams is aware that the deployment of additional ATLAS SRMs is blocked until boxes provided. | Closed. | 2010-11-24 |
20101006-01 | Medium | All | Derek Ross | Provide an additional CREAM CE for ATLAS. | Closed. | 2010-12-08 |
20101124-01 | Medium | All | Andrew Sansum | Circulate an update on the progress with "fixing" the R89 UPS. | Done. | 2010-12-15 |
20101208-02 | Medium | CMS | Andrew Lahiff | Test inheritance of ACLs in CASTOR 2.1.9. | Done. | 2010-12-15 |
20101208-03 | Medium | CMS | Andrew Lahiff | Find out if CMS have any plans for moving to CREAM CE only. | CMS to start a push in January to move. | 2010-12-15 |