RAL Tier1 Experiments Liaison Meeting
From GridPPwiki
Covers all aspects of the Tier1.
[edit]
Agenda
Chairman: David Corney
Secretary: James Thorne
- Summary of Operational Status and Issues (Gareth)
- Highlights/summary of the Tier1 Monday operations meeting.
- Experiment plans and operational issues
- CMS
- ATLAS
- LHCb
- ALICE
- Others
- Special presentations (agreed in advance)
- (none)
- Actions
- AoB
[edit]
Open Actions
| Action ID | Priority | Experiment(s) | Owner | Action | Status |
|---|---|---|---|---|---|
| 20090624-04 | Low | ATLAS | Alastair Dewhurst | Determine ATLAS requirements and plans for group analysis: # jobs, users... | Ongoing. We will be allowing power users at RAL. Check tier1 is set up correctly. Pilots do not have power user status so no power user jobs via Panda. CERN have fixed their WMS now. Power user jobs working well since 2010-02-26. AD is monitoring and will report back. Need to run a manual hammercloud test. |
| 20090923-02 | Medium | MICE | Henry Nebrensky | Resolve data permissions requirements. | Other problems that need sorting out before the permissions. Waiting for Henry to test with second VOMS role. |
| 20091007-01 | Low | All | Andrew Sansum | Check/confirm priority order of services (re)start for Tier1 | Ongoing. To be closed off at PMB. |
[edit]
Completed Actions
Archives of actions completed can be found at:
- RAL Tier1 CASTOR Experiments Completed Actions 2007
- RAL Tier1 CASTOR Experiments Completed Actions 2008
- RAL Tier1 CASTOR Experiments Completed Actions 2009
| Action ID | Priority | Experiment(s) | Owner | Action | Status | Completed date |
|---|---|---|---|---|---|---|
| 20081001-05 | ATLAS | Andrew Sansum | Case by case basis for job limits - meet with Roger. | Removed as there are other actions covering limits. | 2008-11-05 | |
| 20081001-01 | ATLAS | Derek Ross | Need to know memory configuration; memory limits and kill policy. | Discussion with Graeme Stewart. Wiki page requested. Done. | 2008-12-03 | |
| 20081001-02 | Derek Ross | Look at queue limits. | Awaiting document/wiki page from action 20081001-01. Done. | 2008-12-03 | ||
| 20081105-01 | ILC | Matt Hodges | Speak to Glenn Patrick about upping the priority of the ILC disk deployment. | Done. | 2008-12-03 | |
| 20081105-02 | ILC | Brian Davies | Ensure CASTOR team is aware that ILC space token setup is fairly urgent. | Done. | 2008-12-03 | |
| 20081203-02 | All | James Thorne | Consider combining items 1 and 2 on the agenda. | Done. | 2009-01-07 | |
| 20081203-03 | All | Brian Davies | Ensure relevant Tier1 staff are aware that CASTOR is not "VOMS aware" and bare this in mind when resolving user problems. | Closed. | 2009-01-07 | |
| 20081203-04 | ILC | Martin Bly | Stop ILC jobs in the batch system before 09:00 on 4/12/2008 while Chris Brew updates dCache at RALPP. | Closed. | 2009-01-07 | |
| 20081001-03 | ATLAS | Catalin Condurache | Liaise with ATLAS to arrange to get job recovery working. | Discussion held and can be implemented. Can put recovery area on worker nodes or have common area on a disk server, maybe the software area (depends on the load). Ongoing. | 2009-02-04 | |
| 20081001-04 | ATLAS | Catalin Condurache | Permissions check on software areas. Some sites have wrong permissions. | Ongoing. | 2009-02-04 | |
| 20090107-01 | ALICE | Gareth Smith | Follow up ALICE disk quota issues with Cristina via email. | Closed. | 2009-02-04 | |
| 20081203-01 | All | Gareth Smith | Clarify use and purpose of CASTOR-PP and csf-l mailing lists. | Gareth suggests closing the lists and using EGEE the broadcast system. Closed. | 2009-03-04 | |
| 20090204-02 | All | Martin Bly | Report at the next meeting the time scale for moving to 64-bit and/or SL5 worker nodes. | Closed. | 2009-03-04 | |
| 20090107-02 | ATLAS | Brian Davies | Co-ordinate solution for ATLAS hot files such as conditions data. | Graeme suggests something called "pCache". Closed. | 2009-05-13 | |
| 20090204-01 | All | Gareth Smith | Determine actions required before starting a CASTOR downtime (e.g. stopping batch queues). | No response from some experiments. Closed. | 2009-05-13 | |
| 20090304-02 | CMS | Chris Brew | Look at whether the RAL Tier1 can take extra data sets on behalf of ASGC. | No longer relevant. Closed. | 2009-05-13 | |
| 20090107-03 | CMS, ALICE | Matt Hodges | Liaise with CMS and ALICE regarding software server installations. | Ongoing. CMS using theirs. Done. | 2009-06-03 | |
| A-20090311-03 | High | All | James Thorne | Deploy scheduled verifies on disk servers by mid-May. Will be verifying 1/5 of a disk pool at any one time. | Deployed on non-production CASTOR machines and on non-CASTOR machines. Rest will follow in July, after the move. Closed. | 2009-06-17 |
| A-20090506-03 | Medium | All | Experiment Reps. | Provide feedback and comments on Gareth's draft table specifying the actions to be taken when closing batch queues ahead of CASTOR interventions (circulated to the list on 6 May). | Feedback received from most experiments. Closed. | 2009-06-17 |
| A-20090521-01 | High | LHCb | James Thorne, Matt Viljoen | Track down status of LHCb server (gdss160) with problems. | Machine is back in production. Closed. | 2009-06-17 |
| A-20090527-01 | Medium | ALICE | Lee Barnby | Confirm ALICE's plans for STEP09. | Done, closed. | 2009-06-17 |
| A-20090527-02 | Medium | All | Brian Davies | Distribute URLs for the VO CASTOR monitoring pages to the experiment reps. | Done, closed. | 2009-06-17 |
| 20090603-03 | All | Martin Bly | Review planned dates for site network outage (23 and 30 June). | Dates are now 7 and 14 July which are OK (ish). Closed. | 2009-06-17 | |
| 20090603-05 | n/a | Andrew Sansum | Encourage Jeremy Coles to attend the meeting. | Jeremy is attending the meeting, closed. | 2009-06-17 | |
| 20090603-01 | All | Derek Ross | Take BDII issue back to the dteam. | Discussed by dteam. | 2009-06-24 | |
| 20090603-04 | All | Brian Davies | Create a page explaining what the castormon plots mean. | Closed,put titles on plots rather than creating a separate page. | 2009-06-24 | |
| 20090617-01 | High | All | Matt Hodges | Prepare plan for move to SL5 WNs and present it at the meeting on 24 June. | Done | 2009-06-24 |
| 20090624-01 | High | All | Gareth Smith | Confirm with Networking whether the site network intervention on 7 July will involve a break in connectivity or just an "at risk". | Closed. | 2009-07-08 |
| 20090624-02 | High | All | Andrew Sansum | Decide priorities for Tier1 update window by 1 July. | Closed. | 2009-07-15 |
| 20090624-03 | Medium | CMS | Chris Brew | Send link to CMS pre-staging info to meeting list. | Closed. | 2009-07-15 |
| 20090624-05 | Medium | LHCb | Shaun De Witt | Test whether upping the LSF job cap to 600 causes a performance problem. | Decided not necessary. Closed. | 2009-07-15 |
| 20090603-02 | Medium | ATLAS | Brian Davies | Circulate link to twiki page containing out of hours contact details for ATLAS shifters. | Closed. | 2009-07-29 |
| 20090624-06 | High | All | Matt Hodges | Publish SL5 plan on the Tier1 blog and seek input from the experiments. | Ongoing, publish plans on blog before 29 July. Closed. | 2009-07-29 |
| 20090722-02 | Medium | LHCb | Raja Nandakumar | Send to the Tier1 the job IDs of some jobs that are not having job info (mem use etc.) added to the pbs database. | Closed. | 2009-07-29 |
| 20090513-01 | Medium | All | Martin Bly | Contact VO contact list to determine VO hardware requirements for next procurement. | Martin to forward email sent to HAG to Andrew for distribution to the UB. Closed. | 2009-08-06 |
| 20090624-07 | Medium | All | James Thorne | Locate a new meeting room with better facilities. | Ongoing, trying the Access Grid room and EVO on 29 July. Closed. | 2009-08-06 |
| 20090715-02 | Medium | ATLAS | Brian Davies | Follow up with Tim Folkes regarding the ability to recall entire tapes | ATLAS thinks this will be more efficient than individual files. It is possible. Closed. | 2009-08-06 |
| 20090722-01 | High | ATLAS | Matt Hodges | Look at rescheduling the LFC RAC migration to August (in Catalin's absence) and ensure the rest of the grid team are familiar with the migration procedure. | Meeting on 30 July with ATLAS. Closed. | 2009-08-06 |
| 20090722-04 | Medium | All | Experiment reps. | Let Martin Bly know how long their jobs may need to exceed their memory limits. | 10 mins? 20 mins? Closed. | 2009-08-06 |
| 20090722-05 | Medium | All | Derek Ross | Circulate the location of the PBS info to the meeting list. | Done. | 2009-08-19 |
| 20090806-01 | High | All | Andrew Sansum | Flag the planned dates (September) for migration to SL5 to PMB and MB. | Closed | 2009-08-19 |
| 20090826-02 | Medium | MICE | David Corney | Mail Paul Kyberd and Henry to encourage them to present and discuss their plans at the weekly liaison meeting. | Done | 2009-09-16 |
| 20090826-01 | Medium | All | Shaun de Witt | Suggest to Tim that we should repack the tapes that may have been affected by the water ingress into the tape robot. | This is to ensure that the data is on known clean tapes. Shaun spoke to Tim and Tim will have stats on tape damage, if any in the next week. Closed. | 2009-09-16 |
| A-20090225-02 | Low | ILC | Chris Kruk | Speak to Dennis at CERN to try and find out if there's a way to implement "fairshare" in CASTOR (for the gen instance). | Closed. | 2009-09-30 |
| A-20090429-02 | Medium | All | Gareth Smith | Review our procedure for notifying VOs of broken disk servers. | Ongoing. Review of disk server intervention procedure on 1 October. Closed. | 2009-10-14 |
| 20090715-01 | Medium | LHCb | James Thorne | Determine why Ganglia graphs for Storage_LHCb stopped after the CASTOR upgrade and reboot of disk servers. | Ongoing. Believed to be caused by the data source machines being started after the rest of the machines in the cluster. Need to document start order. The start order was a red herring. One of the Storage_LHCb cluster data sources had become an ATLAS server and so ganglia was "confused". Closed. | 2009-10-14 |
| 20090923-01 | Medium | MICE | Shaun de Witt | Send Henry link to SRM download site. | Closed. | 2009-10-14 |
| 20090923-03 | Medium | CMS | Shaun de Witt | Investigate gridFTP "marker" timeouts with CASTOR@RAL to dCache site transfers. | Done. | 2009-11-04 |
| 20090930-01 | Low | All | Matthew Viljoen | Explore possibilities for "fair share" in CASTOR (e.g. extra instances). | Done. | 2009-11-04 |
| 20090204-03 | Medium | All | Alastair Dewhurst | Ask Mingchao to perform a security audit of all experiment software areas. | This follows up from action 20081001-04 (closed). LHC software areas done. Ongoing. Alastair written a nagios check. Done. | 2010-01-13 |
| A-20090506-04 | High | All | Andrew Sansum | Investigate disk server failure rates. | Investigation started. Ongoing. Have something from Ian Collier by Christmas 2009. Closed. | 2010-01-20 |
| 20091111-01 | Medium | T2K | Shaun de Witt | Investigate the possibility of setting up T2K to test file transfers on the gen instance. | Closed. | 2010-01-20 |
| 20091202-01 | Medium | All | Martin Bly, Fabric team | Prepare summary of disk failure rates (action A-20090506-04) for an "item 5" presentation. | Presentation on January 20th. Closed. | 2010-01-20 |
| 20090722-03 | Medium | All | Alastair Dewhurst | Gather a list of "requirements questions" for the experiments and circulate to the meeting list before the next meeting on 29 July. | With Martin Bly, Matt Viljoen, Alastair Dewhurst and Derek Ross. Ongoing. Alastair has drafted something for ATLAS and passed on to CMS to see if it can be applied to them as well. Alastair will also pass it on to LHCb. Raja working with Alastair on LHCb requirements. Presentation given on 10/2/2010 | 2010-02-10 |
| 20100203-01 | Medium | All | James Thorne | Ask Martin for an update on 2009 CPU tender deliveries. | For meeting on 10/2/2010. Martin gave update at meeting. | 2010-02-10 |
| 20090304-01 | Medium | ATLAS | Alastair Dewhurst | Review and understand what maui, torque and linux are doing regarding memory limits when killing ATLAS batch jobs. | Ongoing. Is memory calculation including paged memory? Still a problem in SL5. Monitoring of jobs now in place. Closed. | 2010-02-24 |
| 20100210-01 | Medium | All | Shaun de Witt | Ask Matt Viljoen to set up discussion on upgrade plans for CASTOR 2.1.9 | Closed. | 2010-02-24 |
| 20100217-01 | Medium | All | Matt Hodges, James Thorne | Clarify first deployment date for Viglen 08 kit. | Report back on 2010-02-24. Closed. | 2010-02-24 |
| 20100224-01 | Medium | All | Gareth Smith | Chase up status of backup OPN link commissioning. | "The installation of the backup OPN link is expected to be complete by 31 March 2010, and it to be brought into service through April." Information provided by Robin Tasker. Closed. | 2010-03-03 |
| 20100303-01 | Medium | CMS | Shaun de Witt | Check on status of James Jackson's work on migration policies. | Closed. | 2010-03-10 |
| 20100303-02 | Medium | All | Matt Viljoen | Circulate summary of 2.1.8/2.1.9 presentation given on 2010-03-03. | Closed. | 2010-03-10 |
