RAL Tier1 Experiments Liaison Meeting

From GridPPwiki


Covers all aspects of the Tier1.

Agenda

Chairman: David Corney

Secretary: James Thorne

  1. Summary of Operational Status and Issues (Gareth)
  2. Highlights/summary of the Tier1 Monday operations meeting.
  3. Experiment plans and operational issues
    • CMS
    • ATLAS
    • LHCb
    • ALICE
    • Others
  4. Special presentations (agreed in advance)
    • (none)
  5. Actions
  6. AoB

Open Actions

Action ID Priority Experiment(s) Owner Action Status
20090624-04 Low ATLAS Alastair Dewhurst Determine ATLAS requirements and plans for group analysis: # jobs, users... Ongoing. We will be allowing power users at RAL. Check tier1 is set up correctly. Pilots do not have power user status so no power user jobs via Panda. CERN have fixed their WMS now. Power user jobs working well since 2010-02-26. AD is monitoring and will report back. Need to run a manual hammercloud test.
20090923-02 Medium MICE Henry Nebrensky Resolve data permissions requirements. Other problems that need sorting out before the permissions. Waiting for Henry to test with second VOMS role.
20091007-01 Low All Andrew Sansum Check/confirm priority order of services (re)start for Tier1 Ongoing. To be closed off at PMB.

Completed Actions

Archives of actions completed can be found at:

Action ID Priority Experiment(s) Owner Action Status Completed date
20081001-05 ATLAS Andrew Sansum Case by case basis for job limits - meet with Roger. Removed as there are other actions covering limits. 2008-11-05
20081001-01 ATLAS Derek Ross Need to know memory configuration; memory limits and kill policy. Discussion with Graeme Stewart. Wiki page requested. Done. 2008-12-03
20081001-02 Derek Ross Look at queue limits. Awaiting document/wiki page from action 20081001-01. Done. 2008-12-03
20081105-01 ILC Matt Hodges Speak to Glenn Patrick about upping the priority of the ILC disk deployment. Done. 2008-12-03
20081105-02 ILC Brian Davies Ensure CASTOR team is aware that ILC space token setup is fairly urgent. Done. 2008-12-03
20081203-02 All James Thorne Consider combining items 1 and 2 on the agenda. Done. 2009-01-07
20081203-03 All Brian Davies Ensure relevant Tier1 staff are aware that CASTOR is not "VOMS aware" and bare this in mind when resolving user problems. Closed. 2009-01-07
20081203-04 ILC Martin Bly Stop ILC jobs in the batch system before 09:00 on 4/12/2008 while Chris Brew updates dCache at RALPP. Closed. 2009-01-07
20081001-03 ATLAS Catalin Condurache Liaise with ATLAS to arrange to get job recovery working. Discussion held and can be implemented. Can put recovery area on worker nodes or have common area on a disk server, maybe the software area (depends on the load). Ongoing. 2009-02-04
20081001-04 ATLAS Catalin Condurache Permissions check on software areas. Some sites have wrong permissions. Ongoing. 2009-02-04
20090107-01 ALICE Gareth Smith Follow up ALICE disk quota issues with Cristina via email. Closed. 2009-02-04
20081203-01 All Gareth Smith Clarify use and purpose of CASTOR-PP and csf-l mailing lists. Gareth suggests closing the lists and using EGEE the broadcast system. Closed. 2009-03-04
20090204-02 All Martin Bly Report at the next meeting the time scale for moving to 64-bit and/or SL5 worker nodes. Closed. 2009-03-04
20090107-02 ATLAS Brian Davies Co-ordinate solution for ATLAS hot files such as conditions data. Graeme suggests something called "pCache". Closed. 2009-05-13
20090204-01 All Gareth Smith Determine actions required before starting a CASTOR downtime (e.g. stopping batch queues). No response from some experiments. Closed. 2009-05-13
20090304-02 CMS Chris Brew Look at whether the RAL Tier1 can take extra data sets on behalf of ASGC. No longer relevant. Closed. 2009-05-13
20090107-03 CMS, ALICE Matt Hodges Liaise with CMS and ALICE regarding software server installations. Ongoing. CMS using theirs. Done. 2009-06-03
A-20090311-03 High All James Thorne Deploy scheduled verifies on disk servers by mid-May. Will be verifying 1/5 of a disk pool at any one time. Deployed on non-production CASTOR machines and on non-CASTOR machines. Rest will follow in July, after the move. Closed. 2009-06-17
A-20090506-03 Medium All Experiment Reps. Provide feedback and comments on Gareth's draft table specifying the actions to be taken when closing batch queues ahead of CASTOR interventions (circulated to the list on 6 May). Feedback received from most experiments. Closed. 2009-06-17
A-20090521-01 High LHCb James Thorne, Matt Viljoen Track down status of LHCb server (gdss160) with problems. Machine is back in production. Closed. 2009-06-17
A-20090527-01 Medium ALICE Lee Barnby Confirm ALICE's plans for STEP09. Done, closed. 2009-06-17
A-20090527-02 Medium All Brian Davies Distribute URLs for the VO CASTOR monitoring pages to the experiment reps. Done, closed. 2009-06-17
20090603-03 All Martin Bly Review planned dates for site network outage (23 and 30 June). Dates are now 7 and 14 July which are OK (ish). Closed. 2009-06-17
20090603-05 n/a Andrew Sansum Encourage Jeremy Coles to attend the meeting. Jeremy is attending the meeting, closed. 2009-06-17
20090603-01 All Derek Ross Take BDII issue back to the dteam. Discussed by dteam. 2009-06-24
20090603-04 All Brian Davies Create a page explaining what the castormon plots mean. Closed,put titles on plots rather than creating a separate page. 2009-06-24
20090617-01 High All Matt Hodges Prepare plan for move to SL5 WNs and present it at the meeting on 24 June. Done 2009-06-24
20090624-01 High All Gareth Smith Confirm with Networking whether the site network intervention on 7 July will involve a break in connectivity or just an "at risk". Closed. 2009-07-08
20090624-02 High All Andrew Sansum Decide priorities for Tier1 update window by 1 July. Closed. 2009-07-15
20090624-03 Medium CMS Chris Brew Send link to CMS pre-staging info to meeting list. Closed. 2009-07-15
20090624-05 Medium LHCb Shaun De Witt Test whether upping the LSF job cap to 600 causes a performance problem. Decided not necessary. Closed. 2009-07-15
20090603-02 Medium ATLAS Brian Davies Circulate link to twiki page containing out of hours contact details for ATLAS shifters. Closed. 2009-07-29
20090624-06 High All Matt Hodges Publish SL5 plan on the Tier1 blog and seek input from the experiments. Ongoing, publish plans on blog before 29 July. Closed. 2009-07-29
20090722-02 Medium LHCb Raja Nandakumar Send to the Tier1 the job IDs of some jobs that are not having job info (mem use etc.) added to the pbs database. Closed. 2009-07-29
20090513-01 Medium All Martin Bly Contact VO contact list to determine VO hardware requirements for next procurement. Martin to forward email sent to HAG to Andrew for distribution to the UB. Closed. 2009-08-06
20090624-07 Medium All James Thorne Locate a new meeting room with better facilities. Ongoing, trying the Access Grid room and EVO on 29 July. Closed. 2009-08-06
20090715-02 Medium ATLAS Brian Davies Follow up with Tim Folkes regarding the ability to recall entire tapes ATLAS thinks this will be more efficient than individual files. It is possible. Closed. 2009-08-06
20090722-01 High ATLAS Matt Hodges Look at rescheduling the LFC RAC migration to August (in Catalin's absence) and ensure the rest of the grid team are familiar with the migration procedure. Meeting on 30 July with ATLAS. Closed. 2009-08-06
20090722-04 Medium All Experiment reps. Let Martin Bly know how long their jobs may need to exceed their memory limits. 10 mins? 20 mins? Closed. 2009-08-06
20090722-05 Medium All Derek Ross Circulate the location of the PBS info to the meeting list. Done. 2009-08-19
20090806-01 High All Andrew Sansum Flag the planned dates (September) for migration to SL5 to PMB and MB. Closed 2009-08-19
20090826-02 Medium MICE David Corney Mail Paul Kyberd and Henry to encourage them to present and discuss their plans at the weekly liaison meeting. Done 2009-09-16
20090826-01 Medium All Shaun de Witt Suggest to Tim that we should repack the tapes that may have been affected by the water ingress into the tape robot. This is to ensure that the data is on known clean tapes. Shaun spoke to Tim and Tim will have stats on tape damage, if any in the next week. Closed. 2009-09-16
A-20090225-02 Low ILC Chris Kruk Speak to Dennis at CERN to try and find out if there's a way to implement "fairshare" in CASTOR (for the gen instance). Closed. 2009-09-30
A-20090429-02 Medium All Gareth Smith Review our procedure for notifying VOs of broken disk servers. Ongoing. Review of disk server intervention procedure on 1 October. Closed. 2009-10-14
20090715-01 Medium LHCb James Thorne Determine why Ganglia graphs for Storage_LHCb stopped after the CASTOR upgrade and reboot of disk servers. Ongoing. Believed to be caused by the data source machines being started after the rest of the machines in the cluster. Need to document start order. The start order was a red herring. One of the Storage_LHCb cluster data sources had become an ATLAS server and so ganglia was "confused". Closed. 2009-10-14
20090923-01 Medium MICE Shaun de Witt Send Henry link to SRM download site. Closed. 2009-10-14
20090923-03 Medium CMS Shaun de Witt Investigate gridFTP "marker" timeouts with CASTOR@RAL to dCache site transfers. Done. 2009-11-04
20090930-01 Low All Matthew Viljoen Explore possibilities for "fair share" in CASTOR (e.g. extra instances). Done. 2009-11-04
20090204-03 Medium All Alastair Dewhurst Ask Mingchao to perform a security audit of all experiment software areas. This follows up from action 20081001-04 (closed). LHC software areas done. Ongoing. Alastair written a nagios check. Done. 2010-01-13
A-20090506-04 High All Andrew Sansum Investigate disk server failure rates. Investigation started. Ongoing. Have something from Ian Collier by Christmas 2009. Closed. 2010-01-20
20091111-01 Medium T2K Shaun de Witt Investigate the possibility of setting up T2K to test file transfers on the gen instance. Closed. 2010-01-20
20091202-01 Medium All Martin Bly, Fabric team Prepare summary of disk failure rates (action A-20090506-04) for an "item 5" presentation. Presentation on January 20th. Closed. 2010-01-20
20090722-03 Medium All Alastair Dewhurst Gather a list of "requirements questions" for the experiments and circulate to the meeting list before the next meeting on 29 July. With Martin Bly, Matt Viljoen, Alastair Dewhurst and Derek Ross. Ongoing. Alastair has drafted something for ATLAS and passed on to CMS to see if it can be applied to them as well. Alastair will also pass it on to LHCb. Raja working with Alastair on LHCb requirements. Presentation given on 10/2/2010 2010-02-10
20100203-01 Medium All James Thorne Ask Martin for an update on 2009 CPU tender deliveries. For meeting on 10/2/2010. Martin gave update at meeting. 2010-02-10
20090304-01 Medium ATLAS Alastair Dewhurst Review and understand what maui, torque and linux are doing regarding memory limits when killing ATLAS batch jobs. Ongoing. Is memory calculation including paged memory? Still a problem in SL5. Monitoring of jobs now in place. Closed. 2010-02-24
20100210-01 Medium All Shaun de Witt Ask Matt Viljoen to set up discussion on upgrade plans for CASTOR 2.1.9 Closed. 2010-02-24
20100217-01 Medium All Matt Hodges, James Thorne Clarify first deployment date for Viglen 08 kit. Report back on 2010-02-24. Closed. 2010-02-24
20100224-01 Medium All Gareth Smith Chase up status of backup OPN link commissioning. "The installation of the backup OPN link is expected to be complete by 31 March 2010, and it to be brought into service through April." Information provided by Robin Tasker. Closed. 2010-03-03
20100303-01 Medium CMS Shaun de Witt Check on status of James Jackson's work on migration policies. Closed. 2010-03-10
20100303-02 Medium All Matt Viljoen Circulate summary of 2.1.8/2.1.9 presentation given on 2010-03-03. Closed. 2010-03-10