GridPP PMB Minutes 376 (08.02.10) ================================= Present: David Britton (Chair), Sarah Pearce (remote), Andrew Sansum, Tony Doyle, Jeremy Coles, Steve Lloyd, Tony Cass (remote), Glenn Patrick, Roger Jones, Dave Colling, David Kelsey, Robin Middleton (Suzanne Scott, Minutes) Apologies: John Gordon, Pete Clarke, Neil Geddes 1. RAL Status ============== AS reported that CASTOR had re-started on Tuesday morning and was running fine. Last week problems had arisen moving back to the EMC hardware when there were multiple paths to the database. Although this had previously worked, it now caused problems. At present, a single-path layout had been adopted; this was still more resilient than the OverLand kit but was not as good as would be liked. A team, headed by Martin Bly, were working through the configuration to see what was broken - they had found nothing so far, but work was ongoing. The Post Mortem was being assembled, which would provide analysis for future decisions. DB asked if the test was being carried out on the EMC kit? AS confirmed no, it was being done on loaned equipment. AS advised that they had also lost half a day to a known Oracle bug, which was compounded by a faulty piece of hardware. DB asked if the fix on the bug had changed or affected something else? AS noted it wasn't believed to be related to the recent problems. AS advised that Oracle was very difficult to run - there were a lot of behaviours not fully understood in their configuration. These were compounded by CASTOR issues. DB noted that this represented a large financial cost to the project, in terms of both manpower and finance, to fix equipment, and also to pay for licences. For the long view, DB suggested that we needed to question the decision re Oracle and CASTOR - it was difficult to defend a long-term strategy when we did not have short- term success. AS noted that dCache was also problematic. TD gave a counter-argument: that we had invested a lot of time and effort building a team with expertise, and other systems would require similar effort. DB noted that other Tier-1 sites had abandoned CASTOR and gone to dCache, supported by Fermilab, and they had managed to reduce their manpower. DC agreed that we do need to think about longer term issues, and it was possible that CASTOR was not the best solution. 2. GridPP4 Proposal: Outcomes of the OC meeting ================================================ DB advised that he had circulated a summary on Friday, which was taken from notes at the time. DB reported that the OC meeting had been a helpful one. It was clear that the PPRP will view our proposal as a 'regular' proposal, however Peter Jones would try and assist this process, and had asked STFC for specific guidance for the PPRP. The RMR received today details a lot of information required within the Management space - this is a pointer on how to write the proposal. They had asked for Gantt charts, but after discussion, had agreed we don't need to provide this. DB noted a mismatch of expectations - we may have to write down how GridPP does actually work (in comparison with other 'regular' projects) however there wasn't much time left for a major re-write. The token would pass to SL for the whole of next week. SP would feed-in management information to SL whilst DB was away. ACTION 376.1 SP to feed-in management information to SL whilst DB is away (for the proposal document and in line with RMR information required). 3. Work Breakdown Schedule (WBS) and Risk Register =================================================== SP had circulated documents. Re the Project Map, DB considered this looked sensible. He noted that Work Packages all have obvious Managers except WP-C. Could SP add the names? DB noted it might be possible to do this as an appendix. DB suggested discussing the first lot of risk register points, to ensure some consistency of approach. It was agreed that 'Site Operations' would be changed to 'Site Performance'. There was a discussion on the risk forms themselves (their and ours). DB noted that risks had to lead to contingency and working allowance; working allowance has to be above the line. Lower risks meant lower contingency, and lower working allowance, which ultimately has to be balanced against the number of posts requested. There was dubiety about the difference between 'existing' and 'current' on the form; also the effect of 'mitigation' - could SP check these terms? ACTION 376.2 SP to check the Risk Register terminology, specifically the difference between 'existing' and 'current', 'inherent' and 'residual' on the form, also the effect of mitigation and how that should be correctly expressed. 1. Loss of custodial data at the Tier-1 - should be changed to 'Significant' loss likelihood 10% impact 75% 2. Prolonged outage of the UK Tier-1 - ie: over 5 days likelihood 50% impact 35% TD noted that over 4 years of the project, this would be increased by a factor of 4. 3. Tier-2 sites cannot deal ... - the inherent risk was high, but the residual risk was low JC noted this should be 'middleware' or 'software' at Tier-2 sites (this risk was different to risk No 7). TD also noted to remove 'extra' from manpower note in 'mitigation'. 4. More consolidated Tier-2 structure ... - 'to' all physicists, should be changed to 'for' all physicists It was asked whether this was a real or an imagined disadvantage? It was a real disadvantage. DB noted we needed something here to justify the experiment support posts, eg: 'Risk that experiment software runs inefficiently on the Grid to the disadvantage of UK physicists'. The inherent risk was high but the residual risk was low. 5. Loss of experienced personnel - DB noted this risk was mitigated by having 2 people at 'core' sites. The 'residual' risk for other sites was moderate. TD noted that the likelihood was still high, so the impact was also high. These two 'high' risks should become 'moderate' risks at residual level. 6. Unplanned infrastructure costs - This related to electricity and networking. DB noted it was spread over all sites @ 20-30% therefore was moderately low. This should go into contingency so it should be moved down the sheet. 7. Insufficient manpower to operate core Tier-2 sites - The inherent risk was a low one as there were two people at sites. DC noted however that it was not zero risk. DB advised we couldn't make it too high as we can't get 8 more people from either the working allowance or contingency. 8. Failure to deliver or report EGI requirements - The impact was considered to be very low. 9. Failure to retain or recruit ... - This should be modified to state 'due to recruitment embargoes'. The owner should be SP or NG. 10. Loss or damage to hardware at the Tier-1 - This was similar to risk No 2? AS noted the impact as 100% but the risk itself was low at 10%. SP asked about water on the tape robot? AS noted this related to issues of hardware or finance. TD thought we should add scale of loss and a financial amount. The item TBC. DB advised that all posts should be contained within the table as mitigating risks - and this was an effective way to defend posts. Sensible working allowance and contingency were required. It was agreed that SP would make the above changes and complete the rest of the table, and bring this back to the next PMB for discussion. ACTION 376.3 SP would make the agreed changes to the STFC Risk Register and complete the rest of the table, and bring this back to the next PMB for discussion. It was agreed that all owners would send text comments to SP by the end of the week, including numbers if possible, but 'low', 'medium' or 'high' was also fine, and the table would be checked at the next PMB. ACTION 376.4 All: Risk Register owners to send text comments to SP by the end of the week, including numbers if possible, but 'low', 'medium' or 'high' was also fine, and the table would be checked at the next PMB. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) The disk drives on our problematic lot of disk servers were replaced over late December and early January. Acceptance testing is ongoing and the first servers have completed acceptance testing. The remaining serversn will come through the system by 17th February. 2) FY09 procurements: - Delivery of the disk servers is scheduled for Mid February with a second tranche from one supplier on March 4th. - CPU deliveries are scheduled for mid February (problems leading to delays to one tranche have been resolved but we are waiting for a new delivery date). 3) Corrective work on the UPS room supply was carried out. A 50% improvement was achieved. We are waiting for the formal assesment of how to proceed. 4) There was a power failure into R89 on Friday at about 16:30 caused by a HV transformer trip. This took out all cooling to the LPD room (where the robot is) and power to 6 non-Tier-1 racks. Building services rapidly responded and bypassed the failed unit. Power was restored by 17:05. There was no impact on the Tier-1. We are still waiting full details of the incident. 5) A site network intervention is scheduled for Tuesday morning. We will pause batch work over the period of the network break. Service: 1) SAM test availability for the ops VO was 82%. This was due to the unscheduled CASTOR downtime. 2) The CASTOR service was opened for production use on Tuesday 2nd February after an extended unscheduled downtime following the service upgrade (move back to the production RAID arrays). A draft post mortem is available at: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100129 The current configuration is an improvement on what we had previously, but still shows unexpected behaviour and does not provide the full resilience we had expected. Multi-path could not be made to work and although we have mirrored EMC RAID arrays, if one crashes it brings down the ORACLE database rather than service continuing on the second. This is not yet understood. Investigations are underway to identify the cause of the problem. 3) The LFC, FTS and 3D services were sucessfully moved back to a second ORACLE RAC. 4) A number of other upgrades were carried out to the batch service during the scheduled downtime. SI-2 ATLAS weekly review & plans --------------------------------- RJ had left the meeting. SI-3 CMS weekly review & plans ------------------------------- DC reported that sites were on SL5 now; things were quiet generally. Monte Carlo had generated results. SI-4 LHCb weekly review & plans -------------------------------- GP noted LFC issues at CERN. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) Ambitious target have been recently set for moving to regional nagios. The suggestion is that all ROCs move to their country instance (at CERN) from 12th February and then the regionally based Nagios from March to aim for a switchover by the end of March. Current plans suggest that we may lose some functionality such as the SAM submission portal for admins. This and the nature of default tests (does SAM still run under the ops VO) is currently under discussion. For reference our UK Nagios portal (hosted at Oxford) page is here: https://gridppnagios.physics.ox.ac.uk/myegee 2) Recovery of the APEL database is taking time due to the large database size (100GB or so?). Site data is still being published as usual and sites do not need to do anything different, but the summary portal pages will not currently display beyond mid-January. 3) The RAL frontier server for ATLAS has been reported as under- performing (factor of 3) as compared to the installs at other sites. This is under investigation. 4) Graeme reported that there is a need to work with APEL support to correct Glasgow’s absolute number of hours that appear wrong in APEL. Other sites have been encouraged to cross-check their APEL vs local figures. 5) There were intermittent network problems at RAL last Thursday afternoon that caused the GOCDB to be unavailable for a short period. 6) Lancaster has raised a concern about their experience with sub- clusters under a single CE. There was a recommendation that each sub- cluster should have a separate CE but this was not evident to them when installing (we are checking the background communication and documentation on this topic). As a consequence they have had to close their SL5 queue to LHCb who were unable to cope with the multiple view of the software area. Both queues work fine for ATLAS. (Note: Oxford worked around the need for separate physical boxes by deploying CEs on VMs). 7) There has been a request for GridPP to support travel for a GridPP funded person to attend experiment shift training. There is clearly an advantage to GridPP if sites are aware of experiment computing operation activities, but also to the experiments if the person then is a shifter. The PMB needs to clarify the policy with regard to this sort of request – especially since we are moving towards more focussed experiment support in GridPP4. This was discussed before RJ left the meeting: In general ATLAS fully expects to fund travel expenses for people doing training for ATLAS shifts. At present this is from institute budgets; after April this will be from the ATLAS computing co-ordinators budget. However, in this specific instance, the GridPP person concerned was doing the training to gain a deeper understanding of ATLAS requirements and did not have any immediate plans to run ATLAS shifts. GridPP had thus agreed to fund the airfare and would continue to consider such requests in the future on a case- by-case basis. This was entirely consistent with the established GridPP travel policy of providing (typically matching) funding for GridPP staff to attend experiment-specific jamborees, and no change was necessary to the policy. 8) In running the HEPSPEC06 benchmark on SL5 it has become apparent that the options chosen need to be consistent between sites. At the moment sites are using subtly different settings that will lead to different performance values (SL5 is not selectable directly as the OS). To address this the deployment team are reviewing the options to put forward a recommendation. SI-6 LCG Management Board Report ================================= There had been no MB. No PMB rep could attend for the next meeting, where it was noted that RAL issues would be discussed. SI-7 Dissemination Report ========================== There were no issues to report. REVIEW OF ACTIONS ================= 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. ONGOING. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. ONGOING. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. AS noted that alternative further costings were required. AS to progress. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. ONGOING. 374.11 Re (8.3 Data Support): Re point (2.) on p23, it was agreed DB to move the second half of the sentence (commencing: 'This post would ..' to '.. the resources') to the section above, which would help introduce the post descriptions. DB would check this. Done, item closed. 375.1 GP to provide post descriptions for experiment-specific posts in Appendix A. GP would forward this today. 375.2 DB to co-ordinate post descriptions for the Tier-2 posts, which should be as unique as possible in order to present a strong case. ONGOING. 375.3 TD to do the data posts. TD would do this by Friday and forward to SL. 375.4 PMB ALL: those relevant to do their own post descriptors. ONGOING. 375.5 DB to do the Admin Asst post. ONGOING. 375.6 SP to do the Impact post. Done, item closed. 375.7 SL to put v9.3 of the GridPP4 proposal with the other CB documents. Done, item closed. 375.8 SP to organise a blank copy of the Project Map, and iterate with DB on the work breakdown and schedule. Done, item closed. 375.9 RM to provide a skeleton outline plan, including post details, of GridPP/NGS convergence. ONGOING. [Previous action background: SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. This action to be re-allocated to RM. Done for SP - action closed.] ACTIONS AS AT 08.02.10 ====================== 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. AS noted this depended on money costs. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? Early action on AS required. AS had sent tech questions round the team and would forward inputs when available. DC noted to the meeting that today was the 16th Nov - only 4 weeks remained until Imperial, by which time we needed to have made extensive progress. To be discussed at the F2F on Friday. AS noted that alternative further costings were required. AS to progress. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. Ongoing. RM reported that he had met with Andy Richards, but there was more work to be done and nothing definitive had been decided. There were ongoing discussions about EGI effort but no direct answer. DB noted that the PPRP needed to ensure they were not funding beyond the GridPP remit, and that GridPP were not under threat if NGS4 did not get funded. DB advised that this issue was important, and information on a UK NGI and NGS remits, would be needed by Wed 24th February when the final version is submitted. DB would circulate the version of the proposal to SL on Friday, who would have the token until DB returned. Comments to SL next week. 375.1 GP to provide post descriptions for experiment-specific posts in Appendix A. GP would forward this today. 375.2 DB to co-ordinate post descriptions for the Tier-2 posts, which should be as unique as possible in order to present a strong case. 375.3 TD to do the data posts. TD would do this by Friday and forward to SL. 375.4 PMB ALL: those relevant to do their own post descriptors. 375.5 DB to do the Admin Asst post. 375.9 RM to provide a skeleton outline plan, including post details, of GridPP/NGS convergence. [Previous action background: SP to work with the working group on the following issues in relation to GridPP/NGS convergence: 1. identify Institutes 2. identify manpower 3. decide who is bidding for what - a draft transition plan would be made available by the end of the year; GridPP4 requirements would also be considered. SP was waiting on the Working Group to reply to her. A meeting had been held before Christmas re a transition plan. SP was awaiting a skeleton outline plan from RM, allocating people to sections. This action to be re-allocated to RM. Done for SP - action closed.] 376.1 SP to feed-in management information to SL whilst DB is away (for the proposal document and in line with RMR information required). 376.2 SP to check the Risk Register terminology, specifically the difference between 'existing' and 'current', 'inherent' and 'residual' on the form, also the effect of mitigation and how that should be correctly expressed. 376.3 SP would make the agreed changes to the STFC Risk Register and complete the rest of the table, and bring this back to the next PMB for discussion. 376.4 All: Risk Register owners to send text comments to SP by the end of the week, including numbers if possible, but 'low', 'medium' or 'high' was also fine, and the table would be checked at the next PMB. INACTIVE CATEGORY ================= 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. --------------------- DB would circulate the latest version of the GridPP4 proposal to SL on Friday, who would have the token until DB returned. Comments to SL next week. The next PMB, with JG chairing, would take place on Monday 15th February.