GridPP PMB Minutes 345 - 27th April 2009 ======================================== Present: David Britton (Chair), Sarah Pearce, Tony Doyle, Andrew Sansum, Dave Colling, Roger Jones, Glenn Patrick, Steve Lloyd, John Gordon, Robin Middleton, Jeremy Coles, Pete Clarke (Suzanne Scott - Minutes) Apologies: David Kelsey, Tony Cass, Neil Geddes 1. GridMon =========== SP reported that she had chased Robin Tasker regarding this. RT had got back to her saying that Mark Leese would email her. ML contacted SP to say that he was currently working 6 days pw on secondment to JANET, but was hoping to look into GridMon within the next 7 days. There followed a discussion on how network deliverables could be addressed. It was felt that GridMon was valuable and worth maintaining. JG offered to discuss with RT to discuss a way forward. ACTION 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. 2. Studentship Applications ============================ DB reported that GridPP had received 5 Summer Studentship applications. DB noted that we were seeking the best projects for the amount of money required. JC asked whether we were aiming to support one, or two, of these applications? DB advised that some were more expensive than others, and asked RM to what level we could commit funds? RM advised that we could afford more than one, based on past years' records. DB also asked how much it was worth to spend on these projects in comparison with the Collaboration Meeting costs, which were ~£10k? SP and SL agreed that we should not fund them all. SP felt that funding one, or two, would be useful. DB suggested funding two in total - a 'cut' would be applied to the funding, based on current circumstances, and the quality of the applications would be evaluated. DB noted the issues as: 1. comment on quality 2. examination of costs 3. other institute funding might be available, so we could fund at a certain level, knowing that they could make-up the difference It was agreed to return to this issue later in the meeting, once the email from GP enclosing a late application had been circulated & received. 3. Week's Notes ================ - DB advised that a paper submitted to the STFC Operations Board in relation to EGI.org, and associated costs, had been discussed and approved for an STFC contribution of up to £10k per annum for the next two years to the end of GridPP3. - DB reported on STFC staf changes: Janet Seed had been appointed Associate Director of the Science Programme (continuing to deal with this year's funding round); Deborah Miller would be taking Janet's position at least temporarily. - DB noted that there had been an STFC statement relating to Economic Impact (EI) - and grant applications should from now on include a statement/report on EI. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS provided the following report: Fabric: 1) The R89 machine room is considered to be suitable to accept equipment. A summary of its capabilities will be distributed when it becomes available. 2) Migration to R89 is expected to commence (subject to pending supplier confirmation) on Monday 22nd June and take 2 weeks. 3) Disk, CPU and robotics deliveries are being scheduled. We expect to install w/b 18th May. 4) Some user files and directories were accidentally deleted from the home filesystem after an installation script was run at an inapropriate point in the installation cycle. More protective measures have been added to the script and a post mortem will be carried out. It became clear that this only affected a very small number of users who have now been notified individually. Staffing: 1)The first experiment support post has been accepted. I have been verbally notified that STFC has approved the second post for advertisement. 2) Interviews tomorrow for the EGEE funded PPS post. 3) The YII student (funded by ESC) has informally notified us that they will accept. 4) The CASTOR d/b admin has not yet been approved for external recruitment. An internal search for suitable staff is underway. Service: 1) SAM availability last week was 100%, WLCG availability (ops VO) for March was 100%. 2) CASTOR a) We have received CERN's explanation of the crosstalk problem (Steve Lloyd circulated a similar summary). We are considering our options to manage the problem in the existing release (automated detection/cleanup) and are waiting to know what CERN's medium term plan to manage the situation is. b) The ORACLE database RAID array upgrade tests have now been carried out. We expect the upgrade to take 2 days. It is expected to be scheduled in mid May. c) There will be a scheduled "at risk" to CASTOR on 28th April to upgrade VDQM2 (which manages the tape drives). d) There will be a scheduled "at risk" for CASTOR on the 29th April to roll out Oracle patches. 3) The FTS and LFC are planned to be upgraded to new hardware on 6th May. This will lead to a 1 day downtime of these services. 4) A rolling re-configuration/upgrade of the CE's is underway: Lcgce03 - 27/04/09-30/04/09 Atlas CMS Lcgce04 - 05/05/09-08/05/09 CMS LHCb Alice Lcgce05 - 11/05/09-14/05/09 Atlas LHCb 5) Work on an SL5 service was underway. A test service was nearly ready for deployment. RJ noted that upgrading to SL5 was tricky before STEP, then ATLAS were doing cosmics, the upgrade would be disruptive. JG advised that LHC at CERN wanted all sites to have a step system to cheque queues etc. AS reported that no downtime was forseen, but he agreed that there was not really enough time for a smooth rollout. DC noted merits to having SL5 but CMS were doing cosmics too. GP noted that minimising the risk was the key issue. DB felt it was better to deploy after STEP09 and get the experiments to test it as they could. It was agreed to wait meantime before deploying SL5. SI-2 ATLAS weekly review & plans --------------------------------- RJ had circulated plots showing ATLAS status for the last 7 days, showing SL's test status with summary numbers, then availability via the SAM tests. Typically, these showed issues at sites which involved ATLAS. RJ highlighted problems at Lancaster, Manchester and Birmingham with queueing. RJ was hoping to be able to produce these weekly for info. RJ noted that things had looked good in the reprocessing exercise, but they were doing a lot of that from disk rather than tape, therefore it looked like they performed better. RJ advised they were doing Hammercloud tests with the Tier-2s and will be doing more. The Ganga I/O problem was being looked into. SI-3 CMS weekly review & plans ------------------------------- DC reported that the Tier-1 RAL had been available last week ok; Tier-2 was also ok but Imperial were upgrading their SE. It had been a quiet week overall, they were currently addressing communication issues. SI-4 LHCb weekly review & plans -------------------------------- GP reported as follows: 1. Problems seen on the RAL WMS with some roles incorrectly defined. Fixed last week and has been working fine since then. Problems still seen on GridKa and PIC WMS-es. 2. Problems at Manchester(NFS/SQLite) and UCL(VOMS)which are being investigated by the site admins. 3. Most UK Tier-2 sites have not implemented the pilot role as requested in the LHCb VO-card at https://cic.gridops.org/ DB advised that if the request had been made, this should not have been generally ignored. JC had replied to this point by email, noting that the issue had not been raised at recent relevant meetings. JC asked where the request had been originally made? JG reported it had been made at the GDB. DB suggested that this request had not cascaded outward successfully. It was agreed that JC would follow this up and report-back. ACTION 345.2 JC to follow-up the issue of UK Tier-2 sites not implementing the pilot role as requested in the LHCb VO-card - and report-back to the PMB as to why this had not been generally actioned. Current status and outlook : 1. New versions of LHCb application software (Gauss, Boole, Brunel) released last week and are being tested now. Some systematic crashes seen and are being debugged by the experts. No site - specific problems so far. 2. Still plan to start "large" productions of events late this week / next week. Will know the status better after today's (Monday) operations meeting at 1:30 PM. 3. User analysis - averaged about 200 jobs-a-day at RAL over the last two weeks. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) ATLAS has started to run increased loads for their hammer cloud work. Running these in parallel has uncovered i/o issues for Ganga. Initial tests/results show that high SE loads result in "unmanageable" situations for some current site configurations. SAM tests are seen to fail at sites when these loads are reached. There has been an extended discussion on TB-SUPPORT about the accessibility of information for sysadmins which details the nature of and inner workings of the hammer tests. 2) JG asked about (Thursday 23rd) removal of this page: http://goc02.grid-support.ac.uk/googlemaps/gridpp.html. To the best of my knowledge this is not being actively used by any GridPP operations work. I support John's conclusion that we can remove it. 3) The Manchester site came close to suspension last week following lack of activity on a GGUS ticket. The site staff were away and had indicated in the ticket that there would be no progress. The ticket relates to publishing of accounting data. The matter is taking a long time to resolve due to the need to rebuild their accounting database. 4) Liverpool has raised a concern about the lack of a 32-bit release for WNs in future releases. There are currently no plans to build gLite 3.2 SL5 in a 32-bit configuration. Until now no site had come forward with concerns in this area but it seems possible that the scope of the move to 64-bit SL5 was not properly understood by sites (i.e. they thought 32-bit would still be an option). We need to clarify if this affects other sites, understand the lifetime of the gLite 3.1 SL4 32- bit release and the expected migration dates to SL5 for the experiments. ACTION 345.3 JC to survey the rest of the Tier-2 sites re 32/64bit and SL4/5 hardware capabilities to find out if this problem is widespread. 345.4 JC to ascertain from the experiments an indication of timescale in relation to the move to 64-bit SL5 (although it was expected that this would not be an issue currently) - JC to alleviate Liverpool's concerns meantime. 345.5 GP and RJ to check their experiments' timescales in relation to the move to 64-bit SL5. JG advised that regarding availability reports at the CB, WLCG were weighting availability of the Tier-2 by the number of CPU at sites - however a lot of sites were reporting physical 'zero' CPU (in relation to GSTAT multiple CEs being set to zero). JG wished to flag this issue to the PMB. SI-6 LCG Management Board report --------------------------------- JG advised that the main issue had been the Quarterly Reports. Regarding tape metrics, Ian Bird was requesting the Tier-1s to show the method of how they were going to demonstrate they had met the bandwidth installed - STEP09 would be illustrative in this regard. The weekly average at present was not large, because tape use was not sustained. AS reported that monitoring was limited, they couldn't get things in on time. JG asked what if the experiments said the Tier-1 wasn't delivering? AS advised that monitoring was not too detailed, but there had been a lot of work done to upgrade the method - network rates were ok but the flow was difficult. DB noted that the ALICE Quarterly Report was relevant to GridPP. JG further reported that there had been a discussion on high- level milestones, relating to S-CASS deployment. The PMB had a brief discussion about the CREAM CE. SI-7 Dissemination Report -------------------------- SP reported that Neasan O'Neil had now put some blue 'help' buttons on the GridPP website front pages, and was doing work on the help sections behind this. 2. Studentship Applications ============================ The meeting returned to considering the applications received. Five applications had been received, from Imperial (x2), Birmingham, Brunel and Lancaster. The meeting considered these in the order in which they had been received. Brunel ------ DB noted that this was CMS-specific, and would cost 2500CHF. Relevance to GridPP? SP, SL & RJ all considered this to be the weakest application. Birmingham ---------- DB summarised that this was a request to develop gadgets to monitor processes on the Grid. The proposal seemed well put-together, was potentially useful, flexible, might produce one or more gadgets, and was a modest request. Birmingham could contribute 25%, which was good. PC & SL agreed that this was 'do-able' by a student in the time allowed. JG asked if it was similar to the Grid Observatory? DC advised that the student could work with them, and was a good area of work for a summer studentship. Lancaster --------- DB noted that this proposal related to virtual machine tools. Questions were: how ATLAS-specific was this; and could a realistic contribution be made in 10 weeks? JG noted the ATLAS element was reasonable, but agreed on the question about achievability - he noted that work like this was already going on in other places, but not the CERNvm. DB asked if work was going on in this area with HPC? RJ noted in relation to NGS and worker nodes, virtualisation was needed and this therefore created links with HPC. RJ advised that part-funding would be acceptable. DB noted that this was a good piece of work that needed to be done, but perhaps was ambitious for a summer student. PC advised that virtualisation was a major topic, but if someone did this, and left, what was the continuity to ensure the work was not lost? A plan was needed to engage with output. DC asked whether there were plans to continue this work at Lancaster afterwards? RJ noted yes, certainly within Lancaster. DB summarised that, thus far, the consensus seemed to be 1 x No, 1 x Yes, 1 x Possible. Imperial -------- DB noted that this was for the 3D RTM. There was a long connection here between GridPP and the RTM - in both 2D and 3D versions. Questions were: would a summer student really solve the long-running problems with the 3D version? DC advised that they were looking to employ a student to develop a version that runs under an openGL emulator that would solve the general problem, rather than fire-fight the known problem cases. DB asked about the longer-term support of the RTM, and funding? DC wasn't sure as yet - to the end of EGEE certainly, then unknown. Possibly through the EU thereafter, or the Dissemination budget. SP noted her interest in this proposal, as she had tried 4 times today to get the RTM to install. It was a high-profile tool that she personally felt was useful and if it could be made to work more reliably, that would be good. TD noted that we would be likely to find the right sort of student for 3D-graphics - it was top of the five proposals for him - the RTM was visible and worked in the past, it needed to continue to work, and injection of effort to deal with problems was good - this was an important task for GridPP, and this fact needed to be impressed upon the student concerned. DC advised that there would be follow-on afterwards, they could employ someone for a few hours per week to provide ongoing support. PC had seen it used, noted it was difficult to get it going, but believed that it had been useful so far and was worth continuing to invest in. DC noted in passing that it had been the very first student involved in this who had got the RTM going in the first place. RM noted that the RTM was often the first thing you saw when you went to conferences. JC also agreed that this was worth doing, and asked about EGEE funding? None was available. DB summarised that for this proposal there was general support across the PMB - it would be good for us to retain ownership of this project. Imperial -------- DB noted that this was the third year that Ganga had applied for a studentship - it addressed the issue of generalised interface for output, and was a user-driven modification. However, 8 weeks seemed ambitious. SL noted that only three actual weeks were being allocated to the work - other weeks were for familiarising and writing-up etc. TD noted that work was being done on this area anyway, elsewhere. SL believed that it needed someone who already knows how to do it, to spend the full 8 weeks on improvements. PC noted that this was core LCG stuff - it's got to be done anyway - there were other proposals which were not in that category and wouldn't be done if we didn't fund them. DB concluded for all that this was important work, it would be done anyway and GridPP already funds Ganga at Imperial. DB noted that, thus far, there was support for the Birmingham proposal, the RTM at Imperial, and next Lancaster; the other 2 proposals had received less support. In terms of financing, the Birmingham cost was £1200, the Imperial cost was £1800, = £3k. DB asked whether the PMB felt they could fund a third proposal (at Lancaster), or not? This would be an additional £2300? RJ advised that less funding would still be acceptable at Lancaster. DB advised that the high-level issue was that future financing was murky, but we were talking about £1-2k here of availability to support studentships, which wasn't much. The case for the third studentship at Lancaster was not overwhelming, although it could be useful work. He asked whether the PMB wished to draw a line at funding two; and to fund the third proposal in part or whole? TD advised that it was too Lancaster-specific - the ideas needed fleshed-out and onward relevance 'outward' demonstrated, ie: dissemination at HEPSYSMAN or whatever. PC noted that it should be written-up as a publicly-available report, the output needs to be given to others. JG noted that there were, in addition, other things happening on virtualisation elsewhere. DB concluded that the PMB were 'sitting on the fence' re the Lancaster one - we need to know how this work contributes to the field and GridPP - if it were only an internal Lancaster thing then it was the same as many other groups - there had to be an element of dissemination and a wider value. He asked if RJ would be prepared to re-think/re-write the proposal and bring it back to the next PMB? RJ confirmed yes. DB finalised that GridPP would fund Birmingham and the RTM at Imperial. The other two would be declined. AOB === SL advised that, re the ATLAS report, SL's tests looked worse than the availability and reliability for the CEs, because SL's tests go to every CE he can find - whereas the latter (SAM) tests are ORed over the CEs at a site. The SL ATLAS are diagnostic whereas the SAM tests show how the Tiers are being judged by ATLAS. DB advised that it was a question about whether we put these plots in the Minutes. RJ noted he could add-in availabilty and reliability. It could be judged next time. Re the next PMB meetings in May, DB advised that Mondays (due to holidays and absences) were unavailable. It was agreed that the next PMB meetings in May would take place on Tuesday 5th, Thursday 14th and Thursday 21st. There would be no PMB week commencing 25th. The meeting closed at 2.35 pm REVIEW OF ACTIONS ================= 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. O N G O I N G. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Action DONE, but left open as placeholder. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). O N G O I N G. 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. O N G O I N G. 341.1 AS to review reslience of services that may have to remain in the ATLAS building. DONE, item closed. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. O N G O I N G. 341.4 JG to establish new date for the wLCG-ORACLE meeting. DONE - meeting unlikely, Oracle seminar might be more possible. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). O N G O I N G. 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. DONE, but here as placeholder - Mark Leese still to respond. 342.1 DB to circulate a draft letter from GridPP to Richard Wade in relation to the pending 2nd experiment support post at RAL [done following the meeting]. DONE, item closed. 342.2 SP to add-in a morning discussion time about GridPP4 to the F2F meeting; also to check if there were any issues arising from the Quarterly Reports that needed to be discussed. DONE, item closed. 343.1 GP to follow-up on the ILC experience to ensure issues are addressed. DONE, item closed. 343.2 GP to discuss new users coming to GridPP with NO (possibly also SB) regarding the GridPP website and making relevant changes that would assist them. DONE, item closed. 343.3 Re IMENSE & the iLexIR N-gram enquiry, DB to draft a response to Andy Parker supporting the request to use the Tier-1 before the LHC turns on. DONE, item closed. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. 343.5: SP to contact Tier-2 P.I.'s to discuss Tier-2 grant timing. DONE - DB to respond to Janet Seed. 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. O N G O I N G. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). O N G O I N G. 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. TD noted that a report was awaited from Jens Jensen. O N G O I N G. 343.9 DB to contact Akram re the summer student call. DONE - calls will be considered. ACTIONS AS AT 27.04.09 ====================== 332.1 AS to provide a plan for the tape drives: The experiments would be producing new numbers by ~end of March which would enable this action to progress. 332.3 PC to pursue the issue of the network resilient link - providing installation costs and annual costs, and report-back to the PMB. PC had sent info round, and Robin Tasker would provide further info at GridPP 22 UCL. The issue would need to be referred back to the OC. Action DONE, but left open as placeholder. 336.1 JC to document procedure for ensuring black-listed sites are re-instated. JC noted that experiments have different ways of 'blacklisting' (re top level results & switches etc). 339.8 JC to follow-up VO Registration cards. JC reported that some VOs need to be decommissioned, there were also VOs at institute-level but these do not appear on the CIC operations portal. 341.3 NG to consult/inform GRIDPP PMB on responses to EU questionaires. 341.5 JC to investigate how EGEE VO's request resources (relates to enabling more VO's in the UK). 341.7 SP to chase non-functioning of GridMon with Robin Tasker and report back. DONE, but here as placeholder - Mark Leese still to respond. 343.4 JC to raise the N-gram issue at dTeam and update the VO ID card. 343.6 JC/JG to ensure that the joint taskforce meet and consider issues: work was clearly required on the table presented by AR - more columns needed to be added: NGS approach, GridPP approach, an outline map of funding, service overlaps. 343.7 AS to look into the CASTOR/ATLAS instance availability (in the light of the experiment red metrics). 343.8 TD to follow-up overdue milestone in relation to storage usage per user in a VO. 344.1 DB to respond to Janet Seed's request, but highlight to her that procedure for grant release needs to be fairly speedy, and not take 9 months. 344.2 SL to raise the benchmarking issue at dTeam in relation to the DB Minutes. 344.3 JC to follow-up concerns with dTeam, over security and bandwith in relation to Camont needs at UKI sites. 344.4 JC to contact Camont and seek assurance from them that they have thought-through issues of security/reliability/commercial use of networks, plus other concerns raised about content. 344.5 TD to raise the 'inactive category' action 282.8 with SL (cc RM) re adding to the quarterly report. This needs to be documented. 344.6 GP to contact the ILC community and give our support, highlighting that there may be possible contention issues re STEP09 and fairshares. 345.1 JG to speak to RT regarding GridMon and GridPP funded network effort. 345.2 JC to follow-up the issue of UK Tier-2 sites not implementing the pilot role as requested in the LHCb VO-card - and report-back to the PMB as to why this had not been generally actioned. 345.3 JC to survey the rest of the Tier-2 sites re 32/64bit and SL4/5 hardware capabilities to find out if this problem is widespread. 345.4 JC to ascertain from the experiments an indication of timescale in relation to the move to 64-bit SL5 (although it was expected that this would not be an issue currently) - JC to alleviate Liverpool's concerns meantime. 345.5 GP and RJ to check their experiments' timescales in relation to the move to 64-bit SL5. INACTIVE CATEGORY ================= 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. RM advised that a small amount of effort was going into R-GMA on APEL but for the long term he wasn't sure. The item needed to be kept here for review from time to time, and required to be re-visited around Easter 2009.