RAL Tier1 Grid Team Closed Actions

From GridPP Wiki
Jump to: navigation, search

This is a Wiki area for closed Tier-1 Grid Services Team Actions

See also RAL Tier1 Grid Team Actions.

Action ID prefix Status
G = From Grid Services Team Meeting Open = Action has been created
T = Added by Team members or Team Leader Progress = Action is being worked on
P = Created by other project members Closed = Action is complete
R = Created by UKI ROC/Production Manager Rejected = Action is rejected


Action ID Priority Owner Action Title Target date Status Date closed Notes
G-01 High Catalin RB Availability poor in GRIDVIEW Closed RB is running at 100%. Will schedule downtime on the RB for work to be completed by 27th October. Will consider hardware upgrade possibilities including mirror disks. Cleanup completed but now running at over 95%. Will deploy a second RB � waiting on hardware from Martin. Catalin will ask Martin again and we will review at the Monday meeting.[10.11.06]Second RB is available, basic dteam test complete, going to test with Alice. Will review load on existing RB once use of 2nd climbs and decide if matters have been improved.[16.11.06]Alice and LHCb asked to switch to the second RB. Some additional CPU load on RB2, but no decrease at all on RB1.[21.11.06]Biomed asked to use RB2, but much more important two rogue processes were found to produce the high CPU load on RB1. They were killed and CPU load decreased significantly, it is now less than on RB2.
G-02 High Derek dCache tape SE availability mediocre in Gridview Closed 23/03/07 Load problem? Investigating but no progress.[08/12/06]Await further resources (disk)[30/01/07]Additional tape cache server is just about to be added.[16/2/07]Has been better (in SAM) recently, but down last night - Derek investigating. Upgrade to 1.7 planned for 1st March.[2/3/07]dCache was upgraded on Thursday and we are waiting to see what impact this has had.[8/3/07]does not appear to have improved the situation. Some problems may be caused by gridftp door hangs - still under investigation. We will also put additional disk capacity into the disk pools for tape once (and for as long as) spare capacity is available.[23/03/07]Following the Java 1.5 upgrade gridftp doors have stopped crashing and availability is much better.
G-03 Medium Matt FTS performance reported to degrade from time to time Closed Several interventions were needed last week and dteam crashed it. Will consult FTS developers next week and come up with a plan. Will discuss hardware requirements with Martin. Probably will need 2 web front ends and 2 back end nodes. Has mailed Martin regarding hardware needs � reconfiguration will not happen until next FTS release. [10/11/06]FTS performance fine at present. Possible issue resolved by rolling update. Other FTS work part of improving FTS resilience.
G-04 Low Matt FTS intermittently monitored in Gridview Closed 01/12/06 Tests are intermittent! Just keep an eye on this until its working. Still unreliable � continuing to monitor.[01/12/06]Test no longer intermittent.
G-07 Low Matt Rationalise CMS VO servers 13/4/07 Closed Several CMS services on different boxes. Need to terminate dev08. Will schedule rationalisation after CSA06 finishes. Need plan of what to do.[10/11/06]will investigate possible intervention during CASTOR upgrade.[24/11/06]Will begin next week[02/12/06]Need to procure hardware, plan agreed with CMS[08/12/06]Testing services, expect deployment early next year[12/01/07]Waiting om hardware - will prompt Martin[19/01/07]Reminded Martin about hardware requirement[2/3/07]Matt reminded Martin today to drain a system so that migration could start next week[23/03/07]Monalisa has moved off ganlia.Frontier installed on VO box, but needs testing after firewall change.[30/03/07]Migration completed - waiting for CMS to test.Tested and working
G-08 Medium Matt ATLAS disk pools are full 30-11-06 Closed Some cleanup underway. Matt will ask CMS if they can release dCache space for ATLAS. Ownership moved to Matt who needs to discuss with UB where priority for spare capacity rests. ATLAS are waiting for extra resource. CMS are well resourced at the moment but we worried about what happens if the T2 takes the disk back. Next H/W delivery looking good and should fill the hole (we hope).[10/11/06]Have requested CMS release but they are still checking.[24/11/06]CMS attempted to move files into CASTOR but too slow - now out of time. Derek will move CMS disk files to tape[01/12/06]1 system moved to Atlas, downgraded to high[15/12/06]Downgraded to medium, 7TB moved to Atlas, will continue to free space[22/12/06]derek continues to free up space over xmas[12.01.07] Derek has now released a further 3TB (the last ATLAS will get from this process)
G-09 Low Andrew Circulate LCG security policy 10-11-06 Closed Catalin mentioned that he had been unaware of log retention policy � RAS will circulate the URL to the policy. Not yet done.[10/11/06]Done.
G-10 High Andrew Automatic DNS updates 10-11-06 Closed 12/12/08 Ask networking if we can have some means of live updating our DNS records. Was unable to ask at the network meeting as we ran out of time. Will do by email.[10/11/06]Done - will report on response.[15/12/06]Raised with networking, asked for progress by End March [30/01/07]Should also consider HA linix as a local solution[8/3/07]We have agreed to schedule a meeting with networking to discuss possible solutions[11/5/07]We met but the meeting ran out of time - RAS will reschedule the meeting.[29/06/07]A meeting is schedule for late in July[27/07/07]Network Group can provide access to a mysql database to allow this. Derek will liaiase with Chris Seelig to get something working.[21/9/07]Raised to medium as CASTOR team find they need this.[16/05/08]Andrew to follow up with Networking regarding the status of this[30/05/08]Not clear that a solution is imminent, need to consider other solutions - ip address adoption perhaps.[19/9/08]In principle networking are prepared to give us access to the DNS. [31/10/08]Moved to Andrew[12/12/08]Closed
G-11 High Andrew Priority order for Service hardening 10-11-06 Closed 2008-02-15 RAS will mail round priorities. No progress.[2008-02-15]Track in Delivery Plan.
G-12 Medium Team Identify workplan for individual services 30-11-06 Closed 2008-02-15 Team will identify priority order tasks needing completion to harden systems. Discussion postponed until Derek gets back.[2008-02-15]Track in Delivery Plan.
G-13 Medium Matt Install test gLite UI 30/11/06 Closed 01/12/06 Provide test UI by end November.[24/11/06] Kickstart done, still checking compatibility of glite-* commands vs edg-* commands, Pheno testing.[29/11/06] LCG RB is not upwards compatible, i.e., glite-job-submit fails. gLite RB is backwards compatible, i.e., edg-job-submit works. RAL CE now working with gLite RB (rb102.cern.ch).
G-14 Medium Catalin Install gLite test RB 30-11-06 Closed Provide test RB by end November. Will ask Martin for hardware.[10/11/06]A request was made but it was not yet possible to provide this resource. We noted the problem but need to deploy this system - Catalin will escalate.[21.11.06]No news, but if G-01 concludes that only one RB is necessary at RAL, then the second RB can be re-deployed as glite RB.[08/12/06]Investigating load balancing LCG RBs, deploying Glite RB[15/12/06]Have solution for load balancing but requires configuration of job submission tools, will proceed next year. glite RB tested with patch, awaiting production release[22/1206]Is preparing an email and will ask UK sites to try it after xmas[12/01/06}A load balanced RB is available for production. Catalin will now return to the deployment of the gLite RB[19/01/07]Need gftp firewall hole for RB to use external CEs.[30/12/07]Waiting for the firewall update[9/2/07]Firewall done and tests sucessful.
G-15 Low Derek Install test gLite CE 30-11-06 Closed 17/08/07 Plan to have completed first attempt at CE by end November. [10/11/06]Has re-installed glite CE. Seemed straightforward. Now investigating farm integration. Were asked by LCG fore a milestone date for this task. We have replied 6 weeks after release of glite CE for production deployment.[24/11/06]Base install done[08/12/06]IS now in shape, investigating integration with scheduler[15/06/07]We don't plan to do any more work on this until we have a candidate release[06/08/07]gLite CE has been/will ebe deprecated apparently, need to reexamine this action, do we go to SL4 LCG CE, or SL4 CREAM CE.[17/08/07]Most likely we will get an LCG CE for SL4. Item closed.
G-16 Medium Derek LCG worker node update 10/11/06 Closed 15/12/06 Aim to complete by 10th November.[10/11/06]Has contacted Steve asking how best to do this as it is difficult. Waiting for feedback and will then document on wiki. [24/11/06] Released to Fabric team for deployment[01/12/06]Largely deployed, monitor to ensure completion, review next week.
T-17 Medium Andrew Migrate UKQCD to use SRM Closed 13/4/07 Assess how best to get UKQCD off a standalone service. Discussion postponed until next time.[9/2/07]RAS spoke to Nick about this. UKQCD will move away from stand alone services by the end of GRIDPP2. Ideally we should keep the existing server running through some means.[23/03/07]Appears that they will not move to SRM until late in the year. [13/4/07]This is in UKQCD's plan - scheduled for 2008. We will eveluate how easy it is to maintain the QCDgrid software at the next scheduled upgrade.
G-18 Low Matt Improve FTS resilience Closed 7/9/07 Probably will need 2 web front ends and 2 back end nodes. Has mailed Martin regarding hardware needs � reconfiguration will not happen until next FTS release. [24/11/06] Next FTS release scheduled for Jan 07, coordinate with PPS depending on progress.[9/2/07]expect a new FTS release in a matter of weeks.[23/03/07]Expect Tier-1s to receive it in May - RAL may start early with a test release.[13/04/07]We have been talking to CERN about beta testing the latest release. Matt will request new hardware from Martin.[21/5/07]Hardware is drained and ready, DNS names done, work will commence this week.[25/5/07]Systems are available, waiting on certificates.[31/5/07]Still waiting for certificates; starting work on configuration in advance of this.[08/06/07]No problems so far with multi-box deployment, not expected to bring into production for 3-4 weeks[15/06/07]We have been asked not to upgrade until September. We will do no more work on this unless we see performance problems with FTS 1.5.[29/06/07]FTS developers investigating how best to deploy FTS1.5 now that installing this release from the latest gLite repository is not possible.[27/07/07]We believe FTS 2.0 is released to the PPS. [10/8/07]FTS 2 is released.[17/08/07]Matt is working on a plan. CMS are using the test FTS 2 at the moment.[24/08/07]Moving to FTS2 on Tuesday[7/9/07]Completed - now two web front ends and two agent back ends


G-19 High Andrew Review non capacity use of hardware Closed Find out where all the hardware is going[8/3/07]Now have an updated server count for non capacity use. Martin and Andrew will work on a list of whatthe servers are used for.[23/03/07]Completed - only thing of note is that we seem to be running so many UIs.
G-20 Medium Catalin Allow intraqueue scheduling for ATLAS 17/11/06 Closed 15/12/06 ATLAS require 80% cpu for production. We are investigating and have a range of options but most are messy. Still investigating.[13.11.06]Change in maui.cfg done. Monitoring needed as there is no atlasprg activity.[21.11.06]'atlasprg' fairshare no too high (but higher than in the past). Not sure if because of configuration or lack of submitted jobs.[24/11/06]Continue to monitor[01/12/06]Do not believe that configuration changes are having desired effect. Need to review - possibly rework gid structure and scheduler config.[08/12/06]Have updated config, will monitor to verify[15/12/06]Appears to be working, but difficult to confirm.
G-21 Critical Derek CE dropping out of BDII 17/11/06 Closed 01/12/06 Loosing CE from site and global BDII. Impacting CSA06. Backups were changed and problem was substantially reduced. Still investigating remaining cause. [16/11/06] Recent list match failure yesterday and today appear related to infrequency of SAM test running on site (FCR). [24/11/06]Review next week.
G-22 Medium Catalin User docs are out of date 31/10/07 Closed 18/01/08 The user documentation is out of date. [11/5/07]We have been asked to prioritise work on the web site. Jonathan is preparing a basic update and we should then complete the task. Catalin will review what we have work on improving the site.[21/5/07]Jonathan has done some basic web site updates.[20/7/07]Work on this is not scheduled untol October[17/08/07]We need progress on this by the end of October.[28/9/07]Scheduled to start next week.[05/10/07]Catalin has started looking at existing documentation[12/10/07]Catalin has reviewed what we have, and will book a meeting to discuss requirements.[26/10/07]Wiki pages now exist to be filled in by appropriate team members
T-23 Medium Derek OPs test not reliably scheduled Closed 15/06/07 [Derek] SAM tests do not appear to be prioritised enough over other jobs to guarantee that they are run soon after submission. [24/11/06] Have adjusted maui cfg to solve current problem, longer term - consider separating ops from dteam[22/12/06]Have decided that we will separate ops and dteam and will action this in the new year. believe it is fairly straightforward.[16/2/07]Derek now has pool accounts for operations[27/4/07]Derek has requested a new queue.[11/5/07]A new queue is now available. CE needs reconfiguring to use the new queue.[21/5/07]Derek expects to complete reconfiguration this week.[25/5/07]Waiting for Martin to configure SL4 queue (this work will be done at the same time).[08/06/07]Attempted to do on Thursday - encountered problem with missing symlinks on batch workers, need new release of edgonCSF on batch workers to fix[15/06/07]Now done.
G-24 Critical Matt Rollout glite UI Closed Coordinate with Martin, merging of FE and gliteUI[08/12/06]Babar asked to test FE functionality appears okay so far [22/12/06]Satisfied with tests - waiting for Martin to action this - needed early in the new year.[19/01/07]Reminded Martin about Rollout[16/2/07]Problem with dependencies on SL3.[2/3/07]Other sites are impacted by this problem also[05/04/07]Ticket #18451 - User unable to submit job list match on old lcg UI, works on gLite UI.[13/4/07]Andrew has raised this as urgent with Martin.[21/5/07]UKIROC have asked us for a status update as progress has been slow. We will try a rolling update on a test system and if that works OK we will upgrade other hosts[25/5/07]We informed UKIROC that this would be done this week, but we are waiting for Martin to rollout the upgrades.[31/5/07]Upgrade path retested, and workaround for change in YAIM noted.[08/06/07]Fabric team don't wish to deploy in case it affects Babar in run up to conference[15/06/07]It is critical that the CMS UI is upgraded by Monday otherwise Phedex->CERN FTS stops working. Matt will ensure this is done. We will expect Martin to update the remaining systems next week.
G-25 Critical Derek Deploy dCache pool on SL4 Closed 15/12/06 Need to be able to deploy dCache on new disk servers, which need SL4, test deploy on test dCache.[15/12/06] Done, tested.
G-26 Medium Derek Rollout gLite on SL4 worker nodes 30/4/07 Closed Once SL4 is supported on gLite, plan deployment.[22/12/06]Marian has installed torque/maui on SL4. Need to firm up timeline early in the new year.[9/2/07]Expect to hold a planning meeting next week to decide schedule.[16/2/07]Meeting not arranged yet - RAS will arrange.[2/3/07]Meeting held and a plan exists. target is to provide CMS with a test service by the end of March and deliver a CE routing to SL4 by end of April[30/3/07]Derek has published a schedule which leads to a test SL4 by 30th April.[11/5/07]Basic test install has been completed. derek needs to meet with martin to discuss - how queues will be managed.[21/5/07]We expect to have the new queue set up later this week. [25/5/07]This is waiting on Martin setting up the queue. It is likely there is no time now before Derek goes on Holiday for him to complete the CE configuration and run the test jobs as planned.[08/06/07]Now have queue and a SL4 compat worker node available in CE, yet to get successful job submission, native SL4 worker node has been released have asked Fabric team for a "clean" Sl4 worker to install on[15/06/07]SAM tests now working. Now need to test the native SL4 worker node instance. This requires another system buidling.[29/06/07]Native SL4 now running dteam SAM tests. We don't expect any experiment testing until we have a seperate CE - which we expect to have running some time next week. We may need to track down extra libraries for some experiments (although LHCB ship theirs with the job).[6/7/07]We have a test CE and 2 worker nodes. Has been annouced to dteam. Some work has run on it already as it is published in the info system for all LHC, ops and dteam. Planning on completeing the migration by end of August unless we hear otherwise from the experiments.[13/7/07]Testing is ongoing. Current CE is on test hardware, Derek will arrange to install a new instance on production hardware shortly.[20/7/07]Is working on configuring a production CE. ATLAS working OK, ALICE having problems querying info provider, CMS have run work, Problems with LHCB at the moment[27/07/07]Production CE is nearly in place and queues are set up. Plan to move 20% of capacity to SL4 next week. Will get confirmation from PMB/Production manager.[10/8/07]20% of capacity is moved and new CE is in production. We will contact LHC experiments and ask when they want to move. At present LHCB are unable to move becauase of a problem in lcg-cp.[24/08/07]No response from experiments about further capacity moves, will decide on another percentage to shift over.[7/9/07]We now have have a statement from from the experiments and plan to move 90% as soon as possible. Once we have the 90% moved we will announce termination of SL3 in 1 further month.[14/9/07]60-70% [actually 60%]of capacity is now SL4.[21/9/2007]We expect to reach 88% by the end of September.We will need to provide SL4 front ends (UIs) shortly for Babar. Since the start of migration we are not running at full capacity. It is becoming urgent to eliminate LHC use of SL3. LHCB report that they are able to use SL4 now. [28/9/07]We now have general queues so all VOs can use the service. Yet to be announced. Problem with Babar (missing packages) which are holding up the deployment (now at 70%).[07/10/05]Migration ongoing, SL3 grid queues now closed to job submission, draining remaining queued non-grid SL3 jobs.[12/10/07]Now at 88% in SL4, closure dates broadcast for SL3 service - 19th for grid service and early Nov for non-grid[09/11/07]Last worker nodes migrated to SL4 this week, lcgce01 taken out of BDII
G-27 High Catalin Christmas cover for grid service Closed 22/12/06 Announce plans for cover over Christmas closed period (22/12-02/01)
G-28 Medium Catalin Size of DB Tables affecting RB performance Closed Mysql tables on lcgrb01 are very large and are leading to job aborts, need to find way to reduce table size.[15/12/06]Will investigate archiving old data to separate database.[12.01.07]Load balanced RB and some cleanup may help this situation downgraded to medium.[19/12/07]No official method, consider submitting GGUS ticket.[30/12/06]CERN tell us they have no solution to the problem.[30/12/06]We are not aware of any outstanding problem.
T-29 Low Matt Atlas UK believe they found a hung FTS channel Closed David Cameron will email Matt and give him details. RAL are asked to investigate if we have monitoring that

should/could have spotted this problem and report back to the ATLAS weekly meeting.[22/12/06]Was followed up but not enough info for FTS team. Matt is investigating a method of detecting jobs in pending state.[19/12/07]Added a nagios check for this error condition[30/01/07]No outstanding problem we are aware of

G-30 Medium Matt Map UB allocations to dcache & Castor Closed Now UB allocations are nearly finalised, decide how to split up the allocations between dCache and Castor servers and inform appropriate people.[22/12/06]Basically done - a few lose ends to complete.[12.01.07]Plan exists - closed
G-31 High Matt WLCG Monitoring questionnaire needs completing Closed We need to complete the questionnaire[12.01.07]Waiting for comments[19/12/07]Submitted
G-32 High Derek Gridftp doors are running out of memory Closed 23/03/07 Not fully understood yet - Derek will provide a monitoring script today if possible.[12.01.07]A script exists to stop hung doors but they stll don't always work - still does not understand cause.[19/01/07]Reverting gftp door to previous release did not remove failure, increasing memory in config appears to delay but not prevent occurence. [30/12/07]Increased memory for dCache JAVA configuration - this appears to have helped the situation - but not entirely resolved it.[9/2/07]Tests of new dCache release/upgrade have gone well and we expect to schedule the upgrade on the production service shortly.[17/2/07]No hangs since 6th.[2/3/07]Waiting to see how new release of dcache behaves [8/3/07]It is still a problem.[16/3/07]Have now upgraded doors to use java 1.5, seems to be making some improvement.[23/3/07]Seems to be fixed!
G-33 Medium Marian Will need more nodes in January Closed [12.01.07]Has asked Martin for nodes and Martin is draining them[19/01/07]Nodes received.
T-34 Medium Matt Load on top level BDII causing occasional timeouts Closed 23/03/07 From time to time, failures in CE replica management tests appear to be caused by timeouts contacting lcgbdii02.[19/01/07]Still investigating[30/01/07]Has investigated - high CPU usage. Have decided to task a second node to the BDII. Waiting on Martin[9/2/07] expect

to receive the hardware today. Matt will request a firewall update and plans to deplay ASAP.[16/2/07]Firewall update done but waiting on the DNS update.[2/3/07]Was done - problem mainly resolved. Have had only 1 timeout at RAL this week. Load from LHCB VO server seeems very high and we are persuing that.[8/3/07]Remains a problem - we will task another system.[23/3/07]No recent timeouts

T-35 Critical Derek SAM jobs occasionally do not run CE Closed 27/4/07 From time to time, a SAM job fails as it does not appear to be submitted to the CE[30/01/07] Saw this happen. Job reached the CE but was dequeued by the gridmanager. Will submit to GGUS[16/02/07]GGUS ticket 18603[8/3/07]The GCUS ticket has been closed with status unsolved.[30/3/07]LHCB have reported that 400 jobs aborted for unknown reasons.[13/4/07]Derek believes we are failing a large number of jobs - this issue is becoming very urgent. We believe this may need a rebuild of Torque.Derek is working on a ganglia plugin to monitor this.[27/4/07]Identified some "black hole" nodes and carried out a Torque update - problem resolved.
T-36 High Derek Upgrade dCache to provide SRM 2.2 31/03/07 Closed The latest releases of dCache include an SRM v2.2 interface. Our installation of dCache should be upgraded.[9/2/07]See G-32[16/2/07]Scheduled for 1st March[2/3/07]Completed
T-37 High Catalin RB unavailability because multiple cancellations for a same job Closed A higher than usual CPU system load was noticed on lcgrb02 and also got reports about poor reliability of that RB.[31/01/07]Checked and found out that it was about a known problem appearing when the same job was cancelled multiple times. The RB can't repair itself so human intervention is needed (multiple restarts of edg-wl-proxyrenewal and edg-wl-wm), otherwise the system appears as being overloaded and unresponsive.[01/02/07]A monitoring framework for lcg-RB and glite-WMSLB is going to be deployed (depending on resources available). To be discussed with Derek.[9/02/07]Monitoring framework is deployed. Needs to find a way of detecting the fault state. [2/3/07]Manual interventions have to be carried out in order to restart daemons until high load condition is eliminated. Same problem at many sites.
G-38 Medium Derek / Catalin Implement Service Monitoring Framework Closed Matt has produced a service monitoring framework for nagios. We should implement on Grid service nodes.[8/3/07]Jonathan will release in the next nagios release.[27/4/07]Catalin plans to work on this over the next couple of weeks[11/5/07]Basic checks are now in place on Nagios[13/7/07]Catalin is working on this for the RB.[05/10/07]Tests added to LFC[16/11/07]Matt to investigate why Nagios SAM plugin doesn't work[30/11/07]Handed over to Jonathan[11/01/08]Closed, SAM alarms now appearing in Nagios
G-39 Low Derek Held jobs in the scheduler impact our ETT Closed We shouldn't be holding jobs unnecessarily as they raise the ETT and make our site less attractive.[23/03/07]Derek will ask Martin to not hold failed jobs. We have a workaround in the info provider[21/5/07]Martin agrees that he will delete all automatically held Grid jobs on the schedular[25/5/07]Martin believes has has found a way to fix the original problem causing jobs to enter state 15057 - it is being tested[29/6/07]Held jobs no longer impact our ETT. GRIDPP job deletion policy expected soon and we will then update our process WRT held jobs.[27/07/07]Policy is released - we will evaluate and see how this policy should be implemented.[24/08/07]Nagios check for low efficiency jobs is running[21/9/07]Fabric team will ensure that jobs terminating with state 1507 are simply deleted as there is nothing else we can do with them.[11/01/08]Closed, held jobs not seen for a long time.
T-40 Medium Catalin LFC may need to be more resilient Closed The LFC is critical for Tier-1 and UK GRIDPP operations we need to ensure that we minimise the risk of service downtime

and data loss.[22/6/07]Will see how Oracle LFC for LHCb goes, may consider moving all LFC to Oracle RAC.[29/06/07]Catalin will contact Gordon to see if LHCB requirements have been clarified via 3D. Will also request a test LFC from Fabric Team.[18/01/08]The LFC does need to be more resilient; need plan for how to achieve this.[25/01/08]Equipment to run LFC (and FTS) on a RAC has been ordered. Need to also consider resilience of the frontend.[2008-02-15]ATLAS have an opportunity to migrate in the second half of March; depends on hardware delivery, and coordination with Fabric and DB teams.[29/02/08]Planning to move to Oracle in April when RAC delivered. Develop a timeline for deployment before May.[18/04/2008]Have now deployed on Oracle, but RAC will wait until after CCRC08.[27/06/2008]Inform Martin and Gordon of latest date that RAC can be delivered.[08/08/08]Second frontend to be installed by end August in round robin dns configuration, RAC deployment with Databse team[15/08/08]Have second host installed but still awaiting firewall holes[12/09/08]Two frontends for Atlas now in production. RAC migration will wait until after data taking run.[17/10/08]Catalin will take to DB team to move this along[31/10/08]Problem with RAC deployment appear resolved, expect full deployment by start of December.[21/11/08]DB team made good progress. Work almost done, few minor issues to be easily sorted out. Plan to test next week, possible migration first (or second) week in December.[28/11/08]Test successful, discussing possible dates with Atlas for migration, FTS may also be at the same time.[12/12/08]Postponed until after R89 move[9/1/09]Review plans in light of machine room move[16/1/09]Will remain on single instance until after machine room move[30/1/09]Awaiting for standby single frontend[2009/02/13]Hardware still not available.[2009/03/06]Closing this action as we now have a plan

T-41 Medium Catalin RGMA Registry server may need new hardware Closed This is a world critical server on non resilient hardware[11/5/07]Catalin has started discussion with RGMA group[21/5/07]Some input has been received. Probably need to move the MYSQL service to a RAID filesystem. Needs to be discussed with Martin.[31/5/07]Need to ask Martin for better hardware, initially for testing, and then for production service.[08/06/07]Have promise of hardware from Fabric team, so progress may be made next week[15/6/07]Catalin has new hardware and is working on the kickstart and is talking to RGMA developers to install a new system.[27/07/07]Was raised at the operations meeting. Advised to make this change in September - waiting for feedback from the ROCs.[21/9/07]Catalin has been in discussion with Alistaire Duncan wrt how to carry out the installation and migration. [28/9/07] A test server is in place and is available to the RGMA developers. [05/10/07] Hardware seems suitable, however the ip migration issue is still under discussion but may take several months, discuss with Martin remaining on test network.[12/10/07]Need to know timeline for deployment for RGMA dns issue[26/10/07]Still awaiting timeline[16/11/07]RGMA team working to certify patches[29/2/08]Patch in under certification, available in 2 weeks. New hardware will use SL4.[2008/03/14]Received instructions for installing on SL4.[2008/03/28]Moving ahead- aiming for after CCRC[2008/04/04]Have tested installation of MON box PPS release successfully (similar to IC) awaiting Production release [16/05/2008]Intend to deploy on 1st July, but needs confirmation by 1st June[27/06/2008]Hardware is in production but is being forwarded from old host, will break this link on 9th July and chase hosts which have not switched via GGUS.[08/08/08]Closed - new hardware deployed


T-42 Medium Catalin Do we need to migrate LFC to Oracle? Closed LHCB may need LFC to have Oracle backend to support Oracle streams. Catalin will discuss with Raja.[21/5/07]LHCB state that they require Oracle - a formal request will arrive from LHCB shortly which Catalin will forward to Gordon for comment.[25/5/07]Yes we will need to implement an Oracle instance of the LFC. The Oracle back end will be hosted on the LHCB 3D hardware. We will need to create and test a new LFC instance using Oracle. This is being discussed within the 3D project but we expect our end of the deployment will not be too difficult.[15/6/07]We will need seperate hardware for an LHCB LFC.[22/06/07]Still awaiting requirements from LHCb - have been waiting for 1 month.[6/7/07]Catalin still waiting for a response. LFC data replication schedule seems to be being planned in 3D meetings. Raja will raise question with LHCB as to what plan/schedule is.[13/07/07]Raja has reported that LHCB will ask the Tier-1s to deploy an Oracle based LFC. Catalin will agree with Gordon and Martin a schedule to deploy at RAL.[20/7/07]Hardware request has been made to Fabric team[27/07/07]Still w aiting on hardware.[10/8/07]Still waiting on hardware.[17/08/07]node has been drained and a name assigned.[24/08/07]Awaiting database configuration from Oracle DB team, expected today [7/9/07]95% completed - installation may be completed today - otherwise it will not be until the end of next week.[14/9/07]No progress[21/9/07]The receipe initially provided does not appear to work, discussions are underway at CERN to identify the correct configuration that should be used.[28/9/07]No progress[12/10/07]New suite of LHCb LFC rpms to be released in 2 weeks to test[26/10/07]Packages under certification, sl3 2 weeks, sl4 4 weeks; suggestion is to deploy sl3 version once its certified[09/11/07]SL3 release ready for install by middle of next week[16/11/07]Expect to deploy by end of 1st week of December[30/11/07]Still waiting for go ahead from LHCb[14/12/07]Oracle LFC deployed in test mode for LHCb, LHCB have done basic functionality tests and confirm it works, more testing will be done[11/01/08]Closed, LHCb now performing stress testing
G-43 Low Matt Short Grid queues do not meet needs of experiments Closed 17/08/07 Short queue only runs one job per user at any time. Experiments appear to need to run many jobs.[13/04/07]One job per user limit is lifted.[17/08/07]Not obvious that we can progress this further at present - item closed.
G-44 High Catalin Need an SLA with each VO for VO box management Closed Catalin has draft SLAs already prepared - will complete and circulate next week.[31/5/07]Delayed for early June, due to late request from CMS.[08/06/07]Still waiting on CMS, but other 3 are drafted[6/7/07]RAS received draft on Wednesday but has not yet been able to find time to reply.[10/8/07]Alice and LHCB have copies to review. Will let LHCB what further input we need from them.[17/08/07]LHCB have agreed the document.ATLAS do not require a VO box at RAL. Matt will update the document and send to CMS.[24/08/07]LHCb is done, we believe Alice is done but no reply from Alice yet[7/9/07]Still no response from Alice - We still also need to draft the CMS SLA[14/9/07]CMS now have the SLA. No response from Alice.[21/9/07]Alice inform us that they do have some comments but we have not received them yet. No comments yet received from CMS. We plan to finalise these SLAs by the end of September if possible.[28/9/07]We have received input from Alice but their requirements for disk partioning do not match our configuration. We will negotiate. No response from CMS.[05/10/07]Alice now done but needs glite WMS + VOBOX for full deployment[26/10/07]CMS have provided a draft, need to review[09/11/07]Agreed CMS VOBOX SLA, action closed
G-45 Low Derek dCache tape backend cannot keep up with LHCB demand Closed 29/06/07 Derek has spoken to Tim Folkes, we believe that the ADS service is the bottleneck. Tim has

noted that there are two tapes with problems and will let us have details.[11/05/07]Three physical tapes were involved - all but three files have been recovered. Will monitor that performance is now improved. [21/5/07]New tape servers have been added.[25/5/07]LHCB still have performance problems and say that this is an urgen matter to be resolved.[08/06/07]Have increased numbers of restores from 5 to 7, initially increased to 10 but this hit ADS too hard, so was scaled back.[15/6/07]We have caught up with the backlog and reliability is better.[29/06/07]Seems to be OK now

T-46 High Catalin LFC needs to support TOTALEP Closed The LFC needs to support the totalep collaboration. STFC and GridPP consider this partnerships as evidence of our wider working and success and it is therefore considered a high priority by the collaboration.[11/05/07] This was completed long ago but testing couldn't confirm this. This has now been confirmed as working at dteam meeting
G-47 Low Catalin ATLAS VO box running 100% busy at times Closed 10/8/07 [31/7/05] More memory needed in principle, but ATLAS advise that memory requirements will decrease with next DQ2 release.[08/06/07]Increased memory from 2GB to 3GB, not a major problem for Atlas.[15/06/07]We will wait and see how performance is after next weeks upgrade.[22/06/07]Atlas may not even need this system in the future[29/06/07]Load is low at the moment probably because ATLAS production is not running. Catalin will raise this at the weekly ATLAS-UK Tier-1 meeting.[6/7/07]It has been confirmed that we will still need a VO box for some ATLAS activities but not DQ related work.[10/8/07]RAL VO box no longer exists at RAL - now at CERN.
G-48 Low Catalin Mon box memory usage high at times Closed 6/7/07 [31/5/07]Agreed with R-GMA that load on the machine is to be expected, and we should consider migrating to better hardware.[6/7/07]We do not perceive this to be a problem any longer
G-49 Medium Derek Jobs still being held at CE Closed 13/7/07 Jobs are being held at CE (20 per week?)[13/7/07]We see no evidence of a problem at present.
T-50 Critical Catalin Problems with RB at RAL Closed Second RB at RAL seems broken according to some users.[23/05/07] The cause might be some very large tables within lbserver20 database. A scheduled MySQL maintenance plan for the next two weeks was circulated.[25/5/07]Further investigations are needed. We will generate an action plan to improve our monitoring of the situation and improve load balacing and d/b cleanup. Will also task a third RB. Need to find out if other sites are seeing similar problems.[31/5/07]Now in maintenance and closed to new jobs.[08/06/07]Maintenance done, expect to do same to rb01 in a month
G-51 Medium Derek CE is becoming overloaded Closed 6/7/07 [25/5/07]CE has grown increasingly overloaded in the last 4 days. Availability will be impacted if it grows much more. Derek will investigate

[31/5/07] High load remains and caused availability (SAM) drop out over the bank holiday weekend, and several outages since then.[08/06/07]CE has been okay since last Friday, still unsure of cause, similar problems have been seen at CERN over same period.[15/6/07]We need one or more additional CEs[22/6/07]Additional CE will be required for SL4.[6/7/07]No repetition and no planned changes - therefore closed.

G-52 High Derek Clarify downtime announcement procedure Closed 7/9/07 Recent Castor downtime wasn't announced, need to clarify who does what and what should be done for scheduled and unscheduled downtimes. [15/6/07]Draft circulated[22/6/07]Will update for notification/veto procedures and wider distribute. [13/7/07]Derek has circulated a new draft for comment.[7/9/07]Now complete
G-53 Low Andrew Clarify status of Babar wrt Service Changes Closed 10/8/07 babar have traditionally been able to block system updates if they have not been certified by them. Need to clarify status with PMB. [22/6/07]Babar were content to be treated as other non-LHC experiment (following a a dialogue with Fergus Wilson).[29/06/07]Andrew needs to pass on this info to Fabric Team.[10/8/07]Done
G-54 Medium Catalin Do we need an Oracle LFC for ATLAS Closed 25/01/08 Some discussion that ATLAS will wish to use an Oracle LFC.[22/6/07]Oracle seems better for bulk operations, Atlas would accept MySQL.[18/01/08] Discuss scheduling and hardware requirements with Gordon. [25/01/08]Closing; see Action T-40.
G-55 Medium Catalin Jobs not completing on RB Closed Jobs are not reaching Done state on the RB, believe this is caused by large job submissions from LHCb and Alice causing backlogs, will deploy separate RBs for Alice and LHCb. Will need hardware from Fabric team[29/06/07]rb01 under maintainance planned to be back in service by Tuesday. Hardware for rb03 is now available to implement an RB for Alice (and maybe LHCB). Planned to do dteam tests today - we expect to have this operational by Tuesday. rb02 is struggling on its own at the moment.[6/7/07]All 3 RBs are now working - downgraded to medium. Will watch another week and then close if everything is fine.[13/7/07]We had another problem last weekend which has now been fixed. We (and SAM) perceive that the RB is operating well but Steve Lloyd's monitoring indicates a problem. We have not resolved this difference yet. LHCB is now moved to rb03 with Alice.[20/7/07]All appears to be working as well as can be expected
T-56 Medium Derek Enable mice and new minos vos Closed 18/01/08 Enable new vos on various parts of the system. Minos is a rename of an old vo, so should just need the time to do it. Mice is new and requires disk space, which would be allocated on csfnfs58 but Fabric team wary about adding more partitions.[6/7/07]Now in the siteinfo.def. needs to be scheduled for deployment on Grid Services.[14/9/07]Minos are now able to submit jobs. Storage space and LFC entries need renaming.[21/9/07]LFC is done for Minos.[28/9/07]Mice now enabled on the CE but still need storage space.[26/10/07]Minos now done except for Castor[11/01/08]have space for MICE, just need to add to dCache.[18/01/08]MICE have no disk allocation for Q1; MINOS disk depends on other factors.
G-57 Medium Matt Occassional top level BDII timeouts Closed 13/7/07 We are seeing some situations where the BDII is overloaded. Probably by LHCB queries when a storage element is down. Matt has upgraded one of the three BDII servers with the new release which supports indexing. Load on this server has been halved and we will run this for some time to see how it behaves. The version of this software was released today (29/06/07) [6/7/07]All BDIIs are now updated. This was completed on Tuesday. [13/7/07]Is fixed.
G-58 Low Marian Need access to SRM 2.2 from PPS Closed We are discussing with CASTOR team how to provide SRM 2.2 to PPS. [13/07/07]Waiting on input from Jeremy.[20/7/07]Not a requirement
G-59 Medium Marian Experiments desire more capacity on PPS Closed 18/01/08 We need to explore how to allow PPS access to main batch farm capacity[13/7/07]Has received new hardware for a new CE. Marian and Derek will discuss ways of allowing PPS to grow into CPU farm when demand is low from production service.[20/7/07]Derek is investing various issues.[18/01/08]No longer relevant; closing.
G-60 Medium Marian PPS needs to enable LHC VOs Closed 18/01/08 Only dteam and ops are enabled at present - all LHC VOs need to be enabled.[20/7/07]Marian will enable additional VOs at the next update.[17/08/07]VOs have been added but not tested as PPS is broken.[18/01/08]Enabled, but not fully tested; closing.
G-61 Medium Derek Need to test use of glexec Closed We need to build experience of glexec to see if it is acceptable for use at RAL. Will discuss next week.[13/7/07]Marian will provide support to Mingchao for testing glexec[20/7/07]Marian is working on this[10/8/07]Mingchao is testing glexec[21/9/07]Mingchao has produced a draft report but needs access to a batch system We will provide a test batch worker on the production service. ownership changed to Derek.[26/10/07] Mingchao has been given lcg0351
G-62 Low Catalin RBs fail Steve Lloyd's tests Closed Steve is investigating with Glasgow[21/9/07]Not obviously anything we can do here to progress this and no fault on the RAL RBs therefore closed.
G-63 Medium Andrew Need read access to firewall ruleset Closed 2009-03-06 We need to be able to make live queries of the firewall ruleset.[10/8/07]Known fault - Andrew will raise at next network meeting.[21/9/07]We are discussing how best to improve or liaison with RAL networking group to allow us to raise this issue at a higher level.[16/11/07]Networking group are procuring a management system for the firewall[14/12/07]Management system has been delivered, awaiting access[27/06/2008]Networking have had considerable problems making the unit work.[2009-03-06]Don't expect any further action to be taken. Closing.
G-64 Critical Marian latest update of glite CE does not install from scratch 14/9/07 Closed latest PPS update failed after the glite CE could not be installed. Ticket has been

submitted but we don't think it is likely that it will be solved as we expect support for gLite CE has probably ceased. Marian plans to replace with PPS version of LCG CE for now.[17/08/07]Marian plans to persist with gLite CE until next update and if this fails will move to LCG CE.[14/9/07]Marian has concluded that the gLite CE cannot be made to work.

G-65 Low Derek Use of swap growing on dCache Closed Swap is use growing[17/08/07]May be an interaction between Java 1.4 and Redhat 7.3. Derek plans to upgrade Java on Monday.[24/08/07]No problems seen after upgrade but will continue to monitor[7/9/07]Swap continues to grow - will continue to watch[14/9/07]Appears to be mainly on the old RH 7.3 service - does not seem to be a problem at the moment and will go away with SL4. Will continue to monitor for some time.[05/10/07]Still a problem - moving to Java 1.5 appears to alleviate[14/12/07]All servers moved to 1.5 as part of upgrade, have not yet checked to see whether this is still an issue.[11/01/08]Not a problem any more - closing
G-66 Medium Matt CMS may need a second VO box Closed 7/9/07 Derek has received a request for a second CMS VO box[17/08/07]Ownership changed to Matt.[24/08/07]CMS will do installation from dedicated WN, and all capacity will be moved to SL4[7/9/07]Not required
G-67 Medium Matt CMS need additional FTS channels Closed New channels are needed for CMS Tier-2 sites[17/08/07]Some have been set up more are ongoing - this is routine operating position and the issue is closed.
G-68 Low Marian Needs to have an externally visible web server Closed 7/9/07 [17/08/07]Marian needs to have a PPS repository that can be made externally visible via the web. We propose an NFS volume connected to an existing external web server. Marian will discuss with Catalin how best to proceed.[24/08/07]Derek has give Marian login access and a directory on lcgwww.gridpp.rl.ac.uk that is web accessible[7/9/07]Done
G-69 High Matt Space is getting short on external software area Closed 09/11/07 [7/9/07]Only 7% free we are monitoring the situation. Expect space to be freed up at the end of SL3.[14/9/07]Now at 5% and raised to high.[26/10/07]Space now effectively zero, CMS attempting to move onto dedicated server[02/11/07]CMS have been moved onto a new server, old data will be archived then removed[09/11/07]CMS moved off - ~50% free now
G-70 Medium Derek Performance of csfnfs58 is inadequate for software serving Closed nfs58 is overloaded and we need to consider alternative ways of managing experiment software repositories.

[26/10/07]Will discuss on Monday[02/11/07]Given to Catalin to move Atlas and LHCb on to separate servers[2902/08]New host being deployed for Atlas.[2008/03/14]ATLAS done; LHCb to follow.[2008/03/28]Aiming for LHCb before LHCb[2008/04/04]Worker node set aside for LHCb[2008/04/25]LHCB Done[2008/05/09]Migrate remaining software areas and glite stack off - moved to Derek[28/11/08]Fabric team may wish to decommision csfnfs58, may need to raise priority of this.[12/12/08]raised priority to medium[16/1/09]Spoke to Fabric team, 3 options : AFS, Dedicated nfs or shared nfs server, will followup with Jonathan about actual AFS details to see if feasible[20/02/09]AFS space now allocated, testing installation[2009/03/06]One issue to be resolved - handling of automated CRL updates.[2009/03/20]Will use afs servers for cron jobs, still need to resolve 24hr support for AFS.[2009/05/01]Waiting for Fabric team for 24/7 support[2009/05/22]Now have a plan - aiming to have in place by end of July[2009/09/11]Closed due to quattor WN not requiring this

G-71 High Catalin Need gLite WMS service Closed gLite WMS is released.[28/9/07]Catalin will let Derek (by Tuesday 11:00) now what his timetable will be to get a test WMS running.[05/10/07]Alice is expecting a WMS by the end of the year.[09/11/07]Needed to be in place for CCRC[16/11/07]Test host deployed with some local and remote submission, some missing updates in yum repository now in place, need production hardware, expect to deploy by 1st week of December[23/11/07]Functionality test performed successfully with external submitters, hardware and firewall requests made to Fabric team[30/11/07]Service now in place but not announced, looking for large testers and maintenance procedures but have had little success. Will announce as production.[14/12/07]Has been announced and successfully tested by phenogrid and camont, closing.
T-72 Medium Derek Plan dCache 1.8 upgrade Closed dCache 1.8 is expected to be production ready in November.[28/9/07]December 17th has been proposed - we will negotiate a different date.[05/10/07]Have suggested December 3rd as alternative[23/11/07]Upgrade successful on test hardware[30/11/07]Upgrade will happen on Mondy
G-73 Low Catalin ILC mysql database Closed 18/01/08 [05/10/07]ILC have requested that we run a mysql service for them.[26/10/07]Underway - now have server, resolved some installation issues[02/11/07]Testing underway, awaiting feedback[09/11/07]Some issues discovered[16/11/07]Awaiting response from ILC
T-74 Medium Matt SL4 UI Closed 31/10/08 Minos have requested a SL4 frontend (assume also a UI) before we close SL3 RT #21313[26/10/07]Meeting with JamesT on Monday to discuss[02/11/07]JamesT testing kickstart config for frontend[09/11/07]Machine built, fixing various issues[23/11/07]James is working on issues[14/12/07]All encountered issues expected to be fixed in next iteration of build, to be done today.[2008-02-15]Kickstart iterated; need to resolve minor issues.[29/02/08]Will chase up status[07/03/2008]Need compatible builds of maui and torque, SL3 versions don't work.[2008/03/14]Maui and torque need to be built without tcl/tk support, and SL3 builds as such are expected to work.[2008/03/25]Fabric team have a June milestone for this item and there will be no progress until.[27/06/2008]SL4 torque and maui clients can not talk to SL3 scheduler.[2008/8/01]Investigate using Glite SL4 clients and incorporating our patches in to GLite if necessary.[2008/9/19]Matt is making good progress and will be meeting with James on Tuesday to discuss rollout schedule.[17/10/08]Announced SL4 test UI to GridPP Users, still to agree decommision schedule for SL3 UIs/[31/10/08]Closing
T-75 Medium Derek Gridmap generation on Castor servers Closed Discuss with Castor team problems with gridmap generation[16/11/07]Possibly fixed - will double check with Chris[23/11/07]Chris has found a solution - closing.
T-76 Low Matt Shared VO box for small VOs Closed 18/01/08 Should we provide a shared VOBOX for small VOs? Minos has a requirement to run proxy renewal services.[18/01/08]Not currently required; closing.
T-77 High Matt CMS need dedicated Frontier service (on SL4) Closed CMS need a resilient Frontier service, split from PhEDEx.[14/12/07]Still have load problems with PhEDEx on single system, awaiting hardware requirements from CMS for PhEDEx before moving off.[18/01/08]Requested hardware for more resilient squid and separate PhEDEx.[2008-02-15]One server now dedicated to Frontier; second to follow after firewall changes.[22/02/08]Working on deploying second server[29/02/08]CMS want SL4, so will test deployment.[2008/03/14]Squid installed on SL4; second server needs upgrading from SL3.[2008/04/04]Awaiting UK sites to transition to DNS alias, will then update SL3 box[2008/04/25]Complete
P-78 Medium Matt UKQCD transition Closed 2008-02-15 UKQCD currently run on their own grid. Ultimately, they will need transitioning onto the main LCG farm.

20070627: UKQCD not planning to move till early 2008.

20071114: Transferring this issue to Grid Services team for them to pick up contacts from NGHW.


[2008-02-15]Taken over from Nick.

T-79 High Derek Investigate removing dependency on home fs for grid jobs Closed It should be possible to configure grid jobs to use a home directory local to the batch worker they are running on, this would remove the dependency on the home filesystem for grid jobs so they would be unaffected by nongrid users hammering the home filesystem[2008/05/02]Need to deploy new CEs into service quickly due to current CE reaching capacity, which will delay this.[2008/08/15]May need to change home dirs on CE to match new layout on WNs, testing ongoing.[12/09/2008] Testing now done, need to discuss with Fabric team best time to deploy - possibly after data-taking.[17/10/08]csfnfs02 is causing callouts, have increased the priority of this[14/11/08]James A is progressing this [21/11/08] Raising to High, following the last days problems with csfnfs02 (one callout and one crash, both related to high load presumably).[28/11/08]Ongoing[9/1/09]Almost complete, just need to confirm that WN install is up to date to create the new partition, once this is done WNs can be reinstalled behind the scenes.[16/1/09]Hosts have now been reinstalled, no complaints seen - closing
T-80 Low Catalin New SL4 VO Box for ALICE Closed Request from ALICE for SL4 VO Box; hardware request to be made.[2008-02-15]Waiting on feedback from ALICE.[2008/03/14]Tried new kickstart but possible firewall issues.[2008/03/28]Awaiting feedback from Alice[2008/04/04]Some certificate issues being worked on[2008/04/11]Machine is up and running but awaiting confirmation from Alice that their services are installed.[2008/04/25]Done
T-81 Low Catalin New hardware for MON Box Closed Need to move MON Box off old blade hardware.[2008/03/14]Will use existing UI as replacement hardware.[2008/03/12]Awaiting Glite3.1 MON Box release[2008/04/04]Have working PPS MON instance awaiting Production release[2008/04/11]Migration to SL4/Glite 3.1 scheduled for Monday[18/04/2008]New Mon box in production.
T-82 Med Derek Migration to SL4/GLite 3.1 Closed [29/02/08]Need a wiki page to track migration of Grid Services.[2008/03/14]Matt has produced and circulated a draft of gLite services and versions.[2008/04/04]Moved from Derek to Matt. [30/05/08]Remaining services RBs(obsolete), ce02 and myproxy, UIs[27/06/2008]SL4 myproxy appears to be unstable, deployment on hold for now.[2008/08/01]Hoping to test RB/WMS renewal next week,[08/08/08]SL4 myproxy deployed on Monday during site downtime.[08/08/15]Myproxy now on SL4, need to develop plan for moving lcgce02 to GLite 3.1. [21/11/08] Ownership moved to Derek[12/12/08]Plan to move ALice to CREAM CE and depoy an new lcg-ce instance for small vos. [30/1/09]CREAM CE now deployed, will put in hardware request for small vos CE[20/02/09]Glite 3.0 support ending in April potentially[2009/03/20]Is now tied to SL5 migration.[2009/04/24]Have now CE and WN and test scheduler, will start publishing for OPS early next week, other vs to follow when ops tests succesfull[2005/05/01]OPS tests are passing fine[2009/05/08]Have decided on 1 Oct as final date to stop using glite3.0 CE[2009/05/15]Closing
T-83 Med Catalin Migration of RB service to WMS Closed [29/02/08]Plan schedule to terminate RBs and ramp up WMS service[07/01/2008]Propose putting in place final infrastructure as soon as possible.[2008-03-14]About to finish certification, then will be available in PPS (probably).[27/06/2008]Deploying test setup - 2 WMS + 2 LB for demonstrating proof of concept.[08/08/08]New setup will be deployed by end August[08/08/15]Stalled pending firewall holes[12/09/2008]2 WMS and 2 LB now in production, now developing plan to close RBs.[19/9/08]We plan to close the RB service by the end of the year. Announcements need to be made.[17/10/08]Closing, announcement made
T-84 Closed Derek Scaling up CE service Closed [07/01/2008]Consider scaling up CE service using multiple CEs to cope with heavier load from faster WMSes[2008/03/28]Deploying shared nfs server for common files[2008/04/04]Hardware requested for new CE, ip address allocated for nfs server[2008/05/02]nfs server now in use, deploying new CE, hardware request in for 2 further CEs[208/05/16]lcgce03 now in production for Atlas.Hardware now received for next 2 CEs.[2008/05/23]ce04 & ce05 now deployed for CMS and LHCb respectively[30/05/08]Have enabled atlas, cms and lhcb each on another CE, have scheduled removal from ce02 from 3rd June.[27/6/2008]CE service is now scaled up.
T-85 Med Matt Load issues on FTS DB Closed Sometimes the FTS Oracle database host has very high load, the reasons for this are not well understood. [2008/05/23]There is an pending Oracle upgrade scheduled for post CCRC08 but don't know this will fix the problem, RAC upgrades are also pending. [2008/08/01]Had a repeat occurence of load issues due to Atlas backlog after Castor downtime.[2009/04/24]Move to RAC scheduled from 6th May[2009/05/08]Move to RAC complete - action closed
T-86 High Catalin xrootd on Castor for Alice Closed 31/10/08 Need to deploy xrootd on Castor interface for Alice. Manager is installed and setup, awaiting hardware for the peer. Will not be deployed for CCRC08.[2008/05/16]Have hardware for manager and peer, awaiting configuration details from Alice people, still need xrootd on disk server.[27/06/2008]Have hit some issues with configuration.[08/08/08]Deploying test config over single dns domain to see if that resolves issues.[12/09/08]Still having issues, have asked CERN for assistance and given access to a developer[26/09/08]Our tests work but Alice tests fail[17/10/08]Alice creating zero size files but don't know if this intentional.[31/10/08]Closed, final problems resolved, has been added to Alice production
R-87 Medium Catalin Check Update to date-ness of wiki pages - particularly Grid Services Closed Action from Jeremy[08/08/08]Give to Matt to generate list of pages to assign to team members.[12/09/08]List of wiki pages assigned to Gride Services team has been further sub-divided to team members.[31/10/08]Still to update WMS pages [21/11/08]Ownership to Catalin[16/1/9]Closed
G-88 Medium Catalin LFC keeps crashing Closed 17/10/08 The LFC keeps crashing. Catalin is in touch with CERN[26/09/08]Possibly two issues, one of which is now fixed, second appears to be due to OS differences on frontends[17/10/08]Closing, oracle client update fixed problem
G-90 Medium Derek Alice would like a CREAM CE Closed Alice have requested sites deploy a CREAM CE[31/10/08]Have received instructions to deploy CREAM CE[7/11/08]Awaiting hardware from Fabric team[9/1/09]Aiming to have working CREAM CE by end of next week[16/1/09]Firewall holes requested, working on publishing, Derek will contact Alice to arrange testing[30/1/09]CREAM CE now deployed, Alice need 200GB on VOXBOX + gridftp server, request for more disk in to Fabric team, will investigate using software area.[2009/02/13]Still waiting for hardware.[2009/03/06]Disk available, need to coordinate with Alice for installation.[2009/03/13]200GB disk to be added next week on the VOBOX.[2009/03/20]New disk added, closing
G-91 Medium Catalin Seperate non LHC and LHC WMS service Closed Small vos are adversely affecting the WMS service for the LHC vos, will deploy a new WMS frontend for the small VOS, and restrict existing WMSes to LHC vos. [21/11/08] Hardware provided by Fabric team. Will start deployment and tests.[28/11/08]Will start proper testing next week[12/12/08]Postponed until after Christmas[9/1/09]WMS01 LHC only, WMS03 non-LHC only, WMS02 mixed will become LHC only shortly.
G-92 Medium Derek Deploy CE on NGS for EGEE seed VO Closed Work with RAL NGS to deploy CE service.[28/11/08]Installation proceeding, tracking down required rpms.[12/1208]Making progress, can now submit jobs to NGS queue.[9/1/09]Need to confirm status of this activity with Andrew[16/1/09]Have a deadline to deploy on 29th Jan, job submission is working but jobs are not running correctly.[30/1/09]CE is publishe, but ops SAM tests don't run frequently enough, talking to Kevin to resolve, and have started implementing fallback plan to support vo on Tier1 through lcgce02[2009/03/06]Issue was resloved, job successfuly submitted via NGS CE, need to arrange meetin about longterm support of this system.[2009/03/20]CE is now deployed - closing
G-93 Medium Matt Deploy top-level BDII for PPS testing Closed Deploy top-level BDII for release testing in PPS[2009/03/06]Fabric team is working on providing PPS hardware, if request comes in will fulfil it with older WN.[2009/03/20]Moving to low priority - no request has been received yet[2009/06/19]SL5 BDII is imminent, priority increasing again[2009/07/03]May become involved in PPS testing at a PPS deployment testing (earlier stage than certification)[2009/07/24]Moved to Matt. An SL5 release has been made and we are trying to deploy it[2009/7/31]Release deployed
G-94 Medium Catalin Arrange meeting with Production Team to discuss routine service updates Closed [2009/03/06]Closing as it's tracked elsewhere.
G-95 High Matt Develop Grid Services plan for changes through to Data Taking (June) Closed At next Experiments liaison meeting (4th March) Andrew wants to present the Tier 1's plan for changes up to June and the expected start of the LHC run.[2009/06/03]DOne, closed
G-96 Medium Derek Move to pool accounts for production roles Closed CREAM CE has issue with inability to map to single accounts, historically we have had single accounts for sgm and production usage. While we have some case for this with sgm accounts - issues with permissions on software areas, and we limit them to 1 running job - the same is not true for production accounts. We should move to using pool accounts for production roles.[2009/03/06]Request in to Fabric team to create needed accounts[2009/03/20]WIll contact experiment contacts to alert them that this will happen[17/4/09]Plan distributed to experiment contacts[2009/4/24]Need to confirm that scheduler can cope with new groups cconfiguration.[2005/05/01]Initial tests were successful.[2009/05/08]Problems encountered with CMS - work underway to fix this, but ce04 stuck in drained state, which we want to resolve before starting the ce05 drain.[2009/05/15]ce04 now done - no problems seen so far, ce05 to drain over weekend, reinstall on Monday[2009/05/22]Final CE now moved - closing action
G-97 Medium Catalin LFC/FTS DB migration to RAC Closed Distributed proposal to decouple from R89 move, awaiting feedback. Aiming for end of April for migration.[2009/03/13]RAC OS is now 64bit. Minor configurations still needed (Fabric team). Oracle to be deployed.[17/4/09]Planning for 6th May[2009/4/24]Now confirmed for 6th May[209/05/08]Migration complete - closing
G-98 Medium Catalin Software RAID callouts Closed Define list of hosts using SW RAID and criticality of these.[2009/3/20]Waiting for Fabric team to add config to Nagios[2009/03/27]List of hosts sent to Fabric team[2009/5/08]Tests added to Nagios - closing
G-99 Medium Derek Deploy ngs vo on CE Closed Request to become affliate NGS vo, need to update voms config, check status of pool accounts then configure lcgce02[2009/4/24]NGS can submit job to the CE but not to scheduler as it uses the default queue and our default queue is not submittable to from the CE[2009/05/09]Default queue changed, NGS wil add us to their top-level BDII -closing
G-100 Medium Derek Reinstall CEs to use RAID 1 setup Closed CEs (3,4,5) need to be reconfigured to use raid 1 setup[2009/05/01]lcgce03 done, the rest to follow[2009/05/15]ce04 done, ce05 next week[2009/05/22]ce05 now done - closing
G-101 Medium Catalin Add 1 more frontend to lfc.gridpp.rl.ac.uk Closed LFC frontends are overloaded, add 1 more frontend to improve load balancing[2009/4/24]Hardware, DNS and firewall received, now installing host[2009/05/08]Aiming to install next week[2009/05/15]Installed but not added to DNS round robin yet[2009/5/22]Now added to production (lcglfc0447.gridpp.rl.ac.uk), closing


G-102 Medium Catalin Seperate LFC for Atlas? Closed Atlas may wish a seperate LFC. Carmine believes that it should be possible without additional Oracle licenses[2009/05/01] A response was sent to ATLAS, we aim to separate the LFC by end of July[2009/05/15]Procedure is simple -may be able to do before end May - have asked Atlas[2009/5/22]Have decided on end of July[2009/06/19]Need information from CERN on ways to ban users once split occurs[2009/06/24]Have received information, aiming for 3rd or 4th week in July.[2009/07/03]Scheduled for 20th July, DNS request submitted[2009/07/24]Cancelled downtime, will be rescheduled in August[2009/7/31]Frontend config change in 1st week of August, backend change on 26th August[2009/08/14]Rescheduled for August 20th but now in doubt, because of machine room issues[2009/08/21]LFC split -closing
G-103 Medium Catalin New vo box for Alice Closed Alice need new vo box to test Sl5, hardware requested[2009/05/22]Hardware received - installation in progress[2009/06/05]Now in production - closing
T-104 Medium Matt Remove intra-VO fairsharing for LHCb Closed With LHCb user and production jobs now belonging to different primary groups, it isn't possible with the current (Maui) scheduling scheme to avoid intra-VO fairshares for LHCb. Shares have been set to 50% for each group pending investigation of reworking the scheduling policies, which may require changes to the configuration for all VOs.[2009/5/22]Derek suggests moving lhcb prod accounts to same group as lhcb normal pool accounts - will discuss with Matt[2009/06/19]In progress[2009/06/26]Closed, not an issue as LHCb don't queue work, will reopen if becomes an issue
T-105 Medium Matt LHCb concerned about memory limits on queues Closed Have received an e-mail from LHCb querying our position on the memory limits we have on queues.[2009/06/19]Closing, no action required at the moment
G-106 Medium Derek Improve monitoring on CREAM CE Closed Alice encountered problems submitting jobs to the CREAM CE, which were not caught by our monitoring - we need to expand our monitoring to cover this.[2009/06/22]Test written, request into Fabric team to add to Nagios config[2009/08/21]Test added to Nagios - closing
T-107 Medium Derek Improve monitoring on LCG CE Closed ATLAS encountered problems submitting jobs to the LCG CE, which were not caught by our monitoring - we need to expand our monitoring to cover this.[2009/06/19]Have patched JobManager to stop creating large number of directories in the same directory, will submit to GGUS[2009/06/26]Patch not successful, will implement cron job to clean up and monitor for reaching limit, will also discuss with Atlas to see why directories not cleaned up after job completion. [2009/07/03]Need to implement monitoring for hitting limit[2009/09/11]Check implemented and in place - closing
T-108 High Catalin Plan LFC/FTS RAC upgrades Closed Need to schedule reslience upgrades for LFC/FTS RAC for next period in which changes can be made before data taking.[2009/07/31]Aiming to do one at same time as LFC split, and one other later[2009/08/7]2 At-risks and 1 downtime (26/8/09) to do this
T-109 Low Catalin Separate Ganglia Grid Services group Closed Need to assess pros/cons of splitting the single Services_Grid Ganglia group.[2009/10/23]Closed not going to be implemented


T-110 Medium Matt Audit hotswap status of servers Closed 2009-09-07 Need to check which servers are ready for hotswap, and check if reinstallation is required for those that are not. Audit done; added actions to track this.
T-112 Medium Matt Track baseline requirements Closed Decide to how to monitor services to check that we are compliant with baseline services list[2009/07/31]Tracking on Gridpp wiki


G-113 Medium Andrew Document process for adding DNs with SGM mapping on VOBOXs Closed 2009-09-07 Process not documented, and RPM spec file needs to be generated.
G-114 Medium Catalin Install SL5 VOBox Closed SL5 VOBox is required for Alice, this is expected to be available after the 1 October freeze date.[2009/10/23]Kickstart completed, awaiting hardware[2009/10/30]Making progress, aiming to complete next week[2009/12/11]Done


T-116 Medium Alastair Perform security audit of VO software directories Closed [2009/10/23]Nothing found yet, but double checking[2009/10/30]Investigating automating check every few months[2009/12/11]Jonathan Wheeler fixing some nagios problems with script, will open new action if needed
T-117 Medium Andrew Fix unpublished CREAM CE records in APEL Closed [2009/10/23]October is checked and okay, still need to check August [2009/10/30]August has been checked
T-120 High Matt Determine LHCb diskserver/service class requirements for Q4 deployment Closed [2009/10/23]Done
T-121 High Alastair Determine ATLAS diskserver/service class requirements for Q4 deployment Closed [2009/10/23]Done
T-122 High Andrew Determine CMS diskserver/service class requirements for Q4 deployment Closed [2009/10/23]Done
T-123 Medium Catalin Determine ALICE diskserver/service class requirements for Q4 deployment Closed [2009/10/23]Done