Discussion and tasks list
Deployment team discussion and open tasks/issues list
The intention of this page is to capture deployment and operations tasks that need to be actioned. The list is reviewed and updated at least once per month.
Monday 18th May
- Review deployment webpages and wiki (18/05/09)
- Write-up procedures for removing sites from the experiment blacklists (18/05/09)
- Cross-check APEL, site batch logs and experiment accounting data (18/05/09)
- Understand where we are going with the federated Nagios service (18/05/09) [Also from DTEAM F2F "Need further information about EGEE message bus (for use with Nagios) and how our testing would interface with central instances is required".
- Document technical options for using resources as T3 [From DB] (18/05/09)
Implementation of resources for T3 (shared with T2) currently not well understood technically.
- Investigate what other countries do (re CPU shares and priority resources) and come back to the Deployment Board with a
- Preventing sites getting bad versions of releases. UKI release testing and repository strategy (18/05/09)
- Interaction with core JANET operations [With A Sansum] (18/05/09)
- Publishing software tags for NGS (18/05/09)
- WMS and top-level BDII strategy - where and by when (18/05/09)
Decided at DTEAM F2F that current WMSes were sufficient but unstable.
Decided at DTEAM F2F to deploy one top-level BDII per Tier-2
- Metrics for sites - what is the core/minimal set of pages to watch (18/05/09)
- VO enablement - are the VO ID cards and "GridPP supported VO page" up-to-date (18/05/09)
- VOMS policies such as data retention periods need to be reviewed. (18/05/09)
- UK SAM instance placement (at RAL?) (18/05/09)
- HEP admin training requirements (18/05/09)
- SGE integration with CREAM. Is the testing sufficient? (18/05/09)
- Publishing logical/physical CPU numbers correctly (18/05/09)
- Impacts from T1 building move - where can T2s help (e.g. WMSes) (18/05/09)
- UKI regional procedures - e.g. site escalation and suspension (18/05/09)
- Job efficiencies as seen at the sites - variations in tool outputs (18/05/09)
A review of site interventions to correct/remove inefficient jobs shows that very few GridPP sites are actively doing this at the present time. Since there is still overall under utilisation of CPU this is not an immediate concern, but site feedback on these jobs can be valuable for users to correct bad jobs. Some sites have implemented a monitoring system using MonAmi which has shown potential to automate the process of spotting these problem jobs, but issues with the batch system reporting mean that we may gain little by recommending other sites to adopt this system at the current time.
- Dealing with legacy data and removing SEs (18/05/09)
- Site fairshare systems (18/05/09)
- Communicating with users/sites not in the UK (18/05/09)
There was a problem (key not replaced after certificate change) recently with a VOMS upgrade which led to (about 24hrs) disruption for GridPP VOMS hosted VOs. Several VOs were in contact including phenogrid, supernemo, camont and mice.
- Implementing and following up on resilience concerns (18/05/09)
- GridPP wide disaster planning - to interface to T1 scheme (18/05/09)
- Follow up to the GridPP-NGS meeting: http://indico.cern.ch/conferenceDisplay.py?confId=42642 (18/05/09)
- Helpdesk/GGUS features and requests (18/05/09)
http://indico.cern.ch/conferenceDisplay.py?confId=42642 Add links and subject to the weekly summary reports
- Upcoming meetings dates and links page (18/05/09)
- Follow up with non-HEP VOs - a number of VO activities need to be tracked (18/05/09)
- Incident reports for GridPP wide problems (T1 has this in hand for internal problems) (18/05/09)
- Sharing hardware requirments, selections and product information (18/05/09)
- File persistency at T2s (18/05/09)
- Feedback from and input to the TMB (18/05/09)
- Where do users go to understand the state of the infrastructure (19/05/09)