GridPP ops meeting - Agendas Actions Core Tasks
|
Tuesday 24th April 2012- Agenda Minutes
- Expected this week: BLAH, DPM, Hydra, GFAL/lcg_util, StoRM and WMS
- UI/WN tarball: There are testing releases of the tarballs, Linked off this ticket.
- At T1 two new FTS front end systems on virtual machines.
- Networking monitoring: consensus is to deploy perfsonar (US version)
- Reminder about 11th/12th May HEPSYSMAN
|
WLCG Grid Deployment Board - Agendas MB agendas
|
Wednesday 18th April - Agenda Summary report
Introduction
- Suggestions for future format - TEGs/working groups in pre-GDB slot
- Need to look at public cloud models
TEG outlines
- DM & Storage: Look at http and webdav. gridFTP needed medium term. FTS3 plan to include http. LFC not needed medium term. Security more work.
- Ops Tools: Awaiting task prioritization. Common monitoring (including WLCG coordination body), CVMFS, common sysadmin training, review services, endorse EPEL, apps repository. Develop GGUS and broadcasts. Expand pre-release uptake. Revisit SSB.
- Workload M: Use gelexec. Extend CE for streamed submission, whole node and multi-node jobs, job types (i/o or CPU bound). Remove WMS and simplify InfoSystem.
- Security: Risk analysis done. Fine-grained traceability issues. Data ownership and other issues TBC. Lack of stakeholder input.
- Databases: Use COOL. More Frontier usage. WLCG to monitor squids. Interest in NoSQL options.
PerfSonar
- Need standard. It aids diagnosis - but alert who? Two boxes: latency and bandwitdth. Configs flexible. Main issues firewalls and congested GPNs.
Middleware
- EMI-1:Update 15 due 20th April. Fixes for BLAH, WMS, DPM/LFC, GFAL, Proxy renewal, VOMS-admin. gLite security fixes end 30th April (WN and UI covered till 30th Sept.).
- EMI-2: SL5 and SL6 builds now >95%. Release due 7th May.
- UMD and WLCG: EMI tests seek elimination of bugs. EGI tests seek continuous service delivery.
SHA2 & RFC proxies
- IGTF want CAs using SHA-2 ASAP. Target Jan 2013.
- Therefore need to move to RFC proxies (away from Globus ones). SHA-1 risks are dCache and BestMan. RFC support needed for middleware but most components ok in EMI-2.
Glexec deployment
- Check regional tests. For UK click here.
- Sites need to flag support in GOCDB
- Experiments have plans to use
OSG software update
- OSG3 now using RPM format (via Koji). Pushing EPEL. Best support RHEL. Tarballs may come soon.
|
NGI UK - Homepage CA
|
January Management meeting?
Friday 9th March NGS-CA-TAG meeting
- Priorities discussion for the CA. Plans to be clarified for future meeting.
- Email addresses now removed from certificates.
|
Events
|
HEPiX- 23rd-27th April (Prague)Agenda
HEPSYSMAN - 10th-11th May (RAL) Agenda
WLCG workshop - 19th-20th May (NY) Information
CHEP 2012 - 21st-25th May (NY) Agenda
|
UK ATLAS - Shifter view News & Links
|
Tuesday 24th April
- ATLAS started to run reco jobs at T2s more extensively. These require bigger input data sets. They should be copied DATA DISK space token. Some are copying to PRODDISK. If you see PROD DISK filling up you should take action.
|
UK CMS
|
Tuesday 24th April
- Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
|
UK LHCb
|
Tuesday 24th April
- Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
|
UK OTHER
|
Tuesday 24th April
- T2K are having problems with WMS proxy renewal and some WNMSes are advertising support but don't actually work. Could be user error, but will need investigation.
|
Requests
- More sites needed to test EMI-2
|
|
Tier-1 Status Page
|
Tuesday 24th April
- Problem on one of the Atlas Castor headnodes caused by time drift.
- Problem with xrootd access to the AtlasStripDeg service class - traced to a configuration problem.
- Found an unnecessary restriction on our 4GB batch queue - a limit that we have raised.
- Added two new FTS front end systems on virtual machines.
|
Storage & Data Management - Agendas/Minutes
|
Wednesday 25th April
- "Exploding" DPMs. Bug <1.8.3?
- Document data model recommended for small VOs
- HEPiX approaching
|
Accounting - UK Grid Metrics HEPSPEC06
|
Tuesday 17th April
- SL presented models for disk storage accounting at GridPP28
- AF at GridPP28 presented impacts of changed ATLAS submissions
|
Documentation - KeyDocs
|
Friday 27th April
a) Background effort to address the shortcomings in the GridPP Approved VO list and detail records.
No RPM of LSC files is planned (after discussions with Christina), so primary approach
may be to generate VO Approved list by querying BDII of all VOS, and omitting rare or special ones.
Plans also in train to automate document (e,g, via VomsSnooper) whenever XML changes.
Manual transcription is far too error prone.
b) Rolled out into GridPP wiki a "Grid User Crash Course", based heavily on Ewan's cheat sheet.
New users and/or VOs may wish to consult this early on, to get basic feel of grid applications.
References the "Glite User Guide", which remains the best reference source for all user cases.
c) I appeal for a volunteer to enhance "Grid User Crash Course"
(https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for
dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused
multiple queries on TB_SUPPORT.
|
Interoperation - EGI ops agendas
|
Monday 16th April - EGI ops agenda
- BDII Instability: Did we observe problems with BDII on April 12? (One for RoD?)
|
Monitoring - Links MyWLCG
|
- Glasgow dashboard now packaged and can be downloaded here.
|
On-duty - Dashboard ROD Rota
|
Monday 10th April - JW
- A few sites were caught out with the update to CA RPMs 1.46 just before the Easter break.
- There were intermittent problems with the Glasgow (srv022) and a RAL WMS which were alarms were flapping. Glasgow still has problems.
Monday 16th April - AM
- Several sites in planned and unplanned downtimes still (following round of hardware upgrades?) but no UK-wide issues.
|
Rollout Status
|
Friday 27th April
- Updated version information on rollout page
- WN scan indicates some sites not keen on OS updates to those nodes.
|
Security - Incident Procedure Policies
|
- Phone meeting planned in early May to ensure continuity when Mingchao leaves
- SSC5 preparations to start soon.
|
Services - PerfSonar dashboard
|
- 23rd April requested network utilisation figures for March and April
- LHCONE meetng next week in Amsterdam
- Agreed to focus on perfosonar.
|
Tools - MyEGI Nagios
|
Sunday 8th April
- Lancaster Nagios backup hardware has arrived
- Nagios based VO testing setup for vo.southgrid.ac.uk. Nagios view is here. Tests run every 8 hours. The results are also fed back into the dashboard and alarms triggered.
- There is a bug which is prevents multiple VO Nagios tests with one Nagios instance. Developers informed.
|
VOs - GridPP VOMS VO IDs Approved
|
a) Discussion about VO information in LSC files - EMI says no.
b) Tidying up VO information and gathering addresses for VO admin email list.
c) WMS issue for SNO+ fixed with Sussex UI update
|
|