Operations Bulletin 160712

Bulletin archive

Week commencing 9th July 2012

Task Areas

General updates

Monday 9th July

Problems with certificates and Kashif's email address appearing!
EGI and WLCG reliability/availability figures for June now available
Tuesday there is a meeting on extending CE capabilities to deal with whole-node and multiple core needs. One option is dedicated queues but the WL TEG believed better a solution is to let the VOs specify and CE handle specific requirement.

Monday 2nd July

Tier-2 quarterly reports requested by Wednesday 18th July
There is now an LHCb CVMFS sites map

Tier-1 - Status Page

Tuesday 10th July

The Castor Information Provider ("CIP") was updated yesterday to resolve an issue of over-reporting of disk capacity. No other significant changes.

Storage & Data Management - Agendas/Minutes

Considering input to a community support model for DPM and possible alternatives.

Wednesday 20th June

snoplus needs/plans on agenda last week.
The collaboration will depend on the RAL LFC and are looking to increase Tier-2 usage – current needs of 10TB/site will increase to 20TB/site.
Data taking will start in the autumn and continue for 6 months.

Accounting - UK Grid Metrics HEPSPEC06

Wednesday 6th June - Core-ops

Request sites to publish HS06 figures from new kit to this page.

Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 3rd July

Started a page on stale documents. Please update this page if you find documents or pages that need attention.

Wednesday, 6th June

Released a document, hep.ph.liv.ac.uk/~sjones/VomsSnooper.odt, that describes how to

Maintain site VOMS info document for the approved VOs
Check a site's VOMS records correspond exactly with CIC portal
Create new site VOMS records direct from CIC portal, without manual transcription

Note: I'm accepting tips from GridPP core task members etc. about other use cases for these processes. This will be converted to wiki formatted and made available in the normal way. Next jobs:

review logical/sequence of VOMS admin process, document it if it works, fix it if it doesn't.
create standard baseline for proxy renewal process, and write it up in wiki.

Note: I'm accepting tips from other Gridpp core team members etc. for document priorities. Please think about where the problems lie (i.e. what costs us yet is easy to fix) and get back to me.

Tuesday, 29th May

VOMS Records in GridPP Approved VO list now up to date with CICs Portal XML. This can be used by Site Admins to ensure their site-info.def/vo.d directories are up to date. A tool, SidFormatter, will be released this week to facilitate comparison with the benchmark. A process has been devised to ensure that GridPP Approved VO is kept up to date to within a week of CIC Portal changes. Consultation to be made about further fields that we may wish to advertise in this manner.

Friday 27th April

Appeal for a volunteer to enhance "Grid User Crash Course" (https://www.gridpp.ac.uk/wiki/Grid_user_crash_course) with simple use case for dependable proxy renewal for long jobs, as this is a recurrent requirement that has caused multiple queries on TB_SUPPORT.

Interoperation - EGI ops agendas

Tuesday 3rd July

There was an EGI meeting yesterday. Minutes have been uploaded.

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD Rota

Monday 9th July - KM

Quite week. Nothing to report. Just a reminder that now Dashboard is getting result from gridppnagios at Lancaster. So if you want to see raw result for something, go to https://gridppnagios.lancs.ac.uk/nagios

Monday 2nd July - AM

Good week overall. No major regionwide problems. Tickets opened towards the end for job failures at Durham and Lancaster, and for WMS at Glasgow (which is in downtime now.) QMUL continues to have machines that are not in production raising alarms outside of downtimes, and clearing these counts against the ROD quality

Rollout Status WLCG Baseline

Monday 11th June

EMI2 is released but not in Staged Rollout yet. Buyers beware.

Thursday 10th May

The cream ce and the WMS which were released at the end of April have finally gone into Staged Rollout
Call for more sites to take part in EMI-2 rollout tests.
The overall SR contributions are in this table.

Friday 27th April

Updated version information on rollout page
WN scan indicates some sites not keen on OS updates to those nodes.

Security - Incident Procedure Policies

Monday 25th June

Rota availability responses slow
Is anyone following up on SSC5/6?
Stratuslab VM (ex UK)
gridftp

Services - PerfSonar dashboard

Monday 9th July

The perfsonar link has changed to a new production instance (see link above)
A couple more sites have been added in the last week.
The GridMon boxes are to be returned to Darebury!

Tickets

Monday 9th of July, 13:30 BST

21 Open UK GGUS tickets this week.

NEW https://ggus.eu/ws/ticket_info.php?ticket=84066 Durham have a availability/reliability ticket for June.

NGI https://ggus.eu/ws/ticket_info.php?ticket=83794 There's been a request to update the latitude & longitude information for sites in the gocdb, although this is being handled "centrally" so sites shouldn't have to worry about it.

https://ggus.eu/ws/ticket_info.php?ticket=80259 Mark has set himself up as a temp VO manager for neurogrid.incf.org, and opened another ticket (https://ggus.eu/ws/ticket_info.php?ticket=83926) to cover the final VO registration steps in the CiC portal.

QMUL https://ggus.eu/ws/ticket_info.php?ticket=83773 atlas were being hit by cvmfs timeouts (https://savannah.cern.ch/support/?129468), Chris rolled out the new (beta) version of cvmfs (cvmfs-2.0.18-0.3.3574svn) which seems to be settling things. Similar to Glasgow's ticket https://ggus.eu/ws/ticket_info.php?ticket=83283 (although I believe Glasgow have bigger fish to fry at the moment).

https://ggus.eu/ws/ticket_info.php?ticket=83587 SNO+ are working on rolling out "git" in their software area, currently working out a few kinks. Matt M asks what the situation is for a cvmfs host at RAL?

SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 Emyr is back from his holiday to come home to expired certificates and troubles with Jeremy M losing his CA status! Although everyone's having a hard time with certs this week.

ECDF https://ggus.eu/ws/ticket_info.php?ticket=80152 Wahid's very succinct reply to Matt H's query on this ticket's progress made me smile. But it raises the question that are these short "ticket proddings" useful? Is there an alternative?

SOLVED CASES RHUL https://ggus.eu/ws/ticket_info.php?ticket=83933 https://ggus.eu/ws/ticket_info.php?ticket=83912 Suffered a DPM crash last Friday. Have you upgraded to a "patched" release of dpm (moving to glite 1.8.2-5 fixed things for Lancaster, the latest EMI releases should be immune to the common causes of these crashes).

TICKETS FROM THE UK https://savannah.cern.ch/support/?130203 T2K have requested to be added as a VO to GGUS.

Tools - MyEGI Nagios

Monday 2nd July

Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.

VOs - GridPP VOMS VO IDs Approved

Wednesday 6th July

Cross-checking VOs enabled vs VO table.
Surveying VO-admins for problems faced in their VOs.
SNO+ have asked about using git to deploy software - what are the options?

Site Updates

Monday 25th June

N/A

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 2nd July

No meeting. Next PMB on Monday 9th.

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

ATLAS DATADISK now being used for production input files
Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
Target date for perfsonar-ps at sites is the end of July
KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

Trying the present bulletin as a conduit for information
7 sites below 90% in ATLAS monitoring!
Instructions for small VO space-token setup/usage requested
EMI WN tarball now available - comment in ticket
Not every site running latest SL5 OS

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 27th June

The move of the Castor databases to Oracle 11 this morning has been completed successfully.
There were useful discussions with MICE representatives present about their resource and other requirements for reconstruction of data at the Tier1.
Operations report

WLCG Grid Deployment Board - Agendas MB agendas

June meeting Wednesday 13th June

WLCG meeting notes

Welcome [MJ]

August meeting is cancelled. October meeting is in Annecy.
EGI Technical Forum 17-21st September: http://tf2012.egi.eu/
HEPiX Fall – 15th -19th October.

Post-TEG Working Groups [Ian Bird]

Large number of WGs proposed.
DM&S: Benchmarking. Federation. Networking
WLM: Extensions of CE (multi-core; whole node; pilot support). Information System.
Security: Proposals coming
Database: share experiences.
Operations: m/w sw process. Monitoring.
Teams approaches: Operations coordination team. Sharing experiences/tech watch (pre-GDB discussion)
Possibly Missing? Cloud. SRM (but to be more generic in title!). Bartch systems.

Storage Accounting (John Gordon)

StAR
Plan to publish to APEL but in EMI-3 for May 2013
Interim possibility to use gstat
Noted that information that is published is not precise.

Information System Status and Evolution (Maria Alandes Pradillo)

Caching BDII. In since February. Documentation improved. Use variable BDII_DELETE_DELAY. (https://tomtools.cern.ch/confluence/display/IS
Failover. LCG_GFAL_INFOSYS Use 1& 2 BDII in region and 3 as CERN.
On data quality

- glue-validator (in EMI-1 ans 2) - glue 2 still to be deployed widely - Future work (EMIR; ginfo and IS monitoring/metdata). Question if OSG fully engaged?

AAI on WN update (Romain Wartel)

Security controls – central banning body required
ARGUS locally needed (to pull banning lists from central ARGUS)
Ownership of traceability. VO-site collaboration needed to cover all cases
Recommendations to fulfill logging and traceability policy on WN.
Not current possible to use clouds (VMs) in a way that conforms with WLCG security policies.
Critical proxy extension (ALICE less limited)
Proxy lifetime - reduce back to 24hrs? Balanced compromise between complexity and risk. Proxy credentials can not be revoked.
Pool account recycling – recycle only after 6 months.

EMI update (Cristina Aiftimiei)

EMI-1 at update 15 (23.04.2012)
EMI-1 Full support & maintenance until 28.02.2012. Updates till 31.10.2012.
EMI-2 released 21.05.2012. Supports SL5 and SL6. Some Debian6.
New products: CANL, EMIR, EMI-Nagios, Pseudonymity, WNoDeS.

Hydra and WMS not released yet.
Some backward incompatibilities due to existing EPEL package names.
UI/WN tarballs in the next update.

Globus SW support at OSG

Discussions including use of Cream/Glue2; this to be investigated as it impacts use of the WMS

EMI Sustainability Plans (Alberto Di Meglio)

The end of EMI is the end of the coordination between product teams – not the end of those product teams.

Ian Bird: the outcome of the above WLCG-EMI-EGI meeting needs to be how do we manage software in the future, also to discuss: how do we do certification, staged rollout and deployment in general.

Communicating Machine Features to Batch Jobs (Tony Cass)

Jeff will share a script for PBS to test implementation using /etc/machinefeatures.

MUPJ – gLexec update (Maarten Litmaath)

gLExec flag needs to be set in GOCDB
Status at http://cern.ch/go/PX7
Deployment guide https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment
CMS pushing because of July security challenge.

Federated Identity Vision (Romain Wartel)

Document presented at last GDB. Approved by MB on 5th June.
Pilot project for WLCG - any volunteers to be involved?

NGI UK - Homepage CA

Monday 2nd July

Next meeting is on 9th July.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 160712

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools