Operations Bulletin 200812

Bulletin archive

Week commencing 13th August 2012

Task Areas

General updates

Tuesday 14th August

The WLCG July availability/reliability final figures are now online.

Monday 6th August

EGI Technical Forum (http://tf2012.egi.eu) will be held in Prague 17-21st September 2012
EGI Resource Centre survey on GPGPU. Please participate via https://operations-portal.egi.eu/broadcast/archive/id/705.
VOMS in transition service levels
ATLAS checks require MaxCPUTime and MaxWCTime to be set correctly. Also note that Panda queues found not always to be the same as those in BDII - please check.
Is it possible for a sysadmin to restart cream jobs? (CW)
Steve L's LFC tests now working again!
Getting at VO contact details - is there a clear process?
WLCG T2 Reliability & Availability [sam-reports.web.cern.ch/sam-reports/2012/201207/wlcg/WLCG_Tier2_Jul2012.pdf report] for July 2012 (updates by 13th).

Tuesday 30th July

T2K now has a queue in GGUS
Steve is reviewing VO software areas and tags
The 'latest' update on the GridMon boxes... there is a do nothing option!

Tier-1 - Status Page

Tuesday 14th August

No major operational issues to report for the last week.
Continuing test of hyperthreading, one batch of worker nodes has number of jobs increased further (from 18 to 20) on Thu 9th August.
As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
Site "At Risk" (in GOCDB) for part of the morning of Tuesday 21st August during site firewall reconfiguration. Will drain and stop the FTS during this time.

Storage & Data Management - Agendas/Minutes

Considering input to a community support model for DPM and possible alternatives.

Tuesday 24th July

Sam testing xrootd redirection on Glasgow test cluster - going well.

Wednesday 18th July

snoplus needs/plans on how ( on grid resources) to "blind" data for analysis.
plans for sites to finally upgrade/decommission glite3.1 Storage services ahead of Oct 1st deadline.
GLite to EMUI upgrade path.
Issues regarding PXE booting from 10G cards.
Plan to discuss UK role in community suppport of DPM .

Accounting - UK Grid Metrics HEPSPEC06

Wednesday 18th July - Core-ops

Still need definitive statement on disk situation and SL ATLAS accounting conclusions.
Sites should again check Steve's HS06 page.

Wednesday 6th June - Core-ops

Request sites to publish HS06 figures from new kit to this page.

Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 14th August

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) Ticket follow-up(3/0) Security(3/0) On-duty coordination(2/0) Regional tools(2/0) Monitoring(2/0) Accounting(2/0) Wider VO issues(2/0) Staged rollout(1/0) Core Grid services(1/0) Grid interoperation(1/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Wednesday 18th July - Core ops

Review of stale documents in coming weeks
Plan to setup a site template page to cover 'special' site update pages that are no longer updated

Tuesday 3rd July

Started a page on stale documents. Please update this page if you find documents or pages that need attention.

Interoperation - EGI ops agendas

The next EGI ops meeting is on 30th August. The next NGI management meeting is w/c 20th August.

Monday 30th July (last EGI ops meeting agenda.)

Check VO's who have tested/used EMI WN's. There is a need to avoid any problems when gLite 3.2 WN's End Of Life are announced.

EMI: Forthcomming, 9th August: EMI-1: BDII, CREAM, WMS. EMI-2: BDII, Trustmanager. The CREAM and WMS updates look like the fix some useful bits.

StagedRollout:

IGTF: CA 1.49, SR this week.

The SAM/Nagios 17 still in SR, has problems. A patch is ready, shoudl be done soon. This is needed for support of EMI-2 WN's.

UMD1.8: Blah. gsoap-gss, Storm and IGE Gridftp to be released soon.

Lots of stuff for the UMD-2; note the WN problems. There is some discussion, it looks like the EMI-2 WN will not be released to the UMD until the Sam/Nagios problems are solved.

WMS vunerabilities: Some discussion. UK all patched and up-to-date, yay!

EM1 WN: There's a wiki page https://wiki.egi.eu/wiki/NGI-VO_WN_tests which EGI would like filled in a bit. (NGI-CZ have been EMI-1 WN only for months, so might cover lots of things).

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Monday 23th July - KM

quite week. Nothing to report.

Monday 16th July - DB

Quiet week. Sheffield is showing an alarm for a machine that's not in the GOCDB, that might be worth following up. QMUL have an alarm for a machine not in production, they are aware of it, but short of closing the alarm every couple of days I haven't got any better suggestion on how to deal with it.

Monday 9th July - KM

Quite week. Nothing to report. Just a reminder that now Dashboard is getting result from gridppnagios at Lancaster. So if you want to see raw result for something, go to https://gridppnagios.lancs.ac.uk/nagios

Rollout Status WLCG Baseline

Tuesday 31st July

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.

Wednesday 18th July - Core ops

Sites (that needed a tarball install) will need to work on own glexec installs
Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.

Security - Incident Procedure Policies Rota

Monday 30th July

WMSes patched/configured correctly.

Monday 23rd July

WMS vulnerabilities identified. Sites will have been contacted. Please respond to tickets ASAP.

Services - PerfSonar dashboard

Monday 13th August

Over half of GridPP sites now appear in the dashboard
Duncan has been following up on some of the dashboard results (often simple config problems)

Monday 30th July

Glasgow has been added and Liverpool will be shortly.
VOMS 'ownership' in transition (TBC). Who is picking up requests?

Monday 23rd July

Sheffield added to perfsonar matrix. Glasgow almost ready.
QMUL would like to test with sites using MTU >1500.

Tickets

Monday 13th of August, 13:00 BST 23 open UK tickets this week. I spotted a few ilc tickets this week, they seem to be suffering from the usual problems VOs face after a quiet time. Otherwise not much to report.

Question: Do sites mind me "tidying tickets"; essentially when a ticket is obviously "In Progress" (work has started) or "Waiting for Reply" (the last post was a question from the site to a person of interest) I'll poke my nose in? I've done this a few times where people have started work without setting a ticket "in progress", but I don't want to stick my oar in where it's not wanted!

Quiet tickets? https://ggus.eu/ws/ticket_info.php?ticket=85018 (Lancs) https://ggus.eu/ws/ticket_info.php?ticket=85017 (QMUL) Lancaster and QMUL received some Ops tickets on the 9/8, Lancaster got one notification (which I admit I missed) but no follow up e-mails. Queen Mary have been uncharacteristically quiet which makes me think they might have missed their ticket too. Anyone else had any missed tickets? Could this ticket quietness be caused by similar issues that were discussed in TB-SUPPORT this week (concerning multiple e-mail addresses in GGUS & EGI broadcasts).

UK Tickets https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7) Creation of the COMET VO. A voms instance is online and some formalities have been fulfilled. Daniela asks what's needed to host the VO on the gridpp voms server? (13/8)

QMUL https://ggus.eu/ws/ticket_info.php?ticket=84938 (7/8) neiss.org.uk needed updated VOMS server information on QMUL's servers. Chris has added the "2B" to his .lscs which should have got it, waiting to see if this is fixed things on the CEs before rolling out to the SE (should be waiting for reply, 9/8).

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7) Atlas production problems at Durham. After a lot of peril Alastair reports that Mike has things more or less up and running, with a few issues with services that need to be understood. Remaining in downtime till these are dealt with. In progress (9/8)

Other Durham tickets are on hold except:

https://ggus.eu/ws/ticket_info.php?ticket=68859 (Brian's request for DPM upgrade plans.) which probably should be.

BRISTOL https://ggus.eu/ws/ticket_info.php?ticket=80155 (12/3) Upgrade plans for the Bristol SE. Winnie has outlined a plan (9/7), ticket has been put on hold (18/7) until the end of August. On hold no till the end of September, due to the upgrade not being able to be done this month (but guarantees from Winnie that it will be done before the the 30th of September). On hold (7/8).

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=83283 (14/6) lhcb cvmfs problems. Dave cited https://savannah.cern.ch/bugs/index.php?95420 & https://savannah.cern.ch/support/?129468 (18/6). Has there been any plans to try out the newer versions of cvmfs (or does the problem even still exist?). Jeremy set to Waiting for Reply (31/7). On the 8/8 Mark sparked discussion with lhcb over whether or not failures persisted (the site had in fact been banned for most of the period of this ticket). It was confirmed that the problem is still plagues lhcb jobs, and the list of problem nodes corresponds to the new "high-density" workers. Glasgow currently working on installing the cvmfs upgrade to fix lhcn & atlas problems at the site, but it will take a few days. In progress (10/8)

Tickets from the UK https://ggus.eu/ws/ticket_info.php?ticket=84993 Raul has sent out a batch of cloned tickets to VOs with data still on Brunel's soon to be retired dgc-grid-50.brunel.ac.uk (ref https://ggus.eu/ws/ticket_info.php?ticket=84639).

https://ggus.eu/ws/ticket_info.php?ticket=85021 (9/8) Emyr ticketed EGI over their improper signing of umd-release-1.8.0-1.el5.noarch.rpm. No word yet on the ticket.

Of Interest: https://ggus.eu/ws/ticket_info.php?ticket=85029 Daniela pointed out on TB-SUPPORT a probably cause of some sites intermittent Ops test failures.

No exciting solved cases this week.

Tools - MyEGI Nagios

Tuesday 25th July

Gridppnagios at Lancaster will remain main Nagios instance until further announcement. KM writing down procedure for switch over to backup nagios in case of emergency https://www.gridpp.ac.uk/wiki/Backup_Regional_Nagios . KM now away for one month holiday and may not be able to reply to emails. New email address for Nagios: gridppnagios-admin at physics.ox.ac.uk for any question or information regarding regional nagios. Currently this mail goes to Ewan and Kashif.

Monday 2nd July

Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.

VOs - GridPP VOMS VO IDs Approved

Monday 23rd July

CW requested feedback from non-LHC VOs on issues
Proxy renewal issues thought to be resolved on all services except FTS - a new version may solve problems on that service.
Steve's VOMS snooper application is picking up many site VO config problems. We should review all sites in turn.

Monday 16th July

EGI UCST Representatives have change VO neurogrid.incf.org status from New to

Production

Wednesday 6th July

Cross-checking VOs enabled vs VO table.
Surveying VO-admins for problems faced in their VOs.
SNO+ have asked about using git to deploy software - what are the options?

Site Updates

Monday 30th July

SUSSEX: Still running into problems wrt stability and getting certified. Is there further help we can provide?

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 2nd July

No meeting. Next PMB on Monday 9th.

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 8th May - link Agenda Minutes

ATLAS DATADISK now being used for production input files
Deploy perfsonar-ps (for LHC compatibility), but run a Perfsonar-MDM portal to collate the information
Target date for perfsonar-ps at sites is the end of July
KeyDocs still not in place for all areas

Tuesday 1st May 2012 - Agenda Minutes

Trying the present bulletin as a conduit for information
7 sites below 90% in ATLAS monitoring!
Instructions for small VO space-token setup/usage requested
EMI WN tarball now available - comment in ticket
Not every site running latest SL5 OS

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 1st August

Operations report
Site "At Risk" (in GOCDB) for part of the morning of Tuesday 21st August during site firewall reconfiguration. Will drain and stop the FTS during this time.
CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
Continue with hyperthreading tests. One batch (2011 Dell Worker Nodes) have number of jobs further increased to 20.
Test batch queue ("gridTest") available to try out EMI2/SL5 Worker nodes.

WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.

NGI UK - Homepage CA

Monday 2nd July

Next meeting is on 9th July.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 200812

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools