Operations Bulletin 030912

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 27th August 2012
Task Areas
General updates

Tuesday 28th August

  • Support of the dteam VO is required as well as ops. All UK sites currently good.
  • Recent CE retirements have led to myEGI and REBUS reporting different logical CPUs for some sites. (example). Ewan recommended using gLite cluster.
  • There is a need to gain VO feedback on their use of EMI WNs (EGI summary).
  • NGS is rebranding to NES (National e-Infrastructure Service).
  • Steve's EMI CE install issue -> "If you've got anything enabled except for SL, EPEL, and EMI, then you should probably turn it off".
  • GOCDB will be set to read-only mode and will be at risk between 08:00 and 14:00 [UTC] on 30-08-2012. This is to apply database updates.
  • UserDNs still to be published by: JET, UCL, ECDF, Bristol and Cambridge.


Tier-1 - Status Page

Tuesday 28st August

  • No major operational issues to report for the last week.
  • Castor Repack instance upgraded to 2.1.12-10.
  • Continuing test of hyperthreading, one batch of worker nodes has number of jobs increased further (from 22 to 24).
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • Updated CASTOR information provider - inadvertently we had been running an older version with more bugs in dynamic accounting - current (2.2.12) has bugfixes for both tape and shared disk.
Storage & Data Management - Agendas/Minutes

Wednesday 29th August

  • Update on DPM support plan - aim to define tasks, then look for "volunteers"
  • Planning ahead for coming events - particularly GridPP29
  • Volunteers for ATLAS job recovery?

Wednesday 15th August

  • Tier-1 approach for disk only solution.
  • Considering input to a community support model for DPM and possible alternatives.

Tuesday 24th July

  • Sam testing xrootd redirection on Glasgow test cluster - going well.


Accounting - UK Grid Metrics HEPSPEC06

Wednesday 18th July - Core-ops

  • Still need definitive statement on disk situation and SL ATLAS accounting conclusions.
  • Sites should again check Steve's HS06 page.


Wednesday 6th June - Core-ops

  • Request sites to publish HS06 figures from new kit to this page.
  • Please would all sites check the HS06 numbers they publish. Will review in detail on 26th June.

Friday 11th May - HEPSYSMAN

  • Discussion on HS06 reminding sites to publish using results from 32-bit mode benchmarking. A reminder for new kit results to be posted to the HS06 wiki page. See also the blog article by Pete Gronbech. The HEPiX guidelines for running the benchmark tests are at this link.


Documentation - KeyDocs

Tuesday 28th August

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) Ticket follow-up(3/0) Security(3/0) On-duty coordination(2/0) Regional tools(2/0) Monitoring(2/0) Accounting(2/0) Wider VO issues(2/0) Staged rollout(1/0) Core Grid services(1/0) Grid interoperation(1/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 20th August

  • EGI operations meeting minutes available.
  • SAM/Nagios 17.1 release (needed for EMI2 WN monitoring) expected around 22nd August (but requires a workaround!).
  • Next meeting 10th September


Monday 30th July (last EGI ops meeting agenda.)

  • Check VO's who have tested/used EMI WN's. There is a need to avoid any problems when gLite 3.2 WN's End Of Life are announced.
  • EMI: Forthcomming, 9th August: EMI-1: BDII, CREAM, WMS. EMI-2: BDII, Trustmanager. The CREAM and WMS updates look like the fix some useful bits.
  • StagedRollout:

IGTF: CA 1.49, SR this week.

The SAM/Nagios 17 still in SR, has problems. A patch is ready, shoudl be done soon. This is needed for support of EMI-2 WN's.

UMD1.8: Blah. gsoap-gss, Storm and IGE Gridftp to be released soon.

Lots of stuff for the UMD-2; note the WN problems. There is some discussion, it looks like the EMI-2 WN will not be released to the UMD until the Sam/Nagios problems are solved.

  • WMS vunerabilities: Some discussion. UK all patched and up-to-date, yay!


Monitoring - Links MyWLCG

Monday 2nd July

  • DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

  • Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 27th August

  • COD believes Brunel has been in downtime for a month - we need to respond to the ticket.

Friday 24th August - AM

  • Quite a patchy week with many problems at sites. Several still have unplanned downtimes due to outages, with varying degrees of severity. The WMS at Imperial continues to be unstable (due to overload?) and is awaiting reinstallation on a new machine, and the Dashboard has been particularly slow at reflecting the IC WMS's returns to proper operation. I've created another ticket to track these alarms. Brunel has had several alarms but none of which have lasted more than 12 hours or so, and yet today it is showing as red. To confirm the inconsistency, there are no thresholds exceeded for Brunel or NGI_UK in the metrics and performance index pages in the Dashboard. If this red state persists, I think it should be reported as a Dashboard bug.
Rollout Status WLCG Baseline

Tuesday 31st July

  • Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.

Wednesday 18th July - Core ops

  • Sites (that needed a tarball install) will need to work on own glexec installs
  • Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.
Security - Incident Procedure Policies Rota

Monday 30th July

  • WMSes patched/configured correctly.

Monday 23rd July

  • WMS vulnerabilities identified. Sites will have been contacted. Please respond to tickets ASAP.


Services - PerfSonar dashboard

Tuesday 28th August

  • 14 GridPP sites now enabled with perfSONAR. Are they all reporting correctly in the GridPP community?
  • Some asymmetry issues observed. (A case study on resolving a US problem).
  • Setting up representative UK-international tests.


Monday 20th August

  • There will be a UK CA-TAG meeting in the next week. Are there any CA issues that GridPP should raise?
  • BNL transition to (new hardware and therefore) new collector for all perfSONAR-PS on Monday 20th August.

Monday 13th August

  • Over half of GridPP sites now appear in the dashboard
  • Duncan has been following up on some of the dashboard results (often simple config problems)
Tickets

Tuesday 28th August 00:00 BST</br> I see 43 open UK tickets this week (the build up is alarming), and I figure bothering people the morning after a long weekend about tickets at their site would be an exercise in generating bad will! A peruse through them doesn't show any urgent tickets decaying unnoticed.

However this build-up of tickets is worrying, so please can everybody scan the UK tickets for any pertaining to their site, and take the chance to update any relevant tickets if they need it. I'm guessing a lot of people are on holiday though.

All open UK tickets can be seen here:</br> http://tinyurl.com/93vlp2e

During today's meeting we'll have a quick surgery in case anyone wants to mention any ticket-related problems rather then the usual round-up. This will hopefully clear some time for the site round-table we've been meaning to have for a while!

Tools - MyEGI Nagios

Tuesday 25th July

Gridppnagios at Lancaster will remain main Nagios instance until further announcement. KM writing down procedure for switch over to backup nagios in case of emergency https://www.gridpp.ac.uk/wiki/Backup_Regional_Nagios . KM now away for one month holiday and may not be able to reply to emails. New email address for Nagios: gridppnagios-admin at physics.ox.ac.uk for any question or information regarding regional nagios. Currently this mail goes to Ewan and Kashif.

Monday 2nd July

  • Switched on backup Nagios at Lancaster and stopped Nagios instance at Oxford. Stopping Nagios instance at Oxford means that it is not sending results to the dashboard and central DB. Keeping a close eye on it and will revert it back to original position if any problems encountered.



VOs - GridPP VOMS VO IDs Approved VO table

Monday 27th August


Monday 23rd July

  • CW requested feedback from non-LHC VOs on issues
  • Proxy renewal issues thought to be resolved on all services except FTS - a new version may solve problems on that service.
  • Steve's VOMS snooper application is picking up many site VO config problems. We should review all sites in turn.


Site Updates

Monday 20th August

  • SUSSEX: Still in downtime. Almost there?



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Monday 2nd July

  • No meeting. Next PMB on Monday 3rd September.
GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

  • TBC


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 22nd August

  • Operations report
  • During last week one problematic switch stack affected access to a batch of worker nodes on the morning of the 16th.
  • A number of different issues have affected SUM tests.
  • CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • Continue with hyperthreading tests. One batch (2011 Dell Worker Nodes) have number of jobs further increased to 22.
  • Test batch queue ("gridTest") available to try out EMI2/SL5 Worker nodes.
  • Discussion on better (or less worse) times for some required interventions on the power systems in the building - although these should only be "At Risk"s.
WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.



NGI UK - Homepage CA

Wednesday 22nd August

  • Operationally few changes - VOMS and Nagios changes on hold due to holidays
  • Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
  • The NGS is rebranding to NES (National e-Infrastructure Service)
  • EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
  • Next meeting is on Friday 14th September at 13:00.
Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

  • Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.
  • Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)
UK CMS

Tuesday 24th April

  • Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.
UK LHCb

Tuesday 24th April

  • Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.
UK OTHER

Thursday 21st June - JANET6

  • JANET6 meeting in London (agenda)
  • Spend of order £24M for strategic rather than operational needs.
  • Recommendations to BIS shortly
  • Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
  • Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
  • Reliability limited by funding not ops so need smart provisioning to reduce costs
  • Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
  • Goal of dynamic provisioning
  • Looking at ubiquitous connectivity via ISPs
  • Contracts were 10yrs wrt connection and 5yrs transmission equipment.
  • Current native capacity 80 channels of 100Gb/s per channel
  • Fibre procurement for next phase underway (standard players) - 6400km fibre
  • Transmission equipment also at tender stage
  • Industry engagement - Glaxo case study.
  • Extra requiements: software coding, security, domain knowledge.
  • Expect genome data usage to explode in 3-5yrs.
  • Licensing is a clear issue
To note

Tuesday 26th June