Operations Bulletin 081012

Bulletin archive

Week commencing 1st October 2012

Task Areas

General updates

Tuesday 2nd October

Draft September WLCG reliability and availability figures released.
CREAM in High Availability systems. EGI have asked for feedback on the release of this feature which is foreseen with EMI 3. To finalize development and documentation, the CREAM Project Team needs feedback from site administrators about their existing site setup in the following areas: used shared file systems (NFS, GPFS, etc); used gateways (DNS, Apache server); load balancing algorithms in use (Round Robin, Weight based, etc) and data replication. Please email Jeremy with any feedback/interest. Further details are in this talk.

Monday 1st October

The structure of the WLCG Operations Coordinatin Team is described here. There are minutes from the meeting on 24th September. Next meeting this Thursday.
Ticket responses and a query regarding this ticket.
If you use the Alcatel audioconf for meetings and experience problems connecting then read these suggestions. It concerns browser issues.
The next GDB is on 10th October. There is a pre-GDB on Storage interfaces and Access. Vidyo is available for both meetings.

Tier-1 - Status Page

Tuesday 2nd October

Castor 2.1.12 update for Atlas stager went OK last Tuesday (25th Sep).
Oracle update also applied to Atlas TAGS database on Tuesday (25th Sep).
Problem with Atlas SRM Sunday evening (30th Sep). Problem in SRM Database and unrelated to Castor 2.1.12 update.
Brief interruption to one of servers in RAL AFS cell this morning (Tues 2nd Oct).
Continuing test of hyperthreading. Some problems encountered and have reduced overcommit of nodes.
Continue with ten EMI-2 SL-5 worker nodes in normal production.
Test instance of FTS version 3 now available. Non-LHC VOs that use the existing service have been enabled on it and looking for one of the VOs to test.

Storage & Data Management - Agendas/Minutes

Friday 14th September

The current GridPP response on the DPM community support proposal: "GridPP acknowledges the concerns and issues raised in the DPM Community proposal. As a collaboration that has many sites with DPM endpoints we presently have a good level of engagement with the DPM development team and in providing additional tools, testing and (currently mainly local) support for DPM. We would be happy to continue this level of contribution and take part in meetings shaping the emerging DPM community. Over the coming months it would be useful to trial working with the DPM team to develop and test additional DPM components which would help us, and sites across WLCG more generally, be better placed to understand how DPM can deliver to presently unknown but evidently changing WLCG experiment requirements on the LS1 timescale."
Has anyone in the SG looked at the Glue 2.0 document yet?

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Friday 28th September

Tier-2 pledges to WLCG will be made shortly. The situation is fine unless there are significant equipment retirements coming up.
See Steve Lloyd's GridPP29 talk for the latest on the GridPP accounting.

Wednesday 6th September

Sites should check the atlas page reporting HS06 coefficient because according to the latest statement from Steve that is what it's going to be used Atlas Dashboard coefficients are averages over time.

I am going to suggest using the ATLAS production and analysis numbers given in hs06 directly rather than use cpu secs and try and convert them ourselves as we have been doing. There doesn't seem to be any robust way of doing it any more and so we may as well use ATLAS numbers which are the ones they are checking against pledges etc anyway. If the conversion factors are wrong then we should get them fixed in our BDIIs. No doubt there will be a lively debate at GridPP29!

Check publishing via: http://gstat2.grid.sinica.edu.tw/gstat/summary/Country/UK/

Documentation - KeyDocs

Tuesday 11th September

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Monitoring(2/0) Cluster Management(1/0) (brackets show total/missing)

Thursday 26th July

All the "site update pages" have been reconfigured from a topic oriented structure into a site oriented structure. This is available to view at https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages

Please do not edit these pages yet - any changes would be lost when we refine the template. Any comments gratefully received, contact: sjones@hep.ph.liv.ac.uk

Interoperation - EGI ops agendas

Monday 24th September

EGI operations meeting minutes.
Decommission of a service deploying unsupported software (gLite 3.1 and part of gLite3.2): Site managers must decommission the unsupported software following the production service decommissioning procedure (PROC12), this includes (among other actions) removing the service from GOCDB and the Site-BDII.
UMD 2.2, after a first prioritization within SA2, the list of candidates for the next UMD2 update are: CREAM-SGE; Unicore-UVOs; MPI; BDII-core; EMI-myProxy; ARGUS; Trustmanager and GRIDSite.
Missing CSIRT info for UCL.

Monday 10th September - EGI ops meeting minutes

Tuesday 4th September

The end of security support of the following products:

- glite 3.2 glite-UI - glite 3.2 glite-WN - glite 3.2 glite-GLEXEC_wn - glite 3.2 glite-LFC_mysql/glite-LFC_oracle - glite 3.2 glite-SE_dpm_disk/glite-SE_dpm_mysql

was extended to 30/11/2012 (http://glite.cern.ch/support_calendar/).

Monitoring - Links MyWLCG

Monday 2nd July

DC has almost finished an initial ranking. This will be reviewed by AF/JC and discussed at 10th July ops meeting

Wednesday 6th June

Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available.

Current priority is ranking the tools available.

Glasgow dashboard now packaged and can be downloaded here.

On-duty - Dashboard ROD rota

Friday 28th October

There are three "longterm tickets" (Brunel, Liverpool, Oxford) that are waiting for a fix from EMI. The underlying problem is tracked here: https://ggus.eu/tech/ticket_show.php?ticket=85601

Cuurently 8 sites have at least 1 alarm or ticket open. There are 5 sites with open tickets (including the 3 above).

Durham has an alarm relating to low availability for the month.

Rollout Status WLCG Baseline

Monday 1st October

All gLite 3.1 services and nodes should now have been upgraded or removed.

Thursday 13rd September

Updated all SR pages.

Monday 3rd September

Test queues for EMI WNs: RAL T1, Oxford, Liverpool?, Brunel

Tuesday 31st July

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Brunel has a test EMI2/SL6 cluster almost ready for testing - who (smaller VOs?) will test it with their software.

Wednesday 18th July - Core ops

Sites (that needed a tarball install) will need to work on own glexec installs
Reminder that gLite 3.1 no longer supported. 3.2 support is also decreasing. Need to push for EMI.

Security - Incident Procedure Policies Rota

Tuesday 25th September

EUGridPMA announce new CA rpms (release notes).

Monday 10th September

Lessons from SSC6 (ops meeting feedback TBC)

Services - PerfSonar dashboard

Tuesday 18th September

VOMS in Manchester is now installed with both NGS/GridPP VOs. There is some political decision to take about how to support the NGS VOs and how to maintain them but they have been installed. Replication tests between Manchester and Oxford can now start.

Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week

Tuesday 11th September

Still some sites needing to deploy perfsonar
Meeting date/time for follow-up VOMS discussion needs to be agreed for later this week

Tickets

Monday 1st October 14.30 BST 40 36 Open tickets this week, and it's the start of the month so we get to go other all of them! Maybe every month is a little too often for such a review...

NGI https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7) COMET VO creation, On Hold pending the other VO creation gubbins (6/9) UPDATE - THE GUBBINS TICKET SUGGESTS THAT THE VO IS IN PRODUCTION https://ggus.eu/ws/ticket_info.php?ticket=%2085736

https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/5) Chris's VOMS request rejig ticket. On hold until the UK Voms reshuffle is complete, the reminder date (24/9) has passed. (6/9)

TIER 1 https://ggus.eu/ws/ticket_info.php?ticket=86570 (1/10) GGUS is moving to a SHA2 certificate on their next release (~24th), and have asked if the SHA2 cert will cause any trouble. Gareth has noted the ticket, but unsure if others will take notice. In progress (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=86552 (30/9) Atlas transfers from/to RAL-LCG2 failed, apparently due to high load at the RAL end. Found to be caused by a database problem. Should be fixed, at risk for a little while longer. In Progress (1/10) SOLVED-ORACLE WORKAROUND PUT BACK IN PLACE

https://ggus.eu/ws/ticket_info.php?ticket=86541 (29/9) Before the above problem atlas transfers were failing with SECURITY_ERRORs. A known FTS bug caused this (https://ggus.eu/tech/ticket_show.php?ticket=81844). Patch applied this morning. In progress (1/10) SOLVED -PATCH WORKED

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9) Duncan has ticketed the Tier 1 over packet loss seen on many (not all) Perfsonar tests where the RAL perfsonar is the destination. The RAL chaps are looking into it, but aren't expecting a solution to easily present itself. In progress (19/9)

https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8) biomend nagios jobs can't register files on srm-biomed.gridpp.rl.ac.uk. An odd problem that only seemed to affect biomed jobs. Looked to be dealt with for a while, but seems to have re-emerged. In progress (24/9)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11) SL4 DPM retirement master ticket. On hold but should be In progressed with a view to close (6/9)

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=86578 (1/10) Ops srm-put tests are failing. Only put in this morning though, still assigned. (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=86534 (28/9) Ops wn-rep tests failing. Related to above? Still just assigned (28/8)

https://ggus.eu/ws/ticket_info.php?ticket=86281 (21/9) Another wn-rep related ticket (for a different CE). This one too is just assigned. Are these getting to Mike? (21/9) UPDATE- MACHINES ARE HAVING CERTIFICATE ISSUES

https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9) Biomed having trouble submitting to cream02, "no space left on device" errors. Not much movement, just in progressed (24/9)

https://ggus.eu/ws/ticket_info.php?ticket=85181 (20/8) One of the last two glite 3.1 retirement tickets. No reply since Daniela asked if the BDII was indeed glite 3.1. In Progress (13/9)

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7) High atlas production rate failure at Durham. Durhams rocky summer hasn't helped, but hopefully they're out of the woods(?). On Hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11) Ancient compchem ticket. On hold but might not be relevant as all the CE's have been reinstalled (6/9)

https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11) SL4 retirement ticket. It looks like it can be closed, just need some confirmation from someone Durham side. In progress (28/9)

OXFORD https://ggus.eu/ws/ticket_info.php?ticket=86544 (29/9) Problems after running out of atlas pool accounts at Oxford. Probably caused by lcg-expiregridmapdir bug (I missed the discussion of this, maybe it was offline?), fix in place. Long term plan to up the number of atlas pool accounts. In progress (29/9)

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9) Low atlas sonar rate seen between Oxford & BNL. Ewan has been and is looking into it (17/9)

https://ggus.eu/ws/ticket_info.php?ticket=85968 (10/9) Oxford being bitten by the EMI lcg_utils bug. On hold pending EMI pulling thier finger out. (20/9)

LIVERPOOL https://ggus.eu/ws/ticket_info.php?ticket=86542 (29/9) Liverpool suffered a bunch of SRM transfer failures in a short timeframe, no obvious causes found at the time. Were investigating, but were probably interrupted by their unexpected cable bisecting incident today. In progress (29/9). SOLVED - PROBLEM WAS TRANSIENT

https://ggus.eu/ws/ticket_info.php?ticket=86095 (14/9) Liverpool's encounter with the EMI lcg-utils bug mucking up their WN-rep ops tests. On hold, but has been green for a while -maybe just been lucky? (20/9)

BIRMINGHAM https://ggus.eu/ws/ticket_info.php?ticket=86540 (28/9) Atlas transfers to Birmingham failed with "SRM_ABORTED" messages. Mark reports that the VM they are using as a headnode isn't beefy enough to cope with the demand, causing SRM responses to be too slow. He upped the power of the VM but that wasn't a full fix, hoping to get a reinstall in today. A note from atlas this morning mentions that transfers fail for DATADISK but not for PRODDISK, which is odd. Are there any differences in the nature of these transfers? In progress (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9) One of the tickets clocking poor atlas sonar rates between Birmingham and BNL. Mark and Laurie have looked into this, but not come up with anything conclusive. In progress (19/9)

BRUNEL https://ggus.eu/ws/ticket_info.php?ticket=86533 (28/9) Ops "WN-RepDel" tests failing, likely due to the known EMI WN lcg-utils timing out bug. As Brunel already have at ticket about this issue on a different CE (presumably fronting the same WNs) then Daniela asks if the ROD team can sum it in one ticket rather then multiples. In progress (1/10) CLOSED DUE TO BEING A DUPLICATE

https://ggus.eu/ws/ticket_info.php?ticket=85973 (10/9) The "original" RepDel test failure ticket at Brunel. On hold (awaiting a fix from EMI) (20/9)

IMPERIAL https://ggus.eu/ws/ticket_info.php?ticket=86426 (26/9) Hone have trouble submitting to the Imperial WMSi. Daniela reports that the machines are suffering from being too old (something we can all relate to), replacements should have arrived on Friday but hadn't. Dell report a new delivery date of the 8th. In progress (could be on hold until the kit arrives?) (29/9).

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=86391 (25/9) Atlas were having staging in problems due to high disk server load. Problems however persisted for a while after the load on the server calmed down. Did thing sort themselves out after the weekend? In progress (27/9)

https://ggus.eu/ws/ticket_info.php?ticket=85183 (14/8) One of the last few glite 3.1 retirement tickets. Due to the severe crustiness of the old WMS hardware Glasgow powered it down rather then upgrade (was it only 32-bit hardware?) and are now pondering the next steps. In progress (28/9) UPDATE - DO WE NEED TO REMOVE IT FROM THE GOCDB OR IS DOWNTIME ENOUGH?

https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8) Sno+ WMS problems at Glasgow. AFAICS the wms in question has been switched off due to the reasons above? It might be useful to make that clear to Sno+! In progress (10/9).

RHUL https://ggus.eu/ws/ticket_info.php?ticket=86383 (25/9) RHUL stopped publishing UserDN accounting after "upgrading" from glite to EMI apel in August. Apel support have been called in, and Daniela suggests checking the FAQ. In progress (1/10)

QMUL https://ggus.eu/ws/ticket_info.php?ticket=86378 (25/9) Hone had jobs waiting "too long" at QM, but the problems disappeared. Along with a bunch of jobs, looks like the QM creams suffered from the database resetting issue (https://ggus.eu/tech/ticket_show.php?ticket=85970, as advertised by Daniela). In progress (27/9)

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9) Queen Mary is being swamped by unkillable lhcb zombie pilots. Neither the submitters or the site admins can do ought about them using "normal" tools. Daniela has suggested some DB queries to try or attempting to use the JobPurger tool (which would be my suggestion too). In progress (1/10). UPDATE- Some success with the JobPurger with a 5 day time frame

https://ggus.eu/ws/ticket_info.php?ticket=85967 (10/9) QM failing ops Apel tests. Chris ticketed apel support for help (https://ggus.eu/ws/ticket_info.php?ticket=84326), but not having much luck due to the shear size of their DB, and progress interrupted by GridPP last week. Hopefully will break this problem this week. On hold (21/9)

ECDF https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9) Poor atlas sonar rates between BNL and ECDF. Waiting on moving disk servers to new switches and other general network wizardry scheduled for this week. On hold till then (28/9).

CAMBRIDGE https://ggus.eu/ws/ticket_info.php?ticket=86108 (14/9) Duncan noticed a WAN bandwidth asymmetry at Cambridge. John contacted the local networking guys, who've investigated and found nothing. Still in progress (26/9)

LANCASTER https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8) ilc were having trouble submitting jobs to one of Lancaster's CEs. Robin tracked the issues to high disk IO load, and we're figuring out a some ways of mitigating these problems. In progress (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=84583 (26/7) lhcb jobs failing on a Lancaster CE, originally due to a pool account misconfiguration. The problem has been fixed (probably...) but files don't seem to be being staged in for lhcb and there are no errors (or mention of lhcb at all) in the gridftp logs. Debugging is not being helped by the load issues documented above. In progress (27/9)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7) t2k.org transfers from RAL to Lancaster timing out. We hoped the gateway upgrade would improve things, but we were disappointed. Back to the network investigation. In progress (1/10)

RALPP https://ggus.eu/ws/ticket_info.php?ticket=85019 (9/8) ILC had some adventures due to VO misconfiguration at RALPP, but looks like things are fixed and the ticket can be closed now. In progress (1/10)

SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5) Emyr wondered last week if this was the longest ticket ever? Sadly I doubt it! The baton has passed oddly enough to Lancaster, as we've come across a bazaar problem whereby communication from the Sussex cream CE (and only the cream CE) is being refused by machines on a specific Lancaster subnet. Sadly this is the subnet where the lancaster nagios box is sitting. We've ruled out firewalls and had the network chaps at both sides take a look. Traffic is being stopped at the Lancaster end, but by the servers themselves (not the network gateways). I'm currently investigating to see if there's any oddity with our network settings. In progress (26/9)

Ticket of Interest: https://ggus.eu/tech/ticket_show.php?ticket=85970 As mentioned above, the ticket documenting the EMI2 cream database "reset" problems.

Solved Tickets Ran out of time for these, but I notice that most of the glite 3.1 tickets are closed and the neurogrid VO has taken off. Good stuff!

Tools - MyEGI Nagios

Monday 17th September

Current state of Nagios is now on this page.

Monday 10th September

Discusson needed on which Nagios instance is reporting for the WLCG (metrics) view

VOs - GridPP VOMS VO IDs Approved VO table

Friday 30th September

Summary of some VO activities given at GridPP29
Need more feedback/testing from smaller VOs ahead of EMI2-WN change and then SL6.

Tuesday 18 September 2012

No VOs reporting issues.
VOs have been asked for a brief summary for the GridPP meeting.

Monday 27th August

We are required to encourage our smaller VOs to try running on EMI WNs to inform the upcoming transition. Experiences should be added to https://wiki.egi.eu/wiki/NGI-VO_WN_tests.

Site Updates

Friday 30th September

SUSSEX: Site is now out of downtime and passing ops tests. Issue remains with Lancaster Nagios. ATLAS work to be enabled.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Monday 1st October

ELC work

Tuesday 25th September

Reviewing pledges.
Q2 2012 review
Clouds and DIRAC

GridPP ops meeting - Agendas Actions Core Tasks

Tuesday 21st August - link Agenda Minutes

TBC

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 3rd October

Operations report
Castor 2.1.12 upgrade for Atlas instance successful on 25th September. Dates for other instances announced.
There have been a couple of outages of the Atlas Castor instance. Issue involving the SRM database understood and workaround in place.
LHCb have said their Conditions Database at RAL can be retired.

WLCG Grid Deployment Board - Agendas MB agendas

July meeting Wednesday 11th July

Welcome (Michel Jouvin) • September meeting to include IPv6, LS1 and extended run plans • EMI-2 WN testing also in September

CVMFS deployment status (Ian Collier) • Recap; 78/104 sites for ATLAS – the UK looks good thanks to Alessandra • Using two repos for ATLAS. Local shared area will be dropped in future. • 36/86 for LHCb. Two WN mounts. Pref for CVMFS – extra work • 5 T2s for CMS. Info for sites https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS • Client with shared cache in testing • Looking at NFS client and MAC OS X

Pre-GDB on CE Extensions (Davide Salomoni) • https://indico.cern.ch/conferenceDisplay.py?confId=196743 • Goal – review proposed extensions + focus on whole-node/multi-core set • Also agree development plan + timeline for CEs. • Fixed cores and variable number of cores + mem requirements. May impact expt. frameworks. • Some extra attributes added in Glue 2.0 – e.g. MaxSlotsPerJob • JDL. Devel. Interest. Queue level or site level. • How. CE implementations. Plan. Actions.

Initial Meeting with EMI, EGI and OSG (Michel Jouvin) • ID issues related to end of supporting projects (e.g. EMI) • Globus (community?); EMI MW (WLCG); OSG; validation • Discussion has not included all stakeholders.

How to identify the best top level BDIIs (Maria Alandes Pradillo) • Only 11% are “properly” configured (LCG_GFAL_INFOSYS 1,2,3) • UK BDIIs appear in top 20 of ‘most configured’.

MUPJ – gLEexec update (Maarten Litmaath) • ‘glexec’ flag in GOCDB for each supporting CE • • http://cern.ch/go/PX7p (so far… T1, Brunel, IC-HEP, Liv, Man, Glasgow, Ox, RALPP) Improved instructions: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment • CMS ticketing sites. Working on GlideinWMS.

WG on Storage Federations (Fabrizio Furano) • Federated access to data – clarify what needs supporting • ‘fail over’ for jobs; ‘repair mechanisms’; access control • So far XROOTD clustering through WAN = natural solution • Setting up group.

DPM Collaboration – Motivation and proposal (Oliver Keeble) • Context. Why. Who… • UK is 3rd largest user (by region/country) • Section on myths: DPM has had investment. Not only for small sites… • New features: HTTP/WebDAV, NFSv4.1, Perfsuite… • Improvements with xrootd plugin • Looking for stakeholders to express interest … expect proposal shortly • Possible model: 3-5 MoU or ‘maintain’

Update on SHA-2 and RFC proxy support • IGTF wish CAs -> SHA-2 signatures ASAP. For WLCG means use RFC in place of current Globus legacy proxies. • dCache & BestMan may look at EMI Common Authentication Library (CANL) – supports SHA-2 with legacy proxies. • IGTF aim for Jan 2013 (then takes 395 days for SHA-1 to disappear) • Concern about timeline (LHC run now extended) • Status: https://twiki.cern.ch/twiki/bin/view/LCG/RFCproxySHA2support • Plan deployed SW support RFC proxies (Summer 2013) and SHA-2 (except dCache/BeStMan – Summer 2013). Introduce SHA-2 CAs Jan 2014. • Plan B – short-lived WLCG catch-all CA

ARGUS Authorization Service (Valery Tschopp) • Authorisation examples & ARGUS motivation (many services, global banning, policies static). Can user X perform action Y on resource Z. • ARGUS built on top of a XACML policy engine • PAP = Policy Administration Point. Tool to author policies. • PDP = Policy Decision Point (evaluates requests) • PEP = Policy Execution Point (reformats requests) • Hide XACML with Simplified Policy Language (SPL) • Central banning = Hierarchical policy distribution • Pilot job authorization – gLEexec executes payload on WN https://twiki.cern.ch/twiki/bin/view/EGEE/AuthorizationFramework

Operations Coordination Team (Maria Girone) • Mandate – addresses needs in WLCG service coordination recommendations & commissioning of OPS and Tools. • Establish core teams of experts to validate, commission and troubleshoot services. • Team goals: understand services needed; monitor health; negotiate configs; commission new services; help with transitions. • Team roles: core members (sites, regions, expt., services) + targeted experts • Tasks: CVMFS, Perfsonar, gLEexec

Jobs with High Memory Profiles • See expt reports.

NGI UK - Homepage CA

Wednesday 22nd August

Operationally few changes - VOMS and Nagios changes on hold due to holidays
Upcoming meetings Digital Research 2012 and the EGI Technical Forum. UK NGI presence at both.
The NGS is rebranding to NES (National e-Infrastructure Service)
EGI is looking at options to become a European Research Infrastructure Consortium (ERIC). (Background document.
Next meeting is on Friday 14th September at 13:00.

Events

WLCG workshop - 19th-20th May (NY) Information

CHEP 2012 - 21st-25th May (NY) Agenda

GridPP29 - 26th-27th September (Oxford)

UK ATLAS - Shifter view News & Links

Thursday 21st June

Over the last few months ATLAS have been testing their job recovery mechanism at RAL and a few other sites. This is something that was 'implemented' before but never really worked properly. It now appears to be working well and saving allowing jobs to finish even if the SE is not up/unstable when the job finishes.

Job recovery works by writing the output of the job to a directory on the WN should it fail when writing the output to the SE. Subsequent pilots will check this directory and try again for a period of 3 hours. If you would like to have job recovery activated at your site you need to create a directory which (atlas) jobs can write too. I would also suggest that this directory has some form of tmp watch enabled on it which clears up files and directories older than 48 hours. Evidence from RAL suggest that its normally only 1 or 2 jobs that are ever written to the space at a time and the space is normally less than a GB. I have not observed more than 10GB being used. Once you have created this space if you can email atlas-support-cloud-uk at cern.ch with the directory (and your site!) and we can add it to the ATLAS configurations. We can switch off job recovery at any time if it does cause a problem at your site. Job recovery would only be used for production jobs as users complain if they have to wait a few hours for things to retry (even if it would save them time overall...)

UK CMS

Tuesday 24th April

Brunel will be trialling CVMFS this week, will be interesting. RALPP doing OK with it.

UK LHCb

Tuesday 24th April

Things are running smoothly. We are going to run a few small scale tests of new codes. This will also run at T2, one UK T2 involved. Then we will soon launch new reprocessing of all data from this year. CVMFS update from last week; fixes cache corruption on WNs.

UK OTHER

Thursday 21st June - JANET6

JANET6 meeting in London (agenda)
Spend of order £24M for strategic rather than operational needs.
Recommendations to BIS shortly
Requirements: bandwidth, flexibility, agility, cost, service delivery - reliability & resilience
Core presently 100Gb/s backbone. Looking to 400Gb/s and later 1Tb/s.
Reliability limited by funding not ops so need smart provisioning to reduce costs
Expecting a 'data deluge' (ITER; EBI; EVLBI; JASMIN)
Goal of dynamic provisioning
Looking at ubiquitous connectivity via ISPs
Contracts were 10yrs wrt connection and 5yrs transmission equipment.
Current native capacity 80 channels of 100Gb/s per channel
Fibre procurement for next phase underway (standard players) - 6400km fibre
Transmission equipment also at tender stage

Industry engagement - Glaxo case study.
Extra requiements: software coding, security, domain knowledge.
Expect genome data usage to explode in 3-5yrs.
Licensing is a clear issue

To note

Tuesday 26th June

On Tuesday 31st July 2012 the GOCDB read-write portal at https://gocdb4.esc.rl.ac.uk/portal will be decommissioned and replaced by a single read-write version at https://goc.egi.eu/portal. This will consolidate the service (including the PI) under the same URL. All GOCDB client maintainers are requested to ensure their PI configuration URLs point to https://goc.egi.eu/portal.

Operations Bulletin 081012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools