Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 429: Line 429:
 
===== =====
 
===== =====
 
<!-- ******************Edit start********************* ----->
 
<!-- ******************Edit start********************* ----->
 +
'''Monday 6th July 2015, 14.00 BST''' <br />
 +
30 Open UK Tickets this month. Looking at them all!
  
 +
'''NGI'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114233 114233] (10/6)<br />
 +
The UK not publishing core counts at all sites. Some progress, but at last check John G couldn't see a change for Oxford or Glasgow. In progress (30/6)
 +
 +
'''RALPP'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114442 114442] (18/6)<br />
 +
Gridpp Pilot role ticket. Accounts need to be created, but no word for a few weeks. In progress (19/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114764 114764] (1/7)<br />
 +
Ticket tracking (false) availability issues, created to appease COD - the problem caused by a broken CA rpm release for Arc CEs. Kashif has created a counter-ticket [https://ggus.eu/index.php?mode=ticket_info&ticket_id=114742 114742] Gordon's sagely advice is to submit a recalculation request once the issue is fixed. Assigned (1/7)
 +
 +
'''BRISTOL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114485 114485] (19/6)<br />
 +
Bristol's gridpp pilot role ticket. No news, could do with an update really. In progress (22/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114426 114426] (18/6)<br />
 +
CMS AAA reading test problems. The Bristol admins have transferred data to their new shiny SE and have asked CMS to test again. No word since. Waiting for reply (30/6)
 +
 +
'''EDINBURGH'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=95303 95303] (1/7/13...)<br />
 +
Tarball glexec ticket, now 2 years old. After a really promising burst the last 6 weeks haven't seen any progress, due to a lot of other "normal" tarball work taking up the time. Sorry! On hold (18/5)
 +
 +
'''DURHAM'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114536 114536] (22/6)<br />
 +
Durham's gridpp pilot role ticket. Not acknowledged yet, is Oliver back yet? Assigned (22/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114765 114765] (1/7)<br />
 +
See RALPP ticket 114764. Assigned (1/7)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114727 114727] (30/6)<br />
 +
Catalin ticketed that a number of SW_DIR variables at Durham are still pointing to the old school .gridpp.ac.uk cvmfs space. Assigned (30/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114381 114381] (16/6)<br />
 +
John G ticketed Durham over a small percentage of jobs being published as "zero core". Looks like a SLURM timeout problem, although a fix isn't obvious. Put on the back burner whilst Oliver is on holiday. On Hold (19/6)
 +
 +
'''SHEFFIELD'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114649 114649] (26/6)<br />
 +
A ticket from a Sno+ user about not being able to access software using the Sheffield CEs. Acknowledged but no news. In progress (26/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114460 114460] (18/6)<br />
 +
Sheffield's gridpp pilot role ticket. Did you get round to rolling them out? In progress (19/6)
 +
 +
'''MANCHESTER'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114444 114444] (18/6)<br />
 +
LHCB ticket concerning the DPM's SRM not returning checksum information. On hold whilst a related ticket is being looked at (https://ggus.eu/index.php?mode=ticket_info&ticket_id=111403). On Hold (22/6)
 +
 +
'''LIVERPOOL'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=111403 111403] (10/6)<br />
 +
Another Sno+ ticket, about grid production jobs failing at Liverpool. AIUI caused by Sno+ running out of space on the shared pool. At last check Steve posted the usage information for Sno+ but no word since (and Steve's off on his hols). In progress (17/6)
 +
 +
'''LANCASTER'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114845 114845] (6/7)<br />
 +
LHCB pilots failing at Lancaster. Looks like a simple node misconfiguration, hopefully fixed, waiting to see if it is. On hold (6/7)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=95299 95299] (1/7/2013)<br />
 +
glexec ticket - see Edinburgh description. On hold (15/5)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=100566 100566] (27/1)<br />
 +
Bad bandwidth performance at Lancaster. Hoping that IPv6 will shake things up a bit so pushing that. On hold (18/5)
 +
 +
'''UCL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114746 114746] (30/6)<br />
 +
SRM-put failures ROD ticket. No news at all. Assigned (30/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114851 114851] (6/7)<br />
 +
Low availability ROD ticket, related to above. Assigned (6/7)
 +
 +
'''RHUL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114441 114441] (18/6)<br />
 +
Another GridPP pilot role ticket. Pilots rolled out, but something isn't quite right and they're not working - Govind is looking again. In progress (6/7)
 +
 +
'''QMUL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114573 114573] (23/6)<br />
 +
LHCB ticket about two out of three QM CEs not responding for them. Dan spotted the broken CEs were dual-stacked, the working one wasn't. The ticket seemed to have trailed off into some confusion over who needs to do some testing where. I agree with Dan that that who needs to be someone with LHCB credentials! The waters still seem muddied. In progress (1/7)
 +
 +
'''IC'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114737 114737] (30/6)<br />
 +
The IC voms wasn't updating properly, due to what I infer from the ticket as "SSL/mysql madness". Simon and Robert have been heroically battling this one - it's a good read. On hold (3/7)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114379 114379] (16/6)<br />
 +
Sam's ticket about SE support in Dirac. Sam will shortly try testing things out on the new Dirac to see how it fares. In progress (6/7)
 +
 +
'''BRUNEL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114447 114447] (18/6)<br />
 +
Brunel's gridpp pilot ticket. Being worked on, with one CE with the pilots enabled. In progress (26/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114006 114006] (31/5)<br />
 +
A ticket from APEL, about Brunel under-reporting the number of jobs they are doing. Turned out to be a problem with Arc, which Raul upgraded to the fixed 5.0 version. The APEL team deleted the sync records, but no word since. In progress (30/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114850 114850] (6/7)<br />
 +
Another APEL ticket, likely the fallout of the previous one - it looks like GAP publishing has been left on for the Brunel CREAM CEs. Assigned (6/7)
 +
 +
'''TIER 1'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=114786 114786] (2/7)<br />
 +
Low availability ticket - see RALPP ticket 114442 - probably could do with Ob holding. In progress (2/7)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=113910 113910] (26/5)<br />
 +
Sno+ data staging problems. Brian gave some advice on how the large VOs do data staging from tape, and has asked if Sno+ still has problems. Matt M might still be on leave though. Waiting for reply (23/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=108944 108944] (1/10/14)<br />
 +
CMS AAA problems, which eventually brought to light to a problem with super-hot datasets which were alleviated (I think). Despite an update to castor that improved performance the last batch of tests didn't show improved results. No news since. In progress (17/6)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=113836 113836] (20/5)<br />
 +
Glue mismatch problems at RAL. Working on getting "many-Arcs" to correctly publish. In progress (24/6)
  
 
<!-- ******************Edit stop********************* ----->
 
<!-- ******************Edit stop********************* ----->

Revision as of 14:45, 6 July 2015

Bulletin archive


Week commencing 6th June 2015
Task Areas
General updates

Tuesday 30th June

  • Steve's GridPP tests relied on the CERN LFC (ATLAS) which was removed last week. We need to review what tests if any we still want to run in the UK. Do we?
  • A new mailing list has been setup for the discussion of batch and CE matters in WLCG: project-lcg-gdb-batch at cern.ch.
  • Some have noted that the GOCDB was not sending out downtime notifications. A ticket was raised.
  • A preliminary draft of the July GDB agenda is available.
  • In In order to get the broadest input EGI has prepared a survey (deadline July 7th).
  • There was an EGI OMB last week. Please take a look at the brief minutes and actions.
  • The SAM tests used for ATLAS Availability and Reliability of SEs are going to be changed to new gfal2 DDM-probes from 1st July. The new monitoring is already running at: http://wlcg-sam-atlas-dev.cern.ch/ (the name of the tests are: DDM-srm-Del; DDM-srm-Get and DDM-srm-Put.From 1st July these will move to http://wlcg-sam-atlas.cern.ch/.
  • The camont collaboration is no longer active, but the VO is enabled and useful. Previous contributors to the camont work would like to re-purpose the VO to support the Cambridge 'Geant Human Oncology Simulation Tool’. Providing we broadcast any change to sites supporting camont and the AUP changes (and all users re-sign it), is there any objection to this pragmatic solution (it fits well our current aim of getting varied communities using our resources successfully).

Tuesday 23rd June

  • The BDII discussion of last week continued into the WLCG ops coordination meeting (see email thread).
  • The IGTF is about to release an update to the trust anchor repository (1.65).
  • hone has notified us that they have completed their use of the grid.
  • CVMFS space for GridPP VO: /cvmfs/gridpp.gridpp.ac.uk or /cvmfs/gridpp.egi.eu
  • supernemo data in Liverpool DPM - can it be removed?
  • On July 17 a 4 hour long EGI Federated Cloud tutorial will be organised in London (at SAP near Heathrow). It's a free event, part of a 3-day long software carpentry workshop.
  • GridPP35 @ Liverpool in September.


WLCG Operations Coordination - Agendas

Tuesday 30th June

  • A new WLCG ops portal will be live soon. If you are particularly keen to give feedback please contact Jeremy

Tuesday 23rd June

  • There was a WLCG ops coordination meeting last Thursday 18th June: Agenda. [1].
  • Highlights: Information System discussion started. Use cases and dependencies will be built up and reviewed. May have a pre-GDB on topic.
  • All sites should enable multicore accounting
  • News: No update.
  • Baselines: Removed WMS & L&B. LFC will be removed soon.
  • MW issues: New globus-gss api released to mitigate problem reported last time.
  • T0&1 services: T0 LHCb and shared LFC will be decommissioned 22nd June. Some dCache upgrades reported.
  • T0 news: Efficiency meeting held. Cloud team making I/O changes. LHC exits see some improvement but T0 still behind other sites.
  • T1 feeedback: NTR
  • T2 feedback: UK response on Information System: Useful for service discovery; minor VO usage; contains too much information; cloud raises new questions; mixed data types; YAIM helpful to fill schemas.
  • OSG: Provide InfoSys as service to VOs. Best case deprecation early 2016 but depends on USATLAS.
  • InfoSys: HTC > GLUE in OSG. AGIS uses it (ATLAS seek merge of GOCDB, OIM and BDII). LHCb uses for CE discovery. CMS no clear usage. ALICE for SAM and CERN IT C5 reports.
  • ALICE: High activity. CASTOR issue with xrd3cp. Request sites to plan for Xrootd v4.1.
  • ATLAS: Good data taking. T0 some issues with batch/OpenStack improving. CERN network issue had impacts. CERN to BNL data backlog due to FTS not pushing hard enough.
  • CMS: Data taking but technical stops. MC going well. T1 CPU should be 90% production role and 10% pilot. File transfer FNAL-RAL - possible WAITIO on storage nodes due to many CMS jobs.
  • LHCb: Run2 offline processing workflows validated. Some issues with old files at RAL without checksums.
  • gLEexec: NTR
  • RFC proxies: SAM - okay now for ALICE. CMS PhEDEx instances being switched.
  • Machine/Job features: NTR
  • Middleware readiness: Good work. Credit to ECDF and GRIF for DPM work. New pakiti-client imminent in EPEL stable. MW readiness App now available on a production instance. EL7 support for ARGS urgent. Next meeting 16th September.
  • Multicore: Several sites still not publishing. APEL tickets on NGIs. Issues identified for CREAM and ARC MC publishing.
  • IPv6: NTR
  • Network and transfers WG: PS: proposed mesh for upto 100 sites. Potential bug noted. Next meeting 8th July.
  • HTTP: 2nd meeting on 3rd June. Draft conclusions. Next meeting 15th July.

Tuesday 16th June

  • The next WLCG ops coordination meeting is this Thursday 18th June: Agenda. There will be presentations and discussions on the Information System.
  • The next middleware readiness meeting is on Wednesday 17th June @ 3pm BST: Agenda.


Tier-1 - Status Page
  • A reminder that there is a weekly Tier-1 experiment liaison meeting.
  • The agenda follows this format:
    • 1. Summary of Operational Status and Issues
    • 2. Highlights/summary of the Tier1 Monday operations meeting (Grid Services; Fabric; CASTOR and Other)
    • 3. Experiment plans and operational issues (CMS; ATLAS; LHCb; ALICE and Others)
    • 4. Special presentations
    • 5. Actions
    • 6. Highlights for Operations Bulletin Latest
    • 7. AoB

Tuesday 30th June

  • There were two separate network-related problems on Tuesday afternoon last week. The first was a short (less than ten minute) break in connectivity when a router rebooted. The second, which lasted around 45 minutes, was a period of very high traffic caused by an operational/configuration problem on a hypervisor.
  • We have announced an 'at risk' for the next quarterly UPS/generator load test tomorrow (Wednesday) morning.
Storage & Data Management - Agendas/Minutes

Wednesday 01 July

  • Feedback on CMS's proposal for listing contents of storage
  • Simple storage on expensive raided disks vs complicated storage on el cheapo or archive drives?

Wednesday 24 June

  • Heard about the Indigo datacloud project, a H2020 project in which STFC is participating
  • Data transfers, theory and practice
    • Somewhat clunky tools to set up but perform well when they run
    • Will continue to work on recommendations/overview document
    • Worth having recommendations/experiences for different audiences - (potential) users, decision makers, techies

Tuesday 23rd June

  • Good progress with DiRAC transfers from Durham - data flowing since Monday.

Wednesday 17 June

  • EU projects - SAGE: HSM for HPC
  • Progress on new VOs. Can test as members of 'gridpp' or similar until they get their own allocations.
    • We've talked about it before; should VOs have individual T2 allocations to avoid stepping on each other's toes?
    • Case for expanding back-up-into-T1 to other VOs?

Wednesday 27 May

  • Working on troubleshooting DIRAC data for/with LIGO (not to be confused with DiRAC or with any of the other things called DiRAC)
  • Working on setting up DiRAC at Tier 1 (not to be confused with DIRAC or Dirac or with any other thing called Dirac)
  • New secret user support list!



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 16th June

  • Region not publishing accounting by number of cores.
    • "0" core submission hosts:
    • ce3.dur.scotgrid.ac.uk
    • ce4.dur.scotgrid.ac.uk
    • cetest02.grid.hep.ph.ic.ac.uk
    • hepgrid5.ph.liv.ac.uk
    • hepgrid6.ph.liv.ac.uk
    • hepgrid97.ph.liv.ac.uk
    • svr009.gla.scotgrid.ac.uk
    • t2ce06.physics.ox.ac.uk

Tuesday 9th June

  • Delay noted for Sheffield

Tuesday 26th May

  • Delay noted for Sheffield.

Tuesday 12th May

  • Issues noted with sync for Brunel, Liv, ECDF (see EGI ticket 113473). Message broker issues (memory related) are likely the underlying EGI problem.
  • Need to check on VAC sync publishing.


Documentation - KeyDocs

Tuesday 23rd June

  • Reminder that documents need reviewing!

Tuesday 9th June

LSST voms2 records are not present in VOID cards yet. As a workaround, a temporary note of the actual values has been added to the LSST section of Approved VOs.

https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

General note

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 21st April

  • The Approved VOs document has been updated to take account of changes to the Ops Portal VOID cards.For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503. Sites that support SNOPLUS.SNOLAB.CA should ensure that their configuration conforms to these settings: Approved VOs
  • KeyDocs still need updating since agreements reached at last core ops meeting.
  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 15th June

  • There was an EGI operations meeting today: agenda.
  • New Action: for the NGIs: please start tracking which sites are still using SL5 services: how many services, and for each service if still needed on SL5, if upgrades on SL5 services are expected). A wiki has been provided to record updates. Also interesting to understand who is using Debian.

Tuesday 21st April

  • There was an EGI ops meeting on Monday 20th.
  • David updated the UK SL5 response.
  • Please review the agenda/minutes.


Monitoring - Links MyWLCG

Tuesday 16th June

  • F Melaccio & D Crooks decided to add a FAQs section devoted to common monitoring issues under the monitoring page.
  • Feedback welcome.


Tuesday 31st March

Monday 7th December

On-duty - Dashboard ROD rota

Monday 22nd June

  • Generally quiet. There are some 'glue2' errors that were ticketed. Tried to let these go and see if they would clear. However, in some cases the amount of time the error was outstanding was building up. Unclear if Glue2 is used anywhere.


Monday 8th June

  • The eu.repository has now made a comeback, so the arc alarms, cleared, but I the site availabilities (probably) need to be corrected.
  • Still getting on/off bdii alarms for a variety of sites.

Monday 11th May

  • Rota responses awaited from Andrew and Daniela.
  • Handover summary should be uploaded to the bulletin please.


Rollout Status WLCG Baseline

Tuesday 12th May

  • MW Readiness WG meeting Wed May 6th at 4pm. Attended by Raul, Matt, Sam and Jeremy.

Tuesday 17th March

  • Daniela has updated the [ https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3 EMI-3 testing table]. Please check it is correct for your site. We want a clear view of where we are contributing.
  • There is a middleware readiness meeting this Wednesday. Would be good if a few site representatives joined.
  • Machine job features solution testing. Fed back that we will only commence tests if more documentation made available. This stops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

References


Security - Incident Procedure Policies Rota

Monday 29th June

  • EUGridPMA have announced a new set of CA rpms. Based on this IGTF release a new set of CA RPMs have been packaged for EGI. There is a request to please upgrade within the next seven days at your earliest convenience. When this timeout is over, SAM will throw critical errors on CA tests if old CAs are still detected.
  • The next security team meeting is this Wednesday 1st July.

Tuesday 16th June

  • Security team meeting this Wednesday.
  • One topic for review concerns ES.

Tuesday 9th June



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 23rd June

  • GridPP issued a position statement regarding LHCONE.
    • ...Concerning LHCONE for both T1 and T2. The high level summary is that the UK is not in favour, as within the UK we have no explicit need for LHCONE for any reason of T1 capacity planning, but to implement it involves additional complexity and possibly cost. The current system works fine and we therefore see no overriding reason to remove T1-T1 transit via LHCOPN. ...The UK is sensitive to the “collective” needs of the community, and as a general statement we would always seek to address any legitimate request agreed by the WLCG MB in order to play our role in meeting international expectations.

Tuesday 12th May

  • LHCOPN & LHCONE joint meeting at LBL June 1st & 2nd. Agenda taking shape.

Tuesday 31st March

Tickets

Monday 6th July 2015, 14.00 BST
30 Open UK Tickets this month. Looking at them all!

NGI
114233 (10/6)
The UK not publishing core counts at all sites. Some progress, but at last check John G couldn't see a change for Oxford or Glasgow. In progress (30/6)

RALPP
114442 (18/6)
Gridpp Pilot role ticket. Accounts need to be created, but no word for a few weeks. In progress (19/6)

114764 (1/7)
Ticket tracking (false) availability issues, created to appease COD - the problem caused by a broken CA rpm release for Arc CEs. Kashif has created a counter-ticket 114742 Gordon's sagely advice is to submit a recalculation request once the issue is fixed. Assigned (1/7)

BRISTOL
114485 (19/6)
Bristol's gridpp pilot role ticket. No news, could do with an update really. In progress (22/6)

114426 (18/6)
CMS AAA reading test problems. The Bristol admins have transferred data to their new shiny SE and have asked CMS to test again. No word since. Waiting for reply (30/6)

EDINBURGH
95303 (1/7/13...)
Tarball glexec ticket, now 2 years old. After a really promising burst the last 6 weeks haven't seen any progress, due to a lot of other "normal" tarball work taking up the time. Sorry! On hold (18/5)

DURHAM
114536 (22/6)
Durham's gridpp pilot role ticket. Not acknowledged yet, is Oliver back yet? Assigned (22/6)

114765 (1/7)
See RALPP ticket 114764. Assigned (1/7)

114727 (30/6)
Catalin ticketed that a number of SW_DIR variables at Durham are still pointing to the old school .gridpp.ac.uk cvmfs space. Assigned (30/6)

114381 (16/6)
John G ticketed Durham over a small percentage of jobs being published as "zero core". Looks like a SLURM timeout problem, although a fix isn't obvious. Put on the back burner whilst Oliver is on holiday. On Hold (19/6)

SHEFFIELD
114649 (26/6)
A ticket from a Sno+ user about not being able to access software using the Sheffield CEs. Acknowledged but no news. In progress (26/6)

114460 (18/6)
Sheffield's gridpp pilot role ticket. Did you get round to rolling them out? In progress (19/6)

MANCHESTER
114444 (18/6)
LHCB ticket concerning the DPM's SRM not returning checksum information. On hold whilst a related ticket is being looked at (https://ggus.eu/index.php?mode=ticket_info&ticket_id=111403). On Hold (22/6)

LIVERPOOL
111403 (10/6)
Another Sno+ ticket, about grid production jobs failing at Liverpool. AIUI caused by Sno+ running out of space on the shared pool. At last check Steve posted the usage information for Sno+ but no word since (and Steve's off on his hols). In progress (17/6)

LANCASTER
114845 (6/7)
LHCB pilots failing at Lancaster. Looks like a simple node misconfiguration, hopefully fixed, waiting to see if it is. On hold (6/7)

95299 (1/7/2013)
glexec ticket - see Edinburgh description. On hold (15/5)

100566 (27/1)
Bad bandwidth performance at Lancaster. Hoping that IPv6 will shake things up a bit so pushing that. On hold (18/5)

UCL
114746 (30/6)
SRM-put failures ROD ticket. No news at all. Assigned (30/6)

114851 (6/7)
Low availability ROD ticket, related to above. Assigned (6/7)

RHUL
114441 (18/6)
Another GridPP pilot role ticket. Pilots rolled out, but something isn't quite right and they're not working - Govind is looking again. In progress (6/7)

QMUL
114573 (23/6)
LHCB ticket about two out of three QM CEs not responding for them. Dan spotted the broken CEs were dual-stacked, the working one wasn't. The ticket seemed to have trailed off into some confusion over who needs to do some testing where. I agree with Dan that that who needs to be someone with LHCB credentials! The waters still seem muddied. In progress (1/7)

IC
114737 (30/6)
The IC voms wasn't updating properly, due to what I infer from the ticket as "SSL/mysql madness". Simon and Robert have been heroically battling this one - it's a good read. On hold (3/7)

114379 (16/6)
Sam's ticket about SE support in Dirac. Sam will shortly try testing things out on the new Dirac to see how it fares. In progress (6/7)

BRUNEL
114447 (18/6)
Brunel's gridpp pilot ticket. Being worked on, with one CE with the pilots enabled. In progress (26/6)

114006 (31/5)
A ticket from APEL, about Brunel under-reporting the number of jobs they are doing. Turned out to be a problem with Arc, which Raul upgraded to the fixed 5.0 version. The APEL team deleted the sync records, but no word since. In progress (30/6)

114850 (6/7)
Another APEL ticket, likely the fallout of the previous one - it looks like GAP publishing has been left on for the Brunel CREAM CEs. Assigned (6/7)

TIER 1
114786 (2/7)
Low availability ticket - see RALPP ticket 114442 - probably could do with Ob holding. In progress (2/7)

113910 (26/5)
Sno+ data staging problems. Brian gave some advice on how the large VOs do data staging from tape, and has asked if Sno+ still has problems. Matt M might still be on leave though. Waiting for reply (23/6)

108944 (1/10/14)
CMS AAA problems, which eventually brought to light to a problem with super-hot datasets which were alleviated (I think). Despite an update to castor that improved performance the last batch of tests didn't show improved results. No news since. In progress (17/6)

113836 (20/5)
Glue mismatch problems at RAL. Working on getting "many-Arcs" to correctly publish. In progress (24/6)

Tools - MyEGI Nagios

Tuesday 09 June 2015

  • ARC CEs were failing nagios test becuase of non-availability of egi repository. Nagios test compare CA version from EGI repo. It started on 5th June and one of the IP addresses behind webserver was not responding. Problem went away in approximately 3 hours. The same problem started again on 6th June. Finally it was fixed on 8th June. No reason was given in any of the ticket opened regarding this outage.

Tuesday 17th February

  • Another period where message brokers were temporarily unavailable seen yesterday. Any news on the last follow-up?

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 19th May

  • There is a current priority for enabling/supporting our joining communities.

Tuesday 5th May

  • We have a number of VOs to be removed. Dedicated follow-up meeting proposed.

Tuesday 28th April

  • For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503.

Tuesday 31st March

  • LIGO are in need of additional support for debugging some tests.
  • LSST now enabled on 3 sites. No 'own' CVMFS yet.
Site Updates

Tuesday 24th February

  • Next review of status today.

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 1st July 2015 Operations report

  • Successful UPS/Generator load test this morning.
  • Updated version of xroot pluging for Castor has allowed 3rd party transfers for Alice.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Atlas S&C week 2-6 Feb 2015

Production

• Prodsys-2 in production since Dec 1st

• Deployment has not been transparent , many issued has been solved, the grid is filled again

• MC15 is expected to start soon, waiting for physics validations, evgen testing is underway and close to finalised.. Simulation expected to be broadly similar to MC14, no blockers expected.

Rucio

• Rucio in production since Dec 1st and is ready for LHC RUN-2. Some fields need improvements, including transfer and deletion agents, documentation and monitoring.

Rucio dumps available.

Dark data cleaning

files declaration . Only Only DDM ops can issue lost files declaration for now, cloud support needs to fill a ticket.

• Webdav panda functional tests with Hammercloud are ongoing

Monitoring

Main page

DDM Accounting

space

Deletion

ASAP

• ASAP (ATLAS Site Availability Performance) in place. Every 3 months the T2s sites performing BELOW 80% are reported to the International Computing Board.


UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A