Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 10th November 2014
Task Areas
General updates

Tuesday 11th November

  • Do we need the "Monitoring=N and production=Y" combination of flags in GOC DB?
  • EGI circulated a request to pull together interest in SME engagement. We ought to record any involvement and interest.
  • RAL PPD experience of perfSONAR 3.4 installation and mySQL issues.

Monday 3rd November

  • The OPS VOMS server setting for gridppnagios.physics.ox.ac.uk was updated on 29th Octover. It now uses the new VOMS servers (lcg-voms2.cern.ch and voms2.cern.ch ). Watch out for failing Nagios tests!
  • The draft October WLCG T2 A/R figures have been released. ALICE. ATLAS. CMS. LHCb.
  • EGI operations will be trying a new approach with broadcasts - perhaps one summary broadcast each month.
  • Are there any outstanding VOMS (e.g. dteam or gridpp) or GOCDB role requests?
  • The agendas for November's GDB and the pre-GDB on volunteer computing are available.
WLCG Operations Coordination - Agendas

Tuesday 28th October

Tuesday 21st October

  • There was a WLCG ops coordination meeting last Thursday 16th October. Minutes are available. The brief summary:
  • News: A survey to investigate operational efficiency areas is being prepared - it covers managing changes, communication channels, monitoring, grid service administration. Expect it to be ready for public input soon.
  • MW news: dCache 2.6.35 and 2.10.7 have been released - update is a priority. perfSONAR 3.4 has many fixes - SSLv3.0 vulnerability TBC - but baseline at 3.3.4 until docs ready. CESNET will co-maintain (classads) package needed by CREAM, WMS, L&B, UI, WN (was removed from EPEL).
  • T0&T1 services: FTS and dCache upgrades.
  • Oracle: DB migrations timetable is work in progress.
  • T0: VOMS-ADMIN test cluster available again. Here is a link. lxplus5 now off. Job efficiency differences remain unexplained. High Availability FTS3 being discussed. Quattor End of Service (EoS) now end November.
  • T1: NTR
  • T2: NTR
  • ALICE: Improved job eff at CERN. Operational instability noted on 15th Oct.
  • ATLAS: Full Chain test considered successful. Reco campaign finished. Had 2 FTS3 problems.
  • CMS: PromptRECO will need Tier-1 sites in addition to CERN. Still need SLS and AFS-UI after Oct. Encourage dCache update.
  • LHCb: Proposes to re-visit the list of critical services before Run2. Setting up a prototype installation for http access and access federation where 70 % of the storage endpoints are accessible.
  • gLExec: NTR
  • Machine/Job features: Concluded on a single architecture for cloud and batch implementations.
  • MW readiness: Last met 1st Oct. Verification process now well documented. More volunteer sites needed. DB designed to hold verification results. HTCondor testing request from ATLAS. MW Package Reporter designs discussed. Next meeting 19th November.
  • Multicore: CMS started testing.
  • SHA-2: Ops and ALICE VOMS opened last Monday. LHCb TBC. CMS reminded. ATLAS tests progressing.
  • IPv6: All T1s should join HEPiX IPv6 WG. Dual-stack SAM to be put in place. Deploy PS in dualstack in 2015. OSG infrastructure ready.
  • Squid and HTTP proxy: Squid registration campaign pushed to November.
  • Network & transfer metrics WG: Request for sites to await 3.4 instructions. Internal security audit performed. Recommend all sites running 3.3 to temporarily disable SSL3 - (as at 16th Oct) patches coming.
  • Other: Vidyo ticket opened.

Tuesday 14th October


Tier-1 - Status Page

Tuesday 11th November

  • Reboot of our Tier1 Router (that connects us to the rest of the RAL site) took place successfully last Wednesday morning (5th Nov).
  • Nothing else to report.
Storage & Data Management - Agendas/Minutes

Wedn 29 Oct

Wedn 22 Oct

  • Martin Bly (T1) - HEPiX report

Wedn 15 Oct

Wedn 01 Oct

  • Summary of all the exciting events in Amsterdam last week - EUDAT, EGI big data, RDA
  • DPM 1.8.9 early testing, and (separately) xroot4 early-ish testing. Supporting multiple VOs in one xroot server.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 4th November

  • Sussex and Sheffield publishing issues still apparent.

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.


Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.


Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.


Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 4/11

  • Next meeting 14th November: Status of SAM3 rollout (proposed topic)
On-duty - Dashboard ROD rota

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.

Tuesday 21st October

  • Possible issue with slow updates to GGUS tickets as viewed via ROD Dashboard (or is it a RAL cache issue?)
  • EFDA - Recurring alarms owing to site availability. Just waiting for the bad period to age out of the 30-day window.
  • SUSSEX - Ongoing problem. Matt has a ticket open with middleware developers.


Rollout Status WLCG Baseline

Tuesday 26th August

Monday 28th July


References


Security - Incident Procedure Policies Rota

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 4th November

  • perfSONAR 3.4+ install/update instructions are ready. More details will be included in the WLCG broadcast to all sites planned for later today.

Tuesday 28th October

  • Have the perfSONAR 3.4 instructions/documentation been updated yet? Last week volunteers were sought at the...
  • perfSONAR operations meeting took place on 22nd October.
    • There is a recommendation for sites supporting IPv6 to deploy perfSONAR dual-stack.
    • Concerned about Tier-3s requesting to join Tier-2 meshes.
    • A network transfer metrics wiki page is available.

Tuesday 21st October

Monday 13th October

  • perfSONAR 3.4 has been released. Clear documentation on what to do (clean reinstall) coming this week together with information on mesh updates. See the GDB presentation slides 13 and 14.
  • RIPE have sent a reminder to connect probes that have been handed out (some weeks ago now). Please could the following sites check their status: Lancaster; Brunel; Sussex; and ECDF. 20599 at RAL has never properly connected (DHCP issue?).
Tickets

Monday 10th November 2014, 15.00 GMT;

21 Open UK Tickets this week.

Tier 1
109712(29/10)
CMS seeing glexec errors at the Tier 1, likely due to a lack of "wildcard mapping" at RAL. Andrew L was investigating, but no news on the ticket since. In progress (29/10)

100IT
108356(10/9)
The setting up vmcatcher ticket at 100IT. It looks like this ticket is done, or at least getting there. I've prodded the ticket. In progress (29/10)

Sheffield
109906(5/11)
Some publishing problems have caused Sheffield to get a low availability ticket. Things are fixed, but as it has been pointed out before this alarm requires time to sooth it. My advice is to put the ticket On Hold whilst waiting for it to go on its own. Waiting for reply (7/11) Also how goes the Sno+ ticket 109207?

Durham and Lancaster
108273
108715
Both these sites have perfsonar tickets on hold after the shellshock scare, here's a gentle reminder that the new perfsonar and accompanying instructions are available. (I hoped to not have to mention Lancaster by sneaking in a reinstall this morning, but had trouble getting my iDRAC interfaces to work).

The Ticket with no home
107880(26/8)
Those funny SNO+ SUSE users and their problems with srmcp. The Tier 1 has cast this ticket out onto the streets, and assigned it to QMUL. Who don't really want it (or deserve it!). As mentioned in his last update, Chris has been having a chat with Matt M and it looks like srmcp is working...kinda (if you give it the correct port numbers, and somehow magically know these for each SE). Chris mentions that this could be viewed as a bug in srmcp, or solved with a wrapper script that he doesn't have time to write to. My suggestion is to give the SUSE users the necessary ldap query to pull the information they need and let them sort out the rest! Assigned (7/11)


Tools - MyEGI Nagios

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate
  • Northgrid support on RAL WMS being checked
  • Gfal-copy and castor issue


Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.


Site Updates

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 5th November 2014

WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A