Operations Bulletin 040814

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 28th July 2014
Task Areas
General updates

Tuesday 29th July


Tuesday 22nd July

  • The RAL FTS2 is due to be switched off on 2nd September.
  • HyperK enablement request still stands.
  • The WLCG biweekly WLCG status from yesterday's ops meeting is available here.
  • There was a general reminder last week about putting too much detail into messages that go to our public email lists. Please remember the Traffic Light Protocol!
  • There will be an IPv6 session at GridPP33. JANET will participate. Pete C and G are taking ideas for talks.
  • There was a GridPP technical meeting on Friday.
  • The final WLCG T2 availability and reliability figures for June 2014 are now available.
  • There was a first meeting last Wednesday of the HEP Software Foundation. The first step is to prepare a “call for volunteers" who can devote the time in the coming months to lead the work that has to be done.


WLCG Operations Coordination - Agendas

Tuesday 29th July

  • There was a WLCG coordination meeting last Thursday. Minutes are available.
  • News: MB agreed gstat no longer supported. Notes from WLCG workshop available.
  • Baseline: Changes FTS3 v3.2.26; DPM 1.8.8; perfSONAR 3.3.2-17; Storm 1.11.4.
  • Baseline: Issues: “Missing Key Usage Extension in delegated proxy” affecting CREAM; dCache upgrade; EOS (Brazilian proxies). Tickets for CVMFS upgrade to 2.1.19.
  • Tier-2 – feedback from GridPP given. CVMFS monitoring, gstat and ARGUS-SE status.
  • ALICE: Very high to low throughput due to input tasks. ALICE site admins to note VOMS changes
  • ATLAS: issues related to the new ATLAS frameworks commissioning/integration. Using MCORE and want more. under discussion within ATLAS new way of measure the availability of the sites, through their availability for analysis.
  • CMS: Busy. CAS2014 soon target 40k concurrent jobs. Want to make xrootd-fallback test critical. Removal individual release tags from CEs. 5 site reminders.
  • LHCb: Access to lcg-voms2 and voms2 from outside CERN are timing out. re-curring problem after upgrade of dCache sites. CVMFS clients not all upgraded. Question - What happens to VM if it fails and needs reinstantiation. Lose the cores? No.
  • Tracking tools: Recommendation to close TF.
  • FTS3: Only a few CMS sites running over FTS2. Can now monitor FTS3 transfers against user DN (needed by CRAB3)
  • Glexec: no change. Some running unreleased rpms due to ARGUS status. Someone needs to pledge effort to support ARGUS. SWITCH unable pledge effort.
  • MW readiness: DPM is the pilot WLCG MW package used for the Readiness Verification effort. Progress is traced in a dedicated JIRA tracker. Next meeting 1st October.
  • Multicore: ATLAS steady flow. With CMS concurrent at 5 sites. How to track.
  • SHA-2: Intro new VOMS servers. Testing infrastructure with preprod SAM. Some good; many failures due to simple configuration issues of CREAM and/or Argus (Action to check). Aim for 15th September. Will open tickets. RFC no progress.
  • WMS decommissioning: Work ongoing to improve SAM Condor probe jobs lifecycle. End of September or October.
  • IPv6: All but ALICE now testing (migrating AliEN to Perl 5.18).
  • HTTP proxy dis: NTR
  • Network and transfer metrics WG: proposed dates for kickoff are in the last week of August and second week in September.



Tuesday 21st July

  • The next coordination meeting takes place this Thursday at 14:30 UK time. There is now a standing item for sites to raise issues of concern. Is there anything we would like to mention this week? we are invited to update the twiki up until 1 hour before the meeting.


Tier-1 - Status Page

Tuesday 29th July

  • All Castor instances have been upgraded to version 2.1.14. The upgrade is complete including turning off compatibility mode on the namserver component which was done at he end of last week.
  • We have announced that we will shutdown both the FTS2 service and the software server used by the small VOs on the 2nd September.
  • There was a site outage owing to a network problem for around 45minutes last Tuesday at the time of this DTEAM/OPS meeting.
Storage & Data Management - Agendas/Minutes

Wednesday 30 July 2014

  • Very optimistically hoping to get some Real Work™ done during supposedly quiet time optimistically expected in August.
  • Suggestion to ask experiments to present their view of storage and data management at GridPP33: after all, it's their opinion which is most important.
  • Collating feedback on the storage meetings.

Wednesday 23 July 2014

  • We really should try to document our VO policy: the stuff in people's heads, experiences with "small" VOs (that tend to grow bigger), best practices. Also, much of the wiki needs reviewing. "Boring" old documentation - hey ho!

Tuesday 22nd July

  • Alert today advising to update DPM, mainly to get the new (bug fixed)

Wednesday 2 July

  • Guidance and policies for "small" VOs: how to get them started with stuff, without preventing them later growing bigger.

Tuesday 1st July


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 29th July

  • Accounting gap in last week for UCL and Durham.

Tuesday 22nd July

  • EGI accounting portal does not show any significant outages in publishing in recent days.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 22nd July

  • Starting on revisions this week.
  • Is the alert system now working?

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.


Interoperation - EGI ops agendas

Tuesday 14th July

  • Last meeting yesterday.
  • URT: see agenda for details
  • SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
  • DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
  • Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
  • EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
  • Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
  • Next meeting placeholder 28th July, but may not happen (OMD depending)
  • Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

  • Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
  • EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
  • There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients



Monitoring - Links MyWLCG

Tuesday 22nd July

  • Chris noted yesterday that gstat reports most sites as critical. SB thought the underlying problem is that a value is supposed to be "Production" in GLUE 1 and "production" in GLUE 2; at some point they changed both to lower case.

Tuesday 15th July

On-duty - Dashboard ROD rota

Tuesday 29th July

  • Dealing with low availability alarms in dashboard - notepad message from ROD?

Tuesday 22nd July

  • Another quiet week. Bham availability alarm ticket created.
Rollout Status WLCG Baseline

Monday 28th July

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 29th July

Tuesday 22nd July

Monday 14th July

  • EGI CSIRT ADVISORY [EGI-ADV-20140625]



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 22nd July

  • There was a problem with VOMS admin that was noticed last Thursday: VO page requestes resulted in the WEB-INF directory of the jetty app being displayed. A server restart fixed the problem.

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.


Tickets

Monday 28th July 2014, 14.15 BST
I'm on holiday again, but the UK tickets can be checked here:
http://tinyurl.com/p37ey64

There were 19 UK tickets at time of writing, none of which looked particularly troubling.

If for some reason you ever want to look up past ticket reviews they can be found at:
https://www.gridpp.ac.uk/wiki/Past_Ticket_Bulletins

Cheers!
Matt

Tuesday 29th July As Matt is away here is my review of the tickets - Jeremy

BRUNEL

Remove mapping to PIC FTS2. GGUS 107282. (Created: 01/07? Only assigned 28/07).

SHEFFIELD

org.bdii.GLUE2-Validate issue. GGUS 107217. (24/07: In progress 25/07). Site is publishing SW tags for CMS – unable to delete them GGUS 106820. Looks stuck but RHUL resolved similar issue by editing tags. GGUS 106819 (11/07: In progress 14/07).

UCL

Publishing with EMI-2 APEL. GGUS 106876. Currently reinstalling APEL node, after backing up database. (14/07: Last updated 24/07).

RAL Tier-1

Two alarms from srm-dteam. GGUS 106655. Cross-contamination of information due to the GEN-CASTOR SRMs sharing a database, and some VOs sharing service classes. In progress. (04/07: Last update 24/07)

NGI

Decommissioning FTS2 Service at RAL Tier1. GGUS 106615 It is a master ticket following the decommissioning process. (02/07: On hold 14/07)

RED TICKETS!

BRISTOL

6 day backlog transferring from FNAL to Bristol. GGUS 106554 Several things tried. Luke looking for some suggestions. (29/06: Last update 22/07]

LANCASTER

CVMFS problem. Pilots aborting. Various things tried. GGUS 106406 (23/06: on hold 28/07).

IMPERIAL

Biomed unable to list files with gsiftp. GGUS 106369 They want gsiftp rather than SRM for performance but it is not supported for this use in dCache (20/06: Last update 22/07).

IMPERIAL-CLOUD

Cloud site Service leading to 12% of all WLCG traffic to the service cvmfs-stratum-one.cern.ch GGUS 106347. Thought shoal may help. Site in maintenance. (19/06: on hold 14/07).

BRISTOL

CMS pilots losing network connections. GGUS 106325. Tier-1 sees something similar so waiting on GGUS 106324 (18/07: on hold 14/07)

RAL T1

CMS pilots losing network connections. GGUS 106324. Network settings suggestion made… no response yet from the site. Needs a response. (18/06: 23/07)

GGUS 105405. Check your vidyo router config. Was being followed up and a question remained about connections for clients (14/05: on hold 01/07).

UCL

Problem with perfsonar host. GGUS 101285. Reinstalled and as of 24/07 waiting to be added to WLCG mesh before closure of ticket. (16/02: 24/07)

LANCASTER

perfSONAR poor performance GGUS 100566. Host reinstalled but issue remains. Matt looking for ideas. (27/01:01/07). On hold since 23/06.

EFDA-JET

LHCb jobs failed. GGUS 97485. (21/09/13!: 12/05). On hold in May. Site needs help to resolve.**

ECDF

Glexec deployment. GGUS 95303. There is no tarball. (01/07/13: on hold 29/11/13).

LANCASTER

Glexec deployment. GGUS 95299. There is no tarball. . (01/07/13: on hold 07/07/13).

UCL

Glexec deployment. GGUS 95298. Was awaiting a site update… and possibly new staff member. (01/07/13: on hold 29/08/13).




Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.


VOs - GridPP VOMS VO IDs Approved VO table

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.


Monday 16 June 2014

  • CVMFS
    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.


Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)


Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 30th July 2014

  • Operations report
  • Business as usual....
  • The termination of the FTS2 service has been announced for the 2nd September.
  • The software server used by the smaller VOs will be turned off - also on 2nd September.
  • We are planning to turn off access to the cream CEs (possibly keeping them open for Alice) - although no date yet decided for this.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A