Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
Line 349: Line 349:
<!-- ******************Edit start********************* ----->
<!-- ******************Edit start********************* ----->
- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).
- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).
'''Tuesday 12th August'''
* A reminder to update site status information in the [https://www.gridpp.ac.uk/wiki/IPv6_site_status IPv6 pages].
* We will shortly review issues being picked up by perfSONAR and the steps to take when investigating.
'''Tuesday 22nd July'''
'''Tuesday 22nd July'''

Revision as of 09:52, 12 August 2014

Bulletin archive

Week commencing 11th August 2014
Task Areas
General updates

Tuesday 12th August

  • Bristol suggests it is seeing connection problems mostly, but not exclusively, to US sites.
  • The GOCDB test server has been updated to v5.3.
  • GOCDB has received a new service type request for ‘egi.Perun’. As required by the lightweight EGI review process, we are required to respond with any suggestions/issues before 14th August. Perun is used by the EGI Fed Cloud to manage users access rights to cloud services. Therefore, every cloud VO needs to be supported by Perun, this is why it has been requested to be properly registered and then monitored.
  • LHCb have reported that the dCache problems they have seen recently do not seem to have any correlation with a particular dCache version. Even different endpoints in the same site could fail or work OK. This is all related to xrootd endpoints and some sites have solved the issues that seem to be caused by misconfigurations on their site.
  • GridPP will receive some RIPE probes for distribution - note there is now a waiting list for UK based requests.

Tuesday 5th August

  • There was an EGI Operations Management Board (OMB) meeting last week. Several UK issues (VO DM/job approaches, NFS area futures and availability alarm handling) were input for discussion, but due to the tight agenda will be reviewed at the next EGI ops meeting.
  • Some things to note from the OMB (see also the meeting minutes):
    • Editing of the the EGI wiki is now EGI members (SSO) and on request.
    • A reminder to keep GOCDB information up-to-date - it is used to populate various tools.
    • A federated cloud security survey is in progress.
    • There is an EGI Big Data conference 24th-26th September.
    • Resource requests - 19 pools registered. 13 available for allocation. It is a brokering service only. There is one request in the system for cloud resources.
    • There is a new draft Resource Centre OLA for comment till 15th August. Updates coming for Technical Policy, User and EGI.eu SLAs. Refer to the performance wiki page for a chart showing relevance.
    • A monthly release of the Ops portal following 1 week testing and input from TAG has been proposed.
    • SAM probes Task Force has been setup to assess the support status of probes and improve documentation. An initial list of issues is available.
    • There are plans for VAPOR - combined Vo Administration and operations PORtal. There is a prototype available.

WLCG Operations Coordination - Agendas

Tuesday 12th August

  • The next meeting is on 21st August.
  • For the September MB we have been asked for ideas on improving operational efficiency within WLCG.

Tuesday 29th July

  • There was a WLCG coordination meeting last Thursday. Minutes are available.
  • News: MB agreed gstat no longer supported. Notes from WLCG workshop available.
  • Baseline: Changes FTS3 v3.2.26; DPM 1.8.8; perfSONAR 3.3.2-17; Storm 1.11.4.
  • Baseline: Issues: “Missing Key Usage Extension in delegated proxy” affecting CREAM; dCache upgrade; EOS (Brazilian proxies). Tickets for CVMFS upgrade to 2.1.19.
  • Tier-2 – feedback from GridPP given. CVMFS monitoring, gstat and ARGUS-SE status.
  • ALICE: Very high to low throughput due to input tasks. ALICE site admins to note VOMS changes
  • ATLAS: issues related to the new ATLAS frameworks commissioning/integration. Using MCORE and want more. under discussion within ATLAS new way of measure the availability of the sites, through their availability for analysis.
  • CMS: Busy. CAS2014 soon target 40k concurrent jobs. Want to make xrootd-fallback test critical. Removal individual release tags from CEs. 5 site reminders.
  • LHCb: Access to lcg-voms2 and voms2 from outside CERN are timing out. re-curring problem after upgrade of dCache sites. CVMFS clients not all upgraded. Question - What happens to VM if it fails and needs reinstantiation. Lose the cores? No.
  • Tracking tools: Recommendation to close TF.
  • FTS3: Only a few CMS sites running over FTS2. Can now monitor FTS3 transfers against user DN (needed by CRAB3)
  • Glexec: no change. Some running unreleased rpms due to ARGUS status. Someone needs to pledge effort to support ARGUS. SWITCH unable pledge effort.
  • MW readiness: DPM is the pilot WLCG MW package used for the Readiness Verification effort. Progress is traced in a dedicated JIRA tracker. Next meeting 1st October.
  • Multicore: ATLAS steady flow. With CMS concurrent at 5 sites. How to track.
  • SHA-2: Intro new VOMS servers. Testing infrastructure with preprod SAM. Some good; many failures due to simple configuration issues of CREAM and/or Argus (Action to check). Aim for 15th September. Will open tickets. RFC no progress.
  • WMS decommissioning: Work ongoing to improve SAM Condor probe jobs lifecycle. End of September or October.
  • IPv6: All but ALICE now testing (migrating AliEN to Perl 5.18).
  • HTTP proxy dis: NTR
  • Network and transfer metrics WG: proposed dates for kickoff are in the last week of August and second week in September.

Tuesday 21st July

  • The next coordination meeting takes place this Thursday at 14:30 UK time. There is now a standing item for sites to raise issues of concern. Is there anything we would like to mention this week? we are invited to update the twiki up until 1 hour before the meeting.

Tier-1 - Status Page

Tuesday 12th August

  • Ongoing investigations into problems with draining disk servers in Castor 2.1.14.
  • We have announced that we will shutdown both the FTS2 service and the software server used by the small VOs on the 2nd September.
Storage & Data Management - Agendas/Minutes

Monday 11th August

  • Pool nodes at RHUL have received test errors.

Tuesday 5th August

  • The list of work Jens reviewed last Wednesday
    • WebFTS testing
    • Updating storage documentation (the wiki) and testing it
    • Upgrading DPM 1.8.7s?
    • GLUE 2.0 for storage revisited?
    • IPv6
    • WebDAV

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th August

  • Accounting looks behind for UCL, Sheffield and Sussex.

Tuesday 29th July

  • Accounting gap in last week for UCL and Durham.

Tuesday 22nd July

  • EGI accounting portal does not show any significant outages in publishing in recent days.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 12th August

  • The keydocs php scripts are working now, so we can restart our review process....

Tuesday 22nd July

  • Starting on revisions this week.
  • Is the alert system now working?

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.
Interoperation - EGI ops agendas

Tuesday 14th July

  • Last meeting yesterday.
  • URT: see agenda for details
  • SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
  • DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
  • Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
  • EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
  • Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
  • Next meeting placeholder 28th July, but may not happen (OMD depending)
  • Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

  • Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
  • EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
  • There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients

Monitoring - Links MyWLCG

Tuesday 11th August

  • Next consolidation meeting this Friday, Messaging and SAM 3 UI: https://indico.cern.ch/event/334354/
  • Kick-off meeting discussing cvmfs monitoring in squid monitoring TF being arranged now.

Tuesday 22nd July

  • Chris noted yesterday that gstat reports most sites as critical. SB thought the underlying problem is that a value is supposed to be "Production" in GLUE 1 and "production" in GLUE 2; at some point they changed both to lower case.
On-duty - Dashboard ROD rota

Tuesday 12th August

  • Last week was quiet.
  • Still one or to responses needed for next rota allocations.

Tuesday 5th August

  • Reminder to ROD team on rota availability
  • A fairly quiet week. Site availability alerts is a bit of a pain - there are open "Rod notepad" entries for each but no tickets created. Closing the alarms but they reappear each day. The UCL case is steadily improving and Durham are still moving stuff to a new server room so the situation will persist for them.

Rollout Status WLCG Baseline

Monday 28th July

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).


Security - Incident Procedure Policies Rota

Monday 11th August

  • There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.

Tuesday 5th August

  • There is a security team meeting tomorrow at 11am.

Tuesday 29th July

Tuesday 22nd July

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 12th August

  • A reminder to update site status information in the IPv6 pages.
  • We will shortly review issues being picked up by perfSONAR and the steps to take when investigating.

Tuesday 22nd July

  • There was a problem with VOMS admin that was noticed last Thursday: VO page requestes resulted in the WEB-INF directory of the jetty app being displayed. A server restart fixed the problem.

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.


Monday 11th August 2014, 15.30 BST
20 Open UK tickets this week.

CSIRT Site Security Checks
Master ticket: 107538(7/8)

Jeremy has been submitting tickets to site security e-mails addresses to test that they're all working. Most seem to be present and correct, but on the not-yet-replied pile we have:
SUSSEX (107545)
EFDA-JET (107551)
CAMBRIDGE (107558)

It could be that the sites in question have everyone on holiday, or that the site security contact address is a generic one for the University and someone at the other end is wondering what in the name of Odin's beard a GGUS ticket is.

In the event that that any of these three saw the ticket's here first rather then through the proper channels please, please can you check your lines of communication rather then treating this like any other ticket.

The information publishing on one of Elena's CEs seems to have spontaneously broken (as I found Torque and Maui to be wont to do). If anyone has had their CE's suddenly start publishing all the 4s for their job numbers recently then I'm sure any help would be appreciated. In progress (6/8)

As seen from TB-SUPPORT, Bristol are having networking problems - especially with regards to the US. Lukasz and Winne are working at it, but the submitter (and maybe CMS) seems to be losing patience (unfairly). In progress (6/8)

Decommissioning of the FTS2 service. Gareth sent out a broadcast - https://operations-portal.in2p3.fr/broadcast/archive/id/1187 Preparation are well underway. In progress (11/8)

The RAL counterpart to Bristol's 106325 (cms pilots losing contact to their submission hosts), it looks like this problem still persists as well. A mixup between RALPP and Bristol on the CMS end (and not for the first time) meant that CMS didn't answer RAL's question (which was do you see the same problem at RALPP too?). Still Waiting for reply (7/8)

Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 30th July 2014

  • Operations report
  • Business as usual....
  • The termination of the FTS2 service has been announced for the 2nd September.
  • The software server used by the smaller VOs will be turned off - also on 2nd September.
  • We are planning to turn off access to the cream CEs (possibly keeping them open for Alice) - although no date yet decided for this.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A