Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 8th September 2014
Task Areas
General updates

Monday 8th September

  • Be ready for the new CERN and ops VOMS. Compare the prod and preprod instances for:
  • An EMI3 WN tarball update has been done by Matt (see also GGUS 107869.
  • There is an LHCONE/LHCOPN meeting next week on 16th and 17th (agenda). It would be good to have some remote participation.
  • Website redesign - please complete this survey.


Monday 1st September

  • A/R results for August have been released.
    • ALICE: All good.
    • ATLAS: Durham (89%:98%) - very close! Sussex (45%:86%) - downtime for various updates. Problems with CE for WMS jobs only, so fine for ATLAS.
    • CMS: All good.
    • LHCb: All good. Northgrid and London perfect!
  • EGI A/R results have also been uploaded to this table. July's results show the UK at 96% overall. UCL, Durham and Birmingham had a couple of issues that affected them.
  • There is a UK CA TAG on 3rd September. Please let Jeremy know if you have any CA related issues or comments.
  • There has been discussion about lock-up problems with 2.6.32-431 kernels on supermicro kit. Any conclusions?
  • VOMS updates checks (mixed amongst pre-prod critical alarms):
    • CMS: Bristol, RALPP, RHUL.
    • LHCb: ECDF, EFDA.
    • ATLAS: UCL, ECDF, Oxford, RALPP.

Monday 25th August

  • GridPP33 took place in Ambleside last week.
  • Our thanks to Sam and Mohit, Year in Industry students, who have now finished. There will be less ticket prompts until new students are in place.
  • Reminder of 17th March message: new VOMS servers for Ops and LHC experiments. The deadline is Monday 15th September. The experiment pre-prod instances will switch earlier. Already started: ALICE 23rd July; LHCb 22nd August. Pending ATLAS 28th August; CMS 28th August.
  • For CMS: transition Savannah to GGUS (CMS Computing Operations): September 1st - Disable submission of new tickets; September 30th - Close Savannah (still open issues will be transferred to GGUS).
  • ATLAS RIPE probes handed out to some GridPP sites; at these a welcome notification message should have been received.
  • A UK CA TAG meeting is planned for 3rd September. One discussion item concerns an opportunity to migrate the UK e-Science CA to a new commercial CA as part of a JANET agreement.
  • On 18th August the main DNS servers associated to the egi.eu domain were switched from Nikhef to CESNET.
WLCG Operations Coordination - Agendas

Monday 8th September

  • There will be a multi-core meeting on Tuesday 9th at 14:30 (CERN time). Covering reviews of the UGE setup for multicore jobs at CCIN2P3 and of the method to passing job requirement arguments to batch systems via CE. (Agenda)
  • A review of last week's ops meeting....

Tuesday 2nd September

  • The next WLCG ops coordination meeting is this Thursday 4th September.
  • There will be a Tier-1/2 feedback section in the agenda IF there is feedback/input. Do we have any items to raise?

Tuesday 26th August

  • There was a WLCG coordination meeting last Thursday. [1] are available.
  • News: CERN-IT to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 September 2014.
  • A study to assess how operational effort in WLCG is used and could be optimised will launch in the next weeks. This will cover the management of sites and site services. It will (generally) not cover the experiment computing operations.
  • MW baselines: No recent updates
  • MW Issues: Storm and Argus integration issues.APEL fails to parse accounting records , affecting APEL 1.2.1 (released mid-August). Sites affected should move to 1.2.2. CVMFS upgrade to 2.1.19 almost done.
  • Oracle: upgrade plans now available.
  • T0: ARGUS latest version deployed. Looking at decommissioning AFS UI. A few users have already contacted CERN pointing out that they need SLC5 to build their software, as they haven't completed the porting to SLC6 yet - plan to push users to VMs on OpenStack.
  • Confirmation wanted on the AFS UI tarball support.
  • T1: No feedback.
  • T2: No feedback.
  • ALICE: steady production and analysis activities throughout the past weeks.
  • ATLAS: No report.
  • CMS: Finishing samples for CSA14; Computing Analysis Software challenge 2014 extended till mid-September. Users happy with AAA and miniAOD. Reminder for sites: Need to change xrootd redirectors, see this hn post; Need to adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> (e.g. value=“T1_DE_KIT_Disk") in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out> NEW; Need to upgrade to CVMFS >= 2.1.19 immediately.
  • LHCb: Low activity, mainly monte carlo simulation and user jobs. For SAM/Nagios in order to probe the ARC CEs at several UK sites, the probes are submitted now via a WMS instance from RAL-LCG2. The WMS instance was confirmed to be kept in production also for this purpose at least until 2015.
  • Tracking tools: no report.
  • FTS3: no report.
  • glexec: no report.
  • Machine/job features: Developer is leaving OSG.
  • MW readiness: A new version of the WLCG Package Reporter has been released. A new BDII update 9 and Cream-ce 1.6.3 for CMS verification being deployed.
  • Multicore: no report.
  • SHA-2: Progress with new VOMS servers - compliance with the WLCG infrastructure being tested, ALICE results show CREAM/ARGUS config issues at some sites. Broadcast next week with hard deadline for 15th September. Sites that fail the SAM preprod tests by the end of Aug will be ticketed .
  • WMS decommissioning: Condor validation - ATLAS and CMS ready. Deployment to production is planned on Wed 1st of October 2014.
  • IPv6: Ewan ran tests on pure IPv6 EMI-3 UI. Mixed results.
  • Squid monitoring/HTTP proxy: Reactivated Squid Monitoring TF to track its task list.
  • Network and transfer metrics: Tasks/membership updated. perfSONAR Toolkit 3.4rc2 became available for testing, version 3.4 is a major milestone for the WG as it enables access via REST API and introduces several important performance improvements, therefore deployment campaign will follow once we get a stable release.

Tuesday 12th August

  • The next meeting is on 21st August.


Tier-1 - Status Page

Tuesday 1st September

  • There was a problem on Saturday (30th August) when a network switch failed. For reasons not yet understood some of the virtual machine infrastructure (supporting production services) had a problem despite not being on the network stack containing the failed switch. All services (except Castor) were declared down for around 5.5 hours.
  • Both the FTS2 service and the software server used by the small VOs are being shutdown TODAY.
Storage & Data Management - Agendas/Minutes

Monday 1st September

  • FAX sites to update the C++ N2N rpms .
  • There is interest regarding issues/performance when placing storage outside firewalls. JC will shortly start a (closed) discussion/survey.

Monday 11th August

  • Pool nodes at RHUL have received test errors.

Tuesday 5th August

  • The list of work Jens reviewed last Wednesday
    • WebFTS testing
    • Updating storage documentation (the wiki) and testing it
    • Upgrading DPM 1.8.7s?
    • GLUE 2.0 for storage revisited?
    • IPv6
    • WebDAV



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 2nd September

Tuesday 26th August

  • Sheffield has stopped publishing.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 2nd September

  • This work needs a kick-start! Reminders should now be being received.
  • Tom/Andrew in discussion about options for main site - main considerations are Wordpress and Drupal.

Tuesday 26th August

  • KeyDocs now working again. Several documents assigned to Jeremy for re-allocation. Owners need discussion.

Tuesday 12th August

  • The keydocs php scripts are not yet working, so we cannot restart our review process....
Interoperation - EGI ops agendas

Monday 8th September


Monitoring - Links MyWLCG

Tuesday 2nd September

  • Monitoring consolation meeting last Friday
  • Squid monitoring TF meeting last Thursday
On-duty - Dashboard ROD rota

Tuesday 2nd September

  • Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
  • The UCL ticket is now finally receiving some attention.
  • Ongoing problems at RAL.

Tuesday 26th August

  • RAL : Nagios jobs staying in queue for long time - to be investigated.
  • Sussex : Matt needs help probably from some SGE experts.
  • UCL : No acknowledgement from the site (ticket escalated to second level).
  • 100IT : There is an alarm from EGI federated cloud - this needs discussion.
  • Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

  • Last week was quiet.
  • Still one or to responses needed for next rota allocations.


Rollout Status WLCG Baseline

Tuesday 26th August

Monday 28th July


References


Security - Incident Procedure Policies Rota

Monday 8th August

  • There was a security team meeting last Wednesday.
  • There was a CA TAG meeting also last Wednesday.

Monday 11th August

  • Topics as mentioned during the last GridPP technical meeting.
  • There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 2nd September

  • Only a few of the RIPE probes went live last week - any issues at the other sites to be discussed?
  • JANET is going to deploy a perfSONAR instance on one of the exchange points in London. They hope it will help raise awareness of issues with local systems affecting their transfer performance.

Tuesday 12th August

  • A reminder to update site status information in the IPv6 pages.
  • There is a new version (v3.4rc2) of perfSONAR being tested at QMUL [2]. Details here [3].
  • We will shortly review issues being picked up by perfSONAR and the steps to take when investigating.
Tickets
Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.


Monday 16 June 2014

  • CVMFS
    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.


Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)


Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 3rd September 2014

  • Operations report
  • Tenders for this year's CPU & Disk purchases are underway.
  • The FTS2 service was terminated on the 2nd September.
  • The software server used by the smaller VOs has been turned off.
  • Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. The proposed date for this is Tuesday 23rd September.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A