Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 14th July 2014
Task Areas
General updates

Monday 14th July

  • Workshop - CVMFS monitoring feedback
  • ATLAS DC14 13TeV simulation starting - note Alessandra's recommendation regarding Nikehf scripts and multicore running (for torque/maui) sites.
  • topBDII caching and errors
  • ILC VOMS changes
  • Sites with ARC CEs who want to support LHCb need to make a few configuration changes. This is to ensure that there is an environment variable available to jobs which specifies the name of the queue.
  • EGI A/R report for June
  • Did anyone else see kernel problems like Liverpool (see blog)
  • Large numbers of biomed jobs have been impacting various sites. Is setting MaxTotalJobs the answer? Do we need follow-up with the VO?
  • HyperK can now make use of additional resources and a general request for enablement was circulated. It has been confirmed that they only need disk at QMUL.
WLCG Operations Coordination - Agendas

Tuesday 14th July

Tuesday 1st July

Monday 23rd June

  • Minutes from last Thursday's meeting. Highlights....
  • A page is available listing current known middleware issues affecting WLCG.
  • Baselines: Storm 1.11.4 released in EMI containing several bug fixes. Baseline update with UMD release.
  • 3 issues affected some sites after the latest EMI update of Cream and LB. The problems are under investigations by the PTs.
  • CVMFS: Starting from July, sites not compliant with the 2.1.19 version will be notified with a GGUS ticket (noted that upgrade just requires an update of the RPM and a restart CVMFS).
  • T0: The OPS VO now runs in voms-admin instead of VOMRS, after the migration done on June 17th
  • Tier-1/Tier-2 feedback: NTR!
  • ALICE: successful campaign for users to move away from old ROOT versions. T0 job efficiency issues ongoing.
  • ATLAS: DC14 expected to start in approximately 2 weeks from now.Panda/Jedi is now fully ready for user analysis.
  • CMS: Started to remove individual release tags from CEs. After the introduction of disk/tape separation at the T1 sites, CMS now must site readiness measures for T1 sites
  • LHCb: Recommend CVMFS 2.1.19. General request: ensure that downtimes, including unscheduled outages, accurately reflect the specific services which are unavailable.
  • FTS3: Monitoring the auto-tuning algorithm closely and adjusting various monitoring tools of FTS3.
  • glexec: 10 sites have yet to enable it. ARGUS instabilities being investigated.
  • Machine/job features: PBS/torque and LSF implemented. SLURM pending. SGE and HTCondor in progress.
  • MW readiness: ATLAS and CMS DPM setups in progress. Monitoring prototype being deployed at test sites.
  • Multicore: CMS stable flow. Gathering reports for July workshop. ATLAS MC jobs on-hold pending new software release.
  • SHA-2: New VOMS fix for CERN instances requires sites to update ARGUS, UI, CREAM and WN instances.
  • WMS decommissioning: Progress with SAM Condor validation. ARC-CE WN tests failing for some CMS sites (incl. Imperial).
  • IPv6: NTR
  • HTTP proxy discovery: Task overview table updated.
  • Network and transfers metrics: Mesh leaders developed. Kick off in July.
  • AOB: OSG plan to migrate to HTCondor CEs by October.



Tier-1 - Status Page

Tuesday 1st July

  • LHCb Castor Stager Upgrade was carried out successfully last Thursday. The final update is the Atlas Castor instance stager which is planned for the Atlas - Tue 1st July.
  • There is a UPS/Generator load test tomorrow morning (Wed 2nd July) and the site has been declared in an At Risk (warning) in the GOC DB from 10 to 11 local time.
  • We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
  • The software server used by the small VOs will be withdrawn from service. Its use as a software server is very limited (possibly only SNO+) although a few VOs use it for uploading files to the CVMFS repository.
Storage & Data Management - Agendas/Minutes

Wednesday 2 July

  • Guidance and policies for "small" VOs: how to get them started with stuff, without preventing them later growing bigger.

Tuesday 1st July


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 1st July

  • There are no SL6 HS06 entries in our wiki for UCL and EFDA.
  • Are there any observations from the latest GridPP metrics tables? (Does anything need addressing or correcting?).
  • APEL is not up-to-date for: RHUL; Manchester and Durham.


Tuesday 24th June

  • APEL not up-to-date for: RHUL; Manchester, Durham and Sussex.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.


Interoperation - EGI ops agendas

Tuesday 14th July

  • Last meeting yesterday.
  • URT: see agenda for details
  • SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
  • DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
  • Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
  • EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
  • Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
  • Next meeting placeholder 28th July, but may not happen (OMD depending)
  • Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

  • Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
  • EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
  • There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients



Monitoring - Links MyWLCG

Tuesday 24th June

On-duty - Dashboard ROD rota

Tuesday 1st July

  • Quiet week. Sussex emi2 ticket is still open. UCL also has a open ticket regarding some problem with storage.

Tuesday 24th June

  • Very quiet shift. Dashboard downtime on Tuesday seemed to go ok.


Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Monday 14th July

  • EGI CSIRT ADVISORY [EGI-ADV-20140625]


Tuesday 1st July

  • There was a very useful security challenge debrief last week. Thanks to Heiko.
  • There may be a site contacts challenge in the coming months. Please could every site review their site security contact details and ensure that the GOCDB entry is up-to-date and working.
  • EGI indicates that site ARGUS instances can now be hooked up with the regional instances.
  • There was one EGI amber final report last week.
  • Next team meeting 16th July.

Monday 23rd June

  • CVE-2014-3153 - but no public exploit.
    • This kernel vulnerability has been patched in errata released last week.
  • PerfSonar/Cacti updates.
  • New IGTF CA release 1.58 - the EGI release is due on 30th June.



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.


Tickets

Monday 14th July 2014, 14.30 BST.
29 Open UK tickets today. I might have to send my apologies to this week's meeting as Lancaster is receiving a delivery Tuesday morning.

FNAL VOMS TICKETS
As seen on TB-SUPPORT - a number of sites got tickets concerning jobs still contacting the FNAL voms server for CMS/ILC. Birmingham, RHUL, Liverpool and the Tier 1's tickets are still being worked on - RHUL's ticket might not have been spotted yet (still assigned).

DECOMMISSIONING THE FTS3 SERVICE
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106615 (2/7)
Gareth opened a ticket to document the retirement, in accordance with ancient grid laws. As naught is happening until the 2nd of September I put on hold till nearer the time. On Hold (14/7)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106770 (10/7)
enmr.eu wanted to add tags to one of the Tier 1's arc ces, which of course didn't work. There was an interesting exchange about why a VO would still want to have a site publish tags in the age of cvmfs (essentially so they can minimise changes to the submission gubbins). Andrew offered to add in the tag "VO-enmr.eu-CVMFS" by hand to his CE, it's likely that other sites might be asked to do the same - and it's a solution worth noting for other VOs. In progress (14/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106610 (2/7)
Enabling HyperK at the Tier 1. Ticket looks a little stalled after Chris commented that it was wise for Hyper K to be enabled on only Arc-CEs (in light of RAL going dairy free). In progress (2/7)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (4/7)
UCL are still having trouble with nagios tests after a pool node died. Ben is having trouble getting the new disk server set up - I tried to give him some tips and advised shouting out for help. In progress (8/7)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6)
Bristol having trouble with CMS transfers- Lukasz noticed Storm was being odd (believing there to be no free space when there was). The SE was kicked but the problem (or a similar one) showed up again. Anyone seen similar? (Looking at Chris Walker:Storm Sage again here). In Progress (9/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6)
cf TIER 1 ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324
CMS pilots losing contact with their home base. Looks similar to the issue at RAL, where they seem to have had some success (still waiting to see if it was complete). If the RAL chaps could elaborate on the firewall tweaks that brought about this improvement it would be greatly appreciated (The RAL ticket could do with an update too)! In Progress (14/7)

Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.


VOs - GridPP VOMS VO IDs Approved VO table

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.


Monday 16 June 2014

  • CVMFS
    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.


Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)


Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 25th June 2014

  • Operations report
  • Castor GEN Stager 2.1.14-13 updated yesterday (24th June). Some problems with xroot for ALICE not resolved until following morning. Remaining stager dates as follows (LHCb - Thu 26th June; Atlas - Tue 8th July.)
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A