Operations Bulletin Latest

Bulletin archive

Week commencing 29th September 2014

Task Areas

General updates

Tuesday 30th September

There is an IPv6 meeting next week 6th-7th October.
There is a GDB next week on 8th October.

Monday 22nd September

Operations Portal v3.0.5 is now online.
There were some reports of CMS overloading WNs... discussion needed?
There was an EGI OMB last week (agenda: minutes):
- we have a few more days to feed back on the revised OLAs;
- gstat has now got some support again;
- feedback is requested on the EGI.eu SLA;
- site certification (tagging and process) within the EGI federated cloud is being re-evaluated;
- ARGUS central banning is being piloted following which rollout across EGI will be expected early 2015;
- SAM probe changes for ops have been agreed (e.g. removing the FTS2 tests);
- Some products are still seeking early adopter sites (e.g. CREAM-LSF and CVMFS).
The next EGI OMB is on 30th October.
There is an EGI conference this week Challenges and Solutions for Big Data processing on Cloud.

Monday 15th September

Steve has setup an IPv6 network test.
Duncan asked if the gfal replacement is not on WNs by default?
The official GDB summary notes of the 10th September meeting are now available.
Notes from the pre-GDB on Clouds are also available.
Do we still have sites suffering from ARGUS instabilities? CERN noticed ongoing problems (GGUS 105666).
A reminder of this top-level WLCG page.
All VOMS update tickets closed. Tests passing. Thank you!
3rd interim Foundation Board (iFB) of the HEP Software community meeting this Wednesday, 17th September at 15.00 Geneva time. Plans to identify people to lead the activities.

Monday 8th September

Storage placement - survey TBC.

WLCG Operations Coordination - Agendas

Tuesday 23rd September

There was a WLCG ops coordination meeting last Thursday. (Agenda: Minutes).
Some highlights....

News: At the MB it was decided that WLCG Operations will investigate efficiency ideas in more detail and carry out a survey among sites to understand where effort is currently used.
Baselines: New BDII version (1.6.0) released in EMI-3, containing small fixes for GLUE2; New version of the EMI-3 UI and WN available. New dependencies as gfal2-util and ginfo included, new bouncycastle-mail version dependency for sha-2 voms; Frontier/Squid 2.7.STABLE9-19, enhancement and bug fixes.
MW issues: There is an issue in the info-provider for Storm.
Oracle: migration to GoldenGate is progressing well at CERN.
T0: Investigating use cases for AFS-UI; from October will use EGI WMS for SAM tests; plus5:batch5 still used by users to build binaries under SLC5; KB article describing how to set up a private VM under Openstack with all the relevant RPMs.
T2: gfal packages will not be removed from the repositories. WN version 3.1.0 including gfal2 will be declared baseline as soon as it is released also in UMD.
ALICE: investigation of job failure rates and inefficiencies.
ATLAS: NTR
CMS: Bad incident with xrootd fallback test on Sep 16. Want to close 'unknowns' in the storage accounting. Check out the AAA page.
LHCb: NTR
glexec: Minor site developments. ATLAS: the pilot is now able to use glExec if found working, if not fallback to traditional mode. LHCb implemented glExec support a long time ago but hasn't started pushing to use it on the sites yet.
Machine/job features: NTR
Mlddleware readiness: Next meeting on 6th October.
Multicore: Publishing multicore accounting to APEL works.
SHA-2: AM prod machines use the new servers since Tue Sep 16. Checking experiment systems.
WMS decomm: Deployment of the Condor-based SAM probes planned on Wed 1st of October 2014
IPv6: NTR
Squid/HTTP: NTR
Network & transfer metrics: Good participation in first meeting. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed.

Tuesday 16th September

The next core ops meeting is on 18th September.
The next multi-core meeting is today at 14:30 CERN time. It is on dynamic partitioning with LSF at CNAF.

Tier-1 - Status Page

Tuesday 30th September

We are planning to stop access for all VOs apart from ALICE to our CREAM CEs today.
There were some problems with the switch of the Atlas Frontier system to the new database last week - and that change was reverted.
We have declared an 'At Risk' on the site for a coupl of hours during tomorrow morning for the next quarterly UPS/generator load test.

Storage & Data Management - Agendas/Minutes

Wedn 10 Sept.

High load at L'pool causing low throughput - how to throttle xroot transfers (and is the load necessary or a bug?)
Still testing WebFTS
Prep for DPM workshop

Monday 1st September

FAX sites to update the C++ N2N rpms .
There is interest regarding issues/performance when placing storage outside firewalls. JC will shortly start a (closed) discussion/survey.

Monday 11th August

Pool nodes at RHUL have received test errors.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 23rd September

Slight APEL delay for Birmingham .

Tuesday 16th September

Sheffield and QMUL appear to have a delay on publishing.

Tuesday 9th September

APEL status: An issue at Sheffield?

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 9th September

Looking a bit better. Will review in more details at core ops meeting (next Thursday 18th@11:30am unless there is a clash)

Tuesday 2nd September

This work needs a kick-start! Reminders should now be being received.
Tom/Andrew in discussion about options for main site - main considerations are Wordpress and Drupal.

Interoperation - EGI ops agendas

Tuesday 9th September

Meeting minutes from yesterday.

- Mostly a short meeting to give updates on product updates over the summer.
- Please read the agenda/minutes for a full set but to pull out a couple of things:

- Note that as per http://dmc.web.cern.ch, gfal and lcg-util are in end-of-life mode and support will end for both on 1st November.

- FTS3, SQUID and CVMFS will soon be include in UMD; early adopters are requested

Next meeting planned for October 6th.

Monday 8th September

There is an EGI ops meeting today.

Monitoring - Links MyWLCG

Tuesday 23rd September

Monitoring Consolidation meeting from Friday cancelled, however, T2 A/R report comparison posted on validation wiki: https://twiki.cern.ch/twiki/bin/view/LCG/ValidationStatus . Comparisons conducted to within 3%.

Next (replacement) meeting this Friday to continue discussions.

Tuesday 16 September

SAM Nagios probes refactoring TF meeting

We had first SAM Nagios probe refactoring TF meeting on 12 September. Some of identified issues are listed in TF wiki

https://wiki.egi.eu/wiki/SAM_Nagios_probes_refactoring_TF

We need operations team opinion on some of the issues

- Removing LFC tests : Everyone agreed in the meeting that it should be removed
- Use of WMS for SAM test
- Testing SRM client tools from WN

Tuesday 2nd September

Monitoring consolation meeting last Friday

- Validation of SAM2/3 results: https://twiki.cern.ch/twiki/bin/view/LCG/ValidationStatus
- 4 UK sites (ECDF, Brunel, Durham, Oxford) had slight discrepancies: Looks either to be because more metrics are now being represented (blue = unknown, site not penalised), or older service still in vo feed (AGIS).
- Next step is to compare availabilities for August for SAM2/3 and compare for sites
- If sites see any discrepancies between http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum and http://wlcg-sam-atlas.cern.ch/dashboard/request.py/historicalsmry for their site, please let me know

Squid monitoring TF meeting last Thursday

- Cosmin presented ALICE CVMFS proposal to revived TF
- Notes from meeting: https://twiki.cern.ch/twiki/bin/view/LCG/SquidMonitoringTF20140828MeetingNotes

On-duty - Dashboard ROD rota

Monday 22nd September

Quiet week - little to report.
EGI is looking for people to join the ops portal review and testing TF.

Tuesday 2nd September

Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
The UCL ticket is now finally receiving some attention.
Ongoing problems at RAL.

Tuesday 26th August

RAL : Nagios jobs staying in queue for long time - to be investigated.
Sussex : Matt needs help probably from some SGE experts.
UCL : No acknowledgement from the site (ticket escalated to second level).
100IT : There is an alarm from EGI federated cloud - this needs discussion.
Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

Last week was quiet.
Still one or to responses needed for next rota allocations.

Rollout Status WLCG Baseline

Tuesday 26th August

EMI3 WN tarball update needed soon (GGUS 107869)

Monday 28th July

UMD v.3.8.0 was released on 24th July.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 23rd September

High priority vs critical tests in pakiti.

FAX update

Monday 8th August

There was a security team meeting last Wednesday.
There was a CA TAG meeting also last Wednesday.

Monday 11th August

Topics as mentioned during the last GridPP technical meeting.

There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.

The EGI security dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 23rd September

There was an LHCONE/LHCOPN meeting last week. These few talks will be of general interest: LHCONE global NOC proposal; perfSONAR update and CMS P2P applications.

Tuesday 16th September

Another reminder of the IPv6 network tests.
There is an LHCONE/LHCOPN meeting taking place yesterday and today.
RIPE A at Glasgow now live (but tagged ...). Hope to see others soon.

Tuesday 9th September

RIPE probes now hosted: Cambridge, Sheffield, Liverpool, Lancaster (& Oxford and QMUL). Glasgow connected but no data.
RIPE probes not yet hosted: 6 sites.

Tickets

Monday 29th of October 2014, 15.00 BST
24 Open UK tickets this week.

TIER 1
107935(27/8)
Inconsistent BDII/SRM reported storage numbers for ATLASHOTDISK. No news on this ticket for quite a while. In Progress (3/9)

107880(26/8)
Note quite a RAL ticket, Sno+ asking about how to accommodate users who only have access to srm tools to access data. Nothing since Henry's helpful input on the 10th. Anyone with any ideas? I've mentioned the UI tarball but I think that's clutching at the weediest of straws. In progress (29/9)

Sno + having trouble running jobs.
Sheffield: 108716(23/9)
Lancaster: 108715(23/9)
Matt M reports that Sno+ are having troubles running at Lancaster and Sheffield - it looks like we could be seeing the same problem, I'll let you (Elena) know what I find out. Both In Progress.

100IT
108356(10/9)
100IT not running VMCatcher at their site. There was some trouble creating a AppDB profile, but this was solved (108548). No news on this ticket since then. In progress (17/9)

SUSSEX
108765(24/9)
Sussex has a BDII nagios check ticket that has escalated - can you please give it an update and if you're stuck let us know - Lancaster got a similar ticket on Friday and these issues are annoying to tackle. In Progress (29/9)

RHUL
108448(12/9)
Just a warning that this atlas data transfer ticket has been re-opened on you, with a new set of transfers detected. Reopened (29/9) (It could well be that 108856 is a duplicate of this ticket.).

Tools - MyEGI Nagios

Tuesday 16th Sep

Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.

https://vo-nagios.physics.ox.ac.uk/nagios/

It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk

Should we start monitoring it more actively and open ticket for sites failing tests ?

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

HyperK.org request for support from other sites
- 2TB storage requested.
- CVMFS required

Cernatschool.org
- WebDAV access to storage -world read works at QMUL.
- ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

CVMFS
- Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software

VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

Is there interest in an FTS3 web front end? (more details)

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 9th September

Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 24th September 2014

Operations report
Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. This has been announced for Tuesday 30th September.
Access to Castor has been given to the Pheno VO.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin Latest

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools