General updates
|
Tuesday 30th September
Monday 22nd September
- Operations Portal v3.0.5 is now online.
- There were some reports of CMS overloading WNs... discussion needed?
- There was an EGI OMB last week (agenda: minutes):
- we have a few more days to feed back on the revised OLAs;
- gstat has now got some support again;
- feedback is requested on the EGI.eu SLA;
- site certification (tagging and process) within the EGI federated cloud is being re-evaluated;
- ARGUS central banning is being piloted following which rollout across EGI will be expected early 2015;
- SAM probe changes for ops have been agreed (e.g. removing the FTS2 tests);
- Some products are still seeking early adopter sites (e.g. CREAM-LSF and CVMFS).
- The next EGI OMB is on 30th October.
- There is an EGI conference this week Challenges and Solutions for Big Data processing on Cloud.
Monday 15th September
Monday 8th September
- Storage placement - survey TBC.
|
WLCG Operations Coordination - Agendas
|
Tuesday 23rd September
- There was a WLCG ops coordination meeting last Thursday. (Agenda: Minutes).
- Some highlights....
- News: At the MB it was decided that WLCG Operations will investigate efficiency ideas in more detail and carry out a survey among sites to understand where effort is currently used.
- Baselines: New BDII version (1.6.0) released in EMI-3, containing small fixes for GLUE2; New version of the EMI-3 UI and WN available. New dependencies as gfal2-util and ginfo included, new bouncycastle-mail version dependency for sha-2 voms; Frontier/Squid 2.7.STABLE9-19, enhancement and bug fixes.
- MW issues: There is an issue in the info-provider for Storm.
- Oracle: migration to GoldenGate is progressing well at CERN.
- T0: Investigating use cases for AFS-UI; from October will use EGI WMS for SAM tests; plus5:batch5 still used by users to build binaries under SLC5; KB article describing how to set up a private VM under Openstack with all the relevant RPMs.
- T2: gfal packages will not be removed from the repositories. WN version 3.1.0 including gfal2 will be declared baseline as soon as it is released also in UMD.
- ALICE: investigation of job failure rates and inefficiencies.
- ATLAS: NTR
- CMS: Bad incident with xrootd fallback test on Sep 16. Want to close 'unknowns' in the storage accounting. Check out the AAA page.
- LHCb: NTR
- glexec: Minor site developments. ATLAS: the pilot is now able to use glExec if found working, if not fallback to traditional mode. LHCb implemented glExec support a long time ago but hasn't started pushing to use it on the sites yet.
- Machine/job features: NTR
- Mlddleware readiness: Next meeting on 6th October.
- Multicore: Publishing multicore accounting to APEL works.
- SHA-2: AM prod machines use the new servers since Tue Sep 16. Checking experiment systems.
- WMS decomm: Deployment of the Condor-based SAM probes planned on Wed 1st of October 2014
- IPv6: NTR
- Squid/HTTP: NTR
- Network & transfer metrics: Good participation in first meeting. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed.
Tuesday 16th September
- The next core ops meeting is on 18th September.
- The next multi-core meeting is today at 14:30 CERN time. It is on dynamic partitioning with LSF at CNAF.
|
Tier-1 - Status Page
|
Tuesday 30th September
- We are planning to stop access for all VOs apart from ALICE to our CREAM CEs today.
- There were some problems with the switch of the Atlas Frontier system to the new database last week - and that change was reverted.
- We have declared an 'At Risk' on the site for a coupl of hours during tomorrow morning for the next quarterly UPS/generator load test.
|
Storage & Data Management - Agendas/Minutes
|
Wedn 10 Sept.
- High load at L'pool causing low throughput - how to throttle xroot transfers (and is the load necessary or a bug?)
- Still testing WebFTS
- Prep for DPM workshop
Monday 1st September
- FAX sites to update the C++ N2N rpms .
- There is interest regarding issues/performance when placing storage outside firewalls. JC will shortly start a (closed) discussion/survey.
Monday 11th August
- Pool nodes at RHUL have received test errors.
|
Documentation - KeyDocs
|
See the worst KeyDocs list for documents needing review now and the names of the responsible people.
Tuesday 9th September
- Looking a bit better. Will review in more details at core ops meeting (next Thursday 18th@11:30am unless there is a clash)
Tuesday 2nd September
- This work needs a kick-start! Reminders should now be being received.
- Tom/Andrew in discussion about options for main site - main considerations are Wordpress and Drupal.
|
Interoperation - EGI ops agendas
|
Tuesday 9th September
- Mostly a short meeting to give updates on product updates over the summer.
- Please read the agenda/minutes for a full set but to pull out a couple of things:
- Note that as per http://dmc.web.cern.ch, gfal and lcg-util are in end-of-life mode and support will end for both on 1st November.
- FTS3, SQUID and CVMFS will soon be include in UMD; early adopters are requested
- Next meeting planned for October 6th.
Monday 8th September
|
Monitoring - Links MyWLCG
|
Tuesday 23rd September
- Next (replacement) meeting this Friday to continue discussions.
Tuesday 16 September
- SAM Nagios probes refactoring TF meeting
We had first SAM Nagios probe refactoring TF meeting on 12 September. Some of identified issues are listed in TF wiki
https://wiki.egi.eu/wiki/SAM_Nagios_probes_refactoring_TF
We need operations team opinion on some of the issues
- Removing LFC tests : Everyone agreed in the meeting that it should be removed
- Use of WMS for SAM test
- Testing SRM client tools from WN
Tuesday 2nd September
- Monitoring consolation meeting last Friday
- Squid monitoring TF meeting last Thursday
|
On-duty - Dashboard ROD rota
|
Monday 22nd September
- Quiet week - little to report.
- EGI is looking for people to join the ops portal review and testing TF.
Tuesday 2nd September
- Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
- The UCL ticket is now finally receiving some attention.
- Ongoing problems at RAL.
Tuesday 26th August
- RAL : Nagios jobs staying in queue for long time - to be investigated.
- Sussex : Matt needs help probably from some SGE experts.
- UCL : No acknowledgement from the site (ticket escalated to second level).
- 100IT : There is an alarm from EGI federated cloud - this needs discussion.
- Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.
Tuesday 12th August
- Last week was quiet.
- Still one or to responses needed for next rota allocations.
|
Security - Incident Procedure Policies Rota
|
Tuesday 23rd September
- High priority vs critical tests in pakiti.
Monday 8th August
- There was a security team meeting last Wednesday.
- There was a CA TAG meeting also last Wednesday.
Monday 11th August
- Topics as mentioned during the last GridPP technical meeting.
- There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.
|
|
Services - PerfSonar dashboard | GridPP VOMS
|
- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).
Tuesday 23rd September
Tuesday 16th September
Tuesday 9th September
- RIPE probes now hosted: Cambridge, Sheffield, Liverpool, Lancaster (& Oxford and QMUL). Glasgow connected but no data.
- RIPE probes not yet hosted: 6 sites.
|
Tickets
|
Monday 29th of October 2014, 15.00 BST
24 Open UK tickets this week.
TIER 1
107935(27/8)
Inconsistent BDII/SRM reported storage numbers for ATLASHOTDISK. No news on this ticket for quite a while. In Progress (3/9)
107880(26/8)
Note quite a RAL ticket, Sno+ asking about how to accommodate users who only have access to srm tools to access data. Nothing since Henry's helpful input on the 10th. Anyone with any ideas? I've mentioned the UI tarball but I think that's clutching at the weediest of straws. In progress (29/9)
Sno + having trouble running jobs.
Sheffield: 108716(23/9)
Lancaster: 108715(23/9)
Matt M reports that Sno+ are having troubles running at Lancaster and Sheffield - it looks like we could be seeing the same problem, I'll let you (Elena) know what I find out. Both In Progress.
100IT
108356(10/9)
100IT not running VMCatcher at their site. There was some trouble creating a AppDB profile, but this was solved (108548). No news on this ticket since then. In progress (17/9)
SUSSEX
108765(24/9)
Sussex has a BDII nagios check ticket that has escalated - can you please give it an update and if you're stuck let us know - Lancaster got a similar ticket on Friday and these issues are annoying to tackle. In Progress (29/9)
RHUL
108448(12/9)
Just a warning that this atlas data transfer ticket has been re-opened on you, with a new set of transfers detected. Reopened (29/9)
(It could well be that 108856 is a duplicate of this ticket.).
|
Tools - MyEGI Nagios
|
Tuesday 16th Sep
Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
https://vo-nagios.physics.ox.ac.uk/nagios/
It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
Should we start monitoring it more actively and open ticket for sites failing tests ?
Monday 14th July
Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time.
Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.
Emir reported this on tools-admin mailing list
"We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."
|
VOs - GridPP VOMS VO IDs Approved VO table
|
Monday 11th August
- Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.
"Monday 14th July 2014"
- HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.
"Monday 30 June 2104"
- HyperK.org request for support from other sites
- 2TB storage requested.
- CVMFS required
- Cernatschool.org
- WebDAV access to storage -world read works at QMUL.
- ideally will configure federated access with DFC as LFC allows.
Monday 16 June 2014
- CVMFS
- Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
- VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.
Tuesday 15th April
|
Site Updates
|
Tuesday 9th September
- Intel announced the new generation of Xeon based on Haswell.
Tuesday 20th May
- Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.
|
|