Difference between revisions of "Operations Bulletin Latest"

Revision as of 09:54, 10 February 2015

Bulletin archive

Week commencing 9th February 2015

Task Areas

General updates

Tuesday 10th February

Is there support to start using the bulletin sections (bottom of this page) for experiment input. It helps review issues and progress if the information is in one place.
Pete has reminded us of HEPIX in [ https://indico.cern.ch/event/346931/ Oxford 23th-27th March 2015].
WLCG draft T2 reliability/availability figures have been circulated for January 2015 - please respond if your site was below the 90% targets! ALICE; ATLAS; CMS; LHCb.
There is an LHCONE/LHCOPN meeting in Cambridge 9th & 10th February with good Geant participation. Some interesting links from yesterday's talks: The CERN netstat displays - could we do something for GridPP to show traffic levels month to month? Brocade have initiated a CERN tier partner program. If you are in REBUS you can benefit. A lot of discussion in the afternoon was around the LHCONE AUP!
A cloud traceability workshop/meeting is taking place today - Vidyo is available. GridPP is well represented.
Tomorrow the February GDB has the linked agenda. David Crooks is the Tier-2 rep. Anything to raise?
Lots of core-ops discussion about CERN accounts last week. This is being followed up. Note there is an eGroup for CERN external accounts - were people aware?

Tuesday 3rd February

HEP S/W foundation workshop - agenda link. Notes from meeting.
There was an EGI OMB on Thursday 29th January. A snapshot of areas/updates discussed:

Operations: RC/RP OLA: User SLA Biovel, DRIHM, CompChem.
- QoS: (response) GGUS report monitor: https://ggus.eu/?mode=report_view
- ROD: evolution. Some small. Propose stop monitoring ROD.
- MC accounting: At 33% sites not reporting.
- New global VOs: eli-np.eu (lasers); vo.compass.cern.ch (SPS); fermilab.fnal.gov (Fermi wide)
- Cloud Accounting campaign: FedCloud sites to upgrade probe (OpenStack & OpenNebula)
- Documentation: Upcoming new procedures/Manuals (PROC19 - adding resources; PROC20 - CVMFS replication.
- Guide for ops tool developers.
- EGI 2015 conference: Open for registration. April is the early bird deadline. Co-located: OGF44 & Globus meeting.
- Next OMB meeting: 26th February

Security
- Current: CVE-2014-9322 (linux kernel) ongoing. Pakiti issues. Some sites suspended.
- CVE-2015-0235 (glibc) – under consideration.
GOCDB:
- v5.3.1 out for testing.
- Login via x509 or EGI SSO; get_user PI update.
Fedcloud F2F:
- EGI-Engage (standards & capabilities) & INDIGO Datacloud (services) will be funded.
- Use cases: 26 communities with 50 use cases. 43 NatSci; 3 Medical; 3 Humanities.
- Many sites experimenting with Big Data services also for IaaS provisioning
- Cloud accounting working but improved.
- Cloud security: response; policies; assessment (to happen)
- VM Management: Using OCCI 1.1. Support OS; ON and Sinnefo.
- Data Management: CDMI uses Synnefo. Lack of support.
- Cloud Blueprint docs: Cover GOCDB, monitoring & accounting.
- AAI – AARC and EGI-Engage will push this forward.
- Monitoring: for FedCloud interfaces/services done or in dev.

Earth science e-infrastructures workshop
- Outreach included ES communities (VERCE, CLIPC, EIDA, EPOS…) + EGI.eu+Geant+EUDAT+PRACE.
- Sessions: AAI; Data management; Cloud. Use cases.
- Virtual Earthquake & Seismology Research Community in Europe (VERCE) – iRODS
- Climate4Impact: Data from Earth System Grid Federation (ESGF) 6PB. Data transfer bottlenecks.
- ... + many others...

EGI participation to PRACE data storage pilot call - brief discussion. Survey for responses.

Evolution of UMD repositories
- Will form managed UMD-preview. Preserve EMI repos characteristics.
- Web pages http://repository.egi.eu/

UMD 3.11 update
- FTS 3
- UMD u11: CREAM 1.16.4 (no VOMS client dependencies); BLAH 1.20.7; Torque 2.1.4; VOMS-admin 2.0.12; voms-clients 3.0.6; DPM 1.8.9; DMLite 0.7.2.; CERN DM clients (e.g. gfal2…).

As of yesterday AFS UI at CERN closed. The top directory of the AFS UI installation is /afs/cern.ch/project/gd/LCG-share

WLCG Operations Coordination - Agendas

--

News: An updated version of the agenda for the Okinawa workshop will be soon produced.
MW Baselines: frontier/squid 2.7.STABLE9-22.
MW Issues: Ghost vulnerability. ARGUS failures with latest Java (disable SSLv3 by default) – workaround in place.
MW Services: Various upgrades mentioned for T0 and T1s.
T0 news: Latest VOMS-admin in test. If no issues VOMRS off from 16th Feb. AFS UI closed 2nd Feb.
T1: NTR
T2: Asked about CERN accounts for T2 sysadmins. Should be possible!
ALICE:
ATLAS:
CMS: Bigger run 2 MC started. Doing staging/consistency checks. >50% T1 capacity MC –partional slot model so slots will be used either way.
LHCb: Run-1 legacy stripping almost done. Problems due to SARA disk failure. Some issues CERN simulation jobs. APEL issues 2 sites. VOMS tests.
Glexec: With panda - 55 sites covered. T1 and biggest T2s. Issue resolved at BNL. Further ramping ATLAS.
SHA-2: old VOMS retirement – will keep special firewall/router configs until VOMRS gone. VOMS config reminder broadcast soon.
MJF: Site testing – documentation question.
MW readiness: Some steady progress.
IPv6: TBC
Squid mon + HTTP proxy dis TF: Progress. Not all yet registered so will broadcast again. Generate ATLAS/CMS pages for monitoring. Contributors busy. Additional CVMFS repos now available. Some sites put restriction.

Net+Transfer metrics:

Tuesday 3rd February

Next meeting this Thursday 5th Febraury.
- Tier-2 feedback so far: Has there been any progress on CERN accounts or material access?

Tier-1 - Status Page

Tuesday 10th February

Patching of Oracle databases behind Castor scheduled for Feb 11th.
Some racks in the datacentre have to be moved to make way for latest procurement. This will affect one generation of production diskservers. Details are still being finalised.
The problems with our primary network router are still being followed up.

Storage & Data Management - Agendas/Minutes

Tuesday 10th February

SURFsara have produced a Service Incident Report (SIR) following their dCache problems. This might be of interest.

Wedn 28 Jan

Ready for run 2!
Towards the exascale with GridPP?
Should we spring clean our defunct VOs?

Wedn 21 Jan

Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
Update from RAL's CEPH team.

Wedn 07 Jan

HNY. Most sites ticked over with relatively few glitches.
Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

Update from T1 CEPH team.
Last meeting of year. No updating and configuring or touching! Merry and Happy!

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 3rd February

A reminder to keep updating the HEPSPEC06 tables.

Tuesday 20th January

Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.

APEL status: An issue at Sheffield?

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Survey of KeyDocs
https://www.gridpp.ac.uk/wiki/Current_Activities

Tuesday 3 Feb

New VO, LSST, in Approved VOs: https://www.gridpp.ac.uk/w/index.php?title=GridPP_approved_VOs

Tuesday 27th January

Reviewed briefly at the core ops meeting last week. Doc owners are to be requested to review the usefulness of their keydocs and report any they believe should be removed from keydoc status.

Monday 8th December

Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

New section in Wiki called "Project Management Pages".

The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.

Interoperation - EGI ops agendas

Monday 9th February

The agenda for February's EGI ops meeting is here.
- Argus PAP update (1.6.2) out imminently. This should fix the SSL problems for sites affected after a java update. Hopefully this will solve Raul's problems in his ticket 111505.
- Other update notifications include APEL 1.4.0 (with a bunch of new features), Storm 1.11.6, UNICORE v7.2.0 and a new version of Globus in EPEL testing. UMD release 3.11.0 being prepared.
- Multicore republishing was discussed - if sites need to republish multicore data previously parsed as single core then they need to reparse the logs. Any site wanting to republish is asked to submit a ticket to apel support first.
- Operational issues were discussed whereby sites saw problems running ARGUS and CREAM on the same node, special care is taken if you want to take this mad path. See tickets 111464 and 111373.
- The decommissioning of the EMI2 APEL service and the UMD repositories was mentioned.
- Sites are advised to avoid using MySQL v.5.0 where possible.
- A comment was passed that the EPEL5 repository was becoming harder to keep up to date, bringing the effective EOL of SL5 closer for the community
  - SL7 is a topic that still needs to be discussed in earnest in this forum.
- Next meeting is March 9th - effort to keep the number of meetings down.

Monday 12th January

There was an EGI ops meeting on Monday 12th January. Minutes to follow.
- STORM 1.11.5
- SR: Another reminder to check your contacts
- Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
- Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
- EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).

Monitoring - Links MyWLCG

Monday 7th December

Meeting last Friday - agenda: https://indico.cern.ch/event/356853/ minutes: https://indico.cern.ch/event/356853/material/minutes/1.pdf
This was the wrap-up meeting of the consolidation TF; the mailing list will remain extant for a while yet.

On-duty - Dashboard ROD rota

Tuesday 27th January

Rota to be updated this week based on previous input.

Tuesday 20th January

Last week was quiet.
Kashif + Gordon (shadowing) next week.
The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

A couple of tickets for systems with low availability.
There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
Tier-1 aware of long standing ticket.

Rollout Status WLCG Baseline

Tuesday 20th January

From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 27th January

Keep watching paikiti - WNs are popping up (seemingly) unpatched at some sites!

Tuesday 20th January

DPM configuration instructions.
Status of CVE-2014-9322.
Steve's old perf package false positive.

Tuesday 13th January

EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
Any issues over holiday period?

Tuesday 16th December

Any update on the FTS3/GFAL bug?

The EGI security dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 3rd February

There was a network and transfer metrics WG meeting last week. Any feedback?

Tuesday 20th January

The next LHCOPN/LHCONE meeting is in Cambridge 9-10 February.
Perfsonar is expected to be available at sites - where are we with the reinstall?

Tickets

Monday 9th February 2015, 15.00 GMT

Other VO Nagios Results
At the time of writing the only site showing red that aren't suffering an understood problem was RALPP with org.nordugrid.ARC-CE-submit and SRM-submit test failures for gridpp, pheno, t2k and southgrid for both its CEs and its SE. The failures are between 1 and 12 hours old, so it doesn't seem to be a persistent failure, but it seems to be quite consistent. They all seem to be failing with "Job submission failed... arcsub exited with code 256: ...ERROR: Failed to connect to XXXX(IPv4):443 .... Job submission failed, no more possible targets". Anyone seen something like this before?

Only 20 Open UK tickets this week.

Biomed tickets:
111356 (Manchester)
111357 (Imperial)
Biomed have linked both these tickets as children of 110636, being worked on by the cream blah team. AFAIKS no sign of Cream 1.16.5 just yet.

TIER 1
111347 (22/1)
CMS consistency checks for January 2015. It looks like everything that was asked of RAL has been done by RAL, so hopefully this can be successfully closed. In progress (3/2)

111120 (12/1)
Another ticket, this time concerning a period of Atlas transfer failures between RAL and BNL, that looks like it can be closed as the failures seem to have stopped (and might well have been at the BNL end). Waiting for reply (22/1)

108944 (1/10/14)
CMS AAA test failures at RAL. Federica can't connect to the new xrootd service according to the error messages. No news for a while. In progress (29/1)

100IT 108356
111333
Both of these 100IT tickets are looking a bit crusty - the first is waiting for advice, the second was just put "In progress".

QMUL
110353 (25/11/14) Dan has set up se02.esc.qmul.ac.uk to test out the latest https-accessible version of storm for dteam and atlas. As a cherry on top this node is also IPv6 enabled. I'm not sure if Dan wants others in the UK to "give it a go"? In progress (6/2)

LANCASTER
100566 (27/1/14)
(Blatantly scounging for advice) Trying to figure out why Lancaster's perfsonar is under-performing. Ewan kindly gave us access to a iperf endpoint and it's been very useful in characterising some of the weirdness - although I'm still confused. Ewan also gave us a bunch of suggestions for testing that have been useful - next stop, window sizes. If anyone else wants to throw advice to me all wisdom donations are thankfully accepted. My advice for others in be careful trying to connect to the default iperf port on a working DPM pool node.... In Progress (9/2)

Tools - MyEGI Nagios

Tuesday 27th January

Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html

Tuesday 16th Sep

Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
https://vo-nagios.physics.ox.ac.uk/nagios/
It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015

GEANT4 has now updated their VOMS records in the ops portal to suit the new servers, lcg-voms2.cern.ch voms2.cern.ch. The Approved VOs document has been updated to match:
https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

Tuesday 16th December 2014

Discussion of setting up CVMFS for 'other VOs'.
LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 27th January

Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

Multicore status. Queues available (63%)
- YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
- NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)

According to our table for cloud/VMs (26%)
- YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
- NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)

GridPP DIRAC jobs successful (58%)
- YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
- NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)

IPv6 status
- Allocation - 42%
- YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
- NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex

Dual stack nodes - 21%
- YES: Brunel; IC; QMUL; Oxford (4)
- NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 28th January 2015

Operations report
Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
Electrical power circuit testing completed.
The migration of all data off T10000A & B media has been completed.
Ongoing discussions with vendor to investigate problems on Primary Tier1 router.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

@@ Line 146: / Line 146: @@
 <!-- *********************************************************** ----->
 <!-- ***********************Start T1 text*********************** ----->
-'''Tuesday 3rd February'''
+'''Tuesday 10th February'''
-* Safety testing of electrical circuits in the R89 machine room has been completed. Some problems were encountered but no critical services affected.
+* Patching of Oracle databases behind Castor scheduled for Feb 11th.
-* Backup link to CERN moved to a new route this morning. Currently running over that link as a test.
+* Some racks in the datacentre have to be moved to make way for latest procurement. This will affect one generation of production diskservers. Details are still being finalised.
 * The problems with our primary network router are still being followed up.
-* Outage for reboot of Castor systems yesterday (to pick up latest OS patches).
 <!-- **********************End T1 text************************** ----->
 <!-- *********************************************************** ----->

Difference between revisions of "Operations Bulletin Latest"

Revision as of 09:54, 10 February 2015

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools