Difference between revisions of "Operations Bulletin 030214"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 11:39, 3 February 2014

Bulletin archive


Week commencing 27th January 2014
Task Areas
General updates

Tuesday 28th January

  • There are suggestions for a WLCG pre-GDB on batch systems in March.
  • openssl status update
  • There was an IPv6 working group meeting at CERN last week (agenda).

Tuesday 21st January

  • ipv6.hepix.org VO request
  • WLCG T2 A/R report available. Feedback required from 5 sites.
  • Please check the VO derived A/R figures.
  • The January GDB took place last week.
  • In-kind contribution follow-up still needed for Imperial, QMUL (Steve will do this), Liverpool, Manchester .. and also UCL and Durham.

Tuesday 14th January

  • The HEPSYSMAN meeting took place yesterday. Summary?
  • A summary of experiment activities is given in the WLCG ops summary from Monday.
  • ATLAS multi-core: apologies resources not fully utilised but current plan requests to leave 50% of resources behind these queues where agreed.
  • EGI released its availability/reliability report for sites for December 2013.
  • APEL announcement on 8th Jan: A problem was discovered at 10.00 UTC this morning with the software responsible for receiving data from EMI3 Apel clients and other clients using SSM2 (ARC/JURA, QCG, EDGI). There has been _no_ data loss. However, no data has been loaded into the accounting system from these clients since 18.00 UTC on 31st December.


WLCG Operations Coordination - Agendas

Tuesday 28th January

...

Tuesday 10th December

  • Confirmation of the multi-core task force with this mandate. Some concerns about overlaps with the machine/job features TF.
  • Discussion of experiment Christmas plans
  • Update of the [ttps://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions baseline versions]. BDII update important for SAM BDII nodes at CERN.
  • Tier-1 WNs on OPN is now being tracked here.
  • ALICE - MC will continue over break. Best efforts approach appreciated.
  • ATLAS - plans for ramp up of MC production. Repro and analysis ramp up also expected in coming weeks.
  • CMS - Run2 MC samples prep starting. "Appreciate all support from the sites we can get, but don’t expect normal levels of support, especially for T2 sites"
  • LHCb: Usage of distributed grid resources for mainly monte carlo productions. Surveillance by the operations team on a best effort basis. Also note a new CVMFS dashboard for LHCb.
  • Christmas plans summary: "All experiments will run activities over christmas at non negligible scale. They do not require special effort from sites or WLCG in general, while best effort support is highly appreciated"
  • WMS decommisioning: looks like WMS usage by CMS decreasing but it is variable.
  • glexec: 31 tickets remain open. Status tracked here.
  • FTS3: testing ongoing
  • Tracking tools: An engineer will be on-call for GGUS over the vacation period.
  • perfSONAR: Code maintenance an issue with BNL funding cuts. Looking at OSG and ESNet options. 3.3.2 out soon. See Status & Plans update. Asking sites to make accessible the perfSONAR main page

(https://<hostname>/toolkit) for the central operations activity. Plans are for OSG to host perfSONAR-PS central service, BNL dashboard not all correct.

  • IPv6: request from CMS to have IPV6 supported on SLC5 at CERN. Alistair D taking on ATLAS role for IPv6 testing.
  • Middleware readiness: Meeting planned for 12th December.
  • Machine/job features: Discussion between current implementation and proposed route minimizing draining waste (MDW) cpu time for multi-core pilots.
  • SHA-2: still some updates at sites ongoing (>10 sites). "by mid January the WLCG infrastructure is expected to be essentially ready ". OSG plans to move in mid-January.
  • VOMRS: VOMS-Admin still in testing.

Tuesday 3rd December

Tier-1 - Status Page

Tuesday 28th January

  • Operations have generally been smooth. There was a problem on Thursday afternoon (23rd Jan) when the condor master daemons on worker nodes were restarted due to a configuration error. However, the system recovered by itelf from this although there some batch job losses.
  • The microcode in the tape libraries was successfully update last Tuesday (21st Jan). We are planning a further short (2 hour) interruption to the tape services next Tuesday (4th Feb) to test a new server that interfaces to the tape library. This is running upgraded software to enable access to higher capacity T10000D tape drives. This intervention is just for a test. The system will be returned to its current configuration afterwards. We plan to put upgrade into service in a few weeks time.
  • Over the last day there has been a problem with a disk areas on one of the WMSs (lcgwms05) filling owing to unsuitable user jobs.
  • There has been an update to FTS3 to fix the ssl problem. The first attempt at this (last Tuesday, 21st Jan) failed. The upgrade was successfully done yesterday but included deleting all existing proxies on the FTS3 servers.
Storage & Data Management - Agendas/Minutes

Tuesday 28th January

  • Trying to liaise with DiRAC colleagues to setup a technical meeting.

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 28th January

  • It has been noted that many sites are not publishing revised benchmark HS06 figures following their upgrade to SL6. Please check your site ASAP. Tickets will be raised shortly against sites observed to be unchanged. Please include updates in the HS06 wiki page.

Tuesday 26th November

Tuesday 5th November

  • A reminder to keep an eye on the SL HS06 page for odd ratios. Steve takes HS06 cpu numbers direct from ATLAS and the page does get stuck every now and then.
  • The metrics page has been updated.

Tuesday 13th August

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 28th January

  • Document assignments were reviewed at a core ops tasks meeting last week... so the dashboard for keydocs should improve soon!
  • It has been noted that blog activity has dropped to a very low level. The blogs have been very useful to disseminate details of what is happening at sites, to share technical solutions and to make GridPP's work more visible. It also helps avoid duplication... so please give some thought to reviving your blogs!
Interoperation - EGI ops agendas

Tuesday 21 January

  • Notes from last meeting, Tuesday 14th meeting. Agenda: https://wiki.egi.eu/wiki/Agenda-14-01-2014 . Summary:
    • Updates for Gridsite and Globus gatekeeper as part of openssl , Gridsite in staged rollout
    • Flagged the webdav StoRM update 1.11.3 from December
    • Releases: only major release ARC 4.0.0
    • SHA-2 update: 99% of services completed. No discussion of NGIs moving to SHA-2 certs
    • Flagged that the clients are not monitored, but for dcache-srm-client the first version supporting SHA-2 certificates is v2.2.22.
    • Decommission of CERN WMSes : Planned for April, it was noted by service managers that there was a high usage for ops tests.
    • We mentioned the UKs trial of RPMs for VOMS data; this was favourably received and I said we'd update in a few weeks with more experience.

Tuesday 17th December

  • The next meeting will be on Thursday combined with the EGI OMB.

Tuesday 3rd December

  • Additional notes:
    • the 2.6.16 version of dCache mentioned has a serious bug in the migration module; 2.6.17 has this fixed so should be used in preference. The possibility of skipping 2.6.16 in the overall release of EMI-3 being discussed
    • Note that the cream updates mentioned in this meeting contain security updates and so are recommended.
    • Looking for CREAM/LSF plugin staged rollout, but don't believe there are any such sites in the UK
    • SHA-2 : 17 sites remaining in the EGI that are publishing SHA-2 and alarming; I don't think that any such sites in the UK (just a couple) are unaccounted for/previously documented.
    • It was asked when CAs would start issuing SHA-2 certs only (UK noting that it's planning to from January)
  • Next meeting: (last for 2013) 16th December
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).

Tuesday 10th December

  • Feedback transmitted and discussed by consolidation group; next meeting is now in January.

Tuesday 26th November

  • As noted by Alessandra, if possible we'd like site feedback on the consolidated monitoring prototype before the next meeting a week on Friday to report back to the group (with thanks to everyone who has already contributed)
  • Some notes to form a wiki on Graphite are to be found here: https://www.gridpp.ac.uk/wiki/MonitoringTools but these are under development, however if there are areas people would find useful that could be expanded, please let David know.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 20th January

  • Sussex have put their CE into a downtime. Cleared up some tickets after they did this.
  • The APEL alarms at Brunel reappeared from time to time. Closed these referring to a ticket Daniela raised. See:

https://ggus.eu/ws/ticket_info.php?ticket=100287.

Monday 13th January

  • Quiet week (after Apel problems resolved). Brunel still seems to get dubious

APEL alarms occasionally.

  • Glexec at Sussex unresolved - perhaps may be best to put this into downtime whilst issue addressed?


  • Thanks to the team who continued with the ROD rota over the Christmas (Andrew) and New Year (Daniela) periods!
  • Gareth needs help with ROD work next week (esp. Monday and Thursday).

Monday 6th January

  • Very quiet week. Transient problems as usual, and a couple of new tickets but other than that almost all sites working well.
  • Sussex still has the outstanding escalated/expired glexec ticket but they're hopeful about getting this sorted now.


Tuesday 17th December

  • Quiet week with no UK wide problems.
  • A few sites (EFDA, Sussex, Brunel) have tickets which haven't made any visible progress, partly because of waiting for fixes/help. The other tickets are hopefully transient problems that the sites will fix next week.

Monday 9th December

  • UCL tickets closed.


Rollout Status WLCG Baseline

Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated the Stage of the Nation page.


Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805] References


Security - Incident Procedure Policies Rota

Tuesday 14th January

  • nmap test results show 4 UK sites yet to take action on perfSONAR
  • openssl status

Tuesday 19th November

  • There was a team meeting last Friday 15th November. Next meeting on 29th.
  • Just a couple of site issues showing up in Pakiti.
  • Looking at ARGUS server for UK NGI.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 7th January

  • A perfSONAR dashboard has been established in London based on maDDash.

Tuesday 26th November

  • The main perfSONAR issues this week affect Manchester and Sussex.

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.
Tickets

Monday 27th January 2014, 15.00 GMT</br> 33 Open UK Tickets this week.

Courtesy of John Kewley's Posse of Ticket Wranglers we have:

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=99642 (10/12/2012)</br> Southgrid Backup Voms server testing. I suspect other, squeakier wheels have been getting the Oxford grease (where the heck am I going with this analogy?). Unless you're going to get stuck into it right now probably best to On Hold until you're actually sat down actively poking it. In progress (8/1)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=100037 (3/1)</br> Problems with the Sheffield Perfsonar host. Looks like the Sheffield host might need an upgrade (or at least implementation of the mesh). Again, if it doesn't look like you'll get to this soon can you On Hold. In progress (13/1)

Spotted with my own eyes:

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=100527 (24/1)</br> An atlas ticket concerning the RHUL storage. Looks like it might have snuck in amongst the Monday morning e-mail pile. Assigned (24/1)

That's all really. We're down to 33 tickets (from 42 last week), as usual I'll be going over all of them next week, but feel free to bring any up that are particularly close to your heart in the meeting or online.

Please check your site tickets here:</br> http://tinyurl.com/cblj3ab

Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 28 January 2014

  • hyperk having problems with proxy renewal.
    • This may be related to openssl

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Tuesday 9 December 2013

  • Backup VOMS server
    • VO managers still need to check sites - Scotgrid,northgrid,southgrid,londongrid,gridpp VOs were going first, but have not yet updated their status.

Monday 2nd December 2013


Monday 25th November 2013

  • CVMFS progress - but not quite there yet.
  • 6 VOs (cern@school,gridpp,na62, pheno,sno+,t2k.org ) have updated their VOID card entries and updated the wiki.
  • Storage
    • Gfal2 - GGUS 99043,99044,99055,99067 - not performant, but very interesting functionality
    • Webdav now enabled on LFC@RAL and ports free from firewall - needs testing

Tuesday 19 November 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 29th January

  • Operations report
  • Next Tuesday (08:00-10:00) there will be a short break in tape access to test a new server that provides the interface to the tape library. This server will be required to support T10000D tape drives.
  • Work is progressing on Tier1 Network changes. On Tuesday 11th March it is planned to move the Tier1 to use the new site firewall. The plan is to install the new Routing layer for the Tier1 & change the way the Tier1 connects to the RAL network before this date.
  • FTS3 was successfully upgraded (to fix openssl problems) on Monday 27th Jan - the second try at this. During this second attempt all existing proxies on the FTS3 systems were deleted.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A