Difference between revisions of "Operations Bulletin 150713"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 08:06, 15 July 2013

Bulletin archive


Week commencing 8th July 2013
Task Areas
General updates

Tuesday 9th July

  • Interested parties had a Puppet discussion meeting on Monday 8th (agenda) to try to establish common best practice amoung UK PP sites. There will be another meeting in EVO on 22nd July and there is a possibility of arranging a face to face meeting.
  • There is a Pre-GDB this afternoon (agenda)
  • There is a GDB tomorrow (agenda)

Tuesday 18th June

  • There was a GDB last Wednesday (agenda: minutes). The T2 report will appear in our wiki area soon.
  • Note that glexec deployment is now being monitored more closely.
  • Joint T1/T2 availability/reliability reporting (more information).
  • There is an EGI OMB meeting today. Discussing whether SHA-2 compliance should be mandatory (Storm, WMS, dCache user feedback needed)... other topics similar to GDB updates plus one on EGI brokering VOs access to 'opportunistic pool' resources.
  • Andrew Sansum is moving forwards with the suggestion to move retired servers from the Tier-1 to Tier-2 sites. Sites will be asked to explain what equipment they would find useful and why... so please be ready for when a request arrives.
  • A reminder that SL6 migration has started ... please keep the Ops coordination table updated with regards to plans. The problems summary and progress is being tracked by the ops coordination working group.
  • Registration has opened for the EGI Technical Forum. A call for posters and papers will close on 23rd June.
  • There is an LHCOPN/LHCONE meeting this week (Mark is attending and may report back next week). Agenda.
  • WebDav deployment is being tracked on this page - please keep it up-to-date.


WLCG Operations Coordination - Agendas

Tuesday 9th July

  • SL6
    • Atlas new sw validation system scalability problem has been solved.
    • voms are now in the EMI-3 repository. No testing or prod PT repositories are necessary.
    • UK status: 3&1/2 sites online, 3 testing, 7 with a plan, 4 without a plan (UCL, Durham, RALPP, SUSX).
    • HS06: T0 tests on the compilers didn't give significant differences. Hepix has started an SL6 HS06 page where sites are welcome to post their results SL6 HS06 benchmark results
  • Monitoring
    • WLCG Monitoring consolidation group to consolidate the WLCG monitoring. It doesn't include all the monitoring there is a portion developed by experiments which is not included, but it concerns well known dashboards.
      • WLCG monitoring Initial status.
      • First meeting last week. The experiments have already given a first evaluation, sites will be represented via WLCG Ops Coordination. To get feedback from sites a group has been setup to collect sites opinion (see Maria's slide). Who is interested should contact Pepe Flix (jflix@NOSPAMpic.es). David Crooks and Kashif might want to be part of it as this touches on the GridPP core tasks.
    • Among things interesting to discuss
      • myWLCG vs SUM tests they both get the information from the same source i.e. nagios.
      • Personalised dashboard looks interesting but was never publicized much.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.

Tuesday 4th July

Tier-1 - Status Page

Tuesday 9th July

  • Last week's update of the Castor Nameserver to version 2.1.13 was successful. The Atlas stager will be upgraded tomorrow (Wed. 10th July).
  • We had another occurrence of (we believe) the bug that caused problems for the Atlas castor instance on Friday late afternoon. (This is fixed in 2.1.13).
  • We have had some further CVMFS problems. Note that we have been using new version of the client to try and track down a problem with batch job set-up.
  • Number of ALICE batch jobs has been capped since the firewall problem linked to bit torrent. However, as Alice have a workaround for this we are raising this limit.
  • Testing of alternative batch system proceeding.
  • The single 10Gbit uplink for data has been saturated most of the time this last fortnight.
Storage & Data Management - Agendas/Minutes

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.

Tuesday 21st May

  • Do we have an agenda page for the June workshop?


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 30th April

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Thu, 25th April

  • Process for de-emailing the host certificates of a site. This allows an admin to replace certificates with email address components in the DN - they cause havoc in the security system.

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Monday 10th June

  • There was an EGI ops meeting today.
  • Staged Rollout: Highlights from the SR/UMD process are:

cream-slurm: verification under way cream-gridengine: New EMI-3 release on 3rd June fixes some issues with previous version. Verification to start shortly. Storm: v.1.11.10 in EMI-3 passed verification. EMI-WN: for EMI-2, update to fixed issues with SL-6.

  • SHA-2: Grid CA's will be able to issue certificates based on the SHA-2 digest soon. As a prelude to that, there's a list of software versions that support SHA-2 certificates.
  • An official calendar will be set out shortly, but there will be Ops portal alarms for sites with software that doesn't support SHA-2. In general, this is the EMI-2 / UMD-2 version; with the following exceptions (version with support in brackets):

CREAM (V 1.14.4 from EMI-2 does; so a recent CREAM should do); dCache (not released yet); Pseudonimity (EMI-3); StoRM (EMI-3 v 1.11.10) and WMS (EMI-3, v3.5.0).

gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 17th June

  • Quiet week. No longstanding issues. Ticket opened for this afternoon's alarm.


Rollout Status WLCG Baseline

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 11th June

  • We would like to collect immediate feedback on the security training held last week in conjunction with HEPSYSMAN.
  • Suggestions on future training the content last week would be useful.
  • John added a wiki page on forensics.

Tuesday 21st May

  • SL6 vulnerability. Need to track progress. (See private thread).


Services - PerfSonar dashboard | GridPP VOMS

Monday 10th June

  • Issue with neurogrid.incf.org ownership. Is more guidance needed?
  • Where are we with the perfsonar mesh?
  • Are we ready for full rollout of the VOMS backups?

Monday 20th May

  • Letter sent to Internet2 from GridPP management.

Tuesday 14th May

  • perfSonar support team is asking for statements from the projects using it to help securing funding for their team. Below the email they've sent. The WLCG TF is looking for the WLCG MB and Computing coordinators statements but it was agreed that statements from the sites would also help. Below is the email sent to the users mailing list.

Tuesday 23rd April

Tickets

Monday 8th July 2013 14.30 BST</br> 34 Open UK tickets this week. Lets dive in.

Unresponsive VOs hosted by NGI_UK (5/7)</br> https://ggus.eu/ws/ticket_info.php?ticket=95442- The Parent Ticket, there's also mention of oxgrid.ox.ac.uk in the last update.</br> https://ggus.eu/ws/ticket_info.php?ticket=95474 -camont</br> https://ggus.eu/ws/ticket_info.php?ticket=95473 -gridpp</br> https://ggus.eu/ws/ticket_info.php?ticket=95472 -minos.vo.gridpp.ac.uk</br> https://ggus.eu/ws/ticket_info.php?ticket=95471 -Don't know for sure who this pertains to, perhaps vo.northgrid.ac.uk?</br> https://ggus.eu/ws/ticket_info.php?ticket=95470 -babar</br> https://ggus.eu/ws/ticket_info.php?ticket=95469 -Can't quite figure out who this is for either, probably supernemo.vo.eu-egee.org</br>

These tickets are asking for the CiC portal entries to be updated and/or cleaned up for the relevant VOs. It could be that some of these VOs are no longer used and need the Old Yeller treatment. Not much progress on any of these tickets, but they were only opened on the 5th.

GLEXEC TICKETS (1/7)</br> A number of sites have changed the ticket catagory from an "Incident" to a "Change Request", which I would encourage. It's not an incident unless we fail to meet a deadline!</br> CAMBRIDGE https://ggus.eu/ws/ticket_info.php?ticket=95306 John put in a brief plan. In Progress.</br> BRISTOL https://ggus.eu/ws/ticket_info.php?ticket=95305 In progress.</br> BIRMINGHAM https://ggus.eu/ws/ticket_info.php?ticket=95304 In Progress.</br> ECDF https://ggus.eu/ws/ticket_info.php?ticket=95303 "Not urgent". On hold.</br> DURHAM https://ggus.eu/ws/ticket_info.php?ticket=95302 Plan to implement soon, with support from Glasgow. In progress.</br> SHEFFIELD https://ggus.eu/ws/ticket_info.php?ticket=95301 "gLexec installation is in progress". In progress.</br> MANCHESTER https://ggus.eu/ws/ticket_info.php?ticket=95300 Will rollout gLexec during SL6 deployment on the 14/10. On hold.</br> LANCASTER https://ggus.eu/ws/ticket_info.php?ticket=95299 Working on a gLExec tarball (if possible). In progress.</br> UCL https://ggus.eu/ws/ticket_info.php?ticket=95298 Will try to deploy later in July. On hold.</br> RHUL https://ggus.eu/ws/ticket_info.php?ticket=95297 Working on it. In progress.</br> QMUL https://ggus.eu/ws/ticket_info.php?ticket=95296 Hope to mix it in with the SL6/EMI3 move. Atlas' problems on EMI3/SL6 might delay though. In progress.</br> EFDA-JET https://ggus.eu/ws/ticket_info.php?ticket=95295 An exchange occurred over JET's nature. In progress.</br> SUSSEX https://ggus.eu/ws/ticket_info.php?ticket=95309 Assigned. My usual thought is that Emyr is on holiday?</br>

OXFORD and RALPP have closed their tickets, as they have gLEXEC already deployed and only got ticketed due to glitches at their sites at the time.

BIRMINGHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=95418 (4/7)</br> ALICE have ticket the site about enabling cvmfs for them. Assigned (8/7)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=95487 (6/7)</br> LHCB SAM jobs weren't picking up the VO_LHCB_SW_DIR env variable. Assigned (6/7)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=94873 (14/6)</br> The LHCB reply seems to have been to just set the ticket back to "In progress", so you might as well close this one. In progress (2/7)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Catalin has put plans down for rolling out a read-only LFC webdav frontend to the Oracle db as a standalone host alongside the current lfc.gridpp.rl.ac.uk alias. Please let him know if he's missed something. In progress (2/7)


Ticket Summary Supplemental!

Too late I noticed that Raul asking that some relevant tickets be brought up:

He submitted:</br> https://ggus.eu/ws/ticket_info.php?ticket=95110</br> regarding his recent (bad) experiences after "upgrading" cvmfs.

Which is pertinent to two active UK tickets at the moment:</br> https://ggus.eu/ws/ticket_info.php?ticket=95125 (Brunel)</br> https://ggus.eu/ws/ticket_info.php?ticket=94880 (Imperial)

At last report the I.C. admins were hopeful that some changes they had made would help sort things out.

Raul's advice "don't upgrade now as they are trying to release new version that should be more resilient." Hopefully the tickets above will have some hints for those already have took the plunge.


Tools - MyEGI Nagios

Tuesday 11th June

  • Installation of DIRAC instance at IC ready for 'another' test user.

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Mon 17th June

  • SNO+ request for Ubuntu UI. Do we have one?
  • Short Dirac update from Janucz
  • cernatschool.org VO WMS enabled at Glasgow - waiting for testing. Operations portal entry to be created.

Thurs 6th June

  • SNO+ jobs now work through the glasgow WMS

Mon 20 May

  • RAL wms02 and wm03 seem to have been taken out of commission but were still in the information system.
  • Glasgow WMS doesn't accept SNO+ jobs (https://ggus.eu/ws/ticket_info.php?ticket=94213)
  • SNO+ filling with water and expect to be taking test data Aug/Sept - expect more grid use after that.
  • Epic doing serious testing - running at Glasgow Liverpool and Lancs.

Thurs 16 May

  • SL6 - likely to be deployed for LHC VOs, non LHC should be aware - see mail to vo-admins list.


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 3rd July

  • Operations report
  • The RAL CVMFS Stratum 1 server failed on Wednesday (27th June). CVMFS problems appeared on the Tier1 batch farm and affected Tier2s. As post mortem is being prepared for this incident.
  • There were problems with the Atlas Castor instance that led to an outage from Friday afternoon until Sunday morning. This has subsequently been traced to a known bug in Castor. As post mortem is being prepared for this incident too.
  • The Castor Nameserver was successfully upgraded to version 2.1.13-9 this morning. Stager upgrades will follow in the coming weeks.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A