Difference between revisions of "Operations Bulletin 020913"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 08:36, 2 September 2013

Bulletin archive


Week commencing 26th August 2013
Task Areas
General updates

Tuesday 27th August

  • An EGI task force has started (see kickoff agenda) to develop the use of CVMFS by non-LHC VOs.Task force progress is being tracked in the associated CVMFS wiki page. Webinar at 10am on 5th September.
  • There is an EGI OMB today (agenda). Remember the SHA2 deadline is 1st October and tickets are tracking upgrade plans.

Tuesday 20th August

  • An EMI-2 SL5 tarball WN has been put into CVMFS thanks to Matt and Jakob (see GGUS:96030 for further information).
  • SL6 s/w areas have been discussed several times - the conclusion is that it is too late for a single approach to be adopted.
  • The WLCG T2 availability/reliability figures are now final.
  • A provisional date of 2nd September has been scheduled for the GOCDB v5release. This is a major release which involves migration of the data to a new domain model that supports different DBs. The change involves the use of new primary keys. For more (testing) information see here.


Monday 12th August

  • Redirector problems affecting many UK ATLAS sites.
  • EGI reliability/availability of UK core services for July 2013 (i.e. top BDII) is recorded as good (almost 100%).
  • EGI also monitors site availability/reliability (results akin to the WLCG numbers) and the figures are available via these tables. For July use this link.
  • WLCG operations generally quiet (Monday's summary). Main UK issue is CMS Caltech-RAL transfers - under invesitgation.
  • A reminder that registration for GridPP31 is now open. The meeting will be at Imperial on 24th and 25th September.
WLCG Operations Coordination - Agendas

Monday 12th August

  • There have been no recent meetings. The next is on 29th August.

Tuesday 16th July

  • SL6
    • EMI-3 voms-proxy-info: 3rd problem java eating away memory. You can follow the story in both tickets GGUS 94878 and GGUS 95574
      • A fix is in the testing repositories and has been tested at Liverpool and Oxford.
    • UK status: 4 sites online, 3 testing, 7 with a plan, 3 without a plan (UCL, Durham, RALPP).
    • Presentation today at Atlas ADC weekly
    • Checking now with sites how LHCb is doing. Not running everywhere it seems.
  • Monitoring
  • Next Coord meeting Thursday 18/7/2013

Tuesday 9th July

  • SL6
    • Atlas new sw validation system scalability problem has been solved.
    • voms are now in the EMI-3 repository. No testing or prod PT repositories are necessary.
    • UK status: 3&1/2 sites online, 3 testing, 7 with a plan, 4 without a plan (UCL, Durham, RALPP, SUSX).
    • HS06: T0 tests on the compilers didn't give significant differences. Hepix has started an SL6 HS06 page where sites are welcome to post their results SL6 HS06 benchmark results
  • Monitoring
    • WLCG Monitoring consolidation group to consolidate the WLCG monitoring. It doesn't include all the monitoring there is a portion developed by experiments which is not included, but it concerns well known dashboards.
      • WLCG monitoring Initial status.
      • First meeting last week. The experiments have already given a first evaluation, sites will be represented via WLCG Ops Coordination. To get feedback from sites a group has been setup to collect sites opinion (see Maria's slide). Who is interested should contact Pepe Flix (jflix@NOSPAMpic.es). David Crooks and Kashif might want to be part of it as this touches on the GridPP core tasks.
    • Among things interesting to discuss
      • myWLCG vs SUM tests they both get the information from the same source i.e. nagios.
      • Personalised dashboard looks interesting but was never publicized much.
      • Sites monitoring requirements: SUM tests not representing the real experiment status for example.
Tier-1 - Status Page

Tuesday 27th August

  • It is a privilege meeting at RAL today.

Tuesday 20th August

  • Problems (timeouts) on cms_tape believed to be fixed.
  • Completed update of CVMFS to 2.1.14-1 in response to EGI-SVG-2013-5890
  • Testing of alternative batch system (Condor/ARC CEs/Sl6) proceeding.
Storage & Data Management - Agendas/Minutes

Tuesday 15th July

  • Three sites are now run all UK FTS traffic via 'FTS3' service as a test. Mostly successful; (small issue with a few US sites to be resolved before taking tests further.)

Tuesday 28th May

  • The 'Big Data' agenda is being compiled here. There is also now a suggestion for a cross disciplinary clouds and virtualisation workshop in July - the idea is 'in progress' but no more detail is yet available.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 13th August

Tuesday 23rd July

  • Sites moving to SL6 are reminded of the need to re-benchmark their WNs. Some sites have updated the wiki already and provide an idea of the performance change.
  • There is an ongoing PMB discussion about the timeline for the next Tier-2 hardware tranche. Please let Pete or Jeremy know if your site will benefit from a spend this financial year.

Tuesday 30th April

  • A discussion is starting about how to account/reward disk that is reallocated to LHCb. By way of background, LHCb is changing its computing model to use more of Tier-2 sites. They plan to start with a small number of big/good T2 sites in the first instance, and commission them as T2-Ds with disk. Ideally such sites will provide >300TB but for now may allocate 100TB and build it up over time. Andrew McNab is coordinating the activity for LHCb. (Note the PMB is already aware that funding was not previously allocated for LHCb disk at T2s).

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 27th August

  • The document status was reviews at a core ops meeting last Thursday. Several documents are to be reassigned and some removed.
  • Looking at VO lifecycle.

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Interoperation - EGI ops agendas

Monday 19th August

  • Yesterday's agenda. Attended by David+ and Raul.
  • There is an EGI discussion forum for UMD updates.
  • Coming releases: DPM (being tested in EPEL); BDII core (bug and vulnerability fix); BDII site (bug fixes).
  • Noted that dCache 2.6.5 contains a serious bug: ""if you have a permanent migration thread and issue a "save", the next time you save after a restart, the migration will be saved as "null", causing a second restart to fail"

so something you do now can cause a fail to happen in 2 months time when you restart dcache for some other issue". Thought fixed but not rolled out.

  • New in staged Rollout:
    • emi-ui - 3.0.2
    • gridsite - 2.1.2

This is a hotfix for GridSite, allowing the previously disallowed dash character in delegation IDs. Delegation IDs containing non-alphanumeric characters other than a dot, coma, underscore or dash are rejected. It also properly sets the type of proxy before calling the signing function from caNl.

    • canl - 2.1.2

This is a hotfix for a bug whereby the type of proxy to sign whas erroneously hard-coded to a single value for different types of proxies, most importantly affecting RFC proxies.

    • cream - 1.16.1

Authentication and authorization in the CREAM service now makes use of the CAnL library. The gLite security libraries are no more required.

    • blah - 1.20.2

Memory leak in BNotifier

  • Already in SR:
    • bdii-site - 1.2.1

This new version of the site BDII contains a fix in the ldap info provider script to set to 'Unknown' cached GLUE state attributes. Bug fixes:BUG #101709: Set to 'Unknown' ldap info provider cached state attributes for the site BDII.

    • bdii-top - 1.1.1

This version of the top BDII fixes a bug in the publication of delayed delete GLUE entries. A new plugin is responsible for publishing cached entries with value 'Unknown' in the corresponding GLUE state attributes. This version also includes a bug fix in the glite-info-update-endpoints script.

    • wms - 3.6.0

This version solves the problem with Argus and WMS integration (SL6).

    • voms - 3.2.0

VOMS Admin now supports Group managers, a mechanism which allow the hierarchical dispatching of the notification resulting from user VO membership and group membership requests.

    • apel - 2.2.0

vulnerability bug fix


  • SHA-2 monitoring: The Nagios instance midmon continues to monitor the services not supporting SHA-2. An extensive overview will be part of next week's OMB meeting.

Currently there are two products not yet supporting SHA-2: dCache and StoRM.

  • dCache version 2.6.5 has been released by the product team, and currently it is in the UMD software provisioning process.

The SHA-2 supporting version for the 2.2.x golden release is expected, but not yest released by the product team. The target for the 2.2.x version is UMD-2

  • StoRM: UMD-3 latest version is 1.11.1 It supports SHA-2 but has some critical load issues. It is recommended not to deploy this version.

StoRM PT is testing a new release (1.11.2) which solves the critical issues of the current one. Currently midmon is not generating critical alarms for dCache and StoRM: there are no related alarms in the operations dashboard

  • VOMS monitoring for SHA-2 support has been re-enabled. It was suspended because of a problem in the resource information provider, which was causing some false positives

The problem has been identified (was related to a sudo version), and information has been provided to ROD teams. In case of a false positive with VOMS site administrators should be able to quickly solve the problem and make the service publish again itself.

This is believed to have been fixed but not yet rolled out.


gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 23rd July

Tuesday 18th June

  • David C is taking feedback on the Graphite implementation presented at the HEPSYSMAN meeting. Also considering integrating Site Nagios.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Tuesday 27th August

  • A COD ticket was raised due to overdue SHA2 tickets: https://ggus.eu/ws/ticket_info.php?ticket=96765
  • QMUL has an odd alarm for a non-production machine: eu.egi.MPI-GOCDB-Check. The machine appears to be declared correctly in the GOCDB.
  • Sussex is quite far behind in its APEL publishing.
  • Durham has several issues.

Tuesday 20th August

  • Some sites not running yum update for their EMI2 CREAM-CEs.
  • Rota updated. Gareth from the T1 will participate.
  • Glasgow will continue to provide effort in this area.


Rollout Status WLCG Baseline

Tuesday 9th July

  • New EMI2 and EMI3 release yesterday. No staged rollout requests yet. Imperial upgraded their WMS and they have been somewhat shaky ever since.

Tuesday 18th June

  • New EMI3 CE coming into SR. Liverpool will test.
  • A lot of EMI3 testing done at Brunel.
  • EMI-3 testing page contains all issues I am aware off. It's a Wiki though, so if you find an issue, please put it in the appropriate category.

Tuesday 14th May

  • A reminder. Please could sites fill out the EMI-3 testing contributions page. This is for all testing not just SR sites as we want to know which sites have experience with each component.

References


Security - Incident Procedure Policies Rota

Tuesday 20th August

  • Several sites showing up in pakiti this week.
  • Update on workaround discussed last week.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 27th August

  • PerfSONAR: version 3.3.1 was released on 21st Aug. This update should fix problems with the WLCG mesh (which can now be included in the config file). Indeed traceroute and pingER are now working at Cambridge and Imperial but not, yet, at Oxford or Lancaster. Please could sites upgrade to this or at a minimum check their existing perfsonar host(s) are working OK here dashboard.

Tuesday 13th August

  • PerfSONAR: The dashboard is showing more problems across various sites - presumably the monitoring?
  • VOMS: Using multiple VOMS servers in the ops portal (VO ID cards) requires careful use of designations.

Tuesday 23rd July

  • PerfSONAR: the issues with the WLCG mesh appear to be understood and a new minor release (e.g. 3.3.1) is likely to be released. In the meantime please could sites upgrade by following instructions here but leave the WLCG mesh URL (tests-wlcg-all.json) commented out. Please also update the site progress page.
  • Where are we with the VOMS rollout?

Monday 10th June

  • Issue with neurogrid.incf.org ownership. Is more guidance needed?
  • Where are we with the perfsonar mesh?
  • Are we ready for full rollout of the VOMS backups?


Tickets

Tuesday 27th August 2013, 09.30 BST</br> 45 Open tickets this week. For the ticket list click here.

Some progress since last week. There remain gLexec Tickets, SHA-2 tickets, NGS site decommissioning tickets and Unresponsive VO tickets (minos and supernemo no change). Looking at the 'red' tickets that are not 'on hold' or interesting tickets.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=96634 (15/8)</br> The "cloud" site, 100IT, has received a certification ticket. Assigned (15/8) (Child ticket of https://ggus.eu/ws/ticket_info.php?ticket=94780)


DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=96530 (9/8)</br> Another 444444 waiting job ticket, could do with a bit of an update. In progress (12/8). No update since last ops meeting.

https://ggus.eu/ws/ticket_info.php?ticket=95302 (1/7)</br> Durham's gLexec ticket. Could do with a spot of soothing - either an update or on-holding. In progress (12/7). No updates in recent weeks.

Added a COD ticket https://ggus.eu/ws/ticket_info.php?ticket=96628. APEL pub failure. In progress (15/8)


TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=95996 (22/7)</br> One of the SHA-2 tickets, this could do with an update or the ROD will have to declare it overdue. In progress (22/7). Catalin responded.

https://ggus.eu/ws/ticket_info.php?ticket=96321 (2/8)</br> Sno+ SAM jobs failing at RAL. Probably a problem with the cert these jobs are run under, I've involved Kashif in the ticket. Waiting for reply (19/8) Update - Kashif and Chris have had an exchange about this, as discussed last week the problem is due to Castor being VOMS unaware.

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=96625 (15/8)</br> Issue with installed certificates. In progress (15/8)


Tools - MyEGI Nagios

Monday 12 Aug

  • VO-Nagios

t2wlcgnagios.physics.ox.ac.uk monitors few UK VO and it uses Robot Certificate. Due to some confusion Robot Certificate got expired. I applied for extension of Robot Certificate beforehand but Cert Wizard doesn't understand Robot Certificate and I thought that it has been extended. Finally Jens stepped in sorted it out. Now VO Nagios is working.

  • SHA2 Certificates

I have been issued a SHA2 certificate by Jens. I tested few CE's and some interesting results came out. Gridpp VOMS server is SHA2 compatible so SHA2 proxies can be created for VO's hosted at voms.gridpp.ac.uk. None of CERN voms server are sha2 compatible but there is workaround to add a secondary SHA2 certificate. Details are here https://twiki.cern.ch/twiki/bin/view/LCG/SHA2readinessTesting#SHA_2_VOMS_server I have added my SHA2 certificate but it is not approved yet as most of the people are on holiday. Interestingly when I submitted few jobs using ngs.ac.uk with SHA2 certificate, it finished successfully on the CE's which are not suppose to be SHA2 complaint. I will test again with OPS vo to confirm it.

Tuesday 23rd July

  • In a campaign to update VO ID card details it turns out that a few of our supported VOs are obsolete: babar, possibly supernemo and ngs.ac.uk. The first of these can be safely removed but we need to confirm our announcement process.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 19 August

  • EPIC
    • Support requested at Tier-1
    • Any other sites prepared to support them?
  • Catalogue synchronisation - Biomed working on it.


Monday 12 August

  • HyperK.org
    • VOMS servers set up (Manchester, Oxford, Imperial)
    • VOID card - stalled on a homepage.
    • WMS set up (Imperial) - awaiting Glasgow, Ral
    • Site set up (QMUL)
    • LFC - in progress
    • CVMFS - considering


  • SNO+
    • Dirac set up for some CEs
  • Epic
    • Doing stuff
  • ngs.ac.uk VO - any reason to keep it?
  • Software areas for SL6
    • Are we keeping the same areas as sl5?
    • What about the software tags?
    • Push CVMFS?

Friday 2 August 2013

  • SNO+ would like to streamline their submission
    • Is Dirac possible
  • WebDAV support at RAL LFC
    • Firewall seems to be in the way.
  • HyperK.org
    • Waiting on WMS support from somebody.
    • 1 month so far from starting this off - can we do this quicker next time.
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 14th August

  • Operations report
  • There have been some problems with the cms-tape service class within Castor that are under investigation.
  • The test ARC-CEs have been enabled for more non-LHC VOs (now includes hone, biomed, mice, na62, superb, snoplus.)
  • We continue to work with Atlas on the testing of FTS3.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A