Operations Bulletin 270114

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 20th January 2014
Task Areas
General updates

  • openssl status update

Tuesday 21st January

  • ipv6.hepix.org VO request
  • WLCG T2 A/R report available. Feedback required from 5 sites.
  • Please check the VO derived A/R figures.
  • The January GDB took place last week.
  • In-kind contribution follow-up still needed for Imperial, QMUL (Steve will do this), Liverpool, Manchester .. and also UCL and Durham.

Tuesday 14th January

  • The HEPSYSMAN meeting took place yesterday. Summary?
  • A summary of experiment activities is given in the WLCG ops summary from Monday.
  • ATLAS multi-core: apologies resources not fully utilised but current plan requests to leave 50% of resources behind these queues where agreed.
  • EGI released its availability/reliability report for sites for December 2013.
  • APEL announcement on 8th Jan: A problem was discovered at 10.00 UTC this morning with the software responsible for receiving data from EMI3 Apel clients and other clients using SSM2 (ARC/JURA, QCG, EDGI). There has been _no_ data loss. However, no data has been loaded into the accounting system from these clients since 18.00 UTC on 31st December.



Tuesday 7th January


Tuesday 17th December

  • The December GDB agenda is here. Official notes are not yet available but will be placed in a summary on this page.
  • Details of, and talks from, the DPM workshop that took place in Edinburgh last Friday (13th) can be found here.
WLCG Operations Coordination - Agendas

Tuesday 21st January

  • There was a virtual last Thursday - see minutes.
  • There is a F2F meeting (pre-GDB) on 11th February

Tuesday 14th January


Tuesday 17th December

  • The next WLCG ops coordination meeting is on Thursday 16th (agenda).


Tuesday 10th December

  • Confirmation of the multi-core task force with this mandate. Some concerns about overlaps with the machine/job features TF.
  • Discussion of experiment Christmas plans
  • Update of the [ttps://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions baseline versions]. BDII update important for SAM BDII nodes at CERN.
  • Tier-1 WNs on OPN is now being tracked here.
  • ALICE - MC will continue over break. Best efforts approach appreciated.
  • ATLAS - plans for ramp up of MC production. Repro and analysis ramp up also expected in coming weeks.
  • CMS - Run2 MC samples prep starting. "Appreciate all support from the sites we can get, but don’t expect normal levels of support, especially for T2 sites"
  • LHCb: Usage of distributed grid resources for mainly monte carlo productions. Surveillance by the operations team on a best effort basis. Also note a new CVMFS dashboard for LHCb.
  • Christmas plans summary: "All experiments will run activities over christmas at non negligible scale. They do not require special effort from sites or WLCG in general, while best effort support is highly appreciated"
  • WMS decommisioning: looks like WMS usage by CMS decreasing but it is variable.
  • glexec: 31 tickets remain open. Status tracked here.
  • FTS3: testing ongoing
  • Tracking tools: An engineer will be on-call for GGUS over the vacation period.
  • perfSONAR: Code maintenance an issue with BNL funding cuts. Looking at OSG and ESNet options. 3.3.2 out soon. See Status & Plans update. Asking sites to make accessible the perfSONAR main page

(https://<hostname>/toolkit) for the central operations activity. Plans are for OSG to host perfSONAR-PS central service, BNL dashboard not all correct.

  • IPv6: request from CMS to have IPV6 supported on SLC5 at CERN. Alistair D taking on ATLAS role for IPv6 testing.
  • Middleware readiness: Meeting planned for 12th December.
  • Machine/job features: Discussion between current implementation and proposed route minimizing draining waste (MDW) cpu time for multi-core pilots.
  • SHA-2: still some updates at sites ongoing (>10 sites). "by mid January the WLCG infrastructure is expected to be essentially ready ". OSG plans to move in mid-January.
  • VOMRS: VOMS-Admin still in testing.

Tuesday 3rd December

Tier-1 - Status Page

Tuesday 21st January

  • Operations have generally been smooth. The tape system is down this morning (21st Jan) for an update to the microcode in the tape libraries.
  • Both disk and one CPU tranch now delivered. Last major item (second CPU tranch) being delivered this week.
  • We are working on the major upgrades to the Tier1 network and plan to have this set-up within around 6 weeks.
Storage & Data Management - Agendas/Minutes

Monday 9th December

  • Spacetokens for non-LHC VOs - recommendations.

Tuesday 8th October

  • The DPM workshop agenda and registration page will appear here.

Monday 30th September

  • A DPM workshop is being organised in Edinburgh for 13th December. GridPP PMB anticipated covering travel for of order 10 UK sysadmins for this event. Interest should be indicated during the storage group meeting.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 26th November

Tuesday 5th November

  • A reminder to keep an eye on the SL HS06 page for odd ratios. Steve takes HS06 cpu numbers direct from ATLAS and the page does get stuck every now and then.
  • The metrics page has been updated.

Tuesday 13th August

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 7th January

  • Warnings are going out on a number of documents that need attention. Please could the owners take a look?
  • There are problems migrating the wiki schema (needed for the SHA-2 migration). Andrew has suggested a workaround and we should take comments on it.

Tuesday 17th December

  • A number of documents have gone into the warning state. Please could those with responsibilities here please review their documents - the server will be emailing you with a reminder.


Monday 11 November

  • The plan for use of adoption of backup servers continues to evolve. Please see latest version here. The new version contains details of tests and concluding operations for site and VO admins.
  • The approved VOs page continues to be updated with the newest data from the operations portal.

Note: T2K now requires liblockfile-devel.

Tuesday 5th November

  • Documents states will be reviewed at the core ops meeting this coming Thursday.

Tuesday 1st October

  • The approved VOs page has been updated with the newest data from the operations portal. Note that the VOMS records for LondonGrid now contain some alternative voms servers. The migration plan for use of these backup servers is now document here.
Interoperation - EGI ops agendas

Tuesday 21 January

  • Notes from last meeting, Tuesday 14th meeting. Agenda: https://wiki.egi.eu/wiki/Agenda-14-01-2014 . Summary:
    • Updates for Gridsite and Globus gatekeeper as part of openssl , Gridsite in staged rollout
    • Flagged the webdav StoRM update 1.11.3 from December
    • Releases: only major release ARC 4.0.0
    • SHA-2 update: 99% of services completed. No discussion of NGIs moving to SHA-2 certs
    • Flagged that the clients are not monitored, but for dcache-srm-client the first version supporting SHA-2 certificates is v2.2.22.
    • Decommission of CERN WMSes : Planned for April, it was noted by service managers that there was a high usage for ops tests.
    • We mentioned the UKs trial of RPMs for VOMS data; this was favourably received and I said we'd update in a few weeks with more experience.

Tuesday 17th December

  • The next meeting will be on Thursday combined with the EGI OMB.

Tuesday 3rd December

  • Additional notes:
    • the 2.6.16 version of dCache mentioned has a serious bug in the migration module; 2.6.17 has this fixed so should be used in preference. The possibility of skipping 2.6.16 in the overall release of EMI-3 being discussed
    • Note that the cream updates mentioned in this meeting contain security updates and so are recommended.
    • Looking for CREAM/LSF plugin staged rollout, but don't believe there are any such sites in the UK
    • SHA-2 : 17 sites remaining in the EGI that are publishing SHA-2 and alarming; I don't think that any such sites in the UK (just a couple) are unaccounted for/previously documented.
    • It was asked when CAs would start issuing SHA-2 certs only (UK noting that it's planning to from January)
  • Next meeting: (last for 2013) 16th December
gLite support calendar.


Monitoring - Links MyWLCG

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).

Tuesday 10th December

  • Feedback transmitted and discussed by consolidation group; next meeting is now in January.

Tuesday 26th November

  • As noted by Alessandra, if possible we'd like site feedback on the consolidated monitoring prototype before the next meeting a week on Friday to report back to the group (with thanks to everyone who has already contributed)
  • Some notes to form a wiki on Graphite are to be found here: https://www.gridpp.ac.uk/wiki/MonitoringTools but these are under development, however if there are areas people would find useful that could be expanded, please let David know.
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 20th January

  • Sussex have put their CE into a downtime. Cleared up some tickets after they did this.
  • The APEL alarms at Brunel reappeared from time to time. Closed these referring to a ticket Daniela raised. See:

https://ggus.eu/ws/ticket_info.php?ticket=100287.

Monday 13th January

  • Quiet week (after Apel problems resolved). Brunel still seems to get dubious

APEL alarms occasionally.

  • Glexec at Sussex unresolved - perhaps may be best to put this into downtime whilst issue addressed?


  • Thanks to the team who continued with the ROD rota over the Christmas (Andrew) and New Year (Daniela) periods!
  • Gareth needs help with ROD work next week (esp. Monday and Thursday).

Monday 6th January

  • Very quiet week. Transient problems as usual, and a couple of new tickets but other than that almost all sites working well.
  • Sussex still has the outstanding escalated/expired glexec ticket but they're hopeful about getting this sorted now.


Tuesday 17th December

  • Quiet week with no UK wide problems.
  • A few sites (EFDA, Sussex, Brunel) have tickets which haven't made any visible progress, partly because of waiting for fixes/help. The other tickets are hopefully transient problems that the sites will fix next week.

Monday 9th December

  • UCL tickets closed.


Rollout Status WLCG Baseline

Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated the Stage of the Nation page.


Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem with dcap-libs: [GGUS 97805] References


Security - Incident Procedure Policies Rota

Tuesday 14th January

  • nmap test results show 4 UK sites yet to take action on perfSONAR
  • openssl status

Tuesday 19th November

  • There was a team meeting last Friday 15th November. Next meeting on 29th.
  • Just a couple of site issues showing up in Pakiti.
  • Looking at ARGUS server for UK NGI.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 7th January

  • A perfSONAR dashboard has been established in London based on maDDash.

Tuesday 26th November

  • The main perfSONAR issues this week affect Manchester and Sussex.

Tuesday 19th November

  • There is a new dashboard. Feedback is welcome.
  • Manchester, Durham, Glasgow and Sussex show problems across the board.
Tickets

Monday 20th January 2014, 14.30 GMT</br> There are 42 Open UK tickets this week. Where did they all come from? Let's take a look.

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9/2013)</br> LHCB jobs failing at Jet. The Jet chaps have just fixed an SSL problem at their site, so would like to see if this has fixed the LHCB problems. Waiting fore reply (20/1) Update - things are still failing, reading the error perhaps JET have picked up some wierd rpms somewhere?

(This also possibly solves the Jet gLeXeC ticket https://ggus.eu/ws/ticket_info.php?ticket=95295 UPDATE-SOLVED, the Jet guys put in a fix to JAVA to solve the keysize problem and things work now )

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=100342 (16/1)</br> Atlas are seeing transfer failures to/from UCL's dpm. Looks like an authentication problem, Ben might need a hand. In progress (20/1)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=100333 (16/1)</br> Looks like this problem Tom and Chris spotted with one of the RAL WMSii has been solve, case can be closed. In progress (17/1) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=100343 (16/1)</br> But the WMSses still bring us pain, here Chris documents that the RAL ones are still producing 512-bit proxies. Chris also helpfully links two other WMS tickets. In progress (17/1)

https://ggus.eu/ws/ticket_info.php?ticket=98122 (17/10/2013)</br> But Tom provides another win, this time with the cern@school cvmfs repo. He's managed to get it working, able to put data into it, so this ticket can probably be closed too. In progress (17/1) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> But then the WMS try to spoil our buzz again with another ticket. Although I believe this is the forerunner to 100343 above. In progress (16/1)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=100188 (10/1)</br> Raul has provided Brian with the database dump from his SE (it should have landed in Brian's inbox), I think this ticket can be closed if the dump looks alright. In progress (16/1)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=99910 (20/12/2013)</br> LHCB problems at Bristol, due to ARC doing strange things to the environment. A few brave fixes have been tempted, but no joy. Waiting on feedback from the ARC developers - if that takes a while this ticket will need to be On Holded. In progress (14/1)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=99794 (16/12/2013)</br> Poking holes in the Edinburgh firewall for the perfsonar box. Any news from the IT overlords? I understand that there's a pending Edinburgh baby boom, so I'm not sure if anyone's still about? On hold (13/1)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/2013)</br> The "getting CMS working at Glasgow" ticket. It's looking almost as neglected as my gym membership. On hold (16/12/2013)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=97066 (5/9/13)</br> Getting the Manchester perfsonar boxes back up and running. How goes it? On hold (7/1)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=98594 (4/11/2013)</br> The LHCB job uploading problem at Sheffield. It seems all parties have gotten stuck, so we need to decide where to go with this. On hold (8/1)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=99621 (10/12/13)</br> Just making sure this ticket, with a bad node needing offlining, isn't forgotten about. On hold (19/12)

Similar with the Durham GLEXEC ticket https://ggus.eu/ws/ticket_info.php?ticket=95302 - it was On Holded over Christmas, but Christmas was a while ago now. In fact, with Creme eggs out, it must be nearly Easter already... right?

EXTRA EXTRA</br> RALPP https://ggus.eu/ws/ticket_info.php?ticket=100401 (20/1) This nagios glexec alarm ticket which Chris quickly jumped on has been reopened on you guys. Just bringing it up as reopened tickets have a habit of sneaking under the radar. Reopened (21/1)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=100348 (17/10) Atlas are getting a little ansy for some news on this ticket. And also don't seem to understand the waiting for reply state is for... Waiting for reply (21/1)


Tools - MyEGI Nagios

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 21 January 2014

  • Backup VOMS servers now configured at sites. Some small problems remain. Sites and VOs recommended to update their UIs to use all three servers.
  • Dirac setup (from Janusz):
    • Mice
    • NA62
    • Londongrid
    • SNO+
    • Landslides (resources need to be configured)


Tuesday 9 December 2013

  • Backup VOMS server
    • VO managers still need to check sites - Scotgrid,northgrid,southgrid,londongrid,gridpp VOs were going first, but have not yet updated their status.

Monday 2nd December 2013


Monday 25th November 2013

  • CVMFS progress - but not quite there yet.
  • 6 VOs (cern@school,gridpp,na62, pheno,sno+,t2k.org ) have updated their VOID card entries and updated the wiki.
  • Storage
    • Gfal2 - GGUS 99043,99044,99055,99067 - not performant, but very interesting functionality
    • Webdav now enabled on LFC@RAL and ports free from firewall - needs testing

Tuesday 19 November 2013

Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 22nd January

  • Operations report
  • The Tier1 operated smoothly duing this last week.
  • All of this financial year's capacity procurement has been delivered.
  • Planning to move the Tier1 acrooss to use the new site firewall on 10th March. Furthermore the installation of the Tier1's new Routing Layer and changes to the way the Tier1 connects to the RAL site network need to be done before this date.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A