Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 413: Line 413:
 
'''BIRMINGHAM'''<br />
 
'''BIRMINGHAM'''<br />
 
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102404 (18/3)<br />
 
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102404 (18/3)<br />
Birmingham's perfsonar "being weird" (ignoring Bristol), although Matt fixed it. Just doing the post-game roundup to figure out what magic actually fixed things, but could do with an update in the ticket. In progress (20/3) '''Update-Solved'''
+
Birmingham's perfsonar "being weird" (ignoring Bristol), although Matt fixed it. Just doing the post-game roundup to figure out what magic actually fixed things, but could do with an update in the ticket. In progress (20/3) ''Update-Solved''
  
 
'''QMUL'''<br />
 
'''QMUL'''<br />

Revision as of 10:21, 1 April 2014

Bulletin archive


Week commencing 31st March 2014
Task Areas
General updates

Tuesday 1st April

Monday 24th March

  • There is no ops meeting this week due to GridPP32 in Pitlochry.

Tuesday 18th March

  • The GridPP website was upgraded over the weekend (it is now SHA-2 ready). Please inform Andrew if you encounter any problems with it.
  • Last week there was a pre-GDB on batch systems and a GDB. The March GDB meeting summary is now available. The GDB actions list has been updated.
  • There was a GridPP Strategic Review yesterday. Recommendations will be shared with the PMB and CB in due course.
  • The CERN VOMS service will move to new hosts whose host certificates are signed by the new (SHA-2) CERN CA during 2014. First VOMS-aware services in WLCG need to be aware of these hosts. See the EGI broadcast message which indicates a timeline of 6th May for services to be updated.
  • The GridPP IPv6 site status table has been updated to provide an 'allocation' column. This follows discussion of allocation strategies across sites. This IETF document may be of interest.
  • EGI invites entries to win a funded trip to the community forum in May.
  • The next WLCG middleware readiness WG meeting takes place this afternoon at 13:30 UK time. There are pre-meeting updates in the wiki.
  • Alarms (and then tickets) are now being raised against EMI-2 services at sites. Please respond to the tickets quickly and indicate your plans for removing the EMI-2 based node - replies are expected within 10 working days. If you believe alarms/tickets result from false-positives then please indicate this in the ticket... we are aware of some examples already.
  • France is deploying an EGI wide Dirac instance for 'other communities' and EGI is considering to include this as part of a H2020 proposal.
  • CVMFS v2.1.17 was recently released. Ian reported that RAL T1 had been using it stably for several weeks.
  • The EGI availability/reliability figures for February have been added to the reports wiki page. The UK services show 100%. Ops for NGI_UK is 98%:98%. No sites were below the EGI 70% targets.


WLCG Operations Coordination - Agendas

Tuesday 1st April

Thursday 6th March

  • There was a meeting today (agenda, minutes).
  • Simone Campana was nominated ATLAS Distributed Computing coordinator and will step down as chair of WLCG Operations Coordination.
  • Baselines: WMS fix for 512-bit keys, already applied at CERN.
  • CERN would like to propose a deadline to switch FTS 2 off on the 1st of August.
  • Review of baselines - main update is that fix for the 512-bit keys on WMSes is being applied.
  • dCache is going to extend the security support for 2.2 until Enstore and dCache 2.6 are properly integrated. This will happen by summer.
  • Tests ongoing with Oracle12
  • dCache: is going to extend the security support for 2.2 until Enstore and dCache 2.6 are properly integrated.
  • ALICE: Tier-1/2 workshop in Japan.
  • ATLAS: in the middle of a disk crisis, many of the Tier1s are almost full of primary data. JEDI is under testing now. JEM activated (Job Evolution Monitor) for all the production resources. Rucio migration (Rucio as file catalog instead of LFC) in progress.
  • CMS: Soon starting 13 TeV MC DIGI/RECO. Looking at ccess to high memory resources and multi-core jobs.
  • LHCb: 2014 spring incremental stripping in full swing, 1/4 of the data has been processed (statistics).
  • FTS3 deployment: Discussed with experiment DM developers how to integrate multiple FTS3 servers with experiment frameworks.
  • glexec deployment: 79 tickets closed and verified, 16 still open
  • Machine job features: no update
  • Middleware readiness: Next meeting Tuesday 2014/03/18 @ 14:30h CET. See twiki.
  • Multicore: Several meetings. Conclusions so far systems reviewed are capable of supporting multicore jobs however a tuning of each system is required to be able to absorb them (draining/reservation of resources) when running together with single core jobs.
  • perfSONAR: perfSONAR 3.3.2 is now baseline. Deadline April 1, 2014 - all WLCG sites should have instances deployed, using the mesh - configuration and registered in OIM/GOCDB. Instructions in slides.
  • SHA2: Many new users already registered OK with SHA-2 certificates. Host certs of CERN future VOMS servers are from the new SHA-2 CERN CA.
  • Tracking tools: no update
  • WMS decommissioning: no update
  • xrootd deployment: no update
Tier-1 - Status Page

Tuesday 1st April

  • The software server used by the small VOs will be withdrawn from service (aiming for June).
  • A replacement MyProxy server has been put into production (to resolve the MyProxy issues raised in GGUS ticket 97025). This new service is called myproxy.gridpp.rl.ac.uk. Sites and VOs need to make appropriate reconfigurations to use this. We plan to turn the old one (lcgrbp01.gridpp.rl.ac.uk) off tomorrow (2nd April).
  • Load related problems with the CMS Castor instance have been ongoing. Plans to mitigate this are in place.
  • EMI-3 WN roll out underway with half of the nodes now done.
  • The most recent purchases of worker nodes are currently being deployed into the batch farm. New disk server deployment is also ongoing.
Storage & Data Management - Agendas/Minutes

Tuesday 18th March

  • Chris noticed some of Steve's tests failing. At IC this related to a full spacetoken. Bristol is not working as there is no SCRATCHDISK spacetoken. Durham fails with a no space left on device error message.

March 2014

  • How would we move data between DiRAC and GridPP?


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 1st April

Tuesday 18th March

  • A review of the HEPSPEC page reveals that most sites now have SL6 entries with the exceptions being: UCL-HEP; Manchester and RALPP. We will be checking HS06 figures in the 2014 Q1 quarterly reports so the deadline for action is 31st March 2014.
  • APEL publishing appears up-to-date for all sites.

Tuesday 11th Febraury

  • Another review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1. So... basically no change since last week. Tickets needed?
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Glasgow (minor); RALPP and Sussex.
  • Under publishing... sites are encouraged to run the Glue2 validator. There were problems observed for RAL-LCG2; UKI-LT2-IC-HEP; UKI-NORTHGRID-MAN-HEP; UKI-NORTHGRID-SHEF-HEP; UKI-SCOTGRID-GLASGOW and UKI-SOUTHGRID-RALPP.

Tuesday 4th February

  • A review of the HEPSPEC page shows no SL6 (or equivalent) entry for: UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; RALPP and RAL Tier-1.
  • The accounting pages show the following sites as not up-to-date with publishing accounting data: Lancaster (minor); RALPP and Sussex.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 1st April

  • Keydocs action needed by Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
  • We need to reassign Mark M's documents on Core Grid Services


Tuesday 18th March

  • Keydocs action needed by: Mark M; Jens J; Rob H/Security T; Alessandra F; Wahid B; David C and Matt D.
Interoperation - EGI ops agendas

Monday 31st March

Monday 10th March

  • An EGI operations meeting took place today (agenda).
  • URT recent or future planned releases
    • GridSite 2.2.2 (bug fix ) and dCache 2.6.20 for UMD-3
  • SR updates
    • WMS 3.6.3 (today) - EMI3 WN tarball SR flagged
  • New Nagios probes
    • emi-cream-nagios v. 1.1.1 - released with EMI 3 Update 14, released soon in SAM framework
    • org.sam.WN-SoftVer - new probes check the $EMI_TARBALL_BASE/etc/emi-version file
    • WN replication tests in emi-nagios are now distributed by the SAM-team, nagios-plugins-wn-rep
    • --wn-se-rep option as well as all the other previous --wn-* options will not be supported anymore by technology provider (see #88835 and #91683 )
    • NGIs requested to feedback how they feel about this option not being supported anymore.
  • EMI-2 decommissioning
    • dCache extended the support for the 2.2.x versions until July 2014.
    • List of services failing given Services
    • Alarms to begin on Wednesday, so please check this list for errors ASAP.
  • Cloud probes start raising alarms
    • 4 cloud sites have been certified in the last weeks, sites are currently monitored by cloudmon, but the errors are not raising alarms.


Monitoring - Links MyWLCG

Tuesday 4th March

  • Summary on HC functional tests
  • Overview of feedback


Monday 24th February

  • Next meeting this Friday, agenda looking at HammerCloud Functional tests

Tuesday 21 January

  • Update from meeting on Friday (the 17th); the main item under discussion was the nagios probes (in particular, the Condorg and CREAM-CE).
On-duty - Dashboard ROD rota

Monday 31st March

  • Routine alarms and a lot of EMI-2 tickets. Most of the EMI-2 tickets until next week.
  • Thanks to Daniela for cleaning up the dashboard early in the week.

Tuesday 18th March

  • Many EMI-2 tickets created during the week. Some false positives add to confusion!
  • Tickets outstanding (as on 15th March) for Brunel (2); Oxford - Although system in downtime until yesterday and Sheffield.


Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 5th March

  • Ready for more ARGUS testing
  • SHA-2 looks ready for UK CA switch
  • Looking at technologies

Monday 24th February

  • Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There is some setup documentation.


Services - PerfSonar dashboard | GridPP VOMS

Tuesday 18th March

  • A reminder of this perfSONAR overview for UK sites. Username is WLCGps. Currently shows a number of problems that need to be addressed at various sites.

Monday 10th March

  • A reminder that the perfSONAR documentation is available here.
  • Deadline for 3.3.2 is 1st April.

Tuesday 4th March

  • The full UK perfSONAR view is given on this dashboard.
  • When perfSONAR is performing in a stable fashion the site will appear on the main monitoring page.
Tickets

Monday 31st March 2014, 15.00 BST
34 Open UK tickets this week.

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101968 (11/3)
Atlas deletion errors at the Tier 1. Alastair posted a good explanation of the problem and some mitigation details, but atlas would like an update. On hold (12/3) Update - Problem persists, reminder set for 7/4

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102611 (24/3)
The Tier 1's EMI upgrade ticket. Some false positives on this list, Kashif asks if the NGI argus is also a false alarm? In progress (28/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)
Tweaking the ARCCE DefaultSE publishing. As a bit of bookkeeping can the priority be tweaked to less urgent (seeing as the issue isn't causing great woe). On hold (17/3)

As an aside tickets often are submitted using the default priority of "urgent" and category of "Incident" - if you catch these in your tickets then you should feel free to change them.

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's original EMI upgrade ticket (102212) was closed "automatically"- ("broken ticket - close by Operations Portal"), leaving this one in it's stead. I'm not sure if the information Matt RB carefully posted in the previous ticket needs to be cut and pasted over to here. All seems a bit weird. In progress (28/3)

OXFORD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102469 (19/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102544 (21/3) Solved
A couple of Oxford tickets look a bit neglected (one about cvmfs for T2K, t'other an lhcb/torque problem). I suspect these got overlooked with the excitement of Pitlochry last week. In progress (21/3)

(Also there's ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=102740, which could be seen as either an annoyingly finicky request or the epitome of a low hanging fruit, for when you *really* need a win that day!). Also Solved.

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102489 (20/3)
Similarly at Sheffield, maybe this biomed "invalid publishing" ticket got forgotten about on the trip to sunny Scotland. In progress (20/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)
The Sheffield perfsonar ticket. Things just needed finishing off by the looks of it - let us know if any advice is needed. On hold (11/3)

BIRMINGHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102404 (18/3)
Birmingham's perfsonar "being weird" (ignoring Bristol), although Matt fixed it. Just doing the post-game roundup to figure out what magic actually fixed things, but could do with an update in the ticket. In progress (20/3) Update-Solved

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101639 (26/2)
RFC3820 proxy problems. The problem is spread wider then QM, and likely needs a middleware patch or three to solve. Dan and Chris have asked for a master ticket to be created (failing that some more information would be nice). Nothing forthcoming from the submitter yet. I think this ticket has mutated to include the issues from RAL as well as QM. A bit of a mess. In progress (18/3)


Tools - MyEGI Nagios

Monday 17th March

Tuesday 26th November

  • Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
  • There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
  • Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed. Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
  • Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
VOs - GridPP VOMS VO IDs Approved VO table

Monday 17 February 2014

  • Proxy renewal
    • All RAL WMSs now renew proxies with 1024 bits. This looks like the end of this (at last).


Tuesday 11 February 2014

  • Proxy renewal
    • lcgwms06 at RAL has been upgraded and works
    • Both Imperial's WMSs work
    • Glasgow's will still need to be upgraded (unless they have been since Friday).
Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 19th March

  • Operations report
  • Move of Tier1 to use new site firewall was completed successfully on 17th March.
  • There have been problems with the CMS Castor instance caused by load issues through the disk cache in front of CMS_Tape.
  • Investigations ongoing into file deletion problems seen by Atlas as well as a longer standing, and now reproducible, Castor problem seen when accessing via the SRM.
  • A new MyProxy server is in production (myproxy.gridpp.rl.ac.uk).
  • Reminder: The software server usd by the small VOs will be withdrawn from service (aiming for June).
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

4th February 2014 With reference to the OMB on 30th January.

- UMD-2 will be decommissioned in the coming months. 30th April end of security support. 31st May all services to have been removed or upgraded.

- UK sites failing Glue2:

RAL-LCG2 UKI-LT2-IC-HEP UKI-NORTHGRID-MAN-HEP UKI-NORTHGRID-SHEF-HEP UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-RALPP

Check with the glue2 validator to see errors.

- A new Operations Dashboard will be in pre-production during February and moved into production in March.

- Availability/reliability targets for EGI are moving to 80%/85%.

- Some new ARC SAM tests are being introduced: org.nordugrid.ARC-CE-LFC-result; org.nordugrid.ARC-CE-LFC-submit; org.nordugrid.ARC-CE-SRM-result; org.nordugrid.ARC-CE-SRM-submit; org.nordugrid.ARC-CE-submit

- There is a summary page on how to publish from various middleware types: https://wiki.egi.eu/wiki/MAN09

- There was an overview of the French NGI adoption of iRODs.

- FedCloud to production: Management (GOCDB - ok); Monitoring (in progress); Accounting (in progress); Documentation (ok); Support (ok); Dashboard (in progress) and Security (in progress). - The SAM tests are: org.nagios.CloudBDII-Check; eu.egi.cloud.OCCI-VM and org.nagios.OCCI-TCP. Security checks not yet available.

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A