Difference between revisions of "Past Ticket Bulletins 2016"

From GridPP Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
'''Monday 1st February 2016, 14.30 GMT'''<br />
 +
50 Open UK Tickets this week, no Ops meeting scheduled so postponing a full review.
 +
 +
'''org.bdii.GLUE2-Validate tickets'''<br />
 +
We have 8 sites with these tickets (7 as Bristol have slain theirs), these are being discussed on TB-SUPPORT. A lot of these are still just assigned though - even if the issue is not really our fault we still need to handle the ticket proper. Rising above it all and all that.
 +
 +
If someone has submitted or knows of a counter-ticket for this issue please let me know.
 +
 +
'''NGI'''<br />
 +
Talking about a pain in the Information System, the UK still has this ticket to close (which has a similar root problem): [https://ggus.eu/?mode=ticket_info&ticket_id=118930 118930]
 +
 +
'''CMS Siteconf problems.'''<br />
 +
GLASGOW [https://ggus.eu/?mode=ticket_info&ticket_id=119196 119196]<br />
 +
EDINBURGH [https://ggus.eu/?mode=ticket_info&ticket_id=119195 119195]<br />
 +
OXFORD [https://ggus.eu/?mode=ticket_info&ticket_id=119197 119197]
 +
 +
CMS have spotted a number of misconfigured T3s across the globe (on a Friday afternoon)- the fix seems to be straightforward enough and Glasgow look like they're done already. Proper job!
 +
 +
'''ATLAS CONSISTENCY CHECKS'''<br />
 +
We still have 8 tickets open on this issue, although a couple are waiting for feedback from atlas. I'll bring this up in the Thursday UK atlas meeting to see if we can't shimmy along the tickets waiting for atlas feedback.
 +
 +
'''PILOTS'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=117723 117723]<br />
 +
Whilst investigating pilot issues at QM Daniela reminds us of this page that tells us what Dirac things should be going on at your site. Might be handy to preempt problems:<br />
 +
http://www.hep.ph.ic.ac.uk/~dbauer/dirac/site_pilot_status.html
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=118628 118628]<br />
 +
Whilst rolling out similar changes for LZ at RALPP Chris stumbled upon a problem, for which he submitted a bug report to nordugrid:
 +
http://bugzilla.nordugrid.org/show_bug.cgi?id=3529
 +
 +
AND FINALLY
 +
 +
QMUL<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=118985 118985] (21/1)<br />
 +
Biomed have got back to Dan suggesting that rather then ban them altogether until he has a cgroup-corral to put their jobs in if he would be willing and able to supply a list of the problem users. Of course this requires that there be any non-problem users in the VO... On hold (1/2)
 +
 
'''Monday 25th January 2016, 14.30 GMT'''
 
'''Monday 25th January 2016, 14.30 GMT'''
  

Revision as of 13:32, 8 February 2016

Monday 1st February 2016, 14.30 GMT
50 Open UK Tickets this week, no Ops meeting scheduled so postponing a full review.

org.bdii.GLUE2-Validate tickets
We have 8 sites with these tickets (7 as Bristol have slain theirs), these are being discussed on TB-SUPPORT. A lot of these are still just assigned though - even if the issue is not really our fault we still need to handle the ticket proper. Rising above it all and all that.

If someone has submitted or knows of a counter-ticket for this issue please let me know.

NGI
Talking about a pain in the Information System, the UK still has this ticket to close (which has a similar root problem): 118930

CMS Siteconf problems.
GLASGOW 119196
EDINBURGH 119195
OXFORD 119197

CMS have spotted a number of misconfigured T3s across the globe (on a Friday afternoon)- the fix seems to be straightforward enough and Glasgow look like they're done already. Proper job!

ATLAS CONSISTENCY CHECKS
We still have 8 tickets open on this issue, although a couple are waiting for feedback from atlas. I'll bring this up in the Thursday UK atlas meeting to see if we can't shimmy along the tickets waiting for atlas feedback.

PILOTS
117723
Whilst investigating pilot issues at QM Daniela reminds us of this page that tells us what Dirac things should be going on at your site. Might be handy to preempt problems:
http://www.hep.ph.ic.ac.uk/~dbauer/dirac/site_pilot_status.html

118628
Whilst rolling out similar changes for LZ at RALPP Chris stumbled upon a problem, for which he submitted a bug report to nordugrid: http://bugzilla.nordugrid.org/show_bug.cgi?id=3529

AND FINALLY

QMUL
118985 (21/1)
Biomed have got back to Dan suggesting that rather then ban them altogether until he has a cgroup-corral to put their jobs in if he would be willing and able to supply a list of the problem users. Of course this requires that there be any non-problem users in the VO... On hold (1/2)

Monday 25th January 2016, 14.30 GMT

"OTHER VO" NAGIOS
Looks like hepgrid2.ph.liv.ac.uk at Liverpool is playing up for all VOs, and the Sheffield SE is misbehaving for the gripp VO. Other then that it looks clean.

43 Open UK Tickets this week.

That ticket to the NGI...
118930 (18/1)
Steve J put in a comprehensive reply about what Liverpool do to get their publishing kinda right. The view on this ticket from last week was to close it with a <carefully|harshly> worded statement about why this is a bit of a pointless request. Who was formulating the reply? If it was me I dropped that ball! Assigned (19/1)

Pilots Problems.
BRUNEL: 117710 Pheno. On Hold (19/11/15)
QMUL: 117723 Pheno - hopefully sorted. Waiting for reply (25/1)
SHEFFIELD: 114460 gridpp et al. In Progress (20/1)
RALPP: 118628 LZ (and maybe LSST?). In progress (14/1)

We have a few pilot rollout tickets, the last two being worked on but proving problematic.

RHUL
119027 (22/1)
As seen on the gridpp-storage list, Sno+ have asked RHUL (and will no doubt as others) for storage space (~20TB). In progress (22/1)

(for the interest of others the Govind's other thread on gridpp-storage was likely triggered by https://ggus.eu/?mode=ticket_info&ticket_id=118553)

QMUL
118985 (21/1)
QM have banished biomed from their cluster until they have a batch system that can put Biomed jobs in a c-group cage (looking at slurm). On Hold (21/1)

BIRMINGHAM
118155 (4/12)
Talking of Biomed, they've asked if they've successfully cleaned up all their files on the Birmingham SE - a cheeky uberftp onto your SE suggests the biomed directory is still full of cra.. I mean, files. In Progress (20/1)

HTTP TF Tickets
118787 (ECDF)
118764 (SHEFFIELD)
Feel free to poke the gridpp storage group for help with these. (I left out the 2 Manchester tickets as their immediate showstopper isn't their configs- but they can ask for help too!).

ATLAS CONSISTENCY CHECKS
Manchester, Oxford, Birmingham, Sussex, RHUL, Sheffield, Brunel and QMUL still open - a mix of chugging along nicely and being very much "On Hold".

Monday 18th January 2016, 14.00 GMT
49(!!) Open UK Tickets this week

NGI
118930 (18/1)
The NGI received a ticket concerning incorrect or missing glue information for the Tier 1, Brunel, Imperial, Liverpool, Durham, Glasgow, Bristol, Oxford and RALPP. The variables in question are GlueSubClusterPhysicalCPUs, GlueSubClusterLogicalCPUs and GlueHostProcessorOtherDescription. There are some extra instructions in the ticket - it would be nice if we didn't have to create child tickets (hint hint...).

ATLAS CONSISTENCY CHECKS (10 tickets)
Progress, or at least non-exciting but reassuring updates, on these. Birmingham and Glasgow tickets could do with an update (even if it's a "nothing to see here").

The QMUL ticket had an update providing feedback that might be useful to others too:
https://ggus.eu/?mode=ticket_info&ticket_id=117880

HTTP TF (5 tickets)
ECDF, Manchester, Sheffield and Glasgow are on the HTTP TF list - although no tickets are stale at the moment.

TIER 1 RECOMMENDATIONS
118809 (12/1) An interesting ticket asking T0 and T1s to fill in a questionnaire on configuring batch job memory limits - the Tier 1 have did their bit and the ticket is On Holded for feedback.

GLASGOW
118732 (9/1)
This ticket has got confusing - atlas want a dump for files "lost" at Glasgow that by the looks of it actually never made it to the site in the first place... Waiting for reply (15/1)

TIER 1 DUPLICATES
Are these three CMS are the same (or similar or related) issues -or am I just getting my wires crossed?
118494 (23/12/15)
116864 (12/10/15)
118722 (8/1)

CAN BE CLOSED (I THINK)
IC - 118162 (lfc ticket)
QM - 118839 (atlas job mcore jobs failures - doesn't look like the problem persists).

NEARLY THERE:
Lancaster - 118637 (squid misconfiguration hammering statum-0)
Birmingham - 118155 (biomed SE use - biomed now think they deleted all data at Birmingham).

Monday 11th January 2016, 14.30 GMT
48(!) Open UK Tickets this week

  • VOMS TWEAK

118603: nsccs.ac.uk has been requested to be removed from the gridpp voms servers. Just "Assigned" to the UK as a whole at the moment.

  • THE HTTP TASK FORCE STRIKES

Lancaster, RHUL and Manchester all had http TF tickets alongside Glasgow. Your site might be next! It'll be worth checking the monitoring pages and reviewing the documentation if you are: atlas: http://cern.ch/go/h8Rr
lhcb: http://cern.ch/go/Bk8J
https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFSAMProbe

  • TRANSFER ODDITIES

118494: The Tier-1 have a CMS ticket where xrootd is expecting a file which phedex and DAS don't think is at RAL. Is this even a site problem?

118728: In a similar vein, QMUL have an atlas ticket where a single file is refusing to be transfered - Dan has noticed a number of write attempts followed by immediate deletion. Checksumming causing a problem?

  • LOW HANGING FRUIT- tickets that can probably be closed, or are close to it.

IMPERIAL 118162
A ticket for the Imperial LFC, which appeared to be working (for Janusz at least).

RALPP 117740
Atlas datadisk cleanup ticket. Elena confirmed that the step09 directory can go for the chop. Not sure if Brian has had a chance at looking at the users directory contents yet.

BRISTOL 118311
I suspect that this CMS SAM ticket can be closed as the CEs were all green.

  • ATLAS CONSISTENCY CHECKS

As requested at the Thursday atlas meeting here's the outstanding consistency check tickets.

IMPERIAL: 117879
Not much news, (understandably) low priority for the site.

SUSSEX: 117894
It doesn't look like Matt got round to this before he left.

SHEFFIELD: 117886
Set in progress but no news since.

OXFORD: 117892
A similar case here - I assume it's on Ewan's to-do list before he heads off to pasture's green.

BIRMINGHAM: 117890
Matt was going to look at this again in the New Year. Any joy?

RHUL: 117881
Govind was going to try to get to this before Christmas. Any luck?

GLASGOW: 117889
Back in 2015 the dumps were run and Sam asked for some clarification. Considering Glasgow's current state any dump made using these tools might be full of lies, but I know that you chaps are working on this problem.

BRUNEL 117878
Raul asked some questions in his ticket, for which atlas only replied last week.

QMUL: 117880
Dan has created dumps and has asked for the all clear before he sets up the monthly cron.

TIER 1: 117846
Dumps have been created, but gfal and castor issues have slowed down the checking process (gfal-cat doesn't seem to work with castor).

MANCHESTER: 117885
This ticket was recently On-Holded, as currently Manchester has 0 free space outside of tokens whilst a few disk servers are down.

Monday 4th January 2015, 14.30 GMT
HAPPY NEW YEAR EVERYONE!

38 Open UK Tickets this year.

All-the-UK-tickets URL: http://tinyurl.com/nwgrnys

As Jeremy spotted, with Matt RB off to pastures new the Sussex tickets are looking a bit neglected, especially as one was reopened after his departure:
118337
118289

Finally in this Glasgow ticket the submitter gave two new links for the http taskforce monitoring: 118052

The links to the http tf monitoring pages are:
atlas: http://cern.ch/go/h8Rr
lhcb: http://cern.ch/go/Bk8J