Difference between revisions of "Past Ticket Bulletins 2012"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:04, 17 December 2012

Monday 10th December 2012 15.00 GMT</br> 29 tickets this week.</br>

I haven't seen any sign of a fresh wave of "Unsupported Glite Software" tickets, but the ROD team have started issuing tickets for security alarms in the dashboard. Two of the three sites ticketed are tarball reliant (I wonder how Lancaster dodged the bullet? Only one of our clusters is running the tarball prototype). For your records the current tarball ticket is https://ggus.eu/ws/ticket_info.php?ticket=81496.

A lot of sites did the workers at the same time as the CEs, so with luck there shouldn't be too many more tickets showing up on this front.

  • NGI</br>

https://ggus.eu/ws/ticket_info.php?ticket=89350 (10/12)</br> Bristol and ECDF aren't publishing UserDN, so the NGI got ticketed. In progress (10/12)

  • GLASGOW</br>

https://ggus.eu/ws/ticket_info.php?ticket=89221 (5/12)</br> This ticket concerning enmr.eu accounting at Glasgow seems to have trailed off into a debate between two VO members, so I think it can be closed. In progress (5/12)

  • BIRMINGHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=89129 (3/12)</br> High atlas prod job failure rates at Birmingham. Thought to be caused by the transition to EMI2 workers interacting with the atlas local area. Despite being all EMI now, and having completely reinstalled the workers, the ghosts of gLite past still linger somehow and the atlas software validation jobs won't update. Everyone who should be involved is involved, but something for others to watch out for. Waiting for reply (10/10)

  • LIVERPOOL</br>

https://ggus.eu/ws/ticket_info.php?ticket=89374 (10/12)</br> Fusion is having trouble getting it's files. A very unhelpful split ticket without a notified site. I think it might be a problem at Liverpool (.py.liv.ac.uk servers), but feel free to punt it elsewhere if it ain't. Assigned (10/12)

https://ggus.eu/ws/ticket_info.php?ticket=88761 (22/11)</br> lhcb jobs clogging the Liverpool network pipes. As a ticket that got submitted by Liverpool that wound up going back to Liverpool it could easily end up in Limbo. Both the site and lhcb have taken steps to stop this happening again, so I think it can be closed. In progress (4/12)

No other tickets caught my eye.


Monday 3rd December 13.45 GMT</br> 32 Open UK tickets this week. It's the start of the month, so all tickets, great or small, will get reviewed.

  • NGI/VOMS</br>

https://ggus.eu/ws/ticket_info.php?ticket=88546 (16/11)</br> Creation of epic.vo.gridpp.ac.uk. Name has been settled on, deployed on the master VOMS instance and rolled out to the backups, ready for whatever the next step will be. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=87813 (25/10)</br> Migration of the vo.helio-vo.eu to the UK. At last word everything was done on the VOMS side, and testing on grid resources was needed to be done. In progress (15/11)

  • TIER 1</br>

https://ggus.eu/ws/ticket_info.php?ticket=89141 (3/12)</br> RAL are seeing a high atlas production job failure rate, and a possibly related high FTS failure rate. In Progress (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=89081 (30/11)</br> Failed biomed SAM tests, tracked to a missing / in a .lsc file. Should be fixed, waiting for confirmation (but don't wait too long). Waiting for reply (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=89063 (30/11)</br> The atlas frontier squids at RAL weren't working, fixed (networking problem) but ticket reopened and placed on hold as the monitoring for these boxes needs updating. On hold (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88596 (19/11)</br> t2k.org jobs weren't be delegated to RAL. After some effort this has been fixed, the ticket can be closed. In progress (1/12)

https://ggus.eu/ws/ticket_info.php?ticket=86690 (3/10)</br> "JPKEKCRC02 missing from FTS ganglia metrics" for t2k. This has been a pain to fix, at last word RAL were waiting on their ganglia expert to come back, but that was a while ago (however I suspect they had bigger fish to fry in November). In progress (6/11)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)</br> Correlated packet loss on the RAL perfsonar. On hold pending a wider scale investigation. On hold (31/10)

  • UCL

https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10)</br> The last Unsupported gLite software ticket (until the next batch). Ben has put the remaining out of date CE into downtime after updating another. In progress (29/11)

  • BIRMINGHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=89129 (3/12)</br> High atlas production failure rate, likely to be due to the migration to EMI. It could be a problem with the software area, Mark has involved Alessandro De Salvo. Waiting for reply (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9)</br> Low atlas sonar rates to BNL from Birmingham. atlas tag removed from ticket to lower noise. On hold (30/11)

  • IMPERIAL</br>

https://ggus.eu/ws/ticket_info.php?ticket=89105 (1/12)</br> t2k.org jobs failing on I.C. WMSs due to proxy expiry. Daniela thinks that it may be a problem with myproxy (the cern myproxy servers are having dns alias trouble by the looks of it). In progress (3/12)

  • SHEFFIELD</br>

https://ggus.eu/ws/ticket_info.php?ticket=89096 (30/11)</br> lhcb jobs to Sheffield that go through the WMS are seeing "BrokerHelper: no compatible resources" resources, possibly due to the published values for GlueCEStateFreeCPUs & GlueCEStateFreeJobSlots being 0. In progress (3/12)

  • LANCASTER</br>

https://ggus.eu/ws/ticket_info.php?ticket=89066 (30/11)</br> biomed nagios tests failing on the Lancaster SE. "problem listing Storage Path(s)", which suggests to me that we have a publishing problem. Couldn't find any obvious bugbears though, keeping on digging. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=89084 (30/11)</br> The problem in 89066 is also affecting the biomed CE tests. On hold (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88628 (20/11)</br> Getting t2k working on our clusters. Had some problem with building root on one cluster, and even just submitting jobs to the other. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88772 (22/11)</br> One of Lancaster's clusters is reporting default values for "GlueCEPolicyMaxCPUTime", mucking up lhcb's job scheduling. Tracked to a problem in the scripts (https://ggus.eu/ws/ticket_info.php?ticket=88904), the fix will be out in January so I've on-holded this until then. On hold (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br> ilc jobs always fail on a Lancaster CE, possibly due to the CE's poor performance. For the third time in a row I've had to put this work off for a month. On hold (3/12)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)</br> t2k transfer failures to Lancaster. Having trouble getting a routing change put through with the RAL networking team, probably due to them having a lot on their plate over the past month. In Progress (3/12)

  • LIVERPOOL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88761 (22/11)</br> Technically a ticket from Liverpool to lhcb. A complaint over the bandwidth used by lhcb jobs, probably due to a spike in lhcb jobs running during an atlas quiet period. Are all sides satisfied about the cause of this problem and the steps taken to prevent this happening again? In progress (23/11)

  • SUSSEX</br>

https://ggus.eu/ws/ticket_info.php?ticket=88631 (20/11)</br> Looks like Emyr has fixed Sussex's not-publishing-UserDNs APEL problem, so this ticket can be closed. In Progress (26/11)

  • QMUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88822 (23/11)</br> A similar ticket to 88772 at Lancaster. It could be that the SGE scripts are needing updating too. In progress (26/11)

https://ggus.eu/ws/ticket_info.php?ticket=88987 (28/11)</br> t2k jobs are failing on ce05. In progress (30/11)

https://ggus.eu/ws/ticket_info.php?ticket=88887 (26/11)</br> lhcb pilots are also failing on ce05. In progress (28/11)

https://ggus.eu/ws/ticket_info.php?ticket=88878 (26/11)</br> hone are also having troubles on ce05... In progress (26/11)

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> LHCB redundant, hard-to-kill pilots at QMUL. Chris opened a ticket to the cream developers (https://ggus.eu/tech/ticket_show.php?ticket=87891). But still the request to purge lists come in from lhcb. In progress (21/11).

  • GLASGOW</br>

https://ggus.eu/ws/ticket_info.php?ticket=88376 (8/11)</br> Biomed authorisation errors on CE svr026. Sam asked if this was the only CE that has seen this problem on the 9th. No reply since, I added in the biomed e-mail address explicitly to the cc list to try and coax a response. Waiting for reply (9/11)

  • ECDF</br>

https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9)</br> Low atlas sonar rates to BNL. Apparently things went from bad to worse on the 23rd/24th of October. Duncan has removed the atlas VO tag on the ticket to lower the noise on the atlas daily summary. On hold (30/11)

  • EFDA-JET</br>

https://ggus.eu/ws/ticket_info.php?ticket=88227 (6/11)</br> biomed complaining about 444444 waiting jobs & no running jobs being published by jet. The guys there have had a go at fixing the problem (probably caused by their update to EMI2), but are likely out of ideas. I had a brain wave regarding user access in maui.cfg but if that's not the solution I'm sure they'll appreciate ideas. In progress (3/12).

  • OXFORD</br>

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9)</br> Poor atlas sonar rates from Oxford to BNL. On hold due to running out of fixes to try, and the fact that they get good rates elsewhere. VO tag removed to reduce noise. On hold (30/11)

  • DURHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)</br> atlas production failures at Durham. Site still in "quarantine". On hold (20/11).

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11) compchem authentication failures. As this ticket has been on hold at a low priority since January then it would seem worthwhile to contact the ticket originators to see what they want to do. On hold (8/10)

Monday 26th November 14.30 GMT</br> 35 Open UK tickets today.</br>

I had to set a few tickets "In Progress" this week. Remember if a VO reopens a ticket and you get back onto refix the issue the ticket should be re-in progressed.

A few trends this week. t2k are on a software update spree, I know at Lancaster there were problems due to us not having installed the packages listed on the CiC portal. Biomed continue to give the silent treatment on their tickets, and there seem to be a few lhcb tickets about the place.

  • Unsupported Glite Software:</br>

BRISTOL: https://ggus.eu/ws/ticket_info.php?ticket=87472 (17/10) In Progress (23/11)</br> Cream CE & Workernodes are EMI2. This only leaves the the APEL box and I believe (the hard one) the Bristol Storm box. Good stuff though.</br> EDINBURGH: https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10) In Progress (21/11)</br> "pretty much done with our EMI deployment". Great news, once you're done you can close this ticket, it's not being done for us anymore.</br> MANCHESTER: https://ggus.eu/ws/ticket_info.php?ticket=87467 (17/10) On hold (5/11)</br> Manchester have the final push of upgrades planned for this week. Good luck!</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) On hold (1/11)</br> Things are quiet on the UCL front.

  • QMUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88822 (23/11)</br> This lhcb ticket concerning 99999 values for Max CPU time information publishing might have slipped under the QM radar. Assigned (23/11)

Similar problem at Lancaster: https://ggus.eu/ws/ticket_info.php?ticket=88772</br> Sheffield solved their ticket: https://ggus.eu/ws/ticket_info.php?ticket=88781

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> Hard to Kill lhcb pilots on the QMUL cream. LHCB sent out another "to purge" list on the 21st. The corresponding ticket from Chris to the CREAM developers (https://ggus.eu/tech/ticket_show.php?ticket=87891) has been cheekily closed because Chris couldn't provide log snippets right away (due to Cream's overactive log rotation). In progress (21/11)

BTW Sheffield has seen this problem too in the last week: https://ggus.eu/ws/ticket_info.php?ticket=88719

  • LIVERPOOL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88761 (22/11)</br> Technically this is a "from the UK ticket". LHCB jobs have been swamping the Liverpool WAN. Cutting back the number of running lhcb jobs has alleviated the problem somewhat. Is the cause of the sudden lhcb bandwidth gobbling still thought to be simply a surge of lhcb work whilst atlas were quiet? In progress (23/11)

  • LANCASTER</br>

https://ggus.eu/ws/ticket_info.php?ticket=88628 (20/11)</br> t2k were having problems installing their software on one of our clusters, largely due to not taking t2k into account in our latest node image. However the latest problem is a crash when building ROOT which has us stumped. Apparently Oxford see this too, so expect a poke from me soon in an attempt to scavenge a solution. In progress (26/11)

  • RHUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88417 (11/11)</br> Atlas squid problems - any luck getting the firewall opened for your new squid config? In Progress (20/11)

The solved tickets and tickets from the UK bits have kind of merged into the site tickets this week.


Monday 19th November 14.30 GMT</br> 29 Open UK tickets this week. Thanks for all your hard work making my job easier :-)

  • Unsupported Glite Software.</br>

MANCHESTER: https://ggus.eu/ws/ticket_info.php?ticket=87467 (17/10) On Hold (5/11)</br> Just the DPM left to go now I think, which is scheduled for next week.</br>

ECDF: https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10) In progress (7/11)</br> Opened support tickets to help deal with their issues:</br> https://ggus.eu/tech/ticket_show.php?ticket=88284 (cream canceling jobs-in progress)</br> https://ggus.eu/ws/ticket_info.php?ticket=88285 (cream sge information pub-solved)</br> https://ggus.eu/tech/ticket_show.php?ticket=88286 (argus problems-on hold)</br>

EFDA-JET: https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10) In Progress (9/11)</br> I think jet are all upgraded, but having nagios issues. Might need a hand from someone. Not sure why this ticket hasn't closed though. Update-solved

BRISTOL: https://ggus.eu/ws/ticket_info.php?ticket=87472 (17/10) In Progress (19/11)</br> Have an emi2 test CE up and running, so things are looking good.

BRUNEL: https://ggus.eu/ws/ticket_info.php?ticket=87469 (17/10) In Progress (5/11)</br> Site is on EMI2, so I don't understand why this ticket didn't auto-close. Worth manually solving it (unless you guys have some hidden glite about the place). Update-solved

UCL: https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) On hold (1/11)</br> Ben hoped to have an EMI cream by the 9th, not sure that that target was reached.

SHEFFIELD: Elena has closed their ticket after upgrading everything.

  • NGI/VOMS</br>

https://ggus.eu/ws/ticket_info.php?ticket=88546 (16/11)</br> Setting up a new "epic" VO (that's actually what they're calling themselves). Debate in the finer points of naming - original suggestion was "epic.gridpp.ac.uk" but need to decide on some precedent for future VO naming. Andy McNab suggest the VO registers it's own domain name and uses that. In progress (16/11)

https://ggus.eu/ws/ticket_info.php?ticket=88395 (9/11)</br> David Meredith asks the NGI (i.e. us) if there are any objections to deleting the "UKI-Local-MAN-HEP" site-that-never-was from the gocdb. Waiting for reply (13/11)

  • RHUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88417 (11/11)</br> Alastair would like to know what you have in your squid ACLs/customize.sh to debug the squid problems at RHUL. In progress (12/12)

  • GLASGOW</br>

https://ggus.eu/ws/ticket_info.php?ticket=88376 (8/11)</br> Biomed ticketed Glasgow with a problem on one of their CEs, but have neglected to reply to Sam's question on the 9th. Still Waiting for Reply (9/11)

  • DURHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9)</br> Another example of biomed's silence when asked a question. Waiting for reply (5/11)

  • BIRMINGHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=88262 (6/11)</br> How are Birmingham's power problems coming along? On hold (9/11) Update-downtime extended

  • RALPP</br>

https://ggus.eu/ws/ticket_info.php?ticket=88099 (3/11)</br> Transfer errors continued, although they've changed in nature. The last update was from Wahid on Thursday, any word from the site? They've been quiet on this one, which suggests that they're not getting alerts. In progress (16/11)

  • OXFORD</br>

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9) Low atlas sonar rates to BNL. Brian asked if you guys could see if the bad rates also applied to direct globus-url-copies. Have you had a chance to have a bash at this? In Progress (6/11)

  • Discussion point (mainly for atlas): In https://ggus.eu/ws/ticket_info.php?ticket=86334 Wahid and Brian had an exchange about the usefulness of tickets to track very long standing issues- the main grumble for Wahid seemed to be the constant presence of ECDF on the daily resume because of this ticket. I agree with Brian that tickets are the tool to track issues, but the constant noise created by this ticket (which isn't going anywhere fast) is a nuisance. Maybe the resume needs to start ignoring certain classes of tickets (low priority, on hold ones for example).
  • Tickets from the UK:</br>

https://ggus.eu/tech/ticket_show.php?ticket=87891</br> Chris's ticket concerning the hard-to-kill lhcb jobs.</br> https://ggus.eu/ws/ticket_info.php?ticket=87264</br> Daniela's ticket concerning the enormous number of entries showing up in /var/glite/log, with some interesting input from Daniela. QMUL & Lancaster also see this problem, if you do too please add your voice (to coax a swifter fix).

If you have any tickets you'd like the spot light on, please let me know. Update - Daniela been schooling biomed in https://ggus.eu/ws/ticket_info.php?ticket=88489 and would like any feedback from other sites - although I think that her & Sam's suggestions are perfectly sensible

Monday 12th November 2012 14.00 GMT

32 Open UK tickets this week.

  • Unsupported Glite Software Tickets:</br>

SHEFFIELD: https://ggus.eu/ws/ticket_info.php?ticket=87466 (17/10) On hold (12/11) - Plan to upgrade DPM by the 23rd.</br> EFDA-JET: https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10) In progress (9/11) - Upgraded, but having nagios test issues.</br> BRISTOL: https://ggus.eu/ws/ticket_info.php?ticket=87472 (17/10) In progress (25/10)</br> BRUNEL: https://ggus.eu/ws/ticket_info.php?ticket=87469 (17/10) In progress (5/11) - site on EMI2 now.</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) On hold (1/11) - Planned to have upgraded their CREAM by the 9th.</br> MANCHESTER: https://ggus.eu/ws/ticket_info.php?ticket=87467 (17/10) On hold (5/11) - CE's upgraded. DPM will be done by the end of the month.</br> ECDF: https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10) In Progress (7/11)- Upgraded, but having trouble and have 3 open support tickets:</br> https://ggus.eu/tech/ticket_show.php?ticket=88284 (cream canceling jobs)</br> https://ggus.eu/ws/ticket_info.php?ticket=88285 (cream sge information pub)</br> https://ggus.eu/tech/ticket_show.php?ticket=88286 (argus problems)</br> CAMBRIDGE: Solved their ticket by turning off their old glite 3.2 CE.

  • A BIT INTERESTING</br>

https://ggus.eu/ws/ticket_info.php?ticket=88395 (9/11)</br> Alessandra has asked if/how a site can be deleted from the GOCDB. Apparently a site can but shouldn't be. In Progress (9/11)

  • TIER 1</br>

https://ggus.eu/ws/ticket_info.php?ticket=88406 (9/11)</br> LHCB pilot jobs aborting at the Tier-1. No reply yet to this ticket, LHCB have upped it to "TOP PRIORITY". Assigned (10/11) UPDATE-SOLVED

  • RHUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88417 (11/11)</br> You chaps have an atlas ticket about your frontier squid. Assigned (11/11) UPDATE-IN PROGRESS NOW

https://ggus.eu/ws/ticket_info.php?ticket=88310 (8/11)</br> Nagios Tests were failing on SL6 workernodes (not a site problem). However Kashif has applied a patch, and SL6 WNs should pass nagios tests now. Soon to be closed (12/11)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=88227 (6/11)</br> EFDA-JET are publishing 4444444s for their jobs numbers, biomed are complaining. Not much beyond ticket acknowledgement - can someone down south please take a look. In progress (6/11)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=87958 (31/10)</br> ATLAS transfer failures to FZK. Ticket open with FZK, but hasn't been put into this ticket yet. Hint hint. In Progress (6/11) UPDATE- Ticket to FZK added (https://ggus.eu/ws/ticket_info.php?ticket=88006), but there seem to be other troubles haunting FZK. Wahid suggests reassigning the ticket to them

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=88099 (3/11)</br> Some atlas transfer problems from RAL, probably load based problems. Rob & Chris couldn't comment so Brian had to step in - was the posting problem powercut related or is ggus playing silly beggars for them? On hold (12/11)

No significant goings on with the other tickets this week. https://ggus.eu/ws/ticket_info.php?ticket=87264 Daniela has asked us to take a look at our the number of entries in /var/glite/log on our EMI CEs. I see them at Lancaster on our older EMI1 CE.


Monday 5th November 14:00 GMT</br> 32 Open UK Tickets this week. It's the first Monday of the month, so we get to look at all of them. Have all the GGUS access problems experienced by atlas team members last week soothed themselves?</br>

It's worth noting that a quarter of the open tickets are concerning networking/transfer type problems.</br>

  • UNSUPPORTED GLITE SOFTWARE TICKETS</br>

Congratulations to those sites who closed their tickets. I suspect these will be gone over in greater detail so again I'll just summarise them, we can look at each in the meeting if needed. All seem to be in hand, but my rule of thumb is the more recent the update the lesser the worry.

BRISTOL: https://ggus.eu/ws/ticket_info.php?ticket=87472 (17/10) In Progress (25/10)</br> CAMBRIDGE: https://ggus.eu/ws/ticket_info.php?ticket=87470 (17/10) In Progress (30/10)</br> BRUNEL: https://ggus.eu/ws/ticket_info.php?ticket=87469 (17/10) In Progress (30/10)</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) In Progress (1/11)</br> MANCHESTER: https://ggus.eu/ws/ticket_info.php?ticket=87467 (17/10) On Hold (24/10) In Progress (5/11)</br> SHEFFIELD: https://ggus.eu/ws/ticket_info.php?ticket=87466 (17/10) On Hold (31/10)</br> ECDF: https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10) In progress (30/10) (5/11)</br> EFDA-JET: https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10) In Progress (31/10)</br>

  • NGI/VOMS</br>

https://ggus.eu/ws/ticket_info.php?ticket=87813 (25/10)</br> Migration of vo.helios-vo.eu to Manchester. The transfer was completed manually,users were asked if things okay. In Progress, I "waiting for replied" it today. (30/10) David indicates it works and will now test with WMS/CE (5/11)

  • TIER 1</br>

https://ggus.eu/ws/ticket_info.php?ticket=88112 (3/11)</br> Slow atlas transfers, found to be caused by database problems. The problems have been fixed, the atlas instance restarted and data is flowing once more. Waiting for the thumbs up from atlas. Waiting for reply (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=86690 (3/10)</br> t2k are missing JPKEKCRC02 FTS ganglia metrics. There were some problems with the rrd files that meant they had to be deleted, which hopefully will fix the plots. Things look better to my eyes, In Progress, can be waiting for replied/solved (31/10) t2k give the thumbs up, seems okay to them now

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)</br> Packet loss on the RAL perfsonar. This is being taken under the wing of wider network investigations at RAL. On hold (31/10)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)</br> DPM Sl4 retirement ticket. The only reason this is open is possible SL4 disk servers at Durham right? Are they still there? In progress (30/10)

  • RALPP</br>

https://ggus.eu/ws/ticket_info.php?ticket=88099 (3/11)</br> atlas seeing transfer errors into RALPP with "No transfer markers received" errors, although the problem seems to be abating itself slowly. Still just "Assigned" (4/11) ATLAS still see problem (5/11). Still just assigned

  • BRUNEL</br>

https://ggus.eu/ws/ticket_info.php?ticket=88019 (1/11)</br> lhcb seeing failures on some nodes, blaming cvmfs. Raul has put CE in downtime. In Progress (1/11)

  • BIRMINGHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=88009 (1/11)</br> Hone with one of their usual politely worded requests to get their jobs moving. Mark tweaked the batch system, and hone are happy again. In progress, can be closed (2/11) Solved

https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9)</br> Poor sonar rates between Birmingham & BNL. Investigation made difficult due to EMI2 problems with the DPM, Brian has tried to see if doubling the number of steams would help. Did it? On hold (16/10)

  • DURHAM</br>

https://ggus.eu/ws/ticket_info.php?ticket=88151 (5/11)</br> apel nagios test problems. Assigned (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9)</br> Biomed not cleaning out their cream sandbox. Mike pulled them up about this a while ago but no reply. We should close this ticket and/or re-ticket the VO if they're causing a mess. Waiting for reply (4/10)

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)</br> atlas production job failures at Durham, which has become a bit of a catch-all ticket for atlas problems at Durham. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11)</br> Compchem authentication ticket. On hold, but is it still relevant? (8/10)

  • ECDF</br>

https://ggus.eu/ws/ticket_info.php?ticket=88119 (4/11)</br> Atlas transfer's are failing due to a sickly pool node. In Progress (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=87958 (31/10)</br> atlas transfers between Edinburgh & FZK having problems, likely due to their firewall. FZK had been ticketed (no ticket number given though). In Progress (1/11)

https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9)</br> Poor atlas sonar rates between ECDF & BNL. Wahid has "harmonised" his tcp tunings, and is waiting on some further WAN upgrades. On hold (25/10)

  • GLASGOW</br>

https://ggus.eu/ws/ticket_info.php?ticket=87879 (29/10)</br> na62 mapping problems, traced to a pool node not making its grid map. Seems things are fixed now, despite the user's initial protests to the contrary. Turns out they were just being impatient! In progress, can be closed (30/10) SOLVED

  • SUSSEX</br>

https://ggus.eu/ws/ticket_info.php?ticket=86996 (8/10)</br> Sussex's APEL problems. Things look better now after a lot of work. In progress, can be closed (5/11)

https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> The Sussex Certification Chronicle. Surely the Grid Overlords are satisfied that Sussex is worthy of certification, after paying so much tribute in tears and sanity? :-) In progress (bit quiet though) (23/10) SOLVED! SUSSEX IS ONE OF US NOW...

  • QMUL</br>

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> Hard-to-kill lhcb jobs at QMUL. Chris is still getting regular hit-lists. Chris's corresponding ticket to the cream developers (https://ggus.eu/tech/ticket_show.php?ticket=87891) has problems as lhcb can't reply to it! He has however written information in this ticket. In progress (1/11)

  • CAMBRIDGE</br>

https://ggus.eu/ws/ticket_info.php?ticket=86108 (14/9)</br> Perfsonar WAN bandwidth asymmetry. Been on hold for a while, the classic question must be asked - has the problem gone away all by itself? On hold (2/10)

  • OXFORD</br>

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9)</br> Low atlas sonar rates between BNL and Oxford. Tweaking the FTS settings hasn't made any difference. The next step was to tweak tcp tuning perimeters. Duncan observed similar transfer rates between Oxford & TRIUMF. In progress (19/10) Tuning tcp didn't help, what to do next...

  • LANCASTER</br>

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br> ilc jobs were aborting on one of Lancaster's CEs. This CE has poor performance, which for some reason was affecting ilc jobs more then most. The only fix is a reinstall (and reconfigure), but other priorities keep getting in the way (the latest being the use of this CE to test EMI2 tarballs). On hold (5/11)

t2k.org transfer timeout failures between RAL and Lancaster. Traffic is in the process of being routed over SJ5 from the lightpath to see if that helps. Other then that is the possibility that this is taking too long to stage from tape thing - but no reason why that's only being a problem for us. In progress (1/11)

Monday 29th October 15:00 GMT</br>


37 open tickets this week. Not a lot jumped out at me, but I'll be going over them all in excruciating detail next week as it'll be the start of the month.</br>

UNSUPPORTED GLITE SOFTWARE:</br> All of these tickets seem in hand, plans have been stated (after a few requests for greater verbosity).</br> BRUNEL: https://ggus.eu/ws/ticket_info.php?ticket=87469 (17/10) In Progress (25/10)</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) On hold (23/10)</br> QMUL: https://ggus.eu/ws/ticket_info.php?ticket=87473 (17/10) Solved! (23/10)</br> BRISTOL: https://ggus.eu/ws/ticket_info.php?ticket=87472 (17/10) In Progress (25/10)</br> OXFORD: https://ggus.eu/ws/ticket_info.php?ticket=87471 (17/10) In Progress (17/10)</br> CAMBRIDGE: https://ggus.eu/ws/ticket_info.php?ticket=87470 (17/10) In Progress (23/10)</br> MANCHESTER: https://ggus.eu/ws/ticket_info.php?ticket=87467 (17/10) On hold (24/10)</br> SHEFFIELD: https://ggus.eu/ws/ticket_info.php?ticket=87466 (17/10) On hold (17/10)</br> EDINBURGH: https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10) In progress (23/10)</br> GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=87170 (10/10) In progress (15/10)</br> EFDA-JET: https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10) In progress (24/10)</br> LIVERPOOL: https://ggus.eu/ws/ticket_info.php?ticket=87167 (10/10) In progress (19/10)</br>

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=87813 (25/10)</br> Perhaps Of interest, a VO (vo.helio-vo.eu) is transferring to the Manchester VOMS. In progress (26/10)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=87860 (27/10)</br> It looks like Lancaster is having idle lhcb zombie jobs as well - interesting as we're seeing it on a LSF cluster. In Progress (29/10)</br> (related to the QMUL ticket https://ggus.eu/ws/ticket_info.php?ticket=86306). Update - Chris has submitted https://ggus.eu/tech/ticket_show.php?ticket=87891 to the Cream developers.

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=87797 (25/10)</br> Hone were having trouble getting their jobs to run on SL6 and have been using Brunel as a testbed. Having some successes now thanks to Raul's work (although I'll be intrigued to find out what libraries Raul installed that were missing on SL6). In Progress. Update - Raul has told us to watch this space for when he has a conclusive list

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=86996 (8/10)</br> Sussex APEL Problems. John Gordon asks if you've had any success fixing things? In progress (26/10) Update- I see Emyr contacted TB-SUPPORT about this

No exciting ticket in the "Other" piles that I can see.


Monday 22nd of Octber 14:30 BST</br> 36 Open UK tickets today, 12 of them are Unsupported Glite Software tickets.

Unsupported GLite Software tickets:</br> No sites have left these tickets untouched, which is good. I'm sure these will be talked about elsewhere in the meeting. For completeness I summarised them below, with the dates they were opened and last updated.

BRUNEL: https://ggus.eu/ws/ticket_info.php?ticket=87469 (17/10) In Progress (18/10)</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) In Progress (17/10)</br> QMUL: https://ggus.eu/ws/ticket_info.php?ticket=87473 (17/10) Waiting for reply (18/10)</br> BRISTOL: https://ggus.eu/ws/ticket_info.php?ticket=87472 (17/10) In Progress (18/10)</br> OXFORD: https://ggus.eu/ws/ticket_info.php?ticket=87471 (17/10) In Progress (17/10)</br> CAMBRIDGE: https://ggus.eu/ws/ticket_info.php?ticket=87470 (17/10) In Progress (17/10)</br> MANCHESTER: https://ggus.eu/ws/ticket_info.php?ticket=87467 (17/10) On hold (17/10)</br> SHEFFIELD: https://ggus.eu/ws/ticket_info.php?ticket=87466 (17/10) On hold (17/10)</br> EDINBURGH: https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10) In progress (10/10)</br> GLASGOW: https://ggus.eu/ws/ticket_info.php?ticket=87170 (10/10) In progress (15/10)</br> EFDA-JET: https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10) In progress (10/10)</br> LIVERPOOL: https://ggus.eu/ws/ticket_info.php?ticket=87167 (10/10) In progress (10/10)</br>


TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=86705 (3/10)</br> Sno+ job errors at RAL. Last word that Sno+ production users needed to be implemented, Matt M from Sno+ asks if there's been any progress, a reversed waiting for reply. The RAL chaps have probably been very busy with the move to EMI creams. In progress (17/10)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)</br> Packets loss on the RAL Tier 1 Perfsonar. Under investigation, but due to the slow progress expected when debugging network niggles this probably should be On Holded. In progress (19/9)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=87535 (18/10)</br> This nagios ticket concerning your apel box might have flown under the Sussex radar it needs tending. Assigned (18/10)</br> (almost certainly an extension of the problem in https://ggus.eu/ws/ticket_info.php?ticket=86996)

https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> Ye Olde Sussex certification ticket. How did implementing Atlas Space Tokens go? In progress (11/10)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/09)</br> Chris is still having to regularly manually purge unkillable idle lhcb jobs from his system. This doesn't seem ideal - is there a way to break the cycle? In Progress (22/10)

Of interest:</br> https://ggus.eu/ws/ticket_info.php?ticket=81496</br> The current (AFAIK) "Where's the tarball?" ticket, thanks to Chris for reminding me in his ticket.


Monday 15th October 14:00 BST</br> 3631 open UK tickets this week. Nothing too exciting other then a bunch of tickets regarding unsupported glite software, most which have been handled (and many would and have argued are non-issues, e.g. ClassicSE entries). I've once again glossed over the various networking tickets, but progress on those fronts is expectantly slow.

NGI/ROD</br> https://ggus.eu/ws/ticket_info.php?ticket=87317 (12/10)</br> This ticket against the ROD complained about how old we "let" ticket 85973 get (Brunel's problem with the lcg utils timing out in EMI WNs). The offending ticket is closed, so I think this ticket can be closed too. Waiting for reply (15/10)

https://ggus.eu/ws/ticket_info.php?ticket=86927 (8/10)</br> Sussex's high level of UNKNOWN status in September. John G has involved NGI Ops and Kashif to ask why - I replied. In progress (15/10) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=86847</br> https://ggus.eu/ws/ticket_info.php?ticket=86846</br> The September availability/reliability tickets for Glasgow & Durham. Both sites have submitted a good explanation, i've asked if the powers that be are satisfied. Waiting for reply (15/10) BOTH SOLVED

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=87308 (12/10)</br> Biomed are seeing the default job numbers in gstat, looks like dynamic publish is broken. Assigned (12/10) SOLVED 14/10

https://ggus.eu/ws/ticket_info.php?ticket=87169 (10/10)</br> "Unsupported Glite Software" ticket. A plan is in the works for upgrading. In progress (10/10)

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=87272 (11/10)</br> An interesting one. LHCB jobs were failing at an Imperial CE due to the node running out of inodes - not something I've seen before. Nothing wrong with this ticket, just it caught my eye. In progress (11/10) UPDATE- Daniela has filed a bug report about this: https://ggus.eu/ws/ticket_info.php?ticket=87264

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=87343 (14/10)</br> Oxford's torque server crashed, and lhcb noticed. Fixed quickly, so the ticket can either be solved or lhcb can be asked if things are okay for them now. In Progress (14/10)

EDINBURGH</br> https://ggus.eu/ws/ticket_info.php?ticket=87171 (10/10)</br> Edinburgh has received a ticket about Unsupported Glite Software at their site. It's being handled though. In Progress (10/10)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=87170 (10/10)</br> Similar ticket for t'other side of Scotland. Sam makes some good points about the "Classic SE" witchhunt, and asks for clarity on when the actual deadlines are. In progress (11/10)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=87167 (10/10)</br> Liverpool's U.G.S. ticket. Steve gave it a cheeky In Progress, but no other news (11/10).

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> LHCB are still seeing un-purgable jobs at Queen Mary, delivering another list of zombie jobs. Has anyone else seen this? In progress (12/10)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> That Sussex ticket :-) The nagios problems have been worked around, Brian has been giving advice on Space Tokens. In Progress (11/10)

Solved Cases</br> https://ggus.eu/ws/ticket_info.php?ticket=86753</br> UCL's dpm problems were solved (touch wood) by increasing the VM resources and correspondingly upping the mysql innodb_buffer_pool_size setting. The lesson is don't skimp on your dpm headnode resources!

The Tier-1, Oxford, Cambridge, Bristol, Birmingham and Sheffield all got U.G.S. tickets that they sorted promptly.

No other tickets of interest that I noticed, does anyone have any?

Monday 8th October 15:30 BST</br> 35 tickets this week- maybe this is our new plateau? Here are the highlights, a number if tickets are "network related" and are likely to be slow to resolve, I've left them out this week - there hadn't been any exciting movement with any of them.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=86927 (8/10)</br> The NGI have been ticketed over Sussex's high % of Unknown status. "We would like to kindly ask you to investigate this issue and take an action to decrease Unknown status percentage." "Please close this ticket as a acknowledge that you received this information." Seems like a slightly odd ticket to me (doesn't ask for an explanation). I'll close this ticket after tomorrow's meeting if no one beats me to it. The reason behind the Unknowns is likely the odd network behaviour seen between the Lancaster nagios and the Sussex cream, which we're still investigating.

https://ggus.eu/ws/ticket_info.php?ticket=86995 (8/10)</br> Chris tickets the NGI over getting his new CE monitored. He raises a point in his ticket "shouldn't this happen automatically I add the host to goc.egi.eu?". Assigned (8/10)

https://ggus.eu/ws/ticket_info.php?ticket=86847 (5/10)</br> Glasgow's availability/reliability ticket for September. These tickets sneak up on us so I like to catch them. Assigned (5/10)</br>

https://ggus.eu/ws/ticket_info.php?ticket=86846 (5/10)</br> Durham's availability/reliability ticket. Assigned (5/10)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=86753 (4/10)</br> Atlas transfers to UCL failing with SRM errors. Ben noticed some log errors and would like someone in the know to take a look. They seem a little fishy but nothing jumped out at me. In progress (8/10)

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=86633 (2/10)</br> Hone jobs must flow... This ticket was sorted on the day, and can be closed. (2/10)

https://ggus.eu/ws/ticket_info.php?ticket=86426 (26/9)</br> Did your new Dell servers show up? In progress (28/9)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> Unkillable lhcb pilots at QM. Daniela suggested a full service stop then running the purge scripts, and asked if the jobs are in the batch system at all. In progress (2/10)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9)</br> Biomed running out of space on the Durham CE, due to not cleaning up after themselves. Anyone else have biomed cluttering up their sandbox? I waiting for replied it (4/10)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br> Is this Sno+ WMS ticket effectively replaced with this Sno+ ticket:</br> https://ggus.eu/ws/ticket_info.php?ticket=86704 (3/10)</br> (Different servers, but I believe the WMS in the first ticket is dead.)

Tickets from the UK</br> Mentioned last week, but of interest to anyone installing EMI2 CREAMs:</br> https://ggus.eu/tech/ticket_show.php?ticket=85970</br> The magical resetting database issue.</br> And of interest to anyone planning to be messing about with their accounting:</br> https://ggus.eu/ws/ticket_info.php?ticket=84326</br> Chris has a nice chronicle about the perils of republishing a lot of data.

To exciting solved cases again. I think I'm getting jaded though. Monday 1st October 14.30 BST</br> 40 36 Open tickets this week, and it's the start of the month so we get to go other all of them! Maybe every month is a little too often for such a review...

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7)</br> COMET VO creation, On Hold pending the other VO creation gubbins (6/9) UPDATE - THE GUBBINS TICKET SUGGESTS THAT THE VO IS IN PRODUCTION https://ggus.eu/ws/ticket_info.php?ticket=%2085736

https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/5)</br> Chris's VOMS request rejig ticket. On hold until the UK Voms reshuffle is complete, the reminder date (24/9) has passed. (6/9)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=86570 (1/10)</br> GGUS is moving to a SHA2 certificate on their next release (~24th), and have asked if the SHA2 cert will cause any trouble. Gareth has noted the ticket, but unsure if others will take notice. In progress (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=86552 (30/9)</br> Atlas transfers from/to RAL-LCG2 failed, apparently due to high load at the RAL end. Found to be caused by a database problem. Should be fixed, at risk for a little while longer. In Progress (1/10) SOLVED-ORACLE WORKAROUND PUT BACK IN PLACE

https://ggus.eu/ws/ticket_info.php?ticket=86541 (29/9)</br> Before the above problem atlas transfers were failing with SECURITY_ERRORs. A known FTS bug caused this (https://ggus.eu/tech/ticket_show.php?ticket=81844). Patch applied this morning. In progress (1/10) SOLVED -PATCH WORKED

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)</br> Duncan has ticketed the Tier 1 over packet loss seen on many (not all) Perfsonar tests where the RAL perfsonar is the destination. The RAL chaps are looking into it, but aren't expecting a solution to easily present itself. In progress (19/9)

https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)</br> biomend nagios jobs can't register files on srm-biomed.gridpp.rl.ac.uk. An odd problem that only seemed to affect biomed jobs. Looked to be dealt with for a while, but seems to have re-emerged. In progress (24/9)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)</br> SL4 DPM retirement master ticket. On hold but should be In progressed with a view to close (6/9)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=86578 (1/10)</br> Ops srm-put tests are failing. Only put in this morning though, still assigned. (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=86534 (28/9)</br> Ops wn-rep tests failing. Related to above? Still just assigned (28/8)

https://ggus.eu/ws/ticket_info.php?ticket=86281 (21/9)</br> Another wn-rep related ticket (for a different CE). This one too is just assigned. Are these getting to Mike? (21/9) UPDATE- MACHINES ARE HAVING CERTIFICATE ISSUES

https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9)</br> Biomed having trouble submitting to cream02, "no space left on device" errors. Not much movement, just in progressed (24/9)

https://ggus.eu/ws/ticket_info.php?ticket=85181 (20/8)</br> One of the last two glite 3.1 retirement tickets. No reply since Daniela asked if the BDII was indeed glite 3.1. In Progress (13/9)

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)</br> High atlas production rate failure at Durham. Durhams rocky summer hasn't helped, but hopefully they're out of the woods(?). On Hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11)</br> Ancient compchem ticket. On hold but might not be relevant as all the CE's have been reinstalled (6/9)

https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11)</br> SL4 retirement ticket. It looks like it can be closed, just need some confirmation from someone Durham side. In progress (28/9)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=86544 (29/9)</br> Problems after running out of atlas pool accounts at Oxford. Probably caused by lcg-expiregridmapdir bug (I missed the discussion of this, maybe it was offline?), fix in place. Long term plan to up the number of atlas pool accounts. In progress (29/9)

https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9)</br> Low atlas sonar rate seen between Oxford & BNL. Ewan has been and is looking into it (17/9)

https://ggus.eu/ws/ticket_info.php?ticket=85968 (10/9)</br> Oxford being bitten by the EMI lcg_utils bug. On hold pending EMI pulling thier finger out. (20/9)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=86542 (29/9)</br> Liverpool suffered a bunch of SRM transfer failures in a short timeframe, no obvious causes found at the time. Were investigating, but were probably interrupted by their unexpected cable bisecting incident today. In progress (29/9). SOLVED - PROBLEM WAS TRANSIENT

https://ggus.eu/ws/ticket_info.php?ticket=86095 (14/9)</br> Liverpool's encounter with the EMI lcg-utils bug mucking up their WN-rep ops tests. On hold, but has been green for a while -maybe just been lucky? (20/9)

BIRMINGHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=86540 (28/9)</br> Atlas transfers to Birmingham failed with "SRM_ABORTED" messages. Mark reports that the VM they are using as a headnode isn't beefy enough to cope with the demand, causing SRM responses to be too slow. He upped the power of the VM but that wasn't a full fix, hoping to get a reinstall in today. A note from atlas this morning mentions that transfers fail for DATADISK but not for PRODDISK, which is odd. Are there any differences in the nature of these transfers? In progress (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9)</br> One of the tickets clocking poor atlas sonar rates between Birmingham and BNL. Mark and Laurie have looked into this, but not come up with anything conclusive. In progress (19/9)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=86533 (28/9)</br> Ops "WN-RepDel" tests failing, likely due to the known EMI WN lcg-utils timing out bug. As Brunel already have at ticket about this issue on a different CE (presumably fronting the same WNs) then Daniela asks if the ROD team can sum it in one ticket rather then multiples. In progress (1/10) CLOSED DUE TO BEING A DUPLICATE

https://ggus.eu/ws/ticket_info.php?ticket=85973 (10/9)</br> The "original" RepDel test failure ticket at Brunel. On hold (awaiting a fix from EMI) (20/9)

IMPERIAL</br> https://ggus.eu/ws/ticket_info.php?ticket=86426 (26/9)</br> Hone have trouble submitting to the Imperial WMSi. Daniela reports that the machines are suffering from being too old (something we can all relate to), replacements should have arrived on Friday but hadn't. Dell report a new delivery date of the 8th. In progress (could be on hold until the kit arrives?) (29/9).

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=86391 (25/9)</br> Atlas were having staging in problems due to high disk server load. Problems however persisted for a while after the load on the server calmed down. Did thing sort themselves out after the weekend? In progress (27/9)

https://ggus.eu/ws/ticket_info.php?ticket=85183 (14/8)</br> One of the last few glite 3.1 retirement tickets. Due to the severe crustiness of the old WMS hardware Glasgow powered it down rather then upgrade (was it only 32-bit hardware?) and are now pondering the next steps. In progress (28/9) UPDATE - DO WE NEED TO REMOVE IT FROM THE GOCDB OR IS DOWNTIME ENOUGH?

https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br> Sno+ WMS problems at Glasgow. AFAICS the wms in question has been switched off due to the reasons above? It might be useful to make that clear to Sno+! In progress (10/9).

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=86383 (25/9)</br> RHUL stopped publishing UserDN accounting after "upgrading" from glite to EMI apel in August. Apel support have been called in, and Daniela suggests checking the FAQ. In progress (1/10)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=86378 (25/9)</br> Hone had jobs waiting "too long" at QM, but the problems disappeared. Along with a bunch of jobs, looks like the QM creams suffered from the database resetting issue (https://ggus.eu/tech/ticket_show.php?ticket=85970, as advertised by Daniela). In progress (27/9)

https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br> Queen Mary is being swamped by unkillable lhcb zombie pilots. Neither the submitters or the site admins can do ought about them using "normal" tools. Daniela has suggested some DB queries to try or attempting to use the JobPurger tool (which would be my suggestion too). In progress (1/10). UPDATE- Some success with the JobPurger with a 5 day time frame

https://ggus.eu/ws/ticket_info.php?ticket=85967 (10/9)</br> QM failing ops Apel tests. Chris ticketed apel support for help (https://ggus.eu/ws/ticket_info.php?ticket=84326), but not having much luck due to the shear size of their DB, and progress interrupted by GridPP last week. Hopefully will break this problem this week. On hold (21/9)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9)</br> Poor atlas sonar rates between BNL and ECDF. Waiting on moving disk servers to new switches and other general network wizardry scheduled for this week. On hold till then (28/9).

CAMBRIDGE</br> https://ggus.eu/ws/ticket_info.php?ticket=86108 (14/9)</br> Duncan noticed a WAN bandwidth asymmetry at Cambridge. John contacted the local networking guys, who've investigated and found nothing. Still in progress (26/9)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br> ilc were having trouble submitting jobs to one of Lancaster's CEs. Robin tracked the issues to high disk IO load, and we're figuring out a some ways of mitigating these problems. In progress (1/10)

https://ggus.eu/ws/ticket_info.php?ticket=84583 (26/7)</br> lhcb jobs failing on a Lancaster CE, originally due to a pool account misconfiguration. The problem has been fixed (probably...) but files don't seem to be being staged in for lhcb and there are no errors (or mention of lhcb at all) in the gridftp logs. Debugging is not being helped by the load issues documented above. In progress (27/9)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)</br> t2k.org transfers from RAL to Lancaster timing out. We hoped the gateway upgrade would improve things, but we were disappointed. Back to the network investigation. In progress (1/10)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=85019 (9/8)</br> ILC had some adventures due to VO misconfiguration at RALPP, but looks like things are fixed and the ticket can be closed now. In progress (1/10)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5) Emyr wondered last week if this was the longest ticket ever? Sadly I doubt it! The baton has passed oddly enough to Lancaster, as we've come across a bazaar problem whereby communication from the Sussex cream CE (and only the cream CE) is being refused by machines on a specific Lancaster subnet. Sadly this is the subnet where the lancaster nagios box is sitting. We've ruled out firewalls and had the network chaps at both sides take a look. Traffic is being stopped at the Lancaster end, but by the servers themselves (not the network gateways). I'm currently investigating to see if there's any oddity with our network settings. In progress (26/9)

Ticket of Interest:</br> https://ggus.eu/tech/ticket_show.php?ticket=85970</br> As mentioned above, the ticket documenting the EMI2 cream database "reset" problems.

Solved Tickets</br> Ran out of time for these, but I notice that most of the glite 3.1 tickets are closed and the neurogrid VO has taken off. Good stuff!

Monday 17th September 16:00 BST

36 open tickets this week. Although I can't complain as more then the fair share of them are mine. No ticket update from me next week, as I'm on leave again (I need to learn to take my holiday's earlier in the year!). Although with the GridPP meeting next week I'm not sure there will be a Tuesday meeting anyway.

ROD</br> https://ggus.eu/ws/ticket_info.php?ticket=86009 (11/9)</br> Our ROD team is being picked on to justify the August metrics. No response from anyone since the 11th when this ticket was submitted, has this snuck under the radar? Assigned (11/9)

THE GLITE 3.1 PURGE</br> Just 3 tickets left. So we'll just about be finished with this in time to do the same with the glite 3.2 services!</br> https://ggus.eu/ws/ticket_info.php?ticket=85185 CAMBRIDGE. John has turned off the lcg-CE. I think the BDII is done too, so looks like this is a victory.</br> https://ggus.eu/ws/ticket_info.php?ticket=85181 DURHAM. Daniela asks about the glite 3.1 BDII, a service that can't just be turned off. Time is running out. (13/9).</br> https://ggus.eu/ws/ticket_info.php?ticket=85183 GLASGOW. Worryingly quiet. I assume the Glasgow lads are working on it but the rest of us are getting nervous!</br>

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)</br> biomed SRM problems. A terse reply from biomed suggests that everything is okay, I asked for clarification and their blessing to close it. Waiting for reply (17/9)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3)</br> The ticket tracking older DPMs. Should be put back in progress with view to closing after the above purge is finished. On Hold. (6/9)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=86095 (14/9) Liverpool Ops test jobs falling afoul of lcg_utils bug. I put On Hold as no hope until bug is fixed. On Hold (17/9)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=85973 (10/9)</br> Being bitten by the same bug as Liverpool. On Hold (12/9)</br>

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=85968 (10/9)</br> And again! On Hold (12/9)

Ewan helpfully linked to the likely cause's ticket: https://ggus.eu/tech/ticket_show.php?ticket=85601</br> Although apart for Ewan's cross-referencing the tickets there's been no movement since 29/8

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=85967 (10/9)</br> Nagios APEL tests failing during move from glite to EMI. I got the wrong end of the stick here, QMUL are replacing their APEL box with a newer, EMI machine after a publishing error broke broke APEL at Queen Mary & RAL. John Gordon adds a reminder to be sure that you don't accidentally republish all your data! In Progress (12/9)

https://ggus.eu/ws/ticket_info.php?ticket=80052 (8/3)</br> A ticket from QMUL rather then to it, concerning availability calculation back in March. I think this is some kind of ticket orphen.

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br> Sno+ WMS problems. Sno+ supplied some requested information (6/9). No word since. Is this WMS one that will go in the above mentioned glite 3.1 purge? If so then it will be worth mentioning. In progress (6/9)

BRUNEL (again, I didn't want to break my chain above)</br> https://ggus.eu/ws/ticket_info.php?ticket=85011 (9/8)</br> Retirement of their old CE. I believe that this ticket can be closed. In progress (10/9)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> Things have bene quiet on the Sussex front. Jeremy gave a target for being out of downtime by the 21st. Is that looking likely? In progress (10/9)

Solved Tickets</br> All the UserDN accounting tickets are now closed. No other solved tickets jump out at me.

Tickets from the UK</br> https://ggus.eu/ws/ticket_info.php?ticket=85449</br> Winnie had some trouble with a deleted downtime not being taken into account. This has sparked some deliberation and some changes to the way sam polls the gocdb to prevent this happening again. It maybe that Winnie will have to open a availability/reliability amendment ticket to take into account the false downtime.

Monday 10th September 15:00 BST.</br> 34 Open Tickets this week. The number is slowly shrinking.

Still no sign of ticket reminders here at Lancaster, did anyone else get round to testing it to make sure it's not just me? Also is anyone still getting the automated weekly reminder e-mails for tickets at your site from GGUS. I haven't seen one of those in a while either.

Sno+ (after some nudging from Jeremy) have started answering their tickets again. Thanks to Jeremy for some epic ticket wrangling last week in general.

Glite 3.1 Retirement Tickets (14/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85189 (UCL)</br> Daniela has offered to help, but needs the bare installs set up first. In Progress (6/9)</br> https://ggus.eu/ws/ticket_info.php?ticket=85185 (Cambridge)</br> Plan to switch off the lcg-CEs, but to delay as long as possible as this will mean the loss of 128 job slots. Probably don't want to leave it until the 11th hour though! In progress, Jeremy upped to "Very Urgent" with the rest (29/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85183 (Glasgow)</br> Not much news from Glasgow, at last check they were trying to debug problems they were seeing with the EMI-1 WMS/LB. Jeremy asks how this is going (and has knocked the status to "In Progress" from "On Hold"). (14/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85181 (Durham)</br> The cunning plan here is to simply switch off the offending CEs nearer the deadline. In Progress (6/9)</br> https://ggus.eu/ws/ticket_info.php?ticket=80155 (Bristol)</br> Bristol are confident and committed to upgrading by the deadline (6/9)</br>

Brunel, ECDF & RHUL have purged glite 3.2 from their sites, their tickets are nicely closed.</br>

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)</br> Biomed nagios errors at RAL, registering files on srm-biomed.gridpp.rl.ac.uk. In an uncertain state since 3/9, Jeremy queried what was going on (6/9).

https://ggus.eu/ws/ticket_info.php?ticket=84492 (24/7)</br> SNO+ were having job matching problems submitting to RAL. Looks like these have been solved (although uncovered new problems at Glasgow). (6/9)

https://ggus.eu/ws/ticket_info.php?ticket=85023 (9/8)</br> SNO+ WMS problems at RAL. After a very long break waiting for reply and a gently nudge from Jeremy James from SNO+ has provided some output to help continue the investigation. In Progress (6/9)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br> Sister ticket to 85023. SNO+ have provided some of the requested information. In Progress (6/9)

NGI/RALPP https://ggus.eu/ws/ticket_info.php?ticket=85793 (5/9)</br> As seen on TB-SUPPORT, RALPP request a recalculation of August's availability due to problems with jobs sent by the Lancaster Nagios, not at the site. It looks like it's being worked on. In progress (7/9)

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7)</br> COMET VO creation ticket. After some confusion with Imperial's mail system the question was raised concerning how best to handle tickets to the VOMS team in the future. On hold (6/9)</br> The related ticket for the VOs validation (https://ggus.eu/ws/ticket_info.php?ticket=85736) has stalled due to the AUP being not suitable for use. A new AUP has been requested (7/9)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)</br> Brian's SL4 SE tracking ticket. With the glite 3.1 deadline approaching should this be set In Progress (to be soon closed).

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=85549 (28/8)</br> The last UserDn publishing ticket (85547). Still no movement on it. Has a fix been attempted? (28/8)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=85011 (9/8)</br> Pheno have given their blessing for shutting down the older SE dgc-grid-50.brunel.ac.uk after Jeremy explained the situation. The ticket looks like it can be closed. In Progress (10/9).

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11)</br> Durham SE tracking ticket. As Durham's DPM is now EMI1 can this ticket be closed? In progress (6/9)

Tickets from the UK</br> https://ggus.eu/ws/ticket_info.php?ticket=84015</br> Tracking the LSF publishing problems seen at Lancaster. An updated .jar has been received and is undergoing testing.

Monday 3rd September 14:30 BST.</br> 37 Open UK tickets this week. It's the start of the month so it's time for a deep review.

Anyone else not been receiving their ticket reminders? I haven't for several Lancaster tickets.


UK</br> https://ggus.eu/ws/ticket_info.php?ticket=84408 (20/7)</br> Setting up of neurogrid.incf.org WMS & LFC. Both have been put in place, Catalin wonders if the LFC can be tested? Waiting for reply (29/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=80259 (14/3)</br> neurogrid.incf.org creation ticket. Nearly finished now. In Progress (29/8)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)</br> Brian's ticket to track older DPMs in the UK. Still have Durham, Bristol and Brunel to go at last update (but Brunel are retiring their old SE). On Hold (30/7)

https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7)</br> Setting up the COMET VO. Registering in EU Ops Portal (ticket 85736), On hold till this is done (3/9).

https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/5)</br> Chris' ticket to change the reminder periods for the GridPP VOMS server. Assigned to Rober Frank, On Hold during VOMS transition (28/8)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=85438 (23/8)</br> atlas were seeing FTS transfer failures from RAL. Some files have been corrupted, may have to get replacements from tape. Waiting for Reply (31/8)

https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)</br> Biomed were seeing their nagios tests fail to register files at RAL, but looks to be a (peculiar) problem with their SAM jobs. Other units are involved. In Progress (3/9).

https://ggus.eu/ws/ticket_info.php?ticket=85023 (9/8)</br> SNO+ having troubles with one of the RAL WMSi. No reply after request to attempt job submission to lcgwms02. Waiting for Reply (10/8)

https://ggus.eu/ws/ticket_info.php?ticket=84492 (24/7)</br> SNO+ having job-matching problems at RAL. Some odd behaviour, but In Progress (31/8)

GLITE 3.1 Upgrade tickets (14/8):</br> https://ggus.eu/ws/ticket_info.php?ticket=85189 (UCL) In Progress (29/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85185 (CAMBRIDGE) In Progress (29/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85183 (GLASGOW) On hold (14/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85181 (DURHAM) In Progress (On hold?) (14/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=85179 (Brunel) In Progress (22/8)

UK/SAM/GOCDB</br> https://ggus.eu/ws/ticket_info.php?ticket=85449 (23/8)</br> Bristol canceled an ongoing downtime but weren't bought out of it by the system, thus penalising them. Winnie is out to find the cause of the problem, and get back the lost uptime. Reset to "In Progress" after some ticket tennis (3/9)

PHENO/BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=85011 (28/8)</br> Pheno seem to be surprised that they have data on the retiring Brunel SE. In Progress (28/8)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)</br> The Sussex Certification Chronicle. Jeremy wants to push getting Sussex out of downtime this week to avoid having to re-certify. In Progress (3/9)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=85467 (24/8)</br> Atlas transfer errors to UCL. Clock skew on the head node took some of the blame, but seeing more failures with "Error reading token data header" messages.In Progress (30/8)

https://ggus.eu/ws/ticket_info.php?ticket=85549 (28/8)</br> Last of the User DN accounting tickets (the last child of 85547). In Progress (28/8)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=85679 (31/8)</br> se01 failing Ops tests.</br> https://ggus.eu/ws/ticket_info.php?ticket=85731 (3/9)</br> ce01 failing APEL Pub tests.</br>

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)</br> atlas production failures. On hold as Mike expects slow progress (3/9).</br> https://ggus.eu/ws/ticket_info.php?ticket=83950 (7/7)</br> lhcb cvmfs errors. On hold (7/8)

https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11)</br> SE Upgrade ticket. Probably should be On Hold. (28/8).</br>

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/2011)</br> CompChem job failures at Durham. On hold due to the other problems, but once out of the woods worth checking that the problem persists. (8/8).

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br> SNO+ were having problems with one of the Glasgow WMSs (twinned ticket to 85023). Stuart asked for the FQAN used for the jobs as the problems seemed voms related, but no news since. Waiting for Reply (10/8)

https://ggus.eu/ws/ticket_info.php?ticket=83283 (14/6)</br> LHCB seeing high rate of job failures, likely to be caused by cvmfs. Glasgow upgraded all their nodes to the latest cvmfs but failures are still seen on the "high-core" nodes, correlated with high numbers of atlas job start up. Investigation continues. In Progress (30/8)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=85496 (25/8)</br> LHCB has job failures, that were not cvmfs related (they reckoned a lack of 32-bit gcc rpms or some OS difference). Problem seemed to evaporated though, did anything change. In progress, probably can be closed (31/8)

IC</br> https://ggus.eu/ws/ticket_info.php?ticket=85524 (27/8)</br> Hone had problems submitting jobs through the Imperial WMS' due to "System load is too high" errors. Some magic was worked, and Hone see a massive improvement ahd propose to close the ticket. Can be closed (31/8).

LANCASTER (to my shame)</br> https://ggus.eu/ws/ticket_info.php?ticket=85412 (22/8)</br> JobSubmit tests failing to one of Lancaster's CEs. With help from LCG-SUPPORT tracked to a desync between ICE on the WMS & the CREAM. Best solution is cream reinstall, which is undergoing planning. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br> Lancaster's other CE isn't working well for ILC. Would like to reinstall, but will wait until ticket 85412 is solved. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=84583 (26/7)</br> Similarly LHCB are having problems on the same node. Lancaster is suffering a ticket pileup. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)</br> T2K transfers fail from RAL to Lancaster. Looks to be a networking problem. With new routing to be put in place soon hopefully this problem will disappear, as it has eluded understanding. On hold (3/9)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=85286 (17/8)</br> CMS transfers to Bristol failing. Winnie tracked to a maxed out datalink. In Progress (20/8)</br> https://ggus.eu/ws/ticket_info.php?ticket=80155 (12/3/11)</br> SE upgrade ticket. Bristol are prepping for the upgrade, with a test server. On hold (17/8)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=85019 (9/8)</br> ILC were having problems running jobs at RALPP. Needed a lot of configuration work, but progress made. In Progress (23/8)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=83627 (27/6)</br> Biomed seeing negative published space. Repeat of ticket 81439. Despite great efforts this remains so far unsolved. On hold (31/8)

No exciting tickets from the UK or solved UK tickets that I can see this week (which seems to be very often the case which makes me suspect I'm missing something!).