GDB 9th June 2010

From GridPP Wiki
Revision as of 10:26, 18 June 2010 by Peter gronbech (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

GDB 090610

Notes by Pete Gronbech Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=72057


John Gordon opened the meeting.

No July GDB August 11th scheduled. Future Meetings: Data Management Jamboree EGEE III final review LHC OPN WLCG Workshop EGI Technical Forum Amsterdam 13-17 September Chep 10 Taipei 17-22 Oct Hepix Cornell 1-5 Nov

GDB Issues MU Pilot jobs, scacs, glexec cream SL5 etc Installed Capacity: Sites need to look in gstat to check it’s reporting correctly. bdii browser in gstat.

Availability Monitoring

Nagios Monitoring Proposal to stop running OPS SAM on 15th June if May acceptable. Will consider in 2 weeks at MB

Jamie Shiers spoke on EGI

EGI Inspire The operations groups are starting to forge a plan for meetings and approaches. One of the most relevant parts is SA1. But also see this for SA3 https://wiki.egi.eu/wiki/WP6:_Services_for_the_Heavy_User_Community (covers shared services like dashboards and Ganga).

WP6 HEP manpower Management Jamie 1.86M EU Various CERN Fellows starting in July. Money is fixed, time may be not quite so fixed. Would like to make the reporting as lightweight as possible. Should present status at the EGI Technical Forum in Amsterdam.

Try to link in with other work, Earth Science and Physics for health. Can share work eg tools like Ganga.

IEEE nuclear science symposium Lead partner to coordinate the work,

Much less money 1.8 instead of 5 m eu and no roscoe. 1.8M is what CERN gets for SA3, etc

Stephen Burke Job Matching

LHCb raised a question about installed capacity. Document approved last year coordinated by F. Donno

Glue 1 schema had a lot of restraints. Very informative talk from Stephen Burke. http://indico.cern.ch/getFile.py/access?contribId=6&sessionId=0&resId=0&materialId=slides&confId=72057 Bug in early YAIM had CPUScalingFactor SI00 instead of CPUScalingReferenceSI00

APEL now reading power from CECapability instead of SubCluster.

Solution to LHCb’s problem is about 15 lines of JDL. Various complaints that’s its very ugly and complicated. How can we have a correct simple solution. Trying to setup a sub group to work on this. Maarten Litmaath, Steve Traylen Steven Burke etc

Publishing information in order to provide the WLCG management with a view of the total installed capacity and resource usage by VOs at sites.

Cyril L’Orphelin Talk on new feature of the Operations Portal.

https://operations-portal.in2p3.fr/ Looked at the architecture. Pulls data from places like the GOCDB/SAM/gstat/Nagios and VO ID cards to populate broadcasts, dashboards and so on. Planning new broadcast tool. New VO ID card.


Lunch

Site reports summarized by John Gordon

KIT Saw highest WAN and LAN rates than before even during STEP09.

GPFS site, scaled up the number of server from 3 to 5 but this did not help for LHCb, as it increased the network traffic.

GS: All sites are special in some way or other. eg RAL rfio issues. Needed 32bit rfio libs. Tier 2 support: In general they (The T1s) say tier 2s can cope, don’t need much support. The experiments view is different.

Unstable glite3.2 (SL5) BDII Networking: If sites are being bombarded by transfers they can limit the number of channels to their site. Now seeing some of the t2 sites in the UK, saturating the gigabit link, so rather than seeing 1 gig from each of the 4 distributed t2s, we now may have 1 gig from each of 20 t2 sites, would saturate RALs total capacity.

Some low rates between Lyon and BNL. Difficult to diagnose.

Talk about having monitoring dashboards for WAN links.


RAL Memory limits on the batchfarm. Graeme Stewart says only 1 ATLAS job in 1000 will require more than 2GB. These should fail in a clean way so they can be sent to a special high mem queue.

Conclusions. LHCb issues Storage area: Analysis at T1s ok S/w issues to address T2 support ok In this current financial difficulties, difficult to justify new purchases if nodes not fully occupied.


Middleware

CREAM FTS

Yaim will be transferred from CERN to INFN

New apel version used active MQ instead of RGMA. New node type replaces mon box. Release notes has details of how to transfer db. JG would like some sites to try it in a staged rollout manner, so they can be carefully watched. a) FTS 2.2.4 is coming soon b) gLite releases. Until EMI/EGI process up and running will be driven by WLCG. Documented at: https://twiki.cern.ch/twiki/bin/view/EGEE/DevelopersGuide and https://twiki.cern.ch/twiki/bin/view/EGEE/ProdInt. Francesco Giacomini is the gLite Release Manager and runs weekly EMT meetings – Maria Alandes collects feedback for WLCG priorities and presents progress in the Tier1 Coordination meetings: https://twiki.cern.ch/twiki/bin/view/EGEE/LCGprioritiesgLite. See talk slides 5 onwards for release news.

Discussion on gLite 3.1 services and what can be phased out – a list will be generated soon (the first EMI release will be March/April 2011).

6) glexec. Nagios ops testing for: LCG CE: https://samnag010.cern.ch/nagios/cgi-bin/status.cgi? servicegroup=SERVICE_CE&style=detail

CREAM: https://samnag010.cern.ch/nagios/cgi-bin/status.cgi? servicegroup=SERVICE_CREAM-CE&style=detail

The experiments need to import the glexec tests into their Nagios frameworks to confirm that they work for them.

For us to check:

Q: What is the current status of CREAM deployment?

Q: What ARGUS/SCAS services have so far been installed?

Q: Which sites believe they are glexec ready?

GLEXEC for MUPJ

Maarten Litmaath testing glexec at sites. Mixed results.

lxplus alias will be changed to point to SL5 by default on 21st June 2010


Chat Window:

[10:08:04] Jeff Templon the workaround works for now

[10:08:36] Jeff Templon the glue 2 static info should be made available as quickly as possible since this attribute is now an important one!

[10:08:52] Jeff Templon and the new classads should be added to the WMS as well, but with less priority than the glue 2 my opinion

[10:09:06] Stephen Burke But this attribute is in the dynamic (CE) info

[10:09:28] Jeff Templon can we make just that change ...?

[10:09:34] Stephen Burke The thing we could have quickly is the glue 2 equivalent of the subcluster (ExecutionEnvironment)

[10:18:16] Jeff Templon some sites don't use APEL so they may have already changed

[10:18:41] Jeff Templon we need to not tell them "carry on" but exactly what should be published where, it needs to be very explicit

[10:19:20] Jeff Templon ok [10:19:36] Massimo Sgaravatto There is also an issue with batch system support

[10:19:53] Massimo Sgaravatto see the mail sent to the glite consortium

[10:21:21] Stephen Burke what mail?

[10:21:52] Massimo Sgaravatto mail sent by Francesco Giacomini

[10:22:29] Massimo Sgaravatto Basically most of the partners responsible for batch system support (info provider, yaim) are not part anymore of EMI

[10:22:31] Pete Gronbech Question about APEL: If it now uses the CPU scaling reference rather than GlueHostBenchmarkSI00, results will be different if the Scaling reference is the minimum CPU power rather than the average

[10:23:58] Stephen Burke Yes, but if it's a big effect sites should use scaling

[10:24:19] Stephen Burke Otherwise jobs have to assume the worst, i.e. the may land on the least powerful node

[10:25:26] Stephen Burke Also it shouldn't be a change because sites were always recommended to put the minimum in BenchmarkSI00

[13:08:12] Gonzalo Merino john, PIC also sent feedback

[13:08:31] Gonzalo Merino you answered my e-mail

[13:08:35] Gonzalo Merino no problem

[13:11:47] Jeff Templon John, I think NL-T1 was the first site to give input, and is missing from your list on page 3!!!!

[13:12:06] Jeff Templon maybe we were so much earlier than the others that you forgot

[13:13:28] Gonzalo Merino Jeff, I think our mistake was to edit the subject of the mail and insert there our site name.

[13:14:02] Graeme Stewart It's not true for ATLAS

[13:14:11] Graeme Stewart RAW goes by T1 share

[13:17:22] Graeme Stewart there was a bug in athena

[13:17:37] Graeme Stewart now corrected (not closing file handles promptly)

[13:45:26] Derek Ross RAL have something similar

[14:02:42] Jeff Templon single VO is much easier!

[14:05:12] Jeff Templon i would be interested in knowing how to get torque to overcommit memory .. i could never get it to work!

[14:08:51] Jeff Templon are you sure you have no users contacting the CE with many jobs using globus-job-run? this could cause that effect

[14:11:01] Massimo Sgaravatto infoprovider also query the batch system, right ?

[14:11:24] Jeff Templon yes but this has a much lower frequency

[14:11:43] Jeff Templon order once per minute

[14:12:10] Massimo Sgaravatto this was my understanding as well

[14:13:08] Jeff Templon globus-job-run on CE makes one jobmanager per batch job ... this generates high query load. use of WMS goes through condor-G which has one jobmanager per user/WMS combo ... much lower load of qstat queries. Maarten would know best.

[14:13:50] Massimo Sgaravatto How did they find 1 query/sec ?

[14:14:02] Massimo Sgaravatto Are qstat commands logged somewhere ?

[14:16:49] Jeff Templon no, at least not at the default log level

[14:17:26] Massimo Sgaravatto Would be good to know how they found that info. This should help finding out where these queries come from

[14:20:54] Pete Gronbech You say APEL is a new node type instead of the mon box. Is it intended that the apel node type is run on the ce's? or a new system called the apel box?

[14:21:56] Stephen Burke the latter

[14:22:10] Pete Gronbech so we are renaming the mon box!!

[14:22:12] Andrew Elwell glite-APEL

[14:22:19] Andrew Elwell and removing rGMA

[14:22:46] Stephen Burke the name "mon" goes back to datagrid!

[14:38:03] Stephen Burke WN and UI also affect users ...

[14:41:57] Graeme Stewart I was suggesting that raising this through the MB would help

[14:56:20] Massimo Sgaravatto What does DB error/DB exception mean ? Which DB ?

[14:57:52] Massimo Sgaravatto Ok thanks. If I know some more details I can try to help if needed

[14:59:23] Jeff Templon this is what maarten sees on CREAM DB exception

[14:59:24] Derek Ross If I can get the failed job id I'll take a look at the logs

[14:59:59] Jeff Templon the CREAM DB cwd error is usually related to authorization problems getting into the CE itself, it never makes it as far as glexec

[15:01:44] Jeff Templon cwd error can be : mapping somehow failed, like you get mapped to ops010 but ops010 does not exist

[15:02:07] Jeff Templon or it could be that the cream sandbox area does not exist or is not writable

[15:02:33] Jeff Templon or it could be that the user in question (ops010) does not have sufficent permission to write in the sandbox area ...

[15:02:52] Jeff Templon glexec only gets tested if this cwd does not happen.

[15:03:13] Jeff Templon cwd error does not happen, sorry.

[15:03:48] Jeff Templon cwd error [15:03:52] Jeff Templon yes! [15:05:11] Jeff Templon see [15:05:13] Jeff Templon http://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/CREAM_CE_Troubleshooting

[15:07:48] Massimo Sgaravatto What do you mean with cwd error ? The uglu 'working directory cannot be null` ?

[15:08:48] Jeff Templon i suspect that it what he means. i saw one for ops in our logs

[15:08:53] Jeff Templon don't understand why ...

[15:09:10] Jeff Templon must have been something transient

[15:09:32] Massimo Sgaravatto Ok. I Guess Maarten will provide some more details

[15:09:41] Jeff Templon did you check the link?

[15:09:52] Jeff Templon i will add stuff as i find it

[15:10:18] Massimo Sgaravatto The samnag010 page ?

[15:10:30] Jeff Templon no, the link to www.nikhef.nl/ wiki page

[15:10:33] Jeff Templon a few lines back

[15:10:36] Jeff Templon in the chat window

[15:11:37] Massimo Sgaravatto Yes [15:12:36] Massimo Sgaravatto http://grid.pd.infn.it/cream/field.php?n=Main.ErrorMessagesReportedByCREAMToClient#rollback

[15:13:50] Jeff Templon you might want to make clear on this page, that a 'glexec problem' is here different than the ones reported by Maarten