GridPP14, discussion session 2 Running Applications on the Grid" (Why won't my jobs run?) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Discussion preceded by a talk by Ivan Hollins (Birmingham): "Using the Grid - a user's perspective". See slides for more details. Using LJSF (ATLAS tool) on top of LCG. Some teething problems, not always clear whether problem is LCG, LJSF or other ATLAS software. Peter Love: Sending feedback if site failures? Hard to tell if site/job or collaboration software problem. Steve Fisher: can find out about CPU job is using. URL in Dave Kant's talk. Linked from monitoring page. Owen Synge: should not need to know about specific SEs but work instead through file catalogue. John Gordon, Stephen Burke: tools don't use info on free space? Ivan: need to name SE in LJSF, else goes to Castor as default. Stephen Burke: user support sufficient? Ivan Hollins: Yves is helpful, also lcg-atlas list, various web pages. Panel Introductions ^^^^^^^^^^^^^^^^^^^ James Werner (Manchester) Developing job submission system for BaBar Working with benchmarks: (1) tau -> pi pi0, calculating inv mass of 1 to 4 pi0. (2) looking for deuterons using dE/dx. Hundreds of jobs, millions of events in each. EasyGrid job submission worked well. Provides output or abend/abort listings if job/grid fails. Various sources of errors (see slides). e.g. sometimes use only 15% CPU because of long time to access data. Limit on parallel requests for RLS/SE files. RB can only deal with 3 submissions/minute. Policy leads to jobs being queued for a long time. David Grellscheid (Durham) Applications person for PhenoGrid. Setting up VO finally worked in last 2-3 weeks. Got very good support. Problems have been with integrating different parts. VO has ~10 members, all with slightly different software needs. Where is Fortran compiler? Job for VO manager or sites? If sites, how should they publish this information? Members often only grid user at their site. Need local UI installation. James Catmore (Lancaster) ATLAS b-phys group like Ivan Happy with LCG framework. Fairly intuitive/reliable commands, site ranking etc. Not grid fabric causing problems but getting users' software to run on Grid. Libraries, environment variables etc. Grid easier to use than LSF batch system! User manual offputting for beginners. Users put off by instructions for getting certificate. Need 4-5 lines overview. Would like wildcard system when listing files. Points of contact. Who to contact with problems? CERN/RAL? Steve Lloyd (QMUL) Chair of Collaboration Board, Tier-2 Board. Writing ATLAS workbook - getting started with Atlas software, running HelloWorld on lxplus and grid. Most user problems seem to be with data handling and cataloguing. Where should I put my data? Gianfranco Sciacca (UCL) Installing middleware for local site. Started recently as user. Getting certificate and joining VO is frustrating. Porting RTT to grid. First attempts basically successful but only using local site. Will probably have more questions when submit to many sites. Problems with jobs so far only connected to RB failures. Dave Colling (Imperial) Running CMS MC production recently. Runs reasonably well. Most problems getting data back to RB. Also misconfigured sites. There are some incompatibilities between particular sites and particular SEs. e.g. jobs at CNAF ran but output lost. Giuliano Castelli (RAL) BaBar. Intensively using grid for MC production. Not many problems. User experience getting grid certificates ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Jens Jensen: Working with grid support centre to document certificate procedure. Can't be entirely user-friendly because must meet requirements such as proving ID with photo. James Catmore: different organizations and countries have different procedures and documentation. Hard to provide single point of contact. Chris Brew: We should provide clear instructions for UK CA and encourage other CAs do to the same. James Catmore: in b group only have a few UK members, get asked for help by non-UK members. Stephen Burke: Are problems mainly technical (certificates/browsers) or operatioal (delays)? Gianfranco Sciacca: Waited 2-3 weeks for e-mail that didn't arrive because someone had forgotten to reply. Don't know after clicking "submit" whether the request has gone through successfully. Stephen Burke: Has got a certificate within a day. Why does it sometimes take so much longer for other users? Show of hands: ~1/3 of people present took longer than a week to get a certificate. Largely due to delays dealing with local RAs. Need more automation. User should get confirmation that request has been submitted, e.g. automatic e-mail to user and local RA so user knows who is supposed to be dealing with the request. People at GridPP14 know who to phone to chase things up, but other users don't. Andrew McNab: process should start with user contacting local RA rather than going to CA web site. Stefan Stonjek: should have single web form, and automatically identify correct RA from IP address. David Colling: New PhD students at Imperial will get certificate as a matter of course. John Gordon: list of RAs is on web: http://www.grid-support.ac.uk/content/view/31/37/ Problems with file transfer to SEs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Problem raised by David Colling: transfers fail from particular sites to particular SEs with error "not a recognised protocol". Ticket has been raised. Owen Synge: General problem: storage generally involves file transfer between sites. What about monitoring these transfers according to sites involved? Roger Jones: D Colling and R Jones both interested in monitoring rates for real file transfers. Could be extended as diagnostic tool. David Colling: These failures wouldn't get as far as putting entry in log. Wouldn't see info unless look at job logs. RJ: Site Funcional Tests? DC: monitors transfer from site, not between sites. RJ: need NxN matrix - people might object! ATLAS using site whitelist but mainly mainly for brokering problems. Java GUI for job tracking ^^^^^^^^^^^^^^^^^^^^^^^^^ Henry Nebrensky sent a mail about this to the TB-Support mailing list after GridPP13: http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0507&L=tb-support&T=0&F=&S=&P=2217 PhenoGrid wanted to try this but stopped when it demanded complete read/write access to home directory. User support list ^^^^^^^^^^^^^^^^^ Chris Brew: need TB-Users list? Most users shouldn't need to follow everything on TB-Support. Dave Kelsey: something like GridPP-announce could be useful. Chris Brew: not expert forum but users comaring problems. VO software support ^^^^^^^^^^^^^^^^^^^ Peter Love: Re compilers etc., is there a list of basic requirements? Chris Brew: e-mail site admin and they will probably install what you want if it only involves "yum install" with the standard repositories. Otherwise it is your responsibility. VO can publish tag. Part of VO's validation procedure should be checking for required packages. Summary (Dave Kelsey) ^^^^^^^^^^^^^^^^^^^^^ A number of people say things working are well - pleasant surprise - easier than LSF! VO setup and requirements: don't want each VO to have to talk to each site. VO should provide list of requirements for site to support VO. Certificates: need to improve situation. Once over this hurdle using the grid is plainer sailing. Data management issues more of a problem than job or RB problems. How to get information to user re failures and support channels. Monitoring real file transfers would be an interesting addition.