Difference between revisions of "GDB 10th February 2010"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 16:32, 10 February 2010

EGEE Middleware Minimum Version Control (15') (files Slides pdf file ppt file ) Nicholas Thackray (CERN)

Official footing of reccomanded baselines for services to avoid sites sticking to to old version. Which versions are supported? All versions since the last high priority update or all versions of the last N months (N TBD) whichever is more recent. Tickets opened for releases older than the reccomended version will be closed until the services have been upgraded The list of services reccomended versions is on the twiki. Sites with outdated versions might be eventually suspended unless they give a good reason. With security updates there might be a 4 weeks window. The LCG baseline could be different.

Argus Update (30') (files Slides ) Christoph Witzig (Eidgenossische Technische Hochschule Zurich/ETH (ETH)) , Antonio Retico (CERN)

Release updates Gradually increasing authZ in services in 6 different steps. For now glaxec on the WN and centralised user banning. Release 1.1 is in certification. OSCT global banning is operational. Release 1.2 is foreseen for Q2 2010. Argus is being integrated in CREAM.

There are few sites that have been testing argus since deember. Integration of glexec with SCAS or Argus in the experiment frameworks:

  • Alice: requires several changes in alien and has also some requirements at sites (like a VOBOX and a dedicated queue for glexec infrastructure. No plan to test it yet.
  • CMS: implementing glexe calls in glide-ins. plan to test it in mid-february. Planning to test argus instead of scas at cnaf-t1. glide-ins will be used only at T2 for analysis
  • Atlas: glexec integrated in Panda with SAS backend. Are not iterested in Argus for now, but will be happy to step in when it works.
  • LHcb: integrated glexec ni dirac with SCAS not interested in Argus.

To use glexec with argus at the moment a specific version of lcmap is needed.

Middleware Update (20') Maria Alandes Pradillo

New release process there are links in the talk to the documentation describing it Staged rollout is replacing pre-production. Early site adopters test the relase in a production environment. Sites commit to potentially downgrade. So the resources must be easily withdrawn from production. There are 15-16 sites in stage-rollout and around 5 are real production sites and not ex PPS. Check the release updates for the latest releases both in glite3.1/SL4 and glite 3.2/SL5. There is still a SL5Plannig wiki page. Problems with glite3.1 bundle0.4 due to dcache. Released at the beginning of feb New lcg-A 1.33-1 going probably to be rleased on the 15/2/10. New version of yaim for cream: does it fix the problem of services restarted in the wrong order? Not known. Mostly it fixes the wrong config of lcmaps for glexec. Will this process continue after the end of EGEE?

Operational Security (25') (files Slides pdf file ) Romain Wartel (CERN)

Pakiti and the security patching status monitoring Incidents statistics: in order major problems were caused by compromised accounts, WEB applications, weak passwords Main escalation factors: failure to apply patches and weak passwords Reason for not patching: mostly miscommunication, only partially upgraded farms... Pakiti has been improved and extended and the interfae has been redesigned to make it easier to use. It is available from http://pakiti.sourefourge.net/

EGI without ROSCOE Jamie Shiers (CERN)

Some funding discussion for activities going on at CERN like ganga and experiment support.

Report from the Distributed Database Workshop and Tier1 Service Coordination (30') Maria Girone (CERN)

Discussed at the workshop review of experiments DB (FTS, LFC, ...) operations and DB services readiness

Often DB recovery exercises failed at some T1s with major downtime period. As a consequence there was a review of the reccomendations for configurations and backup policies.

Achievements: Exps satisfied with the level of service of online and offline DBs at T0. Interest of a coral server by experiments in particular atlas online.

Further information on the organisation of meetings and community.

Illuminating discussion about number of meetings sites have to participate.

Virtualisation Working Group (15') Tony Cass (CERN)

Enable virtual machine images created at some sites to be used at other hepix sites. This was to generic and now it has been sort of restricited to some big sites (like CERNS) or VOs create the images. Images can be contextualised by local sites mostly to introduce monitoring. Some sites are interested in connecting the image to directly to the experiment workload management system (panda/dirac/glide-ins). 5 Areas of work have been identified

  • Generation of the images and how they can be trusted
  • Transmission of images to sites
  • Expiry and Revocation
  • Contextualisation to local needs
  • Support for multiple hypervisors to garantuee interoperability

F2F meeting at Hepix in Lisbon.

Experiments have their own virtualisation meeting it is important to syncronise them.

Site Management Issues (15')

List of issues and fora where they might be discussed or solutions.

Multi-User Pilot Jobs (20') Maarten Litmaath (CERN)

Working group on google for discussion on possible issues on mlti-users pilots. Intermediate summaries of on the wiki. Most slides already presented in other GDBs. There is a final table on the questionaire results displaying the spectrum of the responses on the glexec use and conclusions.

UK sent detailed list per site. The preliminary conclusion was that there has been a variety of responses that needs to be taken in consideration if we don't want to loose resources. And some issues needs to be clarified like how to support local users, though atlas, for example, can prioritise groups and local vs non local users. Some sites might not be used for analysis because they don't want to support glexec.

CREAM (30') Antonio Retico (CERN) , Nicholas Thackray (CERN) , Dennis van Dok (NIKHEF)

Sites are encouraged to deploy cream but it is not very popular yet, there are ~50 instances so far.

Alice: happy with cream Lhcb: have enabled it but they get 40% of failing due to default configuration problems CMS: considerable testing activity they found issues with the bookkeeeping system. It is again a configuration problem and it is considered a showstopper by CMS prod. CREAM 1.6 should solve the problem Atlas: Testing promising but incoclusive, only find problems on heavy usage, they require to keep lcg-CE but they reccomend also to explore cream. One show stopper is that condor asks for a 24 hours lease and the job might be queued for most of the time and then be killed because the lease expires. The release 1.6 should fix 70 bugs mostly found by the developers. Availability is calculated as an or on all the CEs. Way to shorten the time to find issues and solve them is to increase the number of sites that deploy it. Experience at sites like Glasgow indicate that it doesn't require so much babysitting anymore and can be installed in production.

OSG Update (30') (files OSG Update pdf file ppt file ) Ruth Pordes (FERMILAB)

OSG update

The EGEE-EGI Transition (30')

Regional monitoring is becoming regional. Operations dashborad using nagios and used by regional teams. The dashboard is still central but has regional views. Regional teams are already in place and they are looking only to the region sites. There is a mixed level of readiness. EGEE tasks not clear in EGI: failover instances, operations meetings, dteam VO, catch all ROC, SLA with sites. Probably more details will come in later. Some ROC/NGIs wanted to act as NGIs ASAP to find issues and document them, Main issues are with

  • Stage rollout: sites have to volunteer for this
  • ROCs need effort to test and deploy operational tools
  • Regionalised views
  • EGI.eu is not established and doesn't have associated staff.