GridPP 17th Collaboration Meeting, NeSC --------------------------------------- Panel Discussion Session 6 - "Site Installation and Management" --------------------------------------------------------------- Thursday 2nd November 2006, 11:15-12:30 --------------------------------------- Chair: John Gordon (JG) Panel Members: Tony Cass (TC), CERN Pete Gronbech (PG), Oxford Dave Kelsey (DK), RAL Winnie Lacesso (WL), Bristol Colin Morey (CM), Manchester Mark Nelson (MN), Durham Derek Ross (DR), RAL Tier-1 Graeme Stewart (GS), Glasgow Steve Thorn (ST), NeSC John Walsh (JW), Trinity College Dublin Comments from Audience: Alessandra Forti (AF), Manchester Neil Geddes (NG), NGS Paul Millar (PM), Glasgow Olivier van der Aa (OvdA), LT2 Identified Issues: - Security - Need for dashboard of site status/performance - Better dissemination of knowledge: more documentation or greater use of the GridPP Wiki - Why when something fails do we have to look through logs for 8-10 different sub-systems to find the cause? - Inability to automate the reconfiguration of a VO's privileges at a site when the VO's requirements change - Do we wan't to develop a Simple Grid Installer Infrastructure package as GS suggests, rather than all developing our own install methods. Alternatively, as should all sites be using the same Quattor release, as TC encouraged? Actions: - JG to circulate details of group being established to develop site status "dashboard". Contents: JG (Chair) showed the summary slide of the equivalent session from the Birmingham Collaboration Meeting (GridPP14, September 2005). Following this, each panel member gave their initial thoughts or some background information about their site. MN (Durham): Relatively small site. 53 nodes (106 processors) and 3.5 TB of storage. Deployed SL 3.0.4 via KickStart scripts, Yum and Yaim. Main lessons learned: - Durham suffered a DPM crash. It was important to have a pre-written procedure to recover the site - Front end hardware should be fault tolerant PG: Do you backup the DPM? MN: Yes. Backing up meta data is also important. JW (TCD): We've moved to Quattor. It gives good control of what software goes on machines. You can interface Quattor with Yaim but we don't. It takes a lot of time to prepare a Quattor release, so the middleware we deploy with it will be a version or so behind, but that's okay. By the time we catch up several problems will have been identified and fixed. CM: Manchester have moved to CFEngine. In terms of usefulness it's up there with Nagios. It's saved days of effort when we've had to roll things out. - We've had problems finding Nagios plug-ins for Grid specific things. Please tell us know if you know of any. - Lack of DCache documentation is also a problem. PG (Oxford): Not much seems to have changed since last year. We're all confident in using Yum, Yaim etc. but will this scale up to lots more nodes? - We need to share more expertise on Nagios and CFEngine via the GridPP Wiki. There's possibly too much re-inventing the wheel happening at each site. - Some clusters come with proprietary installers which can be a problem. DR (Tier-1): Our installs are via PXE. We don't use Yaim on the worker nodes. That's a simple tarball install via NFS. - There are seven SysAdmins in the Tier-1, so there's potential to fall over each other. - We're moving to Nagios, but the problem is that it only tells you of a failure, not the reason why it occurred. - We need to be careful of putting in place too much monitoring - we could easily end up with too many Web pages to look at. JG: I think we need a site status "dashboard". There's a group being set up to develop one. Are there any volunteers? I'll circulate details. ST (Edinburgh): Most people are happy with Yaim, but it hides functionality. By that I mean, how can you be sure that the default configuration for the services Yaim installs are efficient/optimal and secure? Do documents exist for these services? SysAdmins don't like to run scripts to do something; they like to do it themselves to learn how it works. PG & AF: You can look at Bash and Yaim scripts! ST: Yes, but that's not substitute for a step-by-step guide. yaim allows a quick install but you may not always know what you're running or why. ST: Another big problem is supporting VOs with Yaim. The current way for a Tier-2 to deal with VOs won't scale for the numbers expected in the future. - You provide an account or set of accounts for that the members of a VO can use. with the account permissions appropriately set to control what the VO can do. Each time a VO changes its requirements, you potentially have to re-run Yaim to reconfigure the privileges the VO has at your site. - Users with a VO having different levels of privileges will only make the situation worse. - The process isn't automated enough. Right now you want to be able to put a tick in a box to enable particular services for particular VOs. In an ideal World, Yaim would say to VOMS "How should I be configured for this VO?". PG: Is there a possibility to add VO support as a separate Yaim function? AF: There's a Savannah bug around this feature. ST: We had a problem that we couldn't even recreate because we weren't part of the OPs VO. It's disappointing that the "Best Practice" document hasn't progressed. ST also asked (after the session had ended) if the notes could record his concern that there is no automated way of knowing what to install on worker nodes to support the requirements of a particular VO (e.g. a particular compiler). GS (Glasgow): We should be working more as a community, sharing knowledge via GGUS. It's not effective for you to know everything about everthing, you should be leveraging community knowledge. - Monitoring is essential: Nagios and Ganglia interfaces. - Glasgow's install method is PXE -> KickStart -> Yaim + any other scripts. It's just a Glasgow process at the moment. When we have 17 distributed sites we don't have the manpower to use time-intensive methods like Quattor. A simple straw poll was held of the different install methods people use, and are familiar with: Straw Poll of usage ------------------- Image tools, e.g. SystemImager, CVOS 4-5 people Process Installers, e.g. KickStart + post-install scripts most people Engine tools, e.g. Quattor, CFEngine 4-5 people GS: Do we want to develop a Simple Grid Installer Infrastructure as package? Is there any interest? Why are we all developing our own install methods. JG: Should we start a Wiki discussion on this? WL (Bristol): Most things have already been covered, like the dashboard. - "Floating gurus" that know all the sites are useful. - The email lists and documentation in the Wiki are very useful, but it's all in bit and pieces. - It's unbelievable to me that when something fails you end up trawling through logs for 8-10 different sub-systems to find out the cause. JG: AF will tell you that's a Top Ten issue! DK: Wearing my security hat. - Before the end of GridPP2 we need to focus on continuing to raise awareness of security, both general Linux/SysAdmin and Grid specific. - There's work to be done on training and awareness. People need to know: - what logs to keep - how to respond to the need to apply urgent security patches - how to respond to security breaches - what data they're allowed to share offsite - ONE YEAR FROM NOW WE NEED TO BE BETTER PREPARED. - EGEE are developing a regional model for security personnel - There's an activity going on to standardise policy and document the processes for computer forensics. JG: We need to talk about this. We shouldn't wait for EGEE to tell us what to do. PM: Should this be in the MoU? "As a Tier-2 you are expected to do X in response to security incident Y?" TC (CERN): I'm going to be provocative. Yaim is disguising itself as your friend. By using Yaim at several sites GridPP2 lost the opportunity to do installations in a more organised way. Instead, Quattor should be used, as groups in Ireland and Paris do. - Our 17 sites are managed like 17 separate sites. They should be managed as smaller federations of sites. If more sites are the same, more expertise can be shared, and we'd need less SysAdmin effort. - People say that using Quattor takes too long, maybe three weeks, so you can't rollout the latest Grid middleware - drops of which aren't user friendly enough for an easy installation. Well, so what? Lots of us are running SL3 not SL4 and is anyone complaining? OvdA: How does this square with a computing centre? You can't impose an installation method on a centre that has tried and tested, and customised install processes? TC: It might not fit everyone, but it has the potential to benefit many sites. JG: Quattor works better the bigger the collaboration. I can't see us being able to produce a Quattor release for the whole of GridPP, but it's possible that one Tier-2 could look at this as a feasibility study. GS: I can't see it working, as I can see issues with each Tier-2 to stop there being a single release for all Tier-2s. Why couldn't the federations of sites be based on their suitability to use Quattor rather than geography? TC: Managers at the higher levels will view this as a good thing. SysAdmins will not; they'll view it as the potential loss of their job. JW: We use Quattor because it's convenient, not to follow TC's argument. The Tier-2s should be left to decide for themselves, and pick whatever makes their lives easiest. I know sites that use tarballs because it suits them best. NG: If I were a SysAdmin I'd want to do this so I could automate as much as possible to make more time for more interesting work, but there needs to be guarantees that using Quattor won't take up all their time. We actually had a more centralised system than we do now with the first NGS. We moved away from that because most sites wanted some level of customisation. It seems like a no-brainer but it needs to be done because sites will benefit, it shouldn't be imposed. You can't tell Manchester to throw away their SGI machine(s) just to get a Linux cluster that can be used with Quattor. DR: We didn't use Quattor because the learning curve was too big. There was too much work involved before we'd see any payback. CM: Would we want to take all the sites down and tear down their installations just to re-install using Quattor? JW: It's not that bad. Once all our preparation was done, it took us just three days to do our installs with Quattor, plus we upgraded the OS at the same time. If anyone has more interest, there's a Quattor workshop in Dublin next year. DISCUSSION ENDED 12:30.