Virtually tracking softwareTue 31st January 2012
Cutting edge science needs bleeding edge software, you want your analysis using the most up to date techniques and applications. Ensuring that a job submitted to the grid has access to the right software has been a headache for systems administrators and users from the beginning. To make this task easier GridPP has just finished installing a CERN-developed application CernVM-FS at each of its sites, but what does this mean for research in the UK and on the grid?
Maintaining up-to-date software at every site for each experiment is technically difficult, costly and current solutions are prone to problems. The software used by the LHC experiments is constantly evolving, as people’s needs change it changes. This has led to different choices being made by various experiments and institutes about what software they run and how they interact with the computational resources on offer.
Jakob Blomer, CernVM-FS Developer
|This means that there are frequent changes to the code and the applications are gigabytes in size. So shipping a copy of the software with every job is impractical but ensuring every site has a local up to date copy is almost impossible. Developers at CERN however saw some important characteristics in the software that they could exploit to solve the problem. “CERN Virtual Machine File System (CernVM-FS) actually started life as a part of a larger CERN project, the CERN Virtual Machine, designed to create an easy install software environment that would be up-to-date as long as it had internet access” explains Jakob Blomer, who is developing CernVM-FS “This last part appealed to people working on the grid and we started testing CernVM-FS as a stand alone service”.|
The first realisation was that the software mirrored the experiments themselves. They look like one single entity but when you dig deeper you see that individuals are only interested in certain analyses and certain parts of the whole. This means while everyone had things in common they are unlikely to need everything the software could do. The next was that different parts of the software used duplicate files, linking these would mean that file could be downloaded once but used by various aspects of the applications. Also updates to a piece of software usually only meant small changes so that differences between different releases were relatively minor. The final characteristic was that if the software looks for a file that is not there it doesn’t remember that it failed to find it and will look for it again if asked to. These “unsuccessful lookups” happen as often as successful, reducing them would reduce the load on the machine it is running on. Using these things as a starting point the team developed a system for providing up-to-date software on the fly, CernVM-FS.
CernVM-FS at its heart is quite a simple concept and boils down to various computers remembering what components it has used or seen recently. The experiment’s software developers keep their part of the code up-to-date on a server at CERN. CernVM-FS then treats the various pieces of software as if they are individual files The service then tags all these files with what is called a hash that is generated based on its details so that identical components have identical hashes. An institute’s grid cluster stays in contact with the CERN server via a local machine called a squid, this means everything transferred over CernVM-FS passes through a single machine and it can track what has been downloaded. Therefore when a piece of software running on a computer needs the next step in the application it looks to see if it is already on that computer, if not it asks the squid if it has been used elsewhere on the cluster and finally if necessary gets the squid to download it from CERN. This means the correct component is asked for (thanks to the hash), that it is up to date and downloaded only when necessary, reducing load on the computers, bandwidth and the sys admin.
One of the first sites outside CERN to test and use CernVM-FS was the Rutherford Appleton Laboratory (RAL), in Oxfordshire. Ian Collier was a member of the team working on that “We first heard about CernVM-FS in the summer of 2010. We’d been experiencing acute performance problems with (one of) our experiment software servers, affecting both job performance and site availability. I saw the potential to solve both these problems, at the same time reducing overhead for both the site and the experiments. This seemed almost too good to be true. After a few months of testing with ATLAS and then LHCb we migrated to using CernVM-FS for all ATLAS and LHCb jobs early in 2011”.
As the usefulness of the software became more apparent the decision was made to get it up and running across GridPP. “We were very happy with the results coming from RAL” says Alessandra Forti, who has led the effort in the UK to get GridPP using CernVM-FS “so we started testing it on other sites. It has not always been simple but installing the software was not intrusive so sites did not have to go offline at any stage. It only took us 3 months to move it into a production service across the entire collaboration, that is close to a site a week. Now we are happy to say that we are providing the best service we can to our users”. GridPP also hosts a replica of the CernVM-FS repository at RAL, allowing sites in the UK to download the what they need from a server in their own country.
Currently CernVM-FS is being used by the ATLAS and LHCb experiments but it is software agnostic, it can be used by anyone. Chris Walker from Queen Mary, University of London is involved in supporting new and small VOs on GridPP “We were the first site in the UK to put CernVM-FS into production. I believe that if we can support CernVM-FS for the LHC experiments there is no reason why we couldn’t support the same thing from any experiment. This makes it easier for a site to support multiple disciplines any make the decision to support a new VO a simple one. But most importantly it makes it easy for a community to maintain its software and push updates out across the grid so they can do there work quickly and efficiently”.