Glasgow New Cluster Tasklist
From GridPP Wiki
Revision as of 13:28, 25 January 2008 by Michael kenyon (Talk | contribs)
Contents
Installer
- Run firstbootwatcher as a daemon.
- Done Running from cron every minute, but does the buisness.
- Run postbootinstaller automatically, with a limit to the number of installers running simultaneously.
- Superceeded Using cfengine for post kickstart installation now.
- Implement flexible logging (to syslog, and to console if a controlling tty).
- Done cfengine logs to syslog - but see centralise syslog action!
- Restructure cfengine to work better with SVN directory layout.
- Ongoing Link from Colin is useful: http://sial.org/howto/cfengine/repository/
- Put new YPF installer into SVN
- Done Repo on grid01.
- Change autokick to use new YPF clusterdb, instead of old classes.conf file.
- Ongoing Should be easy - PHP has SQLite bindings.
Security
- Disable module loading on WNs.
- Ongoing
- Implement basic automatic integrity checks (like kickstart).
- Ongoing Can use cfengine to do this - it will checksum any file.
- Disable unnecessary SUID binaries.
- Ongoing Can use cfengine to do this.
- Outgoing packets checked and logged by NAT hosts.
- Ongoing David looking at shorewall on NAT boxes. No progress as of 2007-04, so fallback to simple iptables manipulation.
- Disable root logins between ssh known hosts.
- Partial Known hosts authentication does not work for root.
- Disable login from UI to other hosts - UI is the only place that users get a vanilla shell, so defend this.
- Done Login to UI is via gsissh.
- Centralise Email on svr031
- Ongoing Will use exim on svr031. Need sendmail recipe for SL3 boxes.
Monitoring/Logging
- Ganglia
- Done Configuration files for grid servers, worker nodes and storage hosts controlled by cfengine.
- In Progress Configuration files for Nat boxes not working fully
- Nagios
- Done bu broken Installed on svr031, but currently dead due to president's missing brain. Use monami as core for new sensors.
- Central syslogging
- Donestanza in cfengine. Log to master - investigate better sysloggers?
- MonAmi
- Ongoing DPM and NUT sensors deployed. Deploy new sensors as they are made available.
Local Accounting
- PBS Accounting - install Jamie's PBS->MySQL dump scripts.
- Done [1].
Notes and Wish List
To do now:
- Selecting an end date produces an end time at the beginning of that day, instead of the end.
- Bar and line CPU efficiency plots for short time periods produce JPGraph errors.
- Summary table should give wall clock and cpu times.
- Stated values of CPU Hours and KSI2k hours are unclear - need to compare with potential.
- Table of worker nodes should include a comissioning and decomissioning date, to enable the potential cpu hours and KSI2K to be calculated.
Wish list:
- For individual groups, need to design a per-user display. (Nice to map to DNs - work and privacy issues though!)
- Should have a scheme for dribbling in data during the day, rather than having to wait for a log record to be complete before processing it (useful when cluster is busy).
Batch System
- Ensure TMPDIR properly defined.
- Done.
- Investigate SGE as alternative ;-)
- Ongoing This is not entirely in jest (oh alright, is is...)
Storage
- Deploy more disk servers in production grid mode (19+2+1 disks - RAID 6 with 1 hot spare).
- Ongoing disk033,043,035,036 are production servers for DPM. disk032 can be deployed anytime. Greig using disk038-041 for dCache tests for next month. Be sensitive to ATLAS needs. N.B. disk032 still needs to go into correct mode.
Backups
- DPM/LFC databases.
- Ongoing Done for DPM, but need to automate rsyncing to masternode (cfengine).
- Batch system configuration.
- Done Have copied maui configuration from old cluster, but need to add a fair share for local groups.
- VO Tags.
- Ongoing Incorporate into a rolling rsync on masternode.
- Installer subversion repository.
- Done but should institute some backups from grid01.
Efficiency
- Stop WNs downloading CRLs. These files should instead be copied from the CE.
- Done httpd installed on CE with /etc/grid-security/certificates exported. YAIM override function sets up mirroring of CRLs instead of direct downloads.
Networking
- Get nat nodes up and running
- Ongoing Instaledl NAT nodes as SL44 x86_64. David looking at shorewall.
- Ensure http_proxy, no_proxy is defined in batch environment while direct http access is barred
- Done. There is an annoying bug in yum, which does not respect the no_proxy variable, so any scripts invoking yum need to undefine http_proxy and no_proxy, so these are not defined for root. In fact now have received exemption from webcache, so this is no longer needed (and has been undefined).
Grid Nodes
- Install UI for local access
- Done grid-mapfile mirrored from CE ensuring the same account for submitted jobs.
- Install separate BDII to improve system stability
- Done svr021.gla.scotgrid.ac.uk. Some problems still seen with GRIS on CE.
- Local LFC Only required for ALICE, so not really needed right now.
- Ongoing Only required for ALICE, so not really needed right now, but would be necessary for SGS VOs.
- RB
- Done Improves job submission and local VO support. Installed on svr023, but not advertised in BDII.
- Top Level BDII
- Done Shares svr019 with R-GMA.
- Save R-GMA from itself
- Ongoing Anthony sent a cron job to sniff for signs of R-GMA putrefaction.
Documentation
- Proceedure for adding new users.
- Documentation on job submission and storage use.
- Progress See Glasgow User Information. Anything else needed?