Glasgow New Cluster Tasklist

From GridPP Wiki
Revision as of 13:28, 25 January 2008 by Michael kenyon (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Installer

  • Run firstbootwatcher as a daemon.
    • Done Running from cron every minute, but does the buisness.
  • Run postbootinstaller automatically, with a limit to the number of installers running simultaneously.
    • Superceeded Using cfengine for post kickstart installation now.
  • Implement flexible logging (to syslog, and to console if a controlling tty).
    • Done cfengine logs to syslog - but see centralise syslog action!
  • Restructure cfengine to work better with SVN directory layout.
  • Put new YPF installer into SVN
    • Done Repo on grid01.
  • Change autokick to use new YPF clusterdb, instead of old classes.conf file.
    • Ongoing Should be easy - PHP has SQLite bindings.


Security

  • Disable module loading on WNs.
    • Ongoing
  • Implement basic automatic integrity checks (like kickstart).
    • Ongoing Can use cfengine to do this - it will checksum any file.
  • Disable unnecessary SUID binaries.
    • Ongoing Can use cfengine to do this.
  • Outgoing packets checked and logged by NAT hosts.
    • Ongoing David looking at shorewall on NAT boxes. No progress as of 2007-04, so fallback to simple iptables manipulation.
  • Disable root logins between ssh known hosts.
    • Partial Known hosts authentication does not work for root.
  • Disable login from UI to other hosts - UI is the only place that users get a vanilla shell, so defend this.
    • Done Login to UI is via gsissh.
  • Centralise Email on svr031
    • Ongoing Will use exim on svr031. Need sendmail recipe for SL3 boxes.

Monitoring/Logging

  • Ganglia
    • Done Configuration files for grid servers, worker nodes and storage hosts controlled by cfengine.
    • In Progress Configuration files for Nat boxes not working fully
  • Nagios
    • Done bu broken Installed on svr031, but currently dead due to president's missing brain. Use monami as core for new sensors.
  • Central syslogging
    • Donestanza in cfengine. Log to master - investigate better sysloggers?
  • MonAmi
    • Ongoing DPM and NUT sensors deployed. Deploy new sensors as they are made available.

Local Accounting

  • PBS Accounting - install Jamie's PBS->MySQL dump scripts.

Notes and Wish List

To do now:

  • Selecting an end date produces an end time at the beginning of that day, instead of the end.
  • Bar and line CPU efficiency plots for short time periods produce JPGraph errors.
  • Summary table should give wall clock and cpu times.
  • Stated values of CPU Hours and KSI2k hours are unclear - need to compare with potential.
  • Table of worker nodes should include a comissioning and decomissioning date, to enable the potential cpu hours and KSI2K to be calculated.

Wish list:

  • For individual groups, need to design a per-user display. (Nice to map to DNs - work and privacy issues though!)
  • Should have a scheme for dribbling in data during the day, rather than having to wait for a log record to be complete before processing it (useful when cluster is busy).

Batch System

  • Ensure TMPDIR properly defined.
    • Done.
  • Investigate SGE as alternative ;-)
    • Ongoing This is not entirely in jest (oh alright, is is...)

Storage

  • Deploy more disk servers in production grid mode (19+2+1 disks - RAID 6 with 1 hot spare).
    • Ongoing disk033,043,035,036 are production servers for DPM. disk032 can be deployed anytime. Greig using disk038-041 for dCache tests for next month. Be sensitive to ATLAS needs. N.B. disk032 still needs to go into correct mode.

Backups

  • DPM/LFC databases.
    • Ongoing Done for DPM, but need to automate rsyncing to masternode (cfengine).
  • Batch system configuration.
    • Done Have copied maui configuration from old cluster, but need to add a fair share for local groups.
  • VO Tags.
    • Ongoing Incorporate into a rolling rsync on masternode.
  • Installer subversion repository.
    • Done but should institute some backups from grid01.

Efficiency

  • Stop WNs downloading CRLs. These files should instead be copied from the CE.
    • Done httpd installed on CE with /etc/grid-security/certificates exported. YAIM override function sets up mirroring of CRLs instead of direct downloads.

Networking

  • Get nat nodes up and running
    • Ongoing Instaledl NAT nodes as SL44 x86_64. David looking at shorewall.
  • Ensure http_proxy, no_proxy is defined in batch environment while direct http access is barred
    • Done. There is an annoying bug in yum, which does not respect the no_proxy variable, so any scripts invoking yum need to undefine http_proxy and no_proxy, so these are not defined for root. In fact now have received exemption from webcache, so this is no longer needed (and has been undefined).

Grid Nodes

  • Install UI for local access
    • Done grid-mapfile mirrored from CE ensuring the same account for submitted jobs.
  • Install separate BDII to improve system stability
    • Done svr021.gla.scotgrid.ac.uk. Some problems still seen with GRIS on CE.
  • Local LFC Only required for ALICE, so not really needed right now.
    • Ongoing Only required for ALICE, so not really needed right now, but would be necessary for SGS VOs.
  • RB
    • Done Improves job submission and local VO support. Installed on svr023, but not advertised in BDII.
  • Top Level BDII
    • Done Shares svr019 with R-GMA.
  • Save R-GMA from itself
    • Ongoing Anthony sent a cron job to sniff for signs of R-GMA putrefaction.

Documentation