Detailed 2.1.9 Upgrade

From GridPP Wiki
Revision as of 14:57, 27 October 2010 by Chris kruk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Friday Before: Drain batch system for effected VO Monday 07:00:Start drain of FTS 09:00:

Upgrade instruction:

CASTOR: (MV) 8:00-8:30 Shutdown all CASTOR services (including LSF, cron and puppet) and SRM for the effected VO

DataBase: (RS) 8:30-12:30 Take database backup and upgrade stager/dlf schemas

Disk Servers: (JT) 8:30-10:30 Upgrade Castor/OS RPMs on disk servers:

  • Puppet will provide the files:
    • /etc/castor/castor.conf
    • /etc/sysconfig/xrd
    • /usr/local/bin/show_castor_services
    • /etc/logrotate.d/syslog
  • Puppet will restart xrd deamon
  • Restart (r)syslog to pick up new loggers
  • Make sure castor-gridftp-dsi-int is installed on all disk servers
  • Make sure that the following are uncommented on castor.conf on all disk server
   GSIFTP X509_USER_CERT  /etc/grid-security/castor-gridftp-dsi-int/castor-gridftp-dsi-int-cert.pem
   GSIFTP X509_USER_KEY   /etc/grid-security/castor-gridftp-dsi-int/castor-gridftp-dsi-int-key.pem   
  • Make sure that /opt/xrootd/keys/pkey.pem has beendistributed to all disk servers. The key needs to be first generated on DLF machine and put into Puppet

(MV) 10:30-11:00 Change disk servers in puppet to 2.1.9 (MV) 11:00-12:00 Apply and verify changes using puppet to disk servers

Head Nodes: (CK) 8:30-12:00 Upgrade OS (SLC4->SL4)

  • Take a backup
  • Use this kickstart script: sl4-preProd-os-only-x86_64.cfg (151)
  • Use different section for partitions for lsf/dlf and stager/ns
  • Restore backup:
    • Make sure that you have correct:
      • .ssh directory
      • /etc/yum.repos.d/*
      • /etc/resolv.conf make sure it has the following format:
   domain ads.rl.ac.uk
   search ads.rl.ac.uk gridpp.rl.ac.uk rl.ac.uk
   nameserver 130.246.8.13
   nameserver 130.246.56.240
   nameserver 130.246.72.21
    • Make sure that directory lsf exists on c<instance>lsf, if not do:
      • mkdir -p /var/www/html/lsf
      • chmod 777 /var/www/html/lsf
      • service httpd start
    • add RHSERVER_PRIVATE_PORT=9004 to /etc/sysconfig/rhd:
    • overwrite /etc/sysconfig/mighunterd.example with /etc/sysconfig/mighunterd
    • modify /etc/logrotate.d/syslog for all h/n with the following:
   /var/log/messages /var/log/secure /var/log/maillog /var/log/spooler /var/log/boot.log /var/log/cron {
   rotate 5
   daily
   compress
   sharedscripts
   postrotate
   /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
   endscript
   }
    • copy /usr/lib/log-processor/DLF.py from cPREdlf to all h/n and restarted logprocessord for all h/n
    • install nrpe rpms, also do chkconfig nrpe on for all h/n
    • install gmond plus relevant tier1 config file

(CK) 12:00-12:30 Install Castor RPMs

  • Execute ‘install-castor-219’ which is at: ccsc15:/root/219-upgrade

(CK) 13:30-13:45 Modify sysconfig entries (CK) 13:45-14:00 Deploy new castor.conf

  • Make sure correct version is distributed, example can be taken from preProd

(CK) 14:00-14:30 Configure rsyslog

(CK) 14:30-15:30 Install and configure xrootd mgr n DLF server

  • made sure cert is installed in /etc/grid-security
  • install certificate
  • this is not in puppet yet:
    • generate pair of xrootd keys
    • copy lcgcc246:/etc/sysconfig/xrd c<instance>dlf
    • copy xrd.cf file from PreProd if needed
  • installed the following:
    • xrootd-xrd3cp
    • xrootd-secssl
    • xrootd-xcastor2fs
    • fetch-crl
    • tier1-yum-lcg-ca-certs-2.0-1.noarch
    • lcg-CA
  • distribution grid-mapfile using puppet

(CK) 15:30-15:45 Update restarter scripts

  • /etc/rsyslog.conf can be taken from ccsc15
  • Restart rsyslog on all h/n

(CK) 15:45-16:00 Modify show_castor_services

  • Copy it from ccsc15:/root/219-upgrade

(CK) 16:00-16:30 The final checks and configuration

  • Create lhcblogging account: uid=510 gid=510 for stager and lsf machines only
  • 'chkconfig ypbind on' for all h/n
  • 'service ypbind start' for all h/n
  • Copy /etc/castor/logprocessord ro all h/n
  • Make sure that there is entry for DLF DB in 'tnsnames.ora in all h/n
  • 'chkconfig nrpe on' for all h/n
  • 'service nrpe start' for all h/n
  • 'chkconfig gmond on' for all h/n
  • 'service gmond start' for all h/n
  • Make sure you have correct /etc/sysconfig/rhd for stager machine:
    • /etc/sysconfig/rhd
    #RHD_OPTIONS=
    DAEMON_COREFILE_LIMIT=unlimited
    ROLES="public private"
    • /etc/sysconfig/rhd.public
    #RHD_OPTIONS=
    DAEMON_COREFILE_LIMIT=unlimited
    ROLES="public private"
    • /etc/sysconfig/rhd.private
    RHD_OPTIONS="-p 9004"
    DAEMON_COREFILE_LIMIT=unlimited
    ROLES="public private"
  • 'chkconfig httpd on' for dlf and lsf machines
  • 'service httpd start' for dlf and lsf machines
  • Make sure that file '/var/www/conf/dlf/login.conf' is own by group 'apache' if not do:
   chgrp apache /var/www/conf/dlf/login.conf
  • Make sure that on DLF machine:
    • /etc/sysconfig/logprocessord looks like
   LOGPROCESSORDD_OPTIONS="-c /etc/castor/logprocessord.conf"
   #ROLES=
    • /etc/sysconfig/logprocessord looks like:
   #-------------------------------------------------------------------------------
   # General program settings
   #-------------------------------------------------------------------------------
   [main]
   pid_file         = /var/run/logprocessord.normal.pid
   log_file         = /var/log/castor/logprocessord.log
   plugin_path      = /usr/lib/log-processor
   daemon_processes = dlf-syslog-to-db
   
   #-------------------------------------------------------------------------------
   # DLF Database settings
   #-------------------------------------------------------------------------------
   [dest-dlf-db]
   module = DLF
   class  = DLFDbDest
   
   # The database connection string in the form <username>/<password>@<database>
   # or a reference to a file containing the password
   # e.g. file:///etc/castor/DLFCONFIG
   connection_string = file:///etc/castor/DLFCONFIG
   
   # The maximum number of records to be inserted in one bulk operation
   bulk_count        = 5000
   
   # The maximum amount of time to wait before inserting records into the database
   flush_interval    = 60
   
   # The domain name that should be appended to all encountered hostnames
   #domain_name       = cern.ch
   
   #-------------------------------------------------------------------------------
   # DLF source log file - from syslog
   #-------------------------------------------------------------------------------
   [source-dlf-log-file-syslog]
   module = DLF
   class  = DLFLogFile
   
   # There are basically two types of input that the logprocessor operates on:
   # files and pipes.
   
   # If the type is set to 'file' then it treats the file specified by the path
   # variable as the source for the messages. It reads the file treating each line
   # as a separate message and parses it. When it reaches the end of the file it
   # closes it and quits.
   
   # If the type is set to 'pipe' then it blocks until new data is available in
   # the currently opened file.
   
   # The 'seek' variable determines if it should find the end of the file and
   # insert only the newly arriving messages, if set to false it also inserts the
   # messages that are already in the file.
   
   # The 'dynfiles' option specific if dynamic file names should be parsed. If
   # set to false the pipe just processes the file specified by path. If set to
   # true then the pipe processes the YYYY-MM-DD.log files stored in the directory
   # defined by the path variable (YYYY denotes year, MM - month, and DD - day).
   # It starts with the current date and changes the file it processes when there
   # is no more data in the current one and the system date changes.
   
   path     = /var/log/dlf/syslog.input
   type     = pipe
   dynfiles = false
   seek     = true
   
   #-------------------------------------------------------------------------------
   # DLF processes
   #-------------------------------------------------------------------------------
   [process-dlf-syslog-to-db]
   source      = dlf-log-file-syslog
   destination = dlf-db
    • /var/www/conf/dlf/login.conf is talking to the correct DLF DB and looks like:
   <?php
   
   /*
    * DLF Database connectivity information
    */
   $db_instances = array(
   
     "castor2" => array(
       "username" => "XXXXXXXXXXXXXX-username",
       "schema"   => "XXXXXXXXXXXXXX-the same as above",
       "password" => "XXXXXXXXXXXXXX-password",
       "server"   => "XXXXXXXXXXXXXX-DLF DB server name",
   
       /* stager database */
       "stagerdb" => array(
         "username" => "XXXXXXXXXXXXXX-username",
         "schema"   => "XXXXXXXXXXXXXX-the same as above",
         "password" => "XXXXXXXXXXXXXX-password",
         "server"   => "XXXXXXXXXXXXXX-STAGER DB server name",
       ),
     ),
   );
   
   ?>
  • Install Amanda client, rpm and script can be taken from ccsc01.ads.rl.ac.uk:/root/amanda_client. Do it for all h/n
  • Make sure the following lines are on /etc/rc.local for lsf machine:
   source /lsf/conf/profile.lsf
   lsadmin limstartup
   lsadmin resstartup
   badmin hstartup
  • Place c2probe software by doing the following: (stager machine only)
    • Add lines to /etc/rc.local:
   ### c2probe runs itself without crontab
   ### but crontab runs the file transfer job
   c2probe --DirectoryName /castor/ads.rl.ac.uk/test/c2probe --StageHost clhcbstager.ads.rl.ac.uk --SvcClasses lhcbUser --NbBytesToWrite 1024 --RunAsUser gtf --SleepTime 600
    • Make sure these lines are in crontab:
   ### c2probe is started after a reboot using rc.local
   ### this job sends the output from c2probe to the monitoring server
   01 01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23 * * * /usr/local/bin/c2probe_send_data_lhcb.sh >/var/spool/c2probe_send_data_results.log 2>&1
    • Start up c2probe by hand or reboot the machine. You can start up by command:
   c2probe --DirectoryName /castor/ads.rl.ac.uk/test/c2probe --StageHost clhcbstager.ads.rl.ac.uk --SvcClasses lhcbUser --NbBytesToWrite 1024 --RunAsUser gtf --SleepTime 600

CASTOR-tests: (CT) 14:30-15:00 Start up Castor services

  • Make sure the following services are down and don't startup after reboot:
    • Stager:
       service cupvd stop
       service nsd stop
       service expertd stop
       service vmgrd stop
       service vdqmd stop
       service repackd stop
       service rmmasterd stop
       service jobmanagerd stop
       service xrd stop
       chkconfig cupvd off
       chkconfig nsd off
       chkconfig expertd off
       chkconfig vmgrd off
       chkconfig vdqmd off
       chkconfig repackd off
       chkconfig rmmasterd off
       chkconfig jobmanagerd off
       chkconfig xrd off
    • DLF:
       service cupvd stop
       service rtcpclientd stop
       service vmgrd stop
       service rhd stop
       service vdqmd stop
       service repackd stop
       service rmmasterd stop
       service stagerd stop
       service mighunterd stop
       service rechandlerd stop
       chkconfig cupvd off
       chkconfig rtcpclientd off
       chkconfig vmgrd off
       chkconfig rhd off
       chkconfig vdqmd off
       chkconfig repackd off
       chkconfig rmmasterd off
       chkconfig stagerd off
       chkconfig mighunterd off
       chkconfig rechandlerd off
    • LSF:
       service cupvd stop
       service nsd stop
       service expertd stop
       service rtcpclientd stop
       service vmgrd stop
       service rhd stop
       service vdqmd stop
       service repackd stop
       service stagerd stop
       service jobmanagerd stop
       service mighunterd stop
       service rechandlerd stop
       service xrd stop
       chkconfig cupvd off
       chkconfig nsd off
       chkconfig expertd off
       chkconfig rtcpclientd off
       chkconfig vmgrd off
       chkconfig rhd off
       chkconfig vdqmd off
       chkconfig repackd off
       chkconfig stagerd off
       chkconfig jobmanagerd off
       chkconfig mighunterd off
       chkconfig rechandlerd off
       chkconfig xrd off

(CT) 15:00-12:00(next day) Internal tests

Nagios: (JK) 15:00-17:00 Apply modified checks