Detailed 2.1.9 Upgrade
Friday Before: Drain batch system for effected VO Monday 07:00:Start drain of FTS 09:00:
Upgrade instruction:
CASTOR: (MV) 8:00-8:30 Shutdown all CASTOR services (including LSF, cron and puppet) and SRM for the effected VO
DataBase: (RS) 8:30-12:30 Take database backup and upgrade stager/dlf schemas
Disk Servers: (JT) 8:30-10:30 Upgrade Castor/OS RPMs on disk servers:
- Puppet will provide the files:
- /etc/castor/castor.conf
- /etc/sysconfig/xrd
- /usr/local/bin/show_castor_services
- /etc/logrotate.d/syslog
- Puppet will restart xrd deamon
- Restart (r)syslog to pick up new loggers
- Make sure castor-gridftp-dsi-int is installed on all disk servers
- Make sure that the following are uncommented on castor.conf on all disk server
GSIFTP X509_USER_CERT /etc/grid-security/castor-gridftp-dsi-int/castor-gridftp-dsi-int-cert.pem GSIFTP X509_USER_KEY /etc/grid-security/castor-gridftp-dsi-int/castor-gridftp-dsi-int-key.pem
- Make sure that /opt/xrootd/keys/pkey.pem has beendistributed to all disk servers. The key needs to be first generated on DLF machine and put into Puppet
(MV) 10:30-11:00 Change disk servers in puppet to 2.1.9 (MV) 11:00-12:00 Apply and verify changes using puppet to disk servers
Head Nodes: (CK) 8:30-12:00 Upgrade OS (SLC4->SL4)
- Take a backup
- Use this kickstart script: sl4-preProd-os-only-x86_64.cfg (151)
- Use different section for partitions for lsf/dlf and stager/ns
- Restore backup:
- Make sure that you have correct:
- .ssh directory
- /etc/yum.repos.d/*
- /etc/resolv.conf make sure it has the following format:
- Make sure that you have correct:
domain ads.rl.ac.uk search ads.rl.ac.uk gridpp.rl.ac.uk rl.ac.uk nameserver 130.246.8.13 nameserver 130.246.56.240 nameserver 130.246.72.21
- Make sure that directory lsf exists on c<instance>lsf, if not do:
- mkdir -p /var/www/html/lsf
- chmod 777 /var/www/html/lsf
- service httpd start
- add RHSERVER_PRIVATE_PORT=9004 to /etc/sysconfig/rhd:
- overwrite /etc/sysconfig/mighunterd.example with /etc/sysconfig/mighunterd
- modify /etc/logrotate.d/syslog for all h/n with the following:
- Make sure that directory lsf exists on c<instance>lsf, if not do:
/var/log/messages /var/log/secure /var/log/maillog /var/log/spooler /var/log/boot.log /var/log/cron { rotate 5 daily compress sharedscripts postrotate /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true endscript }
- copy /usr/lib/log-processor/DLF.py from cPREdlf to all h/n and restarted logprocessord for all h/n
- install nrpe rpms, also do chkconfig nrpe on for all h/n
- install gmond plus relevant tier1 config file
(CK) 12:00-12:30 Install Castor RPMs
- Execute ‘install-castor-219’ which is at: ccsc15:/root/219-upgrade
(CK) 13:30-13:45 Modify sysconfig entries (CK) 13:45-14:00 Deploy new castor.conf
- Make sure correct version is distributed, example can be taken from preProd
(CK) 14:00-14:30 Configure rsyslog
- Refer to https://www.gridpp.ac.uk/wiki/RAL_Tier1_Upgrade_Plan#Additional_Information_for_DLF_Replacement
- ln -s /etc/castor/tnsnames.ora /etc/tnsnames.ora
- copy etc/castor/DLFCONFIG from c<instance>dlf to all h/n
(CK) 14:30-15:30 Install and configure xrootd mgr n DLF server
- made sure cert is installed in /etc/grid-security
- install certificate
- this is not in puppet yet:
- generate pair of xrootd keys
- copy lcgcc246:/etc/sysconfig/xrd c<instance>dlf
- copy xrd.cf file from PreProd if needed
- installed the following:
- xrootd-xrd3cp
- xrootd-secssl
- xrootd-xcastor2fs
- fetch-crl
- tier1-yum-lcg-ca-certs-2.0-1.noarch
- lcg-CA
- distribution grid-mapfile using puppet
(CK) 15:30-15:45 Update restarter scripts
- /etc/rsyslog.conf can be taken from ccsc15
- Restart rsyslog on all h/n
(CK) 15:45-16:00 Modify show_castor_services
- Copy it from ccsc15:/root/219-upgrade
(CK) 16:00-16:30 The final checks and configuration
- Create lhcblogging account: uid=510 gid=510 for stager and lsf machines only
- 'chkconfig ypbind on' for all h/n
- 'service ypbind start' for all h/n
- Copy /etc/castor/logprocessord ro all h/n
- Make sure that there is entry for DLF DB in 'tnsnames.ora in all h/n
- 'chkconfig nrpe on' for all h/n
- 'service nrpe start' for all h/n
- 'chkconfig gmond on' for all h/n
- 'service gmond start' for all h/n
- Make sure you have correct /etc/sysconfig/rhd for stager machine:
- /etc/sysconfig/rhd
#RHD_OPTIONS= DAEMON_COREFILE_LIMIT=unlimited ROLES="public private"
- /etc/sysconfig/rhd.public
#RHD_OPTIONS= DAEMON_COREFILE_LIMIT=unlimited ROLES="public private"
- /etc/sysconfig/rhd.private
RHD_OPTIONS="-p 9004" DAEMON_COREFILE_LIMIT=unlimited ROLES="public private"
- 'chkconfig httpd on' for dlf and lsf machines
- 'service httpd start' for dlf and lsf machines
- Make sure that file '/var/www/conf/dlf/login.conf' is own by group 'apache' if not do:
chgrp apache /var/www/conf/dlf/login.conf
- Make sure that on DLF machine:
- /etc/sysconfig/logprocessord looks like
LOGPROCESSORDD_OPTIONS="-c /etc/castor/logprocessord.conf" #ROLES=
- /etc/sysconfig/logprocessord looks like:
#------------------------------------------------------------------------------- # General program settings #------------------------------------------------------------------------------- [main] pid_file = /var/run/logprocessord.normal.pid log_file = /var/log/castor/logprocessord.log plugin_path = /usr/lib/log-processor daemon_processes = dlf-syslog-to-db #------------------------------------------------------------------------------- # DLF Database settings #------------------------------------------------------------------------------- [dest-dlf-db] module = DLF class = DLFDbDest # The database connection string in the form <username>/<password>@<database> # or a reference to a file containing the password # e.g. file:///etc/castor/DLFCONFIG connection_string = file:///etc/castor/DLFCONFIG # The maximum number of records to be inserted in one bulk operation bulk_count = 5000 # The maximum amount of time to wait before inserting records into the database flush_interval = 60 # The domain name that should be appended to all encountered hostnames #domain_name = cern.ch #------------------------------------------------------------------------------- # DLF source log file - from syslog #------------------------------------------------------------------------------- [source-dlf-log-file-syslog] module = DLF class = DLFLogFile # There are basically two types of input that the logprocessor operates on: # files and pipes. # If the type is set to 'file' then it treats the file specified by the path # variable as the source for the messages. It reads the file treating each line # as a separate message and parses it. When it reaches the end of the file it # closes it and quits. # If the type is set to 'pipe' then it blocks until new data is available in # the currently opened file. # The 'seek' variable determines if it should find the end of the file and # insert only the newly arriving messages, if set to false it also inserts the # messages that are already in the file. # The 'dynfiles' option specific if dynamic file names should be parsed. If # set to false the pipe just processes the file specified by path. If set to # true then the pipe processes the YYYY-MM-DD.log files stored in the directory # defined by the path variable (YYYY denotes year, MM - month, and DD - day). # It starts with the current date and changes the file it processes when there # is no more data in the current one and the system date changes. path = /var/log/dlf/syslog.input type = pipe dynfiles = false seek = true #------------------------------------------------------------------------------- # DLF processes #------------------------------------------------------------------------------- [process-dlf-syslog-to-db] source = dlf-log-file-syslog destination = dlf-db
- /var/www/conf/dlf/login.conf is talking to the correct DLF DB and looks like:
<?php /* * DLF Database connectivity information */ $db_instances = array( "castor2" => array( "username" => "XXXXXXXXXXXXXX-username", "schema" => "XXXXXXXXXXXXXX-the same as above", "password" => "XXXXXXXXXXXXXX-password", "server" => "XXXXXXXXXXXXXX-DLF DB server name", /* stager database */ "stagerdb" => array( "username" => "XXXXXXXXXXXXXX-username", "schema" => "XXXXXXXXXXXXXX-the same as above", "password" => "XXXXXXXXXXXXXX-password", "server" => "XXXXXXXXXXXXXX-STAGER DB server name", ), ), ); ?>
- Install Amanda client, rpm and script can be taken from ccsc01.ads.rl.ac.uk:/root/amanda_client. Do it for all h/n
- Make sure the following lines are on /etc/rc.local for lsf machine:
source /lsf/conf/profile.lsf lsadmin limstartup lsadmin resstartup badmin hstartup
- Place c2probe software by doing the following: (stager machine only)
- Add lines to /etc/rc.local:
### c2probe runs itself without crontab ### but crontab runs the file transfer job c2probe --DirectoryName /castor/ads.rl.ac.uk/test/c2probe --StageHost clhcbstager.ads.rl.ac.uk --SvcClasses lhcbUser --NbBytesToWrite 1024 --RunAsUser gtf --SleepTime 600
- Make sure these lines are in crontab:
### c2probe is started after a reboot using rc.local ### this job sends the output from c2probe to the monitoring server 01 01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23 * * * /usr/local/bin/c2probe_send_data_lhcb.sh >/var/spool/c2probe_send_data_results.log 2>&1
- Start up c2probe by hand or reboot the machine. You can start up by command:
c2probe --DirectoryName /castor/ads.rl.ac.uk/test/c2probe --StageHost clhcbstager.ads.rl.ac.uk --SvcClasses lhcbUser --NbBytesToWrite 1024 --RunAsUser gtf --SleepTime 600
CASTOR-tests: (CT) 14:30-15:00 Start up Castor services
- Make sure the following services are down and don't startup after reboot:
- Stager:
service cupvd stop service nsd stop service expertd stop service vmgrd stop service vdqmd stop service repackd stop service rmmasterd stop service jobmanagerd stop service xrd stop
chkconfig cupvd off chkconfig nsd off chkconfig expertd off chkconfig vmgrd off chkconfig vdqmd off chkconfig repackd off chkconfig rmmasterd off chkconfig jobmanagerd off chkconfig xrd off
- DLF:
service cupvd stop service rtcpclientd stop service vmgrd stop service rhd stop service vdqmd stop service repackd stop service rmmasterd stop service stagerd stop service mighunterd stop service rechandlerd stop
chkconfig cupvd off chkconfig rtcpclientd off chkconfig vmgrd off chkconfig rhd off chkconfig vdqmd off chkconfig repackd off chkconfig rmmasterd off chkconfig stagerd off chkconfig mighunterd off chkconfig rechandlerd off
- LSF:
service cupvd stop service nsd stop service expertd stop service rtcpclientd stop service vmgrd stop service rhd stop service vdqmd stop service repackd stop service stagerd stop service jobmanagerd stop service mighunterd stop service rechandlerd stop service xrd stop
chkconfig cupvd off chkconfig nsd off chkconfig expertd off chkconfig rtcpclientd off chkconfig vmgrd off chkconfig rhd off chkconfig vdqmd off chkconfig repackd off chkconfig stagerd off chkconfig jobmanagerd off chkconfig mighunterd off chkconfig rechandlerd off chkconfig xrd off
(CT) 15:00-12:00(next day) Internal tests
Nagios: (JK) 15:00-17:00 Apply modified checks