RALPP Logbook 200611

From GridPP Wiki
Jump to: navigation, search

13/11/2006

Upgraded Torque/Maui on CE and WNs - Chris Brew

Stopped all the queues on Friday evening to allow all jobs to drain out, once the queues were empty on Monday I shut down all the batch services.

On the CE

service pbs_server stop
service pbs_mom stop
service maiu stop

and on the WNs

service pbs_mom stop

I then created a new yum repository on heplnx100 called openpbs with the following rpms copied from the T1 yum server:

maui-3.2.6p16-2_sl3_ratio03.i386.rpm
maui-3.2.6p16-2_sl3_ratio03.src.rpm
maui-client-3.2.6p16-2_sl3_ratio03.i386.rpm
maui-devel-3.2.6p16-2_sl3_ratio03.i386.rpm
maui-server-3.2.6p16-2_sl3_ratio03.i386.rpm
torque-2.1.6-1cri_sl3_mjb.i386.rpm
torque-2.1.6-1cri_sl3_mjb.src.rpm
torque-client-2.1.6-1cri_sl3_mjb.i386.rpm
torque-devel-2.1.6-1cri_sl3_mjb.i386.rpm
torque-mom-2.1.6-1cri_sl3_mjb.i386.rpm
torque-server-2.1.6-1cri_sl3_mjb.i386.rpm

Two of the torque rpms have changes names between the versions and had to be removed first by hand:

yum remove torque-resmom torque-clients

This caused the lcg-CE_torque metarpm to be removed.

Then I installed the new torque rpms:

yum install torque torque-client torque-devel torque-mom torque-server
<pre>

This installed the new rpms correctly but replaced the <code>/var/spool/pbs/server_name</code> file with one containing <code>localhost</code>, however it have left an <code>.rpmsave</code> version behind so:
<pre>
mv /var/spool/pbs/server_name.rpmsave /var/spool/pbs/server_name

I was then able to run a general

yum update

to get the new version of maui (an the new kernel).

I then rebooted the CE and the torque/maui services all started cleanly.

On each of the WNs I ran the following script to upgrade torque and the kernel, and to reboot the nodes:

unset http_proxy
yum -d 1 -e 1 -y remove torque-clients torque-resmom
yum -d 1 -e 1 -y install torque torque-mom torque-client
mv /var/spool/pbs/server_name.rpmsave /var/spool/pbs/server_name
yum -d 1 -e 1 -y update
shutdown -r now

As they rebooted they cleanly rejoined the torque cluster and the upgrade was complete.

Upgraded dCache on the dCache server nodes - Chris

I copied the upgrade rpms to a yum repository on heplnx100, then followed the upgrade procedure at http://www.dcache.org/manuals/dcacheUpgrade_1_6-1_7.shtml.

Frist shutdown all the dCache services:

On the head node (heplnx204):

service dcache-pool stop
service dcache-core stop
service pnfs stop

and on the pool nodes

service dcache-pool stop
service dcache-core stop

Then on the head node drop and recreate the databases that have had their schemas updated:

dropdb -U srmdcache billing 
dropdb -U srmdcache dcache

This failed for me because srmdcache didn't own the table I had to do:

dropdb -U postgres dcache

To drop the table, before continuing. dropdb -U srmdcache replicas createdb -U srmdcache billing createdb -U srmdcache dcache createdb -U srmdcache replicas psql -U srmdcache replicas -f /opt/d-cache/etc/psql_install_replicas.sql </pre>

The next step was to remove the old rpms and install the new:

rpm -e --nodeps dcache-server
rpm -Uvh http://heplnx100.pp.rl.ac.uk/yum/dcache/noarch/dcache-server-1.7.0-18.noarch.rpm
rpm -e --nodeps dcache-client
rpm -Uvh http://heplnx100.pp.rl.ac.uk/yum/dcache/noarch/dcache-client-1.7.0-16.noarch.rpm

and the upgrade script:

rpm -Uvh http://heplnx100.pp.rl.ac.uk/yum/dcache/noarch/glite-dcache-upgrade-0.0.4-0.noarch.rpm

Run the script:

sh /opt/d-cache/install/dCacheUpgrade_1_6-1_7.sh

start pnfs

service pnfs start

run the install.sh script:

/opt/d-cache/install/install.sh

and start the rest of the services

service dcache-core start

During startup there was a tomcat error which appeared to come from the gPlazma script that it could not find some necessary fin in /opt/d-cache/libexec/jakarta-tomcat-4.1.3/bin/ looking at the file system it was obvious that should have been /opt/d-cache/libexec/jakarta-tomcat-4.1.31/bin/ but this didn't seem to be in the gPlazma startup scripts. I eventually tracked it down to the dcache-srm startup script /opt/d-cache/bin/dcache-srm and corrected the typo.

On the pool nodes the process was very similar without the DB work:

service dcache-pool stop
service dcache-core stop
rpm -e --nodeps dcache-server
rpm -Uvh http://heplnx100.pp.rl.ac.uk/yum/dcache/noarch/dcache-server-1.7.0-18.noarch.rpm
rpm -e --nodeps dcache-client
rpm -Uvh http://heplnx100.pp.rl.ac.uk/yum/dcache/noarch/dcache-client-1.7.0-16.noarch.rpm
rpm -Uvh http://heplnx100.pp.rl.ac.uk/yum/dcache/noarch/glite-dcache-upgrade-0.0.4-0.noarch.rpm
sh /opt/d-cache/install/dCacheUpgrade_1_6-1_7.sh
/opt/d-cache/install/install.sh 
service dcache-core start 
service dcache-pool start 

Once the upgrade was finished and dCache was running I realised that it had removed all my pool groups, links and other pool VO setup. However it had saved the PoolManager.conf file so it was a simple matter to copy the setup commands across and restart dcache to get it back.

I then had an interesting problem when I installed the new kernel and rebooted heplnx204. When it came back dCache just wasn't working. Most of the services didn't show up in the html pages and the logs were filling up with error messages.

Stopping and restarting dCache, produced error messages that it couldn't stop lm on shutdown and that in might already be running on startup. ps however showed no sign of it. Eventually looking in the .lastPid file and looking at the ps output I found that during the reboot one of the OS processes had been started with the PID of the previous lm service. Deleting the lastPid file allowed me to start everything correctly.