QMUL

From GridPP Wiki
Revision as of 15:59, 21 May 2007 by Giuseppe mazza (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Cross Site support at QMUL

Here is a list of things you can do:

  • ce02:
-bash-2.05b$ sudo tail -1 /var/log/messages
-bash-2.05b$ sudo /etc/rc.d/init.d/globus-gatekeeper status
-bash-2.05b$ sudo /etc/rc.d/init.d/globus-gridftp status
-bash-2.05b$ sudo /etc/rc.d/init.d/globus-mds status

  • ce01:
-bash-2.05b$ sudo /etc/rc.d/init.d/bdii status

  • se01:
-bash-2.05b$ sudo tail -1 /var/log/messages
-bash-2.05b$ sudo /etc/rc.d/init.d/dpm status

Local resources

Currently the HTC consists of a total of 174 machines (348 processors). There are 160 "compute nodes" (128 dual 2.8 GHz Intel Xeon nodes with 2 Gbyte RAM and 32 dual 2.0 GHz AMD Athlon nodes with 1 Gbyte RAM). There is a total of about 40 Tbyte of disk storage ( 25 Tbyte on RAID arrays and 15 Tbyte distributed amongst the cluster nodes). All the nodes are connected together on a dedicated Gbit ethernet and are also connected to the London MAN via a Gbit connection.

e-Science High Throughput Cluster

Upgrades

Glite 3.0

06/07/06

Creation of the VO users
  • made a script for the creation of users and groups in nis from the users.yaim file
  • users.yaim file is created via qmul/scripts/updateVO.pl that is invoked by make in qmul/config
  • The updatevo.pl file takes two things as input:
passwd
currentVO.cfg
site-info.def.main
  • updatevo.pl is invoked by running make in qmul/config
  • currentVO.cfg has the following format. One entry per vo
[geant4]
gname=geant4
gid=32001
numusers=50
voms_server_uri="vomss://lcg-voms.cern.ch:8443/voms/geant4?/geant4/"
vomses="'geant4 lcg-voms.cern.ch 15007 /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch geant4'"
  • The updatevo.pl script produces two files
site-info.def
users.yaim
vos.yaim
  • From the users.yaim we need to create the passwd and group

for the NIS server.

  • This is done with the script makePasswd.py in the config directory
  • makePasswd.py is invoked as: makePasswd.py [users.yaim] [pwdfile] [gfile] [homedir]
  • the pwdfile and group files need to be added to the passwd.local and group.local on the NIS that contains the local users.
  • In the NIS: /var/yp/src there need to be the two files
  • Then run make in /var/yp that will create the relevant entries in /var/yp/htc
  • On the WN the home directories are created from the NIS entries by running:
/etc/init.d/lcg2 start
makePasswd script
#!/usr/bin/python
# Dummy script to create group and passwd file for nis
# usage ./makePasswd.py users.yaim mypass /scratch/lcg2/

import string
import sys


def getUserAndGroup(filename):
        file=open(filename)
        content=file.readlines()
        poolinfo=[]
        for line in content:
                sline=string.split(line,":")
                current=[[sline[0],sline[1],sline[2],sline[3]]]
                poolinfo=poolinfo+current
        return poolinfo

#def create
def extractGroup(poolinfo):
        groups={}
        for apool in poolinfo:
                guid=apool[2]
                group=apool[3]
                if(not groups.has_key(group)):
                        groups[group]=guid
        return groups

def createPasswdFile(newpassfile,poolinfo,homedir):
        outfile=open(newpassfile,"w")
        for i in poolinfo:
                uid=i[0]
                login=i[1]
                guid=i[2]
                group=i[3]
                passwdline=login+':x:'+uid+':'+guid+':mapped user for group ID '+guid+':'+homedir+login+':/bin/bash\n'
                outfile.write(passwdline)

def createGroupFile(groupfile,groups):
        outfile=open(groupfile,"w")
        for i in groups.keys():
                groupline=i+':x:'+groups[i]+':'+'edguser\n'
                outfile.write(groupline)


def main(arg):
        if(len(arg)==5):
                poolinfo=getUserAndGroup(arg[1])
                createPasswdFile(arg[2],poolinfo,arg[4])
                groups=extractGroup(poolinfo)
                createGroupFile(arg[3],groups)
        else:
                print "Usage is makePasswd.py [users.yaim] [passwordfile] [groupfile] [homedir]"
#       print poolpass

main(sys.argv)
Certificates
  • Giuseppe has received renewal for ce01,mon01,ce02,se01 but none of the serials in the mail correspond to the serials on the box. Probably certificates requested previously and never used.
  • We have requested two certificates
wn01.esc.qmul.ac.uk (currently the name of the ce)
ce02.esc.qmul.ac.uk
Problem with submission at QMUL

sft's are showing that the submission is failing match. Observed that the bdii is alive but not responding on port 2170 hence the problem. Could submit a job on the short queue from imperial as dteam

17/07/06

Installation of the rpms on wn01

the yum.conf file used is:

[glite3-base]
name=GLITE - 3_0_0 Repository For sl3 $basearch
baseurl=http://kickstart/RPMS/GLITE/3_0_0/sl3/$basearch/base

[glite3-externals]
name=GLITE - 3_0_0 Repository For sl3 $basearch
baseurl=http://kickstart/RPMS/GLITE/3_0_0/sl3/$basearch/externals

[glite3-updates]
name=GLITE - 3_0_0 Repository For sl3 $basearch
baseurl=http://kickstart/RPMS/GLITE/3_0_0/sl3/$basearch/updates

[CA]
name=CA Repository
baseurl=http://kickstart/RPMS/LCG_CA
  • yum -c yum.conf install lcg-CE
  • yum -c yum.conf install lcg-CA
makeGroupConf.py to populate groups.conf
  • The groups.conf can be populated from the users.conf, since the two last columns are the vo and the role.
#!/usr/bin/python
# Dummy script to create yaim groups.conf  from the users.conf file
# usage ./makeGroupConf.py users.yaim groups.conf

import string
import sys


def getVOandRoles(filename):
        file=open(filename)
        content=file.readlines()
        voroleDict={}
        for line in content:
                sline=string.split(line,":")
#               print sline
                vo=sline[-3]
                role=sline[-2]
                if(voroleDict.has_key(vo)):
                        if(not role in voroleDict[vo]):
                                voroleDict[vo]=voroleDict[vo]+[role]
                else:
                        voroleDict[vo]=[role]
#       print voroleDict
        return voroleDict

def createGroupConf(filename,vorole):
        map={'sgm':'lcgadmin','prd':'production'}
        outfile=open(filename,"w")
        for vo in vorole.keys():
                for role in vorole[vo]:
                        if(role!='' and role not in map.keys()):
                                print 'role='+role+' not found, skipping'
                                continue
                        if(role==''):
                                line='''"/VO='''+vo+'''/GROUP=/'''+vo+'''"::::'''
                        else:
                            line='''"/VO='''+vo+'''/GROUP=/'''+vo+'''/ROLE='''+map[role]+'''":::'''+role+''':'''
                        outfile.write(line+'\n')

def main(arg):
        if(len(arg)==3):
                vorole=getVOandRoles(arg[1])
                createGroupConf(arg[2],vorole)
        else:
                print "Usage is makeGroupConf.py [users.yaim] [groups.conf]"
#       print poolpass

main(sys.argv)
Torque Client
  • commands used
[root@wn01 etc]# cat fstab
[root@wn01 etc]# cat fstab | tail -2
pbs:/var/spool/pbs     /var/spool/pbs          nfs     ro     0 0
that is we added the above line to /etc/fstab file
yum install torque-clients
cd /etc/
mkdir pbs
cd pbs/
echo "pbs.htc.esc.qmul" >  /etc/pbs/server_name
mount /var/spool/pbs/

[root@wn01 pbs]# qstat -q
to test that it worked

Note that first we have to install torque clients and then we can mount the /var/spool/pbs dir, otherwise the installation of the torque clients rpm will overide the stuff inside the just mounted dir.

Maui
  • After configuring the nodes I realized that the information system was reporting an empty tree. In the log files I could see that the vomaxjob plugin did not return anything. The reason is that the diagnose -g was not there because the installation did not install the maui client tools.
  • The configuration file that specify to run vomaxjob is found in /opt/lcg/etc/lcg-info-dynamic-scheduler.conf
  • Maui Client Tools.
    • yum install maui-clients provides the clients from the lcg distribution which is the rpm maui-client-3.2.6p11-2_SL30X.i386.rpm. When using it is not compatible with the version compiled by Alex on fe03 build for fc2 (maui-3.2.6p10-4.fc2.qmul).
    • Tried to copy the diagnose -g binary from gfe03 but when using it complains with the following: ./diagnose: /lib/tls/libc.so.6: version `GLIBC_2.3.4' not found (required by ./diagnose)
    • As a temporary solution we have written a dummy diagnose command that does:
#!/bin/sh
# passing all arguments to maui on pbs
ssh root@fe03.htc.esc.qmul "diagnose $1"

This should be changed in the future.

Information system

After down the maui tricks about the information system was still not publishing the right information I realized that there was still one lcg rpm missing hence the final list to install is

yum install lcg-info-dynmaic 
yum install lcg-info-dynamic 
yum install lcg-info-dynamic-pbs
yum install  lcg-info-dynamic-scheduler-generic 
yum install  lcg-info-dynamic-software
yum install lcg-info-dynamic-scheduler-pbs 
yum install lcg-info-templates 
yum install  lcg-info-generic

Where the missing one was lcg-info-generic.

  • BUG: the lcg-CE dependency tree should contain those rpms. Need to check that it is not contained in the dependency. It generally works because people are updating the system
Yaim configuration
  • Yaim needs to create rgma, edguser, edginfo. Alex prefers that we do not put it in yp since it is only used on the service nodes.
  • Yaim will create those users automatically. But the /home directory is automounted and yaim crashes because it cannot create the three users above.
  • We had to comment out the entry for that home dir in /etc/auto.master
  • Yaim needs the users.yaim and groups.conf file and cannot be empty. So we have deleted out the config_users in the nod-info.def file to avoid it creating the pool accounts.
  • Issue: prd users are not in the yellow pages. They will be added if necessary.

18/07/2006 CE

  • The day before (not logged) we could submit a job to QMUL with globus-job-submit from gfe03. The problem is that the output cannot be retreived. Usually it is because the pbs output could not be retreived.
  • We tried to submit a job from wn01 now renamed ce02 (which caused use to rerun yaim).
  • From dteam001 on ce02, qsub -q lcg2_short test.sh returns

Bad UID for Job execution

  • Alex found that this is because ce02 was not listed in the shosts.equiv of the pbs server. After checking today it has disapeared from there.
  • After that we could submit a job but the output was not returned. Alex told us he was not expecting to support two ce.
  • We understood that the authentication of the cn when the jobs comes back is done via an agent that is started in the prologue of the script. So two things needs to be done on the cn and ce.
    • ce the home directories should contain an autorized_key file that is readable by root only
    • cn /etc/ssh/ssh_known_host should contain the public key of the ce02.

Mon Box upgrade

  • First of all I need to disable the automount of /home which was also there when running the installation on the ce02
  • The auto.master contains
# /misc /etc/auto.misc  --timeout=60
#/home /etc/auto.home  --timeout=300
#/opt /etc/auto.opt  --timeout=300
/opt/shared /etc/auto.opt --timeout=300
/mnt/auto /etc/auto.mnt  --timeout=60
  • /etc/init.d/autofs stop
  • updated the yum.conf with the following repositories:
[glite3-base]
name=GLITE - 3_0_0 Repository For sl3 $basearch
baseurl=http://kickstart/RPMS/GLITE/3_0_0/sl3/$basearch/base

[glite3-externals]
name=GLITE - 3_0_0 Repository For sl3 $basearch
baseurl=http://kickstart/RPMS/GLITE/3_0_0/sl3/$basearch/externals

[glite3-updates]
name=GLITE - 3_0_0 Repository For sl3 $basearch
baseurl=http://kickstart/RPMS/GLITE/3_0_0/sl3/$basearch/updates

[CA]
name=CA Repository
baseurl=http://kickstart/RPMS/LCG_CA
  • Removing the old meta-package lcg-MON (rpm -e lcg-MON)
  • Installing the glite-MON (yum install glite-MON) fails with
Package edg-rgma-api-perl needs librgma-c.so.0, this is not available.
Package edg-rgma-api-perl needs librgma-c.so.0, this is not available.
Package edg-rgma-api-perl needs librgma-c.so.0, this is not available.
Package edg-rgma-api-perl needs librgma-c.so.0, this is not available.
Package edg-rgma-api-perl needs edg-rgma-api-c, this is not available.
Package edg-rgma-api-perl needs librgma-cpp.so.0, this is not available.
Package edg-rgma-api-perl needs edg-rgma-api-cpp, this is not available.
Package edg-rgma-api-perl needs edg-rgma-base, this is not available.
  • The package edg-rgma-api-perl is not in the glite stack hence I remove it rpm -e edg-rgma-api-perl
  • yum install glite-MOM
  • We have putted all the yaim configuration in /opt/glite/yaim/config
  • configure_node /opt/glite/yaim/config/site-info.def MON fails with Java Location not set the reason is that the java version on the mon box is older than on the ce02.
  • We have installed the latests version j2sdk-1_4_2_12-linux-i586.rpm
  • Note this should be put in the QMUL repository for the other machines.
  • Running the configuration: configure_node /opt/glite/yaim/config/site-info.def MON
  • Everything fine apart from the fmon (GridIcE) that we can ignore
  • Checked that the apel cron job works /etc/cron.d/edg-apel-publisher and it get the following message:
org.glite.apel.core.ApelException: org.glite.rgma.RGMAException: Error registering producer table in Registry for table: LcgRecords
Caused by: cannot service request, client hostname is currently being blocked

This is because we where using a to old version of rgma servlet that causes problems to the registry. I (Ovda) have written a mail to Alastair Duncan to unblock us.

  • Checking what services are running We decided to shut down:
    • stopped cupsd

SE Upgrade (19/07/07)

  • First thing to avoid disasters we backup the mysql database as advised on backup your MySQL database.
  • We have stored the backup in /root/mysql.backup.gz
  • Removed the metapackage with the lcg dependencies: rpm -e lcg-SE_dpm_mysql
  • Installed the grid sw on the machine:
  yum install glite-SE_dpm_mysql 
  • Copied config file from mon01:

[root@se01 scripts]# cd /opt/glite/yaim/
                     scp -r mon01:/opt/glite/yaim/config .

  • Created user accounts:
[root@se01 root]# /opt/glite/yaim/config/setup_lcg_pool_accounts
  • Installed newer version of java package:
j2sdk-1_4_2_12-linux-i586.rpm
  • Deleted function
     config_users 
    from file
[root@se01 root]# less /opt/glite/yaim/scripts/node-info.def
  • Configured the machine
[root@se01 scripts]# ./configure_node /opt/glite/yaim/config/site-info.def SE_dpm_mysql
  • Used Alex's recepit:
[root@se01 etc]# cp resolv-pub.conf resolv.conf
[root@se01 etc]# /etc/rc.d/init.d/rfiod restart
[root@se01 etc]# /etc/rc.d/init.d/dpm-gsiftp restart
[root@se01 etc]# /etc/rc.d/init.d/dpnsdaemon restart
[root@se01 etc]# /etc/rc.d/init.d/dpm restart
[root@se01 etc]# /etc/rc.d/init.d/srmv1 restart
[root@se01 etc]# /etc/rc.d/init.d/srmv2 restart
[root@se01 etc]# cp resolv-priv.conf resolv.conf


  • Tested the machine:
[mazza@gfe03 mazza]$ globus-url-copy file:////`pwd`/pippo gsiftp://se01.esc.qmul.ac.uk:2811/dpm/esc.qmul.ac.uk/home/dteam/ol12


cn069 tarball (20/07/07)

  • Configuration of the tarball on cn069.
  • Created a gridadmin user in the nis.
  • Runned yaim as gridadmin.
  • Had to get the cronjob for crl to be installed as gridadmin
  • Had to run install_cert_userland to make sure that the certs are in the tarball
  • We had to make an rpm out of the tarball
    The rpm would contain the certificates, cronjob for crl and the creation of links withing to have

grid-security in /etc pointing to the tarball one. (I know rgma is not using X509_ env var properly)

cn069 testing (21/07/07)

  • Problem is to be sure that dteam and ops jobs are going to the node
  • We did that by setting a property on the cn069 node in /var/spool/pbs/nodes with properties=lcg2shortattr
  • In QMGR we did to allow only dteam jobs and we specified that we wanted lcg2shortattr
set queue lcg2_short acl_group_enable=true
set queue lcg2_short acl_group+=dteam 
set queue lc2_short resources_default.neednodes = lcg2shortattr 

A similar thing can be done in maui.cfg by doing a sft partition


CA 1.9 (20/09/2006)

The procedure to upgrade the CA is far from perfect

  • First we have to retrieve the CA rpm from this location: http://linuxsoft.cern.ch/LCG-CAs/current/RPMS.production/
  • On fe02 at /mnt/installs/RPMS there is a script get_LCG_CA that takes the last rpms and creates the headers files. Beware the remove the old rpms.
  • On each machine you have to run yum update lcg-CA

For the WN we have to rebuild the glite rpm that contains the whole software.

  • The build procedure is done on ce01 in /usr/src/redhat
  • First get a fresh version of the tarball from cn120.
  • untar it in /usr/src/redhat/SOURCES/temp/
  • add the new certificates in the ./grid-security/certificates/
  • cd /usr/src/redhat/SPECS/
  • Edit glite-qmul.spec and increase the version number
  • run rpmbuild -ba glite-qmul.spec >& glite-qmul-8.log

Testing:

  • To test, install on cn362 and direct jobs there by assigning the sft partition to OPS and DTEAM
See GROUPCFG[ops]
  • After running the sft verify that they where ok here

Site log

SC4 Transfer Test

16/01/2006

Realized that the srm version distributed in dcache-client-1.6.6-4 gives

srmcp error: nulljava.lang.NullPointerException

Used the version distributed in http://www.dcache.org/downloads/dcache-v1.6.5-2.tgz

rpm2cpio d-cache-client-1.0-100-RH73.i386.rpm  > client.cpio
cpio --make-directories -F client.cpio -i

defined SRM_PATH as the path of the unpacked srm client. Then Graeme scripts worked.


 nohup ./filetransfer.py --ftp-options="-p 10" --number=2  --delete -s 
 https://fts0344.gridpp.rl.ac.uk:8443/sc3ral/glite-data-transfer-fts/services/FileTransfer   
 srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/tfr2tier2/canned1G  
 srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/can1G

Which resulted in:

Transfer Bandwidth Report:
  2/2 transferred in 237.424527884 seconds
  2000000000.0 bytes transferred.
Bandwidth: 67.389836015Mb/s

The Network bandwidth obtained by iperf was 94Mb/s which indicates that there is a 100Mb trunk in the line. When doing the same bandwidth test from IC-HEP we have 392Mb/s clearly indicating that the 100Mb trunk is on the QMUL side.


Then submitted 50x1G

 nohup ./filetransfer.py --ftp-options="-p 10" --number=50  --delete -s 
 https://fts0344.gridpp.rl.ac.uk:8443/sc3ral/glite-data-transfer-fts/services/FileTransfer   
 srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/tfr2tier2/canned1G  
 srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/can1G

I had to cancel the transfer since nohup crashed.

Tried to submit a 500GB transfer but nohup did not keep the filetransfer up and I could not get the outcome


17/01/2006

Submitted a 10GB transfer test. 10 Files two streams per file

Transfer Bandwidth Report:
  10/10 transferred in 1245.54273701 seconds
  10000000000.0 bytes transferred.
Bandwidth: 64.2290285376Mb/s

Alex and Giuseppe have changed the se01.esc.qmul.ac.uk to a Gb connection. I have rescheduled a transfer for 18h00.

Have submitted using the 0.3.0 filetransfer script:

filetransfer.py --ftp-options="-p 2" --number=500 --background  --delete -s https://fts0344.gridpp.rl.ac.uk:8443/sc3ral/glite-data-transfer-fts/services/FileTransfer srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/tfr2tier2/canned1G srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/can1G
Child:  /opt/glite/bin/glite-transfer-status -l 6b7363f8-884b-11da-a18f-e44be7748cb0 -s https://fts0344.gridpp.rl.ac.uk:8443/sc3ral/glite-data-transfer-fts/services/FileTransfer
FTS status query for 6b7363f8-884b-11da-a18f-e44be7748cb0 failed:
FTS Error: status: getFileStatus: requestID <6b7363f8-884b-11da-a18f-e44be7748cb0> was not found

I could defenitly see the transfer on the se01 node and the machine load rising so the transfer was going on. I could not cancel the transfer, it was giving a soap error.

Tried to destroy myproxy which did not have a direct effect but after 66 files transfered it stopped

Submit time:    2006-01-18 17:54:07.000
Files:          500
        Done:           66
        Active:         0
        Pending:        0
        Canceled:       0
        Failed:         0
        Finished:       0
        Submitted:      0
        Restarted:      0

The bandwidth can be seen here

File:Qmul-transfer1.gif

File:Qmul-fts1.gif

15/02/2006

I used the following command:

[mazza@grid05 mazza]$ filetransfer.py --background --ftp-options="-p 2" --number=500 --delete srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/canned2G srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/qmul

The bandwidth can be seen here

File:QMUL-RAL graph.gif


File:QMUL-RAL graph 2.gif


The bandwidth mean value is 172.8 Mbit/s

22/02/2006

I used the following command:

[mazza@grid05 mazza]$ filetransfer.py --background --ftp-options="-p 2" --number=500 --delete srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned2G srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/canned2G_from_RAL

The bandwidth can be seen here

File:060222 RAL-QMUL.gif

The bandwidth mean value is 118.00 Mbit/sec

SC4 Throughput Test

The throughput test are ment to stress test the RAL Tier1 production network by pulling from differents Tier2 from the Tier1.

More details can be found here


Monitoring links

GSTAT for QMUL-eScience

GridPP storage status