Difference between revisions of "Imperial glideinwms"

From GridPP Wiki
Jump to: navigation, search
(Day-to-day maintenance)
(Day-to-day maintenance)
 
(3 intermediate revisions by one user not shown)
Line 49: Line 49:
  
 
== Configuration ==
 
== Configuration ==
 +
 +
<b>Open some ports</b> <br>
 +
cetest02 (the submit host) needs to be able to accepts connections from gwms00 (the glideinwms) and the cloud nodes themselves.
 +
Don't start without this.
  
 
<b> The ini file </b> <br>
 
<b> The ini file </b> <br>
Line 302: Line 306:
 
condor_q -global <br>
 
condor_q -global <br>
  
The cloud interface can be found here: [https://gridppcl03.grid.hep.ph.ic.ac.uk/dashboard gridppcl03].
+
The cloud interface can be found here: [https://gridppcl03.grid.hep.ph.ic.ac.uk/dashboard gridppcl03]. <br>
 
+
The glidein user user is ic_glidein.
  
 +
To download the ec2 bundle: Access & Security -> API access -> Download EC2 credentials (unzip ichep-x509.zip).
  
 
<br/>
 
<br/>
 
'''Return to overview [[Cloud_Work_at_Imperial | page]].'''
 
'''Return to overview [[Cloud_Work_at_Imperial | page]].'''

Latest revision as of 11:15, 25 November 2015

Setting up a glideinwms (preliminaries)

We are going for the all-in-one solution here, only "lightly tested" according to the developers.
To setup a glidein WMS to work with an ARCCE as the Submit host, just note the difference in setup in 4).

Documentation:
The glideinwms webpages.
Andrew Lahiff's glideinwms setup page.


Setup
For cloud work, we need v3_2 or higher.

Current versions
Our current (June 2014) versions are: condor-8.0.7-x86_64_RedHat6-unstripped, glideinWMS_v3_2_5 and javascriptrrd-1.1.1-with-flot-0.7-tooltip-0.4.4
Latest version of assorted config files: glideinWMS.ini-raincloud, glideinWMS.xml (needs uploading), frontend.xml (needs uploading), anything else ?

Preparations

The node needs a hostcert. Plus an additional (host)cert for the frontend.

There are three distinct pieces of software:
a) condor (condor-8.0.2-x86_64_RedHat6-unstripped which I got from htcondor, leave tarball in /opt/tarballs)

b) javascript (javascriptrrd-0.6.4 from javascriptrrd, unpack the tarball in /opt)

c) glideinwms (from glideinWMS, unpack the tarball in install dir (here: /opt/raincloud) ):
tar -zxvf /opt/tarballs/glideinWMS_v3_2.tgz; chown -R root:root glideinwms

Install some missing packages:
wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/EGI-trustanchors.repo -O /etc/yum.repos.d/EGI-trusta nchors.repo
wget http://www.mirrorservice.org/sites/dl.fedoraproject.org/pub/epel/6/x86_64/fetch-crl-3.0.11-1.el6.noarch.rpm
rpm -iv fetch-crl-3.0.11-1.el6.noarch.rpm
chkconfig fetch-crl-cron on

yum install ca-policy-egi-core
yum install m2crypto
yum install rrdtool-python
yum install httpd

groupadd raincloud
useradd -m -g raincloud raincloud
Make a copy of the hostcert and key belonging to this user.

Validate the ini file for all components
./manage-glideins --validate [component, e.g. wmscollector, in all lower case letters] --ini [inifile]

Configuration

Open some ports
cetest02 (the submit host) needs to be able to accepts connections from gwms00 (the glideinwms) and the cloud nodes themselves. Don't start without this.

The ini file

Note: You cannot have too many condors. Don't make your life difficult by skimping on condor instances !
The working configuration file: glideinWMS.ini-raincloud
(Second try) glideinWMS.ini-raincloud
(First try) glideinWMS.ini-raincloud

Note: I don't really need privilege separation, but support for the 'no' option might be dropped soon.

Create some subdirectories.
Some bits of the glideinwms are very touchy about directory ownership. Right now I have:
[raincloud@gwms00 raincloud]$ pwd
/opt/raincloud
[raincloud@gwms00 raincloud]$ ls -l
drwxr-xr-x. 4 root root 4096 Oct 23 10:52 factory_client_files
drwxr-xr-x. 12 root root 4096 Oct 22 13:33 glideinwms
drwxr-xr-x. 4 raincloud root 4096 Oct 23 11:14 gwms
glideinwms contains the unpacked glideinwms tarball and nothing else.


Configuration
The raw log of the first run through can be found here.

  1. WMSCollector
    This has to be run as root to enable privilege separation. Answer 'y' to any questions the setup script throws at you.
    If this is a reconfiguration you need to remove the following directories first:
    /opt/raincloud/factory_client_files/clientlog/user_raincloud
    and
    /opt/raincloud/factory_client_files/clientproxies/user_raincloud

    [root@gwms00 ~]# /opt/raincloud/glideinwms/install/manage-glideins --install wmscollector --ini /opt/config/glideinWMS.ini-raincloud
    At the end you should see:
    You will need to have the WMSCollector service running if you intend
    to install the other glideinWMS components.
    ... would you like to start it now? (y/n): y
    ... running: /opt/raincloud/glideinwms/install/manage-glideins --start wmscollector --ini /opt/config/glideinWMS.ini-raincloud
    ... requested action completed
    Note that the WMSCollector also writes a file to /etc/condor (whose idea was that ?).

    You may want more than 20 VMs running at once...
    echo "GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_EC2 = 1000" >> /opt/raincloud/gwms/condor-wms/config.d/03_gwms_local.config
    source /opt/raincloud/gwms/condor-wms/condor.sh
    condor_reconfig

  2. Factory
    This needs to be done as the factory unix account (raincloud).
    Also the following directory needs to exist and be owned by raincloud:
    [root@gwms00 opt]# mkdir /var/www/html/cfactory
    [root@gwms00 opt]# chown raincloud:raincloud /var/www/html/cfactory

    [raincloud@gwms00 ~]# /opt/raincloud/glideinwms/install/manage-glideins --install factory --ini /opt/config/glideinWMS.ini-raincloud
    [...]
    Collecting configuration file data. It will be question/answer time.
    Using /opt/raincloud/condor-wms/etc/condor_config
    Do you want to fetch entries from RESS? (y/n): n
    Do you want to add manual entries? (y/n): y
    Please list all additional glidein entry points,
    Entry name (leave empty when finished): Imperial_GridPP_1
    Gatekeeper for 'Imperial_GridPP_1': http://gridppcl02.grid.hep.ph.ic.ac.uk:8773/services/Cloud
    RSL for 'Imperial_GridPP_1':
    Work dir for 'Imperial_GridPP_1': [.]
    Site name for 'Imperial_GridPP_1': [Imperial_GridPP_1]
    ...
    ======== Factory install complete ==========
    Do you want to create the glideins now? (y/n) [n]: n
    At this point edit /opt/raincloud/gwms/factory/glidein_c7.cfg/glideinWMS.xml to insert the cloud configuration (remove the auto-generated part belonging to Imperial_GridPP_1.)
    And change the CCB attricutes to this:
     <attr name="USE_CCB" value="True" const="True" type="string" glidein_publish="True" publish="True" job_publish="False" parameter="True"/>
    

    Also add/replace the following tags if you want to enable gLExec:

     <attr name="GLEXEC_JOB" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="True"/>
     <attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="/usr/sbin/glexec"/>
    

    If you want to use 1024-bit proxies rather than the HTCondor default of 512, now is also a good time to change it. You have to include a script that will change the setting on the WN as the glidein starts up. To do this, create a new file at /opt/raincloud/gwms/factory/glidein_c7.cfg/proxy_length.sh. Once you've done this you can add the following line into the outermost <file> section in the XML config file:

      <file absfname="/opt/raincloud/gwms/factory/glidein_c7.cfg/proxy_length.sh" executable="True" comment="fix proxy length"/>
    

    Before doing starting the factory I need to make two directories and change their owner to raincloud:root. As far as I can tell this is the minimum invasive procedure... (Note that the validation/install will work without these directories, but you won't be able to create the glideins.)
    (as root do)
    mkdir /opt/raincloud/factory_client_files/clientlog/user_raincloud
    chown raincloud:root /opt/raincloud/factory_client_files/clientlog/user_raincloud
    mkdir /opt/raincloud/factory_client_files/clientproxies/user_raincloud
    chown raincloud:root /opt/raincloud/factory_client_files/clientproxies/user_raincloud

    Then create the glideins (as raincloud) and start the factory:
    . /opt/raincloud/gwms/factory/factory.sh
    /opt/raincloud/glideinwms/creation/create_glidein /opt/raincloud/gwms/factory/glidein_c7.cfg/glideinWMS.xml

    /opt/raincloud/glideinwms/install/manage-glideins --start factory --ini /opt/config/glideinWMS.ini-raincloud


  3. Usercollector
    (as raincloud - on installing as root see Usercollector_as_root)
    [raincloud@gwms00 ~]# /opt/raincloud/glideinwms/install/manage-glideins --install usercollector --ini /opt/config/glideinWMS.ini-raincloud
    Answer 'y' to any questions.

  4. Submit
    Note: When running in combination with a CE, the Submit module is installed on the ARCCE, not the glidein WMS and therefore this bit can be ignored when installing the glideinWMS.
    [root@gwms00 ~]# /opt/raincloud/glideinwms/install/manage-glideins --install submit --ini /opt/config/glideinWMS.ini-raincloud
    Note: previously we tried to share a condor with the user collector. This required to a) edit the condor_mapfile by hand and b) update 11_gwms_secondary_collectors.config from backup as the user collector config gets overwritten by the submit install. Not recommended.

    You may also want to stop this node sending out e-mail on every successful job, this is simple:
      echo "MAIL = /bin/true" > /opt/condor-submit/config.d/99_gwms_nomail.conf
      condor_reconfig
    

    If you want to copy an X509 proxy to the WN (which you probably do), you should add the following to /opt/condor-submit/config.d/03_gwms_local.config (or any other file in this directory) and re-run condor_reconfig (thanks to Andrew L. for the simple recipe to do this!):

      use_x509userproxy = True
      SUBMIT_EXPRS = $(SUBMIT_EXPRS) use_x509userproxy
    

  5. VOFrontend
    The VOFrontend can only be installed if the Submit module is installed and up and running.
    If running this with an ARCCE as the Submit host, make sure the ports are open (see note below) before attempting this.
    The frontend has its own hostcert and key with a different DN to the glideinWMS:
    [raincloud@gwms00 ~]$ pwd <br>
    /home/raincloud <br>
    [raincloud@gwms00 ~]$ ls -l <br>
    -rw-r--r--. 1 raincloud raincloud 1814 Jun 11 17:18 frontend-cert.pem <br>
    -rw-------. 1 raincloud raincloud 1679 Jun 11 17:18 frontend-key.pem  <br>
    -rw-------. 1 raincloud raincloud 3873 Jun 11 17:21 raincloud.proxy<br>
    

    The proxy is made using voms-proxy-init -valid 72:00 -cert ~/frontend-cert.pem -key ~/frontend-key.pem -out ~/raincloud.proxy with an SL6 UI.

    It also needs the 'magic files' (technical term) to authenticate against the cloud controller in the home dir (AccessKeyID and SecretAccessKey - for obvious reasons not linked from this wiki :-D )
    (as root)
    mkdir /var/www/html/cfrontend
    chown raincloud:raincloud /var/www/html/cfrontend

    (as raincloud)
    /opt/raincloud/glideinwms/install/manage-glideins --install vofrontend --ini /opt/config/glideinWMS.ini-raincloud
    Answer 'yes'/hit Enter to any questions until you reach:
    Do you want to create the frontend now? (y/n) [n]: n


    Then you should edit frontend.xml with the following: Frontend_xml_imperial_cloud
    If you want gLexec enabled, replace the GLIDEIN_Glexec_Use attr with the following:

     <attr name="GLIDEIN_Glexec_Use" glidein_publish="True" job_publish="True" parameter="False" type="string" value="OPTIONAL"/>
    

    And after that create the frontend:
    . /opt/raincloud/gwms/frontend/frontend.sh
    /opt/raincloud/glideinwms/creation/create_frontend /opt/raincloud/gwms/frontend/instance_c7.cfg/frontend.xml
    and then start it:
    /opt/raincloud/glideinwms/install/manage-glideins --start vofrontend --ini /opt/config/glideinWMS.ini-raincloud


selinux

semanage port -a -t http_port_t -p tcp 8319
edit /etc/httpd/conf/httpd.conf to change "Listen 80" to "Listen 8319"
chkconfig httpd on; service httpd start
Open ports 8139 & 9618 for iptables

Proxies

We run an hourly cron job (as raincloud) that renews the frontend proxy.

Checks, Starting and Stopping

Have a look: glidein_c7

Submitting a job
Use an innocent user (e.g. 'cloud').
su - cloud
source /opt/raincloud/gwms/condor-submit/condor.sh
condor_submit test2.jdl
condor_q

For extra debugging config see here.
List of relevant log files:

  • /opt/raincloud/gwms/frontend/log/frontend_frontend_service-c7/group_main
    main.info.log: ERROR: Runtime Error. Failed to talk to schedd: -> check if submit module on cetest02 is running
  • /opt/raincloud/gwms/condor-user/condor_local/log/SchedLog

Reconfiguring Things

Now you've got it all working, no doubt you want to change things like the image AMI number without going through the trauma of re-installing everything. You can do this with the sections below.

Reconfiguring the factory

As raincloud:

  • /opt/raincloud/glideinwms/install/manage-glideins --stop factory --ini /opt/config/glideinWMS.ini-raincloud
  • cd /opt/raincloud/gwms/factory
  • source factory.sh
  • /opt/raincloud/glideinwms/creation/reconfig_glidein -xml glidein_c7.cfg/glideinWMS.xml
  • (Ignore the error about the monitor directory not existing)
  • /opt/raincloud/glideinwms/install/manage-glideins --start factory --ini /opt/config/glideinWMS.ini-raincloud

Reconfiguring the frontend (e.g. to update the image used)

As raincloud:

  • /opt/raincloud/glideinwms/install/manage-glideins --stop vofrontend --ini /opt/config/glideinWMS.ini-raincloud
  • cd /opt/raincloud/gwms/frontend
  • update instance_c7.cfg/frontend.xml
  • source frontend.sh
  • /opt/raincloud/glideinwms/creation/reconfig_frontend instance_c7.cfg/frontend.xml
  • /opt/raincloud/glideinwms/install/manage-glideins --start vofrontend --ini /opt/config/glideinWMS.ini-raincloud

Day-to-day maintenance

On cetest02:

source /opt/condor-submit/condor.sh
condor_q
/opt/glideinwms/install/manage-glideins --status submit --ini /opt/glideinwms-conf/condor.ini

(also --start and --stop)

On gwms00 (as root):

source /opt/raincloud/gwms/condor-wms/condor.sh <br>
condor_q <br>

to look at reason why job is held:

condor_q -global -long [jobid] 

nukem: condor_rm -name schedd_glideins2@gwms00.grid.hep.ph.ic.ac.uk -all -forcex
(as raincloud -- same output as cetest02):
source /opt/raincloud/gwms/condor-frontend/condor.sh
condor_q -global

The cloud interface can be found here: gridppcl03.
The glidein user user is ic_glidein.

To download the ec2 bundle: Access & Security -> API access -> Download EC2 credentials (unzip ichep-x509.zip).


Return to overview page.