Running jobs over multiple processors using SGE
What is SGE
Most common commands
More to come
1.What is SGE
SGE (Sun Grid Engine) is a Load Management System (LMS) that allocates resources such as processors (CPU's), memory, disk-space, and computing time. Grid Engine like other (LMS) enables transparent load sharing, controls the sharing of resources, and also implements utilization and site policies.
It has many characteristics including batch queuing and load balancing, as well as giving the users the ability to suspend/resume jobs and check the status of their jobs.
Grid Engine can be used through the command line or through a Graphical User Interface (GUI) called qmon, which both have the same set of commands.
For our purposes however, SGE is most useful for automatically finding redundant CPU's for running jobs. This way, a user at a workstation should not be affected.
1.1 Getting Started
Grid Engine provides two ways to run your jobs, the first is directly from the command line or through the QMON GUI, and it's up to the user to choose what is convenient for him.
However, if the job is simple and consists on only few commands then the submission is more easily done done via the command line. And if the job requires the setup of many options and special requests, the use of the GUI will be then very helpful, at least first time when you are writing your script, and will facilitates the navigation through all the available options.
Sun Grid Engine has a large set of programs that let the user submit/delete jobs, check job status, and have information about available queues and environments. For the normal user the knowledge of the following basic commands should be sufficient to get started with Grid Engine and have full control of his jobs:
qconf: Shows (-s) the user the configurations and access permissions only.
For example qconf -sql will give you a list of all available queues and qconf -spl will give the available parallel programmes
qdel: Gives the user the ability to delete his own jobs only.
qhost: Displays status information about Sun Grid Engine execution hosts.
qmod: Modify the status of your jobs (like suspend/resume).
qmon: Provides the X-windows GUI command interface.
qstat: Provides a status listing of all jobs and queues associated with the cluster.
qsub: Is the user interface for submitting a job to Grid Engine.
All these commands come with many options and switches and are also available with the GUI QMON. They all have detailed man pages (e.g. ">man qsub"), and are documented in the Sun Grid Engine, Enterprise Edition 5.3 Administration and User's Guide. (about 2.5 MB)
2.1 Using XPLOR with SGE
Although XPLOR has its own version of parallelization a big problem is that you have to specify which machines for it to use. We can actually trick SGE to find the most suitable nodes. The trick is to get SGE (using mpi or lam) to define the most appropriate machines. The number of which you define. So instead of defining which machines, you just define a number.
Here is an example script to run an xplor job over 10 cpu's, or you can download this script. Note the important command for defining the number of cpu's is #$ -pe lam
To submit it type: qsub myscript.sh. Be sure to change the xplor script names and change to the right directories
2.2 Using ARIA and SGE
This is a bit more tricky. Again I am using the lam environment to create the most available machines.
I then have a series of scripts that will put these machines into your “run.cns” script.
In addition you must set the environment and alias for ARIA in the SGE script:
To invoke this type: qsub myscript.sh. Use qstat to monitor and qdel to remove
Here is an example script which you can also download
2.3 CNS and SGE
This is much simpler. As CNS cannot be paralleled we simply use SGE as a scheduler. There is no need to specify the number of nodes. SGE will find the best machine to run on.
Again here is an example script which you can also download:
submit with: qsub myscript.sh
HADDOCK and SGE
Here I used the same strategy as ARIA, here I am running over 12 machines.
Notice I am using the $PWD command (rather than specifying the whole pathname)
An example script is shown below which you can download
submit with: qsub myscript.sh in the same directory as your run.cns script