Training over a number of machines
There are a number of packages which allow processes to be shared over
a number of machines. In something as comutationally expensive as
acoustic trainig this can be very advantageous. As we cannot offer
scripts that support all those different mechanisms, both free and
proprietary, we include a number of small scripts which seem
sufficient for gaining reasonable use of multiple machines for
processing.
The scripts in the scripts_pl/mc/ offer sufficient support
for running jobs on other machines and managing the process, though
these scripts are particularly elaborate.
There are two parts of the acoustic modelling process that best
benefit from parallelism. First all the Baum Welch runs. The code
supports splitting of training data and running parts on different
processors and combining the results. In training semi-continuous
models for SPhinx2 this can be used at three different stages. The
second case is in building the trees.
This basic multi-machine support uses ssh to run a remote job
on a machine. It require that the current directory be accessible by
NFS (or otherwise) on that remote machine (though potentially by a
different path. Also to work unsupervised its is best if you have see
up access to remove machines with out requiring the entering of
passwds.
For the MC scritps to take effect you need to the following:
- Have working network access through ssh and nsf between machines
(the script mc_check will check this).
- Create an MC config file naming the machines available,
path name on that machine to the training directory and the
number of processors the machine has
- Enable MC by uncommenting the $MC=1; in
etc/sphinx_train.cfg, and changing the number parts to
something greater than one ($CFG_NPART).
Each of these are discussed in more detail below.
MC config file
The config file should be in etc/mc_config. It consist of a
separate line for each machine that is to be used in the cluster.
Each line has three fields, the machine name, the absolute pathname on
that machine to the training directory, and the number of processors
on that machine. MC can run multiple jobs on machine swith more than
one processor. Note there should be no blank lines, and there
is no method to include comments in this file.
A typical example might be
doc.speech.randomdomain.com usr2/swb 1
happy.speech.randomdomain.com /net/doc/usr2/swb 1
sleepy.speech.randomdomain.com /mnt/doc.usr2/swb 1
grumpy.speech.randomdomain.com /net/doc/usr2/swb 1
If a machine has more than one processor put the the number of
processors in the third field. Note this list should normally include
the main machine you will be running from. The pathname for each
machine can be different as may be the case due to various NFS/amd
conventions. The machine name should be sufficient for
ssh MACHINENAME
to be resolved which depends on your local setup.
Once you create this file you should check these machines are
really accessible, this isn't foolproof in guaranteeing that
these machine will remain accessible over the training run
but this is good test.
script_pl/mc/mc_check
For any machine that gives not OK you should check to find out why,
which could be due to access problems or the directory not being
right. If it can't be resolved you shuld delete it from the config
list.