Training over a number of machines

There are a number of packages which allow processes to be shared over a number of machines. In something as comutationally expensive as acoustic trainig this can be very advantageous. As we cannot offer scripts that support all those different mechanisms, both free and proprietary, we include a number of small scripts which seem sufficient for gaining reasonable use of multiple machines for processing. The scripts in the scripts_pl/mc/ offer sufficient support for running jobs on other machines and managing the process, though these scripts are particularly elaborate. There are two parts of the acoustic modelling process that best benefit from parallelism. First all the Baum Welch runs. The code supports splitting of training data and running parts on different processors and combining the results. In training semi-continuous models for SPhinx2 this can be used at three different stages. The second case is in building the trees. This basic multi-machine support uses ssh to run a remote job on a machine. It require that the current directory be accessible by NFS (or otherwise) on that remote machine (though potentially by a different path. Also to work unsupervised its is best if you have see up access to remove machines with out requiring the entering of passwds. For the MC scritps to take effect you need to the following: Each of these are discussed in more detail below.

MC config file

The config file should be in etc/mc_config. It consist of a separate line for each machine that is to be used in the cluster. Each line has three fields, the machine name, the absolute pathname on that machine to the training directory, and the number of processors on that machine. MC can run multiple jobs on machine swith more than one processor. Note there should be no blank lines, and there is no method to include comments in this file. A typical example might be
doc.speech.randomdomain.com     usr2/swb 1
happy.speech.randomdomain.com   /net/doc/usr2/swb 1
sleepy.speech.randomdomain.com  /mnt/doc.usr2/swb 1
grumpy.speech.randomdomain.com  /net/doc/usr2/swb 1
If a machine has more than one processor put the the number of processors in the third field. Note this list should normally include the main machine you will be running from. The pathname for each machine can be different as may be the case due to various NFS/amd conventions. The machine name should be sufficient for
ssh MACHINENAME 
to be resolved which depends on your local setup. Once you create this file you should check these machines are really accessible, this isn't foolproof in guaranteeing that these machine will remain accessible over the training run but this is good test.
script_pl/mc/mc_check
For any machine that gives not OK you should check to find out why, which could be due to access problems or the directory not being right. If it can't be resolved you shuld delete it from the config list.