Mendes Research Group - Ometer GA-DFA

Ometer — Linear Multiple Discriminant Analysis with Genetic Algorithm Variable Selection (GA-DFA)

GA-DFA consists of an initial variable selection step, followed by a classic linear discriminant analysis (see MDA). The variable selection step choses only a subset of variables for the discriminant analysis, which means that a number of variables will not be used in the classification. In many cases better classifications are obtained when some variables are not used.

Within omic data there are always many more variables than samples or classes. Thus it is useful to apply a variable selection step before the classification. The genetic algorithm (GA) attempts to optimize the combination of variables used. Because the GA is a stochastic algorithm, it will often produce different results in different runs. A common strategy used is to run it repeatedly and then identify the variables that are more often picked for the discrimination (which are likely to be the most determinant for the classes).

Ometer contains two different algorithms to carry out the variable selection. The first one (ga-dfa) requires a predetermined (fixed) number of variables to be used in the classification. The second one (ga-dfa2) attempts to chose an optimal number of variables for the classification. The first algorithm is similar to the one described by Jarvis and Goodacre.

Algorithm 1 (ga-dfa)

The data file must contain classification of samples. To see how to add class labels see Ometer file formats. For the GA algorithm it is necessary to chose the number of variables to use in the discrimination (number of genes in GA speak), the population size (number of parallel searches), and the number of generations (iterations). If not pointed out explicitly, ometer will select default values.

Examples of usage:

ometer -m=ga-dfa --ga-genes=9 --ga-generations=1000 --ga-population=20 -o result input.dat
will analyze the data in file input.dat, selecting 9 of its variables for the discrimination. The GA algorithm will run for 1000 generations (if the number of generations is not indicated explicitly the default is 100), and have a population size of 20 (if not indicated, the default is 10). Results on the following files:

result.html, a html report file with a summary of the classification success, including the variables selected and the confusion matrix
result-gadfa-proj.dat, a tab-delimited text file with the data projected on the discriminant axes (this file could be used as input for further analyses).
result-gadfa-loadings.dat, a tab-delimited text file with the loadings of each of the discriminant axes (DF). Note that all variables that were not selected have loadings of zero.
result-gadfa-scree.dat, a tab-delimited text file with the proportion of the between-groups variance explained by each discriminant axis.
result-gadfa-scree.plt, a gnuplot input file for plotting the scree plot.

ometer -m=ga-dfa --ga-genes=9 --ga-generations=1000 --ga-population=20 --proj-2dplots -o result input.dat
will carry out the same analysis as in the previous example, in addition it will create plots of the 2D plots of the data projection onto the discriminant axes (DF)

result-gadfa-2dplots.plt, a gnuplot input file that will display all of the 2D projection plots in a sequence
result-gadfa-DF1_DF2.plt, a gnuplot file to display the projection onto the first two discriminant axes. Other files for DF3, DF4, etc., if they exist.

ometer -m=ga-dfa --ga-genes=9 --ga-generations=1000 --ga-population=20 --proj-2dplots --plot-png -o result input.dat
will carry out the same as the previous two examples, but now the gnuplot files are already modified to produce PNG bitmaps (useful to display on web pages, import to word processors, etc.)

Algorithm 2 (ga-dfa2)

The data file must contain classification of samples. To see how to add class labels see Ometer file formats. It is not necessary to chose the number of variables to use in the discrimination (number of genes in GA speak). Examples of usage:

ometer -m=ga-dfa2 --ga-genes=9 --ga-generations=1000 --ga-population=20 -o result input.dat
will analyze the data in file input.dat. The GA will start with 9 variables for the discrimination, but this number will increase or decrease during its execution. The GA algorithm will run for 1000 generations (if the number of generations is not indicated explicitly the default is 100) with a population of 20 individuals (default is 10). Results on the following files:

result.html, a html report file with a summary of the classification success, including the variables selected and the confusion matrix
result-gadfa2-proj.dat, a tab-delimited text file with the data projected on the discriminant axes (this file could be used as input for further analyses).
result-gadfa2-loadings.dat, a tab-delimited text file with the loadings of each of the discriminant axes (DF). Note that all variables that were not selected have loadings of zero.
result-gadfa2-scree.dat, a tab-delimited text file with the proportion of the between-groups variance explained by each discriminant axis.
result-gadfa2-scree.plt, a gnuplot input file for plotting the scree plot.

ometer -m=ga-dfa2 --ga-genes=9 --ga-generations=1000 --ga-population=20 --proj-2dplots -o result input.dat
will carry out the same analysis as in the previous example, in addition it will create plots of the 2D plots of the data projection onto the discriminant axes (DF)

result-gadfa2-2dplots.plt, a gnuplot input file that will display all of the 2D projection plots in a sequence
result-gadfa2-DF1_DF2.plt, a gnuplot file to display the projection onto the first two discriminant axes. Other files for DF3, DF4, etc., if they exist.

ometer -m=ga-dfa2 --ga-genes=9 --ga-generations=1000 --ga-population=20 --proj-2dplots --plot-png -o result input.dat
will carry out the same as the previous two examples, but now the gnuplot files are already modified to produce PNG bitmaps (useful to display on web pages, import to word processors, etc.)

All tab-delimited text files can be easily loaded into spreadsheets. The html report file should be displayed with a web browser.

back to Ometer home page