Ometer — Principal component analysis

Principal component analysis (PCA) is a technique that reveals the directions of largest variance in a multivariate data set. PCA results in a rotation of the original data to a new set of axes. The new axes, or principal components, are orthogonal and are ordered in terms of the amount of variance in the data set associated with that direction. The first PC corresponds to the direction with the largest variance in the data set.

PCA is often used for a reduction in the dimensionality of a data set. Since the principal components are ordered in terms of the variance they explain, one can project the data set onto a small number of components and still retain the most of the variation in the data.

Ometer carries out PCA on the samples of a data set (using singular value decomposition) and can perform a projection of the data onto an arbitrary number of components (unlike many other programs, ometer is not limited to 2D or 3D projections). The result of the data projection is a data file that is itself in the same format required for input (i.e. the projected data can be used in subsequent analyses).

The loadings of the principal components measure the contribution of each of the variables (molecules) on that direction. To find out the contribution of each PC on each molecule, one must carry out a PCA of the variables, rather than the samples (the projection will contain such contributions). In order to achieve this with ometer, one must transpose the original data file and then carry out PCA on the resulting file.

An efficient way to visualize both the variables and samples in the data set is to carry out a biplot display. Ometer is capable of doing 2D and 3D biplot displays of the PCA-transformed data matrix (through gnuplot). I can also display only a selected subset of variables and it allows to change the weight between the variables and samples.

Usage

The input data file must not have missing data (i.e. empty fields). Examples of usage:

ometer -m=pca -o result input.dat
will analyze the data in file input.dat and put results in the following files:

ometer -m=pca --proj-variance=0.85 -o result input.dat
will carry out the analysis as above, but rather than retaining all of the data, it carries out a projection only onto enough principal components as to keep 85% (0.85) of the variance. (The value of this option should be between 0 and 1)

ometer -m=pca --proj-dimension=5 -o result input.dat
will carry out the analysis as above, but rather than retaining all of the data, it carries out a projection only onto the 5 most important principal components.

ometer -m=pca --proj-variance=0.946 --proj-dimensions=3 -o result input.dat
will carry out the analysis as above, but will carry out a projection onto 3 principal components or as many as needed for 94.6% of the variance (whichever is largest)

ometer -m=pca --proj-2dplots -o result input.dat
will carry out the PCA analysis and will create files for 2D plots of the data projection onto the principal components (PC):

ometer -m=pca --proj-2dplots --plot-png -o result input.dat
will carry out the same as the previous two examples, but now the gnuplot files are already modified to produce PNG bitmaps (useful to display on web pages, import to word processors, etc.)

ometer -m=pca --proj-dimension=4 --biplot-dim=2 --biplot-alpha=0.7 --biplot-vars="myvars.txt" -o result input.dat
will carry out PCA of input.dat putting the output in files starting with "result-". Will produce a 2D biplot of the PCA-transformed data with a value of alpha=0.7 (slightly more weight to the variables than samples). The biplot will be in the file result-biplot.plt which needs gnuplot to be displayed. Note that --proj-dimensions must be larger or equal to --biplot-dim.

ometer -m=pca --proj-dimension=4 --biplot-dim=3 --biplot-alpha=0.7 --biplot-vars="myvars.txt" -o result input.dat
will produce the same as the previous example, but now with a 3D biplot (PC1, PC2, and PC3).

All PCA relevant options:

-m pca or --method=pca : required!
--proj-dimension int : dimension to project data
--proj-variance double : minimum variance kept in projection
--samples-center : centers all samples before analysis (i.e. subtracts the mean value of each sample)
--vars-center : centers all variables before analysis (i.e. subtracts the mean value of each variable)
--samples-z : transforms samples into z-scores before analysis (i.e. subtracts the mean value, and divides by standard deviation of each sample)
--vars-z : transforms variables into z-scores before analysis (i.e. subtracts the mean value, and divides by standard deviation of each variable)
--samples-total double : scales all samples to the same given total before analysis
--biplot-dim int : dimension of the biplot to produce, must be 2 or 3
--biplot-alpha double : weighting factor, if 1 biplot is variables-weighted, if 0 it is samples-weighted, anything in between gives different weight to variables and samples. With alpha=0.5 the biplot is symetric.
--biplot-vars string : filename of a text file containing names of variables to include in the biplot. Names must be included one per line, and only these variables will be included in the biplot (the PCA is carried out with all variables, of course).
-f int or --format=int : input file format: variables on row=1 or columns=2
-o string or --output=string : name for output files
--proj-2dplots : create 2D data projection plots
--plot-png : gnuplot output in PNG format, only relevant if --proj-2dplots is given
--verbose int : print detailed information, higher values provide more information

All tab-delimited text files can be easily loaded into spreadsheets. The html report file should be displayed with a web browser.

back to Ometer home page