Mendes Research Group - Ometer file formats

Ometer — File formats

Each data file can be seen as a table of samples, where each sample is characterized by several variables (attributes). The interpretation of what are samples and attributes depends on the nature of the data itself and/or the purpose of the analysis. In "omic" data sets, the samples are usually interpreted as the biological sample (mutants, time points, etc) and the variables are the molecules in that sample (metabolites, mRNA, etc).

Ometer is able to read data in two distinct file formats, one that has samples on the columns and variables on the rows (the default format), and the other with samples on the rows and variables on the columns. The data files should be ASCII text files with tab as the separator of columns (no commas or spaces!), and carriage return (CR) and/or line feed (LF) as separators of rows. Ometer accepts Windows, Unix and Mac style text files irrespectively (i.e. CR-LF, CR, or LF-CR).

One way to easily create data files for ometer is to export them from a spreadsheet (Excel or Gnumeric, for example), making sure that the file format is "tab-separated text".

Names and classes

In both formats the variables must be named explicitly. Samples should also be named explicitly, although in file format 1 the names may be ommited (in which case they will be numbered by their order in the file). For classification problems, each sample must contain a class assignment. Sample names and class assignments, if they exist, will be added in the same way as variables (rows in format 1, columns in format 2), and will be indicated by the reserved strings"classes" or "names".

File format 1 (default)

In this format, each line of the file represents the measurements of one single variable in several samples. The line must start (first column) with a name for the variable, followed by the numerical values. If the first colum has a numeric value rather than the name then that string will be taken as a name... There can be missing data, in which case the column must have just one tab. If the numerical columns (all but the first) have malformed numbers, these will be taken as missing data. If you are exporting from a spreadsheet, then missing data should be an empty cell not the value 0 (zero), since that is a legitimate number (i.e. ometer assumes that the value 0 is the result of a measurent). This has consequences for some of the algorithms (eg correlation analysis). Because this is the default format you do not have to specify anything if your files are in this format, although the explicit command line option -f=1 will ensure that ometer reads them properly.

Below is a very small example, with the keywords in italic:

names	sample 1	sample 2	sample 3	sample 4
glucose	54	46	73	53
alanine	2.4	4.2	3.6	2.7
malate	1.2e-3	1.5e-3	2.1e-3	1.8e-3
classes	normal	disease	disease	normal

File format 2

This file format is the transpose of file format 1. Each line in the file represents a sample, with the variables in each column. This file format requires that the samples be named explicitly in the first column of each row. The top line of the format should contain the names of the variables, and this should be preceded by the identifier "names" in the first column. If there are class assignments, then there must also be a column for these class assignments and the first line must contain the word "classes" in that same column. Because this is not the default format you must indicate that your files are in this format with the command line option -f=2.

Below is the same example as above, now in format 2:

names	glucose	alanine	malate	classes
sample 1	54	2.4	1.2e-3	normal
sample 2	46	4.2	1.5e-3	disease
sample 3	73	3.6	2.1e-3	disease
sample 4	53	2.7	1.8e-3	normal

Files in format 2 must not have empty lines at the end, not even just one. This is a known bug and will be resolved in a later version (if there will ever be one...).

Data Output

Many analyses carried out by ometer produce result in numerical data of the same nature as the input. Whenever the results are of the same nature as the input data, ometer will output these results in files conforming to format 1 (samples in columns). Such files could then be used as input for further analyses by ometer.

For example, it is often needed to carry out a data reduction with PCA before classifying samples with MDA. With this mechanism one will first extract the necessary PCs with ometer, and then use the result file with the projected data as input to DFA. If there are class assignments in the input file, then those class assignments are also put in the output file, even if the analysis requested ignores the class assignments (as PCA).

Non-data output

Ometer produces several results that are not of the same nature as the input data. For example it calculates correlations and associated p-values. Those results are put in a number of other files, which are specific for each analysis. Please see details on each analysis method for details on these files.

All ometer runs produce a report, which is formatted in html. You should read or print such reports with a web browser (eg Firefox). In Linux, you can use the lynx text browser if you are working on a non-graphical terminal.

back to Ometer home page