Skip to content

1.2.1 Data files

johaGL edited this page Jan 31, 2024 · 2 revisions

The user has to provide his data files within the data folder. Original data files themselves have to be placed in the dataset subfolder (within data/) along with a metadata tabular file that contains the experimental setup corresponding to the data.

Both quantification and metadata files must be provided as in .csv 'tab' delimited format.

Note: if you provide at least one type of measure, you can still run some of the analyses, by making sure that the data you provide is suitable for the analysis that you choose.

a. Metadata file

The structure of the tabular metadata file has to contain 6 columns named name_to_plot, timepoint, timenum, condition, compartment, original_name.

Explanation of the metadata format (click to show/hide)

Here is the semantics of the columns:

  • name_to_plot is the string that will appear on the figures produced by DIMet
  • condition is the experimental condition
  • timepoint is the sampling time as it is defined in your experimental setup (it is an arbitary string that can contain non numerical characters)
  • timenum is the numerical encoding of the timepoint
  • compartment is the name of the cellular compartment for which the measuring has been done (e.g. "endo", "endocellular", "cyto", etc)
  • original_name contains the column names that are provided in the quantification files

Example:

name_to_plot condition timepoint timenum compartment original_name
Cond1 T0 cond1 T0 0 comp_name T0_cond_1
Cond1 T24 cond1 T24 24 comp_name T24_cond_1
Cond2 T0 cond2 T0 0 comp_name T0_cond_2
Cond3 T24 cond2 T24 24 comp_name T24_cond_2

b. Quantification files

Each quantification file is expected to correspond to one type of measure. Supported measure types are:

  1. Isotopologue absolute values
  2. Total metabolite abundances
  3. Mean enrichment (also called Fractional contribution)
  4. Isotopologue proportions
Expected format of the quantification files with examples (click to show/hide)

Each row in the quantification files contains measurements for a given metabolite. Expected columns are the following:

  • ID contains the molecule identifiers
  • All the other columns contain measures in numeric format (no letters or symbols, only numbers).

Note 1: quantification columns' names have to match with the column original_name in the metadata file. Note 2: For the isotopologues, the ID must follow the convention: metaboliteID_m+X (for example: AMP_m+4, cit_m+0, cit_m+1)

The total metabolites' Abundances file:

ID T0_cond_1 T24_cond_1 T0_cond_2 T24_cond_2
PEP 3364610.46 10250098.25 1124772.29 1035932.25
citrate 5783654.51 5934305.65 3546334.99 3460334.88
fumarate 354387.74 360087.74 334287.74 350387.74
OA 9435186.33 9435186.33 9435186.33 9435186.33

The Mean enrichment (also called Fractional contribution) file:

ID T0_cond_1 T24_cond_1 T0_cond_2 T24_cond_2
PEP 0.5603 0.6391 0.9591 0.9553
citrate 0.8057 0.8870 0.7809 0.6918
fumarate 0.001 0 0.1508 0.1511
OA 0.7030 0.7006 0.001 0

The Isotopologue absolute values file:

ID T0_cond_1 T24_cond_1 T0_cond_2 T24_cond_2
PEP_m+0 357354.66 387054.66 0 0
PEP_m+1 965435.68 975030.68 668.91 568.87
PEP_m+2 1435050.95 7987654.66 136749.05 137709.05
PEP_m+3 606769.17 900358.25 987354.33 897654.33

The Isotopologue proportions file :

ID T0_cond_1 T24_cond_1 T0_cond_2 T24_cond_2
PEP_m+0 0.106 0.038 0.000 0.000
PEP_m+1 0.287 0.095 0.001 0.001
PEP_m+2 0.427 0.779 0.122 0.133
PEP_m+3 0.180 0.088 0.878 0.867

c. Data files for the omics integration (optional)

DIMet offers the possitibilty of pathway-based integration of the metabolome and the transcriptome though metabolograms.

Data files required for omics integration (click to show/hide) Two data types are required:
  1. Metabolite quantification files in the dataset subfolder.
  2. Results, provided by the user, of the differential analysis of the transcriptome data placed in the dataset subfolder

Together with the files with differentially expressed genes provided, the user must also provide the pathways files (details in item 2.2 of this subsection).

Thus the expected project data structure becomes:

MYPROJECT
  ├── config
  │   ├── analysis
  │   │   ├── dataset
  │   │   │   └── # --->'dataset configuration' yml files
  │   │   ├── # --->'analysis configuration' yml files
  │   ├── # ---> 'general configuration' yml files
  └── data
      └── DATASET1_data
          ├── # ---> tabular .csv files of metabolomics data
          ├── # ---> .csv files required for omics integration (genes and pathways)

  • 2.1 Files for differentially expressed genes (DEGs)

Files for differentially expressed genes (DEGs) must be provided in the tab delimited .csv format. For each file:

  1. The rows represent the genes (except the first one, which is the header having the names of the columns)
  2. The columns provide the information to be integrated, two columns are compulsory:
    1. the gene names, given as strings
    2. the Fold-Changes (or the log2 Fold-Changes) in numeric format (no letters or symbols, only numbers)

Formatting example of differentially expressed genes files:

log2FoldChange gene_symbol
-16.1660338229612 GPI
3.32192809488736 HK1
2.32192809488736 RPIA
0.807354922057604 PFKL
  • 2.2 The metabolites per pathway and genes or transcripts per pathway files

These files contain the user-provided metabolites and genes for each pathway. It is allowed for a metabolite or gene to appear in several pathways. Identifiers must match with those appearing in the quantification files in the dataset subfolder. Gene names must match with those appearing in the DEGs file

Example for metabolites per pathway:

GLYCOLYSIS PENTOSE_PHOSPHATE ...
Glucose_6P Ribose_5P ...
Pyruvate Xylulose_5P ...
PEP Glucose_6P ...
... ... ...

Example for genes per pathway:

GLYCOLYSIS PENTOSE_PHOSPHATE ...
GPI RPIA ...
HK1 PGD ...
PKFL RBKS ...
... ... ...

All these files must be provided in the tab delimited .csv format.

Clone this wiki locally