Skip to content

Latest commit

 

History

History
94 lines (67 loc) · 3.52 KB

input_file_specification.md

File metadata and controls

94 lines (67 loc) · 3.52 KB

Input File Type Specification

This document contains the characteristics of the input file types that are supported by the ageml pipelines (model_age, factor_correlation, clinical_groups, clinical_classify).

IMPORTANT: Not following these specifications may lead to errors and unexpected behavior, so please, make sure your input files comply with them.

Features File

Specified with the --features flag. Contains the age and the features of the subjects.

  • Extension: CSV. Comma (,) separated.
  • Header: It must contain the variable names, and the first named column must contain the age of the subjects. The first column is the row index. (E.g.: age, BMI, HDL). The rest of the columns contain the features.
  • Variables: All variables must be numeric. Age can be in any unit.
  • Format: We use the decimal point (.) as the decimal separator. NOTE: Support for categorical variables is on its way.

Example (units are arbitrary, quantities are not real):

,age,HDL,LDL,hippocampus_volume,thalamus_volume
0,20,0.5,0.9,142,543
1,21,0.6,1.1,135,636
2,22,0.7,0.89,129,737
3,23,0.8,1.05,128,854

Covariates File

Specified with the --covariates flag. Contains the categorical covariates of the subjects. For separation by categories and/or to make covariate corrections.

  • Extension: CSV. Comma (,) separated.
  • Header: It must contain the covariate names. The first column is row index.
  • Variables: All variables must be int for categorical variables or floats for continous variables.

Example:

,site,biological_gender,smoker,educ_years
0,1,0,1,10.0
1,2,1,1,16.0
2,3,0,0,12.0
3,2,0,10.0

Clinical file

Specified with the --clinical flag. Contains the clinical groups to which every subject belongs.

  • Extension: CSV. Comma (,) separated.
  • Header: It must contain the clinical group names. The first column is the row index. The rest of the columns contain the clinical group names.
  • Variables: All values must be 0 or 1.

Example (in the context of Alzheimer's disease):

,CN,MCI,AD
0,1,0,0
1,1,0,0
2,0,1,0
3,0,0,1

Factors File

Specified with the --factors flag. Contains the factors for exploring the correlation with the age delta.

  • Extension: CSV. Comma (,) separated.
  • Header: It must contain the factor names. The first column is the row index. The rest of the columns contain the factors.
  • Variables: All variables must be numeric.
  • Format: We use the decimal point (.) as the decimal separator.

Example (units and factors are arbitrary, quantities are not real):

,func_score,sedentarism_points,neuro_score,MOCA_SCORE,memory_perf,familiar_support,hygiene_habits
0,28.0,0.0,6.0,0.0,21.0,0.702,-0.154
1,30.0,0.0,3.0,1.0,28.0,1.812,2.046
2,30.0,0.0,8.0,0.0,25.0,0.846,0.812
3,26.0,1.0,21.0,20.0,19.0,-0.627,0.643

Systems File

Specified with the --systems flag. Contains the systems for which we want to train different models. Each system is a set of columns of the features file. Strictly speaking, systems are variable sets. The naming of the systems is up to the user.

  • Extension: txt.
  • Format: In each line, the name of the system, followed by the names of the variables that belong to it, separated by spaces. (E.g.: system_1_name:var1,var2,var3). The variable names must be written in the same way as they appear in the header of the features file. Do not include empty lines after the last system.

Example:

cardiovascular:HDL,LDL
brain:hippocampus_volume,thalamus_volume