Recent developments in high-throughput -OMICs have permitted research on complex areas, such as biomarker research. In metabolomics, current approaches typically suffer from problems in peak picking, alignment, statistical overfitting, variable selection, validation, unconservative test statistics, false positive discoveries and metabolite identification with high risk of misinterpretation of the data as a consequence.
This project aims to develop effective, robust and validated methodologies to analyze and interpret data from large scale metabolomics studies to discover new dietary biomarkers and to assess biological effects of dietary interventions. The project will lead to the build up and integration of data handling approaches which in theory can be utilized in virtually any metabolomics/biomarker project.
The starting point will be two ongoing metabolomics studies generating: i) NMR-based metabolomics to investigate metabolic profile in plasma samples after short-term consumption of fermentable dietary fibre, plant protein and rye in relation to appetite regulation (1000 samples); ii) LC-MS-based metabolomics to investigate dietary biomarkers over longer time for use in prospective cohort studies in relation to type 2 diabetes (2000 samples).
Data management schemes are built up in R and rely heavily on PLS and repeated double cross-validation. However, the validation procedures that we employ for multivariate analyses are very computer-damanding, especially for permutation analyses. In practice, single stand-alone computers are not suffient for the workload. As an example, a pilot study on 112 observations of 16500 variables required 21.6 hours using all 4 cores on a HP Elitebook with an Intel i7-3687U processor.
Although variable prefiltering is a feasible option to decrease computationla cost, especially permutation analysis using this type of model is simply not feasible unless using a more advanced computing structure.