Home
Resource
Knowledge Center
Data Preprocessing for Untargeted Metabolomics

Data Preprocessing for Untargeted Metabolomics

Metabolomics Data Analysis

Untargeted Metabolomics

Overview of Non-targeted Metabolomics

Non-targeted metabolomics (untargeted metabolomics) involves comprehensive profiling of metabolites present in biological samples without prior knowledge of their identities. This approach leverages advanced analytical techniques such as mass spectrometry and nuclear magnetic resonance spectroscopy to generate complex datasets containing qualitative and quantitative information about metabolite abundance.

Select Service

Learn more

Untargeted Metabolomics-An Overview

Why Need Data Preprocessing?

Non-targeted metabolomics generates vast amounts of raw data, capturing diverse metabolite profiles from biological samples. However, the inherent complexity and variability within these datasets necessitate meticulous data preprocessing to ensure data quality, reliability, and meaningful interpretation.

Enhancing Data Quality and Reliability

The primary objective of data preprocessing in non-targeted metabolomics is to enhance data quality and reliability. Raw metabolomics data often contain technical artifacts, noise, and outlier measurements that can skew subsequent statistical analyses and biological interpretations. By systematically applying preprocessing steps such as outlier filtering and missing value handling, researchers can minimize the impact of these anomalies, thereby improving the accuracy and robustness of downstream analyses.

Facilitating Statistical Analyses and Interpretation

Effective data preprocessing lays the foundation for rigorous statistical analyses in non-targeted metabolomics. Preprocessed data sets are better suited for parametric and non-parametric tests, multivariate statistical approaches (e.g., principal component analysis, clustering), and pathway enrichment analyses. By normalizing data distributions and minimizing data variability through preprocessing, researchers can confidently identify significant metabolite changes associated with biological conditions or experimental treatments.

Enabling Biomarker Discovery and Validation

Data preprocessing is crucial for biomarker discovery in non-targeted metabolomics studies. Reliable identification and validation of biomarkers require consistent and reproducible metabolite measurements across samples. Preprocessing steps such as normalization ensure that biological variations are distinguished from technical variability, improving the sensitivity and specificity of biomarker detection. This process is essential for translating metabolomics findings into clinically relevant biomarkers for disease diagnosis, prognosis, and therapeutic monitoring.

Mitigating Technical Biases and Experimental Variability

Non-targeted metabolomics experiments are susceptible to various technical biases and experimental variability introduced by sample preparation, instrument calibration, and data acquisition protocols. Data preprocessing addresses these challenges by standardizing data formats, correcting batch effects, and harmonizing metabolite measurements across different experimental conditions or analytical platforms. By minimizing these sources of variability, preprocessing enhances the reproducibility and comparability of metabolomics data, facilitating robust scientific conclusions and data integration across studies.

Supporting Multi-omics Integration and Systems Biology Approaches

Integrating metabolomics data with other omics datasets (e.g., genomics, transcriptomics, proteomics) is essential for comprehensive systems biology investigations. Data preprocessing ensures that metabolomics data are compatible with other omics data types, enabling holistic analyses of biological pathways, network interactions, and regulatory mechanisms. Consistent preprocessing workflows facilitate data harmonization and integration, empowering researchers to uncover complex biological insights that span multiple molecular layers and biological scales.

Pipeline creation using the model for sources of variation (Riquelme et al., 2023).

Select Service

Metabolomics Data Analysis

Learn more

Univariate and Multivariate Analysis in Untargeted Metabolomics

Outlier Filtering in Untargeted Metabolomics Data Preprocessing

Outliers are data points within untargeted metabolomics datasets that significantly deviate from the expected range of measurements. These outliers can arise due to various factors, including technical errors during sample preparation, instrumental noise, or biological variability. Effective outlier filtering is essential to ensure the accuracy, reliability, and integrity of metabolomics data for subsequent statistical analyses and biological interpretations.

In untargeted metabolomics, outliers are typically identified based on statistical metrics calculated from quality control (QC) samples or across experimental replicates. Common measures used to detect outliers include the relative standard deviation (RSD), which quantifies the variability of metabolite measurements relative to their mean or median values.

Relative Standard Deviation (RSD): RSD is computed as the ratio of the standard deviation to the mean (or median) of metabolite measurements across QC samples. Metabolites with high RSD values (e.g., RSD > 0.3) are considered unstable and are flagged as potential outliers.
Impact of Outliers: Outliers can skew statistical analyses by disproportionately influencing measures such as mean, standard deviation, and correlation coefficients. They can obscure genuine biological variations or lead to erroneous conclusions if not appropriately identified and filtered.

Methods for Outlier Filtering

Several methods can be employed to filter outliers in untargeted metabolomics datasets:

Z-score Method: This statistical technique identifies outliers based on the number of standard deviations a data point is away from the mean. Data points exceeding a predefined threshold (e.g., ±3 standard deviations) are flagged as outliers.
Modified Z-score Method: This method adjusts the Z-score approach to account for skewed distributions and is robust against deviations from normality in metabolomics data.
RSD-based Filtering: Utilizing RSD calculated from QC samples to establish a threshold for outlier detection. Metabolites exceeding the predefined RSD threshold are excluded from further analysis.

Missing Value Handling in Untargeted Metabolomics Data Preprocessing

Missing values are a common challenge in untargeted metabolomics datasets, arising from various factors such as instrumental limitations, detection thresholds, or sample-specific issues. Proper handling of missing values is crucial to ensure the completeness, reliability, and integrity of metabolomics data for subsequent statistical analyses and biological interpretations.

Missing values can complicate data analysis and interpretation in untargeted metabolomics research:

Biased Data Analysis: Incomplete datasets can bias statistical measures such as mean, variance, and correlation coefficients, leading to inaccurate assessments of metabolite abundances or associations.
Reduced Statistical Power: Missing values reduce the effective sample size available for analysis, potentially compromising the statistical power and reliability of findings.
Limitations in Biomarker Discovery: Unaddressed missing values may obscure significant metabolite changes or biomarkers associated with biological conditions or experimental treatments.

Strategies for Missing Value Handling

Several approaches can be employed to effectively handle missing values in untargeted metabolomics datasets:

Exclusion of Metabolites: Exclude metabolites with a high proportion of missing values (e.g., > 50%) from further analysis to focus on metabolites with robust and complete data across samples.

Imputation Methods: Imputation involves estimating missing values based on statistical models or algorithms, aiming to provide plausible replacements for missing data points. Common imputation methods include:

Mean or Median Imputation: Replace missing values with the mean or median value of the metabolite across all samples. This method assumes that missing values are randomly distributed and do not introduce bias.
K-Nearest Neighbors (KNN) Imputation: KNN imputation estimates missing values by averaging the values of the nearest neighbors in terms of Euclidean distance or other similarity metrics in metabolite space.
Singular Value Decomposition (SVD): SVD-based imputation leverages matrix factorization techniques to reconstruct missing values based on the underlying structure of metabolomics data matrices.

Considerations for Effective Missing Value Imputation

Choosing an appropriate imputation method in untargeted metabolomics requires careful consideration of dataset characteristics, experimental design, and the potential impact on downstream analyses:

Data Distribution: Assessing the distribution of missing values and selecting imputation methods that align with the data's characteristics (e.g., normality, skewness) to minimize bias.
Validation and Sensitivity Analysis: Conducting sensitivity analyses to evaluate the robustness of imputation results and assessing their impact on statistical outcomes and biological interpretations.
Integration with Quality Control (QC) Measures: Incorporating QC samples and rigorous validation protocols to ensure imputation methods do not introduce artifacts or compromise data quality.

Data Normalization in Untargeted Metabolomics Data Preprocessing

Data normalization is a critical step in untargeted metabolomics data preprocessing aimed at reducing systematic variation introduced by experimental conditions, sample preparation, and instrumental settings. Normalization methods adjust metabolite measurements to ensure comparability across samples, thereby facilitating meaningful statistical analyses and biological interpretations.

Normalization is essential in untargeted metabolomics to address the following key challenges:

Technical Variability: Differences in sample volumes, injection volumes, and instrument responses can lead to systematic biases in metabolite measurements. Normalization mitigates these variations, allowing for more accurate comparisons of metabolite abundances across samples.
Enhanced Statistical Power: By reducing technical noise and variability, normalization improves the sensitivity and statistical power of analyses. It enables the detection of genuine biological differences and patterns amidst experimental noise.
Facilitates Comparative Analyses: Normalized data enable researchers to compare metabolite profiles across different biological conditions, experimental groups, or time points. This facilitates the identification of statistically significant metabolite changes associated with specific treatments or physiological states.

Common Methods of Data Normalization

Several approaches are employed to normalize untargeted metabolomics data:

Internal Standard Normalization: Utilizes stable isotopically labeled internal standards added to each sample before analysis. The intensity of internal standards serves as a reference for normalizing metabolite abundances across samples, correcting for variations in sample preparation and instrument performance.
Summation or Total Ion Current (TIC) Normalization: Calculates the total ion current or sum of all detected metabolite intensities within each sample. Each metabolite's intensity is then expressed as a fraction of the total, providing a relative measure of its abundance in the context of the overall metabolic profile.
Median or Quantile Normalization: Adjusts metabolite measurements to align their distributions across samples by equalizing their medians or quantiles. This method is particularly useful for datasets with non-normal or skewed distributions.

Considerations for Effective Data Normalization

Choosing an appropriate normalization method in untargeted metabolomics requires careful consideration of experimental design, data characteristics, and analytical goals:

Normalization Strategy Selection: Assessing the suitability of normalization methods based on the data distribution, presence of internal standards, and compatibility with statistical analyses (e.g., parametric versus non-parametric tests).
Normalization Controls: Incorporating quality control (QC) samples and reference standards to validate normalization procedures and ensure consistency across analytical runs.
Normalization Workflow Integration: Integrating normalization into a comprehensive data preprocessing pipeline alongside outlier filtering, missing value handling, and other quality control measures to maintain data integrity and coherence.

Reference

Riquelme, Gabriel, et al. "Model-driven data curation pipeline for LC–MS-based untargeted metabolomics." Metabolomics 19.3 (2023): 15.

For Research Use Only. Not for use in diagnostic procedures.