Machine Learning for Pathway Prediction

Predicting the metabolic interactions, e.g. pathways, within and between cells from genomic sequence information is an integral problem in biology linking genotype to phenotype. A pathway prediction problem exists because we have limited knowledge of the reactions and pathways operating in cells even in model organisms like E. coli where the majority of protein functions are determined. To improve pathway prediction outcomes for genomes at different levels of complexity and completion we are developing supervised and semi-supervised machine learning algorithms for metabolic pathway prediction.


Pathway-Centric Genome Analysis

Metabolic inference from genomic sequence information is a basic science problem with a direct impact on our capacity to understand and ultimately engineer cells at the individual, population, and community levels of organization. Predicting metabolic interactions can be described in terms of molecular events or reactions coordinated within a series or cycle. The set of reactions within and between cells defines a reactome, while the set of linked reactions defines pathways within and between cells. Reactomes and pathways can be predicted from primary sequence information and refined using mass spectrometry to both validate known and uncover novel pathways. This process spans an information hierarchy including individual organismal genomes to environmental genomes manifesting different degrees of complexity and completion.

A common method for determining the metabolic potential encoded in genomes is to map conceptually translated open reading frames onto a database containing known product descriptions. Such gene-centric methods are limited in their capacity to predict pathway presence or absence and do not support standardized rule-sets for automated and reproducible research. Pathway-centric methods based on defined rule sets or machine learning algorithms provide an adjunct or alternative inference method that supports hypothesis generation and testing of metabolic relationships within and between cells. We have developed a series of machine learning algorithms for pathway inference that combine multi-label classification, representational learning and community detection. These include multi-label based on logistic regression for pathway prediction (mlLGPR), pathway2vec for features engineering and triple non-negative matrix factorization (NMF) with community detection (triUMPF) combining three stages of NMF to capture relationships between enzymes and pathways within a network followed by community detection to extract higher order network structure.