Putting the Pieces Back Together Again
Shotgun metagenomic sequencing using short-read sequencing technologies produces a wide distribution of assembled contigs that can be binned into population genomes using intrinsic sequence properties including G+C content, coverage, and k-mer distribution profiles. The resulting bins represent amalgamations of related genotypes within a sample that can be used for more complete pathway reconstruction and hypothesis testing. We are developing binning algorithms to more precisely link taxonomic and functional information contained within microbial communities from diverse ecosystems by combining single-cell genomic data.
Metagenome Assembled Genomes
A metagenome is a representation of genomic information that has been extracted from environmental, enrichment or host samples, sequenced and assembled to produce a multi-organismal dataset that can be analyzed as a collective metabolic network or subdivided into genomic bins composed of closely related genotypes. These genomic bins also known as population genomes or metagenome assembled genomes (MAGs) are constructed based on comparisons to reference genomes and intrinsic nucleotide signatures such as GC content or k-mer frequency distributions. Although MAGs are routinely used to draw linkages between taxonomy and function in microbial communities, binning methods suffer from specific drawbacks including, chimeric assembly artefacts, coverage bias and contamination. Importantly, automated quality assessment of MAGs is not entirely reliable, and in most cases a manual curation step is needed that can become impractical when dealing with hundreds or thousands of MAGS. Recent methods combine the outputs of multiple binning algorithms to create a higher quality consensus MAG but the resulting bins can still be a source of false discovery.
A more direct way to access genomic information from environmental or host samples is to generate single-cell amplified genomes (SAGs) from individual cells. By sorting individual cells, followed by whole genome amplification and sequencing it becomes possible to resolve genotypic diversity at the individual, population, and community levels of biological organization providing confident taxonomic labels to each SAG. Thus, SAGs provide a much higher degree of confidence when used to draw linkages between taxonomy and function than MAGs. Despite this advantage SAGs suffer from incomplete genome coverage (typically representing less than 50%) and amplification bias that can limit their resolving power in metabolic reconstruction. We are developing the SAG anchored binner (SABer) application using robust data mining and machine learning methods on paired SAGs and MAGs to produce more complete population genome assemblies. Sequence properties for each SAG are compiled to form a reference seed, allowing for large-scale, automated surveys of datasets with labelled SAG reference seeds. These references are compared to assembled metagenomes from the same sample to recruit sequences with precision, resulting in metabolic reconstructions that are more complete than those built from incomplete SAGs or MAGs in isolation.