MetaPathways Use & Setup

Running MetaPathways

1. Setting ParametersPreparing for your Metapathways run
Before we start our first run of the pipeline we will again take a look at the parameters contained in template_param.txt. This file gives all the instructions and settings to be run for each step of the pipeline. Many of the default settings found in template_param.txt are general and should be adequate for many metagenomic analyses. However, often one will have to remember to change these to reflect the questions and goals one has about their specific dataset.

Settings in this file are in the form of parameter/value separated by spaces; multiple values are separated by commas:
parameter value
parameter value1,value2,...
INPUT: format fasta — specifies the type of input file. Possible values include: fasta, gff-annotated, gff-unannotated, gbk-annotated, and gbk-unannotated. Annotated and unannotated correspond to the existing gene annotations contained within the General Feature Format (gff) or GenBank (gbk) input files.

QC parameters
quality_control:min_length — specifies the minimum number of nucleotides a sequence must have during the QC phase.
quality_control:delete_replicates — removes duplicate sequences from input.

ORF prediction parameters
orf_prediction:algorithm — specifies the ORF prediction algorithm that is used. Currently only Prodigal is available
orf_prediction:min_length — specifies the minimum number of amino acids in a predicted ORF

Annotation parameters
annotation:algorithm — specifies which homology search algorithm to use for ORF annotation. Current options are blast and last are more-efficient implementation of the seed-and-extend approximation algorithm
annotation:dbs — specifies which protein databases and in what order they will be used for protein BLAST annotation. Database names are separated by commas, and the names must exactly match the naming convention in the BLAST database folder blastDB/ . i.e. The complied database consisting of files like refseq_protein_20130204.00.pni would have the name
annotation:min_bsr — specifies the minimum blast-score ratio threshold. Only hits greater than the threshold will be kept.
annotation:max_evalue — specifies the maximum e-value threshold. Only e-values smaller (more statistically significant) than this threshold will be kept.
annotation:min_score — specifies the minimum bit-score threshold. Only hits greater than this score will be kept.
annotation:min_length — specifies the minimum length threshold. Only annotations with a greater length will be kept.

RNA parameters
Analogous to the protein BLAST settings above:
rRNA:refdbs — specifies the databases to be searched against. These database names must match the names of the nucleotide BLAST databases found in the blastDB/ folder specified in pipeline configuration file
rRNA:max_evalue — sets the 16s rRNA maximum expect value threshold. Only hits less than (more statistically significant) than this threshold will be kept
rRNA:min_identity — sets the minimum percent identity threshold. Only annotations with a greater percent identity with the query sequence will be kept
rRNA:min_bitscore — only annotations with bit-scores greater than this minimum threshold will be kept.

Grid Settings
Settings associated with running protein homology searches on the grid.
grid engine:batch size — specifies the number of sequences to be included in each grid job. This should be set to respect the memory and cpu time requirements of the grid you are using
grid engine:max concurrent batches — sets the maximum number of jobs to be submitted to a grid at one time. MetaPathways will maintain a job queue of this size waiting to be scheduled
grid engine:walltime — sets the maximum amount of time an individual job can take. Setting this value too high affects your scheduling by the SunGrid scheduler. Setting it too low allows you to be schedule but your job will be stopped before completion.
grid engine:RAM — the maximum ram usage for the job. Also can affect the schedul- ing of your jobs. Becomes an issue for larger databases such as RefSeq
grid engine:user — username used to access the grid via ssh
grid engine:server — the address of the compute grid via ssh

Pathway Tools parameters ptools_settings:taxonomic_pruning [yes/no] — Specifies if the ePGDB in Pathway Tools should be built with taxonomic pruning enabled (yes) or disabled (no)

2. Pipeline Execution Flagsyes, skip, stop, redo,grid
For each step of the pipeline one must specify one of the following actions:
yes — perform the operation with the above settings
skip — do not perform this operation (note that this could cause later dependent steps in the pipeline to fail)
stop — stop the pipeline run after completing the previous step
redo — recompute a specific step of the pipeline (after incomplete execution or error may have corrupted the output)
grid — compute this step on the grid. Currently only available for the BLAST/LAST homology search step

3. Starting a Run
The MetaPathways pipeline is run using the script from the command line:

$ ./ -i [input file/folder] -o [output directory] -c [config file] -p [parameter file] -r [overwrite/overlay]
$ ./ -i testdata/ -o ~/MetaPathways/output -c ~/MetaPathways/template_config.txt -p ~/MetaPathways/template_param.txt -r overlay

-i specifies the input file directory or specific .fasta file
-o specifies the output directory
-c the configuration file to be used for this run
-p the parameter file to be used for this run
-r the run-style to be use for this run:
overlay - check for existing run in place and uses existing files as it finds them except if the pipeline step is set to redo
overwrite - overwrites existing output in a file
Note: The script will do a simple run on sequences in the testdata/ folder: