MetaPathways

Download


MetaPathways 1.0 comes in a number of flavors:

MetaPathways 1.0 Requirements:
  • Linux/Unix (64-bit), Mac OSX 10.6.x or later, or Windows (64-bit)
  • Python 2.x
  • Perl 5.x
  • GCC
  • Pathway Tools v15.5 or later

Licence:
  • GNU GPL
  • Academic licenses are required for NCBI BLAST and Pathway Tools

Installation

A downloadable version of this installation page can be found here.
1. Downloading MetaPathways.
Download the zip file MetaPathways v1.0.zip from http://hallam.microbiology.ubc. ca/MetaPathways/ or the GitHub releases page. After you have downloaded the file, unzip and inspect the contents of the MetaPathways/ folder (Figure 1).
Stacks Image 85
Figure 1 - An example of the MetaPathways/ folder from the MetaPathways_v1.0.zip file. Notice that the folder has a number of different files and folders inside it. The template configuration (template config.txt) and parameter configuration (template_param.txt) files are used to configure and set parameter settings of each of the analytical steps of the pipeline. Additionally, the Python script, MetaPathways.py, is used to start the pipeline.
A tour of the MetaPathways folder:
  • blastDB/ - place where BLAST databases are stored along with name-mapping and taxonomic support files for specific databases like KEGG and COG
  • daemon.py - a script that carries out external operations on supercomputing grids using the Sun Grid engine
  • executables/ - contains various analytical and data handling programs that process the inputs and outputs of different steps of the pipeline e.g. BLAST, Prodigal, trna-scan, etc.
  • libs/ - the code library folder contains different Perl and Python functions and code that coordinate different steps of the pipeline
  • MetaPathways.py - the starter script/program that runs the pipeline with specific config- uration and parameter settings for each of the steps
  • MetaPathwaysrc - a unix source file that ensures that the computer system knows where the MetaPathways/ folder, sets the local python and perl paths, and compiles some executable code
  • template_config.txt - a configuration file that specifies the location of different programming resources on the computer. e.g. the Location of BLAST databases, Perl, Python, etc.
  • template_header.txt - a template header for output GenBank (.gbk) files
  • template_param.txt - a parameter file that specifies the analytical settings for all pipeline steps. e.g. BLAST cut-offs, steps to include in a run of the analysis, what order to annotate databases in, etc.
  • testdata/ - contains some simple .fasta files to do a dry-run to ensure that everything in the pipeline is working properly

For simplicity we are going to perform this installation out of the user home folder /User/[username]/ by default. In unix commands the tilde ~ character is equivalent to your home directory. In OSX systems the home folder can be found through any of the following:
  • Double-click the “Macintosh HD” on the Desktop
  • Right-click (control-click) the “Finder” icon in the Dock and select “New Finder Window”
  • Left-click the “Finder” icon and press (command + n)
  • Go to home from any finder folder by pressing (shift + command + h)
Drag-and-drop the newly extracted MetaPathways v1.0/ folder into the home directory. It should sit as ̃/MetaPathways/ when accessing it through the terminal.

MetaPathways requires the use of the unix command-line terminal to run. On OSX systems this is done through the “Terminal” program located in:
  • Applications > Utilities > Terminal
You may want to place this program on your OSX Dock for future convenience.
2. Installing programming languages Python, Perl, and GCC.
Install the required Python 2.x, Perl 5.x, and GCC compiler. For OSX users, these are all contained within the current release of Xcode4 which can be obtained for free from https://developer.apple.com/xcode/ or on the Apple App Sore within modern releases of OSX. Alternatively, Perl, and Python installation files and documentation can be obtained from their respective websites:
These also can be obtained through a package management system like Synaptic. Though in the case of many Unix distributions, like the popular Ubuntu, versions of Python, Perl, and GCC are included by default, but you will want to ensure that they are the proper versions.

In many instances, installing new programming languages is quite low-level from an OS perspective, and may require some discussion with your local system administrator. A restart of the computer might also be required. It is also a good idea to open the terminal after installation to check if these installations made it to your system’s $PATH variable using the which command:
# tests to see if perl is included in your Unix $PATH variable
$ which perl
/usr/bin/perl
$ which python
/usr/bin/python
$ which gcc
/Developer/usr/bin/gcc
3. Install Pathway Tools
One of the final steps of the MetaPathways pipeline uses the software Pathway Tools to build a Pathway/Genome Database (PGDB) from environmental nucleotide sequences. The Pathway Tools software can be obtained directly from SRI International and will require obtaining an academic licence for the software (http://biocyc.org/download.shtml). This is free for academic users and usually takes approximately 1-2 business days to approve. Problems with licensing can be emailed to ptools-info@ai.sri.com. SRI International provides installation instructions for OSX and Unix, and is extensively documented at its homepage: http://bioinformatics.ai.sri.com/ptools/. Eventually you receive an email from the Pathway Tools group that will allow you to download the Pathway Tools software (Figure 2).
Stacks Image 205
Figure 2 - Table of the available versions of Pathway Tools. For most people starting out, the versions circled in red, just containing EcoCyc and MetaCyc, will be sufficient. Additional databases from within the BioCyc umbrella are available for download individually through the internal P2P function of Pathway Tools.
In short, you will obtain an install file like pathway-tools-XX.X-macosx-tier1-install.dmg and upon mounting this folder to the desktop a folder with a file that starts an installation wizard (Figure 3).
Stacks Image 207
Figure 3 - The Pathway Tools 16.0 install wizard for OSX. We recommend that installation defaults are followed, placing the pathway-tools and ptool-local directories in their default location of the user root folder. On typical Mac OSX installations these are ~pathway-tools and ~/ptools-local, respectively.
For ease of instruction we encourage the use of the default installation locations of Pathway Tools directories in the standard home folder locations: ~/pathway-tools and ~/ptools-local.

On OSX systems the a window during the Pathway Tools installation will prompt installation of xQuartz. This will download an additional .dmg file to install xQuartz. Allow the installation of xQuartz to finish before continuing with the Pathway Tools installation. On some systems, installation of xQuartz may require a manual restart. Please restart your system prior to running Pathway Tools for the first time.

After installing Pathway Tools you can launch it from the terminal by executing the following from the command line:
$ cd ~
$ ./pathway-tools/pathway-tools
4. BLAST Databases
The Basic Local Alignment Search Tool (BLAST) is used for a number of pipeline steps; specifically the Open Reading Frame (ORF) functional annotation and the taxonomic identification of sequences through RNA homology. Essentially we are searching for similarity between our query sequence and a set of known sequences contained in public databases. In order to perform this step locally you need a copy of the databases on your computer. We only provide the MetaCyc database (metacyc-v5-2011-10-21) which is the same as a file coupled with the Pathway Tools software (uniprot-seq-ids.seq), just reformatted into the common .fasta format.

However, the choice of database often depends on the specific scientific question you are asking. As such, many databases are freely maintained for download from public ftp servers.
Note: These databases are large and they grow in size every day. Downloads add into the gigabytes (GBs) so a high-speed internet connection will be required. Also many of these are hosted on file transfer protocol (ftp) servers, we recommend Cyberduck http://cyberduck.ch as a free, simple, and user-friendly ftp client.
We will use a directory to store and compile these databases:
• create a folder in the metapathways/ directory named blastDB/

We will discuss a number of these here and go through how to obtain them:

Protein Databases

RefSeq - a major protein reference database maintained by the National Center of Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov/RefSeq/
  • connect to the BLAST database ftp server ftp://ftp.ncbi.nlm.nih.gov/blast/db
  • download the set of files named refseq_protein.XX.tar.gz, where XX are numbers
  • extract the .tar.gz archives (usually by simply double-clicking on them)
  • MetaPathways actually requires the original fasta sequences of the RefSeq database to start. Extract the sequences from the refseq_protein BLAST database using the blastdbcmd or the older fastacmd: and can be found on the can be found on the NCBI's BLAST Software and Databases website.
$ blastdbcmd -db refseq protein -dbtype prot -outfmt %f -out Refseq 2013
$ fastacmd -D 1 -d refseq protein -o Refseq 2013
KEGG

The Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/ and http://www.bioinformatics.jp/en/keggftp.html. MetaPathways is configured to handle KEGG annotations and provide summary tables. Unfortunately, KEGG now requires a subscription fee to access its databases. However, once sequences are obtained they can be simply placed in the blastDB/ folder.

Nucleotide Taxonomic Databases

Silva — comprehensive ribosomal database project http://www.arb-silva.de/download/
• navigate links: Download > Archive > Current > Exports
• download the current SSU database (SSURef_111_NR_tax_silva.fasta.tgz) and the current LSU database (LSURef_111_tax_silva.fasta.tgz)

GreeneGenes — 16S rRNA gene database and workbench compatible with ARB
• navigate links: Download > Sequence Data > Fasta_data_files
• download current_GREENGENES_gg16S_unaligned.fasta.gz

Note: one need only download the databases in .fasta format in place them in the blastDB/ folder. MetaPathways is programmed to do automatic formatting of them on-the-fly.
5. Configuring the template config.txt
The template_config.txt file configures the pipeline to find the resources it needs to run. Paths will have to be set for the PERL_EXECUTABLE, PYTHON_EXECUTABLE, PATHOLOGIC_EXECUTABLE, REFDBS, and METAPATHS_PATH.

Direct the Terminal to the MetaPathways/ folder and source the MetaPathwaysrc file compiling the Perl and Python code and locating Perl, Python and the MetaPathways directory for the config file:
$ cd MetaPathways/
$ source MetaPathwaysrc
Checking for Python and Perl:
Python found in /usr/bin/python
Please set variable PYTHON_EXECUTABLE in file template_config.txt as:
PYTHON_EXECUTABLE /usr/bin/python
Perl found in /usr/bin/perl
Please set variable PERL_EXECUTABLE in file template_config.txt as:
PERL_EXECUTABLE /usr/bin/perl
Adding installation folder of MetaPathways to PYTHONPATH
Your MetaPathways is installed in :
Please set variable METAPATHWAYS_PATH in file template_config.txt as:
METAPATHWAYS_PATH /Users/username/MetaPathways
Follow the printed instructions and update the PYTHON EXECUTABLE, PERL EXECUTABLE, METAPATHWAYS PATH, PATHOLOGIC EXECUTABLE, and SYSTEM keyword in template config.txt (Figure 4). The METAPATHWAYS PATH and PATHOLOGIC EXECUTABLE represent the absolute paths to MetaPathways and Pathways Tools, respectively.
Stacks Image 248
Figure 4 - An example of how to edit the template config.txt file for MetaPathways setup. In most cases, one only needs to edit the PYTHON_EXECUTABLE, PERL_EXECUTABLE, METAPATHWAYS_PATH, the PATHOLOGIC_EXECUTABLE, and then replace the SYSTEM keyword with ether mac, linux, or win depending on the operating system. These fields are highlighted in the red boxes on the left, and potential changes in blue boxes on the right during an example setup on the for a Mac OSX operating system.
6. Configuring the template_param.txt
The template param.txt file defines the parameter settings of all the analytical steps in a MetaPathways run. It needs to be updated with the exact names of your protein and nucleotide databases in the blastDB/ folder (Figure 5).
Stacks Image 252
Figure 5 - The template_param.txt file. The exact names of the BLAST databases need to be listed in the above highlighted lines. These must be the exact names of the database sequence files in the blastDB/ folder.
7. Connecting with the Grid (optional).

MetaPathways has capability to externalize computationally heavy tasks like protein BLAST searches to super computing facilities, provided they use the Sun Grid Engine. This is an optional, but highly recommended step. However this requires having ssh access and sufficient user permissions to set up password-less on a super computing server. This might be a good time to check with your local system administrator and ask if this kind of setup is permissible.
  • test to see if you can connect to your account via ssh:
    $ ssh username@server.address.com
  • You should be asked for your password.
  • check to see there is a .ssh/ folder in your home directory
    $ ls ~/.ssh/
    authorized_keys known_hosts

  • if not you should create it:
    $ mkdir ~/.ssh/
  • press control + d to return to your local computer
  • navigate to the ~/.ssh/ directory
    $ cd ~/.ssh/
  • run ssh-key to create a RSA public and private key.
    $ ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in id_rsa.
    Your public key has been saved in id_rsa.pub.
    Enter file in which to save the key (/Users/username/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in id_rsa.
    Your public key has been saved in id_rsa.pub.

  • Copy your public key to your grid .ssh/ folder with scp
    $ scp id_rsa.pub user@user.server.com:~/.ssh/
  • Log back in to your external server account using ssh
    $ ssh username@server.address.com
  • Navigate to the ~/.ssh/ directory again
    $ cd ~/.ssh
  • append the public key to a file called authorized_keys
    $ cat id_rsa.pub >> authorized_keys
  • change the permissions of the authorized_keys file and .ssh/ directory such that only your username can read/write it
    $ chmod 600 ~/.ssh/authorized_keys
    $ chmod 700 ~/.ssh/

  • logout to your local computer pressing control + d
  • again try to login using ssh, you should not need to type in your password this time
    $ ssh username@server.address

If this above procedure did not help then you likely have a more complicated setup on your hands. At this point it would be good to speak with a local system administrator to help you setup keyless login. If this is not possible, a Google term would be “ssh keyless login”
Congratulations! You have completed what is in some cases an convoluted and unintuitive setup, but with some luck the MetaPathways pipeline ready for action. Now that you have come so far you will likely want to use it. You can now proceed to obtain some .fasta files full of sample sequences and let the analysis commence. Its use is simple if you are familiar with the Unix command line, however, we have provided run examples to help you out.