Tutorial

orthofisher enables researchers to conduct high-throughput identification of orthologous genes using Hidden Markov Models (HMM). This tutorial covers the easy-to-implement workflow needed for using orthofisher.

1) Creating HMMs

Before using orthofisher, users will have to create (or curate) sets of HMMs of interest. HMMs can be curated from various sources including databases such as Pfam database. In this tutorial, we will go over creating your own HMMs.

First, we have to decide on genes of interest, download them, and align them. In this tutorial, we will download protein sequences that are top hits in BLASTP searches to the GAL1, GAL7, and GAL10 genes in Saccharomyces cerevisiae, which make up a very famous metabolic gene cluster in fungi. To do so, I individually BLASTed the GAL1, GAL7, and GAL10 genes against NCBI’s default database. Next, I downloaded the aligned FASTA sequences for each gene, which can be downloaded here:

Download GAL1/7/10 alignments: GAL alignments

Unzip the directory and from within the new directory, and generate alignments for each set of protein sequences. To do so, I will use Clustal Omega alignment protocol, which is available via EMBL-EBI. However, other alignment software like MAFFT can be used. Next, use the resulting alignments as input into the hmmbuild function from the HMMER suite using the following commands:

$ hmmbuild Gal1p.hmm Gal1p.aln.faa
$ hmmbuild Gal7p.hmm Gal7p.aln.faa
$ hmmbuild Gal10p.hmm Gal10p.aln.faa

2) Download the test data

For ease of use, this tutorial will rely on a small dataset, which can be downloaded using the following link:

Download test data: tutorial dataset

Next, unzip the downloaded directory and change directory to the newly downloaded directory.

$ cd path_to_unzipped_directory/orthofisher_tutorial

3) Run orthofisher

Two arguments are required when using orthofisher.

The first arument, -f/–fasta, points to a two column tab delimited file that specifies the location of fasta files that will be searched using HMMs. Typically, these are protein fasta files from the entire genome/transcriptome of an organism. Additionally, the second column of the file specifies the identifier for the organism. This will be used when representing sequences from a given proteome in a multi-fasta file. In this tutorial, this is file fasta_arg.txt.

The second argument, -m is a file that points to the location of HMMs that you wish to identify or fish out of a given proteome. In this tutorial, this is file hmms.txt.

$ orthofisher -m hmms.txt -f fasta_arg.txt

4) Examine output

In the current working directory, a subdirectory will be made titled orthofisher_output. Each subdirectory therein contains desirable output, which is briefly desired as:

all_sequences: multi-fasta file sequences of every hit identified during sequence similarity search.
hmmsearch_output: output files generated during hmmsearches
scog: a directory of single copy orthologous HMMs identified in the various fasta files

Also, two text files are made with helpful information that summarizes all the searches:

1. long_summary.txt: Hits identified during sequence similarity search per fasta file per HMM. HMMs are considered single-copy, multi-copy, or absent in a given fasta file.

$ cat orthofisher_output/long_summary.txt Clavispora_lusitaniae_P5.faa    Gal10p.hmm      single-copy     1       QFZ46524.1
Clavispora_lusitaniae_P5.faa    Gal1p.hmm       single-copy     1       QFZ46523.1
Clavispora_lusitaniae_P5.faa    Gal7p.hmm       single-copy     1       QFZ46527.1
Kluyveromyces_lactis_NRRL_Y-1140.faa    Gal10p.hmm      single-copy     1       XP_455462.1
Kluyveromyces_lactis_NRRL_Y-1140.faa    Gal1p.hmm       single-copy     1       XP_455461.1
Kluyveromyces_lactis_NRRL_Y-1140.faa    Gal7p.hmm       single-copy     1       XP_455463.1
Saccharomyces_cerevisiae_S288C.faa      Gal10p.hmm      single-copy     1       NP_009575.1
Saccharomyces_cerevisiae_S288C.faa      Gal1p.hmm       multi-copy      2       NP_009576.1
Saccharomyces_cerevisiae_S288C.faa      Gal1p.hmm       multi-copy      2       NP_010292.1
Saccharomyces_cerevisiae_S288C.faa      Gal7p.hmm       single-copy     1       NP_009574.1
Yarrowia_lipolytica_DSM3286.faa Gal10p.hmm      single-copy     1       QNP99911.1
Yarrowia_lipolytica_DSM3286.faa Gal1p.hmm       single-copy     1       QNP96229.1
Yarrowia_lipolytica_DSM3286.faa Gal7p.hmm       single-copy     1       QNP99718.1

col. 1: Query proteome fasta file.
col. 2: HMM file used during sequence similarity search.
col. 3: The sequence represented by the HMM is considered single_copy, multi-copy, or absent in a query proteome.
col. 4: Absolute copy number of hits from the sequence similarity search.
col. 5: The fasta entry identifier of the gene identified.

2. short_summary.txt: Summary of the absolute number and percentage of single-copy, multi-copy, or absent HMMs per fasta file.

$ cat orthofisher_output/short_summary.txt
file_name       single-copy     multi-copy      absent  per_single-copy per_multi-copy  per_absent
Clavispora_lusitaniae_P5.faa    3       0       0       1.0     0.0     0.0
Kluyveromyces_lactis_NRRL_Y-1140.faa    3       0       0       1.0     0.0     0.0
Saccharomyces_cerevisiae_S288C.faa      2       1       0       0.67    0.33    0.0
Yarrowia_lipolytica_DSM3286.faa 3       0       0       1.0     0.0     0.0

col. 1: Query proteome fasta file.
col. 2-4: Absolute number of sequences represented by HMMs that are present in single-copy, multi-copy, or absent.
col. 5-7: Percetange of sequences represented by HMMs that are present in single-copy, multi-copy, or absent.

5) Estimating gene family copy number

In the original paper describing orthofisher, we note that orthofisher can also be used to estimate gene family copy number using a zinc finger, C2H2 type, (Pfam: PF00096) as an example. Following that, we will now estimate PF00096 domain copy number among the same set of proteomes.

To do so, first download the PF00096 HMM from EMBL-EBI, which is also available here:

Download Zinc Finger, C2H2 type, HMM: PF00096.hmm

Next, follow the step described above to run orthofisher using PF00096.hmm. Of note, we recommend lowering the bitscore threshold to a less stringent value. The default bitscore threshold is 85% of the best hit, which follows the BUSCO pipeline. However, depending on the research question, lowering the threshold may be reasonable. For example, try running orthofisher using the -b parameter set to 0.85 and 0.2. Using a threshold of 0.2, Clavispora_lusitaniae_P5.faa has 27 copies, Kluyveromyces_lactis_NRRL_Y-1140.faa has 23 copies, Saccharomyces_cerevisiae_S288C.faa has 31 copies, and Yarrowia_lipolytica_DSM3286.faa has 28 copies of PF00096 whereas the same species have 1, 1, 1, and 3 copies, respectively, using default parameters.

We provide an additional level of user-flexibility by having a e-value threshold that can be used during HMM-based sequence similarity search. To change the e-value threshold from the default value of 1e-3, use the -e parameter.

6) Running orthofisher faster

In this section, I go over some suggestions to make orthofisher run faster.

As of version 1.0.0, two new arguments have been added to orthofisher. The first is -c/--cpu, which can be use for multithreading during the HMMsearch. The second is -o/--output_dir, which can be used to specify the name of the output file. These arguments enable users to accelerate the HMMsearch and pseudo-parallelize orthofisher runs.

To pseudo-parallelize orthofisher runs, users can split up their -f/--fasta input file into multiple pieces and then run orthofisher using each smaller file in an array of jobs. When doing so, users can write to different output directories for each input file and then combine results at the end.