orthofisher enables researchers to conduct high-throughput identification of orthologous genes using Hidden Markov Models (HMM). This tutorial covers the easy-to-implement workflow needed for using orthofisher.
1) Download the test data¶
For ease of use, this tutorial will rely on a small dataset, which can be downloaded using the following link:
Download test data:
Next, unzip the downloaded directory and change directory to the newly downloaded directory.
$ cd path_to_unzipped_directory/orthofisher_tutorial
2) Run orthofisher¶
Two arguments are required when using orthofisher.
The first arument, -f/–fasta, points to a two column tab delimited file that specifies the location of fasta files that will be searched using HMMs. Typically, these are protein fasta files from the entire genome/transcriptome of an organism. Additionally, the second column of the file specifies the identifier for the organism. This will be used when representing sequences from a given proteome in a multi-fasta file. In this tutorial, this is file fasta_arg.txt.
The second argument, -m is a file that points to the location of HMMs that you wish to identify or fish out of a given proteome. In this tutorial, this is file hmms.txt.
$ orthofisher -m hmms.txt -f fasta_arg.txt
3) Examine output¶
In the current working directory, a subdirectory will be made titled orthofisher_output. Each subdirectory therein contains desirable output, which is briefly desired as:
all_sequences: multi-fasta file sequences of every hit identified during sequence similarity search.
hmmsearch_output: output files generated during hmmsearches
scog: a directory of single copy orthologous HMMs identified in the various fasta files
Also, two text files are made with helpful information that summarizes all the searches:
1. long_summary.txt: Hits identified during sequence similarity search per fasta file per HMM. HMMs are considered single-copy, multi-copy, or absent in a given fasta file.
$ cat orthofisher_output/long_summary.txt GCF_010094145.1_Didex1_protein.faa 718307-1.fa.mafft.hmm single-copy 1 XP_033451061.1 GCF_010094145.1_Didex1_protein.faa 1001705at2759.hmm single-copy 1 XP_033445010.1 GCA_011032825.1_Masph1_protein.faa 718307-1.fa.mafft.hmm single-copy 1 KAF2869753.1 GCA_011032825.1_Masph1_protein.faa 1001705at2759.hmm single-copy 1 KAF2868776.1 no_copy.faa 718307-1.fa.mafft.hmm absent 0 NA no_copy.faa 1001705at2759.hmm absent 0 NA multi_copy.faa 718307-1.fa.mafft.hmm multi-copy 2 Massariosphaeria_phaeospora multi_copy.faa 718307-1.fa.mafft.hmm multi-copy 2 Didymella_exigua multi_copy.faa 1001705at2759.hmm absent 0 NA
col. 1: Query proteome fasta file.
col. 2: HMM file used during sequence similarity search.
col. 3: The sequence represented by the HMM is considered single_copy, multi-copy, or absent in a query proteome.
col. 4: Absolute copy number of hits from the sequence similarity search.
col. 5: The fasta entry identifier of the gene identified.
2. short_summary.txt: Summary of the absolute number and percentage of single-copy, multi-copy, or absent HMMs per fasta file.
$ cat orthofisher_output/short_summary.txt file_name single-copy multi-copy absent per_single-copy per_multi-copy per_absent GCF_010094145.1_Didex1_protein.faa 2 0 0 1.0 0.0 0.0 GCA_011032825.1_Masph1_protein.faa 2 0 0 1.0 0.0 0.0 no_copy.faa 0 0 2 0.0 0.0 1.0 multi_copy.faa 0 1 1 0.0 0.5 0.5
col. 1: Query proteome fasta file.
col. 2-4: Absolute number of sequences represented by HMMs that are present in single-copy, multi-copy, or absent.
col. 5-7: Percetange of sequences represented by HMMs that are present in single-copy, multi-copy, or absent.