Advanced Usage

OrthoHMM output

All OrthoHMM outputs have the prefix orthohmm so that they are easy to find.

orthohmm_gene_count.txt
- A gene count matrix per taxa for each orthogroup.
orthohmm_orthogroups.txt
- Genes present in each orthogroup.
orthohmm_single_copy_orthogroups.txt
- A single-column list of single-copy orthologs.
orthohmm_orthogroups
- A directory of FASTA files wherein each file is an orthogroup.
orthohmm_single_copy_orthogroups
- A directory of FASTA files wherein each file is a single-copy ortholog.
- Headers are modified to have taxon names come before the gene identifier.
- Taxon names are the file name excluding the extension.
- Taxon name and gene identifier are separated by a pipe symbol “|”.
- This aims to help streamline phylogenomic workflows wherein sequences will be concatenated downstream based on taxon names.
orthohmm_working_res
- A directory of intermediate files, such as all-by-all search results, network edges, and clusters inferred from network edges.

This remaining sections describe the various features and options of OrthoHMM.

Input Directory
Output Directory
Phmmer
E-value Threshold
Substitution Matrix
CPU
Single-Copy Threshold
MCL
Inflation Value
Stop
Start
All Options

Input Directory

A directory that contains FASTA files of protein sequences that also have the extensions .fa, .faa, .fas, .fasta, .pep, or .prot. OrthoHMM will automatically identify files with these extensions and use them for analyses. This directory must be the first argument.

# specifying output directory
orthohmm <path_to_directory_of_FASTA_files>

Output directory

Output directory name to store OrthoHMM results. This directory should already exist. By default, results files will be written to the same directory as the input directory of FASTA files. (-o, –output_directory)

# specifying output directory
orthohmm <path_to_directory_of_FASTA_files> -o <output_directory>

Phmmer

Path to phmmer executable from HMMER suite. By default, phmmer is assumed to be in the PATH variable; in other words, phmmer can be evoked by typing phmmer.

# specify path to phmmer executable
orthohmm <path_to_directory_of_FASTA_files> -p /path/to/phmmer

E-value Threshold

E-value threshold to use when filtering phmmer results. E-value thresholds are applied after searches are made. This is done so that users can change the e-value threshold if they are using the –start argument. The default is 0.0001.

# specify e-value threshold
orthohmm <path_to_directory_of_FASTA_files> -e 0.0001

Substitution Matrix

Residue alignment probabilities will be determined from the specified substitution matrix. Supported substitution matrices include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62.

# specify using the BLOSUM80 substitution matrix
orthohmm <path_to_directory_of_FASTA_files> -x BLOSUM80

CPU

Number of CPU workers for multithreading during sequence search. This argument is used by phmmer during all-vs-all comparisons. By default, the number of CPUs available will be auto-detected.

# run orthohmm using 8 CPUs
orthohmm <path_to_directory_of_FASTA_files> -c 8

Single-Copy Threshold

Taxon occupancy threshold when identifying single-copy orthologs. By default, the threshold is 50% taxon occupancy, which is specified as a fraction - that is, 0.5.

# specify single-copy threshold as a fraction
orthohmm <path_to_directory_of_FASTA_files> -s 0.5

MCL

Path to mcl executable from MCL software. By default, mcl is assumed to be in the PATH variable; in other words, mcl can be evoked by typing mcl.

# specify path to mcl executable
orthohmm <path_to_directory_of_FASTA_files> -m /path/to/mcl

Inflation Value

MCL inflation parameter for clustering genes into orthologous groups. Lower values are more permissive resulting in larger orthogroups. Higher values are stricter resulting in smaller orthogroups. The default value is 1.5.

# use an inflation value of 1.5 during mcl clustering
orthohmm <path_to_directory_of_FASTA_files> -i 1.5

Stop

Similar to other ortholog calling algorithms, different steps in the OrthoHMM workflow can be cpu or memory intensive. Thus, users may want to stop OrthoHMM at certain steps, to faciltiate more practical resource allocation. There are three choices for when to stop the analysis: prepare, infer, and write.

prepare: Stop after preparing input files for the all-by-all search
infer: Stop after inferring the orthogroups
write: Stop after writing sequence files for the orthogroups. Currently, this is synonymous with not specifying a step to stop the analysis at.

# stop orthohmm after preparing the all-by-all search commands
orthohmm <path_to_directory_of_FASTA_files> --stop prepare

Start

Start analysis from a specific intermediate step. Currently, this can only be applied to the results from the all-by-all search.

search_res: Start analysis from all-by-all search results.

# start orthohmm from after the all-by-all searches are complete
orthohmm <path_to_directory_of_FASTA_files> --start search_res

All options

Option	Usage and meaning
-h/--help	Print help message
-v/--version	Print software version
-o/--output_directory	Output directory name. Default: same directory as directory of FASTA files
-p/--phhmer	Path to phmmer from HMMER suite. Default: phmmer
-x/--substitution_matrix	Specify substitution matrix to use for generating the HMMs. Default: BLOSUM62
-c/--cpu	Number of parallel CPU workers to use for multithreading. Default: auto detect
-s/--single_copy_threshold	Taxon occupancy threshold for single-copy orthologs. Default 0.5
-m/--mcl	Path to mcl software. Default: mcl
-i/--inflation_value	MCL inflation parameter. Default: 1.5
--stop	Stop OrthoHMM run at a specific step. Default: None
--start	Start OrthoHMM run at a specific step. Default: None