Advanced Usage
OrthoHMM output
All OrthoHMM outputs have the prefix orthohmm so that they are easy to find.
- orthohmm_gene_count.txt
A gene count matrix per taxa for each orthogroup.
- orthohmm_orthogroups.txt
Genes present in each orthogroup.
- orthohmm_single_copy_orthogroups.txt
A single-column list of single-copy orthologs.
- orthohmm_orthogroups
A directory of FASTA files wherein each file is an orthogroup.
- orthohmm_single_copy_orthogroups
A directory of FASTA files wherein each file is a single-copy ortholog.
Headers are modified to have taxon names come before the gene identifier.
Taxon names are the file name excluding the extension.
Taxon name and gene identifier are separated by a pipe symbol “|”.
This aims to help streamline phylogenomic workflows wherein sequences will be concatenated downstream based on taxon names.
- orthohmm_working_res
A directory of intermediate files, such as all-by-all search results, network edges, and clusters inferred from network edges.
This remaining sections describe the various features and options of OrthoHMM.
Input Directory
A directory that contains FASTA files of protein sequences that also have the extensions .fa, .faa, .fas, .fasta, .pep, or .prot. OrthoHMM will automatically identify files with these extensions and use them for analyses. This directory must be the first argument.
# specifying output directory
orthohmm <path_to_directory_of_FASTA_files>
Output directory
Output directory name to store OrthoHMM results. This directory should already exist. By default, results files will be written to the same directory as the input directory of FASTA files. (-o, –output_directory)
# specifying output directory
orthohmm <path_to_directory_of_FASTA_files> -o <output_directory>
Search Mode
Selects which sequence search engine OrthoHMM uses for the all-vs-all homology step. Two choices:
builtin(default): a built-in profile HMM + k-mer prefilter scorer. No external HMMER dependency. Faster on bacterial proteomes and the recommended path for large datasets.phmmer: legacy path that shells out to thephmmerbinary from the HMMER suite. Requires HMMER to be installed and reachable onPATH(or via-p).
# default — uses the built-in engine
orthohmm <path_to_directory_of_FASTA_files>
# opt into the legacy phmmer pipeline
orthohmm <path_to_directory_of_FASTA_files> --search_mode phmmer
Phmmer
Path to phmmer executable from HMMER suite. Only consulted when
--search_mode phmmer is set. By default, phmmer is assumed to be
in the PATH variable; in other words, phmmer can be evoked by typing
phmmer.
# specify path to phmmer executable
orthohmm <path_to_directory_of_FASTA_files> --search_mode phmmer -p /path/to/phmmer
E-value Threshold
E-value threshold to use when filtering phmmer results. E-value thresholds are applied after searches are made. This is done so that users can change the e-value threshold if they are using the –start argument. The default is 0.0001.
# specify e-value threshold
orthohmm <path_to_directory_of_FASTA_files> -e 0.0001
Substitution Matrix
Residue alignment probabilities will be determined from the specified substitution matrix. Supported substitution matrices include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM30, PAM70, PAM120, PAM240, WAG, and LG. The default is BLOSUM62.
# specify using the BLOSUM80 substitution matrix
orthohmm <path_to_directory_of_FASTA_files> -x BLOSUM80
CPU
Number of CPU workers for multithreading during sequence search. This argument is used by phmmer during all-vs-all comparisons. By default, the number of CPUs available will be auto-detected.
# run orthohmm using 8 CPUs
orthohmm <path_to_directory_of_FASTA_files> -c 8
Single-Copy Threshold
Taxon occupancy threshold when identifying single-copy orthologs. By default, the threshold is 50% taxon occupancy, which is specified as a fraction - that is, 0.5.
# specify single-copy threshold as a fraction
orthohmm <path_to_directory_of_FASTA_files> -s 0.5
Clustering
Selects which clustering algorithm groups RBNH edges into orthogroups. Two choices:
mcl(default): Markov Cluster algorithm with inflation=1.5, via the externalmclbinary. Robust across phylogenetic diversity ranges — never collapses to all-singletons even on cross-kingdom inputs. F=62.4% on OrthoBench, F=88.4% on Three Kingdoms.leiden: Leiden community detection with the Constant Potts Model. Pure Python viaigraphandleidenalg; no external binary required. Higher peak F-score on closely-related inputs (F=65.7-65.9% on OrthoBench), but the fixed defaultcpm_resolution=0.1collapses on cross-kingdom data — pair with--cpm_resolution autofor robustness, which yields F=90.7% on Three Kingdoms.
# default — MCL inflation=1.5
orthohmm <path_to_directory_of_FASTA_files>
# Leiden with auto-tuned CPM resolution
orthohmm <path_to_directory_of_FASTA_files> --clustering leiden --cpm_resolution auto
CPM Resolution
Resolution parameter for Leiden CPM clustering. Lower values produce
larger, more permissive orthogroups; higher values produce smaller,
stricter ones. The default is 0.1, which gave the best F-score on
the OrthoBench reference.
# tighter clustering
orthohmm <path_to_directory_of_FASTA_files> --cpm_resolution 0.3
MCL
Path to mcl executable. Only consulted when --clustering mcl is set.
By default, mcl is assumed to be in the PATH variable; in other words,
mcl can be evoked by typing mcl.
# specify path to mcl executable
orthohmm <path_to_directory_of_FASTA_files> -m /path/to/mcl
Inflation Value
MCL inflation parameter for clustering genes into orthologous groups. Lower values are more permissive resulting in larger orthogroups. Higher values are stricter resulting in smaller orthogroups. The default value is 1.5.
# use an inflation value of 1.5 during mcl clustering
orthohmm <path_to_directory_of_FASTA_files> -i 1.5
Stop
Similar to other ortholog calling algorithms, different steps in the OrthoHMM workflow can be cpu or memory intensive. Thus, users may want to stop OrthoHMM at certain steps, to faciltiate more practical resource allocation. There are three choices for when to stop the analysis: prepare, infer, and write.
prepare: Stop after preparing input files for the all-by-all search
infer: Stop after inferring the orthogroups
write: Stop after writing sequence files for the orthogroups. Currently, this is synonymous with not specifying a step to stop the analysis at.
# stop orthohmm after preparing the all-by-all search commands
orthohmm <path_to_directory_of_FASTA_files> --stop prepare
Start
Start analysis from a specific intermediate step. Currently, this can only be applied to the results from the all-by-all search.
search_res: Start analysis from all-by-all search results.
# start orthohmm from after the all-by-all searches are complete
orthohmm <path_to_directory_of_FASTA_files> --start search_res
All options
Option |
Usage and meaning |
|---|---|
-h/--help |
Print help message |
-v/--version |
Print software version |
-o/--output_directory |
Output directory name. Default: same directory as directory of FASTA files |
-p/--phhmer |
Path to phmmer from HMMER suite. Default: phmmer |
--search_mode |
Search engine: |
--clustering |
Clustering algorithm: |
--cpm_resolution |
Leiden CPM resolution: float or |
-x/--substitution_matrix |
Specify substitution matrix to use for generating the HMMs. Default: BLOSUM62 |
-c/--cpu |
Number of parallel CPU workers to use for multithreading. Default: auto detect |
-s/--single_copy_threshold |
Taxon occupancy threshold for single-copy orthologs. Default 0.5 |
-m/--mcl |
Path to mcl software. Default: mcl |
-i/--inflation_value |
MCL inflation parameter. Default: 1.5 |
--stop |
Stop OrthoHMM run at a specific step. Default: None |
--start |
Start OrthoHMM run at a specific step. Default: None |