Advanced Usage
OrthoHMM output
All OrthoHMM outputs have the prefix orthohmm so that they are easy to find.
- orthohmm_gene_count.txt
A gene count matrix per taxa for each orthogroup.
- orthohmm_orthogroups.txt
Genes present in each orthogroup.
- orthohmm_single_copy_orthogroups.txt
A single-column list of single-copy orthologs.
- orthohmm_orthogroups
A directory of FASTA files wherein each file is an orthogroup.
- orthohmm_single_copy_orthogroups
A directory of FASTA files wherein each file is a single-copy ortholog.
Headers are modified to have taxon names come before the gene identifier.
Taxon names are the file name excluding the extension.
Taxon name and gene identifier are separated by a pipe symbol “|”.
This aims to help streamline phylogenomic workflows wherein sequences will be concatenated downstream based on taxon names.
- orthohmm_working_res
A directory of intermediate files, such as all-by-all search results, network edges, and clusters inferred from network edges.
This remaining sections describe the various features and options of OrthoHMM.
Input Directory
A directory that contains FASTA files of protein sequences that also have the extensions .fa, .faa, .fas, .fasta, .pep, or .prot. OrthoHMM will automatically identify files with these extensions and use them for analyses. This directory must be the first argument.
# specifying output directory
orthohmm <path_to_directory_of_FASTA_files>
Output directory
Output directory name to store OrthoHMM results. This directory should already exist. By default, results files will be written to the same directory as the input directory of FASTA files. (-o, –output_directory)
# specifying output directory
orthohmm <path_to_directory_of_FASTA_files> -o <output_directory>
Phmmer
Path to phmmer executable from HMMER suite. By default, phmmer is assumed to be in the PATH variable; in other words, phmmer can be evoked by typing phmmer.
# specify path to phmmer executable
orthohmm <path_to_directory_of_FASTA_files> -p /path/to/phmmer
E-value Threshold
E-value threshold to use when filtering phmmer results. E-value thresholds are applied after searches are made. This is done so that users can change the e-value threshold if they are using the –start argument. The default is 0.0001.
# specify e-value threshold
orthohmm <path_to_directory_of_FASTA_files> -e 0.0001
Substitution Matrix
Residue alignment probabilities will be determined from the specified substitution matrix. Supported substitution matrices include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62.
# specify using the BLOSUM80 substitution matrix
orthohmm <path_to_directory_of_FASTA_files> -x BLOSUM80
CPU
Number of CPU workers for multithreading during sequence search. This argument is used by phmmer during all-vs-all comparisons. By default, the number of CPUs available will be auto-detected.
# run orthohmm using 8 CPUs
orthohmm <path_to_directory_of_FASTA_files> -c 8
Single-Copy Threshold
Taxon occupancy threshold when identifying single-copy orthologs. By default, the threshold is 50% taxon occupancy, which is specified as a fraction - that is, 0.5.
# specify single-copy threshold as a fraction
orthohmm <path_to_directory_of_FASTA_files> -s 0.5
MCL
Path to mcl executable from MCL software. By default, mcl is assumed to be in the PATH variable; in other words, mcl can be evoked by typing mcl.
# specify path to mcl executable
orthohmm <path_to_directory_of_FASTA_files> -m /path/to/mcl
Inflation Value
MCL inflation parameter for clustering genes into orthologous groups. Lower values are more permissive resulting in larger orthogroups. Higher values are stricter resulting in smaller orthogroups. The default value is 1.5.
# use an inflation value of 1.5 during mcl clustering
orthohmm <path_to_directory_of_FASTA_files> -i 1.5
Stop
Similar to other ortholog calling algorithms, different steps in the OrthoHMM workflow can be cpu or memory intensive. Thus, users may want to stop OrthoHMM at certain steps, to faciltiate more practical resource allocation. There are three choices for when to stop the analysis: prepare, infer, and write.
prepare: Stop after preparing input files for the all-by-all search
infer: Stop after inferring the orthogroups
write: Stop after writing sequence files for the orthogroups. Currently, this is synonymous with not specifying a step to stop the analysis at.
# stop orthohmm after preparing the all-by-all search commands
orthohmm <path_to_directory_of_FASTA_files> --stop prepare
Start
Start analysis from a specific intermediate step. Currently, this can only be applied to the results from the all-by-all search.
search_res: Start analysis from all-by-all search results.
# start orthohmm from after the all-by-all searches are complete
orthohmm <path_to_directory_of_FASTA_files> --start search_res
All options
Option |
Usage and meaning |
---|---|
-h/--help |
Print help message |
-v/--version |
Print software version |
-o/--output_directory |
Output directory name. Default: same directory as directory of FASTA files |
-p/--phhmer |
Path to phmmer from HMMER suite. Default: phmmer |
-x/--substitution_matrix |
Specify substitution matrix to use for generating the HMMs. Default: BLOSUM62 |
-c/--cpu |
Number of parallel CPU workers to use for multithreading. Default: auto detect |
-s/--single_copy_threshold |
Taxon occupancy threshold for single-copy orthologs. Default 0.5 |
-m/--mcl |
Path to mcl software. Default: mcl |
-i/--inflation_value |
MCL inflation parameter. Default: 1.5 |
--stop |
Stop OrthoHMM run at a specific step. Default: None |
--start |
Start OrthoHMM run at a specific step. Default: None |