Advanced Usage


OrthoHMM output

All OrthoHMM outputs have the prefix orthohmm so that they are easy to find.

  • orthohmm_gene_count.txt
    • A gene count matrix per taxa for each orthogroup.

  • orthohmm_orthogroups.txt
    • Genes present in each orthogroup.

  • orthohmm_single_copy_orthogroups.txt
    • A single-column list of single-copy orthologs.

  • orthohmm_orthogroups
    • A directory of FASTA files wherein each file is an orthogroup.

  • orthohmm_single_copy_orthogroups
    • A directory of FASTA files wherein each file is a single-copy ortholog.

    • Headers are modified to have taxon names come before the gene identifier.

    • Taxon names are the file name excluding the extension.

    • Taxon name and gene identifier are separated by a pipe symbol “|”.

    • This aims to help streamline phylogenomic workflows wherein sequences will be concatenated downstream based on taxon names.

  • orthohmm_working_res
    • A directory of intermediate files, such as all-by-all search results, network edges, and clusters inferred from network edges.


This remaining sections describe the various features and options of OrthoHMM.


Input Directory

A directory that contains FASTA files of protein sequences that also have the extensions .fa, .faa, .fas, .fasta, .pep, or .prot. OrthoHMM will automatically identify files with these extensions and use them for analyses. This directory must be the first argument.

# specifying output directory
orthohmm <path_to_directory_of_FASTA_files>

Output directory

Output directory name to store OrthoHMM results. This directory should already exist. By default, results files will be written to the same directory as the input directory of FASTA files. (-o, –output_directory)

# specifying output directory
orthohmm <path_to_directory_of_FASTA_files> -o <output_directory>

Phmmer

Path to phmmer executable from HMMER suite. By default, phmmer is assumed to be in the PATH variable; in other words, phmmer can be evoked by typing phmmer.

# specify path to phmmer executable
orthohmm <path_to_directory_of_FASTA_files> -p /path/to/phmmer

E-value Threshold

E-value threshold to use when filtering phmmer results. E-value thresholds are applied after searches are made. This is done so that users can change the e-value threshold if they are using the –start argument. The default is 0.0001.

# specify e-value threshold
orthohmm <path_to_directory_of_FASTA_files> -e 0.0001

Substitution Matrix

Residue alignment probabilities will be determined from the specified substitution matrix. Supported substitution matrices include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62.

# specify using the BLOSUM80 substitution matrix
orthohmm <path_to_directory_of_FASTA_files> -x BLOSUM80

CPU

Number of CPU workers for multithreading during sequence search. This argument is used by phmmer during all-vs-all comparisons. By default, the number of CPUs available will be auto-detected.

# run orthohmm using 8 CPUs
orthohmm <path_to_directory_of_FASTA_files> -c 8

Single-Copy Threshold

Taxon occupancy threshold when identifying single-copy orthologs. By default, the threshold is 50% taxon occupancy, which is specified as a fraction - that is, 0.5.

# specify single-copy threshold as a fraction
orthohmm <path_to_directory_of_FASTA_files> -s 0.5

MCL

Path to mcl executable from MCL software. By default, mcl is assumed to be in the PATH variable; in other words, mcl can be evoked by typing mcl.

# specify path to mcl executable
orthohmm <path_to_directory_of_FASTA_files> -m /path/to/mcl

Inflation Value

MCL inflation parameter for clustering genes into orthologous groups. Lower values are more permissive resulting in larger orthogroups. Higher values are stricter resulting in smaller orthogroups. The default value is 1.5.

# use an inflation value of 1.5 during mcl clustering
orthohmm <path_to_directory_of_FASTA_files> -i 1.5

Stop

Similar to other ortholog calling algorithms, different steps in the OrthoHMM workflow can be cpu or memory intensive. Thus, users may want to stop OrthoHMM at certain steps, to faciltiate more practical resource allocation. There are three choices for when to stop the analysis: prepare, infer, and write.

  • prepare: Stop after preparing input files for the all-by-all search

  • infer: Stop after inferring the orthogroups

  • write: Stop after writing sequence files for the orthogroups. Currently, this is synonymous with not specifying a step to stop the analysis at.

# stop orthohmm after preparing the all-by-all search commands
orthohmm <path_to_directory_of_FASTA_files> --stop prepare

Start

Start analysis from a specific intermediate step. Currently, this can only be applied to the results from the all-by-all search.

  • search_res: Start analysis from all-by-all search results.

# start orthohmm from after the all-by-all searches are complete
orthohmm <path_to_directory_of_FASTA_files> --start search_res

All options

Option

Usage and meaning

-h/--help

Print help message

-v/--version

Print software version

-o/--output_directory

Output directory name. Default: same directory as directory of FASTA files

-p/--phhmer

Path to phmmer from HMMER suite. Default: phmmer

-x/--substitution_matrix

Specify substitution matrix to use for generating the HMMs. Default: BLOSUM62

-c/--cpu

Number of parallel CPU workers to use for multithreading. Default: auto detect

-s/--single_copy_threshold

Taxon occupancy threshold for single-copy orthologs. Default 0.5

-m/--mcl

Path to mcl software. Default: mcl

-i/--inflation_value

MCL inflation parameter. Default: 1.5

--stop

Stop OrthoHMM run at a specific step. Default: None

--start

Start OrthoHMM run at a specific step. Default: None