Usage

This section covers basic OrthoSNAP usage. We recommend going through the tutorial as well, which gives a worked example of what OrthoSNAP does.

OrthoSNAP fits into a larger pipeline to construct phylogenomic data matrices. After calling orthologous groups of genes, it is common for the number of single-copy orthologous groups of genes to be relatively few compared to all orthologous groups. Typically, only single-copy orthologous groups of genes are used for phylogenomic analyses. OrthoSNAP identifies subgroups of single-copy orthologous genes in multi-copy orthologous groups of genes.

The input for OrthoSNAP is the phylogeny of the orthologous group of genes and the unaligned FASTA file of sequences used to infer the phylogeny. The output of OrthoSNAP are multi-FASTA files of subgroups of single-copy orthologous genes appropriate for downstream analyses.


Basic usage

The following command is the simpliest iteration of OrthoSNAP and will suffice for most use cases:

$ orthosnap -f orthogroup_of_genes.faa -t phylogeny_of_orthogroup_of_genes.tre

Here, two required arguments are shown, the -f/--fasta argument, which specifies the unaligned orthologous group of sequences, and the -t/--tree argument, which species the phylogeny inferred from the orthologous group of sequences. The fasta file should be in FASTA format and the tree file should be in newick format.


Accounting for tree uncertainty

As part of the OrthoSNAP pipeline, species-specific inparalogs and paralogs are pruned following the approach described in PhyloTreePruner. To do so, poorly collapsed bipartitions are collapsed to account for tree uncertainty.

The default threshold for collapsing bipartitions is 80 and can be modified using the -s/--support argument. For example, bipartitions can be collapsed using a threshold of 70 using the following command:

$ orthosnap -f orthogroup_of_genes.faa -t phylogeny_of_orthogroup_of_genes.tre -s 70

Specifying which inparalog to keep

During species-specific inparalog pruning, it is standard practice in transcriptomic analysis to keep the longest inparalog. However, this may not always be the user’s preference. OrthoSNAP allows flexibility in which inparalog should be kept using the -ip /--inparalog_to_keep parameter. Specifically, user’s can choose to keep the longest sequence length (default; longest_seq_len), shortest sequence length (shortest_seq_len), or median sequence length (median_seq_len) in the case of three or more inparalogs. User’s can also choose which inparalog to keep based on tree-based metrics: the longest branch length (longest_branch_len), the shortest branch length (shortest_branch_len), or the median branch length (median_branch_len) in the case of three or more inparalogs.

For example, the inparalog with the shortest branch length can be kept using the following command:

$ orthosnap -f orthogroup_of_genes.faa -t phylogeny_of_orthogroup_of_genes.tre -ip shortest_branch_len

or the inparalog with the median sequence length can be kept using the following command:

$ orthosnap -f orthogroup_of_genes.faa -t phylogeny_of_orthogroup_of_genes.tre -ip median_seq_len

Again, following transcriptomics, the default is to keep the longest sequence because (at least in theory) it is the most complete gene annotation.

Report inparalog handling

To report inparalogs and specify which was kept per SNAP-OG, use the -rih, --report_inparalog_handling argument. The resulting file, which will have the suffix “.inparalog_report.txt,” will have three columns:
- col 1 is the orthogroup file
- col 2 is the inparalog that was kept
- col 3 is/are the inparalog/s that were trimmed separated by a semi-colon “;”

To generate this file, use the following command:

$ orthosnap -f orthogroup_of_genes.faa -t phylogeny_of_orthogroup_of_genes.tre -rih

All options

Option

Usage and meaning

-h/--help

Print help message

-v/--version

Print software version

-t/--tree

Input tree file (format: newick)

-s/--support

Bipartition support threshold for collapsing uncertain branches (default: 80)

-o/--occupancy

Occupancy threshold for identifying a subgroup of interest (default: 50%)

-r/--roooted

boolean argument for whether the input phylogeny is already rooted (default: false)

-st/--snap_trees

boolean argument for whether trees of SNAP-OGs should be outputted (default: false)

-ip/--inparalog_to_keep

determine which sequence to keep in the case of species-specific inparalogs using sequence- or tree-based options (default: longest_seq_len)

-op/--output_path

pathway for output files to be written (default: same as -f input)

-rih, --report_inparalog_handling

create a summary file of which inparalogs where kept compared to trimmed

*For genome-scale analyses, we recommend changing the -o/--occupancy parameter to be the same for all large gene families so that the minimum SNAP-OG occupancy is the same for all SNAP-OGs.