Usage

PhyKIT helps process and analyze multiple sequence alignments and phylogenies.

Generally, all functions are designed to help understand the contents of alignments (e.g., gc content or the number of parsimony informative sites) and the shape of trees (e.g., treeness, degree of violation of a molecular clock).

Some help messages indicate that summary statistics are reported (e.g., bipartition_support_stats). Summary statistics include mean, median, 25th percentile, 75th percentile, minimum, maximum, standard deviation, and variance. These functions typically have a verbose option that allows users to get the underlying data used to calculate summary statistics.


General usage

Calling functions

phykit <command> [optional command arguments]

Command specific help messages can be viewed by adding a -h/\-\-help argument after the command. For example, to see the help message for the command 'treeness', execute:

phykit treeness -h
# or
phykit treeness --help

Function aliases

Each function comes with aliases to save the user some key strokes. For example, to get the help message for the 'treeness' function, you can type:

phykit tness -h

Command line interfaces

As of version 1.2.0, all functions (including aliases) can be executed using a command line interface that starts with pk_. For example, instead of typing the previous command to get the help message of the treeness function, you can type:

pk_treeness -h
# or
pk_tness -h

All possible function names are specified at the top of each function section.


Functions by analytical category

The functions above are organized by input type. Below, the same functions are grouped by analytical purpose to help you find the right tool for your analysis.

Alignment quality & statistics

Alignment manipulation

Tree summary statistics

Tree manipulation & utilities

Tree comparison & consensus

Phylogenetic signal

Trait evolution

Phylogenetic comparative methods

Evolutionary rate analysis

Homology assessment

Saturation & model adequacy


Alignment-based functions

Alignment entropy

Function names: alignment_entropy; aln_entropy; entropy
Command line interface: pk_alignment_entropy; pk_aln_entropy; pk_entropy

Calculate alignment entropy.

Site-wise entropy is calculated using Shannon entropy. By default, PhyKIT reports the mean entropy across all sites in the alignment. With the -v/--verbose option, PhyKIT reports entropy for each site.

phykit alignment_entropy <alignment> [-v/--verbose] [--plot] [--plot-output <path>] [--json]

Example output (default):

0.657

Example output (-v):

1       0.0
2       1.0
3       0.971

Options:
<alignment>: first argument after function name should be an alignment file
-v/\-\-verbose: optional argument to print entropy for each site
--plot: save a per-site alignment entropy plot
--plot-output: output path for plot (default: alignment_entropy_plot.png)
--json: optional argument to print results as JSON


Alignment length

../_images/aln_len.png

Function names: alignment_length; aln_len; al
Command line interface: pk_alignment_length; pk_aln_len; pk_al

Length of an input alignment is calculated using this function.

Longer alignments are associated with strong phylogenetic signal.

Association between alignment length and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.

phykit aln_len <alignment> [--json]

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Alignment length no gaps

../_images/aln_len_no_gaps.png

Function names: alignment_length_no_gaps; aln_len_no_gaps; alng
Command line interface: pk_alignment_length_no_gaps; pk_aln_len_no_gaps; pk_alng

Calculate alignment length excluding sites with gaps.

Longer alignments when excluding sites with gaps is associated with strong phylogenetic signal.

PhyKIT reports three tab delimited values: col1: number of sites without gaps col2: total number of sites col3: percentage of sites without gaps

Association between alignment length when excluding sites with gaps and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.

phykit aln_len_no_gaps <alignment> [--json]

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Alignment outlier taxa

Function names: alignment_outlier_taxa; outlier_taxa; aot
Command line interface: pk_alignment_outlier_taxa; pk_outlier_taxa; pk_aot

Identify potential outlier taxa in an alignment and explicitly report why each taxon was flagged.

The following features are evaluated per taxon: 1) gap_rate: fraction of gap/ambiguous symbols
2) occupancy: fraction of valid symbols
3) composition_distance: Euclidean distance from the median composition profile
4) long_branch_proxy: mean pairwise sequence distance to other taxa
5) rcvt: relative composition variability per taxon
6) entropy_burden: average site entropy over this taxon's valid positions

If a taxon exceeds one or more feature-specific thresholds, PhyKIT reports the exact feature(s), observed value(s), threshold(s), and explanation(s) for the flag.

phykit alignment_outlier_taxa <alignment> [--gap-z <float>] [--composition-z <float>] [--distance-z <float>] [--rcvt-z <float>] [--occupancy-z <float>] [--entropy-z <float>] [--json]

Example output:

features_evaluated      gap_rate,occupancy,composition_distance,long_branch_proxy,rcvt,entropy_burden
thresholds      gap_rate>0.0;occupancy<0.4;composition_distance>0.1;long_branch_proxy>0.4181;rcvt>0.0095;entropy_burden>0.35
taxon_d composition_distance=1.3454>0.1;long_branch_proxy=1.0>0.4181    Unusual sequence composition profile relative to other taxa. | High mean pairwise sequence distance to other taxa.
taxon_e gap_rate=0.6>0.0;occupancy=0.4<0.4      High fraction of gap/ambiguous symbols compared to other taxa. | Low fraction of valid symbols compared to other taxa.

Options:
<alignment>: first argument after function name should be an alignment file
--gap-z: z-threshold used for high-gap outlier detection (default: 3.0)
--composition-z: z-threshold used for composition-distance outlier detection (default: 3.0)
--distance-z: z-threshold used for long-branch-proxy outlier detection (default: 3.0)
--rcvt-z: z-threshold used for RCVT outlier detection (default: 3.0)
--occupancy-z: z-threshold used for low-occupancy outlier detection (default: 3.0)
--entropy-z: z-threshold used for entropy-burden outlier detection (default: 3.0)
--json: optional argument to print results as JSON


Alignment recoding

Function names: alignment_recoding; aln_recoding; recode
Command line interface: pk_alignment_recoding; pk_aln_recoding; pk_recode

Recode alignments using reduced character states.

Alignments can be recoded using established or custom recoding schemes. Recoding schemes are specified using the -c/--code argument. Custom recoding schemes can be used and should be formatted as a two column file wherein the first column is the recoded character and the second column is the character in the alignment.

phykit alignment_recoding <fasta> [-c/--code <code>] [--json]

Codes for which recoding scheme to use:
RY-nucleotide
R = purines (i.e., A and G)
Y = pyrimidines (i.e., T and C)

SandR-6
0 = A, P, S, and T
1 = D, E, N, and G
2 = Q, K, and R
3 = M, I, V, and L
4 = W and C
5 = F, Y, and H

KGB-6
0 = A, G, P, and S
1 = D, E, N, Q, H, K, R, and T
2 = M, I, and L
3 = W
4 = F and Y
5 = C and V

Dayhoff-6
0 = A, G, P, S, and T
1 = D, E, N, and Q
2 = H, K, and R
3 = I, L, M, and V
4 = F, W, and Y
5 = C

Dayhoff-9
0 = D, E, H, N, and Q
1 = I, L, M, and V
2 = F and Y
3 = A, S, and T
4 = K and R
5 = G
6 = P
7 = C
8 = W

Dayhoff-12
0 = D, E, and Q
1 = M, L, I, and V
2 = F and Y
3 = K, H, and R
4 = G
5 = A
6 = P
7 = S
8 = T
9 = N
A = W
B = C

Dayhoff-15
0 = D, E, and Q
1 = M and L
2 = I and V
3 = F and Y
4 = G
5 = A
6 = P
7 = S
8 = T
9 = N
A = K
B = H
C = R
D = W
E = C

Dayhoff-18
0 = F and Y
1 = M and L
2 = I
3 = V
4 = G
5 = A
6 = P
7 = S
8 = T
9 = D
A = E
B = Q
C = N
D = K
E = H
F = R
G = W
H = C

Options:
<alignment>: first argument after function name should be an alignment file
-c/--code: argument to specify the recoding scheme to use
--json: optional argument to print results as JSON


Column score

../_images/column_score.png

Function names: column_score; cs
Command line interface: pk_column_score; pk_cs

Calculates column score.

Column is an accuracy metric for a multiple alignment relative to a reference alignment. It is calculated by summing the correctly aligned columns over all columns in an alignment. Thus, values range from 0 to 1 and higher values indicate more accurate alignments.

Column score is calculated following Thompson et al., Nucleic Acids Research (1999), doi: 10.1093/nar/27.13.2682.

phykit column_score <alignment> --reference <reference_alignment> [--json]

Options:
<alignment>: first argument after function name should be a query fasta alignment file to be scored for accuracy
-r/\-\-reference: reference alignment to compare the query alignment to
--json: optional argument to print results as JSON


Composition per taxon

Function names: composition_per_taxon; comp_taxon; comp_tax
Command line interface: pk_composition_per_taxon; pk_comp_taxon; pk_comp_tax

Calculate sequence composition per taxon in an alignment.

Composition is reported as semicolon-separated symbol:frequency pairs for each taxon. Frequencies are calculated from valid (non-gap/non-ambiguous) characters. Symbol order is alphabetical.

phykit composition_per_taxon <alignment> [--json]

Example output:

1       A:0.4;C:0.0;G:0.2;T:0.4
2       A:0.5;C:0.0;G:0.25;T:0.25

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Compositional bias per site

Function names: compositional_bias_per_site; comp_bias_per_site; cbps
Command line interface: pk_compositional_bias_per_site; pk_comp_bias_per_site; pk_cbps

Calculates compositional bias per site in an alignment.

Site-wise chi-squared tests are conducted in an alignment to detect compositional biases. PhyKIT outputs four columns:
col 1: index in alignment
col 2: chi-squared statistic (higher values indicate greater bias)
col 3: multi-test corrected p-value (Benjamini-Hochberg false discovery rate procedure)
col 4: uncorrected p-value

phykit comp_bias_per_site <alignment> [--plot] [--plot-output <path>] [--json]

Options:
<alignment>: first argument after function name should be a query fasta alignment to calculate the site-wise compositional bias of
--plot: save a Manhattan-style plot of site-wise compositional bias
--plot-output: output path for plot (default: compositional_bias_per_site_plot.png)
--json: optional argument to print results as JSON


Create concatenation matrix

../_images/create_concat_matrix.png

Function names: create_concatenation_matrix, create_concat, cc
Command line interface: pk_create_concatenation_matrix, pk_create_concat, pk_cc

Create a concatenated alignment file. This function is used to help in the construction of multi-locus data matrices.

PhyKIT will output three files: 1) A fasta file with '.fa' appended to the prefix specified with the -p/\-\-prefix parameter. 2) A partition file ready for input into RAxML or IQ-tree. 3) An occupancy file that summarizes the taxon occupancy per sequence.

phykit create_concat -a <file> -p <string> [--threshold <float>] [--plot-occupancy] [--plot-output <path>] [--json]

Options:
-a/\-\-alignment: alignment list file. File should contain a single column list of alignment sequence files to concatenate into a single matrix. Provide path to files relative to working directory or provide absolute path.
-p/\-\-prefix: prefix of output files
--threshold: minimum fraction of informative (non-gap, non-ambiguous) sites across the concatenated alignment for a taxon to be included. Taxa whose effective occupancy falls below this value are excluded from the output. Set to 0 to disable filtering (default: 0).
--plot-occupancy: optional argument to output an occupancy map figure where x-axis shows concatenated positions with gene boundaries and y-axis shows taxa sorted by total occupancy. Colors denote represented characters, gap/ambiguous characters in present genes, and fully absent gene blocks.
--plot-output: optional custom output path for occupancy map figure (default: <prefix>.occupancy.png).
--json: optional argument to print summary metadata as JSON


Evolutionary Rate per Site

Function names: evolutionary_rate_per_site; evo_rate_per_site; erps
Command line interface: pk_evolutionary_rate_per_site; pk_evo_rate_per_site; pk_erps

Estimate evolutionary rate per site.

Evolutionary rate per site is one minus the sum of squared frequency of different characters at a given site. Values may range from 0 (slow evolving; no diversity at the given site) to 1 (fast evolving; all characters appear only once).

phykit evo_rate_per_site <alignment> [--plot] [--plot-output <path>] [--json]

Options:
<alignment>: first argument after function name should be a query fasta alignment to calculate the site-wise evolutionary rate of
--plot: save a per-site evolutionary-rate plot
--plot-output: output path for plot (default: evolutionary_rate_per_site_plot.png)
--json: optional argument to print results as JSON


Faidx

../_images/faidx.png

Function names: faidx; get_entry; ge
Command line interface: pk_faidx; pk_get_entry; pk_ge

Extracts sequence entry from fasta file.

This function works similarly to the faidx function in samtools, but does not requiring an indexing step.

To obtain multiple entries, input multiple entries separated by a comma (,). For example, if you want entries named "seq_0" and "seq_1", the string "seq_0,seq_1" should be associated with the -e argument.

phykit faidx <fasta> -e/--entry <fasta entry> [--json]

Options:
<fasta>: first argument after function name should be a fasta file
-e/\-\-entry: entry name to be extracted from the inputted fasta file
--json: optional argument to print results as JSON


Guanine-cytosine (GC) content

../_images/gc_content.png

Function names: gc_content; gc
Command line interface: pk_gc_content; pk_gc

Calculate GC content of a fasta file.

GC content is negatively correlated with phylogenetic signal.

If there are multiple entries, use the -v/\-\-verbose option to determine the GC content of each fasta entry separately. Association between GC content and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.

phykit gc_content <fasta> [-v/--verbose] [--json]

Options:
<fasta>: first argument after function name should be a fasta file
-v/\-\-verbose: optional argument to print the GC content of each fasta entry
--json: optional argument to print results as JSON


Mask alignment

Function names: mask_alignment; mask_aln; mask
Command line interface: pk_mask_alignment; pk_mask_aln; pk_mask

Mask alignment sites based on threshold criteria.

Sites are retained when they pass all active thresholds: maximum gap fraction, minimum occupancy, and maximum site entropy.

phykit mask_alignment <alignment> [-g/--max_gap <float>] [-o/--min_occupancy <float>] [-e/--max_entropy <float>] [--json]

Options:
<alignment>: first argument after function name should be an alignment file
-g/\-\-max_gap: maximum allowed fraction of missing/invalid characters at a site (default: 1.0)
-o/\-\-min_occupancy: minimum required occupancy at a site (default: 0.0)
-e/\-\-max_entropy: maximum allowed site entropy (default: no filter)
--json: optional argument to print results as JSON


Occupancy per taxon

Function names: occupancy_per_taxon; occupancy_taxon; occ_tax
Command line interface: pk_occupancy_per_taxon; pk_occupancy_taxon; pk_occ_tax

Calculate occupancy per taxon in an alignment.

Occupancy is the fraction of valid (non-gap/non-ambiguous) characters for each taxon.

phykit occupancy_per_taxon <alignment> [--json]

Example output:

1       0.8333
2       0.6667

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Pairwise identity

../_images/pairwise_identity.png

Function names: pairwise_identity; pairwise_id, pi
Command line interface: pk_pairwise_identity; pk_pairwise_id, pk_pi

Calculate the average pairwise identity among sequences.

Pairwise identities can be used as proxies for the evolutionary rate of sequences.

Pairwise identity is defined as the number of identical columns (including gaps) between two aligned sequences divided by the number of columns in the alignment. Summary statistics are reported unless used with the verbose option in which all pairwise identities will be reported.

An example of pairwise identities being used as a proxy for evolutionary rate can be found here: Chen et al. Genome Biology and Evolution (2017), doi: 10.1093/gbe/evx147.

phykit pairwise_identity <alignment> [-v/--verbose] [-e/--exclude_gaps] [--plot] [--plot-output <file>] [--json]

Options:
<alignment>: first argument after function name should be an alignment file
-v/\-\-verbose: optional argument to print identity per pair|br| -e/--exclude_gaps: if a site has a gap, ignore it
--plot: save a clustered pairwise-identity heatmap
--plot-output: output path for heatmap (default: pairwise_identity_heatmap.png)
--json: optional argument to print results as JSON


Parsimony informative sites

Function names: parsimony_informative_sites; pis
Command line interface: pk_parsimony_informative_sites; pk_pis

Calculate the number and percentage of parsimony informative sites in an alignment.

The number of parsimony informative sites in an alignment is associated with strong phylogenetic signal.

PhyKIT reports three tab delimited values: col1: number of parsimony informative sites col2: total number of sites col3: percentage of parsimony informative sites

Association between the number of parsimony informative sites and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179 and Steenwyk et al., PLOS Biology (2020), doi: 10.1371/journal.pbio.3001007.

phykit parsimony_informative_sites <alignment> [--json]

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Plot alignment QC

Function names: plot_alignment_qc; plot_qc; paqc
Command line interface: pk_plot_alignment_qc; pk_plot_qc; pk_paqc

Generate a multi-panel alignment quality-control plot.

The figure includes: 1) occupancy per taxon
2) gap rate per taxon
3) composition distance vs long-branch proxy scatter
4) count of flagged outliers by feature

Outlier evaluation uses the same features as alignment_outlier_taxa: gap_rate, occupancy, composition_distance, long_branch_proxy, rcvt, and entropy_burden.

phykit plot_alignment_qc <alignment> [-o/--output <path>] [--width <float>] [--height <float>] [--dpi <int>] [--gap-z <float>] [--composition-z <float>] [--distance-z <float>] [--rcvt-z <float>] [--occupancy-z <float>] [--entropy-z <float>] [--json]

Options:
<alignment>: first argument after function name should be an alignment file
-o/\-\-output: output image path (default: alignment_qc.png)
--width: figure width in inches (default: 14.0)
--height: figure height in inches (default: 10.0)
--dpi: output image DPI (default: 300)
--gap-z: z-threshold for gap-rate outliers (default: 3.0)
--composition-z: z-threshold for composition-distance outliers (default: 3.0)
--distance-z: z-threshold for long-branch-proxy outliers (default: 3.0)
--rcvt-z: z-threshold for RCVT outliers (default: 3.0)
--occupancy-z: z-threshold for low-occupancy outliers (default: 3.0)
--entropy-z: z-threshold for entropy-burden outliers (default: 3.0)
--json: optional argument to print plot metadata and outlier summary as JSON


Protein-to-nucleotide alignment

Function names: thread_dna; pal2nal, p2n
Command line interface: pk_thread_dna; pk_pal2nal, pk_p2n

Thread DNA sequence onto a protein alignment to create a codon-based alignment.

This function requires input alignments are in fasta format. Codon alignments are then printed to stdout. Note, paired sequences are assumed to have the same name between the protein and nucleotide file. The order does not matter.

To thread nucleotide sequences over a trimmed amino acid alignment, provide PhyKIT with a log file specifying which sites have been trimmed and which have been kept. The log file must be formatted the same as the log files outputted by the alignment trimming toolkit ClipKIT (see -l in ClipKIT documentation.) Details about ClipKIT can be seen here: https://github.com/JLSteenwyk/ClipKIT.

If using a ClipKIT log file, the untrimmed protein alignment should be provided in the -p/--protein argument.

phykit thread_dna -p <file> -n <file> [-s] [--json]

Options:
-p/\-\-protein: protein alignment file
-n/\-\-nucleotide: nucleotide sequence file
-c/\-\-clipkit_log: clipkit outputted log file
-s/\-\-stop: boolean for whether or not stop codons should be kept. If used, stop codons will be removed.
--json: optional argument to print results as JSON


Relative composition variability

Function names: relative_composition_variability; rel_comp_var; rcv
Command line interface: pk_relative_composition_variability; pk_rel_comp_var; pk_rcv

Calculate RCV (relative composition variability) for an alignment.

Lower RCV values are thought to be desirable because they represent a lower composition bias in an alignment. Statistically, RCV describes the average variability in sequence composition among taxa.

RCV is calculated following Phillips and Penny, Molecular Phylogenetics and Evolution (2003), doi: 10.1016/S1055-7903(03)00057-5.

RCV calculations are case-insensitive. Gap and ambiguous characters are excluded from composition counts and correction terms, and each taxon is normalized by its valid (non-excluded) sequence length.

phykit relative_composition_variability <alignment> [--json]

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Relative composition variability, taxon

Function names: relative_composition_variability_taxon; rel_comp_var_taxon; rcvt
Command line interface: pk_relative_composition_variability_taxon; pk_rel_comp_var_taxon; pk_rcvt

Calculate RCVT (relative composition variability, taxon) for an alignment.

RCVT is the relative composition variability metric for individual taxa. This facilitates identifying specific taxa that may have compositional biases. Lower RCVT values are more desirable because they indicate a lower composition bias for a given taxon in an alignment.

RCVT calculations are case-insensitive and exclude gap/ambiguous symbols from composition counts and normalization.

phykit relative_composition_variability_taxon <alignment> [--plot] [--plot-output <path>] [--json]

Options:
<alignment>: first argument after function name should be an alignment file
--plot: optional argument to generate an RCVT per-taxon barplot
--plot-output: output path for the RCVT plot (default: rcvt_plot.png)
--json: optional argument to print results as JSON


Rename FASTA entries

Function names: rename_fasta_entries; rename_fasta
Command line interface: pk_rename_fasta_entries; pk_rename_fasta

Renames fasta entries.

Renaming fasta entries will follow the scheme of a tab-delimited file wherein the first column is the current fasta entry name and the second column is the new fasta entry name in the resulting output alignment. Note, the input fasta file does not need to be an alignment file.

phykit rename_fasta_entries <fasta> -i/--idmap <idmap> [-o/--output <output_file>] [--json]

Options:
<fasta>: first argument after function name should be a FASTA file
-i/\-\-idmap: identifier map of current FASTA names (col1) and desired FASTA names (col2)
--json: optional argument to print results as JSON


Sum-of-pairs score

Function names: sum_of_pairs_score; sops; sop
Command line interface: pk_sum_of_pairs_score; pk_sops; pk_sop

Calculates sum-of-pairs score.

Sum-of-pairs is an accuracy metric for a multiple alignment relative to a reference alignment. It is calculated by summing the correctly aligned residue pairs over all pairs of sequences. Thus, values range from 0 to 1 and higher values indicate more accurate alignments.

Column score is calculated following Thompson et al., Nucleic Acids Research (1999), doi: 10.1093/nar/27.13.2682.

phykit sum_of_pairs_score <alignment> --reference <reference_alignment> [--json]

Options:
<alignment>: first argument after function name should be a query fasta alignment file to be scored for accuracy
-r/\-\-reference: reference alignment to compare the query alignment to
--json: optional argument to print results as JSON


Variable sites

Function names: variable_sites; vs
Command line interface: pk_variable_sites; pk_vs

Calculate the number of variable sites in an alignment.

The number of variable sites in an alignment is associated with strong phylogenetic signal. PhyKIT reports three tab delimited values: col1: number of variable sites col2: total number of sites col3: percentage of variable sites

Association between the number of variable sites and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.

phykit variable_sites <alignment> [--json]

Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON


Tree-based functions

Ancestral state reconstruction

Function names: ancestral_state_reconstruction; asr; anc_recon
Command line interface: pk_ancestral_state_reconstruction; pk_asr; pk_anc_recon

Estimate ancestral states for continuous traits using maximum likelihood, analogous to R's phytools::fastAnc() and ape::ace(type="ML"). Optionally produce a contMap plot showing continuous trait values mapped onto the phylogeny.

Two methods are available:

  • fast (default): Felsenstein's pruning/contrasts shortcut, O(n) time

  • ml: full VCV-based ML with exact conditional CIs, O(n^3)

Both methods produce identical point estimates; ml gives exact conditional confidence intervals.

Input trait data can be either a two-column file (taxon<tab>value) when -c is omitted, or a multi-trait file with header row when -c specifies which column to use.

phykit ancestral_state_reconstruction -t <tree> -d <trait_data> [-c <trait>] [-m <method>] [--ci] [--plot <output>] [--json]

Options:
-t/\-\-tree: a phylogenetic tree file
-d/\-\-trait_data: trait data file (two-column or multi-trait with header)
-c/\-\-trait: trait column name (required for multi-trait files)
-m/\-\-method: method to use: fast or ml (default: fast)
--ci: include 95% confidence intervals
--plot: output path for contMap plot (requires matplotlib)
--json: output results as JSON

Example contMap plot generated with the --plot option. Branches are colored by interpolated ancestral trait values:

../_images/asr_example.png

Concordance-aware ancestral state reconstruction

Function names: concordance_asr; conc_asr; casr
Command line interface: pk_concordance_asr; pk_conc_asr; pk_casr

Concordance-aware ancestral state reconstruction that incorporates gene tree discordance into ancestral estimates. Standard ASR operates on a single species tree and ignores gene tree conflict. This command propagates topological uncertainty from gene tree discordance into ancestral state estimates using gene concordance factors (gCF).

Two strategies are available:

  • weighted (default): For each internal node, compute gCF (fraction of gene trees supporting the species-tree bipartition) and gDF1, gDF2 (fractions for NNI alternatives). Run ASR on the species tree and NNI alternative trees, then combine estimates weighted by concordance. Uses the law of total variance to separate topological vs parameter uncertainty.

  • distribution: Run ASR independently on each gene tree, map nodes across trees by descendant-set identity, and report concordance-weighted means with percentile confidence intervals (2.5th--97.5th).

phykit concordance_asr -t <species_tree> -g <gene_trees> -d <trait_data>
    [-c <trait>] [-m weighted|distribution] [--ci]
    [--plot <output>] [--missing-taxa error|shared] [--json]

Options:
-t/\-\-tree: species tree file
-g/\-\-gene-trees: file with gene trees (multi-Newick, one per line)
-d/\-\-trait_data: trait data file (two-column or multi-trait with header)
-c/\-\-trait: trait column name (required for multi-trait files)
-m/\-\-method: method to use: weighted or distribution (default: weighted)
--ci: include 95% confidence intervals
--plot: output path for concordance ASR plot
--missing-taxa: how to handle taxa mismatches: shared (default, prune to intersection) or error (reject)
--json: output results as JSON

Example output:

Concordance-Aware Ancestral State Reconstruction

Method: weighted
Number of tips: 8
Number of gene trees: 10
Sigma-squared (BM rate): 0.043893

Ancestral estimates:
  Node          Desc    Estimate     gCF                  95% CI    Var_topo   Var_param
  N1 (root)        8      1.6447   1.000        [0.8937, 2.3957]    0.000000    0.146822
  N2               2      1.6881   0.700        [0.9529, 2.4234]    0.000569    0.140151
  N3               5      1.4878   0.857        [0.6727, 2.3028]    0.005878    0.167045
  N4               2      1.7682   0.900        [0.8987, 2.6378]    0.015002    0.181806
  N5               3      1.2674   0.900        [0.3663, 2.1684]    0.001044    0.210295
  N6               2      0.9895   1.000       [-0.5654, 2.5443]    0.000000    0.629294

Example plot generated with the --plot option. Internal nodes are sized and colored by gene concordance factor (gCF):

../_images/concordance_asr_example.png

Bipartition support statistics

Function names: bipartition_support_stats; bss
Command line interface: pk_bipartition_support_stats; pk_bss

Calculate summary statistics for bipartition support.

High bipartition support values are thought to be desirable because they are indicative of greater certainty in tree topology.

To obtain all bipartition support values, use the -v/\-\-verbose option. In addition to support values for each node, the names of all terminal branches tips are also included. Each terminal branch name is separated with a semi-colon (;).

phykit bipartition_support_stats <tree> [-v/--verbose]
    [--thresholds <comma-separated-floats>] [--json]

Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all bipartition support values
--thresholds: optional comma-separated support cutoffs; prints count and fraction of bipartitions below each cutoff
--json: optional argument to print results as JSON

Example JSON output (summary mode):

phykit bipartition_support_stats test.tre --thresholds 70,90 --json
{"summary": {"maximum": 100, "mean": 95.71428571428571, "median": 100, "minimum": 85, "seventy_fifth": 100.0, "standard_deviation": 7.319250547113999, "twenty_fifth": 92.5, "variance": 53.57142857142857}, "thresholds": [{"count_below": 0, "fraction_below": 0.0, "threshold": 70.0}, {"count_below": 2, "fraction_below": 0.2857142857142857, "threshold": 90.0}], "verbose": false}

Example JSON output (verbose mode):

phykit bipartition_support_stats test.tre -v --json
{"bipartitions": [{"support": 85, "terminals": ["taxon_a", "taxon_b"]}, {"support": 100, "terminals": ["taxon_c", "taxon_d"]}], "thresholds": [], "verbose": true}

Branch length multiplier

Function names: branch_length_multiplier; blm
Command line interface: pk_branch_length_multiplier; pk_blm

Multiply branch lengths in a phylogeny by a given factor.

This can help modify reference trees when conducting simulations or other analyses.

phykit branch_length_multiplier <tree> -f n [-o/--output <output_file>] [--json]

Options:
<tree>: first argument after function name should be a tree file
-f/\-\-factor: factor to multiply branch lengths by
-o/\-\-output: optional argument to name the outputted tree file. Default output will have the same name as the input file but with the suffix ".factor_(n).tre"
--json: optional argument to print results as JSON


Collapse bipartitions

Function names: collapse_branches, collapse, cb
Command line interface: pk_collapse_branches, pk_collapse, pk_cb

Collapse branches on a phylogeny according to bipartition support.

Bipartitions will be collapsed if they are less than the user specified value.

phykit collapse_branches <tree> -s/--support n [-o/--output <output_file>] [--json]

Options:
<tree>: first argument after function name should be a tree file
-s/\-\-support: bipartitions with support less than this value will be collapsed
-o/\-\-output: optional argument to name the outputted tree file. Default output will have the same name as the input file but with the suffix ".collapsed_(support).tre"
--json: optional argument to print results as JSON


Consensus network

Function names: consensus_network; consnet; splitnet; splits_network
Command line interface: pk_consensus_network; pk_consnet; pk_splitnet; pk_splits_network

Extract bipartition splits from a collection of gene trees and summarize conflicting phylogenetic signal. Counts how frequently each non-trivial bipartition appears across input trees and filters by a minimum frequency threshold. Optionally draws a circular splits network diagram.

Input can be either: 1) a file with one Newick tree per line, or 2) a file with one tree-file path per line.

phykit consensus_network -t/--trees <trees> [--threshold 0.1] [--missing-taxa error|shared] [--plot-output <file>] [--json]

Options:
-t/\-\-trees: file containing trees (one Newick per line) or tree-file paths (one per line)
--threshold: minimum split frequency to include, between 0 and 1 (default: 0.1)
--missing-taxa: handling strategy for mismatched taxa (error or shared; default: error)
--plot-output: output filename for the circular splits network plot (optional)
--json: optional argument to print results as JSON

When --plot-output is specified, a circular splits network diagram is produced. Taxa are arranged at equal angles around a circle. Each split is drawn as a chord connecting the boundary points between the two sides of the bipartition. Chord thickness and opacity scale with split frequency — thicker, darker lines indicate splits supported by more gene trees.

../_images/consensus_network_example.png

Consensus tree

Function names: consensus_tree; consensus; ctree
Command line interface: pk_consensus_tree; pk_consensus; pk_ctree

Infer a consensus tree from a collection of trees.

Input can be either: 1) a file with one Newick tree per line, or 2) a file with one tree-file path per line.

Consensus methods: * majority: majority-rule consensus (default)
* strict: strict consensus

Missing taxa handling: * --missing-taxa error (default): exits if trees do not share identical tip sets
* --missing-taxa shared: prunes all trees to the intersection of taxa before inferring consensus

phykit consensus_tree -t/--trees <trees> [-m/--method strict|majority] [--missing-taxa error|shared] [--json]

Options:
-t/\-\-trees: file containing trees (one Newick per line) or tree-file paths (one per line)
-m/\-\-method: consensus method (strict or majority; default: majority)
--missing-taxa: handling strategy for mismatched taxa (error or shared; default: error)
--json: optional argument to print results as JSON


Continuous trait evolution model comparison (fitContinuous)

Function names: fit_continuous; fitcontinuous; fc
Command line interface: pk_fit_continuous; pk_fitcontinuous; pk_fc

Compare models of continuous trait evolution on a phylogeny, analogous to R's geiger::fitContinuous(). Fits up to 7 models and ranks them by AIC, BIC, and AIC weights.

Models:

  • BM -- Brownian motion (baseline, 2 params)

  • OU -- Ornstein-Uhlenbeck / stabilizing selection (3 params)

  • EB -- Early Burst (Harmon et al. 2010) (3 params)

  • Lambda -- Pagel's lambda / phylogenetic signal (3 params)

  • Delta -- Pagel's delta / tempo of evolution (3 params)

  • Kappa -- Pagel's kappa / punctuational vs gradual (3 params)

  • White -- White noise / no phylogenetic signal (2 params)

Each model reports R² = 1 - (σ²_model / σ²_White), measuring how much variance is explained relative to the white noise baseline. White serves as the reference (R² = 0).

phykit fit_continuous -t <tree> -d <trait_data> [--models BM,OU,Lambda] [-g <gene_trees>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
--models: comma-separated list of models to fit (default: all 7)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON

Example output:

Model Comparison (fitContinuous)

Number of tips: 8

Model       Param     Value      Sigma2    z0        LL         AIC     dAIC    AICw    BIC     dBIC
BM          -         -          0.0384    1.6447    -11.570    27.14   0.00    0.453   27.83   0.00
OU          alpha     0.0012     0.0385    1.6420    -11.568    29.14   2.00    0.167   30.18   2.35
...

Best model (AIC): BM
Best model (BIC): BM

Continuous trait mapping (contMap)

Function names: cont_map; contmap; cmap
Command line interface: pk_cont_map; pk_contmap; pk_cmap

Plot a phylogram with branches colored by continuous trait values (analogous to R's phytools::contMap()). Ancestral states are estimated via maximum-likelihood (two-pass Felsenstein algorithm) and mapped onto branches using a color gradient (coolwarm colormap).

phykit cont_map -t <tree> -d <trait_data> -o <output.png> [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-o/\-\-output: output plot file path (required)
--json: optional argument to print results as JSON

../_images/contmap_example.png

Cophylogenetic plot (tanglegram)

Function names: cophylo; tanglegram; tangle
Command line interface: pk_cophylo; pk_tanglegram; pk_tangle

Plot a cophylogenetic tanglegram of two phylogenies (analogous to R's phytools::cophylo()). Draws two trees facing each other with connecting lines between matching taxa. By default, taxa are matched by identical tip names. Internal nodes of tree2 are rotated to minimize line crossings.

phykit cophylo -t <tree1> -t2 <tree2> -o <output.png> [-m <mapping>] [--json]

Options:
-t/\-\-tree1: first tree file in Newick format
-t2/\-\-tree2: second tree file in Newick format
-o/\-\-output: output plot file path (required)
-m/\-\-mapping: optional tab-delimited mapping file (taxon1<tab>taxon2)
--json: optional argument to print results as JSON

../_images/cophylo_example.png

Covarying evolutionary rates

Function names: covarying_evolutionary_rates; cover
Command line interface: pk_covarying_evolutionary_rates; pk_cover

Determine if two genes have a signature of covariation with one another. Genes that have covarying evolutionary histories tend to have similar functions and expression levels.

Input two phylogenies and calculate the correlation among relative evolutionary rates between the two phylogenies. The two input trees do not have to have the same taxa. This function will first prune both trees to have the same tips. To transform branch lengths into relative rates, PhyKIT uses the putative species tree's branch lengths, which is inputted by the user. As recommended by the original method developers, outlier branche lengths are removed. Outlier branches have a relative evolutionary rate greater than five.

PhyKIT reports two tab delimited values: col1: correlation coefficient col2: p-value

Method is empirically evaluated by Clark et al., Genome Research (2012), doi: 10.1101/gr.132647.111. Normalization method using a species tree follows Sato et al., Bioinformatics (2005), doi: 10.1093/bioinformatics/bti564.

phykit covarying_evolutionary_rates <tree_file_zero> <tree_file_one> -r/--reference <reference_tree_file> [-v/--verbose] [--plot] [--plot-output <path>] [--json]

Options:
<tree_file_zero>: first argument after function name should be an alignment file
<tree_file_one>: first argument after function name should be an alignment file
-r/\-\-reference: a tree to correct branch lengths by in the two input trees. Typically, this is a putative species tree.
-v/\-\-verbose: print out corrected branch lengths shared between tree 0 and tree 1
--plot: save a covarying-rates scatter plot
--plot-output: output path for plot (default: covarying_rates_plot.png)
--json: optional argument to print results as JSON


Degree of violation of the molecular clock

Function names: degree_of_violation_of_a_molecular_clock, dvmc
Command line interface: pk_degree_of_violation_of_a_molecular_clock, pk_dvmc

Calculate degree of violation of a molecular clock (or DVMC) in a phylogeny.

Lower DVMC values are thought to be desirable because they are indicative of a lower degree of violation in the molecular clock assumption.

Typically, outgroup taxa are not included in molecular clock analysis. Thus, prior to calculating DVMC from a single gene tree, users may want to prune outgroup taxa from the phylogeny. To prune tips from a phylogeny, see the prune_tree function.

Calculate DVMC in a tree following Liu et al., PNAS (2017), doi: 10.1073/pnas.1616744114.

phykit degree_of_violation_of_a_molecular_clock <tree> [--json]

Options:
<tree>: input file tree name
--json: optional argument to print results as JSON


Density map

Function names: density_map; densitymap; dmap
Command line interface: pk_density_map; pk_densitymap; pk_dmap

Plot a phylogram with branches colored by posterior probabilities of discrete character states from stochastic character mapping (analogous to R's phytools::densityMap()). Runs N simulations of stochastic character mapping internally and, for each point along each branch, computes the fraction of simulations in each state.

phykit density_map -t <tree> -d <trait_data> -c <trait_column> -o <output.png> [-n <nsim>] [--seed <seed>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>state)
-c/\-\-trait: column name of the trait to map
-o/\-\-output: output plot file path (required)
-n/\-\-nsim: number of stochastic mapping simulations (default: 100)
--seed: random seed for reproducibility
--json: optional argument to print results as JSON

../_images/densitymap_example.png

Evolutionary tempo mapping

Function names: evo_tempo_map; etm
Command line interface: pk_evo_tempo_map; pk_etm

Detect rate-topology associations by comparing branch length distributions between concordant and discordant gene trees at each species tree branch.

Under the multispecies coalescent, discordant gene trees should have shorter internal branches near the discordant node (because the coalescence happened deeper, in the ancestral population). Deviations from this expectation suggest substitution rate heterogeneity correlated with topology, which could indicate adaptive evolution, different selective pressures in hybridizing lineages, or systematic error from model misspecification.

For each internal branch of the species tree, gene trees are classified as concordant or discordant via bipartition matching (same as gCF). The homologous branch length is extracted from each gene tree and the two groups are compared using a Mann-Whitney U test and a permutation test (1000 permutations). P-values are corrected for multiple testing using Benjamini-Hochberg FDR.

A global treeness (internal/total branch length ratio) comparison between concordant and discordant gene trees is also reported.

phykit evo_tempo_map -t <species_tree> -g <gene_trees> [--plot <output>] [-v] [--json]

Options:
-t/\-\-tree: a species tree file
-g/\-\-gene-trees: multi-Newick file of gene trees with branch lengths
--plot: optional output path for box/strip plot (PNG)
-v/\-\-verbose: print per-gene-tree classification details
--json: optional argument to print results as JSON

Example output:

branch                          n_conc  n_disc    med_conc    med_disc      U_pval   perm_pval       fdr_p
----------------------------------------------------------------------------------------------------------
bear,dog,raccoon                     6       1    3.875000    3.600000          NA          NA          NA
bear,raccoon                         7       3    0.880000    0.700000    0.516667    0.077000    0.516667
cat,monkey                          10       0   20.450000          NA          NA          NA          NA
cat,monkey,weasel                    9       1    2.120000    2.800000          NA          NA          NA
sea_lion,seal                        9       1    7.500000    7.200000          NA          NA          NA
---
Global treeness: concordant=0.126489 (n=6), discordant=0.119014 (n=4)
Branches tested: 1, significant (FDR<0.05): 0

Each row corresponds to an internal branch of the species tree identified by the smaller partition of taxa. The n_conc and n_disc columns show how many gene trees are concordant or discordant at that branch. The med_conc and med_disc columns show the median branch length (in substitutions/site) for each group. The U_pval is the two-sided Mann-Whitney U test p-value, perm_pval is the permutation test p-value (1000 permutations), and fdr_p is the Benjamini-Hochberg corrected p-value. Branches with fewer than 2 gene trees in either group show NA for p-values.

The global treeness comparison tests whether concordant gene trees have systematically different ratios of internal to total branch lengths.

To generate a visualization:

phykit evo_tempo_map -t <species_tree> -g <gene_trees> --plot tempo_map.png
../_images/tutorial_etm_plot.png

The plot shows grouped box plots with jittered data points for each species tree branch, comparing branch lengths between concordant (blue) and discordant (orange) gene trees. Branches where the FDR-corrected p-value is below 0.05 are marked with an asterisk.


Discordance asymmetry

Function names: discordance_asymmetry; disc_asym; da
Command line interface: pk_discordance_asymmetry; pk_disc_asym; pk_da

Test whether the two discordant NNI alternative topologies at each species tree branch are equally frequent. Under incomplete lineage sorting (ILS) alone, the two minor NNI alternatives (gDF1 and gDF2) should appear at equal frequency. When they are significantly asymmetric, it suggests introgression or gene flow between specific lineages.

For each internal branch of the species tree, a two-sided binomial test (H0: P(alt1) = 0.5) is applied, and p-values are corrected for multiple testing using Benjamini-Hochberg FDR.

phykit discordance_asymmetry -t <species_tree> -g <gene_trees> [--plot <output>] [-v] [--json]

Options:
-t/\-\-tree: a species tree file
-g/\-\-gene-trees: multi-Newick file of gene trees (branch lengths not required)
--plot: optional output path for asymmetry phylogram (PNG)
-v/\-\-verbose: print per-branch details
--json: optional argument to print results as JSON

Example output:

branch                          n_conc  n_alt1  n_alt2  asym_ratio     binom_p       fdr_p   gene_flow
------------------------------------------------------------------------------------------------------
bear,dog,raccoon                     6       0       1       1.000      1.0000      1.0000           -
bear,raccoon                         7       1       2       0.667      1.0000      1.0000           -
cat,monkey                          10       0       0          NA          NA          NA           -
cat,monkey,weasel                    9       1       0       1.000      1.0000      1.0000           -
sea_lion,seal                        9       1       0       1.000      1.0000      1.0000           -
---
Summary: 4 branches tested, 0 significant (FDR<0.05)

Each row corresponds to an internal branch of the species tree identified by the smaller partition of taxa. The n_conc column shows concordant gene trees, while n_alt1 and n_alt2 show the counts for the two NNI alternative topologies. The asym_ratio is max(n_alt1, n_alt2) / (n_alt1 + n_alt2), ranging from 0.5 (perfectly symmetric) to 1.0 (maximally asymmetric). NA indicates no discordant gene trees were observed. The binom_p is the two-sided binomial test p-value, fdr_p is the Benjamini-Hochberg corrected p-value, and gene_flow shows which NNI alternative is favored when the result is significant (FDR < 0.05).

Interpretation:

  • Symmetric discordance (asym_ratio near 0.5, not significant): Consistent with ILS alone — both NNI alternatives arise with equal frequency from random coalescent sorting.

  • Asymmetric discordance (asym_ratio near 1.0, significant): Suggests gene flow or introgression. The favored NNI alternative indicates which lineages are exchanging genetic material. For alt1, gene flow is between the C1 lineage and the S (sibling) lineage; for alt2, gene flow is between C2 and S.

To generate a visualization:

phykit discordance_asymmetry -t <species_tree> -g <gene_trees> --plot asymmetry.png
../_images/discordance_asymmetry_plot.png

The plot shows a phylogram with internal branches colored by asymmetry ratio using a diverging colormap (blue = symmetric/0.5, red = highly asymmetric/1.0). Significant branches (FDR < 0.05) are marked with a red star. Internal nodes are annotated with gCF values.


Evolutionary rate

Function names: evolutionary_rate, evo_rate
Command line interface: pk_evolutionary_rate, pk_evo_rate

Calculate a tree-based estimation of the evolutionary rate of a gene.

Evolutionary rate is the total tree length divided by the number of terminals.

Calculate evolutionary rate following Telford et al., Proceedings of the Royal Society B (2014).

phykit evolutionary_rate <tree> [--json]

Options:
<tree>: input file tree name
--json: optional argument to print results as JSON


Hidden paralogy check

Function names: hidden_paralogy_check, clan_check
Command line interface: pk_hidden_paralogy_check, pk_clan_check

Scan tree for evidence of hidden paralogy.

This analysis can be used to identify hidden paralogy. Specifically, this method will examine if a set of well known monophyletic taxa are, in fact, monophyletic. If they are not, the evolutionary history of the gene may be subject to hidden paralogy. This analysis is typically done with single-copy orthologous genes.

Requires a clade file, which species which monophyletic lineages to check for. Multiple monophyletic lineages can be specified. Each lineage should be specified on a single line and each tip name (or taxon name) should be separated by a space. For example, if it is anticipated that tips "A", "B", and "C" are monophyletic and "D", "E", and "F" are expected to be monophyletic, the clade file should be formatted as follows:
"
A B C
D E F
"

The output will report if the specified taxa were monophyletic or not. The number of rows will reflect how many groups of taxa were checked for monophyly. For example, if there were three rows of clades in the -c file, there will be three rows in the output where the first row in the output corresponds to the results of the first row in the clade file.

The concept behind this analysis follows Siu-Ting et al., Molecular Biology and Evolution (2019), doi: 10.1093/molbev/msz067.

phykit hidden_paralogy_check <tree> -c/--clade <clade_file> [--json]

Options:
-t/\-\-tree: input file tree name
-c/\-\-clade: clade file detailing which monophyletic lineages should be scanned for
--json: optional argument to print results as JSON


Internal branch statistics

Function names: internal_branch_stats; ibs
Command line interface: pk_internal_branch_stats; pk_ibs

Calculate summary statistics for internal branch lengths in a phylogeny.

Internal branch lengths can be useful for phylogeny diagnostics.

To obtain all internal branch lengths, use the -v/\-\-verbose option.

phykit internal_branch_stats <tree> [-v/--verbose] [--json]

Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all internal branch lengths
--json: optional argument to print results as JSON


Internode labeler

Function names: internode_labeler; il
Command line interface: pk_internode_labeler; pk_il

Appends numerical identifiers to bipartitions in place of support values. This is helpful for pointing to specific internodes in supplementary files or otherwise.

phykit internode_labeler <tree> [-o/--output <file>] [--json]

Options:
<tree>: first argument after function name should be a tree file
-o/\-\-output: optional argument to name the outputted tree file
--json: optional argument to print results as JSON


Last common ancestor subtree

Function names: last_common_ancestor_subtree; lca_subtree
Command line interface: pk_last_common_ancestor_subtree; pk_lca_subtree

Obtains subtree from a phylogeny by getting the last common ancestor from a list of taxa.

phykit last_common_ancestor_subtree <file> <list_of_taxa> [-o/--output <file>] [--json]

Options:
<tree>: first argument after function name should be a tree file
<list_of_taxa>: second argument after function name should be a single column file with the list of taxa to get the last common ancestor subtree for -o/\-\-output: optional argument to name the outputted tree file
--json: optional argument to print results as JSON


Lineage-through-time plot and gamma statistic

Function names: ltt; gamma_stat; gamma
Command line interface: pk_ltt; pk_gamma_stat; pk_gamma

Compute the Pybus & Harvey (2000) gamma statistic and generate lineage-through-time (LTT) plots to test for temporal variation in diversification rates.

Under a constant-rate pure-birth (Yule) process, the gamma statistic follows a standard normal distribution: gamma ~ N(0, 1). Negative values indicate early diversification (decelerating rates, consistent with an adaptive radiation followed by niche filling), while positive values indicate late diversification (accelerating rates, potentially reflecting recent ecological opportunity or mass extinction recovery).

phykit ltt -t <tree> [-v/--verbose] [--plot-output <file>] [--json]

Options:
-t/\-\-tree: a rooted phylogeny file with branch lengths (required)
-v/\-\-verbose: print branching times and full LTT data points
--plot-output: output filename for the LTT plot (PNG, PDF, SVG)
--json: output results as JSON

Default output: tab-separated gamma statistic and two-tailed p-value.

# Basic gamma statistic
phykit ltt -t species.tre
# -1.4142   0.1573

# With LTT plot
phykit ltt -t species.tre --plot-output ltt_plot.png

# Full JSON output
phykit ltt -t species.tre --json

Tutorial: testing diversification tempo in a clade

Suppose you have a dated phylogeny of 50 gecko species and want to test whether speciation rates were constant or whether diversification decelerated over time (consistent with ecological limits).

Step 1: Compute the gamma statistic.

phykit ltt -t gecko_dated.tre --plot-output gecko_ltt.png
# -2.8514   0.0043

A significantly negative gamma (p = 0.004) rejects the constant-rate null, indicating that lineage accumulation was concentrated early in the clade's history — consistent with an early burst of diversification followed by a slowdown as niches filled.

Step 2: Examine the LTT plot.

The --plot-output option generates a step-function plot of lineage count (log scale) vs. time from root. Under constant-rate birth, lineages accumulate exponentially (straight line on log-scale). An early burst appears as a curve that rises steeply then flattens.

Example: early burst (decelerating diversification)

Most branching events occur near the root, then the curve plateaus. The significantly negative gamma (p < 0.001) rejects constant-rate birth.

../_images/ltt_example_early_burst.png

Example: recent burst (accelerating diversification)

Lineage accumulation is concentrated near the present. The significantly positive gamma (p = 0.022) indicates late diversification.

../_images/ltt_example_recent_burst.png

Step 3: Verbose output for downstream analysis.

phykit ltt -t gecko_dated.tre -v

Verbose mode prints branching times (node ages) and the full LTT data table (time_from_root, n_lineages), which can be piped to custom plotting or further analysis.

Validation against R's ape::gammaStat()

PhyKIT's gamma statistic was validated against R's ape package (v5.8.1, R 4.4.0). Results match to 10 decimal places:

Tree topology

PhyKIT gamma

R ape gamma

Balanced 8-tip

-1.4142135624

-1.4142135624

Ladder (caterpillar) 5-tip

-0.7142857143

-0.7142857143

Recent burst 10-tip

2.2824790785

2.2824790785

Early burst 7-tip

-3.5362021857

-3.5362021857

The algorithm replicates the exact formula from ape's gammaStat.R source, including the rev() step on internode intervals.


Long branch score

Function names: long_branch_score; lb_score; lbs
Command line interface: pk_long_branch_score; pk_lb_score; pk_lbs

Calculate long branch (LB) scores in a phylogeny.

Lower LB scores are thought to be desirable because they are indicative of taxa or trees that likely do not have issues with long branch attraction.

LB score is the mean pairwise patristic distance of taxon i compared to all other taxa over the average pairwise patristic distance.

PhyKIT reports summary statistics. To obtain LB scores for each taxa, use the -v/--verbose option.

LB scores are calculated following Struck, Evolutionary Bioinformatics (2014), doi: 10.4137/EBO.S14239.

phykit long_branch_score <tree> [-v/--verbose] [--json]

Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all LB score values
--json: optional argument to print results as JSON


Monophyly check

Function names: monophyly_check; is_monophyletic
Command line interface: pk_monophyly_check; pk_is_monophyletic

This analysis can be used to determine if a set of taxa are exclusively monophyletic. By exclusively monophyletic, if other taxa are in the same clade, the lineage will not be considered exclusively monophyletic.

Requires a taxa file, which species which tip names are expected to be monophyletic. File format is a single column file with tip names. Tip names not present in the tree will not be considered when examining monophyly.

The output will have six columns. col 1: if the clade was or wasn't monophyletic col 2: average bipartition support value in the clade of interest col 3: maximum bipartition support value in the clade of interest col 4: minimum bipartition support value in the clade of interest col 5: standard deviation of bipartition support values in the clade of interest col 6: tip names of taxa monophyletic with the lineage of interest excluding those that are listed in the taxa_of_interest file

phykit monophyly_check <tree> <list_of_taxa> [--json]

Options:
<tree>: first argument after function name should be a tree file
<list_of_taxa>: single column file with list of tip names to examine the monophyly of
--json: optional argument to print results as JSON


Multi-regime OU models (OUwie)

Function names: ouwie; fit_ouwie; multi_regime_ou
Command line interface: pk_ouwie; pk_fit_ouwie; pk_multi_regime_ou

Fit multi-regime Ornstein-Uhlenbeck models of continuous trait evolution, analogous to R's OUwie package (Beaulieu et al. 2012). Fits up to 7 models and ranks them by AICc, BIC, and AICc weights. Regime assignments to internal branches are inferred via Fitch parsimony.

Models:

  • BM1 -- single-rate Brownian motion (2 params)

  • BMS -- multi-rate Brownian motion with per-regime sigma2 (R+1 params)

  • OU1 -- single-regime Ornstein-Uhlenbeck (3 params)

  • OUM -- multi-regime OU with per-regime trait optima (R+2 params)

  • OUMV -- OUM + per-regime sigma2 (2R+1 params)

  • OUMA -- OUM + per-regime alpha (2R+1 params)

  • OUMVA -- all parameters regime-specific (3R params)

Each model reports R² = 1 - (σ²_model / σ²_BM1), measuring improvement over the simplest Brownian motion baseline. For multi-regime models with per-regime σ² values, the average is used.

phykit ouwie -t <tree> -d <trait_data> -r <regime_data> [--models BM1,OUM,OUMVA] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-r/\-\-regime_data: tab-delimited regime file (taxon<tab>regime_label)
--models: comma-separated list of models to fit (default: all 7)
--json: optional argument to print results as JSON

The trait data file is a two-column tab-delimited file mapping taxon names to continuous trait values:

dog  1.1
bear 1.9
raccoon      1.5
seal 1.8
sea_lion     1.8
cat  0.5
weasel       1.7
monkey       0.3

The regime data file is a two-column tab-delimited file mapping taxon names to discrete regime labels (e.g., habitat, diet category):

dog  terrestrial
bear terrestrial
raccoon      terrestrial
seal aquatic
sea_lion     aquatic
cat  terrestrial
weasel       terrestrial
monkey       terrestrial

Example output:

OUwie Model Comparison
======================
Regimes: aquatic, terrestrial

Model      logLik     AICc      BIC     k  AICc_w  Params
-----  ----------  -------  -------  ----  ------  ------
OUMVA     -6.9859  27.9717  29.5459     6  0.0040  alpha={aquatic:0.38, terrestrial:0.38}, sigma2={aquatic:0.01, terrestrial:0.05}, theta={aquatic:1.80, terrestrial:1.24}
OUMA      -6.9859  27.9717  29.0119     5  0.0040  alpha={aquatic:0.38, terrestrial:0.38}, sigma2=0.0384, theta={aquatic:1.80, terrestrial:1.24}
OUMV      -6.9859  27.9717  29.0119     5  0.0040  alpha=0.3849, sigma2={aquatic:0.01, terrestrial:0.05}, theta={aquatic:1.80, terrestrial:1.24}
OUM       -8.6297  25.2594  26.3276     4  0.0488  alpha=0.0706, sigma2=0.0329, theta={aquatic:1.80, terrestrial:1.33}
OU1      -10.2890  27.2447  28.0459     3  0.0063  alpha=0.0398, sigma2=0.0363, theta=1.64
BMS      -11.2046  29.0759  29.8771     3  0.0024  sigma2={aquatic:0.01, terrestrial:0.05}, z0=1.64
BM1      -11.5697  27.1393  27.6735     2  0.0073  sigma2=0.0384, z0=1.64

Best model (AICc): OUM
Best model (BIC): OUM

Nearest neighbor interchange

Function names: nearest_neighbor_interchange; nni
Command line interface: pk_nearest_neighbor_interchange; pk_nni

Generate all nearest neighbor interchange moves for a binary rooted tree.

By default, the output file will have the same name as the input file but with the suffix ".nnis"

The output file will also include the original phylogeny.

phykit nearest_neighbor_interchange <tree> [-o/--output <output_file>] [--json]

Options:
<tree>: first argument after function name should be a tree file
-o/\-\-output: optional argument to specify output file name
--json: optional argument to print summary metadata as JSON


Network signal

Function names: network_signal; netsig; net_signal
Command line interface: pk_network_signal; pk_netsig; pk_net_signal

Compute phylogenetic signal (Blomberg's K and/or Pagel's lambda) on a phylogenetic network rather than a tree. This accounts for hybridization and introgression when estimating how strongly a continuous trait tracks evolutionary history.

Standard phylogenetic signal methods assume a strictly bifurcating tree. When the true history includes reticulation, the tree-based variance-covariance (VCV) matrix is incorrect and signal estimates may be biased. network_signal replaces the tree VCV with a network VCV computed using the recursive algorithm of Bastide et al. (Systematic Biology, 2018), which properly weights shared ancestry through both tree-like and hybrid lineages.

Two signal metrics are available (same as phylogenetic_signal):

  • Blomberg's K (Blomberg et al. 2003): K = 1 under Brownian motion; K < 1 = less signal than expected; K > 1 = more. P-value via permutation test. Computing K on a network is a novel capability not available in any other tool.

  • Pagel's lambda (Pagel 1999): lambda = 0 = no signal; lambda = 1 = full BM signal. P-value via likelihood ratio test.

Network specification — two options:

  1. Explicit hybrid edges (--hybrid): specify one or more reticulation events as donor:recipient:gamma where gamma is the inheritance probability from the donor lineage (0 < gamma < 0.5).

  2. From quartet_network JSON (--quartet-json): auto-infer hybrid edges from the output of phykit quartet_network --json. The command identifies taxon pairs that swap across hybrid quartets and estimates gamma from concordance factor ratios.

# With explicit hybrid edges
phykit network_signal -t <tree> -d <trait_data> --hybrid <donor:recipient:gamma> [--method both|blombergs_k|lambda] [--permutations 1000] [--json]

# With quartet_network JSON output
phykit network_signal -t <tree> -d <trait_data> --quartet-json <quartets.json> [--method both|blombergs_k|lambda] [--permutations 1000] [--json]

Options:
-t/\-\-tree: a rooted species tree in Newick format (with branch lengths)
-d/\-\-trait-data: tab-delimited trait file (taxon_name<tab>trait_value)
--hybrid: one or more hybrid edge specifications (donor:recipient:gamma);
donor is the source lineage, recipient receives gene flow, gamma is the
inheritance proportion from the donor (e.g., B:C:0.3)
--quartet-json: path to JSON output from phykit quartet_network --json
--method: both (default), blombergs_k, or lambda
--permutations: number of permutations for K p-value (default: 1000)
-v/\-\-verbose: print network VCV matrix details
--json: optional argument to print results as JSON

Output for default (both) mode:
Hybrid edge: B -> C (gamma=0.3000)
Network taxa: 5
---
Blomberg's K: 0.8234    p-value: 0.0320
Pagel's lambda: 0.7651    log-likelihood: -12.3456    p-value: 0.0012


Tutorial: Wing pattern evolution in Heliconius butterflies

This example shows a realistic workflow for computing phylogenetic signal on a network, starting from gene tree discordance analysis through to signal estimation. The scenario is motivated by the Heliconius butterfly system, where H. melpomene and H. cydno are sister species that hybridize with H. heurippa, producing introgression of wing pattern genes across species boundaries (Mavárez et al., Nature, 2006).

Step 1: Identify hybridization from gene trees.

You have gene trees from 200 loci across 6 Heliconius species. First, use quartet_network to test whether gene tree discordance is due to ILS alone or also involves hybridization:

phykit quartet_network -t gene_trees.nwk --json > quartets.json

Examine the output to see which quartets are classified as hybrid:

# Quick summary
python -c "
import json
data = json.load(open('quartets.json'))
print(f'Tree-like: {data[\"tree_count\"]}')
print(f'Hybrid: {data[\"hybrid_count\"]}')
print(f'Unresolved: {data[\"unresolved_count\"]}')
for q in data['quartets']:
    if q['classification'] == 'hybrid':
        print(f'  {q[\"dominant_topology\"]}  CFs: {q[\"cfs\"]}')
"

Suppose the output shows that quartets involving H. melpomene and H. heurippa are consistently classified as hybrid, with asymmetric minor concordance factors — evidence of gene flow between these lineages.

Step 2a: Compute signal using quartet_network output directly.

Feed the quartet JSON into network_signal along with a rooted species tree and wing pattern measurements (e.g., forewing red band area, log-transformed):

phykit network_signal \
    -t species_tree.nwk \
    -d wing_pattern.tsv \
    --quartet-json quartets.json

The command automatically identifies the strongest hybrid signal from the quartet classifications and estimates the inheritance probability (gamma).

Step 2b: Alternatively, specify hybrid edges explicitly.

If you know the donor and recipient lineages (e.g., from prior knowledge or external network inference), you can specify the hybrid edge directly. Here, H. melpomene is the donor of wing pattern alleles to H. heurippa with an estimated 25% introgression:

phykit network_signal \
    -t species_tree.nwk \
    -d wing_pattern.tsv \
    --hybrid H_melpomene:H_heurippa:0.25

Step 3: Interpret the results.

Example output:

Hybrid edge: H_melpomene -> H_heurippa (gamma=0.2500)
Network taxa: 6
---
Blomberg's K: 0.6821    p-value: 0.0410
Pagel's lambda: 0.5934    log-likelihood: -8.7231    p-value: 0.0085

Interpretation:

  • K = 0.68 (p = 0.04): significant phylogenetic signal, but less than expected under Brownian motion (K < 1). This is consistent with the wing pattern being phylogenetically conserved in most lineages but displaced in H. heurippa due to introgression from H. melpomene.

  • Lambda = 0.59 (p = 0.009): moderate phylogenetic signal. The trait is not evolving independently of the network (lambda > 0), but the fit is better with reduced covariance (lambda < 1).

  • Comparison with tree-based signal: running phylogenetic_signal on the species tree alone would likely produce a lower K value because the tree VCV does not account for the shared ancestry introduced by introgression. The network-based K is a more accurate estimate of how much evolutionary history explains trait variation.

Why this matters: Without accounting for the network, the tree treats H. heurippa's wing pattern as an independent observation. In reality, its wing pattern was partly inherited from H. melpomene through hybridization — the network VCV correctly reflects this shared ancestry, producing a more accurate phylogenetic signal estimate.


Algorithm — Bastide et al. 2018 network VCV

For Brownian motion on a network, the trait at a hybrid node h with parents p1 and p2 is:

X_h = gamma * (X_p1 + noise_1) + (1-gamma) * (X_p2 + noise_2)

The network VCV is computed recursively in topological order:

  • Tree node c with parent p, edge length l:
    V[c,c] = V[p,p] + l
    V[c,j] = V[p,j] for all other nodes j

  • Hybrid node h with parents p1 (weight gamma) and p2 (weight 1-gamma):
    V[h,h] = gamma^2 * (V[p1,p1] + l1) + (1-gamma)^2 * (V[p2,p2] + l2) + 2*gamma*(1-gamma)*V[p1,p2]
    V[h,j] = gamma * V[p1,j] + (1-gamma) * V[p2,j]

The tip-by-tip submatrix is the VCV used for K and lambda.


OU shift detection (l1ou)

Function names: ou_shift_detection; ou_shifts; l1ou; detect_shifts
Command line interface: pk_ou_shift_detection; pk_ou_shifts; pk_l1ou; pk_detect_shifts

Automatic OU shift detection using the LASSO-based approach from Khabbazian et al. (2016). Discovers where on the phylogeny the adaptive optimum changed without requiring an a priori regime assignment. Only a tree and continuous trait data are needed.

The algorithm:

  1. Fits a single-regime OU model to estimate alpha (selection strength)

  2. Builds a design matrix with one column per candidate shift edge

  3. Uses Cholesky transformation to remove phylogenetic correlation

  4. Runs a LASSO path to identify candidate shift configurations

  5. Selects the best model using pBIC, BIC, or AICc

phykit l1ou -t <tree> -d <trait_data> [--criterion pBIC] [--max-shifts N] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
--criterion: model selection criterion: pBIC (default), BIC, or AICc
--max-shifts: maximum number of shifts to consider (default: n/2)
--json: optional argument to print results as JSON

Example output (no shifts detected):

============================================================
OU Shift Detection (l1ou)
============================================================
Number of tips:       8
Number of shifts:     0
Selection criterion:  pBIC
Alpha (OU strength):  0.784803
Sigma² (BM rate):     1.203455
Root optimum (θ₀):    1.206251
Log-likelihood:       -10.2890
pBIC:                 26.8163
BIC:                  26.8163
AICc:                 32.5780

No shifts detected — single-regime OU is best.
============================================================

Example output (shifts detected):

============================================================
OU Shift Detection (l1ou)
============================================================
Number of tips:       100
Number of shifts:     8
Selection criterion:  pBIC
Alpha (OU strength):  0.606894
Sigma² (BM rate):     0.062519
Root optimum (θ₀):    0.248810
Log-likelihood:       48.6896
pBIC:                 17.6266
BIC:                  -9.8811
AICc:                 -49.8793

Detected shifts:
------------------------------------------------------------
  Shift 1: terminal branch to valencienni
           New optimum: -0.564678
  Shift 2: terminal branch to insolitus
           New optimum: -0.876398
  Shift 3: stem of (barbatus, porcus, ... +2 more)
           New optimum: -0.635087
  Shift 4: stem of (altitudinalis, oporinus, ... +13 more)
           New optimum: -0.462944
============================================================

Results have been validated against R's l1ou package (Khabbazian et al. 2016). On a 100-tip lizard dataset, PhyKIT recovers the same 8 adaptive shifts with matching alpha (0.607) and pBIC (17.6 vs R's 16.8).


Patristic distances

Function names: patristic_distances; pd
Command line interface: pk_patristic_distances; pk_pd

Calculate summary statistics among patristic distances in a phylogeny.

Patristic distances are all tip-to-tip distances in a phylogeny.

To obtain all patristic distances, use the -v/--verbose option. With the -v option, the first column will have two taxon names separated by a '-' followed by the patristic distance. Features will be tab separated.

phykit patristic_distances <tree> [-v/--verbose] [--json]

Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all tip-to-tip distances
--json: optional argument to print results as JSON


Phenogram (traitgram)

Function names: phenogram; traitgram; tg
Command line interface: pk_phenogram; pk_traitgram; pk_tg

Plot a phenogram (traitgram) showing trait evolution along a phylogeny (analogous to R's phytools::phenogram()). The X-axis represents distance from the root and the Y-axis represents trait values. Ancestral states are reconstructed via maximum-likelihood.

phykit phenogram -t <tree> -d <trait_data> -o <output.png> [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-o/\-\-output: output plot file path (required)
--json: optional argument to print results as JSON

../_images/phenogram_example.png

Phylogenetic GLM

Function names: phylogenetic_glm; phylo_glm; pglm
Command line interface: pk_phylogenetic_glm; pk_phylo_glm; pk_pglm

Fit a Phylogenetic Generalized Linear Model (GLM) for binary or count response data while accounting for phylogenetic non-independence among species.

Two families are supported:

  • binomial: logistic regression via Maximum Penalized Likelihood Estimation (logistic_MPLE; Ives & Garland 2010). Uses Firth's penalty to prevent bias from complete/quasi-complete separation, and jointly estimates the phylogenetic signal parameter alpha via a two-state continuous-time Markov chain on the tree.

  • poisson: Poisson regression via Generalized Estimating Equations (poisson_GEE; Paradis & Claude 2002). Uses Fisher scoring with the phylogenetic correlation matrix and reports an overdispersion parameter.

The multi-trait input file should be tab-delimited with a header row: taxon<tab>trait1<tab>trait2<tab>...

Output includes coefficient estimates, standard errors, z-values, p-values, log-likelihood, AIC, and McFadden's pseudo-R² (computed from full vs. intercept-only model log-likelihoods).

phykit phylogenetic_glm -t <tree> -d <trait_data> -y <response> -x <predictor1> [predictor2 ...] --family <binomial|poisson> [-g <gene_trees>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
-y/\-\-response: response (dependent) variable column name
-x/\-\-predictors: one or more predictor column names
--family: distribution family: binomial or poisson
--method: estimation method: logistic_MPLE or poisson_GEE (auto from family)
--btol: linear predictor bound for logistic model (default: 10)
--log-alpha-bound: bound on log(alpha) for logistic model (default: 4)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON


Phylogenetic Ordination

Function names: phylogenetic_ordination; phylo_ordination; ordination; ord; phylo_pca; phyl_pca; ppca; phylo_dimreduce; dimreduce; pdr
Command line interface: pk_phylogenetic_ordination; pk_phylo_ordination; pk_ordination; pk_ord; pk_phylo_pca; pk_phyl_pca; pk_ppca; pk_phylo_dimreduce; pk_dimreduce; pk_pdr

Perform phylogenetic ordination (PCA, t-SNE, or UMAP) on continuous multi-trait data while accounting for phylogenetic non-independence among species.

All methods use GLS-centering via the phylogenetic variance-covariance matrix. For PCA, eigendecomposition of the evolutionary rate matrix is performed to extract principal components. For t-SNE and UMAP, nonlinear embedding is applied to the GLS-centered data.

Three ordination methods are available via --method:

  • pca (default): phylogenetic PCA (Revell 2009). Performs eigendecomposition of the evolutionary rate matrix.

  • tsne: t-SNE embedding of the GLS-centered data. Perplexity is auto-set to min(30, (n-1)/3); requires at least 4 taxa.

  • umap: UMAP embedding of the GLS-centered data. n_neighbors is auto-set to min(15, n-1); requires at least 3 taxa.

Two phylogenetic correction modes are available via --correction:

  • BM (default): assumes traits evolved under Brownian motion. The phylogenetic VCV matrix is used directly.

  • lambda: jointly estimates a single Pagel's lambda parameter across all traits by maximum likelihood, then rescales the off-diagonal elements of the VCV matrix. This accounts for deviations from Brownian motion.

For PCA, two modes are available via --mode:

  • cov (default): PCA on the evolutionary rate (covariance) matrix.

  • corr: PCA on the evolutionary rate correlation matrix, which standardizes traits to equal variance before extracting components.

The multi-trait input file should be tab-delimited with a header row: taxon<tab>trait1<tab>trait2<tab>... Lines starting with '#' are treated as comments. If the tree and trait file have different taxa, the intersection is used and warnings are printed to stderr.

PCA output includes eigenvalues with proportion of variance explained, trait loadings, and taxon scores for each principal component. t-SNE/UMAP output includes method parameters and embedding coordinates. When correction=lambda is used, the estimated lambda and log-likelihood are also reported.

PCA results have been benchmarked against the R package phytools (phyl.pca function; Revell 2012). Eigenvalues, loadings, and scores match phytools across all method/mode combinations (BM+cov, BM+corr, lambda+cov, lambda+corr) within numerical tolerance (1e-4).

../_images/phylogenetic_pca_plot.png

../_images/phylogenetic_tsne_plot.png

../_images/phylogenetic_umap_plot.png

phykit phylogenetic_ordination -t <tree> -d <trait_data> [--method <pca|tsne|umap>] [--correction <BM|lambda>] [--mode <cov|corr>] [--n-components <int>] [--perplexity <float>] [--n-neighbors <int>] [--min-dist <float>] [--seed <int>] [--plot] [--plot-tree] [--no-plot-tree] [--color-by <col_or_file>] [--tree-color-by <col_or_file>] [--plot-output <path>] [-g <gene_trees>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
--method: ordination method: pca, tsne, or umap (default: pca)
--correction: phylogenetic correction: BM or lambda (default: BM)
--mode: PCA mode: cov or corr (default: cov; PCA only)
--n-components: number of embedding dimensions (default: 2; tsne/umap only)
--perplexity: t-SNE perplexity (default: auto)
--n-neighbors: UMAP n_neighbors (default: auto)
--min-dist: UMAP min_dist (default: 0.1)
--seed: random seed for reproducibility
--plot: optional argument to save a scatter plot
--plot-tree: overlay the phylogeny via ancestral reconstruction (default for tsne/umap; opt-in for pca)
--no-plot-tree: disable the phylogeny overlay for tsne/umap plots
--color-by: color tip points by a trait; specify a column name from the multi-trait file or a separate tab-delimited file (taxon<tab>value) for continuous or discrete coloring
--tree-color-by: color phylogeny edges by a trait; specify a column name or a tab-delimited file (default: distance from root)
--plot-output: output path for plot (default: phylo_ordination_plot.png)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON


Phylogenetic regression (PGLS)

Function names: phylogenetic_regression; phylo_regression; pgls
Command line interface: pk_phylogenetic_regression; pk_phylo_regression; pk_pgls

Fit a Phylogenetic Generalized Least Squares (PGLS) regression while accounting for phylogenetic non-independence among species, analogous to R's caper::pgls().

The multi-trait input file should be tab-delimited with a header row: taxon<tab>trait1<tab>trait2<tab>... Lines starting with '#' are treated as comments. If the tree and trait file have different taxa, the intersection is used and warnings are printed to stderr.

Two methods are available:

  • BM (default): Brownian motion with lambda fixed at 1

  • lambda: jointly estimates Pagel's lambda via maximum likelihood

Output includes coefficient estimates, standard errors, t-values, p-values, R-squared, adjusted R-squared, F-statistic, log-likelihood, and AIC.

A three-way variance decomposition is also reported: R²_total (variance explained by phylogeny + predictors combined), R²_pred (predictor contribution given phylogeny, = standard R²), and R²_phylo (phylogeny's unique contribution). R²_phylo + R²_pred = R²_total.

The implementation uses the raw phylogenetic variance-covariance (VCV) matrix for GLS estimation, matching the approach used by R's caper::pgls(). Note that this differs from nlme::gls() with corBrownian, which normalizes the VCV to a correlation matrix (ones on the diagonal). For non-ultrametric trees the two approaches yield different coefficient estimates; PhyKIT follows the caper convention.

All results have been validated against R 4.4.0 using manual GLS with ape::vcv() (raw VCV). Coefficients, standard errors, t-values, p-values, R-squared, F-statistic, log-likelihood, and AIC match R to at least four decimal places for both simple and multiple regression.

phykit phylogenetic_regression -t <tree> -d <trait_data> -y <response> -x <predictor1> [predictor2 ...] [-m <method>] [-g <gene_trees>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
-y/\-\-response: response (dependent) variable column name
-x/\-\-predictors: one or more predictor column names
-m/\-\-method: method to use: BM or lambda (default: BM)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON


Phylogenetic signal

Function names: phylogenetic_signal; phylo_signal; ps
Command line interface: pk_phylogenetic_signal; pk_phylo_signal; pk_ps

Calculate phylogenetic signal for continuous trait data on a phylogeny.

Two methods are available:

  • Blomberg's K (Blomberg et al. 2003): measures the degree of phylogenetic signal relative to expectation under Brownian motion. K = 1 indicates trait variation consistent with BM; K < 1 indicates less phylogenetic signal than expected; K > 1 indicates more. P-value is computed via permutation test.

  • Pagel's lambda (Pagel 1999): a tree-scaling parameter estimated by maximum likelihood. Lambda = 0 indicates no phylogenetic signal; lambda = 1 indicates trait evolution consistent with BM. P-value is computed via likelihood ratio test against lambda = 0.

The trait file should be tab-delimited with two columns (taxon_name<tab>trait_value). Lines starting with '#' are treated as comments. If the tree and trait file have different taxa, the intersection is used and warnings are printed to stderr.

Output for Blomberg's K: K_value<tab>p_value<tab>R2_phylo
Output for Pagel's lambda: lambda_value<tab>log_likelihood<tab>p_value<tab>R2_phylo

R²_phylo reports the fraction of trait variance explained by phylogenetic structure: R²_phylo = 1 - (σ²_BM / σ²_WN). Values near 1 indicate strong phylogenetic signal; values near 0 indicate phylogeny explains little trait variance.

Results have been validated against the R package phytools (phylosig function) across 95 simulated datasets spanning diverse tree sizes (5-50 tips), topologies (pure-birth, coalescent), trait models (random, Brownian motion, known lambda), and branch length scales. All metrics show Pearson r > 0.999 with phytools.

../_images/phylogenetic_signal_validation.png

phykit phylogenetic_signal -t <tree> -d <trait_data> [-m <method>] [-p <permutations>] [-g <gene_trees>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon_name<tab>trait_value)
-m/\-\-method: method to use: blombergs_k or lambda (default: blombergs_k)
-p/\-\-permutations: number of permutations for Blomberg's K (default: 1000)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON


Phylomorphospace

Function names: phylomorphospace; phylomorpho; phmo
Command line interface: pk_phylomorphospace; pk_phylomorpho; pk_phmo

Plot a phylomorphospace: two raw traits in trait space with the phylogeny overlaid via ML-reconstructed ancestral states at internal nodes. This differs from the phylogenetic_ordination --plot-tree option, which plots in PC space; phylomorphospace plots raw trait values directly on the x and y axes.

Tree edges connect species through ML-reconstructed ancestral states and are colored by distance from root (coolwarm colormap with colorbar). Tip points default to blue, or can be colored by a continuous or discrete variable using the --color-by option.

The multi-trait input file should be tab-delimited with a header row: taxon<tab>trait1<tab>trait2<tab>... Lines starting with '#' are treated as comments. If the tree and trait file have different taxa, the intersection is used and warnings are printed to stderr.

If the trait file has exactly 2 trait columns and --trait-x / --trait-y are omitted, the first two columns are selected automatically.

../_images/phylomorphospace_plot.png

phykit phylomorphospace -t <tree> -d <trait_data> [--trait-x <name>] [--trait-y <name>] [--color-by <col_or_file>] [--plot-output <path>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
--trait-x: column name for x-axis trait
--trait-y: column name for y-axis trait
--color-by: color tip points by a trait; specify a column name from the multi-trait file or a separate tab-delimited file (taxon<tab>value) for continuous or discrete coloring
--plot-output: output path for plot (default: phylomorphospace_plot.png)
--json: optional argument to print results as JSON


Polytomy testing

Function names: polytomy_test; polyt_test; polyt; ptt
Command line interface: pk_polytomy_test; pk_polyt_test; pk_polyt; pk_ptt

Conduct a polytomy test for three clades in a phylogeny.

Polytomy tests can be used to identify putative radiations as well as identify well supported alternative topologies.

The polytomy testing function takes as input a file with the three groups of taxa to test the relationships for and a single column file with the names of the desired tree files to use for polytomy testing. Next, the script to examine support for the grouping of the three taxa using triplets and gene support frequencies.

This function can account for uncertainty in gene trees - that is, the input phylogenies can have collapsed bipartitions.

Thereafter, a chi-squared test is conducted to determine if there is evidence to reject the null hypothesis wherein the null hypothesis is that the three possible topologies among the three groups are equally supported. This test is done using gene support frequencies.

phykit polytomy_test -t/--trees <trees> -g/--groups <groups> [--json]

Options:
-t/\-\-trees <trees>: single column file with the names of phylogenies to use for polytomy testing
-g/\-\-groups: a tab-delimited file with the grouping designations to test. Lines starting with commetns are not considered. Names of individual taxa should be separated by a semi-colon ';'
--json: optional argument to print results as JSON

For example, the groups file could look like the following:

#label group0  group1  group2
name_of_test    tip_name_A;tip_name_B   tip_name_C  tip_name_D;tip_name_E

Prune tree

Function names: prune_tree; prune
Command line interface: pk_prune_tree; pk_prune

Prune tips from a phylogeny.

Provide a single column file with the names of the tips in the input phylogeny you would like to prune from the tree.

phykit prune_tree <tree> <list_of_taxa> [-o/--output <output_file>] [-k/--keep] [--json]

Options:
<tree>: first argument after function name should be a tree file
<list_of_taxa>: single column file with the names of the tips to remove from the phylogeny
-o/\-\-output: name of output file for the pruned phylogeny. Default output will have the same name as the input file but with the suffix ".pruned"
-k/--keep: optional argument. If used instead of pruning taxa in <list_of_taxa>, keep them
--json: optional argument to print results as JSON |

Quartet network

Function names: quartet_network; quartet_net; qnet; nanuq
Command line interface: pk_quartet_network; pk_quartet_net; pk_qnet; pk_nanuq

Quartet-based network inference (NANUQ-style) for distinguishing incomplete lineage sorting (ILS) from hybridization/gene flow using quartet concordance factors from gene trees.

For each 4-taxon subset, counts how many gene trees display each of the 3 possible unrooted topologies and applies two hypothesis tests:

  1. Star test (Pearson chi-squared): tests whether the three topology counts are consistent with a star tree (equal probabilities 1/3 each). If p_star > beta (default 0.95), the quartet is classified as unresolved.

  2. T3 tree model test (G-test / likelihood ratio): tests whether the counts are consistent with any resolved quartet tree under the multispecies coalescent. If p_tree > alpha (default 0.05), the quartet is classified as tree-like (conflict is due to ILS).

  3. If both tests reject their null hypotheses, the quartet is classified as hybrid (asymmetric discordance indicating gene flow or hybridization).

This algorithm matches the NANUQ method of Allman, Baños & Rhodes (2019), implemented in R's MSCquartets package.

Input can be either: 1) a file with one Newick tree per line, or 2) a file with one tree-file path per line.

phykit quartet_network -t/--trees <trees> [--alpha 0.05] [--beta 0.95] [--missing-taxa error|shared] [--plot-output <file>] [--json]

Options:
-t/\-\-trees: file containing trees (one Newick per line) or tree-file paths (one per line)
--alpha: significance level for the T3 tree model test (default: 0.05)
--beta: threshold for the star tree test; quartets with p_star > beta are called unresolved (default: 0.95)
--missing-taxa: handling strategy for mismatched taxa (error or shared; default: error)
--plot-output: output filename for the quartet network plot (optional)
--json: optional argument to print results as JSON

When --plot-output is specified, a NANUQ-style splits graph is drawn from the quartet distance matrix using Neighbor-Joining and circular split decomposition. Tree-like relationships appear as simple branching, while hybridization / reticulation produces characteristic box (parallelogram) structures — the same style of output produced by R's MSCquartets + NeighborNet pipeline.

No hybridization signal — all quartets are tree-like, so the splits graph is a clean unrooted tree:

../_images/quartet_network_tree.png

Hybridization signal present — hybrid quartets introduce conflicting splits that appear as boxes in the network, indicating reticulation among C, D, E, and F:

../_images/quartet_network_hybrid.png

Rate heterogeneity test (multi-rate Brownian motion)

Function names: rate_heterogeneity; brownie; rh
Command line interface: pk_rate_heterogeneity; pk_brownie; pk_rh

Test for rate heterogeneity across phylogenetic regimes using multi-rate Brownian motion (O'Meara et al. 2006), analogous to R's phytools::brownie.lite().

Fits single-rate vs. multi-rate BM models and performs a likelihood ratio test. Users specify a tree, continuous trait data, and a regime file mapping tips to regimes (tab-delimited, taxon<tab>regime_label).

Regime assignments to internal branches are inferred via Fitch parsimony. Per-regime VCV matrices are decomposed and per-regime sigma-squared values are estimated via maximum likelihood.

Optionally, a parametric bootstrap can be run to compute a simulated p-value (-n/--nsim). A horizontal phylogram with branches colored by regime can be generated using --plot.

An effect size metric R²_regime is also reported, measuring the variance reduction from regime-specific rates vs. a single rate, weighted by the number of tips per regime.

phykit rate_heterogeneity -t <tree> -d <trait_data> -r <regime_data> [-n <nsim>] [--seed <seed>] [--plot <output.png>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-r/\-\-regime_data: tab-delimited regime file (taxon<tab>regime_label)
-n/\-\-nsim: number of parametric bootstrap simulations (default: 0)
--seed: random seed for reproducibility
--plot: output plot file path for phylogram with colored branches
--json: optional argument to print results as JSON


Rename tree tips

Function names: rename_tree_tips; rename_tree; rename_tips
Command line interface: pk_rename_tree_tips; pk_rename_tree; pk_rename_tips

Renames tips in a phylogeny.

Renaming tip files will follow the scheme of a tab-delimited file wherein the first column is the current tip name and the second column is the desired tip name in the resulting phylogeny.

phykit rename_tree_tips <tree> -i/--idmap <idmap.txt> [-o/--output <output_file>] [--json]

Options:
<tree>: first argument after function name should be a tree file
-i/\-\-idmap: identifier map of current tip names (col1) and desired tip names (col2)
-o/\-\-output: optional argument to write the renamed tree files to. Default output will have the same name as the input file but with the suffix ".renamed"
--json: optional argument to print results as JSON


Robinson-Foulds distance

Function names: robinson_foulds_distance; rf_distance; rf_dist; rf
Command line interface: pk_robinson_foulds_distance; pk_rf_distance; pk_rf_dist; pk_rf

Calculate Robinson-Foulds (RF) distance between two trees.

Low RF distances reflect greater similarity between two phylogenies. This function prints out two values, the plain RF value and the normalized RF value, which are separated by a tab. Normalized RF values are calculated by taking the plain RF value and dividing it by 2(n-3) where n is the number of tips in the phylogeny. Prior to calculating an RF value, PhyKIT will first determine the number of shared tips between the two input phylogenies and prune them to a common set of tips. Thus, users can input trees with different topologies and infer an RF value among subtrees with shared tips.

PhyKIT will print out col 1; the plain RF distance and col 2: the normalized RF distance.

RF distances are calculated following Robinson & Foulds, Mathematical Biosciences (1981), doi: 10.1016/0025-5564(81)90043-2.

phykit robinson_foulds_distance <tree_file_zero> <tree_file_one> [--json]

Options:
<tree_file_zero>: first argument after function name should be a tree file
<tree_file_one>: second argument after function name should be a tree file
--json: optional argument to print results as JSON


Root tree

Function names: root_tree; root; rt
Command line interface: pk_root_tree; pk_root; pk_rt

Roots phylogeny using user-specified taxa.

A list of taxa to root the phylogeny on should be specified using the -r argument. The root_taxa file should be a single-column file with taxa names. The outputted file will have the same name as the inputted tree file but with the suffix ".rooted".

phykit root_tree <tree> -r/--root <root_taxa> [-o/--output <output_file>] [--json]

Options:
<tree>: first argument after function name should be a tree file to root|br| -r/\-\-root: single column file with taxa names to root the phylogeny on|br| -o/\-\-output: optional argument to specify the name of the output file
--json: optional argument to print results as JSON


Spurious homolog identification

Function names: spurious_sequence; spurious_seq; ss
Command line interface: pk_spurious_sequence; pk_spurious_seq; pk_ss

Determines potentially spurious homologs using branch lengths.

Identifies potentially spurious sequences and reports tips in the phylogeny that could possibly be removed from the associated multiple sequence alignment. PhyKIT does so by identifying and reporting long terminal branches defined as branches that are equal to or 20 times the median length of all branches.

PhyKIT reports the following information col1: name of tip that is a putatively spurious sequence col2: length of branch leading to putatively spurious sequence col3: threshold used to identify putatively spurious sequences col4: median branch length in the phylogeny

If there are no putatively spurious sequences, "None" is reported.

Using this method to identify potentially spurious sequences was, to my knowledge, first introduced by Shen et al., (2018) Cell doi: 10.1016/j.cell.2018.10.023.

phykit spurious_seq <file> -f/\\-\\-factor [--json]

Options:
<file>: first argument after function name should be a tree file
-f/\-\-factor: factor to multiply median branch length by to calculate the threshold of long branches. (Default: 20)
--json: optional argument to print results as JSON


Stochastic character mapping (SIMMAP)

Function names: stochastic_character_map; simmap; scm
Command line interface: pk_stochastic_character_map; pk_simmap; pk_scm

Perform Stochastic Character Mapping (SIMMAP) of discrete traits onto a phylogeny (Huelsenbeck et al. 2003; Bollback 2006), analogous to R's phytools::make.simmap().

Fits a continuous-time Markov chain (CTMC) rate matrix Q via maximum likelihood using Felsenstein's pruning algorithm. Three substitution models are available:

  • ER (equal rates): 1 free parameter

  • SYM (symmetric rates): k(k-1)/2 free parameters

  • ARD (all rates differ): k(k-1) free parameters

The input trait file should be tab-delimited with a header row: taxon<tab>trait_column<tab>... Lines starting with '#' are treated as comments. If the tree and trait file have different taxa, the intersection is used and warnings are printed to stderr.

Output includes the fitted Q matrix, log-likelihood, mean dwelling times, mean transition counts, and posterior probabilities at internal nodes.

Optionally, a horizontal phylogram plot with branches colored by mapped character state can be generated using the --plot argument.

All results have been validated against R 4.4.0 using phytools::fitMk and phytools::make.simmap. ER log-likelihoods match R to within 0.002 (both optimizers converge to the same flat plateau). For the ARD model, PhyKIT's multi-start optimizer finds a slightly better local optimum than R (loglik -8.38 vs -8.43). Dwelling times sum to total tree length exactly, and Q matrix structural properties (rows sum to zero, model nesting) are verified.

phykit stochastic_character_map -t <tree> -d <trait_data> -c <trait_column> [-m <model>] [-n <nsim>] [--seed <seed>] [--plot <output.png>] [--json]

Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file with header row
-c/\-\-trait: column name for discrete character trait
-m/\-\-model: substitution model: ER, SYM, or ARD (default: ER)
-n/\-\-nsim: number of stochastic mapping simulations (default: 100)
--seed: random seed for reproducibility
--plot: output plot file path for phylogram with colored branches
--json: optional argument to print results as JSON


Terminal branch statistics

Function names: terminal_branch_stats; tbs
Command line interface: pk_terminal_branch_stats; pk_tbs

Calculate summary statistics for terminal branch lengths in a phylogeny.

Terminal branch lengths can be useful for phylogeny diagnostics.

To obtain all terminal branch lengths, use the -v/\-\-verbose option.

phykit terminal_branch_stats <tree> [-v/--verbose] [--json]

Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all terminal branch lengths
--json: optional argument to print results as JSON


Threshold model

Function names: threshold_model; threshold; thresh; threshbayes; thresh_bayes
Command line interface: pk_threshold_model; pk_threshold; pk_thresh; pk_threshbayes; pk_thresh_bayes

Estimate the evolutionary correlation between two traits using the Felsenstein (2012) threshold model via MCMC. Binary discrete characters are modelled as arising from continuous latent "liabilities" that evolve under Brownian motion and cross a threshold at 0. This lets you estimate correlations between binary traits (or between a binary and a continuous trait) using BM rather than Mk transition rates.

This is the Python equivalent of phytools::threshBayes in R.

The sampler uses a Gibbs / Metropolis-Hastings hybrid:

  • Gibbs step: sample each discrete tip's liability from a truncated normal conditioned on all other values

  • Metropolis step: update sigma2_1, sigma2_2 (log-normal proposal), ancestral values a1, a2 (normal proposal), and the correlation r (normal proposal with reflection on [-1, 1])

  • Adaptive tuning: during burn-in, proposal variances are adjusted to target ~23% acceptance

Trait types:

  • discrete: binary (0/1). Liabilities < 0 map to state 0, liabilities > 0 map to state 1.

  • continuous: observed values used directly (no liability needed).

Any combination of two traits is supported: discrete+continuous, discrete+discrete, or continuous+continuous.

phykit threshold_model -t <tree> -d <trait_data> --traits <t1,t2> --types <type1,type2> [--ngen 100000] [--sample 100] [--burnin 0.2] [--seed <int>] [--plot <file>] [--json]

Options:
-t/\-\-tree: a rooted phylogeny file with branch lengths (required)
-d/\-\-trait-data: tab-delimited trait file with header row (required)
--traits: comma-separated pair of trait column names (required)
--types: comma-separated pair of trait types, each discrete or continuous (required)
--ngen: number of MCMC generations (default: 100000)
--sample: sample frequency (default: 100)
--burnin: burn-in fraction (default: 0.2)
--seed: random seed for reproducibility
--plot: output filename for trace and posterior density plot (3 rows x 2 columns:
left = MCMC trace, right = posterior histogram with 95% HPD shading)
--json: optional argument to print results as JSON

Output (text mode):
Trait 1: habitat (discrete, 2 states: 0, 1)
Trait 2: body_mass (continuous)
MCMC: 100000 generations, sampled every 100, burn-in 20%
---
Posterior correlation (r): 0.6234 (95% HPD: 0.312, 0.891)
Posterior sigma2_1: 1.234 (95% HPD: 0.456, 2.345)
Posterior sigma2_2: 0.567 (95% HPD: 0.234, 1.012)
Acceptance rates: r=0.234, sigma2_1=0.312, sigma2_2=0.287, a1=0.241, a2=0.228


Tutorial: habitat type and body mass in carnivores

This example uses the classic 8-taxon carnivore tree to test whether habitat type (a binary trait: 0 = non-arboreal, 1 = arboreal) is correlated with body mass on the latent liability scale.

Step 1: Prepare the trait file.

Create a tab-delimited file with a header row. The first column is the taxon name, followed by columns for each trait:

taxon        habitat body_mass
raccoon      0       1.04
bear 0       2.39
sea_lion     0       2.30
seal 0       1.88
monkey       1       0.60
cat  1       0.56
weasel       1       -0.30
dog  0       1.18

Step 2: Run the threshold model.

phykit threshold_model \
  -t carnivore.tre \
  -d traits.tsv \
  --traits habitat,body_mass \
  --types discrete,continuous \
  --ngen 100000 \
  --seed 42

Step 3: Examine the trace plots for convergence.

phykit threshold_model \
  -t carnivore.tre \
  -d traits.tsv \
  --traits habitat,body_mass \
  --types discrete,continuous \
  --ngen 100000 \
  --seed 42 \
  --plot trace.png

This generates a 3-row x 2-column figure. The left column shows MCMC trace plots for r, sigma2_1, and sigma2_2 (check for stationarity and good mixing). The right column shows posterior density histograms with red shading for the 95% HPD interval and a dashed line at the posterior mean.

Step 4: Get full posterior samples as JSON for custom analysis.

phykit threshold_model \
  -t carnivore.tre \
  -d traits.tsv \
  --traits habitat,body_mass \
  --types discrete,continuous \
  --ngen 100000 \
  --seed 42 \
  --json > posterior.json

The JSON output includes full posterior sample arrays, summary statistics (mean, median, 95% HPD), and MCMC metadata.


Tip labels

Function names: tip_labels; tree_labels; labels; tl
Command line interface: pk_tip_labels; pk_tree_labels; pk_labels; pk_tl

Prints the tip labels (or names) a phylogeny.

phykit tip_labels <tree> [--json]

Options:
<tree>: first argument after function name should be a tree file
--json: optional argument to print results as JSON


Tip-to-tip distance

Function names: tip_to_tip_distance; t2t_dist; t2t
Command line interface: pk_tip_to_tip_distance; pk_t2t_dist; pk_t2t

Calculate distance between two tips (or leaves) in a phylogeny.

Distances are in substitutions per site.

phykit tip_to_tip_distance <tree_file> <tip_1> <tip_2> [--json]
phykit tip_to_tip_distance <tree_file> --all-pairs [--plot] [--plot-output <path>] [--json]

Options:
<tree_file>: first argument after function name should be a tree file
<tip_1>: second argument should be the name of the first tip of interest
<tip_2>: third argument should be the name of the second tip of interest
--all-pairs: optional argument to report all pairwise tip distances
--plot: optional argument to save a clustered distance heatmap (requires --all-pairs)
--plot-output: output path for heatmap (default: tip_to_tip_distance_heatmap.png)
--json: optional argument to print results as JSON


Tip-to-tip node distance

Function names: tip_to_tip_node_distance; t2t_node_dist; t2t_nd
Command line interface: pk_tip_to_tip_node_distance; pk_t2t_node_dist; pk_t2t_nd

Calculate distance between two tips (or leaves) in a phylogeny.

Distance is measured by the number of nodes between one tip and another.

phykit tip_to_tip_node_distance <tree_file> <tip_1> <tip_2> [--json]

Options:
<tree_file>: first argument after function name should be a tree file
<tip_1>: second argument should be the name of the first tip of interest
<tip_2>: third argument should be the name of the second tip of interest
--json: optional argument to print results as JSON


Total tree length

Function names: total_tree_length; tree_len
Command line interface: pk_total_tree_length; pk_tree_len

Calculate total tree length, which is a sum of all branches.

phykit total_tree_length <tree> [--json]

Options:
<tree>: first argument after function name should be a tree file
--json: optional argument to print results as JSON


Treeness

Function names: treeness; tness
Command line interface: pk_treeness; pk_tness

Calculate treeness statistic for a phylogeny.

Higher treeness values are thought to be desirable because they represent a higher signal-to-noise ratio.

Treeness is the sum of internal branch lengths divided by the total tree length. Therefore, values range from 0 to 1. Treeness can be used as a measure of the signal-to-noise ratio in a phylogeny.

Calculate treeness (also referred to as stemminess) following Lanyon, The Auk (1988), doi: 10.1093/auk/105.3.565 and Phillips and Penny, Molecular Phylogenetics and Evolution (2003), doi: 10.1016/S1055-7903(03)00057-5.

phykit treeness <tree> [--json]

Options:
<tree>: first argument after function name should be a tree file
--json: optional argument to print results as JSON


Alignment- and tree-based functions

Relative rate test

Function names: relative_rate_test; rrt; tajima_rrt
Command line interface: pk_relative_rate_test; pk_rrt; pk_tajima_rrt

Tajima's relative rate test (Tajima, Genetics, 1993).

Tests whether two ingroup lineages have evolved at equal rates since diverging from their common ancestor. The outgroup is automatically inferred from the rooted tree as the earliest-diverging taxon (the single taxon on the smaller side of the root split). All pairwise ingroup combinations are tested with Bonferroni and Benjamini-Hochberg FDR multiple testing correction.

At each alignment column, the test classifies informative sites:

  • m1: the first ingroup taxon differs from the outgroup, but the second matches

  • m2: the second ingroup taxon differs from the outgroup, but the first matches

  • Sites where both differ or both match the outgroup are uninformative and skipped

  • Sites with gaps or ambiguous characters are skipped

Test statistic: chi2 = (m1 - m2)^2 / (m1 + m2), with 1 degree of freedom.

Single alignment mode:

phykit relative_rate_test -a <alignment> -t <rooted_tree> [-v/--verbose] [--json]

Batch mode (multiple alignments, one shared tree):

phykit relative_rate_test -l <alignment_list> -t <rooted_tree> [-v/--verbose] [--json]

Options:
-a/\-\-alignment: a single alignment file
-l/\-\-alignment-list: a file with one alignment path per line (batch mode)
-t/\-\-tree: a rooted tree file
-v/\-\-verbose: print additional detail
--plot-output: save a pairwise p-value heatmap to this path
--json: optional argument to print results as JSON

Single mode outputs one row per ingroup pair with m1, m2, chi-squared, raw p-value, Bonferroni-corrected p-value, FDR-corrected p-value, and a significance indicator. Batch mode aggregates across genes, reporting the number and percentage of genes rejecting equal rates for each pair.

When --plot-output is provided, a symmetric heatmap of -log10(FDR-corrected p-values) is saved. Cells with significant pairs (FDR < 0.05) are marked with an asterisk. Darker colors indicate stronger evidence for unequal evolutionary rates. The dashed line on the colorbar marks the significance threshold (-log10(0.05) ≈ 1.3); cells above this line are significant.

Equal rates — all taxa evolve at similar rates, so no pairwise comparison reaches significance. The heatmap is uniformly pale with no asterisks:

../_images/rrt_equal_rates.png

Unequal rates — taxa B and C have accumulated many more substitutions than A and D. Significant pairs (dark red, marked with *) indicate lineages evolving at detectably different rates:

../_images/rrt_unequal_rates.png

Validated against R's pegas::rr.test() — chi-squared and p-values match to machine precision.


Saturation

Function names: saturation; sat
Command line interface: pk_saturation; pk_sat

Calculate saturation for a given tree and alignment.

Saturation is defined as sequences in multiple sequence alignments that have undergone numerous substitutions such that the distances between taxa are underestimated.

Data with no saturation will have a value of 1. The closer the value is to 1, the less saturated the data.

This function outputs two values (as of v1.19.9). The first value is the saturation value and the second column is the absolute value of saturation minus 1. Thus, lower values in the second column are indicative of values closer to one and, thus, less saturation.

Saturation is calculated following Philippe et al., PLoS Biology (2011), doi: 10.1371/journal.pbio.1000602.

phykit saturation -a <alignment> -t <tree> [-v/--verbose] [-e/--exclude_gaps] [--plot] [--plot-output <path>] [--json]

Options:
-a/\-\-alignment: an alignment file
-t/\-\-tree: a tree file
-e/--exclude_gaps: if a site has a gap, ignore it
-v/\-\-verbose: print out patristic distances and uncorrected
distances used to determine saturation
--plot: save a saturation scatter plot with fitted slope through origin
--plot-output: output path for saturation plot (default: saturation_plot.png)
--json: optional argument to print results as JSON

Treeness over RCV

Function names: treeness_over_rcv; toverr; tor
Command line interface: pk_treeness_over_rcv; pk_toverr; pk_tor

Calculate treeness/RCV for a given alignment and tree.

Higher treeness/RCV values are thought to be desirable because they harbor a high signal-to-noise ratio and are least susceptible to composition bias.

PhyKIT reports three tab delimited values: col1: treeness/RCV col2: treeness col3: RCV

Calculate treeness/RCV following Phillips and Penny, Molecular Phylogenetics and Evolution (2003), doi: 10.1016/S1055-7903(03)00057-5.

phykit treeness_over_rcv -a/--alignment <alignment> -t/--tree <tree> [--json]

Options:
-a/\-\-alignment: an alignment file
-t/\-\-tree: a tree file
--json: optional argument to print results as JSON