Usage
PhyKIT helps process and analyze multiple sequence alignments and phylogenies.
Generally, all functions are designed to help understand the contents of alignments (e.g., gc content or the number of parsimony informative sites) and the shape of trees (e.g., treeness, degree of violation of a molecular clock).
Some help messages indicate that summary statistics are reported (e.g., bipartition_support_stats). Summary statistics include mean, median, 25th percentile, 75th percentile, minimum, maximum, standard deviation, and variance. These functions typically have a verbose option that allows users to get the underlying data used to calculate summary statistics.
General usage
Calling functions
phykit <command> [optional command arguments]
Command specific help messages can be viewed by adding a -h/\-\-help argument after the command. For example, to see the help message for the command 'treeness', execute:
phykit treeness -h
# or
phykit treeness --help
Function aliases
Each function comes with aliases to save the user some key strokes. For example, to get the help message for the 'treeness' function, you can type:
phykit tness -h
Command line interfaces
As of version 1.2.0, all functions (including aliases) can be executed using a command line interface that starts with pk_. For example, instead of typing the previous command to get the help message of the treeness function, you can type:
pk_treeness -h
# or
pk_tness -h
All possible function names are specified at the top of each function section.
Functions by analytical category
The functions above are organized by input type. Below, the same functions are grouped by analytical purpose to help you find the right tool for your analysis.
Alignment quality & statistics
Alignment entropy: Shannon entropy across alignment sites
Alignment length: Length of an input alignment
Alignment length no gaps: Alignment length excluding gapped sites
Alignment outlier taxa: Identify outlier taxa in alignments
Column score: Column score for alignment quality
Compositional bias per site: Detect compositional bias across sites
Composition per taxon: Nucleotide or amino acid composition per taxon
Evolutionary Rate per Site: Site-specific evolutionary rate estimation
Guanine-cytosine (GC) content: GC content of an alignment
Occupancy per taxon: Taxon occupancy in alignment columns
Pairwise identity: Pairwise sequence identity in an alignment
Parsimony informative sites: Count parsimony informative sites
Plot alignment QC: Visual quality control plots for alignments
Relative composition variability: Composition variability across taxa
Relative composition variability, taxon: Per-taxon relative composition variability
Sum-of-pairs score: Sum-of-pairs alignment quality score
Variable sites: Count variable sites in an alignment
Alignment manipulation
Alignment recoding: Recode alignment into reduced alphabets
Create concatenation matrix: Concatenate multiple alignments into a supermatrix
Faidx: Extract entries from FASTA files
Mask alignment: Mask sites in an alignment
Rename FASTA entries: Rename entries in a FASTA file
Protein-to-nucleotide alignment: Thread nucleotide onto protein alignment
Tree summary statistics
Bipartition support statistics: Summary statistics of bipartition support values
Degree of violation of the molecular clock: Measure molecular clock violation
Evolutionary rate: Calculate tree-based evolutionary rate
Internal branch statistics: Summary statistics of internal branch lengths
Lineage-through-time plot and gamma statistic: Lineage-through-time analysis and gamma statistic
Long branch score: Identify long branches in a tree
Patristic distances: Pairwise patristic distances between taxa
Terminal branch statistics: Summary statistics of terminal branch lengths
Tip-to-tip distance: Distance between two tips in a tree
Tip-to-tip node distance: Node distance between two tips
Total tree length: Sum of all branch lengths
Treeness: Ratio of internal to total branch lengths
Tree manipulation & utilities
Branch length multiplier: Multiply branch lengths by a factor
Collapse bipartitions: Collapse low-support bipartitions
Internode labeler: Label internal nodes of a tree
Last common ancestor subtree: Extract subtree from LCA of specified taxa
Monophyly check: Test monophyly of a group of taxa
Nearest neighbor interchange: Generate NNI tree rearrangements
Print tree: Print ASCII representation of a tree
Prune tree: Prune taxa from a tree
Rename tree tips: Rename tip labels in a tree
Root tree: Root or reroot a tree
Tip labels: Print tip labels of a tree
Tree comparison & consensus
Consensus network: Consensus network from multiple trees
Consensus tree: Consensus tree from multiple trees
Cophylogenetic plot (tanglegram): Tanglegram for comparing two trees
Discordance asymmetry: Test for asymmetric discordance (gene flow detection)
Evolutionary tempo mapping: Detect rate-topology associations in gene trees
Polytomy testing: Test for polytomies in a tree
Quartet network: Quartet-based network visualization
Robinson-Foulds distance: Topological distance between trees
Phylogenetic signal
Network signal: Phylogenetic signal on networks
Phylogenetic signal: Test for phylogenetic signal in traits (supports discordance-aware VCV with
-g)
Trait evolution
Ancestral state reconstruction: Reconstruct ancestral character states
Concordance-aware ASR: ASR incorporating gene tree discordance
Continuous trait mapping (contMap): Map continuous traits onto a phylogeny
Density map: Posterior density of stochastic character maps
Continuous trait evolution model comparison (fitContinuous): Compare continuous trait evolution models (supports discordance-aware VCV with
-g)OU shift detection (l1ou): Detect OU regime shifts on a phylogeny
Multi-regime OU models (OUwie): Multi-regime Ornstein-Uhlenbeck models
Phenogram (traitgram): Phenogram visualizing trait evolution
Rate heterogeneity test (multi-rate Brownian motion): Test for rate heterogeneity in trait evolution
Stochastic character mapping (SIMMAP): Stochastic character mapping on a phylogeny
Threshold model: Felsenstein threshold model for trait correlation
Phylogenetic comparative methods
Phylogenetic GLM: Phylogenetic generalized linear model (supports discordance-aware VCV with
-g)Phylogenetic Ordination: Ordination incorporating phylogenetic structure (supports discordance-aware VCV with
-g)Phylogenetic regression (PGLS): Phylogenetic generalized least squares regression (supports discordance-aware VCV with
-g)Phylomorphospace: Phylomorphospace visualization
Evolutionary rate analysis
Covarying evolutionary rates: Detect covariation in evolutionary rates
Relative rate test: Relative rate test between lineages
Homology assessment
Hidden paralogy check: Check for hidden paralogy in gene trees
Spurious homolog identification: Identify spurious sequences in alignments
Saturation & model adequacy
Saturation: Test for substitution saturation
Treeness over RCV: Treeness over relative composition variability
Alignment-based functions
Alignment entropy
Function names: alignment_entropy; aln_entropy; entropy
Command line interface: pk_alignment_entropy; pk_aln_entropy; pk_entropy
Calculate alignment entropy.
Site-wise entropy is calculated using Shannon entropy. By default, PhyKIT reports the mean entropy across all sites in the alignment. With the -v/--verbose option, PhyKIT reports entropy for each site.
phykit alignment_entropy <alignment> [-v/--verbose] [--plot] [--plot-output <path>] [--json]
Example output (default):
0.657
Example output (-v):
1 0.0
2 1.0
3 0.971
Options:
<alignment>: first argument after function name should be an alignment file
-v/\-\-verbose: optional argument to print entropy for each site
--plot: save a per-site alignment entropy plot
--plot-output: output path for plot (default: alignment_entropy_plot.png)
--json: optional argument to print results as JSON
Alignment length
Function names: alignment_length; aln_len; al
Command line interface: pk_alignment_length; pk_aln_len; pk_al
Length of an input alignment is calculated using this function.
Longer alignments are associated with strong phylogenetic signal.
Association between alignment length and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.
phykit aln_len <alignment> [--json]
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Alignment length no gaps
Function names: alignment_length_no_gaps; aln_len_no_gaps; alng
Command line interface: pk_alignment_length_no_gaps; pk_aln_len_no_gaps; pk_alng
Calculate alignment length excluding sites with gaps.
Longer alignments when excluding sites with gaps is associated with strong phylogenetic signal.
PhyKIT reports three tab delimited values: col1: number of sites without gaps col2: total number of sites col3: percentage of sites without gaps
Association between alignment length when excluding sites with gaps and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.
phykit aln_len_no_gaps <alignment> [--json]
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Alignment outlier taxa
Function names: alignment_outlier_taxa; outlier_taxa; aot
Command line interface: pk_alignment_outlier_taxa; pk_outlier_taxa; pk_aot
Identify potential outlier taxa in an alignment and explicitly report why each taxon was flagged.
The following features are evaluated per taxon:
1) gap_rate: fraction of gap/ambiguous symbols
2) occupancy: fraction of valid symbols
3) composition_distance: Euclidean distance from the median composition profile
4) long_branch_proxy: mean pairwise sequence distance to other taxa
5) rcvt: relative composition variability per taxon
6) entropy_burden: average site entropy over this taxon's valid positions
If a taxon exceeds one or more feature-specific thresholds, PhyKIT reports the exact feature(s), observed value(s), threshold(s), and explanation(s) for the flag.
phykit alignment_outlier_taxa <alignment> [--gap-z <float>] [--composition-z <float>] [--distance-z <float>] [--rcvt-z <float>] [--occupancy-z <float>] [--entropy-z <float>] [--json]
Example output:
features_evaluated gap_rate,occupancy,composition_distance,long_branch_proxy,rcvt,entropy_burden
thresholds gap_rate>0.0;occupancy<0.4;composition_distance>0.1;long_branch_proxy>0.4181;rcvt>0.0095;entropy_burden>0.35
taxon_d composition_distance=1.3454>0.1;long_branch_proxy=1.0>0.4181 Unusual sequence composition profile relative to other taxa. | High mean pairwise sequence distance to other taxa.
taxon_e gap_rate=0.6>0.0;occupancy=0.4<0.4 High fraction of gap/ambiguous symbols compared to other taxa. | Low fraction of valid symbols compared to other taxa.
Options:
<alignment>: first argument after function name should be an alignment file
--gap-z: z-threshold used for high-gap outlier detection (default: 3.0)
--composition-z: z-threshold used for composition-distance outlier detection (default: 3.0)
--distance-z: z-threshold used for long-branch-proxy outlier detection (default: 3.0)
--rcvt-z: z-threshold used for RCVT outlier detection (default: 3.0)
--occupancy-z: z-threshold used for low-occupancy outlier detection (default: 3.0)
--entropy-z: z-threshold used for entropy-burden outlier detection (default: 3.0)
--json: optional argument to print results as JSON
Alignment recoding
Function names: alignment_recoding; aln_recoding; recode
Command line interface: pk_alignment_recoding; pk_aln_recoding; pk_recode
Recode alignments using reduced character states.
Alignments can be recoded using established or custom recoding schemes. Recoding schemes are specified using the -c/--code argument. Custom recoding schemes can be used and should be formatted as a two column file wherein the first column is the recoded character and the second column is the character in the alignment.
phykit alignment_recoding <fasta> [-c/--code <code>] [--json]
Codes for which recoding scheme to use:
RY-nucleotide
R = purines (i.e., A and G)
Y = pyrimidines (i.e., T and C)
SandR-6
0 = A, P, S, and T
1 = D, E, N, and G
2 = Q, K, and R
3 = M, I, V, and L
4 = W and C
5 = F, Y, and H
KGB-6
0 = A, G, P, and S
1 = D, E, N, Q, H, K, R, and T
2 = M, I, and L
3 = W
4 = F and Y
5 = C and V
Dayhoff-6
0 = A, G, P, S, and T
1 = D, E, N, and Q
2 = H, K, and R
3 = I, L, M, and V
4 = F, W, and Y
5 = C
Dayhoff-9
0 = D, E, H, N, and Q
1 = I, L, M, and V
2 = F and Y
3 = A, S, and T
4 = K and R
5 = G
6 = P
7 = C
8 = W
Dayhoff-12
0 = D, E, and Q
1 = M, L, I, and V
2 = F and Y
3 = K, H, and R
4 = G
5 = A
6 = P
7 = S
8 = T
9 = N
A = W
B = C
Dayhoff-15
0 = D, E, and Q
1 = M and L
2 = I and V
3 = F and Y
4 = G
5 = A
6 = P
7 = S
8 = T
9 = N
A = K
B = H
C = R
D = W
E = C
Dayhoff-18
0 = F and Y
1 = M and L
2 = I
3 = V
4 = G
5 = A
6 = P
7 = S
8 = T
9 = D
A = E
B = Q
C = N
D = K
E = H
F = R
G = W
H = C
Options:
<alignment>: first argument after function name should be an alignment file
-c/--code: argument to specify the recoding scheme to use
--json: optional argument to print results as JSON
Column score
Function names: column_score; cs
Command line interface: pk_column_score; pk_cs
Calculates column score.
Column is an accuracy metric for a multiple alignment relative to a reference alignment. It is calculated by summing the correctly aligned columns over all columns in an alignment. Thus, values range from 0 to 1 and higher values indicate more accurate alignments.
Column score is calculated following Thompson et al., Nucleic Acids Research (1999), doi: 10.1093/nar/27.13.2682.
phykit column_score <alignment> --reference <reference_alignment> [--json]
Options:
<alignment>: first argument after function name should be a query
fasta alignment file to be scored for accuracy
-r/\-\-reference: reference alignment to compare the query alignment
to
--json: optional argument to print results as JSON
Composition per taxon
Function names: composition_per_taxon; comp_taxon; comp_tax
Command line interface: pk_composition_per_taxon; pk_comp_taxon; pk_comp_tax
Calculate sequence composition per taxon in an alignment.
Composition is reported as semicolon-separated symbol:frequency pairs for each taxon. Frequencies are calculated from valid (non-gap/non-ambiguous) characters. Symbol order is alphabetical.
phykit composition_per_taxon <alignment> [--json]
Example output:
1 A:0.4;C:0.0;G:0.2;T:0.4
2 A:0.5;C:0.0;G:0.25;T:0.25
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Compositional bias per site
Function names: compositional_bias_per_site; comp_bias_per_site; cbps
Command line interface: pk_compositional_bias_per_site; pk_comp_bias_per_site; pk_cbps
Calculates compositional bias per site in an alignment.
Site-wise chi-squared tests are conducted in an alignment to
detect compositional biases. PhyKIT outputs four columns:
col 1: index in alignment
col 2: chi-squared statistic (higher values indicate greater bias)
col 3: multi-test corrected p-value (Benjamini-Hochberg false discovery rate procedure)
col 4: uncorrected p-value
phykit comp_bias_per_site <alignment> [--plot] [--plot-output <path>] [--json]
Options:
<alignment>: first argument after function name should be a query
fasta alignment to calculate the site-wise compositional bias of
--plot: save a Manhattan-style plot of site-wise compositional bias
--plot-output: output path for plot (default: compositional_bias_per_site_plot.png)
--json: optional argument to print results as JSON
Create concatenation matrix
Function names: create_concatenation_matrix, create_concat, cc
Command line interface: pk_create_concatenation_matrix, pk_create_concat, pk_cc
Create a concatenated alignment file. This function is used to help in the construction of multi-locus data matrices.
PhyKIT will output three files: 1) A fasta file with '.fa' appended to the prefix specified with the -p/\-\-prefix parameter. 2) A partition file ready for input into RAxML or IQ-tree. 3) An occupancy file that summarizes the taxon occupancy per sequence.
phykit create_concat -a <file> -p <string> [--threshold <float>] [--plot-occupancy] [--plot-output <path>] [--json]
Options:
-a/\-\-alignment: alignment list file. File should contain a single column list of alignment
sequence files to concatenate into a single matrix. Provide path to files relative to
working directory or provide absolute path.
-p/\-\-prefix: prefix of output files
--threshold: minimum fraction of informative (non-gap, non-ambiguous) sites across the
concatenated alignment for a taxon to be included. Taxa whose effective occupancy falls
below this value are excluded from the output. Set to 0 to disable filtering
(default: 0).
--plot-occupancy: optional argument to output an occupancy map figure where
x-axis shows concatenated positions with gene boundaries and y-axis shows taxa
sorted by total occupancy. Colors denote represented characters, gap/ambiguous
characters in present genes, and fully absent gene blocks.
--plot-output: optional custom output path for occupancy map figure
(default: <prefix>.occupancy.png).
--json: optional argument to print summary metadata as JSON
Evolutionary Rate per Site
Function names: evolutionary_rate_per_site; evo_rate_per_site; erps
Command line interface: pk_evolutionary_rate_per_site; pk_evo_rate_per_site; pk_erps
Estimate evolutionary rate per site.
Evolutionary rate per site is one minus the sum of squared frequency of different characters at a given site. Values may range from 0 (slow evolving; no diversity at the given site) to 1 (fast evolving; all characters appear only once).
phykit evo_rate_per_site <alignment> [--plot] [--plot-output <path>] [--json]
Options:
<alignment>: first argument after function name should be a query
fasta alignment to calculate the site-wise evolutionary rate of
--plot: save a per-site evolutionary-rate plot
--plot-output: output path for plot (default: evolutionary_rate_per_site_plot.png)
--json: optional argument to print results as JSON
Faidx
Function names: faidx; get_entry; ge
Command line interface: pk_faidx; pk_get_entry; pk_ge
Extracts sequence entry from fasta file.
This function works similarly to the faidx function in samtools, but does not requiring an indexing step.
To obtain multiple entries, input multiple entries separated by a comma (,). For example, if you want entries named "seq_0" and "seq_1", the string "seq_0,seq_1" should be associated with the -e argument.
phykit faidx <fasta> -e/--entry <fasta entry> [--json]
Options:
<fasta>: first argument after function name should be a fasta file
-e/\-\-entry: entry name to be extracted from the inputted fasta file
--json: optional argument to print results as JSON
Guanine-cytosine (GC) content
Function names: gc_content; gc
Command line interface: pk_gc_content; pk_gc
Calculate GC content of a fasta file.
GC content is negatively correlated with phylogenetic signal.
If there are multiple entries, use the -v/\-\-verbose option to determine the GC content of each fasta entry separately. Association between GC content and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.
phykit gc_content <fasta> [-v/--verbose] [--json]
Options:
<fasta>: first argument after function name should be a fasta file
-v/\-\-verbose: optional argument to print the GC content of each fasta
entry
--json: optional argument to print results as JSON
Mask alignment
Function names: mask_alignment; mask_aln; mask
Command line interface: pk_mask_alignment; pk_mask_aln; pk_mask
Mask alignment sites based on threshold criteria.
Sites are retained when they pass all active thresholds: maximum gap fraction, minimum occupancy, and maximum site entropy.
phykit mask_alignment <alignment> [-g/--max_gap <float>] [-o/--min_occupancy <float>] [-e/--max_entropy <float>] [--json]
Options:
<alignment>: first argument after function name should be an alignment file
-g/\-\-max_gap: maximum allowed fraction of missing/invalid characters at a site (default: 1.0)
-o/\-\-min_occupancy: minimum required occupancy at a site (default: 0.0)
-e/\-\-max_entropy: maximum allowed site entropy (default: no filter)
--json: optional argument to print results as JSON
Occupancy per taxon
Function names: occupancy_per_taxon; occupancy_taxon; occ_tax
Command line interface: pk_occupancy_per_taxon; pk_occupancy_taxon; pk_occ_tax
Calculate occupancy per taxon in an alignment.
Occupancy is the fraction of valid (non-gap/non-ambiguous) characters for each taxon.
phykit occupancy_per_taxon <alignment> [--json]
Example output:
1 0.8333
2 0.6667
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Pairwise identity
Function names: pairwise_identity; pairwise_id, pi
Command line interface: pk_pairwise_identity; pk_pairwise_id, pk_pi
Calculate the average pairwise identity among sequences.
Pairwise identities can be used as proxies for the evolutionary rate of sequences.
Pairwise identity is defined as the number of identical columns (including gaps) between two aligned sequences divided by the number of columns in the alignment. Summary statistics are reported unless used with the verbose option in which all pairwise identities will be reported.
An example of pairwise identities being used as a proxy for evolutionary rate can be found here: Chen et al. Genome Biology and Evolution (2017), doi: 10.1093/gbe/evx147.
phykit pairwise_identity <alignment> [-v/--verbose] [-e/--exclude_gaps] [--plot] [--plot-output <file>] [--json]
Options:
<alignment>: first argument after function name should be an alignment file
-v/\-\-verbose: optional argument to print identity per pair|br|
-e/--exclude_gaps: if a site has a gap, ignore it
--plot: save a clustered pairwise-identity heatmap
--plot-output: output path for heatmap (default: pairwise_identity_heatmap.png)
--json: optional argument to print results as JSON
Parsimony informative sites
Function names: parsimony_informative_sites; pis
Command line interface: pk_parsimony_informative_sites; pk_pis
Calculate the number and percentage of parsimony informative sites in an alignment.
The number of parsimony informative sites in an alignment is associated with strong phylogenetic signal.
PhyKIT reports three tab delimited values: col1: number of parsimony informative sites col2: total number of sites col3: percentage of parsimony informative sites
Association between the number of parsimony informative sites and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179 and Steenwyk et al., PLOS Biology (2020), doi: 10.1371/journal.pbio.3001007.
phykit parsimony_informative_sites <alignment> [--json]
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Plot alignment QC
Function names: plot_alignment_qc; plot_qc; paqc
Command line interface: pk_plot_alignment_qc; pk_plot_qc; pk_paqc
Generate a multi-panel alignment quality-control plot.
The figure includes:
1) occupancy per taxon
2) gap rate per taxon
3) composition distance vs long-branch proxy scatter
4) count of flagged outliers by feature
Outlier evaluation uses the same features as alignment_outlier_taxa:
gap_rate, occupancy, composition_distance, long_branch_proxy,
rcvt, and entropy_burden.
phykit plot_alignment_qc <alignment> [-o/--output <path>] [--width <float>] [--height <float>] [--dpi <int>] [--gap-z <float>] [--composition-z <float>] [--distance-z <float>] [--rcvt-z <float>] [--occupancy-z <float>] [--entropy-z <float>] [--json]
Options:
<alignment>: first argument after function name should be an alignment file
-o/\-\-output: output image path (default: alignment_qc.png)
--width: figure width in inches (default: 14.0)
--height: figure height in inches (default: 10.0)
--dpi: output image DPI (default: 300)
--gap-z: z-threshold for gap-rate outliers (default: 3.0)
--composition-z: z-threshold for composition-distance outliers (default: 3.0)
--distance-z: z-threshold for long-branch-proxy outliers (default: 3.0)
--rcvt-z: z-threshold for RCVT outliers (default: 3.0)
--occupancy-z: z-threshold for low-occupancy outliers (default: 3.0)
--entropy-z: z-threshold for entropy-burden outliers (default: 3.0)
--json: optional argument to print plot metadata and outlier summary as JSON
Protein-to-nucleotide alignment
Function names: thread_dna; pal2nal, p2n
Command line interface: pk_thread_dna; pk_pal2nal, pk_p2n
Thread DNA sequence onto a protein alignment to create a codon-based alignment.
This function requires input alignments are in fasta format. Codon alignments are then printed to stdout. Note, paired sequences are assumed to have the same name between the protein and nucleotide file. The order does not matter.
To thread nucleotide sequences over a trimmed amino acid alignment, provide PhyKIT with a log file specifying which sites have been trimmed and which have been kept. The log file must be formatted the same as the log files outputted by the alignment trimming toolkit ClipKIT (see -l in ClipKIT documentation.) Details about ClipKIT can be seen here: https://github.com/JLSteenwyk/ClipKIT.
If using a ClipKIT log file, the untrimmed protein alignment should be provided in the -p/--protein argument.
phykit thread_dna -p <file> -n <file> [-s] [--json]
Options:
-p/\-\-protein: protein alignment file
-n/\-\-nucleotide: nucleotide sequence file
-c/\-\-clipkit_log: clipkit outputted log file
-s/\-\-stop: boolean for whether or not stop codons should be kept.
If used, stop codons will be removed.
--json: optional argument to print results as JSON
Relative composition variability
Function names: relative_composition_variability; rel_comp_var; rcv
Command line interface: pk_relative_composition_variability; pk_rel_comp_var; pk_rcv
Calculate RCV (relative composition variability) for an alignment.
Lower RCV values are thought to be desirable because they represent a lower composition bias in an alignment. Statistically, RCV describes the average variability in sequence composition among taxa.
RCV is calculated following Phillips and Penny, Molecular Phylogenetics and Evolution (2003), doi: 10.1016/S1055-7903(03)00057-5.
RCV calculations are case-insensitive. Gap and ambiguous characters are excluded from composition counts and correction terms, and each taxon is normalized by its valid (non-excluded) sequence length.
phykit relative_composition_variability <alignment> [--json]
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Relative composition variability, taxon
Function names: relative_composition_variability_taxon; rel_comp_var_taxon; rcvt
Command line interface: pk_relative_composition_variability_taxon; pk_rel_comp_var_taxon; pk_rcvt
Calculate RCVT (relative composition variability, taxon) for an alignment.
RCVT is the relative composition variability metric for individual taxa. This facilitates identifying specific taxa that may have compositional biases. Lower RCVT values are more desirable because they indicate a lower composition bias for a given taxon in an alignment.
RCVT calculations are case-insensitive and exclude gap/ambiguous symbols from composition counts and normalization.
phykit relative_composition_variability_taxon <alignment> [--plot] [--plot-output <path>] [--json]
Options:
<alignment>: first argument after function name should be an alignment file
--plot: optional argument to generate an RCVT per-taxon barplot
--plot-output: output path for the RCVT plot (default: rcvt_plot.png)
--json: optional argument to print results as JSON
Rename FASTA entries
Function names: rename_fasta_entries; rename_fasta
Command line interface: pk_rename_fasta_entries; pk_rename_fasta
Renames fasta entries.
Renaming fasta entries will follow the scheme of a tab-delimited file wherein the first column is the current fasta entry name and the second column is the new fasta entry name in the resulting output alignment. Note, the input fasta file does not need to be an alignment file.
phykit rename_fasta_entries <fasta> -i/--idmap <idmap> [-o/--output <output_file>] [--json]
Options:
<fasta>: first argument after function name should be a FASTA file
-i/\-\-idmap: identifier map of current FASTA names (col1) and desired FASTA names (col2)
--json: optional argument to print results as JSON
Sum-of-pairs score
Function names: sum_of_pairs_score; sops; sop
Command line interface: pk_sum_of_pairs_score; pk_sops; pk_sop
Calculates sum-of-pairs score.
Sum-of-pairs is an accuracy metric for a multiple alignment relative to a reference alignment. It is calculated by summing the correctly aligned residue pairs over all pairs of sequences. Thus, values range from 0 to 1 and higher values indicate more accurate alignments.
Column score is calculated following Thompson et al., Nucleic Acids Research (1999), doi: 10.1093/nar/27.13.2682.
phykit sum_of_pairs_score <alignment> --reference <reference_alignment> [--json]
Options:
<alignment>: first argument after function name should be a query
fasta alignment file to be scored for accuracy
-r/\-\-reference: reference alignment to compare the query alignment
to
--json: optional argument to print results as JSON
Variable sites
Function names: variable_sites; vs
Command line interface: pk_variable_sites; pk_vs
Calculate the number of variable sites in an alignment.
The number of variable sites in an alignment is associated with strong phylogenetic signal. PhyKIT reports three tab delimited values: col1: number of variable sites col2: total number of sites col3: percentage of variable sites
Association between the number of variable sites and phylogenetic signal was determined by Shen et al., Genome Biology and Evolution (2016), doi: 10.1093/gbe/evw179.
phykit variable_sites <alignment> [--json]
Options:
<alignment>: first argument after function name should be an alignment file
--json: optional argument to print results as JSON
Tree-based functions
Ancestral state reconstruction
Function names: ancestral_state_reconstruction; asr; anc_recon
Command line interface: pk_ancestral_state_reconstruction; pk_asr; pk_anc_recon
Estimate ancestral states for continuous traits using maximum likelihood,
analogous to R's phytools::fastAnc() and ape::ace(type="ML").
Optionally produce a contMap plot showing continuous trait values mapped
onto the phylogeny.
Two methods are available:
fast (default): Felsenstein's pruning/contrasts shortcut, O(n) time
ml: full VCV-based ML with exact conditional CIs, O(n^3)
Both methods produce identical point estimates; ml gives exact
conditional confidence intervals.
Input trait data can be either a two-column file (taxon<tab>value)
when -c is omitted, or a multi-trait file with header row when -c
specifies which column to use.
phykit ancestral_state_reconstruction -t <tree> -d <trait_data> [-c <trait>] [-m <method>] [--ci] [--plot <output>] [--json]
Options:
-t/\-\-tree: a phylogenetic tree file
-d/\-\-trait_data: trait data file (two-column or multi-trait with header)
-c/\-\-trait: trait column name (required for multi-trait files)
-m/\-\-method: method to use: fast or ml (default: fast)
--ci: include 95% confidence intervals
--plot: output path for contMap plot (requires matplotlib)
--json: output results as JSON
Example contMap plot generated with the --plot option. Branches are colored
by interpolated ancestral trait values:
Concordance-aware ancestral state reconstruction
Function names: concordance_asr; conc_asr; casr
Command line interface: pk_concordance_asr; pk_conc_asr; pk_casr
Concordance-aware ancestral state reconstruction that incorporates gene tree discordance into ancestral estimates. Standard ASR operates on a single species tree and ignores gene tree conflict. This command propagates topological uncertainty from gene tree discordance into ancestral state estimates using gene concordance factors (gCF).
Two strategies are available:
weighted (default): For each internal node, compute gCF (fraction of gene trees supporting the species-tree bipartition) and gDF1, gDF2 (fractions for NNI alternatives). Run ASR on the species tree and NNI alternative trees, then combine estimates weighted by concordance. Uses the law of total variance to separate topological vs parameter uncertainty.
distribution: Run ASR independently on each gene tree, map nodes across trees by descendant-set identity, and report concordance-weighted means with percentile confidence intervals (2.5th--97.5th).
phykit concordance_asr -t <species_tree> -g <gene_trees> -d <trait_data>
[-c <trait>] [-m weighted|distribution] [--ci]
[--plot <output>] [--missing-taxa error|shared] [--json]
Options:
-t/\-\-tree: species tree file
-g/\-\-gene-trees: file with gene trees (multi-Newick, one per line)
-d/\-\-trait_data: trait data file (two-column or multi-trait with header)
-c/\-\-trait: trait column name (required for multi-trait files)
-m/\-\-method: method to use: weighted or distribution (default: weighted)
--ci: include 95% confidence intervals
--plot: output path for concordance ASR plot
--missing-taxa: how to handle taxa mismatches: shared (default, prune to intersection) or error (reject)
--json: output results as JSON
Example output:
Concordance-Aware Ancestral State Reconstruction
Method: weighted
Number of tips: 8
Number of gene trees: 10
Sigma-squared (BM rate): 0.043893
Ancestral estimates:
Node Desc Estimate gCF 95% CI Var_topo Var_param
N1 (root) 8 1.6447 1.000 [0.8937, 2.3957] 0.000000 0.146822
N2 2 1.6881 0.700 [0.9529, 2.4234] 0.000569 0.140151
N3 5 1.4878 0.857 [0.6727, 2.3028] 0.005878 0.167045
N4 2 1.7682 0.900 [0.8987, 2.6378] 0.015002 0.181806
N5 3 1.2674 0.900 [0.3663, 2.1684] 0.001044 0.210295
N6 2 0.9895 1.000 [-0.5654, 2.5443] 0.000000 0.629294
Example plot generated with the --plot option. Internal nodes are sized
and colored by gene concordance factor (gCF):
Bipartition support statistics
Function names: bipartition_support_stats; bss
Command line interface: pk_bipartition_support_stats; pk_bss
Calculate summary statistics for bipartition support.
High bipartition support values are thought to be desirable because they are indicative of greater certainty in tree topology.
To obtain all bipartition support values, use the -v/\-\-verbose option. In addition to support values for each node, the names of all terminal branches tips are also included. Each terminal branch name is separated with a semi-colon (;).
phykit bipartition_support_stats <tree> [-v/--verbose]
[--thresholds <comma-separated-floats>] [--json]
Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all bipartition support values
--thresholds: optional comma-separated support cutoffs; prints count and
fraction of bipartitions below each cutoff
--json: optional argument to print results as JSON
Example JSON output (summary mode):
phykit bipartition_support_stats test.tre --thresholds 70,90 --json
{"summary": {"maximum": 100, "mean": 95.71428571428571, "median": 100, "minimum": 85, "seventy_fifth": 100.0, "standard_deviation": 7.319250547113999, "twenty_fifth": 92.5, "variance": 53.57142857142857}, "thresholds": [{"count_below": 0, "fraction_below": 0.0, "threshold": 70.0}, {"count_below": 2, "fraction_below": 0.2857142857142857, "threshold": 90.0}], "verbose": false}
Example JSON output (verbose mode):
phykit bipartition_support_stats test.tre -v --json
{"bipartitions": [{"support": 85, "terminals": ["taxon_a", "taxon_b"]}, {"support": 100, "terminals": ["taxon_c", "taxon_d"]}], "thresholds": [], "verbose": true}
Branch length multiplier
Function names: branch_length_multiplier; blm
Command line interface: pk_branch_length_multiplier; pk_blm
Multiply branch lengths in a phylogeny by a given factor.
This can help modify reference trees when conducting simulations or other analyses.
phykit branch_length_multiplier <tree> -f n [-o/--output <output_file>] [--json]
Options:
<tree>: first argument after function name should be a tree file
-f/\-\-factor: factor to multiply branch lengths by
-o/\-\-output: optional argument to name the outputted tree file. Default
output will have the same name as the input file but with the suffix ".factor_(n).tre"
--json: optional argument to print results as JSON
Collapse bipartitions
Function names: collapse_branches, collapse, cb
Command line interface: pk_collapse_branches, pk_collapse, pk_cb
Collapse branches on a phylogeny according to bipartition support.
Bipartitions will be collapsed if they are less than the user specified value.
phykit collapse_branches <tree> -s/--support n [-o/--output <output_file>] [--json]
Options:
<tree>: first argument after function name should be a tree file
-s/\-\-support: bipartitions with support less than this value will be
collapsed
-o/\-\-output: optional argument to name the outputted tree file. Default
output will have the same name as the input file but with the suffix
".collapsed_(support).tre"
--json: optional argument to print results as JSON
Consensus network
Function names: consensus_network; consnet; splitnet; splits_network
Command line interface: pk_consensus_network; pk_consnet; pk_splitnet; pk_splits_network
Extract bipartition splits from a collection of gene trees and summarize conflicting phylogenetic signal. Counts how frequently each non-trivial bipartition appears across input trees and filters by a minimum frequency threshold. Optionally draws a circular splits network diagram.
Input can be either: 1) a file with one Newick tree per line, or 2) a file with one tree-file path per line.
phykit consensus_network -t/--trees <trees> [--threshold 0.1] [--missing-taxa error|shared] [--plot-output <file>] [--json]
Options:
-t/\-\-trees: file containing trees (one Newick per line) or tree-file paths (one per line)
--threshold: minimum split frequency to include, between 0 and 1 (default: 0.1)
--missing-taxa: handling strategy for mismatched taxa (error or shared; default: error)
--plot-output: output filename for the circular splits network plot (optional)
--json: optional argument to print results as JSON
When --plot-output is specified, a circular splits network diagram is produced.
Taxa are arranged at equal angles around a circle. Each split is drawn as a chord
connecting the boundary points between the two sides of the bipartition. Chord
thickness and opacity scale with split frequency — thicker, darker lines indicate
splits supported by more gene trees.
Consensus tree
Function names: consensus_tree; consensus; ctree
Command line interface: pk_consensus_tree; pk_consensus; pk_ctree
Infer a consensus tree from a collection of trees.
Input can be either: 1) a file with one Newick tree per line, or 2) a file with one tree-file path per line.
Consensus methods:
* majority: majority-rule consensus (default)
* strict: strict consensus
Missing taxa handling:
* --missing-taxa error (default): exits if trees do not share identical tip sets
* --missing-taxa shared: prunes all trees to the intersection of taxa before inferring consensus
phykit consensus_tree -t/--trees <trees> [-m/--method strict|majority] [--missing-taxa error|shared] [--json]
Options:
-t/\-\-trees: file containing trees (one Newick per line) or tree-file paths (one per line)
-m/\-\-method: consensus method (strict or majority; default: majority)
--missing-taxa: handling strategy for mismatched taxa (error or shared; default: error)
--json: optional argument to print results as JSON
Continuous trait evolution model comparison (fitContinuous)
Function names: fit_continuous; fitcontinuous; fc
Command line interface: pk_fit_continuous; pk_fitcontinuous; pk_fc
Compare models of continuous trait evolution on a phylogeny, analogous to
R's geiger::fitContinuous(). Fits up to 7 models and ranks them by
AIC, BIC, and AIC weights.
Models:
BM -- Brownian motion (baseline, 2 params)
OU -- Ornstein-Uhlenbeck / stabilizing selection (3 params)
EB -- Early Burst (Harmon et al. 2010) (3 params)
Lambda -- Pagel's lambda / phylogenetic signal (3 params)
Delta -- Pagel's delta / tempo of evolution (3 params)
Kappa -- Pagel's kappa / punctuational vs gradual (3 params)
White -- White noise / no phylogenetic signal (2 params)
Each model reports R² = 1 - (σ²_model / σ²_White), measuring how much variance is explained relative to the white noise baseline. White serves as the reference (R² = 0).
phykit fit_continuous -t <tree> -d <trait_data> [--models BM,OU,Lambda] [-g <gene_trees>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
--models: comma-separated list of models to fit (default: all 7)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON
Example output:
Model Comparison (fitContinuous)
Number of tips: 8
Model Param Value Sigma2 z0 LL AIC dAIC AICw BIC dBIC
BM - - 0.0384 1.6447 -11.570 27.14 0.00 0.453 27.83 0.00
OU alpha 0.0012 0.0385 1.6420 -11.568 29.14 2.00 0.167 30.18 2.35
...
Best model (AIC): BM
Best model (BIC): BM
Continuous trait mapping (contMap)
Function names: cont_map; contmap; cmap
Command line interface: pk_cont_map; pk_contmap; pk_cmap
Plot a phylogram with branches colored by continuous trait values
(analogous to R's phytools::contMap()). Ancestral states are
estimated via maximum-likelihood (two-pass Felsenstein algorithm)
and mapped onto branches using a color gradient (coolwarm colormap).
phykit cont_map -t <tree> -d <trait_data> -o <output.png> [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-o/\-\-output: output plot file path (required)
--json: optional argument to print results as JSON
Cophylogenetic plot (tanglegram)
Function names: cophylo; tanglegram; tangle
Command line interface: pk_cophylo; pk_tanglegram; pk_tangle
Plot a cophylogenetic tanglegram of two phylogenies (analogous to R's
phytools::cophylo()). Draws two trees facing each other with
connecting lines between matching taxa. By default, taxa are matched
by identical tip names. Internal nodes of tree2 are rotated to minimize
line crossings.
phykit cophylo -t <tree1> -t2 <tree2> -o <output.png> [-m <mapping>] [--json]
Options:
-t/\-\-tree1: first tree file in Newick format
-t2/\-\-tree2: second tree file in Newick format
-o/\-\-output: output plot file path (required)
-m/\-\-mapping: optional tab-delimited mapping file (taxon1<tab>taxon2)
--json: optional argument to print results as JSON
Covarying evolutionary rates
Function names: covarying_evolutionary_rates; cover
Command line interface: pk_covarying_evolutionary_rates; pk_cover
Determine if two genes have a signature of covariation with one another. Genes that have covarying evolutionary histories tend to have similar functions and expression levels.
Input two phylogenies and calculate the correlation among relative evolutionary rates between the two phylogenies. The two input trees do not have to have the same taxa. This function will first prune both trees to have the same tips. To transform branch lengths into relative rates, PhyKIT uses the putative species tree's branch lengths, which is inputted by the user. As recommended by the original method developers, outlier branche lengths are removed. Outlier branches have a relative evolutionary rate greater than five.
PhyKIT reports two tab delimited values: col1: correlation coefficient col2: p-value
Method is empirically evaluated by Clark et al., Genome Research (2012), doi: 10.1101/gr.132647.111. Normalization method using a species tree follows Sato et al., Bioinformatics (2005), doi: 10.1093/bioinformatics/bti564.
phykit covarying_evolutionary_rates <tree_file_zero> <tree_file_one> -r/--reference <reference_tree_file> [-v/--verbose] [--plot] [--plot-output <path>] [--json]
Options:
<tree_file_zero>: first argument after function name should be an alignment file
<tree_file_one>: first argument after function name should be an alignment file
-r/\-\-reference: a tree to correct branch lengths by in the two input trees. Typically,
this is a putative species tree.
-v/\-\-verbose: print out corrected branch lengths shared between tree 0 and tree 1
--plot: save a covarying-rates scatter plot
--plot-output: output path for plot (default: covarying_rates_plot.png)
--json: optional argument to print results as JSON
Degree of violation of the molecular clock
Function names: degree_of_violation_of_a_molecular_clock, dvmc
Command line interface: pk_degree_of_violation_of_a_molecular_clock, pk_dvmc
Calculate degree of violation of a molecular clock (or DVMC) in a phylogeny.
Lower DVMC values are thought to be desirable because they are indicative of a lower degree of violation in the molecular clock assumption.
Typically, outgroup taxa are not included in molecular clock analysis. Thus, prior to calculating DVMC from a single gene tree, users may want to prune outgroup taxa from the phylogeny. To prune tips from a phylogeny, see the prune_tree function.
Calculate DVMC in a tree following Liu et al., PNAS (2017), doi: 10.1073/pnas.1616744114.
phykit degree_of_violation_of_a_molecular_clock <tree> [--json]
Options:
<tree>: input file tree name
--json: optional argument to print results as JSON
Density map
Function names: density_map; densitymap; dmap
Command line interface: pk_density_map; pk_densitymap; pk_dmap
Plot a phylogram with branches colored by posterior probabilities of
discrete character states from stochastic character mapping (analogous
to R's phytools::densityMap()). Runs N simulations of stochastic
character mapping internally and, for each point along each branch,
computes the fraction of simulations in each state.
phykit density_map -t <tree> -d <trait_data> -c <trait_column> -o <output.png> [-n <nsim>] [--seed <seed>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>state)
-c/\-\-trait: column name of the trait to map
-o/\-\-output: output plot file path (required)
-n/\-\-nsim: number of stochastic mapping simulations (default: 100)
--seed: random seed for reproducibility
--json: optional argument to print results as JSON
Evolutionary tempo mapping
Function names: evo_tempo_map; etm
Command line interface: pk_evo_tempo_map; pk_etm
Detect rate-topology associations by comparing branch length distributions between concordant and discordant gene trees at each species tree branch.
Under the multispecies coalescent, discordant gene trees should have shorter internal branches near the discordant node (because the coalescence happened deeper, in the ancestral population). Deviations from this expectation suggest substitution rate heterogeneity correlated with topology, which could indicate adaptive evolution, different selective pressures in hybridizing lineages, or systematic error from model misspecification.
For each internal branch of the species tree, gene trees are classified as concordant or discordant via bipartition matching (same as gCF). The homologous branch length is extracted from each gene tree and the two groups are compared using a Mann-Whitney U test and a permutation test (1000 permutations). P-values are corrected for multiple testing using Benjamini-Hochberg FDR.
A global treeness (internal/total branch length ratio) comparison between concordant and discordant gene trees is also reported.
phykit evo_tempo_map -t <species_tree> -g <gene_trees> [--plot <output>] [-v] [--json]
Options:
-t/\-\-tree: a species tree file
-g/\-\-gene-trees: multi-Newick file of gene trees with branch lengths
--plot: optional output path for box/strip plot (PNG)
-v/\-\-verbose: print per-gene-tree classification details
--json: optional argument to print results as JSON
Example output:
branch n_conc n_disc med_conc med_disc U_pval perm_pval fdr_p
----------------------------------------------------------------------------------------------------------
bear,dog,raccoon 6 1 3.875000 3.600000 NA NA NA
bear,raccoon 7 3 0.880000 0.700000 0.516667 0.077000 0.516667
cat,monkey 10 0 20.450000 NA NA NA NA
cat,monkey,weasel 9 1 2.120000 2.800000 NA NA NA
sea_lion,seal 9 1 7.500000 7.200000 NA NA NA
---
Global treeness: concordant=0.126489 (n=6), discordant=0.119014 (n=4)
Branches tested: 1, significant (FDR<0.05): 0
Each row corresponds to an internal branch of the species tree identified by the
smaller partition of taxa. The n_conc and n_disc columns show how many gene
trees are concordant or discordant at that branch. The med_conc and med_disc
columns show the median branch length (in substitutions/site) for each group. The
U_pval is the two-sided Mann-Whitney U test p-value, perm_pval is the
permutation test p-value (1000 permutations), and fdr_p is the
Benjamini-Hochberg corrected p-value. Branches with fewer than 2 gene trees
in either group show NA for p-values.
The global treeness comparison tests whether concordant gene trees have systematically different ratios of internal to total branch lengths.
To generate a visualization:
phykit evo_tempo_map -t <species_tree> -g <gene_trees> --plot tempo_map.png
The plot shows grouped box plots with jittered data points for each species tree branch, comparing branch lengths between concordant (blue) and discordant (orange) gene trees. Branches where the FDR-corrected p-value is below 0.05 are marked with an asterisk.
Discordance asymmetry
Function names: discordance_asymmetry; disc_asym; da
Command line interface: pk_discordance_asymmetry; pk_disc_asym; pk_da
Test whether the two discordant NNI alternative topologies at each species tree branch are equally frequent. Under incomplete lineage sorting (ILS) alone, the two minor NNI alternatives (gDF1 and gDF2) should appear at equal frequency. When they are significantly asymmetric, it suggests introgression or gene flow between specific lineages.
For each internal branch of the species tree, a two-sided binomial test (H0: P(alt1) = 0.5) is applied, and p-values are corrected for multiple testing using Benjamini-Hochberg FDR.
phykit discordance_asymmetry -t <species_tree> -g <gene_trees> [--plot <output>] [-v] [--json]
Options:
-t/\-\-tree: a species tree file
-g/\-\-gene-trees: multi-Newick file of gene trees (branch lengths not required)
--plot: optional output path for asymmetry phylogram (PNG)
-v/\-\-verbose: print per-branch details
--json: optional argument to print results as JSON
Example output:
branch n_conc n_alt1 n_alt2 asym_ratio binom_p fdr_p gene_flow
------------------------------------------------------------------------------------------------------
bear,dog,raccoon 6 0 1 1.000 1.0000 1.0000 -
bear,raccoon 7 1 2 0.667 1.0000 1.0000 -
cat,monkey 10 0 0 NA NA NA -
cat,monkey,weasel 9 1 0 1.000 1.0000 1.0000 -
sea_lion,seal 9 1 0 1.000 1.0000 1.0000 -
---
Summary: 4 branches tested, 0 significant (FDR<0.05)
Each row corresponds to an internal branch of the species tree identified by the
smaller partition of taxa. The n_conc column shows concordant gene trees, while
n_alt1 and n_alt2 show the counts for the two NNI alternative topologies.
The asym_ratio is max(n_alt1, n_alt2) / (n_alt1 + n_alt2), ranging from 0.5
(perfectly symmetric) to 1.0 (maximally asymmetric). NA indicates no discordant
gene trees were observed. The binom_p is the two-sided binomial test p-value,
fdr_p is the Benjamini-Hochberg corrected p-value, and gene_flow shows
which NNI alternative is favored when the result is significant (FDR < 0.05).
Interpretation:
Symmetric discordance (asym_ratio near 0.5, not significant): Consistent with ILS alone — both NNI alternatives arise with equal frequency from random coalescent sorting.
Asymmetric discordance (asym_ratio near 1.0, significant): Suggests gene flow or introgression. The favored NNI alternative indicates which lineages are exchanging genetic material. For alt1, gene flow is between the C1 lineage and the S (sibling) lineage; for alt2, gene flow is between C2 and S.
To generate a visualization:
phykit discordance_asymmetry -t <species_tree> -g <gene_trees> --plot asymmetry.png
The plot shows a phylogram with internal branches colored by asymmetry ratio using a diverging colormap (blue = symmetric/0.5, red = highly asymmetric/1.0). Significant branches (FDR < 0.05) are marked with a red star. Internal nodes are annotated with gCF values.
Evolutionary rate
Function names: evolutionary_rate, evo_rate
Command line interface: pk_evolutionary_rate, pk_evo_rate
Calculate a tree-based estimation of the evolutionary rate of a gene.
Evolutionary rate is the total tree length divided by the number of terminals.
Calculate evolutionary rate following Telford et al., Proceedings of the Royal Society B (2014).
phykit evolutionary_rate <tree> [--json]
Options:
<tree>: input file tree name
--json: optional argument to print results as JSON
Internal branch statistics
Function names: internal_branch_stats; ibs
Command line interface: pk_internal_branch_stats; pk_ibs
Calculate summary statistics for internal branch lengths in a phylogeny.
Internal branch lengths can be useful for phylogeny diagnostics.
To obtain all internal branch lengths, use the -v/\-\-verbose option.
phykit internal_branch_stats <tree> [-v/--verbose] [--json]
Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all internal branch lengths
--json: optional argument to print results as JSON
Internode labeler
Function names: internode_labeler; il
Command line interface: pk_internode_labeler; pk_il
Appends numerical identifiers to bipartitions in place of support values. This is helpful for pointing to specific internodes in supplementary files or otherwise.
phykit internode_labeler <tree> [-o/--output <file>] [--json]
Options:
<tree>: first argument after function name should be a tree file
-o/\-\-output: optional argument to name the outputted tree file
--json: optional argument to print results as JSON
Last common ancestor subtree
Function names: last_common_ancestor_subtree; lca_subtree
Command line interface: pk_last_common_ancestor_subtree; pk_lca_subtree
Obtains subtree from a phylogeny by getting the last common ancestor from a list of taxa.
phykit last_common_ancestor_subtree <file> <list_of_taxa> [-o/--output <file>] [--json]
Options:
<tree>: first argument after function name should be a tree file
<list_of_taxa>: second argument after function name should be a single column
file with the list of taxa to get the last common ancestor subtree for
-o/\-\-output: optional argument to name the outputted tree file
--json: optional argument to print results as JSON
Lineage-through-time plot and gamma statistic
Function names: ltt; gamma_stat; gamma
Command line interface: pk_ltt; pk_gamma_stat; pk_gamma
Compute the Pybus & Harvey (2000) gamma statistic and generate lineage-through-time (LTT) plots to test for temporal variation in diversification rates.
Under a constant-rate pure-birth (Yule) process, the gamma statistic follows a standard normal distribution: gamma ~ N(0, 1). Negative values indicate early diversification (decelerating rates, consistent with an adaptive radiation followed by niche filling), while positive values indicate late diversification (accelerating rates, potentially reflecting recent ecological opportunity or mass extinction recovery).
phykit ltt -t <tree> [-v/--verbose] [--plot-output <file>] [--json]
Options:
-t/\-\-tree: a rooted phylogeny file with branch lengths (required)
-v/\-\-verbose: print branching times and full LTT data points
--plot-output: output filename for the LTT plot (PNG, PDF, SVG)
--json: output results as JSON
Default output: tab-separated gamma statistic and two-tailed p-value.
# Basic gamma statistic
phykit ltt -t species.tre
# -1.4142 0.1573
# With LTT plot
phykit ltt -t species.tre --plot-output ltt_plot.png
# Full JSON output
phykit ltt -t species.tre --json
Tutorial: testing diversification tempo in a clade
Suppose you have a dated phylogeny of 50 gecko species and want to test whether speciation rates were constant or whether diversification decelerated over time (consistent with ecological limits).
Step 1: Compute the gamma statistic.
phykit ltt -t gecko_dated.tre --plot-output gecko_ltt.png
# -2.8514 0.0043
A significantly negative gamma (p = 0.004) rejects the constant-rate null, indicating that lineage accumulation was concentrated early in the clade's history — consistent with an early burst of diversification followed by a slowdown as niches filled.
Step 2: Examine the LTT plot.
The --plot-output option generates a step-function plot of lineage
count (log scale) vs. time from root. Under constant-rate birth,
lineages accumulate exponentially (straight line on log-scale). An
early burst appears as a curve that rises steeply then flattens.
Example: early burst (decelerating diversification)
Most branching events occur near the root, then the curve plateaus. The significantly negative gamma (p < 0.001) rejects constant-rate birth.
Example: recent burst (accelerating diversification)
Lineage accumulation is concentrated near the present. The significantly positive gamma (p = 0.022) indicates late diversification.
Step 3: Verbose output for downstream analysis.
phykit ltt -t gecko_dated.tre -v
Verbose mode prints branching times (node ages) and the full LTT data table (time_from_root, n_lineages), which can be piped to custom plotting or further analysis.
Validation against R's ape::gammaStat()
PhyKIT's gamma statistic was validated against R's ape package (v5.8.1, R 4.4.0). Results match to 10 decimal places:
Tree topology |
PhyKIT gamma |
R ape gamma |
|---|---|---|
Balanced 8-tip |
-1.4142135624 |
-1.4142135624 |
Ladder (caterpillar) 5-tip |
-0.7142857143 |
-0.7142857143 |
Recent burst 10-tip |
2.2824790785 |
2.2824790785 |
Early burst 7-tip |
-3.5362021857 |
-3.5362021857 |
The algorithm replicates the exact formula from ape's gammaStat.R
source, including the rev() step on internode intervals.
Long branch score
Function names: long_branch_score; lb_score; lbs
Command line interface: pk_long_branch_score; pk_lb_score; pk_lbs
Calculate long branch (LB) scores in a phylogeny.
Lower LB scores are thought to be desirable because they are indicative of taxa or trees that likely do not have issues with long branch attraction.
LB score is the mean pairwise patristic distance of taxon i compared to all other taxa over the average pairwise patristic distance.
PhyKIT reports summary statistics. To obtain LB scores for each taxa, use the -v/--verbose option.
LB scores are calculated following Struck, Evolutionary Bioinformatics (2014), doi: 10.4137/EBO.S14239.
phykit long_branch_score <tree> [-v/--verbose] [--json]
Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all LB score values
--json: optional argument to print results as JSON
Monophyly check
Function names: monophyly_check; is_monophyletic
Command line interface: pk_monophyly_check; pk_is_monophyletic
This analysis can be used to determine if a set of taxa are exclusively monophyletic. By exclusively monophyletic, if other taxa are in the same clade, the lineage will not be considered exclusively monophyletic.
Requires a taxa file, which species which tip names are expected to be monophyletic. File format is a single column file with tip names. Tip names not present in the tree will not be considered when examining monophyly.
The output will have six columns. col 1: if the clade was or wasn't monophyletic col 2: average bipartition support value in the clade of interest col 3: maximum bipartition support value in the clade of interest col 4: minimum bipartition support value in the clade of interest col 5: standard deviation of bipartition support values in the clade of interest col 6: tip names of taxa monophyletic with the lineage of interest excluding those that are listed in the taxa_of_interest file
phykit monophyly_check <tree> <list_of_taxa> [--json]
Options:
<tree>: first argument after function name should be a tree file
<list_of_taxa>: single column file with list of tip names to
examine the monophyly of
--json: optional argument to print results as JSON
Multi-regime OU models (OUwie)
Function names: ouwie; fit_ouwie; multi_regime_ou
Command line interface: pk_ouwie; pk_fit_ouwie; pk_multi_regime_ou
Fit multi-regime Ornstein-Uhlenbeck models of continuous trait evolution,
analogous to R's OUwie package (Beaulieu et al. 2012). Fits up to 7
models and ranks them by AICc, BIC, and AICc weights. Regime assignments
to internal branches are inferred via Fitch parsimony.
Models:
BM1 -- single-rate Brownian motion (2 params)
BMS -- multi-rate Brownian motion with per-regime sigma2 (R+1 params)
OU1 -- single-regime Ornstein-Uhlenbeck (3 params)
OUM -- multi-regime OU with per-regime trait optima (R+2 params)
OUMV -- OUM + per-regime sigma2 (2R+1 params)
OUMA -- OUM + per-regime alpha (2R+1 params)
OUMVA -- all parameters regime-specific (3R params)
Each model reports R² = 1 - (σ²_model / σ²_BM1), measuring improvement over the simplest Brownian motion baseline. For multi-regime models with per-regime σ² values, the average is used.
phykit ouwie -t <tree> -d <trait_data> -r <regime_data> [--models BM1,OUM,OUMVA] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-r/\-\-regime_data: tab-delimited regime file (taxon<tab>regime_label)
--models: comma-separated list of models to fit (default: all 7)
--json: optional argument to print results as JSON
The trait data file is a two-column tab-delimited file mapping taxon names to continuous trait values:
dog 1.1
bear 1.9
raccoon 1.5
seal 1.8
sea_lion 1.8
cat 0.5
weasel 1.7
monkey 0.3
The regime data file is a two-column tab-delimited file mapping taxon names to discrete regime labels (e.g., habitat, diet category):
dog terrestrial
bear terrestrial
raccoon terrestrial
seal aquatic
sea_lion aquatic
cat terrestrial
weasel terrestrial
monkey terrestrial
Example output:
OUwie Model Comparison
======================
Regimes: aquatic, terrestrial
Model logLik AICc BIC k AICc_w Params
----- ---------- ------- ------- ---- ------ ------
OUMVA -6.9859 27.9717 29.5459 6 0.0040 alpha={aquatic:0.38, terrestrial:0.38}, sigma2={aquatic:0.01, terrestrial:0.05}, theta={aquatic:1.80, terrestrial:1.24}
OUMA -6.9859 27.9717 29.0119 5 0.0040 alpha={aquatic:0.38, terrestrial:0.38}, sigma2=0.0384, theta={aquatic:1.80, terrestrial:1.24}
OUMV -6.9859 27.9717 29.0119 5 0.0040 alpha=0.3849, sigma2={aquatic:0.01, terrestrial:0.05}, theta={aquatic:1.80, terrestrial:1.24}
OUM -8.6297 25.2594 26.3276 4 0.0488 alpha=0.0706, sigma2=0.0329, theta={aquatic:1.80, terrestrial:1.33}
OU1 -10.2890 27.2447 28.0459 3 0.0063 alpha=0.0398, sigma2=0.0363, theta=1.64
BMS -11.2046 29.0759 29.8771 3 0.0024 sigma2={aquatic:0.01, terrestrial:0.05}, z0=1.64
BM1 -11.5697 27.1393 27.6735 2 0.0073 sigma2=0.0384, z0=1.64
Best model (AICc): OUM
Best model (BIC): OUM
Nearest neighbor interchange
Function names: nearest_neighbor_interchange; nni
Command line interface: pk_nearest_neighbor_interchange; pk_nni
Generate all nearest neighbor interchange moves for a binary rooted tree.
By default, the output file will have the same name as the input file but with the suffix ".nnis"
The output file will also include the original phylogeny.
phykit nearest_neighbor_interchange <tree> [-o/--output <output_file>] [--json]
Options:
<tree>: first argument after function name should be a tree file
-o/\-\-output: optional argument to specify output file name
--json: optional argument to print summary metadata as JSON
Network signal
Function names: network_signal; netsig; net_signal
Command line interface: pk_network_signal; pk_netsig; pk_net_signal
Compute phylogenetic signal (Blomberg's K and/or Pagel's lambda) on a phylogenetic network rather than a tree. This accounts for hybridization and introgression when estimating how strongly a continuous trait tracks evolutionary history.
Standard phylogenetic signal methods assume a strictly bifurcating tree.
When the true history includes reticulation, the tree-based
variance-covariance (VCV) matrix is incorrect and signal estimates
may be biased. network_signal replaces the tree VCV with a
network VCV computed using the recursive algorithm of
Bastide et al. (Systematic Biology, 2018), which properly
weights shared ancestry through both tree-like and hybrid lineages.
Two signal metrics are available (same as phylogenetic_signal):
Blomberg's K (Blomberg et al. 2003): K = 1 under Brownian motion; K < 1 = less signal than expected; K > 1 = more. P-value via permutation test. Computing K on a network is a novel capability not available in any other tool.
Pagel's lambda (Pagel 1999): lambda = 0 = no signal; lambda = 1 = full BM signal. P-value via likelihood ratio test.
Network specification — two options:
Explicit hybrid edges (
--hybrid): specify one or more reticulation events asdonor:recipient:gammawhere gamma is the inheritance probability from the donor lineage (0 < gamma < 0.5).From quartet_network JSON (
--quartet-json): auto-infer hybrid edges from the output ofphykit quartet_network --json. The command identifies taxon pairs that swap across hybrid quartets and estimates gamma from concordance factor ratios.
# With explicit hybrid edges
phykit network_signal -t <tree> -d <trait_data> --hybrid <donor:recipient:gamma> [--method both|blombergs_k|lambda] [--permutations 1000] [--json]
# With quartet_network JSON output
phykit network_signal -t <tree> -d <trait_data> --quartet-json <quartets.json> [--method both|blombergs_k|lambda] [--permutations 1000] [--json]
Options:
-t/\-\-tree: a rooted species tree in Newick format (with branch lengths)
-d/\-\-trait-data: tab-delimited trait file (taxon_name<tab>trait_value)
--hybrid: one or more hybrid edge specifications (donor:recipient:gamma);
donor is the source lineage, recipient receives gene flow, gamma is the
inheritance proportion from the donor (e.g., B:C:0.3)
--quartet-json: path to JSON output from phykit quartet_network --json
--method: both (default), blombergs_k, or lambda
--permutations: number of permutations for K p-value (default: 1000)
-v/\-\-verbose: print network VCV matrix details
--json: optional argument to print results as JSON
Output for default (both) mode:
Hybrid edge: B -> C (gamma=0.3000)
Network taxa: 5
---
Blomberg's K: 0.8234 p-value: 0.0320
Pagel's lambda: 0.7651 log-likelihood: -12.3456 p-value: 0.0012
Tutorial: Wing pattern evolution in Heliconius butterflies
This example shows a realistic workflow for computing phylogenetic signal on a network, starting from gene tree discordance analysis through to signal estimation. The scenario is motivated by the Heliconius butterfly system, where H. melpomene and H. cydno are sister species that hybridize with H. heurippa, producing introgression of wing pattern genes across species boundaries (Mavárez et al., Nature, 2006).
Step 1: Identify hybridization from gene trees.
You have gene trees from 200 loci across 6 Heliconius species.
First, use quartet_network to test whether gene tree discordance
is due to ILS alone or also involves hybridization:
phykit quartet_network -t gene_trees.nwk --json > quartets.json
Examine the output to see which quartets are classified as hybrid:
# Quick summary
python -c "
import json
data = json.load(open('quartets.json'))
print(f'Tree-like: {data[\"tree_count\"]}')
print(f'Hybrid: {data[\"hybrid_count\"]}')
print(f'Unresolved: {data[\"unresolved_count\"]}')
for q in data['quartets']:
if q['classification'] == 'hybrid':
print(f' {q[\"dominant_topology\"]} CFs: {q[\"cfs\"]}')
"
Suppose the output shows that quartets involving H. melpomene and H. heurippa are consistently classified as hybrid, with asymmetric minor concordance factors — evidence of gene flow between these lineages.
Step 2a: Compute signal using quartet_network output directly.
Feed the quartet JSON into network_signal along with a rooted
species tree and wing pattern measurements (e.g., forewing red band
area, log-transformed):
phykit network_signal \
-t species_tree.nwk \
-d wing_pattern.tsv \
--quartet-json quartets.json
The command automatically identifies the strongest hybrid signal from the quartet classifications and estimates the inheritance probability (gamma).
Step 2b: Alternatively, specify hybrid edges explicitly.
If you know the donor and recipient lineages (e.g., from prior knowledge or external network inference), you can specify the hybrid edge directly. Here, H. melpomene is the donor of wing pattern alleles to H. heurippa with an estimated 25% introgression:
phykit network_signal \
-t species_tree.nwk \
-d wing_pattern.tsv \
--hybrid H_melpomene:H_heurippa:0.25
Step 3: Interpret the results.
Example output:
Hybrid edge: H_melpomene -> H_heurippa (gamma=0.2500)
Network taxa: 6
---
Blomberg's K: 0.6821 p-value: 0.0410
Pagel's lambda: 0.5934 log-likelihood: -8.7231 p-value: 0.0085
Interpretation:
K = 0.68 (p = 0.04): significant phylogenetic signal, but less than expected under Brownian motion (K < 1). This is consistent with the wing pattern being phylogenetically conserved in most lineages but displaced in H. heurippa due to introgression from H. melpomene.
Lambda = 0.59 (p = 0.009): moderate phylogenetic signal. The trait is not evolving independently of the network (lambda > 0), but the fit is better with reduced covariance (lambda < 1).
Comparison with tree-based signal: running
phylogenetic_signalon the species tree alone would likely produce a lower K value because the tree VCV does not account for the shared ancestry introduced by introgression. The network-based K is a more accurate estimate of how much evolutionary history explains trait variation.
Why this matters: Without accounting for the network, the tree treats H. heurippa's wing pattern as an independent observation. In reality, its wing pattern was partly inherited from H. melpomene through hybridization — the network VCV correctly reflects this shared ancestry, producing a more accurate phylogenetic signal estimate.
Algorithm — Bastide et al. 2018 network VCV
For Brownian motion on a network, the trait at a hybrid node h with parents p1 and p2 is:
X_h = gamma * (X_p1 + noise_1) + (1-gamma) * (X_p2 + noise_2)
The network VCV is computed recursively in topological order:
Tree node c with parent p, edge length l:
V[c,c] = V[p,p] + l
V[c,j] = V[p,j]for all other nodes jHybrid node h with parents p1 (weight gamma) and p2 (weight 1-gamma):
V[h,h] = gamma^2 * (V[p1,p1] + l1) + (1-gamma)^2 * (V[p2,p2] + l2) + 2*gamma*(1-gamma)*V[p1,p2]
V[h,j] = gamma * V[p1,j] + (1-gamma) * V[p2,j]
The tip-by-tip submatrix is the VCV used for K and lambda.
OU shift detection (l1ou)
Function names: ou_shift_detection; ou_shifts; l1ou; detect_shifts
Command line interface: pk_ou_shift_detection; pk_ou_shifts; pk_l1ou; pk_detect_shifts
Automatic OU shift detection using the LASSO-based approach from Khabbazian et al. (2016). Discovers where on the phylogeny the adaptive optimum changed without requiring an a priori regime assignment. Only a tree and continuous trait data are needed.
The algorithm:
Fits a single-regime OU model to estimate alpha (selection strength)
Builds a design matrix with one column per candidate shift edge
Uses Cholesky transformation to remove phylogenetic correlation
Runs a LASSO path to identify candidate shift configurations
Selects the best model using pBIC, BIC, or AICc
phykit l1ou -t <tree> -d <trait_data> [--criterion pBIC] [--max-shifts N] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
--criterion: model selection criterion: pBIC (default), BIC, or AICc
--max-shifts: maximum number of shifts to consider (default: n/2)
--json: optional argument to print results as JSON
Example output (no shifts detected):
============================================================
OU Shift Detection (l1ou)
============================================================
Number of tips: 8
Number of shifts: 0
Selection criterion: pBIC
Alpha (OU strength): 0.784803
Sigma² (BM rate): 1.203455
Root optimum (θ₀): 1.206251
Log-likelihood: -10.2890
pBIC: 26.8163
BIC: 26.8163
AICc: 32.5780
No shifts detected — single-regime OU is best.
============================================================
Example output (shifts detected):
============================================================
OU Shift Detection (l1ou)
============================================================
Number of tips: 100
Number of shifts: 8
Selection criterion: pBIC
Alpha (OU strength): 0.606894
Sigma² (BM rate): 0.062519
Root optimum (θ₀): 0.248810
Log-likelihood: 48.6896
pBIC: 17.6266
BIC: -9.8811
AICc: -49.8793
Detected shifts:
------------------------------------------------------------
Shift 1: terminal branch to valencienni
New optimum: -0.564678
Shift 2: terminal branch to insolitus
New optimum: -0.876398
Shift 3: stem of (barbatus, porcus, ... +2 more)
New optimum: -0.635087
Shift 4: stem of (altitudinalis, oporinus, ... +13 more)
New optimum: -0.462944
============================================================
Results have been validated against R's l1ou package (Khabbazian et al. 2016). On a 100-tip lizard dataset, PhyKIT recovers the same 8 adaptive shifts with matching alpha (0.607) and pBIC (17.6 vs R's 16.8).
Patristic distances
Function names: patristic_distances; pd
Command line interface: pk_patristic_distances; pk_pd
Calculate summary statistics among patristic distances in a phylogeny.
Patristic distances are all tip-to-tip distances in a phylogeny.
To obtain all patristic distances, use the -v/--verbose option. With the -v option, the first column will have two taxon names separated by a '-' followed by the patristic distance. Features will be tab separated.
phykit patristic_distances <tree> [-v/--verbose] [--json]
Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all tip-to-tip distances
--json: optional argument to print results as JSON
Phenogram (traitgram)
Function names: phenogram; traitgram; tg
Command line interface: pk_phenogram; pk_traitgram; pk_tg
Plot a phenogram (traitgram) showing trait evolution along a phylogeny
(analogous to R's phytools::phenogram()). The X-axis represents
distance from the root and the Y-axis represents trait values.
Ancestral states are reconstructed via maximum-likelihood.
phykit phenogram -t <tree> -d <trait_data> -o <output.png> [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-o/\-\-output: output plot file path (required)
--json: optional argument to print results as JSON
Phylogenetic GLM
Function names: phylogenetic_glm; phylo_glm; pglm
Command line interface: pk_phylogenetic_glm; pk_phylo_glm; pk_pglm
Fit a Phylogenetic Generalized Linear Model (GLM) for binary or count response data while accounting for phylogenetic non-independence among species.
Two families are supported:
binomial: logistic regression via Maximum Penalized Likelihood Estimation (logistic_MPLE; Ives & Garland 2010). Uses Firth's penalty to prevent bias from complete/quasi-complete separation, and jointly estimates the phylogenetic signal parameter alpha via a two-state continuous-time Markov chain on the tree.
poisson: Poisson regression via Generalized Estimating Equations (poisson_GEE; Paradis & Claude 2002). Uses Fisher scoring with the phylogenetic correlation matrix and reports an overdispersion parameter.
The multi-trait input file should be tab-delimited with a header row:
taxon<tab>trait1<tab>trait2<tab>...
Output includes coefficient estimates, standard errors, z-values, p-values, log-likelihood, AIC, and McFadden's pseudo-R² (computed from full vs. intercept-only model log-likelihoods).
phykit phylogenetic_glm -t <tree> -d <trait_data> -y <response> -x <predictor1> [predictor2 ...] --family <binomial|poisson> [-g <gene_trees>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
-y/\-\-response: response (dependent) variable column name
-x/\-\-predictors: one or more predictor column names
--family: distribution family: binomial or poisson
--method: estimation method: logistic_MPLE or poisson_GEE (auto from family)
--btol: linear predictor bound for logistic model (default: 10)
--log-alpha-bound: bound on log(alpha) for logistic model (default: 4)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON
Phylogenetic Ordination
Function names: phylogenetic_ordination; phylo_ordination; ordination; ord;
phylo_pca; phyl_pca; ppca; phylo_dimreduce; dimreduce; pdr
Command line interface: pk_phylogenetic_ordination; pk_phylo_ordination;
pk_ordination; pk_ord; pk_phylo_pca; pk_phyl_pca; pk_ppca;
pk_phylo_dimreduce; pk_dimreduce; pk_pdr
Perform phylogenetic ordination (PCA, t-SNE, or UMAP) on continuous multi-trait data while accounting for phylogenetic non-independence among species.
All methods use GLS-centering via the phylogenetic variance-covariance matrix. For PCA, eigendecomposition of the evolutionary rate matrix is performed to extract principal components. For t-SNE and UMAP, nonlinear embedding is applied to the GLS-centered data.
Three ordination methods are available via --method:
pca (default): phylogenetic PCA (Revell 2009). Performs eigendecomposition of the evolutionary rate matrix.
tsne: t-SNE embedding of the GLS-centered data. Perplexity is auto-set to
min(30, (n-1)/3); requires at least 4 taxa.umap: UMAP embedding of the GLS-centered data. n_neighbors is auto-set to
min(15, n-1); requires at least 3 taxa.
Two phylogenetic correction modes are available via --correction:
BM (default): assumes traits evolved under Brownian motion. The phylogenetic VCV matrix is used directly.
lambda: jointly estimates a single Pagel's lambda parameter across all traits by maximum likelihood, then rescales the off-diagonal elements of the VCV matrix. This accounts for deviations from Brownian motion.
For PCA, two modes are available via --mode:
cov (default): PCA on the evolutionary rate (covariance) matrix.
corr: PCA on the evolutionary rate correlation matrix, which standardizes traits to equal variance before extracting components.
The multi-trait input file should be tab-delimited with a header row:
taxon<tab>trait1<tab>trait2<tab>...
Lines starting with '#' are treated as comments. If the tree and trait file
have different taxa, the intersection is used and warnings are printed to
stderr.
PCA output includes eigenvalues with proportion of variance explained, trait
loadings, and taxon scores for each principal component. t-SNE/UMAP output
includes method parameters and embedding coordinates. When correction=lambda
is used, the estimated lambda and log-likelihood are also reported.
PCA results have been benchmarked against the R package
phytools
(phyl.pca function; Revell 2012). Eigenvalues, loadings, and scores match
phytools across all method/mode combinations (BM+cov, BM+corr, lambda+cov,
lambda+corr) within numerical tolerance (1e-4).
phykit phylogenetic_ordination -t <tree> -d <trait_data> [--method <pca|tsne|umap>] [--correction <BM|lambda>] [--mode <cov|corr>] [--n-components <int>] [--perplexity <float>] [--n-neighbors <int>] [--min-dist <float>] [--seed <int>] [--plot] [--plot-tree] [--no-plot-tree] [--color-by <col_or_file>] [--tree-color-by <col_or_file>] [--plot-output <path>] [-g <gene_trees>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
--method: ordination method: pca, tsne, or umap (default: pca)
--correction: phylogenetic correction: BM or lambda (default: BM)
--mode: PCA mode: cov or corr (default: cov; PCA only)
--n-components: number of embedding dimensions (default: 2; tsne/umap only)
--perplexity: t-SNE perplexity (default: auto)
--n-neighbors: UMAP n_neighbors (default: auto)
--min-dist: UMAP min_dist (default: 0.1)
--seed: random seed for reproducibility
--plot: optional argument to save a scatter plot
--plot-tree: overlay the phylogeny via ancestral reconstruction (default for tsne/umap; opt-in for pca)
--no-plot-tree: disable the phylogeny overlay for tsne/umap plots
--color-by: color tip points by a trait; specify a column name from the multi-trait file or a separate tab-delimited file (taxon<tab>value) for continuous or discrete coloring
--tree-color-by: color phylogeny edges by a trait; specify a column name or a tab-delimited file (default: distance from root)
--plot-output: output path for plot (default: phylo_ordination_plot.png)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON
Phylogenetic regression (PGLS)
Function names: phylogenetic_regression; phylo_regression; pgls
Command line interface: pk_phylogenetic_regression; pk_phylo_regression; pk_pgls
Fit a Phylogenetic Generalized Least Squares (PGLS) regression while
accounting for phylogenetic non-independence among species, analogous to
R's caper::pgls().
The multi-trait input file should be tab-delimited with a header row:
taxon<tab>trait1<tab>trait2<tab>...
Lines starting with '#' are treated as comments. If the tree and trait file
have different taxa, the intersection is used and warnings are printed to
stderr.
Two methods are available:
BM (default): Brownian motion with lambda fixed at 1
lambda: jointly estimates Pagel's lambda via maximum likelihood
Output includes coefficient estimates, standard errors, t-values, p-values, R-squared, adjusted R-squared, F-statistic, log-likelihood, and AIC.
A three-way variance decomposition is also reported: R²_total (variance explained by phylogeny + predictors combined), R²_pred (predictor contribution given phylogeny, = standard R²), and R²_phylo (phylogeny's unique contribution). R²_phylo + R²_pred = R²_total.
The implementation uses the raw phylogenetic variance-covariance (VCV) matrix
for GLS estimation, matching the approach used by R's caper::pgls().
Note that this differs from nlme::gls() with corBrownian, which
normalizes the VCV to a correlation matrix (ones on the diagonal). For
non-ultrametric trees the two approaches yield different coefficient estimates;
PhyKIT follows the caper convention.
All results have been validated against R 4.4.0 using manual GLS with
ape::vcv() (raw VCV). Coefficients, standard errors, t-values, p-values,
R-squared, F-statistic, log-likelihood, and AIC match R to at least four
decimal places for both simple and multiple regression.
phykit phylogenetic_regression -t <tree> -d <trait_data> -y <response> -x <predictor1> [predictor2 ...] [-m <method>] [-g <gene_trees>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
-y/\-\-response: response (dependent) variable column name
-x/\-\-predictors: one or more predictor column names
-m/\-\-method: method to use: BM or lambda (default: BM)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON
Phylogenetic signal
Function names: phylogenetic_signal; phylo_signal; ps
Command line interface: pk_phylogenetic_signal; pk_phylo_signal; pk_ps
Calculate phylogenetic signal for continuous trait data on a phylogeny.
Two methods are available:
Blomberg's K (Blomberg et al. 2003): measures the degree of phylogenetic signal relative to expectation under Brownian motion. K = 1 indicates trait variation consistent with BM; K < 1 indicates less phylogenetic signal than expected; K > 1 indicates more. P-value is computed via permutation test.
Pagel's lambda (Pagel 1999): a tree-scaling parameter estimated by maximum likelihood. Lambda = 0 indicates no phylogenetic signal; lambda = 1 indicates trait evolution consistent with BM. P-value is computed via likelihood ratio test against lambda = 0.
The trait file should be tab-delimited with two columns (taxon_name<tab>trait_value). Lines starting with '#' are treated as comments. If the tree and trait file have different taxa, the intersection is used and warnings are printed to stderr.
Output for Blomberg's K: K_value<tab>p_value<tab>R2_phylo
Output for Pagel's lambda: lambda_value<tab>log_likelihood<tab>p_value<tab>R2_phylo
R²_phylo reports the fraction of trait variance explained by phylogenetic structure:
R²_phylo = 1 - (σ²_BM / σ²_WN). Values near 1 indicate strong phylogenetic signal;
values near 0 indicate phylogeny explains little trait variance.
Results have been validated against the R package phytools (phylosig function)
across 95 simulated datasets spanning diverse tree sizes (5-50 tips), topologies
(pure-birth, coalescent), trait models (random, Brownian motion, known lambda),
and branch length scales. All metrics show Pearson r > 0.999 with phytools.
phykit phylogenetic_signal -t <tree> -d <trait_data> [-m <method>] [-p <permutations>] [-g <gene_trees>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon_name<tab>trait_value)
-m/\-\-method: method to use: blombergs_k or lambda (default: blombergs_k)
-p/\-\-permutations: number of permutations for Blomberg's K (default: 1000)
-g/\-\-gene-trees: optional multi-Newick file of gene trees; when provided, uses a discordance-aware VCV (genome-wide average) instead of the species-tree VCV
--json: optional argument to print results as JSON
Phylomorphospace
Function names: phylomorphospace; phylomorpho; phmo
Command line interface: pk_phylomorphospace; pk_phylomorpho; pk_phmo
Plot a phylomorphospace: two raw traits in trait space with the phylogeny
overlaid via ML-reconstructed ancestral states at internal nodes. This differs
from the phylogenetic_ordination --plot-tree option, which plots in PC space;
phylomorphospace plots raw trait values directly on the x and y axes.
Tree edges connect species through ML-reconstructed ancestral states and are
colored by distance from root (coolwarm colormap with colorbar). Tip points
default to blue, or can be colored by a continuous or discrete variable using
the --color-by option.
The multi-trait input file should be tab-delimited with a header row:
taxon<tab>trait1<tab>trait2<tab>...
Lines starting with '#' are treated as comments. If the tree and trait file
have different taxa, the intersection is used and warnings are printed to
stderr.
If the trait file has exactly 2 trait columns and --trait-x / --trait-y
are omitted, the first two columns are selected automatically.
phykit phylomorphospace -t <tree> -d <trait_data> [--trait-x <name>] [--trait-y <name>] [--color-by <col_or_file>] [--plot-output <path>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited multi-trait file with header row
--trait-x: column name for x-axis trait
--trait-y: column name for y-axis trait
--color-by: color tip points by a trait; specify a column name from the multi-trait file or a separate tab-delimited file (taxon<tab>value) for continuous or discrete coloring
--plot-output: output path for plot (default: phylomorphospace_plot.png)
--json: optional argument to print results as JSON
Polytomy testing
Function names: polytomy_test; polyt_test; polyt; ptt
Command line interface: pk_polytomy_test; pk_polyt_test; pk_polyt; pk_ptt
Conduct a polytomy test for three clades in a phylogeny.
Polytomy tests can be used to identify putative radiations as well as identify well supported alternative topologies.
The polytomy testing function takes as input a file with the three groups of taxa to test the relationships for and a single column file with the names of the desired tree files to use for polytomy testing. Next, the script to examine support for the grouping of the three taxa using triplets and gene support frequencies.
This function can account for uncertainty in gene trees - that is, the input phylogenies can have collapsed bipartitions.
Thereafter, a chi-squared test is conducted to determine if there is evidence to reject the null hypothesis wherein the null hypothesis is that the three possible topologies among the three groups are equally supported. This test is done using gene support frequencies.
phykit polytomy_test -t/--trees <trees> -g/--groups <groups> [--json]
Options:
-t/\-\-trees <trees>: single column file with the names of
phylogenies to use for polytomy testing
-g/\-\-groups: a tab-delimited file with the grouping designations
to test. Lines starting with commetns are not considered. Names of
individual taxa should be separated by a semi-colon ';'
--json: optional argument to print results as JSON
For example, the groups file could look like the following:
#label group0 group1 group2
name_of_test tip_name_A;tip_name_B tip_name_C tip_name_D;tip_name_E
Print tree
Function names: print_tree; print; pt
Command line interface: pk_print_tree; pk_print; pk_pt
Print ascii tree of input phylogeny.
Phylogeny can be printed with or without branch lengths. By default, the phylogeny will be printed with branch lengths but branch lengths can be removed using the -r/--remove argument.
phykit print_tree <tree> [-r/--remove] [--json]
Options:
<tree>: first argument after function name should be a tree file
-r/\-\-remove: optional argument to print the phylogeny without branch
lengths
--json: optional argument to print results as JSON
Prune tree
Function names: prune_tree; prune
Command line interface: pk_prune_tree; pk_prune
Prune tips from a phylogeny.
Provide a single column file with the names of the tips in the input phylogeny you would like to prune from the tree.
phykit prune_tree <tree> <list_of_taxa> [-o/--output <output_file>] [-k/--keep] [--json]
Options:
<tree>: first argument after function name should be a tree file
<list_of_taxa>: single column file with the names of the tips to remove
from the phylogeny
-o/\-\-output: name of output file for the pruned phylogeny.
Default output will have the same name as the input file but with the suffix
".pruned"
-k/--keep: optional argument. If used instead of pruning taxa in <list_of_taxa>,
keep them
--json: optional argument to print results as JSON
|
Quartet network
Function names: quartet_network; quartet_net; qnet; nanuq
Command line interface: pk_quartet_network; pk_quartet_net; pk_qnet; pk_nanuq
Quartet-based network inference (NANUQ-style) for distinguishing incomplete lineage sorting (ILS) from hybridization/gene flow using quartet concordance factors from gene trees.
For each 4-taxon subset, counts how many gene trees display each of the 3 possible unrooted topologies and applies two hypothesis tests:
Star test (Pearson chi-squared): tests whether the three topology counts are consistent with a star tree (equal probabilities 1/3 each). If p_star > beta (default 0.95), the quartet is classified as unresolved.
T3 tree model test (G-test / likelihood ratio): tests whether the counts are consistent with any resolved quartet tree under the multispecies coalescent. If p_tree > alpha (default 0.05), the quartet is classified as tree-like (conflict is due to ILS).
If both tests reject their null hypotheses, the quartet is classified as hybrid (asymmetric discordance indicating gene flow or hybridization).
This algorithm matches the NANUQ method of Allman, Baños & Rhodes (2019), implemented in R's MSCquartets package.
Input can be either: 1) a file with one Newick tree per line, or 2) a file with one tree-file path per line.
phykit quartet_network -t/--trees <trees> [--alpha 0.05] [--beta 0.95] [--missing-taxa error|shared] [--plot-output <file>] [--json]
Options:
-t/\-\-trees: file containing trees (one Newick per line) or tree-file paths (one per line)
--alpha: significance level for the T3 tree model test (default: 0.05)
--beta: threshold for the star tree test; quartets with p_star > beta are called unresolved (default: 0.95)
--missing-taxa: handling strategy for mismatched taxa (error or shared; default: error)
--plot-output: output filename for the quartet network plot (optional)
--json: optional argument to print results as JSON
When --plot-output is specified, a NANUQ-style splits graph is drawn from
the quartet distance matrix using Neighbor-Joining and circular split
decomposition. Tree-like relationships appear as simple branching, while
hybridization / reticulation produces characteristic box (parallelogram)
structures — the same style of output produced by R's MSCquartets +
NeighborNet pipeline.
No hybridization signal — all quartets are tree-like, so the splits graph is a clean unrooted tree:
Hybridization signal present — hybrid quartets introduce conflicting splits that appear as boxes in the network, indicating reticulation among C, D, E, and F:
Rate heterogeneity test (multi-rate Brownian motion)
Function names: rate_heterogeneity; brownie; rh
Command line interface: pk_rate_heterogeneity; pk_brownie; pk_rh
Test for rate heterogeneity across phylogenetic regimes using multi-rate
Brownian motion (O'Meara et al. 2006), analogous to R's
phytools::brownie.lite().
Fits single-rate vs. multi-rate BM models and performs a likelihood ratio
test. Users specify a tree, continuous trait data, and a regime file
mapping tips to regimes (tab-delimited, taxon<tab>regime_label).
Regime assignments to internal branches are inferred via Fitch parsimony. Per-regime VCV matrices are decomposed and per-regime sigma-squared values are estimated via maximum likelihood.
Optionally, a parametric bootstrap can be run to compute a simulated
p-value (-n/--nsim). A horizontal phylogram with branches colored by
regime can be generated using --plot.
An effect size metric R²_regime is also reported, measuring the variance reduction from regime-specific rates vs. a single rate, weighted by the number of tips per regime.
phykit rate_heterogeneity -t <tree> -d <trait_data> -r <regime_data> [-n <nsim>] [--seed <seed>] [--plot <output.png>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file (taxon<tab>value)
-r/\-\-regime_data: tab-delimited regime file (taxon<tab>regime_label)
-n/\-\-nsim: number of parametric bootstrap simulations (default: 0)
--seed: random seed for reproducibility
--plot: output plot file path for phylogram with colored branches
--json: optional argument to print results as JSON
Rename tree tips
Function names: rename_tree_tips; rename_tree; rename_tips
Command line interface: pk_rename_tree_tips; pk_rename_tree; pk_rename_tips
Renames tips in a phylogeny.
Renaming tip files will follow the scheme of a tab-delimited file wherein the first column is the current tip name and the second column is the desired tip name in the resulting phylogeny.
phykit rename_tree_tips <tree> -i/--idmap <idmap.txt> [-o/--output <output_file>] [--json]
Options:
<tree>: first argument after function name should be a tree file
-i/\-\-idmap: identifier map of current tip names (col1) and desired
tip names (col2)
-o/\-\-output: optional argument to write the renamed tree files to. Default
output will have the same name as the input file but with the suffix ".renamed"
--json: optional argument to print results as JSON
Robinson-Foulds distance
Function names: robinson_foulds_distance; rf_distance; rf_dist; rf
Command line interface: pk_robinson_foulds_distance; pk_rf_distance; pk_rf_dist; pk_rf
Calculate Robinson-Foulds (RF) distance between two trees.
Low RF distances reflect greater similarity between two phylogenies. This function prints out two values, the plain RF value and the normalized RF value, which are separated by a tab. Normalized RF values are calculated by taking the plain RF value and dividing it by 2(n-3) where n is the number of tips in the phylogeny. Prior to calculating an RF value, PhyKIT will first determine the number of shared tips between the two input phylogenies and prune them to a common set of tips. Thus, users can input trees with different topologies and infer an RF value among subtrees with shared tips.
PhyKIT will print out col 1; the plain RF distance and col 2: the normalized RF distance.
RF distances are calculated following Robinson & Foulds, Mathematical Biosciences (1981), doi: 10.1016/0025-5564(81)90043-2.
phykit robinson_foulds_distance <tree_file_zero> <tree_file_one> [--json]
Options:
<tree_file_zero>: first argument after function name should be a tree file
<tree_file_one>: second argument after function name should be a tree file
--json: optional argument to print results as JSON
Root tree
Function names: root_tree; root; rt
Command line interface: pk_root_tree; pk_root; pk_rt
Roots phylogeny using user-specified taxa.
A list of taxa to root the phylogeny on should be specified using the -r argument. The root_taxa file should be a single-column file with taxa names. The outputted file will have the same name as the inputted tree file but with the suffix ".rooted".
phykit root_tree <tree> -r/--root <root_taxa> [-o/--output <output_file>] [--json]
Options:
<tree>: first argument after function name should be a tree file to root|br|
-r/\-\-root: single column file with taxa names to root the phylogeny on|br|
-o/\-\-output: optional argument to specify the name of the output file
--json: optional argument to print results as JSON
Spurious homolog identification
Function names: spurious_sequence; spurious_seq; ss
Command line interface: pk_spurious_sequence; pk_spurious_seq; pk_ss
Determines potentially spurious homologs using branch lengths.
Identifies potentially spurious sequences and reports tips in the phylogeny that could possibly be removed from the associated multiple sequence alignment. PhyKIT does so by identifying and reporting long terminal branches defined as branches that are equal to or 20 times the median length of all branches.
PhyKIT reports the following information col1: name of tip that is a putatively spurious sequence col2: length of branch leading to putatively spurious sequence col3: threshold used to identify putatively spurious sequences col4: median branch length in the phylogeny
If there are no putatively spurious sequences, "None" is reported.
Using this method to identify potentially spurious sequences was, to my knowledge, first introduced by Shen et al., (2018) Cell doi: 10.1016/j.cell.2018.10.023.
phykit spurious_seq <file> -f/\\-\\-factor [--json]
Options:
<file>: first argument after function name should be a tree file
-f/\-\-factor: factor to multiply median branch length by to calculate
the threshold of long branches. (Default: 20)
--json: optional argument to print results as JSON
Stochastic character mapping (SIMMAP)
Function names: stochastic_character_map; simmap; scm
Command line interface: pk_stochastic_character_map; pk_simmap; pk_scm
Perform Stochastic Character Mapping (SIMMAP) of discrete traits onto a
phylogeny (Huelsenbeck et al. 2003; Bollback 2006), analogous to R's
phytools::make.simmap().
Fits a continuous-time Markov chain (CTMC) rate matrix Q via maximum likelihood using Felsenstein's pruning algorithm. Three substitution models are available:
ER (equal rates): 1 free parameter
SYM (symmetric rates): k(k-1)/2 free parameters
ARD (all rates differ): k(k-1) free parameters
The input trait file should be tab-delimited with a header row:
taxon<tab>trait_column<tab>...
Lines starting with '#' are treated as comments. If the tree and trait file
have different taxa, the intersection is used and warnings are printed to
stderr.
Output includes the fitted Q matrix, log-likelihood, mean dwelling times, mean transition counts, and posterior probabilities at internal nodes.
Optionally, a horizontal phylogram plot with branches colored by mapped
character state can be generated using the --plot argument.
All results have been validated against R 4.4.0 using phytools::fitMk
and phytools::make.simmap. ER log-likelihoods match R to within 0.002
(both optimizers converge to the same flat plateau). For the ARD model,
PhyKIT's multi-start optimizer finds a slightly better local optimum
than R (loglik -8.38 vs -8.43). Dwelling times sum to total tree length
exactly, and Q matrix structural properties (rows sum to zero, model
nesting) are verified.
phykit stochastic_character_map -t <tree> -d <trait_data> -c <trait_column> [-m <model>] [-n <nsim>] [--seed <seed>] [--plot <output.png>] [--json]
Options:
-t/\-\-tree: a tree file in Newick format
-d/\-\-trait_data: tab-delimited trait file with header row
-c/\-\-trait: column name for discrete character trait
-m/\-\-model: substitution model: ER, SYM, or ARD (default: ER)
-n/\-\-nsim: number of stochastic mapping simulations (default: 100)
--seed: random seed for reproducibility
--plot: output plot file path for phylogram with colored branches
--json: optional argument to print results as JSON
Terminal branch statistics
Function names: terminal_branch_stats; tbs
Command line interface: pk_terminal_branch_stats; pk_tbs
Calculate summary statistics for terminal branch lengths in a phylogeny.
Terminal branch lengths can be useful for phylogeny diagnostics.
To obtain all terminal branch lengths, use the -v/\-\-verbose option.
phykit terminal_branch_stats <tree> [-v/--verbose] [--json]
Options:
<tree>: first argument after function name should be a tree file
-v/\-\-verbose: optional argument to print all terminal branch lengths
--json: optional argument to print results as JSON
Threshold model
Function names: threshold_model; threshold; thresh; threshbayes; thresh_bayes
Command line interface: pk_threshold_model; pk_threshold; pk_thresh; pk_threshbayes; pk_thresh_bayes
Estimate the evolutionary correlation between two traits using the Felsenstein (2012) threshold model via MCMC. Binary discrete characters are modelled as arising from continuous latent "liabilities" that evolve under Brownian motion and cross a threshold at 0. This lets you estimate correlations between binary traits (or between a binary and a continuous trait) using BM rather than Mk transition rates.
This is the Python equivalent of phytools::threshBayes in R.
The sampler uses a Gibbs / Metropolis-Hastings hybrid:
Gibbs step: sample each discrete tip's liability from a truncated normal conditioned on all other values
Metropolis step: update sigma2_1, sigma2_2 (log-normal proposal), ancestral values a1, a2 (normal proposal), and the correlation r (normal proposal with reflection on [-1, 1])
Adaptive tuning: during burn-in, proposal variances are adjusted to target ~23% acceptance
Trait types:
discrete: binary (0/1). Liabilities < 0 map to state 0, liabilities > 0 map to state 1.continuous: observed values used directly (no liability needed).
Any combination of two traits is supported: discrete+continuous, discrete+discrete, or continuous+continuous.
phykit threshold_model -t <tree> -d <trait_data> --traits <t1,t2> --types <type1,type2> [--ngen 100000] [--sample 100] [--burnin 0.2] [--seed <int>] [--plot <file>] [--json]
Options:
-t/\-\-tree: a rooted phylogeny file with branch lengths (required)
-d/\-\-trait-data: tab-delimited trait file with header row (required)
--traits: comma-separated pair of trait column names (required)
--types: comma-separated pair of trait types, each discrete or continuous (required)
--ngen: number of MCMC generations (default: 100000)
--sample: sample frequency (default: 100)
--burnin: burn-in fraction (default: 0.2)
--seed: random seed for reproducibility
--plot: output filename for trace and posterior density plot (3 rows x 2 columns:
left = MCMC trace, right = posterior histogram with 95% HPD shading)
--json: optional argument to print results as JSON
Output (text mode):
Trait 1: habitat (discrete, 2 states: 0, 1)
Trait 2: body_mass (continuous)
MCMC: 100000 generations, sampled every 100, burn-in 20%
---
Posterior correlation (r): 0.6234 (95% HPD: 0.312, 0.891)
Posterior sigma2_1: 1.234 (95% HPD: 0.456, 2.345)
Posterior sigma2_2: 0.567 (95% HPD: 0.234, 1.012)
Acceptance rates: r=0.234, sigma2_1=0.312, sigma2_2=0.287, a1=0.241, a2=0.228
Tutorial: habitat type and body mass in carnivores
This example uses the classic 8-taxon carnivore tree to test whether habitat type (a binary trait: 0 = non-arboreal, 1 = arboreal) is correlated with body mass on the latent liability scale.
Step 1: Prepare the trait file.
Create a tab-delimited file with a header row. The first column is the taxon name, followed by columns for each trait:
taxon habitat body_mass
raccoon 0 1.04
bear 0 2.39
sea_lion 0 2.30
seal 0 1.88
monkey 1 0.60
cat 1 0.56
weasel 1 -0.30
dog 0 1.18
Step 2: Run the threshold model.
phykit threshold_model \
-t carnivore.tre \
-d traits.tsv \
--traits habitat,body_mass \
--types discrete,continuous \
--ngen 100000 \
--seed 42
Step 3: Examine the trace plots for convergence.
phykit threshold_model \
-t carnivore.tre \
-d traits.tsv \
--traits habitat,body_mass \
--types discrete,continuous \
--ngen 100000 \
--seed 42 \
--plot trace.png
This generates a 3-row x 2-column figure. The left column shows MCMC trace plots for r, sigma2_1, and sigma2_2 (check for stationarity and good mixing). The right column shows posterior density histograms with red shading for the 95% HPD interval and a dashed line at the posterior mean.
Step 4: Get full posterior samples as JSON for custom analysis.
phykit threshold_model \
-t carnivore.tre \
-d traits.tsv \
--traits habitat,body_mass \
--types discrete,continuous \
--ngen 100000 \
--seed 42 \
--json > posterior.json
The JSON output includes full posterior sample arrays, summary statistics (mean, median, 95% HPD), and MCMC metadata.
Tip labels
Function names: tip_labels; tree_labels; labels; tl
Command line interface: pk_tip_labels; pk_tree_labels; pk_labels; pk_tl
Prints the tip labels (or names) a phylogeny.
phykit tip_labels <tree> [--json]
Options:
<tree>: first argument after function name should be a tree file
--json: optional argument to print results as JSON
Tip-to-tip distance
Function names: tip_to_tip_distance; t2t_dist; t2t
Command line interface: pk_tip_to_tip_distance; pk_t2t_dist; pk_t2t
Calculate distance between two tips (or leaves) in a phylogeny.
Distances are in substitutions per site.
phykit tip_to_tip_distance <tree_file> <tip_1> <tip_2> [--json]
phykit tip_to_tip_distance <tree_file> --all-pairs [--plot] [--plot-output <path>] [--json]
Options:
<tree_file>: first argument after function name should be a tree file
<tip_1>: second argument should be the name of the first tip of interest
<tip_2>: third argument should be the name of the second tip of interest
--all-pairs: optional argument to report all pairwise tip distances
--plot: optional argument to save a clustered distance heatmap (requires --all-pairs)
--plot-output: output path for heatmap (default: tip_to_tip_distance_heatmap.png)
--json: optional argument to print results as JSON
Tip-to-tip node distance
Function names: tip_to_tip_node_distance; t2t_node_dist; t2t_nd
Command line interface: pk_tip_to_tip_node_distance; pk_t2t_node_dist; pk_t2t_nd
Calculate distance between two tips (or leaves) in a phylogeny.
Distance is measured by the number of nodes between one tip and another.
phykit tip_to_tip_node_distance <tree_file> <tip_1> <tip_2> [--json]
Options:
<tree_file>: first argument after function name should be a tree file
<tip_1>: second argument should be the name of the first tip of interest
<tip_2>: third argument should be the name of the second tip of interest
--json: optional argument to print results as JSON
Total tree length
Function names: total_tree_length; tree_len
Command line interface: pk_total_tree_length; pk_tree_len
Calculate total tree length, which is a sum of all branches.
phykit total_tree_length <tree> [--json]
Options:
<tree>: first argument after function name should be a tree file
--json: optional argument to print results as JSON
Treeness
Function names: treeness; tness
Command line interface: pk_treeness; pk_tness
Calculate treeness statistic for a phylogeny.
Higher treeness values are thought to be desirable because they represent a higher signal-to-noise ratio.
Treeness is the sum of internal branch lengths divided by the total tree length. Therefore, values range from 0 to 1. Treeness can be used as a measure of the signal-to-noise ratio in a phylogeny.
Calculate treeness (also referred to as stemminess) following Lanyon, The Auk (1988), doi: 10.1093/auk/105.3.565 and Phillips and Penny, Molecular Phylogenetics and Evolution (2003), doi: 10.1016/S1055-7903(03)00057-5.
phykit treeness <tree> [--json]
Options:
<tree>: first argument after function name should be a tree file
--json: optional argument to print results as JSON
Alignment- and tree-based functions
Relative rate test
Function names: relative_rate_test; rrt; tajima_rrt
Command line interface: pk_relative_rate_test; pk_rrt; pk_tajima_rrt
Tajima's relative rate test (Tajima, Genetics, 1993).
Tests whether two ingroup lineages have evolved at equal rates since diverging from their common ancestor. The outgroup is automatically inferred from the rooted tree as the earliest-diverging taxon (the single taxon on the smaller side of the root split). All pairwise ingroup combinations are tested with Bonferroni and Benjamini-Hochberg FDR multiple testing correction.
At each alignment column, the test classifies informative sites:
m1: the first ingroup taxon differs from the outgroup, but the second matches
m2: the second ingroup taxon differs from the outgroup, but the first matches
Sites where both differ or both match the outgroup are uninformative and skipped
Sites with gaps or ambiguous characters are skipped
Test statistic: chi2 = (m1 - m2)^2 / (m1 + m2), with 1 degree of freedom.
Single alignment mode:
phykit relative_rate_test -a <alignment> -t <rooted_tree> [-v/--verbose] [--json]
Batch mode (multiple alignments, one shared tree):
phykit relative_rate_test -l <alignment_list> -t <rooted_tree> [-v/--verbose] [--json]
Options:
-a/\-\-alignment: a single alignment file
-l/\-\-alignment-list: a file with one alignment path per line (batch mode)
-t/\-\-tree: a rooted tree file
-v/\-\-verbose: print additional detail
--plot-output: save a pairwise p-value heatmap to this path
--json: optional argument to print results as JSON
Single mode outputs one row per ingroup pair with m1, m2, chi-squared, raw p-value, Bonferroni-corrected p-value, FDR-corrected p-value, and a significance indicator. Batch mode aggregates across genes, reporting the number and percentage of genes rejecting equal rates for each pair.
When --plot-output is provided, a symmetric heatmap of
-log10(FDR-corrected p-values) is saved. Cells with significant
pairs (FDR < 0.05) are marked with an asterisk. Darker colors
indicate stronger evidence for unequal evolutionary rates.
The dashed line on the colorbar marks the significance threshold
(-log10(0.05) ≈ 1.3); cells above this line are significant.
Equal rates — all taxa evolve at similar rates, so no pairwise comparison reaches significance. The heatmap is uniformly pale with no asterisks:
Unequal rates — taxa B and C have accumulated many more substitutions
than A and D. Significant pairs (dark red, marked with *) indicate
lineages evolving at detectably different rates:
Validated against R's pegas::rr.test() — chi-squared and p-values
match to machine precision.
Saturation
Function names: saturation; sat
Command line interface: pk_saturation; pk_sat
Calculate saturation for a given tree and alignment.
Saturation is defined as sequences in multiple sequence alignments that have undergone numerous substitutions such that the distances between taxa are underestimated.
Data with no saturation will have a value of 1. The closer the value is to 1, the less saturated the data.
This function outputs two values (as of v1.19.9). The first value is the saturation value and the second column is the absolute value of saturation minus 1. Thus, lower values in the second column are indicative of values closer to one and, thus, less saturation.
Saturation is calculated following Philippe et al., PLoS Biology (2011), doi: 10.1371/journal.pbio.1000602.
phykit saturation -a <alignment> -t <tree> [-v/--verbose] [-e/--exclude_gaps] [--plot] [--plot-output <path>] [--json]
Options:
-a/\-\-alignment: an alignment file
-t/\-\-tree: a tree file
-e/--exclude_gaps: if a site has a gap, ignore it
-v/\-\-verbose: print out patristic distances and uncorrected
distances used to determine saturation
--plot: save a saturation scatter plot with fitted slope through origin
--plot-output: output path for saturation plot (default: saturation_plot.png)
--json: optional argument to print results as JSON
Treeness over RCV
Function names: treeness_over_rcv; toverr; tor
Command line interface: pk_treeness_over_rcv; pk_toverr; pk_tor
Calculate treeness/RCV for a given alignment and tree.
Higher treeness/RCV values are thought to be desirable because they harbor a high signal-to-noise ratio and are least susceptible to composition bias.
PhyKIT reports three tab delimited values: col1: treeness/RCV col2: treeness col3: RCV
Calculate treeness/RCV following Phillips and Penny, Molecular Phylogenetics and Evolution (2003), doi: 10.1016/S1055-7903(03)00057-5.
phykit treeness_over_rcv -a/--alignment <alignment> -t/--tree <tree> [--json]
Options:
-a/\-\-alignment: an alignment file
-t/\-\-tree: a tree file
--json: optional argument to print results as JSON