Change log

Major changes to OrthoHMM are summarized here.

0.5.0 Reverted the v0.3.0 default clustering choice. After head-to-head benchmarking across phylogenetic-diversity ranges (OrthoBench 12 bilaterians, QfO 2020 78 mixed-domain proteomes, Three Kingdoms 12 cross-kingdom proteomes), MCL with inflation=1.5 is the only clustering choice that never catastrophically fails: Leiden CPM with the fixed cpm=0.1 default collapses to all-singletons on cross-kingdom inputs (F=0.0003 on Three Kingdoms), and Leiden with –cpm_resolution auto over-merges on QfO functional metrics. MCL@1.5 trails the best Leiden setting by ~3 pp on closely-related inputs but is robust across the full diversity range. --clustering leiden --cpm_resolution auto remains the recommended choice for users who prefer the pure-Python path or who can confirm their inputs are tightly clustered. mcl is once again a required external binary; see the README/install docs.

0.4.2 --cpm_resolution auto now anchors γ on the smallest positive edge weight (excluding numerical-zero artifacts). The earlier formula (γ = 4 × strict-min) collapsed to γ ≈ 0 on the QfO 40M-edge graph and caused Leiden to segfault.

0.4.1 Refines --cpm_resolution auto to use γ = 4 × min(edge_weight), beating the v0.4.0 10th-percentile heuristic on both OrthoBench (F=65.9 vs 58.8) and Three Kingdoms (F=0.907 vs 0.821).

0.4.0 New --cpm_resolution auto flag. Auto-tunes γ to the post-RBNH edge-weight distribution so OrthoHMM no longer collapses to all-singletons on distantly-related inputs.

0.3.0 Replaced the external MCL binary with in-process Leiden CPM clustering via igraph + leidenalg. Both libraries are pip-installed wheels, so OrthoHMM no longer requires any external executable when run with defaults. Leiden CPM (resolution=0.1) beats MCL on the OrthoBench 2020 reference: F=65.7% vs MCL’s best F=62.4% (inflation=1.5) on the identical RBNH edge set. New flags: --clustering {leiden, mcl} and --cpm_resolution. Selecting --clustering mcl reverts to the prior MCL pipeline and re-introduces the external mcl requirement.

0.2.0 Added a built-in profile HMM + k-mer prefilter search engine that replaces the phmmer subprocess. The new engine is the default and substantially reduces wall time and memory on multi-proteome datasets (see the bacterial scaling table on the home page). HMMER is now optional — only required when opting into --search_mode phmmer. Also adds the WAG and LG substitution matrices, drops Python 3.9 support (now requires Python 3.10+), and adds optional C/AVX2 + CUDA kernels that are compiled at install time when a suitable toolchain is available; otherwise the runtime falls back to a Numba implementation transparently.

0.1.1 There is no longer a limit on the length of gene names for single-copy orthologous genes.

0.1.0 Modified how to handle phmmer multiprocessing, giving the user a parallelized experience. Specifically, if a user sets CPUs to 8, 8 runs of phmmer will run at the same time.