Change log
Major changes to BioKIT are summarized here.
0.1.0: Functions that remove adapters from FASTQ files (trim_se_adapters_fastq and trim_pe_adapters_fastq) are now part of BioKIT. Note, these identify exact matches of adapters.
Parsimony informative sites, constant sites, and variable sites functions now have a verbose option that allows users to examine the characterization of each site in an alignment.
The name gw-RSCU has been shortened to gRSCU.
0.0.9: Functions that look at codons (e.g., RSCU and gw-RSCU) now can account for ambiguous codons. For example, codons that have ambiguous characters like the codon “CNN.” These codons are skipped during analysis of RSCU and gw-RSCU.
1.0.1: Added “X” as a gap character during alignment recoding
1.1.0: Added Dayhoff-9, -12, -15, and -18 recoding schemes
1.1.3: Fixed bug in RSCU calculations
1.1.4: Added structured output support (tsv/json/yaml) to additional commands and expanded integration coverage with output format consistency and CLI output contract tests.
1.1.5: Updated dependency pinning to fix installation failures and moved supported Python versions to 3.11, 3.12, and 3.13 only.
1.1.24: Added transition_transversion_ratio (alias: ti_tv) command
to calculate the transition/transversion ratio from a nucleotide
alignment. Counts pairwise substitutions across all alignment columns;
columns containing any gap or ? are skipped. With -v/--verbose,
prints per-site classification (transition, transversion, constant, gap)
where a site is labeled transversion if any pairwise transversion is
present.
1.1.23: Added codon_adaptation_index (alias: cai) command to
compute Sharp & Li’s (1987) CAI for each coding sequence using a
user-supplied reference set of highly expressed CDS (e.g. ribosomal
proteins). Relative adaptiveness values (w_i) are derived from the
reference; each query gene is scored as the geometric mean of w_i
across its synonymous codons. CAI ranges from 0 to 1. A pseudocount
of 0.5 is added to reference codons with zero observations.
1.1.22: Added effective_number_of_codons (alias: enc) command to
compute Wright’s (1990) effective number of codons for each coding
sequence. ENC ranges from 20 (extreme bias) to 61 (no bias). Sequences
whose length is not divisible by 3 are skipped. Reuses the existing
translation table machinery (29 built-in tables plus custom). With
-v/--verbose adds per-gene GC3; with -p/--plot saves an ENC
vs. GC3 scatter plus the expected curve under mutation-only bias as PNG.
1.1.21: Added protein_properties (alias: prot_prop) command to
report per-sequence protein properties for a protein FASTA: length,
molecular weight, isoelectric point (pI), GRAVY (Grand Average of
Hydropathy), aromaticity, and instability index. Uses BioPython’s
ProtParam for the computations. Stop characters and gaps are stripped
before analysis.
1.1.20: Added dinucleotide_odds (alias: dno) command to compute
observed/expected (O/E) ratios for all 16 dinucleotides in a FASTA file.
CpG O/E is heavily used in methylation studies and viral genomics — CpG
suppression is a hallmark of vertebrate-adapted viruses. With
-v/--verbose, reports per-sequence ratios instead of aggregated.
1.1.19: Added neutrality_plot command for GC12 vs. GC3 regression
analysis of coding sequences. The regression slope is a classical codon
usage diagnostic — a slope near 1 indicates that codon usage is largely
mutation-driven, while a slope near 0 indicates that GC12 is constrained
by selection while GC3 drifts. Reports slope, intercept, r-squared, and
n; with -v/--verbose includes per-gene GC12 and GC3 values; with
-p/--plot saves a scatter plot with the regression line as PNG.
1.1.18: Added sample_sequences (alias: sample_seqs) command to
randomly draw N sequences (-n/--number) or a percentage
(-p/--percent, default 10%) from a FASTA file without replacement.
Supports reproducible sampling via -s/--seed and optional output
file via -o/--output.
1.1.17: Added find_orfs (alias: orfs) command to identify all open
reading frames above a minimum length in nucleotide sequences, searching
all 6 reading frames (3 forward, 3 reverse). Reports id, frame, start,
stop, and length in nucleotides and amino acids. With --extract,
outputs the ORF sequences as FASTA; with --protein, outputs protein
translations instead of nucleotides. Supports custom genetic codes via
-tt/--translation_table.
1.1.16: Added kmer_frequency (alias: kmer_freq) command to count
all k-mers of a given size in sequences. Useful for genomic signatures,
metagenomics binning prep, and alignment-free sequence comparison.
Supports --canonical to collapse each k-mer with its reverse complement
and -v/--verbose for per-sequence counts. K-mers containing non-ACGT
characters are skipped.
1.1.15: Added homopolymer_runs (alias: homopolymer) command to
find the longest homopolymer run per sequence in a FASTA file. Reports
length, base, and 1-based start position. Relevant for nanopore/PacBio
QC where homopolymer errors are the dominant error mode. Use
--per-base to report the longest run for each of A, C, G, T.
1.1.14: Added fastq_to_fasta (alias: fq2fa) command to convert
FASTQ files to FASTA by stripping quality scores. Supports stdin input
via - and optional output file via -o/--output.
1.1.13: Added genbank_to_fasta (alias: gb2fa) command to extract
sequences from GenBank flat files. Optionally filters by feature type
(e.g., CDS, rRNA, tRNA, gene) via -t/--feature_type and can output
protein translations for CDS features via --translate.
1.1.12: Added assembly_curve (alias: asm_curve) command to output
cumulative length vs. contig rank (sorted by descending length). Supports
optional plotting via -p/--plot to generate a PNG assembly curve figure.
1.1.11: Added melting_temperature (alias: tm) command to calculate melting
temperature of nucleotide sequences using nearest-neighbor thermodynamics. Supports
custom Na+ and oligo concentrations.
1.1.10: Added shuffle_sequences (alias: shuffle_seqs) command to randomly
shuffle nucleotide or amino acid order within each sequence while preserving
composition. Supports reproducible shuffling via -s/--seed.
1.1.9: Added restriction_sites (alias: re_sites) command to find restriction
enzyme recognition sites in sequences. Reports cut positions and fragment sizes
for one or more enzymes per sequence. Uses BioPython’s Restriction module.
1.1.8: Added gc_content_four_fold_degenerate_sites (alias: gc4) command to
calculate GC content at four-fold degenerate sites in coding sequences. Supports
custom translation tables via -tt and per-sequence output via -v.
1.1.7: Added protein_charge (alias: prot_charge) command to calculate the
net charge of protein sequences at a given pH (default 7.0). Uses BioPython’s
ProteinAnalysis for the calculation. Supports tsv, json, and yaml output.
1.1.6: Added fasta_deduplication (alias: dedup) command to remove duplicate
sequences from multi-FASTA files. Sequences are compared case-insensitively
and the first occurrence of each unique sequence is kept.