Advanced usage

ClustKIT clusters protein sequences through a six-phase pipeline: MinHash sketching, LSH candidate generation, banded Smith-Waterman alignment, similarity graph construction, Leiden community detection, and representative selection. Many aspects of this pipeline can be tuned via command-line options.

General usage

clustkit -i <input.fasta> -o <output_dir> [options]

Identity threshold

The -t / --threshold option sets the sequence identity threshold (0.0-1.0). Pairs with identity below this threshold are excluded from the similarity graph.

# cluster at 30% identity (low-identity regime)
clustkit -i proteins.fasta -o output/ -t 0.3

# cluster at 90% identity (high-identity regime)
clustkit -i proteins.fasta -o output/ -t 0.9

Default: 0.9

Clustering mode

The --clustering-mode option provides threshold-aware presets that automatically configure sketch size, LSH sensitivity, and other internal parameters.

balanced – Good accuracy/speed trade-off (default)
accurate – Maximum sensitivity, slower
fast – Speed-optimized, lower sensitivity at low thresholds

clustkit -i proteins.fasta -o output/ -t 0.3 --clustering-mode accurate

Default: balanced

Clustering method

The --cluster-method option selects how the similarity graph is partitioned into clusters.

leiden – Leiden community detection (default). Optimizes a global modularity objective. Produces well-connected clusters, especially at low identity thresholds.
connected – Connected components. Every pair of sequences above threshold in the same component becomes one cluster. Fast but may merge distinct families.
greedy – Greedy centroid-based clustering. Processes sequences by descending degree; each unassigned high-degree node claims its unassigned neighbors.

# Leiden community detection (default, recommended)
clustkit -i proteins.fasta -o output/ -t 0.5 --cluster-method leiden

# Connected components
clustkit -i proteins.fasta -o output/ -t 0.7 --cluster-method connected

# Greedy centroid-based
clustkit -i proteins.fasta -o output/ -t 0.7 --cluster-method greedy

Default: leiden

Alignment method

The --alignment option controls how pairwise similarity is computed.

align – Banded Smith-Waterman alignment with BLOSUM62 scoring and affine gap penalties. Accurate, especially at low identity thresholds. This is the default.
kmer – K-mer overlap scoring. Faster but less accurate below ~50% identity.

# Smith-Waterman alignment (default, accurate)
clustkit -i proteins.fasta -o output/ -t 0.3 --alignment align

# K-mer scoring (fast)
clustkit -i proteins.fasta -o output/ -t 0.7 --alignment kmer

Default: align

Threads

The --threads option controls the number of CPU threads used for alignment.

clustkit -i proteins.fasta -o output/ -t 0.5 --threads 8

Default: 1

GPU acceleration

The --device option enables GPU-accelerated Smith-Waterman alignment using CuPy. This requires the clustkit[gpu] installation.

cpu – CPU only (default)
auto – Benchmark a sample to pick the fastest device
0, 1, … – Specific GPU device ID

# Use first GPU
clustkit -i proteins.fasta -o output/ -t 0.3 --device 0

# Auto-detect fastest device
clustkit -i proteins.fasta -o output/ -t 0.3 --device auto

Default: cpu

LSH sensitivity

The --sensitivity option overrides the LSH sensitivity set by --clustering-mode. Higher sensitivity finds more candidate pairs but is slower.

low – Fewer candidates, faster
medium – Balanced
high – More candidates, slower

clustkit -i proteins.fasta -o output/ -t 0.3 --sensitivity high

Default: set by --clustering-mode

Sketch size

The --sketch-size option sets the number of MinHash signatures per sequence. Larger sketches improve candidate recall but increase memory usage. Overrides the value set by --clustering-mode.

clustkit -i proteins.fasta -o output/ --sketch-size 256

Default: set by --clustering-mode

K-mer size

The -k / --kmer-size option sets the k-mer length for MinHash sketching.

clustkit -i proteins.fasta -o output/ -k 3

Default: 5

Representative selection

The --representative option controls how a representative sequence is chosen for each cluster.

longest – Longest sequence in the cluster (default)
centroid – Sequence with highest average similarity to other cluster members
most_connected – Sequence with the most edges in the similarity graph

clustkit -i proteins.fasta -o output/ --representative centroid

Default: longest

Output format

The --format option controls the output format.

tsv – Tab-separated file with columns: sequence_id, cluster_id, is_representative (default)
cdhit – CD-HIT-style .clstr format for compatibility with existing pipelines

clustkit -i proteins.fasta -o output/ --format cdhit

Default: tsv

Plot output

The --plot flag generates a two-panel cluster size distribution figure saved as cluster_size_distribution.png in the output directory.

clustkit -i proteins.fasta -o output/ -t 0.5 --threads 8 --plot

../_images/cluster_size_distribution.png

The left panel shows a histogram of cluster sizes (log-scaled x-axis) with the singleton count annotated. The right panel shows cumulative sequence coverage: the fraction of sequences in clusters of at least a given size. Together, these panels provide a quick sanity check of clustering granularity.

All options

Option	Description	Default
`-i`, `--input`	Input FASTA/FASTQ file	required
`-o`, `--output`	Output directory	required
`-t`, `--threshold`	Identity threshold (0.0-1.0)	0.9
`--threads`	Number of CPU threads	1
`--device`	`cpu`, `auto`, or GPU device ID (e.g., `0`)	`cpu`
`--cluster-method`	`leiden`, `connected`, or `greedy`	`leiden`
`--alignment`	`align` (SW, accurate) or `kmer` (fast)	`align`
`--clustering-mode`	`balanced`, `accurate`, or `fast`	`balanced`
`--sensitivity`	LSH sensitivity: `low`, `medium`, `high`	per mode
`--sketch-size`	MinHash sketch size	128
`-k`, `--kmer-size`	K-mer size for sketching	5
`--representative`	`longest`, `centroid`, or `most_connected`	`longest`
`--format`	Output format: `tsv` or `cdhit`	`tsv`
`--plot`	Generate cluster size distribution plot	off