Tutorials
The following tutorials demonstrate common ClustKIT workflows using real protein sequence data.
Tutorial 1: Basic clustering
This tutorial walks through a basic clustering run at 50% identity.
# Cluster proteins at 50% identity using 8 threads
clustkit -i proteins.fasta -o results_50/ -t 0.5 --threads 8
This produces three output files:
results_50/clusters.tsv– Cluster assignmentsresults_50/representatives.fasta– Representative sequencesresults_50/run_info.json– Run statistics
The clusters.tsv file is tab-separated with three columns:
sequence_id cluster_id is_representative
seq_001 0 True
seq_042 0 False
seq_007 1 True
...
Tutorial 2: Low-identity clustering
Clustering at low identity thresholds (30-50%) is where ClustKIT provides the greatest advantage over greedy methods like CD-HIT and Linclust. The Leiden community detection algorithm produces well-connected clusters even when sequence similarity is sparse.
# Cluster at 30% identity with accurate mode
clustkit -i proteins.fasta -o results_30/ -t 0.3 \
--clustering-mode accurate --threads 8
For maximum sensitivity at low thresholds, use --clustering-mode accurate.
This increases the sketch size and LSH sensitivity to find more candidate pairs.
Tutorial 3: GPU-accelerated clustering
For large datasets, GPU acceleration can significantly reduce alignment time.
This requires the clustkit[gpu] installation.
# Install with GPU support
pip install clustkit[gpu]
# Cluster using a single GPU
clustkit -i large_proteins.fasta -o results_gpu/ -t 0.3 \
--device 0 --threads 8
The --device option accepts:
cpu– CPU only (default)auto– Automatically benchmark CPU vs. GPU and pick the fastest0,1, etc. – A specific GPU device ID
GPU acceleration provides the greatest speedup at low identity thresholds where the alignment phase dominates runtime. At t = 0.3, a single GPU provides ~1.3x speedup over 8 CPU threads, and dual GPUs provide ~2.7x speedup.
Tutorial 4: Comparing clustering methods
ClustKIT supports three clustering methods. This tutorial compares them on the same input.
# Leiden community detection (default)
clustkit -i proteins.fasta -o results_leiden/ -t 0.5 \
--cluster-method leiden --threads 8
# Connected components
clustkit -i proteins.fasta -o results_cc/ -t 0.5 \
--cluster-method connected --threads 8
# Greedy centroid-based
clustkit -i proteins.fasta -o results_greedy/ -t 0.5 \
--cluster-method greedy --threads 8
When to use each method:
Leiden (default): Best for low-to-medium identity thresholds (t <= 0.5). Optimizes a global modularity objective and avoids the chain-extension artifacts of greedy methods.
Connected components: Useful when every pair above threshold should be in the same cluster. Fast, but may merge distinct families through transitive connections.
Greedy: Familiar centroid-based approach similar to CD-HIT. Works well at high identity thresholds (t >= 0.7) where clusters are dense.
Tutorial 5: CD-HIT-compatible output
If your downstream pipeline expects CD-HIT-format .clstr files, use the
--format cdhit option:
clustkit -i proteins.fasta -o results_cdhit/ -t 0.5 \
--format cdhit --threads 8
This produces output in the standard CD-HIT cluster format for compatibility with existing workflows.
Tutorial 6: Choosing representative sequences
ClustKIT offers three strategies for selecting representative sequences from each cluster:
# Longest sequence (default)
clustkit -i proteins.fasta -o results/ --representative longest
# Centroid (highest average similarity to cluster members)
clustkit -i proteins.fasta -o results/ --representative centroid
# Most connected (most edges in the similarity graph)
clustkit -i proteins.fasta -o results/ --representative most_connected
The longest strategy is the default and matches the behavior of CD-HIT and
VSEARCH. The centroid strategy may be preferred when the representative
should be maximally similar to all cluster members. The most_connected
strategy selects the sequence with the most pairwise relationships above
threshold.