Usage

BioKIT helps process and analyze data for bioinformatics research.

Generally, all functions are designed to help understand the contents of various files including those representing alignments, coding sequences, fastq files, and genomes.

General usage

Calling functions

biokit <command> [optional command arguments]

Command specific help messages can be viewed by adding a -h/--help argument after the command. For example, to see the to see the help message for the command ‘alignment_summary’, execute:

biokit alignment_summary -h
# or
biokit alignment_summary --help

Function aliases

Each function comes with aliases to save the user some key strokes. For example, to get the help message for the ‘alignment_summary’ function, you can type:

biokit aln_summary -h

Command line interfaces

All functions (including aliases) can also be executed using a command-line interface that starts with bk_. For example, instead of typing the previous command to get the help message of the alignment_summary function, you can type:

bk_alignment_summary -h
# or
bk_aln_summary -h

All possible function names are specified at the top of each function section.

Alignment-based functions

Alignment length

Function names: alignment_length; aln_len
Command line interface: bk_alignment_length; bk_aln_len

Calculate the length of an alignment.

biokit alignment_length <fasta>

Options:
<alignment>: first argument after function name should be an alignment file

Alignment recoding

Function names: alignment_recoding; aln_recoding; recode
Command line interface: bk_alignment_recoding; bk_aln_recoding; bk_recode

Recode alignments using reduced character states.

Alignments can be recoded using established or custom recoding schemes. Recoding schemes are specified using the -c/–code argument. Custom recoding schemes can be used and should be formatted as a two column file wherein the first column is the recoded character and the second column is the character in the alignment.

biokit alignment_recoding <fasta> [-c/--code <code>]

Codes for which recoding scheme to use:
RY-nucleotide
R = purines (i.e., A and G)
Y = pyrimidines (i.e., T and C)

SandR-6
0 = A, P, S, and T
1 = D, E, N, and G
2 = Q, K, and R
3 = M, I, V, and L
4 = W and C
5 = F, Y, and H

KGB-6
0 = A, G, P, and S
1 = D, E, N, Q, H, K, R, and T
2 = M, I, and L
3 = W
4 = F and Y
5 = C and V

Dayhoff-6
0 = A, G, P, S, and T
1 = D, E, N, and Q
2 = H, K, and R
3 = I, L, M, and V
4 = F, W, and Y
5 = C

Dayhoff-9
0 = D, E, H, N, and Q
1 = I, L, M, and V
2 = F and Y
3 = A, S, and T
4 = K and R
5 = G
6 = P
7 = C
8 = W

Dayhoff-12
0 = D, E, and Q
1 = M, L, I, and V
2 = F and Y
3 = K, H, and R
4 = G
5 = A
6 = P
7 = S
8 = T
9 = N
A = W
B = C

Dayhoff-15
0 = D, E, and Q
1 = M and L
2 = I and V
3 = F and Y
4 = G
5 = A
6 = P
7 = S
8 = T
9 = N
A = K
B = H
C = R
D = W
E = C

Dayhoff-18
0 = F and Y
1 = M and L
2 = I
3 = V
4 = G
5 = A
6 = P
7 = S
8 = T
9 = D
A = E
B = Q
C = N
D = K
E = H
F = R
G = W
H = C

Options:
<alignment>: first argument after function name should be an alignment file
-c/--code: argument to specify the recoding scheme to use

Alignment summary

Function names: alignment_summary; aln_summary
Command line interface: bk_alignment_summary; bk_aln_summary

Summary statistics for an alignment. Reported statistics include alignment length, number of taxa, number of parsimony sites, number of variable sites, number of constant sites, frequency of each character (including gaps, which are considered to be ‘-’ or ‘?’).

biokit alignment_summary <fasta>

Options:
<alignment>: first argument after function name should be an alignment file

Consensus sequence

Function names: consensus_sequence; con_seq
Command line interface: bk_consensus_sequence; bk_con_seq

Generates a consequence from a multiple sequence alignment file in FASTA format.

biokit consensus_sequence <fasta> -t/--threshold <threshold> -ac/--ambiguous_character <ambiguous character>

Options:
<fasta>: first argument after function name should be an alignment fasta file <threshold>: threshold for how common a residue must be to be represented <ambiguous character>: the ambiguity character to use. Default is ‘N’

Constant sites

Function names: constant_sites; con_sites
Command line interface: bk_constant_sites; bk_con_sites

Calculate the number of constant sites in an alignment.

Constant sites are defined as a site in an alignment with the same nucleotide or amino acid sequence (excluding gaps) among all taxa.

biokit constant_sites <fasta> [-v/\-\-verbose]

Options:
<alignment>: first argument after function name should be an alignment file -v/--verbose: optional argument to print site-by-site categorization

Parsimony informative sites

Function names: parsimony_informative_sites; pi_sites; pis
Command line interface: bk_parsimony_informative_sites; bk_pi_sites; bk_pis

Calculate the number of parsimony informative sites in an alignment.

Parsimony informative sites are defined as a site in an alignment with at least two nucleotides or amino acids that occur at least twice.

biokit parsimony_informative_sites <fasta> [-v/\-\-verbose]

Options:
<alignment>: first argument after function name should be an alignment file
-v/--verbose: optional argument to print site-by-site categorization

Position specific score matrix

Function names: position_specific_score_matrix; pssm
Command line interface: bk_position_specific_score_matrix; bk_pssm

Generates a position specific score matrix for an alignment.

biokit position_specific_score_matrix <fasta> [-ac/--ambiguous_character <ambiguous character>]

Options:
<fasta>: first argument after function name should be an alignment fasta file <ambiguous character>: the ambiguity character to use. Default is ‘N’

Variable sites

Function names: variable_sites; var_sites; vs
Command line interface: bk_variable_sites; bk_var_sites; bk_vs

Calculate the number of variable sites in an alignment.

Variable sites are defined as a site in an alignment with at least two nucleotide or amino acid characters among all taxa.

biokit variable_sites <fasta> [-v/\-\-verbose]

Options:
<alignment>: first argument after function name should be an alignment file -v/--verbose: optional argument to print site-by-site categorization

Coding sequence-based functions

GC content first codon position

Function names: gc_content_first_position; gc1
Command line interface: bk_gc_content_first_position; bk_gc1

Calculate GC content of the first codon position. The input must be the coding sequence of a gene or genes. All genes are assumed to have sequence lengths divisible by three.

biokit gc_content_first_position <fasta> [-v/--verbose]

Options:
<fasta>: first argument after function name should be a fasta file
-v/--verbose: optional argument to print the GC content of each fasta entry

GC content second codon position

Function names: gc_content_second_position; gc2
Command line interface: bk_gc_content_second_position; bk_gc2

Calculate GC content of the second codon position. The input must be the coding sequence of a gene or genes. All genes are assumed to have sequence lengths divisible by three.

biokit gc_content_second_position <fasta> [-v/--verbose]

Options:
<fasta>: first argument after function name should be a fasta file
-v/--verbose: optional argument to print the GC content of each fasta entry

GC content third codon position

Function names: gc_content_third_position; gc3
Command line interface: bk_gc_content_third_position; bk_gc3

Calculate GC content of the third codon position. The input must be the coding sequence of a gene or genes. All genes are assumed to have sequence lengths divisible by three.

biokit gc_content_third_position <fasta> [-v/--verbose]

Options:
<fasta>: first argument after function name should be a fasta file
-v/--verbose: optional argument to print the GC content of each fasta entry

Gene-wise relative synonymous codon usage (gRSCU)

Function names: gene_wise_relative_synonymous_codon_usage; gene_wise_rscu; gw_rscu; grscu
Command line interface: bk_gene_wise_relative_synonymous_codon_usage; bk_gene_wise_rscu; bk_gw_rscu; bk_grscu

Calculate gene-wise relative synonymous codon usage (gRSCU).

Codon usage bias examines biases for codon usage of a particular gene. We adapted RSCU to be applied to individual genes rather than only codons. Specifically, gRSCU is the mean (or median) RSCU value observed in a particular gene. This provides insight into how codon usage bias influences codon usage for a particular gene. This function also outputs the standard deviation of RSCU values for a given gene.

The output is col 1: the gene identifier; col 2: the gRSCU based on the mean RSCU value observed in a gene; col 3: the gRSCU based on the median RSCU value observed in a gene; and the col 4: the standard deviation of RSCU values observed in a gene.

Custom genetic codes can be used as input and should be formatted with the codon in first column and the resulting amino acid in the second column.

biokit gene_wise_relative_synonymous_codon_usage <fasta> [-tt/--translation_table <code>]

Options:
<fasta>: first argument after function name should be a fasta file
-tt/--translation_table: optional argument of the code for the translation table to be used. Default: 1, which is the standard code.

Relative synonymous codon usage

Function names: relative_synonymous_codon_usage; rscu
Command line interface: bk_relative_synonymous_codon_usage; bk_rscu

Calculate relative synonymous codon usage.

Relative synonymous codon usage is the ratio of the observed frequency of codons over the expected frequency given that all the synonymous codons for the same amino acids are used equally.

Custom genetic codes can be used as input and should be formatted with the codon in first column and the resulting amino acid in the second column.

biokit relative_synonymous_codon_usage <fasta> [-tt/--translation_table <code>]

Options:
<fasta>: first argument after function name should be a fasta file
-tt/--translation_table: optional argument of the code for the translation table to be used. Default: 1, which is the standard code.

Translate sequence

Function names: translate_sequence; translate_seq; trans_seq
Command line interface: bk_translate_sequence; bk_translate_seq; bk_trans_seq

Translates coding sequences to amino acid sequences. Sequences can be translated using diverse genetic codes. For codons that can encode two amino acids (e.g., TAG encodes Glu or STOP in the Blastocrithidia Nuclear Code), the standard genetic code is used.

Custom genetic codes can be used as input and should be formatted with the codon in first column and the resulting amino acid in the second column.

biokit translate_sequence <fasta> [-tt/--translation_table <code> -o/--output <output_file>]

Options:
<fasta>: first argument after function name should be a fasta file
-tt/--translation_table: optional argument of the code for the translation table to be used. Default: 1, which is the standard code. -o/--output: optional argument to write the translated fasta file to. Default output has the same name as the input file with the suffix “.translated.fa” added to it.

FASTQ file functions

FASTQ read lengths

Function names: fastq_read_lengths; fastq_read_lens
Command line interface: bk_fastq_read_lengths; bk_fastq_read_lens

Determine lengths of FASTQ reads.

Using default arguments, the average and standard deviation of read lengths in a FASTQ file will be reported. To obtain the lengths of all FASTQ reads, use the verbose option.

biokit fastq_read_lengths <fasta> [-tt/--translation_table <code> -o/--output <output_file>]

Options:
<fastq>: first argument after function name should be a FASTQ file
-v/--verbose: print length of each FASTQ read

Subset PE FASTQ reads

Function names: subset_pe_fastq_reads; subset_pe_fastq
Command line interface: bk_subset_pe_fastq_reads; bk_subset_pe_fastq

Subset paired-end FASTQ data.

Subsetting FASTQ data may be helpful for running test scripts or achieving equal coverage between samples. A percentage of total reads in paired-end FASTQ data can be obtained with this function. Random subsamples are obtained using seeds for reproducibility. If no seed is specified, a seed is generated based off of the date and time. During subsampling, paired-end information is maintained.

Files are outputed with the suffix “_subset.fq”

biokit subset_pe_fastq_reads <fastq1> <fastq2> [-p/--percent <percent> -s/--seed <seed> -o/--output_file <output_file>]

Options:
<fastq1>: first argument after function name should be the name of one of the fastq files
<fastq2>: first argument after function name should be the name of the other fastq file
-p/--percent: percentage of reads to maintain in subsetted data
-s/--seed: seed for random sampling

Subset SE FASTQ reads

Function names: subset_se_fastq_reads; subset_se_fastq
Command line interface: bk_subset_se_fastq_reads; bk_subset_se_fastq

Subset single-end FASTQ data.

Subsetting FASTQ data may be helpful for running test scripts or achieving equal coverage between samples. A percentage of total reads in single-end FASTQ data can be obtained with this function. Random subsamples are obtained using seeds for reproducibility. If no seed is specified, a seed is generated based off of the date and time.

Output files will have the suffix “_subset.fq”

biokit subset_se_fastq_reads <fastq> [-p/--percent <percent> -s/--seed <seed> -o/--output_file <output_file>]

Options:
<fastq>: first argument after function name should be the name of one of the fastq files
-p/--percent: percentage of reads to maintain in subsetted data
-s/--seed: seed for random sampling
-o/--output_file: specify the name of the output file

Trim PE adapters FASTQ reads

Function names: trim_pe_adapters_fastq_reads; trim_pe_adapters_fastq
Command line interface: bk_trim_pe_adapters_fastq_reads; bk_trim_pe_adapters_fastq

Trim adapters from paired-end FastQ data.

FASTQ data will be trimmed according to exact match to known adapter sequences.

Output file has the suffix “_adapter_removed.fq” or can be named by the user with the output_file argument.

biokit trim_pe_adapters_fastq_reads <fastq1> <fastq2> [-a/--adapters TruSeq2-PE -l/--length 20]

Adapaters available:
NexteraPE-PE
TruSeq2-PE
TruSeq2-SE
TruSeq3-PE-2
TruSeq3-PE
TruSeq3-SE

Options:
<fastq1>: first argument after function name should be the name of one of the fastq files
<fastq2>: first argument after function name should be the name of the other fastq file
-a/--adapters: adapter sequences to trim. Default: TruSeq2-PE
-l/--length: minimum length of read to be kept. Default: 20

Trim PE FASTQ reads

Function names: trim_pe_fastq_reads; trim_pe_fastq
Command line interface: bk_trim_pe_fastq_reads; bk_trim_pe_fastq

Quality trim paired-end FastQ data.

FASTQ data will be trimmed according to quality score and length of the reads. Users can specify quality and length thresholds. Paired reads that are maintained and saved to files with the suffix “_paired_trimmed.fq.” Single reads that passed quality thresholds are saved to files with the suffix “_unpaired_trimmed.fq.”

biokit trim_pe_fastq_reads <fastq1> <fastq2> [-m/--minimum 20 -l/--length 20]

Options:
<fastq1>: first argument after function name should be the name of one of the fastq files
<fastq2>: first argument after function name should be the name of the other fastq file
-m/--minimum: minimum quality of read to be kept. Default: 20
-l/--length: minimum length of read to be kept. Default: 20

Trim SE adapters FASTQ reads

Function names: trim_se_adapters_fastq_reads; trim_se_adapters_fastq
Command line interface: bk_trim_se_adapters_fastq_reads; bk_trim_se_adapters_fastq

Trim adapters from single-end FastQ data.

FASTQ data will be trimmed according to exact match to known adapter sequences.

Output file has the suffix “_adapter_removed.fq” or can be named by the user with the output_file argument.

biokit trim_se_adapters_fastq_reads <fastq> [-a/--adapters TruSeq2-SE -l/--length 20]

Options:
<fastq>: first argument after function name should be the fastq file
-a/--adapters: adapter sequences to trim. Default: TruSeq2-SE
-l/--length: minimum length of read to be kept. Default: 20
-o/--output_file: name of the output file of trimmed reads

Trim SE FASTQ reads

Function names: trim_se_fastq_reads; trim_se_fastq
Command line interface: bk_trim_se_fastq_reads; bk_trim_se_fastq

Quality trim single-end FastQ data.

FASTQ data will be trimmed according to quality score and length of the reads. Users can specify quality and length thresholds. Output file has the suffix “_trimmed.fq” or can be named by the user with the output_file argument.

biokit trim_se_fastq_reads <fastq> [-m/--minimum 20 -l/--length 20]

Options:
<fastq>: first argument after function name should be the fastq file
-m/--minimum: minimum quality of read to be kept. Default: 20
-l/--length: minimum length of read to be kept. Default: 20
-o/--output_file: name of the output file of trimmed reads

Genome functions

GC content

Function names: gc_content; gc
Command line interface: bk_gc_content; bk_gc

Calculate GC content of a fasta file.

GC content is the fraction of bases that are either guanines or cytosines. To obtain GC content per FASTA entry, use the verbose option.

biokit gc_content <fasta> [-v/--verbose]

Options:
<fasta>: first argument after function name should be a fasta file
-v/--verbose: optional argument to print the GC content of each fasta entry

Genome assembly metrics

Function names: genome_assembly_metrics; assembly_metrics
Command line interface: bk_genome_assembly_metrics; bk_assembly_metrics

Calculate L50, L90, N50, N90, GC content, assembly size, number of scaffolds, number and sum length of large scaffolds, frequency of A, T, C, and G.

L50: The smallest number of contigs whose length sum makes up half of the genome size.
L90: The smallest number of contigs whose length sum makes up 90% of the genome size.
N50: The sequence length of the shortest contig at half of the genome size.
N90: The sequence length of the shortest contig at 90% of the genome size.
GC content: The fraction of bases that are either guanines or cytosines.
Assembly size: The sum length of all contigs in an assembly.
Number of scaffolds: The total number of scaffolds in an assembly.
Number of large scaffolds: The total number of scaffolds that are greater than the threshold for small scaffolds.
Sum length of large scaffolds: The sum length of all large scaffolds.
Frequency of A: The number of occurences of A corrected by assembly size.
Frequency of T: The number of occurences of T corrected by assembly size.
Frequency of C: The number of occurences of C corrected by assembly size.
Frequency of G: The number of occurences of G corrected by assembly size.

biokit genome_assembly_metrics <fasta>

Options:
<fasta>: first argument after function name should be a fasta file
-t/--threshold: threshold for what is considered a large scaffold. Only scaffolds with a length greater than this value will be counted. Default: 500

L50

Function names: l50
Command line interface: bk_l50

Calculates L50 for a genome assembly.

L50 is the smallest number of contigs whose length sum makes up half of the genome size.

biokit l50 <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

L90

Function names: l90
Command line interface: bk_l90

Calculates L90 for a genome assembly.

L90 is the smallest number of contigs whose length sum makes up 90% of the genome size.

biokit l90 <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

Longest scaffold

Function names: longest_scaffold; longest_scaff; longest_contig; longest_cont
Command line interface: bk_longest_scaffold; bk_longest_scaff; bk_longest_contig; bk_longest_cont

Determine the length of the longest scaffold in a genome assembly.

biokit longest_scaffold <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

N50

Function names: n50
Command line interface: bk_n50

Calculates N50 for a genome assembly.

N50 is the sequence length of the shortest contig at 50% of the genome size.

biokit n50 <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

N90

Function names: n90
Command line interface: bk_n90

Calculates N90 for a genome assembly.

N90 is the sequence length of the shortest contig at 90% of the genome size.

biokit n90 <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

Number of large scaffolds

Function names: number_of_large_scaffolds; num_of_lrg_scaffolds; number_of_large_contigs; num_of_lrg_cont
Command line interface: bk_number_of_large_scaffolds; bk_num_of_lrg_scaffolds; bk_number_of_large_contigs; bk_num_of_lrg_cont

Calculate number and total sequence length of large scaffolds. Each value is represented as column 1 and column 2 in the output, respectively.

biokit number_of_large_scaffolds <fasta> [-t/--threshold <int>]

Options:
<fasta>: first argument after function name should be a fasta file -t/--threshold: threshold for what is considered a large scaffold. Only scaffolds with a length greater than this value will be counted. Default: 500

Number of scaffolds

Function names: number_of_scaffolds; num_of_scaffolds; number_of_contigs; num_of_cont
Command line interface: bk_number_of_scaffolds; bk_num_of_scaffolds; bk_number_of_contigs; bk_num_of_cont

Calculate the number of scaffolds or entries in a FASTA file. In this way, a user can also determine the number of predicted genes in a coding sequence or protein FASTA file with this function.

biokit number_of_scaffolds <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

Sum of scaffold lengths

Function names: sum_of_scaffold_lengths; sum_of_contig_lengths
Command line interface: bk_sum_of_scaffold_lengths; bk_sum_of_contig_lengths

Determine the sum of scaffold lengths.

The intended use of this function is to determine the length of a genome assembly, but can also be used, for example, to determine the sum length of all coding sequences.

biokit sum_of_scaffold_lengths <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

Sequence summary and processing functions

Character frequency

Function names: character_frequency; char_freq
Command line interface: bk_character_frequency; bk_char_freq

Calculate the frequency of characters in a FASTA file.

This can be used to determine the frequency of A, T, C, and G in a genome or the frequency of amino acids in a proteome.

biokit character_frequency <fasta> [-v/--verbose]

Options:
<fasta>: first argument after function name should be a fasta file

Get FASTA entry (faidx)

Function names: faidx; get_entry; ge
Command line interface: bk_faidx; bk_get_entry; bk_ge

Extracts sequence entry from fasta file.

This function works similarly to the faidx function in samtools, but does not requiring an indexing the sequence file.

biokit faidx <fasta> [-v/--verbose]

Options:
<fasta>: first argument after function name should be a fasta file
-e/--entry: entry name to be extracted from the inputted fasta file

File format converter

Function names: file_format_converter; format_converter; ffc
Command line interface: bk_file_format_converter; bk_format_converter; bk_ffc

Converts a multiple sequence file from one format to another.

Acceptable file formats include FASTA, Clustal, MAF, Mauve, Phylip, Phylip-sequential, Phylip-relaxed, and Stockholm. Input and output file formats are specified with the --input_file_format and --output_file_format arguments; input and output files are specified with the --input_file and --output_file arguments.

biokit file_format_converter -i/--input_file <input_file> -iff/--input_file_format <input_file_format>  -o/--output_file <output_file> -off/--output_file_format <output_file_format>

Options:
-i/--input_file: input file name -iff/--input_file_format: input file format -o/--output_file: output file name -off/--output_file_format: output file format

Multiple line to single line FASTA

Function names: multiple_line_to_single_line_fasta; ml2sl
Command line interface: bk_multiple_line_to_single_line_fasta; bk_ml2sl

Converts FASTA files with multiple lines per sequence to a FASTA file with the sequence represented on one line.

biokit multiple_line_to_single_line_fasta <fasta> [-o/--output <output_file>]

Options:
<fasta>: first argument after function name should be a fasta file
-o/--output: optional argument to name the output file

Remove FASTA entry

Function names: remove_fasta_entry
Command line interface: bk_remove_fasta_entry

Remove FASTA entry from multi-FASTA file.

Output will have the suffix “pruned.fa” unless the user specifies a different output file name.

biokit remove_fasta_entry <fasta> -e/--entry <entry> [-o/--output <output_file>]

Options:
<fasta>: first argument after function name should be a fasta file
-e/--entry: entry name to be removed from the inputted fasta file
-o/--output: optional argument to write the renamed fasta file to. Default output has the same name as the input file with the suffix “pruned.fa” added to it.

Remove short sequences

Function names: remove_short_sequences; remove_short_seqs
Command line interface: bk_remove_short_sequences; bk_remove_short_seqs

Remove short sequences from a multi-FASTA file.

Short sequences are defined as having a length less than 500. Users can specify their own threshold. All sequences greater than the threshold will be kept in the resulting file.

Output will have the suffix “long_seqs.fa” unless the user specifies a different output file name.

biokit remove_short_sequences <fasta> -t/--threshold <threshold> [-o/--output <output_file>]

Options:
<fasta>: first argument after function name should be a fasta file
-t/--threshold: threshold for short sequences. Sequences greater than this value will be kept
-o/--output: optional argument to write the renamed fasta file to. Default output has the same name as the input file with the suffix “long_seqs.fa” added to it.

Rename FASTA entries

Function names: rename_fasta_entries; rename_fasta
Command line interface: bk_rename_fasta_entries; bk_rename_fasta

Renames fasta entries.

Renaming fasta entries will follow the scheme of a tab-delimited file wherein the first column is the current fasta entry name and the second column is the new fasta entry name in the resulting output alignment.

rename_fasta_entries <fasta> -i/--idmap <idmap> [-o/--output <output_file>]

Options:
<fasta>: first argument after function name should be a fasta file
-i/--idmap: identifier map of current FASTA names (col1) and desired FASTA names (col2)
-o/--output: optional argument to name the output file

Reorder by sequence length

Function names: reorder_by_sequence_length; reorder_by_seq_len
Command line interface: bk_reorder_by_sequence_length; bk_reorder_by_seq_len

Reorder FASTA file entries from the longest entry to the shortest entry.

biokit reorder_by_sequence_length <fasta> [-o/--output <output_file>]

Options:
<fasta>: first argument after function name should be a fasta file
-o/--output: optional argument to write the reordered fasta file to. Default output has the same name as the input file with the suffix “.reordered.fa” added to it.

Sequence complement

Function names: sequence_complement; seq_comp
Command line interface: bk_sequence_complement; bk_seq_comp

Generates the sequence complement for all entries in a multi-FASTA file. To generate a reverse sequence complement, add the -r/–reverse argument.

biokit sequence_complement <fasta> [-r/--reverse]

Options:
<fasta>: first argument after function name should be a fasta file
-r/--reverse: if used, the reverse complement sequence will be generated

Sequence length

Function names: sequence_length; seq_len
Command line interface: bk_sequence_length; bk_seq_len

Calculate sequence length of each FASTA entry.

biokit sequence_length <fasta>

Options:
<fasta>: first argument after function name should be a fasta file

Single line to multiple line fasta

Function names: single_line_to_multiple_line_fasta; sl2ml
Command line interface: bk_single_line_to_multiple_line_fasta; bk_sl2ml

Calculate sequence length of each FASTA entry.

biokit single_line_to_multiple_line_fasta <fasta>

Options:
<fasta>: first argument after function name should be a fasta file