Five-step phylogenomics, from proteomes to species tree
Introduction
Overview of topics
and further reading
Step 1: Data collection and software installation
1.1. Set up your working directory
1.2. Collect data
1.3. Install software
Step 2: Identify single-copy orthologous genes
2.1. Obtain HMMs
2.2. Identify SC-OGs using orthofisher
Step 3: Align and trim SC-OGs
3.1. Align SC-OGs using MAFFT
3.2. Trim alignments using ClipKIT
Step 4: Create a concatenated supermatrix
4.1. Using the "create_concatenation_matrix" in PhyKIT
4.2. Examine alignment properties using "alignment_summary" in BioKIT
Step 5: Infer the putative species tree
5.1. Infer the species tree using IQ-TREE
5.2. Root the phylogeny using "root_tree" in PhyKIT
Closing remarks
Summary, caveats, and considerations
References
Introduction
Knowledge of the evolutionary history among a set of organisms is a prerequisite for studying numerous biological
phenomenon such as positive selection, trait evolution, and more. The use of genome-scale data
to infer the evolutionary history among a set of organisms, which is often referred to as a species tree,
is known as phylogenomics.
In this tutorial, we will go over a simple five-step workflow to go from proteomes
to a (putative) species tree.
This tutorial is non-exhaustive and serves as
a soft introduction for how to conduct "hands-on" phylogenomic analysis. For practical reasons, we will not
reconstruct the evolutionary
history among hundreds of species using hundreds of loci, which is becoming more commonplace; instead, to
convey key concepts and
introduce a "typical" workflow, we will use eight proteomes and 10 loci. This workflow is based on
previously published work
(Steenwyk and Rokas, 2019).
As is true of any analysis grounded in statistics, assumptions will be made.
An in-depth understanding of the assumptions being made during phylogenomic analysis is important, but that is beyond
the scope of this tutorial. Readers may refer to an non-exhaustive list of additional
resources for further reading to enhance their understanding of potential caveats and sources of
error—Kapli et al. (2020), Nat. Rev. Gen.;
Yang & Rannala (2012), Nat. Rev. Gen.; and
Philippe
et al. (2005), Ann. Rev. of Eco., Evo., and Sys.. I'd also like to acknowledge that species trees are
hypotheses and should be treated as such. It is possible, and in some cases expected, that different workflows
will result in different species tree inferences.
Note, all commands are intended to be executed in the UNIX environment, a common system for many bioinformaticians.
This tutorial assumes readers have some familiarity with basic UNIX commands such as making directories, changing into
them, etc.
Commands are displayed in code blocks that have a light grey background.
For example,
this is a command you would execute!
With that, let's conduct phylogenomic analysis using five simple steps!
Step 1: Data collection and software installation
1.1. Set up your working directory
To keep everything organized, we will first make a directory for this tutorial called phylogenomics_made_easy with the
mkdir command:
mkdir phylogenomics_made_easy
.
Thereafter, move into the directory you made with the cd command:
cd phylogenomics_made_easy
.
1.2. Collect data
Next, collect the proteomes of the species whose evolutionary history you are interested in inferring. For simplicity,
I have precompiled a set of eight proteomes of Aspergillus and Penicillium species to use with one outgroup taxon,
Penicilliopsis zonata. All proteomes
were downloaded from the National Center for Biotechnology Information (or NCBI).
We will use these proteomes because their evolutionary relationships have been previously established (Steenwyk and Rokas, 2019). Move the proteomes to the current working directory with the mv command:
mv path_to_file/prot.tar.gz .
. Be sure to change path_to_file to the appropriate relative or
absolute file path of prot.tar.gz. Unzip the directory with the tar command:
tar -zxvf prot.tar.gz
.
1.3. Install software
We will have to download and install the necessary software for the tutorial. Most software will be downloaded from PyPi. To do so, we will continue to implement best practices and first create a virtual environment to store all installed software. A virtual environment is a "self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages" (Python docs). Create a virtual environment with the following command:
python -m venv venv
and activate it
source venv/bin/activate
.
Next, install
orthofisher,
ClipKIT,
PhyKIT, and
BioKIT using the pip install command:
pip install orthofisher clipkit phykit jlsteenwyk-biokit
.
Disclaimer: these are all software I have developed. They have enabled me to conduct diverse
phylogenomic analyses and so I want to share what I've learned and created with you.
The last pieces of software that need to be installed are HMMER, MAFFT, and IQ-TREE. Each of these software have their own installation instructions. Please follow the link for each software and install according to the developer's recommendations. Add the path of each executable to your $PATH variable. Instructions to do so can be found on numerous online discussion forums such as askubuntu and unix stackexchange.
Step 2: Identify single-copy orthologous genes
2.1. Obtain HMMs
Single-copy orthologous genes (SC-OGs) refer a set of homologous genes that originated via speciation and are present only once
among species of interest. These genes are typically used for phylogenomic analyses because current wisdom suggests genes that
have undergone duplication and deletion are likely to have evolutionary histories that do not follow the species tree.
To identify SC-OGs, we will use
orthofisher
(Steenwyk and Rokas, 2021).
orthofisher requires as input a list of HMMs and proteomes and then will use
HMMER to identify SC-OGs in each proteome. Importantly, the output of
orthofisher simplifies downstream processes, which is in part why we are using it.
Now we need HMMs. For this tutorial, we will use a subsampled set of HMMs derived from near universally SC-OGs from the kingdom
fungi, which are available from
OrthoDB
(Zdobnov et al., 2017).
Of note, a custom set of HMMs can be used as input for orthofisher. Instructions on creating custom HMMs can be found in the orthofisher documentation. Move the downloaded HMMs to the current working directory with the mv command:
mv path_to_file/hmms.tar.gz .
. As above, be sure to change path_to_file to the appropriate relative or
absolute file path of hmms.tar.gz. Unzip the directory with the tar command:
tar -zxvf hmms.tar.gz
. We can see that there are 10 HMMs by counting how many files are in the hmms directory using
the following command:
ls hmms | wc -l
.
2.2. Identify SC-OGs using orthofisher
Now, we can create the input files for orthofisher. To do so, list the input files and write the output to text files using the following commands:
ls hmms/* > hmms_list.txt
and
paste <(ls prot/*) <(ls prot/* | sed 's/prot\///g' | sed 's/_protein.faa//g') > prot_and_names_list.txt
. The latter creates a two column file wherein the first column is the file name and path of the proteome
and the second column is the species name associated with the proteome. To view the contents of prot_and_names_list.txt,
use the cat command:
cat prot_and_names_list.txt
prot/Aspergillus_fischeri_protein.faa Aspergillus_fischeri
prot/Aspergillus_flavus_protein.faa Aspergillus_flavus
prot/Aspergillus_nidulans_protein.faa Aspergillus_nidulans
prot/Aspergillus_oryzae_protein.faa Aspergillus_oryzae
prot/Penicilliopsis_zonata_protein.faa Penicilliopsis_zonata
prot/Penicillium_antarcticum_protein.faa Penicillium_antarcticum
prot/Penicillium_arizonense_protein.faa Penicillium_arizonense
prot/Penicillium_digitatum_protein.faa Penicillium_digitatum
Next, execute orthofisher using the following command:
orthofisher -f prot_and_names_list.txt -m hmms_list.txt
.
This will create a directory named orthofisher_output, which contains multiple files. Let's examine the contents of the tab-delimited file short_summary.txt using the cat command:
cat orthofisher_output/short_summary.txt
file_name single-copy multi-copy absent per_single-copy per_multi-copy per_absent
Aspergillus_fischeri_protein.faa 10 0 0 1.0 0.0 0.0
Aspergillus_flavus_protein.faa 10 0 0 1.0 0.0 0.0
Aspergillus_nidulans_protein.faa 10 0 0 1.0 0.0 0.0
Aspergillus_oryzae_protein.faa 10 0 0 1.0 0.0 0.0
Penicilliopsis_zonata_protein.faa 10 0 0 1.0 0.0 0.0
Penicillium_antarcticum_protein.faa 10 0 0 1.0 0.0 0.0
Penicillium_arizonense_protein.faa 10 0 0 1.0 0.0 0.0
Penicillium_digitatum_protein.faa 9 1 0 0.9 0.1 0.0
As noted in the second column, for each of the 10 sequence searches executed by orthofisher
resulted in the identification of one SC-OG with the exception of Penicillium digitatum, which had one near
universally SC-OG present in multiple copies (see column the third column). To determine which near universally SC-OG was present
in multiple copies, see the tab-delimited file, long_summary.txt, in the orthofisher_output directory.
Based on these observations, we can
conclude that these genes have sufficient taxon occupancy for use in species tree inference.
Next, let's examine the scog output from orthofisher, which contains individual multiple-sequence FASTA files for each near universally SC-OG.
ls orthofisher_output/scog/
orthofisher_output/scog/EOG092C00SU.hmm.orthofisher orthofisher_output/scog/EOG092C07CA.hmm.orthofisher
orthofisher_output/scog/EOG092C00WD.hmm.orthofisher orthofisher_output/scog/EOG092C08KH.hmm.orthofisher
orthofisher_output/scog/EOG092C02J3.hmm.orthofisher orthofisher_output/scog/EOG092C097K.hmm.orthofisher
orthofisher_output/scog/EOG092C038B.hmm.orthofisher orthofisher_output/scog/EOG092C0B3U.hmm.orthofisher
orthofisher_output/scog/EOG092C03KT.hmm.orthofisher orthofisher_output/scog/EOG092C0BAH.hmm.orthofisher
One advantage of orthofisher
is that these files are automatically made and formatted in a user-friendly way.
For example, the headers of each sequence are nicely named;
examination of the headers in orthofisher_output/scog/EOG092C07CA.hmm.orthofisher reveals that it is obvious which sequence
comes from which taxon.
grep ">" ../../FILES_tutorial/orthofisher_output/scog/EOG092C07CA.hmm.orthofisher
>Aspergillus_fischeri EAW21995.1|0.0|1156.8
>Aspergillus_flavus QRD94170.1|0.0|1153.6
>Aspergillus_nidulans EAA62991.1|0.0|1143.2
>Aspergillus_oryzae BAE63272.1|0.0|1153.9
>Penicilliopsis_zonata OJJ43347.1|0.0|1156.1
>Penicillium_antarcticum OQD80813.1|0.0|1145.7
>Penicillium_arizonense OGE54164.1|0.0|1145.2
>Penicillium_digitatum EKV21956.1|0.0|1137.9
The first column in the header is the genus and species name, which was specified in the second column of the
prot_and_names_list.txt file and the second column has the gene identifier, e-value, and bitscore separated by a
vertical bar (or pipe character "|") for the best hit
identified during sequence search.
You're doing great! Some of the hardest parts of the workflow are now complete so you can definintely finish the tutorial from here!
Step 3: Align and trim SC-OGs
3.1. Align SC-OGs using MAFFT
To align sequences, we will use MAFFT with the "auto" parameter
(Katoh,K. and Standley, 2013) using the following command:
for i in $(ls orthofisher_output/scog/*); do
mafft --auto $i > $i.mafft ; done
. This will loop through every multiple-sequence FASTA file in orthofisher_output/scog/,
align the sequences in each file, and redirect the output to the files to the same directory and with the same name except
with the suffix ".mafft" added to them. You can view the output using the ls command: ls
orthofisher_output/scog/*mafft
.
3.2. Trim alignments using ClipKIT
To trim sequences, we will use ClipKIT with the default mode, the "smart-gap" mode
(Steenwyk et al., 2020),
by executing the following command:
for i in $(ls orthofisher_output/scog/*mafft); do
clipkit $i -o $i.clipkit ; done
. Similar to the previous command, this will loop through every aligned multiple-sequence
FASTA file, trim sites in each alignment, and the outputted trimmed alignments will have the same name as the input files but with
the ".clipkit" added as a suffix.
Step 4: Create a concatenated supermatrix
4.1. Using the "create_concatenation_matrix" in PhyKIT
To concatenate sequences into a single supermatrix, we will use the "create_concatenation_matrix" (alias: "create_concat")
function from PhyKIT
(Steenwyk et al., 2021a). To do so, PhyKIT
requires only one input file—a single-column file with the path to each trimmed alignment—which can be created using the
following command:
ls orthofisher_output/scog/*clipkit > alignment_list.txt
.
Next, execute the "create_concat" function in PhyKIT with the following command:
phykit create_concat -a alignment_list.txt -p AspPen8_loci10
--------------------
| General features |
--------------------
Total number of taxa: 8
Total number of alignments: 10
----------------
| Output files |
----------------
Partition file output: AspPen8_loci10.partition
Concatenated fasta output: AspPen8_loci10.fa
Occupancy report: AspPen8_loci10.occupancy
Complete!
where the -a argument points to the file we just made and the -p argument provides the prefix
of the output files created by PhyKIT. Here, I named the prefix "AspPen8_loci10"
because we are working with eight proteomes from Aspergillus and Penicillium species (and one outgroup taxon) and we are using 10 loci. The
standard output also provides additional details such as the number of taxa and alignments that contribute to the concatenated alignment.
The names of the outputted files (i.e., AspPen8_loci10.partition, AspPen8_loci10.fa, and AspPen8_loci10.occupancy) are also printed to the standard output. The partition file contains the boundaries of each gene in the concatenated alignment in RAxML format (Stamatakis, 2014). The concatenated FASTA output has the suffix ".fa" and the file with the ".occupancy" suffix details what taxa are present/absent in each gene.
4.2. Examine alignment properties using "alignment_summary" in BioKIT
Properties of multiple sequence alignments (e.g., alignment length) are commonly reported in phylogenomic studies. To summarize the properties of the concatenated alignment, we will use the "alignment_summary" (alias: aln_summary) function in BioKIT (Steenwyk et al., 2021b) using the following command:
biokit aln_summary AspPen8_loci10.fa
General Characteristics
=======================
8 Number of taxa
17143 Alignment length
3767 Parsimony informative sites
6619 Variable sites
10524 Constant sites
Character Frequencies
=====================
Y 3745
W 1367
V 7925
T 6936
S 9949
R 7584
Q 5166
P 6397
N 4784
M 3104
L 13540
K 8252
I 6531
H 2777
G 7189
F 4945
E 9243
D 8140
C 1530
A 10442
- 7598
Not all properties calculated by the "aln_summary" function may be of interest to us such as character frequency
(with the exception of alignment gappyness represented by the "-" character). Rather than calculating all of these parameters,
you can use BioKIT to calculate individual
parameters such as alignment length. In addition to
BioKIT, functions in PhyKIT can also help
summarize alignment (and
phylogenetic tree) information content.
Step 5: Infer the putative species tree
5.1. Infer the species tree using IQ-TREE
To infer the species tree, we will use the maximum likelihood approach implemented in IQ-TREE
(Minh et al., 2020).
To do so, execute the following command:
iqtree2 -s AspPen8_loci10.fa -bb 1000 -m TEST
which will identify the best-fitting substitution model using
ModelFinder
(Kalyaanamoorthy et al., 2017).
Bipartition support will be examined
using 1000 ultrafast bootstrap approximations
(Hoang et al., 2017).
After exeucting this command, a number of different files will be made. Readers can obtain more information about
the various outputted files
on the IQ-TREE website. For now, we will examine the contents of
the file with best tree identified during tree search, AspPen8_loci10.fa.treefile.
To view the contents in a friendly format, we will use the "print_tree" function in
PhyKIT by executing the following command:
phykit print_tree AspPen8_loci10.fa.treefile
_____________________ Aspergillus_fischeri
|
| , Aspergillus_flavus
| ___________________|
|__| | Aspergillus_oryzae
_| |
| |________________________________________ Aspergillus_nidulans
|
| ____________________________________________ Penicilliopsis_zonata
| |
|_______| ___ Penicillium_antarcticum
| ________|
|_________________________| |__ Penicillium_arizonense
|
|______________ Penicillium_digitatum
5.2. Root the phylogeny using "root_tree" in PhyKIT
This phylogeny is unrooted (you can tell by the trifurcating first branch). An outgroup taxon, Penicilliopsis zonata, was included; thus, we can root the phylogeny and then reexamine the rooted tree topology. To root the phylogeny, we will use the "tree_root" function in PhyKIT. First, we need to create a file that lists all outgroup taxa, which in this case is just a single species, Penicilliopsis_zonata, by executing the following command:
echo "Penicilliopsis_zonata" > root.txt
Next, root the tree with the following command:
phykit root_tree AspPen8_loci10.fa.treefile -r root.txt
.
This will output a file named AspPen8_loci10.fa.treefile.rooted. View the rooted
tree using the same command as before.
phykit print_tree AspPen8_loci10.fa.treefile.rooted
____________ Aspergillus_fischeri
|
___| , Aspergillus_flavus
| | _________|
| |__| | Aspergillus_oryzae
| |
_______________________| |_____________________ Aspergillus_nidulans
| |
| | __ Penicillium_antarcticum
| | ___|
_| |_____________| |_ Penicillium_arizonense
| |
| |_______ Penicillium_digitatum
|
| Penicilliopsis_zonata
Next, we can cross reference this phylogeny with the previously published
Aspergillus
and Penicillium phylogeny. Careful examination reveals that the two phylogenies
are perfectly congruent. Thus, we have successfully inferred the species tree among these
eight taxa and found that our inferred tree supports previously inferred relationships - awesome!
Closing remarks
In this tutorial, we conducted phylogenomic analysis in five basic steps. The
workflow executed here can be, and often is, supported by other analyses.
For example, in the paper referenced earlier—in which a
species tree for
Aspergillus and Penicillium species was inferred—a phylogenomic subsampling strategy was implemented to
identify incongruent bipartitions. Also, this tutorial should not be thought of as the phylogenomic workflow.
Numerous workflows have been presented over the years and I encourage readers to find what works best for their
research question.
Congratulations on inferring your species tree! Please do not hesitate to
contact me if you have any questions or constructive comments.
References
1. Hoang,D.T. et al. (2018)
UFBoot2: Improving the
Ultrafast Bootstrap Approximation. Mol. Biol. Evol., 35, 518–522.
2. Kalyaanamoorthy,S. et al. (2017)
ModelFinder: fast model selection
for accurate phylogenetic estimates. Nat. Methods, 14, 587–589.
3. Kapli,P. et al. (2020)
Phylogenetic tree building
in the genomic age. Nat. Rev. Genet.
4. Minh,B.Q. et al. (2020)
IQ-TREE 2: New Models
and Efficient Methods for Phylogenetic Inference in the Genomic Era.
Mol. Biol. Evol., 37, 1530–1534.
5. Philippe,H. et al. (2005)
Phylogenomics.
Annu. Rev. Ecol. Evol. Syst., 36, 541–562.
6. Stamatakis,A. (2014)
RAxML version
8: a tool for phylogenetic analysis and post-analysis of large phylogenies.
Bioinformatics, 30, 1312–1313.
7. Steenwyk,J.L. et al. (2019)
A Robust Phylogenomic
Time Tree for Biotechnologically and Medically Important Fungi in the Genera
Aspergillus and Penicillium. MBio, 10.
8. Steenwyk,J.L., et al. (2021)
BioKIT: a versatile
toolkit for processing and analyzing diverse types of sequence data. bioRxiv,
2021.10.02.462868.
9. Steenwyk,J.L. et al. (2020)
ClipKIT:
A multiple sequence alignment trimming software for accurate phylogenomic inference.
PLOS Biol., 18, e3001007.
10. Steenwyk,J.L., et al. (2021)
PhyKIT:
a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data.
Bioinformatics.
11. Steenwyk,J.L. and Rokas,A. (2021)
orthofisher: a
broadly applicable tool for automated gene identification and retrieval.
G3 Genes|Genomes|Genetics, 11.
12. Yang,Z. and Rannala,B. (2012)
Molecular phylogenetics: principles and practice.
Nat. Rev. Genet., 13, 303–314.
13. Zdobnov,E.M. et al. (2017) OrthoDB
v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant,
archaeal, bacterial and viral orthologs. Nucleic Acids Res., 45, D744–D749.