Five-step phylogenomics, from proteomes to species tree

Introduction
    Overview of topics and further reading
Step 1: Data collection and software installation
    1.1. Set up your working directory
    1.2. Collect data
    1.3. Install software
Step 2: Identify single-copy orthologous genes
    2.1. Obtain HMMs
    2.2. Identify SC-OGs using orthofisher
Step 3: Align and trim SC-OGs
    3.1. Align SC-OGs using MAFFT
    3.2. Trim alignments using ClipKIT
Step 4: Create a concatenated supermatrix
    4.1. Using the "create_concatenation_matrix" in PhyKIT
    4.2. Examine alignment properties using "alignment_summary" in BioKIT
Step 5: Infer the putative species tree
    5.1. Infer the species tree using IQ-TREE
    5.2. Root the phylogeny using "root_tree" in PhyKIT
Closing remarks
    Summary, caveats, and considerations
References


Introduction

Knowledge of the evolutionary history among a set of organisms is a prerequisite for studying numerous biological phenomenon such as positive selection, trait evolution, and more. The use of genome-scale data to infer the evolutionary history among a set of organisms, which is often referred to as a species tree, is known as phylogenomics.

In this tutorial, we will go over a simple five-step workflow to go from proteomes to a (putative) species tree. This tutorial is non-exhaustive and serves as a soft introduction for how to conduct "hands-on" phylogenomic analysis. For practical reasons, we will not reconstruct the evolutionary history among hundreds of species using hundreds of loci, which is becoming more commonplace; instead, to convey key concepts and introduce a "typical" workflow, we will use eight proteomes and 10 loci. This workflow is based on previously published work (Steenwyk and Rokas, 2019).

As is true of any analysis grounded in statistics, assumptions will be made. An in-depth understanding of the assumptions being made during phylogenomic analysis is important, but that is beyond the scope of this tutorial. Readers may refer to an non-exhaustive list of additional resources for further reading to enhance their understanding of potential caveats and sources of error—Kapli et al. (2020), Nat. Rev. Gen.; Yang & Rannala (2012), Nat. Rev. Gen.; and Philippe et al. (2005), Ann. Rev. of Eco., Evo., and Sys.. I'd also like to acknowledge that species trees are hypotheses and should be treated as such. It is possible, and in some cases expected, that different workflows will result in different species tree inferences.

Note, all commands are intended to be executed in the UNIX environment, a common system for many bioinformaticians. This tutorial assumes readers have some familiarity with basic UNIX commands such as making directories, changing into them, etc. Commands are displayed in code blocks that have a light grey background. For example, this is a command you would execute!

With that, let's conduct phylogenomic analysis using five simple steps!


Top of page


Step 1: Data collection and software installation

1.1. Set up your working directory
To keep everything organized, we will first make a directory for this tutorial called phylogenomics_made_easy with the mkdir command: mkdir phylogenomics_made_easy . Thereafter, move into the directory you made with the cd command: cd phylogenomics_made_easy .

1.2. Collect data
Next, collect the proteomes of the species whose evolutionary history you are interested in inferring. For simplicity, I have precompiled a set of eight proteomes of Aspergillus and Penicillium species to use with one outgroup taxon, Penicilliopsis zonata. All proteomes were downloaded from the National Center for Biotechnology Information (or NCBI).

Download proteomes

We will use these proteomes because their evolutionary relationships have been previously established (Steenwyk and Rokas, 2019). Move the proteomes to the current working directory with the mv command: mv path_to_file/prot.tar.gz . . Be sure to change path_to_file to the appropriate relative or absolute file path of prot.tar.gz. Unzip the directory with the tar command: tar -zxvf prot.tar.gz .

1.3. Install software
We will have to download and install the necessary software for the tutorial. Most software will be downloaded from PyPi. To do so, we will continue to implement best practices and first create a virtual environment to store all installed software. A virtual environment is a "self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages" (Python docs). Create a virtual environment with the following command: python -m venv venv and activate it source venv/bin/activate . Next, install orthofisher, ClipKIT, PhyKIT, and BioKIT using the pip install command: pip install orthofisher clipkit phykit jlsteenwyk-biokit . Disclaimer: these are all software I have developed. They have enabled me to conduct diverse phylogenomic analyses and so I want to share what I've learned and created with you.

The last pieces of software that need to be installed are HMMER, MAFFT, and IQ-TREE. Each of these software have their own installation instructions. Please follow the link for each software and install according to the developer's recommendations. Add the path of each executable to your $PATH variable. Instructions to do so can be found on numerous online discussion forums such as askubuntu and unix stackexchange.


Top of page

Step 2: Identify single-copy orthologous genes

2.1. Obtain HMMs
Single-copy orthologous genes (SC-OGs) refer a set of homologous genes that originated via speciation and are present only once among species of interest. These genes are typically used for phylogenomic analyses because current wisdom suggests genes that have undergone duplication and deletion are likely to have evolutionary histories that do not follow the species tree.

To identify SC-OGs, we will use orthofisher (Steenwyk and Rokas, 2021). orthofisher requires as input a list of HMMs and proteomes and then will use HMMER to identify SC-OGs in each proteome. Importantly, the output of orthofisher simplifies downstream processes, which is in part why we are using it. Now we need HMMs. For this tutorial, we will use a subsampled set of HMMs derived from near universally SC-OGs from the kingdom fungi, which are available from OrthoDB (Zdobnov et al., 2017).

Download HMMs

Of note, a custom set of HMMs can be used as input for orthofisher. Instructions on creating custom HMMs can be found in the orthofisher documentation. Move the downloaded HMMs to the current working directory with the mv command: mv path_to_file/hmms.tar.gz . . As above, be sure to change path_to_file to the appropriate relative or absolute file path of hmms.tar.gz. Unzip the directory with the tar command: tar -zxvf hmms.tar.gz . We can see that there are 10 HMMs by counting how many files are in the hmms directory using the following command: ls hmms | wc -l .

2.2. Identify SC-OGs using orthofisher
Now, we can create the input files for orthofisher. To do so, list the input files and write the output to text files using the following commands: ls hmms/* > hmms_list.txt and paste <(ls prot/*) <(ls prot/* | sed 's/prot\///g' | sed 's/_protein.faa//g') > prot_and_names_list.txt . The latter creates a two column file wherein the first column is the file name and path of the proteome and the second column is the species name associated with the proteome. To view the contents of prot_and_names_list.txt, use the cat command:

cat prot_and_names_list.txt 
	prot/Aspergillus_fischeri_protein.faa	Aspergillus_fischeri
	prot/Aspergillus_flavus_protein.faa	Aspergillus_flavus
	prot/Aspergillus_nidulans_protein.faa	Aspergillus_nidulans
	prot/Aspergillus_oryzae_protein.faa	Aspergillus_oryzae
	prot/Penicilliopsis_zonata_protein.faa	Penicilliopsis_zonata
	prot/Penicillium_antarcticum_protein.faa	Penicillium_antarcticum
	prot/Penicillium_arizonense_protein.faa	Penicillium_arizonense
	prot/Penicillium_digitatum_protein.faa	Penicillium_digitatum
									
Next, execute orthofisher using the following command: orthofisher -f prot_and_names_list.txt -m hmms_list.txt .

This will create a directory named orthofisher_output, which contains multiple files. Let's examine the contents of the tab-delimited file short_summary.txt using the cat command:

cat orthofisher_output/short_summary.txt 
	file_name	single-copy	multi-copy	absent	per_single-copy	per_multi-copy	per_absent
	Aspergillus_fischeri_protein.faa	10	0	0	1.0	0.0	0.0
	Aspergillus_flavus_protein.faa	10	0	0	1.0	0.0	0.0
	Aspergillus_nidulans_protein.faa	10	0	0	1.0	0.0	0.0
	Aspergillus_oryzae_protein.faa	10	0	0	1.0	0.0	0.0
	Penicilliopsis_zonata_protein.faa	10	0	0	1.0	0.0	0.0
	Penicillium_antarcticum_protein.faa	10	0	0	1.0	0.0	0.0
	Penicillium_arizonense_protein.faa	10	0	0	1.0	0.0	0.0
	Penicillium_digitatum_protein.faa	9	1	0	0.9	0.1	0.0
									
As noted in the second column, for each of the 10 sequence searches executed by orthofisher resulted in the identification of one SC-OG with the exception of Penicillium digitatum, which had one near universally SC-OG present in multiple copies (see column the third column). To determine which near universally SC-OG was present in multiple copies, see the tab-delimited file, long_summary.txt, in the orthofisher_output directory. Based on these observations, we can conclude that these genes have sufficient taxon occupancy for use in species tree inference.

Next, let's examine the scog output from orthofisher, which contains individual multiple-sequence FASTA files for each near universally SC-OG.

ls orthofisher_output/scog/
	orthofisher_output/scog/EOG092C00SU.hmm.orthofisher  orthofisher_output/scog/EOG092C07CA.hmm.orthofisher
	orthofisher_output/scog/EOG092C00WD.hmm.orthofisher  orthofisher_output/scog/EOG092C08KH.hmm.orthofisher
	orthofisher_output/scog/EOG092C02J3.hmm.orthofisher  orthofisher_output/scog/EOG092C097K.hmm.orthofisher
	orthofisher_output/scog/EOG092C038B.hmm.orthofisher  orthofisher_output/scog/EOG092C0B3U.hmm.orthofisher
	orthofisher_output/scog/EOG092C03KT.hmm.orthofisher  orthofisher_output/scog/EOG092C0BAH.hmm.orthofisher
									
One advantage of orthofisher is that these files are automatically made and formatted in a user-friendly way. For example, the headers of each sequence are nicely named; examination of the headers in orthofisher_output/scog/EOG092C07CA.hmm.orthofisher reveals that it is obvious which sequence comes from which taxon.

grep ">" ../../FILES_tutorial/orthofisher_output/scog/EOG092C07CA.hmm.orthofisher
	>Aspergillus_fischeri EAW21995.1|0.0|1156.8
	>Aspergillus_flavus QRD94170.1|0.0|1153.6
	>Aspergillus_nidulans EAA62991.1|0.0|1143.2
	>Aspergillus_oryzae BAE63272.1|0.0|1153.9
	>Penicilliopsis_zonata OJJ43347.1|0.0|1156.1
	>Penicillium_antarcticum OQD80813.1|0.0|1145.7
	>Penicillium_arizonense OGE54164.1|0.0|1145.2
	>Penicillium_digitatum EKV21956.1|0.0|1137.9
									
The first column in the header is the genus and species name, which was specified in the second column of the prot_and_names_list.txt file and the second column has the gene identifier, e-value, and bitscore separated by a vertical bar (or pipe character "|") for the best hit identified during sequence search.

You're doing great! Some of the hardest parts of the workflow are now complete so you can definintely finish the tutorial from here!


Top of page

Step 3: Align and trim SC-OGs

3.1. Align SC-OGs using MAFFT
To align sequences, we will use MAFFT with the "auto" parameter (Katoh,K. and Standley, 2013) using the following command: for i in $(ls orthofisher_output/scog/*); do mafft --auto $i > $i.mafft ; done. This will loop through every multiple-sequence FASTA file in orthofisher_output/scog/, align the sequences in each file, and redirect the output to the files to the same directory and with the same name except with the suffix ".mafft" added to them. You can view the output using the ls command: ls orthofisher_output/scog/*mafft.

3.2. Trim alignments using ClipKIT
To trim sequences, we will use ClipKIT with the default mode, the "smart-gap" mode (Steenwyk et al., 2020), by executing the following command: for i in $(ls orthofisher_output/scog/*mafft); do clipkit $i -o $i.clipkit ; done. Similar to the previous command, this will loop through every aligned multiple-sequence FASTA file, trim sites in each alignment, and the outputted trimmed alignments will have the same name as the input files but with the ".clipkit" added as a suffix.


Top of page

Step 4: Create a concatenated supermatrix

4.1. Using the "create_concatenation_matrix" in PhyKIT
To concatenate sequences into a single supermatrix, we will use the "create_concatenation_matrix" (alias: "create_concat") function from PhyKIT (Steenwyk et al., 2021a). To do so, PhyKIT requires only one input file—a single-column file with the path to each trimmed alignment—which can be created using the following command: ls orthofisher_output/scog/*clipkit > alignment_list.txt . Next, execute the "create_concat" function in PhyKIT with the following command:


phykit create_concat -a alignment_list.txt -p AspPen8_loci10
	--------------------
	| General features |
	--------------------
	Total number of taxa: 8
	Total number of alignments: 10


	----------------
	| Output files |
	----------------
	Partition file output: AspPen8_loci10.partition
	Concatenated fasta output: AspPen8_loci10.fa
	Occupancy report: AspPen8_loci10.occupancy

	Complete!
									
where the -a argument points to the file we just made and the -p argument provides the prefix of the output files created by PhyKIT. Here, I named the prefix "AspPen8_loci10" because we are working with eight proteomes from Aspergillus and Penicillium species (and one outgroup taxon) and we are using 10 loci. The standard output also provides additional details such as the number of taxa and alignments that contribute to the concatenated alignment.

The names of the outputted files (i.e., AspPen8_loci10.partition, AspPen8_loci10.fa, and AspPen8_loci10.occupancy) are also printed to the standard output. The partition file contains the boundaries of each gene in the concatenated alignment in RAxML format (Stamatakis, 2014). The concatenated FASTA output has the suffix ".fa" and the file with the ".occupancy" suffix details what taxa are present/absent in each gene.

4.2. Examine alignment properties using "alignment_summary" in BioKIT
Properties of multiple sequence alignments (e.g., alignment length) are commonly reported in phylogenomic studies. To summarize the properties of the concatenated alignment, we will use the "alignment_summary" (alias: aln_summary) function in BioKIT (Steenwyk et al., 2021b) using the following command:

biokit aln_summary AspPen8_loci10.fa

	General Characteristics
	=======================
	8	Number of taxa
	17143	Alignment length
	3767	Parsimony informative sites
	6619	Variable sites
	10524	Constant sites

	Character Frequencies
	=====================
	Y	3745
	W	1367
	V	7925
	T	6936
	S	9949
	R	7584
	Q	5166
	P	6397
	N	4784
	M	3104
	L	13540
	K	8252
	I	6531
	H	2777
	G	7189
	F	4945
	E	9243
	D	8140
	C	1530
	A	10442
	-	7598
									
Not all properties calculated by the "aln_summary" function may be of interest to us such as character frequency (with the exception of alignment gappyness represented by the "-" character). Rather than calculating all of these parameters, you can use BioKIT to calculate individual parameters such as alignment length. In addition to BioKIT, functions in PhyKIT can also help summarize alignment (and phylogenetic tree) information content.

Top of page

Step 5: Infer the putative species tree

5.1. Infer the species tree using IQ-TREE
To infer the species tree, we will use the maximum likelihood approach implemented in IQ-TREE (Minh et al., 2020). To do so, execute the following command: iqtree2 -s AspPen8_loci10.fa -bb 1000 -m TEST which will identify the best-fitting substitution model using ModelFinder (Kalyaanamoorthy et al., 2017). Bipartition support will be examined using 1000 ultrafast bootstrap approximations (Hoang et al., 2017).

After exeucting this command, a number of different files will be made. Readers can obtain more information about the various outputted files on the IQ-TREE website. For now, we will examine the contents of the file with best tree identified during tree search, AspPen8_loci10.fa.treefile. To view the contents in a friendly format, we will use the "print_tree" function in PhyKIT by executing the following command:


phykit print_tree AspPen8_loci10.fa.treefile
     _____________________ Aspergillus_fischeri
    |
    |                      , Aspergillus_flavus
    |   ___________________|
    |__|                   | Aspergillus_oryzae
   _|  |
    |  |________________________________________ Aspergillus_nidulans
    |
    |        ____________________________________________ Penicilliopsis_zonata
    |       |
    |_______|                                   ___ Penicillium_antarcticum
            |                          ________|
            |_________________________|        |__ Penicillium_arizonense
                                      |
                                      |______________ Penicillium_digitatum
									

5.2. Root the phylogeny using "root_tree" in PhyKIT
This phylogeny is unrooted (you can tell by the trifurcating first branch). An outgroup taxon, Penicilliopsis zonata, was included; thus, we can root the phylogeny and then reexamine the rooted tree topology. To root the phylogeny, we will use the "tree_root" function in PhyKIT. First, we need to create a file that lists all outgroup taxa, which in this case is just a single species, Penicilliopsis_zonata, by executing the following command: echo "Penicilliopsis_zonata" > root.txt Next, root the tree with the following command: phykit root_tree AspPen8_loci10.fa.treefile -r root.txt . This will output a file named AspPen8_loci10.fa.treefile.rooted. View the rooted tree using the same command as before.

phykit print_tree AspPen8_loci10.fa.treefile.rooted 
                                 ____________ Aspergillus_fischeri
                                |
                             ___|            , Aspergillus_flavus
                            |   |   _________|
                            |   |__|         | Aspergillus_oryzae
                            |      |
     _______________________|      |_____________________ Aspergillus_nidulans
    |                       |
    |                       |                  __ Penicillium_antarcticum
    |                       |              ___|
   _|                       |_____________|   |_ Penicillium_arizonense
    |                                     |
    |                                     |_______ Penicillium_digitatum
    |
    | Penicilliopsis_zonata
									
Next, we can cross reference this phylogeny with the previously published Aspergillus and Penicillium phylogeny. Careful examination reveals that the two phylogenies are perfectly congruent. Thus, we have successfully inferred the species tree among these eight taxa and found that our inferred tree supports previously inferred relationships - awesome!


Top of page

Closing remarks

In this tutorial, we conducted phylogenomic analysis in five basic steps. The workflow executed here can be, and often is, supported by other analyses. For example, in the paper referenced earlier—in which a species tree for Aspergillus and Penicillium species was inferred—a phylogenomic subsampling strategy was implemented to identify incongruent bipartitions. Also, this tutorial should not be thought of as the phylogenomic workflow. Numerous workflows have been presented over the years and I encourage readers to find what works best for their research question.

Congratulations on inferring your species tree! Please do not hesitate to contact me if you have any questions or constructive comments.

Top of page

References

1. Hoang,D.T. et al. (2018) UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol. Biol. Evol., 35, 518–522.
2. Kalyaanamoorthy,S. et al. (2017) ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods, 14, 587–589.
3. Kapli,P. et al. (2020) Phylogenetic tree building in the genomic age. Nat. Rev. Genet.
4. Minh,B.Q. et al. (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol., 37, 1530–1534.
5. Philippe,H. et al. (2005) Phylogenomics. Annu. Rev. Ecol. Evol. Syst., 36, 541–562.
6. Stamatakis,A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30, 1312–1313.
7. Steenwyk,J.L. et al. (2019) A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium. MBio, 10.
8. Steenwyk,J.L., et al. (2021) BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. bioRxiv, 2021.10.02.462868.
9. Steenwyk,J.L. et al. (2020) ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. PLOS Biol., 18, e3001007.
10. Steenwyk,J.L., et al. (2021) PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data. Bioinformatics.
11. Steenwyk,J.L. and Rokas,A. (2021) orthofisher: a broadly applicable tool for automated gene identification and retrieval. G3 Genes|Genomes|Genetics, 11.
12. Yang,Z. and Rannala,B. (2012) Molecular phylogenetics: principles and practice. Nat. Rev. Genet., 13, 303–314.
13. Zdobnov,E.M. et al. (2017) OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic Acids Res., 45, D744–D749.


Top of page