Methodological studies for phylogenetic artifacts caused by compositional biases of sequences
Performance assessment of RY-coding and non-homogeneous models in phylogenetic inferences from nucleotide sequences with significant compositional heterogeneity
In phylogenetic analyses of nucleotide sequences, ‘homogeneous’ substitution models, which assume the stationarity of base composition across lineages, are widely used. However, a homogeneous model-based analysis can yield an artifactual tree when our data exhibit heterogeneous base compositions among sequences. Potential artifacts stemming from compositional heterogeneity in tree reconstruction can be countered by two approaches, ‘RY-coding’ and ‘non-homogeneous (NH)’ models. The former approach converts four bases into two-state characters, purine (R) and pyrimidine (Y), to homogenize their compositions among sequences (Phillips and Penny, 2003). In contrast, compositional heterogeneity is explicitly incorporated in the latter approach by allocating free model parameters in a branch-by-branch fashion (Galtier and Gouy, 1998; Dutheil and Boussau, 2008). Although these approaches have been applied to several real-world data analyses, their basic properties have not been fully examined by pioneering simulation studies.
In this study, I demonstrated the de facto first simulation to assess the performance of the maximum-likelihood phylogenetic analyses incorporating RY-coding and NH models under the presence of compositional heterogeneity. These two methods were applied to the analyses of the ‘4-taxon’ datasets bearing various degrees of the heterogeneity of adenine and thymine (AT) content. Both RY-coding and NH model-based analyses showed superior performance to reconstruct the true phylogenetic relationships against ~20% AT content difference among sequences, compared to a homogeneous model-based analysis. Nevertheless, I revealed that the accuracy of phylogenetic inference based on RY-coding, at least to some extent, depends on the substitution process that generated the sequence data of interest (e.g, transition/transversion ratio). Furthermore, the inferences from RY-coding-based analyses can be severely biased when the data-recoding cannot ameliorate complex patterns of compositional heterogeneity in the data. On the other hand, NH models appeared to be robust against all types of compositional heterogeneity examined in this study, and are widely applicable to phylogenetic analyses of various empirical datasets. For more information, please refer to Ishikawa, Inagaki, and Hashimoto. (2012) listed in my CV.
Computational challenges for the efficient parallelization of phylogenetic inferences with non-homogeneous models, on current supercomputing systems
Recent advances in genome sequencing techniques enable us to phylogenetically analyze large matrices composed of hundreds of genes derived from diverse organisms. Such ‘phylogenomic analyses,’ however, are often influenced by the heterogeneity of base or amino-acid composition, codon usage, and substitution rate across genomes, or even within a genome. Non-homogeneous (NH) models are supposed to be critical to ameliorate the artifact from above systematic biases in phylogenomic analyses. Nevertheless, phylogenomic analyses have been conducted almost exclusively under homogeneous models for two reasons. Firstly, phylogenetic inferences based on NH models can be computationally much more intensive than homogeneous models, because the former models require an enormous amount of model parameters to be optimized. Secondly, all of the currently available phylogenetic codes, which are applied novel parallel computing techniques using a pile of CPUs (and GPUs), only implement homogeneous models. Therefore, it is urgent to build a new phylogenetic program incorporating efficient parallel computing methods with NH models.
For this computational effort, I have collaborated with the laboratory for High Performance Computing Systems in University of Tsukuba, aiming to parallelize a phylogenetic program, ‘NHML’, which implements a NH model that allows the AT content to vary across lineages (Galtier and Gouy, 1998). A fine-grained parallelization by OpenMP was applied to the calculation of site-wise log-likelihoods (site-lnLs) for a given tree, while a coarse-grained parallelization by Message Passing Interface (MPI) was applied to the computation of alternative trees during the ML tree search based on the SPR method. In addition to this ‘Hybrid’ parallelization, I newly implemented a medium-grained parallelization by MPI—during the lnL calculation for a given tree, optimization of model parameters (e.g., equilibrium AT content on each branch), as well as branch lengths, can be assigned to different groups of MPI processes in parallel. The performance of the ‘multi-grained’ parallelization on NHML was benchmarked by analyzing simulation datasets including ~130 species and ~10,000 nucleotide positions. Consequently, I achieved suitable speedup (i.e., parallel efficiency >= 0.5) of the maximum-likelihood tree inference up to 64 computational nodes and 1,024 CPU cores on a supercomputer system, ‘T2K-Tsukuba’ (http://www.top500.org/system/176215) in Center for Computational Sciences, University of Tsukuba.
Detection of gene conversion (recombination) events among bacterial sequences, based on the phylogenetic methods
Bacteria have two paralogs of peptide-chain release factor, RF1 and RF2, which are different from each other in stop-codon recognition. The two RF families are generally expected to have taken independent evolutionary paths after they arose from a single gene-duplication event in the ancestral bacterial genome. However, my survey based on phylogenetic and statistical methods detected inter- or intra-genomic conversions between RF1 and RF2 genes in diverse bacterial genomes, which encompass a domain that has a key role in the interaction with the ribosome during translation termination process. Structural analyses suggested that conversions of the corresponding region are functionally neutral for both RF1 and RF2, implying that the frequency of ‘partial’ conversion between paralogous genes is higher than we generally assume. For more detailed information, please check Ishikawa, Kamikawa, and Inagaki (2015) listed in my CV.
Collaboration for the large-scale phylogenetic analyses
In addition to the main research themes mentioned above, I have collaborated with a number of evolutionary biologists and worked on the global phylogeny of eukaryotes. Particularly, I had strong contribution in two big projects to elucidate the evolutionary affiliations of two novel microbial eukaryotes, Tsukubamonas globosa and Palipitomonas bilix. I took the initiative in operating the 157-protein-based phylogenomic analyses to determine the positions of T. globosa and P. bilix in the global phylogeny of eukaryotes. I also engaged in statistical analyses to investigate underlying systematic errors (e.g., long branch attraction, compositional biases, covarions).