Keywords: bioinformatics, phylogeny, multiple sequence alignment, RNA viruses tree
Sequence alignments are ubiquitous in molecular biology, but when sequences are distantly related alignment columns are often ambiguous, meaningless or wrong. Standard practice is to make an alignment using well-known software (say, MUSCLE), hoping that it is good enough for your downstream analysis. For example, if you make a phylogenetic tree, you must assume/hope that the alignment is correct, or alignment errors can be neglected, or alignment errors will be reflected in low bootstrap confidence. I describe a novel approach which enables quantitative testing of these assumptions. A ensemble of alternative alignments (say, 100) is generated by introducing small variations into gap penalties and other parameters. Parameter variations are small enough that alignment accuracy is not degraded on benchmark tests, but large enough that different alignments are obtained for diverged sequences. Confidence of a given alignment column is assessed by how often it is reproduced in the ensemble, similarly to bootstrap confidence in a tree branch. The Felsenstein non-parametric bootstrap procedure for tree inference is generalized to re-sample columns from an ensemble rather than a single alignment. I re-analyze an influential published tree of RNA virus polymerase genes, showing that high bootstrap confidences are artifacts of systematic errors in the authors’ alignment.
This seminar will be online at the following link: https://teams.microsoft.com/l/meetup-join/19%3a8ba527d9b67a49f99a705c485e691b61%40thread.tacv2/1633593184520?context=%7b%22Tid%22%3a%22096815dc-d9eb-4bc3-a5a3-53c77e7d34e2%22%2c%22Oid%22%3a%220c1ed5a9-f397-4c25-861e-508686c39914%22%7d
An “online coffee” room is available for discussions between participants and the speaker after the talk: https://spatial.chat/s/dbcseminar