Link to HAL – pasteur-04337999
International Conference on Clinical Metagenomics 2023, Nov 2023, Genève (Suisse), Switzerland
Understanding the virosphere, including the realm of unknown viruses, is a critical objective in virology. Metagenomic taxonomic assignment tools have traditionally relied on datasets derived from biological samples or in silico spiked samples containing known viral sequences present in public databases, limiting their capacity to detect novel or distant viruses. Simulating realistic evolutionary directions is therefore key to benchmark and improve these tools. We introduce “Virus Pop,” an innovative tool designed to address this limitation by simulating realistic protein sequences and expanding protein phylogenetic trees with new branches. This tool achieves this by generating simulated sequences with site-specific substitution rate variations, mirroring authentic protein evolution, and by inferring ancestral sequences corresponding to multiple internal nodes of the input data phylogenetic tree. These two strategies combined allow to generate sequences that will be inserted precisely at points of interest in the considered group phylogenetic tree. To provide a user friendly experience, Virus Pop can automatically fetch input data on NCBI simply from a taxonomic group name. Then, all steps can be precisely parameterized or run with default parameters. We demonstrated that Virus Pop produces simulated sequences that closely match the structural and functional characteristics of real protein sequences, taking as an example the spike protein of Sarbecoviruses. Virus Pop also succeeded at creating sequences that resemble real sequences not included in the databases. In conclusion, Virus Pop presents a powerful solution for assessing taxonomic assignment tools and enhancing databases to improve the detection of distant viruses. By addressing the challenge of simulating realistic evolutionary processes, Virus Pop opens new possibilities in the quest to understand and uncover the mysteries of the virosphere. Its utility extends to both practical applications and potential advancements in virology and bioinformatics.