Link to HAL – pasteur-04337913
8th International Conference on Clinical Metagenomics, Nov 2023, Geneva, Switzerland
Background Virus identification and discovery from metagenomic data are critical for understanding the viral component of diverse ecosystems, including clinical specimens. Microseek introduces a novel bioinformatic pipeline for comprehensive analysis of metagenomic Next-Generation Sequencing (mNGS) data, with a focus on virus detection and discovery. The detection is based on the high degree of conservation of viral proteins compared to nucleic acids. Methods Microseek’s pipeline encompasses several key steps. First, it filters mNGS raw data and performs de novo assembly to reconstruct larger contigs from fragmented sequences. Then, it generates translated protein sequences from reads and contigs. Taxonomic classification is determined using the Lowest Common Ancestor methodology applied to the set of best hits identified from three distinct database queries. We demonstrated the effectiveness of Microseek in detecting known and distant viruses through two representative mNGS datasets derived from human tissue and plasma specimens. Our results were compared to those obtained from the reference cloud-based mNGS pipeline, Chan Zuckerberg ID. Results Microseek demonstrates robust and reliable performance in identifying known viral sequences and excels in the detection of distant pseudoviral sequences, particularly in complex samples like human plasma. Moreover, it minimizes non-relevant hits, enhancing data accuracy. Discussion This innovative pipeline represents a user-friendly solution for virus research, offering researchers the ability to explore viral diversity and evolution from mNGS data with ease. In summary, Microseek holds promise as a valuable tool for advancing our understanding of viruses in various ecological and clinical contexts.