When we study evolution of a species, we use different models, depending on what we want to achieve or infer. We might restrict to SNP variation in the “core genome” (presumably inherited vertically) to study phylogeography or to study an outbreak. In reducing the problem to the analysis of SNPs (and invariant sites), it has been possible for researchers to build a range of sophisticated phylogenetic models. However once we try to incorporate genome organisation, chromosomal rearrangements, movement of plasmids, transposons or phage, then the modelling problem is far harder. The question of how to properly model bacterial genetic variation is wide open and extremely challenging.
A prerequisite for any solution to this, is a decision on how to describe the variation in the first place – you cannot model variation until you represent it. Note that this is true even if you have perfect genome assemblies: even if it were possible to multiple sequence align them, this would not really help with how to notice that a SNP at one position in one genome is “the same” as a SNP somewhere else in another.
In this talk, I want to discuss a solution we have been developing to this representation problem. We show how it is possible to represent the pan genome of a species as a network of “floating” graphs, representing the ensemble of known variation in pathology blocks (we use genes and intergenic regions, but this could be done for mobile elements also). In doing so it becomes possible to discover and describe genetic variation at fine (SNP/indel) and coarse (gene order) level.
This is a major research theme for my group and I describe progress to date, including results on both illumina and nanopore data