identifying sampling bias in bacterial genome databases

When analyzing the diversity of bacterial populations it is important to consider a potential sampling bias that can distort your pangenome analysis.
Our tool PhyloThin is a coalescent-theory-based approach to identify prokaryotic genomes that can be considered as oversampled.

We are currently writing the manuscript for this tool. Stay tuned for our first results. If you need preliminary access to PhyloThin, please let us know.

supervised feature selection for ancestry informative markers

The immunity memory of the prokaryotic defense system CRISPR-Cas is remarkable, as it encodes a chronological record of past infection attempts in an inheritable spacer array. SpacerPlacer is a tool designed to reconstruct the ancestral states and events that formed this spacer array. If you want to reconstruct the spacer arrays of your favorite species, check out our GitHub repository. We reconstructed the deletion events of spacers in the CRISPRCasdb database, revealing some interesting properties of immunity loss in CRISPR-Cas systems. Have a look at our paper to find out more.

effective large scale coalescent simulations

msprime uses succinct tree sequences to speed up the simulation of ancestral relationships. We are part of the fantastic scientific open-source community that develops novel features into msprime maintained by Jerome Kelleher. We have implemented the gene conversion mechanism into msprime and currently incorporate additional features that are particularly relevant for microbial evolution.

Have a look at the msprime documentation here.

supervised feature selection for ancestry informative markers

With Peter Pfaffelhuber, Franziska Grundner-Culemann and Veronika Lipphardt we created AIMsetfinder.
AIMsetfinder is a supervised feature selection approach to identify sets of Ancestry Informative Markers (AIMs), that minimize the logloss error of a naive Bayes classifier.

Have a look at what we have done at  github

standardizing population genetic simulations

we are a member of the PopSIm Consortium, aiming to standardize population genetic simulations for frequently studied model organisms and species, including humans.
As a result, the software stdpopsim allows to a) easily re-simulate published population models for many species and b) compare the results of new inference methods against a standardized benchmark.

Have a look at the stdpopsim documentation here.

Pan-genome Analysis and Exploration

panX logo

Richard Neher, Wei Ding, and I created a pipeline to automatically analyze pan-genomes. The most outstanding part of this project is the phylogeny-based identification of paralogs and the visualization of the pan-genome in a browser. It is now very easy to explore the pangenome and search for certain genes or features within a pan-genome.

Have a look at what we have done at  this demo webpage.

(formerly known as IMaGe)

 Panicmage is a shortcut for "pangenome analyzer for infinitely - which means considerably - many genes".

The name changed from IMaGe to panicmage, since "image" is possibly one of the most stupid words to search for on google. In addition, the new name emphasizes that in our model, while there are infinitely many possibly existing genes, at any time a finite number of genes exists in the population.

 

Given

  • a genealogy
  • the gene frequency spectrum
  • the number of generations to the most recent common ancestor (optional),

panicmage estimates

  • the parameters of a neutral Infinitely Many Genes Model (gene gain and gene loss rates)
  • the number of core genes of the whole population
  • the expected number of new genes found in the next sequenced strain
  • the size of the persistent pangenome
  • the size of the total pangenome.

 In addition, panicmage computes the p-value of gene frequency spectra for a given genealogy under neutral evolution and can simulate distributed genomes. So far p-values for neutral evolution and for existing sampling bias can be computed.

 To install panicmage please visit the panicmage GitHub repository.

view panicmage manual