An intriguing map as published by Virshup et al. in JACS (DOI). But what is it? The Virshup article describes an algorithm for generating a universal library of small molecules optimised for diversity. Apparently a lot of chemical space is left unexplored. This blog made an honest attempt at fleshing out the details but the article proved stubborn.
The take-off point is surprisingly simple: The algorithm starts with just two molecules: benzene and cyclohexane. In each new generation atoms are then substituted, bonds broken or formed and molecules joined as fragments. The new molecular set is then filtered for things like synthetic feasibility, conformational stability and drug likeness. In addition to that, in each generation only a set of maximum diverse molecules makes it to the next round. So far so good, the final set contains 9 million very diverse molecules.
The article then describes diversity being assessed by comparing molecular topology into 40 dimensions and then converting to a vector representation. The 9 million molecules (existing or non-existing) are positioned on a grid with each section sharing some topological traits. The final thing to do then is plotting all existing molecules from lets say the Pubchem database on the same grid. It then emerges that they all appear to cluster together in a vast mountain range leaving most of the grid eerily empty. But then again the article proved stubborn.