CAS number crunching

29 June 2008 - cheminformatics

A cheminformatics session recently applied to the CAS registry by CAS people reveals some interesting figures (Lipkus et al. DOI). 98% of over 24 million frameworks (cyclic molecules, side chains or stereochemistry ignored) contain a heteroatom, 18% of atoms in a framework are heteroatoms with nitrogen in the lead (66%) followed by oxygen (23%) and sulfur (9%). 50% of frameworks have between 20 and 30 atoms and even-numbered frameworks outnumber odd-numbered frameworks, an effect attributed to dimerization.

A collection of 143 structures is sufficient to describe half the compounds. The distribution of frameworks in fact obeys a power law: a small number of frameworks make up a lot of molecules while many frameworks occur infrequently. This power law is comparable to Zipf's law for word counts or the rich get richer principle and confirms that chemists are more likely to use a particular framework to make a compound the more often that framework has been used in the past.