Chematica on linguistics

08 August 2014 - The CRD

Chematica caused quite a stir back in 2012 with their announcement of a "chemical brain" and a "Google on steroids". Chematica is software that promises to design synthetic pathways by computer algorithms and contains a database of 10 million compounds interconnected by reaction vectors. It would make the retrosynthetic chemist obsolete. Chematica is associated with Northwestern University, Illinois but it is also a commercial product. The website does not have a lot to offer. How was the chemical reaction database compiled? A licence from Scifinder or Reaxys would cost a fortune. So far all claims made by Chematica are untested by independent sources and remain unchallenged. Is this network really only second in size to the World-Wide Web as the website claims? Computer-generated brains must exist larger than that.

Chematica is back in the news with an article in Angewandte Chemie on organic chemistry as a language. Too bad Angewandte has a crappy disclosure policy: the last entry in the references. Other publications exist in the chemical realm that position disclosures more prominently. The article describes the relationship between organic molecules and the English language. In linguistic terms the latter follows Zipf's law and it would be interesting to see if the same holds for the first. Key finding: it does, but not based on functional groups but on molecular fragments in general.
In linguistics two sentences are compared from a set of sentences and the maximum common substrings (MCS) are recorded. In a ranking of MCS's a given entry is always twice as populous as the one immediately below. In the Chematica research the same is done with sets of molecules. Now a MSC can be the isopropylbenzene fragment or a specific carbon-carbon bond in a larger frame. Fragments such as the methyl group that are likely to show up in every molecule are filtered out. Again the ranking follows Zipf's Law but the actual ranking is not disclosed in the article. The research program also had a second goal: find out if in a molecule the "most characteristic bond" (one with highest fragment frequency) would also be also the prime target for retrosynthetic disconnection. The article does not state a rationale for this hypothesis and this blog is clueless. Disconnections are about chemical opportunity and not about graph theory. In any event, a panel of pH.D.-level chemists was set to work and "In about 97% of cases, at least one chemist suggested one of the top-three computer-chosen bonds".

See previous update on chemical reaction databases here.