NNNS Chemistry blog
Prevous: Hexacyclohexylcyclohexane
Next: A case of failed Suzuki

All blogs

CRD update

10 August 2020 - Data

Update on the Chemical Reaction Database (CRD) project that was started back in 2012 (link, link, link). The idea is simple, construct a public website containing a database of chemical reactions as an alternative to the existing pay-walled ones. Data of chemical reactions is taken from the general literature, for conversion of molecule names to SMILES, the project talks to OPSIN and for SVG image generation the Obabel library is used. Eight years down the road several flaws to the plan have become apparent. The name CRD itself is flawed because only organic reactions are of interest (personal choice). The graphic representation of the reactions (svg images reactants followed by that of the product) is also flawed because Obabel has not been designed to handle these. The obvious third flaw is staggering number of known organic reactions in the literature that have to be entered manually. Even with optimizing the required web forms the total number stands at a pitiful 398 reactions. Time for a new approach.

With respect to image drawing, I decided to create my own svg drawing software. After all, you can retrieve molfiles from OPSIN and each molfile gives you all the atom 2D-coordinates and bond orders in a molecule. Combined with a svg drawing library, basic programming skills and ample time on account of the lockdown, nothing could go wrong? The software wrote itself? Not really, a 2D framework was easily made, this time with a standard bond length regardless of the molecule size (in Obabel all molecules fit in the same 100 by 100 pixels) so that was a win. The trouble started with double bonds and how to position the second line with respect to the first line in relation to other double bonds. Obabel handles this very well and knowledge about which carbon atoms join a ring must be part of the solution. The other obvious problem to overcome was how to avoid overcrowding of the 2D plane with different parts of the molecule. Again Obabel is able to handle this but there are limitations (a taxol skeleton for example goes completely haywire).

It was the moment to abandon this part of the project and look for alternatives. Enter Rdkit, a general purpose cheminformatics toolkit. Not only can it draw molecules from SMILES input, it can also handle reaction SMILES. It is also open-source so why reinvent the wheel. The API offers a glimpse into the overcrowding problem solving mechanism, the algorithm makes substituents perform random rotations until overcrowding is lifted.

And creating new database records? For some time now a US patent reaction database is in the public domain thanks to Daniel Lowe who found a way to automatically scrape content from the raw patent literature. The database is in the form of a bunch of xml files and I am now in the process of transferring then to a SQL database format. Some tinkering had to be done. Not all patents have useful content and the distinction between a reactant chemical and a workup chemical is not always clear but keep in mind this type of data mining as devised by Lowe must have been tremendously difficult. With a data set covering the year 2002 to 2016, I am now back in 2004 with 111.000 records , 280 MB total and counting. Keep you posted!