Brief update on the Chemical Reaction Database (CRD) project but first a recap. Project started in 2012 with the aim to create a database of organic reactions, a slow realization settled in around 2018 that progress was going to remain slow as data entry, even though optimized for speed, is basically manual labor. Then a happy find of a US patent data collection 2006 - 2016. Efforts thus far in 2020: redesigned the data management web application from scratch, imported the US patent data and switched to rdkit for graphical representation of the stored reaction.
The database now contains over one million reactions which is a good improvement from the original manually entered 400 reactions. But it is not the quantity that counts but the quality. And the data is having serious issues. One thing is that reagents listed in a reaction are piled together with the workup chemicals. A typical workup chemical such as magnesium sulfate can easily be recognized as such and a correction can be made. But how to handle a reaction that has both hydrochloric acid and sodium hydroxide listed as a reactant? A reaction is either acid catalyzed or base catalyzed and in workup the acid or base is neutralized. The classification, like all data retrieved from the patent, is based on data mining, and from the text it is not always clear where the reaction ends and the workup starts. In the import stage a rule was added to the algorithm that simply stated that if in a sequence a mix and a stir were already mentioned that the remaining steps must be workup. This rule turned out not to be very effective. In any event, the procedure itself was also imported so the reactions can be revisited in the future with different rules.
Second problem are the missing smiles. The easy solution is to just abandon the reactions that do not smile (missing smiles means no reaction smile) but the favored solution is to try and rescue these reactions. Missing smiles creep up if a product is simply called "the product" or "the ester" or if the previous reaction is referenced. These reactions are beyond rescue. In many cases the names of simple chemicals contain spelling errors or are plainly wrong. It is amazing in how many ways a compound like N,N-diisopropylethylamine or 1,1'-bis(diphenylphosphino)ferrocene can be misspelled. In many cases these entries can be rerouted to correct entries with a custom crafted database query.
If spelling mistakes in patent literature are common, academic literature is not fairing any better if a small sample of articles used for manual data entry is meaningful. For example, a recent blog that discussed reaction prediction software platform Chematica offered the opportunity to incorporate a total synthesis of Dauricine in the database. The supplementary info was well organized with clear procedures and the systematic names of reaction and product - an exception and not the rule. All data did find their way into the database but not without some debugging. The systematic name 1-(4-(benzyloxy)-3-(4-(((6,7-dimethoxy-1,2,3,4-tetrahydroisoquinolin-1-yl)methyl)- phenoxy)benzyl)-6,7-dimethoxy-1,2,3,4-tetrahydroisoquinoline was not accepted by Opsin and had to be rewritten to 4-((1R)-6,7-dimethoxy-1,2,3,4-tetrahydroisoquinolin-1-yl)methyl-2-(4-((1R)-6,7-dimethoxy-1,2,3,4-tetrahydroisoquinolin-1-yl)methylphenoxy)-1-(benzyloxy)-benzene (with square brackets replaced due to blog software incompatibilities). When is a name a correct name? If consistent with IUPAC rules? If you take a more practical attitude, a name is correct if Opsin can handle it.
Another example can be found in this article where the title of the procedure says 2,5-dibromothieno(3,2-b)thiophene but where the 2,6 isomer is supposedly formed. Again, this is a mostly harmless error but when it comes to data mining it is lethal. Poor quality of reaction data sets have been mentioned by others (see blog and blog).
And how realistic is this particular branch of data mining? In 2016 this blog mentioned Peter Murray-Rust and his ContentMine effort. More recently Lee Cronin announced he had cracked synthesis translation (blog here). It is one of my goals to have a go at data mining myself somewhere in the future but I am not optimistic. Mining the content of Organic Syntheses must be doable but beyond that it will be a long uphill battle. Only the scientific publishers can demand from authors to strictly organize their supplementary information. But at the same time the scientific publishers that own the currently existing reaction databases will argue that there is really no need for data mining. They have expensive databases to sell.