NNNS Chemistry blog
Prevous: A californium metallocene
Next: Mining the data

All blogs

Welcome to my nightmare

31 December 2021 - The CRD

kemet_small.PNG Exactly one year ago I updated you on my Covid project, the Chemical Reaction Database (CRD). What was accomplished in 2020? Successfully imported the USPTO database into a SQL database, performed a data cleanup operation, created a web interface for data handling, created a new website for public access to the data and found a new way to visualize the reactions.

The data cleanup continued into 2021. The reaction data set is now free of duplicate reactions (hopefully), free of duplicate compounds (hopefully), free of non-existing compounds (hopefully) and free of non-nonsensical reactions (hopefully). A persistent batch of non-nonsensical compounds were made up of discrete ions and anions such as ammonium or carbonate where of course you cannot have one without the other.

Over 100,000 reactions have now been vetted and the total amount is expected to be 400,000. Every compound must have an inchi key and therefore all reactions are run through Opsin before being made public. Apologies to the Opsin server for all the data traffic!

Patents aside, many more reactions are of course hidden in the academic literature and of greater value. The reactions in the patent literature are limited in variety and more is expected from the vast body of total synthesis work. But how to import them? One at the time is one option. Given a procedure from a supplementary info, I can have it included within one minute. It depends on the quality of the text. Extra points for any text that includes systematic names for the organic compounds (not compound 1 or S2 or ligand L!) and extra points for publications that do not have 10 general procedures lacking information on scale (just equivalents). Also extra points if a systematic name if present can be read by Opsin and does not contain a logical error. Having that said, even with a process time under a minute the total amount of reactions entered this way is a meager 1200.

But how to speed things up? In a previous blog I mentioned IBM RXN, the free (!) webservice that allows you to convert text to procedure. Runs on AI, it even has an API! A new avenue has now opened up, you copy and paste a procedure into a text box, the IBM API will return a procedure list, then Opsin will return molecular data and finally Rdkit will render the image. How is that for speed.

If the original text does not have systematic names and just a picture to look at there is a new tool called syntelly with a very good graphical editor for redrawing the molecule and that is also very good at returning a systematic name from that image. Syntelly is Russian and maybe it exists solely for the purpose of industrial espionage. Use at your own peril. Did I just now insult Russians? My apologies!

The last couple of days I have tried out this approach and I am not a happy man. The IBM tool has it's quirks. If the text starts with a chemical name it is not picked up as part of the procedure. It is not handling newlines and breaks. The procedure shorthand is not exactly well defined if you want to data-process it further and the list of procedures does not tell you when the reaction has completed and the workup starts. Even copying texts from pdf files has it's quirks, spaces disappear and in one text it was possible to copy everything except the chemical compound names. A publisher's plot to thwart scrapers like me?

And there are the synonyms. The IBM step may deliver a synonym that Opsin cannot handle for example BuLi or n-BuLi or nBuLi. The authors of all these supplementary information files can dream up any synonym they want so what is needed is a comprehensive synonym list. I was not able to find an existing one and therefore a scrape was required for the Wikipedia pages for every reagent and solvent in the reagent and solvents list. If Wikipedia felt it was more a rape than a scrape, again my apologies!

And what is the competition doing? This blog has been highlighting the Open Reaction Database. This initiative is open to dataset contributions but it remains to be seen if people have datasets to share (not evident from the github repo). And there is also Lee Cronin who is still pursuing his reaction robot. Hand the robot any procedure as an XDL file and it cooks up the goods. As this recent tweet shows, Cronin can afford to lock up a bunch of students in a basement to crank out these XDL files. Look at the faces of these poor kids after they have just resurfaced. Their nightmare has ended, mine is far from over.

Stay healthy and kill off the virus in 2022.

Rik