CRD year in review

11 December 2022 - And what next

Quick update on the CRD project. The objectives have been unchanged: collect published organic reactions in a database. As many as possible, aiming for diversity and aiming for relevance to pharma. The ultimate goal is maintaining an open-access reaction training set for use in machine learning.

Story so far: successfully imported 400K reactions from the 2002 - 2016 USPTO repository as compiled by Daniel Lowe. Then went on to collect reaction sets from the academic literature (emphasis on total synthesis) which is of course a snail-paced process. But how to move forward? Luckily the USPTO (US patents) publishes weekly updates on their website with fresh new patents to data mine! This has opened the opportunity to continue with patent literature from 2016 to the present day.

Results so far? Each weekly USPTO update is about a GIG of data with up to 8000 individual patents. Isolating the ones with patent field CO7 (organic chemistry) leaves about 200 files and 70 MB. Isolating distinct reaction procedures from each from these files is tricky and an ongoing process. Challenges to tackle: missing systematic names for reactants and products, the name of the product only mentioned as a title and not in the text, reagent jargon and an assortment of nuisances such as OCR mistakes, character coding errors and line breaks and spaces in all the wrong places. In the last batch about 2500 procedures were isolated, ready for the next step.

Next step? That step requires the isolation of the reactant en product by systematic name together with specification of the reagents (leaving out all the workup solvents and chromatography details), all including amounts used, reaction time and reaction temperature. The tool for getting this done can be IBM RXN, with state of the art machine learning capabilities. It dissects any text to a crisp set of procedure steps such as "Add","MIX" and "YIELD". With this breakdown incorporation of the ingredients in a relational database is easy. It must be said, the tool does not work flawlessly. It cannot handle line breaks, it cannot handle certain phrases and it occasionally mangles systematic names. Its use of parentheses to capture details clashes with the same parentheses used in chemical names which takes advanced regular expressions to have everything sorted out. The most eye-catching disadvantage of IBM RXN is that the API endpoints simply do not work ("too many requests"), a bit strange because you might expect a company like IBM to know a thing or two about servers and have them abundantly in supply.

Plan B? Of course there is ChemicalTagger, a Java library maintained by Blue Obelisk that also interprets chemical procedures. It does not do so by breaking up the text into pieces like IBM RXN but it marks up every word in it. If the library finds a molecule it wraps it around MOLECULE tags and in a text string like "2 hours" that whole part is interpreted as a time phrase. With this result in hand it is again possible to isolate all the chemical in the process and store them in the database. Results so far? From a recent batch of 2500 procedures 1000 of them successfully made it to the finish line. The custom code responsible for digesting the chemicaltagger results works with a lot of rules (at least one reactant, all molecules must map to an actual valid molecule) and if one of them does not match the procedure is invalidated.

Chemicaltagger is particularly good at recognizing the systematically named molecules. With informal names the results are a cause for headaches. For example, it cannot deal with names such as palladium on carbon or Hantzsch ester or Pd2 (dba)3. It continues to be difficult to figure out if a solvent was part of the reaction medium or part of the workup procedure.

Final verdict? Well, until last week I was confident that I was going to stick with chemicaltagger. Please note that all important parts in the current toolbox (Opsin, ChemicalTagger, OSCAR4, RDKit) all derive from the people at Blue Obelisk. Many thanks!

But? Of course last week was the week that ChatGPT rocked the world and we have seen many examples on Twitter (yes, Twitter) what the AI chatbox is capable of. Thanks, OpenAI and Elon Musk! And things you can do with a billion dollars! Naturally I created an account and decided to find out how chatGPT would answer my chemical procedure questions.

I must say I find the results staggering. Just ask the question "in the following procedure what are the reactant , the product and what are the reagents?" and let the question follow with a chemical procedure and out comes a specified list of the chemicals. Then ask for example "In the following procedure how much Hantzsch ester is needed?" and out comes the answer "1.0 mmol of Hantzsch ester is needed in the procedure". Mind you, with chemicalTagger the phrases Hantzsch and ester are in no way related. In one example I worked out, the reaction happens to be a photo-chemical reaction with a 5W blue LED bulb. The IBM RXN does not process this information at all and chemicalTagger incorrectly assumes "blue led" must be a molecule, invalidating the entire procedure. On the other hand, ChatGPT can be asked "Can the following procedure be considered a photo-chemical reaction?" (answer yes) or "In the following procedure how is the reaction irradiated?" (Answer The reaction is irradiated using a 5W blue LED bulb).

Now I am imagining that for my raw-text to database conversion I can apply chatGPT by asking a series of specific questions with each procedure I throw at it. Something like a conversation script. The tool comes with an API so it lends itself for automation. The only unknown unknown is the subscription price. For now the use of the tool is free but I am hearing fees up to 1000 dollar per month. Too rich for my 0 dollar annual R&D budget. In any event, what started in 2020 as a Covid project progressed as a project in 2021 and 2022 and the targets for 2023 are set and aimed for. See you there!