On synthesis translators

04 October 2020 - Cheminformatics

cronin_XDL_2020b.PNG Much is going on in the field of chemical data mining. The Glaswegian group of Leroy Cronin has a new paper out detailing a fully automated chemical assembly line for chemical reactions complete with an automated chemical procedure, reading from a text source to a completed product in a reactor (DOI). His synthetic robot called Chemputer, unveiled in 2019 (DOI), fits in a regular fume hood and is all pumps, containers, tubes, valves and processors. An earlier blog mentions the Chemputer's involvement (results not yet reported) in a chloroquine synthesis (Link).

Automation in organic chemistry research laboratories seems like an obvious development and perhaps long overdue. It is all manual labor, repetitive and could easily be carried out by low-level technicians with a minimum of training. In academic laboratories labor of course is cheap with an abundance of highly motivated students and other researchers. Big plus for a 24/7 robot is the possibility to greatly increase the number of experiments. Why would you settle for 10 manually executed optimization reactions if you can have 1000 automated ones. With this in mind, a research team from Liverpool has unveiled, also in 2020, another - mobile - robotic chemist.

Added to the Glasgow assembly line in 2020 is SynthReader which is a NLP implementation for the translation of any text describing a chemical reaction to machine-readable code with Chemical Description Language or XDL (X is pronounced as kai as in the Greek word for chemistry). The XDL distinguishes 44 different actions (for example add, stir, heat) and describes the chemicals involved with quantities and temperatures. It also handles subtleties such as adding drop-wise and not just adding. Text extraction is based on pattern recognition with over 16 thousand patterns to choose from. Over 500 procedures were successfully extracted from the journal Organic Syntheses this way. The researchers acknowledge the algorithm is not flawless, they write that "because of the flexibility of natural language, there will always be cases where the algorithm fails and information is lost, and thus, the output of any such algorithm cannot be blindly trusted". Yes, blindly starting reactions this way may have unpleasant consequences.

The Cronin group has competition in the chemical procedure extraction field. Twenty twenty also saw the publication of an article from IBM Zurich (DOI) detailing the extraction of chemical synthesis actions from experimental procedures. IBM seems a strange entry in this field but we have already seen other big companies like Tencent, Google and Ant Financial active in the field of chemical reaction prediction via AI. The two lead authors of the IBM article at least have a chemical background. The research was based on a dataset owned by NextMove Software, a private company specialized in patent data mining. 60% of the sentences reportedly yielded a correct action.

Also in 2020, a Massachusetts team has described a data mining effort DOI, this time for inorganic reactions. New synthetic procedures were proposed based on artificial intelligence trained on existing literature.

But back to SynthReader. The Chemify website has an online synthesis translator for everyone to try out! The website warns that the translator is a work in progress but what are the first impressions? This blog is currently working with a large US patents data set so we have around 1.5 million reaction descriptions to choose from. Take for example US20160272602A1_0618, a simple ester cleavage but we are off for a bad start. The reader fails to identify the reactant and the base. In a second example (US20080064673A1_0146) a three day stirring action is missed. Instead, the interpreted actions go from add, add, add straight to filter and dry.

Of course the system was not trained on US patents but on Organic Syntheses preps. The curious thing here is that although the synthesis translator warns to always use the past tense (the solution was stirred), OS has the present tense by default? Unsurprisingly, not a single prep worked out.