The chemical reaction database

NNNS Chemistry blog

Prevous: More organocatalytic Suzukies

Next: New in Grignards

On ORD and CRD

07 November 2021 - Open data

New initiative from the field of chemistry data curation and much needed: a novel schema for describing organic reaction data as highlighted in JACS (Kearnes, Coley et al. DOI). If universally adopted, a lot of difficulties with exchange of reaction data are a thing of the past. And with the people behind the initiative hailing from industry (Pfizer, Merck among them) ,academia (MIT) and the tech bro's (Google), wide adoption of the schema is not unthinkable.

This initiative is not just a set of rules defining the schema. There is also a database which is called the Open Reaction Database (ORD) which has my interest because I am running a similar sounding Chemical Reaction Database (CRD).

ORD is open to contributions through a web app located at editor.open-reaction-database.org. This app is basically a long input form with inputs and outcomes and each input having a reaction role, an amount and an identifier and so on. The amount of details is large! The Youtube tutorial explaining just the form runs for an hour. You can add chemicals vendors, addition speeds, flow rates, vessel details, stirring rates and more.

I am surprised by some of the mandatory fields. Reactant amounts are mandatory but not always known. Another mandatory field is 'limiting reagent yes/no' but the answer to this question can be calculated from the reactant amounts. The number of roles for a compound is very limited, just 8. Compounds with ligand roles or base roles all have to be added as reagent.

As a field test CRD file 1785479 was entered as ORD file ord-ef44929c4e76463483e6d5d1ada3fcf2. But what do you get for your trouble? Well, at least a reaction.pbtxt export file. And that requires some explaining. An output file based on XML or JSON was expected (the USPTO dataset for example is a collection of XML files) but ORD is based on so-called Protocol Buffers. This data-serialization format is binary (.pb) invented by Google as early as 2001 with a remote soap smell. The ORD datasets can be found at Github (https://github.com/open-reaction-database/ord-data) which is great news (open access) but only available as binary files which requires specialized software to make them human-readable. The .pbtxt file format appears to be a trade-off.
This is problematic. For example in publishing research articles, an ideal-world scenario would have done away with the supplementary information as we know it (freestyle prose) and replaced by synthetic procedures locked in an agreed upon data format. If you choose XML it is both human- readable and machine-readable. A protocol buffer it seems is only machine-readable. And the Open Chemical Database itself? Does it have a web presence and is it searchable? Move to client.open-reaction-database.org and (sub)search for reactions by reagent, reaction or dataset.

Since 2012 I have been working on CRD as the Chemical Reaction Database and in 2020 this morphed into kmt.vander-lingen.nl, the open-access and searchable organic reaction database (59 thousand entries and counting). Both CRD and ORD have incorporated the USPTO dataset so we can directly compare. For example CRD-1143562 (US20130190293A1) and ord-6a01bf15b70e4aee99918916c61b608b describe the same reaction. More examples: CRD-606868 matches ord-0937fdc5897a4464beac2a4265c0b72d and CRD-170034 matches ord-195173b39cd24007bc2ae1d6cae4dd7b

Of course ORD is miles ahead of CRD (which is trying to master multistep synthesis, substructure search and image rendering to name just a few things) but perhaps I do not have the resources that are available to companies like Merck, Pfizer, MIT and Google. CRD is not a circus but just a dog and pony show with the pony suffering from stage fright and dog a midlife crisis.