Supplemental Information as XML

14 March 2022 - Data

Interesting blog by Richard Apodaca here on the topic of big reaction data and triggered by a recent article by Pierre Baldi of the University of California in Irvine here. In it Baldi notes that a public non-profit database of chemical reactions is lacking but clearly needed. Research into artificial intelligence reaction prediction for example would greatly benefit. His proposal: get an international consortium together akin the Large Hadron Collider and start a database like the small-molecules database maintained by US non-profit National Institutes of Health (NIH) called PubChem. And how to fill the database? The USPTO dataset would be an obvious place to start but Baldi also thinks the commercial database maintainers (Reaxys by Elsevier) could be persuaded to sell data or that non-profit CAS (maintained by the American Chemical Society) is willing to share them.

Apocada is skeptical. The commercial database providers will not abandon their business model. Overseeing a database will have technical and organizational challenges, what process will decide if data detailing a reaction is accurate? The model favored by Apocada is one where multiple reaction data repositories exist that can exchange data, almost like a file sharing service such as Bittorrent. He specifically mentions Chemotion ELN, a laboratory notebook in existence since 2017. It consists of a database, the software to manage the database and critically, the infrastructure to share and synchronize data with other ELN users. Chemotion also has a central repo at chemotion-repository.net.

In my view Chemotion plays in the same division as the open-reaction-database highlighted earlier in this blog. Both initiatives offer an editor, a database scheme and a repository. Both initiatives also share the same disadvantage: it involves a lot of work to enter all the data and who is prepared to do it? ORD does not see dataset additions, Chemotion is currently at 1085 published reactions and additions are piecemeal.

My proposal would take a different approach and start right at the the creation of the chemical synthesis data itself: the writing up of the supplementary information (SI). Currently a collection of rule-free free-prose but why not replace it with a structured XML document that can be data-mined at will. Any XML can be read by humans and computers at the same time. The XML documents will replace the SI and be stored together with the article on the server of the publisher. It can be accessed just like a digital object identifier (doi). I am aware of one journal (Beilstein Journal of Organic Chemistry) that has not the SI but the article itself as an xml file with the explicit purpose of data mining so the idea is not completely alien. ChemRXiv would also be a good location for storage. Hammering out an XML data-scheme should not be that difficult, storage costs are distributed, quality control is built-in and submissions transparent. Next blog: an XML scheme!