Metal-organic frameworks digitized

20 May 2022 - Digi everything

digimof_2022b.PNG Peyman Moghadam and team of the University of Sheffield have reported the creation of a database for metal-organic framework (MOF) synthesis in a recent ChemRxiv publication (Kristian Gubsch et al. DOI). The article describes a workflow starting from the Cambridge Structural Database (CSD) that contains about 100,000 MOF entries. The CSD API provides the same number of DOI's, the corresponding HTML pages were automatically downloaded and text-scraped / data-mined with an in-house tool called ChemDataExtractor (CDE). To extract meaning from text natural language processing was combined with custom part of speech parsers based on Python regular expressions. The authors acknowledge the universal truth that a strict parser yields few results and that a lenient parser is inaccurate. To quote their frustration "We found it extremely challenging to accommodate the considerable diversity of sentence structures observed in MOF literature without compromising the precision of the parsers". A total of 100K MOF's were extracted for the DigiMOF database but complete datasets comprised only 15% of the total. The quote "If a compound is labelled as '1' or '2' without a specifier such as 'compound' (...) then the parsers will not associate the label with anything, and so cannot extract a property relationship" again illustrates a level of frustration on the part of the authors directed at scientific publishing in general.

Have the authors been free to use the articles they selected for data-mining? Generally the big publishers are reluctant partners (hurts the business model) but the article is not very specific. All it mentions that not all CSD DOI's were accessible to them. With some digging the github repo containing the DigiMOF database uncovered a folder with a python scrape file for each publisher, Apparently Springer and Elsevier work with an API (they are okay with it?), the code for the RSC publications on the other hand has the line "Please get the permission from RSC before web-scraping".

And the results? The DigiMOF database yields very useful histograms on synthesis methods, topologies, solvents, linkers and metals. Trends that have been uncovered are increased use of solvent-free methods and continued use of organic solvents versus water. It is evident from the article that the data mining exercise was a long slog. I can sympathize. The CRD database maintained by this blog is also a data mining project but not for MOF's but for organic reactions. The article recommends that "future MOF synthesis publications contain specifically formatted tables of key information, presented in a way that is friendly to text mining algorithms to enable the scraping of data using a high throughput screening approach", a statement that resonates well with my earlier blog on supplementary info as XML.