all blogs

Fun with ContentMine

06 May 2016 - CRD 206

With 206 chemical reactions amassed in 4 years the CRD (Chemical Reaction Database) is not really making progress. For earlier episodes see here, here, here, here, here, here, here, here, here, here, here, here, here, here, here,here and here. What is clearly lacking is the motivation to enter the data manually. The better strategy is of course data mining but as described here in 2013, the scientific publishers are unwilling partners in any such venture.

Open-access advocate Peter Murray-Rust has not been waiting for the outcome of the legal battle and has his open-source data mining software under construction as ContentMine. For now it mines open-access literature in partnership with Europe PubMed Central. For instructions and a first field test see the procedurecheck blog. The set of tools effortlessly produces 21 papers on cubane synthesis in a handy xml format. One of the first tries also yielded a paper on camel milk (?) but this one was from "The Cuban Hospital". It demonstrates the importance of using double quotation marks instead of single quotation marks in the database query.

In a very recent two-part blog here and here Peter Murray-Rust discusses new kid on the block sci-hub. He basically announces a ContentMine + Sci-hub collaboration and that would really be big news. Allowing scientists mine the literature for data would really drive innovation. For example he foresees the "extraction of 15 million chemical reactions a year". That would of course put the CRD out of business but for a good reason.

But is he really going to do this? (Illegal!). The answer in Murray's own words: "I haven't said I' m going to use it. You'll have to wait till the next blog post"