Who wants to read a database

06 april 2019 - Machine learning

Not impressed at all. Not impressed by this press release here from Springer Nature about a new book called "Lithium-Ion Batteries" and authored by Beta Writer. Strange name for an author and so chosen because it is written by an algorithm as in machine-learning and artificial intelligence. Bots are increasingly finding employment in journalism (see this 2019 NYT article) churning out financial reports, sports reports and weather forecasts.

The particular book is a joint-venture between Springer publishing company and the Applied Computational Linguistics lab of Goethe University Frankfurt/Main with the premise that a review in any field of chemistry can just as well be written by a bot as a human specialist in the field. There is not just the press release but also the free eBook with in it the Lithium-Ion Batteries book and also the methodology explained. This blog is having several issues with this initiative itself - more on that later - but first start with some methodology issues that surfaced as well.

The book carves up the field of Li-ion batteries into 4 main fields with each 2 or 3 sub field and with each sub field containing 10 to 30 subsections. The authors mention that 53000 articles have appeared in the last three years alone but the total of in-text references in the ebook is just 400. More out-text references are listed (about 1200) but what is the point of just having a long list of references even ignoring the missing articles titles or even the article DOI's. From a small sample it becomes clear that the references themselves contain reviews. So are we dealing with a review of research articles or a review of reviews? And what is the scope of the book? The title is as usual in scientific publishing highly misleading because it is not about Li-ion batteries in general but about progress over the last three years in 4 highly specialised sub-topics. Oh yeah, the review was also limited to Springer-published content.

As in methodology one questionable design choice is that an effort is made to "preserve as much as possible from the original text". That is not a machine-learning exercise but a lazy copy-paste student assignment. Luckily the authors actively invite feedback from their readers so they are not blind to criticism. The methodology section contains some strange contradictions, to quote: "we build on existing open source software" and "we do not depend on any specific third-party contribution" captured in the same sentence. A good thing is that the creation of data trees is mentioned (breaking down a potentially huge number of articles into a hierarchy) but then a disappointment when a "recursive non-hierarchical clustering" method is preferred instead. More from methodology: the number of citations is artificially capped at 20, human experts have been allowed to move articles around and even delete articles. Optimistically the authors exclaim "We consider the resulting publication nevertheless to be machine-generated". They also identify several remaining challenges for them to tackle, for example the generation of meaningful headlines, for now they stick to a list of keywords. And something for the future: break down texts into their graph representation as source material for new summaries (now we are talking!).

Practical example from a randomly chosen section 1.2.3. There is something wrong with this sentence: "The TiO2 needs to be noted that the carbon coating enhances the overall electronic electrical conductivity and the few-layer MoS2 fosters the diffusion of lithium ions and provides more active sites for lithium-ion storage". And the qualification "facile" in "Few-layer and carbon MoS2 nanosheets co-modified TiO2 nano-composites (conceptualized as MoS2-C@TiO2) were prepared via a facile single-step pyrolysis reaction method" is taken directly from the source text: "Carbon and few-layer MoS2 nanosheets co-modified TiO2 nanocomposites (defined as MoS2-C@TiO2) were prepared through a facile one-step pyrolysis reaction technique." Here this blog would really expect a human expert judging a) this claim is rubbish b) this claim is truthful or c) undecided and removing the word "facile" in a neutral tone of voice.

Then there are the issues in general with this type of publications, who reads them anyway? As exiting as reading a phone book. They have been made obsolete with the appearance of searchable databases in the beginning of the nineties. But the publishing industry is still producing them by the bucket load. For the reason why this blog is reminded of a story how university libraries buy any scientific publication just because they have to (link to Guardian article). The hard-cover price for this book is 54 euro. (link) And finally on the topic of data mining scientific literature? Is only Springer allowed to do this? Others may want to data mine, for example Peter Murray-Rust but he is not allowed to (earlier blog coverage here). There are plenty of innovative ways to think of to make the huge amount of scientific literature more transparent but for this to happen this data-mine ban must go.

Rik