Wikipedia has an unusual page called the "List of chemical compounds with unusual names". The page tries to explain why a chemical compound such as basketane or bullvalene is named as such even when a respectable IUPAC name exists. Since 2004 this page has been involved in no less than 7 so-called article-for-deletion debates with opponents arguing the page is not encyclopedic or based on original-research and should be deleted and with admirers arguing the page is well .., fun! The page thus far has dodged the bullet. It should be mentioned that at the inception of the page a University of Bristol website was pilfered.
Both parties in the debate should be notified however that the page was recently included as a citation in a respectable scientific publication (DOI). In it, a German team headed by Christopf Steinbeck explains that an IUPAC name is much more useful because from the systematic name a molecular structure can always be deduced. In cheminformatics an algorithm can take an IUPAC systematic name as input and spit out SMILES, the string representation of a molecular structure. For example benzene will return c1ccccc1. In the CRD project aimed at creating a chemical reaction database, a tool called OPSIN does just that.
The Germans do the reverse: take a structure and find the IUPAC name that goes with it. Enter c1ccccc1 and get benzene as a result. They call their invention STOUT as in SMILES-TO-IUPAC-name translator, a play on the English beer that no doubt was at one time involved in the brainstorming session leading up to this article.
To accomplish a IUPAC to SMILES translator, the researchers looked at deep learning and specifically at so-called transformers used in natural language translators. Transformers were also mentioned in the previous blog on organic reaction prediction and the basic idea is the same. In language translation you look at an English sentence and a French sentence, look at the relationships between the words and train with a large dataset. But a systematic name and a SMILES representation can be considered string sentences as well and the algorithm will just work as well.
Sixty million molecules were drawn from PubChem for training. The SMILES to IUPAC translation was then tested with 2 million other molecules the quality of which was evaluated by converting back the IUPAC string back to SMILES using the aforementioned OPSIN and calculation the so-called Tanimoto similarity index.
A similarity index of 1 was found for 83% of the molecules. The BLEU score, a metric for the quality of a machine translation, was 0.94. Sounds good but the authors are cautious: "any large scale and uncurated application should be currently handled with care".
2021 is a busy year for chemical deep-learning and the field is competitive. The German team was in fact beaten by a month by an English team. Their publication (Handsel et al. DOI]) already has the two key ingredients. A 50 million training set from Pubchem and the Transformer algorithm. The work differs in that the SMILES string is first converted into an Inchi string. The validation process is troublesome though. Where the Germans rely on OPSIN and the Tanimoto index, the English simply check the IUPAC string that comes out with the string in the PubChem dataset itself.
But systematic names as entered into the PubChem database are troublesome as they are often generated in one of several commercial software apps. All over the world scientists are creating chemical drawings on their computers and out comes a systematic name. But is it a valid name? In my experience, when systematic names are included in a publication (only rarely) a lot of them get bounced by OPSIN. These apps are closed-sourced so not open for examination of the code. How many chemists practice the art of picking up the IUPAC blue book deduct the systematic name by themselves?
In any event both the Germans and the English were beaten by a few months by a Russian team with lead author Sergey Sosnin with their Struct2IUPAC (DOI). With a 50 million PubChem trainingset and a 50 million test set they report 98% accuracy. The Russians also developed the reverse tool IUPAC2Struct but mainly for "aesthetic reasons" as they acknowledge OPSIN works just as good. Interesting finds: the algorithm (again Transformer based) has an optimal performance range of 10 to 60 SMILES token. It finds methane impossible to handle. The algorithm can also find multiple IUPAC names for a single compound, counter-intuitive because a systematic name implies a system but apparently you can start "reading" a molecule at different positions.
The Russian article has one puzzling sentence: "Most chemical journals require IUPAC names for organic structures too". Russian journals? In my experience the use of systematic names in research articles is rare. Authors are happy to name their compounds compound number 10 or 11 or SI1 but the readers are expected to look at the picture of the molecule to find out what the compound is about.