Mining the data

14 January 2022 - Cheminformatics

Thus far the 100,000 reactions in the reaction database have just been sitting there. Not exactly searchable, except by name and by relationships. The venture a solution waiting for a problem? Just as I was wondering what the next step should be a Twitter message crossed my eyes. "we are struggling with the preparation of a boronic ester out of a simple aromatic iodine. Any tips that may not be specified in literature?" There was the problem! Okay, the question specifically asked for tips not found in the existing literature but perhaps the patent literature was overlooked?

The RDKit library has several advanced tools for data querying. One of them adopts the SMILES arbitrary target specification (SMARTS) substructure identification language invented by David Weininger in the early eighties as an extension of SMILES (DOI). SMILES is a line notation for molecules (bromobenzene is C1=CC=C(C=C1)Br) and SMARTS can specify the motif you are searching for inside the SMILES string. For example is a bromine atom connected to a sp2 carbon atom ([https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html link).

Weininger was an interesting guy, this astronomer - musician - pilot - chef - chemist was a co-founder of Daylight Chemical Information Systems, the company that owns SMILES and SMARTS as a trademark.The SMILES origin story is also interesting. Weininger's first job was catching trout in the Canadian Great Lakes and measuring PCB levels. He then went on to build a database for structure - biological activity relationships in a second job. In that job computer operators were struggling to feed the chemical structures into the database (I imagine in the way of ASCII art) and hence Weininger invented SMILES to make their life easier (link).

If you use a postgres database, RDKit allows direct SMARTS queries but as I am stuck with MariaDB a workaround was required. It is fairly simple to generate the reaction SMILES from the component SMILES and put them together in a single file, one reaction per line. This is a file a python script can work on to extract the reactions we want.

Just two lines do the work

mol = Chem.MolFromSmiles("C1=CC=C(C=C1)Br",sanitize=True)
matches = mol.GetSubstructMatches("[2^]Br")

where the variable 'matches' gives you the number of sp2 bromines in bromobenzene. To find the specific reaction mentioned on Twitter the script has to isolate reactant and products, count the number iodide bonds in the reactants and the number of carbon boron bonds in the product. The script also has to check that the number of iodide bonds has decreased by one in the product and that the number of boron bonds has increased.

Other workarounds: in a reaction SMILES a opening tag divides reactants , reagents and products and a dot divides individual SMILES. But the dot also divides ions and counter ion in compounds and having a SMARTS working on anions does not make sense. The workaround consists of replacing the dot by a semicolon on the reaction smiles.

The library appears to choke on certain molecules. Compound 9-BBN when

mol = Chem.MolFromSmiles("C12CCCC(CCC1)B12[H]B2(C3CCCC2CCC3)[H]1",sanitize=True)
if mol is not None:
matches = mol.GetSubstructMatches("[2^]I")

results in an error and then a segmentation fault and the server (a mac mini) will not be able to recover from it.

# Explicit valence for atom # 8 B, 4, is greater than permitted
# zsh: segmentation fault python3 tests.py

Another problematic compound for some reason is pyridium tribromide. The issue is under investigation.

And how did it work? Just fine. The query yielded 48 valid reactions (boronic ester formation from an aromatic iodide) in less than a few seconds.