I recently found the library deepsmiles
(Check here)
Which translates SMILES to a Smiles format which is optimized for the usage of generative models. The problem with traditional SMILES is that rings and branches require paired symbols e.g. a opening and closing parentheses. If only one of the two is present in the generated SMILES, it will be unreadable.
DeepSmiles fixes the issue by only using closing parentheses and the number of parentheses indicating the size of the branches. A similar strategy is applied to rings.
This allows the models to generate more valid smiles. All Smiles can be converted to DeepSmiles and back.
This paper used them in their generative models, from which I also took their code to create deepsmiles.
import deepsmiles as ds
DEEPSMI_CONVERTERS = {
"rings": ds.Converter(rings=True),
"branches": ds.Converter(branches=True),
"both": ds.Converter(rings=True, branches=True)
}
def to_deepsmiles(smi, converter="both"):
"""
Converts a SMILES strings to the DeepSMILES alternative.
:param smi: SMILES string.
:return : A DeepSMILES string.
"""
return DEEPSMI_CONVERTERS[converter].encode(smi)
I trained the DeepChem implementation of the seq2seq
fingerprint with the deepsmiles. Unfortunately, my training cancelled and I did not print the loss over time. But the model that I trained (unknown epoch) could recreate 450 of the 500 smiles of the validation set ( in the tutorial only ~360 were recreated). I know there is quite a big variance in the performance. So this could also just happen due to chance but I think the initial result was promising.
So maybe for people interested in generative models could use this information and try the deepsmiles out.