Chemberta tokenizer

APJansen · July 8, 2022, 10:53am

I have an issue with chemberta’s tokenizer. Either I’m not using it properly or it’s not working correctly.
The example below demonstrates that while '[O+]' is in the vocabulary, it is encoded as 'O'. (this is a fragment of a smiles, it might not be a valid molecule but the same issue happens with the whole smiles).

from transformers import AutoTokenizer
model_checkpoint = 'DeepChem/ChemBERTa-77M-MTR'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
test_smiles = 'CCC1=[O+]'
print(tokenizer.vocab['[O+]'])
print(tokenizer.tokenize(test_smiles))

this outputs:

73
[‘C’, ‘C’, ‘C’, ‘1’, ‘=’, ‘O’]

bharath · July 8, 2022, 3:14pm

I believe the ChemBERTa tokenizer may have some collisions in the encoding (it’s not 1-1).

@seyonec Would you be able to confirm?

APJansen · July 8, 2022, 6:44pm

Well if you mean that in this example 'O' and '[O+]' are encoded to the same token index, that’s not the case (going by tokenizer.vocab, I don’t have it open now but the two strings are both keys in this dictionary pointing to different values).

lianghsun · October 27, 2022, 6:23am

I have replied this question in HF Forum .

kosonocky · March 5, 2024, 7:14pm

@lianghsun that answer works for most cases but still has a few edge cases slipping through. Br and Cl still wouldn’t tokenize right, along with a few connector tokens. Here is the regex that worked for me in the end:

“Cl|Br|%[0-9]{2}|>>|[(.*?)]|.”