Fragment growing strategies for molecule generation

Hi all,

I’m wondering how feasible the following idea would be: to use ChemBERTa as a fragment growing molecule generator. My idea would be the following:

  1. fine tuning inputs would be small molecule fragments
  • from fragment screens of multiple targets that are good or even low-affinity hits that have been optimized and expanded by chemists
  • from bioactive molecules broken down into fragments as a data augmentation strategy
  1. outputs would be the full inhibitor molecules that are “grown” or linked from input fragments - multiple fragments pointing to the same or multiple outputs.
  2. Use specific input fragments to generate output molecules

Does this sound reasonable? I have struggled to find much in the literature that tackles the fragment growing problem in this way.

A second pertinent question is how much data would be needed to effectively fine-tune a model? Is there a minimum number of samples or does it really just depend on a bunch of different factors so there’s no single answer?

Thanks for your time!

I think this would be pretty reasonable. We don’t yet have a sampling method on ChemBERTa though, but I think this would be a great tool to add.

For fine-tuning, I think probably at least a few hundred samples. I don’t think we are large enough scale at ChemBERTa-2 to see low-shot learning effects, but we may get there in future versions