BELKA dataset - 300M measured binding interactions to train on

andrewdblevins · April 4, 2024, 1:36pm

Hey DeepChem,
Working in this field I have been frustrated with the size and/or quality of publicly available datasets to train and benchmark models with. So when my co-founder and I started our company we swore we would open-source some data as quickly as possible.

I am excited to announce our new Kaggle competition.

The training set is ~100M molecules vs 3 proteins (sEH, BRD4 and HSA) from a bunch of replicates of DEL screens we ran here at Leash Bio. We are also offering $50k in prizes. We really hope this will help the community compare many different techniques of molecule/protein representation, on a dataset big enough and clean enough to trust the comparison.