Making DeepChem more useful for bioinformatics?

DeepChem right now isn’t very useful for bioinformatics. I’m raising this issue to help brainstorm ways we can improve our support for bioinformatics applications. Here are a couple thoughts:

Please chime in with more suggestions here!

pyrpipe looks like a really cool resource as well: https://academic.oup.com/nargab/article/3/2/lqab049/6290623, https://github.com/urmi-21/pyrpipe

Hi DeepChem Community ! I am really interested in structural bioinformatics and protein modeling. Looking into the DeepChem github here are some issues/discussion that could help to summarize and add more ideas (Hope that helps) :

  • Adding protein sequence datasets to MolNet (https://github.com/deepchem/deepchem/issues/2330). I think that this is a crucial step ! Including datasets with labeled data such as protein stability variation upon mutations, thermostability and others properties, could help users to create awesome models. Here is an example of a good database for protein design (https://loschmidt.chemi.muni.cz/fireprot/)
    Others databases such as EBI-proteins could have a lot of sequences and curated information for this purpose.

  • Add support for multiple sequence alignment, homology modeling, and deep structural prediction (https://github.com/deepchem/deepchem/issues/2150). In this issue there is a lot of discussion about new featurizers could be really useful, some examples are MultipleSeuqenceAlignment and Mean ContactPotential.
    I think that others features extractors from protein sequences could be taken into account. For example, python packages such as iFeauture (paper:https://academic.oup.com/bioinformatics/article/34/14/2499/4924718 , code = https://github.com/Superzchen/iFeature/) and propy (https://pypi.org/project/propy3/) can be used for extraction of sequence protein descriptors such as amino acid composition, amphiphilic pseudo amino acid composition descriptors and others. Finally, 3D protein structure representations could add more interesting features with a variety of applications, such as protein-protein interface prediction or antibody-antigen binging references. Here is a nice review of DL in protein structural modeling with a lot of information. [https://doi.org/10.1016/j.patter.2020.100142]

1 Like

Following up on the call last week, I looked back at some of the other suggestions for improving bioinformatics capabilities. Besides what was already mentioned here, some of the ideas were

David’s suggestions regarding featurizers I think are a good next step because a few are already implemented in PNet and they are core to the bioinformatics workflow, so they’re a high impact easy target. The featurizers + protein sequence datasets will give deepchem a better foundation in working with bio data, and later we can integrate some of the more cutting edge papers.

If that sounds reasonable, I can start to put some time into trying to integrate the PNet featurizers mentioned in issue 2150.

1 Like

This sounds like an excellent plan. How about we discuss at the next developer call? We can coordinate with Michael who developed PNet about the right path to start integrating these models into DeepChem

1 Like