Making DeepChem more useful for bioinformatics?

bharath · September 18, 2021, 7:31pm

DeepChem right now isn’t very useful for bioinformatics. I’m raising this issue to help brainstorm ways we can improve our support for bioinformatics applications. Here are a couple thoughts:

Adding more file format loaders:
- We support loading of FASTA files but not .gbf/vcf/.bam/.sam/etc files. More loaders could be useful.
Adding ML for GWAS architectures. This review (https://www.frontiersin.org/articles/10.3389/fgene.2020.00350/full) has a really nice overview.

Please chime in with more suggestions here!

bharath · September 18, 2021, 7:47pm

pyrpipe looks like a really cool resource as well: https://academic.oup.com/nargab/article/3/2/lqab049/6290623, https://github.com/urmi-21/pyrpipe

davidRFB · September 21, 2021, 6:14pm

Hi DeepChem Community ! I am really interested in structural bioinformatics and protein modeling. Looking into the DeepChem github here are some issues/discussion that could help to summarize and add more ideas (Hope that helps) :

Adding protein sequence datasets to MolNet (https://github.com/deepchem/deepchem/issues/2330). I think that this is a crucial step ! Including datasets with labeled data such as protein stability variation upon mutations, thermostability and others properties, could help users to create awesome models. Here is an example of a good database for protein design (https://loschmidt.chemi.muni.cz/fireprot/)
Others databases such as EBI-proteins could have a lot of sequences and curated information for this purpose.
Add support for multiple sequence alignment, homology modeling, and deep structural prediction (https://github.com/deepchem/deepchem/issues/2150). In this issue there is a lot of discussion about new featurizers could be really useful, some examples are MultipleSeuqenceAlignment and Mean ContactPotential.
I think that others features extractors from protein sequences could be taken into account. For example, python packages such as iFeauture (paper:https://academic.oup.com/bioinformatics/article/34/14/2499/4924718 , code = https://github.com/Superzchen/iFeature/) and propy (https://pypi.org/project/propy3/) can be used for extraction of sequence protein descriptors such as amino acid composition, amphiphilic pseudo amino acid composition descriptors and others. Finally, 3D protein structure representations could add more interesting features with a variety of applications, such as protein-protein interface prediction or antibody-antigen binging references. Here is a nice review of DL in protein structural modeling with a lot of information. [https://doi.org/10.1016/j.patter.2020.100142]

TonyD · September 26, 2021, 6:39pm

Following up on the call last week, I looked back at some of the other suggestions for improving bioinformatics capabilities. Besides what was already mentioned here, some of the ideas were

RNA structure prediction (https://academic.oup.com/nargab/article/2/4/lqaa090/5983421)
Modeling protein surface interactions (https://www.biorxiv.org/content/10.1101/606202v1.full.pdf)

David’s suggestions regarding featurizers I think are a good next step because a few are already implemented in PNet and they are core to the bioinformatics workflow, so they’re a high impact easy target. The featurizers + protein sequence datasets will give deepchem a better foundation in working with bio data, and later we can integrate some of the more cutting edge papers.

If that sounds reasonable, I can start to put some time into trying to integrate the PNet featurizers mentioned in issue 2150.

bharath · September 30, 2021, 3:32am

This sounds like an excellent plan. How about we discuss at the next developer call? We can coordinate with Michael who developed PNet about the right path to start integrating these models into DeepChem

davidRFB · November 19, 2021, 8:07pm

Hi !! Following up on the call !!

Here is a summary about making Deepchem more useful for bioinformatics

Datasets and loader (related with MolNet)

The addition of labeled and unlaballed data of proteomics and genomics databases could improve MoleculeNet.
Some examples are : Swissprot and TREMBL Database (unlabeled) or FireProt (labeled data).

Some loaders for formats such as .bam/.sam etc.

Featurization and Representations

Features extractors from protein sequences such as amino acid composition, amphiphilic pseudo amino acid composition descriptors and others.

Additionally, standard protein featurizations such as Multiple sequence alignment and MeanContactMaps

Some models for specific tasks

RNA structure prediction or modelling of protein surface interactions.

(New) Simulations and Molecular Modeling Support.

Based on “Adding a Simulations Modules” in the discussion, Making DeepChem a Better Framework for AI-Driven Science. Additional support for docking, visualization and molecular dynamics simulation set up and execution could improve deepchem capabilities.

bharath · November 20, 2021, 6:49pm

This is a great set of suggestions! I’ve started up the DeepChem-bio chatroom on gitter for us to coordinate this work:

Join the room if you’re interested!