Google Summer of Code 2022: Strengthening DeepChem’s Bioinformatics arm

About me:

Hello DeepChem Community!

My name is Paulina and this summer I was given the opportunity to contribute to DeepChem via GSoC 2022. I am currently a research associate at the Gladstone Institutes working on a package for analysis of scATAC-seq data called ArchR. Over the past couple of months I have been studying applications of deep learning for genomics and drug discovery. DeepChem codebase and mentors have been a great introduction to this space and I am excited to continue learning this summer!

Project Description:
The project I am proposing would expand DeepChem’s tools for working with genomic datasets for drug discovery thus strengthening DeepChem’s new Bioinformatics initiatives. I will be implementing a state-of-the-art predictive model for regulatory genomics and adding the relevant datasets for testing. As part of my project, I would compose a tutorial overview on interpreting regulatory sequence data using deep learning. I will figure out what loaders and featurizers to use to translate genomics data into numerical representations that machine learning models can understand. I will also implement gkm-SVM so that it is easier to develop other models down the road that have this dependency. A big part of my project will be identifying how to leverage DeepChem’s infrastructure towards biomedical questions informed by genomics as well as identifying future areas for development.

GitHub: @paupaiz
DeepChem Slack

1 Like

Happy Friday everyone! In this “Community bonding” period I got to meet the other contributors and hear about their proposals. I also brainstormed ideas with Stanley, my mentor, about leveraging transformer architecture and the amazing work of previous contributors to really bridge Hugging Face and DeepChem. This is aligned with the goals of my proposal because it would expand our tools for sequence to sequence predictions. I finished adding all BibTeX tutorial citations and posted on forum so folks can check and add/remove themselves as they like. I also researched factors we should keep in mind for adding datasets to DeepBio and added to Arun’s issue here. We discussed how MolNet should be renamed as we add other kinds of datasets. I also started designing a new logo that would reflect this restructuring of the codebase to be more inclusive of the different scientific pillars. On a separate note, I helped a friend understand why his company would take advantage of using DeepChem instead of building in house.

1 Like

Happy Friday everyone! This week I worked on the following:

  • First official PR for single-cell analysis (scVI) tutorial
  • Insights from user interviews about why they weren’t initially convinced on using DeepChem:
    • Overwhelmed/confused on what exactly they could use it for
    • Thought DeepChem had some overlaps with other libraries
      • Mentioned specifically one that has data-loaders
    • “If you know PyTorch or TF easier to program directly with them… DC should be more of a RDKit or Sci-kit learn: plug-and-play.”
  • Git ropes
  • Reviewing Alana’s work to build-on
  • Starting to review for next week: HF data loader
1 Like

Happy Friday!

This week we held a meeting to discuss how we can continue building the bridge between DeepChem and Hugging Face, leveraging the strengths of both. This helped us identify the need for a diagram that summarizes a typical user workflow in Hugging Face in order to locate where we should intercept and how. After this meeting I read through a couple of tutorials and compiled resources to start drawing out this diagram in LucidChart. I hope it can be useful for the DeepChem community interested in jumping on the :hugs: boat. In the process of doing this I noticed Hugging Face only has a handful of genetic and proteins data so this could be a good contribution from DeepChem as an organization.

1 Like