Google Summer of Code 2022: Strengthening DeepChem’s Bioinformatics arm

paupaiz · June 1, 2022, 4:01am

About me:

Hello DeepChem Community!

My name is Paulina and this summer I was given the opportunity to contribute to DeepChem via GSoC 2022. I am currently a research associate at the Gladstone Institutes working on a package for analysis of scATAC-seq data called ArchR. Over the past couple of months I have been studying applications of deep learning for genomics and drug discovery. DeepChem codebase and mentors have been a great introduction to this space and I am excited to continue learning this summer!

Project Description:
The project I am proposing would expand DeepChem’s tools for working with genomic datasets for drug discovery thus strengthening DeepChem’s new Bioinformatics initiatives. I will be implementing a state-of-the-art predictive model for regulatory genomics and adding the relevant datasets for testing. As part of my project, I would compose a tutorial overview on interpreting regulatory sequence data using deep learning. I will figure out what loaders and featurizers to use to translate genomics data into numerical representations that machine learning models can understand. I will also implement gkm-SVM so that it is easier to develop other models down the road that have this dependency. A big part of my project will be identifying how to leverage DeepChem’s infrastructure towards biomedical questions informed by genomics as well as identifying future areas for development.

Contact:
GitHub: @paupaiz
Twitter: https://twitter.com/paulinapaiz02
DeepChem Slack
www.pau-paiz.com

paupaiz · June 10, 2022, 4:10pm

Happy Friday everyone! In this “Community bonding” period I got to meet the other contributors and hear about their proposals. I also brainstormed ideas with Stanley, my mentor, about leveraging transformer architecture and the amazing work of previous contributors to really bridge Hugging Face and DeepChem. This is aligned with the goals of my proposal because it would expand our tools for sequence to sequence predictions. I finished adding all BibTeX tutorial citations and posted on forum so folks can check and add/remove themselves as they like. I also researched factors we should keep in mind for adding datasets to DeepBio and added to Arun’s issue here. We discussed how MolNet should be renamed as we add other kinds of datasets. I also started designing a new logo that would reflect this restructuring of the codebase to be more inclusive of the different scientific pillars. On a separate note, I helped a friend understand why his company would take advantage of using DeepChem instead of building in house.

paupaiz · June 17, 2022, 4:28pm

Happy Friday everyone! This week I worked on the following:

First official PR for single-cell analysis (scVI) tutorial
Insights from user interviews about why they weren’t initially convinced on using DeepChem:
- Overwhelmed/confused on what exactly they could use it for
- Thought DeepChem had some overlaps with other libraries
  - Mentioned specifically one that has data-loaders
- “If you know PyTorch or TF easier to program directly with them… DC should be more of a RDKit or Sci-kit learn: plug-and-play.”
Git ropes
Reviewing Alana’s work to build-on
Starting to review for next week: HF data loader

paupaiz · June 24, 2022, 4:19pm

Happy Friday!

This week we held a meeting to discuss how we can continue building the bridge between DeepChem and Hugging Face, leveraging the strengths of both. This helped us identify the need for a diagram that summarizes a typical user workflow in Hugging Face in order to locate where we should intercept and how. After this meeting I read through a couple of tutorials and compiled resources to start drawing out this diagram in LucidChart. I hope it can be useful for the DeepChem community interested in jumping on the boat. In the process of doing this I noticed Hugging Face only has a handful of genetic and proteins data so this could be a good contribution from DeepChem as an organization.

paupaiz · July 1, 2022, 3:55pm

Happy Friday!

This week was busy as I was in the Lindau Nobel Laureate conference (here is the recording of the panel I participated in if you are interested). However, I was still able to work on the following:

Wrote data loader for VCF files (WIP)
Read about how VCF files are used in machine learning and what features are important to extract
Diagram for Hugging Face and DeepChem integration
Read about how data structure objects are handled in both HF and DC.
Next week I will start writing pseudo code for how we could pass a DeepChem dataset object and use it to train a model in Hugging Face.

paupaiz · July 8, 2022, 9:14pm

Happy Friday!

Here are my updates for the week:

Made progress on VCF and FASTQ loader
Started writing code to add VCF and FASTQ files to MolNet
- Link to script where I am working on classes and methods described above
Studied FASTA loader already on DeepChem
Studied Kipoi utilities for VCF handling
- Will refactor code to go from using scikit-allel to kipoi

If you are interested: More information on VCF and FASTQ files which I hope provides intuition on why it would be useful for DeepChem to support them as well as an example workflow using both.

paupaiz · July 15, 2022, 4:27pm

Happy Friday!

Here are my updates for the week:

The updated vcfloader now has sharding enabled. Files can be broken by number of variants or number of samples per variant or even both.
Scikit-allel is the package I am using for variant calling. After studying and testing Kipoi extensively found it has lots of low-level dependencies I don’t have the expertise to work with unfortunately but Scikit-allel does the job!
Extending the single-cell tutorial to include ScanPy as it is part of the scvi-tools workflow.
Next week: allow sharding for FASTA (a TO-DO item of the person who developed it, already on DeepChem) and FASTQ dataloaders.

paupaiz · July 22, 2022, 10:49pm

Happy Friday!

This week I extended DeepBio’s tutorial offerings by adding a tutorial of ScanPy as it is part of the scvi-tools workflow.
- One thing I was surprised to learn is how even core tutorials of packages like this don’t run smoothly, or perhaps used to run and don’t anymore which can be discouraging for new users.
- I also found some issues I got while running the notebook on VSCode locally wouldn’t reproduce while using Google Colab with a GPU runtime…
I didn’t get to allowing sharding for FASTA (a TO-DO item of the person who developed it, already on DeepChem) and FASTQ dataloaders but don’t expect it to take too long.
Here is a useful demonstration of adding “Open in Colab Buttons” to your Jupyter Notebooks in case anyone is working on tutorials for other topics.

paupaiz · July 29, 2022, 3:34pm

Happy Friday! My update for the week:

Thank you for the feedback on both my scvi-tools and scanpy tutorials. I implemented the feedback and submitted a new PR for the scvi-tools tutorial and a different one for the ScanPy tutorial.
- What to do with readme commits?
- How to check more easily when doing the PRs?
Wrote unit tests for FASTQ loader
Adding docstrings and checking style for FASTQ loader
Will test Aryan’s style guideline checks
Allowed sharding for FASTQ data loader. Today will work on FASTA (a TO-DO item of the person who developed it, already on DeepChem) don’t expect it to take too long.
DeepChem Featurizer Tutorials? issue #1143
- One-hot featurizer
Licenses (important as an open-source developer)

paupaiz · August 5, 2022, 4:42pm

Happy Friday! My update for the week:

Merged Scanpy tutorial PR
Implemented feedback for scvi-tools tutorial PR
Haven’t been able to commit changes because of pre-commit error
Test Aryan’s style guideline checks
Wrote unit tests for FASTQ loader
- Adding docstrings and checking style for FASTQ loader
- Need to upload the test files.
- Need to add FASTQ loader to the readthedocs
Modified tutorial citation please check the other commits

I do want to get the licenses overview put together out of curiosity but will do that once I merge the PRs I have hanging.

paupaiz · August 12, 2022, 4:02pm

Happy Friday! This week I focused on:

PR for scVI-tools
- Added more comments and background information.
- Implemented feedback
FASTQ loader
- Corrected indentation (Shift tab vs double space in vscode)
Fixed Environment issues
Numpy docstring conventions & linting & unittest
- PEP 8 79 characters
Thanks for Pycharm (will move later)

paupaiz · August 19, 2022, 3:43pm

Happy Friday!
Unfortunately this week I didn’t get much done because I was struggling with the unit tests and coding convention tests and I also switched machines but here is my update:

Got PR for scvi-tools merged in
Added the 2 FASTQ files for testing, the unit tests and the loader itself to my branch
- Final round of testing!

Thank you to @ARY2260 for helping me this week!

paupaiz · August 26, 2022, 4:05am

Hi everyone,

This week I finally had no build issues related to FASTQ loader.
Here is a summary of the checks that might be useful for others:

Convention Checks

yapf -i
flake8 --count
python -m doctest
pytest -v -m ‘slow’ deepchem
mypy -p deepchem --ignore-missing-imports

yapf 0.32	code formatting
flake8	linting code syntax, stylistically
doctest	docstrings
pytest	functionality- arguably the most important . Also runs unittest
mypy	static type checker

Run pytest and mypy like this:
pytest path_to_file -v
mypy --follow-imports=skip path_to_file

I also started studying “A Brief Guide to Metagenomic Sequence Analysis” to put together a notebook for DeepBio

paupaiz · September 2, 2022, 3:15pm

Happy Friday everyone! Here are my updates:

Merged FASTQ loader (hope to test it on the metagenomics pipeline)
Metagenomics pipeline
3 tools for nanopore data: OLC

Canu
SPAdes
minION

Canu is the fastest w/o compromising quality
Hierarchical assembler (correction -> trimming -> Assembly)
Colab notebook I started testing Canu
“A step towards neural genome assembly” and “Learning to untangle genome assembly with graph convolutional networks” for the layout phase

paupaiz · September 9, 2022, 3:25pm

Happy Friday everyone! This final week I reviewed the Submission Guidelines and wrapped up my work here:
https://docs.google.com/document/d/17apLcN7yTMlI7hDht1ZNJcshMKAwGjyUVLKFQE2XoeM/edit#heading=h.qi1laykwdbox

I also met with Jay (from New Atlantis) to look at the metagenome pipelines as a motivational case for the assembly class