Hello!
This summer, I had the chance to contribute to DeepChem through the Google Summer of Code program.
Links:
Link to Proposal: https://summerofcode.withgoogle.com/programs/2025/projects/y70fgXcU
Link to weekly updates thread: GSOC 2025 Project: Integrating Single Cell RNA-seq modeling for Cell Type Identification into Deepchem
My Github: https://github.com/Harindhar10
My LinkedIn: https://www.linkedin.com/in/harindhar-adithya-34864421b/
Preface:
My project focused on integrating single-cell RNA-sequencing data modeling into DeepChem. Specifically, I set out to incorporate the ACTINN model (Automated Cell Type Identification using Neural Networks). In hindsight, this was quite ambitious given the constraints and conventions of contributing to an open-source library. The original goal was to add the model directly to the codebase, but this proved challenging since DeepChem lacked the necessary infrastructure for loading (dataloader class) and storing (dataset class) such data. As a result, the project goal shifted toward providing support through tutorials instead.
Deepchem offers a few tutorials for working with scRNA-seq data (with libraries Scanpy and scVI), but these are independent tutorials that are not connected to the DeepChem ecosystem. My contributions were aimed at analysing scRNA-seq data using Deepchem.
PR 1:
This PR is a tutorial that introduces scRNA-seq data to beginners and implements, trains, and evaluates the ACTINN model using DeepChem.
Challenges and Learnings:
In ACTINN’s original implementation, feature selection was performed using both the training and test sets, which risked information leakage from the test set. To avoid this, the DeepChem implementation carried out feature selection exclusively on the training set and then used the corresponding feature IDs (gene names) to subset the selected features from the test set.
PR 2:
Scanpy and AnnData are widely used open-source libraries for scRNA-seq data analysis. I enhanced the existing Scanpy tutorial by adding an introduction to these tools, fixing bugs, and demonstrating how a neural network can be trained on Scanpy-processed data using DeepChem.
Challenges and Learnings:
Working with AnnData and understanding how it organizes data helped me see the limitations of DeepChem’s Dataset classes and pushed me to think about ways to address them. One key limitation is that DeepChem’s Dataset classes cannot store feature-level metadata (such as column or gene names). I first tried storing the column names alongside the data points in the X variable, but this approach failed since the splitters shuffle the data. My next idea was to design a new Dataset class capable of storing feature-level metadata. However, since that would be a fairly large undertaking, I decided not to pursue it within the scope of the summer project.
PR 3:
Added loss curve to visualise the changes in training loss while training a Deepchem Model on Scanpy processed data.
Challenges and Learnings:
Understood the importance of analysing the loss curve to ensure an ML model doesn’t underfit or overfit.
Future Work:
Adding a dataloader class to load .h5 data, a dataset class similar to an Anndata class, and basic preprocessing functions to DeepChem would be very helpful to deepchem-ize the pipeline.
Acknowledgement:
I’m grateful to my mentors, Rakshit Kr. Singh and Bharath Ramsundar, for their support and feedback throughout this project. Contributing to DeepChem has been a great learning experience. It gave me a better understanding of how open-source scientific software is built and maintained, and also taught me the importance of writing code and tutorials that are accessible even to people who aren’t familiar with programming.