GSOC 2025 Project: Integrating Single Cell RNA-seq modeling for Cell Type Identification into Deepchem

Hi everyone!

I’m Harindhar, a final-year Dual Degree student in Biological Engineering at IIT Madras. I’m using this thread to document my GSoC 2025 project with DeepChem. As part of the project, I’ll be integrating ACTINN, a neural network model for automated cell type identification in single-cell RNA-seq data, into the DeepChem ecosystem.

I’ll be sharing weekly updates here throughout the summer as the project progresses. Stay tuned!

1 Like

Progress June 2 - 8

Ran training and evaluation loops for the existing TF and PyTorch implementations.

Roadblocks: Script kept crashing during data manipulation (normalization, concatenation, etc.) due to dataset size (20k genes × 50k cells), though it loaded fine in pandas.

Fixes: Sampled 1k cells to run the code. Sparse genes led to zero total expression, causing NaNs when computing CV (std/mean). Filtered out genes with zero mean.

To do:

  • Create a DeepChem iterator using single cell rna-seq data
  • Add the classification layer

Progress: June 9–16

  • Converted the .h5 file to a .csv format and sampled 1,000 rows due to memory limitations. Created a DiskDataset object using CSVLoader.
  • Currently working on a featurizer to normalize and filter genes, targeting the bottom and top 1%.
  • Exploring alternative normalization and gene filtering strategies, since the full dataset cannot be loaded into memory at once.

Week 3 (June 16 - 22)

  • Draft PR to merge the featurizer for ACTINN (PR#4463)
  • Works with DeepChem’s CSVLoader class

To Do:

  • Figure out how to process the data (the implementation provided with the paper processes train and test data together)