GSOC'25 Project: Model-Parallel DeepChem Model Training

Hey everyone!

I’m Bhuvan, a 3rd-year Computer Science student from Bangalore, India. Super excited to share that I’ll be working on multi-GPU support for DeepChem models for my GSOC 2025 project! This is something I’ve been tinkering with for a while, and I can’t wait to dive deeper into the challenges of distributed training. I’ll be posting weekly updates here, be sure to check them out!

Looking forward to learning and building this. Catch you in the updates! :slight_smile:

Week 1: June 2 - June 8

Work Done:

  • Implemented custom __get__ and __len__ functions for DiskDataset to ensure compatibility with torch.data.DataLoader.
  • Developed a custom collate function for models inheriting the TorchModel base class, specifically for the Lightning Data wrapper.
  • Created a Lightning data wrapper to utilize DiskDataset data with PyTorch Lightning.

Issues Faced:

  • The iterable method implemented by DeepChem, currently causes GPU deadlocks in a multi-GPU environment.
  • Designing a custom general collate function for all models inheriting the TorchModel base class directly.

Open Pull Requests:

Week 2: June 9 - June 13

Work Done

  • Written a custom test case for checkpoint saving/loading and model fit/prediction, using two Torch models as examples:
    • MultitaskClassifier: A simple TorchModel base class inheriting Model.
    • GCN: A TorchModel inheriting Model that also utilizes DGL.
  • Created a Lightning model wrapper for models inheriting the TorchModel base class, using PyTorch Lightning.
  • Developed a simple trainer, integrating previously written Lightning data and model wrappers.

Issues Faced

  • Ensuring the wrapper worked with models utilizing DGL. This was later resolved by properly installing DGL from here.
  • Making sure train/predict functionality worked as expected in the TorchModel base class initially took some time.

PRs Open

Week 3: June 14 - June 20

Work Done

  • DiskDataset Refactor:
    • Separated DiskDataset enhancements from the main PR (#4454) into a dedicated PR (#4458).
    • Added detailed docstrings for better clarity and maintainability.
  • Testing Improvements:
    • Created a pytest fixture dummy_disk_dataset for testing datasets with uneven shard sizes.
    • Added unit tests for:
      • __getitem__: Including edge cases and out-of-bounds access.
      • _cumulative_sum: Verifying correctness across a variety of inputs.
  • PyTorch Lightning Integration:
    • Introduced DeepChemLightningDataModule and DeepChemLightningModule in PR #4461.
    • Integration with PyTorch Lightning tests with GCN model includes:
      • Training and prediction workflows
      • Checkpointing capabilities
      • GPU support for accelerated computation
    • Comprehensive tests added to ensure Lightning modules behave as expected. The test results can be seen as below image
  • Package Exposure:
    • Updated __init__.py to expose the new Lightning module classes.

Issues Faced

  • Minor integration issues and strict type-checking challenges.

Open Pull Requests

Slide Link