GSoC '25 | Model-Parallel DeepChem Model Training | Final Report

Hello all, This is my final report for my GSoC '25 project.

Project Title: Model-Parallel DeepChem Model Training
Contributor: Bhuvan M
My LinkedIn: https://www.linkedin.com/in/bhuvan-murthy
My GitHub: https://github.com/bhuvanmdev

1. Project Overview:

This project tackled the challenge of training models too large for a single GPU. By integrating PyTorch Lightning’s distributed training capabilities, DeepChem can now train massive chemical language models and molecular transformers that were previously constrained by memory limitations.

2. Key Contributions:

  • Lightning Dataset Module: The existing DCLightningDataModule in DeepChem lacked predict loaders and relied on manual sampling through iterative/generative data pipelines. While this approach worked fine for DDP setups, it caused deadlocks and numerous issues during FSDP training. My key contribution was making sampling, batching, and shuffling dynamic by letting torch.data.DataLoader handle these operations natively, which resolved the distributed training stability issues.

  • Lightning Model Module: The existing DCLightningModule in DeepChem only supported native DeepChem models for DDP only and completely lacked support for DeepChem models that use HuggingFace as their base - neither training nor inference worked with HuggingFace-based models. Additionally, there was no proper predict method for inference with the Lightning wrapper. My implementation gives full support for both training and prediction across the entire DeepChem ecosystem: native DeepChem models and DeepChem models built on HuggingFace foundations. I also made sure that the checkpoints produced by lightning are compatible with deepchem’s Torchmodel class models.

  • Random Access Map-style Dataset: The original generative approach worked fine for single GPU training but caused deadlocks and instability in distributed environments. My _TorchIndexDiskDataset implementation enables robust data access patterns that work seamlessly with both FSDP and DDP strategies. For a simple breakdown of why this was essential and how it solved the distributed training challenges, check out my slides analysis.

  • LightningTorchModel Integration: I developed the LightningTorchModel class as a comprehensive wrapper that bridges DeepChem’s model ecosystem with PyTorch Lightning’s distributed training capabilities. This class inherits from DeepChem’s base Model and implements a clean, unified interface that abstracts away the complexity of distributed training setup. The LightningTorchModel handles multiple strategies (FSDP and DDP), gradient accumulation, mixed precision training, and checkpoint management while maintaining full compatibility with both native DeepChem models and HuggingFace-based architectures. Through methods like fit(), predict(), and restore(), it provides researchers with a familiar API that seamlessly scales from single-GPU prototyping to multi-GPU training without requiring any changes to their existing DeepChem workflows.

  • Advanced Collate Function: The collate_dataset_fn in lightning_utils.py represents a significant evolution from the simple DCLightningDatasetBatch wrapper. While DCLightningDatasetBatch just wraps batch data into lists, the new collate function performs sophisticated data preprocessing by leveraging the model’s default_generator() and _prepare_batch() methods. This enables model-specific transformations and ensures proper tensor formatting for both native DeepChem models and HuggingFace-based architectures. The function extracts features, labels, weights, and IDs from the batch data, creates a temporary NumpyDataset, processes it through the model’s pipeline, and returns properly formatted tensors ready for training or inference. This approach eliminates the manual data handling that was required with the simpler batch wrapper and provides a robust foundation for distributed training scenarios.

  • Real-World Testing: Every feature was battle-tested on Kaggle’s 2x T4 GPUs. I wanted to make sure this wasn’t just theoretical, but that it had to work in real environments where researchers actually do their work. Each feature I developed has 2-3 test cases that have been tested in this kind of multi-GPU environment. The test cases are in test_lightning_utils.py, test_dc_lightning_modules.py, test_dc_lightning_trainer.py, and test_lightning_hf.py.

3. Future Directions:

There’s so much more we can build on this foundation! Some of those possibilities include but are not limited to:

  • Scaling Beyond Current Limits: Due to resource constraints, my validation was limited to 2x T4 GPUs on Kaggle. The next major milestone is testing and optimizing for larger GPU clusters and multi-node distributed systems. This will reveal new scaling challenges and optimization opportunities that aren’t apparent at smaller scales.

  • Multi-Billion Parameter Model Support: For truly massive models, we’ll need to implement auto-wrap policies for FSDP strategy. These policies intelligently determine how to shard model parameters across devices, which becomes critical when dealing with multi-billion parameter language models.

  • Expanded Strategy Support: While I’ve validated DDP and FSDP strategies, Lightning supports many other distributed training approaches like DeepSpeed, which offers advanced memory optimization techniques. Integrating these additional strategies would give researchers more options for their specific use cases and hardware configurations.

  • Leveraging Lightning’s Full Ecosystem: Lightning provides a vast array of advanced features beyond basic distributed training - custom callbacks for model-specific monitoring, sophisticated mixed precision training modes, advanced profiling tools, and specialized logging systems. Integrating these capabilities would unlock even more potential for DeepChem researchers.

The foundation is set, and these enhancements would transform DeepChem into an even more powerful platform for large-scale molecular modeling research.

4. Contributions (PRs):

  • Lightning Utilities Implementation PR #4483 - Advanced collate functions and _TorchIndexDiskDataset implementation with comprehensive test cases for distributed training stability.

  • Lightning Dataset and Model Modules PR #4494 - Core DCLightningDataModule and DCLightningModule wrappers with predict loaders and full model ecosystem support, including extensive test coverage with pytest.

  • Lightning Trainer Integration PR #4508 - Complete LightningTorchModel wrapper class with unified API for distributed training, checkpoint management, and seamless DeepChem integration with test validation using pytest.

  • HuggingFace Compatibility Enhancements PR #4523 - Critical compatibility fixes and enhancements to ensure HuggingFace-based DeepChem models work seamlessly with the Lightning framework.

  • Lightning-deepchem checkpoint compatibility PR #4527 - Ensures compatibility between checkpoints saved by Lightning’s backend and DeepChem’s TorchModel base class. This enables models trained with Lightning to be loaded and used for inference with DeepChem models.

5. Challenges and Learnings:

  • Resolved GPU deadlocks in multi-GPU training by switching to map-style datasets.
  • Analyzed FSDP and DDP strategies for memory efficiency and simplicity.
  • Maintained compatibility between DeepChem and Lightning while enabling distributed training.
  • Validated solutions in real multi-GPU environments, ensuring robust development and testing.
  • Addressed CI issues and documented all implemented modules.

6. Acknowledgement:

I’m grateful to my mentors, Riya Singh, Aryan Amit Barsainyan and Bharath Ramsundar, for their generous support and thoughtful feedback throughout this project. It was an incredible learning experience, and I truly appreciate their patience in guiding me, even when I made plenty of silly mistakes. Beyond the technical work, I gained a deeper understanding of how open-source communities function, what good collaboration looks like, and the best practices for contributing meaningfully. I also ended up learning more chemistry than I ever did in any classroom, making this journey far richer than I had imagined.

Technical analysis slides: FSDP vs DDP Comparison
Weekly updates: DeepChem Forum

1 Like