Exploring Polymer Implementation in DeepChem: A GSoC Project Journey

Greetings, Fellow Developers,

I’m Debasish and thrilled to be part of GSOC 2024, contributing to DeepChem. My project centers around integrating polymer material into the DeepChem architecture. It’s an exciting endeavor as it enhances the fundamental representation of macromolecules, potentially paving the way for diverse model architectures and predictive capabilities.

This forum thread serves as a hub to monitor the progress of this project.

Thank you for your interest! Drop by occasionally for updates!


To know more about the project you can go through official GSoC page using following link.

To understand very fundamental implementation of graph representations you can find a sample implementation in a colab notebook using following link.


Progress Report [ Week - 0 (27.05.2024 - 03.06.2024) ]

New Featurizers for Polymer Representation and Utilities for Graph-Based Featurizations!

Hi everyone,

I’m excited to share two new PRs that introduce tools for working with polymers and graph-based featurizations in DeepChem!

PR 1: Polymer Featurizer Base Class and Weighted Directed Graph Data Structure

This PR tackles the challenge of representing and featurizing polymers, which are large molecules with repeating structural units.

  • New PolymerFeaturizer Base Class: This abstract class provides a framework for converting different polymer representations into features. It supports both BigSMILES string representations and a novel Weighted Directed Graph representation (more on that below!). Subclasses can implement the _featurize method to handle specific feature calculations.
  • Introducing WeightedDirectedGraphData : This new class allows us to represent polymers as weighted directed graphs, capturing both the monomer structure and the distribution of different fragments within the polymer chain. This representation provides a more detailed and flexible way to encode polymer information.

PR 2: Utility Functions for Graph-Based Featurizations

This PR focuses on providing helpful utility functions to streamline the process of building graph-based featurizers, particularly for polymers.

  • FeaturizationParameters Class: This class stores all the parameters needed for encoding atom and bond features, ensuring consistency and simplifying featurizer development.
  • Hydrogen Handling: The handle_hydrogen function provides fine-grained control over hydrogen addition and removal during molecule construction from SMILES strings.
  • Polymer-Specific Utilities: Functions like make_polymer_mol , parse_polymer_rules , and tag_atoms_in_repeating_unit offer specialized tools for building and manipulating polymer molecules from SMILES strings and rules describing their composition.
  • Feature Generation: The generate_atom_features and generate_bond_features functions provide standardized methods for generating feature vectors for atoms and bonds, respectively.

Pointers for Review

I welcome your feedback and suggestions on these PRs! Let’s discuss how we can further enhance DeepChem’s capabilities for polymer modeling and graph-based featurizations.

1 Like

Progress Report: [ Week - 1 ( 03.06.2024 - 10.06.2024 ) ]

DeepChem Improvements for Polymer Materials

This week has been productive for the DeepChem project with progress on two pull requests (PRs) and some local changes. Let’s dive into the details!

PR #3984: Docstring Mapping

  • The changes in PR address docstring mapping, ensures proper documentation is associated with the relevant code. This improves code clarity and maintainability for developers.

PR #3992: Weighted Directed Graph String Validator

  • The changes in PR introduce a new class, PolyWDGStringValidator, specifically designed to validate the string format of weighted directed graph (WDG) data used for polymer materials. This validator ensures the data adheres to the expected format, improving data integrity and reducing potential errors during processing.

Local Changes

In addition to the changes in the PRs, there have been some local changes made on a separate branch. This branch was created by rebasing the previous branches from the two PRs mentioned above. Here’s a summary of the local changes:

  • PolymerFeaturizer Inheritance: The new class, WDPolyFeaturizer , inherits from the PolymerFeaturizer base class. This ensures compatibility with existing functionalities while providing specific features for polymer materials.
  • PolyWDGStringValidator Integration: The local changes integrate the PolyWDGStringValidator class introduced in PR #3992. This allows the WDPolyFeaturizer to leverage the validation functionality for WDG data.
  • Featurization Functions: Several functions are implemented within the WDPolyFeaturizer class to handle various aspects of featurizing polymer molecules, including atom featurization, bond featurization within monomers, and bond featurization between monomers.

Overall, these changes contribute to a more robust and efficient workflow for handling polymer materials within DeepChem. The improvements in data validation and featurization will enhance the accuracy and reliability of the machine learning models working with these materials.

Progress Report: [ Week - 2 ( 10.06.2024 - 17.06.2024 ) ]

DeepChem - WDPolyFeaturizer Improvements

Hey everyone,

This week’s update focuses on the ongoing development of the WDPolyFeaturizer class within DeepChem. We’ve been working on improvements to enhance its functionality and make it more user-friendly. Here’s a breakdown of the key changes:

Local Enhancements:

  • Improved Code Clarity: We’ve added comments and explanations within the codebase, particularly for the _featurize method. This will make it easier for users to understand the logic behind the code and potentially customize it for their specific needs.
  • Unit Test Cases: We’ve added unit test cases in @test_poly_wd_featurizer.py to ensure the WDPolyFeaturizer class functions as expected. These tests verify core functionalities like feature shape, weight assignments, and mapping accuracy.

PR #3992: Documentation Updates:

  • Input/Output Notation: We’ve added clear explanations for the input and output notations used by the WDPolyFeaturizer class. This will help users understand what kind of data the class expects and what format the output takes.
  • Examples: We’ve included more detailed examples demonstrating how to use the WDPolyFeaturizer class with different types of data. This will provide a practical guide for users who want to integrate the class into their workflows.
  • Data Class Documentation: The documentation for the WeightedDirectedGraphData class has been enriched to provide a clearer understanding of the data structures used to represent the featurized graphs.

Progress Report: [ Week - 3 ( 17.06.2024 - 24.06.2024 ) ]

DeepChem Weekly Update: Polymer Featurizer Progress

This week has been productive as we continue development on the polymer featurizer! We’ve made significant progress by splitting up the larger pull requests (PRs) into more manageable chunks, focusing on modular changes.

Here’s a breakdown of what we accomplished:

Modularized PRs:

  • We’ve broken down the initial, bulk PRs into four separate ones, each focusing on a specific aspect of the polymer featurizer:
    • PR #4016 (Under Review): Introduces the base polymer featurizer with unit test cases for verification.
    • PR #4017 (Under Review): Implements weighted directed data classes with comprehensive documentation.
    • PR #4020 (Under Review): Added weighted directed data validator classe and splitted simpler unit test cases for the same.
    • PR #4021 (Under Review): Adds utility functions for creating weighted directed graph data.

Additional Improvements:

  • PR #4023 (Merged): Successfully fixed the pytest issues in unit tests by adapting the codebase to the newer XGBoost version. (:white_check_mark:)

Local Changes:

  • Continued development on a Colab notebook demonstrating the end-to-end implementation of the polymer featurizer. (In Progress)

We’re excited about this progress and believe this modular approach will streamline further development and code review. We welcome any feedback or questions you may have about the ongoing work. Stay tuned for next week’s update!

Progress Report: [ Week - 4 ( 24.06.2024 - 01.07.2024 ) ]

Hey everyone,

This week’s update brings exciting progress to our project, considering the showcase of applications of GCN to predict the ionization potential of conductive polymers using the weighted directed graph featurization! Here’s a breakdown of the progress made:

Pull Request Status:

  • PR #4016: We’ve successfully rebased the pull request. This PR is currently under review and requires approval before merging.

Local Progress:

  • GCN Implementation (Almost Done!): Our local implementation for using Graph Featurization with GCNs for predicting ionization potential is nearing completion. This will be a crucial component showcasing the application in analyzing conductive polymers for biosensing and biofuel applications.
  • Polymer Series Tutorial - Round 2: We’ve drafted a second tutorial in our Polymer Series, diving deep into the weighted directed graph featurization method. This tutorial includes visualizations and examples to enhance clarity. The pull request for this tutorial is almost ready for submission.

Progress Report: [ Week - 5 ( 01.07.2024 - 08.07.2024 ) ]

Hi everyone!

I’m excited to share some progress on my GSoC project.

Merged Pull Requests:

  • PR #4016: This pull request was successfully merged on July 3rd! (:white_check_mark:)

Pull Requests Under Review:

  • PR #4017, #4020, #4021: I’ve rebased these pull requests and they’re now ready for your review. Please take a look when you have a chance!
  • PR #4037: This pull request explores the concept of weighted directed graph featurization for polymers. The tutorial notebook explores the basic understanding of polymer chain generation with an introduction to simple applications.

Local Changes:

  • I’ve completed the Colab implementation using Graph Featurization with GCN to predict conductive polymers’ ionization potential. The implementation is now under review (PR not yet submitted).
  • I’ve submitted a progress report for my GSoC mid-term evaluation from my side.