Exploring Polymer Implementation in DeepChem: A GSoC Project Journey

Greetings, Fellow Developers,

I’m Debasish and thrilled to be part of GSOC 2024, contributing to DeepChem. My project centers around integrating polymer material into the DeepChem architecture. It’s an exciting endeavor as it enhances the fundamental representation of macromolecules, potentially paving the way for diverse model architectures and predictive capabilities.

This forum thread serves as a hub to monitor the progress of this project.

Thank you for your interest! Drop by occasionally for updates!

3 Likes

To know more about the project you can go through official GSoC page using following link.

To understand very fundamental implementation of graph representations you can find a sample implementation in a colab notebook using following link.

https://colab.research.google.com/drive/18TDH7fkTKPMlEQ1jXugurcqQ2vSgswG_?usp=sharing

Progress Report [ Week - 0 (27.05.2024 - 03.06.2024) ]

New Featurizers for Polymer Representation and Utilities for Graph-Based Featurizations!

Hi everyone,

I’m excited to share two new PRs that introduce tools for working with polymers and graph-based featurizations in DeepChem!

PR 1: Polymer Featurizer Base Class and Weighted Directed Graph Data Structure

This PR tackles the challenge of representing and featurizing polymers, which are large molecules with repeating structural units.

  • New PolymerFeaturizer Base Class: This abstract class provides a framework for converting different polymer representations into features. It supports both BigSMILES string representations and a novel Weighted Directed Graph representation (more on that below!). Subclasses can implement the _featurize method to handle specific feature calculations.
  • Introducing WeightedDirectedGraphData : This new class allows us to represent polymers as weighted directed graphs, capturing both the monomer structure and the distribution of different fragments within the polymer chain. This representation provides a more detailed and flexible way to encode polymer information.

PR 2: Utility Functions for Graph-Based Featurizations

This PR focuses on providing helpful utility functions to streamline the process of building graph-based featurizers, particularly for polymers.

  • FeaturizationParameters Class: This class stores all the parameters needed for encoding atom and bond features, ensuring consistency and simplifying featurizer development.
  • Hydrogen Handling: The handle_hydrogen function provides fine-grained control over hydrogen addition and removal during molecule construction from SMILES strings.
  • Polymer-Specific Utilities: Functions like make_polymer_mol , parse_polymer_rules , and tag_atoms_in_repeating_unit offer specialized tools for building and manipulating polymer molecules from SMILES strings and rules describing their composition.
  • Feature Generation: The generate_atom_features and generate_bond_features functions provide standardized methods for generating feature vectors for atoms and bonds, respectively.

Pointers for Review

I welcome your feedback and suggestions on these PRs! Let’s discuss how we can further enhance DeepChem’s capabilities for polymer modeling and graph-based featurizations.

2 Likes

Progress Report: [ Week - 1 ( 03.06.2024 - 10.06.2024 ) ]

DeepChem Improvements for Polymer Materials

This week has been productive for the DeepChem project with progress on two pull requests (PRs) and some local changes. Let’s dive into the details!

PR #3984: Docstring Mapping

  • The changes in PR address docstring mapping, ensures proper documentation is associated with the relevant code. This improves code clarity and maintainability for developers.

PR #3992: Weighted Directed Graph String Validator

  • The changes in PR introduce a new class, PolyWDGStringValidator, specifically designed to validate the string format of weighted directed graph (WDG) data used for polymer materials. This validator ensures the data adheres to the expected format, improving data integrity and reducing potential errors during processing.

Local Changes

In addition to the changes in the PRs, there have been some local changes made on a separate branch. This branch was created by rebasing the previous branches from the two PRs mentioned above. Here’s a summary of the local changes:

  • PolymerFeaturizer Inheritance: The new class, WDPolyFeaturizer , inherits from the PolymerFeaturizer base class. This ensures compatibility with existing functionalities while providing specific features for polymer materials.
  • PolyWDGStringValidator Integration: The local changes integrate the PolyWDGStringValidator class introduced in PR #3992. This allows the WDPolyFeaturizer to leverage the validation functionality for WDG data.
  • Featurization Functions: Several functions are implemented within the WDPolyFeaturizer class to handle various aspects of featurizing polymer molecules, including atom featurization, bond featurization within monomers, and bond featurization between monomers.

Overall, these changes contribute to a more robust and efficient workflow for handling polymer materials within DeepChem. The improvements in data validation and featurization will enhance the accuracy and reliability of the machine learning models working with these materials.

Progress Report: [ Week - 2 ( 10.06.2024 - 17.06.2024 ) ]

DeepChem - WDPolyFeaturizer Improvements

Hey everyone,

This week’s update focuses on the ongoing development of the WDPolyFeaturizer class within DeepChem. We’ve been working on improvements to enhance its functionality and make it more user-friendly. Here’s a breakdown of the key changes:

Local Enhancements:

  • Improved Code Clarity: We’ve added comments and explanations within the codebase, particularly for the _featurize method. This will make it easier for users to understand the logic behind the code and potentially customize it for their specific needs.
  • Unit Test Cases: We’ve added unit test cases in @test_poly_wd_featurizer.py to ensure the WDPolyFeaturizer class functions as expected. These tests verify core functionalities like feature shape, weight assignments, and mapping accuracy.

PR #3992: Documentation Updates:

  • Input/Output Notation: We’ve added clear explanations for the input and output notations used by the WDPolyFeaturizer class. This will help users understand what kind of data the class expects and what format the output takes.
  • Examples: We’ve included more detailed examples demonstrating how to use the WDPolyFeaturizer class with different types of data. This will provide a practical guide for users who want to integrate the class into their workflows.
  • Data Class Documentation: The documentation for the WeightedDirectedGraphData class has been enriched to provide a clearer understanding of the data structures used to represent the featurized graphs.

Progress Report: [ Week - 3 ( 17.06.2024 - 24.06.2024 ) ]

DeepChem Weekly Update: Polymer Featurizer Progress

This week has been productive as we continue development on the polymer featurizer! We’ve made significant progress by splitting up the larger pull requests (PRs) into more manageable chunks, focusing on modular changes.

Here’s a breakdown of what we accomplished:

Modularized PRs:

  • We’ve broken down the initial, bulk PRs into four separate ones, each focusing on a specific aspect of the polymer featurizer:
    • PR #4016 (Under Review): Introduces the base polymer featurizer with unit test cases for verification.
    • PR #4017 (Under Review): Implements weighted directed data classes with comprehensive documentation.
    • PR #4020 (Under Review): Added weighted directed data validator classe and splitted simpler unit test cases for the same.
    • PR #4021 (Under Review): Adds utility functions for creating weighted directed graph data.

Additional Improvements:

  • PR #4023 (Merged): Successfully fixed the pytest issues in unit tests by adapting the codebase to the newer XGBoost version. (:white_check_mark:)

Local Changes:

  • Continued development on a Colab notebook demonstrating the end-to-end implementation of the polymer featurizer. (In Progress)

We’re excited about this progress and believe this modular approach will streamline further development and code review. We welcome any feedback or questions you may have about the ongoing work. Stay tuned for next week’s update!

Progress Report: [ Week - 4 ( 24.06.2024 - 01.07.2024 ) ]

Hey everyone,

This week’s update brings exciting progress to our project, considering the showcase of applications of GCN to predict the ionization potential of conductive polymers using the weighted directed graph featurization! Here’s a breakdown of the progress made:

Pull Request Status:

  • PR #4016: We’ve successfully rebased the pull request. This PR is currently under review and requires approval before merging.

Local Progress:

  • GCN Implementation (Almost Done!): Our local implementation for using Graph Featurization with GCNs for predicting ionization potential is nearing completion. This will be a crucial component showcasing the application in analyzing conductive polymers for biosensing and biofuel applications.
  • Polymer Series Tutorial - Round 2: We’ve drafted a second tutorial in our Polymer Series, diving deep into the weighted directed graph featurization method. This tutorial includes visualizations and examples to enhance clarity. The pull request for this tutorial is almost ready for submission.

Progress Report: [ Week - 5 ( 01.07.2024 - 08.07.2024 ) ]

Hi everyone!

I’m excited to share some progress on my GSoC project.

Merged Pull Requests:

  • PR #4016: This pull request was successfully merged on July 3rd! (:white_check_mark:)

Pull Requests Under Review:

  • PR #4017, #4020, #4021: I’ve rebased these pull requests and they’re now ready for your review. Please take a look when you have a chance!
  • PR #4037: This pull request explores the concept of weighted directed graph featurization for polymers. The tutorial notebook explores the basic understanding of polymer chain generation with an introduction to simple applications.

Local Changes:

  • I’ve completed the Colab implementation using Graph Featurization with GCN to predict conductive polymers’ ionization potential. The implementation is now under review (PR not yet submitted).
  • I’ve submitted a progress report for my GSoC mid-term evaluation from my side.
    [/quote]

Progress Report: [ Week - 6 ( 08.07.2024 - 15.07.2024 ) ]

Hey everyone,

This week’s update brings some progress on several fronts! Let’s dive into the details:

Pull Requests:

  • PR #4017 (Merged!) - We’re happy to announce that this pull request has been merged on July 8th! (:white_check_mark:) Now you can use WeightedDirectedGraphData class to load, validate and use weighted directed graph datatype with ease. Refer to the documentation for more detail.

  • PR #4020 (Under Review) - Thanks to the admin and reviewers for their feedback on adding more details about the string representation of the data. We’ve incorporated these suggestions by adding more descriptions, citations, and references in both the code and documentation.

  • PR #4021 (Under Review) - This PR has been rebased and is ready for your review!

  • PR #4035 (Status Update) - We received valuable feedback that the tutorial was too complex and lacked real-world application references. Based on this, we’ll likely remove this PR and create simpler, more practical tutorials instead.

Local Changes:

  • Mid-Term Evaluation Complete! - We’ve successfully completed the mid-term evaluation for our ongoing project.

Looking Ahead:

For the upcoming week, we’ll be focusing on creating a series of simple, conceptual tutorials. These tutorials will serve as a foundation for later, more technical tutorials that reference the conceptual ones for a smoother learning experience.

Progress Report: [ Week - 7 ( 15.07.2024 - 22.07.2024 ) ]

Hey everyone,

This week has been productive with progress on several key areas. Let’s dive into the details:

Pull Requests (PRs):

  • PR #4020 (Status: Under Review):
    • Trimmed documentation and kept it concise (under 70 characters).
    • Added detailed documentation for private methods within the validator class.
    • Incorporated Rakshit’s feedback and revised the changes accordingly.
  • PR #4021 (Status: Under Review):
    • Received LGTM (Looks Good To Me) from Rakshit! :white_check_mark:
  • PR #4035 (Status: Under Review):
    • Looking forward to finalizing this PR and removing it after discussion with my mentor.

Local Changes:

  • Gearing up to raise the main featurizer PR! This is an exciting step.

Progress Report: [ Week - 8 ( 22.07.2024 - 29.07.2024 ) ]

Progress Status

This week has been productive, with several key milestones achieved and a clear roadmap for the upcoming week. Here’s a detailed breakdown of the progress and plans:

Pull Requests (PRs) Under Review:

  • PR #4020 (Status: Under Review)
    • Received feedback to add citation inside the doc string along with the .rst file.
  • PR #4021 (Status: Under Review)
    • Validation required using implementation.
  • PR #4035 (Status: Under Review)
    • Discussed with mentor Shreyas to simplify the tutorial.
    • Need to split the PR into two parts to separately show implementation for graph formation and featurization.

Local Changes:

  • Graph Formation Tutorial
    • The tutorial for graph formation is almost ready to be raised as a PR.
  • Crystallization Tendency Regression Tutorial
    • Developed a tutorial on crystallization tendency regression for drug delivery applications.

Development Plan for Upcoming Week:

  • Raise several PRs on polymer tutorials, including basic theories and implementation showcases.

Progress Report: [ Week - 9 ( 29.07.2024 - 05.08.2024 ) ]

PR Status

  • PRs #4020 and #4021: These PRs are under review and will be revisited later.
  • PR #4035: Currently under review. The tutorial content is being reorganized into smaller, more focused tutorials.
  • PR #4070: Merged. This PR focuses on tutorial using DeepChem models to predict polymer crystallization tendency. :white_check_mark:
  • PR #4073: Under review. This PR, originally part of #4035, now focuses on explaining weighted directed graphs.

Local Progress

  • Progress is being made on the PSMILES tutorial.

Upcoming Week’s Focus

  • In-depth exploration of polyBERT to create relevant tutorials and explanations for DeepChem.

Progress Report: [ Week - 10 ( 05.08.2024 - 12.08.2024 ) ]

Progress Highlights:

This week has been productive for the polymer project. We’ve successfully merged two significant pull requests:

  • PR #4073: The tutorial on Weighted-Directed Graph implementation for polymers is now live. This tutorial will be a valuable resource for understanding how to represent polymers using graph structures. :white_check_mark:
  • PR #4097: The tutorial on PSMILES, their formation, and tokenization has also been merged. This provides a solid foundation for working with polymer SMILES representations.:white_check_mark:

Additionally, we’ve initiated a deep dive into the PolyBERT research paper to lay the groundwork for the next phase of development.

Upcoming Focus:

Next week, we will concentrate on developing a comprehensive tutorial demonstrating how to implement PolyBERT using the Hugging Face model. This tutorial will be instrumental in applying advanced language models to the realm of polymer science.

Progress Report: [ Week - 11 ( 12.08.2024 - 19.08.2024 ) ]

This week has been productive for the polymer project. We have worked on resolving the image rendering issue in one of our tutorials. We are also working on developing a tutorial on PolyBERT.

Merged PR:

  • PR #4093: The image rendering issue in polymer tutorials using GitHub image paths has been resolved. :white_check_mark:

Local Progress:

  • Significant progress has been made on similarity estimation of polymer molecules using PolyBERT’s latent space.