Graph data from a pandas dataframe of chemical formulas

Fede · January 22, 2021, 9:36am

Hello everybody! First post here, just a brief introduction:

My name is Federico and I’m a Computer Science PhD student dealing with Machine Learning in the field of Material Science.

Led by a recent interest in GNNs, I’ve started to figure out some useful toolboxes /APIs to deal with graph-data structures.

Basically, what I’m trying to understand for now is, if I start with a typical pandas dataframe consisting of several chemical formulas (Strings), is it possible to convert them in a graph data format, and then featurizing these data graphs, for example getting node features to be used in a Graph Convolutional Network? If yes, may you give me some suggestions on how to do that?

For now (outside of GNNs context) I’ve just done some practice with Matminer toolbox, converting (strings) formulas into ‘Composition’ objects and then featurizing them to obtain a variety of chemical/compositional/physical properties, but not sure at all how to do this for graph-type data.

Thank you all,

Fede

bharath · January 23, 2021, 12:05am

Welcome Fede!

You should be able to use DeepChem featurizers directly on pandas columns. So something like

import deepchem as dc

feat = dc.feat.ConvMolFeaturizer()
convmols = feat.featurize(df["smiles"])

Note that there isn’t a standard graph conv data format. ConvMolFeaturizer for example only works with dc.models.GraphConvModel. I’d recommend checkout out our tutorial series for more tips/tricks. This tutorial might be useful in particular:

github.com

deepchem/deepchem/blob/master/examples/tutorials/06_Introduction_to_Graph_Convolutions.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ubFUlqz8cj1L"
   },
   "source": [
    "# Tutorial Part 6: Introduction to Graph Convolutions\n",
    "\n",
    "In this tutorial we will learn more about \"graph convolutions.\" These are one of the most powerful deep learning tools for working with molecular data. The reason for this is that molecules can be naturally viewed as graphs.\n",
    "\n",
    "![Molecular Graph](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/basic_graphs.gif?raw=1)\n",
    "\n",
    "Note how standard chemical diagrams of the sort we're used to from high school lend themselves naturally to visualizing molecules as graphs. In the remainder of this tutorial, we'll dig into this relationship in significantly more detail. This will let us get a deeper understanding of how these systems work.\n",
    "\n",
    "## Colab\n",
    "\n",
    "This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.\n",

This file has been truncated. show original

Fede · January 25, 2021, 11:27am

Hello! Thanks a lot for your kind answer! I have tried to follow your procedure (also looked at the interesting tutorial you have suggested). Anyway, I think that I’m missing some important steps before getting to it. I’ll explain my situation better with an example. My dataset looks like this:

formula          Band_gap(eV)
  NaCl                   3.5
  NO2                    1.4
  He                    1.2
  ZnO                    4.5
  ....                      ...

and this is a simple pandas dataframe consisting of two columns (‘formula’ and ‘Band_Gap(eV’). (I would like to perform regression here, relating the different compositions to band_gap values).

I guess that I should convert chemical formulas in the ‘formula’ column into these ‘smiles’ objects that I still didn’t figure out well (Maybe I should look at some RDkit guidelines).
Is that correct? If yes, may you suggest me how to do that?

So thankful for the help!

bharath · January 25, 2021, 11:41pm

Ah that makes a lot more sense, thank you! Ok, so in this case, some of your compounds wouldn’t have an easy transformation into smiles strings (like NaCl). This dataset seems a little more like a materials science dataset than a usual chemoinformatic dataset.

@ncfrey Would you have any recommendations on how to process this type of dataset? I’m not sure if pymatgen might have some good tools that @Fede could use

ncfrey · January 26, 2021, 1:21pm

Hi @Fede, to get a graph representation of a material you can use the CGCNN featurizer. However, the graph representation requires 3D coordinates (in the form of a pymatgen Structure). If you only have compositions without a structure, you probably want to use something like ElementPropertyFingerprint. Hope this helps!

JanoschMenke · January 28, 2021, 12:56pm

Hi @Fede

you probably figured it out by now but maybe it helps a little . SMILES is a specific format to store molecular structures -> https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

Most chemical toolbox are able to read and write SMILES, and due to the fact that a molecule fits on a single line, it is frequently used format to “store molecules”.
With rdkit you can see whether your compounds are valid SMILES:
Chem.MolFromSmiles("NaCl")
Will through an error and the correct SMILES would be:

Chem.MolFromSmiles("[Na+].[Cl-]")

The easiest method (I found) to generate SMILES from generic compound names is to us cactus.nci.nih.gov.
Not my code Source: https://stackoverflow.com/questions/54930121/converting-molecule-name-to-smiles

from urllib.request import urlopen
def CIRconvert(ids):
        try:
            url = 'http://cactus.nci.nih.gov/chemical/structure/' + ids + '/smiles'
            ans = urlopen(url).read().decode('utf8')
            return ans
        except:
            return 'Did not work'

identifiers  = ['NaCl', 'NO2', 'He', 'ZnO']

for ids in identifiers :
    print(ids, CIRconvert(ids))

>>>>
NaCl [Na+].[Cl-]
NO2 O=[N+]=O
He [He]
ZnO O|[Zn++]|O

Fede · January 28, 2021, 4:16pm

Many thanks everyone for the precious help. Now I start to see things a bit more clearly.

Basically when dealing with Machine Learning from a material science perspective we typically have datasets reporting compositions of the materials (formulas) and not looking that much at the structural information. Also, if we want to potentially discover new materials (like in my case) we might not even know about their crystal structures and so we just rely on the composition. (Is this correct?)

That’s why I think, as suggested by @ncfrey, that graph neural networks may not be the ideal candidate because the input to provide to a learning algorithm should be a graph representation, and in a chemistry context I suppose that the most natural input to provide as a graph would be the molecule structures.

There’s this interesting paper https://arxiv.org/abs/1910.00617 where the authors provide the same a GNN architecture but the graphs are just made from stoichiometric information of the compounds, and the edges are the fractional amounts of each element.

@JanoschMenke The code is really good! unfortunately it fails to convert to smiles many samples from my dataset

Again, many thanks for your consideration!

P.S. I have also tried to use ElementPropertyFingerprint() as suggested by @ncfrey (I have installed matminer and pymatgen) but just copying and pasting the following code from documentation,

import pymatgen as mg

comp = mg.Composition("Fe2O3")

featurizer = ElementPropertyFingerprint()

features = featurizer.featurize([comp])

I get a NameError: name ‘ElementPropertyFingerprint’ is not defined

(I have installed pymatgen and matminer)

Fede

JanoschMenke · January 28, 2021, 5:47pm

If you have some examples of compounds which were not recognized maybe we can find a solution

peastman · January 28, 2021, 6:45pm

That should be

import deepchem as dc
featurizer = dc.feat.ElementPropertyFingerprint()

Fede · February 1, 2021, 11:33am

Thank you all again,

@JanoschMenke the band gap dataset that I’m currently looking at is the one used on this paper : https://pubs.acs.org/doi/abs/10.1021/acs.jpclett.8b00124
It is available under ‘Supporting information’ section ('Complete training set of experimental band gaps…)

Basically, reading this paper again, the authors mentioned Universal Fragment descriptors approach and also the CGCNN model (I have read about this topic during these days) , but they say that basically they have been proposed for DFT calculated data, while unfortunately there is a lack of crystallographic information for experimental Band gap and that’s why they rely on compositional data only using Support vector Machine and Support Vector Regression.

@peastman Thanks for the clarification. I’ll try later to make it work.

ElementPropertyFingerprint() featurizer is not supposed to work with Graph Neural Networks right? It will only generate features to then train a ‘standard’ model. (Basically the same procedure I’ve done with Matminer).