Molfiles: Understanding, Loading & Embedding

kreamstac · January 10, 2022, 3:46pm

TL;DR I want to load a “.mol” file into a graph object, or something.

I am trying to build a dataset from a collection of 3D molecule structure files.

Happily, “.mol” files are available for each of the molecules in question. I have scraped those files and am now trying to load them into sensible objects in python. A graph would be a good start, but I am not sure what I am looking at in these files. Here is a single example; scroll down to see the printed Molfile.

the file

the molecule represented

I found also a partial specification of the file format on Wikipedia. I can’t link it because new users are limited to two links per post.

One thing that bothers me is that the last column of the atom block seems to record an atom index and that the last row of that column has a 0. If you then look in the bond block’s first two columns, which record the atoms involved in a bond, you see no 0s. As if that last hydrogen were floating unattached.

~~Another thing is that there seem to be more atoms listed than I would have estimated from the stick diagram on that Erowid page.~~ Confirmed with chemist friend that the atom count is correct!

I would like to answer those questions, but I know that they are more like chemistry questions.

But one really simple question for deepchem users that I have not been able to answer is: are there basic utilities for loading these files into appropriate data structures? I have in mind, for instance, to do dimension reduction and clustering. I cannot do that on a raw Molfile string and I would rather not do something crude like pack all the 3D vectors into a giant matrix space. Surely there are tools to build a graph from this? Or some other embedding? I am new here.

Thanks!

ellagale · January 11, 2022, 2:58pm

You want to use something like this:
mol_orig = rdkit.Chem.rdmolfiles.MolFromMol2File(
test_file_location,
cleanupSubstructures=False,
sanitize=False)

rdkit can read your mol files and this will make a rdkit molecule object, which has the formula, the coordinates and so on. I’d recommend having a read of rdkit’s website here.
You can then make a database of molecule files.

kreamstac · January 12, 2022, 11:58am

Thanks, this is very helpful. I think I should also spend some time with the deepchem tutorials, which I gave only a cursory glance months ago. Probably there is a lot of basic utility stuff demonstrated there that would get me up to speed.

While I have your attention, are there any other learning materials you can recommend? Like a MOOC that is worthwhile? I am looking at drugs right now but am broadly interested in everything deep learning can do for biotechnology.

Thanks again!

ellagale · January 15, 2022, 5:37pm

Uh, Ok, I have a PhD course (intro to machine learning) here:

It’s general ML stuff, but the last notebook (no. 5) has the stuff the students have to do with deepchem.
In that notebook, I reference some books, the one that is called something like Deep learning in the biological sciences is pretty good and they have a lot of code examples, so definitely check that out

kreamstac · January 15, 2022, 5:51pm

Very cool, thank you.