TL;DR I want to load a “.mol” file into a graph object, or something.
I am trying to build a dataset from a collection of 3D molecule structure files.
Happily, “.mol” files are available for each of the molecules in question. I have scraped those files and am now trying to load them into sensible objects in python. A graph would be a good start, but I am not sure what I am looking at in these files. Here is a single example; scroll down to see the printed Molfile.
I found also a partial specification of the file format on Wikipedia. I can’t link it because new users are limited to two links per post.
One thing that bothers me is that the last column of the atom block seems to record an atom index and that the last row of that column has a 0. If you then look in the bond block’s first two columns, which record the atoms involved in a bond, you see no 0s. As if that last hydrogen were floating unattached.
Another thing is that there seem to be more atoms listed than I would have estimated from the stick diagram on that Erowid page. Confirmed with chemist friend that the atom count is correct!
I would like to answer those questions, but I know that they are more like chemistry questions.
But one really simple question for deepchem users that I have not been able to answer is: are there basic utilities for loading these files into appropriate data structures? I have in mind, for instance, to do dimension reduction and clustering. I cannot do that on a raw Molfile string and I would rather not do something crude like pack all the 3D vectors into a giant matrix space. Surely there are tools to build a graph from this? Or some other embedding? I am new here.
Thanks!