A Sketch of a ModelHub

In Making DeepChem a Better Framework for AI-Driven Science, I suggested the addition of a ModelHub to DeepChem. In this post, I sketch the design of a ModelHub. Here are the rough sequence of steps:

  1. Overhaul dc.models.Model to provide generic weights interface
  2. Introduce a standard weight naming convention
  3. Create uniform weight saving format
    • Implement for KerasModel, TorchModel, SklearnModel
  4. Implement generic Model.from_pretrained implementation
  5. Introduce DEEPCHEM_MODEL_HUB environment variable

I’ll discuss each of these steps below

Overhaul Model Superclass

The current dc.models.Model class should really be named Estimator since it only provides a sensible API for regressors/classifiers. Rename the current model class Estimator and introduce a new abstract model class that specifies the following methods:

class Model():
   
  def __init__(...):
    # The same as the __init__ method we have now

  def get_weights() -> Dict:
    # Returns a dictionary mapping weight names to weights

  def set_weight(name, value):
    # Set specified weight to specified value

  def get_parameters() -> Dict:
    # Returns a dict of hyperparameters

  def to_pretrained():
    # More details on this below

  @staticmethod
  def from_pretrained(pretrained_weights) -> Model:
    # A constructor method that creates a Model from specified pretrained weights  

This API provides a standard way of accessing, updating and saving the weights for a model. Optionally, we could implement dictionary like semantics so you could do things like

gc = GraphConvModel()
# Get a weight from the model
gc["graph-conv-1"]

Note that each model must be able to set its weights. This will have to be implemented separately for different models like KerasModel, TorchModel, and SklearnModel. We should also make Metalearner and Policy implement this API since these classes are models in the generic sense as well.

Implement a standard weight-naming convention

I propose a rough weight-naming convention as

"model_name-layer_name-i"

Where model_name is the name of the model, layer_name is the name of the layer and i is a number that denotes which weight this is in the layer. The advantage of a standard weight-naming convention is that inspecting models will become easier.

Create Standard Weight Saving Format

At present, different models are stored differently on disk. We may want to consider adopting a standard format for saving weights. For simplicity, we could do something like the following directory structure:

model_weights/
  -> params.json
  -> weight-name1.npy
  -> ...
  -> weight-name-n.npy

Here params.json holds the parameters of the model, that is, its class and the constructor arguments

{
  "class": "GraphConvModel" # As an example...
  "n_tasks": "10",
...
  "graph_conv_layers": "[100, 100, 100]"
}

The actual weights are stored in .npy files on disk. The filename for each .npy file is the the weight name for each weight. We would want a method for each model

def to_pretrained():
  # Generate the pretrained folder on disk

The advantage of having our own format is that we won’t break stored weights if TensorFlow/PyTorch change their checkpointing formats. We used to have stored models in TF1, but they all broke with the TF2 upgrade. Ideally by having a simple format we control, we can reduce the risk of breakage.

Implement from_pretrained

Here is a simple implementation for from_pretrained:

@staticmethod
def from_pretrained(pretrained):
  # This is reading the .npy files and returns the model class name, the constructor arguments, and a dict mapping weight names to values
  class, params, weights = load_pretrained(pretrained)
  # Initialize the model
  model = class(**params)
  for weight_name, weight_value in weights:
    model[weight_name] = weight_value

Environment Variable

We should introduce a new environment variable

DEEPCHEM_MODEL_HUB

By default this should point at the DeepChem S3 bucket for now. In future, different companies may want to provide their own DeepChem model hubs. The model hub is simply a directory with the following structure

modelhub/
  -> pretrained_model1/
  -> pretrained_model2/
  ...

To load from the model hub, we simply download pretained_model1/ (in the file format we specified above) and call from_pretrained to load this model.

Potential Complications

We will need to implement a method for setting weights for different model types which will take considerable work.

Feedback?

This is a first sketch for the design, so I’d love folks feedback on whether this makes sense. A lot of this design is still rough but I wanted to get something written down for early review. @peastman @ncfrey I’d love your feedback in particular here! Feel free to suggest changes to this design as needed to make things sensible :slight_smile:

3 Likes

Something similar to the get_weights API was what I had proposed initially for restoring weights from a pretrained model. But at that point, TensorFlow would register variables to the session and creating models from the same model class in a session would mean the same layers across the models would have different names. Peter’s recommendation and the final design both revolve around using the actual tensors themselves and copying values into those, rather than the names of the tensors.

Does this issue still persist with TF 2.0? If yes, then part of the standardization would involve resolving this somehow. This is in contrast to PyTorch where the variables are registered to the model and same layers across models from a model class also share a common name.

1 Like

This looks great! One question I would have with the naming convention is how we would differentiate by pre-training corpora if there are numerous different pre-trained weights for the same model. Would this be specified in the model name found in the hub, similar to ChemBERTa?

1 Like

This looks like a really good idea.

Standardizing weight names could be difficult. For models we implement ourselves we can set them however we want. But a lot of our models are now just wrappers around models defined in other codes. That leads to several difficulties. 1) We don’t control the names. 2) The names could change through a future update to an outside library. 3) Even the structure of the model could change through a future update to an outside library. And as Vignesh notes, variable naming has complicated interactions with other frameworks and libraries.

Maybe we should consider allowing the variable names exposed through this API to differ from the ones assigned by TensorFlow or PyTorch?

The on-disk representation will be cleaner if we store all the arrays in a single .npz file rather than a separate .npy file for each one.

You assume that the entire definition of a model can be set strictly through constructor arguments. Are we certain that will always be the case?

Rather than an environment variable for the model hub, what about an optional argument to from_pretrained()? By default it looks for the model at the standard URL, but you can specify a different location if you have your own repository.

I think following huggingface’s conventions here would be a good idea!

This is a really good point. For wrapped models we don’t really have any control over the internals. It makes sense to me that for our models we define standard model names which we control.

Great suggestion!

I believe this is currently the case for all our models. I can’t think of a counterexample offhand. Is there a model in the literature where a reasonable implementation can’t be specified by constructor arguments?

Great suggestion! This will make it easier for users to use their own model hubs

Following up on a discussion from gitter, one thought for a modelhub is perhaps we should make a standard export of DeepChem models to a format like ONNX/PMML? I don’t yet know how easy/hard this would be, but would have the strong advantage we wouldn’t have to invent a custom storage format. Then modelhub files would be stored as ONNX/PMML files and reloaded from these files.

1 Like

One possibility here to simplify is that we could design the ModelHub to only support PyTorch models to start. This may considerably simplify our design and could fit well with our new positioning of PyTorch as the main backend framework.