Why does batch_size influence output shape

RWalser · January 28, 2023, 2:51pm

While running a GCN model it seems that the size of the prediction is contingent upon the batch_size.

Here is some dummy data with 240 samples:

!pip install --pre deepchem
!pip install --pre rdkit
import deepchem as dc
import numpy as np
import tensorflow as tf
from deepchem.feat.mol_graphs import ConvMol

mol = ['C-C-O']*240

ftr = dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False) 
X=ftr.featurize(mol)
y = np.arange(0,240,1)  
w = np.arange(0,240,1)  
ids = np.arange(0,240,1)  
ds = dc.data.NumpyDataset(X=X, y=y, ids=ids)

And here is the data_generator:

def data_generator(dataset, epochs=1, batch_size = 100, pad_batches = True):
  print(dataset)
  for ind, (X_b, y_b, w_b, ids_b) in enumerate(dataset.iterbatches(batch_size, epochs,
                                                                   deterministic=False, pad_batches=pad_batches)):


    multiConvMol = ConvMol.agglomerate_mols(X_b)
    inputs = [multiConvMol.get_atom_features(), multiConvMol.deg_slice, np.array(multiConvMol.membership)]
    for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
      inputs.append(multiConvMol.get_deg_adjacency_lists()[i])

    labels = [y_b]
    weights = [w_b]

    yield (inputs, labels, weights)

Now we can define the model and fit it to the dataset generated above:

batch_size = 100
n_tasks = 1
class TestModel(tf.keras.Model):

  def __init__(self, model = 1):
    super(TestModel, self).__init__()
    self.model = model

    #____________Test Model 1___________
    if self.model == 1:
      self.gc1 = GraphConv(128, activation_fn=tf.nn.tanh)
      self.readout = GraphGather(batch_size=batch_size, 
                                activation_fn=tf.nn.tanh)
      self.dense2 = layers.Dense(1)

  def call(self, inputs):

    #____________Test Model 1___________
    if self.model == 1:
      gc1_output = self.gc1(inputs)
      readout_output = self.readout([gc1_output]+ inputs[1:])
      dense2_output = self.dense2(readout_output)


    return dense2_output

#Fit_generator
print("_________\nFitting:")
testmodel = dc.models.KerasModel(TestModel(1), loss=dc.models.losses.L2Loss())
testmodel.fit_generator(data_generator(ds, epochs=1, batch_size = 100))

Finally we try to predict the dataset labels. Setting batch_size = 100 and pad_batches to False, we expect the shape of pred to be (240,1) but it comes out to be (300,1):

#Predict
print("_________\nPredicting:")
pred = testmodel.predict_on_generator(data_generator(ds, epochs = 2, batch_size = 100, pad_batches = False))
print(ds.y.shape, pred.shape)

_________
Predicting:
<NumpyDataset X.shape: (240,), y.shape: (240,), w.shape: (240,), ids: [0 1 2 ... 237 238 239], task_names: [0]>
(240,) (300, 1)

I assumed that it had something to do with the dataset not being a multiple of batch size and changed it to 80 and then 60. The former still returned a pred shape of (300,1) while the latter gave a shape of (400,1). I also tried batch sizes of 120 (which returned an error) and 40 (for which the output shape was (600,1))

Can someone solve the mystery?

bharath · January 29, 2023, 10:49pm

It’s possible that there is a bug where padding is happening even though it’s not supposed to. Does this happen for the existing models or only for your wrapped model? You can check perhaps if data_generator is behaving correctly when padding behavior is turned off