Proposal for Updating DeepChem Code Style to Follow 4-space Indents

arunppsg · March 19, 2022, 11:01am

Background

DeepChem codebase currently uses 2 spaces for indents.

Proposal

Using/Migrating to 4 spaces for indents.

Pros for using 4 space for indent

Using 4 spaces for an indent will make DeepChem’s code consistent with the rest of the python community. This will largely help new developers to get started easily with DeepChem’s style guide.
When we are using code from another source, we need not reformat it to follow a 2 space indent style.
PEP 8 guidelines suggest using 4 spaces for indentation.

Cons for using 4 space for indent

Not any I know of.

Migration Steps

Here is a brief overview of steps which can be taken to migrate the code for following 4-space indents:

Setup a timeline for migration ~2-3 months
When writing a new file, use 4-space indents for new files.
For modifying an existing file,
- We can update as and when each file is modified by a developer during additions/deletions of code, performing bug fixes and other improvements to code (as was done during the yapf version update to 0.32)
- Chart out phases of updates for large and critical pieces of code, for example files containing the base classes which are important for the code organisation of DeepChem structure.
Catch inconsistencies in files during CI workflow (black/yapf can be used to detect inconsistency).

Converting 2-indent to 4-indent spaces:

There are a couple of ways in which a file using 2-space indent can be modified to user 4 space indent:

Yapf - Example: yapf --style='{indent_width: 4}' -i foo.py
Reindent.py - Python’s source code provides a script reindent.py which can be run on files to renindent them to 4 spaces.
Black - Black is an auto-formatting tool, from the Python Software Foundation. We can run black on files to reindent them.

Black is a good choice for using as it enforces strict consistency with PEP-8 guidelines. In the community, Pytorch geometric, Pytorch and Google uses Google style code-formatting implemented via yapf whereas Facebook, Dropbox, Mozilla, Quora use Black to style their code. If we wish to stick with google style formatting, we can use reindent.py/yapf otherwise for sticking to the broader python community, we can use Black for formatting.

Expected Disruption

For a short period of time (during the transition period), the code formatting in the repo across the files will not be consistent but at any time, during the migration process, we will have the style in a file consistent.

Other Thoughts

Python does not require all code in a repository/file to follow the same indentation style. For example, this is a valid piece of code:

def foo():
    print ("Foo has 4 space for indent")

def bar():
  print ("Bar has 2 space for indent")

if __name__ == "__main__":
      print ("Main has 6 spaces of indent")
      foo()
     bar()

But though there are some exceptions, like

 def perm(l):                       # error: first line indented
for i in range(len(l)):             # error: not indented
    s = l[:i] + l[i+1:]
        p = perm(l[:i] + l[i+1:])   # error: unexpected indent
        for x in p:
                r.append(l[i:i+1] + x)
            return r                # error: inconsistent dedent

(source: lexical analysis from python docs).

Indentation is also rejected as inconsistent if a source file mixes tabs and spaces. More on python indentation can be found here.

bharath · March 21, 2022, 5:54pm

Great proposal! One additional con I would add is:

While migration is happening, PRs could be very noisy (every line in a file would change because of indent). This could make reviews harder, so if we decide to proceed we may want to reindent important files early and upfront so future PRs don’t have too much noise

Let’s discuss and get feedback on the developer calls to the proposal after other folks have time to review and comment here

peastman · March 21, 2022, 6:08pm

When viewing diffs on Github, there’s an option to ignore whitespace changes.

arunppsg · December 22, 2022, 6:14am

Here is the concrete proposal on updating yapf.

Steps:

The first pull request should update setup.cfg to use indent_width as 4 for yapf.
Subsequent pull requests should update the source code to follow the new convention.

List of pull requests

I would like to suggest the following breakdown of pull requests to make for updating style. I splitted it based on the criteria that each pull request should be of reasonable size (~3000 loc) and core modules should be updated first.
The core modules are:

Update deepchem/data/datasets.py (Loc: 3102)
Update deepchem/data rest of the modules (Loc: 2168)
Update deepchem/feat/base_classes.py, graph_features.py (LoC: 1567)
Update optimizers.py, losses.py, models.py, callbacks.py, init.py in deepchem/models dir (LoC: 1677)
Update deepchem.models.torch_models.layers.py, torch_model.py (LoC: 3887)-
Update deepchem.models.keras_model.py, graph_models.py (LoC: 2544)

Updating deepchem/feat directory:

deepchem/feat/molecule_featurizers and all files (LoC: 3721)
deepchem/feat/complex_featurizers and all files (LoC: 1994)
deepchem/feat/tests all files (LoC: 3312)
rest of all .py deepchem/feat directory (LoC: 1644)

Updating deepchem molnet directory:

all .py files in deepchem/molnet directory (LoC: 2374)
all .py files in deepchem/molnet/load_function dir, load_dataset_template.py, molnet.py (LoC: 432)
rest of all .py files in deepchem/molnet/load_function dir (LoC: 4261)
all .py files in deepchem/molnet/material_datasets (LoC: 538)

Updating deepchem models directory:

Update deepchem/models/layers.py (LoC: 3858)
Update gan.py, seqtoseq.py, chemnet_layers.py, wandblogger.py, chemnet_models.py in deepchem/models dir (LoC: 2121)
Update rest of the files in deepchem/models dir
Update dmpnn.py, kfac_optimizer.py, ferminet.py, cgcnn.py, gcn.py (LoC: 2125)
Update gat.py, pagtn.py, mat.py, normalizing_flows_pytorch.py, cnn.py, megnet.py, mpnn.py, lcnn.py, attentivefp.py (2803)
Update test_reload.py, test_overfit.py, test_layers.py, test_layers_from_config.py, test_torch_model.py test_keras_model.py, test_losses.py in deepchem/models/tests dir (LoC: 4975)
rest of the tests in deepchem/models/tests dir (loc: 5129)

Utils directory:

rdkit_utils.py, data_utils.py, molecule_feature_utils.py in deepchem/utils directory (LoC: 2657)
evaluate.py noncovalent_utils.py fragment_utils.py coordinate_box_utils.py pdbqt_utils.py in deepchem/utild dir (LoC: 2094)
save.py pytorch_utils.py typing.py vina_utils.py debug_utils.py init.py hash_utils.py genomics_utils.py fake_data_generator.py sequence_utils.py voxel_utils.py geometry_utils.py electron_sampler.py conformers.py docking_utils.py in deepchem/utils dir (LoC: 2193)
all files in deepchem/utils/test dir (loc: 1892)

Other files and directories:

deepchem/dock and subdirectories (LoC: 1468)
deepchem/splits and subdirectories (LoC: 2462)
deepchem/rl directory and subdirectories: LoC 2153
deepchem/trans directory: Loc 2786
deepchem/trans/tests directory: Loc 1302
deepchem/hyper directory: Loc 1735
deepchem/metalearning and subdirectories (LoC: 433)
deepchem/metrics directory and subdirectories: LoC 1560
Update deepchem/data tests (Loc: 3018)

Altogether there will be about 30 pull requests.

Expected disruption

Let’s say an user wants to add a feature or a patch to a file following 2 indent style. From earlier experience (in updating yapf from 0.22 to 0.32), we found code review to be difficult since it contains both stylistic changes and new feature/patch. Hence, in such a scenario, the contributor can be asked to complete yapf conversion on a seprate PR and than land a pull request with a patch.

Timeline Jan 3 2023 - Jan 31 2023.