Version Controlling ML projects using DVC

10 minute read

1. Data Drift - Need for retraining and redeployment.

Unlike Software 1.0, Data Science a.k.a Software 2.0 requires constant monitoring of the the system. One cannot just deploy their model and relax. It is prone to data drifts. Data Drift is a phenomenon where model’s performance starts decaying due to the change in the real world data. For ex, During COVID-19, Amazon’s recommendation system went bonkers in recommending the right products. That’s probably because it hasn’t seen a sudden surge in toilet papers in the historical data. Big companies deploy a model every 10 min (Ref: StanfordMLSystems). A quick turn-around time in retraining and redeployment has become a major concern for the upcoming small-mid sized analytics firms.

Different patterns of drifts:

1. Recurrent drifts like Christmas, Friday Sales, Festivals, etc
2. Gradual Drifts
3. Sudden Drifts during emergency situations

Different types of drifts:

1. Concept Drift

1.2 Data Scientist should be more end-end

Open-source pretrained models and platforms like HuggingFace & Spacy have narrowed down the modelling part for Data Scientists, especially for those in the commercial B2B SaaS space. This greatly reduced the time required for setting up the training code and infra. As B2B SaaS companies moving towards subscription based models, Data Drift becomes an embedded problem and is turning out to be a major factor in hiring.

1.3 Pipelines

For every change in data distribution i.e data drift, we will have to reproduce all the Data Science steps starting from data processing, training, picking the best model and deploying. While it is easy to repeat, automating them helps the data scientists to focus on more interesting problems other than just repeating the steps.

Flow of artifacts from the early stage like data processing to redeployment is possible through pipelines. There are different tools like DVC, Kubernetes, KFServing, KubeFlow that serve the purpose.

1.4. Data Versioning

Models are a result of data, hyperparams, optimizers, etc. Change in any of these inputs lead to a new model. So, apart from versioning the code, there is a need to track/version the data. This will be helpful in inferring what data produces what model.

Versioning large files is relatively new and there are quite a lot of tools in tracking the artifacts. GitLFS, wandb, MLFlow, DVC are some of them. Considering the topic of the article, let’s continue with DVC.

2. DVC

DVC is an Open-source Version Control System for Machine Learning Projects. Against the sloppy file suffixes and comments in code, DVC’s full ownership of code and data makes it easy to track every iteration of the model. Also, this helps us to switch back & forth between experiments.

DVC takes help from git for version control. Specifically, DVC tracks the data changes and stores the changes in a few internal meta files. Git tracks these meta files for versioning. So, let’s create a GitHub repository.

Figure 1: Create a Git repo

Figure here @assets/images/dvcblog/createrepo.png

Now, let’s initialize DVC here using dvc init.

Figure 2: Initialize DVC inside the repo

This generates a folder called .dvc/ that stores internal files for tracking. Let’s add this folder to git for tracking.

git add .dvc .dvcignore .gitignore
git commit -m "DVC is initialized"

2.1 Track your data

Before tracking the data, I would like to present the usecase we will be working-on for this article. We take some NER data, 1. Preprocess the data, 2. Train an NER model using spacy and 3. Evaluate the model

Here is the data that’s stored in data/data.json.

            ["Uber blew through $1 million a week", [[0, 4, "ORG"]]],
            ["Android Pay expands to Canada", [[0, 11, "PRODUCT"], [23, 29, "GPE"]]],
            ["Spotify steps up Asia expansion", [[0, 7, "ORG"], [17, 21, "LOC"]]],
            ["Google Maps launches location sharing", [[0, 11, "PRODUCT"]]],
            ["Google rebrands its business apps", [[0, 6, "ORG"]]],
            ["look what i found on google!", [[21, 28, "PRODUCT"]]]

Let DVC track the data through

dvc add data/

This generates a meta file (data.dvc). Instead of tracking the huge data/ folder for changes, git can now track the meta file data.dvc.

git add data.dvc .gitignore

2.2 Adding Google Drive as remote storage

We can upload DVC tracked data to any remote storage. This helps in replicating the artifcats in later stages. DVC supports almost all types of cloud services. As of now, I am storing my data in Google Drive. Here are the steps to add it is a trackable remote storage:

  1. Once you have created your folder, look at the folder’s link in the browser. In this case, it is 1rWmiZKpgAFQAUaZej15KQc-T6mvcVS5A.

    Figure 3: Folder link is present after folders/ in the URL
  2. Now, add
     dvc remote add -d gcloud gdrive://1rWmiZKpgAFQAUaZej15KQc-T6mvcVS5A

    These details get stored in .dvc/config. So, let’s add these details to git.

     git add .dvc/config
     git commit -m "Configure remote storage"
  3. Now, push your tracked data to the cloud
    dvc push
    Figure 4: Pushing data to the cloud
    Figure 5: DVC encrypts the files/folders and uploads them

2.3 Pipelines

Pipeline is a data flow through a sequence of steps that are required to get the desired output. Pipeline in our case, is as follows,

Figure 6: Pipeline flow

DVC lets us automate the repititive process of re-running the scripts by incorporating pipelines. Here is a snippet of the pipeline in our case:


      cmd: python
        - data/
        - features/

In the above yaml file, we just defined one stage namely, 1. preprocess. For each stage, we define the command to run, dependencies and outputs. DVC tracks these dependencies and outputs. It is important to note that the script i.e must also be under dependencies because it lets DVC understand that whenever there is a change in dependencies, this stage must be re-run.

You can run the pipeline using

dvc repro
Figure 7: Output of dvc repro

Along with the output generated by the stages, another file called dvc.lock is generated. This file is meta-file that tracks the changes in any of the stages. Commit dvc.lock to store the first version.

git add dvc.yaml dvc.lock .gitignore
git commit -m "DVC lock generated after 1st repro"

The significance of dvc.lock can be better understood when you run the pipeline again,

dvc repro
Figure 8: dvc.lock uses md5 hashing to not re-run unchanged steps

As we haven’t changed anything from our 1st dvc repro command, DVC skipped the computation for the preprocess stage.

Let’s add the remaining stages, i.e training and evaluate. Updated dvc.yaml:

      cmd: python
        - data/
        - features/
      cmd: python
        - features/
        - model.epochs
        - model/
      cmd: python
        - model/
      - scores.json:
          cache: false

Apart from the usual cmd, deps and outs, we have few more entries here namely, params and metrics. We can store the parameters in another file params.yaml.

  epochs: 5

Let’s run the pipeline again

dvc repro
Figure 9: Running the pipeline with all stages

You could see that preprocess stage is skipped but the other 2 stages are executed.

It is important to note that when you do a dvc push now, along with the data/ folder, the output folders namely, features/, model/ and scores.json are also uploaded to the cloud.

2.4 Metrics

DVC tracks metrics. Specifically, it can track the change in metrics across commits. Let’s change the data (for change in model’s performance) and see how it works.

I have added a few more training instances in data/data.json. Let’s re-run the pipeline.

Figure 10: All stages in the pipeline ran and performance improved

We could see that our model has improved.

{'GPE': {'p': 100.0, 'r': 100.0, 'f': 100.0}}

We can look at the difference in performance compared to the last commit.

dvc metrics diff 
Figure 11: Figure shows the changes in F-score and Recall

Leave a comment