Skip to main content

Data Version Control

Guangzhou, China

Data Version Control is a data versioning, ML workflow automation, and experiment management tool that takes advantage of the existing software engineering toolset you're already familiar with (Git, your IDE, CI/CD, etc.). DVC helps data science and machine learning teams manage large datasets, make projects reproducible, and better collaborate.

Getting Started

To use DVC as a Python library, please install with pip or with conda:

Install with pip

pip install dvc

Initialize a ML Project

Initialize the project by running dvc init inside a Git project:

mkdir my-project
cd my-project
git init
dvc init

Initialized DVC repository.
You can now commit the changes to git.

A few internal files are created that should be added to Git:

git commit -m "Initialize DVC"

[master (root-commit) ed9cc50] Initialize DVC
3 files changed, 6 insertions(+)
create mode 100755 .dvc/.gitignore
create mode 100755 .dvc/config
create mode 100755 .dvcignore

Data Management Trail

  • Data and model versioning is the base layer of DVC for large files, datasets, and machine learning models. Use a standard Git workflow, but without storing large files in the repo. Data is cached by DVC, allowing for efficient sharing.

  • Data and model access goes over using data artifacts from outside of the project and importing them from another DVC project. This can help to download a specific version of an ML model to a deployment server or import a dataset into another project.

  • Data pipelines describe how models and other data artifacts are built, and provide an efficient way to reproduce them. Think "Makefiles for data and ML projects" done right.

  • Metrics, parameters, and plots can be attached to pipelines. These let you capture, evaluate, and visualize ML projects without leaving Git.

Data Versioning

Having initialized a project in the previous section, we can get the data file (which we'll be using later) like this:

dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml

To start tracking a file or directory, use dvc add:

dvc add data/data.xml

DVC stores information about the added file in a .dvc file named data/data.xml.dvc. This metadata file is a placeholder for the original data, and can be easily versioned like source code with Git. The data, meanwhile, is listed in .gitignore:

git add data/data.xml.dvc data/.gitignore
git commit -m "Add raw data"

Storing and Sharing

Remote

The same way as GitHub provides storage hosting for Git repositories, DVC remotes provide a location to store and share data and models. You can pull data assets created by colleagues from DVC remotes without spending time and resources to build or process them locally. Remote storage can also save space on your local environment – DVC can fetch into the cache directory only the data you need for a specific branch/commit.

Using DVC with remote storage is optional. DVC commands use the local cache (usually in dir .dvc/cache) as data storage by default. This enables the main DVC usage scenarios out of the box.

DVC supports several types of remote storage: local file system, SSH, Amazon S3, Google Cloud Storage, HTTP, HDFS, among others.

ssh

DVC requires both SSH and SFTP access to work with remote SSH locations.

dvc remote modify myremote url \
ssh://user@example.com:1234/path
  • url - remote location, in a regular SSH format. Note that this can already include the user parameter, embedded into the URL.
dvc remote modify --local myremote user myuser
dvc remote modify myremote port 2222
dvc remote modify --local myremote keyfile /path/to/keyfile
dvc remote modify --local myremote password mypassword
dvc remote modify myremote ask_password true
dvc remote modify --local myremote passphrase mypassphrase
dvc remote modify myremote ask_passphrase true
dvc remote rename oldremote newremote
dvc remote remove oldremote
local

While the term may seem contradictory, it doesn't have to be. The local part refers to the type of location where the storage is: another directory in the same file system.

We use the -d (--default) option of dvc remote add for this:

dvc remote add -d demo-data /run/media/myuser/files/dvc
Setting 'demo-data' as a default remote.

The project's config file should now look like this:

[core]
remote = demo-data
['remote "demo-data"']
url = /run/media/myuser/files/dvc

List all remotes in the project:

dvc remote list
demo-data /run/media/myuser/files/dvc

Now we need to exclude everything from Git:

git add .dvc/config
git commit -m "Configure remote storage"

Push & Pull Data

You can upload DVC-tracked data or model files with dvc push, so they're safely stored remotely. This also means they can be retrieved on other environments later with dvc pull:

dvc push
1 file pushed

Having DVC-tracked data and models stored remotely, it can be downloaded when needed in other copies of this project with dvc pull:

dvc pull
Everything is up to date.

After updating your local dataset run:

dvc add data/data.xml
git commit data/data.xml.dvc -m "Dataset updates"
dvc push

To switch between datasets:

git checkout HEAD~1 data/data.xml.dvc
dvc checkout
git commit data/data.xml.dvc -m "Revert dataset updates"

DVC is technically not a version control system. Git itself provides that layer. DVC in turn manipulates .dvc files, whose contents define the data file versions.