DVC Model Access
We've learned how to track data and models with DVC, and how to commit their versions to Git. The next questions are: How can we use these artifacts outside of the project? How do we download a model to deploy it? How to download a specific version of a model? Or reuse datasets across different projects?
Data and Model Access
DVC's remote storage config is also saved in Git, and contains all the information needed to access and download any version of datasets, files, and models. It means that a Git repository with DVC files becomes an entry point, and can be used instead of accessing files directly.
Find a file or Directory
You can use dvc list to explore a DVC repository hosted on any Git server:
dvc list https://github.com/mpolinowski/dvc-demo-project.git
.dvcignore
data
The benefit of this command over browsing a Git hosting website is that the list includes files and directories tracked by both Git and DVC.
Download
One way is to simply download the data with dvc get. This is useful when working outside of a DVC project environment, for example in an automated ML model deployment task:
dvc get https://github.com/mpolinowski/dvc-demo-project \
data
And now to the magic part - while the Git repository only contains the .dvc
configuration file that points to our data:
The GET command we used above automatically pulled the data with the version that was committed to Git:
ls -la data
256 Jan 5 17:25 .
232 Jan 5 17:25 ..
14445097 Jan 5 17:25 data.xml
10 Jan 5 17:25 .gitignore
Data Pipelines
WiP