-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DVC do not cache output of pipeline properly #10549
Comments
Can you try removing hardlink and symlink from It'd be great if you could debug, and see why it's not being reflinked by adding a breakpoint here in dvc-objects. You can also try with the following snippets and see if they are getting reflinked: from dvc.fs import LocalFileSystem
fs = LocalFileSystem()
fs.reflink("existing-file", "cloned-file") |
@skshetry I almost find the reason. When multiple pipeline create the same output. Only the first one got Here is a screenshot that 5 pipelines operate on a one-image dataset. And I use |
I am able to create a minimal reproducted project.
|
The reason why the first file on the screenshot provided is |
I think this is due to a relink optimization that I did recently for DVC looks at the file in the workspace, and tries to determine if it needs to relink based on cache-types. So, for example, if a file is a not a symlink, and you have But, DVC does not have a way to determine if a file should be reflinked or not. So, it leaves it as-is in the workspace, which saves us from doing checkout which can be expensive. If you are worried about storage, I think |
|
I have a solution. When
|
Besides, I don't think this is an issue can be ignored. Even there is no multiple pipeline to generate the same output, if user updates some existing pipeline to generate a new output with most of the files is same as those in the cache. All those files will be duplicated in the cache and the workspace. |
I maybe open to some config to force-relink. Any thoughts @dberenbaum, @shcheklein? |
just to clarify, better understand things first folks, a few questions:
do we know how it does this? is it FS specific or is there a general sys call that can do this? Is it expensive or not? @skshetry if we had a call
could you clarify a bit - is it expensive because we would do a full output checkout (all files), since we can't detect the difference? we still traverse and check the link type, right? would be the same or less expensive in case of reflinks specifically to force relink right away w/o doing those checks? |
FYI, https://github.com/tytso/e2fsprogs/blob/950a0d69c82b585aba30118f01bf80151deffe8c/misc/filefrag.c#L269, this line is where the |
Bug Report
repro: doesn't cache output properly with
reflink
setup.Description
I have 4 pipeline to transform the same input dataset for different tasks. The images was process the same way, and the
cache.type
was setting toreflink
. So, according to the document, there should be only one copy of the output images. But this is not the truth. All output of the pipeline was not set to reflink with the cached file.If I run the
dvc checkout -R --reflink
after the pipelink was executed. Then the disk usage behavior normally.The output of
btrfs fi du -s .
right afterrepro
:Total Exclusive Set shared Filename 90.50GiB 31.21GiB 29.64GiB .
The output of
btrfs fi du -s .
right afterdvc checkout -R --reflink
:Total Exclusive Set shared Filename 90.50GiB 1.07GiB 29.83GiB .
Reproduce
Expected
Environment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: