Taskvine: cachenames #3075

BarrySlyDelgado · 2022-12-16T17:49:10Z

BarrySlyDelgado
Dec 16, 2022
Collaborator

For a given file we want the generation of a cachename for said file to be injective. This ensures the files cached at the worker are the exact files we need.

For each file type a different strategy must be used to generate the cachenames for that file. Given that files can generally have the same name across namespaces solely using the filename is not adequate when generating cachenames.

Preferably cachenames would always be generated using data relevant to the contents of the file. However, this is not always available. The following discusses methods for generating cachenames for each file type:

VINE_BUFFER - With buffers, the content of the buffer is available to us we can then use an adequate hashing function on the contents of the buffer.

VINE_FILE - With local files, Assuming we have permission to read the given file, the contents of the file are available for us to hash. However, due to the variable sizes of files and the number of files that may need to be hashed, hashing can cause an unwanted amount of overhead. However, different hashing methods can generate different changes to our overhead to be more favorable. Furthermore, it is important that any method chose is consistent and avoids conflicts adequately. For directories, which are a subset of the VINE_FILE classification, it is important that the directory is hashed from its contents. This can be done by using a variation of a merkle tree. That is, each hash of a directory is a hash of the hashes of the files with the directory. This can be done recursively.

VINE_EMPTY_DIR - Are there cases where an empty directory needs to be unique?

VINE_URL - With files possibly hosted on remote machines, We generally don't have access to the contents unless one transfers the entire file to the site of the manager which is somewhat antithetical to the use case for VINE_URLs. Here, our general strategy is to only retrieve the header of the file from the server. With the information in the header, some fields can give us insights to the identity of the file. More on HTTP header fields: here
Once, the header is retrieved, fields such as Content-MD5, ETag, and Last-Modified can be used to generate the cachenames. The following on are details on each header field used:

Content-MD5 - This is an md5 digest of the entity, This field could be generated by an origin server or a client.
More
ETag - an ETag or entity-tag is an "opaque" cache validator. Typically used to validate changes for a given resource. There is no specification on how an ETag can be generated on a server. It could be a hash of the content, but this is not always the case. ETags that begin with W/ indicate that a weak validator was used to generate the ETag.
More
More
Last-Modified - This is the date and time the a resource was last changed on a server
More

We then generate a hierarchy of header-fields that is equivalent to the order as they appear above. The reasoning is as follows: With md5 hashes, we can determine that two files with the same hash are the same. For ETags, we can be confident that two files are the same IF they are from the same server. This follows for Las-Modified timestamps but an extra piece of information is needed to generate the cachename. For each header retrieved we opt for the field that is highest on the hierarchy when present. For Last-Modified, we need an additional field to generate a cachename(as two files can have identical last-modified dates). Currently, this is the url for the given file in addition to the server where the file is to be retrieved. For each field, each bit of necessary information can be combined together to generate the hash.

VINE_MINITASK - A minitask is the resulting file after executing a given command on the worker. At times these commands have their own file dependencies which have their own cachenames. There is the possibility of generating the cachename for the minitask from the cachenames of the files that the mini task depends on. However, certain commands have a level of dynanicism in which it we cannot use this method to adequately predict the identity of the resulting file. It could be a possibility to just let the user decide whether a command can have a cachename. However, what happens if they are wrong?

When a cachename cannot be generated - There is an argument that if cachename cannot be generated given the present information, that the file should not be cached. That is, generating a cachename could lead to possible conflicts on the worker side.

dthain · 2022-12-16T19:20:18Z

dthain
Dec 16, 2022
Maintainer

This is on the right track -- it helps to work out one's intentions first, before writing code.

Re VINE_MINITASK, think of it this way. If you can express a task as a little document:

{
cmd = "./simulate.exe -p"
inputs = { name: "input.txt", source: "hash-xyz123 }
outputs = { name: "output.txt" }
env = "PATH=y"
}

Then you can make a unique name for that task by describing it as hash(document)

So, if you can do that, can you now make a unique name for the output file, without knowing its contents?

1 reply

BarrySlyDelgado Dec 16, 2022
Collaborator Author

Yes, that makes sense. I'll work on that implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taskvine: cachenames #3075

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Taskvine: cachenames #3075

BarrySlyDelgado Dec 16, 2022 Collaborator

Replies: 1 comment · 1 reply

dthain Dec 16, 2022 Maintainer

BarrySlyDelgado Dec 16, 2022 Collaborator Author

BarrySlyDelgado
Dec 16, 2022
Collaborator

Replies: 1 comment 1 reply

dthain
Dec 16, 2022
Maintainer

BarrySlyDelgado Dec 16, 2022
Collaborator Author