-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(pip): avoid response buffering #873
base: main
Are you sure you want to change the base?
chore(pip): avoid response buffering #873
Conversation
@microsoft-github-policy-service agree company="Legit Security" |
var entry = package.GetEntry($"{name.Replace('-', '_')}-{version}.dist-info/METADATA"); | ||
// Store the .whl file on the disk temporarily | ||
var tempFilePath = Path.GetTempFileName(); | ||
await using (var fileStreamWrite = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I know these are temp files, but depending on how big the file can be, couldn't writing to disk cause errors for systems that are particularly low on disk space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We encountered the opposite problem very often. We had enough disk space, but got OOM when buffering (and caching) large .whl files into memory in the HTTP SendAsync
part.
Perhaps we can make this behavior configurable, so that consumers will be able to select whether to use disk storage for the downloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can start by adding some telemetry on the size of some wheels. I am a bit worried that we would eat up all the disk space downloading the wheels to disk, as technically we'd download them twice in a given CI build (pip install
+ running CD).
A slider or a configurable limit on either:
- The size of the given wheel (only write wheels smaller than x bytes to disk)
- Total disk space used downloading wheels
sounds like the best options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we have 2 options:
- Store wheels in memory (Default)
- Store wheels on the disk
Would that be fine?
It would leave current behavior unchanged, but give an option to use disk space instead of process memory for those who want it. A more fine-grained control like you suggested can be added at a later point if necessary.
We can set this with an environment variable (or maybe even an Options<T>
)
|
||
if (!response.IsSuccessStatusCode) | ||
return (await this.cachedDependencies.GetOrCreateAsync(key, FetchDependencies)).ToList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .ToList might be redundant since the task result is already a list and it eventually gets converted to dictionary in pythonresolver.
{ | ||
dependencies.Add(new PipDependencySpecification(deps, true)); | ||
} | ||
File.Delete(tempFilePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if this fails?
The detector can run in parallel and potentially cause lockups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code uses a different temporary path every time (GetTempFileName
), so there won't be issues
Thanks for the contribution! Would you happen to have some memory benchmarks on this PR? Curious to see the improvement. |
Hi @melotic, sorry for not replying earlier. I ran memory profilers and saw that the peak memory does not increase with this implementation. I'm not sure a classic benchmark would be useful here, as this PR addresses the peak memory usage, but I can demonstrate this manually. I can add code that runs GC and logs the memory in-use immediately after. Would that be OK? |
PyPi dependencies contain large .whl files that are buffered in full into memory, only to resolve a dependency tree. This can lead to OOM errors in some environments. There are popular packages that are hundreds of megabytes in size.
In this change, the response is streamed to a temporary file, and the temporary file content is streamed to the
ZipArchive
, reducing the peak memory. Caching is also now done at a cache-key level, rather than an HTTP response level, as caching HTTP response streams is not possible.I'm not sure whether the dependencies cache should also be limited by the same env-var that is used for the HTTP responses cache, as the dependencies cache is much smaller. In this PR the cache size is the same in both.
This solves #731