-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid excessive heap utilisation due to in memory creation of md5s #222
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution! I am a little puzzled though - why (even with 100k files) this was a problem. So I assume 100k files, times (random guess) 100 chars per line - that's 10.000.000 chars. That's probably around 20MB of RAM needed. Is that already what you meant by excessive? How much memory usage did you see? I am just wondering if this really was the problem. |
Going to set a breakpoint and catch the length, and get a heap dump and tell you exactly what the utilisation is. Certainly in the hundreds of megabytes, due to the length of the paths. |
Awesome - thanks! Hundreds of megabytes? That sounds quite fishy. |
The final md5sums file is 33M. The StringBuilder will retain double that of course, thanks to Java's 2-byte representation of strings:
And two more copies again of the same bytes:
So a little over 220MB I guess. Background is here incidentally (we've got our fair share of heap issues in our Gradle plugin!): |
Thanks for digging into this. I guess the two copies are where the problems turns into excessive. I am wondering if we could dial back the crazy by getting rid of those copies. On the other hand in-memory will always hurt scalability. I am not so eager to use temp files - but on the first look the PR looks reasonable. I need to poke around a bit more but I am inclined to accept it. Thanks for your work! (I so need to get started on jdeb2) |
We noticed in a application with > 100K files that we ran into problems while generating the checksums. This writes to a file and streams from that file, to the output stream to avoid heap utilisation during that phase.