-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving handling of large files in k6 #2974
Comments
While thinking about #2975 (comment), I realized that I probably disagree with something here. I think the first points of the "Must-have" and "Nice to have" sections are somewhat flipped and need to be exchanged 😅 That is, #2975 is probably nice to have, while #2977 seems like a must. What will happen if we only implement #2975? We'll have a very efficient archive bundle format (.tar or otherwise) that doesn't load everything into memory when k6 executes it. That would be awesome! It would mean users would be able to potentially cram huge static files in these archives. However, they would have no way to actually use these files in their scripts besides using Whereas, if we don't touch .tar archives and still load their contents fully in memory, but we stop copying all of the file contents everywhere, and we add a way to open files without fully reading them into memory, users will still be able to work with them somewhat efficiently. Loading 500 MB in memory is not great, but as long as it happens only once, it's fairly tolerable and fixing it becomes "nice to have", not a must. |
TL; DR The tar improvements are less obviously immediately valuable. I agree with that, and agree that we should prioritize #2977 over it 👍🏻 I think parts of the idea with the Having done some research on this in the past, already having a more transparent API around files would help tremendously, even gaining granularity around how to handle data in scripts. As of today, users don't have much choice. It's all or nothing: load all the data in memory or nothing. Whereas with a File API, one could also open the files once and read them whenever they need the content, just in time (modern OSes all have some flavor of a buffer cache, which caches the content of read syscalls, so that when you read file N times in a row, all the reads past the first one are made from this cache, and are much MUCH quicker). This also has the potential to improve memory usage. |
lgtm. |
I have a need to load test an api that takes in pdfs. I was hoping that we could utilize SharedArray to share blobs (ArrayBuffer) across VUs. We are trying to achieve load testing with a wide range of pdf sizes up to 50MB. We need to simulate 2000 VU. Simple math shows that the feasibility is not in our favor if every VU needs to hold a 50MB blob in memory. Having away to share blobs across VU or having streaming support would make this more feasible. I personally think a streaming option in k6/http functions would be the most flexible and scalable. It would be a similar pattern seen in many code bases. We just started using k6 a few months ago to load test our mission critical services where I'm employed and so far its been a really good experience. I think k6 would benefit greatly and allow them to really expand their capabilities if they can crack handling of large sets of unstructured data, like pdfs, jpegs, etc. Our POC to test SharedArray with ArrayBuffer. Currently doesn't work. Note: we are using typescript and transpiling with webpack using ts-loader. Not that should make a difference I don't think. Also if I am doing something wrong with this POC, please leave a comment. There could be others thinking the same approach as we did. import { SharedArray } from "k6/data";
const data: ArrayBuffer[] = new SharedArray<ArrayBuffer>("pdfs", function (): ArrayBuffer[] {
const data: ArrayBuffer[] = [open(path, 'b')];
console.info("Bytes in LoadTestFiles: " + data[0].byteLength);
return data;
});
//Start our k6 test
export default (): void => {
console.info("Number of items in data: " + data.length);
console.info("Bytes in VU: " + data[0].byteLength);
} |
Hi @jade-lucas Thanks a lot for your constructive feedback and your concrete use case. We are currently experimenting with this topic, and we expect the first improvements to land in k6 in the not-so-distant future (no ETA yet). We have prioritized #2977 and have #2978 on our radar. We expect #2977 might help with your issue handling pdf files. Streaming in HTTP is, unfortunately further down the road as it is expected to potentially be part of the next |
I have ran into the same issue as jade-lucas. We need to load-test an API on a large scale with binary file upload. Having little knowledge of JavaScript buffers, I first couldn't understand why As mentioned, being able to share the contents of binary files between VUs and stream their contents over HTTP would be awesome and greatly aid our use-case. Anyway, our experience of K6 has been amazing except for this one hurdle. Thanks for the great OSS 🙌🏻 |
Quick update on this:
|
Any docs on this? Can't find description at https://k6.io/docs/javascript-api/k6-experimental/ |
Hi @nk-tedo-001 👋🏻 Our docs have recently migrated to Grafana's, you can find more information about it there: https://grafana.com/docs/k6/latest/javascript-api/k6-experimental/fs/ 🙇🏻 |
With Great job! |
Thank you 🙇🏻 I'm glad it was helpful 🎉 |
Story
Problem Statement
Handling large files in k6, whether binary or structured format files such as CSV, leads to high memory usage. As a result, our user’s experience, especially in the cloud, is degraded as soon as they need to handle large data sets; of user ids, for instance.
This issue is experienced by our users in various situations and is caused by many design and implementation decisions in the current state of the k6 open-source tool.
Objectives
Product-oriented
The product-oriented objective of this story, and definition of success, will consist in landing a new streaming CSV parser in k6, allowing users to parse and use big CSV files (> 500MB) that wouldn’t fit in memory and likely lead their k6 scripts to crash. We are keen to provide any technical improvements to k6 that make it possible along the way.
Technology Oriented
From a technological standpoint, the objective of this story is to provide all the design and technical changes necessary to complete the product story. Our primary objective should be to support the ability of users to use large data set files in their k6 scripts without leading to out-of-memory errors. As we pursue this goal, we aim to pick the solutions with the most negligible overhead and are keen to take on technical debt if necessary.
Resolution
Through internal workshops with @sniku and @mstoykov, we surfaced various topics and issues that must be addressed to fulfill the objective.
Must-have
The bare minimum must-have items to be able even to start tackling the top-level product objective would be:
Nice to have
As we're at it, some other set of features and refactors would be beneficial to the larger story of handling large files in k6
open()
method of k6 is somewhat misnamed and performs areadFile()
operation. This is also a result of k6 archiving users' content in a single tar archive and having to access resources through it. With more efficient and flexible access to k6's tar archive's content, we believe that k6 would also benefit from having a more "standard" file API to open, read and seek through files more conveniently. This would help support streaming use cases by providing more flexible navigation in a file's content. While also helping benefit from OS-based optimizations like the buffer cache.Problem Space
This issue approaches the problem at hand with a pragmatic product-oriented objective. However, this specific set of issues has already been approached from various angles in the past and is connected to longer-term plans as demonstrated in this list:
The text was updated successfully, but these errors were encountered: