Detailed Results of the Benchmark;
This project aims to create a DB-like benchmark of feature generation (or feature aggregation) task. Especially the task of generating ML-features from the time-series data. In other words it is a benchmark of ETL tools on the task of generating the single partition of the Feature Store.
See a detailed description on the companion web-site.
(Linux)
maturin build --release
for build a wheelpython3 -m venv .venv
(python3.11 is required)source .venv/bin/activate
pip install target/wheels/data_generation-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl
(choose one for your system)
(Inside venv
from the previous step)
generator --help
generator --prefix test_data_tiny
(generate tiny data)generator --prefix test_data_small --size small
(generate small data)
Contributions are very welcome. I created that benchmark not to prove that one framework is better than other. Also, I'm not related anyhow to any company that develops one or another ETL tool. I have some preferences to Apache Spark because I like it, but results and benchmark is quite fair. For example, I'm not trying to hide how faster are Pandas compared to Spark on small datasets, that are fit into memory.
What would be cool:
- Implement the same task in DuckDB;
- Implement the same task in Polars;
- Implement the same task in Dusk;
- Implement different approaches for
Pandas
; - Implement different approaches for
Spark
; - Setup CI to run benchmarks on GH Runners instead of my laptop;
- ???
There is a lack of documentation for now, but I'm working on it. You may open an issue, open a PR or just contact me via email: mailto:[email protected].