Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring Enhanced Compaction Support in Rust #657

Open
amitgilad3 opened this issue Oct 5, 2024 · 1 comment
Open

Exploring Enhanced Compaction Support in Rust #657

amitgilad3 opened this issue Oct 5, 2024 · 1 comment

Comments

@amitgilad3
Copy link

This discussion is related to issues #624 and #607. I have been investigating the compaction process in the Rust library, specifically comparing it to the Java implementation using Spark. During this investigation, I noticed a difference in how the FileScanTask class is handled between the two implementations.

In the Java version, the FileScanTask includes :

  1. DataFile object , which provides crucial information about partitions and specId, content. This information is necessary for the rewrite process in compaction. However, I am aware that @sdd previously raised a valid concern regarding the inclusion of this data in the FileScanTask(in this issue refactor: Store DataFile in FileScanTask instead #607 (comment))
  2. List - which is used to remove the necessary rows from existing files.

I would like to explore the preferred approach for adding the necessary data to facilitate the implementation of compaction in the Rust library. Here are a few potential options I am considering:

  1. Add the fields DataFile & List to FileScanTask.
  2. Propose a new API - that returns a more informative version (perhaps FileScanPlan?) of FileScanTask, which includes the required data but is not serializable.
  3. Other possible solutions? - I am open to suggestions on alternative approaches.

I Also tried to map the logic that is going on in the java + spark implementation to help us understand the flow in the hopes that we can do the same with rust and datafusion and maybe comet

Would love to get your input @sdd @Xuanwo & @ZENOTME

compaction-RewriteDataFilesSparkAction Diagram drawio (1)

@amitgilad3 amitgilad3 changed the title Initial Compaction implementation strategy and missing data Exploring Enhanced Compaction Support in Rust Oct 5, 2024
@liurenjie1024
Copy link
Collaborator

Thanks @amitgilad3 for raising this. I think compaction is a relatively complex topic, and we are somehow far from doing this. For example, we don't support reading deletion files, we don't have transaction api. Also compaction typically requires a distributed computing engine to process it. I think a better approach would be to provide necessary primitives in this library, and help other distributed engines to do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants