feat(test): integration test for S3 log store with pyspark #1988

dispanser · 2023-12-21T10:35:43Z

Description

This adds an integration test to make sure that delta-rs and pyspark can write to a shared delta table concurrently, using the same lock table in a compatible way.

Due to the way pyspark instantiates the SparkSession there's only one session for the entire test run, so we must set up the S3-relevant configs globally, for all tests. This doesn't seem to interfere with the other tests as they write locally and ignore any S3-related configuration.

dispanser · 2023-12-21T10:37:45Z

@rtyler : the new test doesn't work out of the box, as it requires S3 credentials in the form of environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION

AWS_PROFILE might work as well instead, but I haven't tested that.

dispanser · 2023-12-21T14:42:34Z

it may be necessary to mark this test with a separate tag so it can be excluded from a normal CI run, unless we have some credentials we want to stick into GitHub

rtyler · 2023-12-22T17:44:19Z

I think I can add some credentials to an AWS account I can control and structure specifically for this purpose. That said, it looks like I'm going to have to create an IAM user with a brought swath of permissions, have you been testing basically with an admin-level IAM user, or do you have a specific set of IAM permissions in mind?

dispanser · 2023-12-23T10:53:57Z

I completely forgot about permissions and their implications while writing this. 😫

I've been using a very unrestricted user, indeed. The way it currently works is by creating a bucket and a table for each run, but that's purely for the sake of setup-free reproducability. To reduce the required permissions, we could modify the tests to accept an existing bucket + table via environment , which would allows to get away with very limited permissions - basically reading from + writing to a specific bucket and table. That table could also be pre-configured with a proper TTL and very small provisioned throughput for neatly fitting into any free AWS tier.

rtyler · 2024-01-03T19:11:58Z

@dispanser The big refactor just landed, can you rebase and rework this pull request accordingly? 😄

dispanser · 2024-01-04T16:37:34Z

@rtyler : rebase done. What do you think about my proposal in the previous comment?

This still is not working, but it's not totally failing I guess

# Description This PR upgrades `delta-rs` to using DataFusion 35.0, which was recently released. In order to do this, I had to fix a few breaking changes, and also upgrade Arrow to 50 and `sqlparser` to 0.41. # Related Issue(s) N/A # Documentation See here for the list of PRs which required code change: - apache/datafusion#8703 - https://github.com/apache/arrow-datafusion/blob/ec6abece2dcfa68007b87c69eefa6b0d7333f628/dev/changelog/35.0.0.md?plain=1#L227 --------- Co-authored-by: Ming Ying <[email protected]>

…ad (delta-io#2120) # Description Make sure the read path for delta table commit entries passes through the log store, enabling it to ensure the invariants and potentially repair a broken commit in the context of S3 / DynamoDb log store implementation. This also adds another test in the context of S3 log store: repairing a log store on load was not implemented previously. Note that this a stopgap and not a complete solution: it comes with a performance penalty as we're triggering a redundant object store list operation just for the purpose of "triggering" the log store functionality. fixes delta-io#2109 --------- Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]>

dispanser requested review from wjones127, fvaleye, roeap and ion-elgreco as code owners December 21, 2023 10:35

github-actions bot added the binding/python Issues for the Python package label Dec 21, 2023

dispanser force-pushed the spark-integration-test branch 3 times, most recently from 29f3afb to 5f1e278 Compare December 21, 2023 13:48

rtyler self-assigned this Dec 22, 2023

rtyler marked this pull request as draft December 22, 2023 17:44

dispanser force-pushed the spark-integration-test branch 3 times, most recently from f602332 to 2a23837 Compare December 30, 2023 10:59

rtyler added this to the Rust v0.17 milestone Jan 3, 2024

dispanser force-pushed the spark-integration-test branch from 2a23837 to 2131dc3 Compare January 4, 2024 16:36

dispanser force-pushed the spark-integration-test branch from 2131dc3 to 6bda0e9 Compare January 4, 2024 16:55

rtyler force-pushed the spark-integration-test branch 2 times, most recently from 88984a6 to 6cfc616 Compare January 24, 2024 08:00

dispanser and others added 4 commits January 30, 2024 13:56

feat(test): Integration test for S3 log store with pyspark

4eee451

Point pyspark test to the local dockjer-compose setup

32e2c28

This still is not working, but it's not totally failing I guess

rtyler force-pushed the spark-integration-test branch from 1f9bcd8 to d1b24f5 Compare January 31, 2024 05:21

github-actions bot added the binding/rust Issues for the Rust crate label Jan 31, 2024

Merge branch 'main' into spark-integration-test

abc2664

github-actions bot removed the binding/rust Issues for the Rust crate label Feb 1, 2024

rtyler removed this from the Rust v0.17 milestone Feb 6, 2024

Merge branch 'main' into spark-integration-test

1e1a12b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test): integration test for S3 log store with pyspark #1988

feat(test): integration test for S3 log store with pyspark #1988

dispanser commented Dec 21, 2023

dispanser commented Dec 21, 2023

dispanser commented Dec 21, 2023

rtyler commented Dec 22, 2023

dispanser commented Dec 23, 2023

rtyler commented Jan 3, 2024

dispanser commented Jan 4, 2024

feat(test): integration test for S3 log store with pyspark #1988

Are you sure you want to change the base?

feat(test): integration test for S3 log store with pyspark #1988

Conversation

dispanser commented Dec 21, 2023

Description

dispanser commented Dec 21, 2023

dispanser commented Dec 21, 2023

rtyler commented Dec 22, 2023

dispanser commented Dec 23, 2023

rtyler commented Jan 3, 2024

dispanser commented Jan 4, 2024