-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(test): integration test for S3 log store with pyspark #1988
base: main
Are you sure you want to change the base?
Conversation
@rtyler : the new test doesn't work out of the box, as it requires S3 credentials in the form of environment variables:
|
29f3afb
to
5f1e278
Compare
it may be necessary to mark this test with a separate tag so it can be excluded from a normal CI run, unless we have some credentials we want to stick into GitHub |
I think I can add some credentials to an AWS account I can control and structure specifically for this purpose. That said, it looks like I'm going to have to create an IAM user with a brought swath of permissions, have you been testing basically with an admin-level IAM user, or do you have a specific set of IAM permissions in mind? |
I completely forgot about permissions and their implications while writing this. 😫 I've been using a very unrestricted user, indeed. The way it currently works is by creating a bucket and a table for each run, but that's purely for the sake of setup-free reproducability. To reduce the required permissions, we could modify the tests to accept an existing bucket + table via environment , which would allows to get away with very limited permissions - basically reading from + writing to a specific bucket and table. That table could also be pre-configured with a proper TTL and very small provisioned throughput for neatly fitting into any free AWS tier. |
f602332
to
2a23837
Compare
@dispanser The big refactor just landed, can you rebase and rework this pull request accordingly? 😄 |
2a23837
to
2131dc3
Compare
@rtyler : rebase done. What do you think about my proposal in the previous comment? |
2131dc3
to
6bda0e9
Compare
88984a6
to
6cfc616
Compare
This still is not working, but it's not totally failing I guess
# Description This PR upgrades `delta-rs` to using DataFusion 35.0, which was recently released. In order to do this, I had to fix a few breaking changes, and also upgrade Arrow to 50 and `sqlparser` to 0.41. # Related Issue(s) N/A # Documentation See here for the list of PRs which required code change: - apache/datafusion#8703 - https://github.com/apache/arrow-datafusion/blob/ec6abece2dcfa68007b87c69eefa6b0d7333f628/dev/changelog/35.0.0.md?plain=1#L227 --------- Co-authored-by: Ming Ying <[email protected]>
…ad (delta-io#2120) # Description Make sure the read path for delta table commit entries passes through the log store, enabling it to ensure the invariants and potentially repair a broken commit in the context of S3 / DynamoDb log store implementation. This also adds another test in the context of S3 log store: repairing a log store on load was not implemented previously. Note that this a stopgap and not a complete solution: it comes with a performance penalty as we're triggering a redundant object store list operation just for the purpose of "triggering" the log store functionality. fixes delta-io#2109 --------- Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]>
1f9bcd8
to
d1b24f5
Compare
Description
This adds an integration test to make sure that delta-rs and pyspark can write to a shared delta table concurrently, using the same lock table in a compatible way.
Due to the way pyspark instantiates the
SparkSession
there's only one session for the entire test run, so we must set up the S3-relevant configs globally, for all tests. This doesn't seem to interfere with the other tests as they write locally and ignore any S3-related configuration.