Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot chain getting broken - data incorrectly removed #11243

Open
1 of 3 tasks
cristian-fatu opened this issue Oct 1, 2024 · 2 comments
Open
1 of 3 tasks

Snapshot chain getting broken - data incorrectly removed #11243

cristian-fatu opened this issue Oct 1, 2024 · 2 comments
Labels
AWS bug Something isn't working

Comments

@cristian-fatu
Copy link

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 🐞

We're running Iceberg with Spark, using Spark Structured Streaming to read from a Kafka topic and write to an Iceberg table.
Recently we have started running a batch Spark job as well to backfill some older data into the same table.
Both the streaming and the backfill job will run at the same time, inserting into the table concurrently (just inserts, no merge/deletes). The streaming job does 4 minute microbatches, while the backfill job is run on-demand and potentially inserts several times per minute.

We are seeing that on occassion it will happen that a new snapshot gets created, which has no parent snapshot id. When that happens, the data loaded in previous snapshots effectively becomes "invisible" when querying the table.
Also, when we later expire snapshots and delete orphaned files, the older data is hard deleted (which sounds like the correct behavior).

It sounds like the problem is caused by concurrently updating the table but we haven't managed to reproduce it on-demand.

As additional symptoms, we noticed the following when looking at the metadata information:

  • taking this sequence of snapshots as an example from the $history table:

made_current_at snapshot_id parent_id is_current_ancestor
2024-09-30 19:36:23.975000 4112309507491600842 4459680660798272782 true
2024-09-30 19:36:21.466000 4459680660798272782 (null) true
2024-09-30 19:36:19.863000 6444149358610591875 8676544541428494413 false
2024-09-30 19:36:18.948000 8676544541428494413 3452861481380993540 false

  • in this example, only data starting with snapshot_id=4459680660798272782 remains visible
  • looking at this excerpt from the .medata.json file when the snapshot_id=4459680660798272782 was created, we can notice that older snapshots were using schema-id=91 while the 4459680660798272782 is using an older schema-id=3; it's unclear if this is related or not.

Other relevant info:

  • running Spark 3.5.0
  • Iceberg 1.5.0
  • AWS Glue catalog
  • AWS S3 as storage

The target table has the following properties:
history.expire.max-snapshot-age-ms 10800000
write.metadata.previous-versions-max 20
write.parquet.compression-codec zstd
write.spark.accept-any-schema true
write.metadata.delete-after-commit.enabled true

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@cristian-fatu cristian-fatu added the bug Something isn't working label Oct 1, 2024
@cristian-fatu
Copy link
Author

It also looks like whenever this issue happens we get this in the Spark logs:

Retrying task after failure: Cannot commit REDACTED because base metadata location 's3://REDACTED/metadata/80740-e132a8c4-6481-441d-a1d8-5655699a61c4.metadata.json' is not same as the current Glue location 's3://REDACTED/metadata/80741-b80a0885-0864-46c1-a7b9-34819a27ff9f.metadata.json

However, I checked and this error is present also at other times, when no data goes missing.

@nastra nastra added the AWS label Oct 1, 2024
@cristian-fatu
Copy link
Author

And just to clarify, we are only doing insert/append (i.e.: no updates or deletes) and the streaming and the batch jobs are writing to different partitions of the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants