[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

nsivabalan · 2024-11-21T21:17:14Z

Change Logs

Fixing Secondary Index Record generation to not rely on WriteStatus.
This is stacked on top of [HUDI-8542] Updating how RLI records are generated for MDT updates #12269

SI record generation is of two steps:
a. Find record keys that are updated or deleted and add deleted records to SI index. We will do a lookup of the same in SI to find the SI, record key combo and prepare delete records.
b. For the latest data (inserted or updated), we read the records to find SI value, record key combination to generate new insert records to ingest to SI.

Among the above steps, (a) is the one which was relying on WriteStatus.

In this patch, we are only fixing (a). i.e. Finding the list of records keys that got updated or deleted in the current commit of interest will not rely on WriteStatus, but do on-demand read from data files.
Based on time permitting for the 1.x release, we might have a follow up patch, where we can unify steps a and b and get it done in one step.

This patch definitely has to go in to remove the dependency on WriteStatus. The optimization of merging steps a and b will be followed up based on available bandwidht and timeframe before we wrap up 1.0. It is an optimization step and not really impacts correctness. Since Secondary Index itself is a new feature that we are introducing in 1.x, we wanted to take care of correctness and reliability in the first place.

Impact

Fixing Secondary Index Record generation to not rely on WriteStatus.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

vinothchandar

Code style comments. Will do second pass again

vinothchandar · 2024-11-21T22:37:58Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

      }
      updateFunctionalIndexIfPresent(commitMetadata, instantTime, partitionToRecordMap);
-      updateSecondaryIndexIfPresent(commitMetadata, partitionToRecordMap, writeStatus);
+      updateSecondaryIndexIfPresent(commitMetadata, partitionToRecordMap, instantTime);


how come we still have the updateFromWriteStatuses method..

still rename writeStatus to writeStatuses. plural

yes, I have punted below changes to next patch.
a. Remove HoodieWriteDelegate from WriteStatus
b. Remove the update api in HoodieMetadataWriter totally.

Current patch ensures none of MDT record generation uses the RDD<.WriteStatus>.

I can incorporate the b in this patch if you prefer it.

ok. we can follow up. but we should fix this for good.

vinothchandar · 2024-11-21T22:43:30Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java

@@ -159,6 +159,9 @@ public class HoodieWriteStat implements Serializable {
  @Nullable
  private Long maxEventTime;

+  @Nullable
+  private String prevBaseFile;


so you just need the basefile? not the entire previous file slice?

HoodieDeltaWriteStat contains the log files. So, HoodieWriteStat only contains info about base files.

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieUnMergedLogRecordScanner.java

vinothchandar · 2024-11-21T22:44:36Z

hudi-common/src/main/java/org/apache/hudi/metadata/BaseFileRecordParsingUtils.java

+   * @return Iterator of {@link HoodieRecord}s for RLI Metadata partition.
+   * @throws IOException
+   */
+  public static Iterator<HoodieRecord> generateRLIMetadataHoodieRecordsForBaseFile(String basePath,


UT? all methods in file.

yes. we already have it for the RLI methods. I will be updating the patch w/ more tests for the rest (SI related ones).

vinothchandar · 2024-11-21T22:48:36Z

hudi-common/src/main/java/org/apache/hudi/metadata/BaseFileRecordParsingUtils.java

+
+import static java.util.stream.Collectors.toList;
+
+public class BaseFileRecordParsingUtils {


do we need a new Utils class? adding lots of short utils classes, makes the code harder to maintain

I did not want to keep expanding HoodieMetadataTableUti. Its already close to 3000 lines.
So, kept all record level parsing method to this new class.

vinothchandar · 2024-11-21T22:49:06Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+                                                                      HoodieCommitMetadata commitMetadata,
+                                                                      HoodieMetadataConfig metadataConfig,
+                                                                      HoodieTableMetaClient dataTableMetaClient,
+                                                                      int writesFileIdEncoding,


should this be enum

method bit long, IMO. we need small methods that can be easily tested?

sure. will split it up into smaller ones

sorry, which one you meant for enum.
for writesFileIdEncoding, we only have two possible values for now. 0 -> UUID based fileID. 1 -> random string.
we have not defined any enums for it.

lets file a code cleanup JIRA for 1.1 for these.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

hudi-bot · 2024-11-22T21:53:08Z

CI report:

fedc6a5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar · 2024-11-26T20:46:17Z

the specific thing you are not addressing is custom record mergers and payloads? (since that would need reading the previous image and merging fully using the merger/payload). ? Can you make sure we throw an error - for this scenario .. if RLI is enabled on other mergers/payloads.

vinothchandar

I am good with the approach overall. But please resolve testing/code cleanup/errors first

vinothchandar · 2024-11-26T20:57:53Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+                                                                      HoodieCommitMetadata commitMetadata,
+                                                                      HoodieMetadataConfig metadataConfig,
+                                                                      HoodieTableMetaClient dataTableMetaClient,
+                                                                      int writesFileIdEncoding,


lets file a code cleanup JIRA for 1.1 for these.

vinothchandar · 2024-11-26T21:00:20Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+          } else if (!isRecord1Deleted && isRecord2Deleted) {
+            return record1;
+          } else {
+            throw new HoodieIOException("Two HoodieRecord updates to RLI is seen for same record key " + record2.getRecordKey() + ", record 1 : "


I remember leaving a comment here, that we should handle true/true case - by not throwing an error? - for log with redundant deletes. i.e we just encode a streaming delete for same key, without checking for the existence first.

vinothchandar · 2024-11-26T21:01:27Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+   * @param parallelism parallelism to use.
+   * @return
+   */
+  private static HoodieData<HoodieRecord> reduceByKeys(HoodieData<HoodieRecord> recordIndexRecords, int parallelism) {


make this package protected and UT

vinothchandar · 2024-11-26T21:01:35Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+  }
+
+  static List<String> getRecordKeysDeletedOrUpdated(HoodieEngineContext engineContext,
+                                                                      HoodieCommitMetadata commitMetadata,


hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

vinothchandar · 2024-11-26T21:02:09Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+
+  @VisibleForTesting
+  public static Set<String> getRecordKeys(String filePath, HoodieTableMetaClient datasetMetaClient,
+                                                 Option<Schema> writerSchemaOpt, int maxBufferSize,


vinothchandar · 2024-11-26T21:38:17Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+      int parallelism = Math.max(Math.min(allWriteStats.size(), metadataConfig.getRecordIndexMaxParallelism()), 1);
+      String basePath = dataTableMetaClient.getBasePath().toString();
+      // we might need to set some additional variables if we need to process log files.
+      boolean anyLogFilesWithDeletes = allWriteStats.stream().anyMatch(writeStat -> {


avoid duplicating these code blocks?

there are some minor differences across RLI and SI code paths.

one of them generates HoodieData, while the SI one just returns a List and
one of them is interested only for logs w/ deletes.
one of them is interested in INSERTS and DELETES, while the other is interested in DELETES and UPDATES.

I have avoided code duplication in BaseFileParsingUtils. here, it becomes little messy. Will try to take a stab at it. but not gonna be a easy one.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

nsivabalan · 2024-11-26T22:30:06Z

hey @vinothchandar

the specific thing you are not addressing is custom record mergers and payloads? (since that would need reading the previous image and merging fully using the merger/payload). ? Can you make sure we throw an error - for this scenario .. if RLI is enabled on other mergers/payloads.

yes. we are throwing errors
#12337

nsivabalan added 5 commits November 17, 2024 10:58

Updating how RLI records are generated for MDT updates

69ff9a2

Fixing tests

6e281a3

addressing comments

a8f2488

Addressing pending feedback from danny

c21af1c

Fixing Secondary Index Record generation to not rely on WriteStatus

9534f33

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 21, 2024

nsivabalan mentioned this pull request Nov 21, 2024

[HUDI-8543] Fixing SI MDT record generation in MDT to not rely on RDD<WriteStatus> #12291

Closed

4 tasks

vinothchandar reviewed Nov 21, 2024

View reviewed changes

danny0405 reviewed Nov 22, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Outdated Show resolved Hide resolved

danny0405 reviewed Nov 22, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Outdated Show resolved Hide resolved

danny0405 reviewed Nov 22, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Outdated Show resolved Hide resolved

Addressing feedback

fedc6a5

nsivabalan mentioned this pull request Nov 23, 2024

[HUDI-8564] Removing WriteStatus references in Hoodie Metadata writer flow #12321

Merged

4 tasks

vinothchandar reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

nsivabalan commented Nov 21, 2024 •

edited

Loading

vinothchandar left a comment

vinothchandar Nov 21, 2024

vinothchandar Nov 21, 2024

nsivabalan Nov 21, 2024 •

edited

Loading

vinothchandar Nov 26, 2024

vinothchandar Nov 21, 2024

nsivabalan Nov 21, 2024

vinothchandar Nov 21, 2024

nsivabalan Nov 21, 2024

vinothchandar Nov 21, 2024

nsivabalan Nov 21, 2024

vinothchandar Nov 21, 2024

vinothchandar Nov 21, 2024

nsivabalan Nov 21, 2024

nsivabalan Nov 22, 2024 •

edited

Loading

vinothchandar Nov 26, 2024

hudi-bot commented Nov 22, 2024

vinothchandar commented Nov 26, 2024

vinothchandar left a comment

vinothchandar Nov 26, 2024

vinothchandar Nov 26, 2024

vinothchandar Nov 26, 2024

vinothchandar Nov 26, 2024

vinothchandar Nov 26, 2024

vinothchandar Nov 26, 2024

nsivabalan Nov 27, 2024

nsivabalan commented Nov 26, 2024


		import static java.util.stream.Collectors.toList;

		public class BaseFileRecordParsingUtils {

[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

Are you sure you want to change the base?

[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

Conversation

nsivabalan commented Nov 21, 2024 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Nov 22, 2024

CI report:

vinothchandar commented Nov 26, 2024

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan commented Nov 26, 2024

nsivabalan commented Nov 21, 2024 •

edited

Loading

nsivabalan Nov 21, 2024 •

edited

Loading

nsivabalan Nov 22, 2024 •

edited

Loading