feat: Support COW bulk-insert, insert, upsert, delete works with spark datasource and lance #17731

rahil-c · 2025-12-28T17:49:56Z

Describe the issue this Pull Request addresses

Feature: #14127

Goal: Write in Hudi using bulk-insert, insert, update, and deletes with Lance files and read back the data with the Spark Datasource on a COW Table

Exit Criteria: We should be able to construct a test that writes out multiple commits with spark and we can read back the same data. Testing should include time travel and incremental queries as well to ensure basic functionality works end to end.

Summary and Changelog

Implement InternalRowWriter interface which is used by bulk insert, HoodieInternalRowLanceWriter.java
Implement FileFormatUtils in order to get upsert/delete functionality working correctly
Add more test cases in TestLanceDataSource for ensuring that we have full coverage of the above

Impact

None

Risk Level

Low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…riter

the-other-tim-brown · 2025-12-29T16:45:24Z

...-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowLanceWriter.java

+ * This writer is used for bulk insert operations and other optimized write paths that work
+ * directly with InternalRow objects without HoodieRecord wrappers.
+ */
+public class HoodieInternalRowLanceWriter extends HoodieBaseLanceWriter<InternalRow>


Is it possible to update the HoodieSparkLanceWriter to implement the HoodieInternalRowFileWriter so we don't need to duplicate any of the writer logic or configuration in the future?

Will look into this.

…source and lance

the-other-tim-brown · 2025-12-30T16:25:53Z

...lient/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java

  public ClosableIterator<HoodieRecord<InternalRow>> getRecordIterator(HoodieSchema schema) throws IOException {
    ClosableIterator<UnsafeRow> iterator = getUnsafeRowIterator(schema);
-    return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new HoodieSparkRecord(data)));
+    //TODO .copy() is needed for correctness, to investigate further in future.


@rahil-c what is the status of this TODO?

Currently this TODO is around the need for this .copy() workaround. I have filed a tt with more findings here #17754.

Can we just solve this as part of this? I am getting worried about the number of follow on tasks for the baseline features here. If it uses some shared buffer, then you need to copy. It is similar to other spark iterators that we have. If it is some setup issue, then fix that first and see if the copy is still required.

Currently in our code we leverage the following UnsafeProjection when converting an InternalRow to an UnsafeRow https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java#L132.

I went thru the spark docs to see if find more findings on UnsafeProjection, but did not see any docs for this so tried examining in spark repo the following code classes to get more insights around the behavior.

When checking the following class, I can see that there is the mention of a shared buffer
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala

with a recommendation for the following:

This class reuses the [[UnsafeRow]] it produces, a consumer should copy the row if it is being buffered

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

* It generates the code for all the expressions, computes the total length for all the columns * (can be accessed via variables), and then copies the data into a scratch buffer space in the * form of UnsafeRow (the scratch buffer will grow as needed). * * @note The returned UnsafeRow will be pointed to a scratch buffer inside the projection.

Based on what you mentioned above

If it uses some shared buffer, then you need to copy.

If we are leveraging copy, im thinking then in the LanceRecordIterator in the next() we should place the copy() there so that callers do not have to themselves worry about calling .copy on the data, like i was doing before in this specific read path.

@Override public UnsafeRow next() { if (!hasNext()) { throw new IllegalStateException("No more records available"); } InternalRow row = rowIterator.next(); // Convert to UnsafeRow immediately while batch is still open return projection.apply(row).copy(); }

the-other-tim-brown · 2025-12-30T16:35:52Z

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala

+    assertEquals(3, commitCount, "Should have 3 completed commits (one per insert)")
+
+    // Verify that all commits are bulk_insert commits
+    val commits = metaClient.getCommitsTimeline.filterCompletedInstants().getInstants.iterator().asScala.toList


I think you can skip the iterator and just go from java list to scala

the-other-tim-brown · 2025-12-30T16:37:54Z

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala

+      .orderBy("id")
+      .collect()
+
+    // Verify we have exactly 3 records (only from third commit)


For some of these verifications, it seems like we should be able to create a list of rows with expected values to make the validation a bit less verbose and easy to evolve in the future.

...-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriterFactory.java

the-other-tim-brown · 2025-12-30T20:17:44Z

hudi-common/src/main/java/org/apache/hudi/common/util/LanceUtils.java

+                                                                        Map<String, String> paramsMap) throws IOException {
+    throw new UnsupportedOperationException("serializeRecordsToLogBlock with iterator is not yet supported for Lance format");
+  }
+}


newline here

the-other-tim-brown · 2025-12-31T02:51:09Z

...lient/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java

+                                StructType sparkSchema,
+                                TaskContextSupplier taskContextSupplier,
+                                HoodieStorage storage) throws IOException {
+    this(file, sparkSchema, "0", taskContextSupplier, storage, false);


Let's make the instant time null instead of "0" so it is clear it is not set

hudi-bot · 2025-12-31T05:19:07Z

CI report:

4124b83 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

rahil-c · 2025-12-31T13:12:38Z

@the-other-tim-brown @voonhous I have retriggered the azure ci wondering if i can get an approval and we can merge this once azure ci green.

the-other-tim-brown · 2025-12-31T14:57:42Z

https://dev.azure.com/apachehudi/hudi-oss-ci/_build/results?buildId=10699&view=results

Azure CI is passing but the GH result is not updating

rahil-c and others added 11 commits December 20, 2025 08:50

feat: Add HoodieBaseLanceFileWriter and implementation for SparkFileW…

f0f38ca

…riter

feat: Add HoodieSparkLanceReader for reading lance files to internal row

afdbe1a

migrate to hoodie schema and address tim prev comments

3f835c2

Implement SparkColumnarFileReader for Datasource integration with Lance

eedaa50

fix usages to hoodie schema

8007a69

Merge branch 'master' into rahil/hudi-lance-spark-datasource-reader

d02618f

fix iterator for reuse

ad801e7

minor fixes

f2fae6d

add DisabledIfSystemProperty

b9e3de0

try spark 4.0

6c2f6a1

scala style plugin property change

5189042

rahil-c requested review from the-other-tim-brown, voonhous and yihua December 28, 2025 17:50

github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 28, 2025

rahil-c added 2 commits December 29, 2025 10:44

intial minor comments

12c6f34

add spark 3.4

40a558d

the-other-tim-brown reviewed Dec 29, 2025

View reviewed changes

rahil-c added 5 commits December 29, 2025 18:19

address comments

c05368a

address ethan tim comments

522ccfa

scalastyle

bf13d4a

Support COW bulk-insert, insert, upsert, delete works with spark data…

fbd6ad3

…source and lance

get tests passing

88904d7

rahil-c force-pushed the rahil/hudi-lance-spark-datasource-crud-cow branch from 65ad93d to 88904d7 Compare December 30, 2025 14:24

rahil-c changed the title ~~Support COW bulk-insert, insert, upsert, delete works with spark datasource and lance~~ feat: Support COW bulk-insert, insert, upsert, delete works with spark datasource and lance Dec 30, 2025

Merge branch 'master' into rahil/hudi-lance-spark-datasource-crud-cow

003187c

rahil-c mentioned this pull request Dec 30, 2025

Implement bulk insert/ insert / upsert / delete validation for COW #17625

Closed

remove HoodieInternalRowLanceWriter

25bb94d

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Dec 30, 2025

the-other-tim-brown reviewed Dec 30, 2025

View reviewed changes

fix compilation, address comments

996a6d8

rahil-c requested a review from the-other-tim-brown December 30, 2025 18:50

the-other-tim-brown reviewed Dec 30, 2025

View reviewed changes

...-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriterFactory.java Show resolved Hide resolved

the-other-tim-brown reviewed Dec 30, 2025

View reviewed changes

rahil-c added 2 commits December 30, 2025 18:47

address copy()

b7220b0

address tim comments

0f35b34

rahil-c requested a review from the-other-tim-brown December 30, 2025 23:56

rahil-c added 2 commits December 30, 2025 19:13

remove older comment

3fecf06

retrigger ci

15cced0

the-other-tim-brown reviewed Dec 31, 2025

View reviewed changes

rahil-c mentioned this pull request Dec 31, 2025

Investigate data.copy() workaround in HoodieSparkLanceReader #17754

Closed

address tim comment

4124b83

rahil-c requested a review from the-other-tim-brown December 31, 2025 03:15

the-other-tim-brown approved these changes Dec 31, 2025

View reviewed changes

the-other-tim-brown merged commit 2a52448 into apache:master Dec 31, 2025
71 of 72 checks passed

feat: Support COW bulk-insert, insert, upsert, delete works with spark datasource and lance #17731

feat: Support COW bulk-insert, insert, upsert, delete works with spark datasource and lance #17731

Uh oh!

Conversation

rahil-c commented Dec 28, 2025

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Dec 31, 2025

CI report:

Uh oh!

rahil-c commented Dec 31, 2025

Uh oh!

the-other-tim-brown commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants