-
Notifications
You must be signed in to change notification settings - Fork 27
PECOBLR-1121 Arrow patch to circumvent Arrow issues with JDk 16+. #1156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Patch Arrow to create a Databricks ArrowBuf which allocates memory on the heap and provides access to it through Java methods. This removes the need to specify "--add-opens=java.base/java.nio=ALL-UNNAMED" as JVM args for JDK 16+.
Use native Arrow if available. Otherwise fallback to the patch version.
Remove irrelevant reference counting in patch code. Patch code uses heap memory for arrow operations and reference counting is not required.
Remove redundant todos for accounting.
Patch DecimalUtility to not use unsafe methods to set decimal values on DatabricksArrowBuf.
Add notice to all patched Arrow Java code. In NOTICE file mention Arrow has been patched by Databricks.
On static init failure of MemoryUtil class, it prints a stack trace to stderr. Remove this print, since now we fallback to DatabricksBufferAllocator when this happens. And the error is logged as well.
|
|
||
| // ---- Databricks patch start ---- | ||
| private final HistoricalLog historicalLog = | ||
| DEBUG ? new HistoricalLog(DEBUG_LOG_LENGTH, "ArrowBuf[%d]", id) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we even need to worry about the historical log? can we not just set it to null? seems like its usage is null checked everywhere anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, add a comment on why this is needed
| long currentReservation = reservedSize.get(); | ||
| long newReservation = currentReservation + nBytes; | ||
| if (newReservation > allocator.getHeadroom() + currentReservation) { | ||
| return false; | ||
| } | ||
| reservedSize.addAndGet(nBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not thread safe, should we use compareAndSet?
| @Override | ||
| public ReferenceManager getReferenceManager() { | ||
| return referenceManager; | ||
| } | ||
|
|
||
| @Override | ||
| public long capacity() { | ||
| return capacity; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think a bunch of these methods do not have a changed override behaviour, can we remove them so that it is easy to review and maintain?
| if (capacity > Integer.MAX_VALUE) { | ||
| throw new IllegalArgumentException( | ||
| "DatabricksArrowBuf does not support capacity > Integer.MAX_VALUE"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why this limit?
- this is missing from the other constructor
- can we reuse constructors
| static { | ||
| RootAllocator rootAllocator = null; | ||
| try { | ||
| rootAllocator = new RootAllocator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were using Integer.MAX_VALUE, not needed now?
| // ---- to avoid unsafe allocation initialization errors. | ||
| public static final String DEBUG_ALLOCATOR = "arrow.memory.debug.allocator"; | ||
| public static final int DEBUG_LOG_LENGTH = 6; | ||
| public static final boolean DEBUG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to better name than debug?
Description
Databricks server shares query results in Arrow format for easy cross language functionality. The JDBC driver experiences compatibility issues with JDK 16 and later versions when processing Arrow results.
This problem arises from stricter encapsulation of internal APIs in newer Java versions, which affects the driver's use of the Apache Arrow result format consumption with the Apache Arrow library. The JDBC driver is used in partner solutions, where they do not have control of the runtime environment, and the workaround of setting JVM arguments is not feasible.
Testing
Tests are added in other stacked PRs.
Additional Notes to the Reviewer
Its a stacked PR.