-
Notifications
You must be signed in to change notification settings - Fork 173
feat: Spec multi-result-set API #3871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: spec-1.2.0
Are you sure you want to change the base?
Conversation
|
After discussion and agreement on the proposed function, I'll update the relevant implementations here of the other driver managers to leverage and expose this (python, rust, Go, etc.) |
|
Should we make a branch for spec changes? |
That makes sense, gather the spec updates in the branch and then eventually decide to push it as one thing. |
|
I've changed the base for this PR to point to the spec-1.2.0 branch. Now we just need people to weigh in on the proposal 😄 |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
The previous PR also had:
ADBC_EXPORT
AdbcStatusCode AdbcStatementNextResultSetSchema(struct AdbcStatement* statement,
struct ArrowSchema* schema,
struct AdbcError* error);Is that no longer planned or no longer needed?
c/include/arrow-adbc/adbc.h
Outdated
| ADBC_EXPORT | ||
| AdbcStatusCode AdbcStatementNextResultSet(struct AdbcStatement* statement, | ||
| struct ArrowArrayStream* out, | ||
| struct AdbcPartitions* partitions, | ||
| int64_t* rows_affected, | ||
| struct AdbcError* error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to have this just be:
ADBC_EXPORT
AdbcStatusCode AdbcStatementNextResultSet(struct AdbcStatement* statement,
struct AdbcError* error);...and then subsequent calls to AdbcStatementExecute(), AdbcStatementExecuteSchema(), AdbcStatementExecutePartitions(), and future async versions of all of those don't have to be redefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a reasonable idea. Maybe we can specify that the query must be NULL in those cases? (Though, that would translate weirdly into many of the language bindings, albeit they also don't necessarily have to be 1:1.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think we need some way to differentiate, in that case, "I want to start a new query" vs "I want to continue executing the query". And ideally we remain backwards-compatible, so that calling ExecuteQuery again would start a new query as it currently does.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to essentially handle all of the cases with the one function here (minus async) by specifying particular inputs as null or not. I'm don't particularly like having to figure out how to differentiate between the Execute* methods starting a new execution vs continuing after a call to NextResultSet, though that might just be bias from existing APIs I've used.
My thoughts were that the Execute* methods return a stream/schema/list of partitions, so calling NextResultSet should also do the same.
The other pattern I've seen would be that we'd create an AdbcResult struct which would have a NextResultSet callback on it (which IMO we should have done in the first place), but this would be a breaking change or at least require an entirely new set of functions.
@CurtHagenlocher @davidhcoe do either of you have any opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can specify that the query must be
NULLin those cases?
Does that mean the user has to explicitly set it back to NULL? Or that the statement should have internally set it to NULL when it was executed? That's kind of annoying if you're trying to re-use the same Statement instance to execute the same query multiple times if you have to re-set the query to execute it again. Some users might expect that they need to hold on to the Statement to execute it without paying the cost of preparing it again. They might even try to cache Statements that they've already prepared alongside the parent connection.
Including the partitions API makes this method weirdly modal, I would say if you wanted to lean into the partitions API then you should commit to it and just return a partition from here. Or just split it into two separate methods.
But it's already hard to understand from the API design how you're supposed to use partitions, much less how you're supposed to implement them in the driver. Having to go back to the connection to read the partition doesn't make a ton of sense to me, unless it's a separate query or protocol command to read from it, like fetching from a named cursor.
I suggested privately that there should be a new ExecuteMulti method that explicitly returns a multi-result-set stream. That encodes the user's expectation up-front that there will be multiple result sets from the query, and doesn't require Statement itself to become any more stateful than it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot and @davidhcoe / @CurtHagenlocher?
If everyone's okay with this idea I'll take a stab at updating this proposal accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great point that there's already a lot of state to keep track of in a statement, although that bookeeping still has to happen somewhere (both for the driver, ensuring it doesn't call something that will crash, and the caller, ensuring it doesn't violate any constraint about holding multiple results). Possibly there is just no getting around a lot of new C functions (whether they're on some new AdbcResultSet or the existing AdbcStatement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I struggle with is how do you know when additional results are expected? Is there something that would split the SQL statements to know there are multiple? Do you always call AdbcStatementNextResultSet just to check if there is a second one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only familiar with the libpq way of doing this where you call PQgetResult() until it returns NULL (although some libraries probably expose a way to access the number of statements in something that is prepared).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ODBC, JDBC and ADO.NET APIs all work in this fashion. One thing to consider for databases which support stored procedures is that it may not be possible to know in advance how many result sets they will return. I know this is true for MSSQL and think it might be true for PostgreSQL as well.
|
I've updated the PR based on the comments and discussion. Everyone please take a look and let me know what you think! |
| /// with the result set. | ||
| /// | ||
| /// \since ADBC API revision 1.2.0 | ||
| struct ADBC_EXPORT AdbcResultSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd vaguely prefer these as free functions to make it so that there's only one struct that needs possible extension later.
| /// \since ADBC API revision 1.2.0 | ||
| struct ADBC_EXPORT AdbcResultSet { | ||
| /// \brief opaque implementation-defined state | ||
| void* private_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also have a pointer to the AdbcDriver as with the other structs?
| /// \return ADBC_STATUS_NOT_IMPLEMENTED if the driver only supports fetching results | ||
| /// as partitions, ADBC_STATUS_INVALID_STATE if called at an inappropriate time, | ||
| /// and ADBC_STATUS_OK (or an appropriate error code) otherwise. | ||
| AdbcStatusCode (*NextResultSet)(struct AdbcResultSet* self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we also want a NextSchema callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, it'd probably be better to indicate the desire for that up front so that the driver can choose what to do properly. (And actually, I'm not sure if database clients generally support this in the first place...)
| /// \brief opaque implementation-defined state | ||
| void* private_data; | ||
|
|
||
| /// \brief Release the result set and any associated resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can say this also cancels the query if not completed?
Extracted from #3607 with influence by the comments there and main...CurtHagenlocher:arrow-adbc:MoreResults, this contains a proposal for handling multi-result set query execution via ADBC by adding a new function for drivers,
AdbcStatementNextResultSet.This also includes the necessary changes for an ADBC API Revision 1.2.0 (macro defines and so on). The comment above the function includes all the semantic definitions of the behavior.