-
Notifications
You must be signed in to change notification settings - Fork 1.9k
WIP: using arrow-avro. remove own implementation #17861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
❤️ amazing! Thank you @getChan |
|
Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion? |
No, not yet. Thanks for release. |
|
Thanks for jumping on this @getChan; let me know if I can help! |
# Conflicts: # Cargo.lock # Cargo.toml # datafusion/common/Cargo.toml
|
FYI I merged the arrow 57 upgrade to DataFusion -- so if you rebase this PR against main you'll have access to the new arrow-avro crate |
# Conflicts: # Cargo.lock
Thanks for help. |
# Conflicts: # datafusion/datasource-avro/src/source.rs
Coming in a little bit late to this. @getChan Basically the way we treat "projection" in I think to get this working we'll need to extend Also have you tried setting the What I mean by that is you can craft the correct Reader Schema json string and then use There's two ways I can see this working:
I'll create an CC: @nathaniel-d-ef |
|
I had a chance to get back at this today to try and find a workaround. Unless I'm missing something, the writer schema from the ReaderBuilder inference (from the Avro file, with the correct top level name) isn't exposed in a way that we can use in DataFusion. I'm curious if you've had success @getChan? I think this effort is blocked until we can make the arrow-avro modifications. |
|
@nathaniel-d-ef let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
println!("AVRO READER METADATA : {:?}", avro_reader.schema().metadata); // {} |
@getChan Try using let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
let header = avro_reader.avro_header();
println!("\nAVRO HEADER METADATA BYTES: {:?}", header.metadata().collect::<Vec<_>>());
let writer_avro = AvroSchema::new(
std::str::from_utf8(
header
.get(SCHEMA_METADATA_KEY.as_bytes())
.expect("missing avro.schema metadata"),
)
.unwrap()
.to_string(),
);
println!("AVRO HEADER SCHEMA METADATA : {:?}", writer_avro);You should see an output like this: |
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found an approach to get this working. I added comments detailing the suggested changes to make and it all seems to work for me locally. With that said, I'm still fairly new to this codebase, so I apologize in advance if I'm missing something.
Let me know what you think and if this solves the projection issue.
Co-authored-by: Connor Sanders <170039284+jecsand838@users.noreply.github.com>
| ---- | ||
| logical_plan TableScan: avro_table projection=[f1, f2, f3] | ||
| physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/avro/simple_enum.avro]]}, projection=[f1, f2, f3], file_type=avro | ||
| physical_plan DataSourceExec: file_groups={4 groups: [[WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:0..103], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:103..206], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:206..309], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:309..411]]}, projection=[f1, f2, f3], file_type=avro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty neat
# Conflicts: # Cargo.lock # Cargo.toml # datafusion/datasource-avro/src/avro_to_arrow/arrow_array_reader.rs # datafusion/datasource-avro/src/avro_to_arrow/schema.rs # datafusion/datasource-avro/src/source.rs
# Conflicts: # Cargo.lock
|
There is a compatibility issue with projection. I'm waiting for a release of arrow-avro that includes the necessary projection features.
|
|
Thanks for the update @getChan If the fix already included in arrow-avro (and you are waiting on a release), you could rebase this PR against this branch #19355 to get access to the pre-release code We would have to wait for the arrow release to actually merge it but it could potentially help unblock your work I actually would love to get some validation that we can cut over to the new arrow-avro reader before we make the next arrow release (so we can fix any issue that might be found) |
@alamb I'm going to start working on apache/arrow-rs#8923 early next week and should have a PR up before Jan 1st. |
Which issue does this PR close?
arrow-avrofor performance and improved type support #14097.todo list
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?