Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jan 6, 2026

Which issue does this PR close?

Rationale for this change

rust_decimal is a one person crate and is released somewhat infrequently

It also uses a non trivial number of dependencies, including https://crates.io/crates/rkyv, some sort of zero copy deserialization framework that was recently subject to a RUSTSEC security advisory, see #19656 / #19657

Since rust_decimal is only used for sqllogictests to parse the results from postgres, we can pretty easily remove the dependency on rust_decimal and inline the very small amount functionality we need for sqllogictests

This will both decrease the build time and our dependency trail.

What changes are included in this PR?

Removes the rust_decimal dependency from DataFusion
and inlines the smallest required subset of decimal functionality we need for
sqllogictests (which turns out to be pretty small)

Are these changes tested?

Yes by CI

Are there any user-facing changes?

No, this is all internal testing infrastructure

@alamb alamb added the development-process Related to development process of DataFusion label Jan 6, 2026
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 6, 2026
@alamb alamb changed the title Remove dependency on rust_decimal Remove dependency on rust_decimal, remove ignore of RUSTSEC-2026-0001 Jan 6, 2026
# underlying rkyv is patched, but rustsec database not yet updated
# Can remove when this is merged: https://github.com/rustsec/advisory-db/pull/2565
run: cargo audit --ignore RUSTSEC-2026-0001
# Note: you can ignore specific RUSTSEC issues using the `--ignore` flag ,for example:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added in #19657, but unecessary as this PR removes the dependency on rkyv


let start = Instant::now();
let rows = self.get_client().query(sql, &[]).await?;
let messages = self.get_client().simple_query(sql).await?;
Copy link
Contributor Author

@alamb alamb Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanup to improve getting types -- use a prepared statement rather than running and looking at the first row

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now always makes two queries - one for the types and another for the messages.
Before it was making the second query only if there were no messages.
Wouldn't this increase the time to run the test suite ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right -- I will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this more -- the reason there are two queries now is that the "simple_query" API returns only text (no types) but the sqllogictest harness needs to get type information

However, if we use the prepared statement API that returns types, we must also then have code to parse the postgres binary numeric representation (what I think we are using the rust_decimal crate for).

On the balance, I think planning the query twice, while it does have additional overhead, is better than a bunch of custom numeric parsing code. Given the small number of queries that are run in this mode I suspect we won't be able to even tell the difference.

I will add comments explaining this rationale

In case we want to avoid the second planning, here is the code that I wrote with codex (I am not 100% sure it correctly implements numeric parsing):

fn convert_rows(rows: &[Row]) -> Vec<Vec<String>> {
    rows.iter()
        .map(|row| {
            row.columns()
                .iter()
                .enumerate()
                .map(|(idx, column)| cell_to_string(row, column, idx))
                .collect::<Vec<String>>()
        })
        .collect::<Vec<_>>()
}

fn cell_to_string(row: &Row, column: &Column, idx: usize) -> String {
    // Decode binary values by Postgres type to keep normalization aligned with
    // the DataFusion engine output.
    match column.type_().clone() {
        Type::CHAR => {
            let value: Option<i8> = row.get(idx);
            value
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::INT2 => {
            let value: Option<i16> = row.get(idx);
            value
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::INT4 => {
            let value: Option<i32> = row.get(idx);
            value
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::INT8 => {
            let value: Option<i64> = row.get(idx);
            value
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::NUMERIC => {
            let value: Option<PgNumeric> = row.get(idx);
            match value {
                Some(PgNumeric(PgNumericValue::Number(value))) => decimal_to_str(value),
                Some(PgNumeric(PgNumericValue::NaN)) => "NaN".to_string(),
                None => NULL_STR.to_string(),
            }
        }
        Type::DATE => {
            let value: Option<NaiveDate> = row.get(idx);
            value
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::TIME => {
            let value: Option<NaiveTime> = row.get(idx);
            value
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::TIMESTAMP => {
            let value: Option<NaiveDateTime> = row.get(idx);
            value
                .map(|value| format!("{value:?}"))
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::BOOL => {
            let value: Option<bool> = row.get(idx);
            value.map(bool_to_str).unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::BPCHAR | Type::VARCHAR | Type::TEXT => {
            let value: Option<&str> = row.get(idx);
            value
                .map(varchar_to_str)
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::FLOAT4 => {
            let value: Option<f32> = row.get(idx);
            value.map(f32_to_str).unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::FLOAT8 => {
            let value: Option<f64> = row.get(idx);
            value.map(f64_to_str).unwrap_or_else(|| NULL_STR.to_string())
        }
        Type::REGTYPE => {
            let value: Option<u32> = row.get(idx);
            value
                .and_then(Type::from_oid)
                .map(|value| value.to_string())
                .unwrap_or_else(|| NULL_STR.to_string())
        }
        _ => unimplemented!("Unsupported type: {}", column.type_().name()),
    }
}


/// Minimal `NUMERIC` decoder for Postgres' binary format.
///
/// We use it in sqllogictest to avoid pulling in `rust_decimal` while still
/// converting `NUMERIC` values into `BigDecimal` for normalized output.
///
/// Verification:
/// - The binary layout and field meanings come from the Postgres protocol docs:
///   https://www.postgresql.org/docs/current/protocol-message-formats.html#PROTOCOL-MESSAGE-FORMATS-NUMERIC
/// - You can sanity-check correctness by comparing decoded values against the
///   server's text format, e.g. run `SELECT value::numeric, value::text` in
///   Postgres and ensure this parser produces the same normalized string.
enum PgNumericValue {
    Number(BigDecimal),
    NaN,
}

/// Wrapper type that implements `FromSql` for Postgres `NUMERIC` values.
///
/// This is intentionally scoped to the sqllogictest engine so we can parse
/// binary `NUMERIC` results in a single query without extra dependencies.
struct PgNumeric(PgNumericValue);

impl<'a> postgres_types::FromSql<'a> for PgNumeric {
    fn from_sql(
        ty: &Type,
        raw: &'a [u8],
    ) -> Result<Self, Box<dyn std::error::Error + Sync + Send>> {
        if *ty != Type::NUMERIC {
            return Err("invalid type".into());
        }

        let mut buf = raw;
        let ndigits = read_i16(&mut buf)? as i64;
        let weight = read_i16(&mut buf)? as i64;
        let sign = read_i16(&mut buf)?;
        let _dscale = read_i16(&mut buf)?;

        const NUMERIC_POS: i16 = 0x0000;
        const NUMERIC_NEG: i16 = 0x4000;
        const NUMERIC_NAN: i16 = 0xC000u16 as i16;

        if sign == NUMERIC_NAN {
            return Ok(PgNumeric(PgNumericValue::NaN));
        }

        let mut bigint = BigInt::from(0);
        for _ in 0..ndigits {
            let digit = read_i16(&mut buf)? as i64;
            bigint = bigint * 10000 + digit;
        }

        if sign == NUMERIC_NEG {
            bigint = -bigint;
        } else if sign != NUMERIC_POS {
            return Err(format!("invalid numeric sign {sign}").into());
        }

        let scale = if ndigits == 0 {
            0
        } else {
            (ndigits - (weight + 1)) * 4
        };
        let value = BigDecimal::from_bigint(bigint, scale);

        Ok(PgNumeric(PgNumericValue::Number(value)))
    }

    fn accepts(ty: &Type) -> bool {
        *ty == Type::NUMERIC
    }
}

fn read_i16(buf: &mut &[u8]) -> Result<i16, Box<dyn std::error::Error + Sync + Send>> {
    if buf.len() < 2 {
        return Err("invalid buffer size".into());
    }
    let val = i16::from_be_bytes([buf[0], buf[1]]);
    *buf = &buf[2..];
    Ok(val)
}

fn convert_types(types: Vec<Type>) -> Vec<DFColumnType> {
    types
        .into_iter()
        .map(|t| match t {
            Type::BOOL => DFColumnType::Boolean,
            Type::INT2 | Type::INT4 | Type::INT8 => DFColumnType::Integer,
            Type::BPCHAR | Type::VARCHAR | Type::TEXT => DFColumnType::Text,
            Type::FLOAT4 | Type::FLOAT8 | Type::NUMERIC => DFColumnType::Float,
            Type::DATE | Type::TIME => DFColumnType::DateTime,
            Type::TIMESTAMP => DFColumnType::Timestamp,
            _ => DFColumnType::Another,
        })
        .collect()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we won't be able to even tell the difference.

👍🏻
If the execution time of the .slt tests is similar as before then there is no need to optimize further.

.map(|d| format!("{d:?}"))
.unwrap_or_else(|| "NULL".to_string())
fn cell_to_string(row: &SimpleQueryRow, column_type: &Type, idx: usize) -> String {
// simple_query returns text values, so we parse by Postgres type to keep
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left in unwrap() so we fail fast in case there is some problem converting

"syn 2.0.113",
]

[[package]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rust_decimal has many dependencies it turns out, so removing it results in quite a few less dependencies in datafusion

Copy link
Contributor

@mhilton mhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb alamb marked this pull request as ready for review January 6, 2026 14:53
github-merge-queue bot pushed a commit that referenced this pull request Jan 6, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Follow on to #19666

## Rationale for this change

While working on #19666 I
noticed there were `149` dependencies behind the latest

```shell
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ cargo update -p arrow
    Updating crates.io index
     Locking 0 packages to latest compatible versions
note: pass `--verbose` to see 149 unchanged dependencies behind latest
```

## What changes are included in this PR?

Run `cargo update` and check in the result

## Are these changes tested?

By CI
## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also encoutered RUSTSEC issue when working on the PR #19661.

Thanks for the fix, do you mind cherry-picking the PR to branch-52 after it's merged? Then I'll update the PR19661

@alamb
Copy link
Contributor Author

alamb commented Jan 7, 2026

I also encoutered RUSTSEC issue when working on the PR #19661.

Thanks for the fix, do you mind cherry-picking the PR to branch-52 after it's merged? Then I'll update the PR19661

Sure!

I think we can probably backport #19657 to the 52 line temporarily -- I will do so

@alamb alamb added this pull request to the merge queue Jan 7, 2026
Merged via the queue into apache:main with commit 3a0ca4e Jan 7, 2026
31 checks passed
@alamb
Copy link
Contributor Author

alamb commented Jan 7, 2026

Thank you again @martin-g for your insightful review

alamb added a commit to alamb/datafusion that referenced this pull request Jan 7, 2026
…001` (apache#19666)

- Part of apache#19656

rust_decimal is a one person crate and is released somewhat infrequently
* https://github.com/paupino/rust-decimal
* https://crates.io/crates/rust_decimal

It also uses a non trivial number of dependencies, including
https://crates.io/crates/rkyv, some sort of zero copy deserialization
framework that was recently subject to a RUSTSEC security advisory, see
apache#19656 /
apache#19657

Since `rust_decimal` is only used for sqllogictests to parse the results
from postgres, we can pretty easily remove the dependency on
`rust_decimal` and inline the very small amount functionality we need
for sqllogictests

This will both decrease the build time and our dependency trail.

Removes the `rust_decimal` dependency from DataFusion
and inlines the smallest required subset of decimal functionality we
need for
sqllogictests (which turns out to be pretty small)

Yes by CI

No, this is all internal testing infrastructure
@alamb
Copy link
Contributor Author

alamb commented Jan 7, 2026

@xudong963 here is the backport PR:

xudong963 pushed a commit that referenced this pull request Jan 8, 2026
…STSEC-2026-0001` (#19666) (#19686)

## Which issue does this PR close?

- part of #18566

## Rationale for this change

Let's resolve the audit workflow on branch-52 by removing our dependency
on the `rust_decimal` library, per @xudong963 's request here
#19666 (review)

## What changes are included in this PR?

- Backport  #19666 to branch-52

## Are these changes tested?

By CI

## Are there any user-facing changes?
No, this is an internal dev tool change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants