Skip to content

Commit d42ff16

Browse files
fread: consider quoted na.strings in text columns (#7068)
* fread: consider quoted na.strings in text columns Previously, Field() only called end_NA_string() for non-quoted fields, making it impossible to set na.strings='""' and parse empty quoted strings as missing. Fixes: #6974 * NEWS item --------- Co-authored-by: Michael Chirico <chiricom@google.com>
1 parent b88ffa6 commit d42ff16

File tree

3 files changed

+21
-0
lines changed

3 files changed

+21
-0
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@
7474
7575
15. Including an `ITime` object as a named input to `data.frame()` respects the provided name, i.e. `data.frame(a = as.ITime(...))` will have column `a`, [#4673](https://github.com/Rdatatable/data.table/issues/4673). Thanks @shrektan for the report and @MichaelChirico for the fix.
7676
77+
16. `fread()` now handles the `na.strings` argument for quoted text columns, making it possible to specify `na.strings = '""'` and read empty quoted strings as `NA`s, [#6974](https://github.com/Rdatatable/data.table/issues/6974). Thanks to @AngelFelizR for the report and @aitap for the PR.
78+
7779
### NOTES
7880
7981
1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.

inst/tests/tests.Rraw

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21276,3 +21276,18 @@ test(2324.2,
2127621276
rollup(DT, j = sum(value) + ..sets, by=c("color","year","status"), label="total"),
2127721277
rollup(DT, j = sum(value), by=c("color","year","status"), label="total")
2127821278
)
21279+
21280+
# allow na.strings to be quoted, #6974
21281+
f = tempfile()
21282+
DT = data.table(
21283+
"Date Example"=c("12/5/2012", NA),
21284+
"Question 1"=c("Yes", NA),
21285+
"Question 2"=c("Yes", NA),
21286+
"Site: Country"=c("Chile", "Virgin Islands, British")
21287+
)
21288+
fwrite(DT, f, na='""')
21289+
test(2325.1, fread(f, na.strings='""'), DT)
21290+
unlink(f)
21291+
test(2325.2,
21292+
fread('"foo","bar","baz"\n"a","b","c"', na.strings=c('"foo"', '"bar"', '"baz"'), header=FALSE),
21293+
data.table(V1=c(NA, "a"), V2=c(NA, "b"), V3=c(NA, "c")))

src/fread.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,8 @@ static void Field(FieldParseContext *ctx)
515515
// the field is quoted and quotes are correctly escaped (quoteRule 0 and 1)
516516
// or the field is quoted but quotes are not escaped (quoteRule 2)
517517
// or the field is not quoted but the data contains a quote at the start (quoteRule 2 too)
518+
// What if this string signifies an NA? Will find out after we're done parsing quotes
519+
const char *field_after_NA = end_NA_string(fieldStart);
518520
fieldStart++; // step over opening quote
519521
switch(quoteRule) {
520522
case 0: // quoted with embedded quotes doubled; the final unescaped " must be followed by sep|eol
@@ -573,6 +575,8 @@ static void Field(FieldParseContext *ctx)
573575
if (ch == eof && quoteRule != 2) { target->off--; target->len++; } // test 1324 where final field has open quote but not ending quote; include the open quote like quote rule 2
574576
while(target->len > 0 && ((ch[-1] == ' ' && stripWhite) || ch[-1] == '\0')) { target->len--; ch--; } // test 1551.6; trailing whitespace in field [67,V37] == "\"\"A\"\" ST "
575577
}
578+
// Does end-of-field correspond to end-of-possible-NA?
579+
if (field_after_NA == ch) target->len = INT32_MIN;
576580
}
577581

578582
static void str_to_i32_core(const char **pch, int32_t *target, bool parse_date)

0 commit comments

Comments
 (0)