Skip to content

Commit 35377a8

Browse files
authored
Detect CSV/TSV column types by default (#683)
The `--detect-types` option is now automatically turned on for all commands that deal with CSV or CSV. A new `--no-detect-types` option can be used to have all columns treated as text. Closes #679
1 parent 0bbc680 commit 35377a8

File tree

7 files changed

+138
-48
lines changed

7 files changed

+138
-48
lines changed

docs/changelog.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Unreleased
1111

1212
- The ``table.insert_all()`` and ``table.upsert_all()`` methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See :ref:`python_api_insert_lists` for details. (:issue:`672`)
1313
- **Breaking change:** The default floating point column type has been changed from ``FLOAT`` to ``REAL``, which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. (:issue:`645`)
14+
- **Breaking change:** Type detection is now the default behavior for the ``insert`` and ``upsert`` CLI commands when importing CSV or TSV data. Previously all columns were treated as ``TEXT`` unless the ``--detect-types`` flag was passed. Use the new ``--no-detect-types`` flag to restore the old behavior. The ``SQLITE_UTILS_DETECT_TYPES`` environment variable has been removed. (:issue:`679`)
1415

1516
.. _v4_0a0:
1617

docs/cli-reference.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,8 @@ See :ref:`cli_inserting_data`, :ref:`cli_insert_csv_tsv`, :ref:`cli_insert_unstr
285285
--alter Alter existing table to add any missing columns
286286
--not-null TEXT Columns that should be created as NOT NULL
287287
--default <TEXT TEXT>... Default value that should be set for a column
288-
-d, --detect-types Detect types for columns in CSV/TSV data
288+
-d, --detect-types Detect types for columns in CSV/TSV data (default)
289+
--no-detect-types Treat all CSV/TSV columns as TEXT
289290
--analyze Run ANALYZE at the end of this operation
290291
--load-extension TEXT Path to SQLite extension, with optional :entrypoint
291292
--silent Do not show progress bar
@@ -342,7 +343,8 @@ See :ref:`cli_upsert`.
342343
--alter Alter existing table to add any missing columns
343344
--not-null TEXT Columns that should be created as NOT NULL
344345
--default <TEXT TEXT>... Default value that should be set for a column
345-
-d, --detect-types Detect types for columns in CSV/TSV data
346+
-d, --detect-types Detect types for columns in CSV/TSV data (default)
347+
--no-detect-types Treat all CSV/TSV columns as TEXT
346348
--analyze Run ANALYZE at the end of this operation
347349
--load-extension TEXT Path to SQLite extension, with optional :entrypoint
348350
--silent Do not show progress bar

docs/cli.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -508,7 +508,7 @@ Incoming CSV data will be assumed to use ``utf-8``. If your data uses a differen
508508
509509
If you are joining across multiple CSV files they must all use the same encoding.
510510

511-
Column types will be automatically detected in CSV or TSV data, using the same mechanism as ``--detect-types`` described in :ref:`cli_insert_csv_tsv`. You can pass the ``--no-detect-types`` option to disable this automatic type detection and treat all CSV and TSV columns as ``TEXT``.
511+
Column types will be automatically detected in CSV or TSV data, as described in :ref:`cli_insert_csv_tsv`. You can pass the ``--no-detect-types`` option to disable this automatic type detection and treat all CSV and TSV columns as ``TEXT``.
512512

513513
.. _cli_memory_explicit:
514514

@@ -1263,7 +1263,7 @@ To stop inserting after a specified number of records - useful for getting a fas
12631263
12641264
A progress bar is displayed when inserting data from a file. You can hide the progress bar using the ``--silent`` option.
12651265

1266-
By default every column inserted from a CSV or TSV file will be of type ``TEXT``. To automatically detect column types - resulting in a mix of ``TEXT``, ``INTEGER`` and ``REAL`` columns, use the ``--detect-types`` option (or its shortcut ``-d``).
1266+
By default, column types are automatically detected for CSV or TSV files - resulting in a mix of ``TEXT``, ``INTEGER`` and ``REAL`` columns. To disable type detection and treat all columns as ``TEXT``, use the ``--no-detect-types`` option.
12671267

12681268
For example, given a ``creatures.csv`` file containing this:
12691269

@@ -1277,9 +1277,9 @@ The following command:
12771277

12781278
.. code-block:: bash
12791279
1280-
sqlite-utils insert creatures.db creatures creatures.csv --csv --detect-types
1280+
sqlite-utils insert creatures.db creatures creatures.csv --csv
12811281
1282-
Will produce this schema:
1282+
Will produce this schema with automatically detected types:
12831283

12841284
.. code-block:: bash
12851285
@@ -1293,11 +1293,11 @@ Will produce this schema:
12931293
"weight" REAL
12941294
);
12951295
1296-
You can set the ``SQLITE_UTILS_DETECT_TYPES`` environment variable if you want ``--detect-types`` to be the default behavior:
1296+
To disable type detection and treat all columns as TEXT, use ``--no-detect-types``:
12971297

12981298
.. code-block:: bash
12991299
1300-
export SQLITE_UTILS_DETECT_TYPES=1
1300+
sqlite-utils insert creatures.db creatures creatures.csv --csv --no-detect-types
13011301
13021302
If a CSV or TSV file includes empty cells, like this one:
13031303

sqlite_utils/cli.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -898,8 +898,12 @@ def inner(fn):
898898
"-d",
899899
"--detect-types",
900900
is_flag=True,
901-
envvar="SQLITE_UTILS_DETECT_TYPES",
902-
help="Detect types for columns in CSV/TSV data",
901+
help="Detect types for columns in CSV/TSV data (default)",
902+
),
903+
click.option(
904+
"--no-detect-types",
905+
is_flag=True,
906+
help="Treat all CSV/TSV columns as TEXT",
903907
),
904908
click.option(
905909
"--analyze",
@@ -951,6 +955,7 @@ def insert_upsert_implementation(
951955
not_null=None,
952956
default=None,
953957
detect_types=None,
958+
no_detect_types=False,
954959
analyze=False,
955960
load_extension=None,
956961
silent=False,
@@ -1019,7 +1024,8 @@ def insert_upsert_implementation(
10191024
)
10201025
else:
10211026
docs = (dict(zip(headers, row)) for row in reader)
1022-
if detect_types:
1027+
# detect_types is now the default, unless --no-detect-types is passed
1028+
if not no_detect_types:
10231029
tracker = TypeTracker()
10241030
docs = tracker.wrap(docs)
10251031
elif lines:
@@ -1191,6 +1197,7 @@ def insert(
11911197
stop_after,
11921198
alter,
11931199
detect_types,
1200+
no_detect_types,
11941201
analyze,
11951202
load_extension,
11961203
silent,
@@ -1273,6 +1280,7 @@ def insert(
12731280
replace=replace,
12741281
truncate=truncate,
12751282
detect_types=detect_types,
1283+
no_detect_types=no_detect_types,
12761284
analyze=analyze,
12771285
load_extension=load_extension,
12781286
silent=silent,
@@ -1311,6 +1319,7 @@ def upsert(
13111319
not_null,
13121320
default,
13131321
detect_types,
1322+
no_detect_types,
13141323
analyze,
13151324
load_extension,
13161325
silent,
@@ -1356,6 +1365,7 @@ def upsert(
13561365
not_null=not_null,
13571366
default=default,
13581367
detect_types=detect_types,
1368+
no_detect_types=no_detect_types,
13591369
analyze=analyze,
13601370
load_extension=load_extension,
13611371
silent=silent,
@@ -1443,6 +1453,7 @@ def bulk(
14431453
not_null=set(),
14441454
default={},
14451455
detect_types=False,
1456+
no_detect_types=True,
14461457
load_extension=load_extension,
14471458
silent=False,
14481459
bulk_sql=sql,

tests/test_cli.py

Lines changed: 109 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
from pathlib import Path
55
import subprocess
66
import sys
7-
from unittest import mock
87
import json
98
import os
109
import pytest
@@ -1907,7 +1906,16 @@ def test_insert_encoding(tmpdir):
19071906
# Using --encoding=latin-1 should work
19081907
good_result = CliRunner().invoke(
19091908
cli.cli,
1910-
["insert", db_path, "places", csv_path, "--encoding", "latin-1", "--csv"],
1909+
[
1910+
"insert",
1911+
db_path,
1912+
"places",
1913+
csv_path,
1914+
"--encoding",
1915+
"latin-1",
1916+
"--csv",
1917+
"--no-detect-types",
1918+
],
19111919
catch_exceptions=False,
19121920
)
19131921
assert good_result.exit_code == 0
@@ -2196,7 +2204,7 @@ def test_import_no_headers(tmpdir, args, tsv):
21962204
csv_file.write("Tracy{sep}Spider{sep}7\n".format(sep=sep))
21972205
result = CliRunner().invoke(
21982206
cli.cli,
2199-
["insert", db_path, "creatures", csv_path] + args,
2207+
["insert", db_path, "creatures", csv_path] + args + ["--no-detect-types"],
22002208
catch_exceptions=False,
22012209
)
22022210
assert result.exit_code == 0, result.output
@@ -2245,13 +2253,22 @@ def test_csv_insert_bom(tmpdir):
22452253
fp.write(b"\xef\xbb\xbfname,age\nCleo,5")
22462254
result = CliRunner().invoke(
22472255
cli.cli,
2248-
["insert", db_path, "broken", bom_csv_path, "--encoding", "utf-8", "--csv"],
2256+
[
2257+
"insert",
2258+
db_path,
2259+
"broken",
2260+
bom_csv_path,
2261+
"--encoding",
2262+
"utf-8",
2263+
"--csv",
2264+
"--no-detect-types",
2265+
],
22492266
catch_exceptions=False,
22502267
)
22512268
assert result.exit_code == 0
22522269
result2 = CliRunner().invoke(
22532270
cli.cli,
2254-
["insert", db_path, "fixed", bom_csv_path, "--csv"],
2271+
["insert", db_path, "fixed", bom_csv_path, "--csv", "--no-detect-types"],
22552272
catch_exceptions=False,
22562273
)
22572274
assert result2.exit_code == 0
@@ -2263,43 +2280,40 @@ def test_csv_insert_bom(tmpdir):
22632280
]
22642281

22652282

2266-
@pytest.mark.parametrize("option_or_env_var", (None, "-d", "--detect-types"))
2267-
def test_insert_detect_types(tmpdir, option_or_env_var):
2283+
@pytest.mark.parametrize("option", (None, "-d", "--detect-types"))
2284+
def test_insert_detect_types(tmpdir, option):
2285+
"""Test that type detection is now the default behavior"""
22682286
db_path = str(tmpdir / "test.db")
22692287
data = "name,age,weight\nCleo,6,45.5\nDori,1,3.5"
22702288
extra = []
2271-
if option_or_env_var:
2272-
extra = [option_or_env_var]
2289+
if option:
2290+
extra = [option]
22732291

2274-
def _test():
2275-
result = CliRunner().invoke(
2276-
cli.cli,
2277-
["insert", db_path, "creatures", "-", "--csv"] + extra,
2278-
catch_exceptions=False,
2279-
input=data,
2280-
)
2281-
assert result.exit_code == 0
2282-
db = Database(db_path)
2283-
assert list(db["creatures"].rows) == [
2284-
{"name": "Cleo", "age": 6, "weight": 45.5},
2285-
{"name": "Dori", "age": 1, "weight": 3.5},
2286-
]
2287-
2288-
if option_or_env_var is None:
2289-
# Use environment variable instead of option
2290-
with mock.patch.dict(os.environ, {"SQLITE_UTILS_DETECT_TYPES": "1"}):
2291-
_test()
2292-
else:
2293-
_test()
2292+
result = CliRunner().invoke(
2293+
cli.cli,
2294+
["insert", db_path, "creatures", "-", "--csv"] + extra,
2295+
catch_exceptions=False,
2296+
input=data,
2297+
)
2298+
assert result.exit_code == 0
2299+
db = Database(db_path)
2300+
assert list(db["creatures"].rows) == [
2301+
{"name": "Cleo", "age": 6, "weight": 45.5},
2302+
{"name": "Dori", "age": 1, "weight": 3.5},
2303+
]
22942304

22952305

2296-
@pytest.mark.parametrize("option", ("-d", "--detect-types"))
2306+
@pytest.mark.parametrize("option", (None, "-d", "--detect-types"))
22972307
def test_upsert_detect_types(tmpdir, option):
2308+
"""Test that type detection is now the default behavior for upsert"""
22982309
db_path = str(tmpdir / "test.db")
22992310
data = "id,name,age,weight\n1,Cleo,6,45.5\n2,Dori,1,3.5"
2311+
extra = []
2312+
if option:
2313+
extra = [option]
23002314
result = CliRunner().invoke(
23012315
cli.cli,
2302-
["upsert", db_path, "creatures", "-", "--csv", "--pk", "id"] + [option],
2316+
["upsert", db_path, "creatures", "-", "--csv", "--pk", "id"] + extra,
23032317
catch_exceptions=False,
23042318
input=data,
23052319
)
@@ -2312,12 +2326,12 @@ def test_upsert_detect_types(tmpdir, option):
23122326

23132327

23142328
def test_csv_detect_types_creates_real_columns(tmpdir):
2315-
"""Test that CSV import with --detect-types creates REAL columns for floats"""
2329+
"""Test that CSV import creates REAL columns for floats (default behavior)"""
23162330
db_path = str(tmpdir / "test.db")
23172331
data = "name,age,weight\nCleo,6,45.5\nDori,1,3.5"
23182332
result = CliRunner().invoke(
23192333
cli.cli,
2320-
["insert", db_path, "creatures", "-", "--csv", "--detect-types"],
2334+
["insert", db_path, "creatures", "-", "--csv"],
23212335
catch_exceptions=False,
23222336
input=data,
23232337
)
@@ -2333,6 +2347,68 @@ def test_csv_detect_types_creates_real_columns(tmpdir):
23332347
)
23342348

23352349

2350+
def test_insert_no_detect_types(tmpdir):
2351+
"""Test that --no-detect-types treats all columns as TEXT"""
2352+
db_path = str(tmpdir / "test.db")
2353+
data = "name,age,weight\nCleo,6,45.5\nDori,1,3.5"
2354+
result = CliRunner().invoke(
2355+
cli.cli,
2356+
["insert", db_path, "creatures", "-", "--csv", "--no-detect-types"],
2357+
catch_exceptions=False,
2358+
input=data,
2359+
)
2360+
assert result.exit_code == 0
2361+
db = Database(db_path)
2362+
# All columns should be TEXT when --no-detect-types is used
2363+
assert list(db["creatures"].rows) == [
2364+
{"name": "Cleo", "age": "6", "weight": "45.5"},
2365+
{"name": "Dori", "age": "1", "weight": "3.5"},
2366+
]
2367+
assert db["creatures"].schema == (
2368+
'CREATE TABLE "creatures" (\n'
2369+
' "name" TEXT,\n'
2370+
' "age" TEXT,\n'
2371+
' "weight" TEXT\n'
2372+
")"
2373+
)
2374+
2375+
2376+
def test_upsert_no_detect_types(tmpdir):
2377+
"""Test that --no-detect-types treats all columns as TEXT for upsert"""
2378+
db_path = str(tmpdir / "test.db")
2379+
data = "id,name,age,weight\n1,Cleo,6,45.5\n2,Dori,1,3.5"
2380+
result = CliRunner().invoke(
2381+
cli.cli,
2382+
[
2383+
"upsert",
2384+
db_path,
2385+
"creatures",
2386+
"-",
2387+
"--csv",
2388+
"--pk",
2389+
"id",
2390+
"--no-detect-types",
2391+
],
2392+
catch_exceptions=False,
2393+
input=data,
2394+
)
2395+
assert result.exit_code == 0
2396+
db = Database(db_path)
2397+
# All columns should be TEXT when --no-detect-types is used
2398+
assert list(db["creatures"].rows) == [
2399+
{"id": "1", "name": "Cleo", "age": "6", "weight": "45.5"},
2400+
{"id": "2", "name": "Dori", "age": "1", "weight": "3.5"},
2401+
]
2402+
assert db["creatures"].schema == (
2403+
'CREATE TABLE "creatures" (\n'
2404+
' "id" TEXT PRIMARY KEY,\n'
2405+
' "name" TEXT,\n'
2406+
' "age" TEXT,\n'
2407+
' "weight" TEXT\n'
2408+
")"
2409+
)
2410+
2411+
23362412
def test_integer_overflow_error(tmpdir):
23372413
db_path = str(tmpdir / "test.db")
23382414
result = CliRunner().invoke(

tests/test_cli_insert.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ def test_insert_csv_tsv(content, options, db_path, tmpdir):
227227
fp.write(content)
228228
result = CliRunner().invoke(
229229
cli.cli,
230-
["insert", db_path, "data", file_path] + options,
230+
["insert", db_path, "data", file_path] + options + ["--no-detect-types"],
231231
catch_exceptions=False,
232232
)
233233
assert result.exit_code == 0
@@ -236,7 +236,7 @@ def test_insert_csv_tsv(content, options, db_path, tmpdir):
236236

237237
@pytest.mark.parametrize("empty_null", (True, False))
238238
def test_insert_csv_empty_null(db_path, empty_null):
239-
options = ["--csv"]
239+
options = ["--csv", "--no-detect-types"]
240240
if empty_null:
241241
options.append("--empty-null")
242242
result = CliRunner().invoke(
@@ -430,7 +430,7 @@ def test_insert_text(db_path):
430430
"options,input",
431431
(
432432
([], '[{"id": "1", "name": "Bob"}, {"id": "2", "name": "Cat"}]'),
433-
(["--csv"], "id,name\n1,Bob\n2,Cat"),
433+
(["--csv", "--no-detect-types"], "id,name\n1,Bob\n2,Cat"),
434434
(["--nl"], '{"id": "1", "name": "Bob"}\n{"id": "2", "name": "Cat"}'),
435435
),
436436
)

tests/test_sniff.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ def test_sniff(tmpdir, filepath):
1212
runner = CliRunner()
1313
result = runner.invoke(
1414
cli.cli,
15-
["insert", db_path, "creatures", str(filepath), "--sniff"],
15+
["insert", db_path, "creatures", str(filepath), "--sniff", "--no-detect-types"],
1616
catch_exceptions=False,
1717
)
1818
assert result.exit_code == 0, result.stdout

0 commit comments

Comments
 (0)