Fast regressions on database backends.
dbreg is an R package that leverages the power of databases to run regressions on very large datasets, which may not fit into R's memory. Various acceleration strategies allow for highly efficient computation, while robust standard errors are computed from sufficient statistics. Our default DuckDB backend provides a powerful, embedded analytics engine to get users up and running with minimal effort. Users can also specify alternative database backends, depending on their computing needs and setup.
The dbreg R package is inspired by, and has similar aims to, the duckreg Python package. This implementation offers some idiomatic, R-focused features like a formula interface and "pretty" print methods. But the two packages should otherwise be very similar.
dbreg can be installed from R-universe.
install.packages(
"dbreg",
repos = c("https://grantmcdermott.r-universe.dev", getOption("repos"))
)To get ourselves situated, we'll first demonstrate by using an in-memory R dataset.
library(dbreg)
library(fixest) # for data and comparison
data("trade", package = "fixest")
dbreg(Euros ~ dist_km | Destination + Origin, data = trade, vcov = 'hc1')
#> [dbreg] Auto strategy:
#> - data has 38,325 rows with 2 FE (210 unique groups)
#> - compression ratio (0.01) satisfies threshold (0.6)
#> - decision: compress
#> [dbreg] Executing compress strategy SQL
#>
#> Compressed OLS estimation, Dep. Var.: Euros
#> Observations.: 38,325 (original) | 210 (compressed)
#> Standard-errors: Heteroskedasticity-robust
#> Estimate Std. Error t value Pr(>|t|)
#> dist_km -45709.8 1195.84 -38.224 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 124,221,786.3 Adj. R2: 0.215289Behind the scenes, dbreg has compressed the original dataset down from
nearly 40,000 observations to only 210, before running the final (weighted)
regression on this much smaller data object. This compression procedure trick
follows Wang _et. al. (2021) and
effectively allows us to compute on a much lighter object, saving time and
memory. We can confirm that it still gives the same result as running
fixest::feols on the full dataset:
feols(Euros ~ dist_km | Destination + Origin, data = trade, vcov = 'hc1')
#> OLS estimation, Dep. Var.: Euros
#> Observations: 38,325
#> Fixed-effects: Destination: 15, Origin: 15
#> Standard-errors: Heteroskedasticity-robust
#> Estimate Std. Error t value Pr(>|t|)
#> dist_km -45709.8 1195.84 -38.224 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 124,221,786.3 Adj. R2: 0.215289
#> Within R2: 0.025914For a more appropriate dbreg use-case, let's run a regression on some NYC taxi data. (Download instructions here.) The dataset that we're working with here is about 180 million rows deep and takes up 8.5 GB on disk.1 dbreg offers two basic ways to analyse and interact with data of this size.
Use the path argument to read the data directly from disk and perform the
compression computation in an ephemeral DuckDB connection. This requires that
the data are small enough to fit into RAM... but please note that "small enough"
is a relative concept. Thanks to DuckDB's incredible efficiency, your RAM should
be able to handle very large datasets that would otherwise crash your R session,
and require only a fraction of the computation time.
dbreg(
tip_amount ~ fare_amount + passenger_count | month + vendor_name,
path = "read_parquet('nyc-taxi/**/*.parquet')", ## path to hive-partitioned dataset
vcov = "hc1"
)
#> [dbreg] Auto strategy:
#> - data has 178,544,324 rows with 2 FE (24 unique groups)
#> - compression ratio (0.00) satisfies threshold (0.6)
#> - decision: compress
#> [dbreg] Executing compress strategy SQL
#>
#> Compressed OLS estimation, Dep. Var.: tip_amount
#> Observations.: 178,544,324 (original) | 70,782 (compressed)
#> Standard Errors: Heteroskedasticity-robust
#> Estimate Std. Error t value Pr(>|t|)
#> fare_amount 0.106744 0.000068 1564.742 < 2.2e-16 ***
#> passenger_count -0.029086 0.000106 -273.866 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 1.7 Adj. R2: 0.243549Note the size of the original dataset, which is nearly 180 million rows, versus the compressed dataset, which is down to only 70k. On my laptop (M4 MacBook Pro) this regression completes in under 3 seconds... and that includes the time it took to determine an optimal estimation strategy, as well as read the data from disk!2
In case you were wondering, obtaining clustered standard errors is just as easy;
simply pass the relevant cluster variable as a formula to the vcov argument.
Since we know that the optimal acceleration strategy is "compress", we'll also
go ahead a specify this explicitly to skip the auto strategy overhead.
dbreg(
tip_amount ~ fare_amount + passenger_count | month + vendor_name,
path = "read_parquet('nyc-taxi/**/*.parquet')",
vcov = ~month, # clustered SEs
strategy = "compress" # skip auto strategy overhead
)
#> [dbreg] Using strategy: compress
#> [dbreg] Executing compress strategy SQL
#>
#> Compressed OLS estimation, Dep. Var.: tip_amount
#> Observations.: 178,544,324 (original) | 70,782 (compressed)
#> Standard Errors: Clustered (12 clusters)
#> Estimate Std. Error t value Pr(>|t|)
#> fare_amount 0.106744 0.000657 162.4934 < 2.2e-16 ***
#> passenger_count -0.029086 0.001030 -28.2278 1.2923e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 1.7 Adj. R2: 0.243549While querying on-the-fly with our default DuckDB backend is both convenient and
extremely performant, you can also run regressions against existing tables in a
persistent database connection. This could be DuckDB, but it could also be any
other supported backend.
All you need to do is specify the appropriate conn and table arguments.
# load the DBI package to connect to a persistent database
library(DBI)
# create connection to persistent DuckDB database (could be any supported backend)
con = dbConnect(duckdb::duckdb(), dbdir = "nyc.db")
# create a 'taxi' table in our new nyc.db database from our parquet dataset
dbExecute(
con,
"
CREATE TABLE taxi AS
FROM read_parquet('nyc-taxi/**/*.parquet')
SELECT tip_amount, fare_amount, passenger_count, month, vendor_name
"
)
# now run our regression against this conn+table combo
dbreg(
tip_amount ~ fare_amount + passenger_count | month + vendor_name,
conn = con, # database connection,
table = "taxi", # table name
vcov = ~month,
strategy = "compress"
)
#> [dbreg] Using strategy: compress
#> [dbreg] Executing compress strategy SQL
#>
#> Compressed OLS estimation, Dep. Var.: tip_amount
#> Observations.: 178,544,324 (original) | 70,782 (compressed)
#> Standard Errors: Clustered (12 clusters)
#> Estimate Std. Error t value Pr(>|t|)
#> fare_amount 0.106744 0.000657 162.4934 < 2.2e-16 ***
#> passenger_count -0.029086 0.001030 -28.2278 1.2923e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 1.7 Adj. R2: 0.243549Result: we get the same coefficient and standard error estimates as earlier.
We'll close by doing some (optional) clean up.
dbRemoveTable(con, "taxi")
dbDisconnect(con)
unlink("nyc.db") # remove from diskTip
If you don't want to create a persistent database (and materialize data), a
nice alternative is CREATE VIEW. This lets you define subsets or computed
columns on-the-fly. For example, to regress on Q1 2012 data with a day-of-week
fixed effect:
dbExecute(con, "
CREATE VIEW nyc_subset AS
SELECT
tip_amount, trip_distance, passenger_count,
vendor_name, month,
dayofweek(dropoff_datetime) AS dofw
FROM read_parquet('nyc-taxi/**/*.parquet')
WHERE year = 2012 AND month <= 3
")
dbreg(
tip_amount ~ trip_distance + passenger_count | month + dofw + vendor_name,
conn = con,
table = "nyc_subset",
vcov = ~dofw
)All of the examples in this README have made use of the "compress" strategy.
But the compression trick is not the only game in town and dbreg supports
several other acceleration strategies: "moments", "demean", and "mundlak".
Depending on your data and regression requirements, one of these other
strategies may better suit your problem. The good news is that (the default)
strategy = "auto" option uses some intelligent heuristics to determine which
strategy is (probably) optimal for each case. The Acceleration Strategies
section of the ?dbreg helpfile contains a lot detail about the different
options and tradeoffs involved, so please do consult the documentation.
dbreg is a maturing package and there are a number of features that we still
plan to add before submitting it to CRAN. (See our
TO-DO list.) We also don't
yet support some standard R operations like interaction terms in the formula. At
the same time, the core dbreg() routine has been tested pretty thoroughly and
should work in standard cases. Please help us by kicking the tyres and creating
GitHub issues for both bug reports and feature requests.
Footnotes
-
To be clear, this dataset would occupy significantly more RAM than 8.5 GB if we loaded it into R's memory, due to data serialization and the switch to richer representation formats (e.g., ordered factors require more memory). So there's a good chance that just trying to load this raw dataset into R would cause your whole system to crash... never mind doing any statistical analysis on it. ↩
-
If we provided an explicit
dbreg(..., strategy = "compress")argument (thus skipping the automatic strategy determination), then the total computation time drops to less than 1 second... ↩