Skip to content
This repository was archived by the owner on Mar 25, 2023. It is now read-only.

Commit b4c1e6d

Browse files
committed
Split dbt and Pachyderm sections (since they're fairly different from each other) and add a note on similarities/differences to Pachyderm.
1 parent 82e3259 commit b4c1e6d

File tree

2 files changed

+25
-8
lines changed

2 files changed

+25
-8
lines changed

content/docs/0000_getting-started/0150_frequently_asked_questions.mdx

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -76,28 +76,42 @@ PostgreSQL deployment can also be used on Splitgraph.
7676
No. Splitgraph can be used in a decentralized way, sharing data between two engines like one would
7777
with Git. Here's an [example](https://github.com/splitgraph/splitgraph/tree/master/examples/push-to-other-engine) of getting two Splitgraph instances to synchronize with each other.
7878

79-
It is also possible to push data to S3-compatible storage (like [Minio](https://github.com/splitgraph/splitgraph/tree/487c704eb6aba5025708215bfa80399723c530b1/examples/push-to-object-storage)).
79+
It is also possible to push data to S3-compatible storage (like [Minio](https://github.com/splitgraph/splitgraph/tree/master/examples/push-to-object-storage)).
8080

8181
You can use [Splitgraph Cloud](../splitgraph_cloud/introduction) if you wish to
8282
get or share public data or have a [REST API](../splitgraph_cloud/publish_rest_api) generated for your dataset.
8383

8484
### Why not just use...
8585

86-
#### dbt, Pachyderm, ...
86+
#### dbt
8787

88-
There are plenty of great tools around for building datasets and managing ETL pipelines. Firstly,
89-
they can also work against Splitgraph, since a Splitgraph engine is also a PostgreSQL instance.
90-
After the dataset is built, one can snapshot the schema it was built in and package it up as a Splitgraph image.
91-
This enriches the tool by adding version control, packaging and sharing to datasets that it uses and builds.
88+
dbt is a tool for transforming data inside of the data warehouse that allows users to build up
89+
transformations from reusable and versionable SQL snippets.
9290

93-
We have an example of running [dbt](../integrating_splitgraph/dbt) against Splitgraph, swapping between different versions of the
91+
dbt is enhanced by Splitgraph: since a Splitgraph engine is also a PostgreSQL instance, dbt can
92+
work against it, getting benefits like version control, packaging and sharing to datasets that it uses and builds.
93+
94+
We have an example of running [dbt](../integrating_splitgraph/dbt) in such way, swapping between different versions of the
9495
source dataset and looking at their effect on the built dbt model.
9596

96-
Secondly, Splitgraph offers its own method of building datasets: [Splitfiles](../concepts/splitfiles). Splitfiles offer Dockerfile-like caching, provenance tracking, fast dataset rebuilds, joins between datasets and full SQL support.
97+
Splitgraph also offers its own method of building datasets: [Splitfiles](../concepts/splitfiles). Splitfiles offer Dockerfile-like caching, provenance tracking, fast dataset rebuilds, joins between datasets and full SQL support.
9798

9899
We envision Splitfiles as a replacement for ETL pipelines: instead of a series of processes that transform data between tables in a data warehouse,
99100
transformations are treated as pure functions between isolated self-contained datasets, allowing one to replay any part of their pipeline at any point in time.
100101

102+
#### Pachyderm
103+
104+
Pachyderm is used mostly for managing and running distributed data pipelines on flat files (images,
105+
genomics data etc). By specializing in datasets that can be represented as tables in a database,
106+
Splitgraph gets benefits like delta compression on changed data or faster querying speeds.
107+
108+
Similarly to Pachyderm, Splitgraph supports [data lineage (or provenance)](../working_with_data/inspecting_provenance) tracking where the
109+
commands and source datasets that were used to build a particular dataset are recorded in that
110+
dataset's metadata, allowing for them to be replayed or inspected.
111+
112+
Splitgraph can be integrated with Pachyderm using the same methods one would use [for PostgreSQL](https://docs.pachyderm.com/latest/how-tos/splitting-data/splitting/#ingesting-postgressql-data). This can then be used to run a [Splitfile](../concepts/splitfiles) to build a dataset as a
113+
Pachyderm stage.
114+
101115
#### dvc, DataLad, ...
102116

103117
Some tools use [git-annex](https://git-annex.branchable.com/) to version code and data together.

content/docs/0700_integrating_splitgraph/0400_dbt.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ Turning the source and the target schemas that dbt uses into Splitgraph reposito
1616
* Built datasets can be pushed to other Splitgraph engines, shared publicly or serve as inputs to a pipeline of Splitfiles.
1717
* Input datasets can leverage Splitgraph's [layered querying](../large_datasets/layered_querying),
1818
allowing dbt to seamlessly query huge datasets with a limited amount of local disk space.
19+
* Input datasets can be backed by [foreign data wrappers](../ingesting_data/foreign_data_wrappers), allowing dbt
20+
to directly use data from a wide variety of databases without having to write an extra ETL job to load the data
21+
into the warehouse.
1922

2023
## Example
2124

0 commit comments

Comments
 (0)