Skip to content

Commit 94faa11

Browse files
Merge pull request #8 from hummusonrails/add-more
Updates and further work on workshop
2 parents 6afb2a6 + c6da3d2 commit 94faa11

File tree

68 files changed

+47091
-2
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+47091
-2
lines changed

README.md

Lines changed: 105 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,15 @@ After creating a cluster, you can create a new bucket by following the steps bel
5757

5858
Before we can index and search data, we need to transform it into a format that can be used by the vector search engine. We will be using [Couchbase Vector Search](https://docs.couchbase.com/server/current/fts/fts-vector-search.html) for this workshop.
5959

60+
There are two options in this workshop to generate vector embeddings from data:
61+
62+
1. Use the `/embed` endpoint provided in this repository to transform the data. *You need an OpenAI API key to use this option.*
63+
2. Import directly the data with *already generated embeddings* into the Couchbase bucket. You can use the data provided in the `./data/individual_items_with_embedding` directory.
64+
65+
Follow the instructions below for the option you choose.
66+
67+
### Option 1: Use the `/embed` Endpoint
68+
6069
Provided in this repository is an Express.js application that will expose a `/embed` endpoint to transform the data.
6170

6271
The Codespace environment already has all the dependencies installed. You can start the Express.js application by running the following command:
@@ -65,15 +74,109 @@ The Codespace environment already has all the dependencies installed. You can st
6574
node server.js
6675
```
6776

68-
The repository also has a sample set of data in the `./data` directory. You can transform this data by making a POST request to the `/embed` endpoint providing the paths to the data files as an array in the request body.
77+
The repository also has a sample set of data in the `./data/individual_items` directory. You can transform this data by making a POST request to the `/embed` endpoint providing the paths to the data files as an array in the request body.
6978

7079
```bash
7180
curl -X POST http://localhost:3000/embed -H "Content-Type: application/json" -d '["./data/data1.json", "./data/data2.json"]'
7281
```
7382

7483
The data has now been converted into vector embeddings and stored in the Couchbase bucket that you created earlier.
7584

85+
### Option 2: Import Data with Embeddings
86+
87+
If you choose to import the data directly, you can use the data provided in the `./data/individual_items_with_embedding` directory. The data is already in the format required to enable vector search on it.
88+
89+
Once you have opened this repositority in a [GitHub Codespace](https://codespaces.new/hummusonrails/vector-search-nodejs-workshop), you can import the data with the generated embeddings using the [Couchbase shell](https://couchbase.sh/docs/#_importing_data) from the command line.
90+
91+
#### Edit the Config File
92+
93+
First, edit the `./config_file/config` file with your Couchbase Capella information.
94+
95+
Under the `[[cluster]]` section:
96+
97+
- Replace the empty string value for `identifier` with the name of the cluster you created earlier.
98+
- Replace the empty string value for `connstr` with the connection string to your cluster.
99+
- Found in `Menu > Connect`
100+
![](workshop_images/menu_with_connect_highlighted.png)
101+
- Replace the empty string for `default_bucket` with the name of the bucket you created earlier.
102+
- Replace the empty strings for `username` and password with the username and password of your Couchbase Capella account.
103+
- Found in `Menu > Settings > Cluster Access`
104+
![](workshop_images/menu_with_settings_highlighted.png)
105+
- Replace the empty string for `capella_organization` with the name of your organization.
106+
- Found by clicking on your avatar icon (usually your initials) then `Organizations`
107+
- Change the name of your organization if multiple words to use dashes instead of spaces, i.e. "My Organization" becomes "my-organization".
108+
![](workshop_images/menu_with_organizations_highlighted.png)
109+
110+
Under the `[[capella-organization]] section:
111+
112+
- Replace the `identifier` empty string value with the name of your organization like the last step above.
113+
- Replace the `access-key` and `secret_key` empty strings values with the access key for your organization.
114+
- Found in `Menu > Settings > API Keys`
115+
![](workshop_images/menu_with_api_keys_highlighted.png)
116+
- Replace the `default-project` empty string value with the name of the project you created earlier.
117+
- Found in the top-level view of all your clusters.
118+
![](workshop_images/cluster_list_with_project_name.png)
119+
120+
#### Import Data with Couchbase Shell
121+
122+
Change into the directory where the data files with embeddings are:
123+
124+
```bash
125+
cd data/individual_items_with_embedding
126+
```
127+
128+
Open up Couchbase shell passing in an argument with the location of the config file defining your Couchbase information:
129+
130+
```bash
131+
cbsh --config-dir ../config_file
132+
```
133+
134+
Once in the shell, run the `nodes` command to just perform a sanity check that you are connected to the correct cluster.
135+
136+
```bash
137+
> nodes
138+
```
139+
140+
This should output something similar to the following:
141+
142+
```bash
143+
╭───┬───────────┬────────────────┬─────────┬──────────────────────────┬───────────────────────┬───────────────────────────┬──────────────┬─────────────┬─────────╮
144+
# │ cluster │ hostname │ status │ services │ version │ os │ memory_total │ memory_free │ capella │
145+
├───┼───────────┼────────────────┼─────────┼──────────────────────────┼───────────────────────┼───────────────────────────┼──────────────┼─────────────┼─────────┤
146+
│ 0 │ dev.local │ 127.0.0.1:8091 │ healthy │ search,indexing,kv,query │ 8.0.0-1246-enterprise │ x86_64-apple-darwin19.6.0 │ 34359738368 │ 12026126336 │ false
147+
╰───┴───────────┴────────────────┴─────────┴──────────────────────────┴───────────────────────┴───────────────────────────┴──────────────┴─────────────┴─────────╯
148+
```
149+
150+
Now, import the data into the bucket you created earlier:
151+
152+
```bash
153+
> ls *_with_embedding.json | each { |it| open $it.name | wrap content | insert id $in.content._default.name } | doc upsert
154+
```
155+
156+
Once this is done, you can perform a sanity check to ensure the documents were inserted by running a query to select just one:
157+
158+
```bash
159+
> query "select * from name_of_your_bucket._default._default limit 1"
160+
```
161+
162+
Replace the `name_of_your_bucket` with the name of your bucket you created.
163+
76164
## Index Data
77165
78-
Once the vector embeddings have been stored in the Couchbase bucket, we can create a vector search index to enable similarity search.
166+
Once the vector embeddings have been stored in the Couchbase bucket, we can create a vector search index to enable similarity search.
79167
168+
You will use Couchbase Shell to perform this action as well.
169+
170+
Run the following command from inside the shell:
171+
172+
```bash
173+
> vector create-index --bucket name_of_your_bucket --similarity-metric dot_product vector-search-index embedding 1536
174+
```
175+
176+
Replace the `name_of_your_bucket` with the name of your bucket you created.
177+
178+
You can perform a santity check to ensure the index was created by querying for all the indexes and you should see the `vector_search_index` in the list:
179+
180+
```bash
181+
> query indexes
182+
```

config_file/config

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
version = 1
2+
llms = []
3+
4+
[[cluster]]
5+
identifier = "" # This is the name of the cluster you created
6+
connstr = "" # This is the connection string for the cluster and can be found in the Capella UI
7+
default-bucket = "" # This is the name of the bucket you created
8+
username = "" # This is the username you created in the connect settings in the Capella UI
9+
password = "" # This is the password you created in the connect settings in the Capella UI
10+
default-collection = "_default" # Keep these as is unless you changed the defaults in the Capella UI
11+
default-scope = "_default" # Keep these as is unless you changed the defaults in the Capella UI
12+
data-timeout = "10s" # Keep as is
13+
connect-timeout = "1m 15s" # Keep as is
14+
search-timeout = "1m 15s" # Keep as is
15+
analytics-timeout = "1m 15s" # Keep as is
16+
management-timeout = "1m 15s" # Keep as is
17+
transaction-timeout = "1m 15s" # Keep as is
18+
tls-enabled = true # Keep as is
19+
tls-accept-all-certs = true # Keep as is
20+
capella-organization = "" # This is the name of the your organization, if multiple words use a hyphen and lowercase
21+
22+
[[capella-organization]]
23+
identifier = "" # This is the name of the your organization, if multiple words use a hyphen and lowercase
24+
access-key = "" # This is the access key for your organization found in the Capella UI
25+
secret-key = "" # This is the secret key for your organization found in the Capella UI
26+
default-project = "" # This is the name of the project you created where all your clusters are stored
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)