Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ name: Python Lint, Check, Test

on:
push:
branches: [ "main" ]
branches: [ "master" ]
pull_request:
branches: [ "main" ]
branches: [ "master" ]

jobs:
build:
Expand Down Expand Up @@ -37,4 +37,7 @@ jobs:
run: make type-check

- name: Tests (unit)
run: make pytest
run: make test-unit
env:
OPENAI_API_KEY: "dummy-key"
APIFY_API_TOKEN: "dummy-key"
13 changes: 11 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,14 @@ pydantic-model:
datamodel-codegen --input $(DIRS_WITH_ACTORS)/qdrant/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/qdrant_input_model.py --input-file-type jsonschema --field-constraints --enum-field-as-literal all
datamodel-codegen --input $(DIRS_WITH_ACTORS)/weaviate/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/weaviate_input_model.py --input-file-type jsonschema --field-constraints --enum-field-as-literal all

pytest:
poetry run -C $(DIRS_WITH_CODE) pytest --with-integration --vcr-record=none

# Integration tests are marked with @pytest.mark.integration_test
# You will require all databased running to run these tests.
# Check docker-compose.yml for the list of databases.
test-integration:
poetry run -C $(DIRS_WITH_CODE) pytest --with-integration

test-unit:
poetry run -C $(DIRS_WITH_CODE) pytest

test: test-unit test-integration
28 changes: 16 additions & 12 deletions actors/chroma/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,28 +21,31 @@
"title": "Chroma port",
"description": "Port argument for Chroma HTTP Client",
"type": "integer",
"editor": "number",
"default": 8000
"editor": "number"
},
"chromaClientSsl": {
"title": "Chroma SSL enabled",
"description": "Enable/Disable SSL",
"type": "boolean",
"default": false
"type": "boolean"
},
"chromaServerAuthCredentials": {
"title": "Chroma server Auth Static API token credentials",
"description": "Chroma server Auth Static API token.",
"chromaApiToken": {
"title": "Chroma API token",
"description": "Chroma API token for authentication",
"type": "string",
"editor": "textfield",
"isSecret": true
},
"chromaClientAuthProvider": {
"title": "Chroma client auth provider",
"description": "Chroma client auth provider",
"chromaTenant": {
"title": "Chroma tenant",
"description": "Chroma tenant ID",
"type": "string",
"editor": "textfield",
"default": "chromadb.auth.token_authn.TokenAuthClientProvider"
"editor": "textfield"
},
"chromaDatabase": {
"title": "Chroma database",
"description": "Chroma database name",
"type": "string",
"editor": "textfield"
},
"embeddingsProvider": {
"title": "Embeddings provider (as defined in the langchain API)",
Expand Down Expand Up @@ -172,6 +175,7 @@
}
},
"required": [
"chromaCollectionName",
"chromaClientHost",
"embeddingsProvider",
"embeddingsApiKey",
Expand Down
95 changes: 37 additions & 58 deletions actors/chroma/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,60 +24,16 @@ It uses [LangChain](https://www.langchain.com/) to compute embeddings and intera
2. _[Optional]_ Split text data into chunks using `langchain`'s `RecursiveCharacterTextSplitter`
(enable/disable using `performChunking` and specify `chunkSize`, `chunkOverlap`)
3. _[Optional]_ Update only changed data (select `dataUpdatesStrategy`)
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddings` and `embeddingsConfig`)
4. Compute embeddings, e.g. using `OpenAI` or `Cohere` (specify `embeddingsProvider` and `embeddingsConfig`)
5. Save data into the database

## ✅ Before you start

To utilize this integration, ensure you have:

- `Chroma` operational on a server or localhost.
- `Chroma` operational on a remote server or cloud instance.
- An account to compute embeddings using one of the providers, e.g., OpenAI or Cohere.

For quick Chroma setup, refer to [Chroma deployment](https://docs.trychroma.com/deployment#docker).
Chroma can be run in a Docker container with the following commands:

### Docker

```shell
docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma
```

### Authentication with Docker

To enable static API Token authentication, create a .env file with:

```dotenv
CHROMA_SERVER_AUTHN_CREDENTIALS=test-token
CHROMA_SERVER_AUTHN_PROVIDER=chromadb.auth.token_authn.TokenAuthenticationServerProvider
```

Then run Docker with:

```shell
docker run --env-file ./.env -p 8000:8000 chromadb/chroma
```

### If you are running Chroma locally, you can expose the localhost using Ngrok

[Install ngrok](https://ngrok.com/download) (you can use it for free or create an account). Expose Chroma using

```shell
ngrok http http://localhost:8080
```

You'll see an output similar to:
```text
Session Status online
Account a@a.ai (Plan: Free)
Forwarding https://fdfe-82-208-25-82.ngrok-free.app -> http://localhost:8000
```

The URL (`https://fdfe-82-208-25-82.ngrok-free.app`) can be used in the as an input variable for `chromaClientHost=https://fdfe-82-208-25-82.ngrok-free.app`.
Note that your specific URL will vary.


## 👉 Examples

The configuration consists of three parts: Chroma, embeddings provider, and data.
Expand All @@ -88,12 +44,23 @@ This means your Chroma index should also be configured to accommodate vectors of

For detailed input information refer to the [Input page](https://apify.com/apify/chroma-integration/input-schema).

#### Database: Chroma
#### Database: Chroma (simple)
```json
{
"chromaClientHost": "https://fdfe-82-208-25-82.ngrok-free.app",
"chromaCollectionName": "chroma",
"chromaServerAuthCredentials": "test-token"
"chromaClientHost": "https://your-chroma-instance.com",
"chromaApiToken": "your-api-token"
}
```

#### Database: Chroma with tenant and database (cloud/enterprise)
```json
{
"chromaCollectionName": "chroma",
"chromaClientHost": "https://your-chroma-instance.chroma.cloud",
"chromaApiToken": "your-api-token",
"chromaTenant": "your-tenant-id",
"chromaDatabase": "your-database-name"
}
```

Expand Down Expand Up @@ -169,7 +136,7 @@ To control how the integration updates data in the database, use the `dataUpdate
- For instance, this is useful in cases where unique items (such as user profiles or documents) need to be managed, ensuring the database reflects the latest changes.
- Check the `dataUpdatesPrimaryDatasetFields` parameter to specify which fields are used to uniquely identify each dataset item.

- **Delta updates (`deltaUpdates`)**:
- **Update changed data based on deltas (`deltaUpdates`)**:
- Incrementally updates records by identifying differences (deltas) between the new dataset and the existing database records.
- Ensures only new or modified records are processed, leaving unchanged records untouched. This minimizes unnecessary database operations and improves efficiency.
- This is the most efficient strategy when integrating data that evolves over time, such as website content or recurring crawls.
Expand All @@ -196,7 +163,7 @@ For instance, when working with the Website Content Crawler, you can use the URL
```json
{
"dataUpdatesStrategy": "deltaUpdates",
"dataUpdatePrimaryDatasetFields": ["url"]
"dataUpdatesPrimaryDatasetFields": ["url"]
}
```

Expand Down Expand Up @@ -241,32 +208,44 @@ This integration will save the selected fields from your Actor to Chroma.

```json
{
"chromaClientHost": "https://fdfe-82-208-25-82.ngrok-free.app",
"chromaClientSsl": false,
"chromaCollectionName": "chroma",
"chromaClientHost": "https://your-chroma-instance.com",
"chromaClientSsl": true,
"embeddingsProvider": "OpenAI",
"embeddingsApiKey": "YOUR-OPENAI-API-KEY",
"embeddingsConfig": {
"model": "text-embedding-3-small"
},
"embeddingsProvider": "OpenAI",
"datasetFields": [
"text"
],
"dataUpdatesStrategy": "deltaUpdates",
"dataUpdatePrimaryDatasetFields": ["url"],
"dataUpdatesPrimaryDatasetFields": ["url"],
"deleteExpiredObjects": true,
"expiredObjectDeletionPeriodDays": 7,
"performChunking": true,
"chunkSize": 2000,
"chunkOverlap": 200
}
```

#### Chroma
#### Chroma (simple)
```json
{
"chromaCollectionName": "chroma",
"chromaClientHost": "https://your-chroma-instance.com",
"chromaApiToken": "your-api-token"
}
```

#### Chroma (cloud/enterprise with tenant and database)
```json
{
"chromaClientHost": "https://fdfe-82-208-25-82.ngrok-free.app",
"chromaCollectionName": "chroma",
"chromaServerAuthCredentials": "test-token"
"chromaClientHost": "https://your-chroma-instance.chroma.cloud",
"chromaApiToken": "your-api-token",
"chromaTenant": "your-tenant-id",
"chromaDatabase": "your-database-name"
}
```

Expand Down
9 changes: 6 additions & 3 deletions code/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@ APIFY_API_TOKEN=
OPENAI_API_KEY=

# Chromadb
CHROMA_API_TOKEN=
CHROMA_CLIENT_HOST=
CHROMA_SERVER_AUTH_CREDENTIALS=
CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER=chromadb.auth.token.TokenConfigServerAuthCredentialsProvider
CHROMA_SERVER_AUTH_PROVIDER=chromadb.auth.token.TokenAuthServerProvider
CHROMA_CLIENT_PORT=
CHROMA_CLIENT_SSL=
CHROMA_COLLECTION_NAME=
CHROMA_DATABASE=
CHROMA_TENANT=

# Pinecone
PINECONE_API_KEY=
Expand Down
Loading
Loading