Play with Chroma DB to understand how vector databases work

Imagine searching through millions of documents and finding exactly what you need, not because of matching keywords, but because the system understands the meaning behind your query. This is the power of vector databases, and ChromaDB, a lightweight open-source vector database, makes this capability accessible to developers, offering an intuitive API for embedding storage and similarity search operations. Whether you're building a smart search engine or implementing recommendation systems, understanding vector databases is becoming increasingly crucial for modern development.

Reference: Getting Started with ChromaDB

Semantic Search

Semantic search goes beyond traditional keyword matching by understanding the meaning and context of search queries. Instead of looking for exact word matches, it finds content that is conceptually similar to the search query. ChromaDB makes implementing semantic search straightforward by handling the complex process of converting text into mathematical representations (vectors) and finding similar content.

Let's look at a practical example:

import chromadb

# Initialize the client
chroma_client = chromadb.Client()

# Create a collection - ChromaDB will use default embedding function
collection = chroma_client.create_collection(name="example01_collection")

# Sample texts to analyze
TEXT_ABOUT_PHYSICS = ("The equivalence principle is the hypothesis that the observed equivalence of gravitational and "
                      "inertial mass is a consequence of nature. The weak form, known for centuries, relates to "
                      "masses of any composition in free fall taking the same trajectories and landing at identical "
                      "times.")

TEXT_ABOUT_IT = ("Some online sites offer customers the ability to use a six-digit code which randomly changes every "
                 "30–60 seconds on a physical security token. The token has built-in computations and manipulates "
                 "numbers based on the current time. This means that every thirty seconds only a certain array of "
                 "numbers validate access.")

# Add documents to collection
collection.add(
    documents=[TEXT_ABOUT_PHYSICS, TEXT_ABOUT_IT],
    ids=["physics_text", "it_text"]
)

# Try different queries to demonstrate semantic search
results = collection.query(
    query_texts=["Bill Gates"],
    n_results=2
)

# Then try another query
results = collection.query(
    query_texts=["Richard Feynman"],
    n_results=2
)

Let's analyze what happens when we run these queries:

Query: "Bill Gates"
When we search for "Bill Gates", ChromaDB will return the IT-related text first. This happens because the semantic connection between Bill Gates and technology/security is stronger than his connection to physics. Even though Bill Gates isn't explicitly mentioned in the text, the embedding model understands the conceptual relationship between Gates, technology, and computer security.

Query: "Richard Feynman"
When searching for "Richard Feynman", the physics text will be returned as the closest match. This is because Feynman was a renowned physicist, and the embedding model recognizes the semantic relationship between Feynman and concepts like gravitational mass and physical principles, even though Feynman isn't mentioned in the text.

This demonstrates how semantic search understands context beyond simple keyword matching:

No exact words need to match between query and results
Results are based on conceptual similarity
The system can draw connections based on learned relationships in the embedding model

Understanding Vectors

At the heart of vector databases lies the concept of vector embeddings. These are mathematical representations of data in high-dimensional space. To understand this better, let's start with a simple 3D example:

$$ \text{Distance between points:} \hspace{0.10in} 4.69 \hspace{0.05in} units \\[0.2in] \text{Formula:} \hspace{0.25in} d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2} $$

While vectors and embeddings are often used interchangeably in discussions about vector databases, there is a subtle but important distinction:

Vector: A mathematical construct representing a point in space using an array of numbers. For example, the point (2, 3, 4) in 3D space is represented by a vector [2, 3, 4]. Vectors can represent any kind of numerical data.
Embedding: The result of transforming non-numerical data (like text, images, or audio) into a vector format that captures the semantic meaning of the data. For example, the sentence "I love dogs" might be transformed into a vector [0.2, -0.5, 0.8, ...] where the values represent various semantic features learned by the embedding model.

In vector databases:

The term "vector" refers to the mathematical representation used to store and compare data
"Embedding" refers to both the process of creating these vectors and the resulting vectors themselves
All embeddings are vectors, but not all vectors are embeddings

In vector databases, we work with much higher dimensions - typically 384 or more. ChromaDB's default embedding model creates 384-dimensional vectors, where each dimension captures different aspects of the input data. These vectors are normalized to have a magnitude of approximately 1.0, which helps ensure consistent comparisons regardless of text length.

Here's an example of examining vector properties:

# Vector properties example
vector = np.array(results['embeddings'][0])
magnitude = np.linalg.norm(vector)
print(f"Vector magnitude: {magnitude:.4f}")  # Should be close to 1.0
print(f"Dimensions: {vector.shape[0]}")      # 384 dimensions

Embedding Functions

Reference: Chroma - Embeddings

Embedding functions are algorithms that convert data (text, images, etc.) into vector representations. ChromaDB supports various embedding models optimized for different use cases:

Language Models

Examples:

all-MiniLM-L6-v2 (384 dimensions) - Default, general purpose
paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions) - Multilingual support
LaBSE (768 dimensions) - 109 languages support

Domain-Specific Models

Examples:

pritamdeka/S-PubMedBert-MS-MARCO - Medical/scientific text
legal-bert-base-uncased - Legal documents
finbert - Financial text

Here's how to use a custom embedding function:

from chromadb.utils import embedding_functions

# Initialize a specific embedding function
scientific_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="pritamdeka/S-PubMedBert-MS-MARCO"
)

# Create collection with custom embedding function
collection = chroma_client.create_collection(
    name="scientific_collection",
    embedding_function=scientific_ef
)

In example03-embeddings-functions.py, there's a practical demonstration of how choosing the right embedding function can significantly improve search results for specific use cases.

Distance Functions

Reference: Chroma - Distance functions

Distance functions measure the similarity between vectors. The choice of distance function can significantly impact search results and performance. ChromaDB supports three main distance metrics, each with specific strengths and ideal use cases:

1. Euclidean (L2) Distance (Default)

$$ \text{L2_distance}(a,b) = \sum_{i=1}^n (a_i - b_i)^2 $$

Properties: Measures squared straight-line distance, considers both direction and magnitude. ChromaDB uses squared L2 distance (without the square root) for computational efficiency.

Range: 0 (identical) to 2 (maximum) for normalized vectors, [0, ∞) for non-normalized vectors

Best Use Cases:

Image Similarity: Comparing visual features in image recognition
Audio Matching: Finding similar audio patterns or fingerprints
Time Series Analysis: Comparing sequence patterns in data
Anomaly Detection: Identifying outliers in numerical data
Biometric Matching: Face recognition and fingerprint comparison

2. Inner Product Distance

$$ \text{inner_product_distance}(a,b) = -\sum_{i=1}^n a_i b_i $$

Properties: Sensitive to magnitude, computationally efficient, good for recommendations

Range: [-1, 1] for normalized vectors (closest similarity: -1, most dissimilar: +1), unbounded otherwise

Best Use Cases:

Collaborative Filtering: User-item recommendation systems
Feature Matching: When vector magnitude carries important information
Performance-Critical Applications: When computation speed is crucial
Preference Modeling: Capturing user interests in recommendation systems
Real-time Search: Applications requiring fast similarity computations

3. Cosine Distance

$$ \text{cosine_distance}(a,b) = 1 - \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \sqrt{\sum_{i=1}^n b_i^2}} $$

Properties: Measures angle between vectors, ignores magnitude, ideal for semantic similarity

Range: 0 (identical) to 2 (opposite)

Best Use Cases:

Semantic Search: Ideal for finding documents with similar meaning regardless of length
Text Classification: Comparing document topics and themes
Multilingual Search: Finding similar content across different languages
Question-Answering: Matching questions with potential answers
Content Recommendation: When document length shouldn't influence recommendations

Implementation Example:

# Create collections with different distance functions
cosine_collection = chroma_client.create_collection(
    name="cosine_example",
    metadata={"hnsw:space": "cosine"}  # Best for semantic search
)

l2_collection = chroma_client.create_collection(
    name="l2_example",
    metadata={"hnsw:space": "l2"}  # Best for image/audio similarity
)

ip_collection = chroma_client.create_collection(
    name="ip_example",
    metadata={"hnsw:space": "ip"}  # Best for recommendations
)

Choosing the Right Distance Function:

Consider these factors when selecting a distance function:

Data Type: Text content (Cosine), multimedia (L2), user preferences (Inner Product)
Vector Properties: Normalized vs non-normalized vectors
Performance Requirements: Computation speed vs accuracy trade-offs
Search Objectives: Semantic similarity vs structural similarity
Scale Sensitivity: Whether magnitude differences matter for your use case

Chroma Storage and Deployment Options

Data Persistence

ChromaDB offers three main storage options to suit different needs:

1. In-Memory Storage (Default)
Data is stored in RAM and is lost when the process ends. Offers the fastest performance, making it ideal for testing, development, and scenarios where data persistence isn't required.

2. Local Storage
Perfect for development and small applications. Data is stored in a local directory on disk, providing a good balance between persistence and simplicity. Ideal for projects with up to 100K entries.

3. PostgreSQL Backend
Recommended for production environments and larger datasets. Provides ACID compliance, concurrent access support, and better scalability. Suitable for applications with millions of entries and multiple users accessing the database simultaneously.

Deployment Architecture

1. Embedded Mode
ChromaDB runs within your application process. This is the simplest deployment option, perfect for single-user applications and rapid development.

2. Client-Server Mode
Runs ChromaDB as a separate service, allowing multiple clients to connect to the same server. Ideal for distributed systems and when multiple applications need to access the same database.

3. Containerized Deployment
Package ChromaDB in Docker containers for easy scaling and deployment management. Best for cloud environments and when you need consistent deployment across different environments.

Key Considerations

When choosing a storage and deployment option, consider:

Data volume and growth expectations
Required query performance
Number of concurrent users
Backup and recovery needs
Development vs. production environment

Where are Vector Databases Used?

Vector databases find applications in numerous domains:

1. Semantic Search Engines

Document retrieval
Similar product recommendations
Knowledge base search

2. Recommendation Systems

Content recommendations
Product suggestions
Similar item discovery

3. Image and Audio Processing

Similar image search
Audio matching and fingerprinting
Face recognition
Music recommendation systems
Speech pattern analysis

4. Natural Language Processing

Question answering systems
Document classification
Content similarity analysis

5. Anomaly Detection

Fraud detection in financial transactions
Network security threat detection
Manufacturing quality control
System health monitoring
Time series anomaly detection

6. Bioinformatics

Protein structure similarity search
Gene sequence matching
Drug discovery
Molecular property prediction
Biological pathway analysis

7. Audio Analysis

Voice recognition systems
Music similarity search
Sound event detection
Speaker diarization
Acoustic scene classification
Environmental sound analysis

Vector Database Art

The header graphic presents an artistic vision of a vector database, where points corresponding to normalized vectors form a hypersphere surface in multidimensional space.

Repository

For complete code examples and more detailed implementations, check out the ChromaDB Examples GitHub repository.

Search This Blog

Tomasz Dworakowski Blog