Play with Chroma DB to understand how vector databases work

Imagine searching through millions of documents and finding exactly what you need, not because of matching keywords, but because the system understands the meaning behind your query. This is the power of vector databases, and ChromaDB, a lightweight open-source vector database, makes this capability accessible to developers, offering an intuitive API for embedding storage and similarity search operations. Whether you're building a smart search engine or implementing recommendation systems, understanding vector databases is becoming increasingly crucial for modern development.

Reference: Getting Started with ChromaDB

Semantic Search

Semantic search goes beyond traditional keyword matching by understanding the meaning and context of search queries. Instead of looking for exact word matches, it finds content that is conceptually similar to the search query. ChromaDB makes implementing semantic search straightforward by handling the complex process of converting text into mathematical representations (vectors) and finding similar content.

Let's look at a practical example:

import chromadb

# Initialize the client
chroma_client = chromadb.Client()

# Create a collection - ChromaDB will use default embedding function
collection = chroma_client.create_collection(name="example01_collection")

# Sample texts to analyze
TEXT_ABOUT_PHYSICS = ("The equivalence principle is the hypothesis that the observed equivalence of gravitational and "
                      "inertial mass is a consequence of nature. The weak form, known for centuries, relates to "
                      "masses of any composition in free fall taking the same trajectories and landing at identical "
                      "times.")

TEXT_ABOUT_IT = ("Some online sites offer customers the ability to use a six-digit code which randomly changes every "
                 "30–60 seconds on a physical security token. The token has built-in computations and manipulates "
                 "numbers based on the current time. This means that every thirty seconds only a certain array of "
                 "numbers validate access.")

# Add documents to collection
collection.add(
    documents=[TEXT_ABOUT_PHYSICS, TEXT_ABOUT_IT],
    ids=["physics_text", "it_text"]
)

# Try different queries to demonstrate semantic search
results = collection.query(
    query_texts=["Bill Gates"],
    n_results=2
)

# Then try another query
results = collection.query(
    query_texts=["Richard Feynman"],
    n_results=2
)

Let's analyze what happens when we run these queries:

Query: "Bill Gates"
When we search for "Bill Gates", ChromaDB will return the IT-related text first. This happens because the semantic connection between Bill Gates and technology/security is stronger than his connection to physics. Even though Bill Gates isn't explicitly mentioned in the text, the embedding model understands the conceptual relationship between Gates, technology, and computer security.

Query: "Richard Feynman"
When searching for "Richard Feynman", the physics text will be returned as the closest match. This is because Feynman was a renowned physicist, and the embedding model recognizes the semantic relationship between Feynman and concepts like gravitational mass and physical principles, even though Feynman isn't mentioned in the text.

This demonstrates how semantic search understands context beyond simple keyword matching:

  • No exact words need to match between query and results
  • Results are based on conceptual similarity
  • The system can draw connections based on learned relationships in the embedding model

Understanding Vectors

At the heart of vector databases lies the concept of vector embeddings. These are mathematical representations of data in high-dimensional space. To understand this better, let's start with a simple 3D example:

$$ \text{Distance between points:} \hspace{0.10in} 4.69 \hspace{0.05in} units \\[0.2in] \text{Formula:} \hspace{0.25in} d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2} $$

While vectors and embeddings are often used interchangeably in discussions about vector databases, there is a subtle but important distinction:

  • Vector: A mathematical construct representing a point in space using an array of numbers. For example, the point (2, 3, 4) in 3D space is represented by a vector [2, 3, 4]. Vectors can represent any kind of numerical data.
  • Embedding: The result of transforming non-numerical data (like text, images, or audio) into a vector format that captures the semantic meaning of the data. For example, the sentence "I love dogs" might be transformed into a vector [0.2, -0.5, 0.8, ...] where the values represent various semantic features learned by the embedding model.

In vector databases:

  • The term "vector" refers to the mathematical representation used to store and compare data
  • "Embedding" refers to both the process of creating these vectors and the resulting vectors themselves
  • All embeddings are vectors, but not all vectors are embeddings

In vector databases, we work with much higher dimensions - typically 384 or more. ChromaDB's default embedding model creates 384-dimensional vectors, where each dimension captures different aspects of the input data. These vectors are normalized to have a magnitude of approximately 1.0, which helps ensure consistent comparisons regardless of text length.

Here's an example of examining vector properties:

# Vector properties example
vector = np.array(results['embeddings'][0])
magnitude = np.linalg.norm(vector)
print(f"Vector magnitude: {magnitude:.4f}")  # Should be close to 1.0
print(f"Dimensions: {vector.shape[0]}")      # 384 dimensions

Embedding Functions

Reference: Chroma - Embeddings

Embedding functions are algorithms that convert data (text, images, etc.) into vector representations. ChromaDB supports various embedding models optimized for different use cases:

Language Models

Examples:

  • all-MiniLM-L6-v2 (384 dimensions) - Default, general purpose
  • paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions) - Multilingual support
  • LaBSE (768 dimensions) - 109 languages support

Domain-Specific Models

Examples:

  • pritamdeka/S-PubMedBert-MS-MARCO - Medical/scientific text
  • legal-bert-base-uncased - Legal documents
  • finbert - Financial text

Here's how to use a custom embedding function:

from chromadb.utils import embedding_functions

# Initialize a specific embedding function
scientific_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="pritamdeka/S-PubMedBert-MS-MARCO"
)

# Create collection with custom embedding function
collection = chroma_client.create_collection(
    name="scientific_collection",
    embedding_function=scientific_ef
)

In example03-embeddings-functions.py, there's a practical demonstration of how choosing the right embedding function can significantly improve search results for specific use cases.

Distance Functions

Reference: Chroma - Distance functions

Distance functions measure the similarity between vectors. The choice of distance function can significantly impact search results and performance. ChromaDB supports three main distance metrics, each with specific strengths and ideal use cases:

1. Euclidean (L2) Distance (Default)

$$ \text{L2_distance}(a,b) = \sum_{i=1}^n (a_i - b_i)^2 $$

Properties: Measures squared straight-line distance, considers both direction and magnitude. ChromaDB uses squared L2 distance (without the square root) for computational efficiency.

Range: 0 (identical) to 2 (maximum) for normalized vectors, [0, ∞) for non-normalized vectors

Best Use Cases:

  • Image Similarity: Comparing visual features in image recognition
  • Audio Matching: Finding similar audio patterns or fingerprints
  • Time Series Analysis: Comparing sequence patterns in data
  • Anomaly Detection: Identifying outliers in numerical data
  • Biometric Matching: Face recognition and fingerprint comparison

2. Inner Product Distance

$$ \text{inner_product_distance}(a,b) = -\sum_{i=1}^n a_i b_i $$

Properties: Sensitive to magnitude, computationally efficient, good for recommendations

Range: [-1, 1] for normalized vectors (closest similarity: -1, most dissimilar: +1), unbounded otherwise

Best Use Cases:

  • Collaborative Filtering: User-item recommendation systems
  • Feature Matching: When vector magnitude carries important information
  • Performance-Critical Applications: When computation speed is crucial
  • Preference Modeling: Capturing user interests in recommendation systems
  • Real-time Search: Applications requiring fast similarity computations

3. Cosine Distance

$$ \text{cosine_distance}(a,b) = 1 - \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \sqrt{\sum_{i=1}^n b_i^2}} $$

Properties: Measures angle between vectors, ignores magnitude, ideal for semantic similarity

Range: 0 (identical) to 2 (opposite)

Best Use Cases:

  • Semantic Search: Ideal for finding documents with similar meaning regardless of length
  • Text Classification: Comparing document topics and themes
  • Multilingual Search: Finding similar content across different languages
  • Question-Answering: Matching questions with potential answers
  • Content Recommendation: When document length shouldn't influence recommendations

Implementation Example:

# Create collections with different distance functions
cosine_collection = chroma_client.create_collection(
    name="cosine_example",
    metadata={"hnsw:space": "cosine"}  # Best for semantic search
)

l2_collection = chroma_client.create_collection(
    name="l2_example",
    metadata={"hnsw:space": "l2"}  # Best for image/audio similarity
)

ip_collection = chroma_client.create_collection(
    name="ip_example",
    metadata={"hnsw:space": "ip"}  # Best for recommendations
)

Choosing the Right Distance Function:

Consider these factors when selecting a distance function:

  • Data Type: Text content (Cosine), multimedia (L2), user preferences (Inner Product)
  • Vector Properties: Normalized vs non-normalized vectors
  • Performance Requirements: Computation speed vs accuracy trade-offs
  • Search Objectives: Semantic similarity vs structural similarity
  • Scale Sensitivity: Whether magnitude differences matter for your use case

Chroma Storage and Deployment Options

Data Persistence

ChromaDB offers three main storage options to suit different needs:

1. In-Memory Storage (Default)
Data is stored in RAM and is lost when the process ends. Offers the fastest performance, making it ideal for testing, development, and scenarios where data persistence isn't required.

2. Local Storage
Perfect for development and small applications. Data is stored in a local directory on disk, providing a good balance between persistence and simplicity. Ideal for projects with up to 100K entries.

3. PostgreSQL Backend
Recommended for production environments and larger datasets. Provides ACID compliance, concurrent access support, and better scalability. Suitable for applications with millions of entries and multiple users accessing the database simultaneously.

Deployment Architecture

1. Embedded Mode
ChromaDB runs within your application process. This is the simplest deployment option, perfect for single-user applications and rapid development.

2. Client-Server Mode
Runs ChromaDB as a separate service, allowing multiple clients to connect to the same server. Ideal for distributed systems and when multiple applications need to access the same database.

3. Containerized Deployment
Package ChromaDB in Docker containers for easy scaling and deployment management. Best for cloud environments and when you need consistent deployment across different environments.

Key Considerations

When choosing a storage and deployment option, consider:

  • Data volume and growth expectations
  • Required query performance
  • Number of concurrent users
  • Backup and recovery needs
  • Development vs. production environment

Other Solutions

While ChromaDB is excellent for many use cases, the vector database landscape offers various solutions to suit different needs:

Self-Hosted and Open Source

1. Pinecone

  • Managed service optimized for production deployment
  • Excellent scaling capabilities
  • Real-time vector search
  • Reference

2. Milvus

  • Open-source, cloud-native vector database
  • Highly scalable architecture
  • Support for multiple index types
  • Reference

3. Weaviate

  • Graph-like structure for complex relationships
  • GraphQL-based query interface
  • Multi-modal vector search
  • Reference

4. Qdrant

  • Written in Rust for high performance
  • Extended filtering capabilities
  • Production-ready with cloud offering
  • Reference

Cloud-Native Solutions

1. Amazon OpenSearch

  • Fully managed service with vector search capabilities
  • Seamless integration with AWS ecosystem
  • Supports k-NN search with HNSW and IVF algorithms
  • Reference

2. Azure Cognitive Search

  • Vector search integrated with Microsoft Azure
  • Hybrid search capabilities (vector + keyword)
  • Built-in content extraction and enrichment
  • Reference

3. Google Vertex AI Vector Search

  • Fully managed vector similarity search
  • Integration with Google Cloud AI products
  • Automatic scaling and optimization
  • Reference

ChromaDB is particularly recommended for:

  • Rapid prototyping and development
  • Small to medium-scale applications
  • Educational purposes
  • When local deployment is preferred
  • Projects requiring simple integration

Where are Vector Databases Used?

Vector databases find applications in numerous domains:

1. Semantic Search Engines

  • Document retrieval
  • Similar product recommendations
  • Knowledge base search

2. Recommendation Systems

  • Content recommendations
  • Product suggestions
  • Similar item discovery

3. Image and Audio Processing

  • Similar image search
  • Audio matching and fingerprinting
  • Face recognition
  • Music recommendation systems
  • Speech pattern analysis

4. Natural Language Processing

  • Question answering systems
  • Document classification
  • Content similarity analysis

5. Anomaly Detection

  • Fraud detection in financial transactions
  • Network security threat detection
  • Manufacturing quality control
  • System health monitoring
  • Time series anomaly detection

6. Bioinformatics

  • Protein structure similarity search
  • Gene sequence matching
  • Drug discovery
  • Molecular property prediction
  • Biological pathway analysis

7. Audio Analysis

  • Voice recognition systems
  • Music similarity search
  • Sound event detection
  • Speaker diarization
  • Acoustic scene classification
  • Environmental sound analysis

Vector Database Art

The header graphic presents an artistic vision of a vector database, where points corresponding to normalized vectors form a hypersphere surface in multidimensional space.

Repository

For complete code examples and more detailed implementations, check out the ChromaDB Examples GitHub repository.

Comments

Popular posts from this blog

Schematy rozwiązywania równań różniczkowych [Polish]

Vibrating string equation (without damping)

PyCharm - useful shortcuts