HNSW Vector Index Demo
Overview
This tutorial demonstrates HNSW (Hierarchical Navigable Small World) vector indexing in MatrixOne Python SDK. HNSW is a graph-based approximate nearest neighbor search algorithm that provides excellent search performance with high recall rates.
HNSW Advantages:
- ⚡ Very Fast Search: Superior query performance
- 🎯 High Recall: >99% accuracy for nearest neighbor search
- 🚀 No Training Required: Unlike IVF, no clustering training needed
- 📊 Predictable Performance: Consistent query latency
HNSW Characteristics:
- 🔒 Read-Only (Current Limitation): Cannot insert/update/delete after index creation
- 🔑 BigInteger Primary Key Required: Must use
BigIntegertype - 💾 Higher Memory Usage: Graph structure requires more memory than IVF
Future Enhancement: Incremental Updates Coming Soon
Good News: MatrixOne will soon support incremental updates for HNSW indexes!
- 🔄 Async Index Updates: New vectors will be added to the index asynchronously
- ➕ Insert Support: You'll be able to insert new vectors after index creation
- 🔧 Update Support: Modify existing vectors without rebuilding the entire index
- 🗑️ Delete Support: Remove vectors from the index
Current Status: This feature is under development and not yet available. For now, HNSW indexes remain read-only after creation. Once the index is created, you cannot insert or modify data without dropping and rebuilding the index.
Workaround for Now: Use IVF index if you need dynamic updates, or plan to rebuild HNSW index periodically.
MatrixOne Python SDK Documentation
For complete API reference, see MatrixOne Python SDK Documentation
Before You Start
Prerequisites
- MatrixOne database installed and running
- Python 3.7 or higher
- MatrixOne Python SDK installed
Install SDK
pip3 install matrixone-python-sdk
Complete Working Example
Save this as hnsw_demo.py and run with python3 hnsw_demo.py:
from matrixone import Client
from matrixone.config import get_connection_params
from matrixone.sqlalchemy_ext import create_vector_column
from matrixone.orm import declarative_base
from sqlalchemy import BigInteger, Column, String
import numpy as np
import time
np.random.seed(42)
print("="* 70)
print("MatrixOne HNSW Vector Index Demo")
print("="* 70)
# Connect to database
host, port, user, password, database = get_connection_params(database='demo')
client = Client()
client.connect(host=host, port=port, user=user, password=password, database=database)
print(f"Connected to database")
# Define table with BigInteger primary key (HNSW requirement!)
Base = declarative_base()
class Document(Base):
__tablename__ = "hnsw_demo_docs"
id = Column(BigInteger, primary_key=True) # Must be BigInteger!
title = Column(String(200))
category = Column(String(100))
embedding = create_vector_column(64, "f32")
print(f"Defined table with BigInteger primary key (HNSW requirement)")
# Create table
client.drop_table(Document)
client.create_table(Document)
# IMPORTANT: Insert data BEFORE creating HNSW index
documents = [
{
"id": i + 1,
"title": f"Document {i + 1}",
"category": f"Category_{i % 5}",
"embedding": np.random.rand(64).astype(np.float32).tolist()
}
for i in range(200)
]
client.batch_insert(Document, documents)
print(f"Inserted {len(documents)} documents (BEFORE creating index)")
# Enable HNSW and create index
client.vector_ops.enable_hnsw()
client.vector_ops.create_hnsw(
Document, # Using model
"idx_embedding_hnsw",
"embedding",
m=16,
ef_construction=200,
ef_search=50,
op_type="vector_l2_ops"
)
print("HNSW index created")
# Search with ORM-style query
query_vector = np.random.rand(64).astype(np.float32).tolist()
results = client.query(
Document.id,
Document.title,
Document.embedding.l2_distance(query_vector).label('distance')
).order_by('distance').limit(5).all()
print(f"\n Found {len(results)} similar documents:")
for i, row in enumerate(results, 1):
print(f"{i}. ID: {row.id}, Distance: {row.distance:.4f}")
# Cleanup
client.disconnect()
print("\n Demo completed!")
Key Concepts
1. HNSW Requirements
BigInteger Primary Key (Required!)
class Document(Base):
id = Column(BigInteger, primary_key=True) # ✅ Must be BigInteger
# id = Column(Integer, primary_key=True) # ❌ Won't work with HNSW!
Why? HNSW algorithm requires 64-bit integer IDs for graph node references.
Read-Only After Creation (Current Limitation)
Current Limitation: Read-Only Index
IMPORTANT: In the current version, once an HNSW index is created, the table becomes read-only for that indexed column. You cannot insert, update, or delete vectors.
Future Update: Incremental update support with async index updates is coming soon!
# ✅ Correct workflow (Current Version):
client.create_table(Document)
client.batch_insert(Document, all_data) # Insert ALL data first
client.vector_ops.create_hnsw(...) # Then create index
# Now table is READ-ONLY!
# ❌ After index creation, CANNOT (until incremental update is released):
# - Insert new vectors
# - Update existing vectors
# - Delete vectors
# ✅ To modify data (Current Workaround):
# 1. Drop HNSW index
# 2. Modify data (insert/update/delete)
# 3. Recreate HNSW index
Current Workaround:
def update_data_with_hnsw(client, Model, new_data):
"""Update data when using HNSW index (current version)"""
# Step 1: Drop existing HNSW index
client.vector_ops.drop(table_name, "idx_hnsw")
print("Dropped HNSW index")
# Step 2: Now you can modify data
client.batch_insert(Model, new_data)
print("Inserted new data")
# Step 3: Recreate HNSW index
client.vector_ops.create_hnsw(
Model, "idx_hnsw_v2", "embedding",
m=16, ef_construction=200, ef_search=50,
op_type="vector_l2_ops"
)
print("HNSW index recreated")
# Future (when incremental update is available):
# Just insert data directly, index will update asynchronously!
# client.batch_insert(Model, new_data) # Will work!
2. Create HNSW Index
client.vector_ops.enable_hnsw() # Enable HNSW first
client.vector_ops.create_hnsw(
Model, # Table model or table name string
index_name, # Index name
vector_column, # Vector column name
m=16, # Number of connections (default: 16)
ef_construction=200, # Construction parameter (default: 200)
ef_search=50, # Search parameter (default: 50)
op_type="vector_l2_ops" # Distance metric
)
3. HNSW Parameters
m (Number of Connections)
- What it does: Number of bi-directional links per node in the graph
- Impact: Higher m = better recall, slower construction, more memory
| m Value | Recall | Construction Speed | Memory | Use Case |
|---|---|---|---|---|
| 4-8 | ~95% | Fast | Low | Fast approximate search |
| 16 | ~99% | Medium | Medium | Recommended default |
| 32-64 | ~99.5%+ | Slow | High | High-precision search |
ef_construction (Index Quality)
- What it does: Dynamic candidate list size during index construction
- Impact: Higher value = better quality index, slower construction
| ef_construction | Quality | Construction Speed | Recommendation |
|---|---|---|---|
| 100-200 | Good | Fast | Quick setup |
| 200-500 | Better | Medium | Recommended (200) |
| 500+ | Best | Slow | High-precision needs |
ef_search (Search Quality)
- What it does: Dynamic candidate list size during search
- Impact: Higher value = better recall, slower search
| ef_search | Recall | Search Speed | Use Case |
|---|---|---|---|
| 10-50 | Good | Fast | Speed-critical |
| 50-100 | Better | Medium | Balanced (default 50) |
| 100+ | Best | Slower | Precision-critical |
Query Methods
Method 1: vector_ops.similarity_search()
Low-level API for direct vector similarity search:
results = client.vector_ops.similarity_search(
Document, # Model or table name
vector_column="embedding",
query_vector=query_vector,
limit=10,
distance_type="l2" # or "cosine", "ip"
)
# Returns list of tuples: (id, title, category, ..., distance)
for row in results:
id, title, category, distance = row[0], row[1], row[2], row[-1]
print(f"ID: {id}, Distance: {distance:.4f}")
Method 2: client.query() with ORM
SQLAlchemy-style ORM query (more flexible):
results = client.query(
Document.id,
Document.title,
Document.embedding.l2_distance(query_vector).label('distance')
).order_by('distance').limit(10).all()
# Returns list of row objects with attributes
for row in results:
print(f"ID: {row.id}, Distance: {row.distance:.4f}")
Advantages of ORM Method:
- ✅ Support filters (
.filter(Document.category == "Tech")) - ✅ Easy sorting and pagination
- ✅ Type-safe attribute access
- ✅ Combine with WHERE conditions
Usage Examples
Basic Similarity Search
# Generate query vector
query_vector = np.random.rand(64).astype(np.float32).tolist()
# Search top 10 similar documents
results = client.query(
Document.id,
Document.title,
Document.embedding.l2_distance(query_vector).label('distance')
).order_by('distance').limit(10).all()
for i, row in enumerate(results, 1):
print(f"{i}. {row.title} - Distance: {row.distance:.4f}")
Search with Category Filter
# Find similar documents in "Technology" category only
results = client.query(
Document.id,
Document.title,
Document.category,
Document.embedding.l2_distance(query_vector).label('distance')
).filter(
Document.category == "Technology"
).order_by('distance').limit(10).all()
Search with Multiple Filters
# Combine multiple conditions
results = client.query(
Document.id,
Document.title,
Document.embedding.l2_distance(query_vector).label('distance')
).filter(
Document.category == "Science"
).filter(
Document.doc_type == "Research"
).order_by('distance').limit(5).all()
Search with Distance Threshold
# Only return documents within distance 8.0
results = client.query(
Document.id,
Document.title,
Document.embedding.l2_distance(query_vector).label('distance')
).filter(
Document.embedding.l2_distance(query_vector) < 8.0
).order_by('distance').all()
Distance Metrics
L2 (Euclidean) Distance
General purpose distance metric:
results = client.query(
Document.id,
Document.embedding.l2_distance(query_vector).label('distance')
).order_by('distance').limit(10).all()
Use when: General vector similarity, absolute differences matter
Cosine Distance
Measures angular distance (direction similarity):
# Normalize query vector first
norm = np.linalg.norm(query_vector)
normalized_query = (np.array(query_vector) / norm).tolist()
results = client.query(
Document.id,
Document.embedding.cosine_distance(normalized_query).label('distance')
).order_by('distance').limit(10).all()
Use when: Semantic similarity, magnitude doesn't matter (e.g., text embeddings)
Inner Product
Dot product distance:
results = client.query(
Document.id,
Document.embedding.inner_product(query_vector).label('distance')
).order_by('distance').limit(10).all()
Use when: Need both magnitude and direction information
Performance Benchmarking
Compare Query Methods
import time
# Benchmark vector_ops method
times_vector_ops = []
for _ in range(10):
test_vector = np.random.rand(64).astype(np.float32).tolist()
start = time.time()
results = client.vector_ops.similarity_search(
Document, vector_column="embedding",
query_vector=test_vector, limit=10
)
times_vector_ops.append((time.time() - start) * 1000)
print(f"vector_ops avg: {np.mean(times_vector_ops):.2f}ms")
# Benchmark ORM method
times_orm = []
for _ in range(10):
test_vector = np.random.rand(64).astype(np.float32).tolist()
start = time.time()
results = client.query(
Document.id,
Document.embedding.l2_distance(test_vector).label('distance')
).order_by('distance').limit(10).all()
times_orm.append((time.time() - start) * 1000)
print(f"ORM query avg: {np.mean(times_orm):.2f}ms")
Test Different K Values
for k in [5, 10, 20, 50]:
start = time.time()
results = client.query(
Document.id,
Document.embedding.l2_distance(query_vector).label('distance')
).order_by('distance').limit(k).all()
elapsed = (time.time() - start) * 1000
print(f"K={k}: {elapsed:.2f}ms")
HNSW vs IVF Comparison
| Feature | HNSW | IVF |
|---|---|---|
| Index Type | Graph-based | Clustering-based |
| Training Required | No | Yes (k-means clustering) |
| Primary Key | BigInteger required | Any type |
| Insert After Index | ❌ No (coming soon!) | ✅ Yes |
| Update/Delete | ❌ No (coming soon!) | ✅ Yes |
| Update Method | 🔄 Async (future) | Immediate |
| Search Speed | ⚡ Very Fast | Fast |
| Recall Quality | 🎯 Excellent (>99%) | Good (>95%) |
| Memory Usage | Higher | Lower |
| Construction Time | Medium | Fast |
| Best For | Static datasets | Dynamic datasets |
| Current Status | Read-only | Fully dynamic |
When to Use HNSW
✅ Use HNSW when:
- Static or infrequently updated datasets
- High recall requirements (>99%)
- Fast search is critical
- High-dimensional vectors (>128 dims)
- You can rebuild index when data changes
When to Use IVF
✅ Use IVF when:
- Frequently updated datasets
- Need insert/update/delete operations
- Large datasets with memory constraints
- Acceptable recall (95-98%)
- Dynamic, growing datasets
HNSW Parameter Tuning
Parameter Selection Guide
def get_hnsw_parameters(vector_count, use_case):
"""Get recommended HNSW parameters based on use case"""
if use_case == "fast":
# Optimize for speed
return {"m": 8, "ef_construction": 100, "ef_search": 20}
elif use_case == "balanced":
# Balance speed and recall (recommended)
return {"m": 16, "ef_construction": 200, "ef_search": 50}
elif use_case == "high_recall":
# Optimize for accuracy
return {"m": 32, "ef_construction": 400, "ef_search": 100}
else:
# Default
return {"m": 16, "ef_construction": 200, "ef_search": 50}
# Usage
params = get_hnsw_parameters(10000, "balanced")
client.vector_ops.create_hnsw(
Model, index_name, column_name,
m=params["m"],
ef_construction=params["ef_construction"],
ef_search=params["ef_search"]
)
m Parameter Guide
m = Number of connections per node
-
m = 4-8: Fast search, lower recall (~95%)
- Construction: Very fast
- Search: Very fast
- Memory: Low
- Use: Approximate search OK
-
m = 16: Balanced (default, ~99% recall)
- Construction: Medium
- Search: Fast
- Memory: Medium
- Use: Recommended for most cases
-
m = 32-64: High recall (~99.5%+)
- Construction: Slow
- Search: Medium
- Memory: High
- Use: Precision-critical applications
ef_construction Parameter Guide
ef_construction = Candidate list size during construction
- 100-200: Fast construction, good quality
- 200-400: Better quality (default 200)
- 400+: Best quality, slow construction
Rule of thumb: Higher ef_construction improves index quality but increases build time linearly.
ef_search Parameter Guide
ef_search = Candidate list size during search
- 10-50: Fast search, lower recall
- 50-100: Balanced (default 50)
- 100+: High recall, slower search
Important: You can adjust ef_search at query time without rebuilding the index (if API supports).
Best Practices
1. Data Preparation (Critical!)
# ✅ CORRECT: Insert ALL data first
client.create_table(Document)
client.batch_insert(Document, all_data) # Insert 100% of data
client.vector_ops.create_hnsw(...) # Then create index
# ❌ WRONG: Create index on empty/partial data
client.create_table(Document)
client.vector_ops.create_hnsw(...) # Index on empty table
client.batch_insert(Document, data) # Insert later - WON'T WORK!
2. Use BigInteger Primary Key
# ✅ CORRECT
id = Column(BigInteger, primary_key=True)
# ❌ WRONG - HNSW won't work
id = Column(Integer, primary_key=True)
id = Column(String(50), primary_key=True)
3. Start with Default Parameters
# Start with defaults, then tune if needed
client.vector_ops.create_hnsw(
Model, index_name, column_name,
m=16, # Good default
ef_construction=200, # Good default
ef_search=50, # Good default
op_type="vector_l2_ops"
)
4. Choose Appropriate Distance Metric
# For text embeddings (semantic similarity)
op_type="vector_cosine_ops" # Cosine similarity
# For general vectors (absolute distance)
op_type="vector_l2_ops" # Euclidean distance
# For special cases
op_type="vector_ip_ops" # Inner product
5. Plan for Data Updates (Current Limitation)
Read-Only Index in Current Version
Current Limitation: HNSW indexes are read-only. Once created, you cannot insert, update, or delete vectors without dropping the index.
Coming Soon: Incremental update support with asynchronous index updates will be available in a future release!
Current Workaround: Drop, modify, rebuild
def update_hnsw_index(client, Model, new_data):
"""Update HNSW index with new data (current version workaround)"""
# 1. Drop existing HNSW index
client.vector_ops.drop(table_name, "idx_hnsw")
print("Dropped HNSW index (now table is writable)")
# 2. Insert/update/delete data
client.batch_insert(Model, new_data)
print("Modified data")
# 3. Recreate HNSW index
client.vector_ops.create_hnsw(
Model, "idx_hnsw_v2", "embedding",
m=16, ef_construction=200, ef_search=50,
op_type="vector_l2_ops"
)
print("HNSW index rebuilt with new data")
# Future (when incremental update is released):
def update_hnsw_index_future(client, Model, new_data):
"""Future: Direct insert with async index update"""
# Just insert - index will update asynchronously!
client.batch_insert(Model, new_data)
print("Data inserted, index updating in background")
Interim Solution: If you need frequent updates, consider:
- Using IVF index instead (supports dynamic updates)
- Scheduling periodic HNSW rebuilds (e.g., nightly)
- Maintaining a separate "pending" table for new data, merge periodically
Advanced Examples
Multi-Metric Comparison
Compare all three distance metrics:
metrics = [
("L2", lambda v: Document.embedding.l2_distance(v)),
("Cosine", lambda v: Document.embedding.cosine_distance(v)),
("Inner Product", lambda v: Document.embedding.inner_product(v))
]
for metric_name, metric_func in metrics:
results = client.query(
Document.id,
metric_func(query_vector).label('distance')
).order_by('distance').limit(5).all()
print(f"\n{metric_name} Distance:")
for i, row in enumerate(results[:3], 1):
print(f"{i}. ID: {row.id}, Distance: {row.distance:.4f}")
Combined Filters
# Complex query: similarity + category + distance threshold
results = client.query(
Document.id,
Document.title,
Document.category,
Document.embedding.l2_distance(query_vector).label('distance')
).filter(
Document.category.in_(["Technology", "Science"])
).filter(
Document.embedding.l2_distance(query_vector) < 10.0
).order_by('distance').limit(20).all()
Pagination
page_size = 10
page_num = 2 # Second page
results = client.query(
Document.id,
Document.title,
Document.embedding.l2_distance(query_vector).label('distance')
).order_by('distance').limit(page_size).offset((page_num - 1) * page_size).all()
Troubleshooting
Issue: "BigInteger primary key required"
Cause: Using wrong primary key type
Solution: Change to BigInteger
# ❌ Wrong
class Doc(Base):
id = Column(Integer, primary_key=True)
# ✅ Correct
class Doc(Base):
id = Column(BigInteger, primary_key=True)
Issue: "Cannot insert after creating HNSW index"
Cause: HNSW is read-only in current version
Solution: Insert all data before creating index, or use workaround
# ✅ Solution 1: Correct workflow (Insert data first)
client.create_table(Document)
client.batch_insert(Document, all_data) # ALL data
client.vector_ops.create_hnsw(...) # Then index
# Now can only query
# ✅ Solution 2: Drop and rebuild to add data
client.vector_ops.drop(table_name, "idx_hnsw") # Drop index
client.batch_insert(Document, new_data) # Add data
client.vector_ops.create_hnsw(...) # Rebuild index
# ✅ Solution 3: Use IVF if frequent updates needed
client.vector_ops.create_ivf(...) # IVF supports updates
Note: Incremental update support (async) is coming in a future release, which will allow direct inserts after index creation!
Issue: "Low recall / poor results"
Cause: Parameters too aggressive for speed
Solution: Increase ef_search or m
# Try higher parameters
client.vector_ops.create_hnsw(
Model, index_name, column_name,
m=32, # Increased from 16
ef_construction=400, # Increased from 200
ef_search=100, # Increased from 50
op_type="vector_l2_ops"
)
Issue: "Index creation too slow"
Cause: Parameters too high for large dataset
Solution: Reduce ef_construction
# Faster construction
client.vector_ops.create_hnsw(
Model, index_name, column_name,
m=16,
ef_construction=100, # Reduced from 200
ef_search=50,
op_type="vector_l2_ops"
)
Issue: "High memory usage"
Cause: High m value creates more connections
Solution: Reduce m or use IVF instead
# Lower memory usage
client.vector_ops.create_hnsw(
Model, index_name, column_name,
m=8, # Reduced from 16
ef_construction=200,
ef_search=50,
op_type="vector_l2_ops"
)
# Or consider IVF for large datasets
client.vector_ops.create_ivf(
table_name, index_name, column_name,
lists=100,
op_type="vector_l2_ops"
)
Performance Tips
1. Normalize Vectors for Cosine Similarity
def normalize_vector(vec):
"""Normalize vector for cosine similarity"""
norm = np.linalg.norm(vec)
return (np.array(vec) / norm).tolist() if norm > 0 else vec
# Use normalized vectors
query_normalized = normalize_vector(query_vector)
results = client.query(
Document.id,
Document.embedding.cosine_distance(query_normalized).label('distance')
).order_by('distance').limit(10).all()
2. Batch Process Queries
# Process multiple queries efficiently
query_vectors = [np.random.rand(64).tolist() for _ in range(100)]
all_results = []
start = time.time()
for qv in query_vectors:
results = client.query(
Document.id,
Document.embedding.l2_distance(qv).label('distance')
).order_by('distance').limit(10).all()
all_results.append(results)
total_time = time.time() - start
print(f"Processed {len(query_vectors)} queries in {total_time:.2f}s")
print(f"Average: {(total_time/len(query_vectors))*1000:.2f}ms per query")
3. Use Filters to Reduce Search Space
# Efficient: Filter by category first
results = client.query(
Document.id,
Document.embedding.l2_distance(query_vector).label('distance')
).filter(
Document.category == "Technology" # Indexed column - fast filter
).order_by('distance').limit(10).all()
Reference
- MatrixOne Python SDK Documentation
- GitHub - MatrixOne Python Client
- Vector Search Guide
- CREATE INDEX HNSW
- HNSW Algorithm Paper
Summary
HNSW vector indexing in MatrixOne provides:
✅ Superior Performance: Fastest vector search with >99% recall ✅ No Training Required: Direct index construction, no clustering ✅ Predictable Latency: Consistent query performance ✅ High Quality Results: Excellent accuracy for nearest neighbor search ⚠️ Read-Only (Current): Insert data before creating index ⚠️ BigInteger Primary Key: Required for HNSW to work 🔄 Future: Incremental Updates: Async update support coming soon!
Current Best Use Cases:
- Static datasets (catalogs, knowledge bases)
- Infrequently updated data (nightly/weekly refreshes)
- High-performance read-heavy workloads
After Incremental Update Release:
- Dynamic datasets with async updates
- Real-time data ingestion with background indexing
- Continuous data growth scenarios
Perfect for: Product catalogs, document search, image similarity, semantic search, and any high-recall vector search applications! 🚀
Development Roadmap: Once incremental update support is released, HNSW will combine the best of both worlds - superior search performance of HNSW with the flexibility of dynamic updates!