Roadmap: Making SEDB a complete smart embeddings database #1

Closed
opened 2026-03-22 22:37:22 +01:00 by catboi · 3 comments
Collaborator

Roadmap

1. Persistence & Durability

  • Add SQL backing store for metadata (SQLite)
  • Store vectors separately from metadata
  • Transaction support, WAL mode for crash recovery
  • Incremental saves instead of full export
  • WAL checkpointing

2. Query Language

  • Structured query DSL
  • Filter operators: =, !=, in(), contains(), between()
  • Aggregations: count, group_by, stats
  • Time-range queries (QueryBuilder.timeRange())

3. Collections/Namespace

  • Multiple named collections
  • Per-collection index config
  • Collection-level stats

4. Ingestion Pipeline

  • Batch insert with progress callbacks
  • ID deduplication (upsert semantics)
  • Parallel embedding computation

5. RAG Features

  • Document chunking strategies
  • Retrieval with optional reranking
  • Hybrid search (vector + BM25)
  • BM25 stopword filtering

6. Observability

  • Query logging
  • Slow query detection
  • Index stats

7. Tech Debt

  • PQ index needs more testing
  • Query caching
  • Connection pooling for Ollama

Note: SEDB is a generic vector DB. Platform-specific adapters belong in separate projects.

## Roadmap ### 1. Persistence & Durability - [x] Add SQL backing store for metadata (SQLite) - [x] Store vectors separately from metadata - [x] Transaction support, WAL mode for crash recovery - [x] Incremental saves instead of full export - [x] WAL checkpointing ### 2. Query Language - [x] Structured query DSL - [x] Filter operators: =, !=, in(), contains(), between() - [x] Aggregations: count, group_by, stats - [x] Time-range queries (`QueryBuilder.timeRange()`) ### 3. Collections/Namespace - [x] Multiple named collections - [x] Per-collection index config - [x] Collection-level stats ### 4. Ingestion Pipeline - [x] Batch insert with progress callbacks - [x] ID deduplication (upsert semantics) - [x] Parallel embedding computation ### 5. RAG Features - [x] Document chunking strategies - [x] Retrieval with optional reranking - [x] Hybrid search (vector + BM25) - [x] BM25 stopword filtering ### 6. Observability - [x] Query logging - [x] Slow query detection - [x] Index stats ### 7. Tech Debt - [x] PQ index needs more testing - [x] Query caching - [x] Connection pooling for Ollama **Note:** SEDB is a generic vector DB. Platform-specific adapters belong in separate projects.
Owner

@catboi new commit has been pushed, bluesky specific stuff has not been included as the focus is a general database

@catboi new commit has been pushed, bluesky specific stuff has not been included as the focus is a general database
Author
Collaborator

Code review of ef2a304 - excellent progress!

What's Implemented

  • SqliteStore with proper schema for vectors + metadata
  • QueryCache with TTL and size limits
  • OllamaConnectionPool for connection reuse
  • BatchIngestor with upsert semantics
  • HybridSearch combining vector + BM25
  • Collection/CollectionManager for namespacing
  • Aggregation support (count, groupBy, stats)
  • QueryDsl for structured queries
  • IndexStats for observability
  • DocumentChunker for RAG

Looks Good

  • Parallel embedding computation in BatchIngestor
  • WAL mode in SqliteStore
  • BM25 implementation for keyword search
  • Query logging with timing

Remaining Items

  1. Transaction support - SqliteStore uses autocommit, no explicit transactions
  2. Time-range queries - not in QueryDsl yet
  3. Archiver integration - BlueskySource not started
  4. PQ index testing - still needs coverage
  5. WAL checkpointing - SqliteStore has journal_mode=WAL but no explicit checkpoint

Minor Notes

  • QueryCache uses synchronizedMap - could consider ConcurrentHashMap
  • BM25 could use a stopword list for better precision

Overall: ~70% of roadmap done. Archiver integration is the main remaining piece.

Code review of ef2a304 - excellent progress! ## What's Implemented - SqliteStore with proper schema for vectors + metadata - QueryCache with TTL and size limits - OllamaConnectionPool for connection reuse - BatchIngestor with upsert semantics - HybridSearch combining vector + BM25 - Collection/CollectionManager for namespacing - Aggregation support (count, groupBy, stats) - QueryDsl for structured queries - IndexStats for observability - DocumentChunker for RAG ## Looks Good - Parallel embedding computation in BatchIngestor - WAL mode in SqliteStore - BM25 implementation for keyword search - Query logging with timing ## Remaining Items 1. **Transaction support** - SqliteStore uses autocommit, no explicit transactions 2. **Time-range queries** - not in QueryDsl yet 3. **Archiver integration** - BlueskySource not started 4. **PQ index testing** - still needs coverage 5. **WAL checkpointing** - SqliteStore has journal_mode=WAL but no explicit checkpoint ## Minor Notes - QueryCache uses synchronizedMap - could consider ConcurrentHashMap - BM25 could use a stopword list for better precision Overall: ~70% of roadmap done. Archiver integration is the main remaining piece.
Author
Collaborator

Reviewed a79f5df - nice additions:

  • BM25 stopword removal now configurable
  • SqliteStore.transaction() now public with rollback support
  • WAL checkpoint functionality added
  • ProductQuantizer and SqliteStore test coverage expanded

Two roadmap items now complete: transaction support and PQ index testing.

Reviewed a79f5df - nice additions: - BM25 stopword removal now configurable - SqliteStore.transaction() now public with rollback support - WAL checkpoint functionality added - ProductQuantizer and SqliteStore test coverage expanded Two roadmap items now complete: transaction support and PQ index testing.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
LeNooby09/SEDB#1
No description provided.