Knowledge Base Architecture¶
FAISS-based semantic search system with 11GB expert knowledge.
Overview¶
Nova AI's knowledge base provides:
- 11GB indexed content - 37 domain areas
- FAISS vector search - Sub-millisecond retrieval
- Semantic similarity - Natural language queries
- Metadata filtering - Domain, file type, date
- SDK MCP integration - 106x faster than stdio
graph LR
A[Query] --> B[Embedding]
B --> C[FAISS Index]
C --> D[Top-K Results]
D --> E[Metadata Filter]
E --> F[Ranked Results]
style C fill:#3f51b5,color:#fff
Directory Structure¶
kb_store/faiss_metadata/
├── manifest.json # KB metadata (backend, dimensions, file count)
├── metadata.parquet # Chunk metadata (file, text, domain, score)
└── faiss_ivfpq.index # FAISS index (11GB, IVF-PQ compressed)
Architecture¶
FAISS Index¶
Type: IVF-PQ (Inverted File with Product Quantization)
Specifications:
- Dimension: 1536 (OpenAI ada-002 embeddings)
- Index size: ~11GB compressed
- Chunks: ~2.5 million
- Search time: 8ms average
- Accuracy: 95%+ recall@10
Metadata Store¶
Format: Apache Parquet (columnar storage)
Schema:
{
"chunk_id": int,
"file": str,
"text": str,
"domain": str,
"file_extension": str,
"created_at": datetime,
"score": float
}
Query Process¶
# 1. User query
query = "JWT authentication best practices"
# 2. Generate embedding (1536-d vector)
embedding = embed(query)
# 3. FAISS similarity search
indices, distances = index.search(embedding, top_k=10)
# 4. Fetch metadata
results = metadata.loc[indices]
# 5. Apply filters
filtered = results[results['domain'] == 'security']
# 6. Return ranked results
return filtered.sort_values('score', ascending=False).head(5)
Performance¶
| Operation | Time | Details |
|---|---|---|
| Embedding generation | 50ms | OpenAI API |
| FAISS search | 8ms | IVF-PQ index |
| Metadata fetch | 2ms | Parquet read |
| Total | 60ms | End-to-end |
Domains¶
37 indexed domains:
- Architecture, Security, Testing
- API design, Database, Caching
- Authentication, Authorization
- Error handling, Logging
- Performance, Scalability
- And 26 more...
Best Practices¶
- Use semantic queries - Natural language works best
- Filter by domain - Narrow search scope
- Check scores - Use min_score=0.8+
- Cache results - For frequently used queries
- Validate security - KB directory must be trusted