Back to Blog
Building a Semantic Search Engine: From Embeddings to Production
technical

Building a Semantic Search Engine: From Embeddings to Production

The engineering story behind Queryra. How we built a vector search pipeline using sentence transformers, ChromaDB, and FastAPI that serves WooCommerce stores.

RG
Rafal Gron
Founder, Queryra
January 28, 2026·12 min read

When I started building Queryra, the idea was simple: replace WooCommerce's broken LIKE query with something that actually understands what customers are searching for.

The reality was more complex. Building a semantic search engine that's fast enough for real-time e-commerce, accurate enough to beat keyword search, and affordable enough to offer at $9.99/month required solving a chain of engineering problems.

This article is the technical story of building Queryra — the architecture decisions, the trade-offs, and the lessons learned going from a Python prototype to a production service handling search for WooCommerce stores.

Why Not Just Use ChatGPT?

The obvious first approach: send each search query to the OpenAI API, include the product catalog as context, and let GPT find relevant matches.

I built this in a weekend. It worked. And it was completely impractical.

Speed: GPT-3.5 took 2-4 seconds per query. GPT-4 took 5-8 seconds. E-commerce search needs sub-500ms responses. Customers don't wait.

Cost: Every search = API call. With product context included, each query cost $0.01-0.05. A store with 500 searches/day would pay $150-750/month just in OpenAI fees — making a $9.99/month product impossible.

Reliability: Rate limits, API outages, and usage caps meant search could fail during peak traffic — exactly when it matters most.

Privacy: Sending every store's product catalog to OpenAI on every search raised GDPR concerns and customer trust issues.

The alternative: build a custom search pipeline using embeddings and vector similarity. More engineering work upfront, but faster, cheaper, and under our control.

Understanding Embeddings

The core of semantic search is vector embeddings — mathematical representations of text that capture meaning.

An embedding model reads text and outputs a vector (an array of numbers, typically 384-1536 dimensions). Texts with similar meanings produce vectors that are close together in this high-dimensional space.

``python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# These produce similar vectors (close in space)
v1 = model.encode("gift for dad who likes gardening")
v2 = model.encode("Professional Garden Tool Set")

# This produces a distant vector
v3 = model.encode("Bluetooth wireless earbuds")

# Cosine similarity
# v1 vs v2: ~0.82 (high - similar meaning)
# v1 vs v3: ~0.12 (low - unrelated)
``

The key insight: you generate product embeddings once (at index time) and query embeddings on each search. Finding matches is just finding the nearest vectors — which vector databases can do in milliseconds.

Choosing the Right Model: Why MiniLM Over GPT

We evaluated several embedding models:

OpenAI text-embedding-ada-002 — 1536 dimensions, high quality. But requires API calls, costs per token, and introduces latency. Not suitable for a self-hosted pipeline.

all-MiniLM-L6-v2 — 384 dimensions, runs locally. From the sentence-transformers library (Hugging Face). Free, fast (5ms per encoding), and quality is excellent for product search.

all-mpnet-base-v2 — 768 dimensions, better quality than MiniLM but 3x slower. The quality difference was marginal for short e-commerce text.

We chose all-MiniLM-L6-v2 because:
- Runs entirely on our infrastructure (no API dependencies)
- 384 dimensions keeps vector storage efficient
- 5ms encoding time means real-time search is trivial
- Quality is excellent for product titles + descriptions (short text)
- Model is 80MB — small enough to deploy anywhere

The trade-off: lower dimensionality means slightly less nuance in very long, complex text. For e-commerce product search with 10-200 word descriptions, this doesn't matter.

Vector Storage: Why ChromaDB

Embeddings need a vector database — a system optimized for similarity search across millions of vectors.

We evaluated:

Pinecone — Managed, scalable, fast. But adds a cloud dependency, costs scale with vectors, and data leaves your infrastructure.

Weaviate — Open source, feature-rich. But complex to operate, heavy resource requirements.

FAISS (Facebook) — Blazing fast, battle-tested. But it's a library, not a database — no persistence, no CRUD operations, no metadata filtering.

ChromaDB — Open source, embeddable, Python-native. Simple API, built-in persistence, metadata filtering, and lightweight enough to run on a single server.

We chose ChromaDB because:
- Embeds directly in our FastAPI application (no separate service)
- Handles 100,000+ vectors with sub-100ms query times
- Metadata filtering lets us scope searches by user/store
- Simple Python API integrates cleanly with our pipeline
- Persistent storage with SQLite backend

The trade-off: ChromaDB is newer and less proven at massive scale (millions of vectors). For our current user base, it's more than sufficient. If we hit scale limits, migrating to FAISS or Pinecone is straightforward since the embedding format is standard.

The Production Architecture

Queryra's backend is a FastAPI application with three main components:

1. Indexing Pipeline
When a WooCommerce store syncs products:
- Plugin sends product data via HTTPS POST
- FastAPI receives and validates the data
- SentenceTransformer generates embeddings for each product
- Embeddings + metadata are stored in ChromaDB, scoped by user/store ID

2. Search Pipeline
When a customer searches:
- Plugin sends query via HTTPS GET
- SentenceTransformer generates query embedding (5ms)
- ChromaDB performs cosine similarity search across the store's products (10-50ms)
- Results are filtered by min_score threshold
- Ranked results with scores are returned as JSON

3. User Management
PostgreSQL handles user accounts, API keys, usage tracking, and plan limits.

Infrastructure:
- AWS Lightsail (single instance — keeping it simple)
- Nginx as reverse proxy with SSL
- Docker for deployment consistency
- Let's Encrypt for HTTPS

Total monthly infrastructure cost: under $40. This is why we can offer semantic search at $9.99/month — the AI runs on our own hardware with zero per-query costs.

Performance: Getting to Sub-100ms

E-commerce search needs to be fast. Our target: under 100ms total response time.

Here's the breakdown of a typical search request:

``
Network (client → server): ~20ms
Query embedding (MiniLM): ~5ms
Vector search (ChromaDB): ~15ms
Result serialization: ~2ms
Network (server → client): ~20ms
────────────────────────────────────
Total: ~62ms
``

The bottleneck is network latency, not computation. The actual AI work (embedding + vector search) takes about 20ms.

Optimizations that mattered:
- Model warm-up: Load MiniLM once at startup, keep in memory. Cold-start encoding takes 200ms; warm takes 5ms.
- ChromaDB in-process: No network hop to a separate database. ChromaDB runs in the same Python process.
- Connection pooling: Reuse HTTPS connections from the WordPress plugin.
- Result caching: Popular queries are cached for 5 minutes.

Lessons Learned

1. Embeddings beat GPT for search. You don't need a $20B language model for search. A 80MB sentence transformer + vector database produces better, faster, cheaper results for the specific task of matching queries to products.

2. Simple infrastructure wins. A single server running FastAPI + ChromaDB handles thousands of searches per minute. Don't over-architect early. Scale when you need to, not before.

3. Users hate API keys. The single biggest decision that affects adoption: requiring external API keys kills activation rates. One popular WordPress AI search plugin has 27,000 downloads and 90 active installs. The OpenAI key requirement is the primary drop-off point.

4. Product text is short. E-commerce product titles and descriptions are typically 10-200 words. This makes smaller embedding models (384 dimensions) perfectly adequate. You don't need 1536-dimension models designed for academic papers.

5. Free tier is essential. WordPress users expect to try before paying. A generous free tier (100 records, no credit card) gets people past the activation barrier. Conversion to paid happens when they see the value on their own store.

6. Relevance scoring needs tuning. Raw cosine similarity scores are useful but need a threshold to filter noise. Our default 50% min_score (relative to the top match) works well for most stores. The new Search Settings feature lets users adjust this.

What We're Building Next

The current architecture works well for single-store semantic search. Our roadmap includes:

Search analytics — What are customers searching for? Which queries return zero results? This data helps store owners optimize their catalogs.

Hybrid search — Combining semantic matching with keyword matching for the best of both worlds. Exact SKU lookups + natural language discovery in one search.

Multi-language support — The MiniLM model supports 50+ languages, but our pipeline needs localization work for proper multi-language indexing and search.

Larger model option — For stores that need maximum accuracy, offer an option to use a larger embedding model (768 or 1536 dimensions) with slightly higher latency.

If you're building something similar or have questions about the architecture, feel free to reach out at contact@queryra.com or check out the open source plugin on GitHub.

Ready to fix your WooCommerce search?

Open source on GitHub

Frequently Asked Questions

What embedding model does Queryra use?

Queryra uses all-MiniLM-L6-v2 from the sentence-transformers library. It produces 384-dimensional embeddings, runs locally (no API calls), and encodes text in about 5 milliseconds. The 80MB model is optimized for short text like product titles and descriptions.

Why not use OpenAI embeddings?

OpenAI embeddings require API calls (latency + cost), send your data to external servers (privacy), and charge per token (unpredictable costs). Running our own model eliminates all three problems, which is why Queryra has no per-query costs.

What is ChromaDB?

ChromaDB is an open-source vector database designed for AI applications. It stores embeddings and performs similarity searches efficiently. Queryra uses it to find products whose meaning is closest to a search query, returning results in under 50 milliseconds.

How fast is Queryra's search?

Average total response time is under 100ms. The AI computation (embedding generation + vector search) takes about 20ms. The rest is network latency between the WordPress site and Queryra's servers.

Can this architecture handle large stores?

The current setup handles stores with up to 20,000+ products efficiently. ChromaDB queries 100,000+ vectors in under 50ms. For much larger catalogs, we can scale to FAISS or Pinecone with minimal changes since the embedding format is standardized.

Related Reading