Semantic Search System
Why Semantic Search?
Traditional keyword-based search matches exact words or n-grams between the query and documents. While effective for many use cases, this approach has fundamental limitations when searching conceptual or knowledge-based content:
Limitations of Keyword Search
Vocabulary Mismatch
Users may search for "car" while relevant articles use "automobile" or "vehicle." Keyword search would miss these results despite semantic equivalence. This problem, known as the vocabulary mismatch problem, is pervasive in information retrieval.
Conceptual Queries
Queries like "articles about economic inequality" or "information on renewable energy adoption" express conceptual interests that may not have exact keyword matches. Relevant articles might discuss "income disparity" or "solar panel deployment" without using the exact query terms.
Polysemy and Context
Words have multiple meanings depending on context. A keyword search for "bank" cannot distinguish between financial institutions and river banks. Semantic search understands contextual meaning through learned representations.
Advantages of Semantic Approach
Semantic search addresses these limitations by operating in embedding space rather than lexical space:
- Synonym handling: Semantically equivalent terms have similar embeddings, so "car" and "automobile" are recognized as related concepts.
- Conceptual matching: Articles matching the query's intent are retrieved even without keyword overlap.
- Context sensitivity: Embeddings capture word meaning in context, disambiguating polysemous terms.
- Exploratory discovery: Users can find related content they wouldn't have known to search for using keywords alone.
How Semantic Search Works
Query Embedding
When a user submits a search query, it undergoes the same embedding process as article content:
query_embedding = Embed(query_text)
where:
query_embedding ∈ ℝ^1536
Embed() = text-embedding-ada-002
The embedding model transforms the query into the same 1536-dimensional vector space used for article embeddings. Critically, this means queries and articles inhabit the same semantic space, enabling meaningful similarity comparisons.
Similarity Computation
To find relevant articles, the system computes the cosine similarity between the query embedding and each article embedding:
similarity(query, article) = cos_sim(q, a)
= q · a / (||q|| * ||a||)
Since embeddings are unit-normalized:
= q · a
Range: [-1, 1]
- 1: identical meaning
- 0: orthogonal (unrelated)
- -1: opposite meaning (rare)
Cosine similarity measures the angle between vectors, with smaller angles (higher cosine values) indicating greater semantic similarity. This metric is preferred over Euclidean distance for normalized embeddings because it focuses on direction rather than magnitude.
Vector Database Optimization
Computing similarity against all articles would be computationally expensive for large corpora. The system uses pgvector, a PostgreSQL extension providing efficient vector similarity search:
Approximate Nearest Neighbor (ANN) Search
Instead of exhaustively computing all similarities, pgvector uses indexing structures (IVFFlat or HNSW) to quickly identify approximate nearest neighbors. This trades a small amount of accuracy for dramatic speed improvements.
SQL query structure:
SELECT id, title, embedding <=> query_vector AS distance
FROM articles
ORDER BY embedding <=> query_vector
LIMIT 20;
The <=> operator uses the vector index for fast similarity search.
Result Ranking
Results are ranked by descending similarity score. The top results represent articles whose content is most semantically similar to the query. Additional ranking factors can include:
- Popularity metrics: Page view counts or link centrality to boost authoritative articles
- Freshness: More recently updated articles may be prioritized
- Diversity: Results can be diversified to avoid returning many nearly-identical articles
Currently, the system uses pure semantic similarity for ranking, ensuring results are determined solely by content relevance to the query.
What Semantic Search Enables
Exploratory Knowledge Discovery
Semantic search transforms the visualization from a browse-only interface into an exploratory discovery tool:
Broad Conceptual Queries
Users can enter queries like "climate change impacts on agriculture" and receive articles covering crop yields, water scarcity, agricultural adaptation strategies, and related topics—even if those exact phrases don't appear in the articles.
Question-Based Search
Natural language questions work effectively. A query like "how do neural networks learn?" will retrieve articles about backpropagation, gradient descent, and training algorithms based on semantic content rather than keyword matching.
Topic Neighborhood Identification
By finding semantically similar articles and displaying them in the 3D space, users can identify not just individual matches but entire topic neighborhoods, seeing how the query concept relates to surrounding topics.
Cross-Domain Connections
Semantic search excels at finding conceptual connections across traditional domain boundaries:
Methodological Similarity
A search for "statistical methods" might retrieve articles from biology (population genetics), physics (experimental design), economics (econometrics), and psychology (psychometrics)—revealing how similar analytical approaches span disciplines.
Conceptual Analogies
Searching for "network effects" might surface articles about social networks, neural networks, ecological networks, and transportation networks—demonstrating how the same conceptual framework applies across domains.
Integration with Visualization
The search system integrates tightly with the 3D visualization, enabling coordinated exploration:
- Spatial highlighting: Search results are highlighted in the 3D space, showing where relevant articles are positioned and revealing patterns in result distribution.
- Result clustering: If search results cluster in particular regions, this indicates the query concept aligns with specific topic areas. Scattered results suggest a cross-cutting concept.
- Distance metrics: The visualization can display similarity scores as spatial overlays, showing query-to-article distance alongside inter-article relationships.
Search Modes: Text vs Semantic
The system offers both traditional text search and semantic search, allowing users to choose the appropriate mode for their information need:
Text Search Mode
Uses PostgreSQL's full-text search capabilities with the following features:
- Tokenization and stemming (reducing words to root forms)
- Stop word removal (filtering common words like "the," "and")
- Boolean operators (AND, OR, NOT) for complex queries
- Phrase matching for exact sequences
Best for: Finding specific articles by known title terms, author names, or when exact keyword matching is desired.
Semantic Search Mode
Uses the embedding-based similarity approach described above.
Best for: Conceptual exploration, finding articles on related topics, discovering cross-domain connections, and questions where you're unsure of exact terminology.
Mode Selection Guidance
Use text search when you know what you're looking for and can specify it with keywords. Use semantic search when exploring a topic area, when you're unsure of terminology, or when looking for conceptually similar content that might use different vocabulary.
Technical Implementation Details
System Architecture
Query flow:
1. User submits query text
2. Frontend sends POST request to /api/ai/search/semantic
3. Backend calls OpenAI API to generate query embedding
4. Backend queries pgvector: SELECT ... ORDER BY embedding <=> query
5. Results (id, title, similarity) returned to frontend
6. Frontend displays results and highlights in 3D space
Performance Characteristics
Query Latency
Typical query time is 200-500ms, dominated by embedding generation (100-200ms for OpenAI API) and vector similarity search (50-150ms depending on corpus size and index type). Network latency adds 50-100ms.
Scalability
Vector search performance degrades sublinearly with corpus size when using ANN indexes. A corpus of 10,000 articles can be searched in similar time to 1,000 articles with appropriate indexing. For very large corpora (100k+ articles), hierarchical or partitioned search strategies may be needed.