Semantic Search System

Why Semantic Search?

Traditional keyword-based search matches exact words or n-grams between the query and documents. While effective for many use cases, this approach has fundamental limitations when searching conceptual or knowledge-based content:

Limitations of Keyword Search

Vocabulary Mismatch

Users may search for "car" while relevant articles use "automobile" or "vehicle." Keyword search would miss these results despite semantic equivalence. This problem, known as the vocabulary mismatch problem, is pervasive in information retrieval.

Conceptual Queries

Queries like "articles about economic inequality" or "information on renewable energy adoption" express conceptual interests that may not have exact keyword matches. Relevant articles might discuss "income disparity" or "solar panel deployment" without using the exact query terms.

Polysemy and Context

Words have multiple meanings depending on context. A keyword search for "bank" cannot distinguish between financial institutions and river banks. Semantic search understands contextual meaning through learned representations.

Advantages of Semantic Approach

Semantic search addresses these limitations by operating in embedding space rather than lexical space:

Synonym handling: Semantically equivalent terms have similar embeddings, so "car" and "automobile" are recognized as related concepts.
Conceptual matching: Articles matching the query's intent are retrieved even without keyword overlap.
Context sensitivity: Embeddings capture word meaning in context, disambiguating polysemous terms.
Exploratory discovery: Users can find related content they wouldn't have known to search for using keywords alone.

How Semantic Search Works

Query Embedding

When a user submits a search query, it undergoes the same embedding process as article content:

query_embedding = Embed(query_text)

where:

query_embedding ∈ ℝ^1536

Embed() = text-embedding-ada-002

The embedding model transforms the query into the same 1536-dimensional vector space used for article embeddings. Critically, this means queries and articles inhabit the same semantic space, enabling meaningful similarity comparisons.

Similarity Computation

To find relevant articles, the system computes the cosine similarity between the query embedding and each article embedding:

similarity(query, article) = cos_sim(q, a)

= q · a / (||q|| * ||a||)

Since embeddings are unit-normalized:

= q · a

Range: [-1, 1]

- 1: identical meaning

- 0: orthogonal (unrelated)

- -1: opposite meaning (rare)

Cosine similarity measures the angle between vectors, with smaller angles (higher cosine values) indicating greater semantic similarity. This metric is preferred over Euclidean distance for normalized embeddings because it focuses on direction rather than magnitude.

Vector Database Optimization

Computing similarity against all articles would be computationally expensive for large corpora. The system uses pgvector, a PostgreSQL extension providing efficient vector similarity search:

Approximate Nearest Neighbor (ANN) Search

Instead of exhaustively computing all similarities, pgvector uses indexing structures (IVFFlat or HNSW) to quickly identify approximate nearest neighbors. This trades a small amount of accuracy for dramatic speed improvements.

SQL query structure:

SELECT id, title, embedding <=> query_vector AS distance

FROM articles

ORDER BY embedding <=> query_vector

LIMIT 20;

The <=> operator uses the vector index for fast similarity search.

Result Ranking

Results are ranked by descending similarity score. The top results represent articles whose content is most semantically similar to the query. Additional ranking factors can include:

Popularity metrics: Page view counts or link centrality to boost authoritative articles
Freshness: More recently updated articles may be prioritized
Diversity: Results can be diversified to avoid returning many nearly-identical articles

Currently, the system uses pure semantic similarity for ranking, ensuring results are determined solely by content relevance to the query.

What Semantic Search Enables

Exploratory Knowledge Discovery

Semantic search transforms the visualization from a browse-only interface into an exploratory discovery tool:

Broad Conceptual Queries

Users can enter queries like "climate change impacts on agriculture" and receive articles covering crop yields, water scarcity, agricultural adaptation strategies, and related topics—even if those exact phrases don't appear in the articles.

Question-Based Search

Natural language questions work effectively. A query like "how do neural networks learn?" will retrieve articles about backpropagation, gradient descent, and training algorithms based on semantic content rather than keyword matching.

Topic Neighborhood Identification

By finding semantically similar articles and displaying them in the 3D space, users can identify not just individual matches but entire topic neighborhoods, seeing how the query concept relates to surrounding topics.

Cross-Domain Connections

Semantic search excels at finding conceptual connections across traditional domain boundaries:

Methodological Similarity

A search for "statistical methods" might retrieve articles from biology (population genetics), physics (experimental design), economics (econometrics), and psychology (psychometrics)—revealing how similar analytical approaches span disciplines.

Conceptual Analogies

Searching for "network effects" might surface articles about social networks, neural networks, ecological networks, and transportation networks—demonstrating how the same conceptual framework applies across domains.

Integration with Visualization

The search system integrates tightly with the 3D visualization, enabling coordinated exploration:

Spatial highlighting: Search results are highlighted in the 3D space, showing where relevant articles are positioned and revealing patterns in result distribution.
Result clustering: If search results cluster in particular regions, this indicates the query concept aligns with specific topic areas. Scattered results suggest a cross-cutting concept.
Distance metrics: The visualization can display similarity scores as spatial overlays, showing query-to-article distance alongside inter-article relationships.

Search Modes: Text vs Semantic

The system offers both traditional text search and semantic search, allowing users to choose the appropriate mode for their information need:

Text Search Mode

Uses PostgreSQL's full-text search capabilities with the following features:

Tokenization and stemming (reducing words to root forms)
Stop word removal (filtering common words like "the," "and")
Boolean operators (AND, OR, NOT) for complex queries
Phrase matching for exact sequences

Best for: Finding specific articles by known title terms, author names, or when exact keyword matching is desired.

Semantic Search Mode

Uses the embedding-based similarity approach described above.

Best for: Conceptual exploration, finding articles on related topics, discovering cross-domain connections, and questions where you're unsure of exact terminology.

Mode Selection Guidance

Use text search when you know what you're looking for and can specify it with keywords. Use semantic search when exploring a topic area, when you're unsure of terminology, or when looking for conceptually similar content that might use different vocabulary.

Technical Implementation Details

System Architecture

Query flow:

1. User submits query text

2. Frontend sends POST request to /api/ai/search/semantic

3. Backend calls OpenAI API to generate query embedding

4. Backend queries pgvector: SELECT ... ORDER BY embedding <=> query

5. Results (id, title, similarity) returned to frontend

6. Frontend displays results and highlights in 3D space

Performance Characteristics

Query Latency

Typical query time is 200-500ms, dominated by embedding generation (100-200ms for OpenAI API) and vector similarity search (50-150ms depending on corpus size and index type). Network latency adds 50-100ms.

Scalability

Vector search performance degrades sublinearly with corpus size when using ANN indexes. A corpus of 10,000 articles can be searched in similar time to 1,000 articles with appropriate indexing. For very large corpora (100k+ articles), hierarchical or partitioned search strategies may be needed.

Documentation