A three-dimensional graph-based visualization system for exploring semantic relationships in Wikipedia articles using neural embeddings, dimensionality reduction, and clustering algorithms.
This research project presents a novel approach to visualizing the semantic structure of Wikipedia articles in three-dimensional space. By leveraging state-of-the-art natural language processing techniques, specifically transformer-based neural embeddings, we transform textual content into high-dimensional vector representations that capture semantic meaning. These representations are then reduced to three dimensions using Uniform Manifold Approximation and Projection (UMAP), enabling spatial visualization where semantic similarity corresponds to geometric proximity.
Articles are grouped into thematic clusters using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm, which identifies natural groupings in the data without requiring predetermined cluster counts. The resulting visualization represents a knowledge graph where nodes are articles, edges represent hyperlink relationships, and spatial positioning reflects semantic similarity.
Wikipedia articles are extracted using the MediaWiki API, capturing both content and metadata. The raw text undergoes preprocessing to remove markup and extract clean, semantically meaningful content. Article metadata including titles, categories, and link structures are preserved for subsequent graph construction.
Each article is transformed into a dense vector representation using OpenAI's text-embedding-ada-002 model. This model produces 1536-dimensional vectors where each dimension captures a learned semantic feature. The embeddings are computed using the following process:
e(a) = Embedding_Model(text(a))
where:
- e(a) ∈ ℝ^1536 is the embedding vector
- text(a) is the preprocessed article text
- Embedding_Model is the neural encoder
The resulting embeddings capture semantic relationships such that articles with similar content have vectors with high cosine similarity. This property is fundamental to the subsequent dimensionality reduction and clustering steps.
Uniform Manifold Approximation and Projection (UMAP) reduces the 1536-dimensional embeddings to 3D coordinates suitable for visualization. UMAP operates on principles from Riemannian geometry and algebraic topology, constructing a high-dimensional graph of the data and optimizing a low-dimensional representation that preserves the manifold structure.
UMAP algorithm parameters:
- n_neighbors: 15
- min_dist: 0.1
- n_components: 3
- metric: cosine
The algorithm constructs a fuzzy topological representation of the high-dimensional data using a k-nearest-neighbors graph, then optimizes a low-dimensional layout by minimizing cross-entropy between high and low-dimensional probability distributions:
minimize: CE = Σ(v_ij * log(v_ij / w_ij) + (1-v_ij) * log((1-v_ij) / (1-w_ij)))
where:
- v_ij: high-dimensional edge probability
- w_ij: low-dimensional edge probability
This approach preserves both local neighborhood structure and global data topology, making it superior to methods like t-SNE for visualization tasks.
HDBSCAN identifies clusters in the 3D embedding space by analyzing density variations. Unlike k-means or other centroid-based methods, HDBSCAN does not require specifying the number of clusters a priori and can identify clusters of arbitrary shape while handling outliers robustly.
HDBSCAN parameters:
- min_cluster_size: 5
- min_samples: 3
- cluster_selection_method: eom
The algorithm operates in three phases:
Points that do not belong to any dense region are classified as noise and remain unclustered. Each identified cluster represents a coherent topic area in the Wikipedia corpus.
The knowledge graph is constructed by analyzing Wikipedia's hyperlink structure. Each article corresponds to a node positioned at its UMAP-derived 3D coordinates. Edges are created based on hyperlinks between articles:
G = (V, E) where:
V = set of article nodes
E = set of directed edges (a_i → a_j)
Edge weight w_ij:
- w = 1 if unidirectional link
- w = 2 if bidirectional link
Bidirectional links (where both articles reference each other) indicate stronger semantic relationships and are visually distinguished with increased opacity and line thickness in the rendering.
The visualization is rendered using WebGL through the Three.js library and React Three Fiber framework. Each article is represented as a sphere positioned at its (x, y, z) UMAP coordinates. Visual encoding includes:
User interaction employs orbital camera controls allowing rotation, panning, and zooming. Article selection triggers detail panel display and camera interpolation to the selected node's position.
Rendering large graphs (10,000+ nodes) requires careful performance optimization. Current optimizations include:
The resulting visualization successfully groups semantically related articles into spatial clusters. Articles about similar topics (e.g., science, history, geography) naturally cluster together in 3D space, with cluster boundaries emerging from the density-based analysis rather than arbitrary geometric divisions.
The link structure reveals interesting patterns in how Wikipedia articles reference each other. Dense connection patterns within clusters indicate self-contained topic areas, while inter-cluster edges reveal conceptual bridges between domains. Bidirectional links often correspond to strongly related concepts or complementary topics.
This approach to knowledge graph visualization provides an intuitive interface for exploring large corpora of interlinked documents, potentially applicable to domains beyond Wikipedia including academic literature, legal documents, and enterprise knowledge bases.
Comprehensive technical documentation covering all aspects of the system:
Interface controls, navigation, interpretation guidelines, and analytical applications
Embedding generation, UMAP algorithm, dimensionality reduction, and positioning mathematics
Link extraction, directionality detection, weight calculation, and visual encoding
HDBSCAN methodology, mutual reachability, MST construction, and cluster extraction
Embedding similarity, vector search, query processing, and exploratory discovery
Frontend structure, backend API, AI layer integration, database schema, and deployment