Wikipedia Knowledge Graph Visualization

A three-dimensional graph-based visualization system for exploring semantic relationships in Wikipedia articles using neural embeddings, dimensionality reduction, and clustering algorithms.

Abstract

This research project presents a novel approach to visualizing the semantic structure of Wikipedia articles in three-dimensional space. By leveraging state-of-the-art natural language processing techniques, specifically transformer-based neural embeddings, we transform textual content into high-dimensional vector representations that capture semantic meaning. These representations are then reduced to three dimensions using Uniform Manifold Approximation and Projection (UMAP), enabling spatial visualization where semantic similarity corresponds to geometric proximity.

Articles are grouped into thematic clusters using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm, which identifies natural groupings in the data without requiring predetermined cluster counts. The resulting visualization represents a knowledge graph where nodes are articles, edges represent hyperlink relationships, and spatial positioning reflects semantic similarity.

Technical Architecture

1. Data Acquisition and Processing

Wikipedia articles are extracted using the MediaWiki API, capturing both content and metadata. The raw text undergoes preprocessing to remove markup and extract clean, semantically meaningful content. Article metadata including titles, categories, and link structures are preserved for subsequent graph construction.

2. Neural Embeddings

Each article is transformed into a dense vector representation using OpenAI's text-embedding-ada-002 model. This model produces 1536-dimensional vectors where each dimension captures a learned semantic feature. The embeddings are computed using the following process:

e(a) = Embedding_Model(text(a))

where:

- e(a) ∈ ℝ^1536 is the embedding vector

- text(a) is the preprocessed article text

- Embedding_Model is the neural encoder

The resulting embeddings capture semantic relationships such that articles with similar content have vectors with high cosine similarity. This property is fundamental to the subsequent dimensionality reduction and clustering steps.

3. Dimensionality Reduction via UMAP

Uniform Manifold Approximation and Projection (UMAP) reduces the 1536-dimensional embeddings to 3D coordinates suitable for visualization. UMAP operates on principles from Riemannian geometry and algebraic topology, constructing a high-dimensional graph of the data and optimizing a low-dimensional representation that preserves the manifold structure.

UMAP algorithm parameters:

- n_neighbors: 15

- min_dist: 0.1

- n_components: 3

- metric: cosine

The algorithm constructs a fuzzy topological representation of the high-dimensional data using a k-nearest-neighbors graph, then optimizes a low-dimensional layout by minimizing cross-entropy between high and low-dimensional probability distributions:

minimize: CE = Σ(v_ij * log(v_ij / w_ij) + (1-v_ij) * log((1-v_ij) / (1-w_ij)))

where:

- v_ij: high-dimensional edge probability

- w_ij: low-dimensional edge probability

This approach preserves both local neighborhood structure and global data topology, making it superior to methods like t-SNE for visualization tasks.

4. Hierarchical Density-Based Clustering (HDBSCAN)

HDBSCAN identifies clusters in the 3D embedding space by analyzing density variations. Unlike k-means or other centroid-based methods, HDBSCAN does not require specifying the number of clusters a priori and can identify clusters of arbitrary shape while handling outliers robustly.

HDBSCAN parameters:

- min_cluster_size: 5

- min_samples: 3

- cluster_selection_method: eom

The algorithm operates in three phases:

  1. Mutual reachability graph construction: Compute core distances and mutual reachability distances between points.
  2. Minimum spanning tree: Build an MST of the mutual reachability graph using Prim's algorithm.
  3. Cluster extraction: Convert the MST into a hierarchy of clusters and extract optimal flat clustering using excess of mass (EOM) optimization.

Points that do not belong to any dense region are classified as noise and remain unclustered. Each identified cluster represents a coherent topic area in the Wikipedia corpus.

5. Graph Construction and Link Analysis

The knowledge graph is constructed by analyzing Wikipedia's hyperlink structure. Each article corresponds to a node positioned at its UMAP-derived 3D coordinates. Edges are created based on hyperlinks between articles:

G = (V, E) where:

V = set of article nodes

E = set of directed edges (a_i → a_j)

Edge weight w_ij:

- w = 1 if unidirectional link

- w = 2 if bidirectional link

Bidirectional links (where both articles reference each other) indicate stronger semantic relationships and are visually distinguished with increased opacity and line thickness in the rendering.

6. Rendering and Interaction

The visualization is rendered using WebGL through the Three.js library and React Three Fiber framework. Each article is represented as a sphere positioned at its (x, y, z) UMAP coordinates. Visual encoding includes:

  • Color: Determined by cluster membership, with each cluster assigned a distinct hue
  • Size: Base radius of 0.14 units, scaled to 0.22 when selected
  • Opacity: Full opacity (1.0) for all nodes to ensure visibility
  • Edges: Rendered as lines with opacity and thickness varying by link weight

User interaction employs orbital camera controls allowing rotation, panning, and zooming. Article selection triggers detail panel display and camera interpolation to the selected node's position.

Performance Considerations

Rendering large graphs (10,000+ nodes) requires careful performance optimization. Current optimizations include:

  • Frustum culling: Three.js automatically culls objects outside the camera view frustum, reducing draw calls.
  • Level of Detail (LOD): Future implementations may employ instanced rendering for massive node counts, combining multiple geometries into single draw calls.
  • Edge filtering: Edges can be filtered by weight or distance to reduce visual clutter while preserving important connections.

Results and Discussion

The resulting visualization successfully groups semantically related articles into spatial clusters. Articles about similar topics (e.g., science, history, geography) naturally cluster together in 3D space, with cluster boundaries emerging from the density-based analysis rather than arbitrary geometric divisions.

The link structure reveals interesting patterns in how Wikipedia articles reference each other. Dense connection patterns within clusters indicate self-contained topic areas, while inter-cluster edges reveal conceptual bridges between domains. Bidirectional links often correspond to strongly related concepts or complementary topics.

This approach to knowledge graph visualization provides an intuitive interface for exploring large corpora of interlinked documents, potentially applicable to domains beyond Wikipedia including academic literature, legal documents, and enterprise knowledge bases.

Technical Stack

Backend

  • FastAPI (Python 3.11+)
  • PostgreSQL with pgvector extension
  • OpenAI API (text-embedding-ada-002)
  • UMAP-learn (dimensionality reduction)
  • HDBSCAN (clustering)
  • NumPy, scikit-learn

Frontend

  • Next.js 14 (React 18)
  • TypeScript
  • Three.js (WebGL rendering)
  • React Three Fiber
  • Tailwind CSS

Documentation

Comprehensive technical documentation covering all aspects of the system: