← Back

Documentation

Node Generation and Positioning

Node Generation Process

Each node in the visualization corresponds to a single Wikipedia article. The generation process begins with data extraction from Wikipedia's API and proceeds through multiple transformation stages before resulting in the positioned nodes you observe in the 3D space.

1. Article Extraction

Articles are retrieved using the MediaWiki API with specific parameters to ensure content quality:

  • Only articles from the main namespace (namespace 0) are included, excluding discussion pages, user pages, and Wikipedia meta-pages.
  • Disambiguation pages, redirect pages, and stub articles below a minimum word count threshold are filtered out to ensure substantive content.
  • The full article text is extracted and stored, along with metadata including title, ID, creation date, last modification date, and category assignments.

2. Text Preprocessing

Raw Wikipedia markup is processed to extract clean text suitable for embedding:

Preprocessing pipeline:

1. Remove MediaWiki markup (templates, formatting)

2. Extract plain text from HTML rendering

3. Remove citation markers [1], [2], etc.

4. Normalize whitespace and punctuation

5. Truncate to maximum token length (8191 for ada-002)

The preprocessing preserves semantic content while removing formatting artifacts that would introduce noise into the embedding process. Section headers are retained as they provide important structural information about content organization.

Graph Construction from Wikipedia Links

The core of node representation is the graph structure derived from Wikipedia's internal hyperlink network. Each article's connections to other articles form the edges of the graph.

Link Extraction

Links between articles are extracted during the data pipeline processing:

  • Wikipedia markup is parsed to identify internal wiki links ([[Article Name]])
  • Link targets are resolved to article IDs in the dataset
  • Bidirectional links are detected when both A→B and B→A exist
  • Edge weights can reflect link frequency or bidirectionality

Graph Structure

The resulting graph has specific properties:

Graph G = (V, E) where:

V = set of article nodes

E = set of directed edges (hyperlinks)

w(e) = edge weight (1 for unidirectional, 2 for bidirectional)

Properties:

- Directed (A links to B ≠ B links to A)

- Weighted (based on link characteristics)

- Sparse (most articles link to small fraction of total)

- Scale-free (power-law degree distribution)

This graph structure encodes how Wikipedia's editors have chosen to interconnect articles, revealing conceptual relationships and organizational patterns in human knowledge.

Network Characteristics

The Wikipedia link graph exhibits several important network properties:

Small-World Property

Most articles can be reached from any other article through a relatively short path (typically 3-6 hops), despite the large size of the network.

Hub Articles

Certain articles act as hubs with very high degree (many connections). These tend to be broad concepts like "United States" or "Earth" that many specialized articles reference.

Community Structure

Articles naturally group into densely-connected communities corresponding to topic areas, even though no explicit categorization is imposed.

Force-Directed Layout Algorithm

The force-directed (spring layout) algorithm positions nodes in 3D space by simulating physical forces. Connected nodes attract each other while all nodes repel each other, creating a natural spatial arrangement that reflects the graph structure.

Theoretical Foundation

Force-directed algorithms model the graph as a physical system where edges are springs and nodes are charged particles. The system evolves toward a minimum-energy state where:

  • Connected nodes (via edges) are pulled together by spring forces
  • All nodes repel each other through coulomb-like forces
  • The equilibrium state minimizes total system energy
  • Dense subgraphs form tight spatial clusters

This approach naturally reveals community structure: articles that are highly interconnected will cluster together, while loosely connected articles will be pushed apart.

Algorithm Steps

Step 1: Initialize Positions

Nodes are randomly placed in 3D space with coordinates drawn from a uniform distribution. A random seed (42) ensures reproducibility:

For each node i:

x_i, y_i, z_i ~ Uniform(-1, 1)

Initial positions are random but deterministic

Step 2: Calculate Forces

At each iteration, forces on each node are computed based on its relationships with other nodes:

Attractive force (for connected nodes i, j):

F_attr = -k * d_ij * (p_i - p_j) / d_ij

Repulsive force (for all node pairs):

F_rep = k^2 / d_ij * (p_i - p_j) / d_ij

where:

- k = optimal distance parameter (1/√n * k_scale)

- d_ij = current distance between nodes i and j

- p_i, p_j = position vectors of nodes i and j

The parameter k scales with graph size: larger graphs use smaller k to prevent excessive spreading. The k_scale multiplier fine-tunes the overall spacing.

Step 3: Update Positions

Node positions are updated iteratively based on the net force acting on each node:

For each node i:

F_total = Σ F_attr(i,j) + Σ F_rep(i,k)

displacement = F_total * temperature

p_i_new = p_i + displacement

Temperature decreases each iteration:

T_t = T_0 * (1 - t/iterations)

The temperature parameter acts as a simulated annealing schedule: large movements early in the process allow major reorganization, while small movements later enable fine-tuning. The algorithm typically runs for 100-300 iterations until convergence.

Parameter Configuration

iterations = 100-300

Number of iterations for layout optimization. More iterations allow better convergence to minimum-energy state but increase computation time. Typical values of 100-300 provide good balance for graphs with thousands of nodes.

k_scale = 3.0-150.0

Multiplier for the optimal distance parameter k. Higher values create more spread-out layouts with greater spacing between nodes. Lower values produce more compact layouts. This parameter is adjusted based on desired visualization density.

weight consideration

Edge weights influence attractive forces. Bidirectional links (weight=2) create stronger attraction than unidirectional links (weight=1), causing mutually-referencing articles to position closer together.

dim = 3

Spatial dimensions for layout. While 2D is simpler, 3D provides additional degrees of freedom for resolving complex graph structures without excessive node overlap.

Final Node Positioning

The force-directed algorithm outputs (x, y, z) coordinates for each article. These coordinates define the node positions you observe in the visualization. Important characteristics of this positioning:

Deterministic Given Input

With a fixed random seed (42), the same graph structure will produce consistent positioning across runs. The exact coordinates are deterministic, making the visualization reproducible.

Scale Invariance

After layout computation, positions are normalized: centered at the origin and scaled so the furthest node is at a specified radius (typically 300 units). This normalization maintains relative relationships while fitting the graph in a predictable volume.

Radial Structure Preserved

Unlike some visualizations that project all nodes onto a sphere surface, this implementation preserves radial information. Nodes exist at varying distances from the origin, with central hub articles tending toward the center and peripheral articles pushed outward. This creates a natural "cloud" or "ball" structure.

Graph-Based Positioning

Node positions are determined entirely by the Wikipedia link structure. Articles that link to each other frequently (or have many mutual connections) will be positioned close together. The spatial proximity directly reflects connection density rather than semantic content similarity.