← Back

Documentation

Edge Calculation and Visualization

Edge Generation from Wikipedia Links

Edges in this visualization represent hyperlink connections between Wikipedia articles. Unlike the node positioning which is based on semantic embeddings, edges are derived directly from Wikipedia's explicit link structure, providing a complementary view of article relationships.

Link Extraction Process

For each article in the visualization, we extract outgoing links using the following process:

  1. Parse the article's WikiText or HTML rendering to identify all internal links (links to other Wikipedia articles).
  2. Filter links to exclude non-content pages (templates, categories, files, special pages).
  3. Resolve redirects so that all links point to the canonical article page.
  4. Store links as directed edges in a graph database: if article A links to article B, we create an edge A → B.
  5. Deduplicate multiple links from the same source to the same target (if A links to B multiple times in the article text, we store only one edge).

Link Semantics

Wikipedia's Manual of Style provides guidelines for linking, which influences what edges represent:

Definitional Links

The first occurrence of a technical term is typically linked to its definition. These edges represent "uses concept" or "depends on" relationships.

Contextual Links

Links to related topics, historical background, or examples appear throughout articles. These represent "related to" or "see also" relationships.

Navigational Links

Links may be included to help readers navigate between articles on related subjects, even if the connection is not semantically strong.

Edge Directionality and Weight Calculation

Directed Graph Structure

The edge graph is fundamentally directed. An edge from A to B means "article A contains a hyperlink to article B." This does not imply that B links back to A. The directionality encodes information about conceptual dependencies and authorial choices in article construction.

Graph representation: G = (V, E) where:

V = set of article nodes

E ⊆ V × V = set of directed edges

e = (v_i, v_j) means v_i links to v_j

Bidirectional Link Detection

We identify bidirectional links by checking for reciprocal edges. If both A → B and B → A exist, we flag this as a bidirectional relationship. The weight calculation is:

Edge weight w(e):

w(v_i, v_j) = 1 if only v_i → v_j exists

w(v_i, v_j) = 2 if both v_i → v_j and v_j → v_i exist

Bidirectional links are semantically significant because they indicate mutual relevance. When two articles link to each other, it suggests:

  • Closely related topics (e.g., "Machine Learning" ↔ "Neural Networks")
  • Complementary perspectives (e.g., "World War I" ↔ "World War II")
  • Definitional circularity (e.g., "Force" ↔ "Newton's Laws of Motion")

Visualization of Weight

Edge weight is encoded through visual properties to convey relationship strength at a glance:

Unidirectional Edges (weight = 1)

Represented with lighter gray coloring and reduced opacity, indicating a single directional reference.

Color: #404040 (dark gray)
Opacity: 0.15
Line width: 1.2px

Bidirectional Edges (weight = 2)

Rendered with darker coloring and increased opacity to emphasize the mutual relationship.

Color: #606060 (medium gray)
Opacity: 0.35
Line width: 1.2px

Color Encoding System

The visualization employs a grayscale color scheme for edges to maintain visual clarity and avoid confusing edges with node colors (which represent clusters). The grayscale gradient from light to dark encodes information about edge properties.

Standard Edge Colors

Light Gray (#404040): Unidirectional links with low visual prominence. These form the background network structure.

Medium Gray (#606060): Bidirectional links with moderate prominence. These stand out as important connections without overwhelming the visualization.

Highlighted Edge Colors

When a user selects an article node, edges connected to that article receive special highlighting to aid in understanding the article's immediate neighborhood:

Selected Article Edges

All edges where the selected article is either the source or target are highlighted.

Color: #ffffff (white)
Opacity: 0.6
Line width: 2.0px

This highlighting reveals the selected article's link neighborhood, making it easy to trace which articles it references and which articles reference it.

Visual Design Rationale

The grayscale color scheme serves multiple purposes:

  • Avoids color overload: With nodes using color to represent clusters, using grayscale for edges prevents visual confusion and maintains clear figure-ground separation.
  • Emphasizes structure: The monochromatic palette directs attention to graph topology and connection patterns rather than individual edge colors.
  • Preserves hierarchy: The gradient from light (weak connections) to dark (strong connections) creates a natural visual hierarchy that matches semantic importance.

Edge Influence on Node Positioning

In this visualization, edges directly influence node positioningthrough the force-directed layout algorithm. This is the fundamental principle of spring-based graph layouts.

Graph-Based Positioning

The force-directed algorithm uses the Wikipedia link graph to position nodes in 3D space:

Attractive Forces (Edges)

Connected nodes (linked articles) are pulled together by spring-like attractive forces. The strength of attraction can be weighted by edge properties (e.g., bidirectional links create stronger attraction).

Repulsive Forces (All Node Pairs)

All nodes repel each other regardless of connections, preventing overlap and spreading the graph. The equilibrium between attractive and repulsive forces determines final positions.

Analytical Implications

Because positioning is based on the link graph, the spatial structure directly reflects Wikipedia's hyperlink network:

Densely Connected Communities

Articles with many mutual links will cluster tightly together, forming visible communities. These spatial clusters reflect how Wikipedia articles are actually interconnected.

Hub Articles at Center

Articles with many connections (high degree) tend toward the center due to being pulled by many attractive forces from different directions. Peripheral articles have fewer connections.

Cross-Topic Bridge Links

Long edges spanning between clusters represent cross-domain connections. These are articles that link between different topic areas, serving as conceptual bridges in Wikipedia's knowledge graph.

Comparison with Semantic-Based Layouts

This force-directed approach differs from semantic embedding-based layouts (like UMAP):

Force-Directed Layout (Used Here):

Position ← f(edges, repulsion)

Result:

- Connected nodes pulled together

- Position reflects link structure

- Spatial proximity = many shared connections

Semantic Layout (Alternative):

Position ← UMAP(semantic_embeddings)

Result:

- Semantically similar nodes positioned together

- Position reflects content similarity

- Spatial proximity = similar topics/words

The force-directed approach reveals how Wikipedia articles areactually connected through hyperlinks, rather than how similar their content might be. Both approaches have value for different analytical purposes.

Edge Filtering and Performance

In large graphs with thousands of articles, the number of edges can become substantial. To maintain visualization performance and readability, the system implements selective edge rendering.

Edge Density Management

The current implementation includes a maximum edge count threshold (2000 edges) for rendering. When the total number of edges exceeds this limit, edges are prioritized by weight:

if |E| > MAX_EDGES:

E' = top(E, MAX_EDGES, key=weight)

render(E')

else:

render(E)

This ensures bidirectional links (weight = 2) are preferentially retained, as they represent the strongest relationships. Unidirectional links are included until the threshold is reached.

Cluster-Specific Edge Loading

When a user selects a specific cluster, the system loads edges relevant to that cluster:

  • All edges where both source and target are in the selected cluster (intra-cluster edges)
  • Edges connecting the selected cluster to other clusters (inter-cluster edges)

This focused loading improves both performance and analytical clarity, allowing users to study the internal structure of a topic area without visual noise from unrelated connections.