Edge Calculation and Visualization
Edge Generation from Wikipedia Links
Edges in this visualization represent hyperlink connections between Wikipedia articles. Unlike the node positioning which is based on semantic embeddings, edges are derived directly from Wikipedia's explicit link structure, providing a complementary view of article relationships.
Link Extraction Process
For each article in the visualization, we extract outgoing links using the following process:
- Parse the article's WikiText or HTML rendering to identify all internal links (links to other Wikipedia articles).
- Filter links to exclude non-content pages (templates, categories, files, special pages).
- Resolve redirects so that all links point to the canonical article page.
- Store links as directed edges in a graph database: if article A links to article B, we create an edge A → B.
- Deduplicate multiple links from the same source to the same target (if A links to B multiple times in the article text, we store only one edge).
Link Semantics
Wikipedia's Manual of Style provides guidelines for linking, which influences what edges represent:
Definitional Links
The first occurrence of a technical term is typically linked to its definition. These edges represent "uses concept" or "depends on" relationships.
Contextual Links
Links to related topics, historical background, or examples appear throughout articles. These represent "related to" or "see also" relationships.
Navigational Links
Links may be included to help readers navigate between articles on related subjects, even if the connection is not semantically strong.
Edge Directionality and Weight Calculation
Directed Graph Structure
The edge graph is fundamentally directed. An edge from A to B means "article A contains a hyperlink to article B." This does not imply that B links back to A. The directionality encodes information about conceptual dependencies and authorial choices in article construction.
Graph representation: G = (V, E) where:
V = set of article nodes
E ⊆ V × V = set of directed edges
e = (v_i, v_j) means v_i links to v_j
Bidirectional Link Detection
We identify bidirectional links by checking for reciprocal edges. If both A → B and B → A exist, we flag this as a bidirectional relationship. The weight calculation is:
Edge weight w(e):
w(v_i, v_j) = 1 if only v_i → v_j exists
w(v_i, v_j) = 2 if both v_i → v_j and v_j → v_i exist
Bidirectional links are semantically significant because they indicate mutual relevance. When two articles link to each other, it suggests:
- Closely related topics (e.g., "Machine Learning" ↔ "Neural Networks")
- Complementary perspectives (e.g., "World War I" ↔ "World War II")
- Definitional circularity (e.g., "Force" ↔ "Newton's Laws of Motion")
Visualization of Weight
Edge weight is encoded through visual properties to convey relationship strength at a glance:
Unidirectional Edges (weight = 1)
Represented with lighter gray coloring and reduced opacity, indicating a single directional reference.
Color: #404040 (dark gray)
Opacity: 0.15
Line width: 1.2px
Bidirectional Edges (weight = 2)
Rendered with darker coloring and increased opacity to emphasize the mutual relationship.
Color: #606060 (medium gray)
Opacity: 0.35
Line width: 1.2px
Color Encoding System
The visualization employs a grayscale color scheme for edges to maintain visual clarity and avoid confusing edges with node colors (which represent clusters). The grayscale gradient from light to dark encodes information about edge properties.
Standard Edge Colors
Light Gray (#404040): Unidirectional links with low visual prominence. These form the background network structure.
Medium Gray (#606060): Bidirectional links with moderate prominence. These stand out as important connections without overwhelming the visualization.
Highlighted Edge Colors
When a user selects an article node, edges connected to that article receive special highlighting to aid in understanding the article's immediate neighborhood:
Selected Article Edges
All edges where the selected article is either the source or target are highlighted.
Color: #ffffff (white)
Opacity: 0.6
Line width: 2.0px
This highlighting reveals the selected article's link neighborhood, making it easy to trace which articles it references and which articles reference it.
Visual Design Rationale
The grayscale color scheme serves multiple purposes:
- Avoids color overload: With nodes using color to represent clusters, using grayscale for edges prevents visual confusion and maintains clear figure-ground separation.
- Emphasizes structure: The monochromatic palette directs attention to graph topology and connection patterns rather than individual edge colors.
- Preserves hierarchy: The gradient from light (weak connections) to dark (strong connections) creates a natural visual hierarchy that matches semantic importance.
Edge Influence on Node Positioning
In this visualization, edges directly influence node positioningthrough the force-directed layout algorithm. This is the fundamental principle of spring-based graph layouts.
Graph-Based Positioning
The force-directed algorithm uses the Wikipedia link graph to position nodes in 3D space:
Attractive Forces (Edges)
Connected nodes (linked articles) are pulled together by spring-like attractive forces. The strength of attraction can be weighted by edge properties (e.g., bidirectional links create stronger attraction).
Repulsive Forces (All Node Pairs)
All nodes repel each other regardless of connections, preventing overlap and spreading the graph. The equilibrium between attractive and repulsive forces determines final positions.
Analytical Implications
Because positioning is based on the link graph, the spatial structure directly reflects Wikipedia's hyperlink network:
Densely Connected Communities
Articles with many mutual links will cluster tightly together, forming visible communities. These spatial clusters reflect how Wikipedia articles are actually interconnected.
Hub Articles at Center
Articles with many connections (high degree) tend toward the center due to being pulled by many attractive forces from different directions. Peripheral articles have fewer connections.
Cross-Topic Bridge Links
Long edges spanning between clusters represent cross-domain connections. These are articles that link between different topic areas, serving as conceptual bridges in Wikipedia's knowledge graph.
Comparison with Semantic-Based Layouts
This force-directed approach differs from semantic embedding-based layouts (like UMAP):
Force-Directed Layout (Used Here):
Position ← f(edges, repulsion)
Result:
- Connected nodes pulled together
- Position reflects link structure
- Spatial proximity = many shared connections
Semantic Layout (Alternative):
Position ← UMAP(semantic_embeddings)
Result:
- Semantically similar nodes positioned together
- Position reflects content similarity
- Spatial proximity = similar topics/words
The force-directed approach reveals how Wikipedia articles areactually connected through hyperlinks, rather than how similar their content might be. Both approaches have value for different analytical purposes.
Edge Filtering and Performance
In large graphs with thousands of articles, the number of edges can become substantial. To maintain visualization performance and readability, the system implements selective edge rendering.
Edge Density Management
The current implementation includes a maximum edge count threshold (2000 edges) for rendering. When the total number of edges exceeds this limit, edges are prioritized by weight:
if |E| > MAX_EDGES:
E' = top(E, MAX_EDGES, key=weight)
render(E')
else:
render(E)
This ensures bidirectional links (weight = 2) are preferentially retained, as they represent the strongest relationships. Unidirectional links are included until the threshold is reached.
Cluster-Specific Edge Loading
When a user selects a specific cluster, the system loads edges relevant to that cluster:
- All edges where both source and target are in the selected cluster (intra-cluster edges)
- Edges connecting the selected cluster to other clusters (inter-cluster edges)
This focused loading improves both performance and analytical clarity, allowing users to study the internal structure of a topic area without visual noise from unrelated connections.