Cluster Generation and Analysis
Louvain Clustering Algorithm
Clusters in this visualization are identified using the Louvain algorithm, a community detection method that identifies groups of densely connected nodes in the Wikipedia link graph. The algorithm optimizes modularity to find natural thematic communities.
Why Louvain?
The Louvain algorithm is specifically designed for community detection in networks. Unlike k-means or other distance-based clustering methods, Louvain operates directly on the graph structure, identifying communities based on link density. Key advantages:
- No predetermined k: The algorithm discovers the natural number of clusters by optimizing modularity rather than requiring a fixed cluster count.
- Graph-aware: Uses the Wikipedia link structure directly, identifying thematic communities based on how articles reference each other.
- Efficient: Scales well to large graphs, handling hundreds of thousands of nodes and edges with reasonable computation time.
- Hierarchical structure: Can reveal both broad topic areas and specific subtopics through its multi-level optimization process.
Algorithm Phases
Phase 1: Initialize Communities
The algorithm begins by assigning each article to its own community. This creates an initial state where there are as many communities as there are nodes in the graph.
Initial state:
Community(article_i) = i for all articles
Number of communities = Number of articles
From this starting point, the algorithm will iteratively merge communities to maximize modularity, a measure of how well the network is divided into communities.
Phase 2: Modularity Optimization (First Pass)
The algorithm iterates through all articles in random order. For each article, it evaluates the modularity gain of moving it to each of its neighbors' communities:
For each article i:
1. Remove i from its current community C_old
2. For each neighbor j of i:
Calculate ΔQ if i joins community C_j
3. Move i to community with maximum ΔQ (if positive)
Repeat until no more moves improve modularity
Modularity (Q) measures the quality of the community structure. It compares the actual number of edges within communities to the expected number in a random graph:
Q = (1/2m) Σ [A_ij - (k_i * k_j)/2m] δ(c_i, c_j)
where:
- m = total number of edges
- A_ij = adjacency matrix (1 if i links to j)
- k_i, k_j = degree of nodes i and j
- δ(c_i, c_j) = 1 if i and j in same community
This phase continues until no single article move can improve the overall modularity, achieving a local optimum.
Phase 3: Community Aggregation
After the first pass, the algorithm builds a new graph where each community becomes a single "super-node":
Build aggregated graph G':
- Each community C becomes node n_C
- Edge weight between n_C and n_D = sum of edges between C and D
- Self-loop weight for n_C = sum of internal edges in C
This aggregation preserves the modularity value while reducing the number of nodes to process. The resulting graph represents communities and their interconnections.
Phase 4: Iteration
The algorithm repeats Phases 2 and 3 on the aggregated graph, continuing to merge communities at increasingly coarse levels:
Repeat until convergence:
1. Run modularity optimization (Phase 2) on G'
2. Build new aggregated graph G''
3. Continue until no improvement in modularity
The algorithm terminates when a pass through all nodes produces no modularity improvement. At this point, the hierarchy of community structures has been fully explored.
The final clustering uses the partition from the finest level (first aggregation), which typically provides the most useful granularity for visualization. This results in communities of densely interconnected articles that correspond to coherent topics in the Wikipedia graph.
Parameter Configuration
random_state = 42
A random seed that ensures deterministic results across multiple runs. Without a fixed seed, the Louvain algorithm would produce different community assignments each time due to the random order of node processing.
Setting random_state to 42 ensures reproducibility, allowing consistent cluster assignments for the same graph structure.
resolution = 1.0
Controls the size of detected communities. Values greater than 1.0 favor smaller communities, while values less than 1.0 produce larger communities. The default value of 1.0 provides a balanced trade-off.
This parameter effectively tunes the granularity of the clustering, allowing adjustment between broad topic areas and specific subtopics.
Input: Graph structure
The Louvain algorithm operates directly on the Wikipedia link graph, where edges represent hyperlinks between articles. It does not use spatial coordinates or embeddings.
This graph-based approach identifies communities based on link density, revealing how articles are actually connected rather than how similar their content might be.
Cluster Influence on Node Positioning
Clusters do not influence node positioning. This is a critical aspect of the visualization's methodology and distinguishes it from many graph visualization approaches.
Clustering as Post-Processing
The clustering process occurs after the force-directed algorithm has determined all article positions. The workflow is:
1. Build graph: articles + Wikipedia links
2. Apply force-directed layout: graph structure → 3D positions
3. Run Louvain: graph structure → cluster labels
4. Render: positions (from step 2) + colors (from step 3)
Louvain receives the graph structure (nodes and edges) and returns cluster assignments. It does not modify positions. This separation ensures that clustering reveals community structure in the link graph rather than imposing structure through position adjustment. Note that both positioning and clustering use the same underlying graph, but for different purposes: force-directed layout uses it for spatial arrangement, while Louvain uses it for community detection.
Advantages of Position-Independent Clustering
This design choice provides several analytical benefits:
Stable Positions
Article positions remain fixed regardless of cluster assignments. If clustering parameters change or clusters are recomputed, articles stay in the same locations. This stability aids in spatial memory and repeated analysis.
Independent Validation
Because positions are determined by semantic embeddings independent of clustering, the spatial grouping you observe is a true reflection of semantic relationships rather than an artifact of cluster-driven layout algorithms.
Emergent Structure
Clusters emerge from the data's inherent structure. If you observe tight spatial groupings, this reflects genuine semantic coherence rather than algorithmic forcing of proximity.
Cluster Naming and Descriptions
After clusters are identified, they are automatically assigned human-readable names and descriptions using a large language model (GPT-4).
Naming Process
- Extract a representative sample of article titles from the cluster (typically 10-20 articles)
- Optionally include article snippets or summaries for additional context
- Construct a prompt asking the LLM to identify the unifying theme or topic area
- LLM generates a concise name (2-5 words) and a descriptive paragraph explaining the cluster's scope
Cluster Metadata
Each cluster stores the following information:
Cluster metadata:
- cluster_id: integer identifier
- name: auto-generated descriptive name
- description: paragraph explaining scope
- article_count: number of member articles
- centroid: (x, y, z) geometric center of members
- bounding_box: spatial extent of cluster
Interpreting Cluster Structure
Cluster Size Variation
Clusters vary significantly in size, reflecting Wikipedia's uneven coverage:
- Large clusters (100+ articles): Represent major topic areas with extensive Wikipedia coverage (e.g., scientific disciplines, major historical periods, geographic regions)
- Medium clusters (20-100 articles): Represent well-developed subtopics or specialized domains
- Small clusters (5-20 articles): Represent niche topics or emerging areas with limited coverage
Cluster Separation
The spatial separation between clusters indicates semantic distance:
- Adjacent clusters: Cover related topics or share conceptual overlap (e.g., "Biology" and "Medicine")
- Distant clusters: Represent fundamentally different domains (e.g., "Mathematics" and "Sports")
Noise Points
Articles classified as noise (cluster_id = -1) represent outliers that don't fit into any dense region. These typically fall into categories:
- Highly specific articles that don't fit broader topic patterns
- Interdisciplinary articles that blend multiple domains
- Short or stub articles with insufficient content for clear semantic positioning