Cluster Generation and Analysis

Louvain Clustering Algorithm

Clusters in this visualization are identified using the Louvain algorithm, a community detection method that identifies groups of densely connected nodes in the Wikipedia link graph. The algorithm optimizes modularity to find natural thematic communities.

Why Louvain?

The Louvain algorithm is specifically designed for community detection in networks. Unlike k-means or other distance-based clustering methods, Louvain operates directly on the graph structure, identifying communities based on link density. Key advantages:

No predetermined k: The algorithm discovers the natural number of clusters by optimizing modularity rather than requiring a fixed cluster count.
Graph-aware: Uses the Wikipedia link structure directly, identifying thematic communities based on how articles reference each other.
Efficient: Scales well to large graphs, handling hundreds of thousands of nodes and edges with reasonable computation time.
Hierarchical structure: Can reveal both broad topic areas and specific subtopics through its multi-level optimization process.

Algorithm Phases

Phase 1: Initialize Communities

The algorithm begins by assigning each article to its own community. This creates an initial state where there are as many communities as there are nodes in the graph.

Initial state:

Community(article_i) = i for all articles

Number of communities = Number of articles

From this starting point, the algorithm will iteratively merge communities to maximize modularity, a measure of how well the network is divided into communities.

Phase 2: Modularity Optimization (First Pass)

The algorithm iterates through all articles in random order. For each article, it evaluates the modularity gain of moving it to each of its neighbors' communities:

For each article i:

1. Remove i from its current community C_old

2. For each neighbor j of i:

Calculate ΔQ if i joins community C_j

3. Move i to community with maximum ΔQ (if positive)

Repeat until no more moves improve modularity

Modularity (Q) measures the quality of the community structure. It compares the actual number of edges within communities to the expected number in a random graph:

Q = (1/2m) Σ [A_ij - (k_i * k_j)/2m] δ(c_i, c_j)

where:

- m = total number of edges

- A_ij = adjacency matrix (1 if i links to j)

- k_i, k_j = degree of nodes i and j

- δ(c_i, c_j) = 1 if i and j in same community

This phase continues until no single article move can improve the overall modularity, achieving a local optimum.

Phase 3: Community Aggregation

After the first pass, the algorithm builds a new graph where each community becomes a single "super-node":

Build aggregated graph G':

- Each community C becomes node n_C

- Edge weight between n_C and n_D = sum of edges between C and D

- Self-loop weight for n_C = sum of internal edges in C

This aggregation preserves the modularity value while reducing the number of nodes to process. The resulting graph represents communities and their interconnections.

Phase 4: Iteration

The algorithm repeats Phases 2 and 3 on the aggregated graph, continuing to merge communities at increasingly coarse levels:

Repeat until convergence:

1. Run modularity optimization (Phase 2) on G'

2. Build new aggregated graph G''

3. Continue until no improvement in modularity

The algorithm terminates when a pass through all nodes produces no modularity improvement. At this point, the hierarchy of community structures has been fully explored.

The final clustering uses the partition from the finest level (first aggregation), which typically provides the most useful granularity for visualization. This results in communities of densely interconnected articles that correspond to coherent topics in the Wikipedia graph.

Parameter Configuration

random_state = 42

A random seed that ensures deterministic results across multiple runs. Without a fixed seed, the Louvain algorithm would produce different community assignments each time due to the random order of node processing.

Setting random_state to 42 ensures reproducibility, allowing consistent cluster assignments for the same graph structure.

resolution = 1.0

Controls the size of detected communities. Values greater than 1.0 favor smaller communities, while values less than 1.0 produce larger communities. The default value of 1.0 provides a balanced trade-off.

This parameter effectively tunes the granularity of the clustering, allowing adjustment between broad topic areas and specific subtopics.

Input: Graph structure

The Louvain algorithm operates directly on the Wikipedia link graph, where edges represent hyperlinks between articles. It does not use spatial coordinates or embeddings.

This graph-based approach identifies communities based on link density, revealing how articles are actually connected rather than how similar their content might be.

Cluster Influence on Node Positioning

Clusters do not influence node positioning. This is a critical aspect of the visualization's methodology and distinguishes it from many graph visualization approaches.

Clustering as Post-Processing

The clustering process occurs after the force-directed algorithm has determined all article positions. The workflow is:

1. Build graph: articles + Wikipedia links

2. Apply force-directed layout: graph structure → 3D positions

3. Run Louvain: graph structure → cluster labels

4. Render: positions (from step 2) + colors (from step 3)

Louvain receives the graph structure (nodes and edges) and returns cluster assignments. It does not modify positions. This separation ensures that clustering reveals community structure in the link graph rather than imposing structure through position adjustment. Note that both positioning and clustering use the same underlying graph, but for different purposes: force-directed layout uses it for spatial arrangement, while Louvain uses it for community detection.

Advantages of Position-Independent Clustering

This design choice provides several analytical benefits:

Stable Positions

Article positions remain fixed regardless of cluster assignments. If clustering parameters change or clusters are recomputed, articles stay in the same locations. This stability aids in spatial memory and repeated analysis.

Independent Validation

Because positions are determined by semantic embeddings independent of clustering, the spatial grouping you observe is a true reflection of semantic relationships rather than an artifact of cluster-driven layout algorithms.

Emergent Structure

Clusters emerge from the data's inherent structure. If you observe tight spatial groupings, this reflects genuine semantic coherence rather than algorithmic forcing of proximity.

Cluster Naming and Descriptions

After clusters are identified, they are automatically assigned human-readable names and descriptions using a large language model (GPT-4).

Naming Process

Extract a representative sample of article titles from the cluster (typically 10-20 articles)
Optionally include article snippets or summaries for additional context
Construct a prompt asking the LLM to identify the unifying theme or topic area
LLM generates a concise name (2-5 words) and a descriptive paragraph explaining the cluster's scope

Cluster Metadata

Each cluster stores the following information:

Cluster metadata:

- cluster_id: integer identifier

- name: auto-generated descriptive name

- description: paragraph explaining scope

- article_count: number of member articles

- centroid: (x, y, z) geometric center of members

- bounding_box: spatial extent of cluster

Interpreting Cluster Structure

Cluster Size Variation

Clusters vary significantly in size, reflecting Wikipedia's uneven coverage:

Large clusters (100+ articles): Represent major topic areas with extensive Wikipedia coverage (e.g., scientific disciplines, major historical periods, geographic regions)
Medium clusters (20-100 articles): Represent well-developed subtopics or specialized domains
Small clusters (5-20 articles): Represent niche topics or emerging areas with limited coverage

Cluster Separation

The spatial separation between clusters indicates semantic distance:

Adjacent clusters: Cover related topics or share conceptual overlap (e.g., "Biology" and "Medicine")
Distant clusters: Represent fundamentally different domains (e.g., "Mathematics" and "Sports")

Noise Points

Articles classified as noise (cluster_id = -1) represent outliers that don't fit into any dense region. These typically fall into categories:

Highly specific articles that don't fit broader topic patterns
Interdisciplinary articles that blend multiple domains
Short or stub articles with insufficient content for clear semantic positioning

Documentation