The Question

K-Means Clustering Implementation

Implement the K-Means clustering algorithm to partition a set of $N$ points into $K$ clusters. You are given: - points: A list of $D$-dimensional coordinates (as tuples). - k: The number of clusters to identify. - initial_centroids: A list of $K$ points representing the starting positions of the centroids. - max_iterations: The maximum number of assignment/update cycles to perform. Your algorithm should follow these steps: 1. Assignment: Assign each point to the cluster associated with the nearest centroid using squared Euclidean distance. 2. Update: Recompute the centroid of each cluster by taking the arithmetic mean of all points assigned to that cluster. If a cluster has no assigned points, its centroid remains unchanged for that iteration. 3. Convergence: If the centroids do not change between two consecutive iterations, terminate early. The final output should be a list of the $K$ cluster centroids, with each coordinate rounded to exactly 4 decimal places.

Python

K-Means Clustering

Euclidean Distance

Questions & Insights

Clarifying Questions

How should we handle empty clusters? If no points are assigned to a specific centroid during an iteration, should the centroid remain unchanged, or should we pick a random point to restart it? Assumption: The centroid will remain at its current position if no points are assigned to it.

What is the convergence criterion? Should the algorithm stop early if centroids do not change between iterations? Assumption: Yes, early exit is a standard optimization, though `max_iterations` acts as the hard upper bound.

Is the Euclidean distance used for "nearest" calculation?Assumption: Yes, standard L2 Euclidean distance will be used to determine point-to-centroid proximity.

What is the range of dimensions and number of points?Assumption: Points can be $D$-dimensional. The solution should handle arbitrary dimensions efficiently using basic Python primitives.

Thinking Process

Assignment Step: For every point in the dataset, calculate its distance to all

k

centroids. Assign the point to the cluster of the centroid with the minimum distance.

Update Step: For each cluster, calculate the arithmetic mean of all points assigned to it. This mean becomes the new centroid for the next iteration.

Convergence Check: Compare the new centroids with the previous ones. If they are identical (within a very small epsilon or exactly the same), stop the process before reaching max_iterations.

Formatting: After the loop finishes, round each coordinate of the final centroids to 4 decimal places as requested.

Implementation Breakdown

Problem Set

Functional Requirements:

Implement K-Means clustering.

Input: points (List[Tuple]), k (Int), initial_centroids (List[Tuple]), max_iterations (Int).

Output: Final centroids as a List of Tuples, rounded to 4 decimal places.

Constraints:

All points and centroids must have the same dimensionality

D

Perform a maximum of max_iterations.

k

is the number of clusters.

Approach

Algorithm: K-Means Clustering (Lloyd's Algorithm).

Data Structure: Lists and Tuples for point representation; Dictionaries/Lists for cluster grouping.

Complexity:

Time:

O(I \cdot N \cdot K \cdot D)

, where

I

s iterations,

N

is number of points,

K

is number of clusters, and

D

is dimensionality.

Space:

O(N \cdot D + K \cdot D)

to store points and centroid coordinates.

Implementation

Wrap Up

Advanced Topics

Initial Centroid Selection: While this problem provides initial_centroids, in practice, using K-Means++ significantly improves convergence speed and the quality of the final clusters by spreading out initial centroids.

Distance Metrics: While Euclidean (L2) is standard, other metrics like Manhattan (L1) or Cosine Similarity (Spherical K-Means) might be used depending on the nature of the data (e.g., text mining).

Scalability: For massive datasets that don't fit in memory, Mini-Batch K-Means can be used, which updates centroids using small random subsets of the data.

Parallelization: The "Assignment" step is embarrassingly parallel. One could use MapReduce or multi-threading to calculate distances for subsets of points across multiple CPU cores or nodes.