Optimal number of cluster identification with K-Means algorithm using elbow method

This topic explains the elbow method to determine an optimal number of clusters for any given data using K-Means clustering machine learning algorithm.

As per Wikepedia, “The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.”

The below example uses the elbow method to find out the correct choice of K for randomly generated data.

import numpy as np
import pandas as pd
from sklearn import metrics,preprocessing
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%matplotlib inline

Function below to generate some random numbers

def random_numbers():
    X1 = []
    X2 = []
    np.random.seed(1) # generate same numbers
    for i in range(0,25):
        X1.append(i)
        X2.append(np.random.randint(100))

    for i in range(25,50):
        X1.append(i)
        X2.append(np.random.randint(200,300))

    for i in range(50,75):
        X1.append(i)
        X2.append(np.random.randint(400,500))

    for i in range(75,100):
        X1.append(i)
        X2.append(np.random.randint(600,700))

    return [X1,X2]

Data Normalization

def normalize(data):
    maxs = np.max(data)
    mins = np.min(data)
    diff = maxs-mins
    data = (data-mins)/diff
    return data
def plot(X1,X2):
    plt.scatter(X1,X2)
    plt.xlabel('X1',fontweight="bold",fontsize = 15)
    plt.ylabel('X2',fontweight="bold",fontsize = 15)
    plt.title("Scatter Data",fontweight="bold",fontsize = 20)
    plt.show()

Elbow Method

def elbow_method(data):
    cluster = []
    for i in range(1,10):
        kmeans = KMeans(n_clusters = i, init = "k-means++", max_iter = 1000, n_init = 10, random_state = 0)
        kmeans.fit(data)
        cluster.append(kmeans.inertia_)
    return cluster
def plot_cluster(cluster):
    plt.plot(range(1,10),cluster)
    plt.xlabel('No of Clusters',fontweight="bold",fontsize = 15)
    plt.ylabel('Sum of Squared Errors',fontweight="bold",fontsize = 15)
    plt.title("Elbow Method using scikit-learn",fontweight="bold",fontsize = 20)
    plt.show()
def sklearn_model(data):
    kmeans = KMeans(n_clusters = 4, init = "k-means++", max_iter = 1000, n_init = 10, random_state = 0)
    kmeans.fit(data)
    labels = kmeans.predict(data)
    centroids = kmeans.cluster_centers_
    plt.scatter(data[:,0],data[:,1],c=kmeans.labels_,cmap="rainbow")
    plt.scatter(centroids[:,0],centroids[:,1],marker="*",c="black")
    plt.xlabel('X1',fontweight="bold",fontsize = 15)
    plt.ylabel('X2',fontweight="bold",fontsize = 15)
    plt.title("K-means Cluster using scikit-learn",fontweight="bold",fontsize = 20)
    plt.show()
if __name__ == "__main__":
    X1,X2 = random_numbers()
    X1 = np.array(X1).astype(float)
    X2 = np.array(X2).astype(float)
    X1 = normalize(X1)
    X2 = normalize(X2)
    plot(X1,X2)
    data = np.array(list(zip(X1,X2)))
    cluster = elbow_method(data)
    plot_cluster(cluster)
    sklearn_model(data)

Randomly generated scatter data

Choice of K using elbow method

Using the above plot, the choice of K for the given data can be 2 or 4. The better choice for the given data is 4 as per error convergence.

Clustering using K-means Clustering Algorithm

The above plot shows the given data are clustered as 4 seperate groups with different colors and a centroid. The data clustered using 4 seperate groups gives more meaning than grouping into 2 groups visually.

References :

  1. https://scikit-learn.org/
  2. https://stackoverflow.com/
  3. https://en.wikipedia.org/wiki/K-means_clustering
  4. https://en.wikipedia.org/wiki/Cluster_analysis
  5. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

Comments