Cluster-then-predict for classification tasks (2023)

How to leverage unsupervised learning in your supervised learning problems

Supervised classification problems require a dataset with (a) a categorical dependent variable (the “target variable”) and (b) a set of independent variables (“features”) which may (or may not!) be useful in predicting the class. The modeling task is to learn a function mapping features and their values to a target class. An example of this is Logistic Regression.

Unsupervised learning takes a dataset with no labels and attempts to find some latent structure within the data. K-means is one such algorithm. In this article, I will show you how to increase your classifier’s performance by using k-means to discover latent “clusters” in your dataset and either use these clusters as new features in your dataset or to partition your dataset by cluster and train a separate classifier on each.

We begin by generating a nonce dataset using sklearn’s make_classification utility. We will simulate a multi-class classification problem and generate 15 features for prediction.

from sklearn.datasets import make_classificationX, y = make_classification(n_samples=1000, n_features=8, n_informative=5, n_classes=4)

We now have a dataset of 1000 rows with 4 classes and 8 features, 5 of which are informative (the other 3 being random noise). We convert these to a pandas dataframe for easier manipulation.

import pandas as pddf = pd.DataFrame(X, columns=['f{}'.format(i) for i in range(8)])

We can now divide our data into a train and test set (75/25) split.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.25, random_state=90210)

Firstly, you will want to determine what the optimal k is given the dataset.

For the sake of brevity and so as not to distract from the purpose of this article, I refer the reader to this excellent tutorial: How to Determine the Optimal K for K-Means? should you want to read further on this matter.

In our case, because we used the make_classification utility, the parameter

n_clusters_per_class

is already set and defaults to 2. Therefore, we do not need to determine the optimal k; however, we do need to identify the clusters! We will use the following function to find the 2 clusters in the training set, then predict them for our test set.

import numpy as np
from sklearn.cluster import KMeans
from typing import Tuple
def get_clusters(X_train: pd.DataFrame, X_test: pd.DataFrame, n_clusters: int) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
applies k-means clustering to training data to find clusters and predicts them for the test set
"""
clustering = KMeans(n_clusters=n_clusters, random_state=8675309,n_jobs=-1)
clustering.fit(X_train)
# apply the labels
train_labels = clustering.labels_
X_train_clstrs = X_train.copy()
X_train_clstrs['clusters'] = train_labels

# predict labels on the test set
test_labels = clustering.predict(X_test)
X_test_clstrs = X_test.copy()
X_test_clstrs['clusters'] = test_labels
return X_train_clstrs, X_test_clstrs

X_train_clstrs, X_test_clstrs = get_clusters(X_train, X_test, 2)

We now have a new feature called “clusters” with a value of 0 or 1.

(Video) #17 Classification & Prediction - Example, Steps |DM|

Cluster-then-predict for classification tasks (1)

Before we fit any models, we need to scale our features: this ensures all features are on the same numerical scale. With a linear model like logistic regression, the magnitude of the coefficients learned during training will depend on the scale of the features. If you had features that were on the scale of 0–1 and other features on the scale of say 0–100, the coefficients could not be reliably compared.

To scale the features, we use the following function which computes z-scores for each of the features and maps the learnings from the train set to the test set.

from sklearn.preprocessing import StandardScalerdef scale_features(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
applies standard scaler (z-scores) to training data and predicts z-scores for the test set
"""
scaler = StandardScaler()
to_scale = [col for col in X_train.columns.values]
scaler.fit(X_train[to_scale])
X_train[to_scale] = scaler.transform(X_train[to_scale])

# predict z-scores on the test set
X_test[to_scale] = scaler.transform(X_test[to_scale])

return X_train, X_test

X_train_scaled, X_test_scaled = scale_features(X_train_clstrs, X_test_clstrs)

We are now ready to run some experiments!

I chose to use Logistic Regression for this problem because it is extremely fast and inspection of the coefficients allows one to quickly assess feature importance.

To run our experiments, we will build a logistic regression model on 4 datasets:

  1. Dataset with no clustering information(base)
  2. Dataset with “clusters” as a feature (cluster-feature)
  3. Dataset for df[“clusters”] == 0 (clusters-0)
  4. Dataset for df[“clusters”] == 1 (clusters-1)

Out study is a 1x4 between-groups design with dataset [base, cluster-feature, clusters-0, clusters-1] as the only factor. The following creates our datasets.

# to divide the df by cluster, we need to ensure we use the correct class labels, we'll use pandas to do that
train_clusters = X_train_scaled.copy()
test_clusters = X_test_scaled.copy()
train_clusters['y'] = y_train
test_clusters['y'] = y_test
# locate the "0" cluster
train_0 = train_clusters.loc[train_clusters.clusters < 0] # after scaling, 0 went negtive
test_0 = test_clusters.loc[test_clusters.clusters < 0]
y_train_0 = train_0.y.values
y_test_0 = test_0.y.values
# locate the "1" cluster
train_1 = train_clusters.loc[train_clusters.clusters > 0] # after scaling, 1 dropped slightly
test_1 = test_clusters.loc[test_clusters.clusters > 0]
y_train_1 = train_1.y.values
y_test_1 = test_1.y.values
# the base dataset has no "clusters" feature
X_train_base = X_train_scaled.drop(columns=['clusters'])
X_test_base = X_test_scaled.drop(columns=['clusters'])
# drop the targets from the training set
X_train_0 = train_0.drop(columns=['y'])
X_test_0 = test_0.drop(columns=['y'])
X_train_1 = train_1.drop(columns=['y'])
X_test_1 = test_1.drop(columns=['y'])
datasets = {
'base': (X_train_base, y_train, X_test_base, y_test),
'cluster-feature': (X_train_scaled, y_train, X_test_scaled, y_test),
'cluster-0': (X_train_0, y_train_0, X_test_0, y_test_0),
'cluster-1': (X_train_1, y_train_1, X_test_1, y_test_1),
}
Cluster-then-predict for classification tasks (2)
Cluster-then-predict for classification tasks (3)
(Video) Data Analysis: Clustering and Classification (Lec. 1, part 1)
Cluster-then-predict for classification tasks (4)
Cluster-then-predict for classification tasks (5)

To efficiently run our experiments, we’ll use the following function which loops through the 4 datasets and runs 5-fold cross-valdiation on each. For each dataset, we obtain 5 estimates for each classifier’s: accuracy, weighted precision, weighted recall, and weighted f1. We will plot these to observe general performance. We then obtain classification reports from each model on its respective test set to evaluate fine-grained performance.

from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.metrics import classification_report
def run_exps(datasets: dict) -> pd.DataFrame:
'''
runs experiments on a dict of datasets
'''
# initialize a logistic regression classifier
model = LogisticRegression(class_weight='balanced', solver='lbfgs', random_state=999, max_iter=250)

dfs = []
results = []
conditions = []
scoring = ['accuracy','precision_weighted','recall_weighted','f1_weighted']

for condition, splits in datasets.items():
X_train = splits[0]
y_train = splits[1]
X_test = splits[2]
y_test = splits[3]

kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(condition)
print(classification_report(y_test, y_pred))

results.append(cv_results)
conditions.append(condition)
this_df = pd.DataFrame(cv_results)
this_df['condition'] = condition
dfs.append(this_df)
final = pd.concat(dfs, ignore_index=True)

# We have wide format data, lets use pd.melt to fix this
results_long = pd.melt(final,id_vars=['condition'],var_name='metrics', value_name='values')

# fit time metrics, we don't need these
time_metrics = ['fit_time','score_time']
results = results_long[~results_long['metrics'].isin(time_metrics)] # get df without fit data
results = results.sort_values(by='values')

return results

df = run_exps(datasets)
Cluster-then-predict for classification tasks (6)

Let’s plot our results and see how each dataset affected classifier performance.

(Video) Difference between classification and regression [CLASSIFICATION & REGRESSION] 2021

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20, 12))
sns.set(font_scale=2.5)
g = sns.boxplot(x="condition", y="values", hue="metrics", data=df, palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Comparison of Dataset by Classification Metric')
Cluster-then-predict for classification tasks (7)
pd.pivot_table(df, index='condition',columns=['metrics'],values=['values'], aggfunc='mean')
Cluster-then-predict for classification tasks (8)

In general, it appears that our “base” dataset, with no clustering information, creates the worst performing classifier. By adding our binary “clusters” as a feature, we see a modest boost to performance; however, when we fit a model on each cluster, we see the largest boost in performance.

When we look at classification reports for fine-grained performance evaluation, the picture becomes very clear: when the datasets are segmented by cluster, we see a large boost to performance.

base
precision recall f1-score support

0 0.48 0.31 0.38 64
1 0.59 0.59 0.59 71
2 0.42 0.66 0.51 50
3 0.59 0.52 0.55 65

accuracy 0.52 250
macro avg 0.52 0.52 0.51 250
weighted avg 0.53 0.52 0.51 250

cluster-feature
precision recall f1-score support

0 0.43 0.36 0.39 64
1 0.59 0.62 0.60 71
2 0.40 0.56 0.47 50
3 0.57 0.45 0.50 65

accuracy 0.50 250
macro avg 0.50 0.50 0.49 250
weighted avg 0.50 0.50 0.49 250

cluster-0
precision recall f1-score support

0 0.57 0.41 0.48 29
1 0.68 0.87 0.76 30
2 0.39 0.45 0.42 20
3 0.73 0.66 0.69 29

(Video) Prediction, Classification and Clustering

accuracy 0.61 108
macro avg 0.59 0.60 0.59 108
weighted avg 0.61 0.61 0.60 108

cluster-1
precision recall f1-score support

0 0.41 0.34 0.38 35
1 0.54 0.46 0.50 41
2 0.49 0.70 0.58 30
3 0.60 0.58 0.59 36

accuracy 0.51 142
macro avg 0.51 0.52 0.51 142
weighted avg 0.51 0.51 0.51 142

Consider the class “0”, the f1 scores across the four datasets are

  • Base — “0” F1: 0.38
  • Cluster-feature — “0” F1: 0.39
  • Cluster-0 — “0” F1: 0.48
  • Cluster-1 — “0” F1:0.38

For the “0” class, the model trained on the cluster-0 dataset shows ~23% relative improvement in f1 score over the other models and datasets.

In this article, I have shown how you can leverage “cluster-then-predict” for your classification problems and have teased some results suggesting that this technique can boost performance. There is still much more that can be done in terms of cluster creation and evaluation of the results.

In our case, we had a dataset with 2 clusters; however, in your problems you may have many more clusters to find. (Once you determine the optimal k using the elbow method on your dataset!)

In the case of k>2, you can treat the “clusters” feature as a categorical variable and apply one-hot encoding to use them in your model. As k increases, you may run into issues of overfitting should you decide to fit a model for each cluster.

If you find that K-Means is not increasing the performance of your classifier, perhaps your data is better suited for another clustering algorithm — see this article for an introduction to Hierarchical Clustering on imbalanced datasets.

As with all data science problems, experiment, experiment, experiment! Run tests for different techniques and let the data guide your modeling decisions.

Data structures for statistical computing in python, McKinney, Proceedings of the 9th Python in Science Conference, Volume 445, 2010.

@software{reback2020pandas,
author = {The pandas development team},
title = {pandas-dev/pandas: Pandas},
month = feb,
year = 2020,
publisher = {Zenodo},
version = {latest},
doi = {10.5281/zenodo.3509134},
url = {https://doi.org/10.5281/zenodo.3509134}
}

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586–020–2649–2.

(Video) Clustering 9: Evaluating clustering systems

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.

J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.

Waskom, M. L., (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021, https://doi.org/10.21105/joss.03021

FAQs

Can we use clustering to help with classification tasks? ›

Clustering apart from being an unsupervised machine learning can also be used to create clusters as features to improve classification models. On their own they aren't enough for classification as the results show. But when used as features they improve model accuracy.

How can you tell if data is clustered enough for clustering algorithms to produce meaningful results? ›

To tell whether a clustering is meaningful, you can run an algorithm to count the number of clusters, and see if it outputs something greater than 1. Like chl said, one cluster-counting algorithm is the gap statistic algorithm.

Can we do classification after clustering? ›

After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data. It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

What is classification clustering and prediction? ›

Prediction: – Classification involves the prediction of the input variable based on the model building. Clustering is generally used to analyze the data and draw inferences from it for better decision making.

Can you use clustering for prediction? ›

Clusters can represent the system more accurately, and can thus improve prediction accuracy. The choice of the prediction technique can be based either on process knowledge or the end-use.

When should you not use clustering? ›

Hierarchical clustering tends to produce more accurate results, but it requires significant computational power and is not ideal when you're working with larger datasets. This method is also sensitive to outlier values and can produce an inaccurate set of clusters as a result.

What is a good clustering result? ›

A good clustering method will produce high quality clusters in which: – the intra-class (that is, intra intra-cluster) similarity is high. – the inter-class similarity is low. The quality of a clustering result also depends on both the similarity measure used by the method and its implementation.

How do I know if my clustering is correct? ›

A lower within-cluster variation is an indicator of a good compactness (i.e., a good clustering). The different indices for evaluating the compactness of clusters are base on distance measures such as the cluster-wise within average/median distances between observations.

How can we assess that clustering is done correctly? ›

There are majorly two types of measures to assess the clustering performance. (i) Extrinsic Measures which require ground truth labels. Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.

What is the next step after clustering? ›

Once we construct the clusters, we can produce a list of all sellers and which cluster they belong to. We can then take a specific cluster and study the seller characteristics along known dimensions. Alternatively, we can look at which groupings over- or under-index on specific dimensions compared to other clusters.

Why clustering is better than classification? ›

Clustering is also useful to obtain general insights and information. On the other hand, classification belongs to supervised learning, which means that we know the input data (labeled in this case) and we know the possible output of the algorithm.

In which situation do we use classification and cluster? ›

Classification is used for supervised learning in machine learning. Clustering is used for unsupervised learning in machine learning. Classification contains labels. Therefore, training and testing of the datasets is necessary in order to verify the model.

Is clustering a classification method? ›

What Is the Basic Difference Between Classification and Clustering? Classification sorts data into specific categories using a labeled dataset. Clustering is partitioning an unlabeled dataset into groups of similar objects.

Which task can be solved by using clustering? ›

Clustering (cluster analysis) is grouping objects based on similarities. Clustering can be used in many areas, including machine learning, computer graphics, pattern recognition, image analysis, information retrieval, bioinformatics, and data compression.

Which tasks can be best solved using clustering? ›

Detecting fraudulent credit card transactions.

Is clustering a classification algorithm? ›

Clustering is an example of an unsupervised learning algorithm, in contrast to regression and classification, which are both examples of supervised learning algorithms. Data may be labeled via the process of classification, while instances of similar data can be grouped together through the process of clustering.

Should I use clustering or classification? ›

Classification is used for supervised learning in machine learning. Clustering is used for unsupervised learning in machine learning. Classification contains labels. Therefore, training and testing of the datasets is necessary in order to verify the model.

Which is better classification or clustering? ›

Classification and clustering are techniques used in data mining to analyze collected data. Classification is used to label data, while clustering is used to group similar data instances together.

Videos

1. Contrastive Clustering with SwAV
(Connor Shorten)
2. K- Means Clustering Algorithm | Using Iris Dataset | Optimum No of Clusters Calculation | Arka Datta
(Hustlers Den)
3. R Tutorial: Classification, Regression, Clustering
(DataCamp)
4. How to Evaluate the Performance of Clustering Algorithms in Python? (Evaluation of Clustering)
(Dr. Data Science)
5. KMeans Clustering Example | Python ML | Spotify dataset
(gurko codes)
6. Lecture 2 | Image Classification
(Stanford University School of Engineering)
Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated: 04/07/2023

Views: 5859

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.