In this page, we will thoroughly introduce the fuzzy c-means clustering algorithm, a soft clustering method.
So far we have learned various different clustering methods, and they’re all classified as hard clustering, in the sense that each observation would be assigned to only one cluster in a clear-cut manner. On the other side of the spectrum, soft clustering methods work in a more flexible manner. It predicts the probability that a data point is within each cluster, as opposed to giving definite cluster assignments.
One might opt for soft clustering over hard clustering when they want to consider the fact that their clusters could be ambiguous (belonging to multiple states). By assigning a probability of each observation belonging to each cluster, the analyst can identify points that are most probably within one cluster, and points that might belong to another cluster. In a critical setting such as cancer diagnosis, these probabilities can inform further tests and checks. One major downside of soft clustering is that it tends to take longer than hard clustering, given the additional amount of computational work needed.
Fuzzy c-means is \(k\)-means, but more than one cluster per observation.
Fuzzy c-means is similar to the \(k\)-means clustering method that we discussed in class, in that clusters are formed to minimize the distance between a cluster’s centroid and observations in that cluster.
However, rather than explicitly assigning an observation to a given cluster, this method assigns a probability that any given observation is in every single cluster. Observations in the center of a cluster are assigned a very high probability of belonging to that cluster, while observations further out are assigned a lower probability.
Fuzzy c-means is…
There are three components to the fuzzy c-means clustering:
We start by assigning the \(n\) observations into \(j\) clusters randomly. Then we can proceed with the steps below (Ghosh and Dubey 2013):
A geographical approach that is perhaps easier to follow along is as follows:
After each iteration, the weighed distances from each observation will decrease, as we classify closer points as more probable in belonging to a certain cluster.
As with \(k\)-means, the number of clusters \(k\) affects the resulting clustering. Unlike hierarchical clustering, this is pre-defined; see the section Ideal Number of Clusters below for a full discussion of an a priori (not related to the actual underlying features of the dataset) choice of \(k\).
The hyperparameter \(m\) is a number greater than 1 giving the degree of fuzziness. Without prior knowledge of the dataset, this usually takes the number 2. As \(m\rightarrow 1\), the membership probabilities converge to either 0 or 1, resulting in what is effectively a \(k\)-means clustering.
The number of iterations can also be controlled,
either explicitly by setting iter.max
, or by specifying a
sensitivity threshold \(\varepsilon\)
and stopping when the change in probabilities is less than \(\varepsilon\).
We will demonstrate fuzzy c-means clustering on the
iris
, juice
, and market
datasets.
# libraries
library(tidyverse)
# pipes
library(magrittr)
# cmeans
library(e1071)
library(advclust)
# visualize
library(factoextra)
library(ggplot2)
library(corrplot)
# fonts for ggplot
library(showtext)
# ISLR for data
library(ISLR)
::opts_chunk$set(
knitr# Retina!
fig.retina = 4,
# use the showtext package for figures
fig.showtext = TRUE,
# output suppression
# we oppose other forms of suppression
# such as voter suppression
message = FALSE,
warning = FALSE,
# cache = TRUE,
# hold off results until end of chunk
results = "hold"
)# seed
set.seed(42)
# ggplot fonts
# if knitting:
if (isTRUE(getOption("knitr.in.progress"))) {
# add Lato from Google Fonts
font_add_google("Lato", "Lato")
# set font for base R for corrplot
par(family = "Lato")
}
The iris
dataset was collected by Edgar Anderson in
1935. It comes with R and its predecessor S. From
?iris
:
This famous (Fisher’s or Anderson’s) iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Four of the five variables of iris
are numerical; the
fifth is outcome variable, Species
. We will use the four
quantitative variables to cluster observations as their respective
species and use our outcome variable to check our model against the
truth, recognizing that in many clustering scenarios we would not be
able to check our results.
This dataset is a good choice for our purposes because it is relatively small and easy to understand and it contains quantitative predictors.
# iris data comes with base R
<- iris iris
The juice
dataset comes from the 1998 book Business
Analysis Using Regression, via the package ISLR
. From
?ISLR::OJ
:
The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.
juice
identifies the customer brand loyalty with
LoyalCH
. The other variables identify the prices,
discounts, place of purchase, etc. for juice the CH and MM brands.
Most of the 18 variables in juice
are quantitative.
Since fuzzy c-means depends on distance calculations, we will use only
the quantitative variables. After attempting to cluster observations as
either Citrus Hill or Minute Maid Orange Juice, we will be able to check
our results against the ground truth, recognizing that in many
clustering scenarios we would not be able to check our results.
We chose to use this dataset because it contains a variety of quantitative predictors, and because we were interested in applying our method to a real-world scenario.
# orange juice data comes with ISLR dataset
data(OJ)
The market
dataset is from
Kaggle and attributed to Dr. Omar Romero-Hernandez of UC
Berkeley.
The dataset documents the following: - Customer information, such as date of birth, income, number of children. - Spending preference, such as the amount spent on wine, fruits, and fish. - Purchase data, the number of purchases made online, through a catalog, or on the phone.
This dataset contains a mix of numerical and categorical variables. Since fuzzy c-means depends on distance calculations, we will use only the quantitative variables, attempting to classify observations into different types of/personalities of spenders. There is no way to check this classification against reality.
We chose to use this dataset because it contains quantitative predictors and we thought it would be neat to try our method on a dataset with no outcome, since that is the scenario in which clustering techniques are typically used.
# marketing data
<- read_delim(
market "marketing_campaign.csv",
delim = ";",
escape_double = FALSE,
trim_ws = TRUE
)
First, since fuzzy c-means clusters using euclidean distance, we must normalize our data. Otherwise, variables with large values would be overemphasized.
<- iris %>%
iris.scaled select(-5) %>%
scale()
We proceed to fit the dataset using cmeans()
from the
e1071
package.
<- e1071::cmeans(
iris_cmeans # dataset
iris.scaled,# number of clusters
centers = 3,
# type of distance to use: euclidean / manhattan
dist = "euclidean",
# degree of fuzzification; 2 is the default
m = 2,
# maximum number of iterations
iter.max = 100
)
The fviz_cluster()
function (from the
factoextra
package) can plot observations on the dataset’s
first two principal components. Alternatively, we can also use
choose.vars
to specify which (original) variables to plot
as the x and y axes.
The ellipses shown are drawn at the 70% confidence interval, meaning that 70% of additional observations in a given cluster would fall within this ellipse.
::fviz_cluster(
factoextra# list object with data and cluster components
list(
# the data
data = iris.scaled,
# the cluster assignments
cluster = iris_cmeans$cluster
),# for a confidence ellipse, use "confidence"
ellipse.type = "norm",
# size of the concentration ellipse in normal probability
# see ?ggplot2::stat_ellipse for details
ellipse.level = .7
+
) labs(
title = "Cluster plot",
subtitle = "For the iris dataset, using fuzzy c-means",
+
) theme(text = element_text(family = "Lato", size = 14))
Normally, people use clustering when they do not have an outcome variable, and is thus unsupervised. Since we do have an outcome variable, we can go beyond fuzzy c-means and evaluate our performance as follows:
table(iris_cmeans$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 39 13
## 3 0 11 37
Our model performed pretty well! One of our clusters aligns perfectly with setosa, and our other clusters contain the vast majority of just one type of flower.
In most scenarios when we would actually want to use clustering, we would not have an outcome variable like the species of flower. In this section, we illustrate how we might identify the ideal number of clusters for fuzzy c-means clustering.
In \(k\)-means clustering, we often do this with the elbow method: we run a for-loop to fit models with sequentially larger \(k\)-values, plot the within cluster sum of squares versus the \(k\)-value, and select the \(k\)-value at the elbow point. There are also other methods, such as the average silhouette method and the gap statistic method.
Weighted average distance vs. average WCSS
We formulated our own way of extending the elbow method to a fuzzy context, as described below.
First, we show how to fit a fuzzy c-means model with the
advclust
package. We choose to use this method with the
advclust
package because it automatically outputs the
distance from each observation to the centroid of each cluster. We could
have calculated this information ourselves and used the other package,
but this was easier and we thought it was advantagous to know how to do
fuzzy c-means with two different implementations.
<- advclust::fuzzy.CM(
x # the data
iris.scaled,# how many clusters?
K = 3,
# fuzziness, as above
m = 2,
# we specify both of the two hyperparameters here:
# maximum number of iterations
max.iteration = 100,
# convergence criteria
threshold = 1e-5,
# more randomness
RandomNumber = 42
)
Recall that the elbow method minimizes the within-cluster sum of squares. Our idea to extend this idea to fuzzy clustering is to weight the distance from each observation to each cluster, and then minimize that weighted sum of distances. To demonstrate how that would work:
# Extract distance from each observation to each cluster
<- x@distance
l # Extract probabilities of belonging to each cluster
<- t(x@member)
ll
# Calculate the weighted distance
# From center for each observation
# (weighting euclidean distance with probability of belonging)
# Then sum them up
<- l %*% ll
distance <- rowSums(distance)
r sum(r)
## [1] 154857.5
To do this on a variety of different numbers of clusters, we run a
for
loop that minimizes the weighted average distance to
centers.
# list to save distances
<- c()
wag for (i in 1:20) {
<- fuzzy.CM(iris.scaled,
x1 K = i,
m = 2,
max.iteration = 100,
threshold = 1e-5,
RandomNumber = 42,
print.result = 0
)# average distance
<- x1@distance %*% t(x1@member) / (450)
dis # sum and print
# print(sum(diag(dis)))
# write to list
- 1] <- (sum(diag(dis)))
wag[i }
To plot our results:
# use tibble for ggplot
%<>% tibble(wag)
wag $k <- 1:19
wag
ggplot(
wag,aes(x = k, y = wag)
+
) geom_line() +
labs(
title = "Weighted Average Distance to Centroids",
subtitle = "For the iris dataset, using in-house elbow method",
x = "Number of clusters",
y = "Weighted avg. dist.",
+
) theme(text = element_text(family = "Lato", size = 14))
The elbow point here appears to be either \(k = 3\) or \(k = 5\). In reality, we know there are 3 different species of flower, which means that our method is fairly accurate, although the exact elbow point location is highly debatable on the graph and \(k=5\) could be another interpretation.
These are some of the potential ways to improve model performance:
Creating robust validity metrics for fuzzy clustering is an active area of research.
The fclustIndex()
function from the
factoextra
package calculates a number of validity metrics
for fuzzy clustering on soft clustering output.
fclustIndex(
# the cmeans() object
iris_cmeans,# the original data
iris.scaled )
## fhv apd pd xb fs
## 1.351036e-01 2.149397e+02 1.749217e+02 1.480292e-03 -3.245081e+02
## pc pe pre si
## 7.065091e-01 5.294223e-01 7.021749e+02 2.649665e-02
It would have been prohibitively complicated to delve into the derivations of each of the above validity metrics, so instead we provide a brief overview:
fhv
stands for fuzzy hypervolume. We want to minimize
it.apd
stands for average partition density. We want to
maximize it.pd
stands for physical partition density. We want to
maximize it.xb
stands for total variation / centroid vector
separation. We want to minimize it.fs
stands for membership fuzziness * row fuzziness. We
want to minimize it.pc
stands for partition coefficient, which is a
heuristic measure of partition fuzziness. We want to maximize it.pe
stands for partition entropy, which measures how
much information is withheld. We want to minimize it.pre
stands for proportion exponent, a measure of
structural variation (crispness) in the partition matrix. We want to
maximize it.si
stands for separation index, which is a measure of
compactness and separation which assumes that a hard partition is
derived from the fuzzy one. We want to minimize it.We can loop over different number of clusters and graph those metrics to assist us in selecting the ideal number of clusters.
# initialze tibble to store results
<- tribble(~fhv, ~apd, ~pd, ~xb, ~fs, ~pc, ~pe, ~pre, ~si)
error # loop over number of clusters
for (i in 2:11) {
# fit cmeans()
<- cmeans(
iris_cmeans
iris.scaled,centers = i,
dist = "euclidean"
)# store the errors
<- rbind(
error
error,as_tibble(t(
fclustIndex(iris_cmeans, iris.scaled)
))
)
}
# scale the indices because they are not always in [0,1]
<- as_tibble(scale(error))
error_scaled
# number of clusters
$num_clust <- 2:11
error_scaled
# pivot to plot
<- error_scaled %>%
to_plot pivot_longer(
-c(num_clust, pre),
names_to = "variable",
values_to = "value"
)
ggplot(
to_plot,aes(
x = num_clust,
y = value,
color = variable
)+
) geom_line() +
geom_point() +
labs(
title = "Measures of cluster validity",
subtitle = "For the iris dataset, from the 'factoextra' package",
x = "Number of clusters",
y = "Normalized indices",
color = "Index"
+
) theme(text = element_text(family = "Lato", size = 14))
We can choose the ideal number of clusters using any of the variables
in this graph. Note that pre
is omitted, as the proportion
exponent is only meaningful when \(k>2\).
Like we mentioned earlier, this is an active area of research and there are also other validity metrics that could be used. An entire package, zcebeci/fcvalid, is devoted to “Internal Validity Indexes for Fuzzy and Possibilistic Clustering”, including the following additional indices:
In terms of efficiency, our assumption is that fuzzy c-means takes longer to run than \(k\)-means, because it is a more complicated algorithm. Depending on the size of one’s dataset and the application of their model, one may prefer to use the algorithm that runs faster. We therefore quantify runtimes to inform people in making those decisions in this section.
First, we’ll find the run time for running fuzzy c-means clustering
ten times using the advclust
package.
<- c()
storage <- Sys.time()
start_time
for (i in 1:10) {
<- fuzzy.CM(
x1
iris.scaled,K = i,
m = 2,
max.iteration = 100,
threshold = 1e-5,
RandomNumber = 1234
)<- x1@distance %*% t(x1@member) / (450) # average distance
dis print(sum(diag(dis)))
* 1] <- (sum(diag(dis)))
storage[i
}
<- Sys.time() end_time
- start_time
end_time # Time difference (run time) of 5.135732 seconds
## Time difference of 5.183285 secs
Then we compare the runtime with \(k\)-means clustering, the analogous clustering method, which we find to be much faster given that it is a hard clustering method.
<- Sys.time()
start_time <- c()
store for (i in 1:10) {
<- kmeans(iris.scaled, i + 2, nstart = 20)
kc <- kc$tot.withinss
store[i]
}# Time difference (run time) of 0.08174109 seconds
<- Sys.time()
end_time - start_time end_time
## Time difference of 0.04385304 secs
We also compare the runtime with the other fuzzy c-means clustering
package, e1071
.
<- Sys.time()
start_time <- c()
store for (i in 1:10) {
<- cmeans(
kc # dataset
iris.scaled, centers = i + 2, # number of clusters
dist = "euclidean"
# type of distance to use
)
}# Time difference (run time) of 0.1188469 seconds
<- Sys.time()
end_time - start_time end_time
## Time difference of 0.05980611 secs
Overall, we find that \(k\)-means
(from stats
in base R) runs a little faster than
cmeans()
from the e1071
package, which in turn
runs much faster than fuzzy.CM()
from the
advclust
package. It is interesting to note that the
advclust
package took so much longer to run than the
e1071
package, but to be fair, it also gives us more
information, specifically the distance of each observation to the
centroid of each cluster.
As we have illustrated every step of the process in detail with the iris dataset, we will not bore you with repetition of everything on this dataset. Rather, we will perform clustering on the number of clusters that we know there should be, and remark on our model’s performance.
The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.
The ISLR Package defines their variables as follows:
Purchase
: A factor with levels CH and MM indicating
whether the customer purchased Citrus Hill or Minute Maid Orange
JuiceWeekofPurchase
: Week of purchaseStoreID
: Store IDPriceCH
: Price charged for CHPriceMM
: Price charged for MMDiscCH
: Discount offered for CHDiscMM
: Discount offered for MMSpecialCH
: Indicator of special on CHSpecialMM
: Indicator of special on MMLoyalCH
: Customer brand loyalty for CHSalePriceMM
: Sale price for MMSalePriceCH
: Sale price for CHPriceDiff
: Sale price of MM less sale price of CHStore7
: A factor with levels No and Yes indicating
whether the sale is at Store 7PctDiscMM
: Percentage discount for MMPctDiscCH
: Percentage discount for CHListPriceDiff
: List price of MM less list price of
CHSTORE
: Which of 5 possible stores the sale occurred
atWe want to see if we can cluster observations as purchasing one
company or the other. Since we can only use quantitative variables (or
meanfully sortable categories, like “Totally agree” to “Totally
disagree”) in fuzzy c-means, we will be removing STORE
,
STORE7
, Purchase
, WeekofPurchase
,
StoreID
, SpecialCH
, and
SpecialMM
.
<- OJ %>%
juice select(
-STORE, -Store7, -Purchase,
-WeekofPurchase, -StoreID,
-SpecialCH, -SpecialMM
)
Since fuzzy c-means clusters using euclidean distance, we must normalize our data. Otherwise, variables which tend to have large values would inherently be overemphasized.
<- scale(juice) juice.scaled
We first fit a fuzzy c-means model with 2 centers, because we know that there are two classes that we care about: purchases of Citrus Hill and purchases of Minute Maid.
<- cmeans(
juice_cmeans
juice.scaled,centers = 2,
dist = "euclidean"
)
We plots observations using principal components.
fviz_cluster(
list(
data = juice.scaled,
cluster = juice_cmeans$cluster
),ellipse.type = "norm",
ellipse.level = .7,
repel = TRUE,
+
) labs(
title = "Cluster plot",
subtitle = "For the OJ dataset, using fuzzy c-means",
+
) theme(text = element_text(family = "Lato", size = 14))
Normally, people use clustering when they do not have an outcome variable. Since we do have an outcome variable, we can go beyond fuzzy c-means and evaluate our performance as follows.
table(juice_cmeans$cluster, OJ$Purchase)
##
## CH MM
## 1 184 203
## 2 469 214
Our model didn’t do too well. We clustered 469 of the 653 Citrus Hill purchases (71.8% of them) together, but only about half of the Minute Maid purchases.
We could theoretically use the same methods as we did with the iris dataset to pick the ideal number of clusters, but we think it would be more interesting to do that with the market dataset, in which we have no motivation for the number of clusters.
We illustrated every step of the process in detail with the iris dataset, so we will not bore you with repetition of everything on this dataset. Rather, we will perform clustering on the market dataset and pick the ideal number of clusters, since we have no motivation for what number of clusters we should really have.
Kaggle describes the variables in the market dataset as follows:
ID
: Customer’s unique identifierYear_Birth
: Customer’s birth yearEducation
: Customer’s education levelMarital_Status
: Customer’s marital statusIncome
: Customer’s yearly household incomeKidhome
: Number of children in customer’s
householdTeenhome
: Number of teenagers in customer’s
householdDt_Customer
: Date of customer’s enrollment with the
companyRecency
: Number of days since customer’s last
purchaseComplain
: 1 if the customer complained in the last 2
years, 0 otherwiseMntWines
: Amount spent on wine in last 2 yearsMntFruits
: Amount spent on fruits in last 2 yearsMntMeatProducts
: Amount spent on meat in last 2
yearsMntFishProducts
: Amount spent on fish in last 2
yearsMntSweetProducts
: Amount spent on sweets in last 2
yearsMntGoldProds
: Amount spent on gold in last 2 yearsNumDealsPurchases
: Number of purchases made with a
discountAcceptedCmp1
: 1 if customer accepted the offer in the
1st campaign, 0 otherwiseAcceptedCmp2
: 1 if customer accepted the offer in the
2nd campaign, 0 otherwiseAcceptedCmp3
: 1 if customer accepted the offer in the
3rd campaign, 0 otherwiseAcceptedCmp4
: 1 if customer accepted the offer in the
4th campaign, 0 otherwiseAcceptedCmp5
: 1 if customer accepted the offer in the
5th campaign, 0 otherwiseResponse
: 1 if customer accepted the offer in the last
campaign, 0 otherwiseNumWebPurchases
: Number of purchases made through the
company’s websiteNumCatalogPurchases
: Number of purchases made using a
catalogueNumStorePurchases
: Number of purchases made directly in
storesNumWebVisitsMonth
: Number of visits to company’s
website in the last monthZ_CostContact
: Documentation says “total values” – not
sure what that means so we won’t use it.Z_Revenue
: Documentation says “total values” – not sure
what that means so we won’t use it.Our idea is to cluster consumers into different consumer
personalities. Since we can only use quantitative variables in fuzzy
c-means, we can remove ID
,Year_Birth
,
Education
, Marital_Status
,
Dt_Customer
, Complain
,
AcceptedCmp1
, AcceptedCmp2
,
AcceptedCmp3
, AcceptedCmp4
,
AcceptedCmp5
, Response
,
Z_CostContact
, and Z_Revenue
.
<- market %>%
market select(
-ID, -Year_Birth, -Education,
-Marital_Status, -Dt_Customer,
-Complain, -AcceptedCmp1,
-AcceptedCmp2, -AcceptedCmp3,
-AcceptedCmp4, -AcceptedCmp5,
-Response, -Z_CostContact, -Z_Revenue
%>%
) # omit observations with NAs
na.omit()
<- scale(market) market.scaled
# we pick the ideal number of clusters
<- tribble(
error1 ~fhv, ~apd, ~pd, ~xb,
~fs, ~pc, ~pe, ~pre, ~si
)for (i in 2:11) {
<- cmeans(market.scaled,
market_cmeans centers = i,
dist = "euclidean"
)<- rbind(
error1
error1,as_tibble(t(
fclustIndex(market_cmeans, market.scaled)
))
)
}
<- as_tibble(scale(error1))
error_scaled
$num_clust <- 2:11
error_scaled
<- error_scaled %>%
to_plot pivot_longer(
-c(num_clust, pre),
names_to = "variable",
values_to = "value"
)
ggplot(
to_plot,aes(
x = num_clust,
y = value,
color = variable
)+
) geom_line() +
geom_point() +
labs(
title = "Measures of cluster validity",
subtitle = "For the market dataset, from the 'factoextra' package",
x = "Number of clusters",
y = "Normalized indices",
color = "Index"
+
) theme(text = element_text(family = "Lato", size = 14))
These indices suggest 5 as a possibly optimal cluster size.
<- cmeans(
market_cmeans
market.scaled,centers = 5,
dist = "euclidean"
)
<- c()
wag for (i in 1:20) {
<- fuzzy.CM(market.scaled, K = i, m = 2, max.iteration = 100, threshold = 1e-5, RandomNumber = 1234)
x1 <- x1@distance %*% t(x1@member) / (450) # average distance
dis print(sum(diag(dis)))
- 1] <- (sum(diag(dis)))
wag[i }
And plot our results
# use tibble for ggplot
%<>% tibble(wag)
wag $k <- 1:19
wag
ggplot(
wag,aes(x = k, y = wag)
+
) geom_line() +
labs(
title = "Weighted Average Distance to Centroids",
subtitle = "For the market dataset, using in-house elbow method",
x = "Number of clusters",
y = "Weighted avg. dist.",
+
) theme(text = element_text(family = "Lato", size = 14))
From the graph, we observe that although the elbow point occurs around \(k = 5\), the weighted average distances barely change as \(k\) increases. The difference between weighted average distances at \(k = 1\) (around 69.2) and at \(k = 15\) (around 68.75) is less than 5% of the total.
fviz_cluster(
list(
data = market.scaled,
cluster = market_cmeans$cluster
),ellipse.type = "norm",
ellipse.level = .7,
repel = TRUE,
+
) labs(
title = "Cluster plot",
subtitle = "For the market dataset, using fuzzy c-means",
+
) theme(text = element_text(family = "Lato", size = 14))
Using the corrplot()
function from the package of the
same name, we can visualize the membership probabilities for each
observation. Here, the top 10 observations are visualized.
# select 10 from the top
head(market_cmeans$membership, n = 10) %>%
corrplot(
# this is a membership matrix
# not a correlation one
# so set is.corr to false
is.corr = FALSE
)
Note: Observations are on the rows, and clusters are on the columns. Larger probabilities of belonging to a certain cluster are shown by larger and darker circles.
Our analysis shows that the dataset is fairly homogeneous in terms of the metrics we set for the variables, as the weighted average distances to the centroid bairly changes as \(k\) increases.
The fuzzy c-means clustering method is quite similar to the \(k\)-means clustering algorithm we’ve learned in class. Both are non-hierarchical clustering methods that start with random assignment to clusters and then re-cluster by minimizing the within cluster variation.
Again, one major difference between \(k\)-means clustering and fuzzy c-means clustering is that \(k\)-means clustering is a hard clustering method, which means that it assigns each data point to one and only one cluster. \(k\)-means clustering works well with non-overlapping datasets where each data point, while fuzzy c-means clustering works better with overlapping datasets.
The main advantage of fuzzy c-means clustering is that it enables interpretation of complex relationships where observations may belong to more than one cluster. Thus, it is best used in overlapping datasets.
Its first disadvantage is that the arbitrary choice of cluster number may not work as well when there were little information available on how many clusters there should be. It’s harder to pick the ideal number of clusters.
Meanwhile, the complexity of algorithm makes it difficult to compile large datasets using this method: our experiment showed that \(k\)-means clustering took less than one second for 60 iterations, while fuzzy c-means clustering took over 2 minutes. The exact time difference may change based on the environment and packages used, but both our experiment and numerous comparative studies such as Neto, Souza, and Ribeiro (2020) show that fuzzy c-means clustering is usually significantly slower than \(k\)-means clustering.
If categories in the dataset do not overlap, then the results of fuzzy c-means can be harder to interpret than the results of \(k\)-means, since observations are classified as belonging to more than one cluster.
With a lower value of the stopping threshold \(\varepsilon\), we can get a better clustering result, but at the expense of more iterations.
Finally, fuzzy c-means can be harder to interpret than \(k\)-means where observations may belong to more than one class.
The following applies to both c-means and \(k\)-means:
In conclusion, fuzzy c-means is a clustering method similar to \(k\)-means clustering, best used at analyzing overlapping datasets, albeit at the expense of longer run times. The additional hyperparameters allow the user to customize their results and apply their understanding of the dataset better.
We recommend using fuzzy c-means when one believes the dataset to be an overlapping one, and when one has a reasonable estimate of fuzziness and the amount of clusters. As such, the performance outcome of different degrees of fuzziness and number of clusters may also be specified and understood using the aforementioned metrics (e.g. the weighted elbow method).
If one believes that the dataset is non-overlapping and has an estimate of the cluster count, we would recommend \(k\)-means clustering and other hard clustering methods. If one wants to analyze a completely unknown dataset, we would recommend also trying hierarchical methods such as Random Forest. Furthermore, deep-learning clustering techniques could be useful as well, although interpretation of such a black box as deep learning would be much harder.