Introduction

In this page, we will thoroughly introduce the fuzzy c-means clustering algorithm, a soft clustering method.

Soft Clustering

So far we have learned various different clustering methods, and they’re all classified as hard clustering, in the sense that each observation would be assigned to only one cluster in a clear-cut manner. On the other side of the spectrum, soft clustering methods work in a more flexible manner. It predicts the probability that a data point is within each cluster, as opposed to giving definite cluster assignments.

One might opt for soft clustering over hard clustering when they want to consider the fact that their clusters could be ambiguous (belonging to multiple states). By assigning a probability of each observation belonging to each cluster, the analyst can identify points that are most probably within one cluster, and points that might belong to another cluster. In a critical setting such as cancer diagnosis, these probabilities can inform further tests and checks. One major downside of soft clustering is that it tends to take longer than hard clustering, given the additional amount of computational work needed.

Fuzzy c-means

Fuzzy c-means is \(k\)-means, but more than one cluster per observation.

Fuzzy c-means is similar to the \(k\)-means clustering method that we discussed in class, in that clusters are formed to minimize the distance between a cluster’s centroid and observations in that cluster.

However, rather than explicitly assigning an observation to a given cluster, this method assigns a probability that any given observation is in every single cluster. Observations in the center of a cluster are assigned a very high probability of belonging to that cluster, while observations further out are assigned a lower probability.

Fuzzy c-means is…

  • A clustering technique (not classification or regression).
  • Unsupervised: learns from untagged data.
  • Soft: each observation gets probabilities for each cluster. Probabilities sum up to 1.
  • Iterative: starts & improves from an initial set of random clusters.
  • Disjunctive: observations can belong to more than one cluster, based on the cutoff that we decide upon.

How the Method Works

The Math

There are three components to the fuzzy c-means clustering:

  • the centroid of each cluster;
  • the distance matrix, containing each observation’s distance from the centroid; and
  • the fuzzy partition matrix, which produces a cluster membership for each observation-cluster pair in the range \([0,1]\).

We start by assigning the \(n\) observations into \(j\) clusters randomly. Then we can proceed with the steps below (Ghosh and Dubey 2013):

  1. Calculate the centroid \(v_{ij}\) given by: \[\displaystyle v_{ij}=\frac{\sum_{k=1}^{n}{(\mu_{ik})^m x_{kj}}}{\sum_{k=1}^{n}{(\mu_{ij})^m}};\]
  2. Calculate the distance matrix \(D_{[c,n]}\) given by: \[D_{ij}=\left(\sum_{j=1}^{m}{(x_{kj}-v_{ij})^2}\right)^\frac{1}{2}; \]
  3. Update the partition matrix for the \(r\)th iteration, \(U^{(R)}\) given by: \[\mu_{ij}^{r-1}=\left(\left(\sum_{j=1}^{c}{\left(\frac{d_{ik}^r}{d_{jk}^r}\right)}\right)^\frac{2}{m-1}\right)^{-1}; \]
  4. If \(\left\lVert U^{r+1} - U^{r} \right\rVert < \varepsilon\), then we stop iterating; otherwise, we return to Step 2.

A geographical approach that is perhaps easier to follow along is as follows:

  1. In Step 1, we calculate the centroids for each cluster. This is trivial for \(k\)-means; in c-means, we weigh the distances by their membership in each cluster.
  2. In Step 2, we fill distance matrix, which is simply the distance from each observation to its centroid (in each cluster).
  3. In Step 3, we re-calculate the membership probabilities. Intuitively, for points closer to the centroid, we give higher probablities, and vice versa.
  4. In Step 4, we calculate the norm of the partition matrix and compare it to the previous iteration. If the change is below our sensitivity threshold \(\varepsilon\) (discussed below), we stop iterating.

After each iteration, the weighed distances from each observation will decrease, as we classify closer points as more probable in belonging to a certain cluster.

Hyperparameters

As with \(k\)-means, the number of clusters \(k\) affects the resulting clustering. Unlike hierarchical clustering, this is pre-defined; see the section Ideal Number of Clusters below for a full discussion of an a priori (not related to the actual underlying features of the dataset) choice of \(k\).

The hyperparameter \(m\) is a number greater than 1 giving the degree of fuzziness. Without prior knowledge of the dataset, this usually takes the number 2. As \(m\rightarrow 1\), the membership probabilities converge to either 0 or 1, resulting in what is effectively a \(k\)-means clustering.

The number of iterations can also be controlled, either explicitly by setting iter.max, or by specifying a sensitivity threshold \(\varepsilon\) and stopping when the change in probabilities is less than \(\varepsilon\).

Coding Demonstration

We will demonstrate fuzzy c-means clustering on the iris, juice, and market datasets.

Setup

# libraries
library(tidyverse)
# pipes
library(magrittr)
# cmeans
library(e1071)
library(advclust)
# visualize
library(factoextra)
library(ggplot2)
library(corrplot)
# fonts for ggplot
library(showtext)
# ISLR for data
library(ISLR)

knitr::opts_chunk$set(
  # Retina!
  fig.retina = 4,
  # use the showtext package for figures
  fig.showtext = TRUE,
  # output suppression
  # we oppose other forms of suppression
  # such as voter suppression
  message = FALSE,
  warning = FALSE,
  # cache = TRUE,
  # hold off results until end of chunk
  results = "hold"
)
# seed
set.seed(42)

# ggplot fonts
# if knitting:
if (isTRUE(getOption("knitr.in.progress"))) {
  # add Lato from Google Fonts
  font_add_google("Lato", "Lato")
  # set font for base R for corrplot
  par(family = "Lato")
}

Data sources

The iris dataset was collected by Edgar Anderson in 1935. It comes with R and its predecessor S. From ?iris:

This famous (Fisher’s or Anderson’s) iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Four of the five variables of iris are numerical; the fifth is outcome variable, Species. We will use the four quantitative variables to cluster observations as their respective species and use our outcome variable to check our model against the truth, recognizing that in many clustering scenarios we would not be able to check our results.

This dataset is a good choice for our purposes because it is relatively small and easy to understand and it contains quantitative predictors.

# iris data comes with base R
iris <- iris

The juice dataset comes from the 1998 book Business Analysis Using Regression, via the package ISLR. From ?ISLR::OJ:

The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.

juice identifies the customer brand loyalty with LoyalCH. The other variables identify the prices, discounts, place of purchase, etc. for juice the CH and MM brands.

Most of the 18 variables in juice are quantitative. Since fuzzy c-means depends on distance calculations, we will use only the quantitative variables. After attempting to cluster observations as either Citrus Hill or Minute Maid Orange Juice, we will be able to check our results against the ground truth, recognizing that in many clustering scenarios we would not be able to check our results.

We chose to use this dataset because it contains a variety of quantitative predictors, and because we were interested in applying our method to a real-world scenario.

# orange juice data comes with ISLR dataset
data(OJ)

The market dataset is from Kaggle and attributed to Dr. Omar Romero-Hernandez of UC Berkeley.

The dataset documents the following: - Customer information, such as date of birth, income, number of children. - Spending preference, such as the amount spent on wine, fruits, and fish. - Purchase data, the number of purchases made online, through a catalog, or on the phone.

This dataset contains a mix of numerical and categorical variables. Since fuzzy c-means depends on distance calculations, we will use only the quantitative variables, attempting to classify observations into different types of/personalities of spenders. There is no way to check this classification against reality.

We chose to use this dataset because it contains quantitative predictors and we thought it would be neat to try our method on a dataset with no outcome, since that is the scenario in which clustering techniques are typically used.

# marketing data
market <- read_delim(
  "marketing_campaign.csv",
  delim = ";",
  escape_double = FALSE,
  trim_ws = TRUE
)

Iris Dataset: Model Fitting

First, since fuzzy c-means clusters using euclidean distance, we must normalize our data. Otherwise, variables with large values would be overemphasized.

iris.scaled <- iris %>%
  select(-5) %>%
  scale()

We proceed to fit the dataset using cmeans() from the e1071 package.

iris_cmeans <- e1071::cmeans(
  # dataset
  iris.scaled,
  # number of clusters
  centers = 3,
  # type of distance to use: euclidean / manhattan
  dist = "euclidean",
  # degree of fuzzification; 2 is the default
  m = 2,
  # maximum number of iterations
  iter.max = 100
)

The fviz_cluster() function (from the factoextra package) can plot observations on the dataset’s first two principal components. Alternatively, we can also use choose.vars to specify which (original) variables to plot as the x and y axes.

The ellipses shown are drawn at the 70% confidence interval, meaning that 70% of additional observations in a given cluster would fall within this ellipse.

factoextra::fviz_cluster(
  # list object with data and cluster components
  list(
    # the data
    data = iris.scaled,
    # the cluster assignments
    cluster = iris_cmeans$cluster
  ),
  # for a confidence ellipse, use "confidence"
  ellipse.type = "norm",
  # size of the concentration ellipse in normal probability
  # see ?ggplot2::stat_ellipse for details
  ellipse.level = .7
) +
  labs(
    title = "Cluster plot",
    subtitle = "For the iris dataset, using fuzzy c-means",
  ) +
  theme(text = element_text(family = "Lato", size = 14))

Normally, people use clustering when they do not have an outcome variable, and is thus unsupervised. Since we do have an outcome variable, we can go beyond fuzzy c-means and evaluate our performance as follows:

table(iris_cmeans$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         39        13
##   3      0         11        37

Our model performed pretty well! One of our clusters aligns perfectly with setosa, and our other clusters contain the vast majority of just one type of flower.

Ideal Number of Clusters

In most scenarios when we would actually want to use clustering, we would not have an outcome variable like the species of flower. In this section, we illustrate how we might identify the ideal number of clusters for fuzzy c-means clustering.

In \(k\)-means clustering, we often do this with the elbow method: we run a for-loop to fit models with sequentially larger \(k\)-values, plot the within cluster sum of squares versus the \(k\)-value, and select the \(k\)-value at the elbow point. There are also other methods, such as the average silhouette method and the gap statistic method.

Custom Elbow method

Weighted average distance vs. average WCSS

We formulated our own way of extending the elbow method to a fuzzy context, as described below.

First, we show how to fit a fuzzy c-means model with the advclust package. We choose to use this method with the advclust package because it automatically outputs the distance from each observation to the centroid of each cluster. We could have calculated this information ourselves and used the other package, but this was easier and we thought it was advantagous to know how to do fuzzy c-means with two different implementations.

x <- advclust::fuzzy.CM(
  # the data
  iris.scaled,
  # how many clusters?
  K = 3,
  # fuzziness, as above
  m = 2,
  # we specify both of the two hyperparameters here:
  # maximum number of iterations
  max.iteration = 100,
  # convergence criteria
  threshold = 1e-5,
  # more randomness
  RandomNumber = 42
)

Recall that the elbow method minimizes the within-cluster sum of squares. Our idea to extend this idea to fuzzy clustering is to weight the distance from each observation to each cluster, and then minimize that weighted sum of distances. To demonstrate how that would work:

# Extract distance from each observation to each cluster
l <- x@distance
# Extract probabilities of belonging to each cluster
ll <- t(x@member)

# Calculate the weighted distance
# From center for each observation
# (weighting euclidean distance with probability of belonging)
# Then sum them up
distance <- l %*% ll
r <- rowSums(distance)
sum(r)
## [1] 154857.5

To do this on a variety of different numbers of clusters, we run a for loop that minimizes the weighted average distance to centers.

# list to save distances
wag <- c()
for (i in 1:20) {
  x1 <- fuzzy.CM(iris.scaled,
    K = i,
    m = 2,
    max.iteration = 100,
    threshold = 1e-5,
    RandomNumber = 42,
    print.result = 0
  )
  # average distance
  dis <- x1@distance %*% t(x1@member) / (450)
  # sum and print
  # print(sum(diag(dis)))
  # write to list
  wag[i - 1] <- (sum(diag(dis)))
}

To plot our results:

# use tibble for ggplot
wag %<>% tibble(wag)
wag$k <- 1:19

ggplot(
  wag,
  aes(x = k, y = wag)
) +
  geom_line() +
  labs(
    title = "Weighted Average Distance to Centroids",
    subtitle = "For the iris dataset, using in-house elbow method",
    x = "Number of clusters",
    y = "Weighted avg. dist.",
  ) +
  theme(text = element_text(family = "Lato", size = 14))

The elbow point here appears to be either \(k = 3\) or \(k = 5\). In reality, we know there are 3 different species of flower, which means that our method is fairly accurate, although the exact elbow point location is highly debatable on the graph and \(k=5\) could be another interpretation.

These are some of the potential ways to improve model performance:

  1. We can also run the elbow for different degrees of fuzziness, combined with different \(k\) values to test out the above hypothesis - although this may dramatically increase run time
  2. We can develop alternative models and evaluation methods for the dataset
  3. We can obtain more variables and more data points for the model to capture the reality better

Current research

Creating robust validity metrics for fuzzy clustering is an active area of research.

The fclustIndex() function from the factoextra package calculates a number of validity metrics for fuzzy clustering on soft clustering output.

fclustIndex(
  # the cmeans() object
  iris_cmeans,
  # the original data
  iris.scaled
)
##           fhv           apd            pd            xb            fs 
##  1.351036e-01  2.149397e+02  1.749217e+02  1.480292e-03 -3.245081e+02 
##            pc            pe           pre            si 
##  7.065091e-01  5.294223e-01  7.021749e+02  2.649665e-02

It would have been prohibitively complicated to delve into the derivations of each of the above validity metrics, so instead we provide a brief overview:

  • fhv stands for fuzzy hypervolume. We want to minimize it.
  • apd stands for average partition density. We want to maximize it.
  • pd stands for physical partition density. We want to maximize it.
  • xb stands for total variation / centroid vector separation. We want to minimize it.
  • fs stands for membership fuzziness * row fuzziness. We want to minimize it.
  • pc stands for partition coefficient, which is a heuristic measure of partition fuzziness. We want to maximize it.
  • pe stands for partition entropy, which measures how much information is withheld. We want to minimize it.
  • pre stands for proportion exponent, a measure of structural variation (crispness) in the partition matrix. We want to maximize it.
  • si stands for separation index, which is a measure of compactness and separation which assumes that a hard partition is derived from the fuzzy one. We want to minimize it.

We can loop over different number of clusters and graph those metrics to assist us in selecting the ideal number of clusters.

# initialze tibble to store results
error <- tribble(~fhv, ~apd, ~pd, ~xb, ~fs, ~pc, ~pe, ~pre, ~si)
# loop over number of clusters
for (i in 2:11) {
  # fit cmeans()
  iris_cmeans <- cmeans(
    iris.scaled,
    centers = i,
    dist = "euclidean"
  )
  # store the errors
  error <- rbind(
    error,
    as_tibble(t(
      fclustIndex(iris_cmeans, iris.scaled)
    ))
  )
}

# scale the indices because they are not always in [0,1]
error_scaled <- as_tibble(scale(error))

# number of clusters
error_scaled$num_clust <- 2:11

# pivot to plot
to_plot <- error_scaled %>%
  pivot_longer(
    -c(num_clust, pre),
    names_to = "variable",
    values_to = "value"
  )

ggplot(
  to_plot,
  aes(
    x = num_clust,
    y = value,
    color = variable
  )
) +
  geom_line() +
  geom_point() +
  labs(
    title = "Measures of cluster validity",
    subtitle = "For the iris dataset, from the 'factoextra' package",
    x = "Number of clusters",
    y = "Normalized indices",
    color = "Index"
  ) +
  theme(text = element_text(family = "Lato", size = 14))

We can choose the ideal number of clusters using any of the variables in this graph. Note that pre is omitted, as the proportion exponent is only meaningful when \(k>2\).

Like we mentioned earlier, this is an active area of research and there are also other validity metrics that could be used. An entire package, zcebeci/fcvalid, is devoted to “Internal Validity Indexes for Fuzzy and Possibilistic Clustering”, including the following additional indices:

  • Chen-Linkens Index
  • Composed Within and Between Scattering Index
  • Fuzzy Hyper Volume
  • Pakhira-Bandyopadhyay-Maulik Index
  • Kwon Index
  • Minimum Centroid Distance
  • Silhouette Index
  • Tang, Sun & Sun Index
  • Weighted Summation Index

Comparing Run Times

In terms of efficiency, our assumption is that fuzzy c-means takes longer to run than \(k\)-means, because it is a more complicated algorithm. Depending on the size of one’s dataset and the application of their model, one may prefer to use the algorithm that runs faster. We therefore quantify runtimes to inform people in making those decisions in this section.

First, we’ll find the run time for running fuzzy c-means clustering ten times using the advclust package.

storage <- c()
start_time <- Sys.time()

for (i in 1:10) {
  x1 <- fuzzy.CM(
    iris.scaled,
    K = i,
    m = 2,
    max.iteration = 100,
    threshold = 1e-5,
    RandomNumber = 1234
  )
  dis <- x1@distance %*% t(x1@member) / (450) # average distance
  print(sum(diag(dis)))
  storage[i * 1] <- (sum(diag(dis)))
}

end_time <- Sys.time()
end_time - start_time
# Time difference (run time) of 5.135732 seconds
## Time difference of 5.183285 secs

Then we compare the runtime with \(k\)-means clustering, the analogous clustering method, which we find to be much faster given that it is a hard clustering method.

start_time <- Sys.time()
store <- c()
for (i in 1:10) {
  kc <- kmeans(iris.scaled, i + 2, nstart = 20)
  store[i] <- kc$tot.withinss
}
# Time difference  (run time) of 0.08174109 seconds

end_time <- Sys.time()
end_time - start_time
## Time difference of 0.04385304 secs

We also compare the runtime with the other fuzzy c-means clustering package, e1071.

start_time <- Sys.time()
store <- c()
for (i in 1:10) {
  kc <- cmeans(
    iris.scaled, # dataset
    centers = i + 2, # number of clusters
    dist = "euclidean"
  ) # type of distance to use
}
# Time difference  (run time) of 0.1188469 seconds

end_time <- Sys.time()
end_time - start_time
## Time difference of 0.05980611 secs

Overall, we find that \(k\)-means (from stats in base R) runs a little faster than cmeans() from the e1071 package, which in turn runs much faster than fuzzy.CM() from the advclust package. It is interesting to note that the advclust package took so much longer to run than the e1071 package, but to be fair, it also gives us more information, specifically the distance of each observation to the centroid of each cluster.

Orange Juice Dataset

As we have illustrated every step of the process in detail with the iris dataset, we will not bore you with repetition of everything on this dataset. Rather, we will perform clustering on the number of clusters that we know there should be, and remark on our model’s performance.

The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.

The ISLR Package defines their variables as follows:

  • Purchase: A factor with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice
  • WeekofPurchase: Week of purchase
  • StoreID: Store ID
  • PriceCH: Price charged for CH
  • PriceMM: Price charged for MM
  • DiscCH: Discount offered for CH
  • DiscMM: Discount offered for MM
  • SpecialCH: Indicator of special on CH
  • SpecialMM: Indicator of special on MM
  • LoyalCH: Customer brand loyalty for CH
  • SalePriceMM: Sale price for MM
  • SalePriceCH: Sale price for CH
  • PriceDiff: Sale price of MM less sale price of CH
  • Store7: A factor with levels No and Yes indicating whether the sale is at Store 7
  • PctDiscMM: Percentage discount for MM
  • PctDiscCH: Percentage discount for CH
  • ListPriceDiff: List price of MM less list price of CH
  • STORE: Which of 5 possible stores the sale occurred at

We want to see if we can cluster observations as purchasing one company or the other. Since we can only use quantitative variables (or meanfully sortable categories, like “Totally agree” to “Totally disagree”) in fuzzy c-means, we will be removing STORE, STORE7, Purchase, WeekofPurchase, StoreID, SpecialCH, and SpecialMM.

juice <- OJ %>%
  select(
    -STORE, -Store7, -Purchase,
    -WeekofPurchase, -StoreID,
    -SpecialCH, -SpecialMM
  )

Since fuzzy c-means clusters using euclidean distance, we must normalize our data. Otherwise, variables which tend to have large values would inherently be overemphasized.

juice.scaled <- scale(juice)

We first fit a fuzzy c-means model with 2 centers, because we know that there are two classes that we care about: purchases of Citrus Hill and purchases of Minute Maid.

juice_cmeans <- cmeans(
  juice.scaled,
  centers = 2,
  dist = "euclidean"
)

We plots observations using principal components.

fviz_cluster(
  list(
    data = juice.scaled,
    cluster = juice_cmeans$cluster
  ),
  ellipse.type = "norm",
  ellipse.level = .7,
  repel = TRUE,
) +
  labs(
    title = "Cluster plot",
    subtitle = "For the OJ dataset, using fuzzy c-means",
  ) +
  theme(text = element_text(family = "Lato", size = 14))

Normally, people use clustering when they do not have an outcome variable. Since we do have an outcome variable, we can go beyond fuzzy c-means and evaluate our performance as follows.

table(juice_cmeans$cluster, OJ$Purchase)
##    
##      CH  MM
##   1 184 203
##   2 469 214

Our model didn’t do too well. We clustered 469 of the 653 Citrus Hill purchases (71.8% of them) together, but only about half of the Minute Maid purchases.

We could theoretically use the same methods as we did with the iris dataset to pick the ideal number of clusters, but we think it would be more interesting to do that with the market dataset, in which we have no motivation for the number of clusters.

Market Dataset

We illustrated every step of the process in detail with the iris dataset, so we will not bore you with repetition of everything on this dataset. Rather, we will perform clustering on the market dataset and pick the ideal number of clusters, since we have no motivation for what number of clusters we should really have.

Kaggle describes the variables in the market dataset as follows:

  • ID: Customer’s unique identifier
  • Year_Birth: Customer’s birth year
  • Education: Customer’s education level
  • Marital_Status: Customer’s marital status
  • Income: Customer’s yearly household income
  • Kidhome: Number of children in customer’s household
  • Teenhome: Number of teenagers in customer’s household
  • Dt_Customer: Date of customer’s enrollment with the company
  • Recency: Number of days since customer’s last purchase
  • Complain: 1 if the customer complained in the last 2 years, 0 otherwise
  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • MntGoldProds: Amount spent on gold in last 2 years
  • NumDealsPurchases: Number of purchases made with a discount
  • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  • NumWebPurchases: Number of purchases made through the company’s website
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s website in the last month
  • Z_CostContact: Documentation says “total values” – not sure what that means so we won’t use it.
  • Z_Revenue: Documentation says “total values” – not sure what that means so we won’t use it.

Our idea is to cluster consumers into different consumer personalities. Since we can only use quantitative variables in fuzzy c-means, we can remove ID,Year_Birth, Education, Marital_Status, Dt_Customer, Complain, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response, Z_CostContact, and Z_Revenue.

market <- market %>%
  select(
    -ID, -Year_Birth, -Education,
    -Marital_Status, -Dt_Customer,
    -Complain, -AcceptedCmp1,
    -AcceptedCmp2, -AcceptedCmp3,
    -AcceptedCmp4, -AcceptedCmp5,
    -Response, -Z_CostContact, -Z_Revenue
  ) %>%
  # omit observations with NAs
  na.omit()
market.scaled <- scale(market)
# we pick the ideal number of clusters
error1 <- tribble(
  ~fhv, ~apd, ~pd, ~xb,
  ~fs, ~pc, ~pe, ~pre, ~si
)
for (i in 2:11) {
  market_cmeans <- cmeans(market.scaled,
    centers = i,
    dist = "euclidean"
  )
  error1 <- rbind(
    error1,
    as_tibble(t(
      fclustIndex(market_cmeans, market.scaled)
    ))
  )
}

error_scaled <- as_tibble(scale(error1))

error_scaled$num_clust <- 2:11

to_plot <- error_scaled %>%
  pivot_longer(
    -c(num_clust, pre),
    names_to = "variable",
    values_to = "value"
  )

ggplot(
  to_plot,
  aes(
    x = num_clust,
    y = value,
    color = variable
  )
) +
  geom_line() +
  geom_point() +
  labs(
    title = "Measures of cluster validity",
    subtitle = "For the market dataset, from the 'factoextra' package",
    x = "Number of clusters",
    y = "Normalized indices",
    color = "Index"
  ) +
  theme(text = element_text(family = "Lato", size = 14))

These indices suggest 5 as a possibly optimal cluster size.

market_cmeans <- cmeans(
  market.scaled,
  centers = 5,
  dist = "euclidean"
)
wag <- c()
for (i in 1:20) {
  x1 <- fuzzy.CM(market.scaled, K = i, m = 2, max.iteration = 100, threshold = 1e-5, RandomNumber = 1234)
  dis <- x1@distance %*% t(x1@member) / (450) # average distance
  print(sum(diag(dis)))
  wag[i - 1] <- (sum(diag(dis)))
}

And plot our results

# use tibble for ggplot
wag %<>% tibble(wag)
wag$k <- 1:19

ggplot(
  wag,
  aes(x = k, y = wag)
) +
  geom_line() +
  labs(
    title = "Weighted Average Distance to Centroids",
    subtitle = "For the market dataset, using in-house elbow method",
    x = "Number of clusters",
    y = "Weighted avg. dist.",
  ) +
  theme(text = element_text(family = "Lato", size = 14))

From the graph, we observe that although the elbow point occurs around \(k = 5\), the weighted average distances barely change as \(k\) increases. The difference between weighted average distances at \(k = 1\) (around 69.2) and at \(k = 15\) (around 68.75) is less than 5% of the total.

fviz_cluster(
  list(
    data = market.scaled,
    cluster = market_cmeans$cluster
  ),
  ellipse.type = "norm",
  ellipse.level = .7,
  repel = TRUE,
) +
  labs(
    title = "Cluster plot",
    subtitle = "For the market dataset, using fuzzy c-means",
  ) +
  theme(text = element_text(family = "Lato", size = 14))

Using the corrplot() function from the package of the same name, we can visualize the membership probabilities for each observation. Here, the top 10 observations are visualized.

# select 10 from the top
head(market_cmeans$membership, n = 10) %>%
  corrplot(
    # this is a membership matrix
    # not a correlation one
    # so set is.corr to false
    is.corr = FALSE
  )

Note: Observations are on the rows, and clusters are on the columns. Larger probabilities of belonging to a certain cluster are shown by larger and darker circles.

Our analysis shows that the dataset is fairly homogeneous in terms of the metrics we set for the variables, as the weighted average distances to the centroid bairly changes as \(k\) increases.

Discussion/Conclusion

The fuzzy c-means clustering method is quite similar to the \(k\)-means clustering algorithm we’ve learned in class. Both are non-hierarchical clustering methods that start with random assignment to clusters and then re-cluster by minimizing the within cluster variation.

Again, one major difference between \(k\)-means clustering and fuzzy c-means clustering is that \(k\)-means clustering is a hard clustering method, which means that it assigns each data point to one and only one cluster. \(k\)-means clustering works well with non-overlapping datasets where each data point, while fuzzy c-means clustering works better with overlapping datasets.

Advantages

The main advantage of fuzzy c-means clustering is that it enables interpretation of complex relationships where observations may belong to more than one cluster. Thus, it is best used in overlapping datasets.

Disadvantages

Its first disadvantage is that the arbitrary choice of cluster number may not work as well when there were little information available on how many clusters there should be. It’s harder to pick the ideal number of clusters.

Meanwhile, the complexity of algorithm makes it difficult to compile large datasets using this method: our experiment showed that \(k\)-means clustering took less than one second for 60 iterations, while fuzzy c-means clustering took over 2 minutes. The exact time difference may change based on the environment and packages used, but both our experiment and numerous comparative studies such as Neto, Souza, and Ribeiro (2020) show that fuzzy c-means clustering is usually significantly slower than \(k\)-means clustering.

If categories in the dataset do not overlap, then the results of fuzzy c-means can be harder to interpret than the results of \(k\)-means, since observations are classified as belonging to more than one cluster.

With a lower value of the stopping threshold \(\varepsilon\), we can get a better clustering result, but at the expense of more iterations.

Finally, fuzzy c-means can be harder to interpret than \(k\)-means where observations may belong to more than one class.

Similarities to \(k\)-means

The following applies to both c-means and \(k\)-means:

  • Minimizes intra-cluster variance
  • Euclidean distance measures can unequally weight underlying factors, so scaling is needed.
  • The objective function is a local minimum, so the algorithm may stop at a sub-optimal result.
  • Results depend on the initial (random) choice of weights. This may be mitigatable through repetition, but less so for c-means given its time complexity.

In conclusion, fuzzy c-means is a clustering method similar to \(k\)-means clustering, best used at analyzing overlapping datasets, albeit at the expense of longer run times. The additional hyperparameters allow the user to customize their results and apply their understanding of the dataset better.

We recommend using fuzzy c-means when one believes the dataset to be an overlapping one, and when one has a reasonable estimate of fuzziness and the amount of clusters. As such, the performance outcome of different degrees of fuzziness and number of clusters may also be specified and understood using the aforementioned metrics (e.g. the weighted elbow method).

If one believes that the dataset is non-overlapping and has an estimate of the cluster count, we would recommend \(k\)-means clustering and other hard clustering methods. If one wants to analyze a completely unknown dataset, we would recommend also trying hierarchical methods such as Random Forest. Furthermore, deep-learning clustering techniques could be useful as well, although interpretation of such a black box as deep learning would be much harder.

Bibliography

Bezdek, James C. 1987. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum.
Bezdek, James C., Robert Ehrlich, and William Full. 1984. “FCM: The Fuzzy c-Means Clustering Algorithm.” Computers & Geosciences 10 (January): 191–203. https://doi.org/10.1016/0098-3004(84)90020-7.
Cebeci, Zeynel, Figen Yildiz, Alper Tuna Kavlak, Cagatay Cebeci, and Hasan Onder. 2020. “Ppclust: Probabilistic and Possibilistic Cluster Analysis.” CRAN. https://cran.r-project.org/web/packages/ppclust/.
Dias, Madson. 2022. “Fuzzy-c-Means.” PyPI. https://pypi.org/project/fuzzy-c-means/.
F, Achmad Fauzi Bagus, and Setia Pramana. 2016. “Advclust: Object Oriented Advanced Clustering.” CRAN. https://cran.r-project.org/web/packages/advclust/.
Gath, I., and A. B. Geva. 1989. “Unsupervised Optimal Fuzzy Clustering.” IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (July): 773–80. https://doi.org/10.1109/34.192473.
Ghosh, Soumi, and Sanjay Kumar Dubey. 2013. “Comparative Analysis of k-Means and Fuzzy c-Means Algorithms.” International Journal of Advanced Computer Science and Applications 4: 35–39. https://doi.org/10.1.1.683.5131.
Hubs, Research. n.d. “Unsupervised Learning - Fuzzy c-Means.” researchhubs.com. Accessed May 17, 2022. https://researchhubs.com/post/ai/fundamentals/fuzzy-c-means.html.
James, Gareth, Daniela Witten, Trevor Hastie Tibshirani, and Rob. 2021. “ISLR: Data for an Introduction to Statistical Learning with Applications in r.” CRAN. https://cran.rstudio.com/web/packages/ISLR/.
Kassambara, Alboukadel. n.d. “Cmeans() r Function: Compute Fuzzy Clustering.” Datanovia. Accessed May 17, 2022. https://www.datanovia.com/en/lessons/fuzzy-clustering-essentials/cmeans-r-function-compute-fuzzy-clustering/.
Kassambara, Alboukadel, and Fabian Mundt. 2020. “Factoextra: Extract and Visualize the Results of Multivariate Data Analyses.” CRAN. https://cran.r-project.org/web/packages/factoextra/.
Kaushik, Saurav. 2019. “An Introduction to Clustering & Different Methods of Clustering.” Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/.
Kent, University of. 2016. “Team:kent/Model - 2016.igem.org.” 2016.igem.org. https://2016.igem.org/Team:Kent/Model.
Kumar, Satyam. 2021. “Fuzzy c-Means Clustering —Is It Better Than k-Means Clustering?” Medium. https://towardsdatascience.com/fuzzy-c-means-clustering-is-it-better-than-k-means-clustering-448a0aba1ee7.
Malik, Farhad. 2019. “Machine Learning Hard Vs Soft Clustering.” FinTechExplained. https://medium.com/fintechexplained/machine-learning-hard-vs-soft-clustering-dc92710936af.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, Friedrich Leisch, Chih-Chung Chang, and Chih-Chen Lin. 2021. “E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien.” CRAN. https://cran.r-project.org/web/packages/e1071/.
Neto, João, Layse Souza, and Admilson Ribeiro. 2020. “Comparative Analysis Between the k-Means and Fuzzy c-Means Algorithms to Detect UDP Flood DDoS Attack on a SDN/NFV Environment.” Proceedings of the 16th International Conference on Web Information Systems and Technologies. https://doi.org/10.5220/0010176201050112.
Patel, Akash. 2021. “Customer Personality Analysis.” Kaggle. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis.
Rajprasad, Dhivya. 2017. “RPubs - Unsupervised Techniques: Clustering.” rpubs.com. https://rpubs.com/dhivya89/Clustering.
SAS. 2011. “Base SAS(r) 9.3 Procedures Guide: Statistical Procedures.” support.sas.com. http://support.sas.com/documentation/cdl/en/procstat/63963/HTML/default/viewer.htm#procstat_corr_sect021.htm.
Sites, Google. n.d. “Data Clustering Algorithms - Fuzzy c-Means Clustering Algorithm.” sites.google.com. Accessed 2022. https://sites.google.com/site/dataclusteringalgorithms/fuzzy-c-means-clustering-algorithm.
Soage, José Carlos. 2021. “Scatter Plot with Ellipses in Ggplot2.” R CHARTS | A collection of charts; graphs made with the R programming language. https://r-charts.com/correlation/scatter-plot-ellipses-ggplot2/.
Xie, X. L., and G. Beni. 1991. “A Validity Measure for Fuzzy Clustering.” IEEE Transactions on Pattern Analysis and Machine Intelligence 13: 841–47. https://doi.org/10.1109/34.85677.
Yoshiyuki, Fukuyama, and Sugeno Michio. 1989. “A New Method of Choosing the Number of Clusters for Fuzzy c-Means Method.” Proceedings of the 5th Fuzzy Systems Symposium, June, 247–52. https://cir.nii.ac.jp/crid/1572261551570694400.