Loads the necessary packages for the vignettes.

library(future) # allows parralel processing in greed()
library(greed)
library(aricode) # for ari computation
library(dplyr) # data wrangling
library(tidyr)
library(purrr)
library(ggpubr) # advanced ploting 
library(ggplot2)
library(ggwordcloud)

library(careless) # survey pre-processing

set.seed(2134)

future::plan("multisession", workers=5) # may be increased 

The greed package and DLVM framework allows the clustering of categorical data. This vignette describes typical use cases of the greed() function in this context, and illustrates its use on real datasets.

The model

We are interested in the clustering of categorical datasets, which are typically found in survey data or item response theory (ITR). In this context, we observe \(n\) individuals described by \(p\) variables, taking one among \(d_j\) modalities for each variable \(j\). Such datasets are typically represented using a one-hot-encoding of each factor in a design matrix \(\mathbf{X} \in \{0,1\}^{n \times d}\) where \(d = \sum_{j=1}^p d_j\). Latent class analysis (LCA) is a generative model for categorical data clustering which posits conditional independence of the factor variables conditionally on the (unknown) partition. Below is a description of its Bayesian formulation with the use of proper conjugate priors \[ \begin{equation} \begin{aligned} \pi&\sim \textrm{Dirichlet}(\alpha),\\ \forall k, \forall j, \quad \theta_{kj} &\sim \textrm{Dirichlet}_{d_j}(\beta), \\ Z_i&\sim \mathcal{M}_K(1,\pi),\\ \forall j=1, \ldots, p, \quad X_{ij}|Z_{ik}=1 &\sim \mathcal{M}_{d_j}(1, \theta_{kj}),\\ \end{aligned} \end{equation} \] For each cluster \(k\) and variable \(j\), the vector \(\theta_{kj}\) represents the probability of each of the \(d_j\) modalities. With the above choice of priors, the LCA model admits an exact ICL expression similar to the mixture of multinomials derived in the supplementary materials of Côme et. al. (2021, Section 3) .

The greed package implements this model via the Lca model-class and the default values for hyper-parameters are set as follows:

  • \(\alpha\) is set to 1
  • \(\beta\) is set to 1, defaulting to an uninformative uniform prior on \(\theta_{kj}\)

This default values may be changed and we refer to the ?Lca documentation for further details.

Analysis of two real datasets

We now illustrate the use of the greed package on two real datasets:

  1. Mushroom data from UCI Machine Learning Repository describing 8124 mushrooms with 22 phenotype variables. Each mushroom is classified as “edible” or “poisonous” and the goal is to recover the mushroom class from its phenotype.
  2. Young people survey data from Miroslav Sabo and available on the Kaggle platform. This is an authentic example of questionnaire data where Slovakian young people (15-30 years old) were asked musical preferences according to different genres (rock, hip-hop, classical, etc.).

These two datasets come attached with greed package.

Mushroom data

data("mushroom")

We begin by forming the necessary vectors for analysis. The data has \(n=8124\) rows and \(p=23\) columns. The first column contains the poisonous status of each mushroom with two possible values, “p” for “poisonous” and “e” for edible, it will serve as the clustering we seek to recover.

mushroom[,1:10] %>% head() %>% knitr::kable()
edibility V2 V3 V4 V5 V6 V7 V8 V9 V10
p x s n t p f c n k
e x s y t a f c b k
e b s w t l f c b n
p x y w t p f c n n
e x s g f n f w b k
e x y y t a f c b n

The remaining variables are used for clustering. Note that we only use a subset of the data for illustration purpose here.

X = mushroom[,-1]
subset =sample(1:nrow(X), size = 1000)
label = mushroom$edibility[subset]

Clustering

The clustering is done via the main function greed() with argument model set to LCA and the genetic hybrid algorithm for ICL maximization.

Note: The Lca model may only be used with datasets stored in a data.frame object containing factors only. When such data are provided to the greed() function, an Lca model is picked by default. To perform the clustering, it is therefore sufficient to call greed with the prepared data.frame.

sol_mush<-greed(X[subset,], model=Lca())
#> 
#> ── Fitting a LCA model ──
#> 
#> ℹ Initializing a population of 20 solutions.
#> ℹ Generation 1 : best solution with an ICL of -12696 and 13 clusters.
#> ℹ Generation 2 : best solution with an ICL of -12612 and 13 clusters.
#> ℹ Generation 3 : best solution with an ICL of -12603 and 13 clusters.
#> ℹ Generation 4 : best solution with an ICL of -12603 and 13 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a LCA model 12 clusters and an ICL of -12589

The hybrid genetic algorithm found a solution with \(K=12\) clusters which is quite over-segmented. But, it display a good separation among edible and poisonous mushrooms:

table(label, clustering(sol_mush)) %>% knitr::kable()
1 2 3 4 5 6 7 8 9 10 11 12
e 0 0 11 27 32 27 92 0 74 11 7 221
p 173 209 3 9 0 0 0 46 0 37 21 0

Partition’s NMI is 0.29 which is explained by the over-segmentation of the solution compared to the \(2\)-class problem.

Analysis of the hierarchy

Exploring the dendrogram provided by the hierarchical algorithm is quite useful in this case.

plot(sol_mush, type='tree')
Dendogram obtained with greed with the Lca model on the Mushroom dataset.

Dendogram obtained with greed with the Lca model on the Mushroom dataset.

We clearly see a hierarchical structure appearing with \(K=2\) or \(K=4\) main clusters. Thus, we can cut the tree at this height and look at the solution.

sol2 = cut(sol_mush, 2)
table(label, clustering(sol2)) %>% knitr::kable()
1 2
e 0 502
p 382 116

Here, we clearly see that the order of merges is consistent with the labels, and the final ARI is 0.57. While, some poisonous mushrooms have been categorized as edible, this is the consequence of the way the labels have been set, since mushrooms for which the edibility status was unknown were classified as poisonous by default. While this choice is reasonable from a strict health perspective. Furthermore, as the data documentation specifies, ’‘’’. Thus, the unsupervised problem is hard and the obtained clustering is already satisfactory. Moreover, this illustrates the interest of having the hierarchical algorithm in order to access coarser partitions.

Young people survey data

data("Youngpeoplesurvey")

Data preparation

We begin by preprocessing the data, only keeping the categorical variable. The original dataset has \(n=1010\) respondents for \(p=19\). We keep only the feature related to the musical taste of the respondent and remove potential strike of identical responses. Eventually, the data are cast to factors with an explicit levels for the missing responses.



selected = Youngpeoplesurvey %>% 
  mutate(string = longstring(.)) %>%
  mutate(sel = if_else(string <= 10,TRUE,FALSE) ) %>% pull(sel)

Xnum = Youngpeoplesurvey %>% 
  filter(all_of(selected) )

X = Xnum %>% 
  mutate_all(function(x){
    x[is.na(x)]="NA"
    factor(x,levels=c(1:5,"NA"))
    }) %>%
  droplevels()

Hierarchical clustering

As previously the clustering is obtained with a call to greed and the Lca generative model will be taken by default since the dataset is a data.frame with factors only. We demonstrate here how to increase a little bit the exploration capabilities of the optimization algorithm by increasing the population size and the probability of mutation.

sol=greed(X, alg = Hybrid(pop_size = 50,prob_mutation = 0.5))
#> 
#> ── Fitting a LCA model ──
#> 
#> ℹ Initializing a population of 50 solutions.
#> ℹ Generation 1 : best solution with an ICL of -27704 and 15 clusters.
#> ℹ Generation 2 : best solution with an ICL of -27680 and 12 clusters.
#> ℹ Generation 3 : best solution with an ICL of -27529 and 9 clusters.
#> ℹ Generation 4 : best solution with an ICL of -27453 and 7 clusters.
#> ℹ Generation 5 : best solution with an ICL of -27435 and 9 clusters.
#> ℹ Generation 6 : best solution with an ICL of -27430 and 9 clusters.
#> ℹ Generation 7 : best solution with an ICL of -27411 and 7 clusters.
#> ℹ Generation 8 : best solution with an ICL of -27408 and 7 clusters.
#> ℹ Generation 9 : best solution with an ICL of -27395 and 7 clusters.
#> ℹ Generation 10 : best solution with an ICL of -27395 and 7 clusters.
#> ── Final clustering ──
#> 
#> ── Clustering with a LCA model 6 clusters and an ICL of -27393

The algorithm found \(K=6\) clusters which are quite balanced. To explore the results, we plot the dendogram found.

plot(sol,type='tree')
Dendogram obtained with greed with the Lca model on the Young people survey dataset.

Dendogram obtained with greed with the Lca model on the Young people survey dataset.

Clustering analysis

We may also used the marginals plot to depict the conditional probabilities of the different responses knowing the clusters assignment.

plot(sol,type='marginals')
Marginals plot of the clustering obtained with greed on the Young adult data.

Marginals plot of the clustering obtained with greed on the Young adult data.

The clusters appear to be quite different and coherent. For example, the seventh group (second row, in brown) display a strong tendency to like Opera and Classical music and dislike more modern genres (Techno, Trance, Punk or Hi-Hop), whereas the 4th group (top row, in pink) corresponds to music lovers with quite eclectic musical taste across the proposed survey’s genres.

In addition, the conditional maximum a posteriori estimates of \(\theta_{kj}\) are available through the coef() method.

params = coef(sol)

And look for example at the cluster distributions for the Country music style:

params$Thetak$Country %>% knitr::kable(digits=2)
1 2 3 4 5 NA
cluster1 0.55 0.31 0.11 0.02 0.00 0.00
cluster2 0.61 0.21 0.11 0.04 0.02 0.01
cluster3 0.26 0.32 0.25 0.11 0.04 0.01
cluster4 0.18 0.13 0.13 0.24 0.31 0.00
cluster5 0.18 0.34 0.20 0.22 0.06 0.00
cluster6 0.22 0.46 0.24 0.07 0.01 0.00

Cluster structures may also be visualized with a word clouds of feature names. In this representation, a color encodes if the feature has an average score greater than the average score of the whole population. A big blue feature corresponds to music type that scores higher than the mean in this cluster and big red feature to music type that have a smaller score than the mean in the corresponding group.

# get the Lca MAP parameters estimates
params = coef(sol)

# compute the mean score per cluster and features
means_scores = lapply(params$Thetak,function(x){
  apply(x[,1:5],1,function(r){
    sum(r*1:5)
    })
  })

# move to long format
means_scores_long = do.call(rbind,
                            map2(means_scores, names(means_scores),function(x,y){
                              tibble(cluster=1:K(sol),mean=x,var=y)
                              }))  %>% 
  mutate(var = gsub("\\."," ",var))

# compute the average score for the whole population
means_scores_glob = Xnum %>% 
  summarise_all(function(x){mean(x,na.rm=TRUE)}) %>% 
  tidyr::pivot_longer(all_of(1:ncol(Xnum)),names_to = "var",) %>% 
  mutate(var = gsub("[,-/]"," ",var))

# compute the differences between clusters scores and averages scores
gg = means_scores_long %>% 
  left_join(means_scores_glob) %>% 
  mutate(dm=mean-value) 
#> Joining, by = "var"


# create the word clouds (we kept only the music genre that depart from the average for each cluster)
ggplot(gg %>% filter(abs(dm)>0.1), aes(label = var, size = abs(dm),color=dm)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 7) +
  theme_minimal() +
  scale_color_gradient2(guide="none")+
  facet_wrap(~cluster)
Word cloud visualization of the greed clustering on the Young adult survey data.

Word cloud visualization of the greed clustering on the Young adult survey data.

Using such a visualization allow to easily describe the different groups. The first cluster almost exclusively likes Pop and Dance music, the second Opera and Classical music. The third cluster has higher score for almost all music genres and the third is close to the global average score and so on.