Evaluate Clustering

Created: 2022-04-29 15:26
#note

We need metrics that work when the truth labels are not known.

External validation

The methods used to implement this measure need ground truths.
Examples:

  • Measures based on matching:
    • Purity: Cluster $C_i$ contains points only from one ground truth partition;
    • Maximum matching: based on the assumption that only one cluster can match one partition, in this case pairwise matching is used, i.e. one element belongs to only one cluster;
    • F-measures: Precision, Recall;
  • Entropy measures: it represents the amount of orderness of the information in all the partitions → $H(C)=-\sum_{i=1}^r PC_i\log PC_i$, where $PC_i =\dfrac{n_i}{n}$. Some examples are:
    • Conditional entropy: given that $PC_i$ represents the probability of cluster $C_i$, the entropy of partitioning T for ground truth j for k groups is given by $H(T)=-\sum_{j=1}^k PT_i\log PT_j$. Then the entropy of T w.r.t. cluster $C_i$ (how ground truths are distributed among each cluster) is represented by $H(T|C_i)=-\sum_{j=1}^r (\dfrac{n_{ij}}{n_i})\log(\dfrac{n_{ij}}{n_i})$ and T's conditional entropy with respect to C clustering: $H(T|C)=-\sum_{i=1}^r (\dfrac{n_i}{n})H(T|C_i)$;
    • Normalized Mutual Information;
  • Pairwise measures -> TN, TP, FN, FP, Jaccard similarity, Rand statistic ((TP+TN)/N) and Fowlkes Mallow measure (geometric mean of precision and recall).

Internal measures

Ground truths are not needed, based on the ideas of intra-cluster compactness and inter-cluster separation.
Let's define $W(S,R)$ as the summation of weights on all edges having one vertex in cluster S and other vertex in cluster R.
The summation of overall intra-cluster weights over the clusters is: $W_{in}=\dfrac{1}{2} \sum_{i=1}^k W(C_i, C_i)$.
The summation of all the inter-clusters weights is: $W_{out}=\sum_{i=1}^{k-1}\sum_{j>1}W(C_i,C_j)$.
The number of distinct intra-cluster edges is given by: $N_{in}=\sum_{i=1}^k(2^{n_i})$.
The number of distinct inter cluster edges is given by: $N_{out}=\sum_{i=1}^{k-1}\sum_{j=i+1}^{k}n_i n_j$.
Some measures are:

  • Beta-CV measure: it is the ratio of the average intra-cluster distance to the average inter-cluster distance. The smaller the clustering, the better it is. $Beta CV=\dfrac{W_{in}/N_{in}}{W_out/N_{out}}$.
  • Normalized cut: $N_{cut}(A,B) = cut(A,B)(\dfrac{1}{vol(A)}+\dfrac{1}{vol(B)})$, where $vol(A)=\sum_{i \in A}d_i$, $cut(X,Y)=\sum_{i \in A, j \in B}W_{ij}$ where X and Y are sets that divide a graph, the weight of edges connecting vertices in X to Y is minimum, the size of X and Y is very similar and $W_{ij}$ control the neighborhood (and it is equal to $e^{\dfrac{|x_i=x_j|^2}{2\sigma ^ 2}}$). The higher the normalized cut, the better the clustering.

Relative measures

Used to compare disparate clustering usually obtained from the same algorithm changed in some parameters.
Silhouette Coefficient is the most used (it can be also considered as internal measure):
Silhouette coefficient as internal measure -> it checks cluster cohesion and separation. For each point $x_i$ its score $s_i$ is defined as: $s_i=\dfrac{\mu_{out}^{min}(x_i)- \mu_{in}(x_i)}{max{\mu_{out}^{min}(x_i),\mu_{in}(x_i)}}$, where $\mu_{in}(x_i)$ is the mean distance measure from $x_i$ to points in it closest cluster and $\mu_{out}^{min}$ is the mean distance measure from $x_i$ to points in its closest cluster. Then the Silohouette Coefficient is the mean value of s_i across all the points. We have a good cluster for a score close to +1.

Note: The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

Human judgment

Observation-based approaches

Evaluation based on looking at the most probable words in the topic, e.g. Word Cloud.

Another approach is called Termite. It consists in a visualization of the term-topic distributions produced by topic models by introducing two calculations:

  • saliency, which identifies words that are more relevant for the topics in which they appear;
  • seriation, which sorts words into more coherent groupings based on the degree of semantic similairty between them.

Termite visualizes clusters in a tabular form, the term-topic matrix, where rows correspond to terms and columns to topics. This matrix shows term distributions for all latent topics using circular areas.
Term saliency is defined as follows, for a given word w, we compute its conditional probability P(T|w) -> likelihood tat observed word w was generated by latent topic T. Instead, P(T) is the likelihood that any randomly-selected word w' was generated by topic T. Then the distinctiveness of word w is defined as the Kullback-Leibler divergence between P(T|w) and P(T) -> $distinctiveness(W)=\sum_T P(T|w) \log(\dfrac{P(T|w)}{P(T)})$. This measures how much the term w is important to generate the topic, compared to the randomly-selected word w', i.e. if a word appears in all the topics, observing the word tells us little about the document's topical mixture -> the word will recceive a low score for metric.
Saliency of a word is the product between the probability of having this word and the distinctiveness score.
We can order topics by index or topic size, whereas term can be ordererd alphabetically, by frequency or using seriation.
The seriation method used in Termite works as follows: an asymmetric similarity measure to account for co-occurrence and collocation likelihood between all pairs of words. Collocation defines the probability that a phrase occurs more often in a corpus than woulb be expected by chance. The likelihood is computed using $G^2$ statistics. Terms are then placed according to their similarity scores by applying the Bond Energy Algorithm. The BEA algorithm is terminated whenever a sorted sub-list with the desired number of terms is generated.

Interpretation-based approaches

These approaches are considered as "gold stardard" for evaluating topic models but they use human judgment. Some methods are described here.
Word intrusion: subjects are presentd with groups of 6 words, 5 of which belong to a given topic and one which does not. They are asked to identify the intruder word, if most of them agree, then the topic is good.
Topic intrusion: subjects have to identify the intruder topic from a group of topics, e.g. given a document, 4 topics are presented, three of which can belong to the topic.

Other metrics

From this paper.

Word Entropy: to evaluate the specificity of a topic, we measure a topic's word diversity using the conditional entropy of word types given a topic: $-\sum_i Pr(w_i|z)\log Pr(w_i|z)$. Topics composed of tokens from a small set of types will have low entropy, while topics more evenly spread out across the whole vocabulary will have high entropy. Extreme entropy scores indicate bad topics, extremely low entropy are overly specialized, while those with extremely high entropy are overly general.
Coherence: to measure the semantic quality of a topic using two word-cooccurrence-based coherence metrics -> it measures whether a topic's words actually occur together (in the working collection or an external collection). Internal coherence : $\sum_i \sum_{j<1} \log \dfrac{D(w_i,w_j)+ \epsilon}{D(w_i)}$, where D refers to the number of documents that contain a word or word-pair. External coherence: $\sum_i \sum_{j<1} \log \dfrac{Pr(w_i,w_j)+ \epsilon}{Pr(w_i)Pr(w_j)}$, where the probabilities are estimated from the number of 25-word sliding windows that contain a word or word-pair in an external corpus. Higher scores are better (for both internal and external metrics). Words that do not appear in the external corpus are ignored and topics with less than 10 attested words are skipped (the presence of skipped topics ia an indicator of model failure).
Exclusivity: a topic model can reach high coherence by repeating a single high-quality topic multiple times, so we can define a metric that consider how exclusive a word w is to a specific topic z: $\dfrac{Pr(w_i|z)}{\sum_{z'} Pr(w_i|z')}$. A word prelevant in many topics will have a low exclusivity score near 0, while a word occurring in a few topics will have a score near 1.
Davies-Bouldin Index: it is defined as the average similarity measure of each cluster with its most similar cluster, where similrity is the ration of within-cluster distances to between-cluster distances -> clusters which are farther apart and less dispersed will lead to a better score. The lower the value the better the clustering performance.
Calinski-Harabasz Index: it is also known as the Variance Ratio Criterion. It is defined as the ratio between the within-cluster dispersion and the between-cluster dispersion. The higher the index the better the performance. The formula is: $s=\dfrac{tr(B_k)}{tr(W_k)} \times \dfrac{n_E-k}{k-1}$, where $tr(B_k)$ is the trace of the between group dispersion matrix and $tr(W_k)$ is the trace of the within-cluster dispersion matrix defined by: $W_k=\sum_{q=1}^k \sum_{x \in C_q}(x-c_q)(x-c_q)^T$ and $B_k=\sum_{q=1}^k n_q(c_q-c_E)(c_q-c_E)^T$.

References

  1. Review-Clustering of high dimensional data
  2. Towards Data Science
  3. Topic modeling evaluation

Code

  1. Sklearn

Tags

#clustering