0
$\begingroup$

I want to obtain a small optimal value of $k$ (with $k ≤ 5$) for k-means clustering on a dataset of size $5000$. I have used the BIC and the Gap statistic to determine the optimal number of clusters, and both methods indicated an optimal $k$ of $7$ or more. I would like to know if I can make an adjustment (for example, by multiplying $s(k+1)$ by a factor $c > 1$ or by including it as a penalty term in the BIC) so that I obtain a smaller optimal value of $k$.

The following are the calculations I have performed to compute the Gap statistic and the BIC:

BIC calculation $$\text{BIC}=n\ln(\frac{W}{n})+m\ln(n)$$ where $W$ is the total within-cluster sum of squares,

$m$ is the number of free parameters in the model, and $n$ is the total number of data points. The optimal $k$ is the one with least BIC value.

Gap statistics $$\text{Gap}(k)≥\text{Gap}(k+1)−s(k+1)$$ where $\text{Gap}(k)=\frac{1}{B}\sum_{b=1}^{B}\ln(W_{k} ^{∗(b)})−\ln(W_k)$.

Here $W_k$ is the within-cluster dispersion for k clusters, and $W_{k} ^{∗(b)}$ is the within-cluster dispersion of the $b^\text{th}$ reference dataset (from a total of $B$ reference datasets) generated from a distribution with no apparent clustering, for $k$ clusters.

Let me know if I made any errors, if there are other methods to get small cluster numbers, or any modifications I can make.

$\endgroup$
3
  • $\begingroup$ Generally it's more difficult for any clustering algorithm to determine both the clusters and how many clusters there are than it is to pre-specify the number of clusters. That's certainly the case with k-means. What, then, is the actual algorithm you are using that won't permit you to specify 5 clusters in advance? $\endgroup$ Commented Feb 11 at 12:58
  • $\begingroup$ It does, but I expected that using an optimal k selection method would provide additional support for my chosen number of clusters. $\endgroup$ Commented Feb 11 at 13:02
  • 3
    $\begingroup$ Why don't you simply pick a number you are comfortable with? If you don't like the results of an algorithm, why do you think you need to look for a different rule that will give you what you want, instead of simply using what you think makes sense? (Why do you like a smaller number better?) $\endgroup$ Commented Feb 11 at 13:06

1 Answer 1

4
$\begingroup$

"I didn't get the answer I want, so let me keep fishing for one that gives me what I want" is not good practice. If you want the optimal number of classes by some criterion, and you chose two criteria and they both said "7" then .. well, the optimal number is 7.

But, there's nothing sacrosanct about "optimal". Cluster analysis methods are messy. However, rather than searching for some other method, or arbitrarily altering well-known formulas, I think it's fine to say:

"Five clusters were chosen because XXXXX" and give your reasons.

In one of my favorite stats books *Statistics as Principled Argument", the author (Robert Abelson) argues that the key question is often not "what can I do?" or "can I do XXX?" but "What can I justify?"

Might someone who reads your paper reject that justification? Sure. But you have to make a principled argument. Statistics can be part of that argument, but they certainly do not have to be the whole of it.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.