I am generating clustering data using the Bayesian mixture of Gaussian models described in Bishop's Pattern Recognition and Machine Learning textbook, with model parameters drawn from the following priors:
$p(\boldsymbol{\pi}) = \text{Dir}(\boldsymbol{\pi} \mid \alpha)$
$p(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \mathcal{N}(\boldsymbol{\mu}_k \mid \boldsymbol{m}, \beta^{-1} \boldsymbol{\Sigma}_k)\; \mathcal{W}^{-1}(\boldsymbol{\Sigma}_k \mid \boldsymbol{W^{-1}}, \nu).$
My goal is to generate synthetic data from this model and then estimate the number of components $K$ in the dataset using Scikit-learn's BayesianGaussianMixture. To do this, I fit the model for a range of candidate $K$ values, compute the ELBO minus $ln(K!)$ for each, and select the $K$ that maximizes this quantity ( following the method proposed by Bishop (PRML, p. 484)).
However, when I apply this method, the BayesianGaussianMixture consistently favors larger numbers of components and performs poorly in recovering the true $K$. Here is a minimal reproducible example with the accompanying plot:
import numpy as np
from scipy.stats import dirichlet, multivariate_normal, invwishart
import random
from sklearn.mixture import BayesianGaussianMixture
import math
import matplotlib.pyplot as plt
def generate_data(weight_concentration_prior = 1, mean_precision_prior = 0.01,features =5,
mean_prior = 0.0, degrees_of_freedom_prior = 5, covariance_prior=np.eye(5)):
np.random.seed(42)
random.seed(42)
n_components = 5
seq_len = np.random.randint(100, 500)
# Sample weights from Dirichlet prior
weights = np.random.dirichlet(np.ones(n_components) * weight_concentration_prior)
# Assign points to components proportionally
counts = np.random.multinomial(seq_len, weights)
# Sample cluster parameters
means = []
covariances = []
for _ in range(n_components):
Sigma = invwishart.rvs(df=degrees_of_freedom_prior, scale=covariance_prior)
mu = np.random.multivariate_normal(np.full(features, mean_prior), Sigma / mean_precision_prior)
means.append(mu)
covariances.append(Sigma)
# Sample points from GMM
X = []
y = []
for k in range(n_components):
n_k = counts[k]
X_k = np.random.multivariate_normal(means[k], covariances[k], size=n_k)
X.extend(X_k)
y.extend(np.full(n_k, k))
return X, y, n_components
weight_concentration_prior = 1
mean_precision_prior = 0.01
features =5
mean_prior = 0.0
degrees_of_freedom_prior = 5
covariance_prior=np.eye(5)
X, y,true_components = generate_data()
n_components = 15
best_elbo = float('-inf')
predicted_cluster = -1
elbos = []
for component in range(1, n_components + 1):
model = BayesianGaussianMixture(n_components=component,
weight_concentration_prior_type='dirichlet_distribution',
weight_concentration_prior=weight_concentration_prior,
mean_prior=[0.0] * features, mean_precision_prior=mean_precision_prior,
degrees_of_freedom_prior=degrees_of_freedom_prior,
covariance_prior=covariance_prior, random_state=42)
model.fit(X)
lower_bound = model.lower_bound_
curr_elbo = lower_bound - np.log(math.factorial(component))
elbos.append(curr_elbo)
plt.scatter(np.arange(1, n_components + 1),elbos)
plt.show()
print(f"Actual cluster:{true_components}")
print(f"predicted cluster: {np.argmax(elbos) + 1}")
Different random seeds seem to show similar results.
My questions are:
- Is there a mismatch between the priors used in my data generation process and the priors in BayesianGaussianMixture?
- Is Bishop's method of determining the components not applicable here?
