2
$\begingroup$

I am generating clustering data using the Bayesian mixture of Gaussian models described in Bishop's Pattern Recognition and Machine Learning textbook, with model parameters drawn from the following priors:

Bayesian plate diagram

$p(\boldsymbol{\pi}) = \text{Dir}(\boldsymbol{\pi} \mid \alpha)$

$p(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \mathcal{N}(\boldsymbol{\mu}_k \mid \boldsymbol{m}, \beta^{-1} \boldsymbol{\Sigma}_k)\; \mathcal{W}^{-1}(\boldsymbol{\Sigma}_k \mid \boldsymbol{W^{-1}}, \nu).$

My goal is to generate synthetic data from this model and then estimate the number of components $K$ in the dataset using Scikit-learn's BayesianGaussianMixture. To do this, I fit the model for a range of candidate $K$ values, compute the ELBO minus $ln(K!)$ for each, and select the $K$ that maximizes this quantity ( following the method proposed by Bishop (PRML, p. 484)).

However, when I apply this method, the BayesianGaussianMixture consistently favors larger numbers of components and performs poorly in recovering the true $K$. Here is a minimal reproducible example with the accompanying plot:

import numpy as np
from scipy.stats import dirichlet, multivariate_normal, invwishart
import random
from sklearn.mixture import BayesianGaussianMixture
import math
import matplotlib.pyplot as plt


def generate_data(weight_concentration_prior = 1, mean_precision_prior = 0.01,features =5,
                mean_prior = 0.0, degrees_of_freedom_prior = 5, covariance_prior=np.eye(5)):

    np.random.seed(42)
    random.seed(42)
    n_components = 5
    seq_len = np.random.randint(100, 500)
    # Sample weights from Dirichlet prior
    weights = np.random.dirichlet(np.ones(n_components) * weight_concentration_prior)

    # Assign points to components proportionally
    counts = np.random.multinomial(seq_len, weights)

    # Sample cluster parameters
    means = []
    covariances = []
    for _ in range(n_components):
        Sigma = invwishart.rvs(df=degrees_of_freedom_prior, scale=covariance_prior)
        mu = np.random.multivariate_normal(np.full(features, mean_prior), Sigma / mean_precision_prior)
        means.append(mu)
        covariances.append(Sigma)
    # Sample points from GMM
    X = []
    y = []
    for k in range(n_components):
        n_k = counts[k]
        X_k = np.random.multivariate_normal(means[k], covariances[k], size=n_k)
        X.extend(X_k)
        y.extend(np.full(n_k, k))

    return X, y, n_components

weight_concentration_prior = 1
mean_precision_prior = 0.01
features =5
mean_prior = 0.0
degrees_of_freedom_prior = 5
covariance_prior=np.eye(5)

X, y,true_components = generate_data() 
n_components = 15
best_elbo = float('-inf') 
predicted_cluster = -1 
elbos = [] 
for component in range(1, n_components + 1):
    model = BayesianGaussianMixture(n_components=component,
                                    weight_concentration_prior_type='dirichlet_distribution',
                                    weight_concentration_prior=weight_concentration_prior,
                                    mean_prior=[0.0] * features, mean_precision_prior=mean_precision_prior,
                                    degrees_of_freedom_prior=degrees_of_freedom_prior,
                                    covariance_prior=covariance_prior, random_state=42)
    model.fit(X) 
    lower_bound = model.lower_bound_
    curr_elbo = lower_bound - np.log(math.factorial(component))
    elbos.append(curr_elbo)

plt.scatter(np.arange(1, n_components + 1),elbos) 
plt.show()
print(f"Actual cluster:{true_components}")
print(f"predicted cluster: {np.argmax(elbos) + 1}")

enter image description here Different random seeds seem to show similar results. My questions are:

  1. Is there a mismatch between the priors used in my data generation process and the priors in BayesianGaussianMixture?
  2. Is Bishop's method of determining the components not applicable here?
$\endgroup$
0

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.