Estimating number of clusters using Scikit Bayesian GMM

Ask Question

Asked 3 months ago

Modified 3 months ago

Viewed 72 times

I am generating clustering data using the Bayesian mixture of Gaussian models described in Bishop's Pattern Recognition and Machine Learning textbook, with model parameters drawn from the following priors:

$p(\boldsymbol{\pi}) = \text{Dir}(\boldsymbol{\pi} \mid \alpha)$

$p(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \mathcal{N}(\boldsymbol{\mu}_k \mid \boldsymbol{m}, \beta^{-1} \boldsymbol{\Sigma}_k)\; \mathcal{W}^{-1}(\boldsymbol{\Sigma}_k \mid \boldsymbol{W^{-1}}, \nu).$

My goal is to generate synthetic data from this model and then estimate the number of components $K$ in the dataset using Scikit-learn's BayesianGaussianMixture. To do this, I fit the model for a range of candidate $K$ values, compute the ELBO minus $ln(K!)$ for each, and select the $K$ that maximizes this quantity ( following the method proposed by Bishop (PRML, p. 484)).

However, when I apply this method, the BayesianGaussianMixture consistently favors larger numbers of components and performs poorly in recovering the true $K$. Here is a minimal reproducible example with the accompanying plot:

import numpy as np
from scipy.stats import dirichlet, multivariate_normal, invwishart
import random
from sklearn.mixture import BayesianGaussianMixture
import math
import matplotlib.pyplot as plt


def generate_data(weight_concentration_prior = 1, mean_precision_prior = 0.01,features =5,
                mean_prior = 0.0, degrees_of_freedom_prior = 5, covariance_prior=np.eye(5)):

    np.random.seed(42)
    random.seed(42)
    n_components = 5
    seq_len = np.random.randint(100, 500)
    # Sample weights from Dirichlet prior
    weights = np.random.dirichlet(np.ones(n_components) * weight_concentration_prior)

    # Assign points to components proportionally
    counts = np.random.multinomial(seq_len, weights)

    # Sample cluster parameters
    means = []
    covariances = []
    for _ in range(n_components):
        Sigma = invwishart.rvs(df=degrees_of_freedom_prior, scale=covariance_prior)
        mu = np.random.multivariate_normal(np.full(features, mean_prior), Sigma / mean_precision_prior)
        means.append(mu)
        covariances.append(Sigma)
    # Sample points from GMM
    X = []
    y = []
    for k in range(n_components):
        n_k = counts[k]
        X_k = np.random.multivariate_normal(means[k], covariances[k], size=n_k)
        X.extend(X_k)
        y.extend(np.full(n_k, k))

    return X, y, n_components

weight_concentration_prior = 1
mean_precision_prior = 0.01
features =5
mean_prior = 0.0
degrees_of_freedom_prior = 5
covariance_prior=np.eye(5)

X, y,true_components = generate_data() 
n_components = 15
best_elbo = float('-inf') 
predicted_cluster = -1 
elbos = [] 
for component in range(1, n_components + 1):
    model = BayesianGaussianMixture(n_components=component,
                                    weight_concentration_prior_type='dirichlet_distribution',
                                    weight_concentration_prior=weight_concentration_prior,
                                    mean_prior=[0.0] * features, mean_precision_prior=mean_precision_prior,
                                    degrees_of_freedom_prior=degrees_of_freedom_prior,
                                    covariance_prior=covariance_prior, random_state=42)
    model.fit(X) 
    lower_bound = model.lower_bound_
    curr_elbo = lower_bound - np.log(math.factorial(component))
    elbos.append(curr_elbo)

plt.scatter(np.arange(1, n_components + 1),elbos) 
plt.show()
print(f"Actual cluster:{true_components}")
print(f"predicted cluster: {np.argmax(elbos) + 1}")

Different random seeds seem to show similar results. My questions are:

Is there a mismatch between the priors used in my data generation process and the priors in BayesianGaussianMixture?
Is Bishop's method of determining the components not applicable here?

edited Aug 13 at 11:06

asked Aug 9 at 7:01

PJB

212 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Estimating number of clusters using Scikit Bayesian GMM

0

Your Answer

Hot Network Questions

Estimating number of clusters using Scikit Bayesian GMM

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions