Newest 'clustering' Questions

0 votes

0 answers

23 views

Modeling recurring monthly transactions with weekend-shift effects: DBSCAN vs rule-based temporal detection?

I have 3 months of categorized bank transaction data and need to identify recurring cash inflows and outflows for lending risk modeling. Complications: 1. Income dates shift earlier when payday falls ...

Awande Ntombela

1

asked Nov 19 at 9:09

0 votes

0 answers

33 views

Role of Z-Tests in Kernel Density Estimation for Cluster Classification

In a recent bioinformatics paper, the authors describe a statistical/machine learning approach to classify clusters of cells using kernel density estimation (KDE) and Z-scores. While the details of ...

Michiel.Tawdarous

1

asked Nov 18 at 9:33

1 vote

1 answer

50 views

Vector direction of individual clusters after PCA

Suppose I have two multi-dimensional population samples - $A$ and $B$. I hypothesise that $\mathbb{E}[A]$ and $\mathbb{E}[B]$ are orthogonal in this high-dimensional space. To test this hypothesis, I ...

sunnydk

127

asked Nov 3 at 18:16

1 vote

0 answers

32 views

Supervised Clustering Algorithms / Full Graph Edge Prediction Algorithms

I have an interesting problem I am trying to solve and I cannot find any non-deep methods available to solve it. Problem Description Plain The real life problem this relates to are handwritten digits ...

Ryan Folks

149

asked Oct 28 at 21:38

2 votes

1 answer

46 views

Pattern analysis for time between events data

I am trying to subset data based on a pattern of "strings" or clusters of food deliveries to young that I see in my data (see plots labeled 2, 4, 5, 6, and 8 in the figure below for the most ...

thegrayson

23

asked Oct 23 at 17:57

0 votes

0 answers

27 views

How to identify and quantify main tendencies across participants from cluster membership heatmaps?

I'd appreciate your thoughts on the following problem. I've created a heatmap plot (attached) showing the cluster membership ratio for each participant (in separate subplots) and condition (η). Now, I'...

maria mystakidou

1

asked Oct 23 at 10:08

2 votes

1 answer

122 views

Examining country-level effects based on individual-level data combined with country-level data

I am new to working with country-level effects in comparative OLS regression with individual-level data. Are there any good resources for this? Suppose my dependent variable is social integration (an ...

Olestan

71

asked Sep 30 at 11:45

0 votes

0 answers

44 views

Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias) Dataset ...

Rishab

1

asked Sep 26 at 6:09

0 votes

0 answers

57 views

How to peform clustering on heavily right skewed data and zero inflated data

I am currently working on clustering continuous variables (such as AOV, RPV, and conversions(conversion/visits)). The variables are heavily right skewed with long tails and one variable is dominated ...

Rishab

1

asked Sep 24 at 12:43

3 votes

1 answer

129 views

Bayesian Clustering with a Finite Gaussian Mixture Model with Missing Data

I would like to perform clustering with a finite Gaussian Mixture model, however, I have missing data (some features are missing at random). I am using Variational Inference to fit my Bayesian GMM. Is ...

Tom

1,112

asked Sep 4 at 16:36

2 votes

0 answers

72 views

Estimating number of clusters using Scikit Bayesian GMM

I am generating clustering data using the Bayesian mixture of Gaussian models described in Bishop's Pattern Recognition and Machine Learning textbook, with model parameters drawn from the following ...

PJB

21

asked Aug 9 at 7:01

1 vote

1 answer

59 views

Mixture-Based Clustering for Ordered Stereotype Model - Distance Scores

I have a 5-variable/3 category-level ordinal survey data set. E.g. 5 health variables ranked 1-3 (good-moderate-poor). I want to row-cluster different responses. But also, I want determine whether ...

EB3112

264

asked Aug 8 at 9:48

1 vote

0 answers

54 views

Are equal and diagonal variance matrices implicitly assumed in k-means clustering?

When applying k-means clustering, I understand that the goal is to partition the dataset by assigning each point to its nearest cluster center. However, I’ve come across statements that k-means can be ...

EngineerMathlover

153

asked Jul 7 at 17:30

1 vote

0 answers

72 views

"How to validate if a dataset has natural clusters?"

I've recently learnt unsupervised learning methods such as KMeans and DBSCAN. While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...

ssmalik

41

asked Jun 24 at 7:43

0 votes

1 answer

60 views

Data cross validation to predict label from cluster analysis [closed]

My project has the following steps: Use elbow method to determine the features and number of clusters for kmeans. Run kmeans on the data (with determined features and n clusters), and gives the ...

Xin Niu

103

asked Jun 17 at 22:34

0 votes

0 answers

28 views

What is the interval of values of the CDbw index for clustering internal evaluation?

I'm currently studying the CDbw (Compose Density between and within clusters) index, which is metric designed for internal clustering evaluation. The original article of this index was published in ...

DavideChicco.it

742

asked Jun 16 at 9:52

0 votes

0 answers

92 views

How can UMAP improve HDBSCAN clustering results when it also uses nearest neighbors i.e., clustering, internally

I went through UMAPs official documentation which says HDBSCAN, being a density based algorithm suffers from curse of dimensionality and reducing dimensions with UMAP can improve the results. But! ...

Shradha

1

asked Jun 13 at 14:43

0 votes

0 answers

50 views

Cluster Trajectories in LCGA and GMM: Stable Levels vs. Directional Trends

I am currently performing latent class growth analysis (LCGA) and growth mixture modeling (GMM) to identify distinct subgroups within my study population based on the longitudinal trajectories of a ...

Konstantinos Gkirgkiris

473

asked Jun 8 at 10:16

4 votes

2 answers

114 views

Clustering based on the longitudinal trajectory of a single continuous variable

I am currently working on a longitudinal dataset in which I aim to cluster individuals based on the trajectory of a single continuous variable measured repeatedly across time (e.g., daily values). The ...

Konstantinos Gkirgkiris

473

asked Jun 2 at 16:28

0 votes

0 answers

10 views

How to improve inter-group performance - machinelearning for sleep state prediction using EEG data [duplicate]

Short explanation of the problem I've been working on a project with avian EEG data — I'm trying to predict the birds' sleep state using already generated labels for this data set. Now, the ...

m0n74g3

1

asked May 18 at 8:56

0 votes

0 answers

83 views

K-means clustering 1D proof for intervals

I am supposed to prove that given sorted data points such that $X_1 \leq X_2 \leq \dots X_n$ in an optimal cluster assignment each cluster corresponds to some interval of points. Or in other words - ...

user123_pls

1

asked May 8 at 18:18

2 votes

1 answer

76 views

Simultaneous clustering of two matrices

I have two matrices of species abundances from two different types of organisms. I would like to cluster sites where these species co-occur based on the abundances of both types of organisms ...

Bobby Davis

51

asked Apr 22 at 22:34

4 votes

1 answer

138 views

What are other usages of t-SNE/UMAP, other than simply visualizing?

I have learned basic concept and algorithm to perform t-SNE and UMAP. I have read some posts that says one can not use the dataset after t-SNE and UMAP to do cluster. Instead, one can only perform ...

user398751

asked Apr 20 at 15:50

3 votes

1 answer

76 views

What will happen if DBSCAN chooses a noise point as the initial point?

I am learning about DBSCAN, and I’m wondering what happens if it chooses a noise point as the initial point. I know that if a point satisfies the two conditions related to epsilon and minPts, it will ...

Olivia

191

asked Apr 11 at 2:26

0 votes

0 answers

41 views

Is analyzing test scores a clustering problem or an EDA problem?

I have a dataset of 28 personality assessment features, which measures personality attributes like Diligence or Sociability to determine performance in the corporate workplace. I'm tasked with ...

Michael Tran

1

asked Apr 8 at 8:21

3 votes

0 answers

64 views

What clustering methods handle zero-inflated and continuous variables together?

I'm relatively new to data science and currently working on a project to group global cities based on exposure to various climate hazards. I've sourced climate data from GCMs participating in CMIP6 as ...

wobre

31

asked Apr 2 at 23:46

0 votes

0 answers

65 views

What is this algorithm for identifying an optimal number of clusters in HCA?

I'm using some software that does (among other things) hierarchical clustering and automatically chooses a number of clusters to use if one is not specified. I wanted to know what method it is using ...

thposs

123

asked Mar 24 at 14:49

1 vote

0 answers

118 views

Different results clustering with vcovHC vs. vcovCL

I'm conducting a panel analysis in R and would like to control for clustering at the individual level. I've run two-way fixed effects models using both lm() and plm(). These models produce identical ...

Eddie

11

asked Mar 19 at 22:36

1 vote

1 answer

321 views

Cluster analysis with Gower distance

I have a dataset that includes both numeric and categorical variables, and I want to perform cluster analysis. Thus, I choose the Gower distance as distance metric. Next, I perform agglomerative ...

Elena O.

51

asked Mar 10 at 21:35

0 votes

0 answers

34 views

Measure to score elements of a set that can form clusters

Apologies in advance if this is the wrong place to ask this question. I'm trying to find a suitable measure to use to score elements of a set. Each member of the set has a geographical location and ...

gkaminski

1

asked Feb 23 at 22:24

0 votes

0 answers

29 views

Is there a known theoretical or practical proof that higher object detection performance leads to greater clustering accuracy?

I’m working on an image-based object detection problem where I’ve noticed a correlation: improvements in object detection performance (as measured by standard metrics such as mAP or IoU) appear to ...

Sabah Anis

1

asked Feb 23 at 4:34

2 votes

0 answers

29 views

Appropriate to use group-based trajectory modeling for time-unstructured data?

I have a longitudinal dataset where the assessment times vary across participants ( time-unstructured data). I would like to identify clusters of individuals with similar developmental trajectories. ...

May

21

asked Feb 16 at 21:40

5 votes

2 answers

602 views

How can I use unsupervised methods to recommend an “ideal” number of managers for companies when no labels exist?

I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as: Number of employees, Number of customers, Number of complaints, other additional company attributes ...

B_fig

63

asked Feb 13 at 12:57

1 vote

0 answers

76 views

Clustering Based on Transition Probabilities

I have panel data of workers' occupational histories, in which every worker's occupation is indicated for every time period. I am looking to cluster occupations into groups. The idea is to establish ...

ykkoca

23

asked Feb 11 at 22:08

1 vote

0 answers

55 views

Do you know of any dataset with this behavior when applying t-SNE? [closed]

I'm looking for a dataset that exhibits this behavior when applying t-SNE. t-SNE is a dimensionality reduction algorithm that can sometimes separate data points that originally belong to the same ...

greffao

11

asked Feb 10 at 20:33

0 votes

0 answers

32 views

What is the best analytic technique for analyzing "select all that apply" data where participants are asked to assign an experience to 1+ term?

I am trying to find the best approach to analyzing the following data: Participants (n~500) were given a list of 50+ experiences/symptoms/disorders and were asked to assign those with one of 5 terms (...

Lindsay

1

asked Feb 7 at 22:27

2 votes

0 answers

87 views

K-means cost function

In Elements of Statistical Learning (ESL), they state in equation (14.31) that the k-Means objective function is $$W(C) = \sum_{k=1}^KN_k \sum_{C(i)=k} ||x_i - \bar{x}_k||^2$$ where $K$ is the number ...

pseudo-goldstone

121

asked Feb 5 at 22:32

0 votes

0 answers

51 views

Biclustering from a graph or an existing distance matrix

If we have a set of measurements, e.g. gene expression in various conditions, two approaches are common: compute distances between columns (e.g. similarity between gene expression profiles), and ...

Alexlok

187

asked Feb 1 at 21:55

1 vote

1 answer

88 views

Are there advantages in extracting patterns (i.e. clustering) on Partial Least Square latent space?

I am using Partial Least Square in order to obtain linear model parameters in case of correlated covariates. I would like to try clustering in the Partial Least Square latent space, that is the space ...

LearningAlgorithm

227

asked Jan 30 at 17:27

9 votes

3 answers

1k views

What is a good approach to show my data only belongs to one cluster?

I hope the question is not stupid, but after a long search I have not found a satisfactory answer. I have a question about how to proceed if I want to test whether my data is from just one cluster or ...

David

101

asked Jan 13 at 10:27

1 vote

0 answers

54 views

Confidence intervals for stratified sampling and variable-sized clusters

I am working with survey data that involve two ordinal responses vars I wish to crosstab and see margins of (with CI), and two different sets of weights: Stratified sampling: Different sampling ...

dzeltzer

151

asked Jan 12 at 0:25

0 votes

0 answers

105 views

Pickands-Balkema-De Haan theorem, Dependence and Shuffling

The Pickands-Balkema-de Haan theorem states that the conditional excess distribution function is well approximated by the generalized Pareto distribution (for high excesses and if the underlying RV's ...

bee14

1

asked Jan 7 at 12:53

0 votes

0 answers

46 views

Computing/calculating SSB (sum of squares between clusters) for k-modes

I am trying to calculate metrics for validation/calculation of K-modes performance. I am doing my thesis and I need to group patients from binary variables (diseases) I recently read that SSW and SSB ...

Cristina Muntañola

1

asked Dec 17, 2024 at 13:41

3 votes

1 answer

219 views

Clustering in 5-point Likert type dataset

I want to cluster the data collected with a 5-point Likert scale. But I couldn't understand which method is more accurate to use. I searched the literature but couldn't find a clear answer. Can you ...

ali

63

asked Dec 7, 2024 at 4:18

2 votes

0 answers

77 views

How to cluster based on x and y coordinates

I am trying to identify rows in groups of points using clustering algorithms. The bigger picture problem I'm trying to solve is to identify shelves given x and y coordinates of products. I can cluster ...

Tommy Wolfheart

121

asked Dec 4, 2024 at 13:17

0 votes

0 answers

24 views

Should I average data for agglomerative hierarchical clustering (AHC)?

I was conducting an experiment where I measured the response of wheat cultivars to pathogen inoculation. The experiment was repeated three times, with two reps each time. Two disease parameters were ...

user449708

1

asked Nov 30, 2024 at 14:02

0 votes

0 answers

96 views

Missing values in data set before DBSCAN

My goal is to identify bots and fraudulent users for an application. Ideally, this would be a regression problem where users are rated on a continuous scale. I have 4 tables that cover different ...

Burger

1

asked Nov 20, 2024 at 20:59

0 votes

0 answers

50 views

Identify predictors for clustering output?

I have a dataset with variables collected years ago, and many variables collected this year as outcome variables. I want to combine all the variables collected this year to get one outcome, e.g. ...

NPpsy

43

asked Nov 15, 2024 at 15:20

0 votes

1 answer

66 views

Index of spatial variability

I have a geographical area, divided in municipalities. Each municipality has the count of a disease occurrences. The procedure is replicated four times, for four diseases (we can call them A, B, C, D)....

Luke

161

asked Nov 11, 2024 at 12:09

2 votes

2 answers

691 views

Where is the inflection point here in this elbow chart?

Where do you think is the inflection point on this chart?

pdxtrader8888

21

asked Oct 27, 2024 at 21:48

Questions tagged [clustering]