Questions tagged [clustering]
Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]
4,046 questions
0
votes
0
answers
23
views
Modeling recurring monthly transactions with weekend-shift effects: DBSCAN vs rule-based temporal detection?
I have 3 months of categorized bank transaction data and need to identify recurring cash inflows and outflows for lending risk modeling.
Complications:
1. Income dates shift earlier when payday falls ...
0
votes
0
answers
33
views
Role of Z-Tests in Kernel Density Estimation for Cluster Classification
In a recent bioinformatics paper, the authors describe a statistical/machine learning approach to classify clusters of cells using kernel density estimation (KDE) and Z-scores. While the details of ...
1
vote
1
answer
50
views
Vector direction of individual clusters after PCA
Suppose I have two multi-dimensional population samples - $A$ and $B$.
I hypothesise that $\mathbb{E}[A]$ and $\mathbb{E}[B]$ are orthogonal in this high-dimensional space.
To test this hypothesis, I ...
1
vote
0
answers
32
views
Supervised Clustering Algorithms / Full Graph Edge Prediction Algorithms
I have an interesting problem I am trying to solve and I cannot find any non-deep methods available to solve it.
Problem Description
Plain
The real life problem this relates to are handwritten digits ...
2
votes
1
answer
46
views
Pattern analysis for time between events data
I am trying to subset data based on a pattern of "strings" or clusters of food deliveries to young that I see in my data (see plots labeled 2, 4, 5, 6, and 8 in the figure below for the most ...
0
votes
0
answers
27
views
How to identify and quantify main tendencies across participants from cluster membership heatmaps?
I'd appreciate your thoughts on the following problem.
I've created a heatmap plot (attached) showing the cluster membership ratio for each participant (in separate subplots) and condition (η).
Now, I'...
2
votes
1
answer
122
views
Examining country-level effects based on individual-level data combined with country-level data
I am new to working with country-level effects in comparative OLS regression with individual-level data. Are there any good resources for this?
Suppose my dependent variable is social integration (an ...
0
votes
0
answers
44
views
Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?
I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias)
Dataset ...
0
votes
0
answers
57
views
How to peform clustering on heavily right skewed data and zero inflated data
I am currently working on clustering continuous variables (such as AOV, RPV, and conversions(conversion/visits)). The variables are heavily right skewed with long tails and one variable is dominated ...
3
votes
1
answer
129
views
Bayesian Clustering with a Finite Gaussian Mixture Model with Missing Data
I would like to perform clustering with a finite Gaussian Mixture model, however, I have missing data (some features are missing at random). I am using Variational Inference to fit my Bayesian GMM. Is ...
2
votes
0
answers
72
views
Estimating number of clusters using Scikit Bayesian GMM
I am generating clustering data using the Bayesian mixture of Gaussian models described in Bishop's Pattern Recognition and Machine Learning textbook, with model parameters drawn from the following ...
1
vote
1
answer
59
views
Mixture-Based Clustering for Ordered Stereotype Model - Distance Scores
I have a 5-variable/3 category-level ordinal survey data set. E.g. 5 health variables ranked 1-3 (good-moderate-poor).
I want to row-cluster different responses. But also, I want determine whether ...
1
vote
0
answers
54
views
Are equal and diagonal variance matrices implicitly assumed in k-means clustering?
When applying k-means clustering, I understand that the goal is to partition the dataset by assigning each point to its nearest cluster center. However, I’ve come across statements that k-means can be ...
1
vote
0
answers
72
views
"How to validate if a dataset has natural clusters?"
I've recently learnt unsupervised learning methods such as KMeans and DBSCAN.
While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...
0
votes
1
answer
60
views
Data cross validation to predict label from cluster analysis [closed]
My project has the following steps:
Use elbow method to determine the features and number of clusters for kmeans.
Run kmeans on the data (with determined features and n clusters), and gives the ...
0
votes
0
answers
28
views
What is the interval of values of the CDbw index for clustering internal evaluation?
I'm currently studying the CDbw (Compose Density between and within clusters) index, which is metric designed for internal clustering evaluation.
The original article of this index was published in ...
0
votes
0
answers
92
views
How can UMAP improve HDBSCAN clustering results when it also uses nearest neighbors i.e., clustering, internally
I went through UMAPs official documentation which says HDBSCAN, being a density based algorithm suffers from curse of dimensionality and reducing dimensions with UMAP can improve the results. But! ...
0
votes
0
answers
50
views
Cluster Trajectories in LCGA and GMM: Stable Levels vs. Directional Trends
I am currently performing latent class growth analysis (LCGA) and growth mixture modeling (GMM) to identify distinct subgroups within my study population based on the longitudinal trajectories of a ...
4
votes
2
answers
114
views
Clustering based on the longitudinal trajectory of a single continuous variable
I am currently working on a longitudinal dataset in which I aim to cluster individuals based on the trajectory of a single continuous variable measured repeatedly across time (e.g., daily values). The ...
0
votes
0
answers
10
views
How to improve inter-group performance - machinelearning for sleep state prediction using EEG data [duplicate]
Short explanation of the problem
I've been working on a project with avian EEG data — I'm trying to predict the birds' sleep state using already generated labels for this data set. Now, the ...
0
votes
0
answers
83
views
K-means clustering 1D proof for intervals
I am supposed to prove that given sorted data points such that $X_1 \leq X_2 \leq \dots X_n$ in an optimal cluster assignment each cluster corresponds to some interval of points.
Or in other words - ...
2
votes
1
answer
76
views
Simultaneous clustering of two matrices
I have two matrices of species abundances from two different types of organisms. I would like to cluster sites where these species co-occur based on the abundances of both types of organisms ...
4
votes
1
answer
138
views
What are other usages of t-SNE/UMAP, other than simply visualizing?
I have learned basic concept and algorithm to perform t-SNE and UMAP.
I have read some posts that says one can not use the dataset after t-SNE and UMAP to do cluster.
Instead, one can only perform ...
3
votes
1
answer
76
views
What will happen if DBSCAN chooses a noise point as the initial point?
I am learning about DBSCAN, and I’m wondering what happens if it chooses a noise point as the initial point. I know that if a point satisfies the two conditions related to epsilon and minPts, it will ...
0
votes
0
answers
41
views
Is analyzing test scores a clustering problem or an EDA problem?
I have a dataset of 28 personality assessment features, which measures personality attributes like Diligence or Sociability to determine performance in the corporate workplace. I'm tasked with ...
3
votes
0
answers
64
views
What clustering methods handle zero-inflated and continuous variables together?
I'm relatively new to data science and currently working on a project to group global cities based on exposure to various climate hazards. I've sourced climate data from GCMs participating in CMIP6 as ...
0
votes
0
answers
65
views
What is this algorithm for identifying an optimal number of clusters in HCA?
I'm using some software that does (among other things) hierarchical clustering and automatically chooses a number of clusters to use if one is not specified. I wanted to know what method it is using ...
1
vote
0
answers
118
views
Different results clustering with vcovHC vs. vcovCL
I'm conducting a panel analysis in R and would like to control for clustering at the individual level. I've run two-way fixed effects models using both lm() and plm(). These models produce identical ...
1
vote
1
answer
321
views
Cluster analysis with Gower distance
I have a dataset that includes both numeric and categorical variables, and I want to perform cluster analysis. Thus, I choose the Gower distance as distance metric. Next, I perform agglomerative ...
0
votes
0
answers
34
views
Measure to score elements of a set that can form clusters
Apologies in advance if this is the wrong place to ask this question.
I'm trying to find a suitable measure to use to score elements of a set. Each member of the set has a geographical location and ...
0
votes
0
answers
29
views
Is there a known theoretical or practical proof that higher object detection performance leads to greater clustering accuracy?
I’m working on an image-based object detection problem where I’ve noticed a correlation: improvements in object detection performance (as measured by standard metrics such as mAP or IoU) appear to ...
2
votes
0
answers
29
views
Appropriate to use group-based trajectory modeling for time-unstructured data?
I have a longitudinal dataset where the assessment times vary across participants ( time-unstructured data). I would like to identify clusters of individuals with similar developmental trajectories.
...
5
votes
2
answers
602
views
How can I use unsupervised methods to recommend an “ideal” number of managers for companies when no labels exist?
I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as:
Number of employees,
Number of customers,
Number of complaints,
other additional company attributes ...
1
vote
0
answers
76
views
Clustering Based on Transition Probabilities
I have panel data of workers' occupational histories, in which every worker's occupation is indicated for every time period. I am looking to cluster occupations into groups. The idea is to establish ...
1
vote
0
answers
55
views
Do you know of any dataset with this behavior when applying t-SNE? [closed]
I'm looking for a dataset that exhibits this behavior when applying t-SNE. t-SNE is a dimensionality reduction algorithm that can sometimes separate data points that originally belong to the same ...
0
votes
0
answers
32
views
What is the best analytic technique for analyzing "select all that apply" data where participants are asked to assign an experience to 1+ term?
I am trying to find the best approach to analyzing the following data: Participants (n~500) were given a list of 50+ experiences/symptoms/disorders and were asked to assign those with one of 5 terms (...
2
votes
0
answers
87
views
K-means cost function
In Elements of Statistical Learning (ESL), they state in equation (14.31) that the k-Means objective function is
$$W(C) = \sum_{k=1}^KN_k \sum_{C(i)=k} ||x_i - \bar{x}_k||^2$$
where $K$ is the number ...
0
votes
0
answers
51
views
Biclustering from a graph or an existing distance matrix
If we have a set of measurements, e.g. gene expression in various conditions, two approaches are common:
compute distances between columns (e.g. similarity between gene expression profiles), and ...
1
vote
1
answer
88
views
Are there advantages in extracting patterns (i.e. clustering) on Partial Least Square latent space?
I am using Partial Least Square in order to obtain linear model parameters in case of correlated covariates.
I would like to try clustering in the Partial Least Square latent space, that is the space ...
9
votes
3
answers
1k
views
What is a good approach to show my data only belongs to one cluster?
I hope the question is not stupid, but after a long search I have not found a satisfactory answer. I have a question about how to proceed if I want to test whether my data is from just one cluster or ...
1
vote
0
answers
54
views
Confidence intervals for stratified sampling and variable-sized clusters
I am working with survey data that involve two ordinal responses vars I wish to crosstab and see margins of (with CI), and two different sets of weights:
Stratified sampling: Different sampling ...
0
votes
0
answers
105
views
Pickands-Balkema-De Haan theorem, Dependence and Shuffling
The Pickands-Balkema-de Haan theorem states that the conditional excess distribution function is well approximated by the generalized Pareto distribution (for high excesses and if the underlying RV's ...
0
votes
0
answers
46
views
Computing/calculating SSB (sum of squares between clusters) for k-modes
I am trying to calculate metrics for validation/calculation of K-modes performance. I am doing my thesis and I need to group patients from binary variables (diseases) I recently read that SSW and SSB ...
3
votes
1
answer
219
views
Clustering in 5-point Likert type dataset
I want to cluster the data collected with a 5-point Likert scale. But I couldn't understand which method is more accurate to use. I searched the literature but couldn't find a clear answer. Can you ...
2
votes
0
answers
77
views
How to cluster based on x and y coordinates
I am trying to identify rows in groups of points using clustering algorithms. The bigger picture problem I'm trying to solve is to identify shelves given x and y coordinates of products. I can cluster ...
0
votes
0
answers
24
views
Should I average data for agglomerative hierarchical clustering (AHC)?
I was conducting an experiment where I measured the response of wheat cultivars to pathogen inoculation. The experiment was repeated three times, with two reps each time. Two disease parameters were ...
0
votes
0
answers
96
views
Missing values in data set before DBSCAN
My goal is to identify bots and fraudulent users for an application. Ideally, this would be a regression problem where users are rated on a continuous scale. I have 4 tables that cover different ...
0
votes
0
answers
50
views
Identify predictors for clustering output?
I have a dataset with variables collected years ago, and many variables collected this year as outcome variables. I want to combine all the variables collected this year to get one outcome, e.g. ...
0
votes
1
answer
66
views
Index of spatial variability
I have a geographical area, divided in municipalities. Each municipality has the count of a disease occurrences. The procedure is replicated four times, for four diseases (we can call them A, B, C, D)....
2
votes
2
answers
691
views
Where is the inflection point here in this elbow chart?
Where do you think is the inflection point on this chart?