Skip to main content

Questions tagged [clustering]

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

Filter by
Sorted by
Tagged with
0 votes
0 answers
23 views

I have 3 months of categorized bank transaction data and need to identify recurring cash inflows and outflows for lending risk modeling. Complications: 1. Income dates shift earlier when payday falls ...
Awande Ntombela's user avatar
0 votes
0 answers
33 views

In a recent bioinformatics paper, the authors describe a statistical/machine learning approach to classify clusters of cells using kernel density estimation (KDE) and Z-scores. While the details of ...
Michiel.Tawdarous's user avatar
1 vote
1 answer
50 views

Suppose I have two multi-dimensional population samples - $A$ and $B$. I hypothesise that $\mathbb{E}[A]$ and $\mathbb{E}[B]$ are orthogonal in this high-dimensional space. To test this hypothesis, I ...
sunnydk's user avatar
  • 127
1 vote
0 answers
32 views

I have an interesting problem I am trying to solve and I cannot find any non-deep methods available to solve it. Problem Description Plain The real life problem this relates to are handwritten digits ...
Ryan Folks's user avatar
2 votes
1 answer
46 views

I am trying to subset data based on a pattern of "strings" or clusters of food deliveries to young that I see in my data (see plots labeled 2, 4, 5, 6, and 8 in the figure below for the most ...
thegrayson's user avatar
0 votes
0 answers
27 views

I'd appreciate your thoughts on the following problem. I've created a heatmap plot (attached) showing the cluster membership ratio for each participant (in separate subplots) and condition (η). Now, I'...
maria mystakidou's user avatar
2 votes
1 answer
122 views

I am new to working with country-level effects in comparative OLS regression with individual-level data. Are there any good resources for this? Suppose my dependent variable is social integration (an ...
Olestan's user avatar
  • 71
0 votes
0 answers
44 views

I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias) Dataset ...
Rishab's user avatar
  • 1
0 votes
0 answers
57 views

I am currently working on clustering continuous variables (such as AOV, RPV, and conversions(conversion/visits)). The variables are heavily right skewed with long tails and one variable is dominated ...
Rishab's user avatar
  • 1
3 votes
1 answer
129 views

I would like to perform clustering with a finite Gaussian Mixture model, however, I have missing data (some features are missing at random). I am using Variational Inference to fit my Bayesian GMM. Is ...
Tom's user avatar
  • 1,112
2 votes
0 answers
72 views

I am generating clustering data using the Bayesian mixture of Gaussian models described in Bishop's Pattern Recognition and Machine Learning textbook, with model parameters drawn from the following ...
PJB's user avatar
  • 21
1 vote
1 answer
59 views

I have a 5-variable/3 category-level ordinal survey data set. E.g. 5 health variables ranked 1-3 (good-moderate-poor). I want to row-cluster different responses. But also, I want determine whether ...
EB3112's user avatar
  • 264
1 vote
0 answers
54 views

When applying k-means clustering, I understand that the goal is to partition the dataset by assigning each point to its nearest cluster center. However, I’ve come across statements that k-means can be ...
EngineerMathlover's user avatar
1 vote
0 answers
72 views

I've recently learnt unsupervised learning methods such as KMeans and DBSCAN. While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...
ssmalik's user avatar
  • 41
0 votes
1 answer
60 views

My project has the following steps: Use elbow method to determine the features and number of clusters for kmeans. Run kmeans on the data (with determined features and n clusters), and gives the ...
Xin Niu's user avatar
  • 103
0 votes
0 answers
28 views

I'm currently studying the CDbw (Compose Density between and within clusters) index, which is metric designed for internal clustering evaluation. The original article of this index was published in ...
DavideChicco.it's user avatar
0 votes
0 answers
92 views

I went through UMAPs official documentation which says HDBSCAN, being a density based algorithm suffers from curse of dimensionality and reducing dimensions with UMAP can improve the results. But! ...
Shradha's user avatar
0 votes
0 answers
50 views

I am currently performing latent class growth analysis (LCGA) and growth mixture modeling (GMM) to identify distinct subgroups within my study population based on the longitudinal trajectories of a ...
Konstantinos Gkirgkiris's user avatar
4 votes
2 answers
114 views

I am currently working on a longitudinal dataset in which I aim to cluster individuals based on the trajectory of a single continuous variable measured repeatedly across time (e.g., daily values). The ...
Konstantinos Gkirgkiris's user avatar
0 votes
0 answers
10 views

Short explanation of the problem I've been working on a project with avian EEG data — I'm trying to predict the birds' sleep state using already generated labels for this data set. Now, the ...
m0n74g3's user avatar
0 votes
0 answers
83 views

I am supposed to prove that given sorted data points such that $X_1 \leq X_2 \leq \dots X_n$ in an optimal cluster assignment each cluster corresponds to some interval of points. Or in other words - ...
user123_pls's user avatar
2 votes
1 answer
76 views

I have two matrices of species abundances from two different types of organisms. I would like to cluster sites where these species co-occur based on the abundances of both types of organisms ...
Bobby Davis's user avatar
4 votes
1 answer
138 views

I have learned basic concept and algorithm to perform t-SNE and UMAP. I have read some posts that says one can not use the dataset after t-SNE and UMAP to do cluster. Instead, one can only perform ...
user avatar
3 votes
1 answer
76 views

I am learning about DBSCAN, and I’m wondering what happens if it chooses a noise point as the initial point. I know that if a point satisfies the two conditions related to epsilon and minPts, it will ...
Olivia's user avatar
  • 191
0 votes
0 answers
41 views

I have a dataset of 28 personality assessment features, which measures personality attributes like Diligence or Sociability to determine performance in the corporate workplace. I'm tasked with ...
Michael Tran's user avatar
3 votes
0 answers
64 views

I'm relatively new to data science and currently working on a project to group global cities based on exposure to various climate hazards. I've sourced climate data from GCMs participating in CMIP6 as ...
wobre's user avatar
  • 31
0 votes
0 answers
65 views

I'm using some software that does (among other things) hierarchical clustering and automatically chooses a number of clusters to use if one is not specified. I wanted to know what method it is using ...
thposs's user avatar
  • 123
1 vote
0 answers
118 views

I'm conducting a panel analysis in R and would like to control for clustering at the individual level. I've run two-way fixed effects models using both lm() and plm(). These models produce identical ...
Eddie's user avatar
  • 11
1 vote
1 answer
321 views

I have a dataset that includes both numeric and categorical variables, and I want to perform cluster analysis. Thus, I choose the Gower distance as distance metric. Next, I perform agglomerative ...
Elena O.'s user avatar
0 votes
0 answers
34 views

Apologies in advance if this is the wrong place to ask this question. I'm trying to find a suitable measure to use to score elements of a set. Each member of the set has a geographical location and ...
gkaminski's user avatar
0 votes
0 answers
29 views

I’m working on an image-based object detection problem where I’ve noticed a correlation: improvements in object detection performance (as measured by standard metrics such as mAP or IoU) appear to ...
Sabah Anis's user avatar
2 votes
0 answers
29 views

I have a longitudinal dataset where the assessment times vary across participants ( time-unstructured data). I would like to identify clusters of individuals with similar developmental trajectories. ...
May's user avatar
  • 21
5 votes
2 answers
602 views

I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as: Number of employees, Number of customers, Number of complaints, other additional company attributes ...
B_fig's user avatar
  • 63
1 vote
0 answers
76 views

I have panel data of workers' occupational histories, in which every worker's occupation is indicated for every time period. I am looking to cluster occupations into groups. The idea is to establish ...
ykkoca's user avatar
  • 23
1 vote
0 answers
55 views

I'm looking for a dataset that exhibits this behavior when applying t-SNE. t-SNE is a dimensionality reduction algorithm that can sometimes separate data points that originally belong to the same ...
greffao's user avatar
  • 11
0 votes
0 answers
32 views

I am trying to find the best approach to analyzing the following data: Participants (n~500) were given a list of 50+ experiences/symptoms/disorders and were asked to assign those with one of 5 terms (...
Lindsay 's user avatar
2 votes
0 answers
87 views

In Elements of Statistical Learning (ESL), they state in equation (14.31) that the k-Means objective function is $$W(C) = \sum_{k=1}^KN_k \sum_{C(i)=k} ||x_i - \bar{x}_k||^2$$ where $K$ is the number ...
pseudo-goldstone's user avatar
0 votes
0 answers
51 views

If we have a set of measurements, e.g. gene expression in various conditions, two approaches are common: compute distances between columns (e.g. similarity between gene expression profiles), and ...
Alexlok's user avatar
  • 187
1 vote
1 answer
88 views

I am using Partial Least Square in order to obtain linear model parameters in case of correlated covariates. I would like to try clustering in the Partial Least Square latent space, that is the space ...
LearningAlgorithm's user avatar
9 votes
3 answers
1k views

I hope the question is not stupid, but after a long search I have not found a satisfactory answer. I have a question about how to proceed if I want to test whether my data is from just one cluster or ...
David's user avatar
  • 101
1 vote
0 answers
54 views

I am working with survey data that involve two ordinal responses vars I wish to crosstab and see margins of (with CI), and two different sets of weights: Stratified sampling: Different sampling ...
dzeltzer's user avatar
  • 151
0 votes
0 answers
105 views

The Pickands-Balkema-de Haan theorem states that the conditional excess distribution function is well approximated by the generalized Pareto distribution (for high excesses and if the underlying RV's ...
bee14's user avatar
  • 1
0 votes
0 answers
46 views

I am trying to calculate metrics for validation/calculation of K-modes performance. I am doing my thesis and I need to group patients from binary variables (diseases) I recently read that SSW and SSB ...
Cristina Muntañola's user avatar
3 votes
1 answer
219 views

I want to cluster the data collected with a 5-point Likert scale. But I couldn't understand which method is more accurate to use. I searched the literature but couldn't find a clear answer. Can you ...
ali's user avatar
  • 63
2 votes
0 answers
77 views

I am trying to identify rows in groups of points using clustering algorithms. The bigger picture problem I'm trying to solve is to identify shelves given x and y coordinates of products. I can cluster ...
Tommy Wolfheart's user avatar
0 votes
0 answers
24 views

I was conducting an experiment where I measured the response of wheat cultivars to pathogen inoculation. The experiment was repeated three times, with two reps each time. Two disease parameters were ...
user449708's user avatar
0 votes
0 answers
96 views

My goal is to identify bots and fraudulent users for an application. Ideally, this would be a regression problem where users are rated on a continuous scale. I have 4 tables that cover different ...
Burger's user avatar
  • 1
0 votes
0 answers
50 views

I have a dataset with variables collected years ago, and many variables collected this year as outcome variables. I want to combine all the variables collected this year to get one outcome, e.g. ...
NPpsy's user avatar
  • 43
0 votes
1 answer
66 views

I have a geographical area, divided in municipalities. Each municipality has the count of a disease occurrences. The procedure is replicated four times, for four diseases (we can call them A, B, C, D)....
Luke's user avatar
  • 161
2 votes
2 answers
691 views

Where do you think is the inflection point on this chart?
pdxtrader8888's user avatar

1
2 3 4 5
81