Skip to main content
Search type Search syntax
Tags [tag]
Exact "words here"
Author user:1234
user:me (yours)
Score score:3 (3+)
score:0 (none)
Answers answers:3 (3+)
answers:0 (none)
isaccepted:yes
hasaccepted:no
inquestion:1234
Views views:250
Code code:"if (foo != bar)"
Sections title:apples
body:"apples oranges"
URL url:"*.example.com"
Saves in:saves
Status closed:yes
duplicate:no
migrated:no
wiki:no
Types is:question
is:answer
Exclude -[tag]
-apples
For more details on advanced search visit our help page
Results tagged with
Search options answers only not deleted user 7290

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

46 votes
Accepted

Clustering methods that do not require pre-specifying the number of clusters

Clustering algorithms are often categorized into broad kingdoms: Partitioning algorithms (like k-means and it's progeny) Hierarchical clustering (as @Tim describes) Density based clustering (such as … It might help you to read an overview of different types of clustering algorithms. The following might be a place to start: Berkhin, P. "Survey of Clustering Data Mining Techniques" (pdf) …
gung - Reinstate Monica's user avatar
9 votes

Can any dataset be clustered or does there need to be some sort of pattern in the data?

As other answers have pointed out, determining if the clustering represents 'real' latent groups is a very difficult task. … , and the section on evaluating clusterings in Wikipedia's clustering entry). None of the methods are perfect, however. …
gung - Reinstate Monica's user avatar
4 votes
Accepted

Clustering in data

Thus the 'clustering' in $X$ is not necessarily a problem. In addition, data tend to have less leverage* over the fitted regression line the closer they are to the mean of $X$. …
gung - Reinstate Monica's user avatar
1 vote

Clustering Matrices

And you can use whatever clustering algorithm you like and find appropriate (e.g., k-means) likewise. …
gung - Reinstate Monica's user avatar
8 votes
Accepted

Clustering a dense dataset

The clustering algorithms I am familiar with that can do so are fuzzy k-means, Gaussian mixture modeling, and clustering by kernel density estimation. … Clustering via nonparametric density estimation, Statisticis and Computing, 17, 1, pp. 71-80. …
gung - Reinstate Monica's user avatar
2 votes
Accepted

Visualise clusters and relationship with features; alternative to chord diagram

I agree with @Nick Cox. This figure is pretty, but doesn't seem very good to me except as eye candy. In essence, this is a Sankey plot (a.k.a., river plot or flow diagram) with just two levels where …
gung - Reinstate Monica's user avatar
1 vote

How to assess the consistency of clustering

You can see an example where I do this here: How to use both binary and continuous variables together in clustering? This works out fine with a small number of clusterings. … strength of association / correlation between nominal variables, or Determine the probability that two patterns are consistent across the clusterings in that if they are clustered together in one clustering
gung - Reinstate Monica's user avatar
6 votes
Accepted

Why is clustering data with many categorical variables so slow?

An alternative is to use a clustering algorithm that can operate over a distance matrix. Then you can calculate the distances only once. … here: How to use both binary and continous variables together in clustering? …
gung - Reinstate Monica's user avatar
2 votes

Is there a distance metric that measures the ratio between the two rows of data?

I agree with @Anony-Mousse. Using anything in statistics, simply because it is the default will never be the best way to go. Take your situation as an example: At the core of Euclidean distance is …
gung - Reinstate Monica's user avatar
1 vote

Clustering as a means of splitting up data for logistic regression

I want to acknowledge from the beginning that I know relatively little about clustering. However, I don't see the point of the procedure you describe. …
gung - Reinstate Monica's user avatar
2 votes

Does the sign of the adjusted residuals matter in a crosstable?

The null hypothesis for a $\chi^2$ analysis of a contingency table is that the rows and columns are independent of each other. Under this null model, the expected count is: $$ \hat E_{ij}=\left(\frac …
gung - Reinstate Monica's user avatar
9 votes
Accepted

Is it necessary to normalize data for hierarchical clustering of mixed variables using compl...

It is common to normalize all your variables before clustering. … The fact that you are using complete linkage vs. any other linkage, or hierarchical clustering vs. a different algorithm (e.g., k-means) isn't relevant. …
gung - Reinstate Monica's user avatar
7 votes
Accepted

A non parametric clustering algorithm suitable for high dimensional data

The most common clustering technique that meets your requirements would be DBSCAN. This finds points that are continuous by virtue of having shared nearest neighbors. …
gung - Reinstate Monica's user avatar
2 votes
Accepted

Is it possible to select features from completely unlabeled data?

Having figured out what variables are relevant for what you want to do, and having conducted a clustering, it isn't difficult to see which variables play the strongest roles in determining the the clustering … Concept learning and feature selection based on square-error clustering. Machine Learning, 35:25–39, 1999. …
gung - Reinstate Monica's user avatar
2 votes

Clustering numeric, categorical, and multivalue categorical data

With mixed data types the basic answer is to use Gower's distance (see @ttnphns' thorough explainer here: Hierarchical clustering with mixed type data - what distance/similarity to use?). … How to use both binary and continuous variables together in clustering?). …
gung - Reinstate Monica's user avatar

15 30 50 per page