0
$\begingroup$

that's my first question on here :)

I am working with the kNN classifier on datasets from the multivariate normal distribution. I have to groups coming from N(mu_1,I) and N(mu_2,I) with differing expectation vectors mu_i and both identity matrices as covariances (so the features are independent). The dimension is 1000. One dataset constitutes of 3500 entries of random draws from group 1 and 1500 random draws from group 2. I want to study kNN's performance on such high-dimensional data for different configurations of the mu_i.

Obviously, I struggle with the curse of dimensionality, but that is the goal. Still just for fun, I did a PCA projection onto the first two components, so effectively a reduction to 2 dimension, and the classifiers accuracy improved about 10%. How come? My features are independent. Shouldn't that make PCA unsuitable?

Edit: The dataset (1000 features, 5000 samples) are split into 70/30 train and test set. The accuracy is computed using the test set - so I suppose its an out-of-sample accuracy

I thank you in advance for enlightening me ;)

Edit: Here is the R code to generate the data

generate_data <- function(num_samples, num_features, group_split, mu_1) {
  split_samples <- num_samples * group_split
  Sigma <- diag(num_features)  # Covariance matrix
  mean_vector_1 <- rep(mu_1, num_features)
  mean_vector_2 <- rep(0, num_features)
  
  # Generating data for two groups
  data_1 <- mvrnorm(n = split_samples, mu = mean_vector_1, Sigma = Sigma)
  data_2 <- mvrnorm(n = num_samples - split_samples, mu = mean_vector_2, Sigma = Sigma)
  
  # Combine and label data
  data <- rbind(data_1, data_2)
  labels <- c(rep(1, split_samples), rep(2, num_samples - split_samples))
  
  return(as.data.frame(cbind(labels, data)))
}

This is the projection function I use before using a simply kNN-classifier:

projection <- function(dataset) {
  data_pca <- PCA(dataset[, -1], graph = FALSE)
  data.frame(labels = dataset[,1],
             PC1 = data_pca$ind$coord[,1], 
             PC2 = data_pca$ind$coord[,2])
}
$\endgroup$
7
  • 3
    $\begingroup$ Welcome to CV. I don't know the answer but, since this is all simulated data anyway, it might help if you included all your code for people to play with. $\endgroup$ Commented Dec 15, 2023 at 19:16
  • 1
    $\begingroup$ Welcome to Cross Validated! Is this in-sample or out-of-sample accuracy? $\endgroup$ Commented Dec 15, 2023 at 19:29
  • $\begingroup$ @PeterFlom I added the code $\endgroup$ Commented Dec 16, 2023 at 13:04
  • $\begingroup$ @Dave I added this info. I use a test set to determine the accuracy. So it should be out-of-sample $\endgroup$ Commented Dec 16, 2023 at 13:04
  • $\begingroup$ Thank you for the additional information. What kind of in-sample accuracy do you get? I’m wondering if this is a simple case of overfitting. $\endgroup$ Commented Dec 16, 2023 at 14:00

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.