Independent features but PCA improves classifiers accuracy significantly. Why?

Ask Question

Asked 1 year, 11 months ago

Modified 1 year, 11 months ago

Viewed 89 times

that's my first question on here :)

I am working with the kNN classifier on datasets from the multivariate normal distribution. I have to groups coming from N(mu_1,I) and N(mu_2,I) with differing expectation vectors mu_i and both identity matrices as covariances (so the features are independent). The dimension is 1000. One dataset constitutes of 3500 entries of random draws from group 1 and 1500 random draws from group 2. I want to study kNN's performance on such high-dimensional data for different configurations of the mu_i.

Obviously, I struggle with the curse of dimensionality, but that is the goal. Still just for fun, I did a PCA projection onto the first two components, so effectively a reduction to 2 dimension, and the classifiers accuracy improved about 10%. How come? My features are independent. Shouldn't that make PCA unsuitable?

Edit: The dataset (1000 features, 5000 samples) are split into 70/30 train and test set. The accuracy is computed using the test set - so I suppose its an out-of-sample accuracy

I thank you in advance for enlightening me ;)

Edit: Here is the R code to generate the data

generate_data <- function(num_samples, num_features, group_split, mu_1) {
  split_samples <- num_samples * group_split
  Sigma <- diag(num_features)  # Covariance matrix
  mean_vector_1 <- rep(mu_1, num_features)
  mean_vector_2 <- rep(0, num_features)
  
  # Generating data for two groups
  data_1 <- mvrnorm(n = split_samples, mu = mean_vector_1, Sigma = Sigma)
  data_2 <- mvrnorm(n = num_samples - split_samples, mu = mean_vector_2, Sigma = Sigma)
  
  # Combine and label data
  data <- rbind(data_1, data_2)
  labels <- c(rep(1, split_samples), rep(2, num_samples - split_samples))
  
  return(as.data.frame(cbind(labels, data)))
}

This is the projection function I use before using a simply kNN-classifier:

projection <- function(dataset) {
  data_pca <- PCA(dataset[, -1], graph = FALSE)
  data.frame(labels = dataset[,1],
             PC1 = data_pca$ind$coord[,1], 
             PC2 = data_pca$ind$coord[,2])
}

edited Dec 16, 2023 at 13:03

asked Dec 15, 2023 at 19:12

Superintendant

11 bronze badge

3

$\begingroup$ Welcome to CV. I don't know the answer but, since this is all simulated data anyway, it might help if you included all your code for people to play with. $\endgroup$

Peter Flom
– Peter Flom

2023-12-15 19:16:43 +00:00
Commented Dec 15, 2023 at 19:16
1

$\begingroup$ Welcome to Cross Validated! Is this in-sample or out-of-sample accuracy? $\endgroup$

Dave
– Dave

2023-12-15 19:29:05 +00:00
Commented Dec 15, 2023 at 19:29
$\begingroup$ @PeterFlom I added the code $\endgroup$

Superintendant
– Superintendant

2023-12-16 13:04:51 +00:00
Commented Dec 16, 2023 at 13:04
$\begingroup$ @Dave I added this info. I use a test set to determine the accuracy. So it should be out-of-sample $\endgroup$

Superintendant
– Superintendant

2023-12-16 13:04:57 +00:00
Commented Dec 16, 2023 at 13:04
$\begingroup$ Thank you for the additional information. What kind of in-sample accuracy do you get? I’m wondering if this is a simple case of overfitting. $\endgroup$

Dave
– Dave

2023-12-16 14:00:03 +00:00
Commented Dec 16, 2023 at 14:00

| Show 2 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Independent features but PCA improves classifiers accuracy significantly. Why?

0

Your Answer

Hot Network Questions

Independent features but PCA improves classifiers accuracy significantly. Why?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions