that's my first question on here :)
I am working with the kNN classifier on datasets from the multivariate normal distribution. I have to groups coming from N(mu_1,I) and N(mu_2,I) with differing expectation vectors mu_i and both identity matrices as covariances (so the features are independent). The dimension is 1000. One dataset constitutes of 3500 entries of random draws from group 1 and 1500 random draws from group 2. I want to study kNN's performance on such high-dimensional data for different configurations of the mu_i.
Obviously, I struggle with the curse of dimensionality, but that is the goal. Still just for fun, I did a PCA projection onto the first two components, so effectively a reduction to 2 dimension, and the classifiers accuracy improved about 10%. How come? My features are independent. Shouldn't that make PCA unsuitable?
Edit: The dataset (1000 features, 5000 samples) are split into 70/30 train and test set. The accuracy is computed using the test set - so I suppose its an out-of-sample accuracy
I thank you in advance for enlightening me ;)
Edit: Here is the R code to generate the data
generate_data <- function(num_samples, num_features, group_split, mu_1) {
split_samples <- num_samples * group_split
Sigma <- diag(num_features) # Covariance matrix
mean_vector_1 <- rep(mu_1, num_features)
mean_vector_2 <- rep(0, num_features)
# Generating data for two groups
data_1 <- mvrnorm(n = split_samples, mu = mean_vector_1, Sigma = Sigma)
data_2 <- mvrnorm(n = num_samples - split_samples, mu = mean_vector_2, Sigma = Sigma)
# Combine and label data
data <- rbind(data_1, data_2)
labels <- c(rep(1, split_samples), rep(2, num_samples - split_samples))
return(as.data.frame(cbind(labels, data)))
}
This is the projection function I use before using a simply kNN-classifier:
projection <- function(dataset) {
data_pca <- PCA(dataset[, -1], graph = FALSE)
data.frame(labels = dataset[,1],
PC1 = data_pca$ind$coord[,1],
PC2 = data_pca$ind$coord[,2])
}