I’m working with two malware datasets (dataset‑1 and dataset‑2) each with 256 features, but different ratios of malicious vs. benign samples. I’ve merged them into a third set (dataset‑3).
The sample in both datasets is a Windows PE binary, so they share the same structure and even have some overlapping observations. I’m using byte‑unigram features, which gives me a 256‑dimensional vector. My goal with PCA is to see if I can reduce that 256‑dimensional space to a smaller number of components, ideally matching or improving the original model’s performance while using far fewer features.
Dataset differences:
Dataset‑1 and dataset‑2 have different class balances (i.e., different distributions of malicious vs. benign samples).
Combined data:
I created dataset‑3 by simply concatenating dataset‑1 and dataset‑2.
PCA results (99% explained variance):
dataset‑1 → 52 components
dataset‑2 → 33 components
dataset‑3 → 61 components
Task:
Binary classification (malicious vs. benign) downstream.
My questions:
- Merging vs. separate PCA:
Given the differing class distributions, is it appropriate to merge the datasets before fitting PCA, or should I perform PCA independently on each?
- Component strategy:
Since the merged set needs 61 components to hit 99% variance, should I use a uniform 61‑component PCA for all three datasets, or tailor the cutoff (52, 33, 61) individually and then compare accuracy vs. component‑count curves?