1
$\begingroup$

I’m working with two malware datasets (dataset‑1 and dataset‑2) each with 256 features, but different ratios of malicious vs. benign samples. I’ve merged them into a third set (dataset‑3).

The sample in both datasets is a Windows PE binary, so they share the same structure and even have some overlapping observations. I’m using byte‑unigram features, which gives me a 256‑dimensional vector. My goal with PCA is to see if I can reduce that 256‑dimensional space to a smaller number of components, ideally matching or improving the original model’s performance while using far fewer features.

Dataset differences:

Dataset‑1 and dataset‑2 have different class balances (i.e., different distributions of malicious vs. benign samples).

Combined data:

I created dataset‑3 by simply concatenating dataset‑1 and dataset‑2.

PCA results (99% explained variance):

  • dataset‑1 → 52 components

  • dataset‑2 → 33 components

  • dataset‑3 → 61 components

Task:

Binary classification (malicious vs. benign) downstream.

My questions:

  1. Merging vs. separate PCA:

Given the differing class distributions, is it appropriate to merge the datasets before fitting PCA, or should I perform PCA independently on each?

  1. Component strategy:

Since the merged set needs 61 components to hit 99% variance, should I use a uniform 61‑component PCA for all three datasets, or tailor the cutoff (52, 33, 61) individually and then compare accuracy vs. component‑count curves?

$\endgroup$
10
  • 1
    $\begingroup$ Please tell us more. Can you combine the data sets? What is different about them? What use will you make of the results? etc $\endgroup$ Commented Jul 24 at 14:43
  • 3
    $\begingroup$ I have prophylactically closed this post not because it's bad but because it looks likely to collect a set of disparate but potentially conflicting or irrelevant answers. Please edit it to respond to @PeterFlom's questions and also tell us what you mean by "working with multiple datasets" and how you suppose them to be related, if at all. $\endgroup$ Commented Jul 24 at 14:47
  • 1
    $\begingroup$ I voted to reopen the question, but why do you want to use PCA here? Is it because the number of observations is small relative to the number of features? Also, if you haven't already done so, you should check if the two original datasets that you concatenated share common observations, in which case you should deduplicate them. $\endgroup$ Commented Jul 24 at 15:22
  • 1
    $\begingroup$ @J-J-J each sample in both datasets is a Windows PE binary, so they share the same structure and even have some overlapping observations. I’m using byte‑unigram features, which gives me a 256‑dimensional vector. My goal with PCA is to see if I can reduce that 256‑dimensional space to a smaller number of components, ideally matching or improving the original model’s performance while using far fewer features. $\endgroup$ Commented Jul 24 at 15:27
  • 2
    $\begingroup$ I'd also recommend you to indicate in the title of your question that this is for a classification task (as PCA can be used in other contexts than classification). Adding the classification tag might be useful too. This will make it more likely that you attract the attention of people able to give you a good answer. $\endgroup$ Commented Jul 24 at 16:03

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.