Applying Principal Component Analysis (PCA) to reduce dimensionality in multiple datasets for a classification task

Ask Question

Asked 4 months ago

Modified 4 months ago

Viewed 95 times

I’m working with two malware datasets (dataset‑1 and dataset‑2) each with 256 features, but different ratios of malicious vs. benign samples. I’ve merged them into a third set (dataset‑3).

The sample in both datasets is a Windows PE binary, so they share the same structure and even have some overlapping observations. I’m using byte‑unigram features, which gives me a 256‑dimensional vector. My goal with PCA is to see if I can reduce that 256‑dimensional space to a smaller number of components, ideally matching or improving the original model’s performance while using far fewer features.

Dataset differences:

Dataset‑1 and dataset‑2 have different class balances (i.e., different distributions of malicious vs. benign samples).

Combined data:

I created dataset‑3 by simply concatenating dataset‑1 and dataset‑2.

PCA results (99% explained variance):

dataset‑1 → 52 components
dataset‑2 → 33 components
dataset‑3 → 61 components

Task:

Binary classification (malicious vs. benign) downstream.

My questions:

Merging vs. separate PCA:

Given the differing class distributions, is it appropriate to merge the datasets before fitting PCA, or should I perform PCA independently on each?

Component strategy:

Since the merged set needs 61 components to hit 99% variance, should I use a uniform 61‑component PCA for all three datasets, or tailor the cutoff (52, 33, 61) individually and then compare accuracy vs. component‑count curves?

edited Jul 24 at 16:05

asked Jul 24 at 14:37

0xh3xa

1234 bronze badges

1

$\begingroup$ Please tell us more. Can you combine the data sets? What is different about them? What use will you make of the results? etc $\endgroup$

Peter Flom
– Peter Flom

2025-07-24 14:43:42 +00:00
Commented Jul 24 at 14:43
3

$\begingroup$ I have prophylactically closed this post not because it's bad but because it looks likely to collect a set of disparate but potentially conflicting or irrelevant answers. Please edit it to respond to @PeterFlom's questions and also tell us what you mean by "working with multiple datasets" and how you suppose them to be related, if at all. $\endgroup$

whuber
– whuber ♦

2025-07-24 14:47:54 +00:00
Commented Jul 24 at 14:47
1

$\begingroup$ I voted to reopen the question, but why do you want to use PCA here? Is it because the number of observations is small relative to the number of features? Also, if you haven't already done so, you should check if the two original datasets that you concatenated share common observations, in which case you should deduplicate them. $\endgroup$

J-J-J
– J-J-J

2025-07-24 15:22:04 +00:00
Commented Jul 24 at 15:22
1

$\begingroup$ @J-J-J each sample in both datasets is a Windows PE binary, so they share the same structure and even have some overlapping observations. I’m using byte‑unigram features, which gives me a 256‑dimensional vector. My goal with PCA is to see if I can reduce that 256‑dimensional space to a smaller number of components, ideally matching or improving the original model’s performance while using far fewer features. $\endgroup$

0xh3xa
– 0xh3xa

2025-07-24 15:27:25 +00:00
Commented Jul 24 at 15:27
2

$\begingroup$ I'd also recommend you to indicate in the title of your question that this is for a classification task (as PCA can be used in other contexts than classification). Adding the classification tag might be useful too. This will make it more likely that you attract the attention of people able to give you a good answer. $\endgroup$

J-J-J
– J-J-J

2025-07-24 16:03:44 +00:00
Commented Jul 24 at 16:03

| Show 5 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Applying Principal Component Analysis (PCA) to reduce dimensionality in multiple datasets for a classification task

0

Your Answer

Hot Network Questions

Applying Principal Component Analysis (PCA) to reduce dimensionality in multiple datasets for a classification task

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions