Questions tagged [pca]
Principal component analysis (PCA) is a linear dimensionality reduction technique. It reduces a multivariate dataset to a smaller set of constructed variables preserving as much information (as much variance) as possible. These variables, called principal components, are linear combinations of the input variables.
3,454 questions
0
votes
0
answers
12
views
PCA minimal reconstruction error
Let $X \in \mathbb{R}^d$ be a random vector with covariance matrix $\Sigma$, with its eigenvalues ordered as $\lambda_1\geq \lambda_2 \geq \ldots \geq \lambda_d$, and the corresponding orthonormal ...
1
vote
1
answer
50
views
Vector direction of individual clusters after PCA
Suppose I have two multi-dimensional population samples - $A$ and $B$.
I hypothesise that $\mathbb{E}[A]$ and $\mathbb{E}[B]$ are orthogonal in this high-dimensional space.
To test this hypothesis, I ...
0
votes
0
answers
22
views
Combining filtering with clr transform
I am working with a compositional dataset:
A very efficient way of dealing with compositional data is by applying clr-transform (or a similar), which effectively converts them to data in Eucledean ...
0
votes
0
answers
112
views
What is causing my feature importance weights to be so polarized?
I'm new to machine learning and don't post here much, but myself and my lab are a bit stumped here.
I have trained an elastic net classifier on some cortical thickness (CT) data by region of interest (...
0
votes
0
answers
25
views
Should I weight genetic principal components in an IPW-weighted survival model using classic covariates and PRS?
I'm working on a survival analysis using Cox models where the exposure is a binary grouping variable, and I'm adjusting for a set of classical epidemiological covariates (sex, smoking, diabetes, ...
2
votes
0
answers
287
views
Interpreting angles between variables in a biplot
Upon reading the abstract of a recently published paper in ecology, I came across the claim:
Our results suggest that the chromatic contrasts of colours are non-redundant with the intensity of ...
4
votes
1
answer
147
views
How do you maintain orthonormality during optimization?
I am trying to iteratively optimize a set of vectors $\{w_1, w_2, ..., w_n\}$ such that the following holds:
$$
w_r =
\begin{cases}
\underset{w}{\arg\min} \; \sum_x \left\lVert (x^\top w) w - x \...
0
votes
1
answer
71
views
FAMD on large mixed dataset: low explained variance, still worth using?
I'm working with a large tabular dataset (~1.2 million rows) that includes 7 qualitative features and 3 quantitative ones. For dimensionality reduction, I'm using FAMD (Factor Analysis for Mixed Data) ...
1
vote
0
answers
38
views
Denoising: PCA vs averaging
Suppose I have $n$ vectors $v_1,\dots,v_n \in \mathbb{R}^d$. Let's assume there is some underlying direction common to all of them, and each $v_i$ is a noisy version of that direction, and the goal ...
4
votes
1
answer
69
views
What does the singular vectors in a SVD represent when having repeated measurements in the original data matrix?
I'm wondering if this is correct reasoning:
SVD constructs new orthogonal vectors as linear combinations of the rows and columns in the data. In effect correlation among the original variables are ...
0
votes
0
answers
51
views
Generalization Error PCA (with closed formula) versus Ridge
There is something I have an intuition on but my numerical toy examples do not confirm, and I really want to understand where is my mistake.
I suppose that I have a random vector $X = (X_1, \cdots, ...
1
vote
0
answers
95
views
Applying Principal Component Analysis (PCA) to reduce dimensionality in multiple datasets for a classification task
I’m working with two malware datasets (dataset‑1 and dataset‑2) each with 256 features, but different ratios of malicious vs. benign samples. I’ve merged them into a third set (dataset‑3).
The sample ...
3
votes
2
answers
576
views
How can I apply KMeans clustering if all variables are highly uncorrelated
I'm applying K-Means clustering to a dataset of ship voyages. The goal is to group voyages into performance-based clusters like cost-efficient, underperforming, etc.
I have 12 features in total:
10 ...
0
votes
0
answers
68
views
Selecting number of PCs (principal components) to include in PCR (principal component regression)
How do you decide the number of principal components (PC) to include in principal component regression (PCR)?
I have seen these methods:
choosing the lowest RMSEP with the pls() package
Choosing PC's ...
0
votes
1
answer
110
views
Modelling w/ PCA
I'm trying to create a model (which is more interprative than explanatory), in order to model the relationship between water quality (e.g. ammonium concentration, chlorine concentration) and regional ...
0
votes
0
answers
63
views
How to test for homogeneity of z-scores across members of a clinical population?
For N participants I have M measures for which a normative model is avalable. Let's assume these measures are hand finger lengths (so M=5), z=0 means the length of that finger is the mean in the ...
0
votes
0
answers
36
views
Correlation Analysis prior to PCA [duplicate]
So, I have a general question regarding PCA. As far as I understand, before performing PCA you are supposed to perform a correlation analysis between the features so that redundant features can be ...
0
votes
0
answers
55
views
Using a combination of 2 scores to make a combined score
I've got around 50000 companies and majority of them have 2 data points for their revenue: for 2023 and for 2024. The 2 metrics that I'm told to use are:
absolute growth, which is just a difference
...
0
votes
0
answers
31
views
Adding PCA scores to mixed model in R?
So I've done two separate tests, a PCA and a GLMM, using the same groups of individuals. The experiments have to do with animal behavior, so I did preliminary recordings of how the animals interact ...
0
votes
0
answers
79
views
Variable selection: Explanatory model with very low sample size
I am currently doing a research where I am finding the relationship between the quality of wastewater (e.g. biochemical oxygen demand, amount of nitrogen...) and regional characteristics of that ...
1
vote
1
answer
197
views
Can the elbow method be used in PCA (Principal Component Analysis) to determine the optimal number of components for dimensionality reduction?
The elbow method is commonly used with K-means clustering to determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking ...
4
votes
1
answer
137
views
Similar PCA but I want every element of the first eigenvector to be positive / non-negative matrix factorization?
I am familiar with the PCA algorithm for dimension reduction. But I would like every element of the first principal component to have positive sign. So when I try to use my principal component, it's a ...
1
vote
1
answer
80
views
How to perform Collinearity and dimension reduction with mixed variables in SAS
I plan to do an ordinal logistic regression (plus I'm new to SAS v9.4). My dependent and independent variables are ordinals (Likert types), but I want to add about 35 covariates (possible confounders) ...
0
votes
0
answers
28
views
Why Does the Posterior Estimation of Latent Variables in Binary PPCA is different from Ground Truth?
I’ve been working on implementing a binary variant of probabilistic PCA (PPCA) in Python (based on this paper), which uses variational EM for parameter estimation due to the non-conjugacy between the ...
2
votes
0
answers
56
views
PCA As Maximizing Variance Vs. Maximizing Original Length
I think I understand how one could view PCA as a means to find the basis vectors that, once a projection is done onto the subspace spanned by these vectors, maximizes the variance of the new dataset ...
1
vote
1
answer
126
views
Interpretation of PCA Loadings Plot
When interpreting loadings for different principal components in PCA, sometimes the same variable will have a positive loading for one PC, and a negative loading in another PC, despite both PCs having ...
2
votes
1
answer
104
views
How to perform multivariate analyses (e.g., PCA, RDA, coinertia) while accounting for a random effect?
New to multivariate analyses in R. I have two datasets consisting of multivariate response variables (e.g., physiological and environmental measures in a wildlife species) and I want to assess ...
0
votes
0
answers
71
views
Relative abundance square root transformation PCA- do i have to re-normalise after removing spp <2%?
I have calculated relative abundance of species count data. I then removed the species <2%. I want to transform these data using the square root method to reduce the dominance of some species ...
0
votes
0
answers
35
views
Rotated loadings as input to hierarchical clustering
How do we feed the rotated loadings obtained through varimax rotation using psych package to hierarchial clustering in the FactomineR package (HCPCC())? I want to use the rotated components instead of ...
4
votes
2
answers
837
views
Why is factor analysis needed?
I am studying a statistics course on multivariate analysis. The course starts with regression models (simple and multiple) and then moves to interdependence with a focus on dimensionality-reduction ...
2
votes
1
answer
90
views
How to visualize PCA group separation using boxplots in R?
trying to follow best practices.
I’ve run a PCA using prcomp() in R on a set of scaled numeric features. Now I want to check if there's any visual separation between groups in my target categorical ...
0
votes
0
answers
57
views
PLS-SEM and collinearity
I hope someone can help me with this issue or point me in the right direction.
I have recently gotten myself into structural equation modelling (SEM) via PLS-SEM. However, I ran into the issue of ...
0
votes
0
answers
20
views
Discrepancy in Signs Between scikit-learn PCA and Custom PCA Implementation [duplicate]
I’m implementing my own version of PCA and comparing it with scikit-learn's PCA. However, I’m noticing a discrepancy in the signs of the principal components.
Using scikit-learn
...
0
votes
0
answers
44
views
General Dynamic Principal Components in R
I am trying to build an index using the gdpc package in R. I am struggling to compute the PC1 provided by the package from the loadings and the input series. I want to build a chart representing the ...
0
votes
1
answer
110
views
How to deal with correlated features before classification?
I have a classification problem with ~2,500 observations and 50 features.
I perform feature selection beforehand, reducing the set to around 17 features. While my selection methods effectively ...
0
votes
0
answers
50
views
How to Compute a Scale-Invariant Reconstruction Error for PCA?
I am working with Principal Component Analysis (PCA) and trying to evaluate reconstruction error. Specifically I am interested in being able to compare the results of PCA on differently scaled data (...
1
vote
0
answers
86
views
Test statistic for determining if a calculated property/feature is significantly different across 3 clusters of amino acid sequences obtained from PCA
So I'm analyzing an IDR (Intrinsically Disordered Region) amino acid sequence in several organisms(~ 700) in a particular protein and have calculated several features typically calculated for IDRs ...
0
votes
0
answers
35
views
How to transform community matrix data from percent cover of quadrats
I have data from a field expedition where quadrats were done in multiple sites. The data for each site represent the percent cover for the species identified (note that the rows don't necesserily add ...
2
votes
1
answer
131
views
Are there good references discussing the use of survey weights with PCA and MCA?
I've found that the survey package in R allows using survey weights with principal component analysis, which is great.
However, it doesn't seem to provide the same for correspondence analysis or ...
1
vote
0
answers
60
views
What do the rows of SVD tell us about PCA?
If we have a matrix $X\in\mathbb{R}^{n\times p}$ with SVD $X = UDV^T$, we can say for example that the columns of $V$ are the principal directions and the columns of $UD$ are the principal components (...
0
votes
0
answers
64
views
Impact of Excessive Zeros in miRNA Data on PCA and LDA
I am working on a case-cohort (~ case-control, but putting all cases in the subcohort) study evaluating miRNA markers. The variables of interest are continuous quantitative measures of miRNA ...
1
vote
1
answer
113
views
Linear VAE and pPCA
I am looking into the relationship between linear Variational Autoencoder (VAE) and probabilistic PCA (pPCA) presented by Lucas et al. (2019). Don't blame the elbo! paper
In the official ...
0
votes
0
answers
62
views
What are the differences between static factor models and dynamic factor models?
What are the main differences apart from the dynamic using lags?
I read this paper where the explanation of static factor models was that given N time series of T periods each they can be used to ...
0
votes
0
answers
62
views
0
votes
0
answers
77
views
How do you deal with feature selection when working with high-dimensional datasets?
I’ve been working on a classification problem with thousands of features, and I’m struggling with feature selection. I found this article that has a great breakdown of different Feature Selection ...
1
vote
1
answer
88
views
Are there advantages in extracting patterns (i.e. clustering) on Partial Least Square latent space?
I am using Partial Least Square in order to obtain linear model parameters in case of correlated covariates.
I would like to try clustering in the Partial Least Square latent space, that is the space ...
1
vote
1
answer
192
views
High Classification Accuracy Despite Poor Separation in PCA for Multi-class Data
I recently conducted a Principal Component Analysis (PCA) on a dataset with a four-category target variable. While the PCA score plot revealed excellent separation for one group, the remaining three ...
0
votes
0
answers
64
views
Should Data Be Weighted Differently for PCA vs. SVD ($\cos(\text{latitude})$ or $ \sqrt{\cos(\text{latitude})}$)
Background
I am analyzing data on a latitude-longitude grid and want to account for geographic distortions caused by the Earth's curvature (higher data density near the poles). To correct this, I plan ...
0
votes
0
answers
96
views
Orthogonal and Non-Orthogonal Features
I have a bit of confusion regarding the scope of what PCA can do, and if it cannot do the thing I expected, whether any other, similar tool can. My understanding has been that PCA orthogonalizes ...
0
votes
0
answers
74
views
Rolling PCA for time-series regression: information leakage
Assume that a random variable $y_{i,t}$ is governed by some linear factors $x_{t,j}$ and a random noise term $\epsilon_{i,t}$:
$$
y_{i,t} = \sum_{j}^{M+1}\beta_{j,i}x_{t,j} + \epsilon_{i,t}
$$
Written ...