I think I understand how one could view PCA as a means to find the basis vectors that, once a projection is done onto the subspace spanned by these vectors, maximizes the variance of the new dataset we obtain.
What I don't understand is why we view it this way? Why don't we simply say: PCA is a method we use to preserve as much of the length and direction of the original vectors? This is what I mean:
Suppose we have a data-matrix $$\newcommand{\bm}[1]{\boldsymbol{#1}} \bm{X} := \begin{bmatrix}\bm{x}_1 \\ \bm{x}_2 \\ \vdots \\ \bm{x}_N\end{bmatrix} \subseteq \mathbb{R}^{N \times D},$$
and we wish to find for the vector $\bm{w} \in \mathbb{R}^{D \times 1}$ that will span the "major-axis" that goes through the dataset. Then, what we want is to maximize the length of the projection of the datapoint $\bm{x}_i$ onto the subspace spanned by $\bm{w}$. That is, we want to maximize $$\sum_{i = 1}^N |\bm{x}_i\bm{w}|.$$ We see that, since we don't want $\bm{w}$ to grow arbitrarily in length, we restrict $\|\bm{w}\| = 1$. We see finally see that $$\operatorname*{argmax}_{\bm{w} \in \mathbb{R}^{D \times 1} \\ \|\bm{w}\| = 1} \sum_{i = 1}^N |\bm{x}_i \bm{w}| = \operatorname*{argmax}_{\bm{w} \in \mathbb{R}^{D \times 1} \\ \|\bm{w}\| = 1} \sum_{i = 1}^N (\bm{x}_i \bm{w})^2.$$ From there, we see that $$\sum_{i = 1}^N (\bm{x}_i\bm{w})^2 = \sum_{i = 1}^N \bm{w}^\top\bm{x}_i^\top\bm{x}\bm{w} = \bm{w}^\top\biggl(\sum_{i = 1}^N \bm{x}_i^\top \bm{x}_i\biggr)\bm{w} = \bm{w}^\top \bm{X}^\top \bm{X} \bm{w}.$$
At the end, it's known that the vector $\bm{w}$ that maximizes the above quadratic form is a unit eigenvector of $\bm{X}^\top\bm{X}$ that corresponds to the maximal eigenvalue of $\bm{X}^\top\bm{X}$. The rest of the principal vector could be found in a similar manner.
But see? I never had to bring up variance! My question is mostly about why we do bring up variance when discussing PCA.