Why does univariate Mahalanobis distance not match z-score?

Question

I am using Mahalanobis distance for outlier detection. Sometimes my dataset only has 1 feature, sometimes many more. I believe the univariate Mahalanobis distance should be equal to the z-score of the data (i.e. the standardized values), but the numbers from this sklearn function are consistently larger than expected.

Here's some example code using sklearn.covariance.EllipticEnvelope.mahalanobis() that produces the effect:

import numpy sas np
from sklearn.covariance import EllipticEnvelope

x = np.random.default_rng(42).normal(loc=1, scale=10, size=1000)
z = (x - x.mean()) / x.std()
ee = EllipticEnvelope().fit(z.reshape(-1, 1))  
sqdist = ee.mahalanobis(z.reshape(-1, 1))

I'm giving it the z-score here, but I could also give it the raw x; it doesn't affect the outcome.

Now we can compare the square root of this metric (squared Mahalanobis distance, according to the docs) to the absolute z-score, and find that they are different:

First values of np.abs(z): [0.313 , 1.0233 , 0.75595, 0.94488, 1.92866]
First values of np.sqrt(sqdist): [0.34309, 1.12069, 0.82829, 1.03524, 2.11241]

The distributions are look like this:

import seaborn as sns

fig, axs = plt.subplots(ncols=2, figsize=(12, 3), sharey=True)
sns.histplot(np.abs(z), ax=axs[0])
sns.histplot(np.sqrt(sqdist), ax=axs[1])

I expected these metrics to be identical, so one of the following things must be true:

The univariate Mahalanobis distance is not equal to the Z-score. If this is the case, then how are they related?
The metric produced by this function (which is the same as ee.dist_) is not the true squared Mahalanobis distance, either because of a definition or a bug (seems unlikely).

What am I missing?

dipetkov · Accepted Answer · 2022-06-26 23:40:41Z

Your intuition about the Mahalanobis distance is correct. However, the EllipticEnvelope algorithm computes robust estimates of the location and covariance matrix which don't match the raw estimates. (See the scikit-learn documentation for details.) In practice, this means that the z scores you compute by hand are not equal to (the square root of) the Mahalanobis distances.

First we sample from a univariate Normal and we compute the raw estimates of the mean and variance parameters.

import numpy as np
from sklearn.covariance import EllipticEnvelope

x = np.random.default_rng(42).normal(loc=1, scale=10, size=1000)

x.mean()
#> 0.71108449
x.var()
#> 97.75718960

The robust estimates are not equal to the raw estimates. (location = mean, precision = 1 / variance).

ee = EllipticEnvelope()
ee = ee.fit(x.reshape(-1, 1))

ee.location_
#> array([1.09312831])
1 / ee.precision_
#> array([[79.27975244]])

As a result, the "raw" z scores are not equal to the "robust" z scores.

dist = ee.mahalanobis(x.reshape(-1, 1))

robust_mean = ee.location_
robust_std = 1 / np.sqrt(ee.precision_[0, 0])

np.vstack([
    np.sqrt(dist),
    np.abs(x - x.mean()) / x.std(),
    np.abs(x - robust_mean) / robust_std
])
#> array([[0.33176884, 1.17846656, 0.83237332, ..., 0.12563584, 0.13640215, 0.91469881],
#>        [0.33741386, 1.02262536, 0.78823215, ..., 0.15178124, 0.16147682, 0.86237019],
#>        [0.33176884, 1.17846656, 0.83237332, ..., 0.12563584, 0.13640215, 0.91469881]])

Hopefully, this is instructive. It seems to me that first computing z scores, as you do, and then "robustifying" them to find outliers defeats the purpose. Instead apply EllipticEnvelope to the original features and trust the method to come up with reliable estimates of the mean and the covariance.

Stack Exchange Network

Why does univariate Mahalanobis distance not match z-score?

1 Answer 1

Your Answer

Linked

Hot Network Questions

Why does univariate Mahalanobis distance not match z-score?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions