I am using Mahalanobis distance for outlier detection. Sometimes my dataset only has 1 feature, sometimes many more. I believe the univariate Mahalanobis distance should be equal to the z-score of the data (i.e. the standardized values), but the numbers from this sklearn function are consistently larger than expected.
Here's some example code using sklearn.covariance.EllipticEnvelope.mahalanobis() that produces the effect:
import numpy sas np
from sklearn.covariance import EllipticEnvelope
x = np.random.default_rng(42).normal(loc=1, scale=10, size=1000)
z = (x - x.mean()) / x.std()
ee = EllipticEnvelope().fit(z.reshape(-1, 1))
sqdist = ee.mahalanobis(z.reshape(-1, 1))
I'm giving it the z-score here, but I could also give it the raw x; it doesn't affect the outcome.
Now we can compare the square root of this metric (squared Mahalanobis distance, according to the docs) to the absolute z-score, and find that they are different:
- First values of
np.abs(z):[0.313 , 1.0233 , 0.75595, 0.94488, 1.92866] - First values of
np.sqrt(sqdist):[0.34309, 1.12069, 0.82829, 1.03524, 2.11241]
The distributions are look like this:
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(12, 3), sharey=True)
sns.histplot(np.abs(z), ax=axs[0])
sns.histplot(np.sqrt(sqdist), ax=axs[1])
I expected these metrics to be identical, so one of the following things must be true:
- The univariate Mahalanobis distance is not equal to the Z-score. If this is the case, then how are they related?
- The metric produced by this function (which is the same as
ee.dist_) is not the true squared Mahalanobis distance, either because of a definition or a bug (seems unlikely).
What am I missing?
