2
$\begingroup$

I want to get some opinions on how to approach the following problem to do with detecting "unhealthy" behavior in time series data (either using a statistical/analytical model or ML/DL, I do not have a preference). I want to be able to detect "healthy" or "unhealthy" (so this could be framed as a classification problem) based on the following definitions.

Healthy: https://i.sstatic.net/WsGll.jpg

All lines follow the same pattern moving in tandem with minimal cross/overlap with one another.

Unhealthy: https://i.sstatic.net/xky5D.jpg

In the first two images, the behavior is obviously unhealthy as some of the colored lines are straying away from the rest of the pack. However, there will be some edge-cases as seen in the third picture where some lines to cross over slightly and/or flat-line. You can see the red one in particular does not keep going down and has a bit of an uptick toward the end.

The plotted curves are essentially temperature over a time period. One idea was tracking gradients of each temperature line over time and comparing it with the rest. But curious to know what else I could try?

$\endgroup$
7
  • $\begingroup$ Do you want to identify the anomalies in the dataset that fail to provide combined information as part of outlier detection? $\endgroup$ Commented Mar 24, 2022 at 5:58
  • $\begingroup$ or is it that you have multiple timeseries and want to identify healthy/unhealthy time series from multiple series? Just looking for some clarity regarding your objective. $\endgroup$ Commented Mar 24, 2022 at 6:18
  • $\begingroup$ @AtulMishra Yes, I have multiple time-series data in csv files of which each need to be classified as healthy or unhealthy. Hope this clears things up. For reference, this is a follow-up to my question here: stats.stackexchange.com/questions/565863/… $\endgroup$ Commented Mar 24, 2022 at 9:52
  • $\begingroup$ Well, reading the context from your link, looks like you need to classify the whole time series. I think, you can use Stationarity Tests in a time-series and based upon the obtained p-value, you can classify your series to be Stationary(healthy) or Non-Stationary(Unhealthy). $\endgroup$ Commented Mar 24, 2022 at 10:41
  • $\begingroup$ This will help you in identifying very weirdly behaving time-series $\endgroup$ Commented Mar 24, 2022 at 10:42

1 Answer 1

1
$\begingroup$

Please see below several approaches in both a canonical way and an enviromental-specific examples (python, numpy, scikit eco-system)

First, I needed to reconstruct a somewhat similar sample (python / numpy):

import numpy as np
import matplotlib.pyplot as plt

m = 10
n = 5

a = np.arange(-m / 2, m / 2)
a = (a ** 3)

for i in range(n):
    b[i] = a + (i * np.random.normal(scale=3.)) + np.random.rand(m)

fig, axs = plt.subplots(1, 1, sharex=True)
for i in range(n):
    axs.plot(b[i])

curves plus random noise

Then, added some extra noise to the last curve:

b[-1] += np.random.rand(m) * 100

fig, axs = plt.subplots(1, 1, sharex=True)
for i in range(n):
    axs.plot(b[i])

enter image description here

Now for the first approach - statistical analysis using the Correlation Coefficient Matrix:

The hypothesis: the Correlation Coefficient Matrix of the curves should indicate the "rogue" curve's corr vector with the rest of the curves.

You'll need to test it on your real data, but it seems the the similar curves will get a high (close to 1.0) corr score. Then, you'll want to define some eps for thresholding (1 - eps) correlations for a suspicous or "Unhealthy" state.

np.corrcoef(b)
array([[1.        , 0.99999092, 0.99995627, 0.99995058, 0.88247431],
       [0.99999092, 1.        , 0.99997068, 0.9999665 , 0.88147866],
       [0.99995627, 0.99997068, 1.        , 0.99998015, 0.88093051],
       [0.99995058, 0.9999665 , 0.99998015, 1.        , 0.88041311],
       [0.88247431, 0.88147866, 0.88093051, 0.88041311, 1.        ]])

In the above sample test it's clear that the purple curve's corr vector members are below the threshold. [0.88247431, 0.88147866, 0.88093051, 0.88041311]

A second approach - Outlier Detection:

The hypothesis: an Unsupervised Outlier Detection algorithm, feeded with the curves' values, should detect the "rogue" curve as an Outlier.

I recommend trying several Unsupervised Outlier Detection algorithms. See here for a nice illustration of optional algos (python / scikit-learn).

For example, I tried the LocalOutlierFactor:

from sklearn.neighbors import LocalOutlierFactor

X = b
clf = LocalOutlierFactor(n_neighbors=3)
clf.fit_predict(X)
array([ 1,  1,  1,  1, -1])

The result tells us it suspects the last curve..

Looking at the more detailed score negative_outlier_factor_ "... The opposite LOF of the training samples. The higher, the more normal. Inliers tend to have a LOF score close to 1 (negative_outlier_factor_ close to -1), while outliers tend to have a larger LOF score ..."

clf.negative_outlier_factor_
array([-1.06224748, -1.05244769, -0.94720943, -0.94720943, -3.47148487])

Curves #1-4 are arround 1 +- eps, while #5 is clearly far from the gang.. (-3.47148487)

Another approach - Cluster Analysis:

The hypothesis: a clustering algo should detect the "healthy" curves as 1 (or maybe more) dense clusters, where the "rogue" curves should't belong. Whether it'd mark them as outliers, or their cluster properties would point to that - is more by design of the specific algo)

Look for algo / implementations that:

  • Do not require the number of clusters in advance (or try such one with k=1, with the hypothesis that all curves, except for the outliers, are in the same cluster)
  • Provide some kind of scoring
  • Provide some kind of outlier indication
  • Preffer the density-based ones (that's more from experince and personal preference)

You could then test by visual inspection and / or ploting after dimentionality reduction of 2 / 3 (for 2d / 3d)

For example: HDBSCAN's Outliers Detection

$\endgroup$
6
  • $\begingroup$ Wanted to clarify with you if you would use something like HDBSCAN for the edge cases such as this one: imgur.com/a/OtqZboC (this could be unhealthy because of the two crossing signals in green, but generally they are highly correlated with the other lines). $\endgroup$ Commented Mar 31, 2022 at 13:56
  • $\begingroup$ Also, for the correlation matrix method, could you explain what you mean by "eps"? Also, the matrix may not be scalable to larger time series data cause you have 5 time steps in your example, whereas I am dealing with 1000s. How would you know which vector/curve to look at in a matrix that 1000x1000? With LOF, I see a similar issue, how did you get a vector of 5 elements when there are 4 curves in your example? Again, I think those are the 5 time steps? $\endgroup$ Commented Mar 31, 2022 at 22:28
  • $\begingroup$ Furthermore, not sure how I would approach the clustering analysis. Could you please provide an example? $\endgroup$ Commented Mar 31, 2022 at 23:36
  • $\begingroup$ The actual performance of those 3 different approaches should now be tested on your real data. If you can provide a few real data points from your time-series I can try and add some more intuitions. I used synthetic data that was trying to "looks like" your data. There are 5 lines, one is hidden behind another.. $\endgroup$ Commented Apr 1, 2022 at 12:56
  • $\begingroup$ "eps" = epsilon, some small threshold value (i.e. 0.1, 0.01). In my ex' 0.1 could have been a thresh' for the corr mat, such that each corr bellow the thresh would consider a low corr => thus, maybe an indication for an "unhealthy" sensor measurement. Same goes for the LOF - if you'd run it on real data you could point yourself to the right threshold. $\endgroup$ Commented Apr 1, 2022 at 12:56

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.