3
$\begingroup$

I am working on a time series dataset. I understand it has a gamma distribution. I want to use a 99% probability threshold to establish upper & lower limits/cut-offs and find anomalies. However, I am getting strange results when I run the below code.

What am I doing/understanding wrong?

data = {'Synthetic': [984.172, 1144.21, 1304.24, 1464.27, 1624.31,
1784.34, 1944.38, 2104.41, 2264.45, 2424.48, 2584.51, 2744.55, 2904.58,
3064.62, 3224.65, 3384.68, 3544.72, 3704.75, 3864.79, 4024.82, 4184.85,
4344.89, 4504.92, 4664.96, 4824.99, 4985.03, 5145.06, 5305.09, 5465.13,
5625.16, 5785.2, 5945.23, 6105.26, 6265.3, 6425.33, 6585.37, 6745.4, 
6905.44, 7065.47, 7225.5, 7385.54, 7545.57, 7705.61, 7865.64, 8025.67,
8185.71, 8345.74, 8505.78]}

df = pd.DataFrame(data)

Upper_Lim = gamma.ppf(0.99, df.mean(), df.std())

print(Upper_Lim) 

Lower_Lim = gamma.ppf(0.01, df.mean(), df.std())

print(Lower_Lim)

Why am I getting an upper limit of 7147 and a lower limit of 6826? I had imagined that with a 99% threshold, I would be casting a wider net on the dataset.

$\endgroup$
2
  • 2
    $\begingroup$ It would be useful to make the language the code is in explicit for readers who may not know at first glance, (and more specifically the particular libraries you're using), which looks like pandas and scipy. It would also be helpful to format the code accordingly. I've made a quick edit for formatting and at least added the python tag $\endgroup$ Commented Aug 1, 2022 at 0:11
  • $\begingroup$ Thank you, Glen_b. I am new here, I will keep it in mind. $\endgroup$ Commented Aug 1, 2022 at 6:57

1 Answer 1

13
$\begingroup$

The reason for obtaining strange results is that you use empirical standard deviation as scale parameter, but the parameter is not equal to standard deviation. If you want to estimate the parameters from the data, the easiest way would be to use the .fit() method for the distribution that is available in Scipy.

Moreover, this is not the best way of detecting anomalies. If your data contains anomalies, they would affect estimates of the parameters. When estimating, you would be treating the anomalies as valid data. This is a chicken and egg problem: you would need to remove the outliers to estimate the parameters so that you can use the distribution to find anomalies. There are ways of solving this, but it's more complicated.

You also said that this is time-series data. By using single distribution for all the data you are assuming that it is stationary and it’s a big assumption. More likely, you need a specialized algorithm for detecting anomalies in time-series.

$\endgroup$
1
  • 1
    $\begingroup$ Thank you, Tim! This is very helpful. Appreciate your guidance. $\endgroup$ Commented Aug 1, 2022 at 6:56

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.