1
$\begingroup$

I have a dataset that provides the count of cyber incidents since 2011 for different countries and different attack types, and I want to use this data in a machine learning model to predict future attacks using LSTM algorithm.

I am currently setting the time period of each observation to 10 days, and I have in total around 370 points. Due to the fact that such data is often sensitive and confidential, only major incidents are being reported, and many days have 0 attacks. This resulted in having a sparse dataset (more than 50% of cells are zeros), and many values are one digit like 1,2,3, and rarely 10 or 15. I am ok with predicting major cyber attacks only since we cannot have all the attacks.

I read in different sources online that sparse data may lead to overfitting and the prediction algorithm may not perform well. So one solution in my mind is to aggregate the number of attacks (monthly instead of 10 days). However, this will reduce the number of data points which will also affect the performance of the machine learning algorithm.

I am wondering what is the best solution in my case? I just want my dataset to be applicable for a machine learning model that predicts the next number of attacks.

Many thanks in advance!

$\endgroup$
5
  • 3
    $\begingroup$ Not necessarily obligatory but most likely relevant How to know that your machine learning problem is hopeless? $\endgroup$ Commented Mar 20, 2022 at 15:22
  • $\begingroup$ One consideration is: what period do you want the forecast for? Every 10 days or monthly? You can also try it both ways and see how it turns out. $\endgroup$ Commented Mar 21, 2022 at 16:21
  • $\begingroup$ Thanks, l tried the monthly earlier, and was advised by some colleagues that the number of points is 120 which is too small to get a pattern out of it. Do you have an opinion regarding that? $\endgroup$ Commented Mar 21, 2022 at 18:04
  • 1
    $\begingroup$ LSTM is a Deep Learning algorithm, and we know that Deep Learning algorithms are generally pretty data-hungry to be any good. I would hazard an opinion that 370 data points is too small to use any Deep Learning technique. You're likely better off doing careful feature engineering, and seeing if you can get a simple technique like linear regression or generalized linear regression working reasonably well. 370 data points is enough for linear regression. $\endgroup$ Commented Mar 22, 2022 at 20:05
  • $\begingroup$ I might need to expand on this as an answer but the obvious thing to note is that often these attacks amount to anomalous/unusual activity. In that case this forecasting work is closer to change-point/outlier detection. $\endgroup$ Commented Mar 23, 2022 at 13:58

2 Answers 2

2
+50
$\begingroup$

The problem isn't that you have sparse data, it's that you have few data points, and the data points you have exhibit excess zeroes.

My concern is that your LSTM model will not have sufficient data to learn, and the model isn't structured enough to make sense of the limited data.

Since you have limited data, I would suggest a more inflexible statistical model that makes more assumptions about the data generating process. Something like a zero-inflated Poisson model with built in lagged variables as regressors (e.g. how many attacks in the prior 10 day period and maybe the prior 11-20 day period could be used as regressors). See following link for a comparison and background of some Zero-inflated models: Yang et al (2017).

I like the pscl package in R for running zero-inflated regression models. The following link shows a basic example of how to build a model and how to compare models to get the best fit: Zero-inflated models in R, UCLA

$\endgroup$
1
$\begingroup$

Not an answer, but rather an extended comment:

If I were you, I'd rather worry about other issues than sparsity. If half of the rows are zeros it is not that bad. I'd worry about selection bias: who decides on what incidents are big enough to to be reported? Is it possible that the reporting is not consistent, so incidents of the same magnitude may or may not be reported? Also, the data covers only the incidents that are detected, so there is a clear risk of survivorship bias as well. Unfortunately, both problems cannot be solved with the data you have, because they care about the data that you don't have.

If those concerns are valid, your model would only be able to learn to detect "known unknowns" kind of issues and be blind to "unknown unknowns" that you'd probably want to detect. That could be a significant drawback that should be carefully considered.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.