I have a dataset that provides the count of cyber incidents since 2011 for different countries and different attack types, and I want to use this data in a machine learning model to predict future attacks using LSTM algorithm.
I am currently setting the time period of each observation to 10 days, and I have in total around 370 points. Due to the fact that such data is often sensitive and confidential, only major incidents are being reported, and many days have 0 attacks. This resulted in having a sparse dataset (more than 50% of cells are zeros), and many values are one digit like 1,2,3, and rarely 10 or 15. I am ok with predicting major cyber attacks only since we cannot have all the attacks.
I read in different sources online that sparse data may lead to overfitting and the prediction algorithm may not perform well. So one solution in my mind is to aggregate the number of attacks (monthly instead of 10 days). However, this will reduce the number of data points which will also affect the performance of the machine learning algorithm.
I am wondering what is the best solution in my case? I just want my dataset to be applicable for a machine learning model that predicts the next number of attacks.
Many thanks in advance!