I am working with a dataset that contains several missing values. Here's the current situation:
colSums(is.na(dati_train)) / nrow(dati_train) # Proportion of NAs per column
PAID POINT_OF_SALE EVENT_ID YEAR
0.00000000 0.00000000 0.00000000 0.00000000
MONTH N_SUBSCRIPTIONS PRICE PHONE_NUMBER
0.00000000 0.00000000 0.00000000 0.00000000
PROP_CONBINI PAYMENT_TYPE FAV_GENRE AGE
0.00000000 0.00000000 0.05655301 0.10076613
DAYS_FROM_PROMO BOOKS_PAID N_TRANSACTIONS N_ITEMS
0.00000000 0.32598398 0.32598398 0.00000000
DATE_LAST_PURCHASE CUSTOMER_SINCE MAIL SUBSCR_CANC
0.32598398 0.32598398 0.00000000 0.00000000
MARGIN
0.32598398
Here is a visualization of the missing data pattern:
here a description of variabiles:
| Variable Data Challenge | Description |
|---|---|
| EVENT_ID | Transaction ID |
| N_ITEMS | Total number of items purchased in the transaction |
| PROP_CONBINI | Proportion of "conbini" items in the transaction |
| FAV_GENRE | Favorite manga genre |
| PHONE_NUMBER | Customer's phone number (available) |
| Customer's email address (available) | |
| YEAR | Year of the transaction |
| MONTH | Month of the transaction |
| PAYMENT_TYPE | Agreed payment method |
| BOOKS_PAID | Number of manga paid for in previous transactions |
| PRICE | Transaction price |
| N_SUBSCRIPTIONS | Number of active manga series subscriptions |
| SUBSCR_CANC | Number of manga series subscriptions canceled in the past |
| POINT_OF_SALE | Point of sale |
| AGE | Customer's age |
| DAYS_FROM_PROMO | Days since the last promotion ended |
| MARGIN | Customer's cumulative margin |
| N_TRANSACTIONS | Total number of transactions made by the customer |
| CUSTOMER_SINCE | Date of the customer's first transaction |
| DATE_LAST_PURCHASE | Date of the customer's most recent transaction |
| PAID | Payment balance (target) |
As you can see, the variables with ~33% missing values share the same systematic pattern. I am wondering what the best approach is to handle this situation.
My Questions: Should I perform additional tests to better understand the nature of the missing data? If so, what tests would you recommend?
What is the best practice for handling such missing data patterns?
My Initial Plan:
Remove the variables that are irrelevant to the analysis.
Create a "flag" variable to indicate observations with missing data.
Should I create this flag only for variables with systematic patterns (those with ~33%
missing values)? Or for all variables with missing values?
Proceed with multiple imputation or other more sophisticated techniques for missing data handling.
After addressing the missing data, I plan to continue with my analysis. However, since I have never encountered a problem like this before, any advice or suggestions would be greatly appreciated.
Thank you in advance for your help!