1
$\begingroup$

I am working with a dataset that contains several missing values. Here's the current situation:

colSums(is.na(dati_train)) / nrow(dati_train)   # Proportion of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR
        0.00000000         0.00000000         0.00000000         0.00000000
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
        0.00000000         0.00000000         0.00000000         0.00000000
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
        0.00000000         0.00000000         0.05655301         0.10076613
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
        0.00000000         0.32598398         0.32598398         0.00000000
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
        0.32598398         0.32598398         0.00000000         0.00000000
            MARGIN
        0.32598398

Here is a visualization of the missing data pattern:

image.png here a description of variabiles:


Variable Data Challenge Description
EVENT_ID Transaction ID
N_ITEMS Total number of items purchased in the transaction
PROP_CONBINI Proportion of "conbini" items in the transaction
FAV_GENRE Favorite manga genre
PHONE_NUMBER Customer's phone number (available)
MAIL Customer's email address (available)
YEAR Year of the transaction
MONTH Month of the transaction
PAYMENT_TYPE Agreed payment method
BOOKS_PAID Number of manga paid for in previous transactions
PRICE Transaction price
N_SUBSCRIPTIONS Number of active manga series subscriptions
SUBSCR_CANC Number of manga series subscriptions canceled in the past
POINT_OF_SALE Point of sale
AGE Customer's age
DAYS_FROM_PROMO Days since the last promotion ended
MARGIN Customer's cumulative margin
N_TRANSACTIONS Total number of transactions made by the customer
CUSTOMER_SINCE Date of the customer's first transaction
DATE_LAST_PURCHASE Date of the customer's most recent transaction
PAID Payment balance (target)

As you can see, the variables with ~33% missing values share the same systematic pattern. I am wondering what the best approach is to handle this situation.

My Questions: Should I perform additional tests to better understand the nature of the missing data? If so, what tests would you recommend?

What is the best practice for handling such missing data patterns?

My Initial Plan:

Remove the variables that are irrelevant to the analysis.

Create a "flag" variable to indicate observations with missing data.

 Should I create this flag only for variables with systematic patterns (those with ~33% 
 missing values)? Or for all variables with missing values?

Proceed with multiple imputation or other more sophisticated techniques for missing data handling.

After addressing the missing data, I plan to continue with my analysis. However, since I have never encountered a problem like this before, any advice or suggestions would be greatly appreciated.

Thank you in advance for your help!

$\endgroup$
13
  • 2
    $\begingroup$ Hi Giulio. Can you provide more information about the analysis you intend to conduct? And what is the nature of the data? What are the observations? Sales? Clients? Something else? It looks like that when the "CUSTOMER_SINCE" variable is missing, some other variables ("DATE_LAST_PURCHASE", "N_TRANSACTIONS", etc.) are missing too. Could it be that these variables simply do not apply to new customers? (e.g. "last purchase" would not reallly make sense for a new client). It would be useful to have more detailed info about your data, otherwise we can only speculate about that. $\endgroup$ Commented Jan 2 at 11:00
  • 2
    $\begingroup$ If it's what I suspect (variables not applying to new customers), then you have a case of nested variables. If so, this other thread might contain an answer to some questions you might have: stats.stackexchange.com/q/372257/164936 $\endgroup$ Commented Jan 2 at 11:03
  • 2
    $\begingroup$ @J-J-J is right. Also, one key question is whether the remaining missing values (age and fav genre) are missing at random, missing completely at random, or missing not at random. $\endgroup$ Commented Jan 2 at 11:20
  • $\begingroup$ @J-J-J Here’s the scenario: I’m working on a classification task. My target is a boolean variable that indicates 1 when the customer has repaid their credit and 0 when they haven’t. You’re right that some variables are missing systematically—if one is missing, the others are as well. I’ll add an edit with the description of the variables. The variable N_Transaction represents the number of transactions made by the customer. I have records where N_Transaction = 1. This makes me think that these represent the first transaction. Am I wrong? $\endgroup$ Commented Jan 2 at 22:43
  • $\begingroup$ @PeterFlom how to you suggest to procede? $\endgroup$ Commented Jan 2 at 22:51

2 Answers 2

1
$\begingroup$

To summarize comments into an answer:

It's important to distinguish whether data are missing because they can't have a value or whether they just didn't get recorded. In this data set it seems that both types of missingness are in play.

For the former, a comment notes:

the ones that are systematically missing occur only when the "SUB_CANC" variable equals 0.

Then you have what are considered "nested" variables, a set of variables that only have values when SUB_CANC is non-zero. Those would seem to be BOOKS_PAID, N_TRANSACTIONS, DATE_LAST_PURCHASE, CUSTOMER_SINCE, MARGIN.

Following the suggestion in a page linked by @J_J_J, you include interactions between each of those variables and SUB_CANC. As SUB_CANC can have any non-negative integer value, you will have to consider how to model both the SUB_CANC variable and those interactions.

You certainly should NOT try to impute values that can't exist when SUB_CANC=0.

For the second type of missingness, multiple imputation should help deal with the missing FAV_GENRE and AGE values. I recommend Stef van Buuren's Multiple Imputation of Missing Data as a reference for how to distinguish and deal with the issues of data missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), as @PeterFlom noted. As @J_J_J pointed out, this site has many pages devoted to those matters.

$\endgroup$
1
$\begingroup$

Assuming a comprehensive EDA has been done such that we have a reasonable idea how the missingness patterns in our data look like, using a learner that can automatically account for missing values (e.g. LightGBM's LGBMRegressor) can be preferable to using imputation, as a first step. That is because:

  1. It simplifies the analysis pipeline.
  2. It maintains missingness information which can be lost during imputation.
  3. It avoids the creation of additional variables to capture missingness patterns.
  4. It minimises the possibility of data leakage. (Big issue when imputing and using a CV schema)
  5. It requires less time (we don't have to estimate another model).
  6. If indeed there is an underlying systematic patterns it should/will be detected automatically during learning, if not, it will be ignored.

That is not to say that imputation is unnecessary. If we are not satisfied with our results, then comparing this imputation-free approach with other imputation methods and understanding their trade-offs can provide a more comprehensive view. For example, there might be an interpretability trade-off, automatic handling of missing values can make it harder to understand how missingness affects predictions.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.