0

I'm doing some machine learning practice on Kaggle and I'm beginning to use the sklearn.pipeline.Pipeline class to transform my data several times and then train a model on it.

I want to encapsulate several parts of pre-processing my data: dropping rows with 30% or more NaNs, dropping columns with 30% or more NaNs, amongst other things.

Here's the start of my attempt at a custom Transformer:

class NanHandler(BaseEstimator, TransformerMixin):
    def __init__(self, target_col, row_threshold=0.7, col_threshold=0.7):
        self.target_col = target_col
        self.row_threshold = row_threshold
        self.col_threshold = col_threshold
    def transform(self, X):
        # drop rows and columns with >= 30% NaN values
    def fit(self, *_):
        return self

However, I want to use this Transformer with k-fold cross-validation. I'm concerned that if I do 3-fold cross-validation, it's unlikely (but possible) that I run into the following situation:

Train on folds 1 and 2, test on 3

Train on folds 2 and 3, test on 1

Train on folds 1 and 3, test on 2

Folds 1 and 2 combined may have over 30% Nans in a specific column (call it colA). So my NanHandler will drop this column before training. However, folds 2 and 3 combined may have less than 30% NaNs and so it won't drop colA, resulting my model being trained on different columns than the first pass.

1) How should I handle this situation?

2) Is this also a problem if I want to drop rows that have 30% ore more NaN values (in that I'll train on a different number of rows during k-fold cross-validation)?

Thanks!

2
  • Missing values should be handled for the whole dataset and not separately for train and test. Commented Jan 30, 2018 at 6:16
  • 1
    @Vivek Kumar that's wrong. The preprocessing has to be done separately for the train set and test set, otherwise the test set won't be "clean". In this case maybe stratified cross-validation will work or dropping the features altogether. Commented Aug 23, 2018 at 17:09

1 Answer 1

2

The figure 30% is a little ambiguous to me. 30% of your entire dataset or 30% in each fold? For example, if you have a dataset with 90 samples and you break it up to 3 folds of 30. would you want 70% of cols and rows in a fold of 30 points to be present? (I'm going to assume that this is the case)

Then perhaps the following could work:

  1. Clear your entire dataset of all features and samples that have any missing values(Nan) and create a pool of data points that have at least one Nan.
  2. Then build your folds.
  3. Now, based on your number of features and examples you can resample points from your pool of points with Nan and add it back to each of your folds.

I hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.