I'm doing some machine learning practice on Kaggle and I'm beginning to use the sklearn.pipeline.Pipeline class to transform my data several times and then train a model on it.
I want to encapsulate several parts of pre-processing my data: dropping rows with 30% or more NaNs, dropping columns with 30% or more NaNs, amongst other things.
Here's the start of my attempt at a custom Transformer:
class NanHandler(BaseEstimator, TransformerMixin):
def __init__(self, target_col, row_threshold=0.7, col_threshold=0.7):
self.target_col = target_col
self.row_threshold = row_threshold
self.col_threshold = col_threshold
def transform(self, X):
# drop rows and columns with >= 30% NaN values
def fit(self, *_):
return self
However, I want to use this Transformer with k-fold cross-validation. I'm concerned that if I do 3-fold cross-validation, it's unlikely (but possible) that I run into the following situation:
Train on folds 1 and 2, test on 3
Train on folds 2 and 3, test on 1
Train on folds 1 and 3, test on 2
Folds 1 and 2 combined may have over 30% Nans in a specific column (call it colA). So my NanHandler will drop this column before training. However, folds 2 and 3 combined may have less than 30% NaNs and so it won't drop colA, resulting my model being trained on different columns than the first pass.
1) How should I handle this situation?
2) Is this also a problem if I want to drop rows that have 30% ore more NaN values (in that I'll train on a different number of rows during k-fold cross-validation)?
Thanks!