How to simplify my data preprocessing with scikit learn pipelines

Question

I have 2 dfs. df1 are examples of cats and df2 are examples of dogs.

I have to do some preprocessing with these dfs that at the moment I'm doing by calling different functions. I would like to use scikit learn pipelines.

One of these functions is a special encoder function that will look at a column in the df and will return a special value. I rewrote that function in a class like I saw being used in scikit learn:

class Encoder(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.values = []
        super().__init__()

    def fit(self, X, y=None):
        return self

    def encode(self,row):
        result = []
        for base in row:
            result.append(bases[base])

        self.values.append(result)

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        X["seq_new"].apply(self.encode)

        return self.values

so now I would have 2 lists as a result:

encode = Encoder()
X1 = encode.transform(df1)
X2 = encode.transform(df2)

next step would be:

features = np.concatenate((X1, X1), axis=0)

next step build the labels:

Y_dog = [[1]] * len(X1)
Y_cat = [[0]] * len(X2)
labels = np.concatenate((Y_dog, Y_cat), axis=0)

and some other manipulations and then I'll do a model_selection.train_test_split() to split the data into train and test.

How would I call all these functions in a scikit pipeline? The examples that I found start from where the train/test split has already been done.

Out of curiosity, why are you calling transform twice when you just concat the two DFs together after? If you want to use a pipeline, you generally use it once your data engineering has been done, i.e., after a train test split. The reason is this: If you fit() a model or transformer on your full dataset, it creates data leakage into the model between the train and test sets — G. Anderson
– G. Anderson, Commented Oct 24, 2018 at 21:50
@G.Anderson ok I see, probably that's why I didn't find examples. thanks — Claudiu Creanga
– Claudiu Creanga, Commented Oct 24, 2018 at 22:00

Him · Accepted Answer · 2018-10-24 22:33:43Z

The thing about an sklearn.pipeline.Pipeline is that every step needs to implement fit and transform. So, for instance, if you know for a fact that you will ALWAYS need to perform the concatenation step, and you really are dying to put it into a Pipeline (which I wouldn't, but that's just my humble opinion), you need to create a Concatenator class with the appropriate fit and transform methods.

Something like this:

class Encoder(object):
    def fit(self, X, *args, **kwargs):
        return self
    def transform(self, X):
        return X*2

class Concatenator(object):
    def fit(self, X, *args, **kwargs):
        return self
    def transform(self, Xs):
        return np.concatenate(Xs, axis=0)

class MultiEncoder(Encoder):
    def transform(self, Xs):
        return list(map(super().transform, Xs))

pipe = sklearn.pipeline.Pipeline((
    ("encoder", MultiEncoder()),
    ("concatenator", Concatenator())
))

pipe.fit_transform((
    pd.DataFrame([[1,2],[3,4]]), 
    pd.DataFrame([[5,6],[7,8]])
))

Collectives™ on Stack Overflow

How to simplify my data preprocessing with scikit learn pipelines

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related