0

I have 2 dfs. df1 are examples of cats and df2 are examples of dogs.

I have to do some preprocessing with these dfs that at the moment I'm doing by calling different functions. I would like to use scikit learn pipelines.

One of these functions is a special encoder function that will look at a column in the df and will return a special value. I rewrote that function in a class like I saw being used in scikit learn:

class Encoder(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.values = []
        super().__init__()

    def fit(self, X, y=None):
        return self

    def encode(self,row):
        result = []
        for base in row:
            result.append(bases[base])

        self.values.append(result)

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        X["seq_new"].apply(self.encode)

        return self.values

so now I would have 2 lists as a result:

encode = Encoder()
X1 = encode.transform(df1)
X2 = encode.transform(df2)

next step would be:

features = np.concatenate((X1, X1), axis=0)

next step build the labels:

Y_dog = [[1]] * len(X1)
Y_cat = [[0]] * len(X2)
labels = np.concatenate((Y_dog, Y_cat), axis=0)

and some other manipulations and then I'll do a model_selection.train_test_split() to split the data into train and test.

How would I call all these functions in a scikit pipeline? The examples that I found start from where the train/test split has already been done.

2
  • 1
    Out of curiosity, why are you calling transform twice when you just concat the two DFs together after? If you want to use a pipeline, you generally use it once your data engineering has been done, i.e., after a train test split. The reason is this: If you fit() a model or transformer on your full dataset, it creates data leakage into the model between the train and test sets Commented Oct 24, 2018 at 21:50
  • @G.Anderson ok I see, probably that's why I didn't find examples. thanks Commented Oct 24, 2018 at 22:00

1 Answer 1

2

The thing about an sklearn.pipeline.Pipeline is that every step needs to implement fit and transform. So, for instance, if you know for a fact that you will ALWAYS need to perform the concatenation step, and you really are dying to put it into a Pipeline (which I wouldn't, but that's just my humble opinion), you need to create a Concatenator class with the appropriate fit and transform methods.

Something like this:

class Encoder(object):
    def fit(self, X, *args, **kwargs):
        return self
    def transform(self, X):
        return X*2

class Concatenator(object):
    def fit(self, X, *args, **kwargs):
        return self
    def transform(self, Xs):
        return np.concatenate(Xs, axis=0)

class MultiEncoder(Encoder):
    def transform(self, Xs):
        return list(map(super().transform, Xs))

pipe = sklearn.pipeline.Pipeline((
    ("encoder", MultiEncoder()),
    ("concatenator", Concatenator())
))

pipe.fit_transform((
    pd.DataFrame([[1,2],[3,4]]), 
    pd.DataFrame([[5,6],[7,8]])
))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.