I have 2 dfs. df1 are examples of cats and df2 are examples of dogs.
I have to do some preprocessing with these dfs that at the moment I'm doing by calling different functions. I would like to use scikit learn pipelines.
One of these functions is a special encoder function that will look at a column in the df and will return a special value. I rewrote that function in a class like I saw being used in scikit learn:
class Encoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.values = []
super().__init__()
def fit(self, X, y=None):
return self
def encode(self,row):
result = []
for base in row:
result.append(bases[base])
self.values.append(result)
def transform(self, X):
assert isinstance(X, pd.DataFrame)
X["seq_new"].apply(self.encode)
return self.values
so now I would have 2 lists as a result:
encode = Encoder()
X1 = encode.transform(df1)
X2 = encode.transform(df2)
next step would be:
features = np.concatenate((X1, X1), axis=0)
next step build the labels:
Y_dog = [[1]] * len(X1)
Y_cat = [[0]] * len(X2)
labels = np.concatenate((Y_dog, Y_cat), axis=0)
and some other manipulations and then I'll do a model_selection.train_test_split() to split the data into train and test.
How would I call all these functions in a scikit pipeline? The examples that I found start from where the train/test split has already been done.
fit()a model or transformer on your full dataset, it creates data leakage into the model between the train and test sets