5

I'm training a model using sklearn, and there's a sequence of my training that requires running two different feature extraction pipelines.

For some reason each pipeline fits the data without issue, and when they occur in sequence, they transform the data without issue either.

However when the first pipeline is called after the second pipeline has already been fitted, the first pipeline has been altered and this results in a dimension mismatch error.

In the code below you can recreate the issue (I've simplified it heavily, in reality my two pipelines use different parameters but this is a minimally reproducible example).

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vectorizer = CountVectorizer()

data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data1)

print(pipeline1.transform(data1))

# Works fine

pipeline2 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data2)

print(pipeline2.transform(data2))

# Works fine

print(pipeline1.transform(data1))

# ValueError: dimension mismatch

Clearly the fitting of "pipeline2" is in some way interfering with "pipeline1" but I have no clue why. I'd like to be able to use them concurrently.

3
  • 3
    What happens if you re-initialise vectorizer = CountVectorizer() after you call the fit to pipeline1? Commented Aug 28, 2019 at 1:40
  • Did this fix the problem you were having? Commented Aug 28, 2019 at 5:58
  • from the documentation: Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling fit() more than once will overwrite what was learned by any previous fit() Commented Aug 28, 2019 at 7:00

1 Answer 1

3

What happens :

As you define vectorizer first, here is what happens :

  1. You create vectorizer
  2. you fit the first pipeline :

    • vectorizer is fitted, output dim is (3,4), e.g 3 elements, 4 words : foo, bar, duck, goose
    • svd is fitted to have 4 columns as input
  3. you fit the second pipeline :

    • vectorizer is fitted again, this time with 6 words (e.g columns) as output : foo, duck, swan, goose, king, queen
    • the other svd is fitted, not relevant here
  4. you call back the first pipeline :

    • the vectorizer outputs a (3,6) matrix, using words from the last fit, e.g the second pipeline
    • the svd has been fitted to accept 4 columns as input, raise an exception.

How to verify this :

vectorizer = CountVectorizer()

data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', vectorizer)]).fit(data1)
print(pipeline1.transform(data1).shape)

(3, 4)

# Works fine
pipeline2 = Pipeline([('vec', vectorizer)]).fit(data2)
print(pipeline2.transform(data2).shape)

(3, 6)

# Works fine

# vectorizer = CountVectorizer()
print(pipeline1.transform(data1).shape)

(3, 6)

How to fix it :

You just have to include the definition of the vectorizer in the pipeline, like so :

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data1)

print(pipeline1.transform(data1))

# Works fine

pipeline2 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data2)

print(pipeline2.transform(data2))

# Works fine

print(pipeline1.transform(data1))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.