10

I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV(). I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.

This is what I have till now,

# pipeline for naive bayes
naive_bayes_pipeline = Pipeline([
    ('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
    ('tf_idf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

# accessing and using the pipelines
naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])

# pipeline for SVM
svm_pipeline = Pipeline([
    ('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
    ('tf_idf', TfidfTransformer()),
    ('classifier', SVC())
])

param_svm = [
  {'classifier__C': [1, 10], 'classifier__kernel': ['linear']},
  {'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]

grid_svm_skf = GridSearchCV(
    svm_pipeline,  # pipeline from above
    param_grid=param_svm,  # parameters to tune via cross validation
    refit=True,  # fit using all data, on the best detected classifier
    n_jobs=-1,  # number of cores to use for parallelization; -1 uses "all cores"
    scoring='accuracy',
    cv=StratifiedKFold(train_data['gender'], n_folds=5),  # using StratifiedKFold CV with 5 folds
)

svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])
predictions_svm_skf = svm_skf.predict(test_data['data'])

EDIT 1: The second pipeline is the only pipeline using gridSearchCV(), and never seems to be executed.

EDIT 2: Added more code to show gridSearchCV() use.

11
  • What do you mean by concurrency issues? Are you running out of memory? How about saving each pipeline (after it is fit) to a file? Then load the one you want and train your model. Also, please share any error messages you are seeing. Commented Jan 29, 2018 at 18:32
  • Can you elaborate more about "I cannot have multiple Pipelines running during a single implementation due to concurrency issues", I suspect this is the X-Y problem. At least, it is not obvious to me what concurrency issues would be solved by a Pipeline. Commented Jan 29, 2018 at 18:35
  • @pault I can't seem to start the execution of the second pipeline, given that I already have a running pipeline. Commented Jan 29, 2018 at 18:35
  • 1
    So, then the second pipeline is the one using grid-search... why do you say it never appears to be executed? I think you should expand on this as an edit to your question, before this becomes a long chain of comments. Commented Jan 29, 2018 at 18:43
  • 1
    @denbuttigieg, try to pass GridSearchCV(..., verbose=3) and check what does it output... Commented Jan 29, 2018 at 18:59

1 Answer 1

12

Consider checking out similar questions here:

  1. Compare multiple algorithms with sklearn pipeline
  2. Pipeline: Multiple classifiers?

To summarize,

Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.

Create a switcher class that works for any estimator

from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

def __init__(
    self, 
    estimator = SGDClassifier(),
):
    """
    A Custom BaseEstimator that can switch between classifiers.
    :param estimator: sklearn object - The classifier
    """ 

    self.estimator = estimator


def fit(self, X, y=None, **kwargs):
    self.estimator.fit(X, y)
    return self


def predict(self, X, y=None):
    return self.estimator.predict(X)


def predict_proba(self, X):
    return self.estimator.predict_proba(X)


def score(self, X, y):
    return self.estimator.score(X, y)

Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

Perform hyper-parameter optimization

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', ClfSwitcher()),
])

parameters = [
    {
        'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': ['english', None],
        'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
        'clf__estimator__max_iter': [50, 80],
        'clf__estimator__tol': [1e-4],
        'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
    },
    {
        'clf__estimator': [MultinomialNB()],
        'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
        'tfidf__stop_words': [None],
        'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)

How to interpret clf__estimator__loss

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

Sign up to request clarification or add additional context in comments.

6 Comments

I am familiar with GridSearchCV in the traditional case with one estimator. Can you explain what is actually happening in the GridSearchCV when you provide parameters with two estimators? Does it perform 5-fold CV twice (i.e., one round for the SGDClassifier and one round for MultinomialNB) and then repeat it for each set of grid parameters?
Do you know if it is possible to provide multiple datasets as a parameter so that I can fit different estimators with different datasets?
Sure.. for dataset in datasets: gscv.fit(...)
I don't think that would work as the multiple calls to gscv.fit would clobber the fit from the last dataset. I want each of the calls to fit with different datasets to be appended.
Clobber? Just initialize each time. gscv = GridSearchCV(); gscv.fit() There isn't much more to this.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.