Taking into account instance cost in learning?

Question

I am generally trying to take into account costs in learning. The set-up is as follows: a statistical learning problem with usuall X and y, where y is imbalanced (roughly 1% of ones).

Scikit learn usually offers wights parameters where you can set up weights matching imbalance. So the weights are depending on the target. Assigning weights will transform the log loss into weighted log loss as seen below.

$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

$\text{Weighted Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} w_{y_i} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

As you see the weights $w$ is constant on each classes and only depends on $y_i$. I am generally looking for specifying the weights in terms of errors costs. More specifically, I have costs associated with:

$c_{1,1}$, cost associated with True Positives (correctly identified positives)
$c_{0,1}$, cost associated with False Positives (Type 1 error)
$c_{1,0}$, cost associated with False Negatives (Type 2 error)
$c_{0,0}$, cost associated with True Negatives (correctly identified negatives)

With three sub-cases:

$c_{y_i,1,1}, c_{y_i,0,1}, c_{y_i,1,0}, c_{y_i,0,0}$ depends only on classes, typically I have classifications costs for each classes (8 parameters in total)
$c_{i,1,1}, c_{i,0,1}, c_{i,1,0}, c_{i,0,0}$ depends on instances, so I have four values for each instances.
$c_{i,1,1}(\hat{y}_i), c_{i,0,1}(\hat{y}_i), c_{i,1,0}(\hat{y}_i), c_{i,0,0}(\hat{y}_i)$ depends both on instances and models outputs.

It appears the loss can be rewritten as:

$\text{Cost-Sensitive Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \cdot \left( C_{10} \cdot \log(\hat{y}_i) + C_{11} \cdot \log(1 - \hat{y}_i) \right) + (1 - y_i) \cdot \left( C_{01} \cdot \log(\hat{y}_i) + C_{00} \cdot \log(1 - \hat{y}_i) \right) \right]$

So that it would be solved by custom losses. Is there a generic way to handle these cases, ideally within scikit-learn or scikit-learn compatible frameworks ?

My impression is that you'd probably need to code this up in PyTorch or similar, which gives you a lot of flexibility for defining losses. The model could then be encapsulated as an sklearn estimator, with a .fit method etc. — MuhammedYunus
– MuhammedYunus, Commented Dec 27, 2024 at 13:36

Raynard Bond · Accepted Answer · 2024-12-28 03:06:59Z

We can redefine the loss so that each predicted probability is “distorted” by an exponential function of the associated costs, then optimize via gradient-based methods. Let

$$ p_i = \hat{y}_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\theta}) $$

be the standard logistic prediction for instance $i$, with

$$ \sigma(u) = \frac{1}{1 + e^{-u}}. $$

Let

$$ \alpha = e^{-c_{1,0}}, \quad \beta = e^{-c_{1,1}}, \quad \gamma = e^{-c_{0,1}}, \quad \delta = e^{-c_{0,0}}. $$

We define the following “cost-distorted” loss:

$$ \mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{N}\sum_{i=1}^{N} \Big[y_i\big(\alpha \log(p_i) + \beta \log(1 - p_i)\big) + (1 - y_i)\big(\gamma \log(p_i) + \delta \log(1 - p_i)\big)\Big]. $$

Minimizing this corresponds to simultaneously favoring predictions $p_i$ that yield lower exponential cost factors while still maintaining a log-based penalty. We can implement this by customizing the gradient descent step. The gradient with respect to $\boldsymbol{\theta}$ is:

$$ \nabla_{\boldsymbol{\theta}} \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i\left(\alpha \frac{1}{p_i} - \beta \frac{1}{1 - p_i}\right) + (1 - y_i)\left(\gamma \frac{1}{p_i} - \delta \frac{1}{1 - p_i}\right) \right] p_i (1 - p_i) \mathbf{x}_i. $$

Below is a minimal scikit-learn compatible code snippet illustrating how one might implement this custom loss for logistic regression:

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin

class CostDistortedLogisticRegression(BaseEstimator, ClassifierMixin):
    def __init__(self, alpha, beta, gamma, delta, lr=0.01, max_iter=1000):
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.delta = delta
        self.lr = lr
        self.max_iter = max_iter

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-z))

    def fit(self, X, y):
        self.theta_ = np.zeros(X.shape[1])
        for _ in range(self.max_iter):
            p = self.sigmoid(X @ self.theta_)
            grad = np.zeros_like(self.theta_)
            for i in range(len(y)):
                grad += ( - ( y[i] * ( self.alpha / p[i] - self.beta / (1 - p[i]) )
                              + (1 - y[i]) * ( self.gamma / p[i] - self.delta / (1 - p[i]) ) )
                          * p[i] * (1 - p[i]) ) * X[i]
            grad /= len(y)
            self.theta_ -= self.lr * grad
        return self

    def predict_proba(self, X):
        p = self.sigmoid(X @ self.theta_)
        return np.column_stack((1 - p, p))

    def predict(self, X):
        return (self.predict_proba(X)[:, 1] >= 0.5).astype(int)

In this construction, choosing $\alpha, \beta, \gamma, \delta$ as exponentials of the respective misclassification costs “tilts” the log loss so that errors whose costs are higher receive a proportionally larger penalty, creating a “differentiable distortion” of the usual logistic objective. In practice, one would plug in $e^{-c_{1,0}}, e^{-c_{1,1}}, e^{-c_{0,1}}, e^{-c_{0,0}}$ or other monotone mappings of $c_{1,0}, c_{1,1}, c_{0,1}, c_{0,0}$ and proceed with the same gradient-based optimization. This method, despite looking similar on the surface, is fundamentally different from standard class weighting: the exponential weighting yields a different geometry for the loss surface and can adapt more flexibly if one extends the approach to instance-specific or output-dependent costs, yet it remains compatible with scikit-learn pipelines and follows all the usual cross-validation or hyperparameter-tuning procedures.

I am not sure to understand how the exponential transformation impact things... — Lucas Morin
– Lucas Morin, Commented Dec 28, 2024 at 4:47
In an exponential approach, each cost enters as e^{-\text{cost}}, so the model’s gradient is warped more steeply than with simple weighting. Errors with large cost get heavily suppressed or accentuated, reshaping the loss surface. This often yields different, potentially more flexible decision boundaries than linear reweighting alone. — Raynard Bond
– Raynard Bond, Commented Dec 28, 2024 at 5:23
Do you have a source for that ? At least why this distorsion is optimal ? — Lucas Morin
– Lucas Morin, Commented Dec 28, 2024 at 10:45
This doesn't appear to address the question at all, instead modifying the question? — Ben Reiniger
– Ben Reiniger ♦, Commented Dec 28, 2024 at 21:14

Stack Exchange Network

Taking into account instance cost in learning?

1 Answer 1

Your Answer

Hot Network Questions

Taking into account instance cost in learning?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions