3
$\begingroup$

I am running a linear regression model using PySpark, and came across following weird behavior:

When I include a constant feature (representing an intercept term), it is ignored completely by Spark. I.e. the fitted coefficient is 0, despite me setting fitIntercept=False.

Is this expected behavior?

Example:

# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression

spark = SparkSession.builder.appName("test").getOrCreate()
data = [
    (1, 1, 0.01),
    (2, 1, 0.02),
    (3, 1, 0.03),
    (4, 1, 0.04),
    (5, 1, 0.03),
    (6, 1, 0.022)
]
columns = ["ID", "X_1", "Y"]
df = spark.createDataFrame(data, columns)

df = VectorAssembler(
    inputCols = ["X_1"],
    outputCol = "X"
).transform(df)


lm = LinearRegression(
    featuresCol = "X",
    labelCol = "Y",
    fitIntercept = False,
    standardization = False,
    solver = "normal"
)

res = lm.fit(df)
print("Multinomial coefficients: " + str(res.coefficients))
print("Multinomiaal intercepts: " + str(res.intercept))

# Multinomial coefficients: [0.0]
# Multinomiaal intercepts: 0.0

EDIT: Answers indicate this might be due to not using VectorAssembler. This is not the case as shown in the example below.

  • I still get a 0 coeeficient, even with fitIntercept= False.
  • I would expect following behaviour:
  • fitIntercept= False: A non-zero coefficent.
  • fitIntercept= True: An error (due to rank deficiency), or a zero-coefficient (due to Spark removing redundant features).

Updated example:

# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler


spark = SparkSession.builder.appName("test").getOrCreate()
data = [
    (1, 1, 0.12,  0.01),
    (2, 1, 0.123,  0.02),
    (3, 1, 0.524, 0.03),
    (4, 1, 0.1224, 0.04),
    (5, 1, 0.412, 0.03),
    (6, 1, 0.1224, 0.022)
]
columns = ["ID", "X_1", "X_2", "Y"]
df = spark.createDataFrame(data, columns)

df = VectorAssembler(
    inputCols = ["X_1", "X_2"],
    outputCol = "X"
).transform(df)

lm = LinearRegression(
    featuresCol = "X",
    labelCol = "Y",
    fitIntercept = False,
    standardization = False,
    solver = "normal"
)

res = lm.fit(df)
print("Multinomial coefficients: " + str(res.coefficients))
print("Multinomiaal intercepts: " + str(res.intercept))

# Multinomial coefficients: [0.0,0.07806237129637031]
# Multinomiaal intercepts: 0.0
$\endgroup$
3
  • $\begingroup$ Excluding the ID, there are not any other features in your dataframe apart from the constant one; this doesn't look like a proper regression problem. Plus, since you bother to show your print statements, standard practice would say you also include the results. $\endgroup$ Commented Mar 25 at 20:15
  • $\begingroup$ Since my question is about Spark, the regression problem is indeed very simple. I have updated with a new example. $\endgroup$ Commented Mar 26 at 10:40
  • 1
    $\begingroup$ As already requested, please update your post to include the results of your print statements. $\endgroup$ Commented Mar 26 at 12:41

1 Answer 1

5
$\begingroup$

Is this expected behavior?

TL;DR: Yes, this behaviour is expected. PySpark’s LinearRegression requires all input features, including constants intended as intercepts, to be assembled explicitly into vectors using VectorAssembler. Providing scalar columns directly without correct vectorisation can cause coefficients to be zero due to collinearity or rank deficiency.

Details

Relation to an offset in linear regression

This situation is similar to an offset term in linear regression in that the algebra is essentially the same, but the interpretation, estimation, and conceptual meaning are fundamentally different. Algebraically, both intercepts and offsets appear similarly ($Y = C + X\beta + C_1 + \varepsilon$, where $X$ is the model matrix so that $X\beta$ is the linear predictor). Conceptually, an intercept ($C$in this case) is estimated from the data, whereas an offset ($C_1$) is fixed and known, explicitly NOT estimated.

Mathematical Details

The linear regression model with intercept can be formally defined as:

$$ Y = \beta_0 + \beta_1 X + \varepsilon $$ where:

  • $ \beta_0$is the intercept
  • $\beta_1$ is the fixed effects parmeters
  • $X$ is the model matrix
  • $\varepsilon$ are the errors
  • If fitIntercept=True, PySpark internally estimates the intercept term ($\beta_0$).
  • If fitIntercept=False, the intercept must be explicitly included as a constant feature vector to estimate $\beta_0$ directly.

In cases of incorrect feature assembly, collinearity or rank deficiency can lead to zero-valued coefficients.

PySpark’s ML regression models require feature vectors rather than scalar inputs. Directly providing scalar columns results in them being converted into single-dimensional vectors. However, if a constant column intended as an explicit intercept is manually added without proper vector assembly, PySpark assigns it a coefficient of zero. This occurs because the constant feature becomes perfectly collinear with PySpark's internal intercept term, causing rank deficiency — not merely because the feature is constant.

Therefore, the observed behaviour reflects PySpark's intended handling of feature collinearity, rather than ignoring features outright. The key to avoiding this issue lies in correctly assembling features and choosing an appropriate intercept strategy.

A Better Approach

In PySpark there are two clear approaches to handling intercepts. First, allow PySpark to handle the intercept internally by setting fitIntercept=True. No explicit constant feature is necessary. Alternatively, manually add a constant column, assemble it correctly with covariates using VectorAssembler, and explicitly disable the internal intercept estimation (fitIntercept=False). This allows PySpark to treat your intercept as a standard feature. I prefer the first option.

For detailed guidance, consult the Apache Spark MLlib Documentation on VectorAssembler. Additional context on feature collinearity issues can be found in standard machine learning regression modelling references, such as Hastie, Tibshirani, and Friedman's The Elements of Statistical Learning - see references below.

Example of the use of VectorAssembler

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()

# Sample data from the OP: single constant feature
data = [
    (1, 1.0, 0.01),
    (2, 1.0, 0.02),
    (3, 1.0, 0.03),
    (4, 1.0, 0.04),
    (5, 1.0, 0.03),
    (6, 1.0, 0.022)
]
columns = ["ID", "X", "Y"]
df = spark.createDataFrame(data, columns)

# Create a constant 'intercept' column
df = df.withColumn("Intercept", lit(1.0))

# Assemble features into a vector (including constant)
assembler = VectorAssembler(inputCols=["X", "Intercept"], outputCol="features")
df_vectorised = assembler.transform(df)

# Fit the linear regression model with fitIntercept=False
lr = LinearRegression(
    featuresCol="features",
    labelCol="Y",
    fitIntercept=False,
    solver="normal"
)

model = lr.fit(df_vectorised)
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)

Summing Up

The observed PySpark behaviour is expected. Correctly vectorising features and clearly deciding on how to handle intercepts prevents confusion and helps to ensure accurate coefficient estimation.

References

Apache Software Foundation. (2024). Apache Spark MLlib: Feature extraction and transformation. Retrieved March 25, 2025, from https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer. https://doi.org/10.1007/978-0-387-84858-7

McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and Hall/CRC.

$\endgroup$
5
  • $\begingroup$ Instead of a ton of seemingly irrelevant too-generic info (including academic references to GLMs!), wouldn't it be better to show how to properly do the job using VectorAssembler, rather than linking to an external tutorial? $\endgroup$ Commented Mar 25 at 20:18
  • $\begingroup$ @desertnaut Thanks for your comment - you make fair points. I am somewhat wedded to McCullagh, & Nelder which has great explanations about the underlying issues here. Dobson, & Barnett is probably too much though - so I have removed it and I added an example of how to use VectorAssembler - good call. Could you expand a bit on what you think is "too generic" ? Thanks ! $\endgroup$ Commented Mar 25 at 20:54
  • 1
    $\begingroup$ Well, I meant that since the problem seems to be specifically about Spark and VectorAssembler, arguably citing textbooks and theory is too-generic (and borderline irrelevant to the actual issue); plus the other detail that, in fact, there are no actual features here (the constant term is not a feature). And since you bothered to recreate the code, shouldn't you bother to share the actual results (i.e. the outputs of the print statements), too? $\endgroup$ Commented Mar 25 at 21:30
  • $\begingroup$ @RobertLong Thanks for the detailed answer. However, In your example, the fitted coefficient is 0, despite fitIntercept=False -so the issue isn't to do with VectorAssembler. I know that mathematically, fitting an intercept is equivalent to having a constant column (i.e. a vector of the same constant) in the design matrix. My question is more about how this is handled in Spark (I have updated my original question). $\endgroup$ Commented Mar 26 at 11:58
  • $\begingroup$ But good point that the features need to be assembled with VectorAssember to work at all - I have updated my example. $\endgroup$ Commented Mar 26 at 13:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.