I am running a linear regression model using PySpark, and came across following weird behavior:
When I include a constant feature (representing an intercept term), it is ignored completely by Spark. I.e. the fitted coefficient is 0, despite me setting fitIntercept=False.
Is this expected behavior?
Example:
# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
spark = SparkSession.builder.appName("test").getOrCreate()
data = [
(1, 1, 0.01),
(2, 1, 0.02),
(3, 1, 0.03),
(4, 1, 0.04),
(5, 1, 0.03),
(6, 1, 0.022)
]
columns = ["ID", "X_1", "Y"]
df = spark.createDataFrame(data, columns)
df = VectorAssembler(
inputCols = ["X_1"],
outputCol = "X"
).transform(df)
lm = LinearRegression(
featuresCol = "X",
labelCol = "Y",
fitIntercept = False,
standardization = False,
solver = "normal"
)
res = lm.fit(df)
print("Multinomial coefficients: " + str(res.coefficients))
print("Multinomiaal intercepts: " + str(res.intercept))
# Multinomial coefficients: [0.0]
# Multinomiaal intercepts: 0.0
EDIT: Answers indicate this might be due to not using VectorAssembler. This is not the case as shown in the example below.
- I still get a 0 coeeficient, even with
fitIntercept= False. - I would expect following behaviour:
fitIntercept= False: A non-zero coefficent.fitIntercept= True: An error (due to rank deficiency), or a zero-coefficient (due to Spark removing redundant features).
Updated example:
# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
spark = SparkSession.builder.appName("test").getOrCreate()
data = [
(1, 1, 0.12, 0.01),
(2, 1, 0.123, 0.02),
(3, 1, 0.524, 0.03),
(4, 1, 0.1224, 0.04),
(5, 1, 0.412, 0.03),
(6, 1, 0.1224, 0.022)
]
columns = ["ID", "X_1", "X_2", "Y"]
df = spark.createDataFrame(data, columns)
df = VectorAssembler(
inputCols = ["X_1", "X_2"],
outputCol = "X"
).transform(df)
lm = LinearRegression(
featuresCol = "X",
labelCol = "Y",
fitIntercept = False,
standardization = False,
solver = "normal"
)
res = lm.fit(df)
print("Multinomial coefficients: " + str(res.coefficients))
print("Multinomiaal intercepts: " + str(res.intercept))
# Multinomial coefficients: [0.0,0.07806237129637031]
# Multinomiaal intercepts: 0.0
printstatements, standard practice would say you also include the results. $\endgroup$printstatements. $\endgroup$