0

A toy example works fine, where its schema is defined using a static definition. The dynamically defined schema throws error, but why, and how to fix? They seem identical.

Statically defined:

XXX = sc.parallelize([('kygiacomo', 0, 1), ('namohysip', 1, 0)])
schema = StructType([
    StructField("username",StringType(),True),
    StructField("FanFiction",IntegerType(),True),
    StructField("nfl",IntegerType(),True)])
print(schema)
df = sess.createDataFrame(XXX, schema)
df.show() 

Output which is good:

StructType(List(StructField(username,StringType,true),StructField(FanFiction,IntegerType,true),StructField(nfl,IntegerType,true)))
+---------+----------+---+
| username|FanFiction|nfl|
+---------+----------+---+
|kygiacomo|         0|  1|
|namohysip|         1|  0|
+---------+----------+---+

Dynamically-defined:

print(XXX.collect())
username_field = [StructField('username', StringType(), True)]
int_fields = [StructField(str(i), IntegerType(), True) for i in itemids.keys()]
schema = StructType(username_field + int_fields)
print(schema)
df = sess.createDataFrame(XXX, schema)
df.show()

Output which throws an error on df.show:

[('kygiacomo', 0, 1, 0, 0, 0, 0), ('namohysip', 1, 0, 0, 0, 0, 0), ('immortalis', 0, 1, 0, 0, 0, 0), ('403and780', 0, 0, 0, 0, 0, 1), ('SDsc0rch', 0, 0, 0, 1, 0, 0), ('shitpostlord4321', 0, 0, 0, 0, 1, 0), ('scarletcrawford', 0, 0, 1, 0, 0, 0)]
StructType(List(StructField(username,StringType,true),StructField(FanFiction,IntegerType,true),StructField(nfl,IntegerType,true),StructField(alteredcarbon,IntegerType,true),StructField(The_Donald,IntegerType,true),StructField(marvelstudios,IntegerType,true),StructField(hockey,IntegerType,true)))

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
...
TypeError: field FanFiction: IntegerType can not accept object 0 in type <class 'numpy.int64'>

I cannot see what the code is doing differently. Can you? Thanks.

0

1 Answer 1

1

Now, the answer to your previous question already shows one of the possible solutions - convert data to standard Python types using tolist.

Alternatively convert each entry directly calling corresponding builtins functions (int, float on each record in the row).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.