Is it possible in pyspark to use the parallelize function over python objects? I want to run on parallel on a list of objects, modified them using a function, and then print these objects.
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark(object_list):
spark,sc = init_spark(appname="analysis")
p_configs_RDD = sc.parallelize(object_list)
p_configs_RDD=p_configs_RDD.map(func)
p_configs_RDD.foreach(print)
def func(object):
return do-somthing(object)
When I run the above code, I encounter an error of "AttributeError: Can't get attribute 'Object' on <module 'pyspark.daemon' from...> ". How can I solve it?
I did the following workaround. But I don't think it is a good solution in general, and it assumes I can change the constructor of the object.
I have converted the object into a dictionary, and construed the object from the directory.
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark(object_list):
spark,sc = init_spark(appname="analysis")
p_configs_RDD = sc.parallelize([x.__dict__() for x in object_list])
p_configs_RDD=p_configs_RDD.map(func)
p_configs_RDD.foreach(print)
def func(dict):
object=CreateObject(create_from_dict=True,dictionary=dict)
return do-something(object)
In the constructor of the Object:
class Object:
def __init__(create_from_dict=False,dictionary=None, other_params...):
if(create_from_dict):
self.__dict__.update(dictionary)
return
Are there any better solutions?
