I have some text files containing JSON objects (one object per line). Example:
{"a": 1, "b": 2, "table": "foo"}
{"c": 3, "d": 4, "table": "bar"}
{"a": 5, "b": 6, "table": "foo"}
...
I want to parse the contents of text files into Spark DataFrames based on the table name. So in the example above, I would have a DataFrame for "foo" and another DataFrame for "bar". I have made it as far as grouping the lines of JSON into lists inside of an RDD with the following (pyspark) code:
text_rdd = sc.textFile(os.path.join("/path/to/data", "*"))
tables_rdd = text_rdd.groupBy(lambda x: json.loads(x)['table'])
This produces an RDD containing a list of tuples with the following structure:
RDD[("foo", ['{"a": 1, "b": 2, "table": "foo"}', ...],
("bar", ['{"c": 3, "d": 4, "table": "bar"}', ...]]
How do I break this RDD into a DataFrame for each table key?
edit: I tried to clarify above that there are multiple lines in a single file containing information for a table. I know that I can call .collectAsMap on the "groupBy" RDD that I have created, but I know that this will consume a sizeable amount of RAM on my driver. My question is: is there a way to break the "groupBy" RDD into multiple DataFrames without using .collectAsMap?