My raw data looks like this: String followed by numbers.
"cat",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"dog",21,6,160,110,3.9,2.875,17.02,0,1,4,4
...
When I create my RDD and DF I want to keep the string and cast the rest to floats. So expected output for my DF will have 12 columns. The first column will be the string, the rest will be floats.
Below is my code below:
def parse_line(line):
s = line.split(',')
name = s[0]
features = s[1:]
features = [float(x) for x in features]
return name, f
f = sc.textFile("animals.data")
rdd = f.map(parse_line)
df = sqlContext.createDataFrame(rdd)
Output only produced two columns:
+--------------------+--------------------+
| _1| _2|
+--------------------+--------------------+
| "cat"| [21.0, 6.0, 160.0...|
| "dog"| [21.0, 6.0, 160.0...|
| "rat"| [22.8, 4.0, 108.0...|
| "monkey"| [21.4, 6.0, 258.0...|
...