Creating DataFrame of different variable types

Question

My raw data looks like this: String followed by numbers.

"cat",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"dog",21,6,160,110,3.9,2.875,17.02,0,1,4,4
...

When I create my RDD and DF I want to keep the string and cast the rest to floats. So expected output for my DF will have 12 columns. The first column will be the string, the rest will be floats.

Below is my code below:

def parse_line(line):

    s = line.split(',')
    name = s[0]
    features = s[1:]
    features = [float(x) for x in features]

    return name, f

f = sc.textFile("animals.data")
rdd = f.map(parse_line)

df = sqlContext.createDataFrame(rdd)

Output only produced two columns:

+--------------------+--------------------+
|                  _1|                  _2|
+--------------------+--------------------+
|               "cat"| [21.0, 6.0, 160.0...|
|               "dog"| [21.0, 6.0, 160.0...|
|               "rat"| [22.8, 4.0, 108.0...|
|            "monkey"| [21.4, 6.0, 258.0...|
...

Can you add a snapshot of what the animals.data file looks like? — BLimitless
– BLimitless, Commented Apr 29, 2021 at 18:26

werner · Accepted Answer · 2021-04-29 18:49:45Z

2

Option 1: The function parse_line returns a tuple with two elements: one element is the name and one element is the list of features. Therefore, the dataframe has only two columns. To fix that, parse_line should return a tuple with 12 elements in which all elements are floats or strings:

def parse_line(line):
    [...]
    return (name,) + tuple(features)

Option 2: You can use Spark to the read the data as CSV without using Pandas. It will be helpful to define the schema of the csv before reading it. This ensures that all numeric columns will be treated as floats:

from pyspark.sql import types as T
schema = T.StructType([
  T.StructField("col1", T.StringType(), True),
  T.StructField("col2", T.FloatType(), True),
  T.StructField("col3", T.FloatType(), True),
  T.StructField("col4", T.FloatType(), True),
  T.StructField("col5", T.FloatType(), True),
  T.StructField("col6", T.FloatType(), True),
  T.StructField("col7", T.FloatType(), True),
  T.StructField("col8", T.FloatType(), True),
  T.StructField("col9", T.FloatType(), True),
  T.StructField("col10", T.FloatType(), True)])

df = spark.read.schema(schema).csv("test.csv")

The result will be for both options a Spark dataframe with one string column and 11 float columns.

answered Apr 29, 2021 at 18:49

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user15300490 Over a year ago

So I read around and also found out about putting schema in DF. I specified first col as StringType and the rest as FloatType but I get a weird error saying: TypeError: field f1: FloatType can not accept object '21' in type <class 'str'> Even though 21 is clearly float.

user15300490 Over a year ago

Okay, it worked when I returned tuple(features). But can you please explain why?

werner Over a year ago

@amnesic the number of cols in the df is equal to the number of elements in the tuple. ('cat', [21.0, 6.0, ...]) is a tuple with two elements -> 2 cols in the df, ('cat', 21.0, 6.0, ...) is a tuple with 12 elements -> 12 cols in the df

BLimitless · Accepted Answer · 2021-04-29 18:32:36Z

0

If the data in your animals.data file is homogenous -- specifically each row starts with the string you want and is followed by the other features, each separated with a comma -- you can likely skip the entire parse_line function and have pandas do that for you. Try one of the following two functions:

df = pd.read_csv('animals.data')

OR

df = pd.read_fwf('animals.data')

If neither works, please post part of the animals.data file so we can help.

answered Apr 29, 2021 at 18:32

BLimitless

2,7655 gold badges22 silver badges40 bronze badges

Collectives™ on Stack Overflow

Creating DataFrame of different variable types

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related