0

Basically I have census data of US that I have read in Pandas from a csv file. Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.

MY TRY:

What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.

CODE:

#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
    if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
    #idxl.append(census_df.index[idx])
    #st.append(census_df.iloc[idx]['STNAME'])
    #cty.append(census_df.iloc[idx]['CTYNAME'])
    cty_reg.index[idx2] = census_df.index[idx]
    cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
    cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
    idx2 = idx2 + 1
cty_reg

CENSUS TABLE PIC:

enter image description here

SAMPLE TABLE:

   REGION  STNAME        CTYNAME
0       2  "Wisconsin"   "Washington County"
1       2  "Alabama"     "Washington County"
2       1  "Texas"       "Atauga County"
3       0  "California"  "Washington County"

SAMPLE OUTPUT:

  STNAME      CTYNAME
0 Wisconsin  Washington County
1 Alabama    Washington County

I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.

6
  • 4
    Can you give us a small sample dataframe for both your input and (desired) output? That will make it easier for people to answer your question. Commented Oct 23, 2018 at 0:42
  • Done, added the sample table and sample output Commented Oct 23, 2018 at 0:57
  • 2
    Okay, great. I've improved the formatting so people can paste it into their code. Now, can you be clear about the logic you're using for picking these rows? Don't you need the population columns in your input? Commented Oct 23, 2018 at 1:16
  • That's right, I only need STNAME and CTYNAME columns in my output and the logic you explained in your answer is correct, that's what I was applying to get the desired DataFrame. Thanks for your answer as it has helped me a lot. I basically knew that Pandas makes these conditionals very easy, but at the time of writing code my mind was not focused on the easiness of Pandas. Thanks again. Can you read my comment on your answer and reply me there? Commented Oct 23, 2018 at 9:21
  • I need the population columns in my input, but in sample I just did not write because it would be a long table then. By the way I have already solved the question, thanks. Commented Oct 23, 2018 at 10:45

3 Answers 3

1

There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop

Option 1 - original answer

print df.loc[
            (df.REGION.isin([1,2])) & \
            (df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
            (df.CTYNAME.str.startswith('Washington')), \
                          ['REGION', 'STNAME', 'CTYNAME']]

Option 2 - using and with pd.eval

q = pd.eval("(df.REGION.isin([1,2])) and \
            (df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
            (df.CTYNAME.str.startswith('Washington'))", \
            engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]

Option 3 - using and with df.query

regions_list = [1,2]
dfq = df.query("(REGION==@regions_list) and \
              (POPESTIMATE2015 > POPESTIMATE2014) and \
              (CTYNAME.str.startswith('Washington'))", \
              engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]
Sign up to request clarification or add additional context in comments.

3 Comments

thanks a lot this really helped me to chose the only required Columns from the final Table. Can you tell me why are you using the 'fore-slash' ``\` after the 'AND' & symbol? Also can I use the words and and or instead of & and | symbols?
The \ was to extend the command across multiple lines. You should read about Boolean indexing - 1, 2, 3.
@KhubaibKhawar, I missed your question about using and and or. The 3 links I posted for Boolean indexing in Pandas are certainly valid, but if you strictly want to use and and or then you have to use .eval or .query. I have edited my answer to show these possible uses in the OP. Related 1 and 2.
1

If I'm reading the logic in your code right, you want to select rows according to the following conditions:

  • REGION should be 1 or 2
  • POPESTIMATE2015 > POPESTIMATE2014
  • CTYNAME needs to start with "Washington"

In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:

df = census_df[
        ((df.REGION == 1) | (df.REGION == 2)) & \
        (df.POPESTIMATE2015 > POPESTIMATE2014) & \
        (df.CTYNAME.str.startswith('Washington'))
    ]

1 Comment

Thanks for the answer, it helped me to understand that Pandas makes it easier. By the way can you tell me why are you using the 'fore-slash' ``\` after the 'AND' & symbol? Also can I use the words and and or instead of & and | symbols?
1

Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important

And then you wanted the STNAME and CTYNAME of that row.

So here's what you would do:

your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)

This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.