Get new dataframe by multiple conditions with pd.Dataframe.isin()

Question

I am trying to write function that obtains a df and a dictionary that maps columns to values. The function slices rows (indexes) such that it returns only rows whose values match ‘criteria’ keys values. for example: df_isr13 = filterby_criteria(df, {"Area":["USA"], "Year":[2013]}) Only rows with "Year"=2013 and "Area"="USA" are included in the output.

I tried:

def filterby_criteria(df, criteria):
    for key, values in criteria.items():
        return df[df[key].isin(values)]

but I get only the first criterion How can I get the new dataframe that except all criterias by pd.Dataframe.isin()?

Something like criteria = {"Area":["USA"], "Year":[2013]}; df[np.logical_and.reduce(df[k].isin(v) for k, v in criteria.items())]? — cs95
– cs95, Commented Jun 12, 2019 at 19:27

Alon · Accepted Answer · 2019-06-12 19:30:49Z

1

You can use for loop and add every criterion by pandas merge function:

def filterby_criteria(df, criteria):
    for key, values in criteria.items():
        df = pd.merge(df[df [key].isin(values)], df, how='inner')
    return df

answered Jun 12, 2019 at 19:30

Alon

6951 gold badge6 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Parfait Over a year ago

One should never grow objects in a loop including using merge, concat, append inside loops. This leads to excessive copying in memory.

cs95 Over a year ago

@Parfait is absolutely correct. This is a terrible answer.

Parfait · Accepted Answer · 2019-06-12 21:19:47Z

Consider a simple merge of two data frames since by default merge uses all matching names:

from itertools import product
import pandas as pd

def filterby_criteria(df, criteria):
    # EXTRACT DICT ITEMS
    k,v = criteria.keys(), criteria.values()
    # BUILD DF OF ALL POSSIBLE MATCHES
    all_matches = (pd.DataFrame(product(*v))
                     .set_axis(list(k), axis='columns', inplace=False)
                  )
    # RETURN MERGED DF
    return df.merge(all_matches)

To demonstrate with random, seeded data:

Data

import numpy as np
import pandas as pd

np.random.seed(61219)

tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
years = list(range(2013, 2019))
random_df = pd.DataFrame({'Tool': np.random.choice(tools, 500),
                          'Int': np.random.randint(1, 10, 500),
                          'Num': np.random.uniform(1, 100, 500),
                          'Year': np.random.choice(years, 500)
                          })

print(random_df.head(10))
#      Tool  Int        Num  Year
# 0    spss    4  96.465327  2016
# 1     sas    7  23.455771  2016
# 2       r    5  87.349825  2014
# 3   julia    4  18.214028  2017
# 4   julia    7  17.977237  2016
# 5   stata    3  41.196579  2013
# 6   stata    8  84.943676  2014
# 7  python    4  60.576030  2017
# 8    spss    4  47.024075  2018
# 9   stata    3  87.271072  2017

Function call

criteria = {"Tool":["python", "r"], "Year":[2013, 2015]}

def filterby_criteria(df, criteria):
    k,v = criteria.keys(), criteria.values()
    all_matches = (pd.DataFrame(product(*v))
                     .set_axis(list(k), axis='columns', inplace=False)
                  )        
    return df.merge(all_matches)    

final_df = filterby_criteria(random_df, criteria)

Output

print(final_df)
#       Tool  Int        Num  Year
# 0   python    8  96.611384  2015
# 1   python    7  66.782828  2015
# 2   python    9  73.638629  2015
# 3   python    4  70.763264  2015
# 4   python    2  28.311917  2015
# 5   python    3  69.888967  2015
# 6   python    8  97.609694  2015
# 7   python    3  59.198276  2015
# 8   python    3  64.497017  2015
# 9   python    8  87.672138  2015
# 10  python    9  33.605467  2015
# 11  python    8  25.225665  2015
# 12       r    3  72.202364  2013
# 13       r    1  62.192478  2013
# 14       r    7  39.264766  2013
# 15       r    3  14.599786  2013
# 16       r    4  22.963723  2013
# 17       r    1  97.647922  2013
# 18       r    5  60.457344  2013
# 19       r    5  15.711207  2013
# 20       r    7  80.273330  2013
# 21       r    7  74.190107  2013
# 22       r    7  37.923396  2013
# 23       r    2  91.970678  2013
# 24       r    4  31.489810  2013
# 25       r    1  37.580665  2013
# 26       r    2   9.686955  2013
# 27       r    6  56.238919  2013
# 28       r    6  72.820625  2015
# 29       r    3  61.255351  2015
# 30       r    4  45.690621  2015
# 31       r    5  71.143601  2015
# 32       r    6  54.744846  2015
# 33       r    1  68.171978  2015
# 34       r    5   8.521637  2015
# 35       r    7  87.027681  2015
# 36       r    3  93.614377  2015
# 37       r    7  37.918881  2015
# 38       r    3   7.715963  2015
# 39  python    1  42.681928  2013
# 40  python    6  57.354726  2013
# 41  python    1  48.189897  2013
# 42  python    4  12.201131  2013
# 43  python    9   1.078999  2013
# 44  python    9  75.615457  2013
# 45  python    8  12.631277  2013
# 46  python    9  82.227578  2013
# 47  python    7  97.802213  2013
# 48  python    1  57.103964  2013
# 49  python    1   1.941839  2013
# 50  python    3  81.981437  2013
# 51  python    1  56.869551  2013

PyFiddle Demo (click Run at top)

Collectives™ on Stack Overflow

Get new dataframe by multiple conditions with pd.Dataframe.isin()

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related