122

I have a pandas dataframe, df.

I want to select all indices in df that are not in a list, blacklist.

Now, I use list comprehension to create the desired labels to slice.

ix=[i for i in df.index if i not in blacklist]  
df_select=df.loc[ix]

Works fine, but may be clumsy if I need to do this often.

Is there a better way to do this?

1

9 Answers 9

190

Use isin on the index and invert the boolean index to perform label selection:

In [239]:

df = pd.DataFrame({'a':np.random.randn(5)})
df
Out[239]:
          a
0 -0.548275
1 -0.411741
2 -1.187369
3  1.028967
4 -2.755030
In [240]:

t = [2,4]
df.loc[~df.index.isin(t)]
Out[240]:
          a
0 -0.548275
1 -0.411741
3  1.028967
Sign up to request clarification or add additional context in comments.

Comments

26

You could use set() to create the difference between your original indices and those that you want to remove:

df.loc[set(df.index) - set(blacklist)]

It has the advantage of being parsimonious, as well as being easier to read than a list comprehension.

Comments

6
df = pd.DataFrame(data=[5,6,7,8], index=[1,2,3,4], columns=['D',])
blacklist = [2,3]

df.drop(blacklist,0)

3 Comments

Code-only answers are not particularly helpful. Please include a brief description of how this code solves the problem.
Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes.
This, in my opinion, is a lot prettier/more elegant than doing df.loc[~df.index.isin(blacklist)]. However, it is less interpretable, since usually people only use drop to remove columns (so axis=1).
4

Thanks to ASGM; I found that I needed to turn the set into a list to make it work with a MultiIndex:

mi1 = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1), ("b", 2)])
df1 = pd.DataFrame(data={"aaa":[1,2,3,4]}, index=mi1)
setValid = set(df1.index) - set([("a", 2)])
df1.loc[list(setValid)] # works
df1.loc[setValid] # fails

(sorry can't comment, insufficient rep)

Comments

4

If you are looking for a way to select all rows that are outside a condition you can use np.invert() given that the condition returns an array of booleans.

df.loc[np.invert(({condition 1}) & (condition 2))]

Comments

4

You could use difference() to obtain the difference between your original indices and those that you want to exclude:

df.loc[df.index.difference(blacklist), :]

It has the advantage of being easier to read.

Comments

2

Inside query you can access your variable blacklist using @:

df.query('index != @blacklist')

# Or alternatively:
df.query('index not in @blacklist')

Comments

1
import pandas as pd
df = pd.DataFrame(data=[5,6,7,8], index=[1,2,3,4], columns=['D',])
blacklist = [2,3]
#your current way ...
ix=[i for i in df.index if i not in blacklist]  
df_select=df.loc[ix]

# use a mask
mask = [True if x else False for x in df.index if x not in blacklist]
df.loc[mask]

http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-label actually, loc and iloc both take a boolean array, in this case the mask. from now on you can reuse this mask and should be more efficient.

Comments

0

You can use the np.setdiff1d function which finds the set difference of two arrays.

index = np.array(blacklist)
not_index = np.setdiff1d(df.index.to_numpy(), index)
df.iloc[not_index]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.