3

I have a df such that

       c_name  f_name 
0      abc     abc12  
1      xyz     abc1  
2      mnq     mnq2

The goal is to find a substring across the two columns an know which column it belongs to. Preference should be to c_name, as in if the substring is in both the columns then c_name gets precedence For eg: if I search for abc in the above dataframe I should somehow get row 0 abc for c_name and row 1 abc1 for f_name.

To solve this I started with df[df['c_name'].str.contains('abc', case=False)] which will give me the results for c_name. The question now is to how to exclude the rows where I already have the results from performing the same operation on f_name. Any help is greatly appreciated!

1

4 Answers 4

2
import pandas as pd
row  =[['abcx','abcy'],
       ['efg','abcz'],
       ['higj','UK']]
df= pd.DataFrame(row)
df.columns = ['c_name', 'f_name']

print df[df['c_name'].str.contains('abc', case=False)]

delta_df =df[~df['c_name'].str.contains('abc', case=False)]
print delta_df[delta_df['f_name'].str.contains('abc', case=False)]

output

  c_name f_name
0   abcx   abcy
  c_name f_name
1    efg   abcz
Sign up to request clarification or add additional context in comments.

Comments

2
  • stack into a series
  • str.contains to get truth value of sub string
  • unstack to get back dataframe
  • subset results to ensure at least one match
  • idxmax(1) gets first True in columns

def find_subtext(df, txt):
    contains = df.stack().str.contains(txt).unstack()
    return contains[contains.any(1)].idxmax(1)

find_subtext(df, 'abc')

0    c_name
1    f_name
dtype: object

df.assign(abc=find_subtext(df, 'abc'))

  c_name f_name     abc
0    abc  abc12  c_name
1    xyz   abc1  f_name
2    mnq   mnq2     NaN

1 Comment

I can always rely on you to come through with a response :) Thank you very much. Its an interesting approach
0

Mark your first search result as something like 2. This will be overridden(by 1) if the new search result meets.

1 Comment

how is the question :) I know what to do, I just dont know how to do it since I am not proficient at pandas
0

Here is another simple way:

  1. concatenate the target "string" columns into a new single column, e.g.

    df['new_col'] = df['c_name'] + ' ' + df['f_name']
    
  2. search the new_col for the substring, e.g.

    result = df[df['new_col'].str.contains('abc')]
    
  3. delete new_col after it has completed its mission:

    del results['new_col']
    


Here is an example:

>>> df= pd.DataFrame(row, columns=['c_name', 'f_name'])
>>> df
  c_name f_name
0   abcx   abcy
1    efg   abcz
2   higj     UK
>>> df['new_col'] = df['c_name'] + ' ' + df['f_name']
>>> results = df[df['new_col'].str.contains('abc')]
>>> del df['new_col'], results['new_col']
>>> results
  c_name f_name
0   abcx   abcy
1    efg   abcz

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.