0

I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?

index row 1    row 2   ...
0     string1  string3
1     string1  string1
2     string2  string2
...
2
  • What would the output be for your example? 3? Commented Aug 25, 2018 at 14:46
  • string1 3, string2 2, string3 1 Commented Aug 25, 2018 at 14:51

1 Answer 1

4

If performance is an issue try:

from collections import Counter

Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})

Or stack it into one Series then use value_counts

df.stack().value_counts()
#string1    3
#string2    2
#string3    1
#dtype: int64

For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:

s = pd.Series()
for col in df.columns:
    s = s.add(df[col].value_counts(), fill_value=0)

#string1    3.0
#string2    2.0
#string3    1.0
#dtype: float64

Also, there's a numpy solution:

import numpy as np
np.unique(df.to_numpy(), return_counts=True)

#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))

df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
                   'row2': ['string3', 'string1', 'string2']})

def vc_from_loop(df):
    s = pd.Series()
    for col in df.columns:
        s = s.add(df[col].value_counts(), fill_value=0)
    return s

Small DataFrame

%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.unique(df.to_numpy(), return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Long DataFrame

df = pd.concat([df]*300000, ignore_index=True)

%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.unique(df.to_numpy(), return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

1 Comment

You might find df.values.ravel() slightly more efficient.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.