2

I'm subclassing pandas DataFrame in a project of mine. Most pandas operations preserve the subclass type, but df.groupby().agg() does not. Is this a bug? Is there a known workaround?

import pandas as pd

class MySeries(pd.Series):
    pass

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame
    _constructor_sliced = MySeries

MySeries._constructor_expanddim = MyDataFrame

df = MyDataFrame({"a": reversed(range(10)), "b": list('aaaabbbccc')})

print(type(df.groupby("b").sum()))
# <class '__main__.MyDataFrame'>

print(type(df.groupby("b").agg({"a": "sum"})))
# <class 'pandas.core.frame.DataFrame'>

It looks like there was an issue (described here) that fixed subclassing for df.groupby, but as far as I can tell df.groupby().agg() was missed. I'm using pandas version 2.0.3.

2
  • Looping in @alkasm and @grge who worked on the linked issue. Commented Aug 19, 2024 at 19:34
  • I've opened a pandas issue on Github Commented Aug 30, 2024 at 16:29

2 Answers 2

0

The workaround I'm currently using is to re-initialize the subclassed DataFrame and call the __finalize__ method, which propogates metadata to the new object.

MyDataFrame(my_df.groupby("b").agg({"a": "sum"})).__finalize__(other=my_df)

OP case

First, I've added a custom attribute to MyDataFrame:

import pandas as pd

class MySeries(pd.Series):
    _metadata = ['my_attr']

class MyDataFrame(pd.DataFrame):
    _metadata = ['my_attr']
    
    def __init__(
            self, 
            data, 
            my_attr=None, 
            index=None, 
            columns=None, 
            dtype=None, 
            copy=None
        ):
        self.my_attr = my_attr
        super().__init__(data, index, columns, dtype, copy)
    
    @property
    def _constructor(self):
        return MyDataFrame
    _constructor_sliced = MySeries

MySeries._constructor_expanddim = MyDataFrame

Now we can check that subclass type and custom attributes are preserved:

my_df = MyDataFrame(
    {"a": reversed(range(10)), "b": list('aaaabbbccc')},
    my_attr='foo'
)
assert isinstance(my_df, MyDataFrame)
# Success!
assert isinstance(my_df.sample(3), MyDataFrame)
# Success!
assert isinstance(my_df.copy(), MyDataFrame)
# Success!

new_df = my_df.groupby("b").sum()
assert isinstance(new_df, MyDataFrame)
# Success! - fixed by issue linked in question

new_df = my_df.groupby("b").agg({"a": "sum"})
assert isinstance(new_df, MyDataFrame)
# AssertionError
assert new_df.my_attr == 'foo'
# AttributeError

new_df = my_df.groupby("b").agg({"a": "sum"})
new_df = MyDataFrame(new_df).__finalize__(other=my_df)
assert isinstance(new_df, MyDataFrame)
# Success!
assert new_df.my_attr == 'foo'
# Success!
Sign up to request clarification or add additional context in comments.

1 Comment

Creating another object seems far from ideal, as I want end users to be able to use groupby().agg() on my DataFrame subclass without running into bugs. I'm open to accepting other answers that do a better job of patching groupby().agg() on the back end.
0

It turns out that groupby().agg() combines Series to build a DataFrame, so the subclassed Series constructor needs to be properly defined. See this documentation.

The following code runs with no errors:

import pandas as pd

class MySeries(pd.Series):
    @property
    def _constructor(self):
        return MySeries

    @property
    def _constructor_expanddim(self):
        return MyDataFrame

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame

    @property
    def _constructor_sliced(self):
        return MySeries


df = MyDataFrame({"a": reversed(range(10)), "b": list('aaaabbbccc')})

assert isinstance(df.groupby("b").agg({"a": "sum"}), MyDataFrame)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.