Skip to content

DES: How much perf penalty will we accept to get rid of libreduction? #40263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Mar 6, 2021 · 4 comments
Closed
Labels
Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

xref some discussion #40171 (comment)

libreduction and the associated callers are a disproportionate maintenance headache [citation needed]. It would be nice to be able to rip it out and just have one path for those methods, but that would entail a non-trivial performance hit. Recently, though, we've managed to optimize the pure-python path a bit, and im optimistic we can shave off some more of the difference.

The question: how much of a perf penalty are we willing to accept in order to remove libreduction?

Copying from #40171 (comment)

I'll throw out a number: if we could get worst-case down to within about 3x and most-cases within 1.5x, I'd be open to removing the cython paths. They are a disproportionate producer of headaches. (From #36459 (possibly out of date) "entirely disabling the cython path leads to 4 currently-xfailed tests passing")

Besides which, if/when cython3 becomes available, we may have to get rid of these anyway.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 6, 2021
@jorisvandenbossche
Copy link
Member

The referenced benchmark is:

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)
df.groupby("key").apply(lambda x: 1)

Running this with current master, I get:

In [2]: %timeit df.groupby("key").apply(lambda x: 1)
5.45 ms ± 92.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When disabling the usage of libreduction fast_apply using this patch:

--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -390,6 +390,7 @@ class BaseGrouper:
             # for now -> relies on BlockManager internals
             pass
         elif (
+            False and
             com.get_callable_name(f) not in base.plotting_methods
             and isinstance(splitter, FrameSplitter)
             and axis == 0

I get the following timing:

In [4]: %timeit df.groupby("key").apply(lambda x: 1)
16.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

So a 3-4x slowdown.

However, the applied function is not doing anything useful (just returning a constant), so basically this benchmark is only measuring the overhead. And whether a 3-4x slowdown in the overhead is significant in a real use case, depends on how much time this overhead itself takes.

So using a slightly more complex example, calculating the mean of one of the columns (which is still a relatively simple/fast function, I think). With master, this gives

In [4]: %timeit df.groupby("key").apply(lambda x: x['value1'].mean())
147 ms ± 5.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

with libreduction disabled, I get:

In [6]: %timeit df.groupby("key").apply(lambda x: x['value1'].mean())
182 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So still slower, but no longer a 3-4x slowdown.
And this slowdown is for me in the range of what's acceptable to get rid of libreduction, I think.

(it's probably useful to see if those numbers are similar on different machines)

@lithomas1 lithomas1 added Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action Internals Related to non-user accessible pandas implementation and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2021
@mroeschke
Copy link
Member

@jbrockmendel did #42992 close this?

@jbrockmendel
Copy link
Member Author

Only half of it

@jbrockmendel
Copy link
Member Author

Closed by #43189

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants