-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[FEAT] Should include more comprehensive benchmarking (primarily performance) #2760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This code was used for Google-internal micro-benchmarking: |
Here is a doc with background and results of the micro-benchmarking: |
A benchmarking tip: I think combining C++ sampling profiler with Python benchmark is extremely useful. My latest internal benchmark script (not the above one) supports |
This is a bit annoying if external contributors would feel a calling to help out. Future people without access, contributions and ideas are still welcome; sharing that code is up to the authors, but ideas can be discussed :-) |
Yes, sorry, but getting clearance to make this info fully public is likely
quite a bit of trouble. But we can add interested people, and we can
summarize or report what we learn here.
…On Wed, Dec 30, 2020 at 14:57 Yannick Jadoul ***@***.***> wrote:
This is a bit annoying if external contributors would feel a calling to
help out.
Future people without access, contributions and ideas are still welcome;
sharing that code is up to the authors, but ideas can be discussed :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2760 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFUZAAKRWQM57UBQPTANOLSXOV6HANCNFSM4VOVTTTA>
.
|
No worries, @rwgk, I completely understand! |
Also looking at the following to gauge how to possibly benchmark the CPython / PyPy low-level bits - for stuff like #2050-type stuff, in addition Kibeom's suggestion for |
Since cppyy was mentioned: I use pytest-benchmark as well (see: https://bitbucket.org/wlav/cppyy/src/master/bench/). It's hard to write good benchmarks, though, as features and defaults differ. For example, releasing the GIL by default is costly for micro-benches (and for large, costly, C++ functions, the bindings overhead doesn't matter). Another big expense is object tracking for identity matching on returns, which not all binders do (and is useless for micro-benches). For real high performance, the processor matters as well. For example, PyPy has guards on object types in its traces, based on which a specific C++ overload selected by cppyy will be compiled in. On a processor with good branch prediction and a deep out-of-order execution queue, that overhead will not show up in wall clock time (assuming no hyper-threading, of course), but it will be measurable on a processor with simpler cores. When sticking to CPython only, consider also that CFunction objects have seen a massive amount of support in the form of specialized tracks through the CPython interpreter since release 3. (This is what makes SWIG in "builtin" mode, not the default, absolutely smoke everything else.) Only since 3.8 have closures seen some love, with the API stabilized in 3.9. There's a 30% or so reduction in call overhead in there somehow (for cppyy), but it's proving to be quite a lot of work to implement. That last point, CPython internal developments and the need to track/make use of them, is also why I'd be interested if the proposed benchmarks end up being made public. The only way of measuring the usefulness of such changes is by having a historic record to compare against and setting that up is quite some effort (esp. when switching development machines regularly). |
See also: Benchmarking by @sloretz for usability, basic RAM analysis, etc., comparing a popular binding options (boost.python, cffi, cppyy, raw CPython, pybind11, SIP, SWIG) for the rclpy library: ros2/rclpy#665 Wim, not sure if this is what you meant by improvements in 3.8, but Shane also pointed to possible optimizations via PEP 590: Vectorcall (I dunno if pybind11 already leverages this directly but I see it crop up in this search: https://github.com/pybind/pybind11/issues?q=vectorcall+) |
Yes, I was referring to vectorcall. The problem with its implementation is that it creates several distinctive paths, but then squeezes them through a single call interface. The upshot is very branchy code, more memory allocations to track, etc. Not the end of the world, but in the particular case of cppyy, certain code paths such as the handling of Not sure about PyBind11, but with its greater call overhead, the benefits will be far less and there is the additional problem of requiring an extra pointer data member per bound object, so vectorcalls do come with a clear memory cost that can only be worth it if the CPU gains are big enough. But then, if the C++ side does substantial work in each call, then the call overhead matters little to nothing (see e.g. the benchmarking text you quote; myself, I benchmark mostly with empty functions), thus leaving that extra pointer as pure waste. (Aside, the way I currently see it, storing the dispatch pointer as data member in the proxy is something that's only useful for C, not for C++.) |
From @wjakob - use |
I'd be inclined to disagree (we were working with Levinthal back in the day when perf was being developed). The problem with Python code is that it is very branchy and consists mostly of pointer-chasing, so instructions (even when restricted to retired ones) give you a very muddy picture, as there is lots of out-of-order execution and (micro-)benchmarks will have branching predictability that can differ significantly from real applications. |
FYI @EricCousineau-TRI. I've spent some time pondering on this, assuming that the answer to
is yes. There are many tools out there to write benchmarks, to profile builds, to profile execution, to visualize results. What I think we are lacking is a definition for performance impact, though I ignore the conclusions to which the TensorFlow team has arrived to.
I'll focus on evaluating the impact of a code change, where impact is defined as a sequence of predefined metrics. To judge that impact or weigh it against the (sometimes subjective) value of a feature / issue resolution is not straightforward in the general case. Even if the net effect of a change is observable, comparing dissimilar metrics requires extra decision making (barring scalarization i.e. sum-of-products with suitable weights). I'll further assume that new dimensions or axes such as:
as well as different operating systems and processor architectures can always be added by repeatedly evaluating the same metrics for different build-time and/or run-time configurations.
That's fair, but I'll keep some (prudential?) bias towards cross-platform solutions. To the best of my knowledge, it is generally not possible to fully characterize a piece of software in isolation (e.g. to have a performance baseline that's decoupled from downstream use cases and the infrastructure it runs on). @wlav brought up several good points regarding the difficulties of writing generally applicable benchmarks. Thus, it is perhaps easier to define metrics as a measurement and an associated experiment to perform that measurement. A few good measurements were suggested above. Improving on that list, I think that the following measurements paint a comprehensive picture (though I'm sure I'm missing a few):
To perform those measurements, I think that, to begin with, we can re-purpose a subset of As for the tooling, I think This toolkit is fairly new, but it offers cross-platform CMake macros as well as C++ and Python APIs to build-time and run-time instrumentation, hardware counters, vendor specific profilers, and more. It has two downsides though: it uses Thoughts? I haven't had time to check https://bitbucket.org/wlav/cppyy/src/master/bench/, but I will. I'd also appreciate @kkimdev insights. |
Sounds great! Some minor questions:
My naive Google/DDG-fu couldn't resolve my understanding of this term. Do you have good ref. for me to understand this?
How would you imagine CI checks being performed? If they are costly (e.g. >=15min?), my suggestion is that they be relegated to "on-demand" builds and semi-infrequent measures (weekly). From there, metrics at a coarse level can be tracked, then fine-grained inspection can be done via bisecting across a specific increase could be done, rather than indexing a "huge" corpus of data. Thoughts? (is that already a said feature of
I briefly perused docs, but not sure if I saw concrete recommendation on data aggregation / metrics visualization. I saw Do you have thoughts on how to review these changes? (one option is to use external free service; heck, even something like
Hehe, mayhaps "you shouldn't pay for what you don't use" is a more apt description in most cases :P |
In all cases, RSS stands for resident set size. During builds, perhaps inspecting the total virtual memory requirements may be revealing too.
Agreed. I did a poor job above separating metrics by granularity (and focused on their feasibility instead). I definitely expect most automated benchmarks to perform coarse measurements and be run as such. A CI check for a PR'd patch simply amounts to a faster regression check, and for that we need simple yet representative measurements for each metric (e.g. benchmarks for the hottest or least cache friendly execution paths). It will require empirical evidence, and a process to update these measurements as the system evolves, so not the first thing in the queue.
No, and no.
+1 to an external service. Pandas is quite popular, shouldn't be hard to find a good, cheap service that can process them. Commit hashes can be used for tagging. Still, that doesn't address how to review changes in performance. And neither do I 😅. Quoting myself:
I don't know, TBH. We need data. I guess it's likely we'll end up focusing on trends in global metrics (like total execution time per benchmark and per suite) as opposed to fluctuations in local metrics (like cache misses in a function). Or maybe not. Maybe a few local metrics for a handful of functions shape everything else and we ought to be paying attention to them instead. |
As to @hidmic's last point, yes, there are a series of "obvious" functions that matter to almost everyone (resolve an overload, access an Identifying those cases is the hard bit, not writing the optimized code paths, which is why a benchmark suite that is run over multiple binders would be useful for me, even as the original post has "other binders" only as a stretch goal: if one binder is faster than another on some benchmark, it is an obvious indicator that some optimization is missed in the slower binder and such information is a lot easier to act on than trying to optimize a given benchmark where it's not known a priori whether (further) optimization is even possible. Aside, PyPy development often progresses like that, too: someone compares |
Just an update on my comment from Jan 22 above: cppyy now supports vectorcall (in repo; not yet released). Yes, it can achieve 35% performance improvements in some cases, but that isn't true across the board, with SWIG with |
Better late but never ... here are results from some benchmarking I did back in February 2021: I was comparing performance of different holder types at the time (not too much to report), but was surprised by this finding (copied from the slide):
|
I investigated why pybind11 was slow to e.g. create objects, or cast objects to the C++ type. Long story short, skipping complex logics, and just doing what's needed, we can get significantly faster. To know whether it was the
The cast is then:
NOTE: Of course, this only works for a C++ type wrapped, without multiple inheritance, or inheritance of a C++ type from Python, etc. But it's reasonable to expect this use-case to be fast, and more complex features to not always be executed when not needed. Generally speaking, whether we can perform many of these lookup only once depends how much dynamic stuff is done by the user. I expect types to be created at start-up, and, if they are created later, they won't be used until they are created, and they won't be destroyed. In that case, doing these retrieval only once make sense. I think it should be the default, and, if one really want fully dynamic type_casters that always perform a lookup, they should build their own (or we template it with a true/false). We can get significantly faster, close to the raw C API performance, by only doing what is needed and nothing more.
I am comparing directly storing a struct, with storing a unique_ptr of the struct, but it's easy to adapt. So I guess we could be rewriting or adding new type-casters/templated function which are simpler. |
As that document points to the dispatcher loop as the problem (which is also one of the slowest parts in SWIG, so it's not necessarily completely solvable with code generation): that's a perfect place for memoization based on the argument types. In CPython, cppyy does this explicitly, in PyPy, this is implicitly done by the JIT (types are guarded at the beginning of a JITed trace). If the type conversion of arguments of the memoized call fails, the run-time falls back on the normal, slower, overload resolution.
Yes; and these are still static properties. In cppyy, a flag is set for these cases and an optimized path is chosen, where possible, at run-time based on that flag. (The equivalent of the code above in cppyy is skipping the offset calculation.) |
Just to know what it would take to push some of the performance improvements. There are some changes which I think are beneficial, but are breaking the API (so, either we add new methods, and deprecate the current ones, which is ugly, or we increment a major version). For example, making
instead of the current
However, it assumes that the user won't dynamically add the type after this is called, as it would invalidate the cache. So how does the governance work for such breaking changes? I am wondering if there is a way to get them in, writing an email with an exact list of proposals, with the benchmark showing the improvements, but I am not sure there is a formal process to get the owners to decide on these breaking changes. |
Motivation
At present (b7dfe5c),
pybind11
proper only benchmarks compile-time and and artifact size for one given test setup (which tests arguments, simple inheritance, but that's about it, I think); the results of which can be seen here:https://pybind11.readthedocs.io/en/stable/benchmark.html
https://github.com/pybind/pybind11/blob/v2.6.1/docs/benchmark.rst
However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution. Generally, benchmarking is done on an ad-hoc basis (totes works, but may make it difficult for less creative people like myself ;)
Primary motivating issues / PRs:
Secondary:
Fuzzy Scoping + Steps
pybind11
(out of scope: other binding approaches)dlopen
- ish stuff, pybind11 internals upstart, binding registration, ...)pybind11
finds the most important (how to weigh compile-time, size, speed, memory, etc.)Given that performance benchmarks can be a P.I.T.A. (e.g. how to OS + interrupts, hardware capacity / abstractions, blah blah), ideally decisions should be made about relative performance on the same machine. Ideally, we should also publish some metrics for a given config to give people a "feel" for the performance, as was done for compile time.
Suggested Solution Artifacts
github.com/pybind/pybind-benchmarks
?pytest-benchmark
@wjakob @rwgk @rhaschke @YannickJadoul @bstaletic @henryiii @ax3l
Can I ask what y'all think? Is this redundant w.r.t. what we already have?
The text was updated successfully, but these errors were encountered: