[FEAT] Should include more comprehensive benchmarking (primarily performance)

## Motivation

At present (b7dfe5cc), `pybind11` proper only benchmarks compile-time and and artifact size for one given test setup (which tests arguments, simple inheritance, but that's about it, I think); the results of which can be seen here:
https://pybind11.readthedocs.io/en/stable/benchmark.html
https://github.com/pybind/pybind11/blob/v2.6.1/docs/benchmark.rst

However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution. Generally, benchmarking is done on an ad-hoc basis (totes works, but may make it difficult for less creative people like myself ;)

Primary motivating issues / PRs:
- https://github.com/pybind/pybind11/pull/2644#discussion_r549428425
- https://github.com/pybind/pybind11/issues/2646
- #1333

Secondary:
- #2322
- #1227 / #1825
- https://github.com/pybind/pybind11/pull/693#issuecomment-282592947
- #376

## Fuzzy Scoping + Steps

1. [ ] Establish a baseline benchmarking setup that touches on the core "hotpath" features for performance. Only track metrics for a given version of `pybind11` (out of scope: other binding approaches)
  - Initialization time (e.g. `dlopen` - ish stuff, pybind11 internals upstart, binding registration, ...)
  - Run time (function calls, type conversions, casting, etc.)
  - Comparison "axes": later CPython versions, pybind11 versions / PR branches / forks
  - OS: For now, just Linux ('cause that's all I use ;)
2. [ ] Add additional metrics (e.g. memory leaks / usage, do a redux on compile-time + size)
3. [ ] Ideally, provide guidance on what `pybind11` finds the most important (how to weigh compile-time, size, speed, memory, etc.)
4. [ ] (Stretch) Possibly compare against other approaches (Boost.Python, Cython, SWIG, cppyy / Julia / LLVM-esque stuff, etc.)

Given that performance benchmarks can be a P.I.T.A. (e.g. how to OS + interrupts, hardware capacity / abstractions, blah blah), ideally decisions should be made about *relative* performance on the same machine. Ideally, we should also publish some metrics for a given config to give people a "feel" for the performance, as was done for compile time.

## Suggested Solution Artifacts

- Identify code repository
  - Perhaps `github.com/pybind/pybind-benchmarks` ?
  - Alt.: In-tree (which may make it hard to compare across versions...)
- Identify good tooling for performance benchmarking
  - @henryiii suggested  [`pytest-benchmark`](https://pypi.org/project/pytest-benchmark/)

@wjakob @rwgk @rhaschke @YannickJadoul @bstaletic @henryiii @ax3l
Can I ask what y'all think? Is this redundant w.r.t. what we already have?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Should include more comprehensive benchmarking (primarily performance) #2760

Motivation

Fuzzy Scoping + Steps

Suggested Solution Artifacts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] Should include more comprehensive benchmarking (primarily performance) #2760

Description

Motivation

Fuzzy Scoping + Steps

Suggested Solution Artifacts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions