Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(optimization): Tiny optimization for pointsInternal #60138

Merged

Conversation

lbartoletti
Copy link
Member

Description

This PR optimizes the pointsInternal method by caching trigonometric calculations used in ellipse point generation. The implementation precomputes sine and cosine values before generating ellipse points, reducing the number of expensive trigonometric function calls in the main loop.
Performance testing shows significant improvements for large segment counts (>50), while introducing negligible overhead for the default segment count (~32). This trade-off is acceptable as it optimizes the most computationally intensive cases where high-resolution ellipses are required.

benchmark:

236: Segments: 4
236: toLineString Time: 2 µs
236: toLineString2 Time: 3 µs
236: ------------------------------------
236: Segments: 8
236: toLineString Time: 2 µs
236: toLineString2 Time: 2 µs
236: ------------------------------------
236: Segments: 16
236: toLineString Time: 2 µs
236: toLineString2 Time: 2 µs
236: ------------------------------------
236: Segments: 32
236: toLineString Time: 4 µs
236: toLineString2 Time: 6 µs
236: ------------------------------------
236: Segments: 64
236: toLineString Time: 6 µs
236: toLineString2 Time: 5 µs
236: ------------------------------------
236: Segments: 128
236: toLineString Time: 12 µs
236: toLineString2 Time: 9 µs
236: ------------------------------------
236: Segments: 256
236: toLineString Time: 19 µs
236: toLineString2 Time: 17 µs
236: ------------------------------------
236: Segments: 512
236: toLineString Time: 37 µs
236: toLineString2 Time: 33 µs
236: ------------------------------------
236: Segments: 1024
236: toLineString Time: 73 µs
236: toLineString2 Time: 72 µs
236: ------------------------------------
236: Segments: 2048
236: toLineString Time: 153 µs
236: toLineString2 Time: 132 µs
236: ------------------------------------
236: Segments: 4096
236: toLineString Time: 303 µs
236: toLineString2 Time: 268 µs
236: ------------------------------------
236: Segments: 8192
236: toLineString Time: 607 µs
236: toLineString2 Time: 552 µs
236: ------------------------------------
236: Segments: 16384
236: toLineString Time: 1200 µs
236: toLineString2 Time: 1123 µs
236: ------------------------------------
236: Segments: 32768
236: toLineString Time: 2510 µs
236: toLineString2 Time: 2177 µs
236: ------------------------------------
236: Segments: 65536
236: toLineString Time: 4882 µs
236: toLineString2 Time: 4486 µs
236: ------------------------------------
236: Segments: 131072
236: toLineString Time: 10016 µs
236: toLineString2 Time: 9122 µs
236: ------------------------------------
236: Segments: 262144
236: toLineString Time: 19958 µs
236: toLineString2 Time: 18301 µs
236: ------------------------------------
236: Segments: 524288
236: toLineString Time: 39929 µs
236: toLineString2 Time: 36671 µs
236: ------------------------------------

@github-actions github-actions bot added this to the 3.42.0 milestone Jan 14, 2025
@lbartoletti lbartoletti self-assigned this Jan 14, 2025
@lbartoletti lbartoletti added API API improvement only, no visible user interface changes backport release-3_34 backport release-3_40 labels Jan 14, 2025
Copy link

github-actions bot commented Jan 14, 2025

🪟 Windows builds

Download Windows builds of this PR for testing.
Debug symbols for this build are available here.
(Built from commit 3c5f439)

🪟 Windows Qt6 builds

Download Windows Qt6 builds of this PR for testing.
(Built from commit 3c5f439)

@uclaros
Copy link
Contributor

uclaros commented Jan 14, 2025

Are those results on a release build?
I'd expect the compiler/cpu to render this kind of optimization unnecessary.

@rouault
Copy link
Contributor

rouault commented Jan 14, 2025

I'd expect the compiler/cpu to render this kind of optimization unnecessary.

I'm not sure the compiler (before C++26 which will declare them constexpr: https://en.cppreference.com/w/cpp/numeric/math/cos) can infer that the output of std::cos() doesn't change for a given output.
But instead of creating an array, which takes time, I'd suggest to just to just create temporary c = std::cos(t[i]) and s= std::sin(t[i]) variables in the loop.

reducing the number of expensive trigonometric function calls in the main loop.

hum, I would expect that your optimization would just divide the time by 2, not by 10. Did you look at the dissembled code to understand why you get such perf gain ? Which compiler do you use ? Maybe OpenMP is used and the array computation is multithreaded ? In any case adding a comment in the code explaining why the optimization works and under which circumstances would make later maintenance easier.

@lbartoletti
Copy link
Member Author

Are those results on a release build?

@uclaros I'm live only in a Release mode 😉

I'd expect the compiler/cpu to render this kind of optimization unnecessary.

I'm not sure the compiler (before C++26 which will declare them constexpr: https://en.cppreference.com/w/cpp/numeric/
math/cos) can infer that the output of std::cos() doesn't change for a given output. But instead of creating an array, which takes time, I'd suggest to just to just create temporary c = std::cos(t[i]) and s= std::sin(t[i]) variables in the loop.

reducing the number of expensive trigonometric function calls in the main loop.

hum, I would expect that your optimization would just divide the time by 2, not by 10. Did you look at the dissembled code to understand why you get such perf gain ? Which compiler do you use ? Maybe OpenMP is used and the array computation is multithreaded ? In any case adding a comment in the code explaining why the optimization works and under which circumstances would make later maintenance easier.

@rouault I did a more extensive test, which you can find here. It's not real QGIS anymore, but I think it's still representative.

@ptitjano and I have been talking about testing OpenMP on QGIS for several years now, and I've been looking a lot at what's being done on GRASS. We're thinking of proposing a QEP within the year, but first we need to do some testing (and learn OpenMP properly).

In the meantime, for this test, you can see that depending on the number of threads used, OpenMP is interesting around 256 vertices. In any case, it's better to use method 2 with the creation of cos/sin in the loop, as you suggest, and not to pre-calculate it as I did.

So we have choice between:

OpenMP ready

for (int i = 0; i < static_cast<int>(segments); ++i) {
            const double cosT{ std::cos(t[i]) };
            const double sinT{ std::sin(t[i]) };
            x[i] = centerX + mSemiMajorAxis * cosT * cosAzimuth -
                mSemiMinorAxis * sinT * sinAzimuth;
            y[i] = centerY + mSemiMajorAxis * cosT * sinAzimuth +
                mSemiMinorAxis * sinT * cosAzimuth;
            if (hasZ) z[i] = centerZ;
            if (hasM) m[i] = centerM;
        }

Actual with cos/sin calculated only one time.

 for (double it : t) {
            const double cosT{ std::cos(it) };
            const double sinT{ std::sin(it) };
            *xOut++ = centerX + mSemiMajorAxis * cosT * cosAzimuth -
                mSemiMinorAxis * sinT * sinAzimuth;
            *yOut++ = centerY + mSemiMajorAxis * cosT * sinAzimuth +
                mSemiMinorAxis * sinT * cosAzimuth;
            if (zOut) *zOut++ = centerZ;
            if (mOut) *mOut++ = centerM;
        }

@rouault
Copy link
Contributor

rouault commented Jan 21, 2025

@rouault I did a more extensive test, which you can find here. It's not real QGIS anymore, but I think it's still representative.

@lbartoletti I'm confused if you put this standalone benchmark just to answer my question... So please bear with my "too long didn't read" attitude here. Basically I was just curious about the reason why this PR gives such perf boost. In short, is OpenMP implicitly used by your optimizer... ?

@lbartoletti
Copy link
Member Author

@rouault I did a more extensive test, which you can find here. It's not real QGIS anymore, but I think it's still representative.

@lbartoletti I'm confused if you put this standalone benchmark just to answer my question... So please bear with my "too long didn't read" attitude here.

No worries, it was just an interesting test for us!

Basically I was just curious about the reason why this PR gives such perf boost. In short, is OpenMP implicitly used by your optimizer... ?

No, I don't think.

Copy link

The QGIS project highly values your contribution and would love to see this work merged! Unfortunately this PR has not had any activity in the last 14 days and is being automatically marked as "stale". If you think this pull request should be merged, please check

  • that all unit tests are passing

  • that all comments by reviewers have been addressed

  • that there is enough information for reviewers, in particular

    • link to any issues which this pull request fixes

    • add a description of workflows which this pull request fixes

    • add screenshots if applicable

  • that you have written unit tests where possible
    In case you should have any uncertainty, please leave a comment and we will be happy to help you proceed with this pull request.
    If there is no further activity on this pull request, it will be closed in a week.

@github-actions github-actions bot added the stale Uh oh! Seems this work is abandoned, and the PR is about to close. label Feb 13, 2025
@nyalldawson
Copy link
Collaborator

@lbartoletti

But instead of creating an array, which takes time, I'd suggest to just to just create temporary c = std::cos(t[i]) and s= std::sin(t[i]) variables in the loop.

Can you do this so that this change can be merged?

@github-actions github-actions bot removed the stale Uh oh! Seems this work is abandoned, and the PR is about to close. label Feb 14, 2025
@lbartoletti lbartoletti force-pushed the ellipse_pointsInternal_tinyoptim branch from 712b820 to 08ce466 Compare February 17, 2025 10:14
@lbartoletti
Copy link
Member Author

@lbartoletti

But instead of creating an array, which takes time, I'd suggest to just to just create temporary c = std::cos(t[i]) and s= std::sin(t[i]) variables in the loop.

Can you do this so that this change can be merged?

done. Thanks

@lbartoletti lbartoletti force-pushed the ellipse_pointsInternal_tinyoptim branch from 08ce466 to bfa37d5 Compare February 17, 2025 12:57
@lbartoletti lbartoletti force-pushed the ellipse_pointsInternal_tinyoptim branch from bfa37d5 to 3c5f439 Compare February 19, 2025 09:28
@ptitjano ptitjano self-requested a review February 19, 2025 09:48
@lbartoletti lbartoletti merged commit f08eda1 into qgis:master Feb 19, 2025
31 checks passed
@lbartoletti lbartoletti deleted the ellipse_pointsInternal_tinyoptim branch February 19, 2025 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API API improvement only, no visible user interface changes backport release-3_40
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants