Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aggregate Function geomean (mosaic-sql) #684

Merged
merged 3 commits into from
Feb 11, 2025
Merged

Conversation

spren9er
Copy link
Contributor

@spren9er spren9er commented Feb 10, 2025

This pull request introduces a new aggregate function — geomean — to the mosaic-sql package for computing the geometric mean.

What does this pull request cover?

  • Add geomean aggregate function
  • Add preaggregation handling (in sufficient-statistics.js)
  • Add tests (in preaggregator.test.js and aggregate.test.js)
  • Update documentation

What is missing?

The JSON schema file has not yet been updated.
It is currently unclear whether this file will be auto-generated or if a manual modification is required.

Copy link
Member

@jheer jheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks!

Copy link
Member

@jheer jheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments, particularly about adding tests for very large products and a possible solution if they reveal a problem with the current implementation.

function geomeanExpr(preagg, node) {
const as = addStat(preagg, node);
const { expr, name } = countExpr(preagg, node);
return pow(product(pow(as, name)), div(1, expr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks "theoretically" correct to me: given a set per-bin geometric means, exponentiate by the (bin-level) count to "undo" the bin-level nth-root, multiply the results, then take the (global-level) nth-root.

However, might this suffer from overflow? That intermediate product could get very large. (The same result as if we just used product as the sufficient statistic rather than geomean, which would also simplify the overall scheme to just a single pow call for the nth-root.)

This might be worth testing more with large numbers. If we see issues, a more robust alternative would be to instead use the sum of log values as the sufficient statistic. Then the output aggregate expression would be exp(div(sum(as), expr)).

Copy link
Contributor Author

@spren9er spren9er Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree.

The code is already updated as it is quite clear, that there will be numerical issues.

duckdb_geomean

Note: DuckDB uses log-based computation for geomean as well (see here).

Copy link
Member

@jheer jheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks @spren9er!

@jheer jheer merged commit e73fe92 into uwdata:main Feb 11, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants