-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Aggregate Function geomean
(mosaic-sql)
#684
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments, particularly about adding tests for very large products and a possible solution if they reveal a problem with the current implementation.
function geomeanExpr(preagg, node) { | ||
const as = addStat(preagg, node); | ||
const { expr, name } = countExpr(preagg, node); | ||
return pow(product(pow(as, name)), div(1, expr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks "theoretically" correct to me: given a set per-bin geometric means, exponentiate by the (bin-level) count to "undo" the bin-level nth-root, multiply the results, then take the (global-level) nth-root.
However, might this suffer from overflow? That intermediate product could get very large. (The same result as if we just used product as the sufficient statistic rather than geomean, which would also simplify the overall scheme to just a single pow
call for the nth-root.)
This might be worth testing more with large numbers. If we see issues, a more robust alternative would be to instead use the sum of log values as the sufficient statistic. Then the output aggregate expression would be exp(div(sum(as), expr))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree.
The code is already updated as it is quite clear, that there will be numerical issues.
data:image/s3,"s3://crabby-images/deae3/deae306752ad9f3a5e054b6a6bbbfccadc5774a0" alt="duckdb_geomean"
Note: DuckDB uses log-based computation for geomean
as well (see here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks @spren9er!
This pull request introduces a new aggregate function —
geomean
— to themosaic-sql
package for computing the geometric mean.What does this pull request cover?
geomean
aggregate functionsufficient-statistics.js
)preaggregator.test.js
andaggregate.test.js
)What is missing?
The JSON schema file has not yet been updated.
It is currently unclear whether this file will be auto-generated or if a manual modification is required.