-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tables (heavytables): String lookups match on incorrect values #4
Comments
This sort of gets to some questions I had about if we should be indexing into things during model execution, or doing a join beforehand. My understanding is that dimensions of the table are basically of two types:
Mortality tables for example have several columns that could be considered strings. "Sex", "Preferred Status" for example. I don't know that it is ideal to do some searchsorted multiple times over a list of strings, and this can be avoided by doing a join ahead of times to calculate a single linear index. Maybe the API involves the modelpoints as part of the lookup creation process. So here is a proposal for how mortality tables could work. mortality_lookup = Lookup(modelpoint_df, mortality_table)
# can access values like this
mortality_lookup(t) Basically, column names with no match in the modelpoints are assumed to be looked up at runtime. And this will make handling strings easier. Not saying this is the right way, just my first thoughts |
In terms of UK actuarial practice (which is a mix of tables from CMI, and tables bespoke to companies), it needs to handle following cases: Base mortality tables generally have a Male table, Female table, and then the same with smokers. All have 2 dimensions (age + duration), so lookup would be: As well as the base table, there are mortality improvement factors, which improve mortality over calendar years. (e.g. someone 65 in 2028 might have 99.4% of the mortality of someone 65 in 2027, lookup would be Bespoke tables tend to vary by product type, typically an expense table would have keys We also have contractual tables, for example risk premium reinsurance tables might have I'm not sure it would be possible to create a dataframe that could handle these - because I've got the lookup working using numpy (and vectorised) I think it is probably faster than going near pandas. To fix this issue, I just need to write a proper string mapper, I was a bit lazy in using the Band/bounded mapper as it mostly worked. Worth noting that at the moment, no-one of the proprietary software I use can do a lookup as well as |
I am adding an extra step before the model starts to run that can "enrich" a modelpoint file or precompute linear indexes. I do not propose using Pandas during the projection of the model. If you do string mapping during execution of the model, it is always going to be slower than precomputing integer indexes based on the strings. Basically if you are faster or slower depends on how you measure it. Throughout the execution of the model, I am doing 1 join. You are doing potentially thousands of table lookups which are slower than they could be. How do we know the table relates to the data in the proper mannerIt seems like there is potential for typos or data quality issues. To ensure the integrity of the relationship between the table and the modelpoints file, you might have to do the join anyways and verify that everything has a match. In that case, you are already doing the join. In summaryProbably time to make a new benchmark for table lookups |
Strings are definately a pain - could work to convert in the input stage and apply same conversion when setting up the table - they are always going to be static inputs per data point, it is only numbers (years, ages, durations, fund values etc) that are dynamic over the projection. R has a categorical data type that might work too (as much as R annoys me, its quite efficient memory wise for storing category/indicator data) Agree, think we've got a few benchmarks to do, at least: (i) non-vectorised lookup performance, (ii) vectorised lookup performance. Some of the onus here is also on the model developer, they need to make sure their data and assumptions are consistent. |
I'm trying to get heavytables set up with CSO 2017, not exactly sure how to do it: |
I've got documentation on this dev branch - the error is because the table is probably not multi-dimensional fully rectangular: See example 6. https://github.com/lewisfogden/heavylight/blob/dev_heavytable_integration/examples/notebook/table_documentation.ipynb (p.s. need to do a pull request on this branch, but I think it might conflict with your PR so I've held off) I'll have a look at the CSO2017 table and add it into the examples. |
Got it working in Example 7: I think we could put the code into the constructor for tables, e.g. with a parameter Other option here would be to pre-prep tables, which would give users the opportunity to check they are set correctly. Edit: added a test of 100k samples from the original table - np.allclose() returns True. |
I've added a method called heavylight/src/heavylight/heavytables.py Line 166 in 250ceca
|
string lookup performanceA quick analysis shows that the string lookup is 50-80 times slower than an equivalent integer lookup in table 2. It is twice as slow to use your int key lookup as to directly index into NumPy array, but this might be due to the get_index math. I don't think a factor of 2 is problematic, not concern about that. https://github.com/actuarialopensource/benchmarks-python/blob/main/str_performance.ipynb CSO tablesedit: deleted some complaining. Rectify seems good. I think you want to drop the my current concerns
recommendations
edit: deleted unrealistic wishlist of features last edit: Returning to normal work/school stuff for some weeks. Thanks for listening to complaints, nice design. Please test it! Looking forward to using the software in the future. |
Thank you for the input and contribution - enjoy work/school! I think the str vs. int is one to document for users: You're going to get best performance encoding data with ints for tables, but it means your data is less obvious. (Other trick is just to split earlier, and have a table per string identifier). Wishlist/Rants are all good, just do issues for them. |
I guess we can just add something like https://numpy.org/doc/stable/reference/generated/numpy.isin.html#numpy.isin and just throw an error if our values aren't a subset of the table axis? Some overhead, but probably not a big deal compared to string overheads compared to int key. |
If
table
has keys 'A', 'B' and 'C', then looking uptable['AB']
returns the value fortable['B']
.Cause:
np.searchsorted
places 'AB' between 'A' and 'B'Ideal behaviour: Should return
np.nan
or raise an exception if the key doesn't exist.As keys and data should be aligned this shouldn't happen, but if incorrect data is passed in it will not fail.
The text was updated successfully, but these errors were encountered: