Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables (heavytables): String lookups match on incorrect values #4

Open
lewisfogden opened this issue Mar 23, 2024 · 12 comments
Open

Tables (heavytables): String lookups match on incorrect values #4

lewisfogden opened this issue Mar 23, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@lewisfogden
Copy link
Owner

If table has keys 'A', 'B' and 'C', then looking up table['AB'] returns the value for table['B'].

Cause: np.searchsorted places 'AB' between 'A' and 'B'

Ideal behaviour: Should return np.nan or raise an exception if the key doesn't exist.

As keys and data should be aligned this shouldn't happen, but if incorrect data is passed in it will not fail.

@lewisfogden lewisfogden added the bug Something isn't working label Mar 23, 2024
@lewisfogden lewisfogden self-assigned this Mar 23, 2024
@MatthewCaseres
Copy link
Collaborator

This sort of gets to some questions I had about if we should be indexing into things during model execution, or doing a join beforehand. My understanding is that dimensions of the table are basically of two types:

  • Varying with time
  • Not varying with time

Mortality tables for example have several columns that could be considered strings. "Sex", "Preferred Status" for example. I don't know that it is ideal to do some searchsorted multiple times over a list of strings, and this can be avoided by doing a join ahead of times to calculate a single linear index.

Maybe the API involves the modelpoints as part of the lookup creation process. So here is a proposal for how mortality tables could work.

mortality_lookup = Lookup(modelpoint_df, mortality_table)
# can access values like this
mortality_lookup(t)

Basically, column names with no match in the modelpoints are assumed to be looked up at runtime. And this will make handling strings easier.

Not saying this is the right way, just my first thoughts

@lewisfogden
Copy link
Owner Author

In terms of UK actuarial practice (which is a mix of tables from CMI, and tables bespoke to companies), it needs to handle following cases:

Base mortality tables generally have a Male table, Female table, and then the same with smokers. All have 2 dimensions (age + duration), so lookup would be: mort_male_ns[age, dur]

As well as the base table, there are mortality improvement factors, which improve mortality over calendar years. (e.g. someone 65 in 2028 might have 99.4% of the mortality of someone 65 in 2027, lookup would be mi_male_ns[age, year]

Bespoke tables tend to vary by product type, typically an expense table would have keys product, year, expense_type.

We also have contractual tables, for example risk premium reinsurance tables might have contract_start_date, product, sex, sum_assured_band. (these all vary by the reinsurer, each does their own)

I'm not sure it would be possible to create a dataframe that could handle these - because I've got the lookup working using numpy (and vectorised) I think it is probably faster than going near pandas.

To fix this issue, I just need to write a proper string mapper, I was a bit lazy in using the Band/bounded mapper as it mostly worked.

Worth noting that at the moment, no-one of the proprietary software I use can do a lookup as well as Table - one platform requires users to turn all string fields into integers (e.g. 0 = Male, 1 = Female), another one turns the keys into a single lookup string. E.g. age=23 & duration=5 would become a key '23_5'. This doesn't work for the bands properly, so you need to write extra functions to find the bands.

@MatthewCaseres
Copy link
Collaborator

I think it is probably faster than going near pandas.

I am adding an extra step before the model starts to run that can "enrich" a modelpoint file or precompute linear indexes. I do not propose using Pandas during the projection of the model.

If you do string mapping during execution of the model, it is always going to be slower than precomputing integer indexes based on the strings.

Basically if you are faster or slower depends on how you measure it. Throughout the execution of the model, I am doing 1 join. You are doing potentially thousands of table lookups which are slower than they could be.

How do we know the table relates to the data in the proper manner

It seems like there is potential for typos or data quality issues. To ensure the integrity of the relationship between the table and the modelpoints file, you might have to do the join anyways and verify that everything has a match. In that case, you are already doing the join.

In summary

Probably time to make a new benchmark for table lookups

@lewisfogden
Copy link
Owner Author

Strings are definately a pain - could work to convert in the input stage and apply same conversion when setting up the table - they are always going to be static inputs per data point, it is only numbers (years, ages, durations, fund values etc) that are dynamic over the projection.

R has a categorical data type that might work too (as much as R annoys me, its quite efficient memory wise for storing category/indicator data)

Agree, think we've got a few benchmarks to do, at least: (i) non-vectorised lookup performance, (ii) vectorised lookup performance.

Some of the onus here is also on the model developer, they need to make sure their data and assumptions are consistent.

@MatthewCaseres
Copy link
Collaborator

I'm trying to get heavytables set up with CSO 2017, not exactly sure how to do it:
https://github.com/actuarialopensource/benchmarks-python/blob/main/cso_benchmark_heavytable.ipynb

@lewisfogden
Copy link
Owner Author

I've got documentation on this dev branch - the error is because the table is probably not multi-dimensional fully rectangular: See example 6. https://github.com/lewisfogden/heavylight/blob/dev_heavytable_integration/examples/notebook/table_documentation.ipynb

(p.s. need to do a pull request on this branch, but I think it might conflict with your PR so I've held off)

I'll have a look at the CSO2017 table and add it into the examples.

@lewisfogden
Copy link
Owner Author

lewisfogden commented Mar 24, 2024

csv7_src = 'example_tables/2017_loaded_CSO_mortality_rates_heavytable.csv'
df7 = pd.read_csv(csv7_src)
df7 = df7.drop(columns=['IssueAge|int'])

len(df7['Age|int'].unique()) * len(df7['Duration|int'].unique()) * len(df7['Underwriting|str'].unique()) * len(df7['Sex|str'].unique())
Screenshot 2024-03-24 at 17 41 53

It's because the age/duration is triangular, i.e. age 18 only has 10 durations.

Age|int
120    1030
119    1020
118    1010
117    1000
116     990
       ... 
22       50
21       40
20       30
19       20
18       10
Name: count, Length: 103, dtype: int64
Duration|int
1      1030
2      1020
3      1010
4      1000
5       990
       ... 
99       50
100      40
101      30
102      20
103      10
Name: count, Length: 103, dtype: int64
Underwriting|str
NS_P     10712
NS_R     10712
NS_SP    10712
S_P      10712
S_R      10712
Name: count, dtype: int64
Sex|str
Female    26780
Male      26780
Name: count, dtype: int64
vals
1.00000    1030
0.94856     510
0.94780     510
0.89977     505
0.89833     505
           ... 
0.02581       1
0.02956       1
0.01209       1
0.02202       1
0.17419       1
Name: count, Length: 4579, dtype: int64

About half the size it would need to be, so options are either to split as sub-tables (depending on the gaps), or fill the gaps with np.nan. (I'll try both)

@lewisfogden
Copy link
Owner Author

lewisfogden commented Mar 24, 2024

Got it working in Example 7:

https://github.com/lewisfogden/heavylight/blob/dev_heavytable_integration/examples/notebook/table_documentation.ipynb

I think we could put the code into the constructor for tables, e.g. with a parameter make_rectangular (True/False) to make it optional. (Note the other bit that would need to be fixed is making sure that integer keys don't have gaps - generally not a problem though).

Other option here would be to pre-prep tables, which would give users the opportunity to check they are set correctly.

Edit: added a test of 100k samples from the original table - np.allclose() returns True.

@lewisfogden
Copy link
Owner Author

I've added a method called rectify(df), which makes dataframes rectangular, updated in table documentation tab7b. (And sorry, rectify is a terrible pun)

def rectify(df: pd.DataFrame) -> pd.DataFrame:

@MatthewCaseres
Copy link
Collaborator

MatthewCaseres commented Mar 24, 2024

string lookup performance

A quick analysis shows that the string lookup is 50-80 times slower than an equivalent integer lookup in table 2. It is twice as slow to use your int key lookup as to directly index into NumPy array, but this might be due to the get_index math. I don't think a factor of 2 is problematic, not concern about that.

https://github.com/actuarialopensource/benchmarks-python/blob/main/str_performance.ipynb

CSO tables

edit: deleted some complaining. Rectify seems good.

I think you want to drop the Age and not the IssueAge. Currently, we have age 18 at duration >100, which is a negative issue age. If you keep issueAge then rectify will be extending out late issue ages into high durations (instead of early ages into high durations), which is more reasonable.

my current concerns

  • This current issue, a silent failure
    • ban string columns due to this issue and performance problems?
  • Scenarios where the data integrity is low, what if the user makes a table with gaps or duplicates. Can you provide protections against this?
  • edit: I said I thought nested bands wouldn't work, but they seem to work.

recommendations

  1. Before merging, recommend rebasing off of main so you have the testing structure, and test the heck out of the code.
  2. Maybe enforce that a |str column must also have an |int column of the same name and use that for the mapping?
  3. Enforce any preconditions for the software working correctly within the constructor of the table object. I see a lot of this is already done which is great. Throwing the error, and testing that the error is thrown too is maybe another step.

edit: deleted unrealistic wishlist of features

last edit: Returning to normal work/school stuff for some weeks. Thanks for listening to complaints, nice design. Please test it! Looking forward to using the software in the future.

@lewisfogden
Copy link
Owner Author

Thank you for the input and contribution - enjoy work/school!

I think the str vs. int is one to document for users: You're going to get best performance encoding data with ints for tables, but it means your data is less obvious. (Other trick is just to split earlier, and have a table per string identifier).

Wishlist/Rants are all good, just do issues for them.

@MatthewCaseres
Copy link
Collaborator

I guess we can just add something like https://numpy.org/doc/stable/reference/generated/numpy.isin.html#numpy.isin and just throw an error if our values aren't a subset of the table axis? Some overhead, but probably not a big deal compared to string overheads compared to int key.

lewisfogden added a commit that referenced this issue Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants