Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias in the evaluation of SSP-MMC-FSRS #3

Open
1DWalker opened this issue Jan 25, 2025 · 0 comments
Open

Bias in the evaluation of SSP-MMC-FSRS #3

1DWalker opened this issue Jan 25, 2025 · 0 comments

Comments

@1DWalker
Copy link

SSP-MMC-FSRS might just be finding errors in FSRS and exploiting it. Even small errors would be exploited, making SSP-MMC-FSRS's predicted review cost almost always an underestimate of the true value.

To properly evaluate SSP-MMC-FSRS, we can run an experiment with real users, but we probably don't have the resources nor time to do this and such an experiment would take years to measure the half life.

So here's an alternative: we use alternative memory models to evaluate FSRS. For instance, we can use the GRU model that predicts a forgetting curve.
The general problem setup would be:

  1. sample a user
  2. sample a review history prefix to pretrain both FSRS and GRU on
  3. run SSP-MMC-FSRS to get a scheduler from the FSRS parameters.
  4. run a simulation for a card. The review intervals are given by SSP-MMC-FSRS, the transition probabilities given by GRU, and the objective is reached only when GRU's forgetting curve suggests that it is so.

Regarding the objective there is a problem if SSP-MMC-FSRS believes that the half-life has been reached but GRU doesn't think so. In this case I have some ideas:

  1. SSP-MMC-FSRS would just keep scheduling a review at the half life's interval.
  2. SSP-MMC-FSRS should be trained on something large like 100 years, but we only evaluate on something shorter such as 3 years. This should heavily reduce the frequency of the model disagreement.

In addition, I believe that GRU was trained from data where the average retention is higher than 50%. If we want better results from evaluating with GRU, the objective half-life should be something like 80% rather than 50%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant