Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable validation Data split Random Forest Classification #175

Open
wmotte opened this issue Jul 25, 2022 · 2 comments
Open

Disable validation Data split Random Forest Classification #175

wmotte opened this issue Jul 25, 2022 · 2 comments

Comments

@wmotte
Copy link

wmotte commented Jul 25, 2022

The Random Forest Classification requires at least 5% of the data to be kept apart for 'validation'. This is additional to the min. 5% test data.
However, it looks to me that these validation samples are never used.

Would it be possible to disable this additional split?
Just like the Decision Tree Classification: with an 'Holdout Test Data' subset of 5-95%.
It keeps more data available for either training or testing.

Thanks!

@koenderks
Copy link
Collaborator

koenderks commented Oct 26, 2022

In all analyses, the validation data set is used for assessing the model performance for each individual iteration in the optimization loop, which in the case of random forest goes over the maximum number of trees. This is what the validation samples are used for.

The split consisting of train -test only can be achieved by manually setting the maximum number of trees in the forest, hence removing the need for the validation samples.

Given the above, I'm not entirely sure about the specifics of your request. Is this what you are looking for, or did I misunderstand?

@wmotte
Copy link
Author

wmotte commented Oct 26, 2022

Thanks. I understand about the optimisation loop, but in the random forest modelling there is this additional set of 5% kept apart. So you are required to 'spend' at least 10% on test/validation. Whereas in the Decision Tree Classification 5% is sufficient. There is no way to disable this in JASP's random forest settings and there is only one assessment of model performance required during the training phase. In other words, how to explain the 95/5% split in Decision Tree Classification and the 90/5/5% in the random forest if in both cases 95/5% is sufficient (and presented in the ROC curves)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants