-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hello,
I am currently trying to use textrecipes in a project as part of our NLP pipeline in connection with {tidymodels}. At one point I came across a problem for which I have not yet found a solution. My problem is that the step textrecipes::step_tfidf apparently only generates a dense matrix (in the form of a tibble) and not a sparse matrix (dgCMatrix) and this leads to such a large object that I cannot process it in memory. The details in the documentation for this function also describe that step_tokenfilter should be executed in advance for this purpose. I would be very reluctant to do this, however, as I assume that in a sparse format - which I use as a blueprint in the modelling workflow anyway - the resulting object is sufficiently small. Meanwhile, tidymodels also seems to be able to cope with sparse matrices as input.
So my question is, is there a way to convert from a tokenlist representation to a sparse tf-idf (or other dtm) representation in a recipe step or to use another low-memory format as an intermediate step (such as the format from {tidytext}).
It would also be interesting to know whether this is currently only a technical restriction or whether the idea behind it is that there is no legitimate modelling assumption in which we cannot (better) manage with a token filter or another word embedding?
Many thanks in advance and best regards!