Generate a sparse tf-idf matrix from a tokenlist in a recipe step

Hello,
I am currently trying to use textrecipes in a project as part of our NLP pipeline in connection with {tidymodels}. At one point I came across a problem for which I have not yet found a solution. My problem is that the step `textrecipes::step_tfidf` apparently only generates a dense matrix (in the form of a tibble) and not a sparse matrix (dgCMatrix) and this leads to such a large object that I cannot process it in memory. The details in the documentation for this function also describe that `step_tokenfilter` should be executed in advance for this purpose. I would be very reluctant to do this, however, as I assume that in a sparse format - which I use as a blueprint in the modelling workflow anyway - the resulting object is sufficiently small. Meanwhile, tidymodels also seems to be able to cope with sparse matrices as input. 
So my question is, is there a way to convert from a tokenlist representation to a sparse tf-idf (or other dtm) representation in a recipe step or to use another low-memory format as an intermediate step (such as the format from {tidytext}).
It would also be interesting to know whether this is currently only a technical restriction or whether the idea behind it is that there is no legitimate modelling assumption in which we cannot (better) manage with a token filter or another word embedding?

Many thanks in advance and best regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate a sparse tf-idf matrix from a tokenlist in a recipe step #258

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions