Adding analysis feature: Topic modelling #40
Labels
enhancement
New feature or request
major
Requires a significative amount of work
research-needed
Requires literature review
First step is research. LDA seems like a promising method but needs to be adapted for tweets ( see issue #7 ). word2vec is interesting too, but requires a manual step to code the themes of the most common related words (ask @LaChapeliere for more details about that). Other technics can be explored too.
For each method, the implementation's accuracy should be evaluated in some way. The doc should suggest the best preprocessing parameters. The implementation should allow users to split the data according to time periods and compare results over time (the data-splitting part should be made part of the preprocessing module, since it will be common to several analysis pipelines).
See the old implementation of LDA and word2vec in the resiliency_challenge-legacy branch, and related issues #6 and #5.
The text was updated successfully, but these errors were encountered: