- Boilerpipe - tuned extractors for several use case e.g news - Java -
- available as a service: http://boilerpipe-web.appspot.com/
- Readability
- originally in Javascript
- ported to several languages
- Java - snacktory - Apache 2.0
- Java-readability
- PHP-readability - Apache 2.0
- PHP-readability by FiveFilters
- Python (https://github.com/buriy/python-readability)(https://github.com/gfxmonk/python-readability)
- available as a service: https://www.readability.com/developers/api
- Goose - Scala - open source by Gravity - Apache 2.0
- availables as a service: http://juicer.herokuapp.com/
- ReadabilityBUNDLE -bundles snacktory, goose and java-readability
- Beatiful soup python - MIT license
- Tag soup
- Neko HTML
- cleaneval
- Google news dataset (by Boilerpipe)
http://readwrite.com/2011/06/10/head-to-head-comparison-of-tex
-
Feedparser - Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds
-
Oryx 2 - Real time machine learning at scale - lambda architecture
-
JkernelMachines - Structured kernels
-
DIG http://usc-isi-i2.github.io/dig/ https://github.com/usc-isi-i2
-
WordTree https://blogs.princeton.edu/etc/2012/08/16/see-text-in-whole-new-waytext-visualization-tools/
-
[theano]
-
[torch7]
-
topicmodels in R
-
LDAvis - LDA Visualization in R
-
XGBoost - Large scale gradient boosting with Python and R wrappers
-
Sofia-ml Large scale online algorithms - eg. Passive Aggresive perceptron, Pereceptron with margins
-
Wowpal-wabbit - fast, scalable ML - includes also lda and active learning Tutorial
- Awesome Datasets
- Datahub - Open Knowledge foundation
- CommonCrawl
- Knoema