Skip to content
This repository was archived by the owner on Dec 19, 2018. It is now read-only.

Commit 5604ff8

Browse files
committed
Merge orignial repo's PR: Kyubyong#6
1 parent 450c459 commit 5604ff8

File tree

1 file changed

+8
-4
lines changed

1 file changed

+8
-4
lines changed

README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22

33
This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check [this](https://github.com/3Top/word2vec-api) to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.
44

5-
<b>Nearing the end of the work, I happened to know that there is already a similar job named `polyglot`. I strongly encourage you to check [this great project](https://sites.google.com/site/rmyeid/projects/polyglot). How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.</b>
5+
**Nearing the end of the work, I happened to know that there is already a similar job named `polyglot`. I strongly encourage you to check [this great project](https://sites.google.com/site/rmyeid/projects/polyglot). How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.**
66

77
## Requirements
8+
89
* nltk >= 1.11.1
910
* regex >= 2016.6.24
1011
* lxml >= 3.3.3
@@ -16,22 +17,25 @@ This project has two purposes. First of all, I'd like to share some of my experi
1617
* jieba >= 0.38 (Only for Chinese)
1718
* gensim > =0.13.1 (for Word2Vec)
1819
* fastText (for [fasttext](https://github.com/facebookresearch/fastText))
19-
20+
2021
## Background / References
22+
2123
* Check [this](https://en.wikipedia.org/wiki/Word_embedding) to know what word embedding is.
2224
* Check [this](https://en.wikipedia.org/wiki/Word2vec) to quickly get a picture of Word2vec.
2325
* Check [this](https://github.com/facebookresearch/fastText) to install fastText.
2426
* Watch [this](https://www.youtube.com/watch?v=T8tQZChniMk&index=2&list=PL_6hBtWGKk2KdY3ANaEYbxL3N5YhRN9i0) to really understand what's happening under the hood of Word2vec.
2527
* Go get various English word vectors [here](https://github.com/3Top/word2vec-api) if needed.
2628

2729
## Work Flow
28-
* STEP 1. Download the [wikipedia database backup dumps](https://dumps.wikimedia.org/backup-index.html) of the language you want.
30+
31+
* STEP 1. Download the [wikipedia database backup dumps](https://dumps.wikimedia.org/backup-index.html) of the language you want (for example, for english wiki go to `https://dumps.wikimedia.org/enwiki/` click the latest timestamp, and download the `enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2` file).
2932
* STEP 2. Extract running texts to `data/` folder.
3033
* STEP 3. Run `build_corpus.py`.
3134
* STEP 4-1. Run `make_wordvector.sh` to get Word2Vec word vectors.
32-
* STEP 4-2. Run `fasttext.sh` to get fastText word vectors.
35+
* STEP 4-2. Run `fasttext.sh` to get fastText word vectors.
3336

3437
## Pre-trained models
38+
3539
Two types of pre-trained models are provided. `w` and `f` represent `word2vec` and `fastText` respectively.
3640

3741
| Language | ISO 639-1 | Vector Size | Corpus Size | Vocabulary Size |

0 commit comments

Comments
 (0)