Skip to content

Commit 39d486b

Browse files
committed
added more readme
1 parent c562ce8 commit 39d486b

File tree

102 files changed

+2294
-143
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+2294
-143
lines changed

README.md

Lines changed: 30 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<p align="center">
22
<a href="#readme">
3-
<img alt="logo" width="50%" src="wordcloud.png">
3+
<img alt="logo" width="50%" src="malay-dataset.png">
44
</a>
55
</p>
66
<p align="center">
@@ -802,7 +802,7 @@ Total size: 1.4 MB
802802
1. Positive
803803
2. Negative
804804

805-
Reference: https://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.categorized_sents
805+
Reference: http://www.cs.cornell.edu/people/pabo/movie-review-data/
806806

807807
<img src="https://img.shields.io/badge/third--party-red.svg">
808808

@@ -902,16 +902,6 @@ Total size: 482.4 MB
902902

903903
<img src="https://img.shields.io/badge/third--party-red.svg">
904904

905-
#### [Klook](crawl/klook)
906-
907-
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://www.klook.com/policy/**
908-
909-
Crawled up to 200 interesting locations from MY and SG klook.
910-
911-
Total size: 10.3 MB
912-
913-
<img src="https://img.shields.io/badge/third--party-red.svg">
914-
915905
#### [IIUM-Confession](crawl/iium-confession)
916906

917907
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose.**
@@ -922,23 +912,23 @@ Total size: 75.1 MB
922912

923913
<img src="https://img.shields.io/badge/third--party-red.svg">
924914

925-
#### [Wattpad](crawl/wattpad)
915+
#### [Iproperty](crawl/iproperty)
926916

927-
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://support.wattpad.com/hc/en-us/articles/216192503-Copyright-FAQ**
917+
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://www.iproperty.com.my/terms-of-use/**
928918

929-
Crawled using keywords,
919+
crawled up to 16 states on sales residential, sales commercial, rent residential, rent commercial.
930920

931-
1. melayu
932-
2. malaysia
933-
3. seram
934-
4. hantu
935-
5. puisi
936-
6. sajak
937-
7. cerita
921+
Total size: 1329 MB
938922

939-
Crawled up to 7k fiction stories.
923+
<img src="https://img.shields.io/badge/third--party-red.svg">
940924

941-
Total size: 97 MB
925+
#### [Klook](crawl/klook)
926+
927+
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://www.klook.com/policy/**
928+
929+
Crawled up to 200 interesting locations from MY and SG klook.
930+
931+
Total size: 10.3 MB
942932

943933
<img src="https://img.shields.io/badge/third--party-red.svg">
944934

@@ -954,61 +944,27 @@ Total size: 7.9 GB
954944

955945
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://www.ticket2u.com.my/copyright**
956946

957-
Contains 4282 events in Malaysia from 2017,
958-
959-
```python
960-
{'row': {'rownum': '4282',
961-
'rowtotal': '4282',
962-
'rowpp': '18',
963-
'link': 'https://www.ticket2u.com.my/event/10223/emi-business-networking-3.0',
964-
'time': '4:00PM',
965-
'avatar': 'https://www.ticket2u.com.my/upload/event/listing/0-10223-8ce30523-200c-4bfa-98a9-daadd142989b-GYQ6_X.jpg',
966-
'datefrom106': '26 Oct 2017',
967-
'dateto106': '26 Oct 2017',
968-
'day': 'Thursday',
969-
'date': '26',
970-
'month': 'Oct',
971-
'year': '2017',
972-
'datefrom': '2017-10-26T16:00:00',
973-
'dateto': '2017-10-26T19:00:00',
974-
'active': '1',
975-
'id': '10223',
976-
'name': 'EMI Business Networking 3.0',
977-
'titlename': 'EMI Business Networking 3.0',
978-
'excerpt': '',
979-
'pid': '0',
980-
'basecurrency': 'RM',
981-
'online': '0',
982-
'countryid': '1',
983-
'stateid': '1',
984-
'areaid': '0',
985-
'locname': 'Denai Alam Recreational and Riding Club',
986-
'statename': 'WP Kuala Lumpur',
987-
'latitude': '3.150970999999999',
988-
'type': '619',
989-
'regboo': '0',
990-
'pricefrom': '75.00',
991-
'longitude': '101.51955099999998',
992-
'eventcat': 'Business Sharing and Networking Event',
993-
'eventcatcode': 'business',
994-
'eventsubcat': 'Networking',
995-
'eventsubcatcode': 'networking',
996-
'showdate': '1',
997-
'exclusive': '0',
998-
'notexclusive': '0',
999-
'issaleend': '1',
1000-
'status': 'expired'}}
1001-
```
947+
Contains 4282 events in Malaysia from 2017.
1002948

1003949
<img src="https://img.shields.io/badge/third--party-red.svg">
1004950

1005-
#### [Iproperty](crawl/iproperty)
951+
#### [Wattpad](crawl/wattpad)
1006952

1007-
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://www.iproperty.com.my/terms-of-use/**
953+
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose, https://support.wattpad.com/hc/en-us/articles/216192503-Copyright-FAQ**
1008954

1009-
crawled up to 16 states on sales residential, sales commercial, rent residential, rent commercial.
955+
Crawled using keywords,
1010956

1011-
Total size: 1329 MB
957+
1. melayu
958+
2. malaysia
959+
3. seram
960+
4. hantu
961+
5. puisi
962+
6. sajak
963+
7. cerita
964+
965+
Crawled up to 7k fiction stories.
966+
967+
Total size: 97 MB
1012968

1013969
<img src="https://img.shields.io/badge/third--party-red.svg">
1014970

@@ -2811,7 +2767,7 @@ Total size: 1.7 GB
28112767

28122768
1. Please citate the repository if use these corpus.
28132769

2814-
```
2770+
```bibtex
28152771
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!,
28162772
author = {Husein, Zolkepli},
28172773
title = {Malay-Dataset},

alignment/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
# Alignment
2+
3+
Purpose of this dataset to do word alignment, example,
4+
5+
```text
6+
a black cat
7+
kucing hitam
8+
9+
-> 1-1 2-0
10+
```
11+
112
## how-to
213

314
Or you can skip the steps and download the priors.

chatbot/blended-skill-talk/README.md

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,30 @@
1-
## how-to
1+
# Blended Skill Talk
22

3-
1. https://f000.backblazeb2.com/file/malay-dataset/chatbot/blended-skill-talk/blended_skill_talk.json.translate
3+
Original paper, https://arxiv.org/abs/2004.08449
4+
5+
## download
6+
7+
1. blended_skill_talk.json.translate, https://f000.backblazeb2.com/file/malay-dataset/chatbot/blended-skill-talk/blended_skill_talk.json.translate
8+
9+
## Citation
10+
11+
```bibtex
12+
@article{DBLP:journals/corr/abs-2004-08449,
13+
author = {Eric Michael Smith and
14+
Mary Williamson and
15+
Kurt Shuster and
16+
Jason Weston and
17+
Y{-}Lan Boureau},
18+
title = {Can You Put it All Together: Evaluating Conversational Agents' Ability
19+
to Blend Skills},
20+
journal = {CoRR},
21+
volume = {abs/2004.08449},
22+
year = {2020},
23+
url = {https://arxiv.org/abs/2004.08449},
24+
archivePrefix = {arXiv},
25+
eprint = {2004.08449},
26+
timestamp = {Sat, 23 Jan 2021 01:20:50 +0100},
27+
biburl = {https://dblp.org/rec/journals/corr/abs-2004-08449.bib},
28+
bibsource = {dblp computer science bibliography, https://dblp.org}
29+
}
30+
```

chatbot/convai2/README.md

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,46 @@
1-
## how-to
1+
# ConvAI2
22

3-
1. https://f000.backblazeb2.com/file/malay-dataset/chatbot/convai2/convai2-0.json.translate
4-
2. https://f000.backblazeb2.com/file/malay-dataset/chatbot/convai2/convai2-100000.json.translate
3+
Original website, https://parl.ai/projects/convai2/
4+
5+
Original paper, https://arxiv.org/abs/1902.00098?
6+
7+
## download
8+
9+
1. convai2-0.json.translate, https://f000.backblazeb2.com/file/malay-dataset/chatbot/convai2/convai2-0.json.translate
10+
2. convai2-100000.json.translate, https://f000.backblazeb2.com/file/malay-dataset/chatbot/convai2/convai2-100000.json.translate
11+
12+
## Citation
13+
14+
```bibtex
15+
@article{DBLP:journals/corr/abs-1902-00098,
16+
author = {Emily Dinan and
17+
Varvara Logacheva and
18+
Valentin Malykh and
19+
Alexander H. Miller and
20+
Kurt Shuster and
21+
Jack Urbanek and
22+
Douwe Kiela and
23+
Arthur Szlam and
24+
Iulian Serban and
25+
Ryan Lowe and
26+
Shrimai Prabhumoye and
27+
Alan W. Black and
28+
Alexander I. Rudnicky and
29+
Jason Williams and
30+
Joelle Pineau and
31+
Mikhail S. Burtsev and
32+
Jason Weston},
33+
title = {The Second Conversational Intelligence Challenge (ConvAI2)},
34+
journal = {CoRR},
35+
volume = {abs/1902.00098},
36+
year = {2019},
37+
url = {http://arxiv.org/abs/1902.00098},
38+
archivePrefix = {arXiv},
39+
eprint = {1902.00098},
40+
timestamp = {Sat, 23 Jan 2021 01:11:58 +0100},
41+
biburl = {https://dblp.org/rec/journals/corr/abs-1902-00098.bib},
42+
bibsource = {dblp computer science bibliography, https://dblp.org}
43+
}
44+
a service of Schloss Dagstuhl - Leibniz Center for Informatics homebrowsesearchabout
45+
46+
```

chatbot/wiki-wizard/README.md

Lines changed: 31 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,38 @@
1-
# dialog
1+
# Wizard of Wikipedia
22

3-
prefix, https://f000.backblazeb2.com/file/malay-dataset/
3+
Original paper, https://arxiv.org/abs/1811.01241
44

5-
## how-to
5+
## download
66

7-
1. wiki-wizard/dialogs.translate
7+
### dialog
88

9-
# information
9+
1. dialogs.translate, https://f000.backblazeb2.com/file/malay-dataset/wiki-wizard/dialogs.translate
1010

11-
prefix, https://f000.backblazeb2.com/file/malay-dataset/
11+
### information
1212

13-
## how-to
13+
1. informations-0.json.translate, https://f000.backblazeb2.com/file/malay-dataset/wiki-wizard/informations-0.json.translate
14+
2. informations-100000.json.translate, https://f000.backblazeb2.com/file/malay-dataset/wiki-wizard/informations-100000.json.translate
15+
3. informations-200000.json.translate, https://f000.backblazeb2.com/file/malay-dataset/wiki-wizard/informations-200000.json.translate
1416

15-
1. wiki-wizard/informations-0.json.translate
16-
1. wiki-wizard/informations-100000.json.translate
17-
1. wiki-wizard/informations-200000.json.translate
17+
## Citation
18+
19+
```bibtex
20+
@article{DBLP:journals/corr/abs-1811-01241,
21+
author = {Emily Dinan and
22+
Stephen Roller and
23+
Kurt Shuster and
24+
Angela Fan and
25+
Michael Auli and
26+
Jason Weston},
27+
title = {Wizard of Wikipedia: Knowledge-Powered Conversational agents},
28+
journal = {CoRR},
29+
volume = {abs/1811.01241},
30+
year = {2018},
31+
url = {http://arxiv.org/abs/1811.01241},
32+
archivePrefix = {arXiv},
33+
eprint = {1811.01241},
34+
timestamp = {Sat, 23 Jan 2021 01:19:39 +0100},
35+
biburl = {https://dblp.org/rec/journals/corr/abs-1811-01241.bib},
36+
bibsource = {dblp computer science bibliography, https://dblp.org}
37+
}
38+
```

corpus/audience/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Audience
2+
3+
Original website, https://www.kaggle.com/crowdflower/political-social-media-posts
4+
5+
## Citation
6+
7+
Auto generated using https://www.bibme.org/bibtex/website-citation,
8+
9+
```bibtex
10+
@misc{eight_2016, title={Political Social Media Posts}, url={https://www.kaggle.com/crowdflower/political-social-media-posts}, journal={Kaggle}, author={Eight, Figure}, year={2016}, month={Nov}}
11+
```

corpus/emotion/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Emotion
2+
3+
Gathered emotion dataset using lexicon, all steps in [notebook].
4+
5+
## download
6+
7+
1. download dataset from here, 27.4 MB, https://malaya-dataset.s3-ap-southeast-1.amazonaws.com/emotion/emotion-twitter-lexicon.json
8+
9+
```
10+
anger 108813
11+
fear 20316
12+
happy 30962
13+
love 20783
14+
sadness 26468
15+
surprise 13107
16+
```
17+
18+
## Citation
19+
20+
```bibtex
21+
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semi-Supervised Emotion dataset,
22+
author = {Husein, Zolkepli},
23+
title = {Malay-Dataset},
24+
year = {2018},
25+
publisher = {GitHub},
26+
journal = {GitHub repository},
27+
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/emotion}}
28+
}
29+
```

corpus/emotion/lexicon/README.md

Lines changed: 0 additions & 12 deletions
This file was deleted.
File renamed without changes.

corpus/gender/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Gender
2+
3+
Original website, https://www.kaggle.com/crowdflower/twitter-user-gender-classification
4+
5+
## Citation
6+
7+
Auto generated using https://www.bibme.org/bibtex/website-citation,
8+
9+
```bibtex
10+
@misc{eight_2016, title={Twitter User Gender Classification}, url={https://www.kaggle.com/crowdflower/twitter-user-gender-classification}, journal={Kaggle}, author={Eight, Figure}, year={2016}, month={Nov}}
11+
```

corpus/insincere-question/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Insincere Question
2+
3+
Original website, https://www.kaggle.com/c/quora-insincere-questions-classification
4+
5+
## Citation
6+
7+
Auto generated using https://www.bibme.org/bibtex/website-citation,
8+
9+
```bibtex
10+
@misc{kaggle, title={Quora Insincere Questions Classification}, url={https://www.kaggle.com/c/quora-insincere-questions-classification}, journal={Kaggle}}
11+
```

0 commit comments

Comments
 (0)