From 99e3f73de0a33ddf4370bf77092d94d1fd3d531b Mon Sep 17 00:00:00 2001 From: Kartik Perisetla Date: Sat, 4 Jan 2020 13:25:15 -0800 Subject: [PATCH 1/4] updated list and added section for Language Modeling --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index ed00238..7ff9b24 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ Suggestions and pull requests are welcome. The goal is to make this a collaborat * [Question Answering](#question-answering) * [Dialogue Systems](#dialogue-systems) * [Goal-Oriented Dialogue Systems](#goal-oriented-dialogue-systems) + * [Language Modeling](#language-modeling) ## Question Answering * **(NLVR)** A Corpus of Natural Language for Visual Reasoning, 2017 [[paper]](http://yoavartzi.com/pub/slya-acl.2017.pdf) [[data]](http://lic.nlp.cornell.edu/nlvr) @@ -22,6 +23,8 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht * **(QuizBowl)** A Neural Network for Factoid Question Answering over Paragraphs, 2014 [[paper]](https://www.cs.umd.edu/~miyyer/pubs/2014_qb_rnn.pdf) [[data]](https://www.cs.umd.edu/~miyyer/qblearn/index.html) * **(MCTest)** MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [[paper]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf) [[data]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html) [[alternate data link]](https://github.com/mcobzarenco/mctest/tree/master/data/MCTest) * **(QASent)** What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [[paper]](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) [[data]](http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz) + * **(DeepMind Question Answering Corpus)** Question answering dataset featured in "Teaching Machines to Read and Comprehend [[repo]](https://github.com/deepmind/rc-data) + * **(Amazon Question Answering Corpus)** This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions. [[download]] (http://jmcauley.ucsd.edu/data/amazon/qa/) ## Dialogue Systems * **(Ubuntu Dialogue Corpus)** The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [[paper]](http://arxiv.org/abs/1506.08909) [[data]](https://github.com/rkadlec/ubuntu-ranking-dataset-creator) @@ -29,3 +32,7 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht ## Goal-Oriented Dialogue Systems * **(Frames)** Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [[paper]](https://arxiv.org/abs/1704.00057) [[data]](http://datasets.maluuba.com/Frames) * **(DSTC 2 & 3)** Dialog State Tracking Challenge 2 & 3, 2013 [[paper]](http://camdial.org/~mh521/dstc/downloads/handbook.pdf) [[data]](http://camdial.org/~mh521/dstc/) + +## Language Modeling +* **(Google 1 Billion Word Corpus)** A freely available corpus of relatively large size for building and testing language models accompanied by baseline N-gram models [[download]](https://opensource.google/projects/lm-benchmark) +* **(Project Gutenberg)** A large collection of free books that can be retrieved in plain text for a variety of languages [[download]](https://www.gutenberg.org/) From 2ac1610c9e99f949aeb9e92b10c0bbc62b88d5b8 Mon Sep 17 00:00:00 2001 From: Kartik Perisetla Date: Sat, 4 Jan 2020 14:14:49 -0800 Subject: [PATCH 2/4] added more dataset details for VQA --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index 7ff9b24..54e76dc 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,7 @@ Suggestions and pull requests are welcome. The goal is to make this a collaborat * [Dialogue Systems](#dialogue-systems) * [Goal-Oriented Dialogue Systems](#goal-oriented-dialogue-systems) * [Language Modeling](#language-modeling) + * [Visual Question Answering](#visual-qa) ## Question Answering * **(NLVR)** A Corpus of Natural Language for Visual Reasoning, 2017 [[paper]](http://yoavartzi.com/pub/slya-acl.2017.pdf) [[data]](http://lic.nlp.cornell.edu/nlvr) @@ -35,4 +36,14 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht ## Language Modeling * **(Google 1 Billion Word Corpus)** A freely available corpus of relatively large size for building and testing language models accompanied by baseline N-gram models [[download]](https://opensource.google/projects/lm-benchmark) +* **(WikiText-103)** WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set. [[download]](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) * **(Project Gutenberg)** A large collection of free books that can be retrieved in plain text for a variety of languages [[download]](https://www.gutenberg.org/) + +## Visual Question Answering +* **(VQA)** VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. [[download]] (https://visualqa.org/download.html) +* **(DAQUAR)** DAtaset for QUestion Answering on Real-world images (https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge/) +* **((Visual7W))** Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. [[paper]](https://arxiv.org/abs/1511.03416) [[download]](https://github.com/yukezhu/visual7w-toolkit) +* **(Visual Madlibs)** Fill in the blanks Question Answering dataset [[paper]](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yu_Visual_Madlibs_Fill_ICCV_2015_paper.pdf) [[download]](http://tamaraberg.com/visualmadlibs/) +* **(COCO-QA)** The COCO-QA dataset is another dataset based on MS-COCO. Both questions and answers are generated automatically using image captions from MS-COCO and broadly belong to four categories: Object, Number, Color and Location[[download]](http://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/) +* **(Visual Genome)** Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It has 1.7 million Visual Question Answers [[download]](https://visualgenome.org/) +* **(SHAPES)** consists of shapes of varying arrangements, types, and colors. Questions are about the attributes, relationships, and positions of the shapes [[paper]](https://pdfs.semanticscholar.org/0ac8/f1a3c679b90d22c1f840cdc8d61ffef750ac.pdf) \ No newline at end of file From 18a3b45844bc32b20bbe84c7f6f501c0271e22eb Mon Sep 17 00:00:00 2001 From: Kartik Perisetla Date: Sat, 4 Jan 2020 14:15:48 -0800 Subject: [PATCH 3/4] nit change --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 54e76dc..5202d54 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht ## Visual Question Answering * **(VQA)** VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. [[download]] (https://visualqa.org/download.html) * **(DAQUAR)** DAtaset for QUestion Answering on Real-world images (https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge/) -* **((Visual7W))** Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. [[paper]](https://arxiv.org/abs/1511.03416) [[download]](https://github.com/yukezhu/visual7w-toolkit) +* **(Visual7W)** Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. [[paper]](https://arxiv.org/abs/1511.03416) [[download]](https://github.com/yukezhu/visual7w-toolkit) * **(Visual Madlibs)** Fill in the blanks Question Answering dataset [[paper]](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yu_Visual_Madlibs_Fill_ICCV_2015_paper.pdf) [[download]](http://tamaraberg.com/visualmadlibs/) * **(COCO-QA)** The COCO-QA dataset is another dataset based on MS-COCO. Both questions and answers are generated automatically using image captions from MS-COCO and broadly belong to four categories: Object, Number, Color and Location[[download]](http://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/) * **(Visual Genome)** Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It has 1.7 million Visual Question Answers [[download]](https://visualgenome.org/) From 21205a969c9a4737ab74332b73ec14b2b5df0c79 Mon Sep 17 00:00:00 2001 From: Kartik Perisetla Date: Sat, 4 Jan 2020 14:31:52 -0800 Subject: [PATCH 4/4] adding natural questions dataset --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 5202d54..40e524c 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht * **(QuizBowl)** A Neural Network for Factoid Question Answering over Paragraphs, 2014 [[paper]](https://www.cs.umd.edu/~miyyer/pubs/2014_qb_rnn.pdf) [[data]](https://www.cs.umd.edu/~miyyer/qblearn/index.html) * **(MCTest)** MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [[paper]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf) [[data]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html) [[alternate data link]](https://github.com/mcobzarenco/mctest/tree/master/data/MCTest) * **(QASent)** What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [[paper]](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) [[data]](http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz) + * **(Google Natural Questions)** a Benchmark for Question Answering Research [[paper]](https://research.google/pubs/pub47761/) [[download]](https://ai.google.com/research/NaturalQuestions) * **(DeepMind Question Answering Corpus)** Question answering dataset featured in "Teaching Machines to Read and Comprehend [[repo]](https://github.com/deepmind/rc-data) * **(Amazon Question Answering Corpus)** This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions. [[download]] (http://jmcauley.ucsd.edu/data/amazon/qa/)