From 99e3f73de0a33ddf4370bf77092d94d1fd3d531b Mon Sep 17 00:00:00 2001
From: Kartik Perisetla <kperisetla@apple.com>
Date: Sat, 4 Jan 2020 13:25:15 -0800
Subject: [PATCH 1/4] updated list and added section for Language Modeling

---
 README.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/README.md b/README.md
index ed00238..7ff9b24 100644
--- a/README.md
+++ b/README.md
@@ -6,6 +6,7 @@ Suggestions and pull requests are welcome. The goal is to make this a collaborat
   * [Question Answering](#question-answering)
   * [Dialogue Systems](#dialogue-systems)
   * [Goal-Oriented Dialogue Systems](#goal-oriented-dialogue-systems)
+  * [Language Modeling](#language-modeling)
 
 ## Question Answering
   * **(NLVR)** A Corpus of Natural Language for Visual Reasoning, 2017 [[paper]](http://yoavartzi.com/pub/slya-acl.2017.pdf) [[data]](http://lic.nlp.cornell.edu/nlvr)
@@ -22,6 +23,8 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht
   * **(QuizBowl)** A Neural Network for Factoid Question Answering over Paragraphs, 2014 [[paper]](https://www.cs.umd.edu/~miyyer/pubs/2014_qb_rnn.pdf) [[data]](https://www.cs.umd.edu/~miyyer/qblearn/index.html)
   * **(MCTest)** MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [[paper]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf) [[data]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html) [[alternate data link]](https://github.com/mcobzarenco/mctest/tree/master/data/MCTest)  
   * **(QASent)** What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [[paper]](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) [[data]](http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz)
+  * **(DeepMind Question Answering Corpus)** Question answering dataset featured in "Teaching Machines to Read and Comprehend [[repo]](https://github.com/deepmind/rc-data)
+  * **(Amazon Question Answering Corpus)** This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions. [[download]] (http://jmcauley.ucsd.edu/data/amazon/qa/)
 
 ## Dialogue Systems
   * **(Ubuntu Dialogue Corpus)** The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [[paper]](http://arxiv.org/abs/1506.08909) [[data]](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)
@@ -29,3 +32,7 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht
 ## Goal-Oriented Dialogue Systems
   * **(Frames)** Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [[paper]](https://arxiv.org/abs/1704.00057) [[data]](http://datasets.maluuba.com/Frames)
   * **(DSTC 2 & 3)** Dialog State Tracking Challenge 2 & 3, 2013 [[paper]](http://camdial.org/~mh521/dstc/downloads/handbook.pdf) [[data]](http://camdial.org/~mh521/dstc/)
+
+## Language Modeling
+* **(Google 1 Billion Word Corpus)** A freely available corpus of relatively large size for building and testing language models accompanied by baseline N-gram models [[download]](https://opensource.google/projects/lm-benchmark)
+* **(Project Gutenberg)** A large collection of free books that can be retrieved in plain text for a variety of languages [[download]](https://www.gutenberg.org/)

From 2ac1610c9e99f949aeb9e92b10c0bbc62b88d5b8 Mon Sep 17 00:00:00 2001
From: Kartik Perisetla <kperisetla@apple.com>
Date: Sat, 4 Jan 2020 14:14:49 -0800
Subject: [PATCH 2/4] added more dataset details for VQA

---
 README.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/README.md b/README.md
index 7ff9b24..54e76dc 100644
--- a/README.md
+++ b/README.md
@@ -7,6 +7,7 @@ Suggestions and pull requests are welcome. The goal is to make this a collaborat
   * [Dialogue Systems](#dialogue-systems)
   * [Goal-Oriented Dialogue Systems](#goal-oriented-dialogue-systems)
   * [Language Modeling](#language-modeling)
+  * [Visual Question Answering](#visual-qa)
 
 ## Question Answering
   * **(NLVR)** A Corpus of Natural Language for Visual Reasoning, 2017 [[paper]](http://yoavartzi.com/pub/slya-acl.2017.pdf) [[data]](http://lic.nlp.cornell.edu/nlvr)
@@ -35,4 +36,14 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht
 
 ## Language Modeling
 * **(Google 1 Billion Word Corpus)** A freely available corpus of relatively large size for building and testing language models accompanied by baseline N-gram models [[download]](https://opensource.google/projects/lm-benchmark)
+* **(WikiText-103)** WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set. [[download]](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
 * **(Project Gutenberg)** A large collection of free books that can be retrieved in plain text for a variety of languages [[download]](https://www.gutenberg.org/)
+
+## Visual Question Answering
+* **(VQA)** VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. [[download]] (https://visualqa.org/download.html)
+* **(DAQUAR)** DAtaset for QUestion Answering on Real-world images (https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge/)
+* **((Visual7W))** Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. [[paper]](https://arxiv.org/abs/1511.03416) [[download]](https://github.com/yukezhu/visual7w-toolkit)
+* **(Visual Madlibs)** Fill in the blanks Question Answering dataset [[paper]](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yu_Visual_Madlibs_Fill_ICCV_2015_paper.pdf) [[download]](http://tamaraberg.com/visualmadlibs/)
+* **(COCO-QA)** The COCO-QA dataset is another dataset based on MS-COCO. Both questions and answers are generated automatically using image captions from MS-COCO and broadly belong to four categories: Object, Number, Color and Location[[download]](http://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)
+* **(Visual Genome)** Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It has 1.7 million Visual Question Answers [[download]](https://visualgenome.org/)
+* **(SHAPES)** consists of shapes of varying arrangements, types, and colors. Questions are about the attributes, relationships, and positions of the shapes [[paper]](https://pdfs.semanticscholar.org/0ac8/f1a3c679b90d22c1f840cdc8d61ffef750ac.pdf)
\ No newline at end of file

From 18a3b45844bc32b20bbe84c7f6f501c0271e22eb Mon Sep 17 00:00:00 2001
From: Kartik Perisetla <kperisetla@apple.com>
Date: Sat, 4 Jan 2020 14:15:48 -0800
Subject: [PATCH 3/4] nit change

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 54e76dc..5202d54 100644
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht
 ## Visual Question Answering
 * **(VQA)** VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. [[download]] (https://visualqa.org/download.html)
 * **(DAQUAR)** DAtaset for QUestion Answering on Real-world images (https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge/)
-* **((Visual7W))** Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. [[paper]](https://arxiv.org/abs/1511.03416) [[download]](https://github.com/yukezhu/visual7w-toolkit)
+* **(Visual7W)** Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. [[paper]](https://arxiv.org/abs/1511.03416) [[download]](https://github.com/yukezhu/visual7w-toolkit)
 * **(Visual Madlibs)** Fill in the blanks Question Answering dataset [[paper]](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yu_Visual_Madlibs_Fill_ICCV_2015_paper.pdf) [[download]](http://tamaraberg.com/visualmadlibs/)
 * **(COCO-QA)** The COCO-QA dataset is another dataset based on MS-COCO. Both questions and answers are generated automatically using image captions from MS-COCO and broadly belong to four categories: Object, Number, Color and Location[[download]](http://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)
 * **(Visual Genome)** Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It has 1.7 million Visual Question Answers [[download]](https://visualgenome.org/)

From 21205a969c9a4737ab74332b73ec14b2b5df0c79 Mon Sep 17 00:00:00 2001
From: Kartik Perisetla <kperisetla@apple.com>
Date: Sat, 4 Jan 2020 14:31:52 -0800
Subject: [PATCH 4/4] adding natural questions dataset

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 5202d54..40e524c 100644
--- a/README.md
+++ b/README.md
@@ -24,6 +24,7 @@ Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](ht
   * **(QuizBowl)** A Neural Network for Factoid Question Answering over Paragraphs, 2014 [[paper]](https://www.cs.umd.edu/~miyyer/pubs/2014_qb_rnn.pdf) [[data]](https://www.cs.umd.edu/~miyyer/qblearn/index.html)
   * **(MCTest)** MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [[paper]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf) [[data]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html) [[alternate data link]](https://github.com/mcobzarenco/mctest/tree/master/data/MCTest)  
   * **(QASent)** What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [[paper]](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) [[data]](http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz)
+  * **(Google Natural Questions)** a Benchmark for Question Answering Research [[paper]](https://research.google/pubs/pub47761/) [[download]](https://ai.google.com/research/NaturalQuestions)
   * **(DeepMind Question Answering Corpus)** Question answering dataset featured in "Teaching Machines to Read and Comprehend [[repo]](https://github.com/deepmind/rc-data)
   * **(Amazon Question Answering Corpus)** This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions. [[download]] (http://jmcauley.ucsd.edu/data/amazon/qa/)