- Parse CSV (using built-in csv module)
- Ensure division in train, test sets is kept
- Follow the paper's preprocessing by replacing all occurrences of answers in the question text with single entities (e.g., Ernest Hemingway becomes Ernest_Hemingway).
- Train word embeddings on the preprocessed question text from the training data.
- Apply Stanford Parser to question sentences, get dependency tree as text
- Check whether it drops sentences
- Check whether it splits sentence strings that are, in fact, two or more sentences. This happens in the question in line 136 in the 20k dataset.
- Build a word vocabulary that maps words to indices (later to be used in fetching word embeddings).
- Either as a list where vocabulary.index(word) gives an index, or as a dict, where vocabulary[word] gives an index.
- Make sure to include the answer strings in the vocab. They need to have their own embedded representations.
- Consider whether answers should have spaces replaced with underscores, or whether we just keep them as a string with spaces.
- Build a dependency list (in fact an ordered set or vocabulary; only unique entries) that allows mapping a dependency string to an integer (later to be used for fetching the dependency matrix)
- Same remarks as for the word vocabulary
- Convert Stanford Parser dependency tree text to actual tree data structure
- Create and use a class
DependencyNode
:- has
DependencyNode.word_index
(the word's index in the vocabulary) - has
DependencyNode.dependency_index
(the index in the dependency list of the dependency between this node and its parent) - has
DependencyNode.children
(list of nodes)
- has
- Create and use a class
DependencyTree
:- has
DependencyTree.root
(the top node) - has
DependencyTree.answer_index
(the answer string's index in the vocabulary) - has
DependencyTree.question_id
, a uniquely identifying question id (expected to be given by the input CSV) - has
DependencyTree.n_nodes()
, giving the number of nodes in the tree
- has
- Create and use a class
- Figure out a meaningful way of storing trees, word vocabulary, dependency list
- Create a QANTAModel class with
- A vector (word embedding) for each word in the vocabulary
- A matrix (dependency embedding) for each dependency in the dependency list
- The global ``additional matrix'' W_v (eq. 1 description)
- The global bias term
- Initialize word and dependency embeddings, and the additional matrix and bias
- Look into what is considered best practice for initialization of vectors and matrices
- Consider initializing as many words as possible with precomputed vectors, such as those from word2vec
- Remove stopwords as in the original qanta code.
- Implement a method (e.g.
QANTAModel.calculate_embedding
) that can calculate a DependencyTree's embedded representation (following eq. 4) - Figure out why the original QANTA implementation does not use the
rank
calculation (eq 5) - Build an overall method
QANTAModel.train
that, given a list ofDependencyTree
s (and possibly a list of wrong answers?) will train the model.- Implemented a method
QANTA.sentence_error
that calculates eq. 5, given- a sentence tree
- a list of incorrect answers
- Calculate objective function (eq. 6) (model.py:94)
- Do backpropagation through structure / AdaGrad (eq. 7 and supplementary reading)
- Implemented a method
- Use batches in training
- Parallelize training
- Consider whether the tree error is ever used in backpropagation, and why it is not if it isn't
- Build an overall method
QANTAModel.predict
that, given a tree returns the most likely answer (from the set of possible answers)- Consider using LogisticRegression in prediction, as in original QANTA
- Actually train a model on the training set
- Evaluate the trained model's performance on the test set.
- Make it crystal clear what the expected input format is for preprocessing
- Make it crystal clear what the expected input format is for the QANTA model
- Ensure that every method has documentation for all arguments, and for the output
- Provide examples
- Did you document your code well? Could your girlfriend/parent/computerphobic cousin tell what is going on in the code?
- Did you write unit tests for the parts you developed? Did you push them?
- Did you update this checklist?