QANTA from scratch

Preprocessing

Parse CSV (using built-in csv module)
Ensure division in train, test sets is kept
Follow the paper's preprocessing by replacing all occurrences of answers in the question text with single entities (e.g., Ernest Hemingway becomes Ernest_Hemingway).
Train word embeddings on the preprocessed question text from the training data.
Apply Stanford Parser to question sentences, get dependency tree as text
- Check whether it drops sentences
- Check whether it splits sentence strings that are, in fact, two or more sentences. This happens in the question in line 136 in the 20k dataset.
Build a word vocabulary that maps words to indices (later to be used in fetching word embeddings).
- Either as a list where vocabulary.index(word) gives an index, or as a dict, where vocabulary[word] gives an index.
- Make sure to include the answer strings in the vocab. They need to have their own embedded representations.
- Consider whether answers should have spaces replaced with underscores, or whether we just keep them as a string with spaces.
Build a dependency list (in fact an ordered set or vocabulary; only unique entries) that allows mapping a dependency string to an integer (later to be used for fetching the dependency matrix)
- Same remarks as for the word vocabulary
Convert Stanford Parser dependency tree text to actual tree data structure
- Create and use a class DependencyNode:
  - has DependencyNode.word_index (the word's index in the vocabulary)
  - has DependencyNode.dependency_index (the index in the dependency list of the dependency between this node and its parent)
  - has DependencyNode.children (list of nodes)
- Create and use a class DependencyTree:
  - has DependencyTree.root (the top node)
  - has DependencyTree.answer_index (the answer string's index in the vocabulary)
  - has DependencyTree.question_id, a uniquely identifying question id (expected to be given by the input CSV)
  - has DependencyTree.n_nodes(), giving the number of nodes in the tree
Figure out a meaningful way of storing trees, word vocabulary, dependency list

Model training

Create a QANTAModel class with
- A vector (word embedding) for each word in the vocabulary
- A matrix (dependency embedding) for each dependency in the dependency list
- The global ``additional matrix'' W_v (eq. 1 description)
- The global bias term
Initialize word and dependency embeddings, and the additional matrix and bias
Look into what is considered best practice for initialization of vectors and matrices
Consider initializing as many words as possible with precomputed vectors, such as those from word2vec
Remove stopwords as in the original qanta code.
Implement a method (e.g. QANTAModel.calculate_embedding) that can calculate a DependencyTree's embedded representation (following eq. 4)
Figure out why the original QANTA implementation does not use the rank calculation (eq 5)
Build an overall method QANTAModel.train that, given a list of DependencyTrees (and possibly a list of wrong answers?) will train the model.
- Implemented a method QANTA.sentence_error that calculates eq. 5, given
  - a sentence tree
  - a list of incorrect answers
- Calculate objective function (eq. 6) (model.py:94)
- Do backpropagation through structure / AdaGrad (eq. 7 and supplementary reading)
Use batches in training
Parallelize training
Consider whether the tree error is ever used in backpropagation, and why it is not if it isn't
Build an overall method QANTAModel.predict that, given a tree returns the most likely answer (from the set of possible answers)
- Consider using LogisticRegression in prediction, as in original QANTA
Actually train a model on the training set

Evaluation

Evaluate the trained model's performance on the test set.

Usability

Make it crystal clear what the expected input format is for preprocessing
Make it crystal clear what the expected input format is for the QANTA model
Ensure that every method has documentation for all arguments, and for the output
Provide examples

General checklist

Did you document your code well? Could your girlfriend/parent/computerphobic cousin tell what is going on in the code?
Did you write unit tests for the parts you developed? Did you push them?
Did you update this checklist?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

development_notes.md

development_notes.md

QANTA from scratch

Preprocessing

Model training

Evaluation

Usability

General checklist

Files

development_notes.md

Latest commit

History

development_notes.md

File metadata and controls

QANTA from scratch

Preprocessing

Model training

Evaluation

Usability

General checklist