Skip to content

Latest commit

 

History

History
69 lines (61 loc) · 4.52 KB

development_notes.md

File metadata and controls

69 lines (61 loc) · 4.52 KB

QANTA from scratch

Preprocessing

  • Parse CSV (using built-in csv module)
  • Ensure division in train, test sets is kept
  • Follow the paper's preprocessing by replacing all occurrences of answers in the question text with single entities (e.g., Ernest Hemingway becomes Ernest_Hemingway).
  • Train word embeddings on the preprocessed question text from the training data.
  • Apply Stanford Parser to question sentences, get dependency tree as text
    • Check whether it drops sentences
    • Check whether it splits sentence strings that are, in fact, two or more sentences. This happens in the question in line 136 in the 20k dataset.
  • Build a word vocabulary that maps words to indices (later to be used in fetching word embeddings).
    • Either as a list where vocabulary.index(word) gives an index, or as a dict, where vocabulary[word] gives an index.
    • Make sure to include the answer strings in the vocab. They need to have their own embedded representations.
    • Consider whether answers should have spaces replaced with underscores, or whether we just keep them as a string with spaces.
  • Build a dependency list (in fact an ordered set or vocabulary; only unique entries) that allows mapping a dependency string to an integer (later to be used for fetching the dependency matrix)
    • Same remarks as for the word vocabulary
  • Convert Stanford Parser dependency tree text to actual tree data structure
    • Create and use a class DependencyNode:
      • has DependencyNode.word_index (the word's index in the vocabulary)
      • has DependencyNode.dependency_index (the index in the dependency list of the dependency between this node and its parent)
      • has DependencyNode.children (list of nodes)
    • Create and use a class DependencyTree:
      • has DependencyTree.root (the top node)
      • has DependencyTree.answer_index (the answer string's index in the vocabulary)
      • has DependencyTree.question_id, a uniquely identifying question id (expected to be given by the input CSV)
      • has DependencyTree.n_nodes(), giving the number of nodes in the tree
  • Figure out a meaningful way of storing trees, word vocabulary, dependency list

Model training

  • Create a QANTAModel class with
    • A vector (word embedding) for each word in the vocabulary
    • A matrix (dependency embedding) for each dependency in the dependency list
    • The global ``additional matrix'' W_v (eq. 1 description)
    • The global bias term
  • Initialize word and dependency embeddings, and the additional matrix and bias
  • Look into what is considered best practice for initialization of vectors and matrices
  • Consider initializing as many words as possible with precomputed vectors, such as those from word2vec
  • Remove stopwords as in the original qanta code.
  • Implement a method (e.g. QANTAModel.calculate_embedding) that can calculate a DependencyTree's embedded representation (following eq. 4)
  • Figure out why the original QANTA implementation does not use the rank calculation (eq 5)
  • Build an overall method QANTAModel.train that, given a list of DependencyTrees (and possibly a list of wrong answers?) will train the model.
    • Implemented a method QANTA.sentence_error that calculates eq. 5, given
      • a sentence tree
      • a list of incorrect answers
    • Calculate objective function (eq. 6) (model.py:94)
    • Do backpropagation through structure / AdaGrad (eq. 7 and supplementary reading)
  • Use batches in training
  • Parallelize training
  • Consider whether the tree error is ever used in backpropagation, and why it is not if it isn't
  • Build an overall method QANTAModel.predict that, given a tree returns the most likely answer (from the set of possible answers)
    • Consider using LogisticRegression in prediction, as in original QANTA
  • Actually train a model on the training set

Evaluation

  • Evaluate the trained model's performance on the test set.

Usability

  • Make it crystal clear what the expected input format is for preprocessing
  • Make it crystal clear what the expected input format is for the QANTA model
  • Ensure that every method has documentation for all arguments, and for the output
  • Provide examples

General checklist

  • Did you document your code well? Could your girlfriend/parent/computerphobic cousin tell what is going on in the code?
  • Did you write unit tests for the parts you developed? Did you push them?
  • Did you update this checklist?