Old sm-cnn has hardcoded paths that match TrecQA #83

snapbug · 2017-11-10T15:31:39Z

The TrecQA dataset comes in clean- and raw- versions, the sm_cnn main file loads one of these hardcoded prefixes (with a really badly named/described argument to switch between them), code ref, but other datasets (such as WikiQA) don't have these distinctions leading to an exception being thrown:

Traceback (most recent call last):
  File "main.py", line 152, in <module>
    trainer.load_input_data(args.dataset_folder, cache_file, train_set, dev_set, test_set)
  File "/castorini/castor/sm_cnn/train.py", line 51, in load_input_data
    utils.read_in_dataset(dataset_root_folder, set_folder)
  File "/castorini/castor/sm_cnn/utils.py", line 141, in read_in_dataset
    questions = read_in_data(dataset_folder, set_folder, "a.toks", False, stop_punct, dash_split)
  File "/castorini/castor/sm_cnn/utils.py", line 98, in read_in_data
    with open(os.path.join(datapath, set_name, file), encoding='utf-8') as inf:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/WikiQA/raw-test/a.toks'

I'm torn between whether this is the dataset loaders fault, or whether it may just be simpler to change the dataset generation to match TrecQA -- that is, by default if a collection doesn't have the distinction generate everything under the raw- prefix.

If the dataset generator changes then other model files are almost certainly going to break if they were written with a non-TrecQA dataset in mind
Other models written for the TrecQA dataset will almost certainly have a similar problem
Adding a flag for a prefix is yet more flags

Thoughts and comments?

The text was updated successfully, but these errors were encountered:

gauravbaruah · 2017-11-10T18:09:27Z

It would be good to have a data-loader class for each dataset. Basically the train_set, dev_set, test_set "names" need to be set up correctly to reflect the path to respective sub-folders. TrecQA has 2 kinds of data (raw and clean) which was an added complication. Yes this does need better design.

…

On Fri, Nov 10, 2017 at 10:31 AM Matt Crane ***@***.***> wrote: The TrecQA dataset comes in clean- and raw- versions, the sm_cnn main file loads one of these hardcoded prefixes (with a really badly named/described argument to switch between them), code ref <https://github.com/castorini/Castor/blob/61c8c0e622caa69e8579174dee429901128dc7e1/sm_cnn/main.py#L135-L137>, but other datasets (such as WikiQA) don't have these distinctions leading to an exception being thrown: Traceback (most recent call last): File "main.py", line 152, in <module> trainer.load_input_data(args.dataset_folder, cache_file, train_set, dev_set, test_set) File "/castorini/castor/sm_cnn/train.py", line 51, in load_input_data utils.read_in_dataset(dataset_root_folder, set_folder) File "/castorini/castor/sm_cnn/utils.py", line 141, in read_in_dataset questions = read_in_data(dataset_folder, set_folder, "a.toks", False, stop_punct, dash_split) File "/castorini/castor/sm_cnn/utils.py", line 98, in read_in_data with open(os.path.join(datapath, set_name, file), encoding='utf-8') as inf: FileNotFoundError: [Errno 2] No such file or directory: '../../data/WikiQA/raw-test/a.toks' I'm torn between whether this is the dataset loaders fault, or whether it may just be simpler to change the dataset generation to match TrecQA -- that is, by default if a collection doesn't have the distinction generate everything under the raw- prefix. - If the dataset generator changes then other model files are almost certainly going to break if they were written with a non-TrecQA dataset in mind - Other models written for the TrecQA dataset will almost certainly have a similar problem - Adding a flag for a prefix is yet more flags Thoughts and comments? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#83>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABzueJls2aoUL9XNgnRVYdxcv_4ZO2-6ks5s1GxbgaJpZM4QZt2r> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old sm-cnn has hardcoded paths that match TrecQA #83

Old sm-cnn has hardcoded paths that match TrecQA #83

snapbug commented Nov 10, 2017

gauravbaruah commented Nov 10, 2017 via email

Old sm-cnn has hardcoded paths that match TrecQA #83

Old sm-cnn has hardcoded paths that match TrecQA #83

Comments

snapbug commented Nov 10, 2017

gauravbaruah commented Nov 10, 2017 via email