-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Old sm-cnn has hardcoded paths that match TrecQA #83
Comments
It would be good to have a data-loader class for each dataset. Basically
the train_set, dev_set, test_set "names" need to be set up correctly to
reflect the path to respective sub-folders. TrecQA has 2 kinds of data (raw
and clean) which was an added complication. Yes this does need better
design.
…On Fri, Nov 10, 2017 at 10:31 AM Matt Crane ***@***.***> wrote:
The TrecQA dataset comes in clean- and raw- versions, the sm_cnn main
file loads one of these hardcoded prefixes (with a really badly
named/described argument to switch between them), code ref
<https://github.com/castorini/Castor/blob/61c8c0e622caa69e8579174dee429901128dc7e1/sm_cnn/main.py#L135-L137>,
but other datasets (such as WikiQA) don't have these distinctions leading
to an exception being thrown:
Traceback (most recent call last):
File "main.py", line 152, in <module>
trainer.load_input_data(args.dataset_folder, cache_file, train_set, dev_set, test_set)
File "/castorini/castor/sm_cnn/train.py", line 51, in load_input_data
utils.read_in_dataset(dataset_root_folder, set_folder)
File "/castorini/castor/sm_cnn/utils.py", line 141, in read_in_dataset
questions = read_in_data(dataset_folder, set_folder, "a.toks", False, stop_punct, dash_split)
File "/castorini/castor/sm_cnn/utils.py", line 98, in read_in_data
with open(os.path.join(datapath, set_name, file), encoding='utf-8') as inf:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/WikiQA/raw-test/a.toks'
I'm torn between whether this is the dataset loaders fault, or whether it
may just be simpler to change the dataset generation to match TrecQA --
that is, by default if a collection doesn't have the distinction generate
everything under the raw- prefix.
- If the dataset generator changes then other model files are almost
certainly going to break if they were written with a non-TrecQA dataset in
mind
- Other models written for the TrecQA dataset will almost certainly
have a similar problem
- Adding a flag for a prefix is yet more flags
Thoughts and comments?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#83>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABzueJls2aoUL9XNgnRVYdxcv_4ZO2-6ks5s1GxbgaJpZM4QZt2r>
.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The TrecQA dataset comes in
clean-
andraw-
versions, the sm_cnn main file loads one of these hardcoded prefixes (with a really badly named/described argument to switch between them), code ref, but other datasets (such as WikiQA) don't have these distinctions leading to an exception being thrown:I'm torn between whether this is the dataset loaders fault, or whether it may just be simpler to change the dataset generation to match TrecQA -- that is, by default if a collection doesn't have the distinction generate everything under the
raw-
prefix.Thoughts and comments?
The text was updated successfully, but these errors were encountered: