Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using searches counts-only with twarc2 from command line #615

Closed
igorbrigadir opened this issue Mar 24, 2022 · 6 comments · Fixed by #623
Closed

Error using searches counts-only with twarc2 from command line #615

igorbrigadir opened this issue Mar 24, 2022 · 6 comments · Fixed by #623
Labels

Comments

@igorbrigadir
Copy link
Contributor

Via: https://twittercommunity.com/t/error-using-searches-counts-only-with-twarc2-from-command-line/168793

C:\Users\User_1>twarc2 searches --archive --counts-only --granularity day --start-time “2020-04-30” --end-time “2020-05-30” query_users.txt countstw.csv
0%|▏ | Processed 4/1500 lines of input file [01:19<8:16:32, 19.91s/it]

Traceback (most recent call last):
File “C:\Users\39333\anaconda3\lib\runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “C:\Users\39333\anaconda3\lib\runpy.py”, line 87, in run_code
exec(code, run_globals)
File "C:\Users\39333\anaconda3\Scripts\twarc2.exe_main.py", line 7, in
File “C:\Users\39333\anaconda3\lib\site-packages\click\core.py”, line 1128, in call
return self.main(*args, **kwargs)
File “C:\Users\39333\anaconda3\lib\site-packages\click\core.py”, line 1053, in main
rv = self.invoke(ctx)
File “C:\Users\39333\anaconda3\lib\site-packages\click\core.py”, line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “C:\Users\39333\anaconda3\lib\site-packages\click\core.py”, line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “C:\Users\39333\anaconda3\lib\site-packages\click\core.py”, line 754, in invoke
return __callback(*args, **kwargs)
File “C:\Users\39333\anaconda3\lib\site-packages\click\decorators.py”, line 38, in new_func
return f(get_current_context().obj, *args, **kwargs)
File “C:\Users\39333\anaconda3\lib\site-packages\twarc\command2.py”, line 1724, in searches
for result in response:
File “C:\Users\39333\anaconda3\lib\site-packages\twarc\client2.py”, line 269, in _search
== last_time_start
UnboundLocalError: local variable ‘last_time_start’ referenced before assignment
@igorbrigadir
Copy link
Contributor Author

Input file: query_users.txt

@igorbrigadir
Copy link
Contributor Author

Definitely a bug because it's not meant to be throwing that error - but the input file also had an issue:

from:923342569446236000 lang:it -is:reply -is:quote (\" vaccino \" OR \" vaccinazione \" OR \"vaccini\") 

(Also i think \" is redundant here because it's single words not phrases, but it also works regardless as far as i can tell)

The user ID 923342569446236000 is truncated, with the 000 ending, suggesting that at some point a file was opened in javascript or Excel and the 64bit integer got corrupted. Running a query with a non existing user id like this just gives back a blank response, which may be the source of the bug:

twarc2 counts "from:923342569446236000" --text

Maybe we should have a validation for spotting 000 at the end of IDs and issue a Warning?

@SamHames
Copy link
Contributor

The bug is in the workaround for twitter counts stopping prematurely - I think this should be reproducible with just a single count by itself?

@SamHames
Copy link
Contributor

I will investigate later today if I get a chance.

@igorbrigadir
Copy link
Contributor Author

Another example of this:

twarc2 searches --archive --counts-only --granularity day --start-time "2020-04-30" --end-time "2020-05-30" query_users_covid_may_corr.txt countstw_querycovid_1_5k_may_corr.csv

 26%|██▌       | Processed 584/2255 lines of input file [20:30<58:41,  2.11s/it]

Traceback (most recent call last):
  File "/home/t495/.pyenv/versions/twarc/bin/twarc2", line 11, in <module>
    load_entry_point('twarc', 'console_scripts', 'twarc2')()
  File "/home/t495/.pyenv/versions/3.7.5/envs/twarc/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/t495/.pyenv/versions/3.7.5/envs/twarc/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/t495/.pyenv/versions/3.7.5/envs/twarc/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/t495/.pyenv/versions/3.7.5/envs/twarc/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/t495/.pyenv/versions/3.7.5/envs/twarc/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/t495/.pyenv/versions/3.7.5/envs/twarc/lib/python3.7/site-packages/click/decorators.py", line 38, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/home/t495/git/twarc/twarc/command2.py", line 1726, in searches
    for result in response:
  File "/home/t495/git/twarc/twarc/client2.py", line 269, in _search
    == last_time_start
UnboundLocalError: local variable 'last_time_start' referenced before assignment

Full input file (very long)
query_users_covid_may_corr.txt

This error is reproducible with the latest main version with just this:

twarc2 counts --archive --granularity day --start-time "2020-04-30" --end-time "2020-05-30" "from:743520688099758080 lang:it -is:reply -is:quote (\" vaccino \" OR \" vaccinazione \" OR \"vaccini\")" --text

User with ID 743520688099758080 is deleted i think.

I think it may be easier or better to have a general try / except around each "query" when working with batch input - the downside is that people may miss that something went wrong, but on the bright side it won't crash during a long process.

@SamHames
Copy link
Contributor

SamHames commented Apr 1, 2022

I do have a partial fix for the original issue, but haven't had a chance to properly test it yet.

I think the question about error handling needs to be considered separately - I'd rather fail early and loudly where all the context is available. Also the logic is already more convoluted than I'd like, so I'd prefer to defer changes to when we have a better idea about #608

Related though, I have been thinking about an approach to resuming long running processes, which might make loud failure more palatable - on error you could fix the input file, then rerun the same command with a --resume option, maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants