Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parameterizing dateinc field , change the default date inc #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rush00121
Copy link

Instead of pulling data for every day , this will extend the search filter parameter to get a bigger date range. This should make the scraping process faster .

@bpb27
Copy link
Owner

bpb27 commented Mar 29, 2017

In my experience, you actually get less results when you use an interval, as opposed to going day by day. Have you tried running this on a user with a lot of tweets (20K+) and compared the total with the interval method?

@rush00121
Copy link
Author

I was trying out to get tweets from realdonaldtrump .He has > 30k tweets. I refactored the code and ran it with a dateinterval of 50 days. It was way faster than if I get it for 1 day .I did not record metrics to prove this but this definitely sped up the scraping process for me.

@bpb27
Copy link
Owner

bpb27 commented Mar 29, 2017

By less results I mean you only get / collect 28K total tweets (w/ interval method) instead of the 30K total tweets (w/ day by day method).

@rush00121
Copy link
Author

I did not test the number of tweets scraped. Let me test it and see if both the results are the same or not .

@rush00121
Copy link
Author

I tested for dates from 2010-01-01 - 2017-03-01 .

For the previous code, I got results :
total tweet count: 26141

With my modifications, I got results :
total tweet count: 27357

I took a smaller sample size :

dates from 2017-01-01 - 2017-02-01 .

Both runs gave me total tweet count: 204

I am not sure why in the previous run , my code gave me more results. Is is a timeout issue with the twitter page or something else.

But in both cases, I got a significant speed improvement .

@ryanbateman
Copy link

I also ran this PR branch and got 14479 for 2010-01-01 - 2017-06-30. This may well be a facet of the loading of the pages on my machine - a factor which it seems would be an issue regardless - but definitely worth considering.

@ryanbateman
Copy link

Increasing the page load time to 2 netted 15687 ids for that same time period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants