tor-browser-crawler-video

This is a fork of the tor-browser-crawler. The original fork was by Nate Mathews. Danny Campuzano forked it from him. I forked it from Danny to update it for the YouTube, Dailymotion, Vimeo, Rumble, and Facebook Watch interfaces between late 2022 and early 2023, and add functionality to crawl the same platforms without using Tor. I'm running Ubuntu Server 22.04 VMs with 2 CPUs and 2 GB of RAM.

Steps

Install Docker

sudo apt-get install ca-certificates curl gnupg lsb-release
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo usermod -aG docker $USER
sudo systemctl enable docker.service
sudo systemctl disable containerd.service

Logout, login, then build and run the Docker container

sudo apt install make
make build

Setup your crawl configuration files
- replace the contents of videos.txt with your list of YouTube, Facebook Watch, Vimeo, and Rumble URLs to crawl, followed by a comma and the duration in seconds
- edit Makefile to use the correct network interface (find yours with ip link)
- make any desired changes to config.ini
Start the crawl
- make run launches a container and starts crawling with the Tor Browser
- make run-without-tor starts crawing with Firefox ESR without Tor
- the logs, packet captures, and screenshots appear in the results directory

Notes

Software and Library Versions
- This project was originally frozen to v8.0.2 of the TBB, and I've updated it to v12.0.5 with geckodriver v0.32.2
- I've changed the Docker base image from python:2.7 to debian:bookworm-slim for the latest Python3 and selenium, tbselenium, etc. packages
- Bookworm also provides the latest Firefox ESR and uBlock Origin for the run-without-tor option
- To use another TBB version, change the version number in Dockerfile and do another make build
I've changed the triggers for when to end a packet capture. For YouTube, it used to be when the player status was ended. For a while I looked for the fraction of the video loaded to reach 1. Now, for all platforms, it ends after the expected playback duration of the video or after 5 minutes, whichever is shorter. It also ends early if it gets the detected unusual traffic page from YouTube or if it doesn't see certain page elements (depending on the platform) within 30 seconds, in which cases it just deletes the whole subdirectory in results for that visit.
About 50% of the time when using the Tor Browser, YouTube will serve a page saying detected unusual traffic. If the crawler appears to be in the EU, YouTube will show a Before you continue to YouTube banner or page about cookies, preventing most of the video from loading. The crawler first tries to reject cookies, and then checks to see if there's a video player. If there's no player, it terminates the visit. If there's a player but the video isn't playing, it presses play. After that, it looks for the button to skip ad(s) every ten seconds and presses it if found, so this handles both pre-roll and mid-roll ads like a human would do.
Facebook Watch will autoplay and doesn't show ads. Sometimes it shows a banner asking about cookies, but the video will load and play to completion behind it. The crawler supports accessing videos either through a facebook.com URL or a facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd.onion URL when using Tor.
Vimeo requires the crawler to press play but there are no ads. If the crawler appears to be in the EU, Vimeo might show a cookies banner that blocks the play button, so the crawler looks to press the reject button on that banner before pressing play.
Rumble requires the crawler to press play and there are a lot of ads (even with uBlock Origin for the run-without-tor option), which the crawler tries to skip like a human would do after waiting 30 seconds.
At some point in early 2023, Dailymotion stopped working over Tor; the page loaded but the video player hung on Retreiving Ad. Without Tor, Dailymotion autoplays after dealing with a cookie banner, and it shows a lot of ads which I don't currently handle.
I've set the --snapshot-length to 71 bytes for tcpdump, so it only saves the Ethernet, IP, and TCP headers and TLS record lengths. We need these for our analysis depending on the threat model used. We don't need the encrypted payloads for anything, and they would require orders-of-magnitude more storage space.
Using the run-without-tor option, YouTube streams video over the QUIC protocol, so I've changed the tcpdump filter to capture UDP in addition to TCP.
I've changed the virtual display size from 1280x800 to a more standard 1920x1200 for these days, based on the Dell XPS 13 laptop, because the Tor Browser was choosing a very small window size and preventing some page elements from being visible.
The default Docker settings often resulted in a Selenium WebDriverException saying failed to decode response from marionette and subsequently tried to run command without establishing a connection when trying to run execute_script() commands even though the page and video were loading. The fix was to give the container higher runtime constraints on resources, specifically memory and shared host memory (see https://stackoverflow.com/questions/49734915/failed-to-decode-response-from-marionette-message-in-python-firefox-headless-s). This is included in the run command in Makefile

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
bin		bin
tbcrawler		tbcrawler
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.ini		config.ini
requirements.txt		requirements.txt
videos.txt		videos.txt
vimeo_unmonitored.txt		vimeo_unmonitored.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tor-browser-crawler-video

Steps

Notes

About

Releases

Packages

Contributors 3

Languages

License

timwalsh300/tor-browser-crawler-video

Folders and files

Latest commit

History

Repository files navigation

tor-browser-crawler-video

Steps

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages