Needed scripts to crawl and scrape Hackforums along with some analysis perfomed in the paper.
Below are the needed steps and cofiguration before running the crawler and how to do it.
Operating system
sudo pacman -S tor torsocks privoxy sqlite3 tesseract
Python
pip install -r requirements.txt
Playwright
# Once inside the virtual environment
playwright install
Add the following lines to start and end tor and privoxy service along the environment
# Add to the end of the activation file
sudo systemctl start tor
sudo systemctl start privoxy
# Add to the end of the deactivate function
sudo systemctl stop tor
sudo systemctl stop privoxy
# /etc/privoxy/config
forward-socks5t / 127.0.0.1:9050 .
keep-alive-timeout 600
default-server-timeout 600
socket-timeout 600
tor --hash-password <PASS>
# En /etc/tor/torrc
ControlPort 9051
HashedControlPassword <GENERATEDHASH>
# Normal IP
curl http://ifconfig.me
# IP through tor
torify curl http://ifconfig.me
curl -x 127.0.0.1:8118 https://ifconfig.me
Stealthy Crawling using Scrapy, Tor and Privoxy
Tor installation and usage
sqlite3 hackforums.db < create.sql
The files are in path: HFCrypterAnalysis/hackforums/hackforums/spiders
Before running the crawler create an account in scrapeops and add the aspi key to HFCrypterAnalysis/hackforums/hackforums/settings.py
First run the crawler of the marketplace thread list
# Inside HFCrypterAnalysis/hackforums
scrapy crawl hackforums --nolog
Now you can run the scraper inside the threads, you need an account for this
# Inside HFCrypterAnalysis/hackforums
python3 hackforums/spiders/posts.py