Skip to content

Commit f9534ca

Browse files
committed
add youtube scraper
1 parent f9a7ff6 commit f9534ca

13 files changed

+6500
-0
lines changed

.github/workflows/test_scrapers.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,18 @@ jobs:
158158
test: test_page_scraping
159159
- project_dir: yelp-scraper
160160
test: test_search_scraping
161+
- project_dir: youtube-scraper
162+
test: test_video_scraping
163+
- project_dir: youtube-scraper
164+
test: test_comment_scraping
165+
- project_dir: youtube-scraper
166+
test: test_channel_scraping
167+
- project_dir: youtube-scraper
168+
test: test_channel_videos_scraping
169+
- project_dir: youtube-scraper
170+
test: test_search_scraping
171+
- project_dir: youtube-scraper
172+
test: test_shorts_scraping
161173
- project_dir: immoscout24-scraper
162174
test: test_search_scraping
163175
- project_dir: immoscout24-scraper

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -565,6 +565,23 @@ Below is the list of available web scrapers for the supported domains along with
565565
<td><img src="https://img.shields.io/badge/Yelp_scraper-success-brightgreen" alt="Yelp-scraper-status"></td>
566566
</tr>
567567

568+
<tr>
569+
<td><a href="/youtube-scraper/">YouTube.com</a></td>
570+
</a></td>
571+
<td>
572+
<ul>
573+
<li><a href="./youtube-scraper/results/channel_videos.json">Channel videos</a></li>
574+
<li><a href="./youtube-scraper/results/channels.json">Channel metadata</a></li>
575+
<li><a href="./youtube-scraper/results/channel_videos.json">Channel videos</a></li>
576+
<li><a href="./youtube-scraper/results/videos.json">Video metadata</a></li>
577+
<li><a href="./youtube-scraper/results/comments.json">Video comments</a></li>
578+
<li><a href="./youtube-scraper/results/shorts.json">Shorts' metadata</a></li>
579+
</ul>
580+
</td>
581+
<td><img src="https://img.shields.io/badge/YouTube_scraper-success-brightgreen" alt="YouTube-scraper-status"></td>
582+
</tr>
583+
584+
568585
<tr>
569586
<td><a href="/zillow-scraper/">Zillow.com</a></td>
570587
<td><a href="https://scrapfly.io/blog/how-to-scrape-zillow/">How to Scrape Zillow Real Estate Property Data in Python</a></td>

youtube-scraper/README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# YouTube.com Scraper
2+
3+
This scraper is using [scrapfly.io](https://scrapfly.io/) and Python to scrape public YouTube.com video, channel, comments, search, and short videos.
4+
5+
Full tutorial
6+
7+
The scraping code is located in the `youtube.py` file. It's fully documented and simplified for educational purposes and the example scraper run code can be found in `run.py` file.
8+
9+
This scraper scrapes:
10+
- YouTube video metadata
11+
- YouTube video comments
12+
- YouTube channel metadata
13+
- YouTube channel videos
14+
- YouTube search
15+
- YouTube shorts metadata
16+
17+
For output examples, see the `./results` directory.
18+
19+
20+
## Fair Use Disclaimer
21+
22+
Note that this code is provided free of charge as is, and Scrapfly does __not__ provide free web scraping support or consultation. For any bugs, see the issue tracker.
23+
24+
## Setup and Use
25+
26+
This YouTube.com scraper uses __Python 3.10__ with [scrapfly-sdk](https://pypi.org/project/scrapfly-sdk/) package which is used to scrape and parse YouTube's data.
27+
28+
0. Ensure you have __Python 3.10__ and [poetry Python package manager](https://python-poetry.org/docs/#installation) on your system.
29+
1. Retrieve your Scrapfly API key from <https://scrapfly.io/dashboard> and set `SCRAPFLY_KEY` environment variable:
30+
```shell
31+
$ export SCRAPFLY_KEY="YOUR SCRAPFLY KEY"
32+
```
33+
2. Clone and install Python environment:
34+
```shell
35+
$ git clone https://github.com/scrapfly/scrapfly-scrapers.git
36+
$ cd scrapfly-scrapers/youtube-scraper
37+
$ poetry install
38+
```
39+
3. Run example scrape:
40+
```shell
41+
$ poetry run python run.py
42+
```
43+
4. Run tests:
44+
```shell
45+
$ poetry install --with dev
46+
$ poetry run pytest test.py
47+
# or specific scraping areas
48+
$ poetry run pytest test.py -k test_video_scraping
49+
$ poetry run pytest test.py -k test_comment_scraping
50+
$ poetry run pytest test.py -k test_channel_scraping
51+
$ poetry run pytest test.py -k test_channel_videos_scraping
52+
$ poetry run pytest test.py -k test_search_scraping
53+
$ poetry run pytest test.py -k test_shorts_scraping
54+
```

youtube-scraper/pyproject.toml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
[tool.poetry]
2+
name = "scrapfly-youtube"
3+
version = "0.1.0"
4+
description = "demo web scraper for YouTube.com using Scrapfly"
5+
authors = ["Mazen Ramadan <[email protected]>"]
6+
license = "NPOS-3.0"
7+
readme = "README.md"
8+
9+
[tool.poetry.dependencies]
10+
python = "^3.10"
11+
scrapfly-sdk = {extras = ["all"], version = "^0.8.5"}
12+
loguru = "^0.7.0"
13+
14+
[tool.poetry.group.dev.dependencies]
15+
black = "^23.3.0"
16+
ruff = "^0.0.269"
17+
cerberus = "^1.3.4"
18+
pytest = "^7.3.1"
19+
pytest-asyncio = "^0.21.0"
20+
pytest-rerunfailures = "^14.0"
21+
22+
[build-system]
23+
requires = ["poetry-core"]
24+
build-backend = "poetry.core.masonry.api"
25+
26+
[tool.pytest.ini_options]
27+
python_files = "test.py"
28+
29+
[tool.black]
30+
line-length = 120
31+
target-version = ['py37', 'py38', 'py39', 'py310', 'py311']
32+
33+
[tool.ruff]
34+
line-length = 120

0 commit comments

Comments
 (0)