Web Crawler API

A FastAPI wrapper around crawl4ai for extracting markdown content from web pages.

Installation

Install dependencies:

pip install -r requirements.txt

Usage

Starting the API Server

python api.py

The server will start on http://localhost:8000

API Documentation

Once the server is running, visit http://localhost:8000/docs for interactive API documentation.

Endpoints

POST /crawl

Crawl a URL and return fitted markdown content.

Request Body:

{
    "url": "https://iq.linkedin.com/in/mazyarf",
    "profile_name": "profile_1759825962",
    "headless": false,
    "delay_before_return_html": 5.0,
    "threshold": 0.4
}

Response:

{
    "success": true,
    "url": "https://iq.linkedin.com/in/mazyarf",
    "raw_markdown_length": 15234,
    "fit_markdown_length": 8432,
    "fit_markdown": "# LinkedIn Profile\n\nMazyar Farhad...",
    "raw_markdown": null,
    "error_message": null
}

GET /crawl

Simple GET endpoint for testing.

Example:

curl "http://localhost:8000/crawl?url=https://iq.linkedin.com/in/mazyarf"

Using the API

With curl (POST):

curl -X POST 'http://localhost:8000/crawl' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://iq.linkedin.com/in/mazyarf"}'

With curl (GET):

curl 'http://localhost:8000/crawl?url=https://iq.linkedin.com/in/mazyarf'

With Python requests:

import requests

# Simple function to get fitted markdown
def get_fitted_markdown(url):
    response = requests.post(
        'http://localhost:8000/crawl',
        json={'url': url}
    )
    if response.status_code == 200:
        result = response.json()
        if result['success']:
            return result.get('fit_markdown') or result.get('raw_markdown', '')
    return None

# Usage
markdown = get_fitted_markdown("https://iq.linkedin.com/in/mazyarf")
print(markdown)

Testing

Run the test script:

python test_api.py

Configuration

Parameters

url: The URL to crawl (required)
profile_name: Browser profile to use (default: "profile_1759825962")
headless: Run browser in headless mode (default: false)
delay_before_return_html: Delay before returning HTML in seconds (default: 5.0)
threshold: Content filtering threshold (default: 0.4)

install the crwl cli using

Browser Profile

Important: You need to create a browser profile first using the crwl CLI before using the API:

 crwl profiles

This will create a new browser profile that can be used for crawling. The default profile name is "profile_1759825962". You can change this in the API request.

To list existing profiles:

crwl profiles

Response Format

The API returns a JSON object with the following fields:

success: Boolean indicating if the crawl was successful
url: The crawled URL
raw_markdown_length: Length of the raw markdown content
fit_markdown_length: Length of the fitted markdown content (if available)
fit_markdown: The fitted markdown content (preferred)
raw_markdown: The raw markdown content (fallback)
error_message: Error message if the crawl failed

Error Handling

The API will return appropriate HTTP status codes:

200: Success
400: Bad request (invalid URL, profile not found, crawl failed)
422: Validation error (invalid request format)
500: Internal server error

Files

api.py: Main FastAPI application
crawler.py: Original crawler script
test_api.py: Test script for the API
requirements.txt: Python dependencies
README.md: This file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler API

Installation

Usage

Starting the API Server

API Documentation

Endpoints

POST /crawl

GET /crawl

Using the API

With curl (POST):

With curl (GET):

With Python requests:

Testing

Configuration

Parameters

Browser Profile

Response Format

Error Handling

Files

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
api.py		api.py
crawler.py		crawler.py
requirements.txt		requirements.txt

Kindredman/pycrawl

Folders and files

Latest commit

History

Repository files navigation

Web Crawler API

Installation

Usage

Starting the API Server

API Documentation

Endpoints

POST /crawl

GET /crawl

Using the API

With curl (POST):

With curl (GET):

With Python requests:

Testing

Configuration

Parameters

Browser Profile

Response Format

Error Handling

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages