Skip to content

Kindredman/pycrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler API

A FastAPI wrapper around crawl4ai for extracting markdown content from web pages.

Installation

  1. Install dependencies:
pip install -r requirements.txt

Usage

Starting the API Server

python api.py

The server will start on http://localhost:8000

API Documentation

Once the server is running, visit http://localhost:8000/docs for interactive API documentation.

Endpoints

POST /crawl

Crawl a URL and return fitted markdown content.

Request Body:

{
    "url": "https://iq.linkedin.com/in/mazyarf",
    "profile_name": "profile_1759825962",
    "headless": false,
    "delay_before_return_html": 5.0,
    "threshold": 0.4
}

Response:

{
    "success": true,
    "url": "https://iq.linkedin.com/in/mazyarf",
    "raw_markdown_length": 15234,
    "fit_markdown_length": 8432,
    "fit_markdown": "# LinkedIn Profile\n\nMazyar Farhad...",
    "raw_markdown": null,
    "error_message": null
}

GET /crawl

Simple GET endpoint for testing.

Example:

curl "http://localhost:8000/crawl?url=https://iq.linkedin.com/in/mazyarf"

Using the API

With curl (POST):

curl -X POST 'http://localhost:8000/crawl' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://iq.linkedin.com/in/mazyarf"}'

With curl (GET):

curl 'http://localhost:8000/crawl?url=https://iq.linkedin.com/in/mazyarf'

With Python requests:

import requests

# Simple function to get fitted markdown
def get_fitted_markdown(url):
    response = requests.post(
        'http://localhost:8000/crawl',
        json={'url': url}
    )
    if response.status_code == 200:
        result = response.json()
        if result['success']:
            return result.get('fit_markdown') or result.get('raw_markdown', '')
    return None

# Usage
markdown = get_fitted_markdown("https://iq.linkedin.com/in/mazyarf")
print(markdown)

Testing

Run the test script:

python test_api.py

Configuration

Parameters

  • url: The URL to crawl (required)
  • profile_name: Browser profile to use (default: "profile_1759825962")
  • headless: Run browser in headless mode (default: false)
  • delay_before_return_html: Delay before returning HTML in seconds (default: 5.0)
  • threshold: Content filtering threshold (default: 0.4)

install the crwl cli using

Browser Profile

Important: You need to create a browser profile first using the crwl CLI before using the API:

 crwl profiles

This will create a new browser profile that can be used for crawling. The default profile name is "profile_1759825962". You can change this in the API request.

To list existing profiles:

crwl profiles

Response Format

The API returns a JSON object with the following fields:

  • success: Boolean indicating if the crawl was successful
  • url: The crawled URL
  • raw_markdown_length: Length of the raw markdown content
  • fit_markdown_length: Length of the fitted markdown content (if available)
  • fit_markdown: The fitted markdown content (preferred)
  • raw_markdown: The raw markdown content (fallback)
  • error_message: Error message if the crawl failed

Error Handling

The API will return appropriate HTTP status codes:

  • 200: Success
  • 400: Bad request (invalid URL, profile not found, crawl failed)
  • 422: Validation error (invalid request format)
  • 500: Internal server error

Files

  • api.py: Main FastAPI application
  • crawler.py: Original crawler script
  • test_api.py: Test script for the API
  • requirements.txt: Python dependencies
  • README.md: This file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages