Skip to content

Commit 6fa6872

Browse files
committed
feat: added headers
1 parent 2605f9f commit 6fa6872

File tree

3 files changed

+279
-0
lines changed

3 files changed

+279
-0
lines changed

mint.json

+7
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,16 @@
8585
"services/smartscraper",
8686
"services/searchscraper",
8787
"services/markdownify",
88+
{
89+
"group": "Additional Parameters",
90+
"pages": [
91+
"services/additional-parameters/headers"
92+
]
93+
},
8894
{
8995
"group": "Browser Extensions",
9096
"pages": [
97+
9198
"services/extensions/firefox"
9299
]
93100
}
+231
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
---
2+
title: 'Headers & Cookies'
3+
description: 'Customize request headers and cookies for web scraping'
4+
icon: 'gear'
5+
---
6+
7+
<Frame>
8+
<img src="/services/images/headers-banner.png" alt="Headers Configuration" />
9+
</Frame>
10+
11+
## Overview
12+
13+
All our services (SmartScraper, SearchScraper, and Markdownify) support custom headers and cookies to help you:
14+
- Bypass basic anti-bot protections
15+
- Access authenticated content
16+
- Maintain sessions
17+
- Customize request behavior
18+
19+
## Headers
20+
21+
### Common Headers
22+
23+
You can set any of the following headers in your requests:
24+
25+
```json
26+
{
27+
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", // Browser identification
28+
"Accept": "*/*", // Accepted content types
29+
"Accept-Encoding": "gzip, deflate, br", // Supported encodings
30+
"Accept-Language": "en-US,en;q=0.9", // Preferred languages
31+
"Cache-Control": "no-cache,no-cache", // Caching behavior
32+
"Sec-Ch-Ua": "\"Google Chrome\";v=\"107\", \"Chromium\";v=\"107\"", // Browser details
33+
"Sec-Ch-Ua-Mobile": "?0", // Mobile browser flag
34+
"Sec-Ch-Ua-Platform": "\"macOS\"", // Operating system
35+
"Sec-Fetch-Dest": "document", // Request destination
36+
"Sec-Fetch-Mode": "navigate", // Request mode
37+
"Sec-Fetch-Site": "none", // Request origin
38+
"Sec-Fetch-User": "?1", // User-initiated flag
39+
"Upgrade-Insecure-Requests": "1" // HTTPS upgrade
40+
}
41+
```
42+
43+
### Usage Examples
44+
45+
<CodeGroup>
46+
47+
```python Python
48+
from scrapegraph_py import Client
49+
50+
client = Client(api_key="your-api-key")
51+
52+
# Define custom headers
53+
headers = {
54+
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
55+
"Accept-Language": "en-US,en;q=0.9",
56+
"Sec-Ch-Ua-Platform": "\"Windows\""
57+
}
58+
59+
# Use with SmartScraper
60+
response = client.smartscraper(
61+
website_url="https://example.com",
62+
user_prompt="Extract the main content",
63+
headers=headers
64+
)
65+
66+
# Use with SearchScraper
67+
response = client.searchscraper(
68+
user_prompt="Find information about...",
69+
headers=headers
70+
)
71+
72+
# Use with Markdownify
73+
response = client.markdownify(
74+
website_url="https://example.com",
75+
headers=headers
76+
)
77+
```
78+
79+
```typescript TypeScript
80+
import { Client } from '@scrapegraph/sdk';
81+
82+
const client = new Client('your-api-key');
83+
84+
// Define custom headers
85+
const headers = {
86+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
87+
'Accept-Language': 'en-US,en;q=0.9',
88+
'Sec-Ch-Ua-Platform': '"Windows"'
89+
};
90+
91+
// Use with SmartScraper
92+
const response = await client.smartscraper({
93+
websiteUrl: 'https://example.com',
94+
userPrompt: 'Extract the main content',
95+
headers: headers
96+
});
97+
```
98+
99+
</CodeGroup>
100+
101+
## Cookies
102+
103+
### Overview
104+
105+
Cookies are essential for:
106+
- Accessing authenticated content
107+
- Maintaining user sessions
108+
- Handling website preferences
109+
- Bypassing certain security measures
110+
111+
### Setting Cookies
112+
113+
Cookies are set using the `Cookie` header as a semicolon-separated string of key-value pairs:
114+
115+
```python
116+
headers = {
117+
"Cookie": "session_id=abc123; user_id=12345; theme=dark"
118+
}
119+
```
120+
121+
### Examples
122+
123+
<CodeGroup>
124+
125+
```python Python
126+
# Example with session cookies
127+
headers = {
128+
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
129+
"Cookie": "session_id=abc123; user_id=12345; theme=dark"
130+
}
131+
132+
response = client.smartscraper(
133+
website_url="https://example.com/dashboard",
134+
user_prompt="Extract user information",
135+
headers=headers
136+
)
137+
```
138+
139+
```typescript TypeScript
140+
// Example with session cookies
141+
const headers = {
142+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
143+
'Cookie': 'session_id=abc123; user_id=12345; theme=dark'
144+
};
145+
146+
const response = await client.smartscraper({
147+
websiteUrl: 'https://example.com/dashboard',
148+
userPrompt: 'Extract user information',
149+
headers: headers
150+
});
151+
```
152+
153+
</CodeGroup>
154+
155+
### Common Use Cases
156+
157+
1. **Authentication**
158+
```python
159+
headers = {
160+
"Cookie": "auth_token=xyz789; session_id=abc123"
161+
}
162+
```
163+
164+
2. **Regional Settings**
165+
```python
166+
headers = {
167+
"Cookie": "country=US; language=en; currency=USD"
168+
}
169+
```
170+
171+
3. **User Preferences**
172+
```python
173+
headers = {
174+
"Cookie": "theme=dark; notifications=enabled"
175+
}
176+
```
177+
178+
## Best Practices
179+
180+
1. **User Agent Best Practices**
181+
- Use recent browser versions
182+
- Match User-Agent with Sec-Ch-Ua headers
183+
- Consider region-specific variations
184+
185+
2. **Cookie Management**
186+
- Keep cookies up to date
187+
- Include all required session cookies
188+
- Remove unnecessary cookies
189+
- Handle cookie expiration
190+
191+
3. **Security Considerations**
192+
- Don't share sensitive cookies
193+
- Rotate User-Agents when appropriate
194+
- Use HTTPS when sending sensitive data
195+
196+
## Common Issues
197+
198+
<Accordion title="Cookie Expiration" icon="clock">
199+
Cookies may expire during scraping. Solutions:
200+
- Implement cookie refresh logic
201+
- Monitor session status
202+
- Handle re-authentication
203+
</Accordion>
204+
205+
<Accordion title="Header Conflicts" icon="exclamation-triangle">
206+
Some headers may conflict. Common fixes:
207+
- Remove conflicting headers
208+
- Ensure header values match
209+
- Check case sensitivity
210+
</Accordion>
211+
212+
## Support
213+
214+
<CardGroup cols={2}>
215+
<Card title="Documentation" icon="book" href="/introduction">
216+
Comprehensive guides and tutorials
217+
</Card>
218+
<Card title="API Reference" icon="code" href="/api-reference/introduction">
219+
Detailed API documentation
220+
</Card>
221+
<Card title="Community" icon="discord" href="https://discord.gg/uJN7TYcpNa">
222+
Join our Discord community
223+
</Card>
224+
<Card title="GitHub" icon="github" href="https://github.com/ScrapeGraphAI">
225+
Check out our open-source projects
226+
</Card>
227+
</CardGroup>
228+
229+
<Card title="Need Help?" icon="question" href="mailto:[email protected]">
230+
Contact our support team for assistance with headers, cookies, or any other questions!
231+
</Card>

services/smartscraper.mdx

+41
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ response = client.smartscraper(
3535
Get your API key from the [dashboard](https://dashboard.scrapegraphai.com)
3636
</Note>
3737

38+
3839
<Accordion title="Example Response" icon="terminal">
3940
```json
4041
{
@@ -68,6 +69,46 @@ The response includes:
6869
- `error`: Error message (if any occurred during extraction)
6970
</Accordion>
7071

72+
<Accordion title="Using Your Own HTML" icon="code">
73+
Instead of providing a URL, you can optionally pass your own HTML content:
74+
75+
```python
76+
html_content = """
77+
<html>
78+
<body>
79+
<h1>ScrapeGraphAI</h1>
80+
<div class="description">
81+
<p>AI-powered web scraping for modern applications.</p>
82+
</div>
83+
<div class="features">
84+
<ul>
85+
<li>Smart Extraction</li>
86+
<li>Local Processing</li>
87+
<li>Schema Support</li>
88+
</ul>
89+
</div>
90+
</body>
91+
</html>
92+
"""
93+
94+
response = client.smartscraper(
95+
website_html=html_content, # This will override website_url if both are provided
96+
user_prompt="Extract info about the company"
97+
)
98+
```
99+
100+
This is useful when:
101+
- You already have the HTML content cached
102+
- You want to process modified HTML
103+
- You're working with dynamically generated content
104+
- You need to process content offline
105+
- You want to pre-process the HTML before extraction
106+
107+
<Note>
108+
When both `website_url` and `website_html` are provided, `website_html` takes precedence and will be used for extraction.
109+
</Note>
110+
</Accordion>
111+
71112
## Key Features
72113

73114
<CardGroup cols={2}>

0 commit comments

Comments
 (0)