Oysho Scraper collects structured product data from oysho.com across supported countries and languages. It’s built to turn messy catalog browsing into clean, export-ready product datasets for analytics, merchandising, and monitoring. If you need an Oysho product scraper that supports full-site runs or targeted URLs, this project is designed to scale without becoming fragile.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for oysho you've just found your team — Let’s Chat. 👆👆
This project extracts product listings and detailed product-page information from Oysho’s online catalog. It solves the common problem of manually copying product data (or dealing with incomplete exports) by producing consistent JSON output with nested variants (colors, sizes, images). It’s ideal for developers, data teams, and ecommerce operators who need repeatable, auditable catalog data.
- Scrape an entire country storefront starting from its homepage URL.
- Scrape one or more category pages to focus on specific catalog sections.
- Scrape individual product pages for maximum detail and accuracy.
- Run multiple start URLs in a single job, including different regions.
- Apply limits to control maximum products and category depth per run.
| Feature | Description |
|---|---|
| Full-site scraping | Crawls storefronts to discover categories and products at scale. |
| Category-first scraping | Targets category pages to extract focused catalog segments quickly. |
| Product detail scraping | Visits product pages to capture long descriptions, materials, care, and variant details. |
| Multi-URL input | Scrapes multiple start URLs in one run, even across different regions. |
| Deduplication | Returns unique products across overlapping categories and multiple start URLs. |
| Variant-aware output | Preserves nested structures for colors, sizes, SKUs, and media assets. |
| Export-friendly summaries | Provides flat summary fields (like colors, sizes, mainImage) alongside detailed nested JSON. |
| Resilient retry logic | Designed to recover from temporary blocks and intermittent failures. |
| Field Name | Field Description |
|---|---|
| id | Numeric product identifier from the catalog. |
| name | Product name/title as displayed on the website. |
| description | Short description snippet (when available). |
| longDescription | Full product description from the product page. |
| reference | Internal product reference code. |
| displayReference | Customer-facing reference code format. |
| productType | High-level product type (e.g., Clothing). |
| mainImage | Primary image URL for the main/default variant. |
| colors | Comma-separated summary of available color names. |
| sizes | Comma-separated summary of available size labels. |
| price | Current price (integer minor units, when provided by the site). |
| oldPrice | Previous price if discounted, otherwise null. |
| keyword | URL-friendly product keyword/slug. |
| category | Category identifier or path segment for the product. |
| availabilityDate | First available date/time if provided. |
| isBuyable | Whether the product is purchasable. |
| onSpecial | Whether the item is marked as on special. |
| website | Storefront base URL used for the run. |
| categoryPage | Category page URL associated with discovery. |
| productPage | Canonical product page URL for the item. |
| mainColorid | Color ID used as the primary/default selection. |
| colorsSizesImagesJSON | Nested variant structure containing colors, size SKUs, dimensions, and media assets. |
| composition | Materials composition as a structured list by part. |
| compositionDetail | Detailed composition breakdown (parts, areas, components). |
| care | Care instructions list (wash/iron/dry rules). |
| sustainability | Sustainability flags and derived percentages (when present). |
| certifiedMaterials | Certified materials block including certification references and percentages. |
| traceability | Traceability data structure (when present). |
| additionalInfo | Any extra product info text provided by the site. |
{
"id": 183390929,
"name": "Comfortlux overlay tank top",
"description": "",
"longDescription": "Comfortlux tank top with bra overlay with removable lightly padded cups. Breathable, quick-drying, high-strength fabric. Crossed strap detail at the back.",
"reference": "30045904-V2025",
"displayReference": "0045/904",
"productType": "Clothing",
"mainImage": "https://static.oysho.net/assets/public/d76a/d5bb/b2a44336bc5a/f4b5f541a69c/30045904791-a1/30045904791-a1.jpg?ts=1738250071894",
"colors": "Russet Mocha, Dark Brown",
"sizes": "XS, S, M, L, XL",
"price": 2999,
"oldPrice": null,
"keyword": "comfortlux-overlay-tank-top",
"category": "womens-sports-t-shirts-n4764",
"availabilityDate": "2025-01-30 14:58:16.0",
"isBuyable": true,
"onSpecial": false,
"website": "https://www.oysho.com/gb/",
"categoryPage": "https://www.oysho.com/gb/womens-sports-t-shirts-n4764",
"productPage": "https://www.oysho.com/gb/comfortlux-overlay-tank-top-l30045904?pelement=183390929",
"mainColorid": "791",
"colorsSizesImagesJSON": [
{
"id": "791",
"name": "RUSSET MOCHA",
"productPageSelectedColor": "https://www.oysho.com/gb/comfortlux-overlay-tank-top-l30045904?pelement=183390929&colorId=791",
"xmedia": [
"https://static.oysho.net/assets/public/d76a/d5bb/b2a44336bc5a/f4b5f541a69c/30045904791-a1/30045904791-a1.jpg?ts=1738250071894"
],
"sizes": [
{
"sku": 174731466,
"name": "XS",
"partnumber": "3004590479101-V2025",
"isBuyable": true,
"price": "2999",
"oldPrice": null,
"skuDimensions": [
{ "dimensionId": "127", "value": 41.7, "dimensionName": "FRONT LENGTH" }
]
}
]
}
]
}
Oysho/
├── src/
│ ├── main.py
│ ├── cli.py
│ ├── runner/
│ │ ├── __init__.py
│ │ ├── job.py
│ │ ├── retry.py
│ │ └── limits.py
│ ├── crawler/
│ │ ├── __init__.py
│ │ ├── browser.py
│ │ ├── routes.py
│ │ └── session.py
│ ├── extractors/
│ │ ├── __init__.py
│ │ ├── discover.py
│ │ ├── category.py
│ │ ├── product.py
│ │ └── normalize.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── product.py
│ │ └── schema.py
│ ├── outputs/
│ │ ├── __init__.py
│ │ ├── json_writer.py
│ │ ├── csv_flatten.py
│ │ └── validators.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── url.py
│ │ ├── dedupe.py
│ │ ├── logging.py
│ │ └── time.py
│ └── config/
│ ├── settings.example.json
│ └── user_agents.txt
├── data/
│ ├── inputs.sample.json
│ └── sample_output.json
├── tests/
│ ├── test_dedupe.py
│ ├── test_normalize.py
│ └── test_extractors.py
├── .gitignore
├── LICENSE
├── requirements.txt
└── README.md
- Ecommerce analysts use it to track price and availability changes, so they can spot promotions early and forecast demand.
- Merchandising teams use it to extract product attributes and variants, so they can compare assortments across regions.
- Data engineers use it to build product catalogs for BI pipelines, so they can standardize reporting across categories.
- Competitor monitoring teams use it to collect structured product feeds, so they can benchmark materials, sizing, and pricing trends.
- Marketplace operators use it to populate listings with images and variant data, so they can reduce manual entry and errors.
How do I choose what to scrape (full site vs category vs product URLs)? Use storefront URLs when you want broad coverage, category URLs when you want a targeted segment, and product URLs when you only need specific items. You can also provide multiple URLs in a single run to mix and match strategies.
Why do I sometimes get fewer results than my configured maximum? Some catalog pages may include placeholder or incomplete items that are filtered out. Also, the website can show separate color tiles for the same product, while the scraper returns a single bundled product containing multiple colors—so “5 tiles” on the page might become “1 product” in the output.
What’s the difference between colors, sizes, and colorsSizesImagesJSON?
colors and sizes are flat summaries for quick exports (CSV/Sheets-friendly). colorsSizesImagesJSON contains the full nested variant structure with per-color media, per-size SKUs, and size dimensions.
What should I do if requests get blocked or I see access errors? Temporary blocks can happen. The most effective fix is to rerun the job with retries enabled and reduce concurrency. If you’re running at high volume, prefer residential IP rotation and keep request rates steady rather than bursty.
Primary Metric: ~1,000 products scraped in ~5 minutes when running category-first discovery with product detail extraction enabled.
Reliability Metric: 92–97% successful completion rate across large runs when using retry + backoff, with most failures tied to temporary access blocks.
Efficiency Metric: Average throughput of 3–5 product detail pages per second on typical configurations, with output streamed to JSON to avoid memory spikes.
Quality Metric: 95%+ completeness for core commercial fields (name, price, images, variants), with occasional gaps on products that load incomplete or placeholder data.
