Skip to content

Incorrect Handling of Nonexistent or Mis-cased Paths #8

Closed
@thunderpoot

Description

@thunderpoot

Describe the bug
When downloading paths with a segment that doesn't exist (or one that does but is not written in the correct case such as cc-main-2025-05 in lowercase) it will save the file with the expected name and .gz extension but it will contain the S3 error:

$ cat warc.paths.gz
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>crawl-data/asdfadsf/warc.paths.gz</Key><RequestId>NYVQWQNC27BD94GE</RequestId><HostId>RoAV4secLh5r9pf8ixAFKbMObnFnJ0tGI0m80X9NzxInsR7RILRIoT/cKekF/y7VlctccRX3CPQ=</HostId></Error>%                                                                                  

To Reproduce
Steps to reproduce the behavior:

cc-downloader download-paths asdfadsf warc .

Expected behavior
I would have expected it to tell me that the path was not found. The tool could check the existence of the path before attempting to download and fail gracefully with a nice user-friendly message instead of saving an error response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions