Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: follow meta refresh redirects #255

Open
mrcnski opened this issue Jan 29, 2025 · 2 comments
Open

Feature: follow meta refresh redirects #255

mrcnski opened this issue Jan 29, 2025 · 2 comments

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Jan 29, 2025

Hi, I noticed that spider does not follow meta refresh redirects (without the chrome feature, at least). I also don't see an easy way to enable this behavior.

An example of what I mean, from view-source:https://docs.saltproject.io/:

<meta http-equiv="refresh" content="0; url=en/latest/contents.html" />
@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 29, 2025

I tried something like this:

// Follow meta refresh redirects.
fn on_should_crawl_callback(page: &mut Page) -> bool {
    let redirect = get_meta_redirect_url(&page.get_html(), page.get_url());

    if let Some(redirect) = redirect {
        page.final_redirect_destination = Some(redirect);
    }

    true
}
website.on_should_crawl_callback = Some(on_should_crawl_callback);

However, page is an immutable &Page in the callback, so I cannot modify it in the callback.

There also doesn't seem to be anyway to get a reference to website.queue(...) inside the callback. The type signature of the callback restricts it to being either an fn or a closure which does not capture its environment.

@j-mendez
Copy link
Member

Possible to handle http too. Will take a look at this soon. We do as much as we can extra to auto parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants