Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off #1477

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tballison
Copy link
Contributor

@tballison tballison commented Feb 21, 2025

I tested this offline with https://www.cdc.gov and https://www.fda.gov.

I confirmed that when sitemap.discovery=false, I could set one in the seed file to true, and the behavior was as expected.

I also tested the opposite, where the default was true, but the seed for one of them was false, and the behavior was as expected.

I'm not sure this is the best solution. I don't like tightly coupling logic for the SitemapFilter in the FetcherBolts, but so it goes.

And, as usual, unit tests are, well, hard.

Let me know what you think.

Thank you for contributing to Apache StormCrawler.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@tballison tballison requested a review from jnioche February 21, 2025 17:35
@tballison tballison changed the title #1475 #1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off Feb 21, 2025
Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing a configuration value through the metadata, I think we should instead make use of the configure method which all filters inherit.
We can get the value of the conf from there and if the sitemap detection is off, have a simple check at the beginning of the filter method and exit.

@tballison
Copy link
Contributor Author

tballison commented Feb 24, 2025

Thank you @jnioche . To confirm, though, we'll still need to check the parameter on the metadata for cases where a user overrides the default behavior for a given URL.

I'll push updates shortly that put all of the logic in the SitemapFilter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants