You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:
publicfuncreateDefaultPlugins(okHttpClient:OkHttpClient): List<Plugin> =listOf(
// Static redirectors go first, to avoid getting stuck into CAPTCHAs.GoogleUrlRewriter(),
FacebookUrlRewriter(),
// Remove any tracking parameters remaining.TrackingParameterRemover(),
// Prefer canonical URLs over AMP URLs.AmpRedirector(refetchContentFromCanonicalUrl =true, okHttpClient),
// Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.WebAppManifestParser(okHttpClient),
// Parses many standard HTML metadata attributes.HtmlMetadataExtractor(okHttpClient),
// Extracts the best possible favicon from all the markup available on the page itself.FaviconExtractor(),
// Parses the content of the page to remove ads, navigation, and all the other fluff.ArticleExtractor(okHttpClient),
)
It will produce the correct results.
This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?
If that is not possible then we might need to consider a new way to handle merging the fields.
The text was updated successfully, but these errors were encountered:
I've been running crux over several sites and noticed the following bug.
Problem
Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816
Test based off the README example to verify the problem:
The sequence of events is:
Crux.extractFrom
usesResource.plus
to merge the resources overwriting the title with "BBC"crux/src/main/kotlin/com/chimbori/crux/api/Resource.kt
Line 51 in 3b4586c
Possible solutions
If you update
Crux.createDefaultPlugins
to placeWebAppManifestParser
beforeHtmlMetadataExtractor
like this:It will produce the correct results.
This is the simplest way we can resolve it. Is there a specific reason to have
WebAppManifestParser
afterHtmlMetadataExtractor
or can we reorder it?If that is not possible then we might need to consider a new way to handle merging the
fields
.The text was updated successfully, but these errors were encountered: