-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zeno v2 #166
Draft
equals215
wants to merge
302
commits into
main
Choose a base branch
from
dev/v2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ore sending it and reseting the pointer
…ded a freeze context used to refrain other routines to interact with it
Replace custom utils.StringInSlice with standard slices.Contains
…iting-at-finish Log remaining WARC writing while finishing
* add: streaming postprocessing for JSON * add: streaming postprocessing for S3 * add: TestIsS3 * add: TestS3 + S3 extraction refactoring * add: streaming postprocessing for XML * fix: avoid 2-layer assets extraction when HTML is wrongly discovered as asset * add: use Zeno's User-Agent and custom HTTP client when requesting exclusion file * add: error handling when doing SetReadDeadline * fix: extraction from <script> content * fix: outlinks extraction --------- Co-authored-by: Corentin Barreau <[email protected]>
The current code looks for `base` tag but doesn't stop if it finds one. It will still search until the end of the doc. The suggested improvement uses `doc.Find("base").First()` to get just the first element as `<base>` is used just once on the HTML doc header. We also add a unit test.
Simplify hopsToPath and pathToHops
Optimise extractBaseTag and add unit test
HTMLOutlinks unit test
Group video[src] and audio[src] selections in the same `goquery.Find` query because their handling is identical. Group scanning all doc element for attributes `[data-item], [style], [data-preview]` in the same `goquery.Find` query. The previous query `goquery.Find('*')` was returning all elements and then we checked for specific attributes. The new query returns only the elements which have one of the specified attributes, so it should be much faster. Add unit tests to validate the suggested improvements.
Refactor HTMLAssets and add unit tests
Add missing status 303 "See Other" https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/303 Refactor to make it more readable.
Improve isStatusCodeRedirect
* controler: make makeStageChannel() capable of creating buffered and unbuffered channels * Rework preprocessor concurrency (#211) * preprocessor: using fan-in-fan-out pattern instead of dynamic workers pattern ; controler: make the reactor output channel buffered of size WorkersCount * preprocessor: log wording consistency * Rework archiver concurrency (#212) * archiver: using fan-in-fan-out pattern instead of dynamic workers pattern * cmd,config,archiver: rename MaxConcurrentAssets to MaxConcurrentAssetsPerWorker to make it more explicit that this limit is (to be) enforced PER worker * Revert "cmd,config,archiver: rename MaxConcurrentAssets to MaxConcurrentAssetsPerWorker to make it more explicit that this limit is (to be) enforced PER worker" This reverts commit 175af1e. * preprocessor: use struct pointer for worker() method instead of global variable * preprocessor: replace preprocessor.run by preprocessor.worker in the fieldedLogger * Rework postprocessor concurrency (#214) * postprocessor: using fan-in-fan-out pattern instead of dynamic workers pattern * controler: make archiver and preprocessor channel buffered by size of WorkersCount * archiver: check if context is done before passing seeds to the next stage * Rework finisher concurrency (#219) * stats: add counters for Finisher routines * controler: make postprocessor, finisherFinish and finisherProduce chans buffered by size WorkersCount ; consume and discard finisherFinish and finisherProduce when HQ is not used * finisher: make the finisher concurrent using fan-in-fan-out pattern
* Drop item.seed attribute Drop `seed` from the `NewItem` constructor. Replace `item.seed` with `item.IsSeed` Drop some checks involving `seed` and `parent` in `CheckConsistency`. * Drop the seed param from all calls to NewItem * Drop the seed param from all unit tests Also, remove 3 unit tests which became irrelevant due to the drop of the seed attribute.
…the logic of routing the logs, rotatedFile implements io.Writer interface
Rework log to use `samber/slog-multi` as a `log/slog` routing abstraction
* init commit to start the PR * models.url: moved tests from utils package to models package and added a concurrency test for upcoming changes * models.url: implemented @yzqzss idea to cache result of URLToString to reduce number of calls
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.