Run importers in parallel within a single importer instance #1278

jcrossley3 · 2025-02-10T23:06:28Z

Partial impl of #1207

ctron

I think we should have a concept of what parallel means.

I think it makes sense to:

Allow running multiple instances of the importer instance
Allow running multiple importer runs in parallel

But I don't think it makes sense to run a single importer in a parallel mode.

modules/importer/src/runner/csaf/mod.rs

jcrossley3 · 2025-02-11T14:04:25Z

I think we should have a concept of what parallel means.

Sure

I think it makes sense to:

Allow running multiple instances of the importer instance

I'm thinking we'll need another column in the importer table to identify which instance is executing a "run" and some means of identifying a crashed/killed instance. Can you recommend a duration for a window to update a timestamp (last_run or last_success -- what's the difference?) denoting "I'm still alive" after which we can assume another instance needs to take over? Any thoughts on a unique identifier for each instance?

Allow running multiple importer runs in parallel

Can you clarify this? Are you referring to the join_all use in this PR or something else? I don't think we'd want multiple instances to run the same importer simulataneously, do we?

But I don't think it makes sense to run a single importer in a parallel mode.

You're specifically referring to the .walk_parallel calls, right? Just to avoid being rate-limited?

JimFuller-RedHat · 2025-02-11T14:27:01Z

also ... I wonder at what point do we would consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ?

jcrossley3 · 2025-02-11T14:49:31Z

also ... I wonder at what point would we consider bringing in a job queue crate versus building one ourselves ... I have no experience with any rust crates so maybe there are none fit for purpose ?

I'm thinking that threshold is very high. Because if we're requiring a distributed job queue, then I expect any crate would wrap redis/rabbit/etc.

ctron · 2025-02-11T15:25:27Z

I think there's a bunch of way we could deal with this. One way could be a postgres based locking mechanism. ON LOCK SKIP, or whatever that was called.

Downside, it only works with postgres. Which might be ok.

Another way could be do have a hash approach: have a max number of importers, and an individual instance number, and then only process those jobs. Should work with a statefulset.

we could also hold a state in the database. But then we'd need to define such things as timeouts, stale detection, etc.

jcrossley3 · 2025-02-11T15:37:15Z

If we think StatefulSet is the best option, then I'll re-assign this to you. :)

ctron · 2025-02-11T15:53:41Z

If we think StatefulSet is the best option, then I'll re-assign this to you. :)

StatefulSet is just a way of ensuring we have X number of importers running and are able to assign ascending numbers to those instances. The input to the importer would be "x of y", which could be transported via env-vars, even when using ansible. Having a default of "1 of 1", which means "process them all".

ctron · 2025-02-13T08:07:29Z

To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run.

I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no.

I think we'd need a strategy first, with the goal mentioned above, before having an implementation.

jcrossley3 · 2025-02-13T13:48:43Z

I think this would result in the case that one importer runs for a long time, blocking the next check, and thus not really improving the situation. On an initial startup with pending import runs, yes. But afterwards, no.

True. So no worse than what we have, but does address the primary motivation for this PR: that QE will only have to wait as long as it takes the longest enabled importer to run for them all to be finished.

Plus, better logging.

I think we'd need a strategy first, with the goal mentioned above, before having an implementation.

Agree. Still thinking about that.

ctron · 2025-02-13T13:53:05Z

I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior.

jcrossley3 · 2025-02-13T14:04:25Z

I would not want to merge something which only suits some specific QE task. Potentially causing issues for the overall system. Like the case that most of the times it runs sequential, while sometimes it does run in parallel. Without any way of controlling that behavior.

It's not a QE task. They just happen to be the first users who care about how long it takes the app to import data. We already have a means to disable importers. I think there's a cancellation mechanism in there, no?

I don't love that those git-based importers are sync and block threads, but it's not like we expect to have hundreds of them.

Is your reluctance so high that you'd like me to close this PR?

jcrossley3 · 2025-02-13T23:20:20Z

To my understanding, this PR simply checks all importers in parallel. With no limitation. Waiting for the joint outcome of a single check run.

I added a limitation. Defaults to the old behavior: concurrency=1.

Signed-off-by: Jim Crossley <[email protected]>

ctron · 2025-02-17T08:45:15Z

modules/importer/src/runner/common/walker/git.rs

@@ -294,7 +282,7 @@ where
        }
    }

-    #[instrument(skip(self), err)]
+    #[instrument(skip(self))]


I'd like to keep errors. It's ok to reduce the level to e.g. INFO

ctron

I personally think that's better. And it should also be possible to easily extend that implementation to a distributed version (running X instances at the same time).

I'd just prefer keeping the err instrumentation. Also converting the ret to err instead. And maybe lowering that level to Info, as it's not an application level error. But that's not a blocker to me.

ctron

I am ok with this in general. But it also feels incomplete due to ongoing discussions on the ADR. Let's finish that discussion first, and then merge implementations.

jcrossley3 · 2025-02-17T14:56:09Z

I am ok with this in general. But it also feels incomplete due to ongoing discussions on the ADR. Let's finish that discussion first, and then merge implementations.

Come on, man! They're independent. This PR simply adds the ability for a single instance to run multiple import jobs concurrently. Heck, this PR alone might mitigate the need for distributing work at all. It certainly satisfies the need that motivated the creation of the original downstream issue.

chirino · 2025-02-17T14:56:26Z

Before we go distributed, we should make sure the DB is not bottlenecking us.

jcrossley3 · 2025-02-17T14:59:01Z

Before we go distributed, we should make sure the DB is not bottlenecking us.

Amen. 100%. I think it's also worth looking at the logic within individual importers. The git-based ones are flawed since they're sync, but they only take a few minutes to run. The sbom importer's rate varies widely during a single run, and the CSAF importer just seems broken -- I've never seen it run to completion.

ctron · 2025-02-17T15:30:06Z

The git-based ones are flawed since they're sync,

IIRC the run on the tokio worker thread pool. Not blocking anything.

CSAF importer just seems broken

I have. Multiple times: https://console-openshift-console.apps.cluster.trustification.rocks/search/ns/trustify-dumps?kind=core~v1~Pod&q=app.kubernetes.io%2Fcomponent%3Dcreate-dump

jcrossley3 · 2025-02-17T16:09:08Z

IIRC the run on the tokio worker thread pool. Not blocking anything.

Blocks a clean shutdown when those sync git clone functions are running. Only takes a minute the first time, but 30 minutes the second, so it'a a pain to have to manually kill -9 the trustd process.

ctron · 2025-02-17T16:13:32Z

IIRC the run on the tokio worker thread pool. Not blocking anything.

Blocks a clean shutdown when those sync git clone functions are running. Only takes a minute the first time, but 30 minutes the second, so it'a a pain to have to manually kill -9 the trustd process.

Maybe that's a local problem then. It looks ok on our OCP instance:

jcrossley3 · 2025-02-17T16:22:16Z

Using the timestamps in your image, if you had attempted to Ctrl+C to break out of a cargo run trustd process at 2/13 15:30:00, the process would hang, I assume waiting for that 30 minute git clone to complete. If instead you hit Ctrl+C at 2/13 19:00, no problem, no hang. It'd even be fine at 18:00, still processing, but post git clone.

Try it for yourself.

n/a

modules/importer/src/server/mod.rs

chirino · 2025-02-17T20:10:45Z

you might want to include an update to https://github.com/trustification/trustify/blob/main/docs/env-vars.md too.

chirino

LGTM.

Relates trustification#1207 Allows enabled/disabled state of jobs to be reflected immediately when manipulated by the UX. Signed-off-by: Jim Crossley <[email protected]>

Signed-off-by: Jim Crossley <[email protected]>

jcrossley3 · 2025-02-18T00:08:26Z

you might want to include an update to https://github.com/trustification/trustify/blob/main/docs/env-vars.md too.

Done, thx

JimFuller-RedHat · 2025-02-18T08:18:19Z

Suggestion:
This change is a 'rubicon' of sorts eg. going from predictable serial processing to something more async was always something we had to cross to get better scalability. So while the changes seem straightforward - would be useful to use a branch and test a bit more at scale to assess impact (maybe this has been done and I am ignorant!). Impl level details (like considering pg skip locking) could be guided by those tests.

jcrossley3 · 2025-02-18T13:58:07Z

Suggestion: So while the changes seem straightforward - would be useful to use a branch and test a bit more at scale to assess impact (maybe this has been done and I am ignorant!).

The default value of the new concurrency option is 1, meaning the default "new" behavior should match the "old" sequential behavior. My local tests bear that out, but I've done nothing at scale.

My goal for the PR was to match current behavior by default after merging, so would the concurrency option serve the same purpose as a branch?

JimFuller-RedHat · 2025-02-18T14:04:01Z

My goal for the PR was to match current behavior by default after merging, so would the concurrency option serve the same purpose as a branch?

ya I think so!

JimFuller-RedHat

LGTM with the caveat that testing might reveal need for more work in this area. I think this should go in.

ctron requested changes Feb 11, 2025

View reviewed changes

modules/importer/src/runner/csaf/mod.rs Outdated Show resolved Hide resolved

jcrossley3 force-pushed the 1207 branch from 00ba702 to 912ae2c Compare February 11, 2025 14:15

jcrossley3 force-pushed the 1207 branch from 912ae2c to 0d79b33 Compare February 12, 2025 15:43

jcrossley3 marked this pull request as ready for review February 12, 2025 22:01

jcrossley3 changed the title ~~Make the importer execution more concurrent, i.e. faster~~ Run importers in parallel within a single importer instance Feb 12, 2025

jcrossley3 requested a review from ctron February 12, 2025 22:03

jcrossley3 force-pushed the 1207 branch from 60702aa to 5c5816b Compare February 13, 2025 14:22

jcrossley3 force-pushed the 1207 branch from 353ee8e to 9ce58f2 Compare February 13, 2025 23:59

Execute enabled importers concurrently

fdcc3e9

Signed-off-by: Jim Crossley <[email protected]>

jcrossley3 force-pushed the 1207 branch from 9ce58f2 to b0fef03 Compare February 14, 2025 19:05

jcrossley3 added 3 commits February 16, 2025 18:17

Tweak log messages a bit

dafaf68

Signed-off-by: Jim Crossley <[email protected]>

feat: make max concurrent importers configurable

c403fe2

Signed-off-by: Jim Crossley <[email protected]>

Remove unnecessary mutability

33670c5

Signed-off-by: Jim Crossley <[email protected]>

jcrossley3 force-pushed the 1207 branch 2 times, most recently from 55ad1bb to eaa5c2c Compare February 16, 2025 23:24

ctron reviewed Feb 17, 2025

View reviewed changes

ctron approved these changes Feb 17, 2025

View reviewed changes

jcrossley3 force-pushed the 1207 branch from 8b4d22d to 3f8c4a4 Compare February 17, 2025 13:53

ctron self-requested a review February 17, 2025 14:42

ctron previously requested changes Feb 17, 2025

View reviewed changes

ctron self-requested a review February 17, 2025 16:33

chirino reviewed Feb 17, 2025

View reviewed changes

modules/importer/src/server/mod.rs Show resolved Hide resolved

chirino approved these changes Feb 17, 2025

View reviewed changes

jcrossley3 added 3 commits February 17, 2025 18:40

Introduce more liveness with local spawning and no joining

ef542d5

Relates trustification#1207 Allows enabled/disabled state of jobs to be reflected immediately when manipulated by the UX. Signed-off-by: Jim Crossley <[email protected]>

Restore error instrumentation

9341d02

Signed-off-by: Jim Crossley <[email protected]>

Add docs for importer service, per review

8e3d08c

Signed-off-by: Jim Crossley <[email protected]>

jcrossley3 force-pushed the 1207 branch from 3f8c4a4 to 8e3d08c Compare February 17, 2025 23:40

JimFuller-RedHat approved these changes Feb 18, 2025

View reviewed changes

jcrossley3 added this pull request to the merge queue Feb 18, 2025

Merged via the queue into trustification:main with commit 7db1976 Feb 18, 2025
2 checks passed

jcrossley3 deleted the 1207 branch February 18, 2025 18:10

jcrossley3 mentioned this pull request Mar 3, 2025

Allow running multiple importer jobs at the same time #1207

Closed

ctron mentioned this pull request Mar 4, 2025

feat: deploy importer as stateless with replicas=2 trustification/trustify-helm-charts#23

Closed

Run importers in parallel within a single importer instance #1278

Run importers in parallel within a single importer instance #1278

Uh oh!

Conversation

jcrossley3 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ctron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcrossley3 commented Feb 11, 2025

Uh oh!

JimFuller-RedHat commented Feb 11, 2025

Uh oh!

jcrossley3 commented Feb 11, 2025 • edited by JimFuller-RedHat Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ctron commented Feb 11, 2025

Uh oh!

jcrossley3 commented Feb 11, 2025

Uh oh!

ctron commented Feb 11, 2025

Uh oh!

ctron commented Feb 13, 2025

Uh oh!

jcrossley3 commented Feb 13, 2025

Uh oh!

ctron commented Feb 13, 2025

Uh oh!

jcrossley3 commented Feb 13, 2025

Uh oh!

jcrossley3 commented Feb 13, 2025

Uh oh!

ctron Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

ctron left a comment

Choose a reason for hiding this comment

Uh oh!

ctron left a comment

Choose a reason for hiding this comment

Uh oh!

jcrossley3 commented Feb 17, 2025

Uh oh!

chirino commented Feb 17, 2025

Uh oh!

jcrossley3 commented Feb 17, 2025

Uh oh!

ctron commented Feb 17, 2025

Uh oh!

jcrossley3 commented Feb 17, 2025

Uh oh!

ctron commented Feb 17, 2025

Uh oh!

jcrossley3 commented Feb 17, 2025

Uh oh!

Uh oh!

chirino commented Feb 17, 2025

Uh oh!

chirino left a comment

Choose a reason for hiding this comment

Uh oh!

jcrossley3 commented Feb 18, 2025

Uh oh!

JimFuller-RedHat commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrossley3 commented Feb 18, 2025

Uh oh!

JimFuller-RedHat commented Feb 18, 2025

Uh oh!

JimFuller-RedHat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jcrossley3 commented Feb 10, 2025 •

edited

Loading

jcrossley3 commented Feb 11, 2025 •

edited by JimFuller-RedHat

Loading

JimFuller-RedHat commented Feb 18, 2025 •

edited

Loading