Create thread for mask calculation #59

peckto · 2021-04-04T12:40:20Z

This is a new implementation of #38

I moved the grab functionality into the main thread and created a new thread for the mask calculation.

BenBE · 2021-04-04T14:38:26Z

IMHO it's an advantage to have the grabber thread and the mask calculation in different threads. Ideally, and this is somewhat suboptimal right now, the grabber thread just provides images it captured from the camera while the mask calculation thread latches onto the last grabbed frame whenever it is ready for the next frame for processing. That way frames could be silently discarded when processing a mask is too slow.

phlash · 2021-04-04T19:27:39Z

My fork (https://github.com/phlash/deepbacksub/) takes a similar approach to that suggested by Ben: a grab-mask-output thread that runs at full video frame rate using only OpenCV and V4L2, into which the mask calculator thread inserts an updated mask, generated at whatever rate it can maintain, from the latest frame grabbed. This is also the approach taken by ViBa, and Jitsi to minimise video lag, even if the ML is a bit challenged on cycles.

Note that the grab thread at present exists to ensure we read input frames promptly, again to reduce lag, as OpenCV will queue up video rather than discard it (and there are few/no controls over this). As long as we continue to read-mask-write quickly I don't mind which thread has which responsibility :)

BenBE · 2021-04-04T22:17:02Z

The approach with updating the mask async and passing frames thru with a maybe slightly outdated mask seems fine too for most situations. Might be an option to either discard frames or use the old mask.

phlash · 2021-04-05T11:32:05Z

@peckto I have now read through your changes in detail, and tried them out locally, it looks good. I'll mark up a couple of syntax things I spotted, but otherwise great stuff!

Do we think anyone would want this to operate synchronously below the input frame rate (ie: old behaviour)? If so, how?

phlash · 2021-04-05T11:39:41Z

deepseg.cc

+	// copy frame from shared buffer
+	cv::Mat raw;
+	pthread_mutex_lock(&info.lock_raw);
+	raw = info.raw.clone();


Can we avoid the double cloning (in/out of shared buffer)? Unsure of performance impact... see below...

Structurally we could pre-allocate a few picture buffers and just hand out the pointers to them. This would require proper tracking which pointers are in use at any time and would require some preparations not yet present in the current code.

deepseg.cc

phlash · 2021-04-05T11:49:56Z

deepseg.cc

-		pthread_mutex_unlock(&capinfo.lock);
-		// we can now guarantee capinfo.raw will remain unchanged while we process it..
-		calcinfo.raw = *capinfo.raw;
+		calcinfo.raw = raw.clone();


see above comment regarding double clone, we may want to avoid this one, if there is any material performance improvement.

deepseg.cc

peckto · 2021-04-05T16:12:19Z

Regarding opencv and the grab thread:
If I understood you correctly, the main problem is, that opencv does buffer frames, which can lead to a delay if we do not grad the new frame with the maximum frame rate.
In the current (1-thread) implementation, this would occure only, if the clipping and v4l2loopback step would take too long.
I would assume, that this is under normal conditions not the case. Additionally, opencv now provides a buffer setting (CAP_PROP_BUFFERSIZE) for the V4L backend:
https://github.com/opencv/opencv/blob/master/modules/videoio/src/cap_v4l.cpp

With this, I think we can work with the single thread option.
An extra grab thread makes things quite complicated in my opinion and the profit is probably not so high.

peckto · 2021-04-05T16:19:34Z

@phlash: Thank you for your feedback! I will address your points the next days.

I'm just not sure, how we should proceed with the "double copy" thing.
If this is not really a performance issue, I prefer to keep it simple.
(The copy performance is of course dependent on the video resolution.)

BenBE · 2021-04-05T16:53:43Z

It's preferable to avoid any large copies of memory buffers. Both for performance reasons and for reasons of memory/resource usage. An easy solution is to pre-allocate several buffers large enough to hold a frame and just hand pointers to these allocated objects around. With the current architecture of one thread for grabbing, one for mask generation and the main thread handling the final frame merging this means we'd be fine with a pool of about 8 buffers (one active and one inactive per thread, plus some small reserve). That way each thread could hold on to the last active buffer it acquired from the previous step each without blocking and still allow for enough slack that slow mask calculations won't block everything on the main thread too much, all ensuring constant memory usage..

phlash · 2021-04-05T18:35:20Z

I'd like to hold off on a buffer pool until we know the impact of a couple of clone() calls between threads - the timing we have should make it obvious if we have a problem or not. Thanks for the speedy work @peckto!

peckto · 2021-04-07T17:35:45Z

I added some more measurements for the ai thread.
On my machine it looks like the following:

$ ./deepseg -d -d -c /dev/video0 -v /dev/video20 -H 
video [grab: 58741363 retr:   721098 lock:      471 copy:   632780 post:   928098 v4l2:  2864112] ai [cond: 30599240 copy:   183267 pre:  1087840 tflt: 31788445 post:   680549 mask:   144196] - FPS: 16.56 (video) 19.19 (ai)
video [grab: 54829101 retr:   452109 lock:      365 copy:   314983 post:   950727 v4l2:  3267526] ai [cond: 26038497 copy:   230877 pre:   995302 tflt: 31148558 post:   778711 mask:   173230] - FPS: 18.12 (video) 15.63 (ai)
video [grab: 54883606 retr:  1623516 lock:      521 copy:   744499 post:   959518 v4l2:  3415035] ai [cond: 25964401 copy:   180704 pre:  1125331 tflt: 33466405 post:   686013 mask:   155247] - FPS: 17.22 (video) 16.24 (ai)
video [grab: 53589089 retr:   500830 lock:     1068 copy:   354910 post:   925405 v4l2:  3216689] ai [cond: 23785185 copy:   163880 pre:  1027631 tflt: 32905018 post:   707593 mask:   149251] - FPS: 18.61 (video) 16.38 (ai)
video [grab: 54525152 retr:  1019776 lock:      653 copy:   974318 post:  1064244 v4l2:  3157460] ai [cond: 21241901 copy:   244925 pre:  1312905 tflt: 37411994 post:   690571 mask:   179552] - FPS: 17.48 (video) 15.81 (ai)
video [grab: 53905385 retr:   452957 lock:      835 copy:   319837 post:   968303 v4l2:  3224764] ai [cond: 25224082 copy:   239520 pre:   966928 tflt: 31087658 post:   702594 mask:   343650] - FPS: 18.21 (video) 18.20 (ai)

(For some reason I currently don't get 30fps??)

I agree with you, that in general we should avoid double copy.
Maybe we could re-use the double buffer structure from the grabber thread?

phlash · 2021-04-17T23:08:39Z

@peckto, @BenBE: I'm not sure where we are with this PR? To me it looks like @peckto is waiting on a question about double buffering, and then we'll want to check performance is ok before we merge? We can address a switch to the C++11 thread toolkit in a separate PR (as per experimental branch).

Regarding the double buffering - I think that should work, with the 'slower' thread (always mask calculation) swapping buffer pointers (while holding a lock) as it either consumes or writes them. This avoids buffer pool management, while also avoiding multiple copying of video frames.

BenBE · 2021-04-17T23:20:15Z

Haven't tested this yet and with the larger structural rework I think having this land on experimental first is preferable (@floe What do you think?). This also has the advantage of already having the C++11 patches as a basis to work from.

Haven't had too much time for thorough reviews and tests of the experimental branch the last days, so I'd recommend the experimental changes ripen there for another few days before we merge them to main.

Will try to get a review for this PR tomorrow or on Monday as time permits.

peckto · 2021-04-18T13:23:47Z

I added the double buffer structure for the frame buffer Mat. The main thread copies the received frame to the next buffer, while the ai thread works on the current buffer and changing the buffers on every cycle. In this way, we have only one copy operation for the frame.

peckto · 2021-04-18T13:40:10Z

I had a misunderstanding regarding the pthread_cond_signal and pthread_cond_wait. I thought, the wait would be async, in a way, that the condition is always true, as long as the signal arrived before the last wait. But as far as I can see, the wait is only released, after the next signal. This caused a unnecessary long wait period at the pthread_cond_wait call. To fix this issue, I new use a flag, to indicate that a new frame has arrived and the ai thread just sleeps until the flags gets true.
I'm not so much into threading. If you know a cleaner way to achieve this, please let me know.

BenBE · 2021-04-18T14:02:36Z

You might be looking for this:
https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for

phlash · 2021-04-18T14:12:48Z

@peckto - yep you are correct, I re-read the manual (https://linux.die.net/man/3/pthread_cond_wait):

When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed.

So there should indeed be a new frame indicator, set just before raising the signal, checked/cleared after wakeup in the other thread eg:

    // In calc_mask()
    // ...
    pthread_mutex_lock(&info.lock_raw);
    while (!info.new_frame) {
        pthread_cond_wait(&info.condition_new_frame, &info.lock_raw);
    }
    info.new_frame = false;
    // ...any other stuff requiring the lock
    pthread_mutex_unlock(&info.lock_raw);
...
    // In main()
    // ...
    pthread_mutex_lock(&calcinfo.lock_raw);
    // ...any other stuff requiring the lock
    calcinfo.new_frame = true;
    pthread_cond_signal(&calcinfo.condition_new_frame);
    pthread_mutex_unlock(&calcinfo.lock_raw);

this will avoid the issue you noticed where the mask calculation rate drops to half frame rate, since it's waiting for the next signal even when a new frame is already available.
As @BenBE linked to, the C++11 thread library works in a similar manner (although it has a helper function that does the predicate wait loop for you)

peckto · 2021-04-18T14:55:21Z

Combining pthread_cond_wait and the flag is a good idea. I implemented your suggestion.
I noticed, the experimental branch now has patches to adapt to the C++ threading library.
I'm not quite sure, if I understood the order in which you want to merge things into main.

BenBE · 2021-04-18T15:04:48Z

The major refactoring work normally should to to experimental, main should be left for minor fixups and continuous improvements, that are unlikely to cause stability issues. Cf. here. Thus this should probably be going to experimental too, as note above.

phlash · 2021-05-24T19:14:34Z

@peckto, @BenBE polite nudge for this one - I suggest rebasing to experimental, and ensuring C++11 thread library is used? then we can pull this in, as I find it makes a material improvement.

BenBE · 2021-05-24T19:33:34Z

I suggest rebasing to experimental, and ensuring C++11 thread library is used? then we can pull this in, as I find it makes a material improvement.

ACK.

peckto · 2021-05-26T16:35:09Z

I'm working on the rebase to the experimental branch.

I noticed the PR #86, which seems to have overlapping code changes with this PR.
Do you have a plan, in which order you want to merge these two PRs to minimize merge conflicts?

BenBE · 2021-05-26T19:23:15Z

@peckto Any preference on your side? IMHO splitting things into a library first and afterwards extracting some operations into their own thread is probably the more efficient way to go, as the library split forces a separation of concerns in some sense. What do you think?

phlash · 2021-05-26T21:51:30Z

I think #86 is going to need a fair amount of tidying up done before it gets merged and I'd like this one in sooner to improve UX. IMO any threading should be external to TF/the library (which aims to do one job well), so there shouldn't be a clash (although I expect a few function name changes to come along with the separation, they should be trivial fixes).

peckto · 2021-05-30T13:31:39Z

The MR is now rebased to experimental branch and commits are squashed.
For threading, I'm now using the c++ std module.

deepseg.cc

BenBE · 2021-06-25T08:11:40Z

@peckto Do you mind to rebase to resolve the merge conflicts?

peckto · 2021-06-25T18:13:29Z

Hi @BenBE, sorry for the delay.
I had not much time to work on this project the last weeks.

A rebase turns out to be not straight forward.
Probably a re-write is better. The thread class is missing anyhow.
I'll have a look the next days.

BenBE · 2021-06-25T18:30:59Z

Sure, don't worry. Absolutely no need to rush anything.

peckto · 2021-06-30T18:46:58Z

Having the actual mask calculation in the lib, makes this MR much cleaner I think.
I pushed a first draft of the CalcMask class, implementing the threading.
From an architecture point of view, is this the way you were thinking of?

phlash · 2021-07-01T17:10:50Z

Thanks @peckto, that's looking really tidy!

peckto · 2021-07-04T16:20:39Z

I adapted the logging.
Therefore I moved the maskctxand the callbacks into the CalcMask class.
Performance looks good on my system.
What do you think?

app/deepseg.cc

phlash · 2021-08-05T13:11:38Z

This LGTM, works as expected, merging!

phlash reviewed Apr 5, 2021

View reviewed changes

deepseg.cc Outdated Show resolved Hide resolved

phlash reviewed Apr 5, 2021

View reviewed changes

peckto force-pushed the thread branch from 3499d60 to 6b8ce1b Compare April 5, 2021 15:57

peckto force-pushed the thread branch from bceaaf1 to 9f69037 Compare April 7, 2021 16:26

phlash mentioned this pull request Apr 13, 2021

Other background replacement code bases - a round up #58

Open

peckto force-pushed the thread branch from e523688 to 715af41 Compare April 18, 2021 14:43

peckto force-pushed the thread branch from 715af41 to b2dbf4a Compare May 26, 2021 13:46

peckto changed the base branch from main to experimental May 26, 2021 13:59

peckto force-pushed the thread branch from b2dbf4a to 0c72019 Compare May 30, 2021 13:20

BenBE reviewed May 30, 2021

View reviewed changes

deepseg.cc Outdated Show resolved Hide resolved

deepseg.cc Outdated Show resolved Hide resolved

peckto force-pushed the thread branch from 0c72019 to ad80c97 Compare May 30, 2021 17:56

peckto added 2 commits June 30, 2021 19:37

Remove grabber thread

6649453

Add CalcMask class for threading

fa12144

peckto force-pushed the thread branch from ad80c97 to fa12144 Compare June 30, 2021 18:33

Add logging

52a7921

BenBE reviewed Jul 4, 2021

View reviewed changes

app/deepseg.cc Outdated Show resolved Hide resolved

app/deepseg.cc Outdated Show resolved Hide resolved

app/deepseg.cc Outdated Show resolved Hide resolved

BenBE reviewed Jul 4, 2021

View reviewed changes

app/deepseg.cc Outdated Show resolved Hide resolved

Use exception for error handling

a5acabc

peckto force-pushed the thread branch from cc99efb to a5acabc Compare July 5, 2021 18:10

phlash merged commit ec34fdb into floe:experimental Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create thread for mask calculation #59

Create thread for mask calculation #59

peckto commented Apr 4, 2021

BenBE commented Apr 4, 2021

phlash commented Apr 4, 2021 •

edited

Loading

BenBE commented Apr 4, 2021

phlash commented Apr 5, 2021

phlash Apr 5, 2021 •

edited

Loading

BenBE Apr 5, 2021

phlash Apr 5, 2021

peckto commented Apr 5, 2021

peckto commented Apr 5, 2021

BenBE commented Apr 5, 2021

phlash commented Apr 5, 2021

peckto commented Apr 7, 2021

phlash commented Apr 17, 2021

BenBE commented Apr 17, 2021

peckto commented Apr 18, 2021

peckto commented Apr 18, 2021

BenBE commented Apr 18, 2021

phlash commented Apr 18, 2021

peckto commented Apr 18, 2021

BenBE commented Apr 18, 2021

phlash commented May 24, 2021

BenBE commented May 24, 2021

peckto commented May 26, 2021

BenBE commented May 26, 2021

phlash commented May 26, 2021

peckto commented May 30, 2021

BenBE commented Jun 25, 2021

peckto commented Jun 25, 2021

BenBE commented Jun 25, 2021

peckto commented Jun 30, 2021

phlash commented Jul 1, 2021

peckto commented Jul 4, 2021

phlash commented Aug 5, 2021

Create thread for mask calculation #59

Create thread for mask calculation #59

Conversation

peckto commented Apr 4, 2021

BenBE commented Apr 4, 2021

phlash commented Apr 4, 2021 • edited Loading

BenBE commented Apr 4, 2021

phlash commented Apr 5, 2021

phlash Apr 5, 2021 • edited Loading

Choose a reason for hiding this comment

BenBE Apr 5, 2021

Choose a reason for hiding this comment

phlash Apr 5, 2021

Choose a reason for hiding this comment

peckto commented Apr 5, 2021

peckto commented Apr 5, 2021

BenBE commented Apr 5, 2021

phlash commented Apr 5, 2021

peckto commented Apr 7, 2021

phlash commented Apr 17, 2021

BenBE commented Apr 17, 2021

peckto commented Apr 18, 2021

peckto commented Apr 18, 2021

BenBE commented Apr 18, 2021

phlash commented Apr 18, 2021

peckto commented Apr 18, 2021

BenBE commented Apr 18, 2021

phlash commented May 24, 2021

BenBE commented May 24, 2021

peckto commented May 26, 2021

BenBE commented May 26, 2021

phlash commented May 26, 2021

peckto commented May 30, 2021

BenBE commented Jun 25, 2021

peckto commented Jun 25, 2021

BenBE commented Jun 25, 2021

peckto commented Jun 30, 2021

phlash commented Jul 1, 2021

peckto commented Jul 4, 2021

phlash commented Aug 5, 2021

phlash commented Apr 4, 2021 •

edited

Loading

phlash Apr 5, 2021 •

edited

Loading