Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create thread for mask calculation #59

Merged
merged 4 commits into from
Aug 5, 2021
Merged

Conversation

peckto
Copy link
Contributor

@peckto peckto commented Apr 4, 2021

This is a new implementation of #38

I moved the grab functionality into the main thread and created a new thread for the mask calculation.

@BenBE
Copy link
Collaborator

BenBE commented Apr 4, 2021

IMHO it's an advantage to have the grabber thread and the mask calculation in different threads. Ideally, and this is somewhat suboptimal right now, the grabber thread just provides images it captured from the camera while the mask calculation thread latches onto the last grabbed frame whenever it is ready for the next frame for processing. That way frames could be silently discarded when processing a mask is too slow.

@phlash
Copy link
Collaborator

phlash commented Apr 4, 2021

My fork (https://github.com/phlash/deepbacksub/) takes a similar approach to that suggested by Ben: a grab-mask-output thread that runs at full video frame rate using only OpenCV and V4L2, into which the mask calculator thread inserts an updated mask, generated at whatever rate it can maintain, from the latest frame grabbed. This is also the approach taken by ViBa, and Jitsi to minimise video lag, even if the ML is a bit challenged on cycles.

Note that the grab thread at present exists to ensure we read input frames promptly, again to reduce lag, as OpenCV will queue up video rather than discard it (and there are few/no controls over this). As long as we continue to read-mask-write quickly I don't mind which thread has which responsibility :)

@BenBE
Copy link
Collaborator

BenBE commented Apr 4, 2021

The approach with updating the mask async and passing frames thru with a maybe slightly outdated mask seems fine too for most situations. Might be an option to either discard frames or use the old mask.

@phlash
Copy link
Collaborator

phlash commented Apr 5, 2021

@peckto I have now read through your changes in detail, and tried them out locally, it looks good. I'll mark up a couple of syntax things I spotted, but otherwise great stuff!

Do we think anyone would want this to operate synchronously below the input frame rate (ie: old behaviour)? If so, how?

deepseg.cc Outdated
// copy frame from shared buffer
cv::Mat raw;
pthread_mutex_lock(&info.lock_raw);
raw = info.raw.clone();
Copy link
Collaborator

@phlash phlash Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the double cloning (in/out of shared buffer)? Unsure of performance impact... see below...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally we could pre-allocate a few picture buffers and just hand out the pointers to them. This would require proper tracking which pointers are in use at any time and would require some preparations not yet present in the current code.

deepseg.cc Outdated
pthread_mutex_unlock(&capinfo.lock);
// we can now guarantee capinfo.raw will remain unchanged while we process it..
calcinfo.raw = *capinfo.raw;
calcinfo.raw = raw.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment regarding double clone, we may want to avoid this one, if there is any material performance improvement.

@peckto
Copy link
Contributor Author

peckto commented Apr 5, 2021

Regarding opencv and the grab thread:
If I understood you correctly, the main problem is, that opencv does buffer frames, which can lead to a delay if we do not grad the new frame with the maximum frame rate.
In the current (1-thread) implementation, this would occure only, if the clipping and v4l2loopback step would take too long.
I would assume, that this is under normal conditions not the case. Additionally, opencv now provides a buffer setting (CAP_PROP_BUFFERSIZE) for the V4L backend:
https://github.com/opencv/opencv/blob/master/modules/videoio/src/cap_v4l.cpp

With this, I think we can work with the single thread option.
An extra grab thread makes things quite complicated in my opinion and the profit is probably not so high.

@peckto
Copy link
Contributor Author

peckto commented Apr 5, 2021

@phlash: Thank you for your feedback! I will address your points the next days.

I'm just not sure, how we should proceed with the "double copy" thing.
If this is not really a performance issue, I prefer to keep it simple.
(The copy performance is of course dependent on the video resolution.)

@BenBE
Copy link
Collaborator

BenBE commented Apr 5, 2021

It's preferable to avoid any large copies of memory buffers. Both for performance reasons and for reasons of memory/resource usage. An easy solution is to pre-allocate several buffers large enough to hold a frame and just hand pointers to these allocated objects around. With the current architecture of one thread for grabbing, one for mask generation and the main thread handling the final frame merging this means we'd be fine with a pool of about 8 buffers (one active and one inactive per thread, plus some small reserve). That way each thread could hold on to the last active buffer it acquired from the previous step each without blocking and still allow for enough slack that slow mask calculations won't block everything on the main thread too much, all ensuring constant memory usage..

@phlash
Copy link
Collaborator

phlash commented Apr 5, 2021

I'd like to hold off on a buffer pool until we know the impact of a couple of clone() calls between threads - the timing we have should make it obvious if we have a problem or not. Thanks for the speedy work @peckto!

@peckto
Copy link
Contributor Author

peckto commented Apr 7, 2021

I added some more measurements for the ai thread.
On my machine it looks like the following:

$ ./deepseg -d -d -c /dev/video0 -v /dev/video20 -H 
video [grab: 58741363 retr:   721098 lock:      471 copy:   632780 post:   928098 v4l2:  2864112] ai [cond: 30599240 copy:   183267 pre:  1087840 tflt: 31788445 post:   680549 mask:   144196] - FPS: 16.56 (video) 19.19 (ai)
video [grab: 54829101 retr:   452109 lock:      365 copy:   314983 post:   950727 v4l2:  3267526] ai [cond: 26038497 copy:   230877 pre:   995302 tflt: 31148558 post:   778711 mask:   173230] - FPS: 18.12 (video) 15.63 (ai)
video [grab: 54883606 retr:  1623516 lock:      521 copy:   744499 post:   959518 v4l2:  3415035] ai [cond: 25964401 copy:   180704 pre:  1125331 tflt: 33466405 post:   686013 mask:   155247] - FPS: 17.22 (video) 16.24 (ai)
video [grab: 53589089 retr:   500830 lock:     1068 copy:   354910 post:   925405 v4l2:  3216689] ai [cond: 23785185 copy:   163880 pre:  1027631 tflt: 32905018 post:   707593 mask:   149251] - FPS: 18.61 (video) 16.38 (ai)
video [grab: 54525152 retr:  1019776 lock:      653 copy:   974318 post:  1064244 v4l2:  3157460] ai [cond: 21241901 copy:   244925 pre:  1312905 tflt: 37411994 post:   690571 mask:   179552] - FPS: 17.48 (video) 15.81 (ai)
video [grab: 53905385 retr:   452957 lock:      835 copy:   319837 post:   968303 v4l2:  3224764] ai [cond: 25224082 copy:   239520 pre:   966928 tflt: 31087658 post:   702594 mask:   343650] - FPS: 18.21 (video) 18.20 (ai)

(For some reason I currently don't get 30fps??)

I agree with you, that in general we should avoid double copy.
Maybe we could re-use the double buffer structure from the grabber thread?

@phlash
Copy link
Collaborator

phlash commented Apr 17, 2021

@peckto, @BenBE: I'm not sure where we are with this PR? To me it looks like @peckto is waiting on a question about double buffering, and then we'll want to check performance is ok before we merge? We can address a switch to the C++11 thread toolkit in a separate PR (as per experimental branch).

Regarding the double buffering - I think that should work, with the 'slower' thread (always mask calculation) swapping buffer pointers (while holding a lock) as it either consumes or writes them. This avoids buffer pool management, while also avoiding multiple copying of video frames.

@BenBE
Copy link
Collaborator

BenBE commented Apr 17, 2021

Haven't tested this yet and with the larger structural rework I think having this land on experimental first is preferable (@floe What do you think?). This also has the advantage of already having the C++11 patches as a basis to work from.

Haven't had too much time for thorough reviews and tests of the experimental branch the last days, so I'd recommend the experimental changes ripen there for another few days before we merge them to main.

Will try to get a review for this PR tomorrow or on Monday as time permits.

@peckto
Copy link
Contributor Author

peckto commented Apr 18, 2021

I added the double buffer structure for the frame buffer Mat. The main thread copies the received frame to the next buffer, while the ai thread works on the current buffer and changing the buffers on every cycle. In this way, we have only one copy operation for the frame.

@peckto
Copy link
Contributor Author

peckto commented Apr 18, 2021

I had a misunderstanding regarding the pthread_cond_signal and pthread_cond_wait. I thought, the wait would be async, in a way, that the condition is always true, as long as the signal arrived before the last wait. But as far as I can see, the wait is only released, after the next signal. This caused a unnecessary long wait period at the pthread_cond_wait call. To fix this issue, I new use a flag, to indicate that a new frame has arrived and the ai thread just sleeps until the flags gets true.
I'm not so much into threading. If you know a cleaner way to achieve this, please let me know.

@BenBE
Copy link
Collaborator

BenBE commented Apr 18, 2021

You might be looking for this:
https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for

@phlash
Copy link
Collaborator

phlash commented Apr 18, 2021

@peckto - yep you are correct, I re-read the manual (https://linux.die.net/man/3/pthread_cond_wait):

When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed.

So there should indeed be a new frame indicator, set just before raising the signal, checked/cleared after wakeup in the other thread eg:

    // In calc_mask()
    // ...
    pthread_mutex_lock(&info.lock_raw);
    while (!info.new_frame) {
        pthread_cond_wait(&info.condition_new_frame, &info.lock_raw);
    }
    info.new_frame = false;
    // ...any other stuff requiring the lock
    pthread_mutex_unlock(&info.lock_raw);
...
    // In main()
    // ...
    pthread_mutex_lock(&calcinfo.lock_raw);
    // ...any other stuff requiring the lock
    calcinfo.new_frame = true;
    pthread_cond_signal(&calcinfo.condition_new_frame);
    pthread_mutex_unlock(&calcinfo.lock_raw);

this will avoid the issue you noticed where the mask calculation rate drops to half frame rate, since it's waiting for the next signal even when a new frame is already available.
As @BenBE linked to, the C++11 thread library works in a similar manner (although it has a helper function that does the predicate wait loop for you)

@peckto
Copy link
Contributor Author

peckto commented Apr 18, 2021

Combining pthread_cond_wait and the flag is a good idea. I implemented your suggestion.
I noticed, the experimental branch now has patches to adapt to the C++ threading library.
I'm not quite sure, if I understood the order in which you want to merge things into main.

@BenBE
Copy link
Collaborator

BenBE commented Apr 18, 2021

The major refactoring work normally should to to experimental, main should be left for minor fixups and continuous improvements, that are unlikely to cause stability issues. Cf. here. Thus this should probably be going to experimental too, as note above.

@phlash
Copy link
Collaborator

phlash commented May 24, 2021

@peckto, @BenBE polite nudge for this one - I suggest rebasing to experimental, and ensuring C++11 thread library is used? then we can pull this in, as I find it makes a material improvement.

@BenBE
Copy link
Collaborator

BenBE commented May 24, 2021

I suggest rebasing to experimental, and ensuring C++11 thread library is used? then we can pull this in, as I find it makes a material improvement.

ACK.

@peckto peckto changed the base branch from main to experimental May 26, 2021 13:59
@peckto
Copy link
Contributor Author

peckto commented May 26, 2021

I'm working on the rebase to the experimental branch.

I noticed the PR #86, which seems to have overlapping code changes with this PR.
Do you have a plan, in which order you want to merge these two PRs to minimize merge conflicts?

@BenBE
Copy link
Collaborator

BenBE commented May 26, 2021

@peckto Any preference on your side? IMHO splitting things into a library first and afterwards extracting some operations into their own thread is probably the more efficient way to go, as the library split forces a separation of concerns in some sense. What do you think?

@phlash
Copy link
Collaborator

phlash commented May 26, 2021

I think #86 is going to need a fair amount of tidying up done before it gets merged and I'd like this one in sooner to improve UX. IMO any threading should be external to TF/the library (which aims to do one job well), so there shouldn't be a clash (although I expect a few function name changes to come along with the separation, they should be trivial fixes).

@peckto
Copy link
Contributor Author

peckto commented May 30, 2021

The MR is now rebased to experimental branch and commits are squashed.
For threading, I'm now using the c++ std module.

@BenBE
Copy link
Collaborator

BenBE commented Jun 25, 2021

@peckto Do you mind to rebase to resolve the merge conflicts?

@peckto
Copy link
Contributor Author

peckto commented Jun 25, 2021

Hi @BenBE, sorry for the delay.
I had not much time to work on this project the last weeks.

A rebase turns out to be not straight forward.
Probably a re-write is better. The thread class is missing anyhow.
I'll have a look the next days.

@BenBE
Copy link
Collaborator

BenBE commented Jun 25, 2021

Sure, don't worry. Absolutely no need to rush anything.

@peckto
Copy link
Contributor Author

peckto commented Jun 30, 2021

Having the actual mask calculation in the lib, makes this MR much cleaner I think.
I pushed a first draft of the CalcMask class, implementing the threading.
From an architecture point of view, is this the way you were thinking of?

@phlash
Copy link
Collaborator

phlash commented Jul 1, 2021

Thanks @peckto, that's looking really tidy!

@peckto
Copy link
Contributor Author

peckto commented Jul 4, 2021

I adapted the logging.
Therefore I moved the maskctxand the callbacks into the CalcMask class.
Performance looks good on my system.
What do you think?

@phlash
Copy link
Collaborator

phlash commented Aug 5, 2021

This LGTM, works as expected, merging!

@phlash phlash merged commit ec34fdb into floe:experimental Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants