Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to reduce the GPU memory usage and enhance the inference speed? #19

Open
JinraeKim opened this issue Sep 9, 2022 · 6 comments

Comments

@JinraeKim
Copy link

The M-LSD's pred_lines takes a long time than I expected, about ~6Hz (including other stuff; M-LSD-tiny only seems to be about 10Hz).

And it takes about 2G of GPU memory.

Is there a way to reduce the GPU memory usage and enhance the inference speed? (including TensorRT, etc.)

Please give me an adivce as I'm not an expert of this.

Thanks!

@lhwcv
Copy link
Owner

lhwcv commented Sep 9, 2022

You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks

@JinraeKim
Copy link
Author

You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks

Thx for sharing the link.

I'm not familiar with it. TensorRT would reduce the memory usage and enhance the inference speed at the same time?

@rhysdg
Copy link
Contributor

rhysdg commented Sep 12, 2022

@JinraeKim @lhwcv Apologies for the late reply, busy times! Forsure the main criteria with TensorRT is to reduce latency, and therefore increase inference speed pretty signifcantly with minimal reduction in quality at FP16. Given a successful conversion you should also see a significant reduction in memory allocation overhead.

Its worth bearing in mind that the setup I have here was developed for Jetson series devices, although my understanding is that it plays nice with Nvdia's NGC PyTorch docker container. I am hoping to start bringing in a TensorrT Python API/ Pycuda version shortly that should work across a wider range of devices. What were you hoping to deploy with @JinraeKim?

@JinraeKim
Copy link
Author

@rhysdg Thank you for the detailed explanation!
Yeah, I'm looking for employment with Nvidia Jetson as well, and my personal laptops for practice as well.

It gave me a really nice insight! Thank you again!

@rhysdg
Copy link
Contributor

rhysdg commented Sep 21, 2022

@JinraeKim I'm working on a more robust tool over at trt-devel that adds the ability to convert custom trained models with three channel inputs as per the training code, and drops to the result into a folder named accorrding to experiment. This will eventually become a pr but I'm hoping to do a little more testing with the onnx conversion when I get a chance. For now the tool works if you need it for a custom training run, and I can confirm that the results are fantastic with @lhwcv's training script plus some added aggressive pixel level augs!

After that's done I'll work on a straight TensorRT conversion tool, that has wider device support, and also post-training quantization for the onnx representation!

@rhysdg
Copy link
Contributor

rhysdg commented Sep 21, 2022

Ah yes, and I'm yet to update the documentation accordingly but adding the --custom experiment.pth arg with your checkpoint dropped into ./models/experiment.pth will result in a sped up representation at ./models/experiment/mlsd_large/tiny__512_trt_fp16.pth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants