-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way to reduce the GPU memory usage and enhance the inference speed? #19
Comments
You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks |
Thx for sharing the link. I'm not familiar with it. TensorRT would reduce the memory usage and enhance the inference speed at the same time? |
@JinraeKim @lhwcv Apologies for the late reply, busy times! Forsure the main criteria with TensorRT is to reduce latency, and therefore increase inference speed pretty signifcantly with minimal reduction in quality at FP16. Given a successful conversion you should also see a significant reduction in memory allocation overhead. Its worth bearing in mind that the setup I have here was developed for Jetson series devices, although my understanding is that it plays nice with Nvdia's NGC PyTorch docker container. I am hoping to start bringing in a TensorrT Python API/ Pycuda version shortly that should work across a wider range of devices. What were you hoping to deploy with @JinraeKim? |
@rhysdg Thank you for the detailed explanation! It gave me a really nice insight! Thank you again! |
@JinraeKim I'm working on a more robust tool over at trt-devel that adds the ability to convert custom trained models with three channel inputs as per the training code, and drops to the result into a folder named accorrding to experiment. This will eventually become a pr but I'm hoping to do a little more testing with the onnx conversion when I get a chance. For now the tool works if you need it for a custom training run, and I can confirm that the results are fantastic with @lhwcv's training script plus some added aggressive pixel level augs! After that's done I'll work on a straight TensorRT conversion tool, that has wider device support, and also post-training quantization for the onnx representation! |
Ah yes, and I'm yet to update the documentation accordingly but adding the |
The M-LSD's
pred_lines
takes a long time than I expected, about ~6Hz (including other stuff; M-LSD-tiny only seems to be about 10Hz).And it takes about 2G of GPU memory.
Is there a way to reduce the GPU memory usage and enhance the inference speed? (including TensorRT, etc.)
Please give me an adivce as I'm not an expert of this.
Thanks!
The text was updated successfully, but these errors were encountered: