Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault #3

Open
bosmart opened this issue Jan 6, 2017 · 9 comments
Open

Segmentation fault #3

bosmart opened this issue Jan 6, 2017 · 9 comments

Comments

@bosmart
Copy link

bosmart commented Jan 6, 2017

I'm getting the following segmentation fault when running "make runtest". It works fine in the case of the original caffe-segnet (with cuDNN 3.0.8).

[ RUN ] LayerFactoryTest/2.TestCreateLayer
*** Aborted at 1483730734 (unix time) try "date -d @1483730734" if you are using GNU date ***
PC: @ 0x7fe5c0d9cf25 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer()
*** SIGSEGV (@0x208) received by PID 8650 (TID 0x7fe5c15d5ac0) from PID 520; stack trace: ***
@ 0x7fe5c033a390 (unknown)
@ 0x7fe5c0d9cf25 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer()
@ 0x7fe5c0e55099 caffe::DataLayer<>::~DataLayer()
@ 0xb49c08 caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody()
@ 0xde7453 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0xde038a testing::Test::Run()
@ 0xde04d8 testing::TestInfo::Run()
@ 0xde05e5 testing::TestCase::Run()
@ 0xde217f testing::internal::UnitTestImpl::RunAllTests()
@ 0xde24a3 testing::UnitTest::Run()
@ 0x8905cd main
@ 0x7fe5ba028830 __libc_start_main
@ 0x8973a9 _start
@ 0x0 (unknown)
Segmentation fault (core dumped)
src/caffe/test/CMakeFiles/runtest.dir/build.make:57: recipe for target 'src/caffe/test/CMakeFiles/runtest' failed
make[3]: *** [src/caffe/test/CMakeFiles/runtest] Error 139
CMakeFiles/Makefile2:328: recipe for target 'src/caffe/test/CMakeFiles/runtest.dir/all' failed
make[2]: *** [src/caffe/test/CMakeFiles/runtest.dir/all] Error 2
CMakeFiles/Makefile2:335: recipe for target 'src/caffe/test/CMakeFiles/runtest.dir/rule' failed
make[1]: *** [src/caffe/test/CMakeFiles/runtest.dir/rule] Error 2
Makefile:240: recipe for target 'runtest' failed
make: *** [runtest] Error 2

@bosmart
Copy link
Author

bosmart commented Jan 6, 2017

I have just noticed this #2
Ubuntu 16.04.1 LTS, CUDA 8.0, GeForce 980Ti.

Interestingly enough, on my second machine with Ubuntu 16.04.1 LTS, CUDA 8.0, Tesla K40 - it works without any issues.

@TimoSaemann
Copy link
Owner

I can not reproduce that error.
I tried it on 3 different machines and no error occurred:

  1. Ubuntu 14.04, CUDA 8.0, Titan X (Pascal), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
  2. Ubuntu 14.04, CUDA 7.5, Titan X (Maxwell), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
  3. Ubuntu 16, CUDA 8.0, GTX 980, cuDNN v.5.1, compiled with cmake

Did you compiled it with cmake or make? Did you change in your makefile.config something else then uncomment the cuDNN flag?
Can you test and train SegNet anyway or which errors do you encounter?

@bosmart
Copy link
Author

bosmart commented Jan 11, 2017

I have compiled with cmake in both cases i.e.

  1. Ubuntu 16.04.1 - CUDA 8.0 - Tesla K40 (works fine)
  2. Ubuntu 16.04.1 - CUDA 8.0 - GeForce 980Ti or Titan X (produces the fault)

Interestingly enough the fault only happens when caffe process is terminating. So it is able to complete the given number of iterations, save the snapshot etc. and then throws the fault when exiting.

@jgorgenucsd
Copy link

I also get this segfault, with cudnn 5.05.
As @bosmart mentioned
the SegNet trains, saves the solver state, and then apparently caffe's BasePrefetchingDataLayer dies when destructing the model

I0213 09:10:14.745064 29461 solver.cpp:322] Optimization Done.
I0213 09:10:14.745074 29461 caffe.cpp:254] Optimization Done.
*** Aborted at 1487005814 (unix time) try "date -d @1487005814" if you are using GNU date ***
PC: @ 0x7f6497727d1c (unknown)
*** SIGSEGV (@0xfffffff7) received by PID 29461 (TID 0x7f6499c259c0) from PID 18446744073709551607; stack trace: ***
@ 0x7f64976dbcb0 (unknown)
@ 0x7f6497727d1c (unknown)
@ 0x7f649951c68b caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer()
@ 0x7f64995eeb5b caffe::DenseImageDataLayer<>::~DenseImageDataLayer()
@ 0x7f64995eedb2 boost::detail::sp_counted_impl_p<>::dispose()
@ 0x40fcd1 caffe::Net<>::~Net()
@ 0x7f64994459e2 boost::detail::sp_counted_impl_p<>::dispose()
@ 0x7f64994ad4b1 caffe::SGDSolver<>::~SGDSolver()
@ 0x40dd59 boost::detail::shared_count::~shared_count()
@ 0x40b5d1 train()
@ 0x408363 main
@ 0x7f64976c6f45 (unknown)
@ 0x408ce1 (unknown)
@ 0x0 (unknown)
Segmentation fault (core dumped)

@ilia-nikiforov
Copy link

ilia-nikiforov commented Feb 21, 2017

Very similar issue here. CUDNN 5.1, CUDA 8.0, GeForce GTX 860M, Ubuntu 16.04. Various failed tests on runtest with both cmake and make, but SegNet runs and trains fine. However, if I'm using an LMDB data layer, I get a segmentation fault at the end of all runs, after everything is calculated and saved. If I put the
del net
command in any python script after initializing net, I get a segmentation fault.
DenseImageData works fine, however. @bosmart @jgorgenucsd are you using DenseImageData input or some other type of input layer?

@xiaozai
Copy link

xiaozai commented Sep 12, 2017

Hi, I have the exactly same error, how do you solve it? thanks

@drewbo
Copy link

drewbo commented Apr 27, 2018

Having this same error (trains, saves solver state, fails); we're you able to reproduce @TimoSaemann? I can send along my full workflow shortly if that helps

@vsuryamurthy
Copy link

I am having the same error when I use lmdb. Does anyone the reason for the segmentation fault?

@ilia-nikiforov
Copy link

As with others here, my problem disappeared when I switched machines. My particular switch was from a laptop with a GTX860M to a desktop with a GTX1070.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants