Skip to content

Segmentation fault on nightly and beta when running with --release #49010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joe-hauns opened this issue Mar 14, 2018 · 37 comments
Closed

Segmentation fault on nightly and beta when running with --release #49010

joe-hauns opened this issue Mar 14, 2018 · 37 comments
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@joe-hauns
Copy link

joe-hauns commented Mar 14, 2018

I encountered a segmentation fault when running a program on nightly or beta in release mode.
The segfault occurs somewhere in another crate but within safe code and only if --release is passed to cargo.
I created a crate which is minimal setup to encounter this error:
https://github.com/joeschman/tokio-timer-segfault
A more detailed description can be found in the readme.

@Centril Centril added I-nominated I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness C-bug Category: This is a bug. labels Mar 16, 2018
@fu5ha
Copy link
Contributor

fu5ha commented Mar 16, 2018

I am also experiencing this in https://github.com/termhn/nano-rs on nightly-x86_64-pc-windows-msvc however not when running on my Mac also on nightly.

@fu5ha
Copy link
Contributor

fu5ha commented Mar 16, 2018

This fixes it tokio-rs/tokio-timer#40 ... not sure what that means or if it is helpful (some optimization to do with continue breaking things maybe?)

@michaelwoerister
Copy link
Member

cc @rust-lang/compiler

@michaelwoerister michaelwoerister added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 16, 2018
@nikomatsakis nikomatsakis added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Mar 16, 2018
@nikomatsakis
Copy link
Contributor

Do we know if this is a regression? Can one of you try to reproduce on older nightly builds and see if you can find some point that worked?

@nikomatsakis
Copy link
Contributor

My guess would be that it is some kind of optimization gone wrong in LLVM, but whether that's LLVM's fault or ours is sort of unclear.

@joe-hauns
Copy link
Author

Can one of you try to reproduce on older nightly builds and see if you can find some point that worked?

I'd test that but pittily I could not find out how to revert to older nightly/beta versions using rustup :/

@oli-obk
Copy link
Contributor

oli-obk commented Mar 19, 2018

You can use rustup override set nightly-2018-01-15 to set to e.g. 15th January.

@joe-hauns
Copy link
Author

I checked older nightly versions and the segfault appears first when using nightly-2018-02-11-x86_64-apple-darwin.

@nikomatsakis
Copy link
Contributor

@joeschman Thanks! Btw, jfyi there is a new shiny tool by @Mark-Simulacrum that can do the bisection for you:

https://github.com/rust-lang-nursery/cargo-bisect-rustc

It will even find the problem down to the PR level if it arose in the last 90 days.

In any case, these are the diffs between those two nightlies, if I'm not mistaken:

3bcda48...45fba43

@nikomatsakis
Copy link
Contributor

One thing jumps out at me:

6b7b6b6

Upgrade to LLVM 6 =)

@fu5ha
Copy link
Contributor

fu5ha commented Mar 19, 2018

Indeed. Definitely seems like the kind of bug a new LLVM version could introduce.

@nikomatsakis
Copy link
Contributor

cc @rust-lang/compiler -- anybody want to try to track down a potential LLVM bug introduced by LLVM 6 transition?

@nagisa seems like your speciality :)

@nagisa
Copy link
Member

nagisa commented Mar 20, 2018

Okay, I’ll try to take a look at this tomorrow.

@nagisa
Copy link
Member

nagisa commented Mar 21, 2018

Ugh, I cannot reproduce the fault on Linux, so it is very hard for me to even start investigating this more seriously.

The supposed error location is within a huge inlined 1.5k line blob of IR and the function with and without the fix at tokio-rs/tokio-timer#40 appears to be optimised down into identical code, at least for that specific function.

@fu5ha
Copy link
Contributor

fu5ha commented Mar 21, 2018 via email

@nagisa
Copy link
Member

nagisa commented Mar 21, 2018

Can you see if the issue is reproducible on windows with the minimal sample provided in the first comment?

@joe-hauns
Copy link
Author

If there‘s any more information i can provide for the problem ln macos, just tell me what ;)

@nagisa
Copy link
Member

nagisa commented Mar 22, 2018

Wrote this comment yesterday but forgot to hit submit.


So here’s the thing that sticks out as a sore thumb to me in the optimised IR as far as differences between the "bad" and "fixed" versions goes:

  %min.0.ph32.i.i.i = phi i64* [ null, %bb8.lr.ph.i.lr.ph.lr.ph.i.i.i ], [ %.lcssa23.i164.pre-phi.i.pre-phi.i.pre-phi, %bb11.i165.i.i ]
  ; ...
  %328 = icmp ne i64* %min.0.ph32.i.i.i, null
  ; a while later
  call void @llvm.assume(i1 %328) #13, !noalias !213

I still haven’t checked the IR closely enough to confirm whether the CFG allows for this assume to ever called on falsy value, but this is really the only notable difference between the fixed and non-fixed version anyway.

Surprisingly, this %328 is never ever used again.

@nagisa
Copy link
Member

nagisa commented Mar 22, 2018

Things that would be helpful and make it easier to debug this:

  1. A single-file minimal reproducer with as few dependencies on system utilities (such as threads) as possible;
  2. Backtrace from a debugger, register dump at the time of the fault and disassembly of the function.

A minimal reproducer that works on windows would be better as I have an access to a windows VM, though perhaps it is as good a time as ever to figure out how to make a mac VM… hmm.

@nikomatsakis nikomatsakis added the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Mar 22, 2018
@nagisa nagisa added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Mar 22, 2018
@nikomatsakis
Copy link
Contributor

Assigning to nagisa, but it seems like they are stuck without some way to reproduce.

@nikomatsakis
Copy link
Contributor

Assigning to @pnkfelix to try and reproduce on mac

@joe-hauns
Copy link
Author

I now minimized the example. All the code is now in the repository without dependencies to crates that are outside the repository. Pittily a one-file reproducer was not possible. It seems that the bug only occurs when the code is called from another crate. I minimized the size of that crate to 20 lines of code. When the same code is run from an internal module, no segfault occurs.
Both (calling the module & the crate function) is included in the last commit of the crate i linked above.

Furthermore I found out that with yesterday's nightly build ( nightly-2018-03-21-x86_64-apple-darwin) the segfault does not occur anymore. I get the segfaut with nightly-2018-02-11-x86_64-apple-darwin tonightly-2018-02-15-x86_64-apple-darwin. (Didn't test it with *-03-16-* to *-03-20-* ).

Backtrace from a debugger, register dump at the time of the fault and disassembly of the function.

Any recommendations for a debugger for rust on a mac? I have hardly any experience with using debuggers at all.

@fu5ha
Copy link
Contributor

fu5ha commented Mar 24, 2018

Alright so I was testing on my Windows machine and confirmed the same behavior as @joeschman has on mac. I also pinpointed that it is fixed in nightly-2018-03-03 and later, but broken in nightly-2018-03-02 and earlier.

@pnkfelix
Copy link
Member

I have reproduced the problem on my mac, using the joeschman/tokio-timer-segfault repo and the nightly-2018-03-02 rustc. I'll report more as I dissect further.

@pnkfelix
Copy link
Member

Here is a gist https://gist.github.com/pnkfelix/fdab1b374d49e8850073a357d4f492f4 with the things that @nagisa had asked for (a stack backtrace, dump of the register file, and a disassembly of the function).

@pnkfelix
Copy link
Member

@joeschman by the way, the way I got the info in the above gist is I just ran the binary under lldb, like so:

% lldb target/release/tokio-timer-segfault

and then when I got the debugger prompt ((lldb)), I ran the program with r:

(lldb) r

and then after the program exception occurred, I used the following commands (included in the gist output above) bt, register read, and dis.

lldb has an online help mechanism you can use to explore its set of commands; or you can google for cheat sheets online.

@pnkfelix
Copy link
Member

@nagisa pointed out that I was using an old head of the source repo.

I updated and re-ran the same lldb commands, and updated the gist (which I'll relink for convenience: https://gist.github.com/pnkfelix/fdab1b374d49e8850073a357d4f492f4 )

@pnkfelix
Copy link
Member

Experimentation and Discussion with @nagisa and @alexcrichton has led us to the hypothesis that this is a bug injected by thinlto, which is on by default for --release builds ...

@alexcrichton alexcrichton added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Apr 5, 2018
@nikomatsakis
Copy link
Contributor

commit range from nightly-2018-03-02 to nightly-2018-03-03 is 3eeb5a6 .. 9cb18a9

@nikomatsakis
Copy link
Contributor

git log 3eeb5a665..9cb18a92a --author=bors --oneline yields:

9cb18a92ad Auto merge of #48653 - Manishearth:rollup2, r=Manishearth
ddfbf2b0f4 Auto merge of #47861 - sgrif:sg-rebase-chalkify-universe-refactorings, r=nikomatsakis

@nikomatsakis
Copy link
Contributor

I doubt these are the culprits, but skimming the commit log shows @alexcrichton's "rustc: Tweak funclet cleanups of ffi functions", which seems ... at least in the neighborhood.

@nagisa
Copy link
Member

nagisa commented Apr 12, 2018 via email

@pnkfelix
Copy link
Member

We may need to appoint someone to be the official "LTO debugging" expert based on the number of bugs we currently have that have sort of stalled at the point of determining "oh, this appears to be injected by [Thin]LTO."

@nikomatsakis
Copy link
Contributor

triage: P-medium

Next steps are to diagnose the LLVM problem. Filing under #50422.

@nikomatsakis nikomatsakis added P-medium Medium priority and removed P-high High priority labels May 3, 2018
@steveklabnik
Copy link
Member

Triage: looks like the energy to reproduce this one kinda petered out. Can anyone reproduce this today?

@pnkfelix pnkfelix self-assigned this Oct 3, 2019
@Elinvynia Elinvynia added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jun 9, 2020
@LeSeulArtichaut LeSeulArtichaut removed the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jun 9, 2020
@steveklabnik
Copy link
Member

One year later, with no reproduction instructions, I'm going to give this one a close. If anyone can make a reproduction, please let me know and we can re-open, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests