Bug 13012 - 2.6.28.9 causes init to segfault on Debian etch; 2.6.28.8 OK
Summary: 2.6.28.9 causes init to segfault on Debian etch; 2.6.28.8 OK
Status: CLOSED CODE_FIX
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-05 13:04 UTC by Barry K. Nathan
Modified: 2010-01-25 15:00 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.28.9
Tree: Mainline
Regression: Yes


Attachments
.config from 2.6.28.9 (same as on 2.6.28.8) (47.03 KB, text/plain)
2009-04-05 14:07 UTC, Barry K. Nathan
Details
kernel config: 2.6.27.27 - hangs when compiled with gcc-4.2 (40.03 KB, text/plain)
2009-07-20 11:31 UTC, Krzysztof Oledzki
Details

Description Barry K. Nathan 2009-04-05 13:04:01 UTC
(I haven't had a chance to collect all that data that I wanted to collect before reporting this bug, but this might be the best I can do for the next few days, so I think it's best for me to submit what I have so far instead of sitting on it.)

The system in question is a Debian etch system which has a static /dev (no udev) and does not use an initrd or initramfs. With 2.6.28.8, it works fine. With 2.6.28.9, init segfaults during boot, so the system fails to finish booting. (Same problem happens on 2.6.27.21 and does not happen on 2.6.27.20, but I have not had time to investigate that version in depth.)

git bisect says "92db6956ecd01ceb7934be0252b3b184a82ebb64 is first bad commit". That's "Add '-fwrapv' to gcc CFLAGS". If I take 2.6.28.9 and I revert "Move cc-option to below arch-specific setup" and "Add '-fwrapv' to gcc CFLAGS" then the resulting kernel works again.

I also noticed that the problem only happens with some gcc's:

Problem occurs:
gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)

Problem does not occur (i.e. 2.6.28.9 works and I don't have to revert anything):
gcc-4.2 (GCC) 4.2.4 (Debian 4.2.4-6)
gcc (Debian 4.3.2-1.1) 4.3.2

(BTW, on 2.6.29, I'm having a problem where if I'm running 2.6.29 then make instantly blows up in a strange way, and the same pattern of working/failing compilers is showing up there too.)

I have not yet had a chance to try vanilla gcc 4.1.2. That's what I intend to try next.

I'll attach my .config within the next 24 hours, and I'll try to post serial console captures by sometime Tuesday or so. If there's any other information which would help, feel free to ask...


My gut feeling is that this is truly a gcc bug rather than a kernel bug, but it's still disconcerting to see this kind of breakage suddenly pop up in a -stable series, so IMO it may be worth considering reverting the -fwrapv changes for 2.6.2[78].y.
Comment 1 Barry K. Nathan 2009-04-05 14:07:13 UTC
Created attachment 20816 [details]
.config from 2.6.28.9 (same as on 2.6.28.8)
Comment 2 Barry K. Nathan 2009-04-05 14:22:33 UTC
It looks like vanilla gcc 4.1.2 doesn't work either:
gcc (GCC) 4.1.2

So, on my Debian etch system anyway, the -fwrapv change means I can no longer compile 2.6.28.y with gcc 4.1.x, for *all* values of x, or it won't boot. Hmmm...
Comment 3 Barry K. Nathan 2009-04-05 21:07:19 UTC
I tried disabling the following two options (which were enabled in the .config I attached) but it made no difference (with gcc 4.1.2), the problem still occurred:

CONFIG_CC_OPTIMIZE_FOR_SIZE
CONFIG_OPTIMIZE_INLINING

BTW, root filesystem is reiserfs (3.6) in case that is at all relevant.
Comment 4 Andrew Morton 2009-04-09 21:41:04 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

-fwrapv killed Barry's gcc-4.1.2-compiled kernel in 2.6.27.x, 2.6.28.x
and presumably 2.6.29, 2.6.30.


On Sun, 5 Apr 2009 13:04:02 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=13012
> 
>            Summary: 2.6.28.9 causes init to segfault on Debian etch;
>                     2.6.28.8 OK
>            Product: Other
>            Version: 2.5
>     Kernel Version: 2.6.28.9
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: other_other@kernel-bugs.osdl.org
>         ReportedBy: barryn@pobox.com
>         Regression: Yes
> 
> 
> (I haven't had a chance to collect all that data that I wanted to collect
> before reporting this bug, but this might be the best I can do for the next
> few
> days, so I think it's best for me to submit what I have so far instead of
> sitting on it.)
> 
> The system in question is a Debian etch system which has a static /dev (no
> udev) and does not use an initrd or initramfs. With 2.6.28.8, it works fine.
> With 2.6.28.9, init segfaults during boot, so the system fails to finish
> booting. (Same problem happens on 2.6.27.21 and does not happen on 2.6.27.20,
> but I have not had time to investigate that version in depth.)
> 
> git bisect says "92db6956ecd01ceb7934be0252b3b184a82ebb64 is first bad
> commit".
> That's "Add '-fwrapv' to gcc CFLAGS". If I take 2.6.28.9 and I revert "Move
> cc-option to below arch-specific setup" and "Add '-fwrapv' to gcc CFLAGS"
> then
> the resulting kernel works again.
> 
> I also noticed that the problem only happens with some gcc's:
> 
> Problem occurs:
> gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)
> 
> Problem does not occur (i.e. 2.6.28.9 works and I don't have to revert
> anything):
> gcc-4.2 (GCC) 4.2.4 (Debian 4.2.4-6)
> gcc (Debian 4.3.2-1.1) 4.3.2
> 
> (BTW, on 2.6.29, I'm having a problem where if I'm running 2.6.29 then make
> instantly blows up in a strange way, and the same pattern of working/failing
> compilers is showing up there too.)
> 
> I have not yet had a chance to try vanilla gcc 4.1.2. That's what I intend to
> try next.
> 
> I'll attach my .config within the next 24 hours, and I'll try to post serial
> console captures by sometime Tuesday or so. If there's any other information
> which would help, feel free to ask...
> 
> 
> My gut feeling is that this is truly a gcc bug rather than a kernel bug, but
> it's still disconcerting to see this kind of breakage suddenly pop up in a
> -stable series, so IMO it may be worth considering reverting the -fwrapv
> changes for 2.6.2[78].y.
>
Comment 5 Linus Torvalds 2009-04-09 21:55:44 UTC
On Thu, 9 Apr 2009, Andrew Morton wrote:
> 
> -fwrapv killed Barry's gcc-4.1.2-compiled kernel in 2.6.27.x, 2.6.28.x
> and presumably 2.6.29, 2.6.30.

Auughh. I hate compiler bugs. They're horrible to debug.

I _think_ 'fwrapv' only really matters with gcc-4.3, so maybe we could 
just enable it for new versions.

HOWEVER, I also wonder if we could instead of "-fwrapv" use 
"-fno-strict-overflow". They are apparently subtly different, and maybe 
the bug literally only happens with -fwrapv.

Barry, can you see if that simple "replace -fwrapv with 
-fno-strict-overflow" works for you?

Or just go with Barry's helpful debugging:

> > I also noticed that the problem only happens with some gcc's:
> > 
> > Problem occurs:
> > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)
> > 
> > Problem does not occur (i.e. 2.6.28.9 works and I don't have to revert
> > anything):
> > gcc-4.2 (GCC) 4.2.4 (Debian 4.2.4-6)
> > gcc (Debian 4.3.2-1.1) 4.3.2

and consider 4.2 to be the point where it's ok.

Do we have some gcc developer who 
 (a) knows what the rules are
and
 (b) might even help us figure out where the bug occurs?

		Linus
Comment 6 Frans Pop 2009-07-10 07:28:11 UTC
On Thu, 9 Apr 2009, Linus Torvalds wrote:
> On Thu, 9 Apr 2009, Andrew Morton wrote:
> > -fwrapv killed Barry's gcc-4.1.2-compiled kernel in 2.6.27.x,
> > 2.6.28.x and presumably 2.6.29, 2.6.30.
>
> Auughh. I hate compiler bugs. They're horrible to debug.
>
> I _think_ 'fwrapv' only really matters with gcc-4.3, so maybe we could
> just enable it for new versions.
>
> HOWEVER, I also wonder if we could instead of "-fwrapv" use
> "-fno-strict-overflow". They are apparently subtly different, and maybe
> the bug literally only happens with -fwrapv.
>
> Barry, can you see if that simple "replace -fwrapv with
> -fno-strict-overflow" works for you?
>
> Or just go with Barry's helpful debugging:
> > > I also noticed that the problem only happens with some gcc's:
> > >
> > > Problem occurs:
> > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > > gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)
> > >
> > > Problem does not occur (i.e. 2.6.28.9 works and I don't have to
> > > revert anything):
> > > gcc-4.2 (GCC) 4.2.4 (Debian 4.2.4-6)
> > > gcc (Debian 4.3.2-1.1) 4.3.2
>
> and consider 4.2 to be the point where it's ok.
>
> Do we have some gcc developer who
>  (a) knows what the rules are
> and
>  (b) might even help us figure out where the bug occurs?

The discussion on issue looks to have died, but it has bitten Debian 
stable ("Lenny") [1] as it causes init to die on s390 after a kernel 
update.

Here's a possible patch. The exact gcc version to check for is still a bit 
open I guess. For the s390 issue I've confirmed that 4.2.4 is OK, but for 
safety and because of Andrew's comment above I've set the test for 4.3 in 
the patch.

Cheers,
FJP

[1] http://bugs.debian.org/536354

---
From: Frans Pop <elendil@planet.nl>
Subject: Only add '-fwrapv' to gcc CFLAGS for gcc 4.3 and later

This flag has been shown to cause init to segfault for kernels
compiled with gcc-4.1. gcc version 4.2.4 has been shown to be OK,
but as there is some uncertainty the flag is only added for 4.3
and later.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=13012.

Reported-by: Barry K. Nathan <barryn@pobox.com>
Signed-off-by: Frans Pop <elendil@planet.nl>

diff --git a/Makefile b/Makefile
index 0aeec59..2f8756e 100644
--- a/Makefile
+++ b/Makefile
@@ -565,7 +565,8 @@ KBUILD_CFLAGS += $(call 
cc-option,-Wdeclaration-after-statement,)
 KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)
 
 # disable invalid "can't wrap" optimizations for signed / pointers
-KBUILD_CFLAGS	+= $(call cc-option,-fwrapv)
+KBUILD_CFLAGS  += $(shell if [ $(call cc-version) -ge 0430 ]; then \
+		    echo $(call cc-option,-fwrapv); fi ;)
 
 # revert to pre-gcc-4.4 behaviour of .eh_frame
 KBUILD_CFLAGS	+= $(call cc-option,-fno-dwarf2-cfi-asm)
Comment 7 Frans Pop 2009-07-10 14:59:38 UTC
On Friday 10 July 2009, Frans Pop wrote:
> On Thu, 9 Apr 2009, Linus Torvalds wrote:
> > On Thu, 9 Apr 2009, Andrew Morton wrote:
> > > -fwrapv killed Barry's gcc-4.1.2-compiled kernel in 2.6.27.x,
> > > 2.6.28.x and presumably 2.6.29, 2.6.30.
> >
> > Auughh. I hate compiler bugs. They're horrible to debug.
> >
> > I _think_ 'fwrapv' only really matters with gcc-4.3, so maybe we
> > could just enable it for new versions.
> >
> > HOWEVER, I also wonder if we could instead of "-fwrapv" use
> > "-fno-strict-overflow". They are apparently subtly different, and
> > maybe the bug literally only happens with -fwrapv.
> >
> > Barry, can you see if that simple "replace -fwrapv with
> > -fno-strict-overflow" works for you?

Prompted by the same suggestion from Ben Hutchings I checked this too, 
but -fno-strict-overflow was only introduced in gcc 4.2.
So using it instead of -fwrapv *would* fix the problem for gcc 4.1, but 
*only* because it would effectively do the same as the patch I proposed: 
not add an option at all for gcc 4.1.

So that change seems illogical unless there are other reasons to 
prefer -fno-strict-overflow over -fwrapv (well, it would avoid the
gcc version check).

It does however make it somewhat more logical to change the test in my 
proposed patch to also allow -fwrapv for gcc 4.2.

Cheers,
FJP
Comment 8 Frans Pop 2009-07-10 20:44:46 UTC
On Friday 10 July 2009, Frans Pop wrote:
> The discussion on issue looks to have died, but it has bitten Debian
> stable ("Lenny") [1] as it causes init to die on s390 after a kernel
> update.
>
> Here's a possible patch. The exact gcc version to check for is still a
> bit open I guess. For the s390 issue I've confirmed that 4.2.4 is OK,
> but for safety and because of Andrew's comment above I've set the test
> for 4.3 in the patch.

Here's an updated patch as I found the gcc version check was incorrect 
(0430 should have been 0403; sorry).

I've now changed the check to allow -fwrapv for gcc 4.2 as that has been 
shown to work and because of the consideration mentioned in my previous 
mail.

---
From: Frans Pop <elendil@planet.nl>
Subject: Only add '-fwrapv' to gcc CFLAGS for gcc 4.2 and later

This flag has been shown to cause init to segfault for kernels
compiled with gcc-4.1. gcc version 4.2.4 has been shown to be OK.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=13012.

Reported-by: Barry K. Nathan <barryn@pobox.com>
Signed-off-by: Frans Pop <elendil@planet.nl>

diff --git a/Makefile b/Makefile
index 0aeec59..2519fde 100644
--- a/Makefile
+++ b/Makefile
@@ -565,7 +565,8 @@ KBUILD_CFLAGS += $(call 
cc-option,-Wdeclaration-after-statement,)
 KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)
 
 # disable invalid "can't wrap" optimizations for signed / pointers
-KBUILD_CFLAGS	+= $(call cc-option,-fwrapv)
+KBUILD_CFLAGS  += $(shell if [ $(call cc-version) -ge 0402 ]; then \
+		    echo $(call cc-option,-fwrapv); fi ;)
 
 # revert to pre-gcc-4.4 behaviour of .eh_frame
 KBUILD_CFLAGS	+= $(call cc-option,-fno-dwarf2-cfi-asm)
Comment 9 Linus Torvalds 2009-07-12 17:58:25 UTC
On Fri, 10 Jul 2009, Frans Pop wrote:
> 
> Prompted by the same suggestion from Ben Hutchings I checked this too, 
> but -fno-strict-overflow was only introduced in gcc 4.2.
> So using it instead of -fwrapv *would* fix the problem for gcc 4.1, but 
> *only* because it would effectively do the same as the patch I proposed: 
> not add an option at all for gcc 4.1.
> 
> So that change seems illogical unless there are other reasons to 
> prefer -fno-strict-overflow over -fwrapv (well, it would avoid the
> gcc version check).
>
> It does however make it somewhat more logical to change the test in my 
> proposed patch to also allow -fwrapv for gcc 4.2.

Hmm. It all really makes me suspect that we should really be using
-fno-strict-overflow instead.

That not only apparently avoids the unnecessary gcc version check (by 
virtue of the option only existing in compilers that don't have the 
problem), but qutie frankly, one of the core people involved with the 
whole thing (Ian Lance Taylor) seems to think it's the better option.

See for example

	http://www.airs.com/blog/archives/120

on how gcc actually generates better code with -fno-strict-overflow.

I added Ian to the cc.

Ian: we generally do try to be careful and use explicit unsigned types for 
code that cares about overflow, but we use -fwrapv because there have been 
some cases where we didn't (and used pointer comparisons or signed 
integers).

The problem is that apparently gcc-4.1.x was literally generating buggy 
code with -fwrapv. So now the choice for us is between switching to an 
explicit version test:

	-KBUILD_CFLAGS  += $(call cc-option,-fwrapv)
	+KBUILD_CFLAGS  += $(shell if [ $(call cc-version) -ge 0402 ]; then \
	+                   echo $(call cc-option,-fwrapv); fi ;)

or just switching to -fno-strict-overflow instead:

	-KBUILD_CFLAGS  += $(call cc-option,-fwrapv)
	+KBUILD_CFLAGS  += $(call cc-option,-fno-strict-overflow)

which avoids the buggy gcc versions because it's simply not even supported 
by gcc-4.1.x (and even if that wasn't the case, possibly because only 
'wrapv' is the problematic case - apparently the difference _does_ 
matter to gcc).

From everything I have been able to find, I really prefer the second 
version. Not only is the patch cleaner, but it looks like code generation 
is better too (for some inexplicable reason, but I suspect it's because 
-fno-strict-overflow is just saner).

		Linus
Comment 10 Linus Torvalds 2009-07-12 18:24:37 UTC
On Sun, 12 Jul 2009, Linus Torvalds wrote:
> 
> From everything I have been able to find, I really prefer the second 
> version. Not only is the patch cleaner, but it looks like code generation 
> is better too (for some inexplicable reason, but I suspect it's because 
> -fno-strict-overflow is just saner).

Hmm. I just checked. The file that caused us to do this thing in the first 
place (fs/open.c, around like 415, which does:

	/* Check for wrap through zero too */
	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
		goto out_fput;

to check that the resulting 'loffset_t' type is all good) has interesting 
behaviour with my version of gcc (gcc version 4.4.0 20090506 (Red Hat 
4.4.0-4) (GCC)).

 - Without any options:
        leaq    (%rcx,%rdx), %rdi       #, tmp73
        movq    256(%rbx), %rsi # <variable>.i_sb, <variable>.i_sb
        movl    $-27, %eax      #, D.29131
        cmpq    40(%rsi), %rdi  # <variable>.s_maxbytes, tmp73
        ja      .L148   #,

 - With -fno-strict-overflow:
        leaq    (%rcx,%rdx), %rax       #, D.29157
        movq    256(%rbx), %rsi # <variable>.i_sb, <variable>.i_sb
        cmpq    40(%rsi), %rax  # <variable>.s_maxbytes, D.29157
        ja      .L154   #,
        testq   %rax, %rax      # D.29157
        js      .L154   #,

 - With -fwrapv:
        leaq    (%rcx,%rdx), %rax       #, D.29158
        movq    256(%rbx), %rsi # <variable>.i_sb, <variable>.i_sb
        cmpq    40(%rsi), %rax  # <variable>.s_maxbytes, D.29158
        ja      .L154   #,
        testq   %rax, %rax      # D.29158
        js      .L154   #,

and from this it would look like:

 - gcc on its own is actually the best version (the first comparison is 
   unsigned because s_maxbytes is actually 'unsigned long long', so it 
   actually does the right thing!)

   In other words, the whole '< 0' was unnecessary, but does make the 
   source code way more readable, and makes the source code _correct_ 
   regardless of any type issues!

 - From a cursory inspection, -fno-strict-overflow  and -fwrapv are both 
   equivalent in this code, and both do the stupid thing (but for good 
   reason - gcc doesn't know that 's_maxbytes' might not be 'negative in a 
   loffset_t', so technically speaking the extraneous 'js' is not 
   extraneous, because it can actually trigger some "more negative" 
   entries than s_maxbyes is.

 - HOWEVER:

	[torvalds@nehalem ~]$ git diff --stat open.s open.s-fno-strict-overflow 
	 open.s => open.s-fno-strict-overflow |   22 +++++++++++++---------
	 1 files changed, 13 insertions(+), 9 deletions(-)
	[torvalds@nehalem ~]$ git diff --stat open.s open.s-fwrapv
	 open.s => open.s-fwrapv |  296 ++++++++++++++++++++++++-----------------------
	 1 files changed, 150 insertions(+), 146 deletions(-)

  where the _only_ difference that '-fno-strict-overflow' introduces is 
  that one small area (it's saying 22 lines changed, but that's because 
  there's also the compiler option listing at the top of the file etc)

  In contrast, -fwrapv has done a lot of other changes too. Now, in both 
  cases, it really only added the same four instructions (testq + js +
  branchtarget + jumparound).

It looks like 'fwrapv' generates more temporaries (possibly for the code 
that treies to enforce the exact twos-complement behavior) that then all 
get optimized back out again. The differences seem to be in the temporary 
variable numbers etc, not in the actual code.

So fwrapv really _is_ different from fno-strict-pverflow, and disturbs the 
code generation more.

IOW, I'm convinced we should never use fwrapv. It's clearly a buggy piece 
of sh*t, as shown by our 4.1.x experiences. We should use 
-fno-strict-overflow.

Will commit the following (which also fits naming-wise with our use of 
'-fno-strict-aliasing').

		Linus

---
 Makefile |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index 0aeec59..bbe8453 100644
--- a/Makefile
+++ b/Makefile
@@ -565,7 +565,7 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
 KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)
 
 # disable invalid "can't wrap" optimizations for signed / pointers
-KBUILD_CFLAGS	+= $(call cc-option,-fwrapv)
+KBUILD_CFLAGS	+= $(call cc-option,-fno-strict-overflow)
 
 # revert to pre-gcc-4.4 behaviour of .eh_frame
 KBUILD_CFLAGS	+= $(call cc-option,-fno-dwarf2-cfi-asm)
Comment 11 Anonymous Emailer 2009-07-13 05:30:07 UTC
Reply-To: iant@google.com

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Ian: we generally do try to be careful and use explicit unsigned types for 
> code that cares about overflow, but we use -fwrapv because there have been 
> some cases where we didn't (and used pointer comparisons or signed 
> integers).
>
> The problem is that apparently gcc-4.1.x was literally generating buggy 
> code with -fwrapv. So now the choice for us is between switching to an 
> explicit version test:
>
>       -KBUILD_CFLAGS  += $(call cc-option,-fwrapv)
>       +KBUILD_CFLAGS  += $(shell if [ $(call cc-version) -ge 0402 ]; then \
>       +                   echo $(call cc-option,-fwrapv); fi ;)
>
> or just switching to -fno-strict-overflow instead:
>
>       -KBUILD_CFLAGS  += $(call cc-option,-fwrapv)
>       +KBUILD_CFLAGS  += $(call cc-option,-fno-strict-overflow)
>
> which avoids the buggy gcc versions because it's simply not even supported 
> by gcc-4.1.x (and even if that wasn't the case, possibly because only 
> 'wrapv' is the problematic case - apparently the difference _does_ 
> matter to gcc).

My instinctive advice is that y'all should track down and fix the cases
where the program relies on signed overflow being defined.  However, if
that is difficult--and it is--then I agree that -fno-strict-overflow is
preferable when using a compiler which supports it (gcc 4.2.0 and
later).

(The gcc 4.2 and later option -Wstrict-overflow=N can help find the
cases where a program relies on defined signed overflow, but only if
somebody is patient enough to wade through all the false positives.)

Ian
Comment 12 Krzysztof Oledzki 2009-07-20 11:19:09 UTC
OK, this patch is in 2.6.27.27 and now gcc-4.2.4 produces unbootable kernel on both i386 and x86-64 for my Dell PoweEdge 1950 servers.

Reverting this patch:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6-stable.git;a=commitdiff;h=7d317ec29c587f76bbfa375ef3b832204f5bffe5

solves my problem.

The same problem exists if 2.6.28.10 is manually patched with this "fix".

Bootable kernel (2.6.28.10 / 2.6.27.25 / 2.6.27.27 + 7d317ec29c587f76bbfa375ef3b832204f5bffe5 revert):
(...)
radeonfb 0000:0e:0d.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
radeonfb: Found Intel x86 BIOS ROM Image
radeonfb: Retrieved PLL infos from BIOS
radeonfb: Reference=27.00 MHz (RefDiv=12) Memory=200.00 Mhz, System=200.00 MHz
radeonfb: PLL min 20000 max 50000
Switched to high resolution mode on CPU 2
Switched to high resolution mode on CPU 3
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0
radeonfb: Monitor 1 type CRT found
radeonfb: EDID probed
radeonfb: Monitor 2 type no found
Console: switching to colour frame buffer device 128x48
radeonfb (0000:0e:0d.0): ATI Radeon 515e "Q^"
(...)


Unbootable kernel (2.6.28.10 + 7d317ec29c587f76bbfa375ef3b832204f5bffe5 / 2.6.27.27):
(...)
radeonfb 0000:0e:0d.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
radeonfb: Found Intel x86 BIOS ROM Image
radeonfb: Retrieved PLL infos from BIOS
radeonfb: Reference=27.00 MHz (RefDiv=12) Memory=200.00 Mhz, System=200.00 MHz
radeonfb: PLL min 20000 max 50000
radeonfb: Monitor 1 type CRT found
radeonfb: EDID probed
radeonfb: Monitor 2 type no found
(HANG)

So instead of gcc-4.1+wrapv hang we now have gcc-4.2+strict-overflow hang. :(
Comment 13 Krzysztof Oledzki 2009-07-20 11:31:56 UTC
Created attachment 22412 [details]
kernel config: 2.6.27.27 - hangs when compiled with gcc-4.2
Comment 15 Anonymous Emailer 2009-07-25 03:23:54 UTC
Reply-To: davej@redhat.com

On Sun, Jul 12, 2009 at 10:29:50PM -0700, Ian Lance Taylor wrote:

 > (The gcc 4.2 and later option -Wstrict-overflow=N can help find the
 > cases where a program relies on defined signed overflow, but only if
 > somebody is patient enough to wade through all the false positives.)
 
I got curious and wondered just how many warnings this would trigger,
so I did a build with -Wstrict-overflow=5.  I was pleasantly surprised
to see that it isn't that many from an allmodconfig build..

crypto/algboss.c:173: warning: assuming signed overflow does not occur when simplifying conditional to constant
net/ipv4/igmp.c:773: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/ata/libata-eh.c:3076: warning: assuming signed overflow does not occur when simplifying conditional to constant
sound/core/seq/seq_queue.c:195: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/block/ub.c:2293: warning: assuming signed overflow does not occur when simplifying conditional to constant
sound/pci/es1968.c:1545: warning: assuming signed overflow does not occur when simplifying conditional to constant
sound/pci/es1968.c:1582: warning: assuming signed overflow does not occur when simplifying conditional to constant
fs/ext3/inode.c:658: warning: assuming signed overflow does not occur when simplifying conditional to constant
fs/ext4/inode.c:797: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/edac/amd64_edac.c:726: warning: assuming signed overflow does not occur when simplifying conditional to constant
sound/soc/codecs/wm8988.c:665: warning: assuming signed overflow does not occur when simplifying conditional to constant
net/bluetooth/cmtp/core.c:234: warning: assuming signed overflow does not occur when simplifying conditional to constant
net/mac80211/rc80211_minstrel.c:193: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/isdn/hardware/eicon/debug.c:419: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/uio/uio.c:707: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/uio/uio.c:671: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/uio/uio.c:642: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/net/usb/hso.c:2308: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/video/matrox/matroxfb_g450.c:214: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/video/matrox/matroxfb_g450.c:160: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/video/matrox/matroxfb_maven.c:1125: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/video/matrox/matroxfb_maven.c:1044: warning: assuming signed overflow does not occur when simplifying conditional to constant
drivers/net/wireless/zd1211rw/zd_rf_uw2453.c:417: warning: assuming signed overflow does not occur when simplifying conditional to constant

Some of these appear to be obvious cases where we don't care about
the sign (loop counters for eg)

Is it worth fixing these up, with diffs like the below ?

	Dave

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5f51fed..3bb95de 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -595,7 +595,7 @@ static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
 			int *offsets, Indirect *branch)
 {
 	int blocksize = inode->i_sb->s_blocksize;
-	int i, n = 0;
+	unsigned int i, n = 0;
 	int err = 0;
 	struct buffer_head *bh;
 	int num;
Comment 16 Linus Torvalds 2009-07-25 16:49:44 UTC
On Fri, 24 Jul 2009, Dave Jones wrote:
> 
> Is it worth fixing these up, with diffs like the below ?

Historically, when we fix up bugs like this, it causes more bugs than it 
fixes.

It needs _really_ careful people, and people who really understand how C 
type rules work. Stuff that looks obvious often is not, and basing a large 
part of the patch on a compiler warning is a _very_ weak reason for 
something that is more than just syntactic.

And the problem is, nobody can judge the patch from the diff. So it gets 
absolutely zero review, until the day when somebody notices that the 
unsigned version is a bug.

I'd suggest that the rule should be:

 - if you can use -U30 and show all uses, so that people can actually look 
   at the patch and see what it _causes_ (ie not just the "we change 'i' 
   to 'unsigned'", but also the "this is where 'i' gets used, and 
   'unsigned' is right"), then we can apply it.

 - none of the patches go through a 'trivial' tree, or come from newbies 
   that think this is a good way to get involved.

Is it worth it at that point? I dunno.

		Linus

Note You need to log in before you can comment on or make changes to this bug.