Bug 6769 - Linux 2.6.18-rc4 Fedora Core 5 x86_64 "init[1] trap divide error" -- it's md/dm, see results of git-bisect in the ticket
Summary: Linux 2.6.18-rc4 Fedora Core 5 x86_64 "init[1] trap divide error" -- it's md/...
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: i386 Linux
: P2 blocking
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-29 14:07 UTC by Scott Doty
Modified: 2008-09-22 16:43 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.18-rc4
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Link to directory of files regarding the system (156 bytes, text/html)
2006-06-29 14:08 UTC, Scott Doty
Details

Description Scott Doty 2006-06-29 14:07:01 UTC
Most recent kernel where this bug did not occur:2.6.16.22
Distribution:Fedora Core 5 x86_64
Hardware Environment:x86_64 Tyan Tiger mobo w/two dual-Opteron CPU's, 4GB
Software Environment:kernel starting init -- compiler both FC5 and gcc-3.4.4
Problem Description:init[1] trap divide error rip: 4296d17 rsp:7fff24a2fe10

Steps to reproduce:
Boot vanilla 2.6.17 or 2.6.17.1 on AMD Opteron system
See http://scott.user.sonic.net/linux/2.6.17.1/ for all the nitty-gritty
details (gcc -v's, lspci, ver_linux output, etc...)
Comment 1 Scott Doty 2006-06-29 14:08:56 UTC
Created attachment 8459 [details]
Link to directory of files regarding the system
Comment 2 Andi Kleen 2006-06-29 14:29:52 UTC
Does some other program work. e.g. can you boot with init=/bin/sash ?
And when you go back to the older kernel does the userland work again?

Also to be absolutely sure safe your .config, do make distclean and compile
again. Kernels sometimes get miscompiled in weird ways.
Comment 3 Scott Doty 2006-06-30 05:06:31 UTC
Three things:

a) the "rip" in my previous note was "4296d7" (no "1" in it)

b) booting with init=/bin/bash changes the rsp: number, but otherwise no joy

c) booting 2.6.16.22 (or other kernels that have worked) work just fine, and
userland is happy.

Also, I did an fgrep -R for the error, which implicated
arch/x86_64/kernel/traps.c -- and it shows substantial changes (or so says diff)

Also, I unpacked the source for 2.6.17 and 2.6.17.1, loaded my config, and ran
my "make" on that, and I think it unlikely for compilation to fail twice. 
Nevertheless, I'll try "make distclean" and go from there.


Comment 4 Scott Doty 2006-06-30 14:33:07 UTC
No joy with the "distclean" -- also, google shows someone else has run across
this...

I guess my next move would be to back out the changes to traps.c until it works....?
Comment 5 Andi Kleen 2006-07-01 07:19:23 UTC
I don't know what it could be. The traps changes don't change division
by zero handling. I also test booted your configuration and it worked.

One thing you could check is if you objdump -S your init if there is really
a DIV or IDIV instruction at 4296d7

If all else fails can do a full binary search between
.16 and .17 using git bisect.
Just reverting traps alone will probably not compile or boot.
Or revert all of the x86-64 changes and then bisect them individually 
(but that might also not compile without tweaks) 
Comment 6 Scott Doty 2006-08-16 18:41:13 UTC
I will do the bisecting immediately, as this bug is present in 2.6.18-rc4.
The latest "production" kernel that will boot with it is 2.6.16.27.  2.6.17 does
not work.

No joy with the objdump.

I'll try to do the binary search right now, first starting with the release
candidates...

 -Scott
p.s. my new email address for kernel work is <kernel@ponzo.net>, but
<cr-kernel@sonic.net> will still continue to work.

Comment 7 Scott Doty 2006-08-17 00:17:31 UTC
My git-bisection is down to the neighborhood.  I'm at:

bisect/bad c5a10f62c5c496c49db749af103b991873b7e2dc

bisect/current 41dc636b0475582e48584340b774bd1e90d40d9

bisect/good 7e51f257e87297a5b6fe6d136a8ef67206aaf3a8

However, the bisect/current kernel doesn't compile clean:

   fs/block_dev.c:726: error: 
Comment 8 Scott Doty 2006-08-17 08:05:43 UTC
Well, I'm so tired now that I'm not sure if this is the problem or not, but:
take a look at patchs:



6a4d44c1f1108d6c9e8850e8cf166aaba0e56eae
and
100873687d81d4ce7b1299b447d33e87ba1e9583

There seems to be a discrepancy that would show up if CONFIG_FS isn't enabled...

3ac51e741a46af7a20f55e79d3e3aeaa93c6c544 is definitely "good", and
100873687d81d4ce7b1299b447d33e87ba1e9583 is definitely "bad".  I'll track this
down some more tomorrow, there is one patch in between these that I haven't
check yet (6a4d44c1f1108d6c9e8850e8cf166aaba0e56eae).

(Thank goodness for gitk/git-bisect visualize!)

 -Scott
Comment 9 Scott Doty 2006-08-17 09:50:17 UTC
6a4d44c1f1108d6c9e8850e8cf166aaba0e56eae is "good"

so something about

100873687d81d4ce7b1299b447d33e87ba1e9583

...is making my system choke.

Add to this the fact that my lvm setup isn't set up right, and maybe is tickling
a bug.

I'll try reverting 100873687d81d4ce7b1299b447d33e87ba1e9583 I guess...
Comment 10 Scott Doty 2006-08-17 09:55:29 UTC
_ _ _ _ _ _ _
_[/home/scott/wk]_(scott@frenzy)_
$ git branch
  bisect
* exp
  master
  origin

_[/home/scott/wk]_(scott@frenzy)_
$ git-revert --no-edit 100873687d81d4ce7b1299b447d33e87ba1e9583
First trying simple merge strategy to revert.
Simple revert fails; trying Automatic revert.
Auto-merging fs/partitions/check.c
merge: warning: conflicts during merge
ERROR: Merge conflict in fs/partitions/check.c
Auto-merging include/linux/genhd.h
fatal: merge program failed
Automatic revert failed.  After resolving the conflicts,
mark the corrected paths with 'git-update-index <paths>'
and commit with 'git commit -F .msg'
 _ _ _ _ _ _ _

Alas, I'm quite unfamiliar with git -- what would be the next step to unwind
that patch?

 -Scott
Comment 11 Andi Kleen 2006-08-17 10:01:31 UTC
I don't know either. I'm not a git user

What I would do is get the respective patch from gitweb
(http://www.kernel.org/git/linus) and then revert it manually
with patch -R 

Also make sure you retest with it applied again. In my experience
complex binary searches sometimes go wrong
Comment 12 Scott Doty 2006-08-17 10:13:55 UTC
This is reported to be a Fedora bug here:

   http://www.fedoraforum.org/forum/showthread.php?t=115592&page=2

There does appear to be an md mirror volume set up on this system, which
I've been meaning to track down... (the guy who set it up was probably
trying to get the nv_sata raid working...) I have this in my rc.local:

/sbin/dmsetup remove_all
/sbin/partprobe
/bin/mount -v /v
/bin/mount -v /w

Which is an ugly workaround.

I am (clearly) using a non-fedora kernel though, yet I'm somehow tickling
this bug...I'm going to make sure there's no md raid configured, and try
again...

 -Scott
Comment 13 Scott Doty 2006-08-17 11:25:09 UTC
I disabled the entire md/dm subsystem in "make xconfig" -- now I can boot both
2.6.18-rc4 and 2.6.17.8 (which I'm currently running).

I'll try to get the info on my dm setup for bug tracking purposes...

Comment 14 Andi Kleen 2006-08-17 11:38:16 UTC
The MD/DM maintainers should look at it then

Assigning it to MD for now, feel free to assign on further if it was DM
Comment 15 Neil Brown 2006-08-17 17:54:39 UTC
That patch you identied is essentially a no-op.
IT just removes some "#ifdef CONFIG_SYSFS" but as you have
CONFIG_SYSFS=y, that won't affect the generated code at all.


Do  you have a stack trace at all of where the divide is happening?
Even a digital photo would do.
Comment 16 Anonymous Emailer 2006-08-17 18:33:27 UTC
Reply-To: scott@ponzo.net

On Thu, Aug 17, 2006 at 06:01:02PM -0700, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=6769
> 
> 
> 
> 
> 
> ------- Additional Comments From neilb@suse.de  2006-08-17 17:54 -------
> That patch you identied is essentially a no-op.
> IT just removes some "#ifdef CONFIG_SYSFS" but as you have
> CONFIG_SYSFS=y, that won't affect the generated code at all.
> 
> 
> Do  you have a stack trace at all of where the divide is happening?
> Even a digital photo would do.

Alas, no, this happens during the boot (right when control gets passed to
init).

Maybe I said "git-bisect bad" when I meant "good" (or vica-versa) toward the
end there, I was getting pretty punchy.  (12 or 13 compile/reboot cycles,
iirc...)

_Just after_ the sysfs stuff, there's several major changes to dm/md, so I
would suspect the bug is in there...I can try bisecting that area, but I gotta
take a break from computers for a few hours first...

I also owe you some info about the raid that I'd not been using:

_[/etc]_(root@frenzy)_
# dmraid -r
/dev/sda: nvidia, "nvidia_daaejdec", mirror, ok, 781422766 sectors, data@ 0
/dev/sdb: nvidia, "nvidia_daaejdec", mirror, ok, 781422766 sectors, data@ 0

_[/etc]_(root@frenzy)_
# dmraid -v -s
*** Set
name   : nvidia_daaejdec
size   : 781422766
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0

BTW, the onboard raid controller is disabled.

 -Scott

Comment 17 Natalie Protasevich 2008-03-25 21:53:51 UTC
Any update on this problem? thanks.

Note You need to log in before you can comment on or make changes to this bug.