Most recent kernel where this bug did not occur:2.6.16.22 Distribution:Fedora Core 5 x86_64 Hardware Environment:x86_64 Tyan Tiger mobo w/two dual-Opteron CPU's, 4GB Software Environment:kernel starting init -- compiler both FC5 and gcc-3.4.4 Problem Description:init[1] trap divide error rip: 4296d17 rsp:7fff24a2fe10 Steps to reproduce: Boot vanilla 2.6.17 or 2.6.17.1 on AMD Opteron system See http://scott.user.sonic.net/linux/2.6.17.1/ for all the nitty-gritty details (gcc -v's, lspci, ver_linux output, etc...)
Created attachment 8459 [details] Link to directory of files regarding the system
Does some other program work. e.g. can you boot with init=/bin/sash ? And when you go back to the older kernel does the userland work again? Also to be absolutely sure safe your .config, do make distclean and compile again. Kernels sometimes get miscompiled in weird ways.
Three things: a) the "rip" in my previous note was "4296d7" (no "1" in it) b) booting with init=/bin/bash changes the rsp: number, but otherwise no joy c) booting 2.6.16.22 (or other kernels that have worked) work just fine, and userland is happy. Also, I did an fgrep -R for the error, which implicated arch/x86_64/kernel/traps.c -- and it shows substantial changes (or so says diff) Also, I unpacked the source for 2.6.17 and 2.6.17.1, loaded my config, and ran my "make" on that, and I think it unlikely for compilation to fail twice. Nevertheless, I'll try "make distclean" and go from there.
No joy with the "distclean" -- also, google shows someone else has run across this... I guess my next move would be to back out the changes to traps.c until it works....?
I don't know what it could be. The traps changes don't change division by zero handling. I also test booted your configuration and it worked. One thing you could check is if you objdump -S your init if there is really a DIV or IDIV instruction at 4296d7 If all else fails can do a full binary search between .16 and .17 using git bisect. Just reverting traps alone will probably not compile or boot. Or revert all of the x86-64 changes and then bisect them individually (but that might also not compile without tweaks)
I will do the bisecting immediately, as this bug is present in 2.6.18-rc4. The latest "production" kernel that will boot with it is 2.6.16.27. 2.6.17 does not work. No joy with the objdump. I'll try to do the binary search right now, first starting with the release candidates... -Scott p.s. my new email address for kernel work is <kernel@ponzo.net>, but <cr-kernel@sonic.net> will still continue to work.
My git-bisection is down to the neighborhood. I'm at: bisect/bad c5a10f62c5c496c49db749af103b991873b7e2dc bisect/current 41dc636b0475582e48584340b774bd1e90d40d9 bisect/good 7e51f257e87297a5b6fe6d136a8ef67206aaf3a8 However, the bisect/current kernel doesn't compile clean: fs/block_dev.c:726: error:
Well, I'm so tired now that I'm not sure if this is the problem or not, but: take a look at patchs: 6a4d44c1f1108d6c9e8850e8cf166aaba0e56eae and 100873687d81d4ce7b1299b447d33e87ba1e9583 There seems to be a discrepancy that would show up if CONFIG_FS isn't enabled... 3ac51e741a46af7a20f55e79d3e3aeaa93c6c544 is definitely "good", and 100873687d81d4ce7b1299b447d33e87ba1e9583 is definitely "bad". I'll track this down some more tomorrow, there is one patch in between these that I haven't check yet (6a4d44c1f1108d6c9e8850e8cf166aaba0e56eae). (Thank goodness for gitk/git-bisect visualize!) -Scott
6a4d44c1f1108d6c9e8850e8cf166aaba0e56eae is "good" so something about 100873687d81d4ce7b1299b447d33e87ba1e9583 ...is making my system choke. Add to this the fact that my lvm setup isn't set up right, and maybe is tickling a bug. I'll try reverting 100873687d81d4ce7b1299b447d33e87ba1e9583 I guess...
_ _ _ _ _ _ _ _[/home/scott/wk]_(scott@frenzy)_ $ git branch bisect * exp master origin _[/home/scott/wk]_(scott@frenzy)_ $ git-revert --no-edit 100873687d81d4ce7b1299b447d33e87ba1e9583 First trying simple merge strategy to revert. Simple revert fails; trying Automatic revert. Auto-merging fs/partitions/check.c merge: warning: conflicts during merge ERROR: Merge conflict in fs/partitions/check.c Auto-merging include/linux/genhd.h fatal: merge program failed Automatic revert failed. After resolving the conflicts, mark the corrected paths with 'git-update-index <paths>' and commit with 'git commit -F .msg' _ _ _ _ _ _ _ Alas, I'm quite unfamiliar with git -- what would be the next step to unwind that patch? -Scott
I don't know either. I'm not a git user What I would do is get the respective patch from gitweb (http://www.kernel.org/git/linus) and then revert it manually with patch -R Also make sure you retest with it applied again. In my experience complex binary searches sometimes go wrong
This is reported to be a Fedora bug here: http://www.fedoraforum.org/forum/showthread.php?t=115592&page=2 There does appear to be an md mirror volume set up on this system, which I've been meaning to track down... (the guy who set it up was probably trying to get the nv_sata raid working...) I have this in my rc.local: /sbin/dmsetup remove_all /sbin/partprobe /bin/mount -v /v /bin/mount -v /w Which is an ugly workaround. I am (clearly) using a non-fedora kernel though, yet I'm somehow tickling this bug...I'm going to make sure there's no md raid configured, and try again... -Scott
I disabled the entire md/dm subsystem in "make xconfig" -- now I can boot both 2.6.18-rc4 and 2.6.17.8 (which I'm currently running). I'll try to get the info on my dm setup for bug tracking purposes...
The MD/DM maintainers should look at it then Assigning it to MD for now, feel free to assign on further if it was DM
That patch you identied is essentially a no-op. IT just removes some "#ifdef CONFIG_SYSFS" but as you have CONFIG_SYSFS=y, that won't affect the generated code at all. Do you have a stack trace at all of where the divide is happening? Even a digital photo would do.
Reply-To: scott@ponzo.net On Thu, Aug 17, 2006 at 06:01:02PM -0700, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=6769 > > > > > > ------- Additional Comments From neilb@suse.de 2006-08-17 17:54 ------- > That patch you identied is essentially a no-op. > IT just removes some "#ifdef CONFIG_SYSFS" but as you have > CONFIG_SYSFS=y, that won't affect the generated code at all. > > > Do you have a stack trace at all of where the divide is happening? > Even a digital photo would do. Alas, no, this happens during the boot (right when control gets passed to init). Maybe I said "git-bisect bad" when I meant "good" (or vica-versa) toward the end there, I was getting pretty punchy. (12 or 13 compile/reboot cycles, iirc...) _Just after_ the sysfs stuff, there's several major changes to dm/md, so I would suspect the bug is in there...I can try bisecting that area, but I gotta take a break from computers for a few hours first... I also owe you some info about the raid that I'd not been using: _[/etc]_(root@frenzy)_ # dmraid -r /dev/sda: nvidia, "nvidia_daaejdec", mirror, ok, 781422766 sectors, data@ 0 /dev/sdb: nvidia, "nvidia_daaejdec", mirror, ok, 781422766 sectors, data@ 0 _[/etc]_(root@frenzy)_ # dmraid -v -s *** Set name : nvidia_daaejdec size : 781422766 stride : 128 type : mirror status : ok subsets: 0 devs : 2 spares : 0 BTW, the onboard raid controller is disabled. -Scott
Any update on this problem? thanks.