Subject : x86 Geode issues in kernel >= 2.6.23 and >= 2.6.31-rc4 Submitter : "Martin-Éric Racine" <q-funk@iki.fi> Date : 2009-08-03 12:58 References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4 This entry is being used for tracking a regression from 2.6.29. Please don't close it until the problem is fixed in the mainline.
http://launchpadlibrarian.net/30267494/2.6.31-5.24.jpg shows a snapshot of the kernel panic in action. FYI Ubuntu's 2.6.31-5.24 package is based on upstream 2.6.31-rc5.
http://launchpadlibrarian.net/30414918/ubuntu_kernel_2.6.31-6.25.jpg shows an updated snapshot of the kernel panic, this time against 2.6.31-rc6. PS: https://bugs.launchpad.net/linux/+bug/396286 is where the thread about this issue started.
PS: contrary to what Rafael said above, this tracks a regression starting with 2.6.31-rcX. Kernels up to and including 2.6.30 are not affected.
Yes, that was a mistake, sorry. The right meta-bug is linked, though.
Working with Martin-Éric in the Launchpad bug, a bisect seemed to narrow down the following commit: f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0 is first bad commit commit f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Jun 8 19:50:45 2009 -0400 add caching of ACLs in struct inode No helpers, no conversions yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On Thursday 20 August 2009, Martin-Éric Racine wrote: > Yes, it's still valid. > > Screenshots of the crash have been provided. Is anything else missing > for the LKML to be able to debug and fix this? > > On Wed, Aug 19, 2009 at 11:26 PM, Rafael J. Wysocki<rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.30. Please verify if it still should be listed and let me know > > (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13941 > > Subject : x86 Geode issue > > Submitter : Martin-Éric Racine <q-funk@iki.fi> > > Date : 2009-08-03 12:58 (17 days old) > > References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4
On Wednesday 26 August 2009, Martin-Éric Racine wrote: > Yes, it still is valid. > > On Tue, Aug 25, 2009 at 11:34 PM, Rafael J. Wysocki<rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.30. Please verify if it still should be listed and let me know > > (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13941 > > Subject : x86 Geode issue > > Submitter : Martin-Éric Racine <q-funk@iki.fi> > > Date : 2009-08-03 12:58 (23 days old) > > References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4
On Sunday 06 September 2009, Martin-Éric Racine wrote: > Yes, it should still be listed, for as long as it hasn't been resolved. > > On Sun, Sep 6, 2009 at 8:24 PM, Rafael J. Wysocki<rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.30. Please verify if it still should be listed and let me know > > (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13941 > > Subject : x86 Geode issue > > Submitter : Martin-Éric Racine <q-funk@iki.fi> > > Date : 2009-08-03 12:58 (35 days old) > > References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4
Erm... What is comment #8 supposed to be about?
This is not complete yet, but to give a current status and maybe someone has an idea about the origins. Both, bisecting and looking at the panic show that the code added with the ACL caching framework is the place where the panic happens. The bug happens when __destroy_inode tries to free the i_acl structure. From the disassembly and the panic it can be seen that 0xffffb4ff is the content of i_acl. This looks like the value for ACL_NOT_CACHED (0xffffffff) was partially overwritten by a single byte value. Experimentally swapping around the order of the acl pointers behind the previously existing private pointer lets the system boot. So this might be either an issue with geode which has always being around but went unnoticed as the private pointer was not used or something accesses the inode structure directly with an offset hoping to find the private pointer at that place.
On Monday 26 October 2009, Martin-Éric Racine wrote: > I do not recall anyone on LKML actually ever doing any work towards > fixing this issue so, yes, it is still open. > > On Mon, Oct 26, 2009 at 9:31 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.30 and 2.6.31. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.30 and 2.6.31. Please verify if it still should > > be listed and let me know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13941 > > Subject : x86 Geode issue > > Submitter : Martin-Éric Racine <q-funk@iki.fi> > > Date : 2009-08-03 12:58 (85 days old) > > References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4
It should be noted that Stefan Bader has done a significant amount of work to help me narrow down the cause of this bug. He has produced a number of patches that improve the amount and quality of debug info sent by this new ACL caching code and I have routinely submitted a number of dmesg outputs to match. At this point, the collaboration of the LKML to analyze the results and fix the regression introduced by the addition of ACL caching would be extremely desirable.
Created attachment 23792 [details] Debug patch that shows the corruption With this patch added to the crashing kernel on the Geode, the invalid pointers get detected and reported. The debug logs show a vast majority (I believe I only saw one single different case) of detections on destroy.
Created attachment 23793 [details] Debug v2 to v5 diff This is the diff between v2 and v5 of my debugging patches. The interesting part is that solely by immediately testing (reading back) the values of the acl pointers, the problem seems to go away. So really, this sounds like some strange interaction of processor and caches than a race condition. Especially as there have been runs with SMP set to off and still corruption was observed.
I can't reproduce it on the hardwares I have here. I use 2.6.31.6 with following patches applied above of it: - BFS - AUFS2 and it works well for all LX and GX hardware I have with me.
OK, closing as unreproducible. If anyone can reproduce it with 2.6.31.6, please feel free to reopen.
I can still reproduce it with 2.6.31.6 and with 2.6.32, so, no, you cannot close it.
PS: and as it seems that I cannot re-open it myself, please do it for me.
I'm unable to reproduce this bug w/ 2.6.33-rc2, built with gcc version 4.3.2 (Debian 4.3.2-1.1). This really needs a .config supplied in order to reproduce it (I'm started to suspect that this is ubuntu-specific). Martin or Stefan, can you please supply the .config that you're using?
I should also note that I am using ext3 w/ POSIX_ACLs enabled, on an LX w/ the following root: /dev/hda1 / ext3 rw,noatime,errors=continue,data=writeback 0 0
Created attachment 24319 [details] kernel config This is the config used on Ubuntu kernels.
Here, what I have is the following (replace /dev/hda1 with an UUID statement): /dev/hda1 / ext4 defaults,relatime,errors=remount-ro 0 1
Can you please provide a newer .config that the problem occurs with? oldconfig wants to make some large changes to the config when I attempt to use 2.6.32 (and 2.6.31-rc5 fails to build).
Created attachment 24324 [details] This is Ubuntu's 2.6.32 config.
With that .config and 2.6.32 on ext3, I'm afraid that I'm still unable to reproduce the problem. Things to check: - Are you positive that you're using ext3 and not ext4? - Try building a vanilla 2.6.32 kernel; this may be broken due to a patch that Ubuntu has added - Try building the Ubuntu kernel on an older compiler (I'm using 4.3.2 (Debian 4.3.2-1.1)). It could be a compiler bug..
Could this be a change in the host compiler to default to -march=i686? This has been reported on Fedora 12, that building the decompression code uses the default arch setting rather than what was set in Kconfig. See: http://git.kernel.org/tip/17a2a9b57a9a7d2fd8f97df951b5e63e0bd56ef5
I tried both ext3 and ext4. Same crash regardless. The Ubuntu guys maintain builds of stock kernels i.e. upstream vanilla kernels with the default configuration, packaged as a .deb for convenience, as reference material to test for precisely this sort of regressions. I got the same crash when testing with those. Given Peter's assumption and Andres' third idea, I'd be tempted to blame it on recent host compiler defaults changes. Could Leann or Stefan please point me to instructions on building their stock 2.6.32-9-generic kernel? It's gonna take forever to complete, but I could launch a build of that on the Geode host itself, install it and see if that one boots. Peter: given how recent GCC produce broken i386 code anyhow, would -march=i486 or -march=i586 make more sense as a default base level for x86 kernels?
Not sure how recent gcc produces broken i386 code, example please. Either way, there really isn't any point in deviating for -march=i386; even with -march=i686 the differences are minimal (a handful of CMOV).
Trying "apt-get --compile source linux-image-2.6.32-9-generic" on the Geode now. I'll report ASAP on whether the resulting package boots any better or not.
Repeatedly trying to build 2.6.32 with Peter Anvin's patch following instructions at https://help.ubuntu.com/community/Kernel/Compile consistently fails: CC arch/x86/kernel/alternative.o CC arch/x86/kernel/i8253.o CC arch/x86/kernel/pci-nommu.o CC arch/x86/kernel/tsc.o CC arch/x86/kernel/io_delay.o CC arch/x86/kernel/rtc.o CC arch/x86/kernel/trampoline.o CC arch/x86/kernel/process.o arch/x86/kernel/process.o: final close failed: File truncated make[5]: *** [arch/x86/kernel/process.o] Error 1 make[4]: *** [arch/x86/kernel] Error 2 make[3]: *** [arch/x86] Error 2 make[2]: *** [sub-make] Error 2 make[1]: *** [/home/q-funk/Projektit/linux-2.6.32/debian/stamps/stamp-build-generic] Error 2 make: *** [binary-generic] Error 2 This is on a system with 1 GB of RAM and plenty of hard-disk space, so I'm really not sure how this "file truncated" keeps on showing up.
Stefan and I built new packages with Peter's patch, but this did not fix it either. Rather, the problem very much seems related to the new ACL caching code.
I would not say it is so much related as on this specific hw it shows up with a high chance as a bug in this code. But as it shows in testing, the difference between a working kernel and one that shows the problem is not what the code does, but how long it takes. The only things added were printk's of the cached acl values while allocating the inode and all of a sudden there is no corruption. The same code works on other platforms. At least I don't know of any report that looks like this. Also we cannot say it did not happen before. Just with the acl code there is code in place which visibly crashes at that point. If other structures or fields are affected this might "just" lead to weird effects. We had tests with smp disabled and still got the same issue. Up to now I cannot recall a completely positive reproduction of someone else with the exactly same hw and kernel. So a very very odd hw specific issue cannot be ruled out 100%. The only other option would be a architecture specific problem. Either running at a speed that then allows the allocation being interrupted at an inconvenient time and leaving register values incorrectly. That might probably be some SMI interrupt. But on the other hand I could not really explain why this changes with the additional printks as the occurance of an interrupt would unlikely be related to the speed of the running main thread. Would it be possible that such an interrupt disrupts something like the transfer between cache and memory and that an added read for some reason also causes that value to be written out?
Created attachment 25228 [details] dmesg -r It seems that we have some progress. In an attempt to debug this issue, I compared notes with someone on Fedora for whom the same hardware works. As a test, I used their kernel 2.6.31 config (with a couple of small modifications to build specific drivers as built-in) to build my own kernel. Much to my amazement, this kernel boots fine, as long as I specify root=/dev/sda1 on the GRUB cmdline instead of the usual root=UUID=unique-filesystem-hash. However, for some reason, an initrd.img was not automatically created upon installing this custom kernel package. Yet, as soon as I created one using "sudo update-initramfs -k 2.6.31.12-geodelx -c" and rebooted, the kernel failed to boot as before. Just to be safe, I deleted the initrd and rebooted again, letting "udev" perform its work after /sbin/init has been launched by the kernel. Lo and behold, it worked again! As such, it seems that something that gets included in the initramfs image is what messes with the ACL and destroys some inodes and makes the kernel crash in a non-recoverable way. Interestingly enough, we still get the previous error messages about destroyed inodes when booting with this barebone kernel, without an initrd.img, but the error is non-fatal. The output of dmesg -r is attached.
Created attachment 25303 [details] dmesg -r on kernel 2.6.32 This is what shows on Ubuntu's stock kernel for their upcoming Lucid 10.4 release, if I purposely purge the initrd.img and let the kernel boot using whatever drivers it has built-in. As you can see, we indeed succeed at booting, however vcons support doesn't work and, for obvious reasons, whatever goodies we expect to find in the initramfs image are also missing. However, the advantage is that the error I've mentioned appears as usual, but in a non-fatal way. Hopefully, the attached dmesg output can help the LKML developers determine what messes with sysfs on this host.
Kernel 2.6.36 magically fixed this. The fix would need to be backported to kernels newer than 2.6.30.
Thanks for the heads-up. Regarding the backport... can you use git-bisect to find the fix? Else we have nothing to backport. I'm closing this as unreproducible in the meantime. If you happen to find the fix, please post that here!
The issue returned as of 2.6.38-rc1 on the same hardware.
Is it still somehow related to the initrd? I.e. can you get 2.6.38-rc7 (or later) to boot without an initrd while with the initrd it does not?
2.6.38 final seems to work again as intended, with or without initrd.
Nice. Thanks for the update. I'm closing this as unreproducible, since we don't know which commit fixed it.
Reopening: http://bugs.debian.org/677655
If I understand correctly, each kernel version reliably works or doesn't work, but this symptom comes and goes from one kernel version to the next. What kernels have you tested, and what happened with each? Do you think this is a timing-related or memory layout related problem? Any ideas for distinguishing between the two?
Ping? I really am curious about that list of kernels you have tested and results, if you happen to have that information available.
The issue comes and goes. As I recall, this happens whenever someone changes something in the inode ACL code. I don't have a clue about what causes it. Kernel 3.2 is the last one that works for me.
I can reproduce this with 3.4.7, 3.5 and 3.5.1.
This bug is old. Please tell against a newer kernel to see if this is still a valid kernel bug. Cheers Nick
Hello to all of you! I can confirm that the bug is there in 3.14.0 and > 2.6.26 I tried 2.6.30.10 2.6.32.63 2.6.36 2.6.38.8 None of them works I upgraded recently one such Geode GX-MMX machine from lenny to wheezy and tried to compile more recent kernel few months ago, however all of this failed and finally came along this thread. To verify the source of the issue I downloaded and rebuild almost same 2.6.26 that was working since 2008. This works (despite some issues with gcc-4.7 and with update-initramfs). All the rest is hanging with some mangled output on the console cat /proc/cpuinfo processor : 0 vendor_id : Geode by NSC cpu family : 5 model : 5 model name : Geode(TM) Integrated Processor by National Semi flags : fpu de pse tsc msr cx8 pge cmov mmx mmxext 3dnowext 3dnow bogomips : 665.94 thanks in advance
I don't actually have any Geode hardware (haven't for some years) but if you know which was the last release that worked, and which was the first that didn't then it may still be possible to chase down
Can you give some short instructions how to proceed? I did not find any patches for 2.6.26 So we know that 2.6.26 and 2.6.26.2 is working 2.6.30 is not. Actually I'm more interested in getting something above 2.6.32 working. Is there a chance that this testing can be done on newer version?
The absolutely best would be if you could do a "git bisect" between 2.6.26 and 2.6.30.
I want to draw the baseline for this testing - compiling is done on Intel 64bit for Geode GX-MMX - compiler is gcc-4.7 on debian wheeze - working config for 2.6.26 used as baseline make ARCH=i386 oldconfig - compiling with make ARCH=i386 -j4 all - added some additional options to gcc Makefile HOSTCC = gcc -m32 -m3dnow -maes -mmmx HOSTCXX = g++ -m32 -m3dnow -maes -mmmx Starting with kernel 2.6.27.1 ( and in 2.6.26 ) 1. In arch/x86/vdso/Makefile I have to remove "-m elf_i386" (compiling in chrooted i386 on 64) CPPFLAGS_vdso32.lds = $(CPPFLAGS_vdso.lds) -VDSO_LDFLAGS_vdso32.lds = -m elf_i386 -Wl,-soname=linux-gate.so.1 +VDSO_LDFLAGS_vdso32.lds = -Wl,-soname=linux-gate.so.1 Solving this issue VDSO arch/x86/vdso/vdso32-int80.so.dbg gcc: error: unrecognized command line option ‘-m’ gcc: error: elf_i386: No such file or directory make[3]: *** [arch/x86/vdso/vdso32-int80.so.dbg] Error 1 2. the mutex_lock/_unlock should be marked as __used as described here http://www.brunni.de/linux_kernel_2.6.27.62_with_gcc_4.6.3.html LD .tmp_vmlinux1 kernel/built-in.o: In function `mutex_lock': (.sched.text+0x726): undefined reference to `__mutex_lock_slowpath' kernel/built-in.o: In function `mutex_unlock': (.sched.text+0x730): undefined reference to `__mutex_unlock_slowpath' make[2]: *** [.tmp_vmlinux1] Error 1 Is there a more intelligent way to overcome the above issues? Is this setup useful? Do I have to add some more debugging in the kernel? I'll compile some kernels for testing and send you feedback next week thanks in advance regards
Hi, from the official download site https://www.kernel.org/pub/linux/kernel/v2.6/ all kernels upto 2.6.30.10 worked. 2.6.31 do not work I tested following config-2.6.26 config-2.6.27.1 config-2.6.27.30 config-2.6.27.50 config-2.6.28.10 config-2.6.29.6 config-2.6.30.10 config-2.6.30.3 config-2.6.31 config-2.6.31.14 Does this help?
Created attachment 148631 [details] attachment-17854-0.html How do I do "git bisect" between last working 2.6.30.10 and 2.6.31? thanks On Saturday, August 16, 2014 4:12 AM, "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org> wrote: https://bugzilla.kernel.org/show_bug.cgi?id=13941 --- Comment #50 from H. Peter Anvin <hpa@zytor.com> --- The absolutely best would be if you could do a "git bisect" between 2.6.26 and 2.6.30.
Hi, thanks for answering. I'm not too familiar with git etc, but I'll try my best. I found this document which seems to be good for the start https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html Just for the record I noticed that 2.6.30 has issue with udev in debian wheezy. Precisely it gets permission denied on locking /run/network/ifstate and thus can not bring up the network. There were other issues with /run as well. In wheezy udev is not starting because kernel higher than 2.6.32 is expected.
Hi again could you please provide instructions (or link with such) that describes the process of bisect-ing. It seems I can not do git clone for some reason. I tried this @home and in the office but no luck. All I get is git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.git Cloning into 'linux-2.6'... fatal: unable to connect to git.kernel.org: git.kernel.org[0: 199.204.44.194]: errno=Die Wartezeit f?r die Verbindung ist abgelaufen git.kernel.org[1: 198.145.20.140]: errno=Die Wartezeit f?r die Verbindung ist abgelaufen git.kernel.org[2: 149.20.4.72]: errno=Die Wartezeit f?r die Verbindung ist abgelaufen git.kernel.org[3: 2001:4f8:1:10:0:1991:8:25]: errno=Das Netzwerk ist nicht erreichbar git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git Cloning into 'linux-stable'... fatal: unable to connect to git.kernel.org: git.kernel.org[0: 199.204.44.194]: errno=Die Wartezeit f?r die Verbindung ist abgelaufen git.kernel.org[1: 198.145.20.140]: errno=Die Wartezeit f?r die Verbindung ist abgelaufen git.kernel.org[2: 149.20.4.72]: errno=Die Wartezeit f?r die Verbindung ist abgelaufen git.kernel.org[3: 2001:4f8:1:10:0:1991:8:25]: errno=Das Netzwerk ist nicht erreichbar PS: the german messages are saying "Connection Timeout" traceroute git.kernel.org traceroute to git.kernel.org (149.20.4.72), 30 hops max, 60 byte packets 1 10.146.8.4 (10.146.8.4) 0.414 ms 0.437 ms 0.487 ms 2 192.168.192.4 (192.168.192.4) 0.460 ms 0.599 ms 0.738 ms 3 80.109.255.1 (80.109.255.1) 2.979 ms 2.949 ms 2.868 ms 4 at-vie-sk11-pe04-vl-2035.upc.at (84.116.228.98) 104.393 ms 104.270 ms 104.380 ms 5 at-vie-sk11-pe02-vl-2028.upc.at (84.116.228.69) 104.371 ms 104.496 ms 104.501 ms 6 at-vie15a-rd1-vl-2043.aorta.net (84.116.228.129) 103.672 ms 102.080 ms 102.733 ms 7 uk-lon01b-rd1-xe-1-0-1.aorta.net (84.116.132.37) 102.424 ms 103.347 ms 102.573 ms 8 de-fra01a-ri2-xe-4-0-0.aorta.net (84.116.130.138) 103.141 ms 84-116-130-149.aorta.net (84.116.130.149) 102.154 ms 101.613 ms 9 84.116.137.182 (84.116.137.182) 105.509 ms 84.116.137.190 (84.116.137.190) 157.893 ms 84.116.137.186 (84.116.137.186) 105.663 ms 10 nyiix.r1.lga1.isc.org (198.32.160.95) 105.398 ms 104.632 ms 104.637 ms 11 int-0-0-0-7.r1.pao1.isc.org (149.20.65.137) 175.873 ms 175.884 ms 172.800 ms 12 git2.kernel.org (149.20.4.72) 172.420 ms 172.626 ms 172.640 ms thanks in advance regards
I think there is some IPv6 issue here