Bug 15836
Summary: | Untraceble race condition (badblocks) on SD card reader [197b:2381] | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Stephan Sürken (absurd) |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | andi-bz, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32-2.6.36,2.6.39 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
revert.patch
config-2.6.32-rc7.git6ad696d |
Description
Stephan Sürken
2010-04-22 18:50:49 UTC
Created attachment 26097 [details]
revert.patch
OK, now that's weird. Can you please apply the above revert patch to 2.6.33, see if that indeed makes the failures go away? Thanks. Stephan, can you attach your .config, please? (In reply to comment #3) > Stephan, can you attach your .config, please? Sure. JFTR, all builds and tests where done on a Debian lenny/amd64 system. All builds were done on a clean tree with a "near" Debian standard kernels's config as template (using all defaults for the few questions by oldconfig). Attaching the exact config for the "6ad696d" (aka "first bad") commit build. (In reply to comment #2) > OK, now that's weird. Can you please apply the above revert patch to 2.6.33, > see if that indeed makes the failures go away? Ok. As soon as I find time... Thx, Stephan Created attachment 26109 [details]
config-2.6.32-rc7.git6ad696d
Ok, I had some headaches [1] testing this, but if i did not do any stupid mistakes, these are the results: 2.6.33.2 native (git19f00f0): bad 2.6.33.2 revert (git63c34d8): good That's the same as for 2.6.32; however 2.6.34-rc5 native (gitb91ce4d): bad 2.6.34-rc5 revert (git9a34047): bad Hmm (?). Does this help? I have no clue really ;) Maybe the 6ad69 commit is only making some race condition more likely that has always existed? Thx, Stephan [1] Your patch missed calls to [un]lock_system_sleep in mm/memory-failure.c, so I happily just removed these two calls. Also, git made me look stupid again as "git apply <your patch>" barked at me on the 2.6.33 tree (while "patch -p1" worked fine); just wondering if that's really the correct way to apply patch? (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Mon, 26 Apr 2010 20:48:47 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=15836 > OK, I give up. Can anyone suggest how 6ad696d2cf535772dff659298ec7e7260e344595 ("mm: allow memory hotplug and hibernation in the same kernel") could have caused the mmc failures which Stephan is reporting in the above bugzilla record? Thanks. On Mon, 2010-04-26 at 13:58 -0700, Andrew Morton wrote: (...) > > https://bugzilla.kernel.org/show_bug.cgi?id=15836 > > > > OK, I give up. Ok, I guess it would be very helpful, first of all, if someone could affirm the buggy behavior on the same or similar hardware with kernels >=2.6.32. The bug occurs on a "HP HDX 16" laptop; attaching some hardware information... Thanks! Stephan -- Stephan Sürken <absurd at olurdix.de> On Tue, 27 Apr 2010 12:59:17 +0200, Stephan Suerken <absurd@olurdix.de> wrote: > On Mon, 2010-04-26 at 13:58 -0700, Andrew Morton wrote: > (...) > > > https://bugzilla.kernel.org/show_bug.cgi?id=15836 > > > > > > > OK, I give up. > > Ok, I guess it would be very helpful, first of all, if someone could > affirm the buggy behavior on the same or similar hardware with kernels > >=2.6.32. > > The bug occurs on a "HP HDX 16" laptop; attaching some hardware > information... Stephan, this one definitely looks strange. Have you tried running other diagnostic tools on you SD card, e.g. fsck? How about on your hard disks or other mass storage devices? Is it always the same block that's reported as bad? Were you doing any suspend/hibernation during these tests? It might also be an idea to turn on CONFIG_DEBUG_PAGEALLOC and run badblocks again. If you turn off CONFIG_HOTPLUG and CONFIG_HIBERNATION on a recent kernel does it work then? I'm pretty stumped as to what is going on here... Reply-To: andi@firstfloor.org > Can anyone suggest how 6ad696d2cf535772dff659298ec7e7260e344595 ("mm: > allow memory hotplug and hibernation in the same kernel") could have > caused the mmc failures which Stephan is reporting in the above > bugzilla record? The mission information is which CONFIG_* option changes the behaviour. (or maybe it's in the bugzilla, i just skimmed it very quickly) presumably it's either hibernation or memory hotadd that introduces the problem. My patch just allowed to have both together. Once it's determined which one it is one could do a second bisect with that option always on (and the other off)? Or maybe it's not a regression. -Andi Matt, On Wed, 2010-04-28 at 23:38 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=15836 (...) > this one definitely looks strange. ok, I am desperate. Maybe its time to just throw out any information I can think of right now, hoping for some educated guess ;). > Have you tried running other diagnostic tools on you SD card, e.g. fsck? Actually, s.th. like "mkfs;fsck --force" was my initial test before I switched to badblocks. As for bad media, the laptop is my working horse. I work on it every day; when I work on it, the sd card is in use. I had no issues with it whatsoever until I switched to 2.6.32, and I continue to have no issues (sticking to a 2.6.30 kernel) with the very same card. And yes, after I found the kernel upgrade broke it, I tested with a different sd card with the same results. So it's definitely not a media problem. > How about on your hard disks or other mass storage devices? Ok, good point (although I booted from the hard disk for the tests). I run the same badblocks test (with a "bad" kernel) on a hard disk partition and on an usb stick without issues. So it's not a general mass storage problem. > Is it always the same block that's reported as bad? Ok, good point. I noted that they seem to be different, but I did not care much yet. Now I have retested on a bad kernel (actually stock 2.6.33.2), running the test 5 times (0-4) in a row, and this three times on three bootups; the number is the first found bad blocks: Boot 0: 0 14144 1-4 64 Boot 1 (warm boot): 0 64 1-4 64 Boot 2 (cold boot): 0 140 1-4 76 Fwiw, read-only badblock tests always work fine. > Were you doing any suspend/hibernation during these tests? No, all tests after fresh reboot. > It might also be an idea to turn on > CONFIG_DEBUG_PAGEALLOC and run badblocks again. > > If you turn off CONFIG_HOTPLUG and CONFIG_HIBERNATION on a recent kernel > does it work then? Ok, later ;). Thx, Stephan -- Stephan Sürken <absurd at olurdix.de> On Fri, 2010-04-30 at 19:06 +0200, Andi Kleen wrote:
> The mission information is which CONFIG_* option changes the behaviour.
> (or maybe it's in the bugzilla, i just skimmed it very quickly)
>
> presumably it's either hibernation or memory hotadd that introduces
> the problem.
>
> My patch just allowed to have both together.
>
> Once it's determined which one it is one could do a second bisect
> with that option always on (and the other off)? Or maybe it's not a
> regression.
Thx Andi. I now get the obvious: the way I built the kernels (using
standard Debian configs as template, having both CONFIG_MEMORY_HOTPLUG
and CONIG_HIBERNATION enabled), without your commit,
CONFIG_MEMORY_HOTPLUG was implicitly turned off. Bummer.
Enabling CONFIG_MEMORY_HOTPLUG actually enables the bug.
I explicitly confirmed this on top of 2.6.33.2:
config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohiber:CONFIG_MEMORY_HOTPLUG=y
=>"bad"
config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohotplug:# CONFIG_MEMORY_HOTPLUG is not set
=>"good"
So any code enabled by CONFIG_MEMORY_HOTPLUG is the winner, presumable
somewhere in the mmc code (?).
I am not sure how to proceed, now. To bisect I would need to guess/find
a "good" kernel first, if there is one. Maybe some hints how to best
debug the code in question would be more helpful...
Thx,
Stephan
--
Stephan Sürken <absurd at olurdix.de>
Reply-To: andi@firstfloor.org > I explicitly confirmed this on top of 2.6.33.2: Thanks that makes a bit more sense. > > > config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohiber:CONFIG_MEMORY_HOTPLUG=y > =>"bad" > config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohotplug:# > CONFIG_MEMORY_HOTPLUG is not set > =>"good" > > So any code enabled by CONFIG_MEMORY_HOTPLUG is the winner, presumable > somewhere in the mmc code (?). I doubt the mmc code really knows anything about memory hotplug. Must be some subtle side effect. Can send the diff of two boot logs with memory hotplug on and off? I wonder if you have the same memory layout in both cases. -Andi On Sun, 2010-05-02 at 01:04 +0200, Andi Kleen wrote: (..) > > > config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohiber:CONFIG_MEMORY_HOTPLUG=y > > =>"bad" > > config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohotplug:# > CONFIG_MEMORY_HOTPLUG is not set > > =>"good" (..) > Can send the diff of two boot logs with memory hotplug on and off? > I wonder if you have the same memory layout in both cases. Attached, from the above two good/bad kernel builds; I stripped the stamps to make it more readable. Also attaching the full logs too just in case... Hope this helps, Stephan -- Stephan Sürken <absurd at olurdix.de> Reply-To: andi@firstfloor.org On Mon, May 03, 2010 at 08:47:21PM +0200, Stephan Sürken wrote: > On Sun, 2010-05-02 at 01:04 +0200, Andi Kleen wrote: > (..) > > > > config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohiber:CONFIG_MEMORY_HOTPLUG=y > > > =>"bad" > > > config-2.6.33.2.git19f00f0-config-2.6.33-2-amd64.nohotplug:# > CONFIG_MEMORY_HOTPLUG is not set > > > =>"good" > (..) > > Can send the diff of two boot logs with memory hotplug on and off? > > I wonder if you have the same memory layout in both cases. > > Attached, from the above two good/bad kernel builds; I stripped the > stamps to make it more readable. Thanks, I was hoping for a comparison between memory hotplug on and off with hibernation always off, this seems to be between memory hotplug and hibernation. Anyways looking at this log. The only significant difference I can see is that in the one case a lot of stuff including sdhci get initialized earlier. Presumably the timing is somewhat different related to the USB scanning and the memory hotadd kernel being a bit slower to boot (perhaps due to the increased printks) Perhaps this exposes some races that cause the problem. What these races are I don't know. I would suggest to check with the mmc maintainer. -Andi Hi, sorry for the delay, but I am still on it ;) On Tue, 2010-05-04 at 22:13 +0200, Andi Kleen wrote: > I would suggest to check with the mmc maintainer. linux-mmc is on CC, I hope that will suffice. Ok, I spent some more time trying to debug this, but the only really predictable thing seems to be that it's designed to drive me mad ;). 1st, it's definitely not just some code enabled by CONFIG_MEMORY_HOTPLUG. I did a new bisect as suggested by Andi earlier (using a configuration with CONFIG_HIBERNATION off and CONFIG_MEMORY_HOTPLUG on); the result however was unusable as I had to change some other config in-between. So, yes, disabling CONFIG_MEMORY_HOTPLUG seemingly consistent "fixes" the bug, at least for kernels 2.6.30-33 using a else unchanged Debian config. But just changing other config items (like just leaving out some "unrelated" modules) also does the trick... > get initialized earlier. Presumably the timing is somewhat > different related to the USB scanning and the memory hotadd > kernel being a bit slower to boot (perhaps due to the increased printks) Then (assuming a strange race condition) I checked that 2.6.34 (I couldn't until then build a "non-bad" 2.6.34 kernel) behaves the same; and yes, after removing some more unrelated modules, it suddenly works. So I did more testing with a stock 2.6.34 "Debian" kernel using a minimal initrd and blacklisting as many modules as possible (list of involved modules attached). With this setup, testing results became even more random ;(. The only maybe worthy observation: * Most (like 4 out of 5) bootups boot in a "good state" -- meaning I can run my test again and again without failure. * When in such a "good state", modprobing "ehci-hcd" (usb on pci pulling in quite some internal usb devices) always triggers the "bad state". * When in such a "good state", modprobing "r8169" (net on pci) often (but not always) triggers the "bad state". Ok, this all sounds a bit whacky -- but the race may be related to (other) data on the pci bus? This would at least explain the randomness. Anyway, I am still stumped, and lacking in-depth kernel knowledge to further debug this -- so I'd appreciate any hints to deliver better information. Thx, Stephan Fwiw, this problem is mysteriously fixed with kernels >=2.6.37. I am setting this to resolved/obsolete, and versions to 2.6.32-2.6.36, hope that's correct way. It's fine with me to just close this issue. Thx! Stephan Fwiw, I just noticed that 2.6.39 re-introduces this bug, while however 3.0.0 "fixes" it again ;(. 2.6.32-36: badblocks 2.6.37-38: ok 2.6.39: badblocks 3.0: ok Updated the versions in the bug, and some more correct summary. If it will come back in 3.x I might consider reopening. MfG, Stephan |