Bug 44771
Summary: | [REGRESSION] a7a20d103994fd760766e6c9d494daa569cbfe06 makes kernel 3.5 unbootable on an Intel chipset based motherboard | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Artem S. Tashkinov (aros) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | dan.j.williams, dan.j.williams, jgarzik, rjw, stern, torvalds, yoshi314 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.5-rc7 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Linux 3.5 .config
Linux 3.3 .config (working) Image of panic on boot dmesg under Linux 3.3 and lspci -vvv output RESEND-Fix-a-dead-loop-in-async_synchronize_full.patch dmesg with "rootdelay=10" dmesg with "rootwait" unbootable shot with timestamps Trial patch Alternate scsi_complete_async_scans() patch |
Created attachment 75341 [details]
Linux 3.3 .config (working)
Created attachment 75351 [details]
Image of panic on boot
VFS: Cannot open root device "sda2" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions: <EMPTY LIST>
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
Seems to fixed in 3.5.0-rc7. Weird. Wow, wow, wow! It's a perfect example of non-deterministic kernel behavior. This morning I could boot my PC into this kernel and everything worked fine. I now tried to boot again after powering off my computer (not completely - it was still plugged into the outlet) and it again cannot read the partition table thus the kernel doesn't boot. Don't what what fluke was that, but I was able to boot into that kernel just once. Created attachment 75561 [details]
dmesg under Linux 3.3 and lspci -vvv output
git bisect bad a7a20d103994fd760766e6c9d494daa569cbfe06 is the first bad commit commit a7a20d103994fd760766e6c9d494daa569cbfe06 Author: Dan Williams <dan.j.williams@intel.com> Date: Thu Mar 22 17:05:11 2012 -0700 [SCSI] sd: limit the scope of the async probe domain sd injects and synchronizes probe work on the global kernel-wide domain. This runs into conflict with PM that wants to perform resume actions in async context: [ 494.237079] INFO: task kworker/u:3:554 blocked for more than 120 seconds. [ 494.294396] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 494.360809] kworker/u:3 D 0000000000000000 0 554 2 0x00000000 [ 494.420739] ffff88012e4d3af0 0000000000000046 ffff88013200c160 ffff88012e4d3fd8 [ 494.484392] ffff88012e4d3fd8 0000000000012500 ffff8801394ea0b0 ffff88013200c160 [ 494.548038] ffff88012e4d3ae0 00000000000001e3 ffffffff81a249e0 ffff8801321c5398 [ 494.611685] Call Trace: [ 494.632649] [<ffffffff8149dd25>] schedule+0x5a/0x5c [ 494.674687] [<ffffffff8104b968>] async_synchronize_cookie_domain+0xb6/0x112 [ 494.734177] [<ffffffff810461ff>] ? __init_waitqueue_head+0x50/0x50 [ 494.787134] [<ffffffff8131a224>] ? scsi_remove_target+0x48/0x48 [ 494.837900] [<ffffffff8104b9d9>] async_synchronize_cookie+0x15/0x17 [ 494.891567] [<ffffffff8104ba49>] async_synchronize_full+0x54/0x70 <-- here we wait for async contexts to complete [ 494.943783] [<ffffffff8104b9f5>] ? async_synchronize_full_domain+0x1a/0x1a [ 495.002547] [<ffffffffa00114b1>] sd_remove+0x2c/0xa2 [sd_mod] [ 495.051861] [<ffffffff812fe94f>] __device_release_driver+0x86/0xcf [ 495.104807] [<ffffffff812fe9bd>] device_release_driver+0x25/0x32 <-- here we take device_lock() [ 853.511341] INFO: task kworker/u:4:549 blocked for more than 120 seconds. [ 853.568693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 853.635119] kworker/u:4 D ffff88013097b5d0 0 549 2 0x00000000 [ 853.695129] ffff880132773c40 0000000000000046 ffff880130790000 ffff880132773fd8 [ 853.758990] ffff880132773fd8 0000000000012500 ffff88013288a0b0 ffff880130790000 [ 853.822796] 0000000000000246 0000000000000040 ffff88013097b5c8 ffff880130790000 [ 853.886633] Call Trace: [ 853.907631] [<ffffffff8149dd25>] schedule+0x5a/0x5c [ 853.949670] [<ffffffff8149cc44>] __mutex_lock_common+0x220/0x351 [ 854.001225] [<ffffffff81304bd7>] ? device_resume+0x58/0x1c4 [ 854.049082] [<ffffffff81304bd7>] ? device_resume+0x58/0x1c4 [ 854.097011] [<ffffffff8149ce48>] mutex_lock_nested+0x2f/0x36 <-- here we wait for device_lock() [ 854.145591] [<ffffffff81304bd7>] device_resume+0x58/0x1c4 [ 854.192066] [<ffffffff81304d61>] async_resume+0x1e/0x45 [ 854.237019] [<ffffffff8104bc93>] async_run_entry_fn+0xc6/0x173 <-- ...while running in async context Provide a 'scsi_sd_probe_domain' so that async probe actions actions can be flushed without regard for the state of PM, and allow for the resume path to handle devices that have transitioned from SDEV_QUIESCE to SDEV_DEL prior to resume. Acked-by: Alan Stern <stern@rowland.harvard.edu> [alan: uplevel scsi_sd_probe_domain, clarify scsi_device_resume] Signed-off-by: Dan Williams <dan.j.williams@intel.com> [jejb: remove unneeded config guards in include file] Signed-off-by: James Bottomley <JBottomley@Parallels.com> :040000 040000 4e59ccb852f261f97701a245e637a690dfce9d20 fc73ca0da1288a7f30b81a8593ddad2146d7bfb5 M drivers Just in case here's a complete `git bisect` history: $ git bisect log git bisect start # bad: [bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a] Linux 3.5-rc6 git bisect bad bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a # good: [76e10d158efb6d4516018846f60c2ab5501900bc] Linux 3.4 git bisect good 76e10d158efb6d4516018846f60c2ab5501900bc # bad: [05f144a0d5c2207a0349348127f996e104ad7404] mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages git bisect bad 05f144a0d5c2207a0349348127f996e104ad7404 # bad: [8dca6010d44cc722a94dc6da96560f9083dac782] Merge tag 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect bad 8dca6010d44cc722a94dc6da96560f9083dac782 # good: [cb62ab71fe2b16e8203a0f0a2ef4eda23d761338] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next git bisect good cb62ab71fe2b16e8203a0f0a2ef4eda23d761338 # good: [ba01a87e37d3ca9efe141e2907c2ec3f89490b4f] Merge branch 'usb-target-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending git bisect good ba01a87e37d3ca9efe141e2907c2ec3f89490b4f # bad: [226da0dbc84ed97f448523e2a4cb91c27fa68ed9] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 226da0dbc84ed97f448523e2a4cb91c27fa68ed9 # bad: [da4f58ffa08a7b7012fab9c205fa0f6ba40fec42] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi git bisect bad da4f58ffa08a7b7012fab9c205fa0f6ba40fec42 # bad: [794c10fa0fa4d1781c5651c31e3d4d0b71629128] [SCSI] sg: remove while (1) non-loop git bisect bad 794c10fa0fa4d1781c5651c31e3d4d0b71629128 # good: [42e22cac4e57f3e0b4b631c9489effe97f7d7d6c] [SCSI] storvsc: Properly handle errors from the host git bisect good 42e22cac4e57f3e0b4b631c9489effe97f7d7d6c # good: [89a342ca6bfc1a014ff50cce5659abc58e11ecfc] [SCSI] scsi_transport_spi: fix for unbalanced reference counting git bisect good 89a342ca6bfc1a014ff50cce5659abc58e11ecfc # good: [6cba3f1972de14421ef4cf4b46a15cc5d604e791] [SCSI] hpsa: do aborts two ways git bisect good 6cba3f1972de14421ef4cf4b46a15cc5d604e791 # good: [d82357eaaa4c9b9cb16cbc1b95cb015801506a33] [SCSI] hpsa: add new RAID level "1(ADM)" git bisect good d82357eaaa4c9b9cb16cbc1b95cb015801506a33 # good: [e85c59746957fd6e3595d02cf614370056b5816e] [SCSI] hpsa: dial down lockup detection during firmware flash git bisect good e85c59746957fd6e3595d02cf614370056b5816e # bad: [dddbf8d908e89afb14a49502f568d2a2b814436d] [SCSI] sg: remove unnecessary indentation git bisect bad dddbf8d908e89afb14a49502f568d2a2b814436d # bad: [a7a20d103994fd760766e6c9d494daa569cbfe06] [SCSI] sd: limit the scope of the async probe domain git bisect bad a7a20d103994fd760766e6c9d494daa569cbfe06 Wow, I'm not the first person to hit this problem: http://help.lockergnome.com/linux/SCSI-sd-limit-scope-async-probe-domain-breaks-booting--ftopict555976.html > Dudes, > so I've been testing latest linus > (731a7378b81c2f5fa88ca1ae20b83d548d5613dc) here and my box fails booting > because it can't find the root partition, see message below. > > I did a bisect run (also below) and pointed me to the first bad commit > (see below too). > > Reverting the commit in question fixes booting. > > Let me know what other info you'd need. > > Posted: Wed May 30, 2012 2:10 pm > > Borislav Petkov Yet another person: https://lkml.org/lkml/2012/5/25/271 > Dan Williams > > Ok, I'll take a look. > > Thanks for the help! And no fix whatsoever. This *should* have been fixed by commit 43a8d39d0137 ("fix async probe regression"). It may have been an incomplete fix, or you're actually seeing a different bug (and then bisection mixes the bugs up). (In reply to comment #10) > This *should* have been fixed by commit 43a8d39d0137 ("fix async probe > regression"). > > It may have been an incomplete fix, or you're actually seeing a different bug > (and then bisection mixes the bugs up). If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it looks like the LKML discussion about this bug yielded no positive resolution. On Tue, Jul 17, 2012 at 12:56 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it > looks > like the LKML discussion about this bug yielded no positive resolution. Sure it did, just in another thread. See the thread [PATCH v2 0/4] fix / cleanup async scsi scanning instead. Where Borislav and Meelis Roos both said it fixed things for them. That said, there's this: http://marc.info/?l=linux-kernel&m=133833942517359&w=2 which looks really odd. And may indicate that there is still something else going on. But Walt never replied to Dan's quesiton, afaik. That said, there's *another* patch that never got applied which was also talked about in a thread about that same "sd: limit the scope of the async probe domain" patch, see https://lkml.org/lkml/2012/7/2/32 so I'm adding Paul and Li Zhong to the cc here. Does that patch make any difference to you? Linus On Tue, Jul 17, 2012 at 01:08:06PM -0700, Linus Torvalds wrote: > On Tue, Jul 17, 2012 at 12:56 PM, <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > > > If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it > looks > > like the LKML discussion about this bug yielded no positive resolution. > > Sure it did, just in another thread. > > See the thread > > [PATCH v2 0/4] fix / cleanup async scsi scanning > > instead. Where Borislav and Meelis Roos both said it fixed things for them. > > That said, there's this: > > http://marc.info/?l=linux-kernel&m=133833942517359&w=2 > > which looks really odd. And may indicate that there is still something > else going on. But Walt never replied to Dan's quesiton, afaik. > > That said, there's *another* patch that never got applied which was > also talked about in a thread about that same "sd: limit the scope of > the async probe domain" patch, see > > https://lkml.org/lkml/2012/7/2/32 > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make > any difference to you? The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing. Thanx, Paul Created attachment 75591 [details] RESEND-Fix-a-dead-loop-in-async_synchronize_full.patch (In reply to comment #12) > On Tue, Jul 17, 2012 at 12:56 PM, <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > > > If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it > looks > > like the LKML discussion about this bug yielded no positive resolution. > > Sure it did, just in another thread. > > See the thread > > [PATCH v2 0/4] fix / cleanup async scsi scanning > > instead. Where Borislav and Meelis Roos both said it fixed things for them. > > That said, there's this: > > http://marc.info/?l=linux-kernel&m=133833942517359&w=2 > > which looks really odd. And may indicate that there is still something > else going on. But Walt never replied to Dan's quesiton, afaik. > > That said, there's *another* patch that never got applied which was > also talked about in a thread about that same "sd: limit the scope of > the async probe domain" patch, see > > https://lkml.org/lkml/2012/7/2/32 > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make > any difference to you? > > Linus I've applied it on top of 3.5-rc7: # patch -b -p1 < RESEND-Fix-a-dead-loop-in-async_synchronize_full.patch patching file kernel/async.c Hunk #1 succeeded at 86 with fuzz 1. The issue persists. Artem, what happens if you enable CONFIG_PRINTK_TIME and add "rootdelay=10" to the boot command line? If the system still fails to boot, please attach another screenshot. (In reply to comment #15) > Artem, what happens if you enable CONFIG_PRINTK_TIME and add "rootdelay=10" > to > the boot command line? If the system still fails to boot, please attach > another screenshot. Should I check it with or without Fix-a-dead-loop-in-async_synchronize_full.patch? I presume you want me to run Linux 3.5-rc7, right? I don't know if that patch will make any difference. Try it both ways if you want. Yes, use 3.5-rc7. Created attachment 75661 [details] dmesg with "rootdelay=10" (In reply to comment #15) > Artem, what happens if you enable CONFIG_PRINTK_TIME and add "rootdelay=10" > to > the boot command line? If the system still fails to boot, please attach > another screenshot. The system now boots successfully even without this patch albeit with a ten seconds delay. Okay, that's progress. Next, can you attach a screenshot showing the hang when you boot the same kernel without the "rootdelay=10" option? Comparison of the timestamps may turn out to be helpful. One more thing. What happens with that same kernel when you specify "rootwait" instead of "rootdelay=10"? Created attachment 75671 [details] dmesg with "rootwait" (In reply to comment #20) > One more thing. What happens with that same kernel when you specify > "rootwait" > instead of "rootdelay=10"? rootwait is also perfectly usable. Created attachment 75681 [details] unbootable shot with timestamps (In reply to comment #19) > Okay, that's progress. Next, can you attach a screenshot showing the hang > when > you boot the same kernel without the "rootdelay=10" option? Comparison of > the > timestamps may turn out to be helpful. Here it is. Created attachment 75721 [details]
Trial patch
Ugh. There's something wrong in our async handling.
wait_for_device_probe() does an "async_synchronize_full()" because it doesn't want any async stuff pending.
But despite the name, "async_synchronize_full()" really only synchronizes the "global domain". And commit a7a20d103994 very much on purpose uses a different async domain for this. So wait_for_device_probe() basically does not do what it is supposed to do.
And commit 43a8d39d0137 really only fixed the case of people using the scsi_wait_scan module, but that is only useful for people who mount an initrd that loads that module. Artem doesn't do that.
So we really need to fix wait_for_device_probe() itself, methinks. Perhaps like this patch.
Alternatively, we should just put the scsi_complete_async_scans() into wait_for_device_probe() itself. Comments?
Ok, I attached a patch to the bugzilla entry that may or may not fix this. The wait_for_device_probe() cases seem to need "scsi_complete_async_scans()". I wonder if we should just put that call to scsi_complete_async_scans() _into_ the wait_for_device_probe() function, since most callers really do seem to want it. The resume code already did it anyway, the hibernate code did not. Ho humm. Linus Note: James isn't on the CC list for this bug report. If Dan Williams' async changes go into -stable, they will fix the problem. Otherwise something like your patch is needed. James is the person to ask. (In reply to comment #23) > Created an attachment (id=75721) [details] > Trial patch I have tested the patch and it fixes this issue. On Wednesday, July 18, 2012, Linus Torvalds wrote:
> Ok, I attached a patch to the bugzilla entry that may or may not fix this.
>
> The wait_for_device_probe() cases seem to need
> "scsi_complete_async_scans()". I wonder if we should just put that
> call to scsi_complete_async_scans() _into_ the wait_for_device_probe()
> function, since most callers really do seem to want it. The resume
> code already did it anyway, the hibernate code did not. Ho humm.
Thanks for the heads up!
If I remeber correctly, the reason why the resume code path does the
scsi_complete_async_scans() is that it didn't work without it at least for
some people, which was reported then.
The hibernate code path checks if the device is available to it when it
tries to save the image, which is after executing all of the "thaw"
callbacks.
On Wed, Jul 18, 2012 at 3:02 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > If I remeber correctly, the reason why the resume code path does the > scsi_complete_async_scans() is that it didn't work without it at least for > some people, which was reported then. > > The hibernate code path checks if the device is available to it when it > tries to save the image, which is after executing all of the "thaw" > callbacks. Actually, I didn't look closely enough. The hibernate code does the scsi_complete_async_scans() too, it just has the "resume_wait" hackery going on before that. And it's entirely possible that the resume_wait hackery is because it didn't always wait for scsi_complete_async_scans(). So I am more and more strongly starting to suspect that we should just move scsi_complete_async_scans() into wait_for_device_probe(), because just about every user wants it. Some users that don't do it (the IBM dasd drivers etc) look like they really should want it too, they just didn't understand that. There's some odd v4l user of wait_for_device_probe() too, but it won't hurt. So I'll attach a new test patch that does that instead. Linus Created attachment 75731 [details]
Alternate scsi_complete_async_scans() patch
This just moves the scsi_complete_async_scans() call into wait_for_device_probe(), since that's what basically all the callers of that function really expect anyway.
We can thus remove a number of calls to scsi_complete_async_scans(), and also one extraneous declaration of wait_for_device_probe() (there were two identical ones in <linux/device.h>
(In reply to comment #29) > Created an attachment (id=75731) [details] > Alternate scsi_complete_async_scans() patch This one works here too. "Tested by" and thank you! does this problem manifest itself as kernel not detecting certain block devices entirely? i am not sure if this is the bug report i am looking for, or should i post a new bug. i have ssd and 2tb hdd on my pc. and while -rc5 works fine, rc7 only detects the ssd. reverting to rc5 resumes normal behavior. both devices are sata on controller handled by ahci driver. On Thursday, July 19, 2012, Linus Torvalds wrote:
> On Wed, Jul 18, 2012 at 3:02 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >
> > If I remeber correctly, the reason why the resume code path does the
> > scsi_complete_async_scans() is that it didn't work without it at least for
> > some people, which was reported then.
> >
> > The hibernate code path checks if the device is available to it when it
> > tries to save the image, which is after executing all of the "thaw"
> > callbacks.
>
> Actually, I didn't look closely enough. The hibernate code does the
> scsi_complete_async_scans() too, it just has the "resume_wait" hackery
> going on before that.
>
> And it's entirely possible that the resume_wait hackery is because it
> didn't always wait for scsi_complete_async_scans().
This was added because of MMC being too slow to get ready sometimes.
But yes, it looks like all callers of wait_for_device_probe() need the
scsi_complete_async_scans() anyway.
(In reply to comment #32) > On Thursday, July 19, 2012, Linus Torvalds wrote: > > On Wed, Jul 18, 2012 at 3:02 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > > > > If I remeber correctly, the reason why the resume code path does the > > > scsi_complete_async_scans() is that it didn't work without it at least > for > > > some people, which was reported then. > > > > > > The hibernate code path checks if the device is available to it when it > > > tries to save the image, which is after executing all of the "thaw" > > > callbacks. > > > > Actually, I didn't look closely enough. The hibernate code does the > > scsi_complete_async_scans() too, it just has the "resume_wait" hackery > > going on before that. > > > > And it's entirely possible that the resume_wait hackery is because it > > didn't always wait for scsi_complete_async_scans(). > > This was added because of MMC being too slow to get ready sometimes. > > But yes, it looks like all callers of wait_for_device_probe() need the > scsi_complete_async_scans() anyway. Which is what the pending async patches also achieve since they make async_synchronize_full() include async scsi scanning work. This makes scsi_complete_async_scans() fully internal to scsi. (sorry I missed this bugzilla. I no longer have access to my intel.com account) On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote: > > That said, there's *another* patch that never got applied which was > > also talked about in a thread about that same "sd: limit the scope of > > the async probe domain" patch, see > > > > https://lkml.org/lkml/2012/7/2/32 > > > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make > > any difference to you? > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing. With today's -rc7, this patch is still needed to boot my powerpc[0] box. Will it be included before 3.5 gets released? Thanks, Christian. [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/ (In reply to comment #34) > With today's -rc7, this patch is still needed to boot my powerpc[0] box. > Will it be included before 3.5 gets released? > Right, we at least needs Li Zhong's fix for the soft lockup if not the whole conversion to make async domains globally flushable by default [1]. -- Dan http://marc.info/?l=linux-kernel&m=134247994811985&w=2 Fixed in 3.5: Commit eea03c20ae38a55405c0865ed9adfccc400e4c8e Make wait_for_device_probe() also do scsi_complete_async_scans() On Fri, 20 Jul 2012 at 00:03, Christian Kujau wrote: > On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote: > > > That said, there's *another* patch that never got applied which was > > > also talked about in a thread about that same "sd: limit the scope of > > > the async probe domain" patch, see > > > > > > https://lkml.org/lkml/2012/7/2/32 > > > > > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make > > > any difference to you? > > > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing. > > With today's -rc7, this patch is still needed to boot my powerpc[0] box. > Will it be included before 3.5 gets released? Sadly it got not - which makes it the first kernel to be released which will not boot my beloved but somewhat old PowerBook :-( Can we please get Li's patch into mainline? Thank you, Christian. > [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html > http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/ At this (In reply to comment #37) > > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing. > > > > With today's -rc7, this patch is still needed to boot my powerpc[0] box. > > Will it be included before 3.5 gets released? > > Sadly it got not - which makes it the first kernel to be released > which will not boot my beloved but somewhat old PowerBook :-( > > Can we please get Li's patch into mainline? > The async rework is pending acceptance into mainline: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=tag;h=72e9dc4b44d8 Once those commits land and bake for a bit I think it would be reasonable to request the fix for 3.5-stable. I.e. commit a4683487f90b "[SCSI] async: make async_synchronize_full() flush all work regardless of domain" with commit 2955b47d2c19 "[SCSI] async: introduce 'async_domain' type" as a pre-requisite. http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=commit;h=2955b47d2c19 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=commit;h=a4683487f90b It's been merged and is in current -git right now. Please test, and if things are ok, I guess we should mark it for stable. ping?
On Tue, 24 Jul 2012 at 16:43, Christian Kujau wrote:
> On Fri, 20 Jul 2012 at 00:03, Christian Kujau wrote:
> > On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote:
> > > > That said, there's *another* patch that never got applied which was
> > > > also talked about in a thread about that same "sd: limit the scope of
> > > > the async probe domain" patch, see
> > > >
> > > > https://lkml.org/lkml/2012/7/2/32
> > > >
> > > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> > > > any difference to you?
> > >
> > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.
> >
> > With today's -rc7, this patch is still needed to boot my powerpc[0] box.
> > Will it be included before 3.5 gets released?
>
> Sadly it got not - which makes it the first kernel to be released
> which will not boot my beloved but somewhat old PowerBook :-(
>
> Can we please get Li's patch into mainline?
>
> Thank you,
> Christian.
>
> > [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html
> > http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/
Christ, ping yourself! We asked you to test mainline. You didn't. On Mon, 30 Jul 2012 at 17:03, Christian Kujau wrote: > > ping? > I'm terribly sorry for the noise, I just saw the replies in #44771 (but not in my mailbox) and the the fixes are now in mainline. Unfortunately another bug keeps me from testing mainline on this box now so I'm still on 3.5 + Li's patch. Again, sorry for the noise & thanks to all involved for the fix. Christian. > On Tue, 24 Jul 2012 at 16:43, Christian Kujau wrote: > > On Fri, 20 Jul 2012 at 00:03, Christian Kujau wrote: > > > On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote: > > > > > That said, there's *another* patch that never got applied which was > > > > > also talked about in a thread about that same "sd: limit the scope of > > > > > the async probe domain" patch, see > > > > > > > > > > https://lkml.org/lkml/2012/7/2/32 > > > > > > > > > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make > > > > > any difference to you? > > > > > > > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing. > > > > > > With today's -rc7, this patch is still needed to boot my powerpc[0] box. > > > Will it be included before 3.5 gets released? > > > > Sadly it got not - which makes it the first kernel to be released > > which will not boot my beloved but somewhat old PowerBook :-( > > > > Can we please get Li's patch into mainline? > > > > Thank you, > > Christian. > > > > > [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html > > > http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/ > -- > BOFH excuse #246: > > It must have been the lightning storm we had (yesterday) (last week) (last > month) > |
Created attachment 75331 [details] Linux 3.5 .config I have a standard x86 PC, with a standard 1TB HDD partitioned using fdisk, i.e. I have a standard MS-DOS based partition layout. Kernel 3.4-rc6 fails to parse the partition table even though I explicitly enabled: CONFIG_PARTITION_ADVANCED=y CONFIG_MSDOS_PARTITION=y which wasn't necessary with kernels 3.3 and 3.4 which work fine here. # fdisk -l Disk /dev/sda: 1000.2 GB, 1000203804160 bytes 255 heads, 63 sectors/track, 121601 cylinders, total 1953523055 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xXXXXXXXX Device Boot Start End Blocks Id System /dev/sda1 * 63 XXXXXXXX XXXXXXXX 83 Linux /dev/sda2 XXXXXXXX XXXXXXXX XXXXXXXX 83 Linux ...