Bug 44771

Summary: [REGRESSION] a7a20d103994fd760766e6c9d494daa569cbfe06 makes kernel 3.5 unbootable on an Intel chipset based motherboard
Product: IO/Storage Reporter: Artem S. Tashkinov (aros)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: dan.j.williams, dan.j.williams, jgarzik, rjw, stern, torvalds, yoshi314
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.5-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Linux 3.5 .config
Linux 3.3 .config (working)
Image of panic on boot
dmesg under Linux 3.3 and lspci -vvv output
RESEND-Fix-a-dead-loop-in-async_synchronize_full.patch
dmesg with "rootdelay=10"
dmesg with "rootwait"
unbootable shot with timestamps
Trial patch
Alternate scsi_complete_async_scans() patch

Description Artem S. Tashkinov 2012-07-13 20:23:51 UTC
Created attachment 75331 [details]
Linux 3.5 .config

I have a standard x86 PC, with a standard 1TB HDD partitioned using fdisk, i.e. I have a standard MS-DOS based partition layout.

Kernel 3.4-rc6 fails to parse the partition table even though I explicitly enabled:

CONFIG_PARTITION_ADVANCED=y
CONFIG_MSDOS_PARTITION=y

which wasn't necessary with kernels 3.3 and 3.4 which work fine here.

# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000203804160 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953523055 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xXXXXXXXX

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *          63    XXXXXXXX    XXXXXXXX   83  Linux
/dev/sda2        XXXXXXXX    XXXXXXXX    XXXXXXXX   83  Linux
...
Comment 1 Artem S. Tashkinov 2012-07-13 20:24:16 UTC
Created attachment 75341 [details]
Linux 3.3 .config (working)
Comment 2 Artem S. Tashkinov 2012-07-13 20:31:42 UTC
Created attachment 75351 [details]
Image of panic on boot

VFS: Cannot open root device "sda2" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions: <EMPTY LIST>
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
Comment 3 Artem S. Tashkinov 2012-07-15 09:09:01 UTC
Seems to fixed in 3.5.0-rc7.

Weird.
Comment 4 Artem S. Tashkinov 2012-07-15 18:54:16 UTC
Wow, wow, wow!

It's a perfect example of non-deterministic kernel behavior.

This morning I could boot my PC into this kernel and everything worked fine.

I now tried to boot again after powering off my computer (not completely - it was still plugged into the outlet) and it again cannot read the partition table thus the kernel doesn't boot.
Comment 5 Artem S. Tashkinov 2012-07-17 09:25:59 UTC
Don't what what fluke was that, but I was able to boot into that kernel just once.
Comment 6 Artem S. Tashkinov 2012-07-17 09:31:15 UTC
Created attachment 75561 [details]
dmesg under Linux 3.3 and lspci -vvv output
Comment 7 Artem S. Tashkinov 2012-07-17 19:06:06 UTC
git bisect bad
a7a20d103994fd760766e6c9d494daa569cbfe06 is the first bad commit
commit a7a20d103994fd760766e6c9d494daa569cbfe06
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Thu Mar 22 17:05:11 2012 -0700

    [SCSI] sd: limit the scope of the async probe domain

    sd injects and synchronizes probe work on the global kernel-wide domain.
    This runs into conflict with PM that wants to perform resume actions in
    async context:

    [  494.237079] INFO: task kworker/u:3:554 blocked for more than 120 seconds.
    [  494.294396] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  494.360809] kworker/u:3     D 0000000000000000     0   554      2 0x00000000
    [  494.420739]  ffff88012e4d3af0 0000000000000046 ffff88013200c160 ffff88012e4d3fd8
    [  494.484392]  ffff88012e4d3fd8 0000000000012500 ffff8801394ea0b0 ffff88013200c160
    [  494.548038]  ffff88012e4d3ae0 00000000000001e3 ffffffff81a249e0 ffff8801321c5398
    [  494.611685] Call Trace:
    [  494.632649]  [<ffffffff8149dd25>] schedule+0x5a/0x5c
    [  494.674687]  [<ffffffff8104b968>] async_synchronize_cookie_domain+0xb6/0x112
    [  494.734177]  [<ffffffff810461ff>] ? __init_waitqueue_head+0x50/0x50
    [  494.787134]  [<ffffffff8131a224>] ? scsi_remove_target+0x48/0x48
    [  494.837900]  [<ffffffff8104b9d9>] async_synchronize_cookie+0x15/0x17
    [  494.891567]  [<ffffffff8104ba49>] async_synchronize_full+0x54/0x70  <-- here we wait for async contexts to complete
    [  494.943783]  [<ffffffff8104b9f5>] ? async_synchronize_full_domain+0x1a/0x1a
    [  495.002547]  [<ffffffffa00114b1>] sd_remove+0x2c/0xa2 [sd_mod]
    [  495.051861]  [<ffffffff812fe94f>] __device_release_driver+0x86/0xcf
    [  495.104807]  [<ffffffff812fe9bd>] device_release_driver+0x25/0x32  <-- here we take device_lock()

    [  853.511341] INFO: task kworker/u:4:549 blocked for more than 120 seconds.
    [  853.568693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  853.635119] kworker/u:4     D ffff88013097b5d0     0   549      2 0x00000000
    [  853.695129]  ffff880132773c40 0000000000000046 ffff880130790000 ffff880132773fd8
    [  853.758990]  ffff880132773fd8 0000000000012500 ffff88013288a0b0 ffff880130790000
    [  853.822796]  0000000000000246 0000000000000040 ffff88013097b5c8 ffff880130790000
    [  853.886633] Call Trace:
    [  853.907631]  [<ffffffff8149dd25>] schedule+0x5a/0x5c
    [  853.949670]  [<ffffffff8149cc44>] __mutex_lock_common+0x220/0x351
    [  854.001225]  [<ffffffff81304bd7>] ? device_resume+0x58/0x1c4
    [  854.049082]  [<ffffffff81304bd7>] ? device_resume+0x58/0x1c4
    [  854.097011]  [<ffffffff8149ce48>] mutex_lock_nested+0x2f/0x36   <-- here we wait for device_lock()
    [  854.145591]  [<ffffffff81304bd7>] device_resume+0x58/0x1c4
    [  854.192066]  [<ffffffff81304d61>] async_resume+0x1e/0x45
    [  854.237019]  [<ffffffff8104bc93>] async_run_entry_fn+0xc6/0x173  <-- ...while running in async context

    Provide a 'scsi_sd_probe_domain' so that async probe actions actions can
    be flushed without regard for the state of PM, and allow for the resume
    path to handle devices that have transitioned from SDEV_QUIESCE to
    SDEV_DEL prior to resume.

    Acked-by: Alan Stern <stern@rowland.harvard.edu>
    [alan: uplevel scsi_sd_probe_domain, clarify scsi_device_resume]
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>
    [jejb: remove unneeded config guards in include file]
    Signed-off-by: James Bottomley <JBottomley@Parallels.com>

:040000 040000 4e59ccb852f261f97701a245e637a690dfce9d20 fc73ca0da1288a7f30b81a8593ddad2146d7bfb5 M      drivers
Comment 8 Artem S. Tashkinov 2012-07-17 19:10:10 UTC
Just in case here's a complete `git bisect` history:

$ git bisect log
git bisect start
# bad: [bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a] Linux 3.5-rc6
git bisect bad bd0a521e88aa7a06ae7aabaed7ae196ed4ad867a
# good: [76e10d158efb6d4516018846f60c2ab5501900bc] Linux 3.4
git bisect good 76e10d158efb6d4516018846f60c2ab5501900bc
# bad: [05f144a0d5c2207a0349348127f996e104ad7404] mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages
git bisect bad 05f144a0d5c2207a0349348127f996e104ad7404
# bad: [8dca6010d44cc722a94dc6da96560f9083dac782] Merge tag 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 8dca6010d44cc722a94dc6da96560f9083dac782
# good: [cb62ab71fe2b16e8203a0f0a2ef4eda23d761338] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect good cb62ab71fe2b16e8203a0f0a2ef4eda23d761338
# good: [ba01a87e37d3ca9efe141e2907c2ec3f89490b4f] Merge branch 'usb-target-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
git bisect good ba01a87e37d3ca9efe141e2907c2ec3f89490b4f
# bad: [226da0dbc84ed97f448523e2a4cb91c27fa68ed9] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 226da0dbc84ed97f448523e2a4cb91c27fa68ed9
# bad: [da4f58ffa08a7b7012fab9c205fa0f6ba40fec42] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad da4f58ffa08a7b7012fab9c205fa0f6ba40fec42
# bad: [794c10fa0fa4d1781c5651c31e3d4d0b71629128] [SCSI] sg: remove while (1) non-loop
git bisect bad 794c10fa0fa4d1781c5651c31e3d4d0b71629128
# good: [42e22cac4e57f3e0b4b631c9489effe97f7d7d6c] [SCSI] storvsc: Properly handle errors from the host
git bisect good 42e22cac4e57f3e0b4b631c9489effe97f7d7d6c
# good: [89a342ca6bfc1a014ff50cce5659abc58e11ecfc] [SCSI] scsi_transport_spi: fix for unbalanced reference counting
git bisect good 89a342ca6bfc1a014ff50cce5659abc58e11ecfc
# good: [6cba3f1972de14421ef4cf4b46a15cc5d604e791] [SCSI] hpsa: do aborts two ways
git bisect good 6cba3f1972de14421ef4cf4b46a15cc5d604e791
# good: [d82357eaaa4c9b9cb16cbc1b95cb015801506a33] [SCSI] hpsa: add new RAID level "1(ADM)"
git bisect good d82357eaaa4c9b9cb16cbc1b95cb015801506a33
# good: [e85c59746957fd6e3595d02cf614370056b5816e] [SCSI] hpsa: dial down lockup detection during firmware flash
git bisect good e85c59746957fd6e3595d02cf614370056b5816e
# bad: [dddbf8d908e89afb14a49502f568d2a2b814436d] [SCSI] sg: remove unnecessary indentation
git bisect bad dddbf8d908e89afb14a49502f568d2a2b814436d
# bad: [a7a20d103994fd760766e6c9d494daa569cbfe06] [SCSI] sd: limit the scope of the async probe domain
git bisect bad a7a20d103994fd760766e6c9d494daa569cbfe06
Comment 9 Artem S. Tashkinov 2012-07-17 19:15:46 UTC
Wow, I'm not the first person to hit this problem:

http://help.lockergnome.com/linux/SCSI-sd-limit-scope-async-probe-domain-breaks-booting--ftopict555976.html

> Dudes,

> so I've been testing latest linus
> (731a7378b81c2f5fa88ca1ae20b83d548d5613dc) here and my box fails booting
> because it can't find the root partition, see message below.
> 
> I did a bisect run (also below) and pointed me to the first bad commit
> (see below too).
> 
> Reverting the commit in question fixes booting.
> 
> Let me know what other info you'd need. 
> 
> Posted: Wed May 30, 2012 2:10 pm
>
> Borislav Petkov

Yet another person:

https://lkml.org/lkml/2012/5/25/271

> Dan Williams
> 
> Ok, I'll take a look.
> 
> Thanks for the help!

And no fix whatsoever.
Comment 10 Linus Torvalds 2012-07-17 19:49:56 UTC
This *should* have been fixed by commit 43a8d39d0137 ("fix async probe regression").

It may have been an incomplete fix, or you're actually seeing a different bug (and then bisection mixes the bugs up).
Comment 11 Artem S. Tashkinov 2012-07-17 19:56:29 UTC
(In reply to comment #10)
> This *should* have been fixed by commit 43a8d39d0137 ("fix async probe
> regression").
> 
> It may have been an incomplete fix, or you're actually seeing a different bug
> (and then bisection mixes the bugs up).

If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it looks like the LKML discussion about this bug yielded no positive resolution.
Comment 12 Linus Torvalds 2012-07-17 20:08:30 UTC
On Tue, Jul 17, 2012 at 12:56 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it
> looks
> like the LKML discussion about this bug yielded no positive resolution.

Sure it did, just in another thread.

See the thread

  [PATCH v2 0/4] fix / cleanup async scsi scanning

instead. Where Borislav and Meelis Roos both said it fixed things for them.

That said, there's this:

   http://marc.info/?l=linux-kernel&m=133833942517359&w=2

which looks really odd. And may indicate that there is still something
else going on. But Walt never replied to Dan's quesiton, afaik.

That said, there's *another* patch that never got applied which was
also talked about in a thread about that same "sd: limit the scope of
the async probe domain" patch, see

    https://lkml.org/lkml/2012/7/2/32

so I'm adding Paul and Li Zhong to the cc here. Does that patch make
any difference to you?

             Linus
Comment 13 Paul E. McKenney 2012-07-17 20:21:09 UTC
On Tue, Jul 17, 2012 at 01:08:06PM -0700, Linus Torvalds wrote:
> On Tue, Jul 17, 2012 at 12:56 PM,  <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> >
> > If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it
> looks
> > like the LKML discussion about this bug yielded no positive resolution.
> 
> Sure it did, just in another thread.
> 
> See the thread
> 
>   [PATCH v2 0/4] fix / cleanup async scsi scanning
> 
> instead. Where Borislav and Meelis Roos both said it fixed things for them.
> 
> That said, there's this:
> 
>    http://marc.info/?l=linux-kernel&m=133833942517359&w=2
> 
> which looks really odd. And may indicate that there is still something
> else going on. But Walt never replied to Dan's quesiton, afaik.
> 
> That said, there's *another* patch that never got applied which was
> also talked about in a thread about that same "sd: limit the scope of
> the async probe domain" patch, see
> 
>     https://lkml.org/lkml/2012/7/2/32
> 
> so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> any difference to you?

The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.

							Thanx, Paul
Comment 14 Artem S. Tashkinov 2012-07-17 20:24:11 UTC
Created attachment 75591 [details]
RESEND-Fix-a-dead-loop-in-async_synchronize_full.patch

(In reply to comment #12)
> On Tue, Jul 17, 2012 at 12:56 PM,  <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> >
> > If kernel 3.5-rc7 contains this fix, then this issue is not fixed. And it
> looks
> > like the LKML discussion about this bug yielded no positive resolution.
> 
> Sure it did, just in another thread.
> 
> See the thread
> 
>   [PATCH v2 0/4] fix / cleanup async scsi scanning
> 
> instead. Where Borislav and Meelis Roos both said it fixed things for them.
> 
> That said, there's this:
> 
>    http://marc.info/?l=linux-kernel&m=133833942517359&w=2
> 
> which looks really odd. And may indicate that there is still something
> else going on. But Walt never replied to Dan's quesiton, afaik.
> 
> That said, there's *another* patch that never got applied which was
> also talked about in a thread about that same "sd: limit the scope of
> the async probe domain" patch, see
> 
>     https://lkml.org/lkml/2012/7/2/32
> 
> so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> any difference to you?
> 
>              Linus

I've applied it on top of 3.5-rc7:

# patch -b -p1 < RESEND-Fix-a-dead-loop-in-async_synchronize_full.patch
patching file kernel/async.c
Hunk #1 succeeded at 86 with fuzz 1.

The issue persists.
Comment 15 Alan Stern 2012-07-18 15:25:30 UTC
Artem, what happens if you enable CONFIG_PRINTK_TIME and add "rootdelay=10" to the boot command line?  If the system still fails to boot, please attach another screenshot.
Comment 16 Artem S. Tashkinov 2012-07-18 15:39:00 UTC
(In reply to comment #15)
> Artem, what happens if you enable CONFIG_PRINTK_TIME and add "rootdelay=10"
> to
> the boot command line?  If the system still fails to boot, please attach
> another screenshot.

Should I check it with or without Fix-a-dead-loop-in-async_synchronize_full.patch?

I presume you want me to run Linux 3.5-rc7, right?
Comment 17 Alan Stern 2012-07-18 16:09:12 UTC
I don't know if that patch will make any difference.  Try it both ways if you want.  Yes, use 3.5-rc7.
Comment 18 Artem S. Tashkinov 2012-07-18 18:41:55 UTC
Created attachment 75661 [details]
dmesg with "rootdelay=10"

(In reply to comment #15)
> Artem, what happens if you enable CONFIG_PRINTK_TIME and add "rootdelay=10"
> to
> the boot command line?  If the system still fails to boot, please attach
> another screenshot.

The system now boots successfully even without this patch albeit with a ten seconds delay.
Comment 19 Alan Stern 2012-07-18 18:56:15 UTC
Okay, that's progress.  Next, can you attach a screenshot showing the hang when you boot the same kernel without the "rootdelay=10" option?  Comparison of the timestamps may turn out to be helpful.
Comment 20 Alan Stern 2012-07-18 19:02:05 UTC
One more thing.  What happens with that same kernel when you specify "rootwait" instead of "rootdelay=10"?
Comment 21 Artem S. Tashkinov 2012-07-18 19:24:41 UTC
Created attachment 75671 [details]
dmesg with "rootwait"

(In reply to comment #20)
> One more thing.  What happens with that same kernel when you specify
> "rootwait"
> instead of "rootdelay=10"?

rootwait is also perfectly usable.
Comment 22 Artem S. Tashkinov 2012-07-18 19:29:24 UTC
Created attachment 75681 [details]
unbootable shot with timestamps

(In reply to comment #19)
> Okay, that's progress.  Next, can you attach a screenshot showing the hang
> when
> you boot the same kernel without the "rootdelay=10" option?  Comparison of
> the
> timestamps may turn out to be helpful.

Here it is.
Comment 23 Linus Torvalds 2012-07-18 20:30:33 UTC
Created attachment 75721 [details]
Trial patch

Ugh. There's something wrong in our async handling. 

wait_for_device_probe() does an "async_synchronize_full()" because it doesn't want any async stuff pending.

But despite the name, "async_synchronize_full()" really only synchronizes the "global domain". And commit a7a20d103994 very much on purpose uses a different async domain for this. So wait_for_device_probe() basically does not do what it is supposed to do.

And commit 43a8d39d0137 really only fixed the case of people using the scsi_wait_scan module, but that is only useful for people who mount an initrd that loads that module. Artem doesn't do that.

So we really need to fix wait_for_device_probe() itself, methinks.  Perhaps like this patch.

Alternatively, we should just put the scsi_complete_async_scans() into wait_for_device_probe() itself. Comments?
Comment 24 Linus Torvalds 2012-07-18 20:34:41 UTC
Ok, I attached a patch to the bugzilla entry that may or may not fix this.

The wait_for_device_probe() cases seem to need
"scsi_complete_async_scans()".  I wonder if we should just put that
call to scsi_complete_async_scans() _into_ the wait_for_device_probe()
function, since most callers really do seem to want it. The resume
code already did it anyway, the hibernate code did not. Ho humm.

                   Linus
Comment 25 Alan Stern 2012-07-18 21:08:04 UTC
Note: James isn't on the CC list for this bug report.

If Dan Williams' async changes go into -stable, they will fix the problem.  Otherwise something like your patch is needed.  James is the person to ask.
Comment 26 Artem S. Tashkinov 2012-07-18 21:12:24 UTC
(In reply to comment #23)
> Created an attachment (id=75721) [details]
> Trial patch

I have tested the patch and it fixes this issue.
Comment 27 Rafael J. Wysocki 2012-07-18 21:56:32 UTC
On Wednesday, July 18, 2012, Linus Torvalds wrote:
> Ok, I attached a patch to the bugzilla entry that may or may not fix this.
> 
> The wait_for_device_probe() cases seem to need
> "scsi_complete_async_scans()".  I wonder if we should just put that
> call to scsi_complete_async_scans() _into_ the wait_for_device_probe()
> function, since most callers really do seem to want it. The resume
> code already did it anyway, the hibernate code did not. Ho humm.

Thanks for the heads up!

If I remeber correctly, the reason why the resume code path does the
scsi_complete_async_scans() is that it didn't work without it at least for
some people, which was reported then.

The hibernate code path checks if the device is available to it when it
tries to save the image, which is after executing all of the "thaw"
callbacks.
Comment 28 Linus Torvalds 2012-07-18 22:33:56 UTC
On Wed, Jul 18, 2012 at 3:02 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>
> If I remeber correctly, the reason why the resume code path does the
> scsi_complete_async_scans() is that it didn't work without it at least for
> some people, which was reported then.
>
> The hibernate code path checks if the device is available to it when it
> tries to save the image, which is after executing all of the "thaw"
> callbacks.

Actually, I didn't look closely enough. The hibernate code does the
scsi_complete_async_scans() too, it just has the "resume_wait" hackery
going on before that.

And it's entirely possible that the resume_wait hackery is because it
didn't always wait for scsi_complete_async_scans().

So I am more and more strongly starting to suspect that we should just
move scsi_complete_async_scans() into wait_for_device_probe(), because
just about every user wants it. Some users that don't do it (the IBM
dasd drivers etc) look like they really should want it too, they just
didn't understand that.

There's some odd v4l user of wait_for_device_probe() too, but it won't
hurt. So I'll attach a new test patch that does that instead.

                Linus
Comment 29 Linus Torvalds 2012-07-18 22:36:04 UTC
Created attachment 75731 [details]
Alternate scsi_complete_async_scans() patch

This just moves the scsi_complete_async_scans() call into wait_for_device_probe(), since that's what basically all the callers of that function really expect anyway. 

We can thus remove a number of calls to scsi_complete_async_scans(), and also one extraneous declaration of wait_for_device_probe() (there were two identical ones in <linux/device.h>
Comment 30 Artem S. Tashkinov 2012-07-19 07:07:38 UTC
(In reply to comment #29)
> Created an attachment (id=75731) [details]
> Alternate scsi_complete_async_scans() patch

This one works here too. "Tested by" and thank you!
Comment 31 kowalski marcin 2012-07-19 07:27:52 UTC
does this problem manifest itself as kernel not detecting certain block devices entirely? i am not sure if this is the bug report i am looking for, or should i post a new bug.

i have ssd and 2tb hdd on my pc. and while -rc5 works fine, rc7 only detects the ssd.  reverting to rc5 resumes normal behavior.

both devices are sata on controller handled by ahci driver.
Comment 32 Rafael J. Wysocki 2012-07-19 08:56:19 UTC
On Thursday, July 19, 2012, Linus Torvalds wrote:
> On Wed, Jul 18, 2012 at 3:02 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >
> > If I remeber correctly, the reason why the resume code path does the
> > scsi_complete_async_scans() is that it didn't work without it at least for
> > some people, which was reported then.
> >
> > The hibernate code path checks if the device is available to it when it
> > tries to save the image, which is after executing all of the "thaw"
> > callbacks.
> 
> Actually, I didn't look closely enough. The hibernate code does the
> scsi_complete_async_scans() too, it just has the "resume_wait" hackery
> going on before that.
> 
> And it's entirely possible that the resume_wait hackery is because it
> didn't always wait for scsi_complete_async_scans().

This was added because of MMC being too slow to get ready sometimes.

But yes, it looks like all callers of wait_for_device_probe() need the
scsi_complete_async_scans() anyway.
Comment 33 Dan Williams 2012-07-19 14:34:48 UTC
(In reply to comment #32)
> On Thursday, July 19, 2012, Linus Torvalds wrote:
> > On Wed, Jul 18, 2012 at 3:02 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > >
> > > If I remeber correctly, the reason why the resume code path does the
> > > scsi_complete_async_scans() is that it didn't work without it at least
> for
> > > some people, which was reported then.
> > >
> > > The hibernate code path checks if the device is available to it when it
> > > tries to save the image, which is after executing all of the "thaw"
> > > callbacks.
> > 
> > Actually, I didn't look closely enough. The hibernate code does the
> > scsi_complete_async_scans() too, it just has the "resume_wait" hackery
> > going on before that.
> > 
> > And it's entirely possible that the resume_wait hackery is because it
> > didn't always wait for scsi_complete_async_scans().
> 
> This was added because of MMC being too slow to get ready sometimes.
> 
> But yes, it looks like all callers of wait_for_device_probe() need the
> scsi_complete_async_scans() anyway.

Which is what the pending async patches also achieve since they make async_synchronize_full() include async scsi scanning work.  This makes scsi_complete_async_scans() fully internal to scsi.

(sorry I missed this bugzilla.  I no longer have access to my intel.com account)
Comment 34 Christian Kujau 2012-07-20 07:03:55 UTC
On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote:
> > That said, there's *another* patch that never got applied which was
> > also talked about in a thread about that same "sd: limit the scope of
> > the async probe domain" patch, see
> > 
> >     https://lkml.org/lkml/2012/7/2/32
> > 
> > so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> > any difference to you?
> 
> The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.

With today's -rc7, this patch is still needed to boot my powerpc[0] box. 
Will it be included before 3.5 gets released?

Thanks,
Christian.

[0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html
    http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/
Comment 35 Dan Williams 2012-07-20 15:49:18 UTC
(In reply to comment #34)
> With today's -rc7, this patch is still needed to boot my powerpc[0] box. 
> Will it be included before 3.5 gets released?
> 

Right, we at least needs Li Zhong's fix for the soft lockup if not the whole conversion to make async domains globally flushable by default [1].

--
Dan

http://marc.info/?l=linux-kernel&m=134247994811985&w=2
Comment 36 Artem S. Tashkinov 2012-07-22 12:13:33 UTC
Fixed in 3.5:

Commit eea03c20ae38a55405c0865ed9adfccc400e4c8e

Make wait_for_device_probe() also do scsi_complete_async_scans()
Comment 37 Christian Kujau 2012-07-24 23:44:04 UTC
On Fri, 20 Jul 2012 at 00:03, Christian Kujau wrote:
> On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote:
> > > That said, there's *another* patch that never got applied which was
> > > also talked about in a thread about that same "sd: limit the scope of
> > > the async probe domain" patch, see
> > > 
> > >     https://lkml.org/lkml/2012/7/2/32
> > > 
> > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> > > any difference to you?
> > 
> > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.
> 
> With today's -rc7, this patch is still needed to boot my powerpc[0] box. 
> Will it be included before 3.5 gets released?

Sadly it got not - which makes it the first kernel to be released 
which will not boot my beloved but somewhat old PowerBook :-(

Can we please get Li's patch into mainline? 

Thank you,
Christian.

> [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html
>     http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/
Comment 38 Dan Williams 2012-07-25 00:51:28 UTC
At this (In reply to comment #37)
> > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.
> > 
> > With today's -rc7, this patch is still needed to boot my powerpc[0] box. 
> > Will it be included before 3.5 gets released?
> 
> Sadly it got not - which makes it the first kernel to be released 
> which will not boot my beloved but somewhat old PowerBook :-(
> 
> Can we please get Li's patch into mainline? 
> 

The async rework is pending acceptance into mainline:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=tag;h=72e9dc4b44d8

Once those commits land and bake for a bit I think it would be reasonable to request the fix for 3.5-stable.  I.e. commit a4683487f90b "[SCSI] async: make async_synchronize_full() flush all work regardless of domain"  with commit 2955b47d2c19 "[SCSI] async: introduce 'async_domain' type" as a pre-requisite.

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=commit;h=2955b47d2c19
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=commit;h=a4683487f90b
Comment 39 Linus Torvalds 2012-07-25 01:17:42 UTC
It's been merged and is in current -git right now. Please test, and if things are ok, I guess we should mark it for stable.
Comment 40 Christian Kujau 2012-07-31 00:03:09 UTC
ping?

On Tue, 24 Jul 2012 at 16:43, Christian Kujau wrote:
> On Fri, 20 Jul 2012 at 00:03, Christian Kujau wrote:
> > On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote:
> > > > That said, there's *another* patch that never got applied which was
> > > > also talked about in a thread about that same "sd: limit the scope of
> > > > the async probe domain" patch, see
> > > > 
> > > >     https://lkml.org/lkml/2012/7/2/32
> > > > 
> > > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> > > > any difference to you?
> > > 
> > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.
> > 
> > With today's -rc7, this patch is still needed to boot my powerpc[0] box. 
> > Will it be included before 3.5 gets released?
> 
> Sadly it got not - which makes it the first kernel to be released 
> which will not boot my beloved but somewhat old PowerBook :-(
> 
> Can we please get Li's patch into mainline? 
> 
> Thank you,
> Christian.
> 
> > [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html
> >     http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/
Comment 41 Linus Torvalds 2012-07-31 00:31:11 UTC
Christ, ping yourself! We asked you to test mainline. You didn't.
Comment 42 Christian Kujau 2012-07-31 05:20:30 UTC
On Mon, 30 Jul 2012 at 17:03, Christian Kujau wrote:
> 
> ping?
>

I'm terribly sorry for the noise, I just saw the replies in #44771 (but 
not in my mailbox) and the the fixes are now in mainline. Unfortunately
another bug keeps me from testing mainline on this box now so I'm
still on 3.5 + Li's patch.

Again, sorry for the noise & thanks to all involved for the fix.

Christian.

> On Tue, 24 Jul 2012 at 16:43, Christian Kujau wrote:
> > On Fri, 20 Jul 2012 at 00:03, Christian Kujau wrote:
> > > On Tue, 17 Jul 2012 at 13:20, Paul E. McKenney wrote:
> > > > > That said, there's *another* patch that never got applied which was
> > > > > also talked about in a thread about that same "sd: limit the scope of
> > > > > the async probe domain" patch, see
> > > > > 
> > > > >     https://lkml.org/lkml/2012/7/2/32
> > > > > 
> > > > > so I'm adding Paul and Li Zhong to the cc here. Does that patch make
> > > > > any difference to you?
> > > > 
> > > > The https://lkml.org/lkml/2012/7/2/32 fixes the problem in my testing.
> > > 
> > > With today's -rc7, this patch is still needed to boot my powerpc[0] box. 
> > > Will it be included before 3.5 gets released?
> > 
> > Sadly it got not - which makes it the first kernel to be released 
> > which will not boot my beloved but somewhat old PowerBook :-(
> > 
> > Can we please get Li's patch into mainline? 
> > 
> > Thank you,
> > Christian.
> > 
> > > [0] https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-July/098922.html
> > >     http://nerdbynature.de/bits/3.5.0-rc5/soft_lockup/
> -- 
> BOFH excuse #246:
> 
> It must have been the lightning storm we had (yesterday) (last week) (last
> month)
>