Created attachment 20839 [details] dmesg I've updated from 2.6.29-rc4 to current git, and after reboot (1) I do see disks attached to the PM connected to sata_sil24, but (2) they are not visible to the block layer, and (3) kernel complains that couple of async threads were blocked for more than 480 seconds... I'm not sure whether problem is with async code, or with the way SCSI host probe waits for async operations to complete, or with async_port_probe in SATA for ports != 0. As I see two threads blocked in async_port_probe, which nicely correlates with harddisks connected to port 1 & 2 of PM, I'll start with SATA. ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 0) ata4.15: Port Multiplier 1.1, 0x1095:0x3726 r23, 6 ports, feat 0x1/0x9 ata4.00: hard resetting link ata4.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320) ata4.01: hard resetting link ata4.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata4.02: hard resetting link ata4.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata4.03: hard resetting link ata4.03: SATA link down (SStatus 0 SControl 320) ata4.04: hard resetting link ata4.04: SATA link down (SStatus 0 SControl 320) ata4.05: hard resetting link ata4.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320) ata4.00: ATA-7: ST3750640AS, 3.AAE, max UDMA/133 ata4.00: 1465149168 sectors, multi 0: LBA48 NCQ (depth 31/32) ata4.00: configured for UDMA/100 ata4.01: ATA-7: Hitachi HDS721010KLA330, GKAOA70F, max UDMA/133 ata4.01: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32) ata4.01: configured for UDMA/100 ata4.02: ATA-7: Hitachi HDS721010KLA330, GKAOA70F, max UDMA/133 ata4.02: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32) ata4.02: configured for UDMA/100 ata4: EH complete scsi: waiting for bus probes to complete ... ... INFO: task async/1:3416 blocked for more than 480 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. async/1 D 0000000000000000 0 3416 2 ffff88011dd97d40 0000000000000046 ffff88011dd97c70 0000000000000046 00000000001d33c0 000000000000e568 0000000000000000 ffff88011ea2c5a0 ffffffff807b34a0 ffff88011ea2c968 000000001dd97d00 00000000ffff9233 Call Trace: [<ffffffff805d25b5>] ? _spin_unlock_irqrestore+0x65/0x80 [<ffffffff805cf393>] schedule+0x13/0x40 [<ffffffff8026fde5>] async_synchronize_cookie_domain+0x75/0x130 [<ffffffff80268fa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8027a83d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8026feb0>] async_synchronize_cookie+0x10/0x20 [<ffffffff8026fed4>] async_synchronize_full+0x14/0x30 [<ffffffff804b98b9>] scsi_complete_async_scans+0x79/0x130 [<ffffffff804ba0b5>] __scsi_add_device+0xe5/0x110 [<ffffffff804d0ed2>] ata_scsi_scan_host+0xb2/0x1e0 [<ffffffff804cde07>] async_port_probe+0xb7/0xf0 [<ffffffff8026fc07>] async_thread+0x107/0x270 [<ffffffff8024b4c0>] ? default_wake_function+0x0/0x10 [<ffffffff8026fb00>] ? async_thread+0x0/0x270 [<ffffffff80268ad9>] kthread+0x49/0x80 [<ffffffff8020cfba>] child_rip+0xa/0x20 [<ffffffff80249f09>] ? finish_task_switch+0x89/0x110 [<ffffffff805d2526>] ? _spin_unlock_irq+0x36/0x60 [<ffffffff8020c980>] ? restore_args+0x0/0x30 [<ffffffff80268a90>] ? kthread+0x0/0x80 [<ffffffff8020cfb0>] ? child_rip+0x0/0x20 INFO: lockdep is turned off. INFO: task async/2:3417 blocked for more than 480 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. async/2 D 0000000000000001 0 3417 2 ffff88011dc8fe00 0000000000000046 0000000000000000 0000000000000046 00000000001d33c0 000000000000e568 0000000000000001 ffff88011ea28000 ffff88011dd245a0 ffff88011ea283c8 000000011dc8fdc0 0000000000000046 Call Trace: [<ffffffff805d25b5>] ? _spin_unlock_irqrestore+0x65/0x80 [<ffffffff805cf393>] schedule+0x13/0x40 [<ffffffff8026fde5>] async_synchronize_cookie_domain+0x75/0x130 [<ffffffff80268fa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8026feb0>] async_synchronize_cookie+0x10/0x20 [<ffffffff804cde20>] async_port_probe+0xd0/0xf0 [<ffffffff8026fc07>] async_thread+0x107/0x270 [<ffffffff8024b4c0>] ? default_wake_function+0x0/0x10 [<ffffffff8026fb00>] ? async_thread+0x0/0x270 [<ffffffff80268ad9>] kthread+0x49/0x80 [<ffffffff8020cfba>] child_rip+0xa/0x20 [<ffffffff80249f09>] ? finish_task_switch+0x89/0x110 [<ffffffff805d2526>] ? _spin_unlock_irq+0x36/0x60 [<ffffffff8020c980>] ? restore_args+0x0/0x30 [<ffffffff80268a90>] ? kthread+0x0/0x80 [<ffffffff8020cfb0>] ? child_rip+0x0/0x20 INFO: lockdep is turned off. INFO: task async/3:3418 blocked for more than 480 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. async/3 D 0000000000000001 0 3418 2 ffff88011e44fe00 0000000000000046 0000000000000000 0000000000000046 00000000001d33c0 000000000000e568 0000000000000001 ffff88011dd245a0 ffff88011fa6c5a0 ffff88011dd24968 000000011e44fdc0 00000000ffff9237 Call Trace: [<ffffffff805d25b5>] ? _spin_unlock_irqrestore+0x65/0x80 [<ffffffff805cf393>] schedule+0x13/0x40 [<ffffffff8026fde5>] async_synchronize_cookie_domain+0x75/0x130 [<ffffffff80268fa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8026feb0>] async_synchronize_cookie+0x10/0x20 [<ffffffff804cde20>] async_port_probe+0xd0/0xf0 [<ffffffff8026fc07>] async_thread+0x107/0x270 [<ffffffff8024b4c0>] ? default_wake_function+0x0/0x10 [<ffffffff8026fb00>] ? async_thread+0x0/0x270 [<ffffffff80268ad9>] kthread+0x49/0x80 [<ffffffff8020cfba>] child_rip+0xa/0x20 [<ffffffff80249f09>] ? finish_task_switch+0x89/0x110 [<ffffffff805d2526>] ? _spin_unlock_irq+0x36/0x60 [<ffffffff8020c980>] ? restore_args+0x0/0x30 [<ffffffff80268a90>] ? kthread+0x0/0x80 [<ffffffff8020cfb0>] ? child_rip+0x0/0x20 INFO: lockdep is turned off.
Created attachment 20840 [details] kernel config file
Reverting Arjan's commit 9710794383ee - 'async: remove the temporary (2.6.29) "async is off by default" code' "fixes" a problem.
can you boot with "initcall_debug" ? this will make the async code print more diagnostics...
Created attachment 20845 [details] dmesg with initcall_debug enabled Dmesg with debugging enabled attached.
Now when I look at initcall trace... sata_sil24 driver is built as a module, so it won't show in initcall_debug (except as far as async.c code is concerned).
Created attachment 20846 [details] dmesg with initcall_debug enabled This one comes from kernel which has SYSTEM_BOOTING in async.c removed... initcall sil24_init+0x0/0x20 [sata_sil24] returned 0 after 5174365 usecs calling 7_async_port_probe+0x0/0xf0 @ 3432 async_waiting @ 3432 calling 8_async_port_probe+0x0/0xf0 @ 3433 async_waiting @ 3433 calling 9_async_port_probe+0x0/0xf0 @ 3434 async_waiting @ 3434 ... async_continuing @ 3432 after 2059143 usec ... async_waiting @ 3432 async_continuing @ 3432 after 2 usec ... async_waiting @ 3432 ... INFO: task async/1:3432 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. async/1 D 0000000000000000 0 3432 2
The interesting part is that this driver doesn't enable parallel scanning ;(
Created attachment 21078 [details] Hung async on 2.6.30-rc2 Rafael's script asked whether it still happens. Yes, it still dies unless fastboot patch is reverted. Only difference is that now box does not boot anymore - beside async/1,2,3 being D, also modprobes are now dead as they wait for async in sys_init_module, so box now even does not finish booting. And also khungtaskd now hits BUG about using smp_processor_id() in preemptible context :-(
bugzilla-daemon@bugzilla.kernel.org wrote: it would be interesting to do a quick check on -rc3; a potential source of deadlock in this stuff has been fixed there.
Aargh, I just did 'git pull' today afternoon :-( Yep, with your commit d4d5291c8cd499b1 box comes up nicely. sata_sil24 now complains about misclassified device it never complained before, but it seems to be a feature, and on retry it gets things right... ata3: softreset failed (timeout) ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 0) ata3: link online but device misclassified, retrying ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 0) ata3.00: ATA-6: Sans Digital V.36.A0B, V.36.A0B, max UDMA/133