Bug 13031

Summary: Deadlock/hang in SATA probe
Product: IO/Storage Reporter: Petr Vandrovec (petr)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: normal CC: arjan, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29-ge0724bf6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: dmesg
kernel config file
dmesg with initcall_debug enabled
dmesg with initcall_debug enabled
Hung async on 2.6.30-rc2

Description Petr Vandrovec 2009-04-06 23:33:35 UTC
Created attachment 20839 [details]
dmesg

I've updated from 2.6.29-rc4 to current git, and after reboot (1) I do see disks attached to the PM connected to sata_sil24, but (2) they are not visible to the block layer, and (3) kernel complains that couple of async threads were blocked for more than 480 seconds...

I'm not sure whether problem is with async code, or with the way SCSI host probe waits for async operations to complete, or with async_port_probe in SATA for ports != 0.  As I see two threads blocked in async_port_probe, which nicely correlates with harddisks connected to port 1 & 2 of PM, I'll start with SATA.

ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
ata4.15: Port Multiplier 1.1, 0x1095:0x3726 r23, 6 ports, feat 0x1/0x9
ata4.00: hard resetting link
ata4.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata4.01: hard resetting link
ata4.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.02: hard resetting link
ata4.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.03: hard resetting link
ata4.03: SATA link down (SStatus 0 SControl 320)
ata4.04: hard resetting link
ata4.04: SATA link down (SStatus 0 SControl 320)
ata4.05: hard resetting link
ata4.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
ata4.00: ATA-7: ST3750640AS, 3.AAE, max UDMA/133
ata4.00: 1465149168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata4.00: configured for UDMA/100
ata4.01: ATA-7: Hitachi HDS721010KLA330, GKAOA70F, max UDMA/133
ata4.01: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata4.01: configured for UDMA/100
ata4.02: ATA-7: Hitachi HDS721010KLA330, GKAOA70F, max UDMA/133
ata4.02: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata4.02: configured for UDMA/100
ata4: EH complete
scsi: waiting for bus probes to complete ...
...
INFO: task async/1:3416 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
async/1       D 0000000000000000     0  3416      2
 ffff88011dd97d40 0000000000000046 ffff88011dd97c70 0000000000000046
 00000000001d33c0 000000000000e568 0000000000000000 ffff88011ea2c5a0
 ffffffff807b34a0 ffff88011ea2c968 000000001dd97d00 00000000ffff9233
Call Trace:
 [<ffffffff805d25b5>] ? _spin_unlock_irqrestore+0x65/0x80
 [<ffffffff805cf393>] schedule+0x13/0x40
 [<ffffffff8026fde5>] async_synchronize_cookie_domain+0x75/0x130
 [<ffffffff80268fa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8027a83d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8026feb0>] async_synchronize_cookie+0x10/0x20
 [<ffffffff8026fed4>] async_synchronize_full+0x14/0x30
 [<ffffffff804b98b9>] scsi_complete_async_scans+0x79/0x130
 [<ffffffff804ba0b5>] __scsi_add_device+0xe5/0x110
 [<ffffffff804d0ed2>] ata_scsi_scan_host+0xb2/0x1e0
 [<ffffffff804cde07>] async_port_probe+0xb7/0xf0
 [<ffffffff8026fc07>] async_thread+0x107/0x270
 [<ffffffff8024b4c0>] ? default_wake_function+0x0/0x10
 [<ffffffff8026fb00>] ? async_thread+0x0/0x270
 [<ffffffff80268ad9>] kthread+0x49/0x80
 [<ffffffff8020cfba>] child_rip+0xa/0x20
 [<ffffffff80249f09>] ? finish_task_switch+0x89/0x110
 [<ffffffff805d2526>] ? _spin_unlock_irq+0x36/0x60
 [<ffffffff8020c980>] ? restore_args+0x0/0x30
 [<ffffffff80268a90>] ? kthread+0x0/0x80
 [<ffffffff8020cfb0>] ? child_rip+0x0/0x20
INFO: lockdep is turned off.
INFO: task async/2:3417 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
async/2       D 0000000000000001     0  3417      2
 ffff88011dc8fe00 0000000000000046 0000000000000000 0000000000000046
 00000000001d33c0 000000000000e568 0000000000000001 ffff88011ea28000
 ffff88011dd245a0 ffff88011ea283c8 000000011dc8fdc0 0000000000000046
Call Trace:
 [<ffffffff805d25b5>] ? _spin_unlock_irqrestore+0x65/0x80
 [<ffffffff805cf393>] schedule+0x13/0x40
 [<ffffffff8026fde5>] async_synchronize_cookie_domain+0x75/0x130
 [<ffffffff80268fa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8026feb0>] async_synchronize_cookie+0x10/0x20
 [<ffffffff804cde20>] async_port_probe+0xd0/0xf0
 [<ffffffff8026fc07>] async_thread+0x107/0x270
 [<ffffffff8024b4c0>] ? default_wake_function+0x0/0x10
 [<ffffffff8026fb00>] ? async_thread+0x0/0x270
 [<ffffffff80268ad9>] kthread+0x49/0x80
 [<ffffffff8020cfba>] child_rip+0xa/0x20
 [<ffffffff80249f09>] ? finish_task_switch+0x89/0x110
 [<ffffffff805d2526>] ? _spin_unlock_irq+0x36/0x60
 [<ffffffff8020c980>] ? restore_args+0x0/0x30
 [<ffffffff80268a90>] ? kthread+0x0/0x80
 [<ffffffff8020cfb0>] ? child_rip+0x0/0x20
INFO: lockdep is turned off.
INFO: task async/3:3418 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
async/3       D 0000000000000001     0  3418      2
 ffff88011e44fe00 0000000000000046 0000000000000000 0000000000000046
 00000000001d33c0 000000000000e568 0000000000000001 ffff88011dd245a0
 ffff88011fa6c5a0 ffff88011dd24968 000000011e44fdc0 00000000ffff9237
Call Trace:
 [<ffffffff805d25b5>] ? _spin_unlock_irqrestore+0x65/0x80
 [<ffffffff805cf393>] schedule+0x13/0x40
 [<ffffffff8026fde5>] async_synchronize_cookie_domain+0x75/0x130
 [<ffffffff80268fa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8026feb0>] async_synchronize_cookie+0x10/0x20
 [<ffffffff804cde20>] async_port_probe+0xd0/0xf0
 [<ffffffff8026fc07>] async_thread+0x107/0x270
 [<ffffffff8024b4c0>] ? default_wake_function+0x0/0x10
 [<ffffffff8026fb00>] ? async_thread+0x0/0x270
 [<ffffffff80268ad9>] kthread+0x49/0x80
 [<ffffffff8020cfba>] child_rip+0xa/0x20
 [<ffffffff80249f09>] ? finish_task_switch+0x89/0x110
 [<ffffffff805d2526>] ? _spin_unlock_irq+0x36/0x60
 [<ffffffff8020c980>] ? restore_args+0x0/0x30
 [<ffffffff80268a90>] ? kthread+0x0/0x80
 [<ffffffff8020cfb0>] ? child_rip+0x0/0x20
INFO: lockdep is turned off.
Comment 1 Petr Vandrovec 2009-04-06 23:34:15 UTC
Created attachment 20840 [details]
kernel config file
Comment 2 Petr Vandrovec 2009-04-07 01:48:03 UTC
Reverting Arjan's commit 9710794383ee - 'async: remove the temporary (2.6.29) "async is off by default" code' "fixes" a problem.
Comment 3 Arjan van de Ven 2009-04-07 02:09:13 UTC
can you boot with "initcall_debug" ? this will make the async code print more diagnostics...
Comment 4 Petr Vandrovec 2009-04-07 03:34:45 UTC
Created attachment 20845 [details]
dmesg with initcall_debug enabled

Dmesg with debugging enabled attached.
Comment 5 Petr Vandrovec 2009-04-07 03:37:51 UTC
Now when I look at initcall trace... sata_sil24 driver is built as a module, so it won't show in initcall_debug (except as far as async.c code is concerned).
Comment 6 Petr Vandrovec 2009-04-07 03:54:06 UTC
Created attachment 20846 [details]
dmesg with initcall_debug enabled

This one comes from kernel which has SYSTEM_BOOTING in async.c removed...

initcall sil24_init+0x0/0x20 [sata_sil24] returned 0 after 5174365 usecs
calling  7_async_port_probe+0x0/0xf0 @ 3432
async_waiting @ 3432
calling  8_async_port_probe+0x0/0xf0 @ 3433
async_waiting @ 3433
calling  9_async_port_probe+0x0/0xf0 @ 3434
async_waiting @ 3434
...
async_continuing @ 3432 after 2059143 usec
...
async_waiting @ 3432
async_continuing @ 3432 after 2 usec
...
async_waiting @ 3432
...
INFO: task async/1:3432 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
async/1       D 0000000000000000     0  3432      2
Comment 7 Arjan van de Ven 2009-04-07 03:58:46 UTC
The interesting part is that this driver doesn't enable parallel scanning ;(
Comment 8 Petr Vandrovec 2009-04-22 05:25:42 UTC
Created attachment 21078 [details]
Hung async on 2.6.30-rc2

Rafael's script asked whether it still happens.  Yes, it still dies unless fastboot patch is reverted.

Only difference is that now box does not boot anymore - beside async/1,2,3 being D, also modprobes are now dead as they wait for async in sys_init_module, so box now even does not finish booting.  And also khungtaskd now hits BUG about using smp_processor_id() in preemptible context :-(
Comment 9 Arjan van de Ven 2009-04-22 05:41:31 UTC
bugzilla-daemon@bugzilla.kernel.org wrote:

it would be interesting to do a quick check on -rc3;
a potential source of deadlock in this stuff has been fixed there.
Comment 10 Petr Vandrovec 2009-04-22 06:28:15 UTC
Aargh, I just did 'git pull' today afternoon :-(

Yep, with your commit d4d5291c8cd499b1 box comes up nicely.  sata_sil24 now complains about misclassified device it never complained before, but it seems to be a feature, and on retry it gets things right...

ata3: softreset failed (timeout)
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 0)
ata3: link online but device misclassified, retrying
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 0)
ata3.00: ATA-6: Sans Digital V.36.A0B, V.36.A0B, max UDMA/133