Bug 33802

Summary: list_del corruption in sd driver since 2.6.39-rc4
Product: IO/Storage Reporter: Christian Casteyde (casteyde.christian)
Component: SCSIAssignee: linux-scsi (linux-scsi)
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, maciej.rutecki, rjw, stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.39-rc4 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 32012    

Description Christian Casteyde 2011-04-21 21:10:47 UTC
Acer Aspire 1511LMi
Athlon 64 3GHz in 64bits mode
Slackware64 13.1
PREEMPT+slub+kmemcheck

I've just got (once, very difficult to reproduce) the following warning at boot on -rc4 with kmemcheck and list check debugging options:

Intel ICH 0000:00:06.0: PCI INT A -> Link[LACI] -> GSI 22 (level, low) -> IRQ 22
Intel ICH 0000:00:06.0: setting latency timer to 64
firewire_net: firewire0: IPv4 over FireWire on device 00c09f000019adc5
firewire_core: refreshed device fw0
scsi 2:0:0:0: Direct-Access     Generic  Flash R/W        2002 PQ: 0 ANSI: 2
intel8x0_measure_ac97_clock: measured 52049 usecs (2528 samples)
intel8x0: clocking to 47437
sd 2:0:0:0: Attached scsi generic sg2 type 0
------------[ cut here ]------------
WARNING: at lib/kref.c:34 kref_get+0x2c/0x30()
Hardware name: Aspire 1510  
Modules linked in:
Pid: 310, comm: kworker/0:1 Not tainted 2.6.39-rc4 #5
Call Trace:
 [<ffffffff810379ea>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81037a35>] warn_slowpath_null+0x15/0x20
 [<ffffffff8125bffc>] kref_get+0x2c/0x30
 [<ffffffff8125ac9a>] kobject_get+0x1a/0x30
 [<ffffffff812ec704>] get_device+0x14/0x20
 [<ffffffff81307a47>] scsi_request_fn+0x37/0x4e0
 [<ffffffff812470d0>] __blk_run_queue+0x30/0x60
 [<ffffffff81248f14>] blk_delay_work+0x24/0x40
 [<ffffffff810507d1>] process_one_work+0x1a1/0x440
 [<ffffffff81050771>] ? process_one_work+0x141/0x440
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff81248ef0>] ? blk_requeue_request+0x60/0x60
 [<ffffffff81050f15>] worker_thread+0x2a5/0x380
 [<ffffffff81050c70>] ? rescuer_thread+0x200/0x200
 [<ffffffff810550be>] kthread+0xae/0xc0
 [<ffffffff81067675>] ? trace_hardirqs_on_caller+0x135/0x180
 [<ffffffff815e8a34>] kernel_thread_helper+0x4/0x10
 [<ffffffff810335ec>] ? T.1044+0x7c/0xd0
 [<ffffffff815e7682>] ? retint_restore_args+0xe/0xe
 [<ffffffff81055010>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815e8a30>] ? gs_change+0xb/0xb
---[ end trace cee58f33c54e791a ]---
sd 2:0:0:0: [sdb] Attached SCSI removable disk
------------[ cut here ]------------
WARNING: at lib/list_debug.c:47 __list_del_entry+0x63/0xd0()
Hardware name: Aspire 1510  
list_del corruption, ffff88001d73a010->next is LIST_POISON1 (dead000000100100)
Modules linked in:
Pid: 310, comm: kworker/0:1 Tainted: G        W   2.6.39-rc4 #5
Call Trace:
 [<ffffffff810379ea>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81037ac1>] warn_slowpath_fmt+0x41/0x50
 [<ffffffff81269b83>] __list_del_entry+0x63/0xd0
 [<ffffffff81269c01>] list_del+0x11/0x40
 [<ffffffff8130ce4d>] scsi_device_dev_release_usercontext+0x4d/0x140
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff8130ce00>] ? scsi_device_cls_release+0x20/0x20
 [<ffffffff810504e6>] execute_in_process_context+0x86/0x90
 [<ffffffff8130cdd7>] scsi_device_dev_release+0x17/0x20
 [<ffffffff812ecdf2>] device_release+0x22/0x90
 [<ffffffff8125ac37>] kobject_release+0x47/0x90
 [<ffffffff8125abf0>] ? kobject_del+0x40/0x40
 [<ffffffff8125bf96>] kref_put+0x36/0x70
 [<ffffffff8125aaf7>] kobject_put+0x27/0x60
 [<ffffffff812ec732>] put_device+0x12/0x20
 [<ffffffff81307ab8>] scsi_request_fn+0xa8/0x4e0
 [<ffffffff812470d0>] __blk_run_queue+0x30/0x60
 [<ffffffff81248f14>] blk_delay_work+0x24/0x40
 [<ffffffff810507d1>] process_one_work+0x1a1/0x440
 [<ffffffff81050771>] ? process_one_work+0x141/0x440
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff81248ef0>] ? blk_requeue_request+0x60/0x60
 [<ffffffff81050f15>] worker_thread+0x2a5/0x380
 [<ffffffff81050c70>] ? rescuer_thread+0x200/0x200
 [<ffffffff810550be>] kthread+0xae/0xc0
 [<ffffffff81067675>] ? trace_hardirqs_on_caller+0x135/0x180
 [<ffffffff815e8a34>] kernel_thread_helper+0x4/0x10
 [<ffffffff810335ec>] ? T.1044+0x7c/0xd0
 [<ffffffff815e7682>] ? retint_restore_args+0xe/0xe
 [<ffffffff81055010>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815e8a30>] ? gs_change+0xb/0xb
---[ end trace cee58f33c54e791b ]---
------------[ cut here ]------------
WARNING: at lib/list_debug.c:47 __list_del_entry+0x63/0xd0()
Hardware name: Aspire 1510  
list_del corruption, ffff88001d73a020->next is LIST_POISON1 (dead000000100100)
Modules linked in:
Pid: 310, comm: kworker/0:1 Tainted: G        W   2.6.39-rc4 #5
Call Trace:
 [<ffffffff810379ea>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81037ac1>] warn_slowpath_fmt+0x41/0x50
 [<ffffffff81269b83>] __list_del_entry+0x63/0xd0
 [<ffffffff81269c01>] list_del+0x11/0x40
 [<ffffffff8130ce57>] scsi_device_dev_release_usercontext+0x57/0x140
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff8130ce00>] ? scsi_device_cls_release+0x20/0x20
 [<ffffffff810504e6>] execute_in_process_context+0x86/0x90
 [<ffffffff8130cdd7>] scsi_device_dev_release+0x17/0x20
 [<ffffffff812ecdf2>] device_release+0x22/0x90
 [<ffffffff8125ac37>] kobject_release+0x47/0x90
 [<ffffffff8125abf0>] ? kobject_del+0x40/0x40
 [<ffffffff8125bf96>] kref_put+0x36/0x70
 [<ffffffff8125aaf7>] kobject_put+0x27/0x60
 [<ffffffff812ec732>] put_device+0x12/0x20
 [<ffffffff81307ab8>] scsi_request_fn+0xa8/0x4e0
 [<ffffffff812470d0>] __blk_run_queue+0x30/0x60
 [<ffffffff81248f14>] blk_delay_work+0x24/0x40
 [<ffffffff810507d1>] process_one_work+0x1a1/0x440
 [<ffffffff81050771>] ? process_one_work+0x141/0x440
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff81248ef0>] ? blk_requeue_request+0x60/0x60
 [<ffffffff81050f15>] worker_thread+0x2a5/0x380
 [<ffffffff81050c70>] ? rescuer_thread+0x200/0x200
 [<ffffffff810550be>] kthread+0xae/0xc0
 [<ffffffff81067675>] ? trace_hardirqs_on_caller+0x135/0x180
 [<ffffffff815e8a34>] kernel_thread_helper+0x4/0x10
 [<ffffffff810335ec>] ? T.1044+0x7c/0xd0
 [<ffffffff815e7682>] ? retint_restore_args+0xe/0xe
 [<ffffffff81055010>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815e8a30>] ? gs_change+0xb/0xb
---[ end trace cee58f33c54e791c ]---
------------[ cut here ]------------
WARNING: at lib/list_debug.c:47 __list_del_entry+0x63/0xd0()
Hardware name: Aspire 1510  
list_del corruption, ffff88001d73a080->next is LIST_POISON1 (dead000000100100)
Modules linked in:
Pid: 310, comm: kworker/0:1 Tainted: G        W   2.6.39-rc4 #5
Call Trace:
 [<ffffffff810379ea>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81037ac1>] warn_slowpath_fmt+0x41/0x50
 [<ffffffff81269b83>] __list_del_entry+0x63/0xd0
 [<ffffffff81269c01>] list_del+0x11/0x40
 [<ffffffff8130ce64>] scsi_device_dev_release_usercontext+0x64/0x140
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff8130ce00>] ? scsi_device_cls_release+0x20/0x20
 [<ffffffff810504e6>] execute_in_process_context+0x86/0x90
 [<ffffffff8130cdd7>] scsi_device_dev_release+0x17/0x20
 [<ffffffff812ecdf2>] device_release+0x22/0x90
 [<ffffffff8125ac37>] kobject_release+0x47/0x90
 [<ffffffff8125abf0>] ? kobject_del+0x40/0x40
 [<ffffffff8125bf96>] kref_put+0x36/0x70
 [<ffffffff8125aaf7>] kobject_put+0x27/0x60
 [<ffffffff812ec732>] put_device+0x12/0x20
 [<ffffffff81307ab8>] scsi_request_fn+0xa8/0x4e0
 [<ffffffff812470d0>] __blk_run_queue+0x30/0x60
 [<ffffffff81248f14>] blk_delay_work+0x24/0x40
 [<ffffffff810507d1>] process_one_work+0x1a1/0x440
 [<ffffffff81050771>] ? process_one_work+0x141/0x440
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff81248ef0>] ? blk_requeue_request+0x60/0x60
 [<ffffffff81050f15>] worker_thread+0x2a5/0x380
 [<ffffffff81050c70>] ? rescuer_thread+0x200/0x200
 [<ffffffff810550be>] kthread+0xae/0xc0
 [<ffffffff81067675>] ? trace_hardirqs_on_caller+0x135/0x180
 [<ffffffff815e8a34>] kernel_thread_helper+0x4/0x10
 [<ffffffff810335ec>] ? T.1044+0x7c/0xd0
 [<ffffffff815e7682>] ? retint_restore_args+0xe/0xe
 [<ffffffff81055010>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815e8a30>] ? gs_change+0xb/0xb
---[ end trace cee58f33c54e791d ]---

=============================================
[ INFO: possible recursive locking detected ]
2.6.39-rc4 #5
---------------------------------------------
kworker/0:1/310 is trying to acquire lock:
 ((&(&q->delay_work)->work)){+.+...}, at: [<ffffffff8104f540>] wait_on_work+0x0/0x90

but task is already holding lock:
 ((&(&q->delay_work)->work)){+.+...}, at: [<ffffffff81050771>] process_one_work+0x141/0x440

other info that might help us debug this:
2 locks held by kworker/0:1/310:
 #0:  (kblockd){.+.+..}, at: [<ffffffff81050771>] process_one_work+0x141/0x440
 #1:  ((&(&q->delay_work)->work)){+.+...}, at: [<ffffffff81050771>] process_one_work+0x141/0x440

stack backtrace:
Pid: 310, comm: kworker/0:1 Tainted: G        W   2.6.39-rc4 #5
Call Trace:
 [<ffffffff8106a19b>] __lock_acquire+0x151b/0x1c60
 [<ffffffff8126418d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff8126418d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff8106a939>] lock_acquire+0x59/0x70
 [<ffffffff8104f540>] ? wait_on_cpu_work+0xd0/0xd0
 [<ffffffff8104f584>] wait_on_work+0x44/0x90
 [<ffffffff8104f540>] ? wait_on_cpu_work+0xd0/0xd0
 [<ffffffff815e7206>] ? _raw_spin_unlock_irq+0x36/0x60
 [<ffffffff8104f6b9>] __cancel_work_timer+0xe9/0x1a0
 [<ffffffff8104f77d>] cancel_delayed_work_sync+0xd/0x10
 [<ffffffff81249514>] blk_sync_queue+0x24/0x30
 [<ffffffff81249536>] blk_cleanup_queue+0x16/0x60
 [<ffffffff81307859>] scsi_free_queue+0x9/0x10
 [<ffffffff8130ceeb>] scsi_device_dev_release_usercontext+0xeb/0x140
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff8130ce00>] ? scsi_device_cls_release+0x20/0x20
 [<ffffffff810504e6>] execute_in_process_context+0x86/0x90
 [<ffffffff8130cdd7>] scsi_device_dev_release+0x17/0x20
 [<ffffffff812ecdf2>] device_release+0x22/0x90
 [<ffffffff8125ac37>] kobject_release+0x47/0x90
 [<ffffffff8125abf0>] ? kobject_del+0x40/0x40
 [<ffffffff8125bf96>] kref_put+0x36/0x70
 [<ffffffff8125aaf7>] kobject_put+0x27/0x60
 [<ffffffff812ec732>] put_device+0x12/0x20
 [<ffffffff81307ab8>] scsi_request_fn+0xa8/0x4e0
 [<ffffffff812470d0>] __blk_run_queue+0x30/0x60
 [<ffffffff81248f14>] blk_delay_work+0x24/0x40
 [<ffffffff810507d1>] process_one_work+0x1a1/0x440
 [<ffffffff81050771>] ? process_one_work+0x141/0x440
 [<ffffffff815e7ae9>] ? error_exit+0x29/0xb0
 [<ffffffff81248ef0>] ? blk_requeue_request+0x60/0x60
 [<ffffffff81050f15>] worker_thread+0x2a5/0x380
 [<ffffffff81050c70>] ? rescuer_thread+0x200/0x200
 [<ffffffff810550be>] kthread+0xae/0xc0
 [<ffffffff81067675>] ? trace_hardirqs_on_caller+0x135/0x180
 [<ffffffff815e8a34>] kernel_thread_helper+0x4/0x10
 [<ffffffff810335ec>] ? T.1044+0x7c/0xd0
 [<ffffffff815e7682>] ? retint_restore_args+0xe/0xe
 [<ffffffff81055010>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815e8a30>] ? gs_change+0xb/0xb
Intel ICH Modem 0000:00:06.1: PCI INT B -> Link[LMCI] -> GSI 21 (level, low) -> IRQ 21
Intel ICH Modem 0000:00:06.1: setting latency timer to 64
usbcore: registered new interface driver snd-usb-audio
ALSA device list:
  #0: NVidia nForce3 with ALC250 at irq 22
  #1: NVidia nForce3 Modem at irq 21
Netfilter messages via NETLINK v0.30.
nf_conntrack version 0.5.0 (3843 buckets, 15372 max)
ip_tables: (C) 2000-2006 Netfilter Core Team
...
Comment 1 Stefan Richter 2011-04-22 09:14:37 UTC
I saw this already with 2.6.39-rc3 (which was the first 2.6.39-rc that I tried after 2.6.38, to which I have reverted for now).

Reported in "2.6.39-rc3: usb-storage hotplug kref corruption/ list corruption" at linux-scsi + linux-usb on April 14: http://thread.gmane.org/gmane.linux.usb.general/45456
Comment 2 Stefan Richter 2011-04-22 09:25:42 UTC
PS,
in my case the bug happened not during boot but during power-up of a monitor with built-in USB card reader.  Such a power-up happened circa 10 times in that 2.6.39-rc3 session --- plus a few other plug-in events of USB and FireWire disks --- without the bug hitting.
Comment 3 Anonymous Emailer 2011-04-22 14:27:46 UTC
Reply-To: James.Bottomley@suse.de

On Fri, 2011-04-22 at 09:25 +0000, from Stefan Richter
<stefanr@s5r6.in-berlin.de>  2011-04-22 09:25:42 ---
> PS,
> in my case the bug happened not during boot but during power-up of a monitor
> with built-in USB card reader.  Such a power-up happened circa 10 times in
> that
> 2.6.39-rc3 session --- plus a few other plug-in events of USB and FireWire
> disks --- without the bug hitting.

The taxonomy also looks like it may be related to this

http://marc.info/?t=130253369300004

The fix we'll be pushing shortly to git head is

http://marc.info/?l=linux-kernel&m=130271409412095

James
Comment 4 Rafael J. Wysocki 2011-04-30 19:33:01 UTC
Handled-By : James Bottomley <James.Bottomley@suse.de>
Patch : http://marc.info/?l=linux-kernel&m=130271409412095
Comment 5 Christian Casteyde 2011-05-15 11:00:17 UTC
I didn't reproduced on recent kernels.
Comment 6 Stefan Richter 2011-05-15 11:14:00 UTC
Fixed by https://git.kernel.org/linus/86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b in 2.6.39-rc6, with an additional required fix https://git.kernel.org/linus/c055f5b2614b4f758ae6cc86733f31fa4c2c5844 in 2.6.39-rc7.