Bug 50491

Summary: rmmod/modprobe i7core_edac leads to kernel oops
Product: Drivers Reporter: Jean Delvare (jdelvare)
Component: EDACAssignee: Jean Delvare (jdelvare)
Status: CLOSED CODE_FIX    
Severity: low CC: alan, mchehab
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0.39, 3.5.0, 3.7-rc5 Subsystem:
Regression: No Bisected commit-id:
Attachments: i7core_edac: Fix PCI device reference count

Description Jean Delvare 2012-11-13 13:55:45 UTC
If I run the following on my system:

# while rmmod i7core_edac ; do modprobe i7core_edac ; done

I quickly get a flood of warning and error messages in the kernel logs. First I get 3 correct cycles:

EDAC PCI: Removed device 0 for i7core_edac EDAC PCI controller: DEV 0000:ff:03.0
EDAC MC: Removed device 0 for i7core_edac.c i7 core #0: DEV 0000:ff:03.0
EDAC MC0: Giving out device to 'i7core_edac.c' 'i7 core #0': DEV 0000:ff:03.0
EDAC PCI1: Giving out device to module 'i7core_edac' controller 'EDAC PCI controller': DEV '0000:ff:03.0' (POLLED)
EDAC i7core: Driver loaded, 1 memory controller(s) found.

but then a first failure occurs:

EDAC PCI: Removed device 3 for i7core_edac EDAC PCI controller: DEV 0000:ff:03.0
EDAC MC: Removed device 0 for i7core_edac.c i7 core #0: DEV 0000:ff:03.0
EDAC i7core: Device not found: dev 00.0 PCI ID 8086:2c41

This last message is repeated a hundred times or so, and finally a WARNING followed by a BUG.

------------[ cut here ]------------
WARNING: at include/linux/kref.h:42 klist_next+0xba/0x110()
Hardware name: System Product Name
Modules linked in: i7core_edac(+) fuse xt_tcpudp xt_pkttype xt_physdev xt_LOG xt_limit nfsd lockd nfs_acl auth_rpcgss sunrpc bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_utf8 loop snd_hda_codec_hdmi mt2060 snd_hda_codec_realtek snd_hda_intel snd_hda_codec e1000e container snd_hwdep snd_pcm button edac_core sr_mod dvb_usb_dib0700 dib0090 dib7000p dib7000m dib0070 dvb_usb dib8000 dvb_core dib3000mc rc_core sg iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm i2c_mux_gpio i2c_mux gpio_ich lpc_ich i2c_i801 crc32c_intel dibx000_common cdrom pcspkr microcode snd_timer snd edd soundcore snd_page_alloc autofs4 radeon ttm rtc_cmos drm_kms_helper drm i2c_algo_bit processor thermal_sys ata_generic [last unloaded: i7core_edac]
Pid: 10962, comm: modprobe Not tainted 3.7.0-rc5 #12
Call Trace:
 [<ffffffff8104501c>] warn_slowpath_common+0x8c/0xc0
 [<ffffffff81045065>] warn_slowpath_null+0x15/0x20
 [<ffffffff815c34ea>] klist_next+0xba/0x110
 [<ffffffff81317ac0>] ? pci_do_find_bus+0x60/0x60
 [<ffffffff813be319>] next_device+0x9/0x30
 [<ffffffff813beaf9>] bus_find_device+0x69/0x90
 [<ffffffff81317b9b>] pci_get_dev_by_id+0x6b/0xb0
 [<ffffffff81317ce0>] pci_get_subsys+0x30/0x40
 [<ffffffff81317d03>] pci_get_device+0x13/0x20
 [<ffffffffa0331448>] i7core_get_onedevice+0x3c/0x28e [i7core_edac]
 [<ffffffffa033177e>] i7core_probe+0x8e/0x16e [i7core_edac]
 [<ffffffff81316524>] local_pci_probe+0x74/0x100
 [<ffffffff81316679>] __pci_device_probe+0xc9/0xf0
 [<ffffffff81316c05>] pci_device_probe+0x35/0x60
 [<ffffffff813c022f>] really_probe+0x10f/0x2f0
 [<ffffffff813c05bb>] driver_probe_device+0x7b/0xa0
 [<ffffffff813c063f>] __driver_attach+0x5f/0x90
 [<ffffffff813c05e0>] ? driver_probe_device+0xa0/0xa0
 [<ffffffff813be609>] bus_for_each_dev+0x49/0x80
 [<ffffffff813bfe69>] driver_attach+0x19/0x20
 [<ffffffff813bf83b>] bus_add_driver+0xdb/0x260
 [<ffffffffa038d000>] ? 0xffffffffa038cfff
 [<ffffffff813c0c68>] driver_register+0xa8/0x150
 [<ffffffffa038d000>] ? 0xffffffffa038cfff
 [<ffffffff81316d1c>] __pci_register_driver+0x5c/0x70
 [<ffffffffa038d093>] i7core_init+0x93/0x1000 [i7core_edac]
 [<ffffffff8100024a>] do_one_initcall+0x8a/0x160
 [<ffffffff810a99b5>] sys_init_module+0xc5/0x220
 [<ffffffff815ff829>] system_call_fastpath+0x16/0x1b
---[ end trace e1e91b09c80c8edb ]---
BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: [<ffffffff815c30b8>] klist_put+0x28/0xa0
PGD 13699d067 PUD 137f4d067 PMD 0 
Oops: 0000 [#1] PREEMPT SMP 
Modules linked in: i7core_edac(+) fuse xt_tcpudp xt_pkttype xt_physdev xt_LOG xt_limit nfsd lockd nfs_acl auth_rpcgss sunrpc bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf nls_utf8 loop snd_hda_codec_hdmi mt2060 snd_hda_codec_realtek snd_hda_intel snd_hda_codec e1000e container snd_hwdep snd_pcm button edac_core sr_mod dvb_usb_dib0700 dib0090 dib7000p dib7000m dib0070 dvb_usb dib8000 dvb_core dib3000mc rc_core sg iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm i2c_mux_gpio i2c_mux gpio_ich lpc_ich i2c_i801 crc32c_intel dibx000_common cdrom pcspkr microcode snd_timer snd edd soundcore snd_page_alloc autofs4 radeon ttm rtc_cmos drm_kms_helper drm i2c_algo_bit processor thermal_sys ata_generic [last unloaded: i7core_edac]
CPU 6 
Pid: 10962, comm: modprobe Tainted: G        W    3.7.0-rc5 #12 System manufacturer System Product Name/Z8NA-D6(C)
[26130.232897] RIP: 0010:[<ffffffff815c30b8>]  [<ffffffff815c30b8>] klist_put+0x28/0xa0
RSP: 0018:ffff880125389af8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000cb2dcb2c
RDX: 00000000000035f0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff880125389b18 R08: ffffffff81e41cdc R09: ffffffff81e61e20
R10: 0000000000000044 R11: 000000000001fef0 R12: ffff8801391b24b8
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
FS:  00007fca8695c700(0000) GS:ffff88013f2c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000030 CR3: 00000001357d5000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 10962, threadinfo ffff880125388000, task ffff880037a68700)
Stack:
 ffff880125389b48 ffffffff81317ac0 0000000000000000 ffffffffa0332640
 ffff880125389b38 ffffffff815c314c ffff880125389b38 ffff880125389bb8
 ffff880125389b78 ffffffff813beb0a ffff8801397a68d8 ffff8801391b24b8
Call Trace:
 [<ffffffff81317ac0>] ? pci_do_find_bus+0x60/0x60
 [<ffffffff815c314c>] klist_iter_exit+0x1c/0x30
 [<ffffffff813beb0a>] bus_find_device+0x7a/0x90
 [<ffffffff81317b9b>] pci_get_dev_by_id+0x6b/0xb0
 [<ffffffff81317ce0>] pci_get_subsys+0x30/0x40
 [<ffffffff81317d03>] pci_get_device+0x13/0x20
 [<ffffffffa0331448>] i7core_get_onedevice+0x3c/0x28e [i7core_edac]
 [<ffffffffa033177e>] i7core_probe+0x8e/0x16e [i7core_edac]
 [<ffffffff81316524>] local_pci_probe+0x74/0x100
 [<ffffffff81316679>] __pci_device_probe+0xc9/0xf0
 [<ffffffff81316c05>] pci_device_probe+0x35/0x60
 [<ffffffff813c022f>] really_probe+0x10f/0x2f0
 [<ffffffff813c05bb>] driver_probe_device+0x7b/0xa0
 [<ffffffff813c063f>] __driver_attach+0x5f/0x90
 [<ffffffff813c05e0>] ? driver_probe_device+0xa0/0xa0
 [<ffffffff813be609>] bus_for_each_dev+0x49/0x80
 [<ffffffff813bfe69>] driver_attach+0x19/0x20
 [<ffffffff813bf83b>] bus_add_driver+0xdb/0x260
 [<ffffffffa038d000>] ? 0xffffffffa038cfff
 [<ffffffff813c0c68>] driver_register+0xa8/0x150
 [<ffffffffa038d000>] ? 0xffffffffa038cfff
 [<ffffffff81316d1c>] __pci_register_driver+0x5c/0x70
 [<ffffffffa038d093>] i7core_init+0x93/0x1000 [i7core_edac]
 [<ffffffff8100024a>] do_one_initcall+0x8a/0x160
 [<ffffffff810a99b5>] sys_init_module+0xc5/0x220
 [<ffffffff815ff829>] system_call_fastpath+0x16/0x1b
Code: 00 00 00 55 48 89 e5 48 83 ec 20 4c 89 65 e8 49 89 fc 4c 89 75 f8 41 89 f6 48 89 5d e0 4c 89 6d f0 48 8b 1f 48 83 e3 fe 48 89 df <4c> 8b 6b 30 e8 4f 51 03 00 45 84 f6 74 2a 49 8b 04 24 a8 01 74 
RIP  [<ffffffff815c30b8>] klist_put+0x28/0xa0
 RSP <ffff880125389af8>
CR2: 0000000000000030
---[ end trace e1e91b09c80c8edc ]---

A further call to lspci shows the following error:
lspci: Cannot open /sys/bus/pci/devices/0000:ff:00.0/resource: No such file or directory
PCI device 0000:ff:00.0 has simply disappeared (symbolic link /sys/bus/pci/devices/0000:ff:00.0 points to a nonexistent directory.)
Comment 1 Alan 2012-11-14 12:47:34 UTC
Looks like it corrupts the pci device list and ends up over 'put'ting a pci device
Comment 2 Jean Delvare 2012-11-20 19:52:51 UTC
Created attachment 86751 [details]
i7core_edac: Fix PCI device reference count

Thanks for the hint, Alan. You were right, the bug is caused by PCI device over-putting. This should fix it. I think more work is needed to get the error paths right though.
Comment 3 Jean Delvare 2012-11-21 10:19:14 UTC
After investigation, the error paths are actually correct.