Bug 9696 - rmmod capidrv makes kernel oops and never returns
Summary: rmmod capidrv makes kernel oops and never returns
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: ISDN (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jike Song
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-05 14:03 UTC by Roland Kletzing
Modified: 2008-04-29 11:52 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.24-rc6
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
revert b1b2e7cf4a9742f61d76fcb419b1fd13159876a5 (618 bytes, patch)
2008-01-07 20:46 UTC, Jike Song
Details | Diff
proposed fix for capidrv (910 bytes, patch)
2008-01-24 08:42 UTC, Gerd von Egidy
Details | Diff

Description Roland Kletzing 2008-01-05 14:03:49 UTC
Most recent kernel where this bug did not occur: don`t know

Distribution: OpenSuSE 10.3 + vanilla kernel 2.6.24rc6

Hardware Environment: fsc lifebook e-series (happens both on real hw and inside vmware)

Software Environment: ???

Problem Description:

modprobe capidrv;rmmod capidrv

rmmod never returns

[ 7358.883208] capidrv: Rev 1.1.2.2 : unloaded
[ 7358.895698] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000009
[ 7358.896132] printing eip: c020ceea *pde = 00000000
[ 7358.896512] Oops: 0000 [#1] SMP
[ 7358.896844] Modules linked in: capidrv kernelcapi isdn slhc edd iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 af_packet microcode firmware_class fuse loop dm_mod ide_cd cdrom pata_acpi 8250_pnp ata_piix ahci ata_generic libata parport_pc parport floppy 8250 rtc_cmos serial_core rtc_core rtc_lib pcnet32 mii pcspkr hci_usb piix bluetooth generic i2c_piix4 ide_core i2c_core container shpchp pci_hotplug ac thermal power_supply button processor intel_agp agpgart sg mousedev evdev ext3 jbd mbcache sd_mod mptspi mptscsih mptbase ehci_hcd uhci_hcd scsi_transport_spi scsi_mod usbcore
[ 7358.903430]
[ 7358.903462] Pid: 4466, comm: kstopmachine Not tainted (2.6.24-rc6 #1)
[ 7358.905436] EIP: 0060:[<c020ceea>] EFLAGS: 00010092 CPU: 0
[ 7358.906712] EIP is at list_del+0xa/0x61
[ 7358.906769] EAX: e0b39704 EBX: 00000009 ECX: 00000000 EDX: de4e4e10
[ 7358.906819] ESI: df6b8ef0 EDI: 00000000 EBP: de4a2fb4 ESP: de4a2fa4
[ 7358.906863]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 7358.907431] Process kstopmachine (pid: 4466, ti=de4a2000 task=de4e4e10 task.ti=de4a2000)
[ 7358.907463] Stack: c03f80ac 00000046 00000000 00000008 de4a2fbc c0153506 de4a2fd0 c015fb58
[ 7358.907492]        df6b8ef0 c015fa9c 00000000 de4a2fe0 c013fe7e c013fe43 00000000 00000000
[ 7358.909431]        c0108d77 df6b8e6c 00000000 00000000 00000000 00000000 00000000
[ 7358.909649] Call Trace:
[ 7358.911432]  [<c01091cd>] show_trace_log_lvl+0x1a/0x2f
[ 7358.911492]  [<c010927d>] show_stack_log_lvl+0x9b/0xa3
[ 7358.911539]  [<c010932c>] show_registers+0xa7/0x179
[ 7358.911575]  [<c0109538>] die+0x13a/0x225
[ 7358.911602]  [<c02ed731>] do_page_fault+0x554/0x632
[ 7358.915172]  [<c02ebd72>] error_code+0x72/0x78
[ 7358.915172]  [<c0153506>] __unlink_module+0xb/0xf
[ 7358.915172]  [<c015fb58>] do_stop+0xbc/0x110
[ 7358.915172]  [<c013fe7e>] kthread+0x3b/0x61
[ 7358.915172]  [<c0108d77>] kernel_thread_helper+0x7/0x10
[ 7358.915172]  =======================
[ 7358.915432] Code: 00 00 8b 53 10 8d 4b 0c 8d 46 0c e8 72 00 00 00 89 f8 e8 87 fe ff ff 83 c4 10 5b 5e 5f 5d c3 90 90 55 89 e5 53 83 ec 0c 8b 58 04 <8b> 0b 39 c1 74 18 89 4c 24 08 89 44 24 04 c7 04 24 a7 43 39 c0
[ 7358.917433] EIP: [<c020ceea>] list_del+0xa/0x61 SS:ESP 0068:de4a2fa4
[ 7358.919433] ---[ end trace a35f9be43025b578 ]---

Steps to reproduce:
power up vanilla kernel built with >allmodconfig< or use suse distro kernel from http://ftp.suse.com/pub/projects/kernel/kotd/HEAD/i386/

then do modprobe capidrv;rmmod capidrv
(don`t need any isdn hardware for that)
Comment 1 Jike Song 2008-01-06 18:56:55 UTC
Same problem with my Fedora 8, i386, vanilla kernel v2.6.24-rc3 , but I even cannot save the oops message because Linux hangs immediately after the Oops.  

Besides, I cannot reproduce this oops within VMware.
Comment 2 Jike Song 2008-01-07 20:43:16 UTC
Hi Roland,

I run a git bisect (hmm, it takes a whole day) between a bad kernel 2.6.24-rc1, and a good kernel 2.6.23.  It seems the commit `b1b2e7cf4a9742f61d76fcb419b1fd13159876a5' is lucky.  Here is the patch to revert it, please apply and check if it is OK.

thanks!
Comment 3 Jike Song 2008-01-07 20:46:20 UTC
Created attachment 14346 [details]
revert b1b2e7cf4a9742f61d76fcb419b1fd13159876a5
Comment 4 Karsten Keil 2008-01-08 04:25:43 UTC
No, this cannot be the reason for the Oops and it is completely unrelated to
module unload.
Comment 5 Jike Song 2008-01-09 01:11:22 UTC
(In reply to comment #4)
> No, this cannot be the reason for the Oops and it is completely unrelated to
> module unload.
> 

Yep, checking of return value of alloc_skb() should never introduce new BUGs.  
Comment 6 Gerd von Egidy 2008-01-17 10:59:10 UTC
I can reproduce the same (or similar) problem on 2.6.23.14.

modprobe capidrv;rmmod capidrv

produces:

capidrv: Rev 1.1.2.2 : unloaded
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000007
 printing eip:
c012f3e1
*pde = 2d0b3067
*pte = 00000000
Oops: 0002 [#1]
Modules linked in: capidrv kernelcapi isdn deflate zlib_deflate twofish twofish_common camellia serpent blowfish des cbc ecb blkcipher aes xcbc sha256 crypto_null xfrm_user xfrm4_tunnel tunnel4 ipcomp esp4 ah4 af_key forcedeth e1000 iptable_mangle ipt_iprange xt_CONNMARK ipt_REDIRECT ipt_MASQUERADE iptable_nat xt_conntrack xt_connmark ipt_LOG xt_limit xt_TCPMSS ipt_REJECT ipt_recent nf_conntrack_ipv4 xt_state ipt_ACCOUNT xt_tcpudp xt_condition xt_policy xt_multiport iptable_filter ip_tables x_tables nf_nat_tftp nf_nat_irc nf_nat_pptp nf_nat_proto_gre nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_socks nf_conntrack_irc nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_ftp nf_conntrack nfnetlink ext3 jbd mbcache reiserfs sg ahci libata amd74xx generic ehci_hcd ohci_hcd
CPU:    0
EIP:    0060:[__unlink_module+0xa/0x21]    Not tainted VLI
EFLAGS: 00010046   (2.6.23-1.i2n #1)
EIP is at __unlink_module+0xa/0x21
eax: f8b80b00   ebx: f8b80b04   ecx: 00000003   edx: 00000012
esi: 00000000   edi: bfc90e90   ebp: ed332000   esp: ed333f50
ds: 007b   es: 007b   fs: 0000  gs: 0000  ss: 0068
Process rmmod (pid: 2896, ti=ed332000 task=efac5550 task.ti=ed332000)
Stack: f8b80b00 c0130279 f8b80b00 c0131bc6 69706163 00767264 00000000 ed07a320
       c0147e46 ee61c580 ed07a320 c014876e ffffffff b7f25000 b7f24000 b7f25000
       ed07a5cc ed07a5c0 0061c580 f8b80b80 00000880 ed333fa8 00000000 bfc90e90
Call Trace:
 [free_module+0x9/0x9e] free_module+0x9/0x9e
 [sys_delete_module+0x165/0x17b] sys_delete_module+0x165/0x17b
 [remove_vma+0x31/0x36] remove_vma+0x31/0x36
 [do_munmap+0x193/0x1ac] do_munmap+0x193/0x1ac
 [syscall_call+0x7/0x0b] syscall_call+0x7/0xb
 =======================
Code: 94 00 00 00 00 0f 95 c0 0f b6 c0 c3 83 b8 98 00 00 00 00 0f 95 c0 0f b6 c0 c3 8b 80 80 01 00 00 c3 53 8b 48 04 8d 58 04 8b 53 04 <89> 51 04 89 0a c7 43 04 00 02 20 00 c7 40 04 00 01 10 00 31 c0
EIP: [__unlink_module+0xa/0x21] __unlink_module+0xa/0x21 SS:ESP 0068:ed333f50

I tried to nail it down but it's not easy: as soon as I add some printks to capidrv.c the bug does not cause immediate trouble anymore (but I think it's not gone). I think this is why removing the alloc_skb check helped Jike.

On my system the local variable "mod" of sys_delete_module is modified during the call to mod->exit() and thus causes problems later on. But I don't think it is targeting mod but it's just some random stack or memory corruption. In my case the problem disappears if I remove the remove_proc_entry() call from capidrv's proc_exit. But I'm not sure if it has something to do with this either.
Comment 7 Roland Kletzing 2008-01-17 16:56:08 UTC
cannot reproduce this with http://ftp.suse.com/pub/projects/kernel/kotd/HEAD/i386/kernel-default-2.6.24_rc8-20080117182846.i586.rpm anymore
Comment 8 Gerd von Egidy 2008-01-24 08:42:12 UTC
Created attachment 14564 [details]
proposed fix for capidrv

I found the bug, it is a long standing bug overwriting the stack. If the memory is aligned in some special way it crashes the kernel, this is why you can see it with some kernel versions and some not.
Comment 9 Karsten Keil 2008-01-25 03:00:01 UTC
Yes this makes lot of sense, it was fixed in the init part but overseen that the exit code has a similar issue. I acked the patch. Thanks.
I will close it when the code is in the tree.
Comment 10 Roland Kletzing 2008-03-02 09:58:50 UTC
i found the fix went upstream (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=eb36f4fc019835cecf0788907f6cab774508087b ), so i`m closing this one , ok ?
Comment 11 Thomas Jarosch 2008-04-29 06:15:13 UTC
Gerd's fix is not included in 2.6.24.x and therefore the kernel still oopses. I'm wondering what happend as it was already part of 2.6.23.15 in February?

If you are looking for the patch in 2.6.23.15,
it's filed under Karsten Keil as author...
Comment 12 Roland Kletzing 2008-04-29 11:04:12 UTC
2.6.24 is tagged 24.1.2008

the patch went in just one day later - see http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=eb36f4fc019835cecf0788907f6cab774508087b
Comment 13 Roland Kletzing 2008-04-29 11:17:43 UTC
>I'm wondering what happend as it was already part of 2.6.23.15 in February?

2.6.23.x is stable series, so is 2.6.24.x - both are different branches and maintained for a longer time - what`s being merged to development kernel after release of those, doesn`t get automatically into that, afaik.
Comment 14 Karsten Keil 2008-04-29 11:52:20 UTC
Yes the stable branch for 2.6.24 was not open at this time, so it got lost, now Greg has it in his queue for the next 2.6.24 stable patch, thanks for spotting this.

Note You need to log in before you can comment on or make changes to this bug.