Bug 9696
Summary: | rmmod capidrv makes kernel oops and never returns | ||
---|---|---|---|
Product: | Drivers | Reporter: | Roland Kletzing (devzero) |
Component: | ISDN | Assignee: | Jike Song (albcamus) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | albcamus, gerd.von.egidy, kernel, thomas.jarosch |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24-rc6 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
revert b1b2e7cf4a9742f61d76fcb419b1fd13159876a5
proposed fix for capidrv |
Description
Roland Kletzing
2008-01-05 14:03:49 UTC
Same problem with my Fedora 8, i386, vanilla kernel v2.6.24-rc3 , but I even cannot save the oops message because Linux hangs immediately after the Oops. Besides, I cannot reproduce this oops within VMware. Hi Roland, I run a git bisect (hmm, it takes a whole day) between a bad kernel 2.6.24-rc1, and a good kernel 2.6.23. It seems the commit `b1b2e7cf4a9742f61d76fcb419b1fd13159876a5' is lucky. Here is the patch to revert it, please apply and check if it is OK. thanks! Created attachment 14346 [details]
revert b1b2e7cf4a9742f61d76fcb419b1fd13159876a5
No, this cannot be the reason for the Oops and it is completely unrelated to module unload. (In reply to comment #4) > No, this cannot be the reason for the Oops and it is completely unrelated to > module unload. > Yep, checking of return value of alloc_skb() should never introduce new BUGs. I can reproduce the same (or similar) problem on 2.6.23.14. modprobe capidrv;rmmod capidrv produces: capidrv: Rev 1.1.2.2 : unloaded BUG: unable to handle kernel NULL pointer dereference at virtual address 00000007 printing eip: c012f3e1 *pde = 2d0b3067 *pte = 00000000 Oops: 0002 [#1] Modules linked in: capidrv kernelcapi isdn deflate zlib_deflate twofish twofish_common camellia serpent blowfish des cbc ecb blkcipher aes xcbc sha256 crypto_null xfrm_user xfrm4_tunnel tunnel4 ipcomp esp4 ah4 af_key forcedeth e1000 iptable_mangle ipt_iprange xt_CONNMARK ipt_REDIRECT ipt_MASQUERADE iptable_nat xt_conntrack xt_connmark ipt_LOG xt_limit xt_TCPMSS ipt_REJECT ipt_recent nf_conntrack_ipv4 xt_state ipt_ACCOUNT xt_tcpudp xt_condition xt_policy xt_multiport iptable_filter ip_tables x_tables nf_nat_tftp nf_nat_irc nf_nat_pptp nf_nat_proto_gre nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_socks nf_conntrack_irc nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_ftp nf_conntrack nfnetlink ext3 jbd mbcache reiserfs sg ahci libata amd74xx generic ehci_hcd ohci_hcd CPU: 0 EIP: 0060:[__unlink_module+0xa/0x21] Not tainted VLI EFLAGS: 00010046 (2.6.23-1.i2n #1) EIP is at __unlink_module+0xa/0x21 eax: f8b80b00 ebx: f8b80b04 ecx: 00000003 edx: 00000012 esi: 00000000 edi: bfc90e90 ebp: ed332000 esp: ed333f50 ds: 007b es: 007b fs: 0000 gs: 0000 ss: 0068 Process rmmod (pid: 2896, ti=ed332000 task=efac5550 task.ti=ed332000) Stack: f8b80b00 c0130279 f8b80b00 c0131bc6 69706163 00767264 00000000 ed07a320 c0147e46 ee61c580 ed07a320 c014876e ffffffff b7f25000 b7f24000 b7f25000 ed07a5cc ed07a5c0 0061c580 f8b80b80 00000880 ed333fa8 00000000 bfc90e90 Call Trace: [free_module+0x9/0x9e] free_module+0x9/0x9e [sys_delete_module+0x165/0x17b] sys_delete_module+0x165/0x17b [remove_vma+0x31/0x36] remove_vma+0x31/0x36 [do_munmap+0x193/0x1ac] do_munmap+0x193/0x1ac [syscall_call+0x7/0x0b] syscall_call+0x7/0xb ======================= Code: 94 00 00 00 00 0f 95 c0 0f b6 c0 c3 83 b8 98 00 00 00 00 0f 95 c0 0f b6 c0 c3 8b 80 80 01 00 00 c3 53 8b 48 04 8d 58 04 8b 53 04 <89> 51 04 89 0a c7 43 04 00 02 20 00 c7 40 04 00 01 10 00 31 c0 EIP: [__unlink_module+0xa/0x21] __unlink_module+0xa/0x21 SS:ESP 0068:ed333f50 I tried to nail it down but it's not easy: as soon as I add some printks to capidrv.c the bug does not cause immediate trouble anymore (but I think it's not gone). I think this is why removing the alloc_skb check helped Jike. On my system the local variable "mod" of sys_delete_module is modified during the call to mod->exit() and thus causes problems later on. But I don't think it is targeting mod but it's just some random stack or memory corruption. In my case the problem disappears if I remove the remove_proc_entry() call from capidrv's proc_exit. But I'm not sure if it has something to do with this either. cannot reproduce this with http://ftp.suse.com/pub/projects/kernel/kotd/HEAD/i386/kernel-default-2.6.24_rc8-20080117182846.i586.rpm anymore Created attachment 14564 [details]
proposed fix for capidrv
I found the bug, it is a long standing bug overwriting the stack. If the memory is aligned in some special way it crashes the kernel, this is why you can see it with some kernel versions and some not.
Yes this makes lot of sense, it was fixed in the init part but overseen that the exit code has a similar issue. I acked the patch. Thanks. I will close it when the code is in the tree. i found the fix went upstream (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=eb36f4fc019835cecf0788907f6cab774508087b ), so i`m closing this one , ok ? Gerd's fix is not included in 2.6.24.x and therefore the kernel still oopses. I'm wondering what happend as it was already part of 2.6.23.15 in February? If you are looking for the patch in 2.6.23.15, it's filed under Karsten Keil as author... 2.6.24 is tagged 24.1.2008 the patch went in just one day later - see http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=eb36f4fc019835cecf0788907f6cab774508087b >I'm wondering what happend as it was already part of 2.6.23.15 in February?
2.6.23.x is stable series, so is 2.6.24.x - both are different branches and maintained for a longer time - what`s being merged to development kernel after release of those, doesn`t get automatically into that, afaik.
Yes the stable branch for 2.6.24 was not open at this time, so it got lost, now Greg has it in his queue for the next 2.6.24 stable patch, thanks for spotting this. |