Discovered this on one of the kernel.org machines - it's been happening semi-consistently on a pair of boxes. Seems like a double free somewhere and at that point the whole box falls over dead basically. ------------[ cut here ]------------ kernel BUG at mm/slub.c:2835! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 1 Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: scsi_wait_scan] Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 G1 RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) knlGS:00000000f76fa6d0 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task ffff8801f5688000) Stack: ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 Call Trace: [<ffffffff81125e27>] seq_release_private+0x28/0x44 [<ffffffff8115dc10>] seq_release_net+0x34/0x3d [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 [<ffffffff8110efbb>] __fput+0x12a/0x1dc [<ffffffff8110f087>] fput+0x1a/0x1c [<ffffffff8110c0f7>] filp_close+0x68/0x72 [<ffffffff8110c19e>] sys_close+0x9d/0xd2 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e RIP [<ffffffff811006d6>] kfree+0x5e/0xcb RSP <ffff8801f6433df8> ---[ end trace 1a4b1fd758dd1fdb ]--- general protection fault: 0000 [#2] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:19.3/name CPU 1 Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: scsi_wait_scan] Pid: 16274, comm: snmpd Tainted: G D 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 G1 RIP: 0010:[<ffffffff8110136b>] [<ffffffff8110136b>] __kmalloc_track_caller+0xe3/0x14c RSP: 0018:ffff8800380dbcc8 EFLAGS: 00010006 RAX: 0000000000000000 RBX: 0003000000000000 RCX: 000000000000000b RDX: 0000000100000000 RSI: 00000000000006a9 RDI: ffffffff8177d901 RBP: ffff8800380dbd18 R08: ffff880207412570 R09: ffff8800380dbe88 R10: ffff8800380dbf28 R11: 0000000000000000 R12: ffffffff81a28520 R13: 000000000000000b R14: 00000000000000d0 R15: 00000000000000d0 FS: 00007f4afe3857a0(0000) GS:ffff880207400000(0000) knlGS:00000000f6d92b70 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f4afc38cd50 CR3: 00000000f1fe8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process snmpd (pid: 16274, threadinfo ffff8800380da000, task ffff8801ccdcddc0) Stack: ffff8800380dbcf8 0000000000000003 ffffffff81117373 0000000000000246 <0> ffff8800380dbd18 ffff8803ed0178e0 ffff8803ed017840 ffff8801f0938a50 <0> ffff8801f09046c0 0000000000000000 ffff8800380dbd48 ffffffff810dd233 Call Trace: [<ffffffff81117373>] ? vfs_rename+0xb2/0x3e1 [<ffffffff810dd233>] kstrdup+0x31/0x49 [<ffffffff81117373>] vfs_rename+0xb2/0x3e1 [<ffffffff81116523>] ? __lookup_hash+0x55/0xf1 [<ffffffff811ce973>] ? security_inode_permission+0x21/0x23 [<ffffffff81118cfd>] sys_renameat+0x193/0x20c [<ffffffff810e4e5a>] ? handle_mm_fault+0x452/0x97b [<ffffffff81450195>] ? do_page_fault+0x28e/0x2bb [<ffffffff81118d91>] sys_rename+0x1b/0x1d [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: 90 66 90 48 89 45 c8 fa 66 66 90 66 66 90 65 4c 8b 04 25 90 e8 00 00 49 8b 04 24 49 01 c0 49 8b 18 48 85 db 74 0e 49 63 44 24 18 <48> 8b 04 03 49 89 00 eb 15 48 8b 4d c0 83 ca ff 44 89 fe 4c 89 RIP [<ffffffff8110136b>] __kmalloc_track_caller+0xe3/0x14c RSP <ffff8800380dbcc8> ---[ end trace 1a4b1fd758dd1fdc ]--- block drbd1: write: error=-95 s=1658s block drbd1: Method to ensure write ordering: flush block drbd1: local disk flush failed with status -95 block drbd1: Method to ensure write ordering: drain o2net: accepted connection from node demeter2.kernel.org (num 2) at 172.20.0.20:7777 ocfs2_dlm: Node 2 joins domain FC86A681BA714C7AA126836FFC1D4C8C ocfs2_dlm: Nodes in domain ("FC86A681BA714C7AA126836FFC1D4C8C"): 1 2 general protection fault: 0000 [#3] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 1 Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: scsi_wait_scan] Pid: 29724, comm: lsof Tainted: G D 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 G1 RIP: 0010:[<ffffffff81101615>] [<ffffffff81101615>] __kmalloc+0xeb/0x150 RSP: 0018:ffff8801c9cd5be8 EFLAGS: 00010006 RAX: 0000000000000000 RBX: 0003000000000000 RCX: ffffffff811264ff RDX: 0000000100000000 RSI: 00000000000006a9 RDI: ffffffff8177d901 RBP: ffff8801c9cd5c28 R08: ffff880207412570 R09: ffff8803ea9aed80 R10: ffff8803ce0767c0 R11: 0000000000000000 R12: ffffffff81a28520 R13: 0000000000000010 R14: 00000000000080d0 R15: 00000000000080d0 FS: 00007f364411c7a0(0000) GS:ffff880207400000(0000) knlGS:00000000f5990b70 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004091c0 CR3: 000000019f488000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lsof (pid: 29724, threadinfo ffff8801c9cd4000, task ffff8801cd6c8000) Stack: ffffffff811264ff 0000000000000246 ffffc900013ed638 ffff8803ee7dfc00 <0> ffffffff81666660 ffff8803ce0765b0 ffff8801f6bed1f8 ffff8803ee7dfc00 <0> ffff8801c9cd5c58 ffffffff811264ff ffff8801c9cd5c58 ffffffff81e074e0 Call Trace: [<ffffffff811264ff>] ? __seq_open_private+0x25/0x5f [<ffffffff811264ff>] __seq_open_private+0x25/0x5f [<ffffffff8115dcf9>] seq_open_net+0x65/0x8c [<ffffffff814217b6>] unix_seq_open+0x1a/0x1c [<ffffffff81155f64>] proc_reg_open+0xd7/0x163 [<ffffffff8115dbdc>] ? seq_release_net+0x0/0x3d [<ffffffff81155e8d>] ? proc_reg_open+0x0/0x163 [<ffffffff8110c4a1>] __dentry_open+0x173/0x2aa [<ffffffff811ce973>] ? security_inode_permission+0x21/0x23 [<ffffffff8110c6a7>] nameidata_to_filp+0x3f/0x50 [<ffffffff81117e21>] do_last+0x447/0x5b8 [<ffffffff81119868>] do_filp_open+0x217/0x5fe [<ffffffff81214e23>] ? might_fault+0x21/0x23 [<ffffffff811225ca>] ? alloc_fd+0x7b/0x124 [<ffffffff8110c236>] do_sys_open+0x63/0x10f [<ffffffff8110c315>] sys_open+0x20/0x22 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: 90 66 90 48 89 45 c8 fa 66 66 90 66 66 90 65 4c 8b 04 25 90 e8 00 00 49 8b 04 24 49 01 c0 49 8b 18 48 85 db 74 0e 49 63 44 24 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 fe 4c 89 e7 e8 16 eb RIP [<ffffffff81101615>] __kmalloc+0xeb/0x150 RSP <ffff8801c9cd5be8> ---[ end trace 1a4b1fd758dd1fdd ]---
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Mon, 18 Oct 2010 19:05:30 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=20702 > > Summary: Kernel bug, possible double free, effecting kernel.org > machines > Product: Memory Management > Version: 2.5 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Slab Allocator > AssignedTo: akpm@linux-foundation.org > ReportedBy: warthog9@kernel.org > Regression: No > > > Discovered this on one of the kernel.org machines - it's been happening > semi-consistently on a pair of boxes. Seems like a double free somewhere and > at that point the whole box falls over dead basically. > > ------------[ cut here ]------------ > kernel BUG at mm/slub.c:2835! Gee. That's : void kfree(const void *x) : { : struct page *page; : void *object = (void *)x; : : trace_kfree(_RET_IP_, x); : : if (unlikely(ZERO_OR_NULL_PTR(x))) : return; : : page = virt_to_head_page(x); : if (unlikely(!PageSlab(page))) { : BUG_ON(!PageCompound(page)); <<---- : kmemleak_free(x); : put_page(page); : return; : } : slab_free(page->slab, page, object, _RET_IP_); : } and yes, I'd assume that we fed garbage to kfree(). > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/mm/ksm/run Question is: what is /sys/kernel/mm/ksm/run? I can't immediately locate any code in 2.6.34.7 which could create that. Is it something which RH added to that kernel? > CPU 1 > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp > stp > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: > scsi_wait_scan] > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 > G1 > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) knlGS:00000000f76fa6d0 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task ffff8801f5688000) > Stack: > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > Call Trace: > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > [<ffffffff8110f087>] fput+0x1a/0x1c > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67 > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89 > f7 > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP <ffff8801f6433df8> > ---[ end trace 1a4b1fd758dd1fdb ]--- But then again, that's a procfs trace, not a sysfs trace. proc_reg_release() makes my brain hurt.
On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp > stp > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: > > scsi_wait_scan] > > > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > > G1 > > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > > Stack: > > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > > Call Trace: > > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > > [<ffffffff8110f087>] fput+0x1a/0x1c > > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 > 67 > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 > 89 f7 > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > > RSP <ffff8801f6433df8> > > ---[ end trace 1a4b1fd758dd1fdb ]--- > > But then again, that's a procfs trace, not a sysfs trace. > > proc_reg_release() makes my brain hurt. That complex code triggers only when module unloads and removes its /proc entries. Just to confirm, rmmod wasn't executed, last module still is scsi_wait_scan from boot sequence?
> Question is: what is /sys/kernel/mm/ksm/run? I can't immediately > locate any code in 2.6.34.7 which could create that. Is it something > which RH added to that kernel? That's Kernel Samepage Merging, the virtualization thing. It's there since 2.6.32, it lives at mm/ksm.c:ksm_init.
(In reply to comment #2) > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager > > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q > garp stp > > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last > unloaded: > > > scsi_wait_scan] > > > > > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > > > G1 > > > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > > > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > > > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > > > Stack: > > > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > > > Call Trace: > > > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > > > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > > > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > > > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > > > [<ffffffff8110f087>] fput+0x1a/0x1c > > > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > > > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > > > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 > > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 > 89 f7 > > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > > > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > > > RSP <ffff8801f6433df8> > > > ---[ end trace 1a4b1fd758dd1fdb ]--- > > > > But then again, that's a procfs trace, not a sysfs trace. > > > > proc_reg_release() makes my brain hurt. > > That complex code triggers only when module unloads and removes > its /proc entries. > > Just to confirm, rmmod wasn't executed, last module still is > scsi_wait_scan from boot sequence? I'm not sure that scsi_wait_scan was the last module pushed into the kernel at all. I've got two boxes (one of which is the box with the error above), nearly identical software wise, and neither has scsi_wait_scan even loaded (they are up and running right now). So I'm not sure where that came from, or what depended on that. The last module, on each machine, that got loaded during boot is ocfs2. This got loaded from the init script for it. scsi_wait_scan isn't even listed according to lsmod. I did not explicitly execute rmmod, restart ocfs2 or drbd (both of which mess with their respective modules being loaded), but those shouldn't have been doing much with their modules since they were up and running
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). ^^ See this? Please don't use the bugzilla interface for this bug! Nobody saw your reply. On Mon, 18 Oct 2010 23:56:22 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=20702 > > > > > > --- Comment #4 from John 'Warthog9' Hawley <warthog9@kernel.org> 2010-10-18 > 23:56:20 --- > (In reply to comment #2) > > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > > > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd > nfs_acl > > > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager > > > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q > garp stp > > > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > > > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > > > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last > unloaded: > > > > scsi_wait_scan] > > > > > > > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > > > > G1 > > > > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > > > > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > > > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > > > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > > > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > > > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > > > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > > > > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > > > > Stack: > > > > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > > > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > > > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > > > > Call Trace: > > > > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > > > > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > > > > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > > > > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > > > > [<ffffffff8110f087>] fput+0x1a/0x1c > > > > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > > > > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > > > > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > > > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 > > > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe > 48 89 f7 > > > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > > > > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > > > > RSP <ffff8801f6433df8> > > > > ---[ end trace 1a4b1fd758dd1fdb ]--- > > > > > > But then again, that's a procfs trace, not a sysfs trace. > > > > > > proc_reg_release() makes my brain hurt. > > > > That complex code triggers only when module unloads and removes > > its /proc entries. > > > > Just to confirm, rmmod wasn't executed, last module still is > > scsi_wait_scan from boot sequence? > > I'm not sure that scsi_wait_scan was the last module pushed into the kernel > at > all. I've got two boxes (one of which is the box with the error above), > nearly > identical software wise, and neither has scsi_wait_scan even loaded (they are > up and running right now). So I'm not sure where that came from, or what > depended on that. > > The last module, on each machine, that got loaded during boot is ocfs2. This > got loaded from the init script for it. scsi_wait_scan isn't even listed > according to lsmod. > > I did not explicitly execute rmmod, restart ocfs2 or drbd (both of which mess > with their respective modules being loaded), but those shouldn't have been > doing much with their modules since they were up and running > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug.
On 10/18/2010 05:00 PM, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > ^^ See this? Please don't use the bugzilla interface for this bug! Nobody > saw your > reply. No I hadn't seen that, my apologies.
*** Bug 20672 has been marked as a duplicate of this bug. ***
*** Bug 20682 has been marked as a duplicate of this bug. ***
On 10/18/2010 02:11 PM, Alexey Dobriyan wrote: > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: >>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl >>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager >>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp >>> stp >>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs >>> exportfs >>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp >>> amd_rng >>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: >>> scsi_wait_scan] >>> >>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant >>> DL385 >>> G1 >>> RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb >>> RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 >>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 >>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 >>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 >>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 >>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 >>> FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) >>> knlGS:00000000f76fa6d0 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task >>> ffff8801f5688000) >>> Stack: >>> ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 >>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 >>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 >>> Call Trace: >>> [<ffffffff81125e27>] seq_release_private+0x28/0x44 >>> [<ffffffff8115dc10>] seq_release_net+0x34/0x3d >>> [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 >>> [<ffffffff8110efbb>] __fput+0x12a/0x1dc >>> [<ffffffff8110f087>] fput+0x1a/0x1c >>> [<ffffffff8110c0f7>] filp_close+0x68/0x72 >>> [<ffffffff8110c19e>] sys_close+0x9d/0xd2 >>> [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b >>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 >>> 67 >>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 >>> 89 f7 >>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e >>> RIP [<ffffffff811006d6>] kfree+0x5e/0xcb >>> RSP <ffff8801f6433df8> >>> ---[ end trace 1a4b1fd758dd1fdb ]--- >> >> But then again, that's a procfs trace, not a sysfs trace. >> >> proc_reg_release() makes my brain hurt. > > That complex code triggers only when module unloads and removes > its /proc entries. > > Just to confirm, rmmod wasn't executed, last module still is > scsi_wait_scan from boot sequence? Bumping on this as I'm still seeing this, and hit it again today. Since this seems to be happening more, is there anything I can add for debugging to help with this? Getting back to Alexey's comments. No there were not modules loaded or unloaded, and I have no idea why scsi_wait_scan would have been loaded or unloaded. Is today's trace (From D2): ------------[ cut here ]------------ kernel BUG at mm/slub.c:2835! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib] Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1 /ProLiant DL380 G6 RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb RSP: 0018:ffff88080c12fdf8 EFLAGS: 00010246 RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40 RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001 RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000 R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001 R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40 FS: 00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task ffff88080b58ddc0) Stack: ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30 <0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40 <0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98 Call Trace: [<ffffffff81125daf>] seq_release_private+0x28/0x44 [<ffffffff8115db98>] seq_release_net+0x34/0x3d [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0 [<ffffffff8110efb3>] __fput+0x12a/0x1d4 [<ffffffff8110f077>] fput+0x1a/0x1c [<ffffffff8110c0f7>] filp_close+0x68/0x72 [<ffffffff8110c19e>] sys_close+0x9d/0xd2 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e RIP [<ffffffff811006d6>] kfree+0x5e/0xcb RSP <ffff88080c12fdf8> ---[ end trace 60a902368ad4d4fe ]--- It seems to claim speedstep_lib was the last unloaded, which I wouldn't be able to explain why that would have been unloaded at all, the running system itself doesn't seem to be using it, so I can only assume it got loaded on startup somewhere and was subsequently removed. On D1 yesterday I saw the following: ------------[ cut here ]------------ kernel BUG at mm/slub.c:2835! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 1 Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3 edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd microcode pata_acpi ata_generic pata_amd cciss [last unloaded: scsi_wait_scan] Nov 23 06:50:04 demeter kernel: Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1 /ProLiant DL385 G1 RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb RSP: 0018:ffff8801eef2ddf8 EFLAGS: 00010246 RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0 RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000 R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540 FS: 00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task ffff8801ecf2c650) Stack: ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0 <0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540 <0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98 Call Trace: [<ffffffff81125daf>] seq_release_private+0x28/0x44 [<ffffffff8115db98>] seq_release_net+0x34/0x3d [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0 [<ffffffff8110efb3>] __fput+0x12a/0x1d4 [<ffffffff8110f077>] fput+0x1a/0x1c [<ffffffff8110c0f7>] filp_close+0x68/0x72 [<ffffffff8110c19e>] sys_close+0x9d/0xd2 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e RIP [<ffffffff811006d6>] kfree+0x5e/0xcb RSP <ffff8801eef2ddf8> ---[ end trace 34b1e268e13a43a5 ]--- Which claims the last unloaded module was scsi_wait_scan. The module unloads I'm going to assume are coming from random normal boot code, and the errors are happening a significant time later anywhere from hours to days after boot. Considering I'm seeing this on completely different hardware, completely different CPU's (AMD vs. Intel) It's been suggested that I setup kmemleak on the boxes, despite the overhead that will incur, but it might help point out the problem. Is there anything else? Note: I'm not seeing this on any other box I have deployed, just these. Now the load patterns are completely different here, and these are the only two boxes with drbd & ocfs2 on them as well. Not trying to throw those under the bus, but it's worth pointing out any differences in my setup that might be helpful for debugging. - John 'Warthog9' Hawley
On Wed, 24 Nov 2010 15:25:16 -0800 "J.H." <warthog9@kernel.org> wrote: > On 10/18/2010 02:11 PM, Alexey Dobriyan wrote: > > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > >>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > >>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager > >>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q > garp stp > >>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > >>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > >>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last > unloaded: > >>> scsi_wait_scan] > >>> > >>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > >>> G1 > >>> RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > >>> RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > >>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > >>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > >>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > >>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > >>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > >>> FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > >>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > >>> Stack: > >>> ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > >>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > >>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > >>> Call Trace: > >>> [<ffffffff81125e27>] seq_release_private+0x28/0x44 > >>> [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > >>> [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > >>> [<ffffffff8110efbb>] __fput+0x12a/0x1dc > >>> [<ffffffff8110f087>] fput+0x1a/0x1c > >>> [<ffffffff8110c0f7>] filp_close+0x68/0x72 > >>> [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > >>> [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > >>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 > >>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 > 89 f7 > >>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > >>> RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > >>> RSP <ffff8801f6433df8> > >>> ---[ end trace 1a4b1fd758dd1fdb ]--- > >> > >> But then again, that's a procfs trace, not a sysfs trace. > >> > >> proc_reg_release() makes my brain hurt. > > > > That complex code triggers only when module unloads and removes > > its /proc entries. > > > > Just to confirm, rmmod wasn't executed, last module still is > > scsi_wait_scan from boot sequence? > > Bumping on this as I'm still seeing this, and hit it again today. Since > this seems to be happening more, is there anything I can add for > debugging to help with this? > > Getting back to Alexey's comments. No there were not modules loaded or > unloaded, and I have no idea why scsi_wait_scan would have been loaded > or unloaded. > > Is today's trace (From D2): > > ------------[ cut here ]------------ > kernel BUG at mm/slub.c:2835! That's BUG_ON(!PageCompound(page)); ? > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/mm/ksm/run > CPU 0 > Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd > lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc > ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter > iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm > drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib] > > Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1 > /ProLiant DL380 G6 > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP: 0018:ffff88080c12fdf8 EFLAGS: 00010246 > RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40 > RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001 > RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000 > R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001 > R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40 > FS: 00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task > ffff88080b58ddc0) > Stack: > ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30 > <0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40 > <0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98 > Call Trace: > [<ffffffff81125daf>] seq_release_private+0x28/0x44 > [<ffffffff8115db98>] seq_release_net+0x34/0x3d > [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0 > [<ffffffff8110efb3>] __fput+0x12a/0x1d4 > [<ffffffff8110f077>] fput+0x1a/0x1c > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b > eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP <ffff88080c12fdf8> > ---[ end trace 60a902368ad4d4fe ]--- > > It seems to claim speedstep_lib was the last unloaded, which I wouldn't > be able to explain why that would have been unloaded at all, the running > system itself doesn't seem to be using it, so I can only assume it got > loaded on startup somewhere and was subsequently removed. > > On D1 yesterday I saw the following: > > ------------[ cut here ]------------ > kernel BUG at mm/slub.c:2835! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/mm/ksm/run > CPU 1 > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd > nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 > freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 > ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3 > edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd > microcode pata_acpi ata_generic pata_amd cciss [last unloaded: > scsi_wait_scan] > Nov 23 06:50:04 demeter kernel: > Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1 > /ProLiant DL385 G1 > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP: 0018:ffff8801eef2ddf8 EFLAGS: 00010246 > RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0 > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000 > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540 > FS: 00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task > ffff8801ecf2c650) > Stack: > ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0 > <0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540 > <0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98 > Call Trace: > [<ffffffff81125daf>] seq_release_private+0x28/0x44 > [<ffffffff8115db98>] seq_release_net+0x34/0x3d > [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0 > [<ffffffff8110efb3>] __fput+0x12a/0x1d4 > [<ffffffff8110f077>] fput+0x1a/0x1c > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b > eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP <ffff8801eef2ddf8> > ---[ end trace 34b1e268e13a43a5 ]--- > > Which claims the last unloaded module was scsi_wait_scan. The module > unloads I'm going to assume are coming from random normal boot code, and > the errors are happening a significant time later anywhere from hours to > days after boot. > > Considering I'm seeing this on completely different hardware, completely > different CPU's (AMD vs. Intel) > > It's been suggested that I setup kmemleak on the boxes, despite the > overhead that will incur, but it might help point out the problem. Is > there anything else? > > Note: I'm not seeing this on any other box I have deployed, just these. > Now the load patterns are completely different here, and these are the > only two boxes with drbd & ocfs2 on them as well. Not trying to throw > those under the bus, but it's worth pointing out any differences in my > setup that might be helpful for debugging. > I can't begin to think what could have caused this :(
OK, nobody has ideas. seq_release_private() does kfree(seq->private); which is void * 1. what files snmpd opens? 2. If Fedora ships with SLUB but with cache merging, what caches alias with 1) -- ls -l /sys/kernel/slab
With the exception of one instance of a lsof process opening a file, snmpd is always the process when the bug happens and it's always the sys_close system call. Insight into the kernel bug might be obtained by auditing snmpd's use of fdopen(), fclose(), and close(). Specifically: 1. not close()-ing an fdopen()-ed fd after fclose()-ing the FILE * (See man fdopen) 2. not fclose()-ing a NULL pointer. (See man fclose) Here's an actual close() after fdopen() & fclose() error in agent/snmpd.c:main(): if ((PID = fdopen(fd, "w")) == NULL) { snmp_log_perror(pid_file); exit(1); } else { fprintf(PID, "%d\n", (int) getpid()); fclose(PID); } close(fd); snmpd is huge, ~65 source files call one of these functions, and most instances in the first 20 source files I checked were OK. Maybe auditing lsof would be easier. Regards, Andy