Bug 13341

Summary: BUG: unable to handle kernel paging request at 000[cd]?a0b (or 000[cd]?a1f)
Product: Other Reporter: patrick
Component: module-init-toolsAssignee: Alexey Dobriyan (adobriyan)
Status: CLOSED OBSOLETE    
Severity: normal CC: adobriyan, akpm, alan, rjw, rusty, tonywhite100, torvalds
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: Diagnostic printing patch.
complete dmesg output
YA diagnostic patch; if it triggers, implies not module.c error.
Actual used modification to module.c, see below

Description patrick 2009-05-19 09:08:46 UTC
The Distribution is Gentoo, using a vanilla kernel 2.6.30-rc6. While booting and executing the ip6tables rule loading script an Oops occurs in about half of the cases. The module-init-tools version is 3.5. The problem did never occur in Kernel 2.6.29.

dmesg extract:

[  374.119788] BUG: unable to handle kernel paging request at 000cba0b
[  374.119970] IP: [<c015981c>] m_show+0x9c/0x1a0
[  374.120004] *pde = 00000000 
[  374.120004] Oops: 0000 [#1] PREEMPT 
[  374.120004] last sysfs file: /sys/block/dm-0/size
[  374.120004] Modules linked in: snd_opl3_synth snd_seq_midi_emul snd_seq_midi snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss fuse tun asb100 hwmon_vid iptable_mangle nf_conntrack_netbios_ns nf_conntrack_irc nf_conntrack_ftp usbhid cryptoloop 8139cp tuner_simple tuner_types tuner arc4 cx8800 cx88xx rt61pci rt2x00pci rt2x00lib ir_common led_class i2c_algo_bit snd_cmipci input_polldev usblp tveeprom snd_pcm v4l2_common mac80211 videodev snd_page_alloc v4l1_compat videobuf_dma_sg snd_opl3_lib videobuf_core btcx_risc eeprom_93cx6 skge snd_timer snd_hwdep snd_mpu401 forcedeth snd_mpu401_uart i2c_nforce2 snd_rawmidi snd_seq_device nvidia_agp agpgart snd soundcore floppy
[  374.120004] 
[  374.120004] Pid: 5953, comm: modprobe Not tainted (2.6.30-rc6 #3) A7N8X-E
[  374.120004] EIP: 0060:[<c015981c>] EFLAGS: 00010296 CPU: 0
[  374.120004] EIP is at m_show+0x9c/0x1a0
[  374.120004] EAX: 00000000 EBX: 000cba0b ECX: 00000000 EDX: 00000000
[  374.120004] ESI: f81b0c48 EDI: c7f88ec0 EBP: f81b0b40 ESP: cac88ecc
[  374.120004]  DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
[  374.120004] Process modprobe (pid: 5953, ti=cac88000 task=c7f967c0 task.ti=cac88000)
[  374.120004] Stack:
[  374.120004]  c7f88ec0 c0640c5a f8344e4c 00009100 f81b0b44 00000060 00010246 35a83966
[  374.120004]  00000000 c7f88ec0 f81b0b44 000002af c01bf296 cac88f90 000003ff b8016001
[  374.120004]  cac5c280 c7f88ee8 00000001 00000027 00000000 00000026 00000000 35a83966
[  374.120004] Call Trace:
[  374.120004]  [<c01bf296>] ? seq_read+0x236/0x3a0
[  374.120004]  [<c01bf060>] ? seq_read+0x0/0x3a0
[  374.120004]  [<c01e5384>] ? proc_reg_read+0x84/0xf0
[  374.120004]  [<c01a0c07>] ? vfs_read+0xc7/0x170
[  374.120004]  [<c01e5300>] ? proc_reg_read+0x0/0xf0
[  374.120004]  [<c01a0de0>] ? sys_read+0x60/0xc0
[  374.120004]  [<c010353b>] ? sysenter_do_call+0x12/0x26
[  374.120004] Code: 31 c0 81 c6 04 01 00 00 39 f3 74 2e 8d 76 00 8b 43 08 89 3c 24 c7 44 24 04 5a 0c 64 c0 83 c0 0c 89 44 24 08 e8 66 56 06 00 8b 1b <8b> 03 0f 18 00 90 39 de 75 da b8 01 00 00 00 8b b5 bc 00 00 00 
[  374.120004] EIP: [<c015981c>] m_show+0x9c/0x1a0 SS:ESP 0068:cac88ecc
[  374.120004] CR2: 00000000000cba0b
[  374.130919] ---[ end trace 560ced3d4b382f16 ]---


$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 10
model name	: AMD Athlon(tm) XP 2500+
stepping	: 0
cpu MHz		: 1837.534
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips	: 3675.06
clflush size	: 32
power management: ts
Comment 1 Linus Torvalds 2009-05-25 03:10:43 UTC
That decodes to


   8:	39 f3                	cmp    %esi,%ebx
   a:	74 2e                	je     0x3a
   c:	8d 76 00             	lea    0x0(%esi),%esi
   f:	8b 43 08             	mov    0x8(%ebx),%eax
  12:	89 3c 24             	mov    %edi,(%esp)
  15:	c7 44 24 04 5a 0c 64 	movl   $0xc0640c5a,0x4(%esp)
  1c:	c0 
  1d:	83 c0 0c             	add    $0xc,%eax
  20:	89 44 24 08          	mov    %eax,0x8(%esp)
  24:	e8 66 56 06 00       	call   0x6568f
  29:	8b 1b                	mov    (%ebx),%ebx
  2b:*	8b 03                	mov    (%ebx),%eax     <-- trapping instruction
  2d:	0f 18 00             	prefetchnta (%eax)
  30:	90                   	nop
  31:	39 de                	cmp    %ebx,%esi
  33:	75 da                	jne    0xf
  35:	b8 01 00 00 00       	mov    $0x1,%eax
  3a:	8b b5 bc 00 00 00    	mov    0xbc(%ebp),%esi

Which would _seem_ to be part of this loop:

        /* Always include a trailing , so userspace can differentiate
           between this and the old multi-field proc format. */
        list_for_each_entry(use, &mod->modules_which_use_me, list) {
                printed_something = 1;
                seq_printf(m, "%s,", use->module_which_uses->name);
        }

in print_unload_info().

Does the problem go away if you disable CONFIG_MODULE_UNLOAD?

Also, is the oops always in the same spot? This looks like something has seriously corrupted some module list information, which in turn sounds like a memory corruption issue (and then it's obviously very interesting if it's always just about ip6tables!)
Comment 2 patrick 2009-05-26 07:30:34 UTC
> Does the problem go away if you disable CONFIG_MODULE_UNLOAD?

Yes, it seems to go away. I found no problems in 20 reboots with CONFIG_MODULE_UNLOAD unset, while with CONFIG_MODULE_UNLOAD=y the problem occured again at the fifth boot. Maybe 50% is a bit exaggerated.

> Also, is the oops always in the same spot?

Yes, as far as I observed it.

The modules, that are normally loaded but missing above are: af_packet, bridge, drm, ip6table_filter, ip6_tables, iptable_filter, iptable_nat, ip_tables, ipt_LOG, ipt_REJECT, llc, nf_conntrack, nf_conntrack_ipv4, nf_defrag_ipv4, nf_nat, radeon, stp, ts_bm, x_tables, xt_mac, xt_multiport, xt_state, xt_string, xt_tcpudp

I attached another oops trace (I don't know, if it helps). If I can provide any other useful information, let me know. Thank you!

[   43.368473] BUG: unable to handle kernel paging request at 000cba0b
[   43.368654] IP: [<c0152d0c>] m_show+0x9c/0x1a0
[   43.368787] *pde = 00000000 
[   43.368911] Oops: 0000 [#1] 
[   43.369036] last sysfs file: /sys/block/dm-0/size
[   43.369120] Modules linked in: snd_opl3_synth snd_seq_midi_emul snd_seq_midi snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss fuse tun asb100 hwmon_vid iptable_mangle ip_tables x_tables nf_conntrack_netbios_ns nf_conntrack_irc nf_conntrack_ftp nf_conntrack cryptoloop 8139cp tuner_simple tuner_types tuner arc4 cx8800 rt61pci rt2x00pci cx88xx rt2x00lib ir_common led_class input_polldev i2c_algo_bit snd_cmipci tveeprom mac80211 v4l2_common snd_pcm usbhid videodev v4l1_compat snd_page_alloc videobuf_dma_sg snd_opl3_lib videobuf_core eeprom_93cx6 skge snd_timer btcx_risc snd_hwdep forcedeth snd_mpu401 snd_mpu401_uart snd_rawmidi snd_seq_device i2c_nforce2 snd soundcore floppy nvidia_agp agpgart usblp
[   43.372320] 
[   43.372320] Pid: 5993, comm: modprobe Not tainted (2.6.30-rc7 #1) A7N8X-E
[   43.372320] EIP: 0060:[<c0152d0c>] EFLAGS: 00010282 CPU: 0
[   43.372320] EIP is at m_show+0x9c/0x1a0
[   43.372320] EAX: 00000000 EBX: 000cba0b ECX: 00000000 EDX: 00000000
[   43.372320] ESI: f819b988 EDI: f55517c0 EBP: f819b880 ESP: f5621ed4
[   43.372320]  DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
[   43.372320] Process modprobe (pid: 5993, ti=f5621000 task=f5630350 task.ti=f5621000)
[   43.372320] Stack:
[   43.372320]  f55517c0 c05eccc5 f83a1b4c 000049a0 f819b884 6b2d2eea c055d966 6b2d2eea
[   43.372320]  00000000 f55517c0 f819b884 00000040 c01af4d6 f5621f90 000003dc b7f52024
[   43.372320]  f5628880 f55517e8 00000024 0000002e 00000000 0000002d 00000000 6b2d2eea
[   43.372320] Call Trace:
[   43.372320]  [<c055d966>] ? mutex_lock+0x16/0x40
[   43.372320]  [<c01af4d6>] ? seq_read+0x236/0x3a0
[   43.372320]  [<c01af2a0>] ? seq_read+0x0/0x3a0
[   43.372320]  [<c01d2cac>] ? proc_reg_read+0x5c/0x90
[   43.372320]  [<c0194527>] ? vfs_read+0xc7/0x170
[   43.372320]  [<c01d2c50>] ? proc_reg_read+0x0/0x90
[   43.372320]  [<c0194700>] ? sys_read+0x60/0xc0
[   43.372320]  [<c010340f>] ? sysenter_do_call+0x12/0x26
[   43.372320] Code: 31 c0 81 c6 04 01 00 00 39 f3 74 2e 8d 76 00 8b 43 08 89 3c 24 c7 44 24 04 c5 cc 5e c0 83 c0 0c 89 44 24 08 e8 b6 c3 05 00 8b 1b <8b> 03 0f 18 00 90 39 de 75 da b8 01 00 00 00 8b b5 bc 00 00 00 
[   43.372320] EIP: [<c0152d0c>] m_show+0x9c/0x1a0 SS:ESP 0068:f5621ed4
[   43.372320] CR2: 00000000000cba0b
[   43.378955] ---[ end trace 6d4e1d01893fe12a ]---
Comment 3 Linus Torvalds 2009-05-26 16:10:40 UTC
On Tue, 26 May 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> Yes, it seems to go away. I found no problems in 20 reboots with
> CONFIG_MODULE_UNLOAD unset, while with CONFIG_MODULE_UNLOAD=y the problem
> occured again at the fifth boot. Maybe 50% is a bit exaggerated.

Ok. The more I look, the less likely this looks like anything but a module 
loading bug. The fact that it is so consistent for you means that it's 
almost certainly not some random memory corruption by somebody else that 
corrupts the module lists.

And if it isn't somebody else, then it's the module loading (or unloading) 
itself that corrupts them. 

And it's entirely possible that it's not unloading per se that matters: 
CONFIG_MODULE_UNLOAD=y doesn't just mean that we allow unloading, it also 
ends up causing a lot of other code to be accessed, particularly in the 
path you oops in. 'm_show()' is an almost trivial function if unloading 
isn't supported, while with CONFIG_MODULE_UNLOAD=y it does a lot more 
following of module pointers (which is why I asked you to test it).

My personal gut feel (which I admittedly have absolutely nothing to back 
up with), is that it's either:

 - commit e91defa26c527ceeaff6266c55cdc7e17c9081a2: "module: don't use 
   stop_machine on module load", which changes loading a lot.

 - fbf59bc9d74d1fb30b8e0630743aff2806eafcea: "percpu: implement new 
   dynamic percpu allocator", since it could easily be the percpu 
   allocations too (and in particular the unloading), and that may well 
   explain why it happens with certain modules in particular.

 - or possibly it's just a timing difference (perhaps even brought on by
   totally unrelated changes, or the async stuff), and it goes back to 
   earlier than 2.6.29, and was just exposed recently.

You probably have something touching /proc/modules at just the same time 
something else happens, and it's quite possible that it's very 
timing-dependent and that would explain why you see it but others don't.

Anyway, it really looks like a bug in handling "modules_which_use_me", and 
none of that code itself has really changed since 2.6.29 afaik. Which is 
why my primary suspect would be that locking change/synchronization thing 
(if this was some unrelated timing difference that wasn't about locking, 
I'd have expected it to have happened before - at least for somebody 
else).

Hmm. Rusty - what protects that 'modules_which_use_me' list? It looks like 
it's always protected module_mutex, but we have some things like 
'use_module()' that are exported - why? If somebody were to use that 
without holding the lock, they'd be buggy.

Patrick, do you have any external modules built? I'm not seeing any in 
your list, but I'm double-checking..

		Linus
Comment 4 patrick 2009-05-26 16:46:00 UTC
> Patrick, do you have any external modules built?

I had the Virtualbox modules "vboxdrv" and "vboxnetflt" built for 2.6.30-rc6, but none for 2.6.30-rc7.
Comment 5 Rusty Russell 2009-05-27 03:11:25 UTC
The module_mutex is also exported: it's for the out-of-tree ksplice guys to play with module usage.  That's not the cuplrit here.

But I don't expect modules to be unloaded during boot.  Let's make sure that is what is happening.  Could you apply this by hand or see attachment, then report what's in the logs before the oops?

Thanks!
Rusty.

diff --git a/kernel/module.c b/kernel/module.c                                             
--- a/kernel/module.c                                                                      
+++ b/kernel/module.c                                                                      
@@ -784,6 +784,8 @@ SYSCALL_DEFINE2(delete_module, const cha                               
                return -EFAULT;                                                            
        name[MODULE_NAME_LEN-1] = '\0';

+       printk("delete_module: %s\n", name);
+
        /* Create stop_machine threads since free_module relies on
         * a non-failing stop_machine call. */
        ret = stop_machine_create();
@@ -2295,6 +2297,7 @@ static noinline struct module *load_modu
        return mod;

  unlink:
+       printk("%s: unlink\n", mod->name);
        /* Unlink carefully: kallsyms could be walking list. */
        list_del_rcu(&mod->list);
        synchronize_sched();
@@ -2368,6 +2371,7 @@ SYSCALL_DEFINE3(init_module, void __user
                blocking_notifier_call_chain(&module_notify_list,
                                             MODULE_STATE_GOING, mod);
                mutex_lock(&module_mutex);
+               printk("Freeing failed init module %s\n", mod->name);
                free_module(mod);
                mutex_unlock(&module_mutex);
                wake_up(&module_wq);
Comment 6 Rusty Russell 2009-05-27 03:12:05 UTC
Created attachment 21576 [details]
Diagnostic printing patch.
Comment 7 Linus Torvalds 2009-05-27 03:24:40 UTC
On Wed, 27 May 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> But I don't expect modules to be unloaded during boot.  Let's make sure that
> is
> what is happening.  Could you apply this by hand or see attachment, then
> report
> what's in the logs before the oops?

Rusty, look more closely.

The reason CONFIG_MODULE_UNLOAD is relevant is not because modules 
necessarily get unloaded, it's because it enables code that traverses the 
"modules_which_use_me" list in "print_unload_info()".

If MODULE_UNLOAD is not enabled, that code doesn't exist at all, and it 
seems to be that list that is not protected.

It could easily be "not protected" over a concurrent module _load_, rather 
than an unload. 

So it's not really all that interesting whether an unload event happens or 
not. That's not necessarily (or even likely) what triggers the problem.

		Linus
Comment 8 Rusty Russell 2009-05-27 05:42:02 UTC
> (In reply to comment #7)
> Rusty, look more closely.
> 
> The reason CONFIG_MODULE_UNLOAD is relevant is not because modules 
> necessarily get unloaded, it's because it enables code that traverses the 
> "modules_which_use_me" list in "print_unload_info()".
> 
> If MODULE_UNLOAD is not enabled, that code doesn't exist at all, and it 
> seems to be that list that is not protected.
> 
> It could easily be "not protected" over a concurrent module _load_, rather 
> than an unload. 

Except it is protected!  Even if it weren't, I can't see a racy add likely to cause this kind of oops.

Since I have no idea, I want to look harder at the corner cases; where we remove from the list because of module unload or load failure (the latter might explain why Patrick sees this and noone else does).

Cheers,
Rusty.
Comment 9 Andrew Morton 2009-05-27 23:24:23 UTC
Note that in both cases we have

> [  374.120004] last sysfs file: /sys/block/dm-0/size

so there may be a problem in the sysfs read-side handler for that sysfs file.

(This is why I added that "last sysfs file" thing originally - the backtrace
doesn't tell us which file is being read from)
Comment 10 patrick 2009-05-28 01:14:25 UTC
> Could you apply this by hand or see attachment, then report
what's in the logs before the oops?

Done, but none of the three messages appears, not at a normal boot and neither, when the oops occurs.
Comment 11 patrick 2009-05-28 01:15:38 UTC
Created attachment 21593 [details]
complete dmesg output
Comment 12 Rusty Russell 2009-05-28 06:13:00 UTC
OK, my guess is that someone is doing use-after-free or overrunning the end of their alloc and hitting our little "use" alloc.

One more patch should eliminate module.c as the culprit.  Hopefully won't perturb things enough to remove the bug.

(BTW, this will deliberately crash if a module load fails or is removed, but your boot should be fine).

Thanks,
Rusty.
Comment 13 Rusty Russell 2009-05-28 06:14:29 UTC
Created attachment 21598 [details]
YA diagnostic patch; if it triggers, implies not module.c error.
Comment 14 Rafael J. Wysocki 2009-05-30 21:43:25 UTC
Handled-By : Rusty Russell <rusty@rustcorp.com.au>
Comment 15 patrick 2009-06-01 09:02:33 UTC
After up to 50 retries I was not able to reproduce the oops with the patch applied.
Comment 16 Linus Torvalds 2009-06-01 14:59:00 UTC
On Mon, 1 Jun 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> After up to 50 retries I was not able to reproduce the oops with the patch
> applied.

Grr. Either it's a subtle timing bug, and the patch just ends up hiding 
it, or it's a compiler bug, and the patch ends up hiding it due to 
different code generation.

Bt your dmesg says "gcc version 4.3.2 (Gentoo 4.3.2-r3 p1.6, pie-10.1.5)", 
which doesn't sound bad. Neither bleeding-edge nor known-bad compiler, 
afaik.

I have no ideas. Maybe you can delete parts of Rusty's patch until you can 
make it trigger again, and then we can see exactly _which_ change makes 
the oops no longer happen.

		Linus
Comment 17 Rusty Russell 2009-06-01 23:27:52 UTC
Two additional ideas:
1) Add a BUG() at the bottom of dump_use_list(); if only the magic is corrupted it won't crash with that patch.
2) Remove the wmb() in use_module().  Does it trigger now?

But basically, I'm as confused as Linus :(
Rusty.
Comment 18 patrick 2009-06-07 16:45:44 UTC
Created attachment 21793 [details]
Actual used modification to module.c, see below
Comment 19 patrick 2009-06-07 16:48:40 UTC
Hello again,

after some days with very few time I experimented again a bit with 2.6.30-rc7 and modifications of module.c. In difference to the "YA diagnostic patch" there is
- a BUG() at the bottom of dump_use_list()
- no modification of already_uses()
- a pushed up test for a bad magic in print_unload_info()
(see Comment #18 above)

The BUG() call in dump_use_list() triggered and dmesg contained the following. The bad value in "magic" is the same as it was each time in the "next" pointer, whose place it took (?).

In normal operation there are no more than the mentioned three modules using "videobuf_core".

[  128.050410] print_unload_info: videobuf_core: bad magic in use list f8161988: next=f6a532e4 prev=f6976234
[  128.050416] f6a532e4: magic cba0b next = f6976fe4 prev = f8161988 user=f834cb40 (cx8800)
[  128.050420] f6976fe4: magic badc0ff next = f6976234 prev = f6a532e4 user=f830ce40 (cx88xx)
[  128.050424] f6976234: magic badc0ff next = f8161988 prev = f6976fe4 user=f818a260 (videobuf_dma_sg)
[  128.050440] ------------[ cut here ]------------
[  128.050531] kernel BUG at kernel/module.c:638!
[  128.050616] invalid opcode: 0000 [#1] 
[  128.050742] last sysfs file: /sys/block/sr1/capability
[  128.050827] Modules linked in: snd_opl3_synth snd_seq_midi_emul snd_seq_midi snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sn
d_pcm_oss snd_mixer_oss fuse tun asb100 hwmon_vid iptable_mangle ip_tables x_tables nf_conntrack_netbios_ns nf_conntrack_irc nf_conntra
ck_ftp nf_conntrack cryptoloop 8139cp tuner_simple tuner_types tuner arc4 cx8800 rt61pci rt2x00pci rt2x00lib cx88xx led_class ir_common
 input_polldev mac80211 i2c_algo_bit tveeprom snd_cmipci v4l2_common videodev usbhid snd_pcm v4l1_compat videobuf_dma_sg snd_page_alloc
 snd_opl3_lib videobuf_core eeprom_93cx6 btcx_risc skge snd_timer snd_hwdep snd_mpu401 forcedeth snd_mpu401_uart i2c_nforce2 snd_rawmid
i snd_seq_device nvidia_agp agpgart snd soundcore floppy usblp
[  128.054065] 
[  128.054147] Pid: 5964, comm: modprobe Not tainted (2.6.30-rc7bugtest #24) A7N8X-E
[  128.054266] EIP: 0060:[<c0151d5b>] EFLAGS: 00010246 CPU: 0
[  128.054266] EIP is at m_show+0x1fb/0x250
[  128.054266] EAX: f8161988 EBX: f8161984 ECX: f6a532e4 EDX: f8161988
[  128.054266] ESI: f54b43c0 EDI: f8161988 EBP: f8161880 ESP: f5696ec8
[  128.054266]  DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
[  128.054266] Process modprobe (pid: 5964, ti=f5696000 task=f55ab1b0 task.ti=f5696000)
[  128.054266] Stack:
[  128.054266]  c060a6f0 f818a26c 0badc0ff f8161988 f6976fe4 f818a260 f8161884 f816188c
[  128.054266]  36933e7f c0557006 36933e7f 00000000 f54b43c0 f8161884 00000040 c01adf56
[  128.054266]  f5696f90 000003dc b7fee024 f7353480 f54b43e8 00000024 0000002e 00000000
[  128.054266] Call Trace:
[  128.054266]  [<c0557006>] ? mutex_lock+0x16/0x40
[  128.054266]  [<c01adf56>] ? seq_read+0x236/0x3a0
[  128.054266]  [<c01add20>] ? seq_read+0x0/0x3a0
[  128.054266]  [<c01d17bc>] ? proc_reg_read+0x5c/0x90
[  128.054266]  [<c0193707>] ? vfs_read+0xc7/0x170
[  128.054266]  [<c01d1760>] ? proc_reg_read+0x0/0x90
[  128.054266]  [<c01938e0>] ? sys_read+0x60/0xc0
[  128.054266]  [<c010310f>] ? sysenter_do_call+0x12/0x26
[  128.054266] Code: c0 89 44 24 14 89 54 24 08 c7 04 24 2c 61 5e c0 e8 db 3f 40 00 8b 85 08 01 00 00 8d 58 fc 8b 4b 04 0f 18 01 90 39 c7 89 c2 75 04 <0f> 0b eb fe 8b 43 0c 89 44 24 14 8b 43 08 89 4c 24 0c 89 44 24 
[  128.054266] EIP: [<c0151d5b>] m_show+0x1fb/0x250 SS:ESP 0068:f5696ec8
[  128.060624] ---[ end trace fce10aafebf4c8a9 ]---
Comment 20 Linus Torvalds 2009-06-08 00:01:30 UTC
On Sun, 7 Jun 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> The BUG() call in dump_use_list() triggered and dmesg contained the
> following.
> The bad value in "magic" is the same as it was each time in the "next"
> pointer,
> whose place it took (?).

Ok, that's definitely a smoking gun.

Do you have CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG enabled? Oh, and 
please either do CONFIG_SLUB_DEBUG_ON=y, or boot with the "slubdebug" 
kernel command line (you need to have CONFIG_SLUB_DEBUG enabled for this 
to do anything).

It does sound like something else is corrupting that first word. But that 
cba0b value doesn't trigger any responses in my brain.

		Linus
Comment 21 patrick 2009-06-18 16:39:33 UTC
Hello,

after many tests with rc7 and the released 2.6.30, the error did not occur with these settings. I'll keep trying and report any new finding...

Patrick
Comment 22 patrick 2009-08-20 14:19:02 UTC
Hello,

the Oops occured at least with a 2.6.30.4. It is very strange, as the occurence varied extremely in frequency.

However, I got a new computer. It is not likely, that I will be able to reproduce it again in the next time. I am sorry. Maybe the bug should be closed?

Patrick
Comment 23 Alexey Dobriyan 2009-11-14 21:38:45 UTC
> BUG: unable to handle kernel paging request at 000??a0b
> EIP is at m_show()

this appears to be pretty common signature according to kerneloops.org
Comment 24 Alexey Dobriyan 2009-11-14 22:13:28 UTC
> [last unloaded: scsi_wait_scan]
Comment 25 Alexey Dobriyan 2009-11-15 00:19:27 UTC
There are at least two corruption modes:

1. BUG: unable to handle kernel paging request at 000XYa0b
2. BUG: unable to handle kernel paging request at 000XYa1f

where X is c or d.
Comment 26 Alexey Dobriyan 2009-11-15 00:33:37 UTC
List of modules which were loaded from all traces:

snd
snd_page_alloc
snd_pcm
snd_timer
soundcore
Comment 27 Alexey Dobriyan 2009-11-15 00:51:09 UTC
Modules which were loaded with 000??a0b _and_ 000??a1f corruptions:

arc4
crc_itu_t
ecb
input_polldev
rt2x00lib
snd
snd_page_alloc
snd_pcm
snd_timer
soundcore
Comment 28 Alexey Dobriyan 2009-11-15 12:13:07 UTC
*** Bug 13975 has been marked as a duplicate of this bug. ***