Bug 31572

Summary: BUG in vb_alloc() - firewire crash at boot
Product: Memory Management Reporter: Pavel Kysilka (goldenfish)
Component: OtherAssignee: Andrew Morton (akpm)
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, selk, stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 27352    
Attachments: photo of oops
komplete boot log with oops - kernel 2.6.39-rc3+
debugging output in vb_alloc()
fix vmap power-of-2 NR_CPUS assumption

Description Pavel Kysilka 2011-03-21 12:40:32 UTC
Created attachment 51502 [details]
photo of oops

Computer crash randomly at boot with one flashdisk or usb cdrom connected.
Tested with kernel 2.6.38-rc2 and 2.6.38.

Computer: AMD Phenom x6, M4A88TD-V EVO/USB3 motherboard, 8GB RAM.

lspci:

00:00.0 Host bridge: Advanced Micro Devices [AMD] RS880 Host Bridge
00:01.0 PCI bridge: ASUSTeK Computer Inc. RS880 PCI to PCI bridge (int gfx)
00:09.0 PCI bridge: Advanced Micro Devices [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 4)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] RS780/RS880 PCI to PCI bridge (PCIE port 5)
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [IDE mode] (rev 40)
00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller
00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 42)
00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller (rev 40)
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA) (rev 40)
00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller (rev 40)
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
00:14.5 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI2 Controller
00:15.0 PCI bridge: ATI Technologies Inc Device 43a0
00:15.1 PCI bridge: ATI Technologies Inc Device 43a1
00:16.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
00:16.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Link Control
01:05.0 VGA compatible controller: ATI Technologies Inc RS880 [Radeon HD 4250]
01:05.1 Audio device: ATI Technologies Inc RS880 Audio Device [Radeon HD 4200]
02:00.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6315 Series Firewire Controller
02:00.1 IDE interface: VIA Technologies, Inc. VT6415 PATA IDE Host Controller (rev a0)
03:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)
Comment 1 Stefan Richter 2011-03-21 13:42:40 UTC
EIP is at vm_map_ram.  firewire-ohci only calls this since 2.6.38-rc1 commit 7a39d8b82165462729d09066bddb395a19025acd, hence this bug seems to be a regression after 2.6.37.
Comment 2 Stefan Richter 2011-03-27 20:37:19 UTC
Date: Tue, 22 Mar 2011 08:24:44 +0100
From: Clemens Ladisch <clemens@ladisch.de>

Stefan Richter wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=31572
> > > Created an attachment (id=51502)  
> > >  --> (https://bugzilla.kernel.org/attachment.cgi?id=51502)  
> > > photo of oops  
> 
> EIP is at vm_map_ram+0xff/0x363.  

This is in some inlined part of vb_alloc (which means that the FireWire
code is not directly at fault, it's just the first one that happens to
use this code).

> Clemens, does the hex dump tell you anything?  

Half of it is missing.  (What's going on with that video output?
This GPU works fine in my machine, with a 64-bit kernel.  (And why
is an 8 GB machine using a 32-bit kernel?))

Anyway, the part immediately before the crashing instruction is:
c109c993:   31 d2                   xor    %edx,%edx
c109c995:   f7 f1                   div    %ecx
c109c997:   31 d2                   xor    %edx,%edx
c109c999:   89 c7                   mov    %eax,%edi
c109c99b:   8b 45 cc                mov    -0x34(%ebp),%eax
c109c99e:   f7 f1                   div    %ecx
c109c9a0:   39 c7                   cmp    %eax,%edi
c109c9a2:   74 04                   je     0xc109c9a8
c109c9a4:   ??...                   ???                   <-- crash here

This looks as if this check in vb_alloc triggered:

                BUG_ON(addr_to_vb_idx(addr) !=
                                addr_to_vb_idx(vb->va->va_start));

On x86, we call vm_map_ram() with 8+2 pages, so the parameters here
are vb_alloc(40960, GFP_KERNEL).

I've never tested this code during bootup; I always loaded firewire-ohci
later.


Regards,
Clemens
Comment 3 Stefan Richter 2011-04-09 11:24:32 UTC
*** Bug 32842 has been marked as a duplicate of this bug. ***
Comment 4 Stefan Richter 2011-04-09 12:18:06 UTC
On Mar 22 Clemens Ladisch wrote:
> Stefan Richter wrote:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=31572
> > > > Created an attachment (id=51502)
> > > >  --> (https://bugzilla.kernel.org/attachment.cgi?id=51502)
> > > > photo of oops
> > 
> > EIP is at vm_map_ram+0xff/0x363.
> 
> This is in some inlined part of vb_alloc (which means that the FireWire
> code is not directly at fault, it's just the first one that happens to
> use this code).
> 
> > Clemens, does the hex dump tell you anything?
> 
> Half of it is missing.  (What's going on with that video output?
> This GPU works fine in my machine, with a 64-bit kernel.  (And why
> is an 8 GB machine using a 32-bit kernel?))
> 
> Anyway, the part immediately before the crashing instruction is:
> c109c993:   31 d2                   xor    %edx,%edx
> c109c995:   f7 f1                   div    %ecx
> c109c997:   31 d2                   xor    %edx,%edx
> c109c999:   89 c7                   mov    %eax,%edi
> c109c99b:   8b 45 cc                mov    -0x34(%ebp),%eax
> c109c99e:   f7 f1                   div    %ecx
> c109c9a0:   39 c7                   cmp    %eax,%edi
> c109c9a2:   74 04                   je     0xc109c9a8
> c109c9a4:   ??...                   ???                   <-- crash here
> 
> This looks as if this check in vb_alloc triggered:
> 
>                 BUG_ON(addr_to_vb_idx(addr) !=
>                                 addr_to_vb_idx(vb->va->va_start));

This is it indeed.  Matias posted at
https://bugzilla.kernel.org/show_bug.cgi?id=32842 :

>>>
Sometimes I receive this error while I am booting:

[    7.227814] kernel BUG at mm/vmalloc.c:893!
[    7.227909] invalid opcode: 0000 [#1] SMP 
[    7.228001] last sysfs file: /sys/devices/pci0000:00/0000:00:04.0/host6/target6:0:0/6:0:0:0/type
[    7.228001] Modules linked in: i2c_nforce2(+) firewire_ohci(+) ehci_hcd(+) processor pcmcia_core forcedeth(+) firewire_core soundcore rtc_lib i2c_core evdev fan button snd_page_alloc k10temp thermal asus_atk0110 sg serio_raw thermal_sys pcspkr hwmon
[    7.228001] 
[    7.228001] Pid: 1176, comm: modprobe Not tainted 2.6.38.2-smp #1 System manufacturer System Product Name/M2N-SLI DELUXE
[    7.228001] EIP: 0060:[<c10b3033>] EFLAGS: 00010202 CPU: 0
[    7.228001] EIP is at vm_map_ram+0xd9/0x312
[    7.228001] EAX: 00000017 EBX: f4a0a8a0 ECX: 00055000 EDX: 00054000
[    7.228001] ESI: f6003318 EDI: f87b1000 EBP: f4bd1d94 ESP: f4bd1d64
[    7.228001]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    7.228001] Process modprobe (pid: 1176, ti=f4bd0000 task=f44cb670 task.ti=f4bd0000)
[    7.228001] Stack:
[    7.228001]  f87a1000 f4bd1db0 00000000 0000000a 00000010 00000004 0000a000 00000018
[    7.228001]  f4a0a8cc 34479000 f44805d4 00000008 f4bd1de4 f8772bdb 00000163 f6e85f20
[    7.228001]  f5227060 00000400 f4480000 f6e86de0 f6e86e20 f6e94180 f6e941a0 f6e85880
[    7.228001] Call Trace:
[    7.228001]  [<f8772bdb>] ar_context_init+0xee/0x15e [firewire_ohci]
[    7.228001]  [<f877469b>] pci_probe+0x1bf/0x42d [firewire_ohci]
[    7.228001]  [<c12f6754>] ? pm_runtime_enable+0x55/0x5d
[    7.228001]  [<c1294663>] local_pci_probe+0x34/0x5f
[    7.228001]  [<c1294b07>] pci_device_probe+0x4a/0x6d
[    7.228001]  [<c12f18cf>] driver_probe_device+0x9b/0x11f
[    7.228001]  [<c12f199b>] __driver_attach+0x48/0x64
[    7.228001]  [<c12f0d56>] bus_for_each_dev+0x42/0x65
[    7.228001]  [<c12f1636>] driver_attach+0x19/0x1b
[    7.228001]  [<c12f1953>] ? __driver_attach+0x0/0x64
[    7.228001]  [<c12f130d>] bus_add_driver+0x8d/0x1c5
[    7.228001]  [<c12f1b75>] driver_register+0x7c/0xdb
[    7.228001]  [<c107a9b8>] ? tracepoint_module_notify+0x22/0x26
[    7.228001]  [<c1294ce1>] __pci_register_driver+0x3d/0x9a
[    7.228001]  [<f8779017>] fw_ohci_init+0x17/0x19 [firewire_ohci]
[    7.228001]  [<c100123f>] do_one_initcall+0x76/0x11b
[    7.228001]  [<f8779000>] ? fw_ohci_init+0x0/0x19 [firewire_ohci]
[    7.228001]  [<c105eab0>] sys_init_module+0x12ed/0x14eb
[    7.228001]  [<c144fdf4>] syscall_call+0x7/0xb
[    7.228001] Code: 58 0f 0b 8b 53 04 89 c7 c1 e7 0c 8b 12 01 d7 89 f8 89 55 d0 e8 79 f0 ff ff 8b 55 d0 89 45 ec 89 d0 e8 6c f0 ff ff 39 45 ec 74 02 <0f> 0b 8b 43 0c 2b 45 e0 85 c0 89 43 0c 75 1b 89 f0 e8 7e c9 39 
[    7.228001] EIP: [<c10b3033>] vm_map_ram+0xd9/0x312 SS:ESP 0068:f4bd1d64
<<<

> On x86, we call vm_map_ram() with 8+2 pages, so the parameters here
> are vb_alloc(40960, GFP_KERNEL).
> 
> I've never tested this code during bootup; I always loaded firewire-ohci
> later.

Today I tried an i686 alias x86-32 kernel v2.6.39-rc2 on a Core 2 Duo with 
2 GB RAM with firewire-ohci statically linked in, booting it twenty times
or so.  Could not reproduce it so far.

I also got an x86-64 box with 8 GB RAM but reboot it rarely.  (It still
runs 2.6.38-rc7.)  I do unload and reload modular firewire-ohci on it very
often, but this did not reproduce the bug either.
Comment 5 Pavel Kysilka 2011-04-13 20:05:59 UTC
Created attachment 54312 [details]
komplete boot log with oops - kernel 2.6.39-rc3+
Comment 6 Clemens Ladisch 2011-06-21 07:33:07 UTC
Created attachment 63062 [details]
debugging output in vb_alloc()

Please try running with this patch.
It doesn't prevent the crash, put prints more information about the internal state of vb_alloc()'s data structures to the log.
Comment 7 Clemens Ladisch 2011-06-22 06:53:54 UTC
Created attachment 63132 [details]
fix vmap power-of-2 NR_CPUS assumption

Pavel, Matias, please try this patch, which should fix the crash.
(Alternatively, you could set CONFIG_NR_CPUS to a power of two.)
Comment 8 Matias A. Fonzo 2011-06-22 18:42:25 UTC
Hi Clemens,

I've updated (some time ago) the libraries on my GNU/Linux:

libavc1394-0.5.4
libffado-2.0.1
libiec61883-1.2.0
libraw1394-2.0.7

I have not seen the error message. Now, I am using the kernel 2.6.38.8.
Comment 9 Stefan Richter 2011-08-14 12:39:27 UTC
The bug is 100% repeatable here.  I tried kernel 3.0.1, x86-32 dual-core, CONFIG_NR_CPUS=6, CONFIG_FIREWIRE_OHCI=y.  The patch in attachment 63062 [details] fixes it for me.
Comment 10 Stefan Richter 2011-08-14 12:40:31 UTC
Correction, I actually tested the patch in attachment 63132 [details].
Comment 11 Florian Mickler 2011-08-15 09:13:43 UTC
A patch referencing this bug report has been merged in Linux v3.1-rc2:

commit f982f91516fa4cfd9d20518833cd04ad714585be
Author: Clemens Ladisch <clemens@ladisch.de>
Date:   Tue Jun 21 22:09:50 2011 +0200

    mm: fix wrong vmap address calculations with odd NR_CPUS values
Comment 12 Florian Mickler 2011-08-15 09:19:13 UTC
So I close this as fixed. Please shout if that is a mistake.