Bug 14277

Summary: Caught 8-bit read from freed memory in b43 driver at association
Product: Drivers Reporter: Christian Casteyde (casteyde.christian)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: johannes, linville, mb, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: dma.o object file (code without the commented check on ring)
dma.o object file (code with the check on ring)
config file (for debug options I selected)
possible fix

Description Christian Casteyde 2009-09-30 18:06:51 UTC
Bluewhite64 13.0
Athlon 64 3000 in 64 bits mode
Acer Aspire 1511LMi laptop

Since 2.6.32-rc1 compiled with memcheck, I got the following error:
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
ADDRCONF(NETDEV_UP): eth1: link is not ready
WARNING: kmemcheck: Caught 8-bit read from freed memory (ffff8800429bb738)
0000000000000000000000000000000000000000000000001f20010000000000
 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
                                                 ^

Modules linked in:
Pid: 5, comm: events/0 Not tainted 2.6.32-rc2 #3 Aspire 1510
RIP: 0010:[<ffffffff8133c7ce>]  [<ffffffff8133c7ce>] b43_dma_tx+0x64e/0xbc0
RSP: 0018:ffff88004bbd7ca0  EFLAGS: 00010202
RAX: 0000000040000000 RBX: 0000000000c14020 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 000000006000002a RDI: ffff88004f2d9930
RBP: ffff88004bbd7da0 R08: 0000000040c14020 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000020
R13: ffff88004255c720 R14: ffff8800429bbd00 R15: ffff8800429bb700
FS:  0000000001b1c860(0000) GS:ffffffff81783000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff88004f832550 CR3: 0000000040dec000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff813242ce>] b43_tx_work+0x5e/0x90
 [<ffffffff8104dd42>] worker_thread+0x1d2/0x320
 [<ffffffff81051f9e>] kthread+0x8e/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffffffffffff>] 0xffffffffffffffff
eth1: deauthenticating by local choice (reason=3)
eth1: direct probe to AP 00:07:cb:55:12:b0 (try 1)
eth1: direct probe to AP 00:07:cb:55:12:b0 (try 1)
eth1 direct probe responded
eth1: authenticate with AP 00:07:cb:55:12:b0 (try 1)
eth1: authenticated
eth1: associate with AP 00:07:cb:55:12:b0 (try 1)
eth1: RX AssocResp from 00:07:cb:55:12:b0 (capab=0x411 status=0 aid=2)
eth1: associated
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

After that, everything works nice.
Comment 1 Michael Buesch 2009-10-02 21:41:31 UTC
A little bit more information is required. Does this happen on interface up/down only or does it happen in operation?
Is it possible for you to track down the b43_dma_tx+0x64e/0xbc0 to the line of C code where this happens?
Comment 2 Christian Casteyde 2009-10-04 21:45:55 UTC
It's just at the first association. After that, the warning doesn't occur anymore (the variable may be initialized).

I added some printk and found the faulty line was the last test in b43_dma_tx, that is:

        printk(KERN_INFO "b43_dma_tx 9\n");
        ring->nr_tx_packets++;
        printk(KERN_INFO "b43_dma_tx A\n");
**** This line ****
        if ((free_slots(ring) < TX_SLOTS_PER_FRAME) ||
            should_inject_overflow(ring)) {
**** End of line ****
                printk(KERN_INFO "b43_dma_tx A1\n");
                /* This TX ring is full. */
                ieee80211_stop_queue(dev->wl->hw, skb_get_queue_mapping(skb));
                printk(KERN_INFO "b43_dma_tx A2\n");
                ring->stopped = 1;
                printk(KERN_INFO "b43_dma_tx A3\n");
                if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
                        printk(KERN_INFO "b43_dma_tx A4\n");
                        b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
                        printk(KERN_INFO "b43_dma_tx A5\n");
                }
        }
        printk(KERN_INFO "b43_dma_tx B\n");
out:

In my logs, I get:
b43_dma_tx A
b43_dma_tx B

but no A{1-n}

so ring may be the faulty variable
Comment 3 Michael Buesch 2009-10-04 22:43:21 UTC
Are you sure this isn't a bug in kmemcheck?
It's reporting an 8bit-read from feed memory, but ring is only freed on interface shutdown.
And free_slots() checks two 32bit variables, so I think this can't trigger an 8bit read error. Do you have CONFIG_B43_DEBUG enabled? If not, should_inject_overflow() compiles to nothing. But even if you have CONFIG_B43_DEBUG defined, I don't see an 8bit read in that function.
How does kmemcheck detect whether it's an 8bit read?
Comment 4 Michael Buesch 2009-10-04 22:46:28 UTC
Uhm, wait a second (I just read your message again), what did you actually check in your test? The
        if ((free_slots(ring) < TX_SLOTS_PER_FRAME) ||
            should_inject_overflow(ring)) {
check is _not_ supposed to trigger often. It's an unlikely branch. You won't hit it in practice very often, so it's expected that you don't see a A1/2/3/4/5 message.

You need to check what triggers the 8-bit read error.
Comment 5 Christian Casteyde 2009-10-05 17:24:42 UTC
I'm not sure of anything about kmemcheck. It may have bugs of course, I don't know how it works: I only activated it and tried to see which line was faulty.

I was looking for the line which was triggering the problem. I added printk before and after each line of b43_dma_tx, and found the problem was in the test (I didn't cut in two subtests to see if free_slots or should_inject_overflow was responsible of the problem).

Indeed, I added A1 etc. in the if branch only to see if the warning was issued by a line in the test. As you said it, the flow doesn't go inside the block controlled by the if, since A1 doesn't occur. So we know the warning is preceded by A and followed by B (which is after the test in the execution flow). So this is really the code in the test that triggers the problem.

Now, I don't know what it does, however the problem occurs between the call to free_slots and the end of shoud_inject (since this is an "or"). These two functions may do anything they do, and I don't know what ring is.

However, something annoys me semantically speaking: if "free_slots" is supposed to free something in ring from its name, shoud_inject_overflow should'nt use ring anymore as it is executed second. That is coherent with kmemcheck warning. But of course, once again, I don't know at all the b43 code, just trying to help.
Comment 6 Michael Buesch 2009-10-05 21:05:41 UTC
(In reply to comment #5)
> However, something annoys me semantically speaking: if "free_slots" is
> supposed
> to free something in ring

No it isn't. It returns the number of free slots in the ring. It's pretty obvious, if you look at the trivial implementation.

> from its name, shoud_inject_overflow should'nt use
> ring anymore as it is executed second.

???
should_inject_overflow is a debugging function that is completely optimized out, if you disable debugging.

> That is coherent with kmemcheck warning.
> But of course, once again, I don't know at all the b43 code, just trying to
> help.

I'm pretty sure the b43 code is ok.
Comment 7 Christian Casteyde 2009-10-05 21:25:58 UTC
OK, so I'll redo this test with laters -rc, to see if this is a kmemcheck false positive or something that get fixed elsewhere, since I cannot do anything more for this.

For should_inject_overflow, if I remove kernel debugging (at least kmemcheck) I cannot see the problem ;-) That's the goal of my checks!

The worse that could happen is a block freed erratically by another subsystem (and in this case other people will experience real troubles).
Comment 8 Michael Buesch 2009-10-05 21:32:49 UTC
(In reply to comment #7)
> For should_inject_overflow, if I remove kernel debugging (at least kmemcheck)
> I
> cannot see the problem ;-) That's the goal of my checks!

Well, my suggestion was to disable b43-debugging, not kmemcheck debugging.
b43-debugging is CONFIG_B43_DEBUG. If the oops still happens with b43-debugging disabled (and kmemcheck enabled), it narrows it down to two 32-bit reads. However, I can't see how they could be freed from within full operation, except of course, if another unrelated driver is freeing dangling pointers...
The two 32bit integers are members of the ring structure. If we'd free that in operation, we'd have _much_ bigger problems than an 8bit read ;)
Comment 9 Christian Casteyde 2009-10-05 21:54:09 UTC
CONFIG_B43_DEBUG is already set to off in my .config.
The only debug option I use when I check a new kernel are in the kernel hacking menu, so it's not should_inject_overflow.

I'll try to narrow the problem in free_slots, but I've just finished a bisect on another one and I'm tired. Will check on next -rc now I think, next week.
Comment 10 Michael Buesch 2009-10-05 21:57:41 UTC
How do you narrow it down, btw?
I'm not sure if kmemcheck is synchronous. Does it work through trapping each mem access? Or is it some asynchronous timer based mechanism? I don't know.
Comment 11 Christian Casteyde 2009-10-05 22:25:15 UTC
I had the feeling it was synchronous, so putting some printk should be sufficient.

I guess kmemcheck use the paging mechanisms to trap memory problems, like electric fence for user space software (or gflags under windows, to which I'm more accustomed at work). So the warning must be issued by a signal, handled by the VMM layer, and I'm quite sure paging this is synchronous (otherwise I don't understand how it can work reliably). But all this is still guess, kernel dev is not my party at all.

I guess it catch only once the error by default, but there is a kernel option that seems to make it catch everytime (at each access). However when I tried this option, I didn't see any difference, so either I haven't understood what it was for, or the problem occurs only once (at boot apparently). When I tried to reassociate, I couldn't reproduce the problem: I get it only at boot (I should check with module loading/unloading too I think).
Comment 12 Johannes Berg 2009-10-06 06:48:19 UTC
Can you provide a new warning (since you've changed the kernel) along with disassembly of the matching version of the function in question?
Comment 13 Christian Casteyde 2009-10-09 15:03:45 UTC
After some more tests, I'm quite confused and think also it's a kmemcheck bug. First, when trying to reproduce, I noticed I misinterpreted my printks results: the warning occurs ***after*** B, that is, the faulty line is the "out:" label!
This is not the access to ring, but something else the compiler generated, or a bug in kmemcheck.

To be sure ring is OK, I redid the test with the following code on -rc3:

My printk are now the following at the end of the function:

       printk(KERN_INFO "b43_dma_tx 9\n");
        ring->nr_tx_packets++;
        printk(KERN_INFO "b43_dma_tx A\n");
/*
        int nr_slots = ring->nr_slots;
        printk(KERN_INFO "b43_dma_tx A01\n");
        int used_slots = ring->used_slots;
        printk(KERN_INFO "b43_dma_tx A02\n");
        int ifs = free_slots(ring);
        printk(KERN_INFO "b43_dma_tx A02, %d=%d-%d\n", ifs, nr_slots, used_slots);
*/
        if ((free_slots(ring) < TX_SLOTS_PER_FRAME) ||
            should_inject_overflow(ring)) {
                printk(KERN_INFO "b43_dma_tx A1\n");
                /* This TX ring is full. */
                ieee80211_stop_queue(dev->wl->hw, skb_get_queue_mapping(skb));
                ring->stopped = 1;
                if (b43_debug(dev, B43_DBG_DMAVERBOSE)) {
                        b43dbg(dev->wl, "Stopped TX ring %d\n", ring->index);
                }
        }
        printk(KERN_INFO "b43_dma_tx B\n");
out:

        return err;

I still get A, then B:
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
ADDRCONF(NETDEV_UP): eth1: link is not ready
b43_dma_tx 9
b43_dma_tx A
b43_dma_tx B
WARNING: kmemcheck: Caught 8-bit read from freed memory (ffff88004e754e38)
0000000000000000000000000000000000000000000000001f20010000000000
 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
                                                 ^

Modules linked in:
Pid: 5, comm: events/0 Not tainted 2.6.32-rc3 #4 Aspire 1510
RIP: 0010:[<ffffffff813366ab>]  [<ffffffff813366ab>] b43_dma_tx+0x57b/0x930
RSP: 0018:ffff88004e961cd0  EFLAGS: 00010202
RAX: 0000000040000000 RBX: 0000000000c14020 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 000000006000002a RDI: ffff88004fa02130
etc.

If I uncomment the lines (OK, this is in C++ declaration style...), I get this:

b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)
ADDRCONF(NETDEV_UP): eth1: link is not ready
b43_dma_tx 9
b43_dma_tx A
b43_dma_tx A01
b43_dma_tx A02
b43_dma_tx A02, 254=256-2
b43_dma_tx B
WARNING: kmemcheck: Caught 8-bit read from freed memory (ffff88004d516838)
0000000000000000000000000000000000000000000000001f20010000000000
 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
                                                 ^

Modules linked in:
Pid: 3117, comm: irq/19-b43 Not tainted 2.6.32-rc3 #5 Aspire 1510
RIP: 0010:[<ffffffff813366ab>]  [<ffffffff813366ab>] b43_dma_tx+0x57b/0x960
RSP: 0018:ffff88004e961cd0  EFLAGS: 00010202
RAX: 0000000040000000 RBX: 0000000000c14020 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 000000006000002a RDI: ffff88004fa02130
RBP: ffff88004e961da0 R08: 0000000040c14020 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880000c14020
R13: ffff88004e7db240 R14: ffff88004d516100 R15: ffff88004d516800
FS:  0000000002142860(0000) GS:ffffffff8176d000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
etc.

So I always get "B" before the warning, and there is stricly no code after B.

I tried to disassemble dma.o but I'm not sure it is the right solution. Therefore I prefer giving to you the dma.o file (with and without the commented code portion), you may be more used to get faulty asm code.

Ah, since this can be a bad code generation, I'm on Bluewhite64 13, that uses:
../gcc-4.3.3/configure --prefix=/usr --libdir=/usr/lib --enable-shared --enable-bootstrap --enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix --enable-checking=release --with-system-zlib --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp --with-gnu-ld --verbose --disable-multilib --target=x86_64-pc-linux --build=x86_64-pc-linux --host=x86_64-pc-linux

and:
GNU ld (Linux/GNU Binutils) 2.18.50.0.9.20080822

Quite strange bug indeed.
Comment 14 Christian Casteyde 2009-10-09 15:05:59 UTC
Created attachment 23318 [details]
dma.o object file (code without the commented check on ring)
Comment 15 Christian Casteyde 2009-10-09 15:07:03 UTC
Created attachment 23319 [details]
dma.o object file (code with the check on ring)
Comment 16 Christian Casteyde 2009-10-09 15:08:09 UTC
Created attachment 23320 [details]
config file (for debug options I selected)
Comment 17 Michael Buesch 2009-10-26 20:45:01 UTC
Created attachment 23539 [details]
possible fix

Please test
Comment 18 Christian Casteyde 2009-10-28 20:41:06 UTC
This fix seems to work.
I've not managed to get the warning after several consecutive reboots.
Thanks
Comment 19 Christian Casteyde 2009-11-16 23:19:29 UTC
fixed in 32-rc7
Comment 20 Rafael J. Wysocki 2009-11-16 23:24:52 UTC
Closing.