Bug 14200

Summary: kernel oops with latest tg3 changes
Product: Drivers Reporter: Daniel Vetter (daniel)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: CLOSED CODE_FIX    
Severity: normal CC: mcarlson, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-git Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: dmesg of working kernel 2.6.31-01217-g483e3cd
Fix return ring size breakage

Description Daniel Vetter 2009-09-21 14:33:07 UTC
Somewhen after boooting up, the kernel oopses with the following output (captured via serial console, everything else is dead):

[  355.736310] BUG: unable to handle kernel paging request at 0000006600000002
[  355.736390] IP: [<ffffffff810a87ce>] put_page+0x16/0x127
[  355.736390] PGD 0
[  355.736390] Oops: 0000 [#1] PREEMPT SMP
[  355.736390] last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
[  355.736390] CPU 3
[  355.736390] Modules linked in: radeon ttm drm i2c_algo_bit bnep sco l2cap bluetooth binfmt_misc w83627hf_wdt dm_snapshot dm_mirror dm_region_hash dm_log fuse nfsd nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 xfs exportfs ext3 jbd eeprom lm85 hwmon_vid powernow_k8 sbp2 loop raid1 md_mod snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq snd_timer snd_seq_device psmouse i2c_amd756 i2c_amd8111 snd rtc_cmos soundcore serio_raw pcspkr evdev i2c_core rtc_core amd_rng snd_page_alloc button rtc_lib processor rng_core k8temp ext4 mbcache jbd2 crc16 sha256_generic aes_x86_64 aes_generic cbc dm_crypt dm_mod sg sd_mod crc_t10dif sr_mod cdrom usbhid hid pata_amd sata_sil ohci_hcd ohci1394 libata tg3 libphy ehci_hcd ieee1394 scsi_mod usbcore thermal fan thermal_sys [last unloaded: scsi_wait_scan]
[  355.736390] Pid: 13, comm: ksoftirqd/3 Not tainted 2.6.31-rc5-01932-g882e979 #88 To Be Filled By O.E.M.
[  355.845876] RIP: 0010:[<ffffffff810a87ce>]  [<ffffffff810a87ce>] put_page+0x16/0x127
[  355.845876] RSP: 0018:ffffc90000603cc8  EFLAGS: 00010292
[  355.845876] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000870
[  355.845876] RDX: ffff88007dca8680 RSI: ffff88007fc027e0 RDI: 0000006600000002
[  355.845876] RBP: ffffc90000603cf8 R08: ffffc90000603d38 R09: ffff8800519814be
[  355.845876] R10: 000000000000000e R11: ffff88007e592600 R12: ffff880051981480
[  355.845876] R13: ffffffff81289db3 R14: ffff88007e550400 R15: ffff880051981554
[  355.845876] FS:  00007f2e24a956f0(0000) GS:ffffc90000600000(0000) knlGS:0000000000000000
[  355.845876] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  355.845876] CR2: 0000006600000002 CR3: 0000000001001000 CR4: 00000000000006e0
[  355.845876] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  355.845876] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  355.845876] Process ksoftirqd/3 (pid: 13, threadinfo ffff88007ffe4000, task ffff88007ffe2380)
[  355.845876] Stack:
[  355.845876]  00000000ffffffff 0000000000000001 ffff880051981480 ffffffff81289db3
[  355.845876] <0> ffff88007e550400 ffff880051981554 ffffc90000603d18 ffffffff81224780
[  355.845876] <0> ffff880051981480 ffff880051981480 ffffc90000603d38 ffffffff8122446f
[  355.845876] Call Trace:
[  355.845876]  <IRQ>
[  355.845876]  [<ffffffff81289db3>] ? packet_rcv_spkt+0xcc/0xd7
[  355.845876]  [<ffffffff81224780>] skb_release_data+0x73/0xd0
[  355.845876]  [<ffffffff8122446f>] __kfree_skb+0x1e/0x8b
[  355.845876]  [<ffffffff812245b1>] kfree_skb+0xa3/0xab
[  355.845876]  [<ffffffff81289db3>] packet_rcv_spkt+0xcc/0xd7
[  355.845876]  [<ffffffff8122bb25>] netif_receive_skb+0x28a/0x2d0
[  355.845876]  [<ffffffff8122bcd4>] napi_skb_finish+0x2d/0x44
[  355.845876]  [<ffffffff8122c17c>] napi_gro_receive+0x2f/0x34
[  355.845876]  [<ffffffffa00bee4a>] tg3_poll+0x6ec/0x940 [tg3]
[  355.845876]  [<ffffffff8122c291>] net_rx_action+0x82/0x1c6
[  355.845876]  [<ffffffff8104ae12>] __do_softirq+0x118/0x225
[  355.845876]  [<ffffffff8104a385>] ? ksoftirqd+0x0/0x164
[  355.845876]  [<ffffffff8100cd9c>] call_softirq+0x1c/0x28
[  355.845876]  <EOI>
[  355.845876]  [<ffffffff8100e842>] do_softirq+0x3e/0x8f
[  355.845876]  [<ffffffff8104a40e>] ksoftirqd+0x89/0x164
[  355.845876]  [<ffffffff8105af79>] kthread+0x8d/0x95
[  355.845876]  [<ffffffff8100cc9a>] child_rip+0xa/0x20
[  355.845876]  [<ffffffff81039d44>] ? finish_task_switch+0x56/0xe3
[  355.845876]  [<ffffffff8100c62d>] ? restore_args+0x0/0x30
[  355.845876]  [<ffffffff8105aeec>] ? kthread+0x0/0x95
[  355.845876]  [<ffffffff8100cc90>] ? child_rip+0x0/0x20
[  355.845876] Code: fe ff eb 08 e8 db d7 fe ff 41 54 9d 41 5b 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 48 89 fb 66 a9 00 c0 74 26 66 85 c0 79 04 48 8b 5f 10
[  355.845876] RIP  [<ffffffff810a87ce>] put_page+0x16/0x127
[  355.845876]  RSP <ffffc90000603cc8>
[  355.845876] CR2: 0000006600000002
[  356.143175] ---[ end trace 121ba4541c000396 ]---
[  356.148089] Kernel panic - not syncing: Fatal exception in interrupt


I'm in the bisecting process, actual status is:
good: 2.6.31-rc5-01926-gb6080e1
bad: 2.6.31-rc5-01932-g882e979

I hope I can finish the rest of this bisect run somewhen later today. I'll also attach the dmesg of a working kernel shortly.

# lspci -nn
00:06.0 PCI bridge [0604]: Advanced Micro Devices [AMD] AMD-8111 PCI [1022:7460] (rev 07)
00:07.0 ISA bridge [0601]: Advanced Micro Devices [AMD] AMD-8111 LPC [1022:7468] (rev 05)
00:07.1 IDE interface [0101]: Advanced Micro Devices [AMD] AMD-8111 IDE [1022:7469] (rev 03)
00:07.2 SMBus [0c05]: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 [1022:746
a] (rev 02)
00:07.3 Bridge [0680]: Advanced Micro Devices [AMD] AMD-8111 ACPI [1022:746b] (rev 05)
00:07.5 Multimedia audio controller [0401]: Advanced Micro Devices [AMD] AMD-8111 AC97 Audio [1022:746d] (rev 03)
00:0a.0 PCI bridge [0604]: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge [1022:7450] (rev 12)
00:0a.1 PIC [0800]: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC [1022:7451] (rev 01)
00:0b.0 PCI bridge [0604]: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge [1022:7450] (rev 12)
00:0b.1 PIC [0800]: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC [1022:7451] (rev 01)
00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100]
00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101]
00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102]
00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103]
00:19.0 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100]
00:19.1 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101]
00:19.2 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102]
00:19.3 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103]
01:00.0 USB Controller [0c03]: Advanced Micro Devices [AMD] AMD-8111 USB [1022:7464] (rev 0b)
01:00.1 USB Controller [0c03]: Advanced Micro Devices [AMD] AMD-8111 USB [1022:7464] (rev 0b)
01:0a.0 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43)
01:0a.1 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43)
01:0a.2 USB Controller [0c03]: NEC Corporation USB 2.0 [1033:00e0] (rev 04)
01:0b.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
01:0c.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) [104c:8023]
02:09.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet [14e4:16a7] (rev 02)
04:00.0 Host bridge [0600]: Advanced Micro Devices [AMD] AMD-8151 System Controller [1022:7454] (rev 14)
04:01.0 PCI bridge [0604]: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge [1022:7455] (rev 14)
05:00.0 VGA compatible controller [0300]: ATI Technologies Inc RV570 [Radeon X1950 Pro] [1002:7280] (rev 9a)
05:00.1 Display controller [0380]: ATI Technologies Inc RV570 [Radeon X1950 Pro] (secondary) [1002:72a0] (rev 9a)
Comment 1 Daniel Vetter 2009-09-21 14:34:42 UTC
Created attachment 23130 [details]
dmesg of working kernel 2.6.31-01217-g483e3cd
Comment 2 Daniel Vetter 2009-09-21 14:39:30 UTC
Adding Matt Carlson because he's the author of all relevant patches.
Comment 3 Matt Carlson 2009-09-21 17:45:48 UTC
I scanned over those patches.  So far I don't see the problem.  I tried reproducing the problem with a 5703 B0 here.  I failed there too.  I'll be interested to see what the bisection yields.

Please be advised that there is a bug in one of the later patches in this patchset that may reduce the effectiveness of the bisection.  The good news is that the bug happens in the last patch that might be relevant to the 5703.  Hopefully the bisection will point to an earlier patch, but if not, the fix is small and should apply cleanly to any other bisection point.
Comment 4 Daniel Vetter 2009-09-21 19:26:27 UTC
On Mon, Sep 21, 2009 at 05:45:49PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #3 from Matt Carlson <mcarlson@broadcom.com>  2009-09-21 17:45:48
> ---
> I scanned over those patches.  So far I don't see the problem.  I tried
> reproducing the problem with a 5703 B0 here.  I failed there too.  I'll be
> interested to see what the bisection yields.

Thanks for the fast turnaround.

> Please be advised that there is a bug in one of the later patches in this
> patchset that may reduce the effectiveness of the bisection.  The good news
> is
> that the bug happens in the last patch that might be relevant to the 5703. 
> Hopefully the bisection will point to an earlier patch, but if not, the fix
> is
> small and should apply cleanly to any other bisection point.

Bisecting points at

t f6eb9b1fc1411d22c073f5264e5630a541d0f7df
Author: Matt Carlson <mcarlson@broadcom.com>
Date:   Tue Sep 1 13:19:53 2009 +0000

    tg3: Add 5717 asic rev

    This patch adds the 5717 asic rev.

    Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
    Reviewed-by: Benjamin Li <benli@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

I've doubled-tested both first-bad and last-good. I've also tried to
revert this on top of latest -linus, but failed due to conflicts.

-Daniel
Comment 5 Matt Carlson 2009-09-21 20:50:15 UTC
Created attachment 23132 [details]
Fix return ring size breakage

Drat.  That was the problematic patch.  If you were to apply this patch, does the bisection still point to this patch as the culprit?

FYI, this patch represents the last patch that was integrated and is already included in the current upstream kernel (and the snapshot you are working from).
Comment 6 Matt Carlson 2009-09-22 20:26:51 UTC
Are you using jumbo frames by any chance?
Comment 7 Daniel Vetter 2009-09-23 07:45:19 UTC
I've tested your patch and this fixed the problem. I've also checked the serial console output and it looks like I'm hitting another problem. Bad luck while bisecting pointed then to this unrelated tg3 problem.

I'm closing this bug now. Thanks for your help in tracking this down.

-Daniel
Comment 8 Matt Carlson 2009-09-23 17:57:56 UTC
I don't think we are done though.  That patch has been (or should have been) applied to the tip of your tree.  That is where you discovered the bug though, right?  If so, then the real bug should still be lurking between the current bisection point and the tip of the tree (a few patches away).
Comment 9 Daniel Vetter 2009-09-25 12:51:46 UTC
> --- Comment #8 from Matt Carlson <mcarlson@broadcom.com>  2009-09-23 17:57:56
> ---
> I don't think we are done though.  That patch has been (or should have been)
> applied to the tip of your tree.  That is where you discovered the bug
> though,
> right?  If so, then the real bug should still be lurking between the current
> bisection point and the tip of the tree (a few patches away).

As I've said, there is still another problem. But I looked at the traces
via serial console, and it is _definitely_ something else (somewhere in
the block layer). Thanks to your patch, I can make sure that I track down
the right problem when bisect again. I was fooled only because I checked
the serial console only after bisecting a few revisions which lead me into
thinking that the original problem was tg3 related. It is not.

I'll post a link here to the new bug report, as soon as I find time to
bisect it and prepare a report.

Thanks for your help, Daniel
Comment 10 Rafael J. Wysocki 2009-09-29 23:24:46 UTC
Handled-By : Matt Carlson <mcarlson@broadcom.com>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=23132
Comment 11 Rafael J. Wysocki 2009-09-30 20:36:15 UTC
Fixed by commit 5ea1c50662d447de344812054175d7151783ea25.
Comment 12 Daniel Vetter 2009-10-01 17:53:06 UTC
For reference, as promised, the new bug report (with my real problem):

http://bugzilla.kernel.org/show_bug.cgi?id=14290