Latest working kernel version: 2.6.24 Earliest failing kernel version: 2.6.24-git18 Distribution: Debian/testing Hardware Environment: Software Environment: Problem Description: Feb 11 13:11:52 www kernel: [ 12.015569] tg3: eth0: Link is up at 100 Mbps, full duplex. Feb 11 13:11:52 www kernel: [ 12.015633] tg3: eth0: Flow control is on for TX and on for RX. Feb 11 13:33:44 www kernel: [ 1328.538204] tg3: eth0: The system may be re-ordering memory-mapped I/O cycles to the network device, attempting to recover. Please report the problem to the driver maintainer and include system chipset information. Feb 11 13:33:44 www kernel: [ 1328.667255] tg3: eth0: Link is down. Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Link is up at 100 Mbps, full duplex. Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Flow control is on for TX and on for RX. After that, the machine rebooted (panic?) Feb 11 13:35:14 www kernel: klogd 1.5.0#1.1, log source = /proc/kmsg started. lspci -vvv info: 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter (PCI-X, 10,100,1000-T) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 (16000ns min), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 19 Region 0: Memory at fdf70000 (64-bit, non-prefetchable) [size=64K] [virtual] Expansion ROM at 88140000 [disabled] [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO- RBC=2048 OST=1 Status: Dev=02:02.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable+ DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data <?> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: fd7ffd6fdf7deeb8 Data: bdfd Kernel driver in use: tg3 Kernel modules: tg3 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter (PCI-X, 10,100,1000-T) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 (16000ns min), Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 20 Region 0: Memory at fdf60000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO+ RBC=512 OST=1 Status: Dev=02:02.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data <?> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: f73feeefffffe7f8 Data: 9bcd Kernel driver in use: tg3 Kernel modules: tg3 Steps to reproduce:
Reply-To: akpm@linux-foundation.org On Thu, 14 Feb 2008 01:59:12 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9990 > > Summary: tg3: eth0: The system may be re-ordering memory-mapped > I/O cycles > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.24-git18 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Network > AssignedTo: jgarzik@pobox.com > ReportedBy: ralf.hildebrandt@charite.de > > > Latest working kernel version: 2.6.24 > Earliest failing kernel version: 2.6.24-git18 > Distribution: Debian/testing > Hardware Environment: > Software Environment: > Problem Description: > > Feb 11 13:11:52 www kernel: [ 12.015569] tg3: eth0: Link is up at 100 Mbps, > full duplex. > Feb 11 13:11:52 www kernel: [ 12.015633] tg3: eth0: Flow control is on for > TX > and on for RX. > Feb 11 13:33:44 www kernel: [ 1328.538204] tg3: eth0: The system may be > re-ordering memory-mapped I/O cycles to the network > device, attempting to recover. Please report the problem to the driver > maintainer and include system chipset information. > Feb 11 13:33:44 www kernel: [ 1328.667255] tg3: eth0: Link is down. > Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Link is up at 100 Mbps, > full duplex. > Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Flow control is on for > TX > and on for RX. > > After that, the machine rebooted (panic?) > > Feb 11 13:35:14 www kernel: klogd 1.5.0#1.1, log source = /proc/kmsg started. > > lspci -vvv info: > 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit > Ethernet (rev 10) > Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter > (PCI-X, 10,100,1000-T) > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ > Stepping- SERR+ FastB2B- DisINTx- > Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 64 (16000ns min), Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 19 > Region 0: Memory at fdf70000 (64-bit, non-prefetchable) [size=64K] > [virtual] Expansion ROM at 88140000 [disabled] [size=64K] > Capabilities: [40] PCI-X non-bridge device > Command: DPERE- ERO- RBC=2048 OST=1 > Status: Dev=02:02.0 64bit+ 133MHz+ SCD- USC- DC=simple > DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- > Capabilities: [48] Power Management version 2 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > PME(D0-,D1-,D2-,D3hot+,D3cold+) > Status: D0 PME-Enable+ DSel=0 DScale=1 PME- > Capabilities: [50] Vital Product Data <?> > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/3 > Enable- > Address: fd7ffd6fdf7deeb8 Data: bdfd > Kernel driver in use: tg3 > Kernel modules: tg3 > > 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit > Ethernet (rev 10) > Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter > (PCI-X, 10,100,1000-T) > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ > Stepping- SERR+ FastB2B- DisINTx- > Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 64 (16000ns min), Cache Line Size: 64 bytes > Interrupt: pin B routed to IRQ 20 > Region 0: Memory at fdf60000 (64-bit, non-prefetchable) [size=64K] > Capabilities: [40] PCI-X non-bridge device > Command: DPERE- ERO+ RBC=512 OST=1 > Status: Dev=02:02.1 64bit+ 133MHz+ SCD- USC- DC=simple > DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- > Capabilities: [48] Power Management version 2 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > PME(D0-,D1-,D2-,D3hot+,D3cold+) > Status: D0 PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [50] Vital Product Data <?> > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/3 > Enable- > Address: f73feeefffffe7f8 Data: 9bcd > Kernel driver in use: tg3 > Kernel modules: tg3 > > > Steps to reproduce: > >
Can you provide more info about this PCI bridge the tg3 in that system? It looks like we need to add it to the list of devices that need re-ordering in tg3_get_invariants.
On Thu, Feb 14, 2008 at 10:24:25AM -0800, Andrew Morton wrote: > On Thu, 14 Feb 2008 01:59:12 -0800 (PST) bugme-daemon@bugzilla.kernel.org > wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=9990 > > > > Summary: tg3: eth0: The system may be re-ordering memory-mapped > > I/O cycles > > Product: Drivers > > Version: 2.5 > > KernelVersion: 2.6.24-git18 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Network > > AssignedTo: jgarzik@pobox.com > > ReportedBy: ralf.hildebrandt@charite.de > > > > > > Latest working kernel version: 2.6.24 > > Earliest failing kernel version: 2.6.24-git18 > > Distribution: Debian/testing > > Hardware Environment: > > Software Environment: > > Problem Description: > > > > Feb 11 13:11:52 www kernel: [ 12.015569] tg3: eth0: Link is up at 100 > Mbps, > > full duplex. > > Feb 11 13:11:52 www kernel: [ 12.015633] tg3: eth0: Flow control is on > for TX > > and on for RX. > > Feb 11 13:33:44 www kernel: [ 1328.538204] tg3: eth0: The system may be > > re-ordering memory-mapped I/O cycles to the network > > device, attempting to recover. Please report the problem to the driver > > maintainer and include system chipset information. > > Feb 11 13:33:44 www kernel: [ 1328.667255] tg3: eth0: Link is down. > > Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Link is up at 100 > Mbps, > > full duplex. > > Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Flow control is on > for TX > > and on for RX. > > > > After that, the machine rebooted (panic?) > > > > Feb 11 13:35:14 www kernel: klogd 1.5.0#1.1, log source = /proc/kmsg > started. > > > > lspci -vvv info: > > 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit > > Ethernet (rev 10) > > Subsystem: Compaq Computer Corporation NC7782 Gigabit Server > Adapter > > (PCI-X, 10,100,1000-T) > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ > > Stepping- SERR+ FastB2B- DisINTx- > > Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- > > <TAbort- <MAbort- >SERR- <PERR- INTx- > > Latency: 64 (16000ns min), Cache Line Size: 64 bytes > > Interrupt: pin A routed to IRQ 19 > > Region 0: Memory at fdf70000 (64-bit, non-prefetchable) [size=64K] > > [virtual] Expansion ROM at 88140000 [disabled] [size=64K] > > Capabilities: [40] PCI-X non-bridge device > > Command: DPERE- ERO- RBC=2048 OST=1 > > Status: Dev=02:02.0 64bit+ 133MHz+ SCD- USC- DC=simple > > DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- > > Capabilities: [48] Power Management version 2 > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > > PME(D0-,D1-,D2-,D3hot+,D3cold+) > > Status: D0 PME-Enable+ DSel=0 DScale=1 PME- > > Capabilities: [50] Vital Product Data <?> > > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/3 > > Enable- > > Address: fd7ffd6fdf7deeb8 Data: bdfd > > Kernel driver in use: tg3 > > Kernel modules: tg3 > > > > 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit > > Ethernet (rev 10) > > Subsystem: Compaq Computer Corporation NC7782 Gigabit Server > Adapter > > (PCI-X, 10,100,1000-T) > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ > > Stepping- SERR+ FastB2B- DisINTx- > > Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- > > <TAbort- <MAbort- >SERR- <PERR- INTx- > > Latency: 64 (16000ns min), Cache Line Size: 64 bytes > > Interrupt: pin B routed to IRQ 20 > > Region 0: Memory at fdf60000 (64-bit, non-prefetchable) [size=64K] > > Capabilities: [40] PCI-X non-bridge device > > Command: DPERE- ERO+ RBC=512 OST=1 > > Status: Dev=02:02.1 64bit+ 133MHz+ SCD- USC- DC=simple > > DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- > > Capabilities: [48] Power Management version 2 > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > > PME(D0-,D1-,D2-,D3hot+,D3cold+) > > Status: D0 PME-Enable- DSel=0 DScale=1 PME- > > Capabilities: [50] Vital Product Data <?> > > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/3 > > Enable- > > Address: f73feeefffffe7f8 Data: 9bcd > > Kernel driver in use: tg3 > > Kernel modules: tg3 > > > > > > Steps to reproduce: > > > > That should be a simple matter of adding the right pci-ids to tg3_get_invariants -- hopefully Ralf will respond and we can get that knocked out quickly.
> Can you provide more info about this PCI bridge the tg3 in that system? What info do you need? Which commands should I run?
On Thu, 2008-02-14 at 13:56 -0500, Andy Gospodarek wrote: > On Thu, Feb 14, 2008 at 10:24:25AM -0800, Andrew Morton wrote: > > On Thu, 14 Feb 2008 01:59:12 -0800 (PST) bugme-daemon@bugzilla.kernel.org > wrote: > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=9990 > > > > > > Summary: tg3: eth0: The system may be re-ordering > memory-mapped > > > I/O cycles > > > Product: Drivers > > > Version: 2.5 > > > KernelVersion: 2.6.24-git18 > > > Platform: All > > > OS/Version: Linux > > > Tree: Mainline > > > Status: NEW > > > Severity: normal > > > Priority: P1 > > > Component: Network > > > AssignedTo: jgarzik@pobox.com > > > ReportedBy: ralf.hildebrandt@charite.de > > > > > > > > That should be a simple matter of adding the right pci-ids to > tg3_get_invariants -- hopefully Ralf will respond and we can get that > knocked out quickly. > > It doesn't look like it was re-ordered IO. If it was, it should have self-recovered without hitting the BUG(). One possibility is that the nr_frags in the SKB got corrupted before the TX SKB was freed. The driver relies on the nr_frags in the SKB to find the packet boundaries in the TX ring. If it cannot find the packet boundaries, it will exhibit the same symptom as re-ordered IO, only that it cannot be self-recovered. Ralf, please try this debug patch with the same traffic condition you ran before. This patch stores the nr_frags when transmitting an SKB. During tx completion, it will compare the stored nr_frags with the one in the SKB and will print out something in dmesg if they don't match. diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index db606b6..73f1ddd 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -3324,12 +3324,20 @@ static void tg3_tx(struct tg3 *tp) struct tx_ring_info *ri = &tp->tx_buffers[sw_idx]; struct sk_buff *skb = ri->skb; int i, tx_bug = 0; + unsigned short nr_frags = ri->nr_frags; if (unlikely(skb == NULL)) { tg3_tx_recover(tp); return; } + if (nr_frags != skb_shinfo(skb)->nr_frags) { + printk(KERN_ALERT "tg3: %s: Tx skb->nr_frags corrupted " + "before skb is freed. Expected nr_frags %d, " + "corrupted nr_frags %d\n", tp->dev->name, + nr_frags, skb_shinfo(skb)->nr_frags); + } + pci_unmap_single(tp->pdev, pci_unmap_addr(ri, mapping), skb_headlen(skb), @@ -3339,7 +3347,7 @@ static void tg3_tx(struct tg3 *tp) sw_idx = NEXT_TX(sw_idx); - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + for (i = 0; i < nr_frags; i++) { ri = &tp->tx_buffers[sw_idx]; if (unlikely(ri->skb != NULL || sw_idx == hw_idx)) tx_bug = 1; @@ -4105,6 +4113,7 @@ static int tigon3_dma_hwbug_workaround(struct tg3 *tp, struct sk_buff *skb, len, PCI_DMA_TODEVICE); if (i == 0) { tp->tx_buffers[entry].skb = new_skb; + tp->tx_buffers[entry].nr_frags = 0; pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, new_addr); } else { tp->tx_buffers[entry].skb = NULL; @@ -4211,6 +4220,7 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE); tp->tx_buffers[entry].skb = skb; + tp->tx_buffers[entry].nr_frags = skb_shinfo(skb)->nr_frags; pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, mapping); tg3_set_txd(tp, entry, mapping, len, base_flags, @@ -4388,6 +4398,7 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev) mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE); tp->tx_buffers[entry].skb = skb; + tp->tx_buffers[entry].nr_frags = skb_shinfo(skb)->nr_frags; pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, mapping); would_hit_hwbug = 0; diff --git a/drivers/net/tg3.h b/drivers/net/tg3.h index 3938eb3..d4a3aca 100644 --- a/drivers/net/tg3.h +++ b/drivers/net/tg3.h @@ -2098,6 +2098,7 @@ struct tx_ring_info { struct sk_buff *skb; DECLARE_PCI_UNMAP_ADDR(mapping) u32 prev_vlan_tag; + unsigned short nr_frags; }; struct tg3_config_info {
On Thu, Feb 14, 2008 at 01:25:27PM -0800, Michael Chan wrote: > On Thu, 2008-02-14 at 13:56 -0500, Andy Gospodarek wrote: > > On Thu, Feb 14, 2008 at 10:24:25AM -0800, Andrew Morton wrote: > > > On Thu, 14 Feb 2008 01:59:12 -0800 (PST) bugme-daemon@bugzilla.kernel.org > wrote: > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=9990 > > > > > > > > Summary: tg3: eth0: The system may be re-ordering > memory-mapped > > > > I/O cycles > > > > Product: Drivers > > > > Version: 2.5 > > > > KernelVersion: 2.6.24-git18 > > > > Platform: All > > > > OS/Version: Linux > > > > Tree: Mainline > > > > Status: NEW > > > > Severity: normal > > > > Priority: P1 > > > > Component: Network > > > > AssignedTo: jgarzik@pobox.com > > > > ReportedBy: ralf.hildebrandt@charite.de > > > > > > > > > > > > That should be a simple matter of adding the right pci-ids to > > tg3_get_invariants -- hopefully Ralf will respond and we can get that > > knocked out quickly. > > > > > > It doesn't look like it was re-ordered IO. If it was, it should have > self-recovered without hitting the BUG(). > Good catch, Michael! I missed that it paniced since I expect to see some sort of backtrace when that happens. We should try and get that bridge added to the list though, to avoid repeated complaints that there is a tg3 bug.
(In reply to comment #4) > > Can you provide more info about this PCI bridge the tg3 in that system? > > What info do you need? Which commands should I run? > 'lspci -vvv' and 'lspci -t' should be fine.
Created attachment 14843 [details] 'lspci -vvv' and 'lspci -t'
On Thu, 2008-02-14 at 17:12 -0500, Andy Gospodarek wrote: > On Thu, Feb 14, 2008 at 01:25:27PM -0800, Michael Chan wrote: > > On Thu, 2008-02-14 at 13:56 -0500, Andy Gospodarek wrote: > > > That should be a simple matter of adding the right pci-ids to > > > tg3_get_invariants -- hopefully Ralf will respond and we can get that > > > knocked out quickly. > > > > > > > > > > It doesn't look like it was re-ordered IO. If it was, it should have > > self-recovered without hitting the BUG(). > > > > Good catch, Michael! I missed that it paniced since I expect to see > some sort of backtrace when that happens. We should try and get that > bridge added to the list though, to avoid repeated complaints that there > is a tg3 bug. > > Andy, I think you still missed my point. I don't believe this problem was caused by the bridge or the chipset at all. Some corruption caused us to not find the SKB in the TX ring where it was expected. So the driver assumed it was the bridge re-ordering I/O and printed that warning message and took recovery action. The recovery action had no effect in this case since apparently it was caused by something else and the corruption happened again later. This 2nd time, we hit the BUG_ON() seeing that the recovery action did not work.
On Thu, Feb 14, 2008 at 02:48:09PM -0800, Michael Chan wrote: > On Thu, 2008-02-14 at 17:12 -0500, Andy Gospodarek wrote: > > On Thu, Feb 14, 2008 at 01:25:27PM -0800, Michael Chan wrote: > > > On Thu, 2008-02-14 at 13:56 -0500, Andy Gospodarek wrote: > > > > That should be a simple matter of adding the right pci-ids to > > > > tg3_get_invariants -- hopefully Ralf will respond and we can get that > > > > knocked out quickly. > > > > > > > > > > > > > > It doesn't look like it was re-ordered IO. If it was, it should have > > > self-recovered without hitting the BUG(). > > > > > > > Good catch, Michael! I missed that it paniced since I expect to see > > some sort of backtrace when that happens. We should try and get that > > bridge added to the list though, to avoid repeated complaints that there > > is a tg3 bug. > > > > > > Andy, I think you still missed my point. I don't believe this problem > was caused by the bridge or the chipset at all. Some corruption caused > us to not find the SKB in the TX ring where it was expected. So the > driver assumed it was the bridge re-ordering I/O and printed that > warning message and took recovery action. The recovery action had no > effect in this case since apparently it was caused by something else and > the corruption happened again later. This 2nd time, we hit the BUG_ON() > seeing that the recovery action did not work. > > Ah, I see. Due to at leat a 2 second delay between the message and the panic, I figured it would be good data to gather....
On Thu, 2008-02-14 at 18:21 -0500, Andy Gospodarek wrote: > On Thu, Feb 14, 2008 at 02:48:09PM -0800, Michael Chan wrote: > > Andy, I think you still missed my point. I don't believe this problem > > was caused by the bridge or the chipset at all. Some corruption caused > > us to not find the SKB in the TX ring where it was expected. So the > > driver assumed it was the bridge re-ordering I/O and printed that > > warning message and took recovery action. The recovery action had no > > effect in this case since apparently it was caused by something else and > > the corruption happened again later. This 2nd time, we hit the BUG_ON() > > seeing that the recovery action did not work. > > > > > > Ah, I see. Due to at leat a 2 second delay between the message and the > panic, I figured it would be good data to gather.... > > > Yeah, 2 seconds for the link to come up after chip reset to recover. It then panicked sometime later and rebooted about 90 seconds after the initial warning message. It was also running at the slower 100Mbps link speed. Tx packets stay longer in the TX ring at this slower speed, increasing the window of time that the nr_frags in the SKB can be corrupted. Ralf, please try the debug patch that I sent out earlier. Thanks.
OK, I rebuilt 2.6.25-rc1-git4 with your debug patch. Let's see what happens.
Yay! Got some crashes: ... Feb 15 14:58:24 www kernel: [ 53.233264] Time: acpi_pm clocksource has been installed. Feb 15 14:58:38 www kernel: [ 68.093613] warning: `vsftpd' uses 32-bit capabilities (legacy support in use) Feb 16 08:45:58 www kernel: [64211.034406] tg3: eth0: Tx skb->nr_frags corrupted before skb is freed. Expected nr_frags 0, corrupted nr_frags 2 Feb 16 08:45:58 www kernel: [64211.036946] tg3: eth0: Tx skb->nr_frags corrupted before skb is freed. Expected nr_frags 0, corrupted nr_frags 1 Reboot (by panic I guess) Feb 16 10:18:42 www kernel: klogd 1.5.0#1.1, log source = /proc/kmsg started. ...
OK, I rebuilt 2.6.25-rc2-git1 with your debug patch. Let's see what happens.
OK. Thanks a lot. So far, we know that there is definitely skb->nr_frags corruption and not IO reordering. I have a suspicion that these SKBs got reused before they were freed. They were corrupted with "legitimate" values instead of random garbage. I don't know why it still panicked. The patch should have prevented hitting the BUG_ON() assuming that was the cause of the original panic after the original "IO reordered" warning message. Can you configure a serial console or something so we can see the actual panic message?
Closing out stale bugs