Bug 8455

Summary: panic with e1000 driver on HP Integrity servers
Product: Drivers Reporter: Doug Chapman (doug.chapman)
Component: NetworkAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: blocking CC: auke-jan.h.kok, jbrandeb, michal.k.k.piotrowski, shawn.starr
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.21-git Subsystem:
Regression: --- Bisected commit-id:
Attachments: stack trace from panic on HP Integrity ia64 server
.config from kernel build
lspci -vv output
ethtool -e eth0 output
dmesg output

Description Doug Chapman 2007-05-08 12:23:48 UTC
Most recent kernel where this bug did *NOT* occur:
worked in 2.6.21
broken when e0aac5a289b1dacbc94bd9ae8c449bcdf9ab508c was merged into git.

Distribution:
Fedora7/RHEL5

Hardware Environment:
HP rx4640, rx6600, and rx2620

Software Environment:

Problem Description:
Panic when bringing up eth0 (which is an e1000 device).  I will attach the full
stack trace as a separate file.

I have narrowed this down to this commit.  I can revert just this commit from
the current head and the problem goes away.

commit e0aac5a289b1dacbc94bd9ae8c449bcdf9ab508c
Author: Auke Kok <auke-jan.h.kok@intel.com>
Date:   Tue Mar 6 08:57:21 2007 -0800

    e1000: FIX: be ready for incoming irq at pci_request_irq

    DEBUG_SHIRQ code exposed that e1000 was not ready for incoming interrupts
    after having called pci_request_irq. This obviously requires us to finish
    our software setup which assigns the irq handler before we request the
    irq.

    Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>



Steps to reproduce:
Boot the latest kernel on HP ia64 server with e1000 device.
Comment 1 Doug Chapman 2007-05-08 12:24:44 UTC
Created attachment 11437 [details]
stack trace from panic on HP Integrity ia64 server
Comment 2 Doug Chapman 2007-05-08 12:52:42 UTC
Created attachment 11438 [details]
.config from kernel build
Comment 3 Doug Chapman 2007-05-08 13:02:52 UTC
If I compile witout CONFIG_E1000_NAPI the problem goes away.  This should help
narrow down the defect considerably.

Comment 4 Andrew Morton 2007-05-08 13:08:22 UTC
Auke, this one appears to be a post-2.6.21 regression.
Comment 5 Auke Kok 2007-05-10 07:39:59 UTC
I need some more information: `lspci -vvv`, `dmesg`, `ethtool -e ethX`
Comment 6 Doug Chapman 2007-05-10 07:54:22 UTC
Created attachment 11458 [details]
lspci -vv output

Taken from original 2.6.21 kernel before defect was introduced.
Comment 7 Doug Chapman 2007-05-10 07:54:51 UTC
Created attachment 11459 [details]
ethtool -e eth0 output
Comment 8 Doug Chapman 2007-05-10 07:55:19 UTC
Created attachment 11460 [details]
dmesg output
Comment 9 Doug Chapman 2007-05-11 11:23:26 UTC
FYI,

I tried the same model card in an HP dl380 (x86_64) and I did _not_ see the
panic.  Appears this panic is just on ia64.

Comment 10 Jesse Brandeburg 2007-05-17 19:56:32 UTC
status: we are still unable to reproduce.  We are also having lots of machine 
issues, which is effecting our ability to reproduce.  Work continues.
Comment 11 Doug Chapman 2007-05-18 10:29:18 UTC
Here is a little more info if it helps.  The panic happens at
include/linux/netdevice.h:923 in netif_rx_complete

918     static inline void netif_rx_complete(struct net_device *dev)
919     {
920             unsigned long flags;
921
922             local_irq_save(flags);
923             BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));


This appears to have been called from e1000_main.c:3962 in e1000_clean

   3956         /* If no Tx and not enough Rx work done, exit the polling mode */
   3957         if ((!tx_cleaned && (work_done == 0)) ||
   3958            !netif_running(poll_dev)) {
   3959 quit_polling:
   3960                 if (likely(adapter->itr_setting & 3))
   3961                         e1000_set_itr(adapter);
   3962                 netif_rx_complete(poll_dev);
   3963                 e1000_irq_enable(adapter);
   3964                 return 0;


Please let me know if there is any more info I can provide.  I can reproduce
this quite easily but I don't have the background to really debug it.

Comment 12 Auke Kok 2007-05-21 09:20:45 UTC
Tentative patch below, please test this patch:

---
Herbert Xu wrote:
"netif_poll_enable can only be called if you've previously called
netif_poll_disable.  Otherwise a poll might already be in action
and you may get a crash like this."

Removing the call to netif_poll_enable in e1000_open should fix this issue,
the only other call to netif_poll_enable is in e1000_up() which is only
reached after a device reset or resume.

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---

 drivers/net/e1000/e1000_main.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 49be393..cbc7feb 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1431,10 +1431,6 @@ e1000_open(struct net_device *netdev)
 	/* From here on the code is the same as e1000_up() */
 	clear_bit(__E1000_DOWN, &adapter->flags);
 
-#ifdef CONFIG_E1000_NAPI
-	netif_poll_enable(netdev);
-#endif
-
 	e1000_irq_enable(adapter);
 
 	/* fire a link status change interrupt to start the watchdog */
Comment 13 Doug Chapman 2007-05-21 09:38:35 UTC
I have tested the patch and it does fix the panic.

Comment 14 Shawn Starr 2007-05-22 00:21:00 UTC
I believe I hit this also on my e1000 that came with this T42 laptop. Panic on 
link up. Noticed in 2.6.22-rc1. 
Comment 15 Shawn Starr 2007-05-22 00:46:50 UTC
Yes, I can also confirm the patch below fixes this issue. 

Linux segfault 2.6.22-rc2 #1 Tue May 22 03:34:35 EDT 2007 i686 GNU/Linux
Comment 16 Doug Chapman 2007-05-22 13:05:35 UTC
I have tested the latest git tree which includes this patch and it resolves the
issue.

thanks,

- Doug