Most recent kernel where this bug did *NOT* occur: worked in 2.6.21 broken when e0aac5a289b1dacbc94bd9ae8c449bcdf9ab508c was merged into git. Distribution: Fedora7/RHEL5 Hardware Environment: HP rx4640, rx6600, and rx2620 Software Environment: Problem Description: Panic when bringing up eth0 (which is an e1000 device). I will attach the full stack trace as a separate file. I have narrowed this down to this commit. I can revert just this commit from the current head and the problem goes away. commit e0aac5a289b1dacbc94bd9ae8c449bcdf9ab508c Author: Auke Kok <auke-jan.h.kok@intel.com> Date: Tue Mar 6 08:57:21 2007 -0800 e1000: FIX: be ready for incoming irq at pci_request_irq DEBUG_SHIRQ code exposed that e1000 was not ready for incoming interrupts after having called pci_request_irq. This obviously requires us to finish our software setup which assigns the irq handler before we request the irq. Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> Signed-off-by: Jeff Garzik <jeff@garzik.org> Steps to reproduce: Boot the latest kernel on HP ia64 server with e1000 device.
Created attachment 11437 [details] stack trace from panic on HP Integrity ia64 server
Created attachment 11438 [details] .config from kernel build
If I compile witout CONFIG_E1000_NAPI the problem goes away. This should help narrow down the defect considerably.
Auke, this one appears to be a post-2.6.21 regression.
I need some more information: `lspci -vvv`, `dmesg`, `ethtool -e ethX`
Created attachment 11458 [details] lspci -vv output Taken from original 2.6.21 kernel before defect was introduced.
Created attachment 11459 [details] ethtool -e eth0 output
Created attachment 11460 [details] dmesg output
FYI, I tried the same model card in an HP dl380 (x86_64) and I did _not_ see the panic. Appears this panic is just on ia64.
status: we are still unable to reproduce. We are also having lots of machine issues, which is effecting our ability to reproduce. Work continues.
Here is a little more info if it helps. The panic happens at include/linux/netdevice.h:923 in netif_rx_complete 918 static inline void netif_rx_complete(struct net_device *dev) 919 { 920 unsigned long flags; 921 922 local_irq_save(flags); 923 BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state)); This appears to have been called from e1000_main.c:3962 in e1000_clean 3956 /* If no Tx and not enough Rx work done, exit the polling mode */ 3957 if ((!tx_cleaned && (work_done == 0)) || 3958 !netif_running(poll_dev)) { 3959 quit_polling: 3960 if (likely(adapter->itr_setting & 3)) 3961 e1000_set_itr(adapter); 3962 netif_rx_complete(poll_dev); 3963 e1000_irq_enable(adapter); 3964 return 0; Please let me know if there is any more info I can provide. I can reproduce this quite easily but I don't have the background to really debug it.
Tentative patch below, please test this patch: --- Herbert Xu wrote: "netif_poll_enable can only be called if you've previously called netif_poll_disable. Otherwise a poll might already be in action and you may get a crash like this." Removing the call to netif_poll_enable in e1000_open should fix this issue, the only other call to netif_poll_enable is in e1000_up() which is only reached after a device reset or resume. Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> --- drivers/net/e1000/e1000_main.c | 4 ---- 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 49be393..cbc7feb 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -1431,10 +1431,6 @@ e1000_open(struct net_device *netdev) /* From here on the code is the same as e1000_up() */ clear_bit(__E1000_DOWN, &adapter->flags); -#ifdef CONFIG_E1000_NAPI - netif_poll_enable(netdev); -#endif - e1000_irq_enable(adapter); /* fire a link status change interrupt to start the watchdog */
I have tested the patch and it does fix the panic.
I believe I hit this also on my e1000 that came with this T42 laptop. Panic on link up. Noticed in 2.6.22-rc1.
Yes, I can also confirm the patch below fixes this issue. Linux segfault 2.6.22-rc2 #1 Tue May 22 03:34:35 EDT 2007 i686 GNU/Linux
I have tested the latest git tree which includes this patch and it resolves the issue. thanks, - Doug