Bug 5918

Summary: e100 still pulls IRQ after RESET: irq 11: nobody cared
Product: ACPI Reporter: Nils Rennebarth (nils.rennebarth)
Component: Config-InterruptsAssignee: Jesse Brandeburg (jbrandeb)
Status: CLOSED CODE_FIX    
Severity: high CC: acpi-bugzilla, bjorn.helgaas, jbrandeb, stern
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.15 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg from a boot that issues the nobody cared message
dmesg from a boot without "nobody cared".
dmesg from 2.6.16-rc4 exhibiting the bug
Output of lspci -vvv
content of /proc/interrupts
test driver
quirk
dmesg from 2.6.16.1 with the pci quirk patch
quirk2
very verbose driver exit
syslog around a reboot with patches 7490 and 7765
proposed upstream quirk
debug function entry / exit

Description Nils Rennebarth 2006-01-18 06:10:20 UTC
Most recent kernel where this bug did not occur: 
Distribution: self compiled vanilla kernel
Hardware Environment:
Supermicro mainboard, 2 onboard e100 network adapters
Software Environment:
Problem Description:
Sometimes during boot, I get a irq 11: nobody cared together with a backtrace.
The irq is then disabled which renders the network card non-functional.

Apparently this depends on the exact order the hardware is detected. I never got
the problem with a 2.6.14.3 kernel but that may be pure luck.


Steps to reproduce:
This is difficult to reproduce. Only entering the BIOS and
setting "Reset Config Data" to yes which is supposed to clear PCI/PnP
Configuration data stored in Flash on next boot makes the problem go away reliably.
Comment 1 Nils Rennebarth 2006-01-18 06:12:10 UTC
Created attachment 7062 [details]
dmesg from a boot that issues the nobody cared message
Comment 2 Nils Rennebarth 2006-01-18 06:14:06 UTC
Created attachment 7063 [details]
dmesg from a boot without "nobody cared".

I made no changes to hard or software, only entering the BIOS setup and
setting "Reset config data" to Yes
Comment 3 Nils Rennebarth 2006-02-15 05:19:25 UTC
Problem can be reliably triggered by traffic on any interface during boot, or
avoided by pulling the network plugs during boot.

2.6.14 exhibits the same problem
Comment 4 Andrew Morton 2006-02-15 11:48:24 UTC
(Re-added bugzilla to cc)

Nils Rennebarth <nils.rennebarth@packetalarm.com> wrote:
>
> Andrew Morton wrote:
> > Gee.  I guess it'd be useful to identify whic two subsystems have to be
> > initialised in which order for the problem to occur.
> Turned out now, that the problem appears to be somewhere else.
> 
> To recall: the problem was:
> "irq 11: nobody cared", and later: "Disabling IRQ" during booting, which stops 
> my network adapters from getting any irqs, rendering them quasi dead.
> 
> - I can reliably trigger the problem now, by creating traffic on any of the 
> interfaces of the machine during boot. If on the other hand I power the machine 
> down, unplug both cables and power on again, I reliably get a working machine.
> - Testing with vanilla 2.4.14 exhibits the same problem
> 
> Does that mean, e100 is the culprit?
> Should I try to go back to see where the problem starts?
> 
> I won't be available for replys/further examination before Mo 20.2. though.

Seems that your USB host controller and both e100's are all sharing the
IRQ11 line.  Somehow, an e100 is asserting that line before it has
registered its interrupt handler.  I can't see how that can happen from the
driver...

Could you please test 2.6.16-rc3, or whatever's the latest released kernel?

Also, this patch might help:


diff -puN drivers/net/e100.c~a drivers/net/e100.c
--- devel/drivers/net/e100.c~a	2006-02-15 11:44:09.000000000 -0800
+++ devel-akpm/drivers/net/e100.c	2006-02-15 11:44:49.000000000 -0800
@@ -2055,6 +2055,10 @@ static int e100_up(struct nic *nic)
 {
 	int err;
 
+	printk("%s\n", __FUNCTION__);
+	if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ,
+		nic->netdev->name, nic->netdev)))
+		goto err_no_irq;
 	if((err = e100_rx_alloc_list(nic)))
 		return err;
 	if((err = e100_alloc_cbs(nic)))
@@ -2064,9 +2068,6 @@ static int e100_up(struct nic *nic)
 	e100_set_multicast_list(nic->netdev);
 	e100_start_receiver(nic, NULL);
 	mod_timer(&nic->watchdog, jiffies);
-	if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ,
-		nic->netdev->name, nic->netdev)))
-		goto err_no_irq;
 	netif_wake_queue(nic->netdev);
 	netif_poll_enable(nic->netdev);
 	/* enable ints _after_ enabling poll, preventing a race between
_

Comment 5 Nils Rennebarth 2006-02-20 02:25:03 UTC
Andrew Morton wrote:
> Seems that your USB host controller and both e100's are all sharing the
> IRQ11 line.
They do.


> Somehow, an e100 is asserting that line before it has
> registered its interrupt handler.  I can't see how that can happen from the
> driver...
Browsing the various dmesg of failed and non failed boots suggests, it might 
have to do with e100_probe more than with e100_up. Is that possible?

> 
> Could you please test 2.6.16-rc3, or whatever's the latest released kernel?
Currently compiling. Will report success/failure.

> 
> Also, this patch might help:
That patch doesn't make any difference.

However once I trigger the bug, if found out it does not go away as easyily as I 
described, seems a powerdown is required. Sigh.


-- 

Mit freundlichen Gr
Comment 6 Nils Rennebarth 2006-02-20 03:55:32 UTC
Created attachment 7416 [details]
dmesg from 2.6.16-rc4 exhibiting the bug
Comment 7 Nils Rennebarth 2006-02-20 03:57:31 UTC
Andrew Morton wrote:
> Could you please test 2.6.16-rc3, or whatever's the latest released kernel?
2.6.16-rc4 unfortunately doesn't make a difference either.

To the contrary: once I booted, saw the nobody cared message together with the 
traceback, then the rest of the boot messages and finally at the login prompt 
the machine hang hard: no Shift-PgUp, no SysRq, nothing. No console message 
appeared either. Could not reproduce that hang though.

The hardware ran fine for years, so if it is a hardware flaw it must have 
developed in the last few weeks which I deem rather unlikely.

I attached a new dmesg output from 2.6.16-rc4 exhibiting the bug to the bugzilla 
page.


-- 

Mit freundlichen Gr
Comment 8 Nils Rennebarth 2006-02-22 02:10:57 UTC
Andrew Morton wrote:
> Seems that your USB host controller and both e100's are all sharing the
> IRQ11 line.  Somehow, an e100 is asserting that line before it has
> registered its interrupt handler.  I can't see how that can happen from the
> driver...

A few more datapoints:
- I reproduced the problem on another identical machine, so the chances of just 
misbehaving hardware are really slim now.
- I can only reproduce the problem with udev, and it always occurs at the point 
where it is synthesizing hotplug events
- blacklisting uhci-hcd makes the problem go away.

The latter seems to point to the uhcd-hcd driver. Who could shed more light on this?


-- 

Mit freundlichen Gr
Comment 9 Jesse Brandeburg 2006-02-22 10:37:59 UTC
can you please test with the e100 driver from 2.6.14.3 (or whatever kernel
worked okay for you)

e100 part was known for asserting its interrupt line at inopportune times, but I
thought we had that under control.  From the "dmesg ... nobody cared"
attachment, it appears that e100 hadn't even loaded yet.
Comment 10 Nils Rennebarth 2006-02-27 02:05:31 UTC
short bug description:
During boot I get a "irq 11: nobody cared!" message, followed by a backtrace, 
followed by "Disabling IRQ #11", while the uhci-hcd driver is loaded by insmod. 
This is on a machine with two network adapters and a USB host controller sharing 
  IRQ line 11. With a disabled IRQ, the network card of course are dead afterwards.

bugme-daemon@bugzilla.kernel.org wrote:
> ------- Additional Comments From jesse.brandeburg@intel.com  2006-02-22 10:37 -------
> can you please test with the e100 driver from 2.6.14.3 (or whatever kernel
> worked okay for you)
> 
> e100 part was known for asserting its interrupt line at inopportune times, but I
> thought we had that under control.  From the "dmesg ... nobody cared"
> attachment, it appears that e100 hadn't even loaded yet.
Unfortunately I have not yet found a 2.6 kernel that worked ok. I went as far 
back as 2.6.12.6 but could still reproduce the bug. I even tried to prevent the 
e100 driver to load at all during boot and could still get the "irq 11: nobody 
cared" message, but only when I create network traffic (dmesg attached, note 
that no e100 driver was loaded)

Is the network adapter supposed to create interrupts when receiving packets 
destined for its hardware address, even though its driver is not yet loaded?

I'am not excluding a BIOS bug, there are newer ones available, but at least 2.4 
kernels never trigger the bug and as identical hardware is at many customer 
locations, working as headless servers, a BIOS upgrade is no real option.

-- 

Mit freundlichen Gr
Comment 11 Alan Stern 2006-02-27 07:10:35 UTC
It's likely that uhci-hcd is just an innocent bystander here.  The Ethernet
controller generates an interrupt request on IRQ 11 at a time when it should be
quiescent.  Nothing happens until uhci-hcd loads and enables that IRQ line, at
which time the CPU receives the interrupt from the e100 device.  Of course the
uhci-hcd interrupt routine doesn't handle it, and the e100 driver hasn't been
loaded yet.  Thus you get the "Nobody cared!" message and the IRQ is disabled.

It's possible that under 2.4, the e100 driver loads before uhci-hcd.  Obviously
that would prevent the problem from occurring.  It's also possible that there's
something different about the way 2.4 and 2.6 initialize the e100 hardware, as
part of the PCI startup or even the ACPI startup.
Comment 12 Nils Rennebarth 2006-02-28 06:49:09 UTC
Hi,

Here are a few more datapoints for the bug:

The problem starts with linux 2.6.4, which is probably of little help to you, as 
the 2.6.4 changelog talks about a complete rewrite of the e100 driver.

What happens exactly is, that during reboot the driver leaves the network 
hardware in a non-quiet state, and may raise interrupts.
If then, while booting, another driver (uhci-hcd here) loads which shares the 
IRQ line (11 here) with the EtherExpress PRO/100 cards, its interrupt routine 
will of course not handle the irq and the "nobody cared" message follows.
(See also explanation from Alan Stern <stern@rowland.harvard.edu> in bugzilla)

Any idea/patches?

-- 

Mit freundlichen Gr
Comment 13 Jesse Brandeburg 2006-02-28 14:58:04 UTC
On Tue, 28 Feb 2006, Nils Rennebarth wrote:

> Hi,
>
> Here are a few more datapoints for the bug:
>
> The problem starts with linux 2.6.4, which is probably of little help to you, 
> as the 2.6.4 changelog talks about a complete rewrite of the e100 driver.
>
> What happens exactly is, that during reboot the driver leaves the network 
> hardware in a non-quiet state, and may raise interrupts.
> If then, while booting, another driver (uhci-hcd here) loads which shares the 
> IRQ line (11 here) with the EtherExpress PRO/100 cards, its interrupt routine 
> will of course not handle the irq and the "nobody cared" message follows.
> (See also explanation from Alan Stern <stern@rowland.harvard.edu> in 
> bugzilla)
>
> Any idea/patches?

I spent a lot of time looking through the driver for an place that might 
be allowing the driver to reset and then exit without disabling 
interrupts, I can't find one.  We may need to add lots of printks so we 
can figure out what the flow of events at shutdown might be.  You don't 
happen to be using kexec do you?  How do you reboot, exactly? init 6? 
reboot?

I flooded my interface with packets while booting, with no luck, and I 
have a system where the interrupt is being shared on IRQ11 with 
uhci-hcd:usb5. I can't reproduce this locally, so here is a patch that 
prints the interrupt status register in a few places, please give it a 
shot.

I actually have a supermicro motherboard with dual PIII, and dual e100 
interfaces.  Do you think they are similar? whats your motherboard 
exactly?

please attach cat /proc/interrupts to the bug as well as lspci -vvv

[root@ricochet ~]# cat /proc/interrupts
            CPU0
   0:    1200529          XT-PIC  timer
   1:       6249          XT-PIC  i8042
   2:          0          XT-PIC  cascade
   5:          0          XT-PIC  uhci_hcd:usb4
   8:          1          XT-PIC  rtc
   9:         67          XT-PIC  ehci_hcd:usb1, uhci_hcd:usb2
  10:      10025          XT-PIC  libata, uhci_hcd:usb3
  11:       6490          XT-PIC  uhci_hcd:usb5, eth1
  12:       1029          XT-PIC  i8042
  14:         75          XT-PIC  ide0
NMI:          0
LOC:    1200521
ERR:          0
MIS:          0


--- linux-2.6.16-rc5.orig/drivers/net/e100.c	2006-02-28 13:46:06.000000000 -0800
+++ linux-2.6.16-rc5/drivers/net/e100.c	2006-02-28 13:55:43.000000000 -0800
@@ -2064,6 +2064,10 @@ static int e100_up(struct nic *nic)
  	e100_set_multicast_list(nic->netdev);
  	e100_start_receiver(nic, NULL);
  	mod_timer(&nic->watchdog, jiffies);
+	{
+	u8 stat_ack = readb(&nic->csr->scb.stat_ack);
+	DPRINTK(PROBE, ERR, "stat_ack: %02X\n", stat_ack);
+	}
  	if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ,
  		nic->netdev->name, nic->netdev)))
  		goto err_no_irq;
@@ -2623,6 +2627,11 @@ static int __devinit e100_probe(struct p
  	spin_lock_init(&nic->cmd_lock);
  	spin_lock_init(&nic->mdio_lock);

+	{
+	u8 stat_ack = readb(&nic->csr->scb.stat_ack);
+	DPRINTK(PROBE, ERR, "stat_ack before reset: %02X\n", stat_ack);
+	}
+
  	/* Reset the device before pci_set_master() in case device is in some
  	 * funky state and has an interrupt pending - hint: we don't have the
  	 * interrupt handler registered yet. */
@@ -2681,6 +2690,10 @@ static int __devinit e100_probe(struct p
  		netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2],
  		netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]);

+	{
+	u8 stat_ack = readb(&nic->csr->scb.stat_ack);
+	DPRINTK(PROBE, ERR, "stat_ack end: %02X\n", stat_ack);
+	}
  	return 0;

  err_out_free:
@@ -2775,6 +2788,11 @@ static void e100_shutdown(struct pci_dev
  #endif
  	if (retval)
  		DPRINTK(PROBE,ERR, "Error enabling wake\n");
+
+	{
+	u8 stat_ack = readb(&nic->csr->scb.stat_ack);
+	DPRINTK(PROBE, ERR, "stat_ack: %02X\n", stat_ack);
+	}
  }


Comment 14 Nils Rennebarth 2006-03-01 02:19:33 UTC
Created attachment 7488 [details]
Output of lspci -vvv
Comment 15 Nils Rennebarth 2006-03-01 02:20:31 UTC
Created attachment 7489 [details]
content of /proc/interrupts
Comment 16 Nils Rennebarth 2006-03-01 02:54:32 UTC
Jesse Brandeburg wrote:
> I spent a lot of time looking through the driver for an place that might 
> be allowing the driver to reset and then exit without disabling 
> interrupts, I can't find one.  We may need to add lots of printks so we 
> can figure out what the flow of events at shutdown might be.  You don't 
> happen to be using kexec do you?
No, never did.

>  How do you reboot, exactly? init 6? 
> reboot?
Reboot. This is a debian sarge system, sysvinit version 2.86.ds1 After all stop 
scripts in /etc/rc6.d/ are run, the last one just does reboot -d -f -i
-d prevents a wtmp record to be written, -f prevents all stop scripts from 
beeing called again and (most importantly I suspect) -i will shut down all 
network interfaces. Here is how the latter is done in the source:
    ifr[i].ifr_flags &= ~(IFF_UP);
    if (ioctl(fd, SIOCSIFFLAGS, &ifr[i]) < 0) {
        fprintf(stderr, "ifdown: shutdown ");
        perror(ifr[i].ifr_name);
    }

> 
> I flooded my interface with packets while booting, with no luck, and I 
> have a system where the interrupt is being shared on IRQ11 with 
> uhci-hcd:usb5. I can't reproduce this locally, so here is a patch that 
> prints the interrupt status register in a few places, please give it a 
> shot.
Ok, I'll run it soon. Btw, the patch was whitespace damaged, probably the 
mailer, could you please send it safely (not this one of course, corrected 
manually here)

It is easy, of course to save the messages at boot time, but what about those 
before the reboot, do you need those as well? I need to create a serial console 
for that.

> 
> I actually have a supermicro motherboard with dual PIII, and dual e100 
> interfaces.  Do you think they are similar? whats your motherboard exactly?
It is a supermicro P3TSSE, for a single PIII, with a 815E chipset, providing 
dual onboard ethernet. As apparently no one else sees this bug, it might have to 
do with the fact, that it has two e100 interfaces.

> 
> please attach cat /proc/interrupts to the bug as well as lspci -vvv
Done.


-- 

Mit freundlichen Gr
Comment 17 Nils Rennebarth 2006-03-01 03:26:28 UTC
Hi,

attached is the relevant part of a dmesg with your printks enabled. And yes, I 
have been unable to trigger the bug with only one network cable (the flooded one 
of course) attached, which might explain why you didn't see it.

I also attached an extract of dmesg from a boot without the "nobody cared", as a 
second attachment. That one actually is with the second cable removed.


-- 

Mit freundlichen Gr
Comment 18 Jesse Brandeburg 2006-03-01 08:47:29 UTC
Created attachment 7490 [details]
test driver

as was requested, this is the patch I originally sent
Comment 19 Jesse Brandeburg 2006-03-03 17:33:36 UTC
do you happen to have PXE boot roms on your adapter?  Are they enabled and
possibly turning on the NIC at boot and reboot time?  Does your machine have a BMC?

The debug attachments show an RX interrupt pending before we reset in probe.
(stat ack = 0x50)  which should not be happenning.

at the very least, I am surprised that the interrupt register shows an actual
value, indicating we are truly the cause of the interrupt storm.

Comment 20 Bjorn Helgaas 2006-03-08 12:31:34 UTC
Comment #10 notes that Nils reproduced the problem without 
even loading the e100 driver.   
 
It may be that this has something to do with the way the e100 
driver shut down the device on a previous boot, but it means 
we can't reliably fix the problem in the driver.  We have to 
be able to boot regardless of what a previous kernel did. 
 
It sounds to me like a BIOS bug (that it is handing off a 
device with interrupts enabled), and my suggestion is a PCI 
quirk to disable e100 interrupts and clear out any pending 
interrupts before we start attaching drivers to devices. 
 
The fact that it doesn't occur with 2.4 kernels or prior to 
2.6.4 is interesting, but I suspect it's really because the 
driver load order changed.  I bet you could reproduce it on 
a 2.4 kernel by removing the e100 driver. 
 
 
Comment 21 Nils Rennebarth 2006-04-03 00:52:22 UTC
Sorry for coming back so late on this issue.

bugme-daemon@bugzilla.kernel.org wrote:

> ------- Additional Comments From jesse.brandeburg@intel.com  2006-03-03 17:33 -------
> do you happen to have PXE boot roms on your adapter?
Yes
>  Are they enabled and
> possibly turning on the NIC at boot and reboot time?
Every Option concerning the network adapters is disabled in the BIOS:
- PXE ROM
- Remote Access (whatever that means)
- LAN Wake-Up

>  Does your machine have a BMC?
A Baseboard Management Controller? No.

It may well be a BIOS Bug, but somehow the previous version of the driver (in 
2.6.3) managed to work around that bug.

-- 

Mit freundlichen Gr
Comment 22 Nils Rennebarth 2006-04-03 01:16:03 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> ------- Additional Comments From bjorn.helgaas@hp.com  2006-03-08 12:31 -------
> Comment #10 notes that Nils reproduced the problem without 
> even loading the e100 driver.   
>  
> It may be that this has something to do with the way the e100 
> driver shut down the device on a previous boot, but it means 
> we can't reliably fix the problem in the driver.
It must have, as I was never able to hit the problem when booting directly after 
Power-Up or after a non affected kernel ran the last time.


>  We have to 
> be able to boot regardless of what a previous kernel did. 
>  
> It sounds to me like a BIOS bug (that it is handing off a 
> device with interrupts enabled), and my suggestion is a PCI 
> quirk to disable e100 interrupts and clear out any pending 
> interrupts before we start attaching drivers to devices. 
It probably *is* a BIOS bug, as we seem to be the only ones seeing this 
particular problem. The 2.6.3  or earlier e100 driver appears however to be able 
to work around that bug and it would be nice to have that workaround in a newer 
kernel as well. (This type of mainboard is deployed to quite a number of our 
customers and used as a headless machine, so a BIOS upgrade for those is out of 
question)


>  
> The fact that it doesn't occur with 2.4 kernels or prior to 
> 2.6.4 is interesting, but I suspect it's really because the 
> driver load order changed.  I bet you could reproduce it on 
> a 2.4 kernel by removing the e100 driver. 
It cannot be reproduced by a 2.4 kernel because the "nobody cared" and 
"Disabling IRQ" does not exist in the 2.4 series. Something changed there which 
enables the (probably BIOS) bug to surface at all.

Both USB and network is compiled as modules. I can trigger the bug by loading 
only the uhci-hcd module, but only if I boot after a 2.6.4 or later driver ran 
before booting. How can "load order" be relevant?

-- 

Mit freundlichen Gr
Comment 23 Alan Stern 2006-04-03 07:42:12 UTC
Load order matters because if uhci-hcd is loaded first then it will enable the
IRQ and you'll get flooded with unhandled interrupts.  If e100 is loaded first,
then when it enables the IRQ line its handler will be ready to take care of the
interrupts.

It could be that something as simple as turning off the network interface before
rebooting would fix the problem.  Or perhaps a shutdown method should be added
to e100.
Comment 24 Bjorn Helgaas 2006-04-03 09:50:15 UTC
Created attachment 7753 [details]
quirk

Nils, can you try to reproduce the problem with this patch?  Please
post the extra dmesg output (the "DEBUG" messages) and the results
(whether the IRQ problem still occurs).

This is a gross hack, but I'm just trying to figure out whether this
is an approach worth pursuing.
Comment 25 Nils Rennebarth 2006-04-04 01:20:50 UTC
Created attachment 7755 [details]
dmesg from 2.6.16.1 with the pci quirk patch
Comment 26 Nils Rennebarth 2006-04-04 01:43:26 UTC
> ------- Additional Comments From bjorn.helgaas@hp.com  2006-04-03 09:50
> -------
> Created an attachment (id=7753)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=7753&action=view)
> quirk
> 
> Nils, can you try to reproduce the problem with this patch?  Please
> post the extra dmesg output (the "DEBUG" messages) and the results
> (whether the IRQ problem still occurs).
Sorry, the IRQ problem persists, dmesg uploaded to bugzilla.

I tried it two times and although the order of messages does not appear to be 
strictly synchronous, the "nobody cared" message came directly after
"PCI: Enabling device 0000:01:04.0" both times. This device is the second 
network card (i.e. eth1)

-- 

Mit freundlichen Gr
Comment 27 Alan Stern 2006-04-04 07:23:26 UTC
It looks like the call to pci_enable_device was enough to cause eth1 to begin
issuing interrupt requests.  Can the driver be changed to disable interrupt
generation first?
Comment 28 Bjorn Helgaas 2006-04-04 11:54:06 UTC
Created attachment 7761 [details]
quirk2

> Can the driver be changed to disable interrupt generation first?

I'm sure it could, but I think we would still have a problem if
the kernel doesn't include the e100 driver.  I expect that the
first driver that requested IRQ 11 would get the flood.

Nils, thanks for testing this.	I learned that:

  (a) The BIOS hands off the e100 device with bus mastering,
      memory space, and I/O space enabled (command register =
      0x0117).	It looks like this is typical, because the
      "PCI: Enabling device" message is unusual.  But it was
      new for me.

  (b) Disabling the e100 device is enough to prevent it from
      asserting the interrupt (the "nobody cared" message used
      to occur as soon as any driver requested IRQ 11; now it
      doesn't occur until we re-enable the offending e100).
      There is no standard PCI config bit for enabling interrupts,
      so this is useful information about the e100.

So I think we could fix this by disabling the device with a quirk,
but then we'd have to also change the e100 driver to clear out
interrupts before enabling the device. I think it might be better
to just mask interrupts directly in the quirk, which shouldn't
require a driver change.

Can you try this updated quirk please?
Comment 29 Jesse Brandeburg 2006-04-04 17:23:27 UTC
Created attachment 7765 [details]
very verbose driver exit

this patch builds on patch 7490, and adds lots of logging (beware!!!) as well
as printing interrupt state at various parts of unload

I'd like to get a last view after the driver is unloaded but that is really
hard.

I'm particularly interested if the adapter is being put into D3 upon unload. 
maybe we should comment out the set_power_state calls.	D3/D0 transition could
cause interrupts to be re-enabled.

Jesse
Comment 30 Bjorn Helgaas 2006-04-04 20:39:56 UTC
Jesse, can you comment on my assertion in comment #20 that we cannot
reliably fix this in the e100 driver?

Here's what I mean: assume you change the e100 driver to leave the
device in a "better" state at shutdown time, and this change appears
in 2.6.17.  Now, assume the user boots 2.6.16, which does not have
your driver update.  He shuts down 2.6.16 and boots 2.6.17, which *does*
contain your fix.  I assert that the 2.6.17 boot may fail, because
2.6.16 left the device in the "bad" state.

That's why I think the fix needs to be a quirk rather than an e100
update.  With a quirk that masks the e100 interrupt before any drivers
claim devices, we can reliably fix the problem, no matter what state
a previous kernel left the device in.

Am I missing something?

I'm not opposed to changing the driver *in addition* to a quirk, but
I don't think a driver change can be the complete solution.
Comment 31 Nils Rennebarth 2006-04-05 04:42:39 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> ------- Additional Comments From bjorn.helgaas@hp.com  2006-04-04 11:54 -------
> Can you try this updated quirk please?
Ahhh. Light dawns...

Yes, this patch fixes the irq problem, making things much easier for us. Thanks 
a lot.

I'll try the e100 driver debug patch next, Jesse, if I understand you right, you 
  need a dmesg from after doing
  # ifconfig eth0 down; ifconfig eth1 down
right?


-- 

Mit freundlichen Gr
Comment 32 Nils Rennebarth 2006-04-05 05:36:32 UTC
Created attachment 7769 [details]
syslog around a reboot with patches 7490 and 7765
Comment 33 Nils Rennebarth 2006-04-05 05:37:00 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> I'd like to get a last view after the driver is unloaded but that is really
> hard.
I'm not sure if this is what you want. Created an/var/log/syslog around a
   ifconfig eth0 down; ifconfig eth1 down; reboot
It *is* large although compression with bzip is efficient

-- 

Mit freundlichen Gr
Comment 34 Bjorn Helgaas 2006-04-05 08:47:57 UTC
Created attachment 7776 [details]
proposed upstream quirk

Here's what I propose we send upstream.  This is basically
just a cleaned-up generic version of what Nils tried recently.

Jesse, any objection?

There are a ton of e100 device IDs, and I hate to duplicate
all of them between e100.c and the quirk, so if you have a
better idea, I'm all ears.
Comment 35 Jesse Brandeburg 2006-04-05 10:33:47 UTC
I don't have any objection to the quirk patch, but I think its glossing over
another problem.  BTW I also don't have any better solution than using lots of
device ID's for the e100 quirk.  The good news is that the incoming rate of new
device ID's is very low.

So here is the problem that I see in Nils' verbose dump (thanks Nils!) search
for **************** for my comments inline

Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_intr: stat_ack = 0x40
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x00
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0xA020
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0x0000
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_tx_clean: cb->status = 0xA000
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0x0000
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_tx_clean: cb->status = 0x0000
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_intr: stat_ack = 0x20
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x00
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0x0000
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_tx_clean: cb->status = 0x0000
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_down: post free_irq
stat_ack: 00
Apr  5 15:11:36 packetalarm kernel: e100: eth0: e100_close: post down stat_ack: 00
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_set_multicast_list:
mc_count=0, flags=0x1003
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_configure:
[00-07]=16:08:00:01:00:00:26:07
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_configure:
[08-15]=01:00:2E:00:60:00:F2:48
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_configure:
[16-23]=00:40:FA:86:3F:05:00:00
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_down: post free_irq
stat_ack: 00
Apr  5 15:11:36 packetalarm kernel: e100: eth1: e100_close: post down stat_ack: 00
Apr  5 15:11:38 packetalarm init: Switching to runlevel: 6

****************
Okay, things are looking good here, but check out this next bit..
****************

Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_hw_init: e100_hw_init
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=0,
data_in=0x0000, data_out=0x18203000
...
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_phy_init: phy ID = 0x015402A8
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[00-07]=16:08:00:01:00:00:26:07
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[08-15]=01:00:2E:00:60:00:F2:48
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[16-23]=00:40:FA:86:3F:05:00:00
...
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_up: stat_ack: 00

****************
What the heck is going on here? The nic just came back up in runlevel 6!!  Not
only that, just below here, we start transmitting and receiving.
****************

Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list:
mc_count=0, flags=0x1003
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[00-07]=16:08:00:01:00:00:26:07
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[08-15]=01:00:2E:00:60:00:F2:48
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[16-23]=00:40:FA:86:3F:05:00:00
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list:
mc_count=1, flags=0x1003
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[00-07]=16:08:00:01:00:00:26:07
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[08-15]=01:00:2E:00:60:00:F2:48
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[16-23]=00:40:FA:86:3F:05:00:00
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x20
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0xA000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list:
mc_count=1, flags=0x1003
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[00-07]=16:08:00:01:00:00:26:07
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[08-15]=01:00:2E:00:60:00:F2:48
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[16-23]=00:40:FA:86:3F:05:00:00
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x20
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0xA000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_watchdog: right now = 49651
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=4,
data_in=0x0000, data_out=0x182405E1
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=0,
data_in=0x0000, data_out=0x18203000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=5,
data_in=0x0000, data_out=0x18250021
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1,
data_in=0x0000, data_out=0x1821782D
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1,
data_in=0x0000, data_out=0x1821782D
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_watchdog: link up, 10Mbps,
half-duplex
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1,
data_in=0x0000, data_out=0x1821782D
Apr  5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1,
data_in=0x0000, data_out=0x1821782D
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x04
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0x0000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list:
mc_count=1, flags=0x1003
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[00-07]=16:08:00:01:00:00:26:07
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[08-15]=01:00:2E:00:60:00:F2:48
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_configure:
[16-23]=00:40:FA:86:3F:05:00:00
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x20
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000
Apr  5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0xA000
Apr  5 15:11:45 packetalarm pa-configwrap[3644]: Unable to set sensor status:
Invalid response from manager
Apr  5 15:11:45 packetalarm ntpd[3399]: ntpd exiting on signal 15
Apr  5 15:11:45 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x40
Apr  5 15:11:45 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0xA022
Apr  5 15:11:45 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000
Apr  5 15:11:45 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0x0000
Apr  5 15:11:45 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000
Apr  5 15:11:45 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0x0000
Apr  5 15:11:46 packetalarm kernel: Kernel logging (proc) stopped.
Apr  5 15:11:46 packetalarm kernel: Kernel log daemon terminating.
Apr  5 15:11:46 packetalarm exiting on signal 15
Apr  5 15:12:52 packetalarm syslogd 1.4.1#17: restart

****************
uh, crap, we just rebooted, and there were no calls to 'down', I suppose that
the kernel logger could have been killed before the last messages came in.  But
according to what we see here, something starts communicating using the NIC
after entry into runlevel 6.  what might that be? netconsole? stray dhcpcd? It
really seems like a bug that anything is allowed to reactivate the NIC during
shutdown.  It may not be a kernel bug, however.

A serial console would tell us a little more.  However I'd not suggest running a
serial console with the verbose debugging enabled.  I'll try to whip up a driver
today that prints a string for each entry point entered for down/up open/close
(or Nils you can do if you want)
Comment 36 Alan Stern 2006-04-05 11:11:37 UTC
Suggestions: It might help to print out the name of the current process for each
up/down open/close.  And under the assumption that the process will be running
ifconfig, it might help to replace ifconfig with a wrapper that prints out some
"ps -Af" information before running the real ifconfig.
Comment 37 Bjorn Helgaas 2006-04-05 11:18:17 UTC
> I don't have any objection to the quirk patch, but I think its glossing over 
> another problem. 
 
What other problem?  If a device comes up in a bad state after a reboot, 
I hardly think you can blame the last kernel that was running.  The BIOS 
driver should have left the device with interrupts disabled. 
 
If you disagree, please outline how you can resolve my question from 
comment #30.  That scenario seems like a problem no matter what you 
do to the e100 driver. 
Comment 38 Jesse Brandeburg 2006-04-05 13:45:17 UTC

On Wed, 5 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Additional Comments From bjorn.helgaas@hp.com  2006-04-05 11:18 -------
> > I don't have any objection to the quirk patch, but I think its glossing over 
> > another problem. 
>  
> What other problem?  If a device comes up in a bad state after a reboot, 
> I hardly think you can blame the last kernel that was running.  The BIOS 
> driver should have left the device with interrupts disabled. 

The other problem I'm talking about is the driver / OS leaving the device 
in an initialized state.  I agree with your assertion that the current 
kernel should be responsible for getting things initialized right, 
regardless of the previous state.  The complete solution includes both 
root causing the reason the hardware is left in the bad state (and maybe 
fixing it) as well as making the quirk.

I'm not sure what you mean by BIOS driver.  It appears that this system 
doesn't reset the PCI bus devices at reboot (by pulsing Power Good line) 
which causes the device state to be partially or fully initialized upon 
pci_enable_device, when the previous OS didn't disable interrupts at 
reboot.

> If you disagree, please outline how you can resolve my question from 
> comment #30.  That scenario seems like a problem no matter what you 
> do to the e100 driver. 
> 

The quirk is good, as it works around the somewhat unfortunate design 
decision on the PRO/100 to re-enable interrupts on reset, as well as the 
odd machine that doesn't initialize PCI devices at warm reboot.

I think we're getting very close to a better understanding of what order 
of things cause this to happen.  It seems to me that the order of things 
happening is not how it was designed to work, but that may be another bug.

Comment 39 Jesse Brandeburg 2006-04-05 13:52:42 UTC
Bjorn, I have to say thanks for your work on this too.
Comment 40 Bjorn Helgaas 2006-04-05 14:02:25 UTC
> I'm not sure what you mean by BIOS driver.  
  
Sorry, I just meant whatever firmware (if any) touched the device before  
the OS loads.  
  
Anyway, it sounds like we're in agreement as far as the quirk, so 
I'll post that today. 
 
I don't understand the rest of the stuff you're working on in the 
driver, so I won't bug you about that :-) 
Comment 41 Jesse Brandeburg 2006-04-05 14:44:14 UTC
Created attachment 7780 [details]
debug function entry / exit

you can replace patch 7765 with this
please load driver with 
# dmesg -n 8
# insmod/modprobe e100 debug=6
# ifconfig eth0 ... up
# reboot

I'm not sure if you're able, but setting up a serial console to log output
through reboot would be very useful.

I saw this when running with 7490 and this patch and rebooting.
e100: 0000:01:01.0: e100_probe: stat_ack before reset: 00
e100: eth0: e100_probe: stat_ack end: 00
e100: eth0: e100_open: enter
e100: eth0: e100_up: stat_ack: 00
e100: eth0: e100_watchdog: right now = 4295219869
e100: eth0: e100_watchdog: right now = 4295228386
e100: eth0: e100_close: enter
md: stopping all md devices.
md: md0 switched to read-only mode.
Synchronizing SCSI cache for disk sdb:
Synchronizing SCSI cache for disk sda:
e100: eth0: e100_shutdown: enter
e100: eth0: e100_shutdown: stat_ack: 00
Restarting system.
.
machine restart
Comment 42 Nils Rennebarth 2006-04-10 08:58:39 UTC
After setting up a serial console and capturing messages during reboot, I found 
out that the e100 driver already does as best as possible. If all interfaces are 
taken down during reboot, the irq problem does not show up.

However the sysvinit I have (debian, 2.86.ds1-1) only detects and downs 
interfaces that have an ip address. Now on our systems that turns out not to be 
the case: One interface is used solely for network sniffing and for security 
reasons it never gets an ip address, but it is up and will be set to promiscous 
mode by the sniffing program.

So the upstream pci quirk is still necessary, because we cannot guarantee that 
all interfaces are taken down before boot.

The e100 driver however turned out to IMHO do the best it can. It is not its 
fault if interfaces stay up during reboot.

So I think this bug can be finally marked as resolved (And yes, I tested again, 
that the final proposed patch really fixes my original problem. It is also quite 
nice that it warns when the problem would have appeared without the patch and is 
completely silent otherwise.)

A big thanks to all the people involved.
Comment 43 Len Brown 2006-07-25 22:18:41 UTC
I expect this is technically unrelated to ACPI (ie. it likely still happens  
with "acpi=off"), but I don't see the interrupt quirk patch from 
comment #34 upstream, so I'll add it to the acpi test tree. 
 
Comment 44 Len Brown 2006-08-27 19:34:15 UTC
shipped post linux-2.6.18-rc4. 
closed.