Most recent kernel where this bug did not occur: Distribution: self compiled vanilla kernel Hardware Environment: Supermicro mainboard, 2 onboard e100 network adapters Software Environment: Problem Description: Sometimes during boot, I get a irq 11: nobody cared together with a backtrace. The irq is then disabled which renders the network card non-functional. Apparently this depends on the exact order the hardware is detected. I never got the problem with a 2.6.14.3 kernel but that may be pure luck. Steps to reproduce: This is difficult to reproduce. Only entering the BIOS and setting "Reset Config Data" to yes which is supposed to clear PCI/PnP Configuration data stored in Flash on next boot makes the problem go away reliably.
Created attachment 7062 [details] dmesg from a boot that issues the nobody cared message
Created attachment 7063 [details] dmesg from a boot without "nobody cared". I made no changes to hard or software, only entering the BIOS setup and setting "Reset config data" to Yes
Problem can be reliably triggered by traffic on any interface during boot, or avoided by pulling the network plugs during boot. 2.6.14 exhibits the same problem
(Re-added bugzilla to cc) Nils Rennebarth <nils.rennebarth@packetalarm.com> wrote: > > Andrew Morton wrote: > > Gee. I guess it'd be useful to identify whic two subsystems have to be > > initialised in which order for the problem to occur. > Turned out now, that the problem appears to be somewhere else. > > To recall: the problem was: > "irq 11: nobody cared", and later: "Disabling IRQ" during booting, which stops > my network adapters from getting any irqs, rendering them quasi dead. > > - I can reliably trigger the problem now, by creating traffic on any of the > interfaces of the machine during boot. If on the other hand I power the machine > down, unplug both cables and power on again, I reliably get a working machine. > - Testing with vanilla 2.4.14 exhibits the same problem > > Does that mean, e100 is the culprit? > Should I try to go back to see where the problem starts? > > I won't be available for replys/further examination before Mo 20.2. though. Seems that your USB host controller and both e100's are all sharing the IRQ11 line. Somehow, an e100 is asserting that line before it has registered its interrupt handler. I can't see how that can happen from the driver... Could you please test 2.6.16-rc3, or whatever's the latest released kernel? Also, this patch might help: diff -puN drivers/net/e100.c~a drivers/net/e100.c --- devel/drivers/net/e100.c~a 2006-02-15 11:44:09.000000000 -0800 +++ devel-akpm/drivers/net/e100.c 2006-02-15 11:44:49.000000000 -0800 @@ -2055,6 +2055,10 @@ static int e100_up(struct nic *nic) { int err; + printk("%s\n", __FUNCTION__); + if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ, + nic->netdev->name, nic->netdev))) + goto err_no_irq; if((err = e100_rx_alloc_list(nic))) return err; if((err = e100_alloc_cbs(nic))) @@ -2064,9 +2068,6 @@ static int e100_up(struct nic *nic) e100_set_multicast_list(nic->netdev); e100_start_receiver(nic, NULL); mod_timer(&nic->watchdog, jiffies); - if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ, - nic->netdev->name, nic->netdev))) - goto err_no_irq; netif_wake_queue(nic->netdev); netif_poll_enable(nic->netdev); /* enable ints _after_ enabling poll, preventing a race between _
Andrew Morton wrote: > Seems that your USB host controller and both e100's are all sharing the > IRQ11 line. They do. > Somehow, an e100 is asserting that line before it has > registered its interrupt handler. I can't see how that can happen from the > driver... Browsing the various dmesg of failed and non failed boots suggests, it might have to do with e100_probe more than with e100_up. Is that possible? > > Could you please test 2.6.16-rc3, or whatever's the latest released kernel? Currently compiling. Will report success/failure. > > Also, this patch might help: That patch doesn't make any difference. However once I trigger the bug, if found out it does not go away as easyily as I described, seems a powerdown is required. Sigh. -- Mit freundlichen Gr
Created attachment 7416 [details] dmesg from 2.6.16-rc4 exhibiting the bug
Andrew Morton wrote: > Could you please test 2.6.16-rc3, or whatever's the latest released kernel? 2.6.16-rc4 unfortunately doesn't make a difference either. To the contrary: once I booted, saw the nobody cared message together with the traceback, then the rest of the boot messages and finally at the login prompt the machine hang hard: no Shift-PgUp, no SysRq, nothing. No console message appeared either. Could not reproduce that hang though. The hardware ran fine for years, so if it is a hardware flaw it must have developed in the last few weeks which I deem rather unlikely. I attached a new dmesg output from 2.6.16-rc4 exhibiting the bug to the bugzilla page. -- Mit freundlichen Gr
Andrew Morton wrote: > Seems that your USB host controller and both e100's are all sharing the > IRQ11 line. Somehow, an e100 is asserting that line before it has > registered its interrupt handler. I can't see how that can happen from the > driver... A few more datapoints: - I reproduced the problem on another identical machine, so the chances of just misbehaving hardware are really slim now. - I can only reproduce the problem with udev, and it always occurs at the point where it is synthesizing hotplug events - blacklisting uhci-hcd makes the problem go away. The latter seems to point to the uhcd-hcd driver. Who could shed more light on this? -- Mit freundlichen Gr
can you please test with the e100 driver from 2.6.14.3 (or whatever kernel worked okay for you) e100 part was known for asserting its interrupt line at inopportune times, but I thought we had that under control. From the "dmesg ... nobody cared" attachment, it appears that e100 hadn't even loaded yet.
short bug description: During boot I get a "irq 11: nobody cared!" message, followed by a backtrace, followed by "Disabling IRQ #11", while the uhci-hcd driver is loaded by insmod. This is on a machine with two network adapters and a USB host controller sharing IRQ line 11. With a disabled IRQ, the network card of course are dead afterwards. bugme-daemon@bugzilla.kernel.org wrote: > ------- Additional Comments From jesse.brandeburg@intel.com 2006-02-22 10:37 ------- > can you please test with the e100 driver from 2.6.14.3 (or whatever kernel > worked okay for you) > > e100 part was known for asserting its interrupt line at inopportune times, but I > thought we had that under control. From the "dmesg ... nobody cared" > attachment, it appears that e100 hadn't even loaded yet. Unfortunately I have not yet found a 2.6 kernel that worked ok. I went as far back as 2.6.12.6 but could still reproduce the bug. I even tried to prevent the e100 driver to load at all during boot and could still get the "irq 11: nobody cared" message, but only when I create network traffic (dmesg attached, note that no e100 driver was loaded) Is the network adapter supposed to create interrupts when receiving packets destined for its hardware address, even though its driver is not yet loaded? I'am not excluding a BIOS bug, there are newer ones available, but at least 2.4 kernels never trigger the bug and as identical hardware is at many customer locations, working as headless servers, a BIOS upgrade is no real option. -- Mit freundlichen Gr
It's likely that uhci-hcd is just an innocent bystander here. The Ethernet controller generates an interrupt request on IRQ 11 at a time when it should be quiescent. Nothing happens until uhci-hcd loads and enables that IRQ line, at which time the CPU receives the interrupt from the e100 device. Of course the uhci-hcd interrupt routine doesn't handle it, and the e100 driver hasn't been loaded yet. Thus you get the "Nobody cared!" message and the IRQ is disabled. It's possible that under 2.4, the e100 driver loads before uhci-hcd. Obviously that would prevent the problem from occurring. It's also possible that there's something different about the way 2.4 and 2.6 initialize the e100 hardware, as part of the PCI startup or even the ACPI startup.
Hi, Here are a few more datapoints for the bug: The problem starts with linux 2.6.4, which is probably of little help to you, as the 2.6.4 changelog talks about a complete rewrite of the e100 driver. What happens exactly is, that during reboot the driver leaves the network hardware in a non-quiet state, and may raise interrupts. If then, while booting, another driver (uhci-hcd here) loads which shares the IRQ line (11 here) with the EtherExpress PRO/100 cards, its interrupt routine will of course not handle the irq and the "nobody cared" message follows. (See also explanation from Alan Stern <stern@rowland.harvard.edu> in bugzilla) Any idea/patches? -- Mit freundlichen Gr
On Tue, 28 Feb 2006, Nils Rennebarth wrote: > Hi, > > Here are a few more datapoints for the bug: > > The problem starts with linux 2.6.4, which is probably of little help to you, > as the 2.6.4 changelog talks about a complete rewrite of the e100 driver. > > What happens exactly is, that during reboot the driver leaves the network > hardware in a non-quiet state, and may raise interrupts. > If then, while booting, another driver (uhci-hcd here) loads which shares the > IRQ line (11 here) with the EtherExpress PRO/100 cards, its interrupt routine > will of course not handle the irq and the "nobody cared" message follows. > (See also explanation from Alan Stern <stern@rowland.harvard.edu> in > bugzilla) > > Any idea/patches? I spent a lot of time looking through the driver for an place that might be allowing the driver to reset and then exit without disabling interrupts, I can't find one. We may need to add lots of printks so we can figure out what the flow of events at shutdown might be. You don't happen to be using kexec do you? How do you reboot, exactly? init 6? reboot? I flooded my interface with packets while booting, with no luck, and I have a system where the interrupt is being shared on IRQ11 with uhci-hcd:usb5. I can't reproduce this locally, so here is a patch that prints the interrupt status register in a few places, please give it a shot. I actually have a supermicro motherboard with dual PIII, and dual e100 interfaces. Do you think they are similar? whats your motherboard exactly? please attach cat /proc/interrupts to the bug as well as lspci -vvv [root@ricochet ~]# cat /proc/interrupts CPU0 0: 1200529 XT-PIC timer 1: 6249 XT-PIC i8042 2: 0 XT-PIC cascade 5: 0 XT-PIC uhci_hcd:usb4 8: 1 XT-PIC rtc 9: 67 XT-PIC ehci_hcd:usb1, uhci_hcd:usb2 10: 10025 XT-PIC libata, uhci_hcd:usb3 11: 6490 XT-PIC uhci_hcd:usb5, eth1 12: 1029 XT-PIC i8042 14: 75 XT-PIC ide0 NMI: 0 LOC: 1200521 ERR: 0 MIS: 0 --- linux-2.6.16-rc5.orig/drivers/net/e100.c 2006-02-28 13:46:06.000000000 -0800 +++ linux-2.6.16-rc5/drivers/net/e100.c 2006-02-28 13:55:43.000000000 -0800 @@ -2064,6 +2064,10 @@ static int e100_up(struct nic *nic) e100_set_multicast_list(nic->netdev); e100_start_receiver(nic, NULL); mod_timer(&nic->watchdog, jiffies); + { + u8 stat_ack = readb(&nic->csr->scb.stat_ack); + DPRINTK(PROBE, ERR, "stat_ack: %02X\n", stat_ack); + } if((err = request_irq(nic->pdev->irq, e100_intr, SA_SHIRQ, nic->netdev->name, nic->netdev))) goto err_no_irq; @@ -2623,6 +2627,11 @@ static int __devinit e100_probe(struct p spin_lock_init(&nic->cmd_lock); spin_lock_init(&nic->mdio_lock); + { + u8 stat_ack = readb(&nic->csr->scb.stat_ack); + DPRINTK(PROBE, ERR, "stat_ack before reset: %02X\n", stat_ack); + } + /* Reset the device before pci_set_master() in case device is in some * funky state and has an interrupt pending - hint: we don't have the * interrupt handler registered yet. */ @@ -2681,6 +2690,10 @@ static int __devinit e100_probe(struct p netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5]); + { + u8 stat_ack = readb(&nic->csr->scb.stat_ack); + DPRINTK(PROBE, ERR, "stat_ack end: %02X\n", stat_ack); + } return 0; err_out_free: @@ -2775,6 +2788,11 @@ static void e100_shutdown(struct pci_dev #endif if (retval) DPRINTK(PROBE,ERR, "Error enabling wake\n"); + + { + u8 stat_ack = readb(&nic->csr->scb.stat_ack); + DPRINTK(PROBE, ERR, "stat_ack: %02X\n", stat_ack); + } }
Created attachment 7488 [details] Output of lspci -vvv
Created attachment 7489 [details] content of /proc/interrupts
Jesse Brandeburg wrote: > I spent a lot of time looking through the driver for an place that might > be allowing the driver to reset and then exit without disabling > interrupts, I can't find one. We may need to add lots of printks so we > can figure out what the flow of events at shutdown might be. You don't > happen to be using kexec do you? No, never did. > How do you reboot, exactly? init 6? > reboot? Reboot. This is a debian sarge system, sysvinit version 2.86.ds1 After all stop scripts in /etc/rc6.d/ are run, the last one just does reboot -d -f -i -d prevents a wtmp record to be written, -f prevents all stop scripts from beeing called again and (most importantly I suspect) -i will shut down all network interfaces. Here is how the latter is done in the source: ifr[i].ifr_flags &= ~(IFF_UP); if (ioctl(fd, SIOCSIFFLAGS, &ifr[i]) < 0) { fprintf(stderr, "ifdown: shutdown "); perror(ifr[i].ifr_name); } > > I flooded my interface with packets while booting, with no luck, and I > have a system where the interrupt is being shared on IRQ11 with > uhci-hcd:usb5. I can't reproduce this locally, so here is a patch that > prints the interrupt status register in a few places, please give it a > shot. Ok, I'll run it soon. Btw, the patch was whitespace damaged, probably the mailer, could you please send it safely (not this one of course, corrected manually here) It is easy, of course to save the messages at boot time, but what about those before the reboot, do you need those as well? I need to create a serial console for that. > > I actually have a supermicro motherboard with dual PIII, and dual e100 > interfaces. Do you think they are similar? whats your motherboard exactly? It is a supermicro P3TSSE, for a single PIII, with a 815E chipset, providing dual onboard ethernet. As apparently no one else sees this bug, it might have to do with the fact, that it has two e100 interfaces. > > please attach cat /proc/interrupts to the bug as well as lspci -vvv Done. -- Mit freundlichen Gr
Hi, attached is the relevant part of a dmesg with your printks enabled. And yes, I have been unable to trigger the bug with only one network cable (the flooded one of course) attached, which might explain why you didn't see it. I also attached an extract of dmesg from a boot without the "nobody cared", as a second attachment. That one actually is with the second cable removed. -- Mit freundlichen Gr
Created attachment 7490 [details] test driver as was requested, this is the patch I originally sent
do you happen to have PXE boot roms on your adapter? Are they enabled and possibly turning on the NIC at boot and reboot time? Does your machine have a BMC? The debug attachments show an RX interrupt pending before we reset in probe. (stat ack = 0x50) which should not be happenning. at the very least, I am surprised that the interrupt register shows an actual value, indicating we are truly the cause of the interrupt storm.
Comment #10 notes that Nils reproduced the problem without even loading the e100 driver. It may be that this has something to do with the way the e100 driver shut down the device on a previous boot, but it means we can't reliably fix the problem in the driver. We have to be able to boot regardless of what a previous kernel did. It sounds to me like a BIOS bug (that it is handing off a device with interrupts enabled), and my suggestion is a PCI quirk to disable e100 interrupts and clear out any pending interrupts before we start attaching drivers to devices. The fact that it doesn't occur with 2.4 kernels or prior to 2.6.4 is interesting, but I suspect it's really because the driver load order changed. I bet you could reproduce it on a 2.4 kernel by removing the e100 driver.
Sorry for coming back so late on this issue. bugme-daemon@bugzilla.kernel.org wrote: > ------- Additional Comments From jesse.brandeburg@intel.com 2006-03-03 17:33 ------- > do you happen to have PXE boot roms on your adapter? Yes > Are they enabled and > possibly turning on the NIC at boot and reboot time? Every Option concerning the network adapters is disabled in the BIOS: - PXE ROM - Remote Access (whatever that means) - LAN Wake-Up > Does your machine have a BMC? A Baseboard Management Controller? No. It may well be a BIOS Bug, but somehow the previous version of the driver (in 2.6.3) managed to work around that bug. -- Mit freundlichen Gr
bugme-daemon@bugzilla.kernel.org wrote: > ------- Additional Comments From bjorn.helgaas@hp.com 2006-03-08 12:31 ------- > Comment #10 notes that Nils reproduced the problem without > even loading the e100 driver. > > It may be that this has something to do with the way the e100 > driver shut down the device on a previous boot, but it means > we can't reliably fix the problem in the driver. It must have, as I was never able to hit the problem when booting directly after Power-Up or after a non affected kernel ran the last time. > We have to > be able to boot regardless of what a previous kernel did. > > It sounds to me like a BIOS bug (that it is handing off a > device with interrupts enabled), and my suggestion is a PCI > quirk to disable e100 interrupts and clear out any pending > interrupts before we start attaching drivers to devices. It probably *is* a BIOS bug, as we seem to be the only ones seeing this particular problem. The 2.6.3 or earlier e100 driver appears however to be able to work around that bug and it would be nice to have that workaround in a newer kernel as well. (This type of mainboard is deployed to quite a number of our customers and used as a headless machine, so a BIOS upgrade for those is out of question) > > The fact that it doesn't occur with 2.4 kernels or prior to > 2.6.4 is interesting, but I suspect it's really because the > driver load order changed. I bet you could reproduce it on > a 2.4 kernel by removing the e100 driver. It cannot be reproduced by a 2.4 kernel because the "nobody cared" and "Disabling IRQ" does not exist in the 2.4 series. Something changed there which enables the (probably BIOS) bug to surface at all. Both USB and network is compiled as modules. I can trigger the bug by loading only the uhci-hcd module, but only if I boot after a 2.6.4 or later driver ran before booting. How can "load order" be relevant? -- Mit freundlichen Gr
Load order matters because if uhci-hcd is loaded first then it will enable the IRQ and you'll get flooded with unhandled interrupts. If e100 is loaded first, then when it enables the IRQ line its handler will be ready to take care of the interrupts. It could be that something as simple as turning off the network interface before rebooting would fix the problem. Or perhaps a shutdown method should be added to e100.
Created attachment 7753 [details] quirk Nils, can you try to reproduce the problem with this patch? Please post the extra dmesg output (the "DEBUG" messages) and the results (whether the IRQ problem still occurs). This is a gross hack, but I'm just trying to figure out whether this is an approach worth pursuing.
Created attachment 7755 [details] dmesg from 2.6.16.1 with the pci quirk patch
> ------- Additional Comments From bjorn.helgaas@hp.com 2006-04-03 09:50 > ------- > Created an attachment (id=7753) > --> (http://bugzilla.kernel.org/attachment.cgi?id=7753&action=view) > quirk > > Nils, can you try to reproduce the problem with this patch? Please > post the extra dmesg output (the "DEBUG" messages) and the results > (whether the IRQ problem still occurs). Sorry, the IRQ problem persists, dmesg uploaded to bugzilla. I tried it two times and although the order of messages does not appear to be strictly synchronous, the "nobody cared" message came directly after "PCI: Enabling device 0000:01:04.0" both times. This device is the second network card (i.e. eth1) -- Mit freundlichen Gr
It looks like the call to pci_enable_device was enough to cause eth1 to begin issuing interrupt requests. Can the driver be changed to disable interrupt generation first?
Created attachment 7761 [details] quirk2 > Can the driver be changed to disable interrupt generation first? I'm sure it could, but I think we would still have a problem if the kernel doesn't include the e100 driver. I expect that the first driver that requested IRQ 11 would get the flood. Nils, thanks for testing this. I learned that: (a) The BIOS hands off the e100 device with bus mastering, memory space, and I/O space enabled (command register = 0x0117). It looks like this is typical, because the "PCI: Enabling device" message is unusual. But it was new for me. (b) Disabling the e100 device is enough to prevent it from asserting the interrupt (the "nobody cared" message used to occur as soon as any driver requested IRQ 11; now it doesn't occur until we re-enable the offending e100). There is no standard PCI config bit for enabling interrupts, so this is useful information about the e100. So I think we could fix this by disabling the device with a quirk, but then we'd have to also change the e100 driver to clear out interrupts before enabling the device. I think it might be better to just mask interrupts directly in the quirk, which shouldn't require a driver change. Can you try this updated quirk please?
Created attachment 7765 [details] very verbose driver exit this patch builds on patch 7490, and adds lots of logging (beware!!!) as well as printing interrupt state at various parts of unload I'd like to get a last view after the driver is unloaded but that is really hard. I'm particularly interested if the adapter is being put into D3 upon unload. maybe we should comment out the set_power_state calls. D3/D0 transition could cause interrupts to be re-enabled. Jesse
Jesse, can you comment on my assertion in comment #20 that we cannot reliably fix this in the e100 driver? Here's what I mean: assume you change the e100 driver to leave the device in a "better" state at shutdown time, and this change appears in 2.6.17. Now, assume the user boots 2.6.16, which does not have your driver update. He shuts down 2.6.16 and boots 2.6.17, which *does* contain your fix. I assert that the 2.6.17 boot may fail, because 2.6.16 left the device in the "bad" state. That's why I think the fix needs to be a quirk rather than an e100 update. With a quirk that masks the e100 interrupt before any drivers claim devices, we can reliably fix the problem, no matter what state a previous kernel left the device in. Am I missing something? I'm not opposed to changing the driver *in addition* to a quirk, but I don't think a driver change can be the complete solution.
bugme-daemon@bugzilla.kernel.org wrote: > ------- Additional Comments From bjorn.helgaas@hp.com 2006-04-04 11:54 ------- > Can you try this updated quirk please? Ahhh. Light dawns... Yes, this patch fixes the irq problem, making things much easier for us. Thanks a lot. I'll try the e100 driver debug patch next, Jesse, if I understand you right, you need a dmesg from after doing # ifconfig eth0 down; ifconfig eth1 down right? -- Mit freundlichen Gr
Created attachment 7769 [details] syslog around a reboot with patches 7490 and 7765
bugme-daemon@bugzilla.kernel.org wrote: > I'd like to get a last view after the driver is unloaded but that is really > hard. I'm not sure if this is what you want. Created an/var/log/syslog around a ifconfig eth0 down; ifconfig eth1 down; reboot It *is* large although compression with bzip is efficient -- Mit freundlichen Gr
Created attachment 7776 [details] proposed upstream quirk Here's what I propose we send upstream. This is basically just a cleaned-up generic version of what Nils tried recently. Jesse, any objection? There are a ton of e100 device IDs, and I hate to duplicate all of them between e100.c and the quirk, so if you have a better idea, I'm all ears.
I don't have any objection to the quirk patch, but I think its glossing over another problem. BTW I also don't have any better solution than using lots of device ID's for the e100 quirk. The good news is that the incoming rate of new device ID's is very low. So here is the problem that I see in Nils' verbose dump (thanks Nils!) search for **************** for my comments inline Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_intr: stat_ack = 0x40 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x00 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0xA020 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0x0000 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_tx_clean: cb->status = 0xA000 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0x0000 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_tx_clean: cb->status = 0x0000 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_intr: stat_ack = 0x20 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x00 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_rx_indicate: status=0x0000 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_tx_clean: cb->status = 0x0000 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_down: post free_irq stat_ack: 00 Apr 5 15:11:36 packetalarm kernel: e100: eth0: e100_close: post down stat_ack: 00 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_set_multicast_list: mc_count=0, flags=0x1003 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_configure: [00-07]=16:08:00:01:00:00:26:07 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_configure: [16-23]=00:40:FA:86:3F:05:00:00 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_down: post free_irq stat_ack: 00 Apr 5 15:11:36 packetalarm kernel: e100: eth1: e100_close: post down stat_ack: 00 Apr 5 15:11:38 packetalarm init: Switching to runlevel: 6 **************** Okay, things are looking good here, but check out this next bit.. **************** Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_hw_init: e100_hw_init Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=0, data_in=0x0000, data_out=0x18203000 ... Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_phy_init: phy ID = 0x015402A8 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [00-07]=16:08:00:01:00:00:26:07 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [16-23]=00:40:FA:86:3F:05:00:00 ... Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_up: stat_ack: 00 **************** What the heck is going on here? The nic just came back up in runlevel 6!! Not only that, just below here, we start transmitting and receiving. **************** Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list: mc_count=0, flags=0x1003 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [00-07]=16:08:00:01:00:00:26:07 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [16-23]=00:40:FA:86:3F:05:00:00 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list: mc_count=1, flags=0x1003 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [00-07]=16:08:00:01:00:00:26:07 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [16-23]=00:40:FA:86:3F:05:00:00 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x20 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0xA000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list: mc_count=1, flags=0x1003 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [00-07]=16:08:00:01:00:00:26:07 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [16-23]=00:40:FA:86:3F:05:00:00 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x20 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0xA000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_watchdog: right now = 49651 Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=4, data_in=0x0000, data_out=0x182405E1 Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=0, data_in=0x0000, data_out=0x18203000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=5, data_in=0x0000, data_out=0x18250021 Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_watchdog: link up, 10Mbps, half-duplex Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D Apr 5 15:11:44 packetalarm kernel: e100: eth1: mdio_ctrl: READ:addr=1, reg=1, data_in=0x0000, data_out=0x1821782D Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x04 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0x0000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_set_multicast_list: mc_count=1, flags=0x1003 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [00-07]=16:08:00:01:00:00:26:07 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_configure: [16-23]=00:40:FA:86:3F:05:00:00 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x20 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000 Apr 5 15:11:44 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0xA000 Apr 5 15:11:45 packetalarm pa-configwrap[3644]: Unable to set sensor status: Invalid response from manager Apr 5 15:11:45 packetalarm ntpd[3399]: ntpd exiting on signal 15 Apr 5 15:11:45 packetalarm kernel: e100: eth1: e100_intr: stat_ack = 0x40 Apr 5 15:11:45 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0xA022 Apr 5 15:11:45 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000 Apr 5 15:11:45 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0x0000 Apr 5 15:11:45 packetalarm kernel: e100: eth1: e100_rx_indicate: status=0x0000 Apr 5 15:11:45 packetalarm kernel: e100: eth1: e100_tx_clean: cb->status = 0x0000 Apr 5 15:11:46 packetalarm kernel: Kernel logging (proc) stopped. Apr 5 15:11:46 packetalarm kernel: Kernel log daemon terminating. Apr 5 15:11:46 packetalarm exiting on signal 15 Apr 5 15:12:52 packetalarm syslogd 1.4.1#17: restart **************** uh, crap, we just rebooted, and there were no calls to 'down', I suppose that the kernel logger could have been killed before the last messages came in. But according to what we see here, something starts communicating using the NIC after entry into runlevel 6. what might that be? netconsole? stray dhcpcd? It really seems like a bug that anything is allowed to reactivate the NIC during shutdown. It may not be a kernel bug, however. A serial console would tell us a little more. However I'd not suggest running a serial console with the verbose debugging enabled. I'll try to whip up a driver today that prints a string for each entry point entered for down/up open/close (or Nils you can do if you want)
Suggestions: It might help to print out the name of the current process for each up/down open/close. And under the assumption that the process will be running ifconfig, it might help to replace ifconfig with a wrapper that prints out some "ps -Af" information before running the real ifconfig.
> I don't have any objection to the quirk patch, but I think its glossing over > another problem. What other problem? If a device comes up in a bad state after a reboot, I hardly think you can blame the last kernel that was running. The BIOS driver should have left the device with interrupts disabled. If you disagree, please outline how you can resolve my question from comment #30. That scenario seems like a problem no matter what you do to the e100 driver.
On Wed, 5 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote: > ------- Additional Comments From bjorn.helgaas@hp.com 2006-04-05 11:18 ------- > > I don't have any objection to the quirk patch, but I think its glossing over > > another problem. > > What other problem? If a device comes up in a bad state after a reboot, > I hardly think you can blame the last kernel that was running. The BIOS > driver should have left the device with interrupts disabled. The other problem I'm talking about is the driver / OS leaving the device in an initialized state. I agree with your assertion that the current kernel should be responsible for getting things initialized right, regardless of the previous state. The complete solution includes both root causing the reason the hardware is left in the bad state (and maybe fixing it) as well as making the quirk. I'm not sure what you mean by BIOS driver. It appears that this system doesn't reset the PCI bus devices at reboot (by pulsing Power Good line) which causes the device state to be partially or fully initialized upon pci_enable_device, when the previous OS didn't disable interrupts at reboot. > If you disagree, please outline how you can resolve my question from > comment #30. That scenario seems like a problem no matter what you > do to the e100 driver. > The quirk is good, as it works around the somewhat unfortunate design decision on the PRO/100 to re-enable interrupts on reset, as well as the odd machine that doesn't initialize PCI devices at warm reboot. I think we're getting very close to a better understanding of what order of things cause this to happen. It seems to me that the order of things happening is not how it was designed to work, but that may be another bug.
Bjorn, I have to say thanks for your work on this too.
> I'm not sure what you mean by BIOS driver. Sorry, I just meant whatever firmware (if any) touched the device before the OS loads. Anyway, it sounds like we're in agreement as far as the quirk, so I'll post that today. I don't understand the rest of the stuff you're working on in the driver, so I won't bug you about that :-)
Created attachment 7780 [details] debug function entry / exit you can replace patch 7765 with this please load driver with # dmesg -n 8 # insmod/modprobe e100 debug=6 # ifconfig eth0 ... up # reboot I'm not sure if you're able, but setting up a serial console to log output through reboot would be very useful. I saw this when running with 7490 and this patch and rebooting. e100: 0000:01:01.0: e100_probe: stat_ack before reset: 00 e100: eth0: e100_probe: stat_ack end: 00 e100: eth0: e100_open: enter e100: eth0: e100_up: stat_ack: 00 e100: eth0: e100_watchdog: right now = 4295219869 e100: eth0: e100_watchdog: right now = 4295228386 e100: eth0: e100_close: enter md: stopping all md devices. md: md0 switched to read-only mode. Synchronizing SCSI cache for disk sdb: Synchronizing SCSI cache for disk sda: e100: eth0: e100_shutdown: enter e100: eth0: e100_shutdown: stat_ack: 00 Restarting system. . machine restart
After setting up a serial console and capturing messages during reboot, I found out that the e100 driver already does as best as possible. If all interfaces are taken down during reboot, the irq problem does not show up. However the sysvinit I have (debian, 2.86.ds1-1) only detects and downs interfaces that have an ip address. Now on our systems that turns out not to be the case: One interface is used solely for network sniffing and for security reasons it never gets an ip address, but it is up and will be set to promiscous mode by the sniffing program. So the upstream pci quirk is still necessary, because we cannot guarantee that all interfaces are taken down before boot. The e100 driver however turned out to IMHO do the best it can. It is not its fault if interfaces stay up during reboot. So I think this bug can be finally marked as resolved (And yes, I tested again, that the final proposed patch really fixes my original problem. It is also quite nice that it warns when the problem would have appeared without the patch and is completely silent otherwise.) A big thanks to all the people involved.
I expect this is technically unrelated to ACPI (ie. it likely still happens with "acpi=off"), but I don't see the interrupt quirk patch from comment #34 upstream, so I'll add it to the acpi test tree.
shipped post linux-2.6.18-rc4. closed.