Bug 8472
Summary: | atl1 module APIC error when MSI enabled in kernel 2.6.21 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Gregory Krzystek (ninex) |
Component: | Network | Assignee: | Tejun Heo (htejun) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | dj, jacliburn |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.21 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
.config
dmesg lspci lsmod |
Description
Gregory Krzystek
2007-05-12 14:46:13 UTC
Created attachment 11484 [details]
.config
Created attachment 11485 [details]
dmesg
Created attachment 11486 [details]
lspci
Created attachment 11487 [details]
lsmod
My motherboard is Asus M2V Andrew Morton <akpm@linux-foundation.org> writes: > On Sat, 12 May 2007 14:46:17 -0700 bugme-daemon@bugzilla.kernel.org wrote: > >> http://bugzilla.kernel.org/show_bug.cgi?id=8472 >> >> Summary: atl1 module APIC error when MSI enabled in kernel 2.6.21 >> Kernel Version: 2.6.21 >> Status: NEW >> Severity: high >> Owner: jgarzik@pobox.com >> Submitter: ninex@o2.pl >> >> >> Distribution:Gentoo/amd64 >> Problem Description: when try to up eth0 gernel produces errors: >> >> after /etc/init.d/net.eth0 start >> i can't establish any connection >> in /var/log/messages i see: >> May 13 19:16:49 localhost APIC error on CPU0: 04(08) >> May 13 19:16:49 localhost atl1: eth0 link is up 100 Mbps full duplex >> May 13 19:16:49 localhost APIC error on CPU0: 08(08) >> May 13 19:16:49 localhost APIC error on CPU0: 08(08) >> ... >> sometimes system rebooting without any error in logs >> there is workaround: >> add pci=nomsi into kernel boot comand line. Interesting... > argh, MSI again. Seems more trouble than it's worth. > Eric, Andi: do we have anything in the pipeline which is likely to > address this? Not exactly. I actually haven't heard of this symptom before. If I can ever log into bugzilla.kernel.org and get the full bug report I might have to see if I can understand what hardware is at work. The practical problem (I think) is that we assume that msi works and then go about blacklisting things. Instead of just enabling MSI when the hardware supports it and is setup properly. Although this instance actually looks like msi is at least half working so it may be something generic going wrong. I'm really confused because at least in Intel's documentation of the local apic the reported error only ever happens if you are using the old serial apic bus, which nothing has used since the P4 and the K8 were introduced. All current products are front side bus based. Eric Andrew Morton <akpm@linux-foundation.org> writes: > On Sun, 13 May 2007 00:59:10 -0600 ebiederm@xmission.com (Eric W. Biederman) > wrote: > >> If >> I can ever log into bugzilla.kernel.org and get the full bug report I >> might have to see if I can understand what hardware is at work. > > I think the DNS thingy broke. bugme.osdl.org works. Thanks. Of the full boot trace this appears to be the interesting bit. > Attansic L1 Ethernet Network Driver - version 2.0.7 > Copyright(c) 2005-2006 Attansic Corporation. > ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 36 (level, low) -> IRQ 36 > PCI: Setting latency timer of device 0000:04:00.0 to 64 > APIC error on CPU0: 04(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > APIC error on CPU0: 08(08) > do_IRQ: 0.213 No irq handler for vector > Uhhuh. NMI received for unknown reason 3c. > Do you have a strange power saving mode enabled? > Dazed and confused, but trying to continue > APIC error on CPU0: 08(08) The no irq handler for vector is worrisome, The atl1 network driver appears new in 2.6.21. Has anyone gotten msi working on this network adapter, and in which situations? The chipset is a via hypertransport chipset and it does have a msi mapping capability. If I have decoded the bus layout properly we don't have a msi to hypertransport interrupt mapping capability on the direct path from this pci-express nic to the hypertransport bus. We seem to have the mapping only for bus 0x80 and 0x07. Which could explain why msi doesn't work on this hardware. IRQ 32 is our non msi irq number. We don't print the msi irq number. Since I can't see how a msi would properly transfer I wonder if VIA saw that their msi mapping capability was borked, and disabled it, and we are seeing how it is borked when we try and use it unconditionally. Does this system have a different pci-express slot that could correspond to the mysterious bus 7? Eric Andrew Morton wrote: > (add Jay to cc) Thanks Andrew. I've been whining about this problem for some time now. http://marc.info/?l=linux-netdev&m=117469855329113&w=2 I'm *really* happy to have some help! :) Thank you > > On Sun, 13 May 2007 02:25:56 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > >> Andrew Morton <akpm@linux-foundation.org> writes: >> >>> On Sun, 13 May 2007 00:59:10 -0600 ebiederm@xmission.com (Eric W. Biederman) >>> wrote: >>> >>>> If >>>> I can ever log into bugzilla.kernel.org and get the full bug report I >>>> might have to see if I can understand what hardware is at work. >>> I think the DNS thingy broke. bugme.osdl.org works. >> Thanks. >> >> Of the full boot trace this appears to be the interesting bit. >> >>> Attansic L1 Ethernet Network Driver - version 2.0.7 >>> Copyright(c) 2005-2006 Attansic Corporation. >>> ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 36 (level, low) -> IRQ 36 >>> PCI: Setting latency timer of device 0000:04:00.0 to 64 >>> APIC error on CPU0: 04(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> APIC error on CPU0: 08(08) >>> do_IRQ: 0.213 No irq handler for vector >>> Uhhuh. NMI received for unknown reason 3c. >>> Do you have a strange power saving mode enabled? >>> Dazed and confused, but trying to continue >>> APIC error on CPU0: 08(08) >> The no irq handler for vector is worrisome, >> >> The atl1 network driver appears new in 2.6.21. Has anyone gotten msi >> working on this network adapter, and in which situations? Yes, MSI on this network adapter works flawlessly on my Intel-based ASUS P5B-MX board. Luca Tattamanti has a P5B-E board, and MSI works for him there, too. I also have an Asus M2V (like the reporter in the instant case), and MSI works for me there so long as I stick with a 64-bit kernel. I get the APIC error only when I run a 32-bit kernel. However, I do have one user report of the error on the M2V under a Debian x86_64 kernel, so I might just be lucky in having avoided it for so long. >> >> The chipset is a via hypertransport chipset and it does have a msi >> mapping capability. If I have decoded the bus layout properly we >> don't have a msi to hypertransport interrupt mapping capability on the >> direct path from this pci-express nic to the hypertransport bus. >> We seem to have the mapping only for bus 0x80 and 0x07. Which >> could explain why msi doesn't work on this hardware. >> >> IRQ 32 is our non msi irq number. >> We don't print the msi irq number. >> >> Since I can't see how a msi would properly transfer I wonder if VIA >> saw that their msi mapping capability was borked, and disabled it, >> and we are seeing how it is borked when we try and use it >> unconditionally. >> >> Does this system have a different pci-express slot that could >> correspond to the mysterious bus 7? The only other PCIe device on this board is the video adapter. >> >> Eric > > I've pretty much concluded that MSI is broken on this board, and I tested a pci quirk patch last night that simply turns it off globally when it discovers the VIA VT3351 bridge. I was going to submit it atop Tejun's similar patch here: http://lkml.org/lkml/2007/5/9/213. If there's something else I need to try, please let me know. Jay Reply-To: ninex@NineX.eu.org Jay Cliburn pisze: > Andrew Morton wrote: >> (add Jay to cc) > > Thanks Andrew. > > I've been whining about this problem for some time now. > > http://marc.info/?l=linux-netdev&m=117469855329113&w=2 > > I'm *really* happy to have some help! :) Thank you > >> >> On Sun, 13 May 2007 02:25:56 -0600 ebiederm@xmission.com (Eric W. >> Biederman) wrote: >> >>> Andrew Morton <akpm@linux-foundation.org> writes: >>> >>>> On Sun, 13 May 2007 00:59:10 -0600 ebiederm@xmission.com (Eric W. >>>> Biederman) >>>> wrote: >>>> >>>>> If >>>>> I can ever log into bugzilla.kernel.org and get the full bug report I >>>>> might have to see if I can understand what hardware is at work. >>>> I think the DNS thingy broke. bugme.osdl.org works. >>> Thanks. >>> >>> Of the full boot trace this appears to be the interesting bit. >>> >>>> Attansic L1 Ethernet Network Driver - version 2.0.7 >>>> Copyright(c) 2005-2006 Attansic Corporation. >>>> ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 36 (level, low) -> IRQ 36 >>>> PCI: Setting latency timer of device 0000:04:00.0 to 64 >>>> APIC error on CPU0: 04(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> APIC error on CPU0: 08(08) >>>> do_IRQ: 0.213 No irq handler for vector >>>> Uhhuh. NMI received for unknown reason 3c. >>>> Do you have a strange power saving mode enabled? >>>> Dazed and confused, but trying to continue >>>> APIC error on CPU0: 08(08) >>> The no irq handler for vector is worrisome, >>> The atl1 network driver appears new in 2.6.21. Has anyone gotten msi >>> working on this network adapter, and in which situations? > > Yes, MSI on this network adapter works flawlessly on my Intel-based > ASUS P5B-MX board. Luca Tattamanti has a P5B-E board, and MSI works > for him there, too. I also have an Asus M2V (like the reporter in the > instant case), and MSI works for me there so long as I stick with a > 64-bit kernel. I get the APIC error only when I run a 32-bit kernel. > However, I do have one user report of the error on the M2V under a > Debian x86_64 kernel, so I might just be lucky in having avoided it > for so long. >>> The chipset is a via hypertransport chipset and it does have a msi >>> mapping capability. If I have decoded the bus layout properly we >>> don't have a msi to hypertransport interrupt mapping capability on the >>> direct path from this pci-express nic to the hypertransport bus. >>> We seem to have the mapping only for bus 0x80 and 0x07. Which >>> could explain why msi doesn't work on this hardware. >>> >>> IRQ 32 is our non msi irq number. >>> We don't print the msi irq number. >>> >>> Since I can't see how a msi would properly transfer I wonder if VIA >>> saw that their msi mapping capability was borked, and disabled it, >>> and we are seeing how it is borked when we try and use it >>> unconditionally. >>> >>> Does this system have a different pci-express slot that could >>> correspond to the mysterious bus 7? > > The only other PCIe device on this board is the video adapter. > >>> >>> Eric >> >> > > I've pretty much concluded that MSI is broken on this board, and I > tested a pci quirk patch last night that simply turns it off globally > when it discovers the VIA VT3351 bridge. I was going to submit it > atop Tejun's similar patch here: http://lkml.org/lkml/2007/5/9/213. > > If there's something else I need to try, please let me know. hmm but i see this bug when running 64bit kernel.... > Jay > > > > > -- GRZEGORZ {NineX} KRZYSTEK NineX Inc. ninex@ninex.eu.org Krak Grzegorz Krzystek wrote:
> hmm but i see this bug when running 64bit kernel....
Then you're the second person to see it under x86_64. Congratulations! :)
I vote for quirking MSI off on this board, unless Andi, Eric, and Andrew have
alternate ideas.
Jay Cliburn <jacliburn@bellsouth.net> writes: > Grzegorz Krzystek wrote: > >> hmm but i see this bug when running 64bit kernel.... > > Then you're the second person to see it under x86_64. Congratulations! :) Interesting. Is the failure mode really apic errors on all kernels? The apic error feels like we tried to send an invalid apic interrupt to ourselves, and things croaked. > I vote for quirking MSI off on this board, unless Andi, Eric, and Andrew have > alternate ideas. I have a hypothesis that the appropriate msi mapping capability is simply programmed wrong. If that is the case we can really fix this issue. On the non-failing instances of this board can we use lspci to find the working msi mapping capability that is on the path between the pci-express bus and the upstream hypertransport bus. Can we then get a complete register dump of the msi-mapping capability. Can we then please repeat the process on a failing instance of the board in question. Can we also please compare the pci revision fields in the chipset between working and non-working versions of this chipset. I'm also curious about these apic errors, and the "no irq for vector" error that happened on x86_64. Even the bit about trigger an NMI I'm curious about. This suggests we may actually have bad hardware somewhere in the mix that we need to weed out. So getting details on other similar failures would also be interesting. If we are actually going to spend some time on this I am inclined to figure out how our quirk layer works for x86 msi and turn it inside out. Defaulting it to off. With a quirk that says if you are an intel pci-express chipset turn it on, and with another quirk that if you are a properly setup msi mapping capability turn it on. It can't be that hard and at least then it will be safe to enable MSI support in the kernel by default. The practical question for me is can we use sweeping generalizations (like the presence of the standard hypertransport msi mapping capability) when turning things on, or do we need to go chipset by chipset? Or can some combination of both work. Andrew my current feeling on this part of the MSI code is that we are currently violating our current best practices for enabling a hardware capability. We try and use MSI even without knowing it works, instead of enabling MSI only in know good configurations. As far as I can tell we have never seriously tried a MSI whitelist approach. Now that the MSI core is pretty solid I think I can spare a cycle or two so we can straighten out the MSI enable criteria, at which point MSI should be much less hassle. Eric On Sun, 13 May 2007 09:25:41 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Jay Cliburn <jacliburn@bellsouth.net> writes: > > > Grzegorz Krzystek wrote: > > > >> hmm but i see this bug when running 64bit kernel.... > > > > Then you're the second person to see it under x86_64. > > Congratulations! :) > > Interesting. Is the failure mode really apic errors on all kernels? I have results from a working x86_64 kernel. Again, unlike Grzegorz, it works for me under x86_64, but fails under i386 kernels. > On the non-failing instances of this board can we use lspci > to find the working msi mapping capability that is on the path > between the pci-express bus and the upstream hypertransport bus. Here's a pointer to the files for working and non-working instances: x86_64 with working MSI: ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-x86_64-msi-working/ m2v-x86_64-dmesg.txt m2v-x86_64-dmidecode.txt m2v-x86_64-lsmod.txt m2v-x86_64-lspci.txt m2v-x86_64-ping.txt m2v-x86_64-proc-interrupts.txt m2v-x86_64-uname.txt i386 with failing MSI: ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing/ m2v-i386-dmesg.txt m2v-i386-dmidecode.txt m2v-i386-lsmod.txt m2v-i386-lspci.txt m2v-i386-proc-interrupts.txt m2v-i386-uname.txt > Can we then get a complete register dump of the msi-mapping > capability. How do I obtain a register dump? > > Can we then please repeat the process on a failing instance of > the board in question. > > Can we also please compare the pci revision fields in the chipset > between working and non-working versions of this chipset. How do I obtain the pci revision fields? > > I'm also curious about these apic errors, and the "no irq for vector" > error that happened on x86_64. Even the bit about trigger an NMI > I'm curious about. This suggests we may actually have bad hardware > somewhere in the mix that we need to weed out. > > So getting details on other similar failures would also be > interesting. Specifically, what additional info do you need? I'll be glad to get it. One interesting tidbit here is the comparison of /proc/interrupts between the x86_64 (working) kernel and the i386 (non-working) kernel. The x86_64 version shows the MSI mapping, but the i386 doesn't; eth0 doesn't show up at all under i386 (with MSI enabled). The atl1 module is definitely loaded, but the network isn't started (because it kills the box). [jcliburn@osprey ~]$ cat m2v-x86_64-proc-interrupts.txt CPU0 CPU1 0: 391098 0 IO-APIC-edge timer 1: 27 371 IO-APIC-edge i8042 6: 5 0 IO-APIC-edge floppy 7: 0 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 4 0 IO-APIC-edge i8042 14: 2521 16075 IO-APIC-edge libata 15: 0 0 IO-APIC-edge libata 17: 195 0 IO-APIC-fasteoi HDA Intel 20: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 21: 138 104 IO-APIC-fasteoi libata, ehci_hcd:usb2, uhci_hcd:usb4 22: 33 263 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 IO-APIC-fasteoi uhci_hcd:usb5 2298: 68 120 PCI-MSI-edge eth0 NMI: 0 0 LOC: 390899 390790 ERR: 0 [jcliburn@osprey ~]$ cat m2v-i386-proc-interrupts.txt CPU0 CPU1 0: 260 0 IO-APIC-edge timer 1: 5 692 IO-APIC-edge i8042 4: 0 10 IO-APIC-edge serial 6: 1 4 IO-APIC-edge floppy 7: 0 0 IO-APIC-edge parport0 8: 0 1 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 1981 47 IO-APIC-edge ide0 21: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 22: 489 30 IO-APIC-fasteoi uhci_hcd:usb2 23: 21 13873 IO-APIC-fasteoi uhci_hcd:usb3, ehci_hcd:usb5, sata_via 24: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 26: 0 166 IO-APIC-fasteoi HDA Intel NMI: 0 0 LOC: 253317 249720 ERR: 0 MIS: 0 Jay On Sun, 13 May 2007 20:44:07 -0500 Jay Cliburn <jacliburn@bellsouth.net> wrote: > One interesting tidbit here is the comparison of /proc/interrupts > between the x86_64 (working) kernel and the i386 (non-working) kernel. Here's another interesting thing... The attached png file shows the lspci outputs for the L1 driver side-by-side. The working version is on the left, and the non-working version is on the right. Note the Address line; it's all zeroes on the non-working side. That doesn't seem right. Reply-To: ninex@NineX.eu.org Jay Cliburn pisze: > On Sun, 13 May 2007 09:25:41 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > > >> Jay Cliburn <jacliburn@bellsouth.net> writes: >> >> >>> Grzegorz Krzystek wrote: >>> >>> >>>> hmm but i see this bug when running 64bit kernel.... >>>> >>> Then you're the second person to see it under x86_64. >>> Congratulations! :) >>> >> Interesting. Is the failure mode really apic errors on all kernels? >> > > I have results from a working x86_64 kernel. Again, unlike Grzegorz, it > works for me under x86_64, but fails under i386 kernels. > so maybe there is something in bios settings/version ??? i have latest version on my board can you make bios settings profile dump via asus a.o.c profile? if yes i will trye to use tour profile and check if that works... > >> On the non-failing instances of this board can we use lspci >> to find the working msi mapping capability that is on the path >> between the pci-express bus and the upstream hypertransport bus. >> > > Here's a pointer to the files for working and non-working instances: > > x86_64 with working MSI: > ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-x86_64-msi-working/ > m2v-x86_64-dmesg.txt > m2v-x86_64-dmidecode.txt > m2v-x86_64-lsmod.txt > m2v-x86_64-lspci.txt > m2v-x86_64-ping.txt > m2v-x86_64-proc-interrupts.txt > m2v-x86_64-uname.txt > > i386 with failing MSI: > ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing/ > m2v-i386-dmesg.txt > m2v-i386-dmidecode.txt > m2v-i386-lsmod.txt > m2v-i386-lspci.txt > m2v-i386-proc-interrupts.txt > m2v-i386-uname.txt > > >> Can we then get a complete register dump of the msi-mapping >> capability. >> > > How do I obtain a register dump? > > >> Can we then please repeat the process on a failing instance of >> the board in question. >> >> Can we also please compare the pci revision fields in the chipset >> between working and non-working versions of this chipset. >> > > How do I obtain the pci revision fields? > > >> I'm also curious about these apic errors, and the "no irq for vector" >> error that happened on x86_64. Even the bit about trigger an NMI >> I'm curious about. This suggests we may actually have bad hardware >> somewhere in the mix that we need to weed out. >> >> So getting details on other similar failures would also be >> interesting. >> > > Specifically, what additional info do you need? I'll be glad to get it. > > One interesting tidbit here is the comparison of /proc/interrupts > between the x86_64 (working) kernel and the i386 (non-working) kernel. > The x86_64 version shows the MSI mapping, but the i386 doesn't; eth0 > doesn't show up at all under i386 (with MSI enabled). The atl1 > module is definitely loaded, but the network isn't started (because it > kills the box). > > [jcliburn@osprey ~]$ cat m2v-x86_64-proc-interrupts.txt > CPU0 CPU1 > 0: 391098 0 IO-APIC-edge timer > 1: 27 371 IO-APIC-edge i8042 > 6: 5 0 IO-APIC-edge floppy > 7: 0 0 IO-APIC-edge parport0 > 8: 0 0 IO-APIC-edge rtc > 9: 0 0 IO-APIC-fasteoi acpi > 12: 4 0 IO-APIC-edge i8042 > 14: 2521 16075 IO-APIC-edge libata > 15: 0 0 IO-APIC-edge libata > 17: 195 0 IO-APIC-fasteoi HDA Intel > 20: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 > 21: 138 104 IO-APIC-fasteoi libata, ehci_hcd:usb2, uhci_hcd:usb4 > 22: 33 263 IO-APIC-fasteoi uhci_hcd:usb3 > 23: 0 0 IO-APIC-fasteoi uhci_hcd:usb5 > 2298: 68 120 PCI-MSI-edge eth0 > NMI: 0 0 > LOC: 390899 390790 > ERR: 0 > > [jcliburn@osprey ~]$ cat m2v-i386-proc-interrupts.txt > CPU0 CPU1 > 0: 260 0 IO-APIC-edge timer > 1: 5 692 IO-APIC-edge i8042 > 4: 0 10 IO-APIC-edge serial > 6: 1 4 IO-APIC-edge floppy > 7: 0 0 IO-APIC-edge parport0 > 8: 0 1 IO-APIC-edge rtc > 9: 0 0 IO-APIC-fasteoi acpi > 12: 1 3 IO-APIC-edge i8042 > 14: 1981 47 IO-APIC-edge ide0 > 21: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 > 22: 489 30 IO-APIC-fasteoi uhci_hcd:usb2 > 23: 21 13873 IO-APIC-fasteoi uhci_hcd:usb3, ehci_hcd:usb5, sata_via > 24: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 > 26: 0 166 IO-APIC-fasteoi HDA Intel > NMI: 0 0 > LOC: 253317 249720 > ERR: 0 > MIS: 0 > > Jay > > -- GRZEGORZ {NineX} KRZYSTEK NineX Inc. ninex@ninex.eu.org Krak Reply-To: ninex@NineX.eu.org bugme-daemon@bugzilla.kernel.org pisze: > http://bugzilla.kernel.org/show_bug.cgi?id=8472 > > > > > > ------- Additional Comments From anonymous@kernel-bugs.osdl.org 2007-05-13 22:59 ------- > Reply-To: ninex@NineX.eu.org > > > > Jay Cliburn pisze: > >> On Sun, 13 May 2007 09:25:41 -0600 >> ebiederm@xmission.com (Eric W. Biederman) wrote: >> >> >> >>> Jay Cliburn <jacliburn@bellsouth.net> writes: >>> >>> >>> >>>> Grzegorz Krzystek wrote: >>>> >>>> >>>> >>>>> hmm but i see this bug when running 64bit kernel.... >>>>> >>>>> >>>> Then you're the second person to see it under x86_64. >>>> Congratulations! :) >>>> >>>> >>> Interesting. Is the failure mode really apic errors on all kernels? >>> >>> >> I have results from a working x86_64 kernel. Again, unlike Grzegorz, it >> works for me under x86_64, but fails under i386 kernels. >> >> > so maybe there is something in bios settings/version ??? > i have latest version on my board > can you make bios settings profile dump via asus a.o.c profile? > if yes i will trye to use tour profile and check if that works... > >> >> >>> On the non-failing instances of this board can we use lspci >>> to find the working msi mapping capability that is on the path >>> between the pci-express bus and the upstream hypertransport bus. >>> >>> >> Here's a pointer to the files for working and non-working instances: >> >> x86_64 with working MSI: >> ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-x86_64-msi-working/ >> m2v-x86_64-dmesg.txt >> m2v-x86_64-dmidecode.txt >> m2v-x86_64-lsmod.txt >> m2v-x86_64-lspci.txt >> m2v-x86_64-ping.txt >> m2v-x86_64-proc-interrupts.txt >> m2v-x86_64-uname.txt >> ant take a consideration that MSI/APIC error apers when you try to up eth interface, not when driver is loading... so let thse guy who created this logs let boot on x86_64 kernel and let they try to up interface and see to logs what hapend... >> i386 with failing MSI: >> ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing/ >> m2v-i386-dmesg.txt >> m2v-i386-dmidecode.txt >> m2v-i386-lsmod.txt >> m2v-i386-lspci.txt >> m2v-i386-proc-interrupts.txt >> m2v-i386-uname.txt >> >> >> >>> Can we then get a complete register dump of the msi-mapping >>> capability. >>> >>> >> How do I obtain a register dump? >> >> >> >>> Can we then please repeat the process on a failing instance of >>> the board in question. >>> >>> Can we also please compare the pci revision fields in the chipset >>> between working and non-working versions of this chipset. >>> >>> >> How do I obtain the pci revision fields? >> >> >> >>> I'm also curious about these apic errors, and the "no irq for vector" >>> error that happened on x86_64. Even the bit about trigger an NMI >>> I'm curious about. This suggests we may actually have bad hardware >>> somewhere in the mix that we need to weed out. >>> >>> So getting details on other similar failures would also be >>> interesting. >>> >>> >> Specifically, what additional info do you need? I'll be glad to get it. >> >> One interesting tidbit here is the comparison of /proc/interrupts >> between the x86_64 (working) kernel and the i386 (non-working) kernel. >> The x86_64 version shows the MSI mapping, but the i386 doesn't; eth0 >> doesn't show up at all under i386 (with MSI enabled). The atl1 >> module is definitely loaded, but the network isn't started (because it >> kills the box). >> >> [jcliburn@osprey ~]$ cat m2v-x86_64-proc-interrupts.txt >> CPU0 CPU1 >> 0: 391098 0 IO-APIC-edge timer >> 1: 27 371 IO-APIC-edge i8042 >> 6: 5 0 IO-APIC-edge floppy >> 7: 0 0 IO-APIC-edge parport0 >> 8: 0 0 IO-APIC-edge rtc >> 9: 0 0 IO-APIC-fasteoi acpi >> 12: 4 0 IO-APIC-edge i8042 >> 14: 2521 16075 IO-APIC-edge libata >> 15: 0 0 IO-APIC-edge libata >> 17: 195 0 IO-APIC-fasteoi HDA Intel >> 20: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 >> 21: 138 104 IO-APIC-fasteoi libata, ehci_hcd:usb2, uhci_hcd:usb4 >> 22: 33 263 IO-APIC-fasteoi uhci_hcd:usb3 >> 23: 0 0 IO-APIC-fasteoi uhci_hcd:usb5 >> 2298: 68 120 PCI-MSI-edge eth0 >> NMI: 0 0 >> LOC: 390899 390790 >> ERR: 0 >> >> [jcliburn@osprey ~]$ cat m2v-i386-proc-interrupts.txt >> CPU0 CPU1 >> 0: 260 0 IO-APIC-edge timer >> 1: 5 692 IO-APIC-edge i8042 >> 4: 0 10 IO-APIC-edge serial >> 6: 1 4 IO-APIC-edge floppy >> 7: 0 0 IO-APIC-edge parport0 >> 8: 0 1 IO-APIC-edge rtc >> 9: 0 0 IO-APIC-fasteoi acpi >> 12: 1 3 IO-APIC-edge i8042 >> 14: 1981 47 IO-APIC-edge ide0 >> 21: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 >> 22: 489 30 IO-APIC-fasteoi uhci_hcd:usb2 >> 23: 21 13873 IO-APIC-fasteoi uhci_hcd:usb3, ehci_hcd:usb5, sata_via >> 24: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 >> 26: 0 166 IO-APIC-fasteoi HDA Intel >> NMI: 0 0 >> LOC: 253317 249720 >> ERR: 0 >> MIS: 0 >> >> Jay >> >> >> > > -- GRZEGORZ {NineX} KRZYSTEK NineX Inc. ninex@ninex.eu.org Kraków, Idzikowskiego 17a tel. +48 602135796 _____________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> <br> <br> <a class="moz-txt-link-abbreviated" href="mailto:bugme-daemon@bugzilla.kernel.org">bugme-daemon@bugzilla.kernel.org</a> pisze: <blockquote cite="mid:200705140558.l4E5w8Kt014430@fire-2.osdl.org" type="cite"> <pre wrap=""><a class="moz-txt-link-freetext" href="http://bugzilla.kernel.org/show_bug.cgi?id=8472">http://bugzilla.kernel.org/show_bug.cgi?id=8472</a> ------- Additional Comments From <a class="moz-txt-link-abbreviated" href="mailto:anonymous@kernel-bugs.osdl.org">anonymous@kernel-bugs.osdl.org</a> 2007-05-13 22:59 ------- Reply-To: <a class="moz-txt-link-abbreviated" href="mailto:ninex@NineX.eu.org">ninex@NineX.eu.org</a> Jay Cliburn pisze: </pre> <blockquote type="cite"> <pre wrap="">On Sun, 13 May 2007 09:25:41 -0600 <a class="moz-txt-link-abbreviated" href="mailto:ebiederm@xmission.com">ebiederm@xmission.com</a> (Eric W. Biederman) wrote: </pre> <blockquote type="cite"> <pre wrap="">Jay Cliburn <a class="moz-txt-link-rfc2396E" href="mailto:jacliburn@bellsouth.net"><jacliburn@bellsouth.net></a> writes: </pre> <blockquote type="cite"> <pre wrap="">Grzegorz Krzystek wrote: </pre> <blockquote type="cite"> <pre wrap="">hmm but i see this bug when running 64bit kernel.... </pre> </blockquote> <pre wrap="">Then you're the second person to see it under x86_64. Congratulations! :) </pre> </blockquote> <pre wrap="">Interesting. Is the failure mode really apic errors on all kernels? </pre> </blockquote> <pre wrap="">I have results from a working x86_64 kernel. Again, unlike Grzegorz, it works for me under x86_64, but fails under i386 kernels. </pre> </blockquote> <pre wrap=""><!---->so maybe there is something in bios settings/version ??? i have latest version on my board can you make bios settings profile dump via asus a.o.c profile? if yes i will trye to use tour profile and check if that works... </pre> <blockquote type="cite"> <pre wrap=""> </pre> <blockquote type="cite"> <pre wrap="">On the non-failing instances of this board can we use lspci to find the working msi mapping capability that is on the path between the pci-express bus and the upstream hypertransport bus. </pre> </blockquote> <pre wrap="">Here's a pointer to the files for working and non-working instances: x86_64 with working MSI: <a class="moz-txt-link-freetext" href="ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-x86_64-msi-working/">ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-x86_64-msi-working/</a> m2v-x86_64-dmesg.txt m2v-x86_64-dmidecode.txt m2v-x86_64-lsmod.txt m2v-x86_64-lspci.txt m2v-x86_64-ping.txt m2v-x86_64-proc-interrupts.txt m2v-x86_64-uname.txt </pre> </blockquote> </blockquote> ant take a consideration that MSI/APIC error apers when you try to up eth interface, not when driver is loading... so let thse guy who created this logs let boot on x86_64 kernel and let they try to up interface and see to logs what hapend...<br> <blockquote cite="mid:200705140558.l4E5w8Kt014430@fire-2.osdl.org" type="cite"> <blockquote type="cite"> <pre wrap=""> i386 with failing MSI: <a class="moz-txt-link-freetext" href="ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing/">ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing/</a> m2v-i386-dmesg.txt m2v-i386-dmidecode.txt m2v-i386-lsmod.txt m2v-i386-lspci.txt m2v-i386-proc-interrupts.txt m2v-i386-uname.txt </pre> <blockquote type="cite"> <pre wrap="">Can we then get a complete register dump of the msi-mapping capability. </pre> </blockquote> <pre wrap="">How do I obtain a register dump? </pre> <blockquote type="cite"> <pre wrap="">Can we then please repeat the process on a failing instance of the board in question. Can we also please compare the pci revision fields in the chipset between working and non-working versions of this chipset. </pre> </blockquote> <pre wrap="">How do I obtain the pci revision fields? </pre> <blockquote type="cite"> <pre wrap="">I'm also curious about these apic errors, and the "no irq for vector" error that happened on x86_64. Even the bit about trigger an NMI I'm curious about. This suggests we may actually have bad hardware somewhere in the mix that we need to weed out. So getting details on other similar failures would also be interesting. </pre> </blockquote> <pre wrap="">Specifically, what additional info do you need? I'll be glad to get it. One interesting tidbit here is the comparison of /proc/interrupts between the x86_64 (working) kernel and the i386 (non-working) kernel. The x86_64 version shows the MSI mapping, but the i386 doesn't; eth0 doesn't show up at all under i386 (with MSI enabled). The atl1 module is definitely loaded, but the network isn't started (because it kills the box). [jcliburn@osprey ~]$ cat m2v-x86_64-proc-interrupts.txt CPU0 CPU1 0: 391098 0 IO-APIC-edge timer 1: 27 371 IO-APIC-edge i8042 6: 5 0 IO-APIC-edge floppy 7: 0 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 4 0 IO-APIC-edge i8042 14: 2521 16075 IO-APIC-edge libata 15: 0 0 IO-APIC-edge libata 17: 195 0 IO-APIC-fasteoi HDA Intel 20: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 21: 138 104 IO-APIC-fasteoi libata, ehci_hcd:usb2, uhci_hcd:usb4 22: 33 263 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 IO-APIC-fasteoi uhci_hcd:usb5 2298: 68 120 PCI-MSI-edge eth0 NMI: 0 0 LOC: 390899 390790 ERR: 0 [jcliburn@osprey ~]$ cat m2v-i386-proc-interrupts.txt CPU0 CPU1 0: 260 0 IO-APIC-edge timer 1: 5 692 IO-APIC-edge i8042 4: 0 10 IO-APIC-edge serial 6: 1 4 IO-APIC-edge floppy 7: 0 0 IO-APIC-edge parport0 8: 0 1 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 1981 47 IO-APIC-edge ide0 21: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 22: 489 30 IO-APIC-fasteoi uhci_hcd:usb2 23: 21 13873 IO-APIC-fasteoi uhci_hcd:usb3, ehci_hcd:usb5, sata_via 24: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 26: 0 166 IO-APIC-fasteoi HDA Intel NMI: 0 0 LOC: 253317 249720 ERR: 0 MIS: 0 Jay </pre> </blockquote> <pre wrap=""><!----> </pre> </blockquote> <br> <pre class="moz-signature" DEFANGED_cols="72">-- GRZEGORZ {NineX} KRZYSTEK NineX Inc. <a class="moz-txt-link-abbreviated" href="mailto:ninex@ninex.eu.org">ninex@ninex.eu.org</a> Kraków, Idzikowskiego 17a tel. +48 602135796 _____________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. </pre> </body> </html> Reply-To: ninex@NineX.eu.org >> Here's a pointer to the files for working and non-working instances: >> >> x86_64 with working MSI: >> ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-x86_64-msi-working/ >> m2v-x86_64-dmesg.txt >> m2v-x86_64-dmidecode.txt >> m2v-x86_64-lsmod.txt >> m2v-x86_64-lspci.txt >> m2v-x86_64-ping.txt >> m2v-x86_64-proc-interrupts.txt >> m2v-x86_64-uname.txt >> >> i386 with failing MSI: >> ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing/ >> m2v-i386-dmesg.txt >> m2v-i386-dmidecode.txt >> m2v-i386-lsmod.txt >> m2v-i386-lspci.txt >> m2v-i386-proc-interrupts.txt >> m2v-i386-uname.txt >> >> please take a consideration that MSI bug apears not on boot in my case, but when i trye tu up interface.... let some one who created logs on x86_64 kernel trye tu up interface and ping some host... and see into log again .... sorry for format of last message - my thunderbird freakout ;) On Mon, 14 May 2007 09:24:21 +0200 Grzegorz Krzystek <ninex@NineX.eu.org> wrote: > please take a consideration that MSI bug apears not on boot in my > case, but when i trye tu up interface.... let some one who created > logs on x86_64 kernel trye tu up interface and ping some host... and > see into log again .... > sorry for format of last message - my thunderbird freakout ;) I see the same behavior; the APIC errors start when the network interface is brought up. The x86_64 logs were captured on my M2V system with the interface up, and I included a ping command's output in the list of files. Jay Cliburn <jacliburn@bellsouth.net> writes: > On Mon, 14 May 2007 09:24:21 +0200 > Grzegorz Krzystek <ninex@NineX.eu.org> wrote: > >> please take a consideration that MSI bug apears not on boot in my >> case, but when i trye tu up interface.... let some one who created >> logs on x86_64 kernel trye tu up interface and ping some host... and >> see into log again .... >> sorry for format of last message - my thunderbird freakout ;) > > I see the same behavior; the APIC errors start when the network > interface is brought up. The x86_64 logs were captured on my M2V system > with the interface up, and I included a ping command's output in the > list of files. When you bring up the interface is when we call pci_enable_msi and allocate the msi. Eric On Mon, 14 May 2007 05:26:00 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > When you bring up the interface is when we call pci_enable_msi > and allocate the msi. Based upon Eric's statement here, I went back to the failing case and recaptured some files while the interface was up. My previous data collections were when the interface was down. The new files are at: ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing In particular, /proc/interrupts now shows an interrupt assigned to the device, whereas it didn't before because the driver wasn't started. This is from the failing i386 instance. APIC errors are pouring out of the system when this file is captured. CPU0 CPU1 0: 260 0 IO-APIC-edge timer 1: 41 532 IO-APIC-edge i8042 4: 0 10 IO-APIC-edge serial 6: 0 5 IO-APIC-edge floppy 7: 0 0 IO-APIC-edge parport0 8: 0 1 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 1477 47 IO-APIC-edge ide0 21: 0 0 IO-APIC-fasteoi uhci_hcd:usb1 22: 1210 28 IO-APIC-fasteoi uhci_hcd:usb2 23: 18 10754 IO-APIC-fasteoi uhci_hcd:usb3, ehci_hcd:usb5, sata_via 24: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 26: 0 167 IO-APIC-fasteoi HDA Intel 218: 0 0 PCI-MSI-edge eth0 NMI: 0 0 LOC: 195484 191531 ERR: 234 MIS: 0 Also, please ignore the previous png file; the atl1 driver wasn't started, so lspci showed zeros for the MSI address. Jay Jay Cliburn <jacliburn@bellsouth.net> writes: > On Mon, 14 May 2007 05:26:00 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> When you bring up the interface is when we call pci_enable_msi >> and allocate the msi. > > Based upon Eric's statement here, I went back to the failing case and > recaptured some files while the interface was up. My previous data > collections were when the interface was down. > > The new files are at: > > ftp://ftp.hogchain.net/pub/linux/m2v/apic-problem/m2v-i386-msi-failing Hmm. Your 64bit working case is from a 2.6.20 kernel. Interesting we are using a different delivery mode in the 32bit and the 64bit kernel. 64bit Address: 00000000fee01000 Data: 40d9 fixed delivery mode. dest: 32bit Address: 00000000fee0300c Data: 416a lowest priority delivery mode On your 32bit system could you try the patch below. I want to see if things work properly with when you are not in lowest priority delivery mode. The other truly odd thing is the two MSI mapping capabilities that lpsci found were not enabled. So I am puzzled how things are working in this case. I'm guessing it is chipset internal magic not using the standard capabilities. Thanks, Eric diff --git a/include/asm-i386/mach-default/mach_apic.h b/include/asm-i386/mach-default/mach_apic.h index 6db1c3b..f72c307 100644 --- a/include/asm-i386/mach-default/mach_apic.h +++ b/include/asm-i386/mach-default/mach_apic.h @@ -19,8 +19,8 @@ static inline cpumask_t target_cpus(void) #define NO_BALANCE_IRQ (0) #define esr_disable (0) -#define INT_DELIVERY_MODE dest_LowestPrio -#define INT_DEST_MODE 1 /* logical delivery broadcast to all procs */ +#define INT_DELIVERY_MODE (dest_Fixed) +#define INT_DEST_MODE (0) /* phys delivery to target proc */ static inline unsigned long check_apicid_used(physid_mask_t bitmap, int apicid) { On Mon, 14 May 2007 10:38:01 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Hmm. Your 64bit working case is from a 2.6.20 kernel. Actually, it's a 2.6.21-rcX kernel, unless I'm mistaken. It's the Fedora 7 Test 4 Live version, and Fedora's kernel numbers don't track with the vanilla kernel numbering scheme. > > > Interesting we are using a different delivery mode in > the 32bit and the 64bit kernel. > 64bit > Address: 00000000fee01000 Data: 40d9 > fixed delivery mode. > dest: > > 32bit > Address: 00000000fee0300c Data: 416a > lowest priority delivery mode > > On your 32bit system could you try the patch below. I want to see if > things work properly with when you are not in lowest priority delivery > mode. I'll get back to you soon with the result... > > The other truly odd thing is the two MSI mapping capabilities that > lpsci found were not enabled. So I am puzzled how things are working > in this case. I'm guessing it is chipset internal magic not using > the standard capabilities. > > Thanks, > Eric > > > diff --git a/include/asm-i386/mach-default/mach_apic.h > b/include/asm-i386/mach-default/mach_apic.h index 6db1c3b..f72c307 > 100644 --- a/include/asm-i386/mach-default/mach_apic.h > +++ b/include/asm-i386/mach-default/mach_apic.h > @@ -19,8 +19,8 @@ static inline cpumask_t target_cpus(void) > #define NO_BALANCE_IRQ (0) > #define esr_disable (0) > > -#define INT_DELIVERY_MODE dest_LowestPrio > -#define INT_DEST_MODE 1 /* logical delivery broadcast to all > procs */ +#define INT_DELIVERY_MODE (dest_Fixed) > +#define INT_DEST_MODE (0) /* phys delivery to target proc */ > > static inline unsigned long check_apicid_used(physid_mask_t bitmap, > int apicid) { > On Mon, 14 May 2007 10:38:01 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > On your 32bit system could you try the patch below. > diff --git a/include/asm-i386/mach-default/mach_apic.h > b/include/asm-i386/mach-default/mach_apic.h index 6db1c3b..f72c307 > 100644 --- a/include/asm-i386/mach-default/mach_apic.h > +++ b/include/asm-i386/mach-default/mach_apic.h > @@ -19,8 +19,8 @@ static inline cpumask_t target_cpus(void) > #define NO_BALANCE_IRQ (0) > #define esr_disable (0) > > -#define INT_DELIVERY_MODE dest_LowestPrio > -#define INT_DEST_MODE 1 /* logical delivery broadcast to all > procs */ +#define INT_DELIVERY_MODE (dest_Fixed) > +#define INT_DEST_MODE (0) /* phys delivery to target proc */ > > static inline unsigned long check_apicid_used(physid_mask_t bitmap, > int apicid) { > Panics the kernel even before I can grab serial console output. Jpeg attached... On Mon, 14 May 2007 10:38:01 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > The other truly odd thing is the two MSI mapping capabilities that > lpsci found were not enabled. So I am puzzled how things are working > in this case. I'm guessing it is chipset internal magic not using > the standard capabilities. I pulled from Linus' tree today, built a current git 2.6.22-rc1 kernel, booted with apic=debug, started the atl1 driver, and produced the attached dmesg. This is under the failing i386 instance. I've figured out that if I leave the network cable disconnected, I can start the driver with MSI enabled without crashing the system. That's probably because the NIC doesn't generate any interrupts so long as the cable is disconnected. Hope the attached dmesg helps. If there's anything else I can provide, please let me know. Jay Jay Cliburn <jacliburn@bellsouth.net> writes: > Panics the kernel even before I can grab serial console output. Jpeg > attached... Bother then I missed something on the testing patch. I will take a look a little later and see if I can come up with something that actually works. Eric Jay Cliburn <jacliburn@bellsouth.net> writes: > On Mon, 14 May 2007 10:38:01 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> The other truly odd thing is the two MSI mapping capabilities that >> lpsci found were not enabled. So I am puzzled how things are working >> in this case. I'm guessing it is chipset internal magic not using >> the standard capabilities. > > I pulled from Linus' tree today, built a current git 2.6.22-rc1 kernel, > booted with apic=debug, started the atl1 driver, and produced the > attached dmesg. This is under the failing i386 instance. > > I've figured out that if I leave the network cable disconnected, I can > start the driver with MSI enabled without crashing the system. That's > probably because the NIC doesn't generate any interrupts so long as the > cable is disconnected. > > Hope the attached dmesg helps. If there's anything else I can provide, > please let me know. I'm still trying to figure out (without trying to hard) if this is a case where MSI interrupts only work in physical mode, and not in lowest priority deliver mode. So since I can't easily switch the i386 kernel to use physical mode. Here is my attempt to break your 64bit kernel with lowest priority delivery mode. Think you could try this and tell me if MSI continues to work with this patch applied on your 64bit kernel. All this patch does is override the selection logic in genapic.c diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c index 47496a4..92f4925 100644 --- a/arch/x86_64/kernel/genapic.c +++ b/arch/x86_64/kernel/genapic.c @@ -55,6 +55,10 @@ void __init setup_apic_routing(void) else genapic = &apic_physflat; +#if 1 + genapic = &apic_flat; +#endif + printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); } Eric Eric W. Biederman wrote:
> Jay Cliburn <jacliburn@bellsouth.net> writes:
>
>> On Mon, 14 May 2007 10:38:01 -0600
>> ebiederm@xmission.com (Eric W. Biederman) wrote:
>>
>>> The other truly odd thing is the two MSI mapping capabilities that
>>> lpsci found were not enabled. So I am puzzled how things are working
>>> in this case. I'm guessing it is chipset internal magic not using
>>> the standard capabilities.
>> I pulled from Linus' tree today, built a current git 2.6.22-rc1 kernel,
>> booted with apic=debug, started the atl1 driver, and produced the
>> attached dmesg. This is under the failing i386 instance.
>>
>> I've figured out that if I leave the network cable disconnected, I can
>> start the driver with MSI enabled without crashing the system. That's
>> probably because the NIC doesn't generate any interrupts so long as the
>> cable is disconnected.
>>
>> Hope the attached dmesg helps. If there's anything else I can provide,
>> please let me know.
>
> I'm still trying to figure out (without trying to hard) if this
> is a case where MSI interrupts only work in physical mode, and
> not in lowest priority deliver mode.
>
> So since I can't easily switch the i386 kernel to use physical
> mode. Here is my attempt to break your 64bit kernel with
> lowest priority delivery mode.
>
> Think you could try this and tell me if MSI continues to
> work with this patch applied on your 64bit kernel.
>
> All this patch does is override the selection logic in genapic.c
>
> diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c
> index 47496a4..92f4925 100644
> --- a/arch/x86_64/kernel/genapic.c
> +++ b/arch/x86_64/kernel/genapic.c
> @@ -55,6 +55,10 @@ void __init setup_apic_routing(void)
> else
> genapic = &apic_physflat;
>
> +#if 1
> + genapic = &apic_flat;
> +#endif
> +
> printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name);
> }
Unfortunately, my day job has gotten in the way of my netdev hacking hobby.
I'll try your patch as soon as I can, probably in the next couple of nights.
Jay
Jay Cliburn <jacliburn@bellsouth.net> writes: > Unfortunately, my day job has gotten in the way of my netdev hacking hobby. I'll > try your patch as soon as I can, probably in the next couple of nights. No problem. Eric On Tue, 15 May 2007 06:28:39 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Think you could try this and tell me if MSI continues to > work with this patch applied on your 64bit kernel. > > All this patch does is override the selection logic in genapic.c > > diff --git a/arch/x86_64/kernel/genapic.c > b/arch/x86_64/kernel/genapic.c index 47496a4..92f4925 100644 > --- a/arch/x86_64/kernel/genapic.c > +++ b/arch/x86_64/kernel/genapic.c > @@ -55,6 +55,10 @@ void __init setup_apic_routing(void) > else > genapic = &apic_physflat; > > +#if 1 > + genapic = &apic_flat; > +#endif > + > printk(KERN_INFO "Setting APIC routing to %s\n", > genapic->name); } I installed Fedora 7 Test 4 x86_64 to a hard disk this evening; (for the past few weeks I've been using a Fedora Live x86_64 distribution with the pci=msi kernel command line option to enable MSI). I cloned Linus' git tree and built a 64-bit current kernel. The resulting kernel spews the apic error when the network driver is started, just like the 32-bit kernel does. This is different from prior behavior (at least for me). I can't explain it. No need to apply your patch Eric. Neither 64-bit nor 32-bit kernels work reliably with MSI on this board, apparently. Jay Jay Cliburn <jacliburn@bellsouth.net> writes: > On Tue, 15 May 2007 06:28:39 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > > >> Think you could try this and tell me if MSI continues to >> work with this patch applied on your 64bit kernel. >> >> All this patch does is override the selection logic in genapic.c >> >> diff --git a/arch/x86_64/kernel/genapic.c >> b/arch/x86_64/kernel/genapic.c index 47496a4..92f4925 100644 >> --- a/arch/x86_64/kernel/genapic.c >> +++ b/arch/x86_64/kernel/genapic.c >> @@ -55,6 +55,10 @@ void __init setup_apic_routing(void) >> else >> genapic = &apic_physflat; >> >> +#if 1 >> + genapic = &apic_flat; >> +#endif >> + >> printk(KERN_INFO "Setting APIC routing to %s\n", >> genapic->name); } > > I installed Fedora 7 Test 4 x86_64 to a hard disk this evening; (for the > past few weeks I've been using a Fedora Live x86_64 distribution > with the pci=msi kernel command line option to enable MSI). I cloned > Linus' git tree and built a 64-bit current kernel. The resulting > kernel spews the apic error when the network driver is started, just > like the 32-bit kernel does. This is different from prior behavior (at > least for me). I can't explain it. > > No need to apply your patch Eric. Neither 64-bit nor 32-bit kernels > work reliably with MSI on this board, apparently. Jay can you please try the opposite of my patch. genapic = &apic_physflat. I am still curious to know if the apic mode makes the working versus non-working difference. I am concerned that this may be an Opteron thing and not a chipset thing. I seem to recall Ingo having some problem with Opterons in lowest priority delivery mode. Thanks, Eric Eric W. Biederman wrote:
> Jay can you please try the opposite of my patch.
>
> genapic = &apic_physflat.
>
> I am still curious to know if the apic mode makes the working versus
> non-working difference.
>
> I am concerned that this may be an Opteron thing and not a chipset thing.
> I seem to recall Ingo having some problem with Opterons in lowest
> priority delivery mode.
I'll try and do it tonight. Be advised I'm using Athlons, not Opterons.
Jay Cliburn <jacliburn@bellsouth.net> writes: > I'll try and do it tonight. Be advised I'm using Athlons, not Opterons. Interesting. A dual core Athlon64. Regardless my practical curiosity is if the delivery mode affects how well this works. Eric Eric W. Biederman wrote:
> Jay Cliburn <jacliburn@bellsouth.net> writes:
>
>> I'll try and do it tonight. Be advised I'm using Athlons, not Opterons.
>
> Interesting. A dual core Athlon64.
One socket, dual core. Socket AM2.
On Wed, 16 May 2007 11:53:24 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Jay can you please try the opposite of my patch. > > genapic = &apic_physflat. > > I am still curious to know if the apic mode makes the working versus > non-working difference. Applying this patch makes it work -- no apic errors. Grzegorz, can you try it? I've attached /proc/interrupts and dmesg, too. What next? (By the way Eric, thanks a lot for your help.) diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c index 47496a4..82c5340 100644 --- a/arch/x86_64/kernel/genapic.c +++ b/arch/x86_64/kernel/genapic.c @@ -54,6 +54,9 @@ void __init setup_apic_routing(void) genapic = &apic_flat; else genapic = &apic_physflat; +#if 1 + genapic = &apic_physflat; +#endif printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); } Jay Cliburn <jacliburn@bellsouth.net> writes: > On Wed, 16 May 2007 11:53:24 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> Jay can you please try the opposite of my patch. >> >> genapic = &apic_physflat. >> >> I am still curious to know if the apic mode makes the working versus >> non-working difference. > > Applying this patch makes it work -- no apic errors. Grzegorz, can you > try it? > > I've attached /proc/interrupts and dmesg, too. What next? (By the way > Eric, thanks a lot for your help.) Ok. So it looks like we have a problem with lowest priority delivery mode and msi and your chipset. You chipset does not have a active hypertransport msi mapping. So I think it is time to step back and see if we can come up with a reasonably maintainble MSI enable quirks that will be useable. Eric Reply-To: ninex@NineX.eu.org Jay Cliburn pisze: > On Wed, 16 May 2007 11:53:24 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > > >> Jay can you please try the opposite of my patch. >> >> genapic = &apic_physflat. >> >> I am still curious to know if the apic mode makes the working versus >> non-working difference. >> > > Applying this patch makes it work -- no apic errors. Grzegorz, can you > try it? > > sure! :) i will thest this today,when i back home from work... and report results > I've attached /proc/interrupts and dmesg, too. What next? (By the way > Eric, thanks a lot for your help.) > > diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c > index 47496a4..82c5340 100644 > --- a/arch/x86_64/kernel/genapic.c > +++ b/arch/x86_64/kernel/genapic.c > @@ -54,6 +54,9 @@ void __init setup_apic_routing(void) > genapic = &apic_flat; > else > genapic = &apic_physflat; > +#if 1 > + genapic = &apic_physflat; > +#endif > > printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); > } -- GRZEGORZ {NineX} KRZYSTEK NineX Inc. ninex@ninex.eu.org Kraków, Idzikowskiego 17a tel. +48 602135796 _____________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Reply-To: ninex@NineX.eu.org Jay Cliburn pisze: > On Wed, 16 May 2007 11:53:24 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > > >> Jay can you please try the opposite of my patch. >> >> genapic = &apic_physflat. >> >> I am still curious to know if the apic mode makes the working versus >> non-working difference. >> > > Applying this patch makes it work -- no apic errors. Grzegorz, can you > try it? > > I've attached /proc/interrupts and dmesg, too. What next? (By the way > Eric, thanks a lot for your help.) > > diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c > index 47496a4..82c5340 100644 > --- a/arch/x86_64/kernel/genapic.c > +++ b/arch/x86_64/kernel/genapic.c > @@ -54,6 +54,9 @@ void __init setup_apic_routing(void) > genapic = &apic_flat; > else > genapic = &apic_physflat; > +#if 1 > + genapic = &apic_physflat; > +#endif > > printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); > } this patch don't work with my 2.6.21.1 kernel :( Grzegorz Krzystek <ninex@NineX.eu.org> writes: > Jay Cliburn pisze: >> On Wed, 16 May 2007 11:53:24 -0600 >> ebiederm@xmission.com (Eric W. Biederman) wrote: >> >> >>> Jay can you please try the opposite of my patch. >>> >>> genapic = &apic_physflat. >>> >>> I am still curious to know if the apic mode makes the working versus >>> non-working difference. >>> >> >> Applying this patch makes it work -- no apic errors. Grzegorz, can you >> try it? >> >> I've attached /proc/interrupts and dmesg, too. What next? (By the way >> Eric, thanks a lot for your help.) >> >> diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c >> index 47496a4..82c5340 100644 >> --- a/arch/x86_64/kernel/genapic.c >> +++ b/arch/x86_64/kernel/genapic.c >> @@ -54,6 +54,9 @@ void __init setup_apic_routing(void) >> genapic = &apic_flat; >> else >> genapic = &apic_physflat; >> +#if 1 >> + genapic = &apic_physflat; >> +#endif >> >> printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); >> } > this patch don't work with my 2.6.21.1 kernel :( It doesn't apply or it doesn't fix your apic errors? Eric Reply-To: ninex@NineX.eu.org bugme-daemon@bugzilla.kernel.org pisze: > http://bugzilla.kernel.org/show_bug.cgi?id=8472 > > > > > > ------- Additional Comments From ebiederm@xmission.com 2007-05-17 15:02 ------- > Grzegorz Krzystek <ninex@NineX.eu.org> writes: > > >> Jay Cliburn pisze: >> >>> On Wed, 16 May 2007 11:53:24 -0600 >>> ebiederm@xmission.com (Eric W. Biederman) wrote: >>> >>> >>> >>>> Jay can you please try the opposite of my patch. >>>> >>>> genapic = &apic_physflat. >>>> >>>> I am still curious to know if the apic mode makes the working versus >>>> non-working difference. >>>> >>>> >>> Applying this patch makes it work -- no apic errors. Grzegorz, can you >>> try it? >>> >>> I've attached /proc/interrupts and dmesg, too. What next? (By the way >>> Eric, thanks a lot for your help.) >>> >>> diff --git a/arch/x86_64/kernel/genapic.c b/arch/x86_64/kernel/genapic.c >>> index 47496a4..82c5340 100644 >>> --- a/arch/x86_64/kernel/genapic.c >>> +++ b/arch/x86_64/kernel/genapic.c >>> @@ -54,6 +54,9 @@ void __init setup_apic_routing(void) >>> genapic = &apic_flat; >>> else >>> genapic = &apic_physflat; >>> +#if 1 >>> + genapic = &apic_physflat; >>> +#endif >>> >>> printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); >>> } >>> >> this patch don't work with my 2.6.21.1 kernel :( >> > > it dosnt aplcy cause there are: print: printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); } i was fixed patch for this but it dosn't fix APIC problem > It doesn't apply or it doesn't fix your apic errors? > > Eric > > > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. > > -- GRZEGORZ {NineX} KRZYSTEK NineX Inc. ninex@ninex.eu.org Kraków, Idzikowskiego 17a tel. +48 602135796 _____________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. As I know, L1 does not support MSI, so the initial driver to enable msi is wrong. Is there anybody remove pci_enable_msi and re-try ? best regards xiong Reply-To: ninex@NineX.eu.org can you prepare patch? i'm not a programer.... but i found in atl1_main.c err = pci_enamle_msi(adapter->pdev); if (err) { dev_info(&adapter->pdevp>dev, "Unable to enable MSI: %D\n", err); irq_flags |=IRQF_SHARED; } how to modify it? bugme-daemon@bugzilla.kernel.org pisze: > http://bugzilla.kernel.org/show_bug.cgi?id=8472 > > > > > > ------- Additional Comments From huang.xiong@gmail.com 2007-05-20 06:11 ------- > As I know, L1 does not support MSI, so the initial driver to enable msi is > wrong. > > Is there anybody remove pci_enable_msi and re-try ? > > best regards > xiong > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. > > -- GRZEGORZ {NineX} KRZYSTEK NineX Inc. ninex@ninex.eu.org Kraków, Idzikowskiego 17a tel. +48 602135796 _____________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. really sorry, I made a mistake. Today I do watch TLP via Protocol analyzer , L1 does support MSI. best regards xiong On Wed, 16 May 2007 18:31:37 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Ok. So it looks like we have a problem with lowest priority delivery > mode and msi and your chipset. > > You chipset does not have a active hypertransport msi mapping. > > So I think it is time to step back and see if we can come up with > a reasonably maintainble MSI enable quirks that will be useable. Where are we on this? I need advice from the more experienced kernel developers here. Recap: - MSI is enabled by default in the atl1 network device driver. - The atl1 driver with MSI enabled results in debilitating APIC errors on the Asus M2V mainboard (VIA K8T890 chipset, VT3351 host bridge/ioapic). - Under vanilla 2.6.22 x86_64, forcing apic routing to physflat seems to fix it for me, but does /not/ fix it for the OP under gentoo 2.6.21 x86_64. - Booting with pci=nomsi provides a workaround for the APIC errors. - The atl1 driver with MSI enabled works fine on Intel chipset mainboards. Should I: (a) sit back, shutup, and wait for a fix from Eric et al.; (b) propose a quirk in drivers/pci/quirks.c that disables MSI altogether when it sees this chipset; (c) remove MSI from the atl1 driver; (d) other? I'd really like to avoid foisting this APIC error stuff on an unsuspecting user base. And this driver is in -stable now. Please. Guide me. I'm pretty new at the netdev maintenance thing. Jay Jay Cliburn <jacliburn@bellsouth.net> writes: > On Wed, 16 May 2007 18:31:37 -0600 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> Ok. So it looks like we have a problem with lowest priority delivery >> mode and msi and your chipset. >> >> You chipset does not have a active hypertransport msi mapping. >> >> So I think it is time to step back and see if we can come up with >> a reasonably maintainble MSI enable quirks that will be useable. > > Where are we on this? I need advice from the more experienced kernel > developers here. > > Recap: > > - MSI is enabled by default in the atl1 network device driver. > > - The atl1 driver with MSI enabled results in debilitating APIC errors > on the Asus M2V mainboard (VIA K8T890 chipset, VT3351 host > bridge/ioapic). > > - Under vanilla 2.6.22 x86_64, forcing apic routing to physflat seems to > fix it for me, but does /not/ fix it for the OP under gentoo 2.6.21 > x86_64. > > - Booting with pci=nomsi provides a workaround for the APIC errors. > > - The atl1 driver with MSI enabled works fine on Intel chipset > mainboards. > > > Should I: > > (a) sit back, shutup, and wait for a fix from Eric et al.; > (b) propose a quirk in drivers/pci/quirks.c that disables MSI > altogether when it sees this chipset; > (c) remove MSI from the atl1 driver; > (d) other? > > I'd really like to avoid foisting this APIC error stuff on an > unsuspecting user base. And this driver is in -stable now. > > Please. Guide me. I'm pretty new at the netdev maintenance thing. Thanks for asking (sorry for not replying sooner), I missed this in the deluge in my inbox. So the practical problem is how do we avoid foisting the APIC error and other non-functioning MSI problems on an unsuspecting user base. I am kind of Mr. MSI because be default more then by choice. So from the information that I have available. We need to figure out how to disable MSI by default on everything at the bus level, and then we can add in quirks to turn MSI on as appropriate. Anything else seems to be just looking for trouble. My hypothesis that there would be an MSI irq mapping capability appears incorrect, so even on hypertransport case it appears that you will have to know the chipset to be able to enable MSI safely. Although if you have a msi mapping capability and if it is enabled that seems to be a reasonable default. So it looks like we need to go down the only enable MSI on known chipsets path to make this safe. Hmm... This looks like a fairly small and simple patch. I will see if I can post a patchset later today. Eric Subject: Re: [Bugme-new] New: atl1 module APIC error when MSI enabled in kernel 2.6.21 On Thu, 24 May 2007 13:48:14 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Jay Cliburn <jacliburn@bellsouth.net> writes: > > Where are we on this? I need advice from the more experienced > > kernel developers here. > > > > Recap: > > > > - MSI is enabled by default in the atl1 network device driver. > > > > - The atl1 driver with MSI enabled results in debilitating APIC > > errors on the Asus M2V mainboard (VIA K8T890 chipset, VT3351 host > > bridge/ioapic). > > > > - Under vanilla 2.6.22 x86_64, forcing apic routing to physflat > > seems to fix it for me, but does /not/ fix it for the OP under > > gentoo 2.6.21 x86_64. > > > > - Booting with pci=nomsi provides a workaround for the APIC errors. > > > > - The atl1 driver with MSI enabled works fine on Intel chipset > > mainboards. > > > > > > Should I: > > > > (a) sit back, shutup, and wait for a fix from Eric et al.; > > (b) propose a quirk in drivers/pci/quirks.c that disables MSI > > altogether when it sees this chipset; > > (c) remove MSI from the atl1 driver; > > (d) other? > > > > I'd really like to avoid foisting this APIC error stuff on an > > unsuspecting user base. And this driver is in -stable now. Option (b) was implemented. Please close bugzilla 8472 owing to the following commit: commit 184b812f7da6726d7ea4ca409c7a8762ff6c6df6 Author: Jay Cliburn <jacliburn@bellsouth.net> Date: Sat May 26 17:01:04 2007 -0500 PCI: quirk disable MSI on via vt3351 The Via VT3351 APIC does not play well with MSI and unleashes a flood of APIC errors when MSI is used to deliver interrupts. The problem was recently exposed when the atl1 network device driver, which enables MSI by default, stimulated APIC errors on an Asus M2V mainboard, which employs the Via VT3351. See http://bugzilla.kernel.org/show_bug.cgi?id=8472 for additional details on this bug. Signed-off-by: Jay Cliburn <jacliburn@bellsouth.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Jay Alright, thanks. |