Hi, we're running HP 5730 thin clients with the following Broadcom chipset: 02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5787M Gigabit Ethernet PCI Express (rev 02) Subsystem: Broadcom Corporation Unknown device 9693 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at fdff0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at <ignored> [disabled] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] Vendor Specific Information Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable- Capabilities: [d0] Express Endpoint IRQ 0 Capabilities: [100] Advanced Error Reporting Capabilities: [13c] Virtual Channel Capabilities: [160] Device Serial Number 67-ab-6f-fe-ff-5a-21-00 Capabilities: [16c] Power Budgeting The kernel used in the PXE boot image uses statically compiled-in network drivers. The firmware is statically linked into the kernel image. I'm attaching the .config file. Our previous thin client kernel was based on 2.6.28.2 and everything works fine with that version. However, after the upgrade to 2.6.30.1 the system freezes during the initialisation of the bnx2 driver. It also leads to some form of memory corruption affecting the thin client's internal PXE boot code/BIOS memory area: If the faulty kernel has been booted once, the thin client no longer starts the internal PXE boot code, but only offers the choice to press F10 for setup pr F12 for network boot and accepts no input. Even shutting the system with the power button doesn't reset the state, only pulling the power cord reverts the system to being able to boot again. Is this a known issue? What further information do you need?
There's a thread on linux-kernel, which mentions problems if the Broadcom firmware is compiled-in: http://lkml.org/lkml/2009/4/27/182 However in our case reverting 57579f7629a3d46c307405fbd2ea6bdb650d692f didn't fix the error.
Created attachment 23038 [details] .config
I marked this as a regression.
BCM5787M uses the tg3 driver. Please confirm the chip and driver that are having the problem.
What version of the BIOS are you using? The following URL seemed to suggest BIOS problems. http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html
(In reply to comment #4) > BCM5787M uses the tg3 driver. Please confirm the chip and driver that are > having the problem. You are right, the tg3 driver is used, I just double-checked. The chipset is the following: 02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5787M Gigabit Ethernet PCI Express (rev 02) Subsystem: Broadcom Corporation Unknown device 9693 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at fdff0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at <ignored> [disabled] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] Vendor Specific Information Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable- Capabilities: [d0] Express Endpoint IRQ 0 Capabilities: [100] Advanced Error Reporting Capabilities: [13c] Virtual Channel Capabilities: [160] Device Serial Number 67-ab-6f-fe-ff-5a-21-00 Capabilities: [16c] Power Budgeting
(In reply to comment #5) > What version of the BIOS are you using? The following URL seemed to suggest > BIOS problems. > > > http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html Are you referring to the PXE boot code? It says "Broadcom UNDI PXE 2.1 v10.0.9" The thin client itself uses a Phoenix BIOS which doesn't print any version number. I don't think this is related to the BIOS, though. The system worked fine with 2.6.28 and one major change which happened post-2.6.28 was the change to request_firmware(). Since the failure happens so early and leads to memory corruption in the initial boot loader requiring a complete cut-off from the power line, I suspect the bug is coming from there. (Comment #1 can be discarded since it was patched against the wrong driver.)
(In reply to comment #7) > (In reply to comment #5) > > What version of the BIOS are you using? The following URL seemed to > suggest > > BIOS problems. > > > > > http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html > > Are you referring to the PXE boot code? It says "Broadcom UNDI PXE 2.1 > v10.0.9" > The thin client itself uses a Phoenix BIOS which doesn't print any version > number. The URL talks about the BIOS. Knowing the PXE firmware you provided above might be useful though. Is the BIOS revision disclosed in the BIOS setup menus? > I don't think this is related to the BIOS, though. The system worked fine > with > 2.6.28 and one major change which happened post-2.6.28 was the change to > request_firmware(). The 5787 does not use firmware though. > Since the failure happens so early and leads to memory corruption in the > initial boot loader requiring a complete cut-off from the power line, I > suspect the bug is coming from there. This all sounds like bad interactions between the BIOS and the PXE firmware to me. Kernel layouts can change. If PXE is left enabled when the OS boots, it might corrupt system memory like this.
(In reply to comment #8) > (In reply to comment #7) > > (In reply to comment #5) > > > What version of the BIOS are you using? The following URL seemed to > suggest > > > BIOS problems. > > > > > > > http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html > > > > Are you referring to the PXE boot code? It says "Broadcom UNDI PXE 2.1 > v10.0.9" > > The thin client itself uses a Phoenix BIOS which doesn't print any version > > number. > > The URL talks about the BIOS. Knowing the PXE firmware you provided above > might be useful though. Is the BIOS revision disclosed in the BIOS setup > menus? Unfortunately not. I checked all menus and no version number is displayed. I also attached a CRT screen to catch possible startup messages hidden by the delay the TFT screen requires to wake up, but none are displayed. > > I don't think this is related to the BIOS, though. The system worked fine > with > > 2.6.28 and one major change which happened post-2.6.28 was the change to > > request_firmware(). > > The 5787 does not use firmware though. Ok, this makes it unlikely, then :-) > > Since the failure happens so early and leads to memory corruption in the > > initial boot loader requiring a complete cut-off from the power line, I > > suspect the bug is coming from there. > > This all sounds like bad interactions between the BIOS and the PXE firmware > to > me. Kernel layouts can change. If PXE is left enabled when the OS boots, it > might corrupt system memory like this. But the kernel boots for a few seconds, I'm seeing the printks of the first three seconds of the boot. It only locks up during initialisation of the network card. If the problem would be caused by the PXE firmware, the lockup should be earlier, or am I misunderstanding you?
> > The URL talks about the BIOS. Knowing the PXE firmware you provided above > > might be useful though. Is the BIOS revision disclosed in the BIOS setup > > menus? > > Unfortunately not. I checked all menus and no version number is displayed. I > also attached a CRT screen to catch possible startup messages hidden by the > delay the TFT screen requires to wake up, but none are displayed. Hmmm. O.K. With the booting kernel, can you check : /sys/class/dmi/id/bios_version? > > > Since the failure happens so early and leads to memory corruption in the > > > initial boot loader requiring a complete cut-off from the power line, I > > > suspect the bug is coming from there. > > > > This all sounds like bad interactions between the BIOS and the PXE firmware > to > > me. Kernel layouts can change. If PXE is left enabled when the OS boots, > it > > might corrupt system memory like this. > > But the kernel boots for a few seconds, I'm seeing the printks of the first > three seconds of the boot. It only locks up during initialisation of the > network card. If the problem would be caused by the PXE firmware, the lockup > should be earlier, or am I misunderstanding you? No, we're on the same page. My theory was that if the device is left enabled, rx packets could overwrite critical areas of kernel memory. The system would only be susceptible to corruption right up until the chip was reset by the driver for the first time. There were four patchsets integrated between these two kernel versions. I don't see any patch that jumps out as a possible culprit. It would help if we could narrow the range of kernels between the working and non-working versions. That'll reduce the number of suspect patches. In the meantime, I'll see if I can reproduce your problem with a NIC setup.
(In reply to comment #10) > > > The URL talks about the BIOS. Knowing the PXE firmware you provided > above > > > might be useful though. Is the BIOS revision disclosed in the BIOS setup > > > menus? > > > > Unfortunately not. I checked all menus and no version number is displayed. > I > > also attached a CRT screen to catch possible startup messages hidden by the > > delay the TFT screen requires to wake up, but none are displayed. > > Hmmm. O.K. With the booting kernel, can you check : > > /sys/class/dmi/id/bios_version? The BIOS version stored in DMI is 786R4 v1.03, the vendor is Phoenix Technologies and the BIOS date is 3/10/2008 > > But the kernel boots for a few seconds, I'm seeing the printks of the first > > three seconds of the boot. It only locks up during initialisation of the > > network card. If the problem would be caused by the PXE firmware, the > lockup > > should be earlier, or am I misunderstanding you? > > No, we're on the same page. My theory was that if the device is left > enabled, > rx packets could overwrite critical areas of kernel memory. The system would > only be susceptible to corruption right up until the chip was reset by the > driver for the first time. > > There were four patchsets integrated between these two kernel versions. I > don't see any patch that jumps out as a possible culprit. It would help if > we > could narrow the range of kernels between the working and non-working > versions. > That'll reduce the number of suspect patches. A git bisect is not really possible, since each build takes quite some time and needs additional changes to roll out the images. I could try 2.6.29 if that helps to get a better picture of the problem.
> > Hmmm. O.K. With the booting kernel, can you check : > > > > /sys/class/dmi/id/bios_version? > > The BIOS version stored in DMI is 786R4 v1.03, the vendor is Phoenix > Technologies and the BIOS date is 3/10/2008 O.K. The website says that is the version that fixed the problem presented in that thread. We look elsewhere then. > > > But the kernel boots for a few seconds, I'm seeing the printks of the > first > > > three seconds of the boot. It only locks up during initialisation of the > > > network card. If the problem would be caused by the PXE firmware, the > lockup > > > should be earlier, or am I misunderstanding you? > > > > No, we're on the same page. My theory was that if the device is left > enabled, > > rx packets could overwrite critical areas of kernel memory. The system > would > > only be susceptible to corruption right up until the chip was reset by the > > driver for the first time. > > > > There were four patchsets integrated between these two kernel versions. I > > don't see any patch that jumps out as a possible culprit. It would help if > we > > could narrow the range of kernels between the working and non-working > versions. > > That'll reduce the number of suspect patches. > > A git bisect is not really possible, since each build takes quite some time > and > needs additional changes to roll out the images. I could try 2.6.29 if that > helps to get a better picture of the problem. I had it in my mind that we could do a limited bisection by targeting the bounds of the four patch submissions. It looks like most of the changes went in from 2.6.28 => 2.6.29 though, so using released kernels as our bisection points won't be as useful as I originally thought. Knowing whether or not 2.6.29 works does eliminate some patches as culprits though, so that would be useful.
(In reply to comment #12) > > A git bisect is not really possible, since each build takes quite some time > and > > needs additional changes to roll out the images. I could try 2.6.29 if that > > helps to get a better picture of the problem. > > I had it in my mind that we could do a limited bisection by targeting the > bounds of the four patch submissions. It looks like most of the changes went > in from 2.6.28 => 2.6.29 though, so using released kernels as our bisection > points won't be as useful as I originally thought. Knowing whether or not > 2.6.29 works does eliminate some patches as culprits though, so that would be > useful. I compiled a 2.6.29 kernel (with the .config from the previous 2.6.28.2 kernel, for all the new Kconfig options I just took the defaults) and it boots fine.
I also tested a 2.6.31 kernel, it fails just like 2.6.30. I'm out of office until the 29th, so I won't be able to followup/test until then. I don't know yet if a colleague of mine will pick up the tests in the mean time.
> > I had it in my mind that we could do a limited bisection by targeting the > > bounds of the four patch submissions. It looks like most of the changes > went > > in from 2.6.28 => 2.6.29 though, so using released kernels as our bisection > > points won't be as useful as I originally thought. Knowing whether or not > > 2.6.29 works does eliminate some patches as culprits though, so that would > be > > useful. > > I compiled a 2.6.29 kernel (with the .config from the previous 2.6.28.2 > kernel, > for all the new Kconfig options I just took the defaults) and it boots fine. Thanks for the info. This removes a large swath of patches as causes. I'm just curious. Does turning off ASPM help (pcie_aspm=off on the kernel command line)?
(In reply to comment #15) > I'm just curious. Does turning off ASPM help (pcie_aspm=off on the kernel > command line)? No, it doesn't work. These are the latest lines during the boot: tg3.c:v3.98 (February 25, 2009) tg3 0000:02:00.0: PCI INT A -> GSI 0 (level, low) -> IRQ 0
Wait. IRQ 0? That doesn't sound right. Andrew, I'm not sure who to contact about this. Can you CC someone qualified to look more deeply into the interrupt assignment?
(In reply to comment #17) > Wait. IRQ 0? That doesn't sound right. Yes, after setting the kernel parameter "acpi=noirq" the kernel boots and the network card is available. But in this case I can't use the usb devices.
I reassigned this to acpi/config-interrupts.
Will you please attach the output of acpidump/lspci -vxxx? Please also attach the output of dmesg by adding the following boot options. a. acpi=noirq b. noapic Thanks.
Created attachment 23171 [details] lspci-vvv.txt
Created attachment 23172 [details] dmesg-acpi_noirq.txt
(In reply to comment #20) > Will you please attach the output of acpidump/lspci -vxxx? lspci-vvv.txt > Please also attach the output of dmesg by adding the following boot options. > a. acpi=noirq dmesg-acpi_noirq.txt > b. noapic With the noapic parameter the boot will stop with: tg3.c:v.98 (February 25, 2009) tg3 0000:02:00.0: PCI INT A -> GSI 0 (level, low) -> IRQ 0 (In reply to comment #18) > Yes, after setting the kernel parameter "acpi=noirq" the kernel boots and the > network card is available. But in this case I can't use the usb devices. That was my fault, the kernel was build without USB HID. Sorry for the confusion.
Hi, Stefan Please attach the output of acpidump, lspci -vxxx. Thanks.
Created attachment 23201 [details] lspci-vxxx.txt
Hi, Stefan Sorry for the late response. Will you please confirm whether the BCM5787M driver can work if the boot option of "acpi=noirq" is added? Will you please also attach the output of acpidump? thanks.
ping ...
(In reply to comment #26) > Hi, Stefan > Sorry for the late response. > Will you please confirm whether the BCM5787M driver can work if the boot > option of "acpi=noirq" is added? > Will you please also attach the output of acpidump? > thanks. In the mean time we've changed some kernel options for our thin client kernel and with the current setup the IRQ problems no longer occurs, i.e. we cannot reproduce the problem any longer. I've attached the acpidump output anyway. Maybe it helps to track down the original problem.
Created attachment 24017 [details] acpidump for HP thin client 5735