14145 – Regression with Broadcom BCM5787M driver in 2.6.30/2.6.31

Bug 14145 - Regression with Broadcom BCM5787M driver in 2.6.30/2.6.31

Summary: Regression with Broadcom BCM5787M driver in 2.6.30/2.6.31

Status:	REJECTED UNREPRODUCIBLE

Alias:	None

Product:	ACPI
Classification:	Unclassified
Component:	Config-Interrupts (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	acpi_config-interrupts

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-09-08 11:55 UTC by Muehlenhoff
Modified:	2009-12-07 05:31 UTC (History)
CC List:	6 users (show)

See Also:
Kernel Version:
Subsystem:
Regression:	Yes
Bisected commit-id:

Attachments
.config (52.92 KB, application/octet-stream) 2009-09-08 13:08 UTC, Muehlenhoff	Details
lspci-vvv.txt (12.37 KB, text/plain) 2009-09-24 12:28 UTC, Stefan Gohmann	Details
dmesg-acpi_noirq.txt (26.60 KB, text/plain) 2009-09-24 12:29 UTC, Stefan Gohmann	Details
lspci-vxxx.txt (22.20 KB, text/plain) 2009-09-29 06:47 UTC, Stefan Gohmann	Details
acpidump for HP thin client 5735 (79.51 KB, application/octet-stream) 2009-12-04 10:59 UTC, Muehlenhoff	Details
Add an attachment (proposed patch, testcase, etc.)

Description Muehlenhoff 2009-09-08 11:55:18 UTC

Hi,
we're running HP 5730 thin clients with the following Broadcom chipset:

02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5787M Gigabit Ethernet PCI Express (rev 02)
        Subsystem: Broadcom Corporation Unknown device 9693
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Vendor Specific Information
        Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
        Capabilities: [d0] Express Endpoint IRQ 0
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [13c] Virtual Channel
        Capabilities: [160] Device Serial Number 67-ab-6f-fe-ff-5a-21-00
        Capabilities: [16c] Power Budgeting

The kernel used in the PXE boot image uses statically compiled-in network drivers. The firmware is statically linked into the kernel image. I'm attaching the .config file.

Our previous thin client kernel was based on 2.6.28.2 and everything works fine with that version. However, after the upgrade to 2.6.30.1 the system freezes during the initialisation of the bnx2 driver. 
It also leads to some form of memory corruption affecting the thin client's internal PXE boot code/BIOS memory area: 
If the faulty kernel has been booted once, the thin client no longer starts the internal PXE boot code, but only offers the choice to press F10 for setup pr F12 for network boot and accepts no input. Even shutting the system with the power button doesn't reset the state, only pulling the power cord reverts the system to being able to boot again.

Is this a known issue? What further information do you need?

Comment 1 Muehlenhoff 2009-09-08 11:58:26 UTC

There's a thread on linux-kernel, which mentions problems if the Broadcom firmware is compiled-in: http://lkml.org/lkml/2009/4/27/182

However in our case reverting 57579f7629a3d46c307405fbd2ea6bdb650d692f didn't fix the error.

Comment 2 Muehlenhoff 2009-09-08 13:08:49 UTC

Created attachment 23038 [details]
.config

Comment 3 Andrew Morton 2009-09-11 19:53:30 UTC

I marked this as a regression.

Comment 4 Michael Chan 2009-09-11 20:03:31 UTC

BCM5787M uses the tg3 driver.  Please confirm the chip and driver that are having the problem.

Comment 5 Matt Carlson 2009-09-11 21:09:11 UTC

What version of the BIOS are you using?  The following URL seemed to suggest BIOS problems.

http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html

Comment 6 Muehlenhoff 2009-09-14 12:32:31 UTC

(In reply to comment #4)
> BCM5787M uses the tg3 driver.  Please confirm the chip and driver that are
> having the problem.

You are right, the tg3 driver is used, I just double-checked.

The chipset is the following:

02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5787M Gigabit
Ethernet PCI Express (rev 02)
        Subsystem: Broadcom Corporation Unknown device 9693
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Vendor Specific Information
        Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0
Enable-
        Capabilities: [d0] Express Endpoint IRQ 0
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [13c] Virtual Channel
        Capabilities: [160] Device Serial Number 67-ab-6f-fe-ff-5a-21-00
        Capabilities: [16c] Power Budgeting

Comment 7 Muehlenhoff 2009-09-14 12:38:14 UTC

(In reply to comment #5)
> What version of the BIOS are you using?  The following URL seemed to suggest
> BIOS problems.
> 
>
> http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html

Are you referring to the PXE boot code? It says "Broadcom UNDI PXE 2.1 v10.0.9"
The thin client itself uses a Phoenix BIOS which doesn't print any version number.

I don't think this is related to the BIOS, though. The system worked fine with 2.6.28 and one major change which happened post-2.6.28 was the change to request_firmware(). Since the failure happens so early and leads to memory corruption in the initial boot loader requiring a complete cut-off from the power line, I suspect the bug is coming from there.

(Comment #1 can be discarded since it was patched against the wrong driver.)

Comment 8 Matt Carlson 2009-09-14 16:33:08 UTC

(In reply to comment #7)
> (In reply to comment #5)
> > What version of the BIOS are you using?  The following URL seemed to
> suggest
> > BIOS problems.
> > 
> >
> http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html
> 
> Are you referring to the PXE boot code? It says "Broadcom UNDI PXE 2.1
> v10.0.9"
> The thin client itself uses a Phoenix BIOS which doesn't print any version
> number.

The URL talks about the BIOS.  Knowing the PXE firmware you provided above might be useful though.  Is the BIOS revision disclosed in the BIOS setup menus?

> I don't think this is related to the BIOS, though. The system worked fine
> with
> 2.6.28 and one major change which happened post-2.6.28 was the change to
> request_firmware().

The 5787 does not use firmware though.

> Since the failure happens so early and leads to memory corruption in the
> initial boot loader requiring a complete cut-off from the power line, I
> suspect the bug is coming from there.

This all sounds like bad interactions between the BIOS and the PXE firmware to me.  Kernel layouts can change.  If PXE is left enabled when the OS boots, it might corrupt system memory like this.

Comment 9 Muehlenhoff 2009-09-15 07:23:40 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #5)
> > > What version of the BIOS are you using?  The following URL seemed to
> suggest
> > > BIOS problems.
> > > 
> > >
> http://altirigos.com/vbulletin/imaging-bootworks-rdeploy/9428-t5730-imaging-deploying.html
> > 
> > Are you referring to the PXE boot code? It says "Broadcom UNDI PXE 2.1
> v10.0.9"
> > The thin client itself uses a Phoenix BIOS which doesn't print any version
> > number.
> 
> The URL talks about the BIOS.  Knowing the PXE firmware you provided above
> might be useful though.  Is the BIOS revision disclosed in the BIOS setup
> menus?

Unfortunately not. I checked all menus and no version number is displayed. I also attached a CRT screen to catch possible startup messages hidden by the delay the TFT screen requires to wake up, but none are displayed.
 
> > I don't think this is related to the BIOS, though. The system worked fine
> with
> > 2.6.28 and one major change which happened post-2.6.28 was the change to
> > request_firmware().
> 
> The 5787 does not use firmware though.

Ok, this makes it unlikely, then :-)
 
> > Since the failure happens so early and leads to memory corruption in the
> > initial boot loader requiring a complete cut-off from the power line, I
> > suspect the bug is coming from there.
> 
> This all sounds like bad interactions between the BIOS and the PXE firmware
> to
> me.  Kernel layouts can change.  If PXE is left enabled when the OS boots, it
> might corrupt system memory like this.

But the kernel boots for a few seconds, I'm seeing the printks of the first three seconds of the boot. It only locks up during initialisation of the network card. If the problem would be caused by the PXE firmware, the lockup should be earlier, or am I misunderstanding you?

Comment 10 Matt Carlson 2009-09-15 17:57:34 UTC

> > The URL talks about the BIOS.  Knowing the PXE firmware you provided above
> > might be useful though.  Is the BIOS revision disclosed in the BIOS setup
> > menus?
> 
> Unfortunately not. I checked all menus and no version number is displayed. I
> also attached a CRT screen to catch possible startup messages hidden by the
> delay the TFT screen requires to wake up, but none are displayed.

Hmmm.  O.K.  With the booting kernel, can you check :

/sys/class/dmi/id/bios_version?

> > > Since the failure happens so early and leads to memory corruption in the
> > > initial boot loader requiring a complete cut-off from the power line, I
> > > suspect the bug is coming from there.
> > 
> > This all sounds like bad interactions between the BIOS and the PXE firmware
> to
> > me.  Kernel layouts can change.  If PXE is left enabled when the OS boots,
> it
> > might corrupt system memory like this.
> 
> But the kernel boots for a few seconds, I'm seeing the printks of the first
> three seconds of the boot. It only locks up during initialisation of the
> network card. If the problem would be caused by the PXE firmware, the lockup
> should be earlier, or am I misunderstanding you?

No, we're on the same page.  My theory was that if the device is left enabled, rx packets could overwrite critical areas of kernel memory.  The system would only be susceptible to corruption right up until the chip was reset by the driver for the first time.

There were four patchsets integrated between these two kernel versions.  I don't see any patch that jumps out as a possible culprit.  It would help if we could narrow the range of kernels between the working and non-working versions.  That'll reduce the number of suspect patches.

In the meantime, I'll see if I can reproduce your problem with a NIC setup.

Comment 11 Muehlenhoff 2009-09-16 11:29:07 UTC

(In reply to comment #10)
> > > The URL talks about the BIOS.  Knowing the PXE firmware you provided
> above
> > > might be useful though.  Is the BIOS revision disclosed in the BIOS setup
> > > menus?
> > 
> > Unfortunately not. I checked all menus and no version number is displayed.
> I
> > also attached a CRT screen to catch possible startup messages hidden by the
> > delay the TFT screen requires to wake up, but none are displayed.
> 
> Hmmm.  O.K.  With the booting kernel, can you check :
> 
> /sys/class/dmi/id/bios_version?

The BIOS version stored in DMI is 786R4 v1.03, the vendor is Phoenix Technologies and the BIOS date is 3/10/2008
 
> > But the kernel boots for a few seconds, I'm seeing the printks of the first
> > three seconds of the boot. It only locks up during initialisation of the
> > network card. If the problem would be caused by the PXE firmware, the
> lockup
> > should be earlier, or am I misunderstanding you?
> 
> No, we're on the same page.  My theory was that if the device is left
> enabled,
> rx packets could overwrite critical areas of kernel memory.  The system would
> only be susceptible to corruption right up until the chip was reset by the
> driver for the first time.
> 
> There were four patchsets integrated between these two kernel versions.  I
> don't see any patch that jumps out as a possible culprit.  It would help if
> we
> could narrow the range of kernels between the working and non-working
> versions.
>  That'll reduce the number of suspect patches.

A git bisect is not really possible, since each build takes quite some time and needs additional changes to roll out the images. I could try 2.6.29 if that helps to get a better picture of the problem.

Comment 12 Matt Carlson 2009-09-16 17:04:27 UTC

> > Hmmm.  O.K.  With the booting kernel, can you check :
> > 
> > /sys/class/dmi/id/bios_version?
> 
> The BIOS version stored in DMI is 786R4 v1.03, the vendor is Phoenix
> Technologies and the BIOS date is 3/10/2008

O.K.  The website says that is the version that fixed the problem presented in that thread.  We look elsewhere then.

> > > But the kernel boots for a few seconds, I'm seeing the printks of the
> first
> > > three seconds of the boot. It only locks up during initialisation of the
> > > network card. If the problem would be caused by the PXE firmware, the
> lockup
> > > should be earlier, or am I misunderstanding you?
> > 
> > No, we're on the same page.  My theory was that if the device is left
> enabled,
> > rx packets could overwrite critical areas of kernel memory.  The system
> would
> > only be susceptible to corruption right up until the chip was reset by the
> > driver for the first time.
> > 
> > There were four patchsets integrated between these two kernel versions.  I
> > don't see any patch that jumps out as a possible culprit.  It would help if
> we
> > could narrow the range of kernels between the working and non-working
> versions.
> >  That'll reduce the number of suspect patches.
> 
> A git bisect is not really possible, since each build takes quite some time
> and
> needs additional changes to roll out the images. I could try 2.6.29 if that
> helps to get a better picture of the problem.

I had it in my mind that we could do a limited bisection by targeting the bounds of the four patch submissions.  It looks like most of the changes went in from 2.6.28 => 2.6.29 though, so using released kernels as our bisection points won't be as useful as I originally thought.  Knowing whether or not 2.6.29 works does eliminate some patches as culprits though, so that would be useful.

Comment 13 Muehlenhoff 2009-09-17 11:24:09 UTC

(In reply to comment #12)

> > A git bisect is not really possible, since each build takes quite some time
> and
> > needs additional changes to roll out the images. I could try 2.6.29 if that
> > helps to get a better picture of the problem.
> 
> I had it in my mind that we could do a limited bisection by targeting the
> bounds of the four patch submissions.  It looks like most of the changes went
> in from 2.6.28 => 2.6.29 though, so using released kernels as our bisection
> points won't be as useful as I originally thought.  Knowing whether or not
> 2.6.29 works does eliminate some patches as culprits though, so that would be
> useful.

I compiled a 2.6.29 kernel (with the .config from the previous 2.6.28.2 kernel, for all the new Kconfig options I just took the defaults) and it boots fine.

Comment 14 Muehlenhoff 2009-09-17 14:22:27 UTC

I also tested a 2.6.31 kernel, it fails just like 2.6.30.

I'm out of office until the 29th, so I won't be able to followup/test until then. I don't know yet if a colleague of mine will pick up the tests in the mean time.

Comment 15 Matt Carlson 2009-09-17 19:56:40 UTC

> > I had it in my mind that we could do a limited bisection by targeting the
> > bounds of the four patch submissions.  It looks like most of the changes
> went
> > in from 2.6.28 => 2.6.29 though, so using released kernels as our bisection
> > points won't be as useful as I originally thought.  Knowing whether or not
> > 2.6.29 works does eliminate some patches as culprits though, so that would
> be
> > useful.
> 
> I compiled a 2.6.29 kernel (with the .config from the previous 2.6.28.2
> kernel,
> for all the new Kconfig options I just took the defaults) and it boots fine.

Thanks for the info.  This removes a large swath of patches as causes.

I'm just curious.  Does turning off ASPM help (pcie_aspm=off on the kernel command line)?

Comment 16 Stefan Gohmann 2009-09-18 05:18:06 UTC

(In reply to comment #15)
> I'm just curious.  Does turning off ASPM help (pcie_aspm=off on the kernel
> command line)?

No, it doesn't work.

These are the latest lines during the boot:

tg3.c:v3.98 (February 25, 2009)
tg3 0000:02:00.0: PCI INT A -> GSI 0 (level, low) -> IRQ 0

Comment 17 Matt Carlson 2009-09-18 17:23:29 UTC

Wait.  IRQ 0?  That doesn't sound right.

Andrew, I'm not sure who to contact about this.  Can you CC someone qualified to look more deeply into the interrupt assignment?

Comment 18 Stefan Gohmann 2009-09-21 06:40:14 UTC

(In reply to comment #17)
> Wait.  IRQ 0?  That doesn't sound right.

Yes, after setting the kernel parameter "acpi=noirq" the kernel boots and the network card is available. But in this case I can't use the usb devices.

Comment 19 Andrew Morton 2009-09-21 19:48:47 UTC

I reassigned this to acpi/config-interrupts.

Comment 20 ykzhao 2009-09-24 03:39:30 UTC

Will you please attach the output of acpidump/lspci -vxxx?

Please also attach the output of dmesg by adding the following boot options.
   a. acpi=noirq
   b. noapic

Thanks.

Comment 21 Stefan Gohmann 2009-09-24 12:28:57 UTC

Created attachment 23171 [details]
lspci-vvv.txt

Comment 22 Stefan Gohmann 2009-09-24 12:29:25 UTC

Created attachment 23172 [details]
dmesg-acpi_noirq.txt

Comment 23 Stefan Gohmann 2009-09-24 12:33:13 UTC

(In reply to comment #20)
> Will you please attach the output of acpidump/lspci -vxxx?

lspci-vvv.txt

> Please also attach the output of dmesg by adding the following boot options.
>    a. acpi=noirq

dmesg-acpi_noirq.txt

>    b. noapic

With the noapic parameter the boot will stop with:

tg3.c:v.98 (February 25, 2009)
tg3 0000:02:00.0: PCI INT A -> GSI 0 (level, low) -> IRQ 0


(In reply to comment #18)
> Yes, after setting the kernel parameter "acpi=noirq" the kernel boots and the
> network card is available. But in this case I can't use the usb devices.

That was my fault, the kernel was build without USB HID. Sorry for the confusion.

Comment 24 ykzhao 2009-09-29 06:33:25 UTC

Hi, Stefan
   Please attach the output of acpidump, lspci -vxxx.
   Thanks.

Comment 25 Stefan Gohmann 2009-09-29 06:47:28 UTC

Created attachment 23201 [details]
lspci-vxxx.txt

Comment 26 ykzhao 2009-11-18 09:35:41 UTC

Hi, Stefan
   Sorry for the late response.
   Will you please confirm whether the BCM5787M driver can work if the boot option of "acpi=noirq" is added?
   Will you please also attach the output of acpidump?
   thanks.

Comment 27 Zhang Rui 2009-12-04 03:48:29 UTC

ping ...

Comment 28 Muehlenhoff 2009-12-04 10:57:54 UTC

(In reply to comment #26)
> Hi, Stefan
>    Sorry for the late response.
>    Will you please confirm whether the BCM5787M driver can work if the boot
> option of "acpi=noirq" is added?
>    Will you please also attach the output of acpidump?
>    thanks.

In the mean time we've changed some kernel options for our thin client kernel and with the current setup the IRQ problems no longer occurs, i.e. we cannot reproduce the problem any longer.

I've attached the acpidump output anyway. Maybe it helps to track down the original problem.

Comment 29 Muehlenhoff 2009-12-04 10:59:03 UTC

Created attachment 24017 [details]
acpidump for HP thin client 5735

Note You need to log in before you can comment on or make changes to this bug.