Bug 15822 - Host bridge reported broken by .34-rc5 kernel
Host bridge reported broken by .34-rc5 kernel
Product: Drivers
Classification: Unclassified
Component: PCI
All Linux
: P1 normal
Assigned To: Bjorn Helgaas
Depends on:
  Show dependency treegraph
Reported: 2010-04-20 21:21 UTC by Sten Heinze
Modified: 2010-10-05 20:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.34-rc5
Tree: Mainline
Regression: Yes

dmesg output from 2.6.34-rc5 (37.95 KB, application/octet-stream)
2010-04-20 21:21 UTC, Sten Heinze
debug patch (1.46 KB, patch)
2010-04-21 17:49 UTC, Bjorn Helgaas
Details | Diff
patch to revert warning (1.96 KB, patch)
2010-04-22 15:11 UTC, Bjorn Helgaas
Details | Diff

Description Sten Heinze 2010-04-20 21:21:07 UTC
Created attachment 26072 [details]
dmesg output from 2.6.34-rc5

dmesg reports on booting a 2.6.34-rc5 kernel:
pci 0000:00:00.0: reg 10: invalid size (l 0x0 sz 0x8 mask 0xfffffff0); broken device?
(Complete dmesg is attached.)

The device is:
00:00.0 Host bridge: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller (rev 02)
	Subsystem: IBM ThinkPad R50e
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
	Latency: 0
	Region 0: Memory at <unassigned> (32-bit, prefetchable)
	Capabilities: <access denied>
	Kernel driver in use: agpgart-intel
(The hardware is actually a Thinkpad X40.)

Since I do not experience anything being not usable, is this a bug? What can I do to verify either the device being broken or it being a bug in the kernel?

This message is not printed using a 2.6.33 kernel.
Comment 1 Andrew Morton 2010-04-20 21:32:56 UTC
Mutter.  Bjorn, that was a fairly unuseful message you added.  Could we have made it a bit more helpful to the testers?  Like, tell them to mail bjorn.helgaas@hp.com :)

I fear that I'm going to get showered with reports of this message coming out, I'll dutifully direct these emails to yourself and Jesse and nothing will happen  and the reports will keep coming :(   Am I wrong?

In a way I guess this should really go into the post-2.6.33 regressions bucket - those warnings didn't come out in 2.6.33.  But if we really want these messages coming out of 2.6.34 then there's no point in treating it as a regression.
Comment 2 Bjorn Helgaas 2010-04-21 03:49:43 UTC
Thanks for the report, Sten.  I added that message after spending a few days debugging a device that turned out to be physically broken, so that it might be easier to debug next time.  Obviously in this case, the device is NOT broken, so I need to make that test smarter or remove the message altogether.
Comment 3 Sten Heinze 2010-04-21 10:56:32 UTC
Thanks for the replies. Good to know the device is not broken. Let me know if I can help and test a new patch.
Comment 4 Bjorn Helgaas 2010-04-21 17:49:01 UTC
Created attachment 26083 [details]
debug patch

Sten, would you mind trying this patch?  If you just respond with the output of "dmesg | grep 0000:00:00.0", that should be enough.

I think what probably happened is that we read 0x8 from the BAR when sizing it.  The spec says we calculate the size by clearing the encoding bits (0xf in this case), inverting what's left, and incrementing by one.  That would be (~(0x8 & ~0xf)) + 1 == 0, so that seems sensible (a 32-bit prefetchable memory BAR of size zero).

Assuming your test results confirm this, I think I'll just revert that patch for now and revisit it after 2.6.34.  The warning was useful for a device where we read 0x7fffe000 from the BAR.  That's clearly invalid (because the MSB is not set), and we should be able to distinguish that from the case you're seeing, but it will require a little too much work for this stage of the 2.6.34 release.
Comment 5 Sten Heinze 2010-04-22 14:54:04 UTC
dmesg | grep 0000:00:00.0
[    0.275450] pci 0000:00:00.0: reg 10: l 0x8 (original)
[    0.275456] pci 0000:00:00.0: reg 10: sz 0x8 (original)
[    0.275464] pci 0000:00:00.0: reg 10: type 2 flags 0x42208 l 0x0 mask 0xfffffff0
[    0.275471] pci 0000:00:00.0: reg 14: l 0x0 (original)
[    0.275477] pci 0000:00:00.0: reg 14: sz 0x0 (original)
[    0.275483] pci 0000:00:00.0: reg 18: l 0x0 (original)
[    0.275489] pci 0000:00:00.0: reg 18: sz 0x0 (original)
[    0.275495] pci 0000:00:00.0: reg 1c: l 0x0 (original)
[    0.275502] pci 0000:00:00.0: reg 1c: sz 0x0 (original)
[    0.275508] pci 0000:00:00.0: reg 20: l 0x0 (original)
[    0.275514] pci 0000:00:00.0: reg 20: sz 0x0 (original)
[    0.275520] pci 0000:00:00.0: reg 24: l 0x0 (original)
[    0.275526] pci 0000:00:00.0: reg 24: sz 0x0 (original)
[    0.275532] pci 0000:00:00.0: reg 30: l 0x0 (original)
[    0.275539] pci 0000:00:00.0: reg 30: sz 0x0 (original)
[    1.954969] agpgart-intel 0000:00:00.0: Intel 855GM Chipset
[    1.955435] agpgart-intel 0000:00:00.0: detected 8060K stolen memory
[    1.964026] agpgart-intel 0000:00:00.0: AGP aperture is 128M @ 0xe0000000

Hope that helps. Let me know if you need me to try more.
Comment 6 Bjorn Helgaas 2010-04-22 15:11:00 UTC
Created attachment 26095 [details]
patch to revert warning

Here's the patch I just posted to revert the warning for now.
Comment 7 Bjorn Helgaas 2010-10-05 20:03:46 UTC
This warning was removed by commit 45aa23b4cb, which was included
in 2.6.34.

Note You need to log in before you can comment on or make changes to this bug.