Bug 12168
Description
Alex Villacis Lasso
2008-12-05 14:03:20 UTC
On Fri, Dec 05, 2008 at 02:03:23PM -0800, bugme-daemon@bugzilla.kernel.org wrote: > Latest working kernel version: > Earliest failing kernel version: > > 2.6.26.6-49.fc8 for these examples, also seen with 2.6.28-rc6 So this has never worked? Please bring this up on the linux-pci mailing list, that is the proper place for it. Any chance this is a BIOS option that can be configured (or not)? e.g. hit F2 or F12 to enable the SATA device as a Boot device. I had to do this recently so linux would find some SATA drives I added to my desktop machine. I don't believe the problem with the SATA controller is the same thing as having a problem reading the Expansion ROM from a second Video card. The difference is VGA gets special MMIO routing by Bridges. We can only enable VGA routing for one device (enabled in one PCI host bridge) and I expect that one gets the Expansion ROM enabled. I'm not 100% sure about it but it would be worth checking. Is that one set of VGA routing per bridge, or per machine? Well, what I'm really getting at is, my video cards are: (--) PCI: (0@1:0:0) nVidia Corporation NV17 [GeForce4 MX 440] rev 163, Mem @ 0xf0000000/0, 0xe0000000/0, 0xe8000000/0, BIOS @ 0x?? ??????/131072 (--) PCI:*(0@2:2:0) Silicon Integrated Systems [SiS] 86C326 5598/6326 rev 11, Mem @ 0xf4000000/0, 0xf3000000/0, I/O @ 0x00009800/0 , BIOS @ 0x????????/65536 ...and they both used to work in this exact hardware configuration; all I did to stop it working was upgrade the software. It seems to me that they're on different PCI busses. Does that make a difference? Also (in case it helps), if I understand correctly, the ROM-reading code only gets called on the secondary video card, not the primary. HTH (In reply to comment #3) > Is that one set of VGA routing per bridge, or per machine? Well, what I'm > really getting at is, my video cards are: > > (--) PCI: (0@1:0:0) nVidia Corporation NV17 [GeForce4 MX 440] rev 163, Mem @ > 0xf0000000/0, 0xe0000000/0, 0xe8000000/0, BIOS @ 0x?? > ??????/131072 > (--) PCI:*(0@2:2:0) Silicon Integrated Systems [SiS] 86C326 5598/6326 rev 11, > Mem @ 0xf4000000/0, 0xf3000000/0, I/O @ 0x00009800/0 > , BIOS @ 0x????????/65536 > > ...and they both used to work in this exact hardware configuration; all I did > to stop it working was upgrade the software. If you are using the Xorg xserver, we switched from using X's PCI implementation to the kernel's when we switched to libpciaccess. For additional examples, I compiled 2.6.28-rc7 at home. This machine has no SATA interfaces, but it has an integrated AGP chipset (ProSavage-DDR-K), and also a PCI VGA card plugged in (Oak Spitfire 64111). For these tests, I prepared a pair of patches. The first patch for drivers/pci/pci-sysfs.c makes the kernel fail the read with EIO, instead of returning a "success" of 0 bytes, if either the ROM mapping is NULL, or its detected size is zero. The second patch adds debugging to drivers/pci/rom.c in order to locate the source of failure. What I conclude is that the sysfs implementation is only able to read the ROM from the VGA card that is selected as the primary boot-time card. In addition, the failure mode is that the mapping of the ROM apparently succeeds, but the contents are invalid, and fail the test to determine the size of the ROM. I performed the tests by selecting the BIOS option "Initialize video card first" with options of AGP (Savage) or PCI (Oak Spitfire). In either case, the selected card showed the boot process, as expected, and remained the only card on which the ROM could be read. Created attachment 19210 [details]
perl script used to read ROM to file, as documented in kernel docs
This program will report the reading status in addition to saving the ROM file to a specified destination.
Created attachment 19211 [details]
Shell script with paths to corresponding PCI devices for tests
Created attachment 19212 [details]
Add validation to ROM reading operation, fail with -EIO if validation fails
This patch lets userspace know that something wrong happened with the ROM reading.
Created attachment 19213 [details]
Add debugging information to PCI ROM mapping
This patch adds debugging to know where in the ROM mapping process the failure occurs.
Created attachment 19214 [details]
dmesg output for 2.6.28-rc7 with patches applied - AGP/savage as primary, PCI/spitfire as secondary
These are the results on dmesg when trying to use the supplied scripts to read both ROMs, with savage as primary and spitfire as secondary.
Created attachment 19215 [details]
Script output - AGP/savage as primary, PCI/spitfire as secondary
Created attachment 19216 [details]
dmesg output for 2.6.28-rc7 with patches applied - AGP/savage as secondary, PCI/spitfire as primary
Created attachment 19217 [details]
Script output - AGP/savage as secondary, PCI/spitfire as primary
Alex: I'm aware of the PCI switch; my point (which I didn't make well, sorry), was that it's hardware-possible, assuming appropriate software :).
Bill Crawford made a point which I think is relevant in response to Alex on the xorg list; I'll paste the entire comment below.
Bill Crawford wrote:
> On Friday 05 December 2008 22:08:13 Alex Villac�s Lasso wrote:
>
> > The failure to read the PCI ROM as documented seems like a kernel bug. I
> > opened http://bugzilla.kernel.org/show_bug.cgi?id=12168 for it.
>
> This affects kernel modesetting as well, since the cards aren't POSTed
> properly in these situations.
HTH
(The Alex addressed in the message above was supposed to be Alex Deucher, but the rest of the message is for everyone; sorry). So I guess you're not hitting the assign_resource path on the ROM that can't be read? Maybe the rom_size code is getting things wrong somehow (image - rom == 0 possibly)? Assuming it has I/O space allocated and it's being enabled properly, I don't see why else it would fail. Am I right in assuming that error-checking patches such as attachment 19212 [details] above are welcome? And would it be useful to have most of the "break" statements in pci_get_rom_size instead be a "return -EIO"? Or is that going to cause problems somewhere else? Or should some more specific error be returned?
Created attachment 19513 [details]
Example of patch for error checking on rom size
This is a patch that is an example of what I was talking about above.
Yeah both are reasonable. Can you send them out to the linux-pci@vger.kernel.org list for review? Created attachment 19667 [details]
Patch that helps debugging, but does not solve the problem in any way
Patch that helps debugging, but does not solve the problem in any way
...will be discussed in future comment
Here's the output from the patch in my previous comment. pci_map_rom: Trace A pci_map_rom: Trace C pci_map_rom: Trace E pci_map_rom: Trace F pci_get_rom_size: Trace pci_get_rom_size; breaking because 0x55 failed pci_get_rom_size: Trace pci_get_rom_size; image: -121896960, rom: -121896960; size: 131072 pci_map_rom: pci_get_rom_size() returned 0 at 0xf8bc0000 pci_map_rom: Trace G This is naturally on the secondary video card, which is an AGP card. The first video card works fine. As you can see, there's a bug in my patch that causes the image and rom sizes to be reported negatively, but the real problem is that the byte that is supposed to be 0x55 isn't. Unwisely, I didn't include code to print out what it actually is. Maybe next time. I'll keep fiddling with this, and, while I may push us a bit further along the road to understanding, I don't expect to come up with the final solution to the problem (as I have no real understanding of PCI, or the code outside the specific function I'm debugging); hopefully others can chime in usefully. Created attachment 19668 [details]
Another possible patch for rom size?
This patch would mean that pci_get_rom_size would return something "usable" even when the content of the ROM appears to be wrong in some way. However, I suspect the error is caused by some underlying problem, and that we should solve the problem instead of using this patch, but I attach it as an idea for discussion.
More results: pci_map_rom: Trace A pci_map_rom: Trace C pci_map_rom: Trace E pci_map_rom: Trace F pci_get_rom_size: Trace pci_get_rom_size; breaking because 0x55 failed (got 255) pci_get_rom_size: Trace pci_get_rom_size; image: 4173332480, rom: 4173332480; size: 131072 pci_map_rom: pci_get_rom_size() returned 0 at 0xf8c00000 pci_map_rom: Trace G I apologies that the 0x55 result was reported as 255 instead of 0xFF, but you get that. Anyway, my point is, we now know what we're getting instead of 0x55, we just don't know why. I thought something that might also be useful is a trace from a working video card. pci_map_rom: Trace A pci_map_rom: Trace B pci_map_rom: Trace F pci_get_rom_size: Trace pci_get_rom_size; last_image is 128 pci_get_rom_size: Trace pci_get_rom_size; image: 3222044672, rom: 3222011904; size: 131072 pci_map_rom: pci_get_rom_size() returned 32768 at 0xc00c0000 pci_map_rom: Trace G (and many more repetitions of similar things; the thing I find interesting is the ABF pattern instead of ACEF). Btw, I'm stuck at the moment on this, because I don't really understand the code. If someone could compare the patch in comment #20 and the output in comment #23 (which admittedly has added a little to one of the patch messages), and suggest further avenues for exploration (ie. as to why we're getting 0xFF instead of 0x55, or anything else you may deem useful), that would be helpful. I'm happy to dig further into things and find out what values they're returning and that sort of thing, but I have no idea what could possibly be going wrong. On the particular case of my home machine, in which a ProSavage DDR-K AGP chipset is the primary video device, and an old OAK OTI64111 PCI video card is the secondary device, the command "setpci -d VENDORID:DEVICEID COMMAND" shows the value 0007 for the ProSavage and 0000 for the OAK. After manually setting the value of COMMAND to 0007 for the OAK, the ROM reading from the OAK PCI card succeeds. Unfortunately this is not a general solution for all PCI cards, since the SATA chipsets in my work machine already have COMMAND=0007 and still cannot be read. Both work and home machines now run custom-built vanilla 2.6.28 from kernel.org. Works for me also with 2 video cards. I agree it's disappointing that it's not a general solution, but it's better than nothing. As per Bjorn Helgaas' message to linux-pci on 15 December (which I now actually understand how to apply, thanks to Alex VL), I found that it can be set to 2 instead of 7 to enable this. It seems to me that this bit (bit 1) should be permanently turned on in the case of all ROMs. My questions are: 1. Should this be done in pci_enable_rom(), or somewhere else (maybe pci_bus_assign_resources() in setup-bus.c)? 2. Does anyone know whether bits 2 and 0 should also be turned on? Hmm. After further examination, I presume pci_enable_device (in pci.c) is not getting run. This kind of answers question 2 above, but I still need to know the answer to question 1. It also seems like is_enabled_store() in pci-sysfs.c is trying to call pci_enable_device. *(&^#%&( Ok, all, I've just discovered that, in addition to echoing 1 into the ROM before reading it, we also have to echo 1 into the /sys/bus/pci/devices/<deviceID>/enable as well. Created attachment 19737 [details]
Fixed version of readrom.pl
I've attached a fixed version of readrom.pl
(In reply to comment #29) > Created an attachment (id=19737) [details] > Fixed version of readrom.pl > > I've attached a fixed version of readrom.pl > Nice one, only I would not test it on the primary chipset (the one used for the current display) - it unconditionally disables the PCI chipset after reading the ROM. It disables the rom, but not the device. And even if I disabled the device, the file called "enable" works like a counter, unlike the file "rom". HTH. Anyway, in my opinion, what needs to happen to close the display portion of this bug is: - The patch from comment 8 needs to be applied - The pci_enable_rom() function needs to printk an error if the card is not enabled The other problems related to reading the VGA ROM can be dealt with in the Xorg code. The problems with the SATA ROMs aren't something I'll be in a position to spend time on. Hopefully I'll get some time tomorrow or so to make up a combined patch to solve the VGA-related problems. Anything else? Created attachment 19757 [details]
Makes pci-sysfs fail when it can't read rom, and emit warnings too
This patch should solve the problem, and give warnings to others to help them not waste the time that Alex and I did on this.
Can you send this patch out to linux-pci (if you haven't already, haven't checked that mailbox yet). We may even go one step further and check whether the device is enabled directly. Created attachment 20020 [details]
Patch that solves the problem; I know of no problems with it
This patch appears to do exactly what I want, and I think also what Jesse requested as well.
Oops, doesn't comply with coding standards. I'll try again. Created attachment 20023 [details]
pci: Return/dmesg errors when reading PCI ROMs
Above patch, corrected for coding style, etc.
Created attachment 20039 [details]
[PATCH] pci: Return/dmesg errors when reading PCI ROMs (take 2)
Corrected according to Greg-KH's comments on the mailing list.
|