Most recent kernel where this bug did not occur: 2.6.15 Distribution: SuSE Hardware Environment: HP ze1115 notebook with AMD Duron CPU Software Environment: gcc 4.0.2 Problem Description: Cardbus peripherals do not work. The dmesg output shows the following: ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11 ACPI: PCI interrupt for device 0000:00:0a.0 disabled yenta_cardbus: probe of 0000:00:0a.0 failed with error -12 I used 'git bisect' to localize the problem, which turned out to be the changes in commit b408cbc704352eccee301e1103b23203ba1c3a0e, which was made on March 23. When I revert this patch, my cardbus system behaves normally and the cardbus controller initialization is ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11 Yenta: CardBus bridge found at 0000:00:0a.0 [103c:0022] Yenta O2: res at 0x94/0xD4: 00/ea Yenta O2: enabling read prefetch/write burst Yenta: ISA IRQ mask 0x0038, PCI irq 11 Socket status: 30000821 The memory map at the beginning of the dmesg buffer is BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 0000000023ff0000 (usable) BIOS-e820: 0000000023ff0000 - 0000000023ffffc0 (ACPI data) BIOS-e820: 0000000023ffffc0 - 0000000024000000 (ACPI NVS) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) 575MB LOWMEM available. On node 0 totalpages: 147440 DMA zone: 4096 pages, LIFO batch:0 Normal zone: 143344 pages, LIFO batch:31 Steps to reproduce: Happens every reboot.
(CCing gregkh, who signs the commit)
Yeah, I signed off on the commit, but it was Linus that wrote the patch. Reassigning to him.
Larry, to help debug this, can you do the following: - with and without that patch, do a lspci -vvxxx and attach both ouputs (clearly saying which is which) to the bugzilla entry - same deal, with full "dmesg" output for both working and nonworking case. The latter preferably with CONFIG_PCI_DEBUG enabled _and_ with DEBUG enabled manually at the top of arch/i386/pci/pci.h, which gives a lot more output. Linus
Created attachment 7820 [details] dmesg output _with_ b408cbc7
Created attachment 7821 [details] dmesg output _without_ b408cbc7
Created attachment 7822 [details] lspci -vvxxx output _with_ b408cbc7
Created attachment 7823 [details] lspci -vvxxx output _without_ b408cbc7
Linus, The outputs of dmesg and lspci -vvxxx with and without patch b408cbc7 have been attached. Each was run with CONFIG_PCI_DEBUG=y, and the #undef DEBUG at the top of arch/i386/pci/pci.h changed to #define DEBUG. If any further output will be of help, please let me know. Thanks for the help. Larry
On Sat, 8 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote: > > The outputs of dmesg and lspci -vvxxx with and without patch b408cbc7 have been > attached. Each was run with CONFIG_PCI_DEBUG=y, and the #undef DEBUG at the top > of arch/i386/pci/pci.h changed to #define DEBUG. > > If any further output will be of help, please let me know. Thanks, it's all clear. The BIOS has set up the cardbus controller registers at a bad area, and the change in commit b408cbc7.. basically assumes that any PCI register setup by the BIOS is good, and in this case it clearly just isn't. The whole reason for that commit is that the BIOS has set up the e820 "reserved region" to overlap with a _valid_ PCI allocation, but in your case, a e820 reserved region overlaps with an _invalid_ PCI allocation, and the fact that the kernel now accepts it (because it was the right thing to do on another machine) is wrong on your machine. I suspect we will need to just revert that change. Can you try this trivial patch which should effectively do exactly that (but without undoing the cleanups that were done in b408cbc7..) This will break the fbcon setup that caused the original patch ;/ Linus
Created attachment 7826 [details] Trial patch to test
Created attachment 7827 [details] dmesg output with commit b408cbc7 and the trial patch of 4/09 The trial patch fixes the problem. It was run with commit b408cbc7 and DEBUG set in arch/i386/pci/pci.h. Larry
Linus Torvalds wrote: > > On Sat, 8 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote: > > Thanks, it's all clear. The BIOS has set up the cardbus controller > registers at a bad area, and the change in commit b408cbc7.. basically > assumes that any PCI register setup by the BIOS is good, and in this case > it clearly just isn't. > > The whole reason for that commit is that the BIOS has set up the e820 > "reserved region" to overlap with a _valid_ PCI allocation, but in your > case, a e820 reserved region overlaps with an _invalid_ PCI allocation, > and the fact that the kernel now accepts it (because it was the right > thing to do on another machine) is wrong on your machine. > > I suspect we will need to just revert that change. Can you try this > trivial patch which should effectively do exactly that (but without > undoing the cleanups that were done in b408cbc7..) > > This will break the fbcon setup that caused the original patch ;/ The trial patch fixes the problem on my machine. I used "git checkout -f" to reset the source to that of the git repository for 2.6.17-rc1, then did the new patch, followed by setting DEBUG in arch/i386/pci/pci.h. As before, I am willing to provide any further output, or test any more patches. Larry
Created attachment 7828 [details] dmesg, lspci and fbset Framebuffer wont work without mem=.
Can you guys test this trivial one-liner? In arch/i386/pci/common.c, just change the line that says subsys_initcall(pcibios_init); into arch_initcall(pcibios_init); to move the "pcibios_init()" code sequence up. In particular, it should mean that we do the bus resource traversal early, but leave the actual device resources for later. That might make everybody happy: it means that the bus traversal will be done before the e820 resources have been set up, and that we trust the PCI _bus_ information more than we trust the BIOS tables. But when individual devices are scanned, we'll have scanned the bus and the e820 information first. This is inevitably a balance of "whom do you trust?" Do we trust the BIOS "reserved memory" tables over the actual device setup? We have two cases where one is "more correct" than the other, but sadly, it's a different one in those two cases. We could also have an even more involved setup: - we trust the "real RAM" e820 table information the most (because we already very fundamentally rely on that: it's what we use to initialize our regular free memory setup) - then we trust PCI bus scanning - then we trust "reserved RAM" etc information e820 table info - and finally, we trust regular PCI device resource scanning I'll send out a patch that tries to do all of this, and see how that works out.. Linus
Created attachment 7829 [details] Another patch to test This goes on top of the previou patch, but INSTEAD of the change I described manually (the subsys_initcall->arch_initcall one). It sets up resource allocation to trust the "real RAM" resources more than anything else.
The second patch to test has some offsets in arch/i386/kernel/setup.c and generates an error as follows: finger@larrylap:~/linux-2.6> patch -p1 < linus_patch2 patching file arch/i386/kernel/efi.c patching file arch/i386/kernel/setup.c Hunk #1 succeeded at 1281 (offset -30 lines). Hunk #2 succeeded at 1329 (offset -30 lines). Hunk #3 succeeded at 1342 (offset -30 lines). patching file arch/ia64/kernel/efi.c patching file arch/ia64/kernel/setup.c patching file include/linux/efi.h finger@larrylap:~/linux-2.6> make CHK include/linux/version.h CC init/main.o CHK include/linux/compile.h LD init/built-in.o CHK usr/initramfs_list CC arch/i386/kernel/time.o CC arch/i386/kernel/setup.o arch/i386/kernel/setup.c:1355: error:
On Sun, 9 Apr 2006, bugme-daemon@bugzilla.kernel.org wrote: > > The second patch to test has some offsets in arch/i386/kernel/setup.c and > generates an error as follows: It should apply cleanly against the current -git, but I think you can ignore the offsets against your older kernel too. > arch_initcall(request_standard_resources_early); > fs_initcall(request_standard_resources_late); > > Clearly, the arch_initcall and the fs_initcall belong in a function, but which one? Actually, no. They should be outside. They generate magic inline assembly that adds those two functions to the list of initcalls, and they should _not_ be inside a function, but at the top level. So the thing you cut-and-pasted looked fine. Linus
Created attachment 7830 [details] Revised version of second patch The second patch works with 2 changes: (1) the routine is request_standard_resources, not request_standard_resources_early, and (2) request_standard_resources_late needs to return a value (presumably 0). The revised diff is attached. BTW, both patches are probably needed for the stable series 2.6.16.y. Larry
Larry, is this fixed now?
bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=6337 > > > > > > ------- Additional Comments From akpm@osdl.org 2006-06-01 21:45 ------- > Larry, is this fixed now? > Yes. The code changes that Linus made fixed it completely. I just marked the bug closed. Sorry that I didn't do that earlier. Larry
On Fri, 2 Jun 2006, bugme-daemon@bugzilla.kernel.org wrote: > > > Larry, is this fixed now? > > > Yes. The code changes that Linus made fixed it completely. I just marked > the bug closed. Sorry that I didn't do that earlier. Did I merge the proposed fix too? Remind me. I have the attention span of a rodent on crack. Linus
bugme-daemon@bugzilla.kernel.org wrote: > Did I merge the proposed fix too? > Remind me. I have the attention span of a rodent on crack. > > Linus > I'm not surprised given all that you do. IIRC (not a certainty), you revised the patch that had broken my machine. I think the new version does things in a different order. In any case, my computer works OK and it doesn't seem that anyone else has complained. Larry