Bug 26732
Summary: | Problem: PCIE hot-plug resource assignments hangs kernel during boot | ||
---|---|---|---|
Product: | Drivers | Reporter: | Kushal Koolwal (kushalkoolwal) |
Component: | PCI | Assignee: | Bjorn Helgaas (bjorn) |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | bjorn, ebiederm, kushalkoolwal |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32 onwards including 2.6.37 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg 2.6.32 - kernel hangs during early boot
dmesg 2.6.31 - kernel boots fine dmesg 2.6.32 - suspected commit reverted kernel boots fine now 2.6.31 kernel config file 2.6.32 kernel config file lspci -vvvxxx output for 2.6.31 kernel 2.6.39 dmesg showing kernel hang 2.6.39 dmesg log with "ignore_loglevel" |
Description
Kushal Koolwal
2011-01-14 19:39:43 UTC
Created attachment 43582 [details]
dmesg 2.6.31 - kernel boots fine
Created attachment 43592 [details]
dmesg 2.6.32 - suspected commit reverted kernel boots fine now
Created attachment 43602 [details]
2.6.31 kernel config file
Created attachment 43612 [details]
2.6.32 kernel config file
Both the config files are essentially the same. However make command during kernel compilation explicitly asked to take action on certain new items that were introduced in the 2.6.32 kernel. I select the default action for most of the items.
Created attachment 43622 [details]
lspci -vvvxxx output for 2.6.31 kernel
I forgot to mention one more piece of information. We do not see this problem on Windows XP and Windows 7. I'm sorry this report has been neglected. I assume it's still an issue? If so, would it be possible to attach a serial console log from a current kernel, e.g., 2.6.39? It seems that the conflicting IO address range was 0x1000-0x1fff. For some reasons Linux kernel seems to hang upon discovering this IO range. To resolve the issue we modified our BIOS/firmware code to move the I/O base addresses for the ACPI Power Management Block and the SM Bus controller which were defined at 0x1000 and 0x2000 respectively. If you would like to debug this issue further at the Linux kernel level I would be more than happy to attach the serial console output from 2.6.39. Let me know. Thanks for checking back. Also it seems that this issue might be related to: https://bugzilla.kernel.org/show_bug.cgi?id=36462 Heh, that's really funny that you found bug #36362 already. I bet it is related, especially since you mention the fixed hardware that you had in the 0x1000 range, which got assigned to the 1c.0 bridge. It would be useful if you could attach the console output from 2.6.39, booted with "ignore_loglevel". That will show more details about the ACPI/PNP devices we find and the PCI resource assignment. Created attachment 60522 [details]
2.6.39 dmesg showing kernel hang
Attached is the full 2.6.39 dmesg log from the serial output showing Linux kernel hang with the unmodified BIOS.
Also there was a typo in my previous comment. To solve the problem we moved the I/O base addresses for the ACPI Power Management Block which was initially defined at 0x1000 to it's new location i.e. 0x2000 to make Linux kernel happy.
It seems that there is no way for the BIOS to tell the Linux kernel that certain I/O space (0x1000-0x1fff in this case) is reserved.
You didn't use the "ignore_loglevel" kernel parameter, so we don't see the PNP resources ... we see that you have 7 devices, but not the resources they use. Moving the PM block to 0x2000 manages to avoid the problem for now, but it doesn't actually *solve* anything, it just moves the landmine elsewhere. If we do any more PCI allocation, we could still step on it. ACPI is the mechanism the BIOS is supposed to use to tell the kernel that I/O space like this is reserved. It's just a bad Linux bug that we happen to ignore most of that information. I think most of the time we're lucky because hardware like this is below 0x1000, and we have "#define PCIBIOS_MIN_IO 0x1000" that keeps PCI from allocating anything down there. It happens to be PCI that trips over this, but it's really a PNP bug. Created attachment 60562 [details]
2.6.39 dmesg log with "ignore_loglevel"
My bad. Attached is the 2.6.39 dmesg log with "ignore_loglevel" option. Please let me know if you need any more information.
Wait, you said the problem happens when you have the ACPI PM block at 0x1000, didn't you? That PM block would be described in the FADT. My Lenovo laptop also has a PNP0C02 device that describes it, so my /proc/ioports looks like this: 1000-107f : pnp 00:03 <-- this comes from the PNP0C02 device 1000-1003 : ACPI PM1a_EVT_BLK <-- these come from the FADT fields 1004-1005 : ACPI PM1a_CNT_BLK 1008-100b : ACPI PM_TMR 1010-1015 : ACPI CPU throttle 1020-102f : ACPI GPE0_BLK 1050-1050 : ACPI PM2_CNT_BLK I don't know if it's actually a spec requirement to have a PNP0C02 device for the PM areas or not, but if you can add one (or just add the FADT fields to the one you already have), I think it will prevent the problem. Linux has a special case for PNP0C02 devices -- we bind the driver earlier than normal, and the driver claims the resources. This is done early enough that PCI allocations will see the already-claimed PNP0C02 resources and avoid them. I think you are right. It seems that we have two PNP0C02 devices in our ACPI DSDT but the conflicting IO base address was not included in the resources list which probably we should. Currently, I do not have any time-line as to when we will be able to test this fix in our BIOS, so meanwhile, you can consider this issue as a low priority (or may be even close it?). If this fix does not work then I can just re-open this issue. Thanks for all your help! OK, I'm going to close this as "invalid" on the assumption that a BIOS fix will resolve it. If it doesn't, please re-open and we'll take another look. |