Bug 8346
Summary: | boot crash + hangduring modprobe processor unless "processor.nocst" - P4/HT | ||
---|---|---|---|
Product: | ACPI | Reporter: | Olaf Kirch (okir) |
Component: | BIOS | Assignee: | Len Brown (lenb) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | acpi-bugzilla |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | v2.6.21-rc7 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
acpidump
dmidecode Possible fix Snapshot of an oops Another oops patch vs 2.6.21-rc7 patch for 2.6.22-git1 |
Description
Olaf Kirch
2007-04-17 01:01:41 UTC
Created attachment 11217 [details]
acpidump
Created attachment 11218 [details]
dmidecode
I narrowed it down a little. The problem was introduced by the ACPICA merge good: e47fddf2470feb228e1d3ff41fc78dad4cfbbcc6 bad: 15a58ed12142939d51076380e6e58af477ad96ec I'll try to drill into this a little more, but unfortunately a lot of changes in this set of patches don't compile, which makes bisect a little tedious It seems to boil down to one of these: 2502fffb1958da66fa50a475081cb6827acdd9f3 ACPICA: Add support for DMAR table ad71860a17ba33eb0e673e9e2cf5ba0d8e3e3fdd ACPICA: minimal patch to integrate new tables into Linux a4bbb810dedaecf74d54b16b6dd3c33e95e1024c ACPICA: Lint changes 4bf273939c99fae5bae399f51c417a552d74b97f ACPICA: Fix for FADT conversion in 64-bit mode 8f34890dce60f7df6dd23a0d04977c6572adaab8 ACPICA: Update comments for individual table fields c5fc42ac4d4d6d3e3f619290b86890cb3725d2f8 ACPICA: misc fixes for new Table Manager: f3d2e7865c816258c699ff965768e46b50d536d3 ACPICA: Implement simplified Table Manager Unfortunately, these are rather big. And I'm unable to bisect further, I keep getting compile errors in misc places (mostly es7000 code, but elsewhere too - eg In file included from arch/i386/kernel/setup.c:31: include/linux/acpi.h:59: error: redefinition of ‘struct acpi_table_rsdp’ include/linux/acpi.h:86: error: redefinition of ‘struct acpi_table_rsdt’ include/linux/acpi.h:93: error: redefinition of ‘struct acpi_table_xsdt’ include/linux/acpi.h:100: error: redefinition of ‘struct acpi_table_fadt’ I enabled ACPI debugging and noticed that the BIOS doesn't have a _CST. However, the following if() is executed: if (acpi_fadt.cst_control && !nocst) { status = acpi_os_write_port(acpi_fadt.smi_command, acpi_fadt.cst_control, 8); [...] } If I set nocst, the boot seems to proceed normally. cst_control has a value of 0xe3. So the problem goes away if you boot with "processor.nocst"? Yes. FWIW, if I just skip writing cst_control if no _CST was found, this will work as well. The patch below will call acpi_processor_get_power_info first, and only write CST_CNT to the SMI_CMND register iff pr->flags.has_cst is true. Created attachment 11229 [details]
Possible fix
I'm not entirely sure if this is the right approach. I understand that
the _CST object is supposed to be dynamic, but can the BIOS hide it
completely prior to the OSPM sending CST_CNT?
Created attachment 11230 [details]
Snapshot of an oops
The longer I look at this, the more I suspect something else is
going wrong here. Look at the oops - it's strange in many ways.
- It's the swapper task, but the stack looks like modprobe's
There's a mismatch between ti and task.ti - current points
at the swapper task_struct but %esp is that of the modprobe
task
- The EIP shown is bogus: 7E70:[0000001e] is crap. 0x1e matches %edi,
and 7e70 is the lower 16bit of %ebp
- ds and es segment registers are crap too; these correspond to the
topmost two words on the stack shown
Note this is oops #2, the first one had scrolled off the top of the
screen and scrollback was not possible.
Created attachment 11231 [details]
Another oops
Here's another oops, with similar characteristics
Again it's the swapper task, the pt_regs shown look rather corrupted,
and the stack shown is modprobe's.
The most peculiar thing about this oops is EIP.
It says:
printing eip
c16a9080
...
EIP: 53b8:[<c0243b8>]
...
Code: 00 00 00 00
EIP: [<00000000>]
So it shows three different EIP values in one single oops. It appears as if
the stack is still active while being printed. Which isn't that surprising
given that we're dumping the stack of another task, which may still be
running on the other sibling CPU.
This bug is a bit frustrating :-) I spent another day on trying to hunt down exactly what is happening, with moderate success. Here's what I could establish for a fact - the bug is triggered by the call to acpi_os_write_port(acpi_gbl_FADT.smi_command, acpi_gbl_FADT.cst_control, 8); If I make it skip this outb, everything works like a charm - The symptoms look like some strange memory corruption to me, but without a fixed pattern. However, many of the oopses I recorded had a mismatch between %esp (which belonged to modprobe) and struct current (which pointed at some other task, probably whatever was currently active on the other CPU). I actually added a check to the oops code which would chase current_thread_info()->task, and it would always show the modprobe task in state TASK_RUNNING. - In the oops handler, I compared smp_processor_id() to safe_smp_processor_id(), which I *think* should trigger if we're running with %fs pointing to the wrong kernel PDA. But this condition never triggered. - I added a printk to acpi_ev_sci_xrupt_handler and a few exception handlers to see if there was some bad interaction going on, but that printk never triggered. So I'm back to suspecting the ACPI BIOS doing something stupid - which would be a pretty good assumption if it wasn't for the fact that the git bisect showed that this particular problem seems to have been introduced by the new ACPI table manager code. Any other ideas to try? This system has a revision 2 FADT, but it is populating FADT.CST_CONTROL. However, the FADT.CST_CONTROL field was RESERVED until FADT revision 3 and ACPI 2.0 -- so a non-zero value here is technically a BIOS bug. Linux is believing the BIOS that it should send these bits to the SMI_COMMAND register in order to tell the BIOS that Linux knows how to handle a _CST. As this system doesn't have a CST, it is technically a 2nd BIOS bug that this field is populated -- even if the BIOS were up to date with ACPI 2.0. Which brings us to the actual (intermittant) failure. We've had problems in the past with tickling SMM from a thread other than CPU0. My guess is that on some boots, we are writing this value from cpu1 and that confuses SMM. It is likely that booting with HT disabled in the BIOS or with "maxcpus=1" or a UP kernel will make the failure go away. (give it a try) But why did 2.6.20 work and 2.6.21-rc7 fails? It turns out that for FADT revisions before r3, 2.6.20 would explicitly clear this field when converting the BIOS supplied FADT into the Linux' in-memory ACPI 2.0 format. Upon the table-re-write in 2.6.21, we simply bcopy the entire BIOS supplied FADT into a buffer, preserving this field, and exposing this series of BIOS bugs. Created attachment 11275 [details]
patch vs 2.6.21-rc7
Please test this patch.
On this system, it should fix the boot issue
and result in an addition line in dmesg.
Thanks, Len - this patch fixes the crashes I've been seeing, and the printk triggers. Created attachment 11318 [details] patch for 2.6.22-git1 Thanks for testing Olaf. patch in comment #13 sent to 2.6.21.stable Attached is a slightly expanded patch w/o prink for 2.6.22. patch in comment #15 shipped post 2.6.22-git3, expected -git4 closed. |