Bug 56051 - IRQ remap causes freeze on Thinkpad T430
Summary: IRQ remap causes freeze on Thinkpad T430
Status: NEEDINFO
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-01 01:11 UTC by Kevin Locke
Modified: 2016-07-12 18:43 UTC (History)
5 users (show)

See Also:
Kernel Version: master
Subsystem:
Regression: No
Bisected commit-id:


Attachments
acpidump in "Discrete Graphics" mode (348.40 KB, text/plain)
2013-04-01 01:11 UTC, Kevin Locke
Details
acpidump in "nVidia Optimus" mode (348.68 KB, text/plain)
2013-04-01 01:11 UTC, Kevin Locke
Details
kernel config (140.26 KB, text/plain)
2013-04-01 01:12 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode (38.95 KB, text/plain)
2013-04-01 01:12 UTC, Kevin Locke
Details
dmesg in "nVidia Optimus" mode (40.55 KB, text/plain)
2013-04-01 01:12 UTC, Kevin Locke
Details
/proc/iomem in "Discrete Graphics" mode (2.25 KB, text/plain)
2013-04-01 01:12 UTC, Kevin Locke
Details
/proc/iomem in "nVidia Optimus" mode (2.43 KB, text/plain)
2013-04-01 01:12 UTC, Kevin Locke
Details
/proc/ioports in "Discrete Graphics" mode (1.04 KB, text/plain)
2013-04-01 01:13 UTC, Kevin Locke
Details
/proc/ioports in "nVidia Optimus" mode (1.06 KB, text/plain)
2013-04-01 01:13 UTC, Kevin Locke
Details
lspci -vvvxxx in "Discrete Graphics" mode (44.05 KB, text/plain)
2013-04-01 01:13 UTC, Kevin Locke
Details
lspci -vvvxxx in "nVidia Optimus" mode (43.13 KB, text/plain)
2013-04-01 01:14 UTC, Kevin Locke
Details
ASL variables in "Discrete Graphics" mode (4.00 KB, application/octet-stream)
2013-04-17 22:17 UTC, Kevin Locke
Details
ASL variables in "Discrete Graphics" mode with acpi_osi="!Windows 2012" (4.00 KB, application/octet-stream)
2013-04-17 22:17 UTC, Kevin Locke
Details
ASL variables in "nVidia Optimus" mode (4.00 KB, application/octet-stream)
2013-04-17 22:18 UTC, Kevin Locke
Details
ASL variables in "nVidia Optimus" mode (corrected offset) (4.00 KB, application/octet-stream)
2013-04-18 13:46 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with kernel e8df390f (from aaronlu/acpi_video) (39.78 KB, text/plain)
2013-04-22 14:07 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with kernel e8df390f (from aaronlu/acpi_video) (72.93 KB, text/plain)
2013-04-23 00:27 UTC, Kevin Locke
Details
Add some debug statement to dsdt table (57.21 KB, application/octet-stream)
2013-04-24 01:32 UTC, Aaron Lu
Details
dmesg in "Discrete Graphics" mode with instrumented DSDT and kernel e8df390f (from aaronlu/acpi_video) (77.26 KB, text/plain)
2013-04-24 04:47 UTC, Kevin Locke
Details
dmesg after LID event with instrumented DSDT and kernel e8df390f (from aaronlu/acpi_video) (67.82 KB, text/plain)
2013-04-24 14:10 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with apic=debug (63.68 KB, text/plain)
2013-07-18 13:55 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with apic=debug intremap=off (59.45 KB, text/plain)
2013-07-18 13:56 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with x2apic_phys (39.34 KB, text/plain)
2013-07-23 04:22 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54 (63.68 KB, text/plain)
2013-07-31 15:47 UTC, Kevin Locke
Details
/proc/interrupts in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54 (2.20 KB, text/plain)
2013-07-31 15:48 UTC, Kevin Locke
Details
lspci -vvvnnn in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54 (30.15 KB, text/plain)
2013-07-31 15:49 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with intremap=off, kernel 3.10.1 with patch from comment 70, bios 2.54 (51.20 KB, text/plain)
2013-07-31 15:57 UTC, Kevin Locke
Details
lspci -xxxvvvnnn in "Discrete Graphics" mode with apic=debug, kernel 3.10.1 with patch from comment 70, bios 2.54, with nouveau loaded (44.81 KB, text/plain)
2013-07-31 16:40 UTC, Kevin Locke
Details
dmesg in "Discrete Graphics" mode with apic=debug, kernel 3.10.1 with patch from comment 70, bios 2.54, after video loaded and unloaded (63.94 KB, text/plain)
2013-07-31 16:42 UTC, Kevin Locke
Details

Description Kevin Locke 2013-04-01 01:11:12 UTC
Created attachment 96771 [details]
acpidump in "Discrete Graphics" mode

On a Lenovo ThinkPad T430 with the latest BIOS (G1ET93WW - 2.53), when
configured to use the nVidia graphics card in the BIOS (by setting
"Graphics Device" to "Discrete Graphics"), modprobing the ACPI video
module causes the system to freeze hard either immediately or within a
few seconds.  The video mode is not changed and the display is not
corrupted compared to the pre-modprobe state, the cursor continues to
blink, but the system ceases responding to input (SysRq does not work)
and no additional messages (except rarely the "ACPI: Video Device
[VID] [...]" messages when the freeze occurs a second or two later)
are produced to screen or netconsole.

This occurs on all kernel versions that I have tested, which includes
3.2 from the Debian archives, self-compiled 3.8.2, and self-compiled
linus master from earlier today (46a1f21a).

I have also confirmed that the system boots and functions correctly in
"Discrete Graphics" mode in Windows 7.

The freeze may or may not be related to the following messages (which
are not produced if the video module is loaded manually early in the
boot process - presumably because they are produced by the loading of
another module):

ACPI Warning: 0x0000000000000428-0x000000000000042f SystemIO conflicts with Region \_SB_.PCI0.LPC_.PMIO 1 (20130117/utaddress-251)
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
ACPI Warning: 0x0000000000000530-0x000000000000053f SystemIO conflicts with Region \_SB_.PCI0.LPC_.LPIO 1 (20130117/utaddress-251)
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
ACPI Warning: 0x0000000000000500-0x000000000000052f SystemIO conflicts with Region \_SB_.PCI0.LPC_.LPIO 1 (20130117/utaddress-251)
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
lpc_ich: Resource conflict(s) found affecting gpio_ich

Also possibly relevant:  acpidump produces the message "Wrong checksum for
FADT!" when run in either BIOS configuration.

I'll attach the acpidump, config, dmesg, /proc/iomem, /proc/ioports, and lspci output to the report.  Please let me know if there is anything else which I can do to help troubleshoot/fix the issue.

Thanks for reading,
Kevin
Comment 1 Kevin Locke 2013-04-01 01:11:40 UTC
Created attachment 96781 [details]
acpidump in "nVidia Optimus" mode
Comment 2 Kevin Locke 2013-04-01 01:12:01 UTC
Created attachment 96791 [details]
kernel config
Comment 3 Kevin Locke 2013-04-01 01:12:15 UTC
Created attachment 96801 [details]
dmesg in "Discrete Graphics" mode
Comment 4 Kevin Locke 2013-04-01 01:12:29 UTC
Created attachment 96811 [details]
dmesg in "nVidia Optimus" mode
Comment 5 Kevin Locke 2013-04-01 01:12:45 UTC
Created attachment 96821 [details]
/proc/iomem in "Discrete Graphics" mode
Comment 6 Kevin Locke 2013-04-01 01:12:58 UTC
Created attachment 96831 [details]
/proc/iomem in "nVidia Optimus" mode
Comment 7 Kevin Locke 2013-04-01 01:13:17 UTC
Created attachment 96841 [details]
/proc/ioports in "Discrete Graphics" mode
Comment 8 Kevin Locke 2013-04-01 01:13:31 UTC
Created attachment 96851 [details]
/proc/ioports in "nVidia Optimus" mode
Comment 9 Kevin Locke 2013-04-01 01:13:47 UTC
Created attachment 96861 [details]
lspci -vvvxxx in "Discrete Graphics" mode
Comment 10 Kevin Locke 2013-04-01 01:14:02 UTC
Created attachment 96871 [details]
lspci -vvvxxx in "nVidia Optimus" mode
Comment 11 Aaron Lu 2013-04-01 05:44:00 UTC
Hi Kevin,

So the system will freeze when you:
# modprobe video

May I know if i915 module is already loaded?

I think the video module should be auto loaded when you boot the system, why you would need to manually load it? Thanks.
Comment 12 Kevin Locke 2013-04-01 12:31:21 UTC
Hi Aaron,

Yes, exactly, the system freezes when I
# modprobe video

You are correct, the video module is auto-loaded when I boot the system.  During a normal boot, the system freezes during `udevadm trigger --action=add`.  To troubleshoot further, I had started dropping to a shell before `udevadm trigger --action=add` and loading modules individually to attempt to determine an exact cause.

During testing, the i915 module was not yet loaded (I had assumed that loading it would load the video module as a dependency and cause the same behavior - I can test this if there is any doubt).

Thanks.
Comment 13 Kevin Locke 2013-04-01 13:04:02 UTC
I can confirm that loading either i915 or nouveau will cause the system to freeze as well.

In testing this morning, the behavior was slightly less reliable (I actually got to the desktop a few times, perhaps 2/15 attempts) and if the system did not freeze within a few seconds, it was stable until I shut it down (even under heavy 3D video load).

However, even in the cases where the system did not freeze immediately, I could provoke a freeze by `rmmod video && modprobe video` a few times.
Comment 14 Aaron Lu 2013-04-02 05:26:36 UTC
(In reply to comment #13)
> I can confirm that loading either i915 or nouveau will cause the system to
> freeze as well.
> 
> In testing this morning, the behavior was slightly less reliable (I actually
> got to the desktop a few times, perhaps 2/15 attempts) and if the system did
> not freeze within a few seconds, it was stable until I shut it down (even
> under
> heavy 3D video load).
> 
> However, even in the cases where the system did not freeze immediately, I
> could
> provoke a freeze by `rmmod video && modprobe video` a few times.

Is it possible to get the backtrace when it panic?
Comment 15 Kevin Locke 2013-04-02 13:54:36 UTC
(In reply to comment #14)
> Is it possible to get the backtrace when it panic?

I'm not sure.  As I said, the system doesn't respond to SysRq.  I haven't used the kernel debuggers before, but I'll look into it.  If you have any recommendations for how to get the backtrace, that'd be appreciated.
Comment 16 Aaron Lu 2013-04-02 14:02:00 UTC
So netconsole doesn't print anything?

Is serial console possible?
Comment 17 Kevin Locke 2013-04-02 14:31:16 UTC
Right, netconsole doesn't print anything additional compared to what is displayed on screen (so nothing about the freeze/crash).  Unfortunately, there are no serial ports on the machine.  I was just looking to see if it was possible to run a serial console over USB.  I'll keep digging and see if I can figure out a way to get some information out of it after it freezes/crashes.
Comment 18 Kevin Locke 2013-04-02 18:34:07 UTC
I'm at a loss for ideas.  The machine has no serial or IR ports, either on the laptop or in the docking stations.  I could get a USB-serial or ExpressCard-serial adapter and run the serial console on that, but I'm guessing either would require interrupts to be enabled for it to function (so would behave similarly to netconsole).  Think it's worth it?

I have also been unable to get CONFIG_HARDLOCKUP_DETECTOR to do anything useful.  Passing nmi_watchdog=panic on the kernel command line does cause "NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter." to be printed, but doesn't cause a reboot (or anything AFAICT). I'm not sure if that is symptomatic of the problem, or of how I am using it.

I've also recompiled with some ftrace tracers and tried "ftrace=function ftrace_dump_on_oops" without success.  Which may be expected, since I'm not sure if it is oopsing or looping or locking or just going out to lunch.

Any suggestions for other things to try, or perhaps some strategic printks that could be added to the video module?
Comment 19 Kevin Locke 2013-04-04 16:55:08 UTC
For what it's worth, I just tested the nvidia binary module and it worked reliably (with both the video and nouveau modules blacklisted).
Comment 20 Kevin Locke 2013-04-05 05:00:08 UTC
Through a bit of serendipity I also discovered that the system does not freeze when the kernel parameter "nox2apic" is added.
Comment 21 Aaron Lu 2013-04-07 01:29:52 UTC
(In reply to comment #19)
> For what it's worth, I just tested the nvidia binary module and it worked
> reliably (with both the video and nouveau modules blacklisted).

Do I understand correctly that no matter in which combinations, as long as you load video module, system will freeze. Is this correct?
Comment 22 Kevin Locke 2013-04-07 02:38:53 UTC
(In reply to comment #21)
> Do I understand correctly that no matter in which combinations, as long as
> you
> load video module, system will freeze. Is this correct?

Yes, that is correct.  As long as "nox2apic" is not present on the kernel cmdline, loading the video module will freeze the system in all cases tested so far about 90% of the time (in the other 10%, reloading it once or twice will cause it to freeze).
Comment 23 Aaron Lu 2013-04-07 02:44:50 UTC
If you do not load video module, then during normal use, will the kernel freeze when you plug/unplug the ac adapter or close/open the LID?
Comment 24 Kevin Locke 2013-04-07 03:43:58 UTC
Good question.  I hadn't tried it.  Yes.  Opening and closing the lid with the nvidia binary driver loaded and X running will cause the system to freeze pretty reliably (seems to be a certainty if something like glxgears is running).
Comment 25 Aaron Lu 2013-04-07 04:18:42 UTC
I was just guessing if this freeze is related to SCI interrupt(this interrupt is used for all ACPI events), since you mentioned x2apic played a role here :-)

I have checked your acpi table, there should be a SCI interrupt after you plug/unplug the ac adapter, and if that is the problem, the system should also freeze.

See if /sys/firmware/acpi/interrupts/gpe11 increases after you plug/unplug the adapter, it the systems doesn't freeze.
Comment 26 Kevin Locke 2013-04-07 04:47:04 UTC
Yes /sys/firmware/acpi/interrupts/gpe11 does increase.  Testing just now it went from 3447 before unplugging up to 3660 after a few seconds.  Then after plugging it went up to 3927 after a few seconds.  Still no freezes while plugging/unplugging.
Comment 27 Aaron Lu 2013-04-07 04:49:09 UTC
And does the number in /proc/interrupts for acpi interrupts increase?
Comment 28 Kevin Locke 2013-04-07 05:06:54 UTC
Yes.  Testing just now it started at (2369 2582 695 457), went up to (2458 2664 724 470) after unplugging and finished at:

9:       2569       2792        756        488  IR-IO-APIC-fasteoi   acpi
Comment 29 Aaron Lu 2013-04-07 05:37:16 UTC
I don't know much about x2apic(maybe I need to learn it now), will that affect the graphics controller's interrupt routing?
Comment 30 Kevin Locke 2013-04-07 14:33:21 UTC
I'm not sure.  The only difference I see in /proc/interrupts is that ehci_hcd:usb3 is changed to ehci_hcd:usb1 and ehci_hcd:usb4 is changed to ehci_hcd:usb2 (they are on the same interrupts, just the names changed).  Also snd_hda_intel and iwlwifi swapped between IRQ 46 and 47.  If it would be helpful, I could upload a copy of /proc/interrupts (with/without x2apic, with/without X, with/without Discrete Graphics mode).  Let me know what would be useful.
Comment 31 Kevin Locke 2013-04-07 21:08:07 UTC
I hesitate to mention it, because it is probably unrelated, but on the off-chance that it may be useful:  When running with the BIOS configured for "nVidia Optimus" mode, which typically works quite well using the i915 driver (and with both the video and nouveau modules loaded), if I use vga_switcheroo to disable the nVidia card, sometimes processes will start to hang and the kernel reports "INFO: rcu_sched self-detected stall on CPU { 3}" with a backtrace that goes through nouveau_connector_detect, nv50_dac_power, and ends in nouveau_timer_wait_eq.

If it would be useful, I can narrow-down how to trigger the behavior and attach the logs and backtraces.  If not, I'll investigate further if it still occurs after this issue is resolved.
Comment 32 Aaron Lu 2013-04-15 05:53:35 UTC
Please try booting with acpi_osi="!Windows 2012" in the kernel command line, thanks.
Comment 33 Kevin Locke 2013-04-15 06:09:43 UTC
That works!  Mostly.  With acpi_osi="!Windows 2012" loading the video module (and nouveau module) never causes a freeze the first time it is loaded and it seems to work reliably.  I've tried it 5 or 10 times now and it never freezes during a normal boot.  If I drop to a shell early in boot and modprobe video && rmmod video it tends to freeze on the second load.  But I don't do that normally, so I'm ok with it.  ;)
Comment 34 Kevin Locke 2013-04-15 13:55:22 UTC
Ugh.  Looks like the behavior is not quite that simple because I was screwing up the test.

Correction:  With acpi_osi="!Windows 2012", if I boot and load the binary nvidia module, I can then reboot and video/nouveau will load without freezing pretty reliably (this is not 100%, probably 80 or 90) and keep rebooting with the same reliability.  However, if I shut down and give it a cold-boot, loading video/nouveau still freezes with the same frequency as before.

So it does appear to make a difference, but the problem's not quite solved yet.  Thanks for all of your effort on this!
Comment 35 Aaron Lu 2013-04-17 04:29:39 UTC
Hi Kevin,

Please attach the output of the following command:
# dd if=/dev/mem of=memdump bs=4096 count=1 skip=915357

I need to check some variable in the BIOS asl table, thanks.
Comment 36 Kevin Locke 2013-04-17 22:17:31 UTC
Created attachment 99071 [details]
ASL variables in "Discrete Graphics" mode
Comment 37 Kevin Locke 2013-04-17 22:17:58 UTC
Created attachment 99081 [details]
ASL variables in "Discrete Graphics" mode with acpi_osi="!Windows 2012"
Comment 38 Kevin Locke 2013-04-17 22:18:24 UTC
Created attachment 99091 [details]
ASL variables in "nVidia Optimus" mode
Comment 39 Kevin Locke 2013-04-17 22:20:56 UTC
Sorry for the slow reply.  I've attached the output of the command you requested in both the "Discrete Graphics" and "nVidia Optimus" BIOS configurations and additionally the output with acpi_osi="!Windows 2012" (since I wasn't sure what was relevant).  All dumps were made without video/nouveau/intel/nvidia modules loaded.  If you'd like it run in any other state, let me know.
Comment 40 Aaron Lu 2013-04-18 07:47:11 UTC
Thanks Kevin, and actually, you are responding pretty fast, thanks a lot for the cooperation!

From the dump, it looks to me:
VIGD is the variable to represent if integrated graphics controller is in use;
VDSC is the variable to represent if discrete graphics controller is in use;
So we get VIGD 0, VDSC 1 in the "Discrete Graphics" mode no matter if !win8 is set. And for the "optimus mode", the base address is different, so the dump are no use, you will need to change the skip=765853.

According to the table, BIOS does not support backlight brightness control in win8 mode, and in win7 mode, the control method for the discrete graphics controller are all redirected to the control method for the integrated graphics controller. Perhaps we need disable win7 too, and using vista mode to control brightness level. Please add the following kernel command line:
acpi_osi="!Windows 2012" acpi_osi="!Windows 2009"
and see if it makes any difference, thanks.
Comment 41 Kevin Locke 2013-04-18 13:46:27 UTC
Created attachment 99231 [details]
ASL variables in "nVidia Optimus" mode (corrected offset)

Here's a dump of the memory with the corrected offset for "nVidia Optimus" mode.

I tried booting with acpi_osi="!Windows 2012" acpi_osi="!Windows 2009" and of my 5 attempts (all from a cold-boot this time) loading the video module caused a freeze 4 times and it succeeded once.
Comment 42 Aaron Lu 2013-04-19 10:15:56 UTC
When it freezes, can you see the log or the screen is black?
Comment 43 Kevin Locke 2013-04-19 12:55:58 UTC
When it freezes I can still see everything that was printed to the screen before it froze.
Comment 44 Aaron Lu 2013-04-22 05:52:30 UTC
Please try this kernel tree:
git://github.com/aaronlu/linux.git acpi_video

And boot kernel with: acpi.debug_layer=0x10000010 acpi.debug_level=0x4,
let's see if we can find something, thanks.
Comment 45 Kevin Locke 2013-04-22 14:07:21 UTC
Created attachment 99641 [details]
dmesg in "Discrete Graphics" mode with kernel e8df390f (from aaronlu/acpi_video)

I gave it a whirl and unfortunately I didn't see any differences or additional output (although the system froze a bit less frequently, I assume it was random chance).  I've attached the dmesg output produced after successfully loading the video module on one of the occasions when the system did not freeze.  When the system does freeze, no additional output is shown.
Comment 46 Aaron Lu 2013-04-22 23:47:41 UTC
Hi Kevin,

CONFIG_ACPI_DEBUG needs to be set for these debug messages to appear, sorry I didn't mention this. The document is here: Documentation/acpi/debug.txt.
Comment 47 Kevin Locke 2013-04-23 00:27:25 UTC
Created attachment 99801 [details]
dmesg in "Discrete Graphics" mode with kernel e8df390f (from aaronlu/acpi_video)

Ha!  My mistake.  I should have had that enabled from the start, I guess it just got overlooked.  My bad.

Here's the dmesg after one of the successful attempts to load the module in "Discrete Graphics" mode.  With the debugging options enabled on the kernel command line, it is surprisingly difficult to make the system freeze.  After ~6 attempts from a cold boot, none of them froze, so I reverted to loading and unloading the module in a loop which provoked a freeze after maybe 10 load/unload cycles (I can count exactly if it would be useful).  When it did freeze, the last lines printed to the screen were:

   video-0749 [00] video_init_brightness : max_level=100
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BQC] (Node ffff88011822c538)
   video-0806 [00] video_init_brightness : initial backlight level is 100
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BCM] (Node ffff88011822c560)
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BQC] (Node ffff88011822c538)
   video-0668 [00] video_bqc_quirk       : test_level=0, returned_level=100
   video-0684 [00] video_bqc_quirk       : clear bqc cap
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BCM] (Node ffff88011822c560)

Without the debug options on the kernel command line, the system still freezes fairly reliably, so printing to the screen does appear to be a relevant symptom.  I can collect more statistics on it if it would be useful or if you need the debugging output from a freeze after cold-boot.

Thanks Aaron!
Comment 48 Aaron Lu 2013-04-24 01:32:10 UTC
Created attachment 99871 [details]
Add some debug statement to dsdt table

I suspect the freeze occurs in SMI call in the BIOS provided asl code: in integrated graphics mode, _BCM will use AINT control method; while in discrete mode, _BCM will use SMI.

I've added some debug statement to the dsdt table, please override the system dsdt table with this one. I prefer to override dsdt in initrd: Documentation/acpi/initrd_table_override.txt.

$ mkdir acpi_initrd
$ cd acpi_initrd
$ mkdir -p kernel/firmware/acpi
$ cp dsdt.aml kernel/firmware/acpi
$ find kernel | cpio -H newc --create > /boot/instrumented_initrd
$ cat /boot/your_original_initrd >>/boot/instrumented_initrd
done, the instrumented_initrd should be used.

Please still use the kernel you built in your last test, and keep setting acpi.debug_layer and acpi.debug_level, add a new one: acpi.aml_debug_output=1.
Let's see if SMI is the culprit.
Comment 49 Kevin Locke 2013-04-24 04:47:23 UTC
Created attachment 99881 [details]
dmesg in "Discrete Graphics" mode with instrumented DSDT and kernel e8df390f (from aaronlu/acpi_video)

Here's the dmesg after one of the successful attempts to load the video module in "Discrete Graphics" mode booting with the modified DSDT and acpi.aml_debug_output=1 added to the kernel command line.  Note that this is not the same binary kernel as the previous test, since I needed to recompile with CONFIG_ACPI_INITRD_TABLE_OVERRIDE, but that should be the only change.

As before, with the debugging options enabled on the kernel
command line, it is surprisingly difficult to make the system freeze.  However, this time it did freeze from a cold boot on the 7th attempt.  When it froze, the last messages printed to screen were:

ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BCM] (Node ffff8800c7e2c560)
[ACPI Debug]  String [0x03] "SMI"
[ACPI Debug]  Integer 0x00000001
[ACPI Debug]  Integer 0x0000000A
[ACPI Debug]  Integer 0x0000000F
[ACPI Debug]  String [0x0D] "MSMI acquired"
Comment 50 Aaron Lu 2013-04-24 05:39:10 UTC
    Method (SMI, 5, NotSerialized)
    {
        Store ("SMI", debug)
        Store (Arg0, debug)
        Store (Arg1, debug)
        Store (Arg2, debug)
        Acquire (MSMI, 0xFFFF)
        Store ("MSMI acquired", debug)
        Store (Arg0, CMD)
        Store (0x01, ERR)
        Store (Arg1, PAR0)
        Store (Arg2, PAR1)
        Store (Arg3, PAR2)
        Store (Arg4, PAR3)
        Store (0xF5, APMC)
        Store ("Begin while", debug)
        While (LEqual (ERR, 0x01))
        {
            Sleep (0x01)
            Store (0xF5, APMC)
        }
        Store ("End while", debug)

        Store (PAR0, Local0)
        Release (MSMI)
        Store ("MSMI released", debug)
        Return (Local0)
    }

So most likely, it failed in Store (0xF5, APMC).
What that instruction did is probably to trigger SMI handler, and then that handler will do things according to the values in CMD/ERR/PAR0/etc, and that handler runs without OS' awareness.

BTW, I think the cause of hang by closing LID is also SMI call, you can test that too.
Comment 51 Kevin Locke 2013-04-24 14:10:15 UTC
Created attachment 99911 [details]
dmesg after LID event with instrumented DSDT and kernel e8df390f (from aaronlu/acpi_video)

That certainly sounds plausible.  Is there anything we can do about it?

To test the hang on LID event, I booted with the same options and opened/closed the lid a few times without loading the video module.  I have attached the dmesg after the first LID event.  After ~5 tries, it froze.  The last messages it printed to screen were:

[ACPI Debug]  String [0x0D] "MSMI released"
[ACPI Debug]  String [0x03] "SMI"
[ACPI Debug]  Integer 0x00000001
[ACPI Debug]  Integer 0x00000008
[ACPI Debug]  Integer 0x00000002
[ACPI Debug]  String [0x0D] "MSMI acquired"

I did this twice, and both times it froze with the same lines visible on screen.
Comment 52 Aaron Lu 2013-04-25 01:26:34 UTC
Unfortunately, I don't see anything we can do about it, it's not easy, if not impossible, for us to debug this without knowing what happened.

One thing I noticed is, other system models have similar asl code, like the x230, but they didn't mention such problem.

The LCD hang happens with discrete mode only or both?
Comment 53 Kevin Locke 2013-04-25 01:37:10 UTC
That's quite alright.  It only hangs in discrete mode and I can work around it by passing nox2apic on the kernel command line or using the nvidia binary driver and not closing the lid.  So it's not a big deal for me.  The only time I need discrete mode is to do graphics-intensive stuff, which isn't particularly frequent anyway.

Feel free to leave the bug open.  Perhaps Lenovo will be kind enough to fix their firmware.
Comment 54 Aaron Lu 2013-07-05 02:25:11 UTC
I came across a patch today:
https://lkml.org/lkml/2011/7/29/414
which has words:

Some of the OEM platforms are running into issues because of this, as their
bios is not x2apic aware. For example, this was resulting in interrupt migration
issues on one of the platforms. Also if the BIOS SMI handling uses APIC
interface to send SMI's, then the BIOS need to be aware of x2apic mode
that OS has enabled.

I suppose the BIOS's SMI here is not aware of x2apic mode and yet it doesn't set the opt_out flag. BTW, did you configure IOMMU driver in your kernel config as that opt_out logic is done in IOMMU driver. I suppose you have, at least the 3.2 kernel from Debian should have that config.

Also, I guess Windows doesn't enable x2apic, anyway to verify this? Thanks.
Comment 55 Kevin Locke 2013-07-05 02:49:43 UTC
Great find!  That could explain why it works with "nox2apic".  I do have the kernel configured with CONFIG_INTEL_IOMMU=y, but CONFIG_INTEL_IOMMU_DEFAULT_ON is not set and I wasn't specifying intel_iommu=on during the testing (although I do in my usual configuration).  Could that make a difference?

This bug is probably due for some re-testing on my part (since there has been at least one BIOS update since I last tested).  I'll try to get to it soon.

Microsoft KB2303458 (and the doc for the x2apicpolicy option of bcdedit) suggests that Server 2008 R2 uses x2apic, but I'm not sure about Windows 7.  I can see if bcdedit supports that option on Windows 7.
Comment 56 Aaron Lu 2013-07-05 03:06:07 UTC
(In reply to Kevin Locke from comment #55)
> Great find!  That could explain why it works with "nox2apic".  I do have the
> kernel configured with CONFIG_INTEL_IOMMU=y, but
> CONFIG_INTEL_IOMMU_DEFAULT_ON is not set and I wasn't specifying
> intel_iommu=on during the testing (although I do in my usual configuration).
> Could that make a difference?

It shouldn't make any difference.
I should ask does the CONFIG_IRQ_REMAP get set? From looking at the source code, the logic to detect if BIOS has set the x2apic flag is done in intel_enable_irq_remapping function, which is compiled in if CONFIG_IRQ_REMAP is set.

> Microsoft KB2303458 (and the doc for the x2apicpolicy option of bcdedit)
> suggests that Server 2008 R2 uses x2apic, but I'm not sure about Windows 7. 
> I can see if bcdedit supports that option on Windows 7.

OK, thanks.
Comment 57 Kevin Locke 2013-07-13 21:07:25 UTC
Ok.  I finally got around to re-testing.  I can confirm that CONFIG_IRQ_REMAP=y in the config and the symptoms are unchanged as far as I can tell when testing with 3.10.1 using BIOS G1ET94WW (2.54 ).  I haven't tested with the modified DSDT, but I could if it might be useful.
Comment 58 Aaron Lu 2013-07-15 06:47:29 UTC
Thanks. So the problem is, Linux enabled x2apic due to enabling interrupt remap for IOMMU, which caused hang in SMI handler. Did you manage to see if Windows has enabled x2apic on your system?
Comment 59 Aaron Lu 2013-07-16 02:45:40 UTC
Add Youquan and David.

Kevin has reported that loading of the ACPI video module would cause system hang, and adding nox2apic would solve the problem. I found that the hang occurs in SMI handler. Any ideas? Thanks.
Comment 60 Song Youquan 2013-07-17 15:44:47 UTC
Fine.

There are similar issues on Lenovo ThinkPad T420, W520 and L520, I have written a patch to disable x2APIC on these platform to workround the issue. Refer to:
https://lkml.org/lkml/2012/12/18/1 [PATCH] x86,apic: Blacklist x2APIC on some platforms

Admittedly, The patch is not elegant solution.

I plan to look for complete fix for this issue since there are serials of platform will be affected.

Unfortunately, I have no any issued platform for debug. 

Hi Kevin Locke, Can you attach the two dmesg information after add kernel option: "apic=debug intremap=off" and "apic=debug"?

In addition, I find there is an issue in ACPI BIOS possibly related. After parse the file "acpidump in "Discrete Graphics" mode" attached at description, the ACPI DMAR table only includes one DMAR engine but I check the proccessor of the platform at least includes two DMAR engines. 


Thanks
-Youquan
Comment 61 Kevin Locke 2013-07-18 13:55:56 UTC
Created attachment 106928 [details]
dmesg in "Discrete Graphics" mode with apic=debug

Aaron, I haven't found a way to test if x2apic is enabled in Windows yet, but I'm still looking.  My documentation search has turned up nothing.  I need to ask some knowledgeable people if they have any ideas for how to test.

Hello Youquan.  Here's the first of the two dmesg logs.  I'll attach the other shortly.

Thanks!
Comment 62 Kevin Locke 2013-07-18 13:56:28 UTC
Created attachment 106929 [details]
dmesg in "Discrete Graphics" mode with apic=debug intremap=off
Comment 63 Aaron Lu 2013-07-19 01:51:15 UTC
Perhaps force enable x2apic and see if Windows can work? This seems to be a way to force enable x2apic:
bcdedit /set x2apicpolicy enable
But I don't know if it applies to all Windows version or just some server edition. Also, I'm not sure how Windows handle backlight, it may or may not use ACPI to change brightness.
Comment 64 Kevin Locke 2013-07-19 13:01:57 UTC
In Windows, I installed the IBM Performance Inspector and ran msr <http://perfinsp.sourceforge.net/msr.html> which produced the following output:

C:\> msr -r APIC_BASE

***** msr v2.0.7 for x64 *****
CPU0  msr 0x1B = 0x00000000:FEE00900 (4276095232)
CPU1  msr 0x1B = 0x00000000:FEE00800 (4276094976)
CPU2  msr 0x1B = 0x00000000:FEE00800 (4276094976)
CPU3  msr 0x1B = 0x00000000:FEE00800 (4276094976)

If I read the x2APIC Spec correctly, it looks like xAPIC enable is 0x800 and x2APIC enable is 0x400.  So from the above values, xAPIC is enabled and x2APIC is disabled.  If you think it might be useful, I can try setting x2apicpolicy to enabled to force x2APIC in Windows.  But it looks like Windows 7 does not enable it by default on this machine.
Comment 65 Aaron Lu 2013-07-22 00:56:20 UTC
Thanks Kevin.

So this just suggests we should not enable x2APIC mode for this system. Youquan, what do you think? Should we simply blacklist it or you have other ideas?
Comment 66 Aaron Lu 2013-07-23 03:16:05 UTC
Hi Youquan,

I think the problem is in interrupt remap code, it enabled x2apic for some systems that it shouldn't.
Comment 67 Song Youquan 2013-07-23 03:43:05 UTC
Hi Kevin,

Please add kernel option "x2apic_phys" and try it.

Hi Aaron,

In my knowledge, this issue is an specific hardware/BIOS issue. Possible reasons: 1) BIOS is not x2APIC compatible or BIOS SMI handler is not x2APIC compatible like serializing access semantics 2) SMI routing to CPU in Virtual Legal Wire (VLW) over DMI while the VT-d interrupt remapping unaware the SMI type message and SMI is not in x2APIC format.

Thanks
-Youquan
Comment 68 Kevin Locke 2013-07-23 04:22:30 UTC
Created attachment 106993 [details]
dmesg in "Discrete Graphics" mode with x2apic_phys

Hi Youquan,

When booting with kernel option "x2apic_phys" the behavior appears to be the same as when booting without any kernel options.  The system will reliably freeze if the video module is loaded and unloaded a few times (although in my few tests it didn't freeze the first time the video module was loaded).  I've attached the dmesg for reference.  Let me know if there's anything else you'd like me to test.

Thanks,
Kevin
Comment 69 Aaron Lu 2013-07-29 06:58:56 UTC
A similar problem on DELL XPS 15z is reported:
https://lkml.org/lkml/2013/7/27/102
Comment 70 Song Youquan 2013-07-30 01:56:33 UTC
Thanks a lot Kevin for your support!

1.
Can you provide "cat /proc/interrupts" and "lspci -vvvnnn" and "dmesg" information when "Discrete Graphics" mode with apic=debug kernel option?


2.

Apply the below patch and build kernel, when boot kernel add kernel option "intremap=off"; To check if the issue happen. Also, please provide me "dmesg" information?

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 904611b..51a065a 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1603,11 +1603,8 @@ void __init enable_IR_x2apic(void)
                goto skip_x2apic;

        if (ret < 0) {
-               /* IR is required if there is APIC ID > 255 even when running
-                * under KVM
-                */
-               if (max_physical_apicid > 255 ||
-                   !hypervisor_x2apic_available()) {
+               /* IR is required if there is APIC ID > 255 */
+               if (max_physical_apicid > 255) {
                        if (x2apic_preenabled)
                                disable_x2apic();
                        goto skip_x2apic;
--
1.7.7.4


Thanks
-Youquan
Comment 71 Kevin Locke 2013-07-31 15:47:51 UTC
Created attachment 107054 [details]
dmesg in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54
Comment 72 Kevin Locke 2013-07-31 15:48:40 UTC
Created attachment 107055 [details]
/proc/interrupts in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54
Comment 73 Kevin Locke 2013-07-31 15:49:06 UTC
Created attachment 107056 [details]
lspci -vvvnnn in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54
Comment 74 Kevin Locke 2013-07-31 15:57:18 UTC
Created attachment 107057 [details]
dmesg in "Discrete Graphics" mode with intremap=off, kernel 3.10.1 with patch from comment 70, bios 2.54

Here's the dmesg with the patch from comment 70 applied.  Unfortunately, it did not solve the problem.  I could still provoke a freeze by modprobe video && rmmod video repeatedly.  It did seem a bit more stable; it would loop 10-20 times before freezing, rather than the usual 5-10, although my sample size is probably too small to say that with any certainty.

Thanks again, much appreciated!
Comment 75 Song Youquan 2013-07-31 16:04:45 UTC
Thanks a lot Kevin!

When the Nvidia graphics is loaded before system is not freezed, can you provide lspci -xxxvvvnnn?  The previous lspci is not listed the Nividia graphics driver load correctly.

Thanks
-Youquan
Comment 76 Kevin Locke 2013-07-31 16:08:20 UTC
Sure.  Would you prever the nouveau driver loaded (may or may not freeze) or the nvidia binary driver (always loads without freezing)?
Comment 77 Song Youquan 2013-07-31 16:17:06 UTC
I prefer to nouveau driver loaded which will possibly freeze. 
It will be better if you also provide whole dmesg after you do the loop before freeze when "acpi=debug" option.

Thanks
-Youquan
Comment 78 Kevin Locke 2013-07-31 16:22:10 UTC
Ok.  To confirm:  I'll get lspci -xxxvvvnnn after nouveau has been loaded, dmesg after nouveau and video have been unloaded.  Just to confirm, did you want "acpi=debug" or "apic=debug" (as before)?
Comment 79 Kevin Locke 2013-07-31 16:40:05 UTC
Created attachment 107058 [details]
lspci -xxxvvvnnn in "Discrete Graphics" mode with apic=debug, kernel 3.10.1 with patch from comment 70, bios 2.54, with nouveau loaded
Comment 80 Kevin Locke 2013-07-31 16:42:04 UTC
Created attachment 107059 [details]
dmesg in "Discrete Graphics" mode with apic=debug, kernel 3.10.1 with patch from comment 70, bios 2.54, after video loaded and unloaded

Here's the dmesg with apic=debug (I can get acpi=debug as well if it would be useful).  I got it after the video module was loaded and unloaded since if I load nouveau it can not be unloaded (or I can't find the right module to unload first to allow it to be unloaded).
Comment 81 Song Youquan 2013-08-01 14:03:50 UTC
From last two comments, the discrete graphics mode works and show the desktop without issue. Right?

can you also provide "cat /proc/interrupts" in the situation?

What's the steps to reproduce the "modprobe video" cause hang issue? I will put a Nvidia card to my desktop to check if I can reproduce the issue or not?  

Thanks
-Youquan
Comment 82 Kevin Locke 2013-08-01 14:28:21 UTC
The issue is that the system hangs on startup during `udevadm trigger --action=add`.  It is not completely reliable, sometimes the system will boot without hanging, but when last I tried it the system would freeze ~90% of the time.  The problem is alleviated if the video and nouveau modules are blacklisted.

If I drop to a shell before `udevadm trigger --action=add` is run, I can usually load the video module without causing the system to hang (presumably the issue is aggravated by multiple modules being loaded/initialized in short succession).  I can always provoke a freeze by running "for i in `seq 1 100` ; do modprobe video ; rmmod video ; done", usually after a few iterations sometimes as many as 30.

I think the issue is specific to a small set of machines including the T430 (or perhaps specific to the T430), otherwise all nVidia users would be experiencing intermittent freezes during boot.

I will get a copy of /proc/interrupts after a successful boot as soon as I have some spare time.

Thanks,
Kevin
Comment 83 Aaron Lu 2013-09-05 05:57:09 UTC
Hi Youquan,

May I know your opinion about this bug?
Comment 84 Kevin Locke 2015-12-01 23:04:13 UTC
Hi Aaron and Youquan,

My apologies for letting this bug linger for so long.  I had forgotten that you were waiting on me for a copy of /proc/interrupts in addition to Youquan's thoughts.  Since I had an effective workaround (adding "nox2apic") I had simply forgotten to check up on this issue.

I can confirm that the issue is still present in Linux 4.3 with the latest BIOS (G1ETA9WW - 2.69) as of today.

A quick synopsis of the issue as I understand it:

My Lenovo ThinkPad T430 will freeze when the video module is loaded if the BIOS is configured in "Discrete Graphics" mode.  The freeze does not occur immediately on every boot, but repeatedly loading and unloading the video module provokes it reliably.

The freeze appears to be related to x2APIC (which is not enabled by Windows 7).  Disabling x2APIC with the "nox2apic" command-line option reliably avoids the issue.  Aaron has a more detailed synopsis in comment 58.

It looks like Youquan's (non-ideal) patch from comment 60 <https://lkml.org/lkml/2012/12/18/1> was never merged, nor the patch that it was waiting on <https://lkml.org/lkml/2013/3/18/892>.

There are a couple bugs with different machines/models that appear to be the same (or similar) issue:
https://bugzilla.kernel.org/show_bug.cgi?id=34262
https://bugzilla.kernel.org/show_bug.cgi?id=43054

Since it has been a few years and I've changed both BIOS and kernel versions significantly since then, would it be helpful to get fresh comparable copies of some/all of the previously attached files?  What would be most useful?  How else can I help?

Thanks and sorry for the very long delay,
Kevin
Comment 85 Neil McCirrus 2016-07-12 18:43:04 UTC
Apologies if i post this in wrong thread..

>inxi -S
System: Host: hawker64 Kernel: 4.6.4-1-ck x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
>inxi -M 
Machine Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
>inxi -G
Graphics: Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2

I put it down to the Intel 55x0 chipset errata - Interrupt remapping issue (Intel 5500/5520/X58 chipset revision 0x13 and 0x22 have an errata (#47 and #53) which makes the IOMMU interrupt remapping unit unreliable. This erratum causes interruptions and the interrupt remapping invalidations become unresponsive) https://forums.gentoo.org/viewtopic-t-1030102-start-0.html?sid=59c8eddb43e0553296f93355ea10b42d
below are some snippets from logs that i *think* may be relevant. Ocaasionaly i get hard lockups where only a hard reboot will suffice,on other occasion i just lose network connectivity. This started happening for me since kernels 4.* if i rollback to say kernel 3.19-1 all is well. So again i *guess* it's a kernel regression, either that or my issue is a mixture of this and the IOMMU thing. (happens when using linux-ck or stock archlinux kernels, kvm disabled in BIOS)

perf: interrupt took too long (2711 > 2500), lowering kernel.perf_event_max_sample_rate to 73000
 perf: interrupt took too long (3512 > 3388), lowering kernel.perf_event_max_sample_rate to 56000
 perf: interrupt took too long (4459 > 4390), lowering kernel.perf_event_max_sample_rate to 44000
 perf: interrupt took too long (5613 > 5573), lowering kernel.perf_event_max_sample_rate to 35000
hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB !
hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB !
hawker64 kernel: radeon 0000:02:00.0: scheduling IB failed (-2).
hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB !
hawker64 kernel: r8169 0000:05:00.0 enp5s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
hawker64 kernel: r8169 0000:05:00.0 enp5s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
hawker64 kernel: INFO: rcu_preempt detected stalls on CPUs/tasks:
hawker64 kernel:         1-...: (0 ticks this GP) idle=9b2/0/0 softirq=4017531/4017531 fqs=0 
hawker64 kernel:         2-...: (6 GPs behind) idle=dd2/0/0 softirq=2426637/2426637 fqs=0 
hawker64 kernel:         3-...: (3 GPs behind) idle=e78/0/0 softirq=1361775/1361777 fqs=0 
hawker64 kernel:         4-...: (38 GPs behind) idle=3e0/0/0 softirq=440832/440833 fqs=0 
hawker64 kernel:         6-...: (1 GPs behind) idle=efe/0/0 softirq=305520/305520 fqs=0 
hawker64 kernel:         7-...: (1 GPs behind) idle=a2a/0/0 softirq=204822/204822 fqs=0 
hawker64 kernel:         (detected by 0, t=127647 jiffies, g=2627574, c=2627573, q=9503)
hawker64 kernel: rcu_preempt kthread starved for 127647 jiffies! g2627574 c2627573 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
Jul 01 14:13:57 hawker64 login[541]: pam_systemd(login:session): Failed to release session: Connection reset by peer
Jul 01 14:13:57 hawker64 systemd-logind[507]: Failed to abandon session scope: Transport endpoint is not connected
-- Reboot --

Note You need to log in before you can comment on or make changes to this bug.