Bug 56051
Created attachment 96781 [details]
acpidump in "nVidia Optimus" mode
Created attachment 96791 [details]
kernel config
Created attachment 96801 [details]
dmesg in "Discrete Graphics" mode
Created attachment 96811 [details]
dmesg in "nVidia Optimus" mode
Created attachment 96821 [details]
/proc/iomem in "Discrete Graphics" mode
Created attachment 96831 [details]
/proc/iomem in "nVidia Optimus" mode
Created attachment 96841 [details]
/proc/ioports in "Discrete Graphics" mode
Created attachment 96851 [details]
/proc/ioports in "nVidia Optimus" mode
Created attachment 96861 [details]
lspci -vvvxxx in "Discrete Graphics" mode
Created attachment 96871 [details]
lspci -vvvxxx in "nVidia Optimus" mode
Hi Kevin, So the system will freeze when you: # modprobe video May I know if i915 module is already loaded? I think the video module should be auto loaded when you boot the system, why you would need to manually load it? Thanks. Hi Aaron, Yes, exactly, the system freezes when I # modprobe video You are correct, the video module is auto-loaded when I boot the system. During a normal boot, the system freezes during `udevadm trigger --action=add`. To troubleshoot further, I had started dropping to a shell before `udevadm trigger --action=add` and loading modules individually to attempt to determine an exact cause. During testing, the i915 module was not yet loaded (I had assumed that loading it would load the video module as a dependency and cause the same behavior - I can test this if there is any doubt). Thanks. I can confirm that loading either i915 or nouveau will cause the system to freeze as well. In testing this morning, the behavior was slightly less reliable (I actually got to the desktop a few times, perhaps 2/15 attempts) and if the system did not freeze within a few seconds, it was stable until I shut it down (even under heavy 3D video load). However, even in the cases where the system did not freeze immediately, I could provoke a freeze by `rmmod video && modprobe video` a few times. (In reply to comment #13) > I can confirm that loading either i915 or nouveau will cause the system to > freeze as well. > > In testing this morning, the behavior was slightly less reliable (I actually > got to the desktop a few times, perhaps 2/15 attempts) and if the system did > not freeze within a few seconds, it was stable until I shut it down (even > under > heavy 3D video load). > > However, even in the cases where the system did not freeze immediately, I > could > provoke a freeze by `rmmod video && modprobe video` a few times. Is it possible to get the backtrace when it panic? (In reply to comment #14) > Is it possible to get the backtrace when it panic? I'm not sure. As I said, the system doesn't respond to SysRq. I haven't used the kernel debuggers before, but I'll look into it. If you have any recommendations for how to get the backtrace, that'd be appreciated. So netconsole doesn't print anything? Is serial console possible? Right, netconsole doesn't print anything additional compared to what is displayed on screen (so nothing about the freeze/crash). Unfortunately, there are no serial ports on the machine. I was just looking to see if it was possible to run a serial console over USB. I'll keep digging and see if I can figure out a way to get some information out of it after it freezes/crashes. I'm at a loss for ideas. The machine has no serial or IR ports, either on the laptop or in the docking stations. I could get a USB-serial or ExpressCard-serial adapter and run the serial console on that, but I'm guessing either would require interrupts to be enabled for it to function (so would behave similarly to netconsole). Think it's worth it? I have also been unable to get CONFIG_HARDLOCKUP_DETECTOR to do anything useful. Passing nmi_watchdog=panic on the kernel command line does cause "NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter." to be printed, but doesn't cause a reboot (or anything AFAICT). I'm not sure if that is symptomatic of the problem, or of how I am using it. I've also recompiled with some ftrace tracers and tried "ftrace=function ftrace_dump_on_oops" without success. Which may be expected, since I'm not sure if it is oopsing or looping or locking or just going out to lunch. Any suggestions for other things to try, or perhaps some strategic printks that could be added to the video module? For what it's worth, I just tested the nvidia binary module and it worked reliably (with both the video and nouveau modules blacklisted). Through a bit of serendipity I also discovered that the system does not freeze when the kernel parameter "nox2apic" is added. (In reply to comment #19) > For what it's worth, I just tested the nvidia binary module and it worked > reliably (with both the video and nouveau modules blacklisted). Do I understand correctly that no matter in which combinations, as long as you load video module, system will freeze. Is this correct? (In reply to comment #21) > Do I understand correctly that no matter in which combinations, as long as > you > load video module, system will freeze. Is this correct? Yes, that is correct. As long as "nox2apic" is not present on the kernel cmdline, loading the video module will freeze the system in all cases tested so far about 90% of the time (in the other 10%, reloading it once or twice will cause it to freeze). If you do not load video module, then during normal use, will the kernel freeze when you plug/unplug the ac adapter or close/open the LID? Good question. I hadn't tried it. Yes. Opening and closing the lid with the nvidia binary driver loaded and X running will cause the system to freeze pretty reliably (seems to be a certainty if something like glxgears is running). I was just guessing if this freeze is related to SCI interrupt(this interrupt is used for all ACPI events), since you mentioned x2apic played a role here :-) I have checked your acpi table, there should be a SCI interrupt after you plug/unplug the ac adapter, and if that is the problem, the system should also freeze. See if /sys/firmware/acpi/interrupts/gpe11 increases after you plug/unplug the adapter, it the systems doesn't freeze. Yes /sys/firmware/acpi/interrupts/gpe11 does increase. Testing just now it went from 3447 before unplugging up to 3660 after a few seconds. Then after plugging it went up to 3927 after a few seconds. Still no freezes while plugging/unplugging. And does the number in /proc/interrupts for acpi interrupts increase? Yes. Testing just now it started at (2369 2582 695 457), went up to (2458 2664 724 470) after unplugging and finished at: 9: 2569 2792 756 488 IR-IO-APIC-fasteoi acpi I don't know much about x2apic(maybe I need to learn it now), will that affect the graphics controller's interrupt routing? I'm not sure. The only difference I see in /proc/interrupts is that ehci_hcd:usb3 is changed to ehci_hcd:usb1 and ehci_hcd:usb4 is changed to ehci_hcd:usb2 (they are on the same interrupts, just the names changed). Also snd_hda_intel and iwlwifi swapped between IRQ 46 and 47. If it would be helpful, I could upload a copy of /proc/interrupts (with/without x2apic, with/without X, with/without Discrete Graphics mode). Let me know what would be useful. I hesitate to mention it, because it is probably unrelated, but on the off-chance that it may be useful: When running with the BIOS configured for "nVidia Optimus" mode, which typically works quite well using the i915 driver (and with both the video and nouveau modules loaded), if I use vga_switcheroo to disable the nVidia card, sometimes processes will start to hang and the kernel reports "INFO: rcu_sched self-detected stall on CPU { 3}" with a backtrace that goes through nouveau_connector_detect, nv50_dac_power, and ends in nouveau_timer_wait_eq. If it would be useful, I can narrow-down how to trigger the behavior and attach the logs and backtraces. If not, I'll investigate further if it still occurs after this issue is resolved. Please try booting with acpi_osi="!Windows 2012" in the kernel command line, thanks. That works! Mostly. With acpi_osi="!Windows 2012" loading the video module (and nouveau module) never causes a freeze the first time it is loaded and it seems to work reliably. I've tried it 5 or 10 times now and it never freezes during a normal boot. If I drop to a shell early in boot and modprobe video && rmmod video it tends to freeze on the second load. But I don't do that normally, so I'm ok with it. ;) Ugh. Looks like the behavior is not quite that simple because I was screwing up the test. Correction: With acpi_osi="!Windows 2012", if I boot and load the binary nvidia module, I can then reboot and video/nouveau will load without freezing pretty reliably (this is not 100%, probably 80 or 90) and keep rebooting with the same reliability. However, if I shut down and give it a cold-boot, loading video/nouveau still freezes with the same frequency as before. So it does appear to make a difference, but the problem's not quite solved yet. Thanks for all of your effort on this! Hi Kevin, Please attach the output of the following command: # dd if=/dev/mem of=memdump bs=4096 count=1 skip=915357 I need to check some variable in the BIOS asl table, thanks. Created attachment 99071 [details]
ASL variables in "Discrete Graphics" mode
Created attachment 99081 [details]
ASL variables in "Discrete Graphics" mode with acpi_osi="!Windows 2012"
Created attachment 99091 [details]
ASL variables in "nVidia Optimus" mode
Sorry for the slow reply. I've attached the output of the command you requested in both the "Discrete Graphics" and "nVidia Optimus" BIOS configurations and additionally the output with acpi_osi="!Windows 2012" (since I wasn't sure what was relevant). All dumps were made without video/nouveau/intel/nvidia modules loaded. If you'd like it run in any other state, let me know. Thanks Kevin, and actually, you are responding pretty fast, thanks a lot for the cooperation! From the dump, it looks to me: VIGD is the variable to represent if integrated graphics controller is in use; VDSC is the variable to represent if discrete graphics controller is in use; So we get VIGD 0, VDSC 1 in the "Discrete Graphics" mode no matter if !win8 is set. And for the "optimus mode", the base address is different, so the dump are no use, you will need to change the skip=765853. According to the table, BIOS does not support backlight brightness control in win8 mode, and in win7 mode, the control method for the discrete graphics controller are all redirected to the control method for the integrated graphics controller. Perhaps we need disable win7 too, and using vista mode to control brightness level. Please add the following kernel command line: acpi_osi="!Windows 2012" acpi_osi="!Windows 2009" and see if it makes any difference, thanks. Created attachment 99231 [details]
ASL variables in "nVidia Optimus" mode (corrected offset)
Here's a dump of the memory with the corrected offset for "nVidia Optimus" mode.
I tried booting with acpi_osi="!Windows 2012" acpi_osi="!Windows 2009" and of my 5 attempts (all from a cold-boot this time) loading the video module caused a freeze 4 times and it succeeded once.
When it freezes, can you see the log or the screen is black? When it freezes I can still see everything that was printed to the screen before it froze. Please try this kernel tree: git://github.com/aaronlu/linux.git acpi_video And boot kernel with: acpi.debug_layer=0x10000010 acpi.debug_level=0x4, let's see if we can find something, thanks. Created attachment 99641 [details]
dmesg in "Discrete Graphics" mode with kernel e8df390f (from aaronlu/acpi_video)
I gave it a whirl and unfortunately I didn't see any differences or additional output (although the system froze a bit less frequently, I assume it was random chance). I've attached the dmesg output produced after successfully loading the video module on one of the occasions when the system did not freeze. When the system does freeze, no additional output is shown.
Hi Kevin, CONFIG_ACPI_DEBUG needs to be set for these debug messages to appear, sorry I didn't mention this. The document is here: Documentation/acpi/debug.txt. Created attachment 99801 [details]
dmesg in "Discrete Graphics" mode with kernel e8df390f (from aaronlu/acpi_video)
Ha! My mistake. I should have had that enabled from the start, I guess it just got overlooked. My bad.
Here's the dmesg after one of the successful attempts to load the module in "Discrete Graphics" mode. With the debugging options enabled on the kernel command line, it is surprisingly difficult to make the system freeze. After ~6 attempts from a cold boot, none of them froze, so I reverted to loading and unloading the module in a loop which provoked a freeze after maybe 10 load/unload cycles (I can count exactly if it would be useful). When it did freeze, the last lines printed to the screen were:
video-0749 [00] video_init_brightness : max_level=100
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BQC] (Node ffff88011822c538)
video-0806 [00] video_init_brightness : initial backlight level is 100
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BCM] (Node ffff88011822c560)
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BQC] (Node ffff88011822c538)
video-0668 [00] video_bqc_quirk : test_level=0, returned_level=100
video-0684 [00] video_bqc_quirk : clear bqc cap
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BCM] (Node ffff88011822c560)
Without the debug options on the kernel command line, the system still freezes fairly reliably, so printing to the screen does appear to be a relevant symptom. I can collect more statistics on it if it would be useful or if you need the debugging output from a freeze after cold-boot.
Thanks Aaron!
Created attachment 99871 [details]
Add some debug statement to dsdt table
I suspect the freeze occurs in SMI call in the BIOS provided asl code: in integrated graphics mode, _BCM will use AINT control method; while in discrete mode, _BCM will use SMI.
I've added some debug statement to the dsdt table, please override the system dsdt table with this one. I prefer to override dsdt in initrd: Documentation/acpi/initrd_table_override.txt.
$ mkdir acpi_initrd
$ cd acpi_initrd
$ mkdir -p kernel/firmware/acpi
$ cp dsdt.aml kernel/firmware/acpi
$ find kernel | cpio -H newc --create > /boot/instrumented_initrd
$ cat /boot/your_original_initrd >>/boot/instrumented_initrd
done, the instrumented_initrd should be used.
Please still use the kernel you built in your last test, and keep setting acpi.debug_layer and acpi.debug_level, add a new one: acpi.aml_debug_output=1.
Let's see if SMI is the culprit.
Created attachment 99881 [details]
dmesg in "Discrete Graphics" mode with instrumented DSDT and kernel e8df390f (from aaronlu/acpi_video)
Here's the dmesg after one of the successful attempts to load the video module in "Discrete Graphics" mode booting with the modified DSDT and acpi.aml_debug_output=1 added to the kernel command line. Note that this is not the same binary kernel as the previous test, since I needed to recompile with CONFIG_ACPI_INITRD_TABLE_OVERRIDE, but that should be the only change.
As before, with the debugging options enabled on the kernel
command line, it is surprisingly difficult to make the system freeze. However, this time it did freeze from a cold boot on the 7th attempt. When it froze, the last messages printed to screen were:
ACPI: Execute Method [\_SB_.PCI0.PEG_.VID_.LCD0._BCM] (Node ffff8800c7e2c560)
[ACPI Debug] String [0x03] "SMI"
[ACPI Debug] Integer 0x00000001
[ACPI Debug] Integer 0x0000000A
[ACPI Debug] Integer 0x0000000F
[ACPI Debug] String [0x0D] "MSMI acquired"
Method (SMI, 5, NotSerialized) { Store ("SMI", debug) Store (Arg0, debug) Store (Arg1, debug) Store (Arg2, debug) Acquire (MSMI, 0xFFFF) Store ("MSMI acquired", debug) Store (Arg0, CMD) Store (0x01, ERR) Store (Arg1, PAR0) Store (Arg2, PAR1) Store (Arg3, PAR2) Store (Arg4, PAR3) Store (0xF5, APMC) Store ("Begin while", debug) While (LEqual (ERR, 0x01)) { Sleep (0x01) Store (0xF5, APMC) } Store ("End while", debug) Store (PAR0, Local0) Release (MSMI) Store ("MSMI released", debug) Return (Local0) } So most likely, it failed in Store (0xF5, APMC). What that instruction did is probably to trigger SMI handler, and then that handler will do things according to the values in CMD/ERR/PAR0/etc, and that handler runs without OS' awareness. BTW, I think the cause of hang by closing LID is also SMI call, you can test that too. Created attachment 99911 [details]
dmesg after LID event with instrumented DSDT and kernel e8df390f (from aaronlu/acpi_video)
That certainly sounds plausible. Is there anything we can do about it?
To test the hang on LID event, I booted with the same options and opened/closed the lid a few times without loading the video module. I have attached the dmesg after the first LID event. After ~5 tries, it froze. The last messages it printed to screen were:
[ACPI Debug] String [0x0D] "MSMI released"
[ACPI Debug] String [0x03] "SMI"
[ACPI Debug] Integer 0x00000001
[ACPI Debug] Integer 0x00000008
[ACPI Debug] Integer 0x00000002
[ACPI Debug] String [0x0D] "MSMI acquired"
I did this twice, and both times it froze with the same lines visible on screen.
Unfortunately, I don't see anything we can do about it, it's not easy, if not impossible, for us to debug this without knowing what happened. One thing I noticed is, other system models have similar asl code, like the x230, but they didn't mention such problem. The LCD hang happens with discrete mode only or both? That's quite alright. It only hangs in discrete mode and I can work around it by passing nox2apic on the kernel command line or using the nvidia binary driver and not closing the lid. So it's not a big deal for me. The only time I need discrete mode is to do graphics-intensive stuff, which isn't particularly frequent anyway. Feel free to leave the bug open. Perhaps Lenovo will be kind enough to fix their firmware. I came across a patch today: https://lkml.org/lkml/2011/7/29/414 which has words: Some of the OEM platforms are running into issues because of this, as their bios is not x2apic aware. For example, this was resulting in interrupt migration issues on one of the platforms. Also if the BIOS SMI handling uses APIC interface to send SMI's, then the BIOS need to be aware of x2apic mode that OS has enabled. I suppose the BIOS's SMI here is not aware of x2apic mode and yet it doesn't set the opt_out flag. BTW, did you configure IOMMU driver in your kernel config as that opt_out logic is done in IOMMU driver. I suppose you have, at least the 3.2 kernel from Debian should have that config. Also, I guess Windows doesn't enable x2apic, anyway to verify this? Thanks. Great find! That could explain why it works with "nox2apic". I do have the kernel configured with CONFIG_INTEL_IOMMU=y, but CONFIG_INTEL_IOMMU_DEFAULT_ON is not set and I wasn't specifying intel_iommu=on during the testing (although I do in my usual configuration). Could that make a difference? This bug is probably due for some re-testing on my part (since there has been at least one BIOS update since I last tested). I'll try to get to it soon. Microsoft KB2303458 (and the doc for the x2apicpolicy option of bcdedit) suggests that Server 2008 R2 uses x2apic, but I'm not sure about Windows 7. I can see if bcdedit supports that option on Windows 7. (In reply to Kevin Locke from comment #55) > Great find! That could explain why it works with "nox2apic". I do have the > kernel configured with CONFIG_INTEL_IOMMU=y, but > CONFIG_INTEL_IOMMU_DEFAULT_ON is not set and I wasn't specifying > intel_iommu=on during the testing (although I do in my usual configuration). > Could that make a difference? It shouldn't make any difference. I should ask does the CONFIG_IRQ_REMAP get set? From looking at the source code, the logic to detect if BIOS has set the x2apic flag is done in intel_enable_irq_remapping function, which is compiled in if CONFIG_IRQ_REMAP is set. > Microsoft KB2303458 (and the doc for the x2apicpolicy option of bcdedit) > suggests that Server 2008 R2 uses x2apic, but I'm not sure about Windows 7. > I can see if bcdedit supports that option on Windows 7. OK, thanks. Ok. I finally got around to re-testing. I can confirm that CONFIG_IRQ_REMAP=y in the config and the symptoms are unchanged as far as I can tell when testing with 3.10.1 using BIOS G1ET94WW (2.54 ). I haven't tested with the modified DSDT, but I could if it might be useful. Thanks. So the problem is, Linux enabled x2apic due to enabling interrupt remap for IOMMU, which caused hang in SMI handler. Did you manage to see if Windows has enabled x2apic on your system? Add Youquan and David. Kevin has reported that loading of the ACPI video module would cause system hang, and adding nox2apic would solve the problem. I found that the hang occurs in SMI handler. Any ideas? Thanks. Fine. There are similar issues on Lenovo ThinkPad T420, W520 and L520, I have written a patch to disable x2APIC on these platform to workround the issue. Refer to: https://lkml.org/lkml/2012/12/18/1 [PATCH] x86,apic: Blacklist x2APIC on some platforms Admittedly, The patch is not elegant solution. I plan to look for complete fix for this issue since there are serials of platform will be affected. Unfortunately, I have no any issued platform for debug. Hi Kevin Locke, Can you attach the two dmesg information after add kernel option: "apic=debug intremap=off" and "apic=debug"? In addition, I find there is an issue in ACPI BIOS possibly related. After parse the file "acpidump in "Discrete Graphics" mode" attached at description, the ACPI DMAR table only includes one DMAR engine but I check the proccessor of the platform at least includes two DMAR engines. Thanks -Youquan Created attachment 106928 [details]
dmesg in "Discrete Graphics" mode with apic=debug
Aaron, I haven't found a way to test if x2apic is enabled in Windows yet, but I'm still looking. My documentation search has turned up nothing. I need to ask some knowledgeable people if they have any ideas for how to test.
Hello Youquan. Here's the first of the two dmesg logs. I'll attach the other shortly.
Thanks!
Created attachment 106929 [details]
dmesg in "Discrete Graphics" mode with apic=debug intremap=off
Perhaps force enable x2apic and see if Windows can work? This seems to be a way to force enable x2apic: bcdedit /set x2apicpolicy enable But I don't know if it applies to all Windows version or just some server edition. Also, I'm not sure how Windows handle backlight, it may or may not use ACPI to change brightness. In Windows, I installed the IBM Performance Inspector and ran msr <http://perfinsp.sourceforge.net/msr.html> which produced the following output: C:\> msr -r APIC_BASE ***** msr v2.0.7 for x64 ***** CPU0 msr 0x1B = 0x00000000:FEE00900 (4276095232) CPU1 msr 0x1B = 0x00000000:FEE00800 (4276094976) CPU2 msr 0x1B = 0x00000000:FEE00800 (4276094976) CPU3 msr 0x1B = 0x00000000:FEE00800 (4276094976) If I read the x2APIC Spec correctly, it looks like xAPIC enable is 0x800 and x2APIC enable is 0x400. So from the above values, xAPIC is enabled and x2APIC is disabled. If you think it might be useful, I can try setting x2apicpolicy to enabled to force x2APIC in Windows. But it looks like Windows 7 does not enable it by default on this machine. Thanks Kevin. So this just suggests we should not enable x2APIC mode for this system. Youquan, what do you think? Should we simply blacklist it or you have other ideas? Hi Youquan, I think the problem is in interrupt remap code, it enabled x2apic for some systems that it shouldn't. Hi Kevin, Please add kernel option "x2apic_phys" and try it. Hi Aaron, In my knowledge, this issue is an specific hardware/BIOS issue. Possible reasons: 1) BIOS is not x2APIC compatible or BIOS SMI handler is not x2APIC compatible like serializing access semantics 2) SMI routing to CPU in Virtual Legal Wire (VLW) over DMI while the VT-d interrupt remapping unaware the SMI type message and SMI is not in x2APIC format. Thanks -Youquan Created attachment 106993 [details]
dmesg in "Discrete Graphics" mode with x2apic_phys
Hi Youquan,
When booting with kernel option "x2apic_phys" the behavior appears to be the same as when booting without any kernel options. The system will reliably freeze if the video module is loaded and unloaded a few times (although in my few tests it didn't freeze the first time the video module was loaded). I've attached the dmesg for reference. Let me know if there's anything else you'd like me to test.
Thanks,
Kevin
A similar problem on DELL XPS 15z is reported: https://lkml.org/lkml/2013/7/27/102 Thanks a lot Kevin for your support! 1. Can you provide "cat /proc/interrupts" and "lspci -vvvnnn" and "dmesg" information when "Discrete Graphics" mode with apic=debug kernel option? 2. Apply the below patch and build kernel, when boot kernel add kernel option "intremap=off"; To check if the issue happen. Also, please provide me "dmesg" information? diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 904611b..51a065a 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1603,11 +1603,8 @@ void __init enable_IR_x2apic(void) goto skip_x2apic; if (ret < 0) { - /* IR is required if there is APIC ID > 255 even when running - * under KVM - */ - if (max_physical_apicid > 255 || - !hypervisor_x2apic_available()) { + /* IR is required if there is APIC ID > 255 */ + if (max_physical_apicid > 255) { if (x2apic_preenabled) disable_x2apic(); goto skip_x2apic; -- 1.7.7.4 Thanks -Youquan Created attachment 107054 [details]
dmesg in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54
Created attachment 107055 [details]
/proc/interrupts in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54
Created attachment 107056 [details]
lspci -vvvnnn in "Discrete Graphics" mode with apic=debug, kernel 3.10.1, bios 2.54
Created attachment 107057 [details] dmesg in "Discrete Graphics" mode with intremap=off, kernel 3.10.1 with patch from comment 70, bios 2.54 Here's the dmesg with the patch from comment 70 applied. Unfortunately, it did not solve the problem. I could still provoke a freeze by modprobe video && rmmod video repeatedly. It did seem a bit more stable; it would loop 10-20 times before freezing, rather than the usual 5-10, although my sample size is probably too small to say that with any certainty. Thanks again, much appreciated! Thanks a lot Kevin! When the Nvidia graphics is loaded before system is not freezed, can you provide lspci -xxxvvvnnn? The previous lspci is not listed the Nividia graphics driver load correctly. Thanks -Youquan Sure. Would you prever the nouveau driver loaded (may or may not freeze) or the nvidia binary driver (always loads without freezing)? I prefer to nouveau driver loaded which will possibly freeze. It will be better if you also provide whole dmesg after you do the loop before freeze when "acpi=debug" option. Thanks -Youquan Ok. To confirm: I'll get lspci -xxxvvvnnn after nouveau has been loaded, dmesg after nouveau and video have been unloaded. Just to confirm, did you want "acpi=debug" or "apic=debug" (as before)? Created attachment 107058 [details] lspci -xxxvvvnnn in "Discrete Graphics" mode with apic=debug, kernel 3.10.1 with patch from comment 70, bios 2.54, with nouveau loaded Created attachment 107059 [details] dmesg in "Discrete Graphics" mode with apic=debug, kernel 3.10.1 with patch from comment 70, bios 2.54, after video loaded and unloaded Here's the dmesg with apic=debug (I can get acpi=debug as well if it would be useful). I got it after the video module was loaded and unloaded since if I load nouveau it can not be unloaded (or I can't find the right module to unload first to allow it to be unloaded). From last two comments, the discrete graphics mode works and show the desktop without issue. Right? can you also provide "cat /proc/interrupts" in the situation? What's the steps to reproduce the "modprobe video" cause hang issue? I will put a Nvidia card to my desktop to check if I can reproduce the issue or not? Thanks -Youquan The issue is that the system hangs on startup during `udevadm trigger --action=add`. It is not completely reliable, sometimes the system will boot without hanging, but when last I tried it the system would freeze ~90% of the time. The problem is alleviated if the video and nouveau modules are blacklisted. If I drop to a shell before `udevadm trigger --action=add` is run, I can usually load the video module without causing the system to hang (presumably the issue is aggravated by multiple modules being loaded/initialized in short succession). I can always provoke a freeze by running "for i in `seq 1 100` ; do modprobe video ; rmmod video ; done", usually after a few iterations sometimes as many as 30. I think the issue is specific to a small set of machines including the T430 (or perhaps specific to the T430), otherwise all nVidia users would be experiencing intermittent freezes during boot. I will get a copy of /proc/interrupts after a successful boot as soon as I have some spare time. Thanks, Kevin Hi Youquan, May I know your opinion about this bug? Hi Aaron and Youquan, My apologies for letting this bug linger for so long. I had forgotten that you were waiting on me for a copy of /proc/interrupts in addition to Youquan's thoughts. Since I had an effective workaround (adding "nox2apic") I had simply forgotten to check up on this issue. I can confirm that the issue is still present in Linux 4.3 with the latest BIOS (G1ETA9WW - 2.69) as of today. A quick synopsis of the issue as I understand it: My Lenovo ThinkPad T430 will freeze when the video module is loaded if the BIOS is configured in "Discrete Graphics" mode. The freeze does not occur immediately on every boot, but repeatedly loading and unloading the video module provokes it reliably. The freeze appears to be related to x2APIC (which is not enabled by Windows 7). Disabling x2APIC with the "nox2apic" command-line option reliably avoids the issue. Aaron has a more detailed synopsis in comment 58. It looks like Youquan's (non-ideal) patch from comment 60 <https://lkml.org/lkml/2012/12/18/1> was never merged, nor the patch that it was waiting on <https://lkml.org/lkml/2013/3/18/892>. There are a couple bugs with different machines/models that appear to be the same (or similar) issue: https://bugzilla.kernel.org/show_bug.cgi?id=34262 https://bugzilla.kernel.org/show_bug.cgi?id=43054 Since it has been a few years and I've changed both BIOS and kernel versions significantly since then, would it be helpful to get fresh comparable copies of some/all of the previously attached files? What would be most useful? How else can I help? Thanks and sorry for the very long delay, Kevin Apologies if i post this in wrong thread.. >inxi -S System: Host: hawker64 Kernel: 4.6.4-1-ck x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux >inxi -M Machine Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010 >inxi -G Graphics: Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870] Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2 I put it down to the Intel 55x0 chipset errata - Interrupt remapping issue (Intel 5500/5520/X58 chipset revision 0x13 and 0x22 have an errata (#47 and #53) which makes the IOMMU interrupt remapping unit unreliable. This erratum causes interruptions and the interrupt remapping invalidations become unresponsive) https://forums.gentoo.org/viewtopic-t-1030102-start-0.html?sid=59c8eddb43e0553296f93355ea10b42d below are some snippets from logs that i *think* may be relevant. Ocaasionaly i get hard lockups where only a hard reboot will suffice,on other occasion i just lose network connectivity. This started happening for me since kernels 4.* if i rollback to say kernel 3.19-1 all is well. So again i *guess* it's a kernel regression, either that or my issue is a mixture of this and the IOMMU thing. (happens when using linux-ck or stock archlinux kernels, kvm disabled in BIOS) perf: interrupt took too long (2711 > 2500), lowering kernel.perf_event_max_sample_rate to 73000 perf: interrupt took too long (3512 > 3388), lowering kernel.perf_event_max_sample_rate to 56000 perf: interrupt took too long (4459 > 4390), lowering kernel.perf_event_max_sample_rate to 44000 perf: interrupt took too long (5613 > 5573), lowering kernel.perf_event_max_sample_rate to 35000 hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB ! hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB ! hawker64 kernel: radeon 0000:02:00.0: scheduling IB failed (-2). hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB ! hawker64 kernel: r8169 0000:05:00.0 enp5s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). hawker64 kernel: r8169 0000:05:00.0 enp5s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). hawker64 kernel: INFO: rcu_preempt detected stalls on CPUs/tasks: hawker64 kernel: 1-...: (0 ticks this GP) idle=9b2/0/0 softirq=4017531/4017531 fqs=0 hawker64 kernel: 2-...: (6 GPs behind) idle=dd2/0/0 softirq=2426637/2426637 fqs=0 hawker64 kernel: 3-...: (3 GPs behind) idle=e78/0/0 softirq=1361775/1361777 fqs=0 hawker64 kernel: 4-...: (38 GPs behind) idle=3e0/0/0 softirq=440832/440833 fqs=0 hawker64 kernel: 6-...: (1 GPs behind) idle=efe/0/0 softirq=305520/305520 fqs=0 hawker64 kernel: 7-...: (1 GPs behind) idle=a2a/0/0 softirq=204822/204822 fqs=0 hawker64 kernel: (detected by 0, t=127647 jiffies, g=2627574, c=2627573, q=9503) hawker64 kernel: rcu_preempt kthread starved for 127647 jiffies! g2627574 c2627573 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 Jul 01 14:13:57 hawker64 login[541]: pam_systemd(login:session): Failed to release session: Connection reset by peer Jul 01 14:13:57 hawker64 systemd-logind[507]: Failed to abandon session scope: Transport endpoint is not connected -- Reboot -- |
Created attachment 96771 [details] acpidump in "Discrete Graphics" mode On a Lenovo ThinkPad T430 with the latest BIOS (G1ET93WW - 2.53), when configured to use the nVidia graphics card in the BIOS (by setting "Graphics Device" to "Discrete Graphics"), modprobing the ACPI video module causes the system to freeze hard either immediately or within a few seconds. The video mode is not changed and the display is not corrupted compared to the pre-modprobe state, the cursor continues to blink, but the system ceases responding to input (SysRq does not work) and no additional messages (except rarely the "ACPI: Video Device [VID] [...]" messages when the freeze occurs a second or two later) are produced to screen or netconsole. This occurs on all kernel versions that I have tested, which includes 3.2 from the Debian archives, self-compiled 3.8.2, and self-compiled linus master from earlier today (46a1f21a). I have also confirmed that the system boots and functions correctly in "Discrete Graphics" mode in Windows 7. The freeze may or may not be related to the following messages (which are not produced if the video module is loaded manually early in the boot process - presumably because they are produced by the loading of another module): ACPI Warning: 0x0000000000000428-0x000000000000042f SystemIO conflicts with Region \_SB_.PCI0.LPC_.PMIO 1 (20130117/utaddress-251) ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver ACPI Warning: 0x0000000000000530-0x000000000000053f SystemIO conflicts with Region \_SB_.PCI0.LPC_.LPIO 1 (20130117/utaddress-251) ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver ACPI Warning: 0x0000000000000500-0x000000000000052f SystemIO conflicts with Region \_SB_.PCI0.LPC_.LPIO 1 (20130117/utaddress-251) ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver lpc_ich: Resource conflict(s) found affecting gpio_ich Also possibly relevant: acpidump produces the message "Wrong checksum for FADT!" when run in either BIOS configuration. I'll attach the acpidump, config, dmesg, /proc/iomem, /proc/ioports, and lspci output to the report. Please let me know if there is anything else which I can do to help troubleshoot/fix the issue. Thanks for reading, Kevin