Bug 40332

Summary: Lenovo X201s: Kernel panic - not syncing: DMAR hardware is malfunctioning
Product: Drivers Reporter: Yves-Alexis Perez (corsac)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: RESOLVED INVALID    
Severity: normal CC: chris
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg with no I/O MMU enabled
dmesg with intel_iommu=on
dmesg with intel_iommu=igfx_off
disable iommu for graphics on ironlake too

Description Yves-Alexis Perez 2011-07-29 13:52:38 UTC
Hey,

I just bought a Lenovo X201s (CPU is Core i7 LM640) and I have hard lockups in X when using intel_iommu=on.

I'm reporting against DRI driver since the involved process is X and I recall some other issues with DMAR and graphics card, but feel free to reassign if it's not the correct component.

Basically, after some time using the laptop, there's a freeze. Using netconsole I managed to get a trace (which doesn't get saved in kern.log afaict):


[  414.698653] Kernel panic - not syncing: DMAR hardware is malfunctioning
[  414.698657] 
[  414.698817] Pid: 1482, comm: Xorg Not tainted 3.0.0-1-amd64 #1
[  414.698914] Call Trace:
[  414.698964]  <IRQ>  [<ffffffff81334c28>] ? panic+0x92/0x1a1
[  414.699075]  [<ffffffff81038189>] ? test_tsk_need_resched+0xe/0x17
[  414.699174]  [<ffffffff811ca67b>] ? __iommu_flush_iotlb+0x131/0x178
[  414.699270]  [<ffffffff811caa11>] ? flush_unmaps+0x66/0x11d
[  414.699357]  [<ffffffff811caadd>] ? flush_unmaps_timeout+0x15/0x25
[  414.699453]  [<ffffffff81052b04>] ? run_timer_softirq+0x1bf/0x28a
[  414.699549]  [<ffffffff8104c0ad>] ? raise_softirq_irqoff+0x9/0x2e
[  414.699642]  [<ffffffff811caac8>] ? flush_unmaps+0x11d/0x11d
[  414.699732]  [<ffffffff81066feb>] ? timekeeping_get_ns+0xd/0x2a
[  414.699823]  [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x178
[  414.699911]  [<ffffffff8133cc9c>] ? call_softirq+0x1c/0x30
[  414.699997]  [<ffffffff8100a9ef>] ? do_softirq+0x3f/0x84
[  414.700080]  [<ffffffff8104c040>] ? irq_exit+0x3f/0xa3
[  414.700163]  [<ffffffff8101f51e>] ? smp_apic_timer_interrupt+0x76/0x86
[  414.700263]  [<ffffffff8133c453>] ? apic_timer_interrupt+0x13/0x20
[  414.700354]  <EOI>  [<ffffffff8133ba92>] ? system_call_fastpath+0x16/0x1b
[  415.884326] panic occurred, switching back to text console
[  415.884413] BUG: scheduling while atomic: Xorg/1482/0x10000100
[  415.884500] Modules linked in: thinkpad_acpi nvram netconsole configfs nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ext2 acpi_cpufreq mperf snd_hda_codec_hdmi snd_hda_codec_conexant arc4 iwlagn snd_hda_intel mac80211 snd_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer btusb bluetooth i2c_i801 pcspkr psmouse snd tpm_tis rfkill battery serio_raw ac soundcore power_supply evdev snd_page_alloc tpm tpm_bios intel_ips wmi processor ext4 mbcache jbd2 crc16 aesni_intel cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sg sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit thermal ahci libahci e1000e ehci_hcd libata scsi_mod usbcore i2c_core video thermal_sys button [last unloaded: nvram]
[  415.886115] CPU 0 
[  415.886149] Modules linked in: thinkpad_acpi nvram netconsole configfs nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ext2 acpi_cpufreq mperf snd_hda_codec_hdmi snd_hda_codec_conexant arc4 iwlagn snd_hda_intel mac80211 snd_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer btusb bluetooth i2c_i801 pcspkr psmouse snd tpm_tis rfkill battery serio_raw ac soundcore power_supply evdev snd_page_alloc tpm tpm_bios intel_ips wmi processor ext4 mbcache jbd2 crc16 aesni_intel cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sg sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit thermal ahci libahci e1000e ehci_hcd libata scsi_mod usbcore i2c_core video thermal_sys button [last unloaded: nvram]
[  415.887749] 
[  415.887780] Pid: 1482, comm: Xorg Not tainted 3.0.0-1-amd64 #1 LENOVO 51434JG/51434JG
[  415.887918] RIP: 0033:[<00007fe14fbc261e>]  [<00007fe14fbc261e>] 0x7fe14fbc261d
[  415.888039] RSP: 002b:00007fff5bdc7d50  EFLAGS: 00003246
[  415.888119] RAX: 0000000000000000 RBX: ffffffff8133ba92 RCX: 0000000000000000
[  415.888225] RDX: 000000000205a008 RSI: 0000000000000000 RDI: 000000000205a008
[  415.888329] RBP: 0000000003e7b7b0 R08: 00000000039be000 R09: 00000000039be000
[  415.888435] R10: 0000000002059f20 R11: 0000000000003246 R12: ffffffff8133c44e
[  415.888540] R13: 0000000040406469 R14: 00007fff5bdc7d80 R15: 0000000000000000
[  415.888646] FS:  00007fe150874880(0000) GS:ffff880137c00000(0000) knlGS:0000000000000000
[  415.888764] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  415.888851] CR2: 0000000003b9c011 CR3: 0000000131e67000 CR4: 00000000000006f0
[  415.888955] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  415.889060] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  415.889166] Process Xorg (pid: 1482, threadinfo ffff880132142000, task ffff88012f1a49f0)
[  415.889282] 
[  415.889310] Call Trace:

and it ends here.

This is on Debian sid with Debian 3.0 kernel. I'm gonna drop the intel_iommu=on boot arg for now, and will try other args like igfx_off, but I'd like to use the I/O MMU when possible.

Bios is up to date. As I just bought the laptop I can't say if it's a regression in the kernel (3.0 was the first kernel I installed, I'll try from  a GRML just in case) or in the BIOS (I don't think I had a lockup before upgrading but since I upgraded to latest BIOS pretty fast, it might not be meaningful).

I'll follow up on that bug with complete dmesg after boot (with and without intel_iommu=on), if you need any more information, please ask.
Comment 1 Yves-Alexis Perez 2011-07-29 13:53:46 UTC
And note that while:

[  415.884326] panic occurred, switching back to text console

is displayed, there's no switch, it stays in X.
Comment 2 Chris Wilson 2011-07-29 13:57:09 UTC
The ironlake iommu does not work with the igfx due to a silicon bug, hence why the kernel disables it...
Comment 3 Yves-Alexis Perez 2011-07-29 14:05:04 UTC
Hmh, I might be confused but I thought that it was just a matter of excluding igfx (setting up an identity mapping) from DMAR. Do you mean there's *no* way to use the I/O MMU on this laptop?
Comment 4 Yves-Alexis Perez 2011-07-29 14:39:09 UTC
Created attachment 67082 [details]
dmesg with no I/O MMU enabled
Comment 5 Yves-Alexis Perez 2011-07-29 14:40:49 UTC
Created attachment 67092 [details]
dmesg with intel_iommu=on
Comment 6 Yves-Alexis Perez 2011-07-29 14:41:46 UTC
Created attachment 67102 [details]
dmesg with intel_iommu=igfx_off

It seems that, for now, intel_iommu=igfx_off works fine (no lockup, though I'm only booted for half an hour).
Comment 7 Yves-Alexis Perez 2011-07-29 14:44:21 UTC
Hum, indeed, with igfx_off it seems that it doesn't really use DMAR:

[    0.939792] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
Comment 8 Yves-Alexis Perez 2011-07-29 15:08:52 UTC
Ok, I might be wrong but couldn't the same kind of fix than http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2d9e667e be applied? AIUI in the patch only GFX DMAR is disabled, for the relevant device ([8086:2a40] rev07). In my case my device is:

00:02.0 VGA compatible controller [0300]: Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02)

I'll try to extend the patch and see it leads somewhere.
Comment 9 Yves-Alexis Perez 2011-07-30 07:44:13 UTC
Created attachment 67152 [details]
disable iommu for graphics on ironlake too

Ok, using this patch seems to fix the problem but still enable DMAR.

Is it normal that when booting with intel_iommu=igfx_off the setup does:

dmar_map_gfx = 0;

but not

dmar_disabled = 0; ?

Debian doesn't have DMAR_DEFAULT_ON enabled (for good reasons) but I think (according to the documentation) that intel_iommu=igfx_off means “DMAR remapping enabled except for IGFX”:

		igfx_off [Default Off]
			By default, gfx is mapped as normal device. If a gfx
			device has a dedicated DMAR unit, the DMAR unit is
			bypassed by not enabling DMAR with this option. In
			this case, gfx device will use physical address for
			DMA.

What do you think? Should I prepare a patch adding dmar_disable = 0; for intel_iommu=igfx_on?
Comment 10 Yves-Alexis Perez 2011-07-30 09:05:47 UTC
Ok, I was confused. I still think the quirk is valid (in order to prevent users like me to shoot themselves in the foot) but I didn't realized one could do:

intel_iommu=on,igfx_off

to have the same behavior as with the quirk. As Dave Airled said on #intel-gfx though, it might make sense to extend the quirk to all Cantiga and Ironlake chipsets where apparently the GFX I/O MMU is buggy, while the main one does work.
Comment 11 Yves-Alexis Perez 2012-01-17 14:49:16 UTC
Note that Intel-IOMMU.txt in Documentation says:


Graphics Problems?
------------------
If you encounter issues with graphics devices, you can try adding
option intel_iommu=igfx_off to turn off the integrated graphics engine.
If this fixes anything, please ensure you file a bug reporting the problem.


so this is exactly what I did. My understanding is that intel_iommu=igfx_off is a temporary workaround until a fix is committed to the code (wether it's a real fix or disabling the IGFX I/O MMU). If I'm wrong, maybe rephrasing the documentation would be a good idea?