Bug 51921
Summary: | [g45 dmar] Screen black, no action possible | ||
---|---|---|---|
Product: | Drivers | Reporter: | JP Pozzi (jp.pozzi) |
Component: | Video(DRI - Intel) | Assignee: | intel-gfx-bugs (intel-gfx-bugs) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | high | CC: | daniel, intel-gfx-bugs, stathis |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.7.1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
various dmesg files
disable DMAR on gen4 gfx disable DMAR on gen4 gfx v2 |
Description
JP Pozzi
2012-12-22 18:38:52 UTC
Hello, I get a second "freeze" with the same messages : Dec 22 20:49:08 jport kernel: [ 5642.810093] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Dec 22 20:49:08 jport kernel: [ 5642.810105] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Dec 22 20:49:09 jport kernel: [ 5643.330121] [drm:i915_reset] *ERROR* Failed to reset chip. Preceeding messages involving i951 : Dec 22 19:15:35 jport kernel: [ 34.577466] i915 0000:00:02.0: irq 44 for MSI/MSI-X Dec 22 19:15:35 jport kernel: [ 34.577477] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). Dec 22 19:15:35 jport kernel: [ 34.577480] [drm] Driver supports precise vblank timestamp query. Dec 22 19:15:35 jport kernel: [ 34.577518] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem Dec 22 19:15:35 jport kernel: [ 35.038989] [drm] initialized overlay support Dec 22 19:15:35 jport kernel: [ 34.577466] i915 0000:00:02.0: irq 44 for MSI/MSI-X Dec 22 19:15:35 jport kernel: [ 34.577477] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). Dec 22 19:15:35 jport kernel: [ 34.577480] [drm] Driver supports precise vblank timestamp query. Dec 22 19:15:35 jport kernel: [ 34.577518] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem Dec 22 19:15:35 jport kernel: [ 35.038989] [drm] initialized overlay support Dec 22 19:33:59 jport kernel: [ 1130.380102] i915 0000:00:02.0: power state changed by ACPI to D3hot All was OK which a 3.6.5 kernel. Regards JP P Hi, I think I am hit by the same bug in stable 3.7.1. My box has the same behavior, it effectively freezes completely and can only be hard reset and boot back into a different kernel version. I managed to track down that the problem does not occur on 3.6.10 or 3.6.11. I looked at my logs, but there is nothing in the 3.7.1 log, as the system locks up before managing to write anything on disk or on screen. It does seem to lock up when it attempts to modeset though, as it freezes inbetween the kernel boot sequence. I do see some errors in the 3.6.10 log, but I don't think they are relevant: Dec 17 08:42:11 commell kernel: [drm] Initialized drm 1.1.0 20060810 Dec 17 08:42:11 commell kernel: i915 0000:00:02.0: setting latency timer to 64 Dec 17 08:42:11 commell kernel: i915 0000:00:02.0: irq 49 for MSI/MSI-X Dec 17 08:42:11 commell kernel: [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). Dec 17 08:42:11 commell kernel: [drm] Driver supports precise vblank timestamp query. Dec 17 08:42:11 commell kernel: vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem Dec 17 08:42:11 commell kernel: No connectors reported connected with modes Dec 17 08:42:11 commell kernel: [drm] Cannot find any crtc or sizes - going 1024x768 Dec 17 08:42:11 commell kernel: fbcon: inteldrmfb (fb0) is primary device Dec 17 08:42:11 commell kernel: Console: switching to colour frame buffer device 128x48 Dec 17 08:42:11 commell kernel: fb0: inteldrmfb frame buffer device Dec 17 08:42:11 commell kernel: drm: registered panic notifier Dec 17 08:42:11 commell kernel: [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0 .. .. <a bit further down in the same boot sequence> .. Dec 17 08:42:11 commell kernel: No soundcards found. Dec 17 08:42:11 commell kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Dec 17 08:42:11 commell kernel: i915: render error detected, EIR: 0x00000010 Dec 17 08:42:11 commell kernel: i915: IPEIR: 0x00000000 Dec 17 08:42:11 commell kernel: i915: IPEHR: 0x11880000 Dec 17 08:42:11 commell kernel: i915: INSTDONE: 0xfffffffe Dec 17 08:42:11 commell kernel: i915: INSTPS: 0x00000000 Dec 17 08:42:11 commell kernel: i915: INSTDONE1: 0xffffffff Dec 17 08:42:11 commell kernel: i915: ACTHD: 0x00000000 Dec 17 08:42:11 commell kernel: i915: page table error Dec 17 08:42:11 commell kernel: i915: PGTBL_ER: 0x00100000 Dec 17 08:42:11 commell kernel: [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking this is a headless server system that I very infrequently use the console or X directly on. I tried passing the kernel parameter "nomodeset", which apparently boots me into 3.7.1 without a problem (only the console resolution is odd). It would be great if it can be fixed. The system is based on a Commell LV-67AD2 Mini-ITX board ( http://www.commell.com.tw/Product/SBC/LV-67A.HTM ) CPU : Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz memory : 4GB Distribution : Gentoo x64 Let me know if there is anything to test. thanks, stathis You can avoid the gpu reset (which is usually the reason behind a hard-hang) with i915.reset=0 added to your kernel cmdline. That should hopefully allow you to grab the error state. Hello, I "tweak" the .config file with "CONFIG_DRM_I915=y" and "CONFIG_ACPI_VIDEO=y" plus some others parameters to enable the "y" instead of "m". After the last compile the system is 2 days 6 hours with a very nice screen showing all 1920 pixels and without any lock. Regards JP P It's a no go for me. JP, I tried what you have suggested, but it still freezes upon modesetting. Here are some of the relevant kernel config options of my 3.7.1 built kernel: ## grep AGP .config CONFIG_AGP=y CONFIG_AGP_INTEL=y ## grep ACPI .config CONFIG_X86_64_ACPI_NUMA=y # Power management and ACPI options CONFIG_ACPI=y CONFIG_ACPI_PROCFS=y CONFIG_ACPI_PROCFS_POWER=y CONFIG_ACPI_PROC_EVENT=y CONFIG_ACPI_AC=y CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_VIDEO=y CONFIG_ACPI_FAN=y CONFIG_ACPI_DOCK=y CONFIG_ACPI_PROCESSOR=m CONFIG_ACPI_THERMAL=m CONFIG_ACPI_NUMA=y CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set # CONFIG_ACPI_PCI_SLOT is not set CONFIG_ACPI_CONTAINER=y CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ_CPB=y CONFIG_PNPACPI=y CONFIG_ATA_ACPI=y CONFIG_SENSORS_ACPI_POWER=m ## grep DRM .config CONFIG_DRM=y CONFIG_DRM_KMS_HELPER=y CONFIG_DRM_I915=y CONFIG_DRM_I915_KMS=y Kernel command line: BOOT_IMAGE=linux ro root=801 raid=noautodetect i915.reset=0 It still freezes for me. regards, stathis On the off-chance that this is the same OOPS as one we've recently fixed, please try latest drm-intel-fixes branch from http://cgit.freedesktop.org/~danvet/drm-intel Otherwise I think it's time to bisect, since we seem to be unable to grab the logs. One last thing to try though is setting up netconsole logging. I just tried: drm-intel-901593f2bf221659a605bdc1dcb11376ea934163, but it still freezes. I tried setting up netconsole, but I am uncertain if it is my setup, as it doesn't log anything on the remote machine. Doesn't netconsole require that a working interface is configured? Bear in mind that my box freezes very early... I tried to use netconsole by supplying nomodeset, so that I can figure out what is going on with it. Turns out that drm is loaded before the netconsole at the boot sequence, and (probably due to misconfiguration??) I see: Jan 8 16:03:22 commell kernel: netpoll: netconsole: local port 6665 Jan 8 16:03:22 commell kernel: netpoll: netconsole: local IP 192.168.77.1 Jan 8 16:03:22 commell kernel: netpoll: netconsole: interface 'eth1' Jan 8 16:03:22 commell kernel: netpoll: netconsole: remote port 6666 Jan 8 16:03:22 commell kernel: netpoll: netconsole: remote IP 192.168.77.100 Jan 8 16:03:22 commell kernel: netpoll: netconsole: remote ethernet address 00:25:22:a6:c0:e8 Jan 8 16:03:22 commell kernel: netpoll: netconsole: eth1 doesn't exist, aborting Jan 8 16:03:22 commell kernel: netconsole: cleaning up I read around about netconsole, but I run out of time, so let me know if it is essential that I get netconsole to work in this case (since it appears to be loaded after KMS). I am reading in the netconsole.txt documentation: "As a built-in, netconsole initializes immediately after NIC cards and will bring up the specified interface as soon as possible. While this doesn't allow capture of early kernel panics, it does capture most of the boot process." so probably this is not going to help much here? thanks, stathis (In reply to comment #7) > "As a built-in, netconsole initializes immediately after NIC cards and will > bring up the specified interface as soon as possible. While this doesn't > allow > capture of early kernel panics, it does capture most of the boot process." > > so probably this is not going to help much here? Ime netconsole only really works if both netconsole and your ethernet driver are built-in. That also solves the neat problem with init ordering, since the i915.ko module will then certainly be loaded afterwards ... Hello, I do the same lists as Stathis, here is the result : grep AGP .config CONFIG_AGP=y CONFIG_AGP_INTEL=y grep ACPI .config # Power management and ACPI options CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_AC=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_VIDEO=y CONFIG_ACPI_FAN=m CONFIG_ACPI_PROCESSOR=m CONFIG_ACPI_THERMAL=m CONFIG_ACPI_BLACKLIST_YEAR=0 CONFIG_ACPI_PCI_SLOT=m CONFIG_ACPI_HED=y CONFIG_ACPI_APEI=y CONFIG_ACPI_APEI_GHES=y CONFIG_ACPI_APEI_PCIEAER=y CONFIG_X86_ACPI_CPUFREQ=m CONFIG_PNPACPI=y CONFIG_ATA_ACPI=y # ACPI drivers CONFIG_SENSORS_ACPI_POWER=m grep DRM .config CONFIG_DRM=y CONFIG_DRM_KMS_HELPER=y CONFIG_DRM_I915=y If it can help, Uptime is now : 3 days, 51 min, 2 users, load average: 0.41, 0.40, 0.49 the screen is "normal". Regards JP P Ok, I've compiled netconsole and i915 as modules (i set i915 to not modeset by default!) and I have setup a working netconsole monitor where I actually receive the kernel messages. I boot without modesetting and I load both netconsole and i915 from the command line. modprobe netconsole netconsole=6665@/eth1,6666@192.168.77.100/00:25:22:A6:C0:E8 // ignore this, just a reminder for me :) I can load i915 without modesetting, with: modprobe i915 reset=0 which printk's in the netconsole: [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). [drm] No driver support for vblank timestamp query. [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0 Next I rmmod i915 and reload it with: modprobe i915 reset=0 modeset=1 which printks : [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). [drm] Driver supports precise vblank timestamp query. vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem that's it. the screen goes blank and the box freezes, exactly as it does when I boot. I have to hard reset it. I then booted into the working 3.6.11 and scanned for the revelant sections (note i915 is compiled in on this one): Linux agpgart interface v0.103 agpgart-intel 0000:00:00.0: Intel Q45/Q43 Chipset agpgart-intel 0000:00:00.0: detected gtt size: 2097152K total, 262144K mappable agpgart-intel 0000:00:00.0: detected 131072K stolen memory agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xd0000000 [drm] Initialized drm 1.1.0 20060810 i915 0000:00:02.0: setting latency timer to 64 i915 0000:00:02.0: irq 49 for MSI/MSI-X [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). [drm] Driver supports precise vblank timestamp query. vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state i915: render error detected, EIR: 0x00000010 i915: IPEIR: 0x00000000 i915: IPEHR: 0x11880000 i915: INSTDONE: 0xfffffffe i915: INSTPS: 0x00000000 i915: INSTDONE1: 0xffffffff i915: ACTHD: 0x00000000 i915: page table error i915: PGTBL_ER: 0x00100000 [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking fbcon: inteldrmfb (fb0) is primary device Console: switching to colour frame buffer device 240x67 fb0: inteldrmfb frame buffer device drm: registered panic notifier [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0 I then cat /debug/dri/0/i915_error_state and the output was: Time: 1357675963 s 419826 us PCI ID: 0x2e12 EIR: 0x00000010 IER: 0x02028c53 PGTBL_ER: 0x00100000 CCID: 0x00000000 fence[0] = 00000000 fence[1] = 00000000 fence[2] = 00000000 fence[3] = 00000000 fence[4] = 00000000 fence[5] = 00000000 fence[6] = 00000000 fence[7] = 00000000 fence[8] = 00000000 fence[9] = 00000000 fence[10] = 00000000 fence[11] = 00000000 fence[12] = 00000000 fence[13] = 00000000 fence[14] = 00000000 fence[15] = 00000000 render command stream: HEAD: 0x00000000 TAIL: 0x00000000 ACTHD: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x11880000 INSTDONE: 0xfffffffe INSTDONE1: 0xffffffff BBADDR: 0x00000000 INSTPS: 0x00000000 INSTPM: 0x00000000 FADDR: 0x00000000 seqno: 0x00000000 waiting: no ring->head: 0x00000000 ring->tail: 0x00000000 bsd command stream: HEAD: 0x00000000 TAIL: 0x00000000 ACTHD: 0x00000000 IPEIR: 0x00000000 IPEHR: 0x00000000 INSTDONE: 0x00000000 INSTPS: 0x00000000 INSTPM: 0x00000000 FADDR: 0x00000000 seqno: 0x00000000 waiting: no ring->head: 0x00000000 ring->tail: 0x00000000 Active [0]: Pinned [4]: 00000000 4096 0001 0001 00000000 P snooped (LLC) 00001000 131072 0040 0040 00000000 P dirty uncached 00021000 4096 0001 0001 00000000 P snooped (LLC) 00022000 131072 0040 0040 00000000 P dirty uncached render ring --- ringbuffer = 0x00001000 00000000 : 00000000 00000004 : 00000000 .. <long list> : all values 00000000 here Does this help at all? Shall I try anything else? I'll cross check if I (can) have exactly JPs config. thanks for the help. stathis Was that netconsole with drm.debug=0xe? There should be more noise before we start touching the hw for real ... I'm not sure. I haven't added a parameter for debugging drm. (In reply to comment #8) > (In reply to comment #7) > > "As a built-in, netconsole initializes immediately after NIC cards and will > > bring up the specified interface as soon as possible. While this doesn't > allow > > capture of early kernel panics, it does capture most of the boot process." > > > > so probably this is not going to help much here? > > Ime netconsole only really works if both netconsole and your ethernet driver > are built-in. That also solves the neat problem with init ordering, since the > i915.ko module will then certainly be loaded afterwards ... actually this doesn't seem to be the case. I compiled in the kernel netconsole and my ethernet driver and left i915 and drm as modules. When I have nomodeset in my parameters, it all goes smooth, I receive the kernel messages to the netconsole, but there seems to be a slight delay from when they appear on screen to actually being shown in the monitoring machine. When I remove the nomodeset parameter, so that the i915/drm modules are loaded, there is actually no output on the netconsole shown. The reason is that the machine freezes, probably before it manages to dump the buffer over the wire, so I again receive nothing....this is very frustrating. Created attachment 90821 [details] various dmesg files I have enabled drm.debug=0xe (see the cmdline of the kernel for details on all parameters) - dmesg-3.6.11 is the log from the kernel configuration in which modesettings works (but there are still some errors with the drm) - dmesg-3.8.0-rc2 is the log when I boot with the 3.8.0-rc2 (drm-intel-901593f2bf221659a605bdc1dcb11376ea934163) which I have to modeset=0 to get my system to boot. - dmesg-3.8.0-rc2-igfx_off is the log from the above exact kernel only now run with the kernel parameters "intel_iommu=igfx_off" and modeset=1. KMS works, however I don't really know what are the consequences. I saw this hint at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/911236 ok. I'm out of ideas. Just let me know if there is anything to do next. Hm, g45 is known to blow up with dmar enabled, but we should automatically disable things (i.e. what you've done with the igfx_off option). Also strange is why 3.6 works for you, but 3.8 doesn't - we shouldn't have changed anything. Can you please compare the .config for CONFIG_DMAR and related options of these two kernels? Created attachment 91401 [details]
disable DMAR on gen4 gfx
Please test this patch, thanks.
Checking the kernel configs: # grep DMAR linux-3.7.1/.config CONFIG_DMAR_TABLE=y # grep DMAR linux-3.6.11/.config CONFIG_DMAR_TABLE=y No difference there. No other related options. I'll test the patch once I'm physically at the machine later on and let you know. Thanks. Oops, they're actually called CONFIG_INTEL_IOMMU_DEFAULT_ON and CONFIG_INTEL_IOMMU. Can you please check those? The DMAR one is only the internal one we use in the code ... I greped both .configs and they have exactly the same: CONFIG_INTEL_IOMMU=y CONFIG_INTEL_IOMMU_DEFAULT_ON=y CONFIG_INTEL_IOMMU_FLOPPY_WA=y I applied the patch against 3.7.1, it still freezes ... Created attachment 91491 [details]
disable DMAR on gen4 gfx v2
this patch worked. thanks. Fix merged to drm-intel-fixes, will send out a pull request rsn: commit 9452618e7462181ed9755236803b6719298a13ce Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Sun Jan 20 23:50:13 2013 +0100 iommu/intel: disable DMAR for g4x integrated gfx |