Bug 51921

Summary: [g45 dmar] Screen black, no action possible
Product: Drivers Reporter: JP Pozzi (jp.pozzi)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: high CC: daniel, intel-gfx-bugs, stathis
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.7.1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: various dmesg files
disable DMAR on gen4 gfx
disable DMAR on gen4 gfx v2

Description JP Pozzi 2012-12-22 18:38:52 UTC
Hello,

I try today a 3.7.1 kernel (self compiled) ans I get apparently a complete freeze of the machine, at least the screen was all black and no actions were "visible".
I had to restart "wildly" by reset.
The only thing I found i the logs was the following message in "kern.log", the last message before system's restart :

Dec 22 19:13:34 jport kernel: [16760.610045] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Dec 22 19:13:34 jport kernel: [16760.610052] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Dec 22 19:13:34 jport kernel: [16761.130048] [drm:i915_reset] *ERROR* Failed to reset chip.

As I had to reset the system I can't find the debug/dri/O/i915_error_state file.

System :
Thinkpad 
CPU : Intel(R) Core(TM)2 Duo CPU     T8100  @ 2.10GHz
memory : 1025840 Kb
Distribution : Debian 6

Regards

JP P
Comment 1 JP Pozzi 2012-12-22 20:07:26 UTC
Hello,

I get a second "freeze" with the same messages :

Dec 22 20:49:08 jport kernel: [ 5642.810093] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Dec 22 20:49:08 jport kernel: [ 5642.810105] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Dec 22 20:49:09 jport kernel: [ 5643.330121] [drm:i915_reset] *ERROR* Failed to reset chip.

Preceeding messages involving i951 :


Dec 22 19:15:35 jport kernel: [   34.577466] i915 0000:00:02.0: irq 44 for MSI/MSI-X
Dec 22 19:15:35 jport kernel: [   34.577477] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
Dec 22 19:15:35 jport kernel: [   34.577480] [drm] Driver supports precise vblank timestamp query.
Dec 22 19:15:35 jport kernel: [   34.577518] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
Dec 22 19:15:35 jport kernel: [   35.038989] [drm] initialized overlay support

Dec 22 19:15:35 jport kernel: [   34.577466] i915 0000:00:02.0: irq 44 for MSI/MSI-X
Dec 22 19:15:35 jport kernel: [   34.577477] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
Dec 22 19:15:35 jport kernel: [   34.577480] [drm] Driver supports precise vblank timestamp query.
Dec 22 19:15:35 jport kernel: [   34.577518] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem

Dec 22 19:15:35 jport kernel: [   35.038989] [drm] initialized overlay support

Dec 22 19:33:59 jport kernel: [ 1130.380102] i915 0000:00:02.0: power state changed by ACPI to D3hot


All was OK which a 3.6.5 kernel.

Regards

JP P
Comment 2 stathis 2012-12-30 00:23:47 UTC
Hi,

I think I am hit by the same bug in stable 3.7.1. My box has the same behavior, it effectively freezes completely and can only be hard reset and boot back into a different kernel version. 

I managed to track down that the problem does not occur on 3.6.10 or 3.6.11.

I looked at my logs, but there is nothing in the 3.7.1 log, as the system locks up before managing to write anything on disk or on screen. It does seem to lock up when it attempts to modeset though, as it freezes inbetween the kernel boot sequence. I do see some errors in the 3.6.10 log, but I don't think they are relevant:

Dec 17 08:42:11 commell kernel: [drm] Initialized drm 1.1.0 20060810
Dec 17 08:42:11 commell kernel: i915 0000:00:02.0: setting latency timer to 64
Dec 17 08:42:11 commell kernel: i915 0000:00:02.0: irq 49 for MSI/MSI-X
Dec 17 08:42:11 commell kernel: [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
Dec 17 08:42:11 commell kernel: [drm] Driver supports precise vblank timestamp query.
Dec 17 08:42:11 commell kernel: vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
Dec 17 08:42:11 commell kernel: No connectors reported connected with modes
Dec 17 08:42:11 commell kernel: [drm] Cannot find any crtc or sizes - going 1024x768  
Dec 17 08:42:11 commell kernel: fbcon: inteldrmfb (fb0) is primary device
Dec 17 08:42:11 commell kernel: Console: switching to colour frame buffer device 128x48
Dec 17 08:42:11 commell kernel: fb0: inteldrmfb frame buffer device
Dec 17 08:42:11 commell kernel: drm: registered panic notifier
Dec 17 08:42:11 commell kernel: [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0

..
.. <a bit further down in the same boot sequence>
.. 

Dec 17 08:42:11 commell kernel: No soundcards found.
Dec 17 08:42:11 commell kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Dec 17 08:42:11 commell kernel: i915: render error detected, EIR: 0x00000010
Dec 17 08:42:11 commell kernel: i915:   IPEIR: 0x00000000
Dec 17 08:42:11 commell kernel: i915:   IPEHR: 0x11880000
Dec 17 08:42:11 commell kernel: i915:   INSTDONE: 0xfffffffe
Dec 17 08:42:11 commell kernel: i915:   INSTPS: 0x00000000
Dec 17 08:42:11 commell kernel: i915:   INSTDONE1: 0xffffffff
Dec 17 08:42:11 commell kernel: i915:   ACTHD: 0x00000000
Dec 17 08:42:11 commell kernel: i915: page table error
Dec 17 08:42:11 commell kernel: i915:   PGTBL_ER: 0x00100000
Dec 17 08:42:11 commell kernel: [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking


this is a headless server system that I very infrequently use the console or X directly on. I tried passing the kernel parameter "nomodeset", which apparently boots me into 3.7.1 without a problem (only the console resolution is odd). It would be great if it can be fixed.


The system is based on a Commell LV-67AD2 Mini-ITX board ( http://www.commell.com.tw/Product/SBC/LV-67A.HTM )

CPU : Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
memory : 4GB
Distribution : Gentoo x64

Let me know if there is anything to test.

thanks,
stathis
Comment 3 Daniel Vetter 2013-01-07 15:25:11 UTC
You can avoid the gpu reset (which is usually the reason behind a hard-hang) with i915.reset=0 added to your kernel cmdline. That should hopefully allow you to grab the error state.
Comment 4 JP Pozzi 2013-01-08 00:26:42 UTC
Hello,

I "tweak" the .config file with "CONFIG_DRM_I915=y" and "CONFIG_ACPI_VIDEO=y" plus some others parameters to enable the "y" instead of "m".

After the last compile the system is 2 days 6 hours with a very nice screen showing all 1920 pixels and without any lock.

Regards

JP P
Comment 5 stathis 2013-01-08 01:13:44 UTC
It's a no go for me. JP, I tried what you have suggested, but it still freezes upon modesetting. Here are some of the relevant kernel config options of my 3.7.1 built kernel:


## grep AGP .config

CONFIG_AGP=y
CONFIG_AGP_INTEL=y

## grep ACPI .config

CONFIG_X86_64_ACPI_NUMA=y
# Power management and ACPI options
CONFIG_ACPI=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_CONTAINER=y
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_PNPACPI=y
CONFIG_ATA_ACPI=y
CONFIG_SENSORS_ACPI_POWER=m

## grep DRM .config
CONFIG_DRM=y
CONFIG_DRM_KMS_HELPER=y
CONFIG_DRM_I915=y
CONFIG_DRM_I915_KMS=y

Kernel command line: BOOT_IMAGE=linux ro root=801 raid=noautodetect i915.reset=0

It still freezes for me.

regards,
stathis
Comment 6 Daniel Vetter 2013-01-08 09:03:18 UTC
On the off-chance that this is the same OOPS as one we've recently fixed, please try latest drm-intel-fixes branch from

http://cgit.freedesktop.org/~danvet/drm-intel

Otherwise I think it's time to bisect, since we seem to be unable to grab the logs. One last thing to try though is setting up netconsole logging.
Comment 7 stathis 2013-01-08 17:11:14 UTC
I just tried: drm-intel-901593f2bf221659a605bdc1dcb11376ea934163, but it still freezes. I tried setting up netconsole, but I am uncertain if it is my setup, as it doesn't log anything on the remote machine. 

Doesn't netconsole require that a working interface is configured? 
Bear in mind that my box freezes very early...

I tried to use netconsole by supplying nomodeset, so that I can figure out what is going on with it. Turns out that drm is loaded before the netconsole at the boot sequence, and (probably due to misconfiguration??) I see:

Jan  8 16:03:22 commell kernel: netpoll: netconsole: local port 6665
Jan  8 16:03:22 commell kernel: netpoll: netconsole: local IP 192.168.77.1   
Jan  8 16:03:22 commell kernel: netpoll: netconsole: interface 'eth1' 
Jan  8 16:03:22 commell kernel: netpoll: netconsole: remote port 6666
Jan  8 16:03:22 commell kernel: netpoll: netconsole: remote IP 192.168.77.100
Jan  8 16:03:22 commell kernel: netpoll: netconsole: remote ethernet address 00:25:22:a6:c0:e8   
Jan  8 16:03:22 commell kernel: netpoll: netconsole: eth1 doesn't exist, aborting
Jan  8 16:03:22 commell kernel: netconsole: cleaning up

I read around about netconsole, but I run out of time, so let me know if it is essential that I get netconsole to work in this case (since it appears to be loaded after KMS). 

I am reading in the netconsole.txt documentation:

"As a built-in, netconsole initializes immediately after NIC cards and will bring up the specified interface as soon as possible. While this doesn't allow
capture of early kernel panics, it does capture most of the boot process."

so probably this is not going to help much here?

thanks,
stathis
Comment 8 Daniel Vetter 2013-01-08 18:37:09 UTC
(In reply to comment #7)
> "As a built-in, netconsole initializes immediately after NIC cards and will
> bring up the specified interface as soon as possible. While this doesn't
> allow
> capture of early kernel panics, it does capture most of the boot process."
> 
> so probably this is not going to help much here?

Ime netconsole only really works if both netconsole and your ethernet driver are built-in. That also solves the neat problem with init ordering, since the i915.ko module will then certainly be loaded afterwards ...
Comment 9 JP Pozzi 2013-01-08 18:55:42 UTC
Hello,

I do the same lists as Stathis, here is the result :

grep AGP .config
CONFIG_AGP=y
CONFIG_AGP_INTEL=y

grep ACPI .config
# Power management and ACPI options
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=m
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
CONFIG_ACPI_BLACKLIST_YEAR=0
CONFIG_ACPI_PCI_SLOT=m
CONFIG_ACPI_HED=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_PNPACPI=y
CONFIG_ATA_ACPI=y
# ACPI drivers
CONFIG_SENSORS_ACPI_POWER=m

grep DRM .config
CONFIG_DRM=y
CONFIG_DRM_KMS_HELPER=y
CONFIG_DRM_I915=y


If it can help, 
Uptime is now : 3 days, 51 min,  2 users,  load average: 0.41, 0.40, 0.49
the screen is "normal".

Regards

JP P
Comment 10 stathis 2013-01-08 20:31:31 UTC
Ok, I've compiled netconsole and i915 as modules (i set i915 to not modeset by default!) and I have setup a working netconsole monitor where I actually receive the kernel messages. I boot without modesetting and I load both netconsole and i915 from the command line.

modprobe netconsole netconsole=6665@/eth1,6666@192.168.77.100/00:25:22:A6:C0:E8
// ignore this, just a reminder for me :)

I can load i915 without modesetting, with:

modprobe i915 reset=0

which printk's in the netconsole:

[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] No driver support for vblank timestamp query.
[drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0


Next I rmmod i915 and reload it with:

modprobe i915 reset=0 modeset=1

which printks :

[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] Driver supports precise vblank timestamp query.
vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem

that's it. the screen goes blank and the box freezes, exactly as it does when I boot. I have to hard reset it.




I then booted into the working 3.6.11 and scanned for the revelant sections (note i915 is compiled in on this one):

Linux agpgart interface v0.103
agpgart-intel 0000:00:00.0: Intel Q45/Q43 Chipset
agpgart-intel 0000:00:00.0: detected gtt size: 2097152K total, 262144K mappable
agpgart-intel 0000:00:00.0: detected 131072K stolen memory
agpgart-intel 0000:00:00.0: AGP aperture is 256M @ 0xd0000000   
[drm] Initialized drm 1.1.0 20060810
i915 0000:00:02.0: setting latency timer to 64
i915 0000:00:02.0: irq 49 for MSI/MSI-X
[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] Driver supports precise vblank timestamp query.
vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
i915: render error detected, EIR: 0x00000010
i915:   IPEIR: 0x00000000 
i915:   IPEHR: 0x11880000
i915:   INSTDONE: 0xfffffffe
i915:   INSTPS: 0x00000000   
i915:   INSTDONE1: 0xffffffff
i915:   ACTHD: 0x00000000
i915: page table error
i915:   PGTBL_ER: 0x00100000
[drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
fbcon: inteldrmfb (fb0) is primary device
Console: switching to colour frame buffer device 240x67
fb0: inteldrmfb frame buffer device
drm: registered panic notifier
[drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0

I then cat /debug/dri/0/i915_error_state and the output was:

Time: 1357675963 s 419826 us
PCI ID: 0x2e12
EIR: 0x00000010
IER: 0x02028c53
PGTBL_ER: 0x00100000
CCID: 0x00000000
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000
  fence[3] = 00000000
  fence[4] = 00000000
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 00000000
  fence[13] = 00000000
  fence[14] = 00000000
  fence[15] = 00000000
render command stream:
  HEAD: 0x00000000
  TAIL: 0x00000000
  ACTHD: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x11880000
  INSTDONE: 0xfffffffe
  INSTDONE1: 0xffffffff
  BBADDR: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000
  seqno: 0x00000000
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
bsd command stream:
  HEAD: 0x00000000
  TAIL: 0x00000000
  ACTHD: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000
  seqno: 0x00000000
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
Active [0]:
Pinned [4]:
  00000000     4096 0001 0001 00000000 P snooped (LLC)
  00001000   131072 0040 0040 00000000 P dirty uncached
  00021000     4096 0001 0001 00000000 P snooped (LLC)
  00022000   131072 0040 0040 00000000 P dirty uncached
render ring --- ringbuffer = 0x00001000
00000000 :  00000000
00000004 :  00000000
..
<long list> :  all values 00000000 here




Does this help at all? Shall I try anything else?


I'll cross check if I (can) have exactly JPs config.


thanks for the help.

stathis
Comment 11 Daniel Vetter 2013-01-08 21:00:43 UTC
Was that netconsole with drm.debug=0xe? There should be more noise before we start touching the hw for real ...
Comment 12 stathis 2013-01-08 21:41:20 UTC
I'm not sure. I haven't added a parameter for debugging drm.
Comment 13 stathis 2013-01-08 21:57:42 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > "As a built-in, netconsole initializes immediately after NIC cards and will
> > bring up the specified interface as soon as possible. While this doesn't
> allow
> > capture of early kernel panics, it does capture most of the boot process."
> > 
> > so probably this is not going to help much here?
> 
> Ime netconsole only really works if both netconsole and your ethernet driver
> are built-in. That also solves the neat problem with init ordering, since the
> i915.ko module will then certainly be loaded afterwards ...

actually this doesn't seem to be the case. I compiled in the kernel netconsole and my ethernet driver and left i915 and drm as modules. When I have nomodeset in my parameters, it all goes smooth, I receive the kernel messages to the netconsole, but there seems to be a slight delay from when they appear on screen to actually being shown in the monitoring machine.

When I remove the nomodeset parameter, so that the i915/drm modules are loaded, there is actually no output on the netconsole shown. The reason is that the machine freezes, probably before it manages to dump the buffer over the wire, so I again receive nothing....this is very frustrating.
Comment 14 stathis 2013-01-09 01:03:49 UTC
Created attachment 90821 [details]
various dmesg files

I have enabled drm.debug=0xe (see the cmdline of the kernel for details on all parameters)

- dmesg-3.6.11 is the log from the kernel configuration in which modesettings works (but there are still some errors with the drm)

- dmesg-3.8.0-rc2 is the log when I boot with the 3.8.0-rc2 (drm-intel-901593f2bf221659a605bdc1dcb11376ea934163) which I have to modeset=0 to get my system to boot. 

- dmesg-3.8.0-rc2-igfx_off is the log from the above exact kernel only now run with the kernel parameters "intel_iommu=igfx_off" and modeset=1. KMS works, however I don't really know what are the consequences. I saw this hint at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/911236

ok. I'm out of ideas. Just let me know if there is anything to do next.
Comment 15 Daniel Vetter 2013-01-16 14:25:00 UTC
Hm, g45 is known to blow up with dmar enabled, but we should automatically disable things (i.e. what you've done with the igfx_off option). Also strange is why 3.6 works for you, but 3.8 doesn't - we shouldn't have changed anything. Can you please compare the .config for CONFIG_DMAR and related options of these two kernels?
Comment 16 Daniel Vetter 2013-01-16 19:41:16 UTC
Created attachment 91401 [details]
disable DMAR on gen4 gfx

Please test this patch, thanks.
Comment 17 stathis 2013-01-17 07:15:00 UTC
Checking the kernel configs:

# grep DMAR linux-3.7.1/.config
CONFIG_DMAR_TABLE=y

# grep DMAR linux-3.6.11/.config
CONFIG_DMAR_TABLE=y

No difference there. No other related options.

I'll test the patch once I'm physically at the machine later on and let you know. Thanks.
Comment 18 Daniel Vetter 2013-01-17 09:11:15 UTC
Oops, they're actually called CONFIG_INTEL_IOMMU_DEFAULT_ON and CONFIG_INTEL_IOMMU. Can you please check those? The DMAR one is only the internal one we use in the code ...
Comment 19 stathis 2013-01-17 23:00:19 UTC
I greped both .configs and they have exactly the same:

CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y


I applied the patch against 3.7.1, it still freezes ...
Comment 20 Daniel Vetter 2013-01-19 16:30:22 UTC
Created attachment 91491 [details]
disable DMAR on gen4 gfx v2
Comment 21 stathis 2013-01-20 21:09:25 UTC
this patch worked. thanks.
Comment 22 Daniel Vetter 2013-01-24 10:14:51 UTC
Fix merged to drm-intel-fixes, will send out a pull request rsn:

commit 9452618e7462181ed9755236803b6719298a13ce
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun Jan 20 23:50:13 2013 +0100

    iommu/intel: disable DMAR for g4x integrated gfx