Bug 19582

Summary: Watching Sintel in 2K resolution via VAAPI results in GPU Hang and memory allocation issues
Product: Drivers Reporter: Julian Andres Klode (jak)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: RESOLVED INVALID    
Severity: normal CC: chris
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35.4, 2.6.35.7 Subsystem:
Regression: No Bisected commit-id:

Description Julian Andres Klode 2010-10-02 18:15:35 UTC
I tried watching Sintel in 2K resolution using VAAPI; this caused a GPU hang. debugging the GPU hang was not possible, as reading i915_error_state failed with the message that not enough memory could be allocated.

The following dmesg logs show the hang and the trace from running cat on i915_error_state.

Result from Debian Kernel based on 2.6.35.4
============================================

[  501.423918] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  501.425867] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 6477 at 6080)
[  506.641302] ------------[ cut here ]------------
[  506.641313] WARNING: at /build/mattems-linux-2.6_2.6.35-1~experimental.3-amd64-XReacf/linux-2.6-2.6.35/debian/build/source_amd64_none/mm/page_alloc.c:1968 __alloc_pages_nodemask+0x17c/0x70b()
[  506.641316] Hardware name: 03017VG
[  506.641317] Modules linked in: cbc hidp aes_x86_64 aes_generic ecryptfs parport_pc ppdev sco lp parport rfcomm bnep l2cap acpi_cpufreq mperf binfmt_misc cpufreq_stats cpufreq_userspace cpufreq_powersave kvm_intel cpufreq_conservative kvm uinput fuse loop snd_hda_codec_intelhdmi snd_hda_codec_realtek btusb joydev bluetooth arc4 uvcvideo thinkpad_acpi videodev snd_hda_intel ecb v4l1_compat v4l2_compat_ioctl32 iwlagn snd_hda_codec iwlcore snd_hwdep snd_seq snd_pcm serio_raw mac80211 snd_timer cfg80211 i2c_i801 pcspkr snd_seq_device rfkill snd_page_alloc tpm_tis snd soundcore led_class tpm tpm_bios psmouse battery nvram processor ac evdev ext4 mbcache jbd2 crc16 btrfs zlib_deflate crc32c libcrc32c sg sr_mod cdrom sd_mod usbhid crc_t10dif ata_generic usb_storage hid i915 ata_piix libata drm_kms_helper drm ehci_hcd i2c_algo_bit scsi_mod i2c_core usbcore thermal video output r8169 mii button thermal_sys nls_base [last unloaded: scsi_wait_scan]
[  506.641378] Pid: 3402, comm: cat Not tainted 2.6.35-trunk-amd64 #1
[  506.641380] Call Trace:
[  506.641387]  [<ffffffff81044307>] ? warn_slowpath_common+0x78/0x8c
[  506.641389]  [<ffffffff810b4866>] ? __alloc_pages_nodemask+0x17c/0x70b
[  506.641397]  [<ffffffff8100938e>] ? apic_timer_interrupt+0xe/0x20
[  506.641403]  [<ffffffff813064db>] ? _raw_spin_unlock_irqrestore+0xb/0x11
[  506.641407]  [<ffffffff810d90ba>] ? alloc_pages_current+0x9f/0xc2
[  506.641410]  [<ffffffff810b3bdb>] ? __get_free_pages+0x9/0x46
[  506.641414]  [<ffffffff810e17f2>] ? __kmalloc+0x3f/0x136
[  506.641418]  [<ffffffff811003fe>] ? seq_read+0x1f6/0x360
[  506.641420]  [<ffffffff810e96ba>] ? vfs_read+0xa1/0xfd
[  506.641422]  [<ffffffff810e97c9>] ? sys_read+0x45/0x6b
[  506.641425]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[  506.641427] ---[ end trace ad0d7a981527aba9 ]---

Result from Kernel 2.6.35.7
=============================
[  276.102224] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  276.104675] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 3735 at 3513)
[  279.986230] SysRq : Changing Loglevel
[  279.988670] Loglevel set to 0
[  295.823788] ------------[ cut here ]------------
[  295.823798] WARNING: at mm/page_alloc.c:1981 __alloc_pages_nodemask+0x17c/0x6f3()
[  295.823800] Hardware name: 03017VG
[  295.823802] Modules linked in: vboxnetadp vboxnetflt vboxdrv
[  295.823808] Pid: 2680, comm: cat Not tainted 2.6.35.7+ #1
[  295.823809] Call Trace:
[  295.823816]  [<ffffffff8106285b>] ? warn_slowpath_common+0x78/0x8c
[  295.823819]  [<ffffffff810ce9df>] ? __alloc_pages_nodemask+0x17c/0x6f3
[  295.823825]  [<ffffffff8102a38e>] ? apic_timer_interrupt+0xe/0x20
[  295.823828]  [<ffffffff8148c38b>] ? _raw_spin_unlock_irqrestore+0xb/0x11
[  295.823833]  [<ffffffff810f328a>] ? alloc_pages_current+0x9f/0xc2
[  295.823836]  [<ffffffff810cdeab>] ? __get_free_pages+0x9/0x46
[  295.823839]  [<ffffffff810fb9c6>] ? __kmalloc+0x3f/0x136
[  295.823842]  [<ffffffff81119f66>] ? seq_read+0x1f6/0x360
[  295.823846]  [<ffffffff8110370b>] ? vfs_read+0xa1/0xfd
[  295.823877]  [<ffffffff8110381a>] ? sys_read+0x45/0x6b
[  295.823880]  [<ffffffff810299c2>] ? system_call_fastpath+0x16/0x1b
[  295.823882] ---[ end trace 22ac58d95ef11a99 ]---
Comment 1 Chris Wilson 2010-12-16 13:42:19 UTC
Not a kernel bug, please inform the libva developers.
Comment 2 Julian Andres Klode 2010-12-19 11:48:28 UTC
(In reply to comment #1)
> Not a kernel bug, please inform the libva developers.

How is it not a kernel bug if a user space application running as a normal user can cause the GPU to hang? There might be a bug in libva causing this, but even with this fixed, other user space code could still cause GPU hangs by going the same path.
Comment 3 Chris Wilson 2010-12-19 11:56:15 UTC
Because userspace is doing undefined operations with the GPU. In exactly the same manner as if the application tried to *0, only the exception handling in the GPU is not as robust in the CPU.