Bug 59761

Summary: Kernel fails to reset AMD HD5770 GPU properly and encounters OOPS. GPU reset fails - system remains in unusable state.
Product: Drivers Reporter: t3st3r
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: blocking CC: alexdeucher, szg00000
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.10 RC5 Subsystem:
Regression: No Bisected commit-id:

Description t3st3r 2013-06-15 17:24:02 UTC
Intro:
This is really tricky bug. Probably GPU lockup itself is provoked by MESA and is out of scope.
However, GPU lockup recovery is kernel work and that's where kernel fails in this case. 

Configuration:
 Xubuntu 13.04 64 bit running 3.10 RC5 Linux kernel. Though similar problems occurs with some older kernels as well (recent GPU reset handling rework not seems to help much). 
 MESA should be recent 9.1 or 9.2 git to provoke GPU lockup condition. 
 GPU is AMD HD5770, 512Mb GDDR5.
 libtxc-dxtn-s2tc is installed to handle 

To reproduce:
 It's enough to run Ryzom RPG (www.ryzom.com) with 128Mb textures setting. 
 I'm using 64-bit version from launchpad PPA (https://launchpad.net/~kervala/+archive/ppa)

Basically it's looks like following:
1) Launch game and let it run for some time using best (128Mb) textures on GPU like my one. 
2) You can notice that on some objects textures are grabled/broken and don't display properly. Maybe data transfer error or so. 
   Note: MESA before 9.1 lacks this bug and it will not occur. 
3) After some run time GPU would encounter lockup (CP stall). Probably MESA does something wrong at code genreation. 
4) Then kernel attempts to reset GPU but it never works properly.
5) All graphic output locks up since GPU driver has failed to reset GPU properly

This condition is quite fatal: system responds to alt-sysrq stuff but becomes completely unusable due to lack of any graphic output.

Expected:
 GPU is properly reset and system recovers to usable state. No kernel errors should happen during this process.

One of logs with crash data follows:

Jun 15 04:47:12 compname kernel: [17564.696695] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
Jun 15 04:47:12 compname kernel: [17564.696706] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000004ae9a1 last fence id 0x00000000004ae9a0)
Jun 15 04:47:12 compname kernel: [17564.697787] radeon 0000:01:00.0: Saved 119 dwords of commands on ring 0.
Jun 15 04:47:12 compname kernel: [17564.697812] BUG: unable to handle kernel paging request at ffffc90012a9c418
Jun 15 04:47:12 compname kernel: [17564.697885] IP: [<ffffffffa03a2ace>] radeon_fence_process+0x8e/0x160 [radeon]
Jun 15 04:47:12 compname kernel: [17564.697985] PGD 41f00f067 PUD 41f020067 PMD 417586067 PTE 0
Jun 15 04:47:12 compname kernel: [17564.698045] Oops: 0000 [#1] SMP 
Jun 15 04:47:12 compname kernel: [17564.698080] Modules linked in: parport_pc ppdev bnep rfcomm bluetooth snd_hda_codec_hdmi kvm_amd kvm crc32_pclmul ghash_clmulni_intel mxm_wmi aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_realtek microcode fam15h_power snd_hda_intel serio_raw snd_ca0106 amd64_edac_mod edac_core snd_ac97_codec edac_mce_amd snd_hda_codec k10temp ac97_bus snd_hwdep snd_pcm snd_seq_midi radeon snd_page_alloc joydev sp5100_tco snd_seq_midi_event i2c_piix4 snd_rawmidi snd_seq snd_seq_device snd_timer ttm drm_kms_helper drm snd i2c_algo_bit soundcore mac_hid wmi xfs it87 hwmon_vid lp parport btrfs xor zlib_deflate hid_generic usbhid hid raid6_pq libcrc32c usb_storage firewire_ohci firewire_core pata_acpi crc_itu_t r8169 ahci pata_atiixp libahci
Jun 15 04:47:12 compname kernel: [17564.698840] CPU: 6 PID: 2925 Comm: ryzom_client Not tainted 3.10.0-031000rc5-generic #201306082135
Jun 15 04:47:12 compname kernel: [17564.698914] Hardware name: Gigabyte Technology Co., Ltd. 
Jun 15 04:47:12 compname kernel: [17564.698991] task: ffff880414908000 ti: ffff880403afc000 task.ti: ffff880403afc000
Jun 15 04:47:12 compname kernel: [17564.699053] RIP: 0010:[<ffffffffa03a2ace>]  [<ffffffffa03a2ace>] radeon_fence_process+0x8e/0x160 [radeon]
Jun 15 04:47:12 compname kernel: [17564.699160] RSP: 0018:ffff880403afdc18  EFLAGS: 00010246
Jun 15 04:47:12 compname kernel: [17564.699205] RAX: ffffc90012a9c418 RBX: 0000000000000002 RCX: ffff880415134dc0
Jun 15 04:47:12 compname kernel: [17564.699264] RDX: 0000000000000041 RSI: 0000000000000000 RDI: ffff880415134000
Jun 15 04:47:12 compname kernel: [17564.699323] RBP: ffff880403afdc78 R08: ffffffff00000000 R09: ffff880415134208
Jun 15 04:47:12 compname kernel: [17564.699382] R10: 0000000000000000 R11: 0000000000000005 R12: 000000000000000c
Jun 15 04:47:12 compname kernel: [17564.699441] R13: ffff880415134e08 R14: 0000000000000002 R15: ffff880415134000
Jun 15 04:47:12 compname kernel: [17564.699501] FS:  00007f6231639780(0000) GS:ffff88042fd80000(0000) knlGS:0000000000000000
Jun 15 04:47:12 compname kernel: [17564.699567] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 15 04:47:12 compname kernel: [17564.699615] CR2: ffffc90012a9c418 CR3: 00000003d049d000 CR4: 00000000000407e0
Jun 15 04:47:12 compname kernel: [17564.699674] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 15 04:47:12 compname kernel: [17564.699734] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 15 04:47:12 compname kernel: [17564.699792] Stack:
Jun 15 04:47:12 compname kernel: [17564.699811]  ffff880417a9c848 ffffffffa042b3d0 ffff88041918b8a0 ffff880403afdc98
Jun 15 04:47:12 compname kernel: [17564.699886]  ffff880403afdc68 ffffffff8143357f 0000000000000001 ffff880415134000
Jun 15 04:47:12 compname kernel: [17564.699960]  0000000000000005 ffff880415134000 ffff880415134e38 ffff8804151345f8
Jun 15 04:47:12 compname kernel: [17564.700034] Call Trace:
Jun 15 04:47:12 compname kernel: [17564.700067]  [<ffffffff8143357f>] ? __dev_printk+0x5f/0xa0
Jun 15 04:47:12 compname kernel: [17564.700141]  [<ffffffffa03a38c3>] radeon_fence_count_emitted+0x23/0x70 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700234]  [<ffffffffa03b9fcb>] radeon_ring_backup+0x4b/0x130 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700314]  [<ffffffffa038e560>] radeon_gpu_reset+0x90/0x220 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700402]  [<ffffffffa03b8d36>] radeon_gem_wait_idle_ioctl+0xd6/0x100 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700486]  [<ffffffffa02c658a>] drm_ioctl+0x50a/0x650 [drm]
Jun 15 04:47:12 compname kernel: [17564.700568]  [<ffffffffa03b8c60>] ? radeon_gem_busy_ioctl+0x120/0x120 [radeon]
Jun 15 04:47:12 compname kernel: [17564.700632]  [<ffffffff81082401>] ? update_curr+0x141/0x1f0
Jun 15 04:47:12 compname kernel: [17564.700684]  [<ffffffff810810dd>] ? set_next_entity+0xad/0xd0
Jun 15 04:47:12 compname kernel: [17564.700738]  [<ffffffff811987c7>] do_vfs_ioctl+0x87/0x330
Jun 15 04:47:12 compname kernel: [17564.700787]  [<ffffffff816cab14>] ? __schedule+0x3d4/0x6b0
Jun 15 04:47:12 compname kernel: [17564.700837]  [<ffffffff81198b01>] SyS_ioctl+0x91/0xb0
Jun 15 04:47:12 compname kernel: [17564.700885]  [<ffffffff816d5506>] system_call_fastpath+0x1a/0x1f
Jun 15 04:47:12 compname kernel: [17564.700936] Code: 49 87 55 00 48 39 d0 73 50 48 89 c3 41 ba 01 00 00 00 41 80 bf a0 16 00 00 00 4d 8b b1 f8 0b 00 00 0f 84 8a 00 00 00 48 8b 41 10 <8b> 00 48 89 da 89 c0 4c 21 c2 48 09 d0 48 39 c3 76 0c 4c 89 f2 
Jun 15 04:47:12 compname kernel: [17564.701288] RIP  [<ffffffffa03a2ace>] radeon_fence_process+0x8e/0x160 [radeon]
Jun 15 04:47:12 compname kernel: [17564.701375]  RSP <ffff880403afdc18>
Jun 15 04:47:12 compname kernel: [17564.701406] CR2: ffffc90012a9c418
Jun 15 04:47:12 compname kernel: [17564.735862] ---[ end trace 5017208705d52fa8 ]---
Jun 15 04:47:17 compname kernel: [17570.145204] SysRq : Emergency Sync
Jun 15 04:47:17 compname kernel: [17570.153481] Emergency Sync complete
Comment 1 Alex Deucher 2013-06-17 12:56:22 UTC
Can you bisect mesa and find out what commit caused the breakage?
Comment 2 t3st3r 2013-06-19 14:36:31 UTC
Quite hard/time consuming for me at this point. But if no other options left, I probably can try since this bug is quite annoying.

Right now I know that MESA 9.0.x has been working perfectly with that HD5770. But MESA 9.1 and up (9.2 git, etc) are broken. This also seems to produce some visually visible artifacts on textured objects in mentioned "Ryzom" game. Say, "metallic" objects

However as for kernel itself - the problem is that kernel detects lock-up but then EPIC FAILs when trying to reset GPU. 

I bet there should be no "BUG: unable to handle kernel paging request" at very least. Then, kernel never manages to recover from this condition. Neither old 3.8 kernel, nor recent changed GPU reset code from 3.9/3.10 would work.
Comment 3 t3st3r 2013-08-02 18:50:20 UTC
Btw, looks like in MESA 9.2 GPU lockup bug which provokes this problem has gone (congrats to MESA people on killing it!). Though I can still re-test kernel handling of GPU reset by using older MESA :).