Bug 65761 - HD 7970M Hybrid - hangs and errors and rmmod causes crash
Summary: HD 7970M Hybrid - hangs and errors and rmmod causes crash
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
: 69291 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-11-25 16:49 UTC by Christoph Haag
Modified: 2015-01-04 01:09 UTC (History)
9 users (show)

See Also:
Kernel Version: 3.13-rc1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with multiple issues (173.08 KB, text/plain)
2013-11-25 16:49 UTC, Christoph Haag
Details
dmesg from 3.13-rc2 with rmmod radeon in line 2095 (206.74 KB, text/plain)
2013-11-30 12:12 UTC, Christoph Haag
Details
kernel panic on reboot with runpm=1 (326.65 KB, image/jpeg)
2013-12-06 22:21 UTC, Christoph Haag
Details
dmesg with drm-fixes-3.12-radeon-poweroff branch (73.75 KB, application/octet-stream)
2013-12-07 01:06 UTC, Christoph Haag
Details
Same as "117761: dmesg with drm-fixes-3.12-radeon-poweroff branch", but with rmmod radeon and modprobe radeon (96.07 KB, text/plain)
2013-12-07 01:17 UTC, Christoph Haag
Details
rc5 with starting X (~line 1811) and trying to render something (149.38 KB, text/plain)
2013-12-24 15:45 UTC, Christoph Haag
Details
dmesg with acpiphp.disable=1 (66.58 KB, text/plain)
2013-12-24 21:40 UTC, Christoph Haag
Details
erratic dpm with window manager actions (104.18 KB, image/png)
2014-01-21 00:15 UTC, Christoph Haag
Details
dmesg (75.23 KB, text/plain)
2014-01-22 01:53 UTC, Hohahiu
Details
Xorg.0.log (30.69 KB, text/plain)
2014-01-22 01:53 UTC, Hohahiu
Details
quitting steam after some time (66.72 KB, image/png)
2014-01-31 00:10 UTC, Christoph Haag
Details
just some printk to easily see what's going on (2.41 KB, patch)
2014-01-31 20:00 UTC, Christoph Haag
Details | Diff
return IRQ_NONE if we don't have any interrupts (1.71 KB, patch)
2014-01-31 23:00 UTC, Alex Deucher
Details | Diff
sysprof with 2 xterm windows and kwin (666.64 KB, application/octet-stream)
2014-02-02 23:58 UTC, Christoph Haag
Details
callgrind output from X (2.40 MB, application/octet-stream)
2014-02-06 13:48 UTC, Christoph Haag
Details
possibly call chain that calls radeon stuff (212.70 KB, image/png)
2014-02-07 14:52 UTC, Christoph Haag
Details
radeon_cs_emit did not trigger, but ProcGetInputFocus causes some activity (4.79 KB, text/plain)
2014-02-12 09:41 UTC, Christoph Haag
Details
radeonsi: Short-circuit flushes with no preceding draw calls (1.46 KB, patch)
2014-02-13 02:53 UTC, Michel Dänzer
Details | Diff
r600g,radeonsi: Consolidate logic for short-circuiting flushes (4.08 KB, patch)
2014-02-14 02:59 UTC, Michel Dänzer
Details | Diff
report 0 for temp when dGPU is powered off (1.15 KB, patch)
2014-02-17 17:04 UTC, Alex Deucher
Details | Diff
dmesg (89.73 KB, text/plain)
2014-02-19 00:46 UTC, Hohahiu
Details
kenrel log (19.28 KB, application/octet-stream)
2014-03-03 00:20 UTC, saadnaji89
Details

Description Christoph Haag 2013-11-25 16:49:08 UTC
Created attachment 115961 [details]
dmesg with multiple issues

00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M]
muxless, so there's no display outputs on the radeon gpu.

With 3.13 and dpm and runtime power management enabled by default this should be getting usable.
I did not start X this time.
(when starting X on the intel gpu quickly after booting (presumably before radeon is initialized) the display output hangs, which is yet another problem).

Dmesg seems to contain several issues.

1. While booting it already produces these before even "dpm initialized" is logged:

[   20.120137] WARNING: CPU: 0 PID: 53 at drivers/gpu/drm/drm_mm.c:578 drm_mm_takedown+0x2e/0x30 [drm]()
[   20.120138] Memory manager not clean during takedown.

2. These are just periodically displayed in the tty:

[   79.179928] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35).
[   79.181378] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[   79.182790] [drm:radeon_resume_kms] *ERROR* ib ring test failed (-35).

3. After some time in a tty, I did "rmmod radeon" and got some other issues ending with

[  192.663059] Fixing recursive fault but reboot is needed!

and then, after a little while a lot of other errors happen, but they are not captured here in the dmesg.
Comment 1 Mike Lothian 2013-11-25 17:15:24 UTC
I believe this is related to https://bugs.freedesktop.org/show_bug.cgi?id=71930

I think the memory Warnings have been fixed by either:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm?id=c58f009e01c918717379c206a63baa66f56a77f9

or

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm?id=0bc254257bfd9b25f64a68b719ee70a303b6d051

As a workaround you can boot with radeon.runpm=0 but that'll leave your discreet card powered up
Comment 2 Alex Deucher 2013-11-26 20:03:07 UTC
From the bisect in https://bugs.freedesktop.org/show_bug.cgi?id=71930 it looks like commit bbd34fcdd1b201e996235731a7c98fd5197d9e51 caused a regression in vgaswitcheroo and hence runpm.  Adding Rafael.  Also bug 65811 looks like a duplicate of this one.
Comment 3 Christoph Haag 2013-11-30 12:12:47 UTC
Created attachment 116871 [details]
dmesg from 3.13-rc2 with rmmod radeon in line 2095

Having enough time to test everything would be nice.

But in the meantime I have only tested 3.13-rc2. With defaults the issues are still there (I have not yet checked whether mentioned commits made it into rc2).

Now I have some more messsages in dmesg after rmmod'ing radeon. This time it didn't hang after all that, but rebooting didn't work for some reason. Maybe it's just me but I'd like to have proper unloading to have the option to switch between radeon and fglrx without rebooting. :)

Anyway with runpm=0 I don't get any errors so dpm does really work fine. On an unrelated note: WOW, has radeonsi improved in performance! Much better than fglrx with "official" hybrid support in some applications at least.

Unfortunately using runpm=0 and using vgaswitcheroo manually is not a viable workaround because of https://bugzilla.kernel.org/show_bug.cgi?id=51381

But with runpm=0 there still is a similar (?) problem when rmmod'ing radeon:

[ 1897.087151] [drm] radeon: finishing device.
[ 1897.518615] ------------[ cut here ]------------
[ 1897.518634] WARNING: CPU: 4 PID: 2539 at drivers/gpu/drm/drm_mm.c:578 drm_mm_takedown+0x2e/0x30 [drm]()
[ 1897.518635] Memory manager not clean during takedown.
[ 1897.518636] Modules linked in: bnep bluetooth iTCO_wdt iTCO_vendor_support arc4 iwldvm snd_hda_codec_hdmi mac80211 snd_hda_codec_realtek joydev x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel microcode snd_hda_codec iwlwifi snd_hwdep psmouse cfg80211 snd_pcm serio_raw pcspkr snd_page_alloc radeon(-) rtsx_pci_ms r8169 snd_timer memstick rfkill lpc_ich mii snd i2c_i801 soundcore ttm wmi thermal mei_me mei shpchp processor battery ac evdev nfs lockd sunrpc fscache fuse ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod hid_generic usbhid hid rtsx_pci_sdmmc mmc_core ahci i915 libahci crc32c_intel i2c_algo_bit libata intel_agp firewire_ohci intel_gtt
[ 1897.518659]  firewire_core crc_itu_t drm_kms_helper ehci_pci ehci_hcd xhci_hcd scsi_mod rtsx_pci drm usbcore i2c_core usb_common video button
[ 1897.518665] CPU: 4 PID: 2539 Comm: rmmod Not tainted 3.13.0-1-mainline #1
[ 1897.518666] Hardware name: CLEVO                             P170EM/P170EM, BIOS 4.6.5 08/22/2012
[ 1897.518667]  0000000000000009 ffff8807d4b47c20 ffffffff814f5570 ffff8807d4b47c68
[ 1897.518669]  ffff8807d4b47c58 ffffffff81061bad ffff8807fef616c0 ffff8807fef61768
[ 1897.518671]  ffff8808029a47b0 0000000000000000 0000000001ce5090 ffff8807d4b47cb8
[ 1897.518673] Call Trace:
[ 1897.518677]  [<ffffffff814f5570>] dump_stack+0x4d/0x6f
[ 1897.518681]  [<ffffffff81061bad>] warn_slowpath_common+0x7d/0xa0
[ 1897.518683]  [<ffffffff81061c1c>] warn_slowpath_fmt+0x4c/0x50
[ 1897.518688]  [<ffffffffa05644b8>] ? ttm_bo_man_takedown+0x48/0x70 [ttm]
[ 1897.518693]  [<ffffffffa00c2f6e>] drm_mm_takedown+0x2e/0x30 [drm]
[ 1897.518696]  [<ffffffffa05644a8>] ttm_bo_man_takedown+0x38/0x70 [ttm]
[ 1897.518699]  [<ffffffffa055ebf9>] ttm_bo_clean_mm+0x49/0x80 [ttm]
[ 1897.518709]  [<ffffffffa05eebcd>] radeon_ttm_fini+0xbd/0x190 [radeon]
[ 1897.518716]  [<ffffffffa05ef662>] radeon_bo_fini+0x12/0x20 [radeon]
[ 1897.518727]  [<ffffffffa06570c1>] si_fini+0xc1/0x100 [radeon]
[ 1897.518733]  [<ffffffffa05d404e>] radeon_device_fini+0x3e/0x120 [radeon]
[ 1897.518739]  [<ffffffffa05d5f7e>] radeon_driver_unload_kms+0x4e/0x70 [radeon]
[ 1897.518744]  [<ffffffffa00bed4c>] drm_dev_unregister+0x2c/0xe0 [drm]
[ 1897.518749]  [<ffffffffa00bee8b>] drm_put_dev+0x3b/0x70 [drm]
[ 1897.518754]  [<ffffffffa05d229d>] radeon_pci_remove+0x1d/0x20 [radeon]
[ 1897.518756]  [<ffffffff812c66cb>] pci_device_remove+0x3b/0xb0
[ 1897.518759]  [<ffffffff8138caaf>] __device_release_driver+0x7f/0xf0
[ 1897.518761]  [<ffffffff8138d438>] driver_detach+0xb8/0xc0
[ 1897.518763]  [<ffffffff8138c6e5>] bus_remove_driver+0x55/0xd0
[ 1897.518765]  [<ffffffff8138daac>] driver_unregister+0x2c/0x50
[ 1897.518767]  [<ffffffff812c60a9>] pci_unregister_driver+0x29/0x90
[ 1897.518772]  [<ffffffffa00c10e8>] drm_pci_exit+0x98/0xa0 [drm]
[ 1897.518778]  [<ffffffffa06a35af>] radeon_exit+0x17/0x1e [radeon]
[ 1897.518780]  [<ffffffff810d6682>] SyS_delete_module+0x172/0x240
[ 1897.518783]  [<ffffffff81014e5c>] ? do_notify_resume+0x8c/0xa0
[ 1897.518785]  [<ffffffff815039ed>] system_call_fastpath+0x1a/0x1f
[ 1897.518786] ---[ end trace ba1fe37dd4719714 ]---
[ 1897.518790] [TTM] Finalizing pool allocator
[ 1897.518793] [TTM] Finalizing DMA pool allocator
[ 1897.518797] ------------[ cut here ]------------
[ 1897.518799] WARNING: CPU: 4 PID: 2539 at drivers/gpu/drm/ttm/ttm_page_alloc_dma.c:534 ttm_dma_free_pool+0x12b/0x130 [ttm]()
[ 1897.518800] Modules linked in: bnep bluetooth iTCO_wdt iTCO_vendor_support arc4 iwldvm snd_hda_codec_hdmi mac80211 snd_hda_codec_realtek joydev x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel microcode snd_hda_codec iwlwifi snd_hwdep psmouse cfg80211 snd_pcm serio_raw pcspkr snd_page_alloc radeon(-) rtsx_pci_ms r8169 snd_timer memstick rfkill lpc_ich mii snd i2c_i801 soundcore ttm wmi thermal mei_me mei shpchp processor battery ac evdev nfs lockd sunrpc fscache fuse ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod hid_generic usbhid hid rtsx_pci_sdmmc mmc_core ahci i915 libahci crc32c_intel i2c_algo_bit libata intel_agp firewire_ohci intel_gtt
[ 1897.518829]  firewire_core crc_itu_t drm_kms_helper ehci_pci ehci_hcd xhci_hcd scsi_mod rtsx_pci drm usbcore i2c_core usb_common video button
[ 1897.518843] CPU: 4 PID: 2539 Comm: rmmod Tainted: G        W    3.13.0-1-mainline #1
[ 1897.518844] Hardware name: CLEVO                             P170EM/P170EM, BIOS 4.6.5 08/22/2012
[ 1897.518844]  0000000000000009 ffff8807d4b47c18 ffffffff814f5570 0000000000000000
[ 1897.518846]  ffff8807d4b47c50 ffffffff81061bad ffff880801a31e40 ffff8808032a2290
[ 1897.518848]  0000000000000008 0000000000000000 0000000001ce5090 ffff8807d4b47c60
[ 1897.518850] Call Trace:
[ 1897.518852]  [<ffffffff814f5570>] dump_stack+0x4d/0x6f
[ 1897.518854]  [<ffffffff81061bad>] warn_slowpath_common+0x7d/0xa0
[ 1897.518856]  [<ffffffff81061c8a>] warn_slowpath_null+0x1a/0x20
[ 1897.518861]  [<ffffffffa0564c9b>] ttm_dma_free_pool+0x12b/0x130 [ttm]
[ 1897.518863]  [<ffffffffa0565dee>] ttm_dma_page_alloc_fini+0x8e/0x104 [ttm]
[ 1897.518867]  [<ffffffffa055b549>] ttm_mem_global_release+0x19/0x90 [ttm]
[ 1897.518875]  [<ffffffffa05ed892>] radeon_ttm_mem_global_release+0x12/0x20 [radeon]
[ 1897.518879]  [<ffffffffa00d06c3>] drm_global_item_unref+0x53/0x80 [drm]
[ 1897.518887]  [<ffffffffa05eebfe>] radeon_ttm_fini+0xee/0x190 [radeon]
[ 1897.518894]  [<ffffffffa05ef662>] radeon_bo_fini+0x12/0x20 [radeon]
[ 1897.518904]  [<ffffffffa06570c1>] si_fini+0xc1/0x100 [radeon]
[ 1897.518910]  [<ffffffffa05d404e>] radeon_device_fini+0x3e/0x120 [radeon]
[ 1897.518917]  [<ffffffffa05d5f7e>] radeon_driver_unload_kms+0x4e/0x70 [radeon]
[ 1897.518922]  [<ffffffffa00bed4c>] drm_dev_unregister+0x2c/0xe0 [drm]
[ 1897.518926]  [<ffffffffa00bee8b>] drm_put_dev+0x3b/0x70 [drm]
[ 1897.518932]  [<ffffffffa05d229d>] radeon_pci_remove+0x1d/0x20 [radeon]
[ 1897.518934]  [<ffffffff812c66cb>] pci_device_remove+0x3b/0xb0
[ 1897.518936]  [<ffffffff8138caaf>] __device_release_driver+0x7f/0xf0
[ 1897.518938]  [<ffffffff8138d438>] driver_detach+0xb8/0xc0
[ 1897.518940]  [<ffffffff8138c6e5>] bus_remove_driver+0x55/0xd0
[ 1897.518942]  [<ffffffff8138daac>] driver_unregister+0x2c/0x50
[ 1897.518945]  [<ffffffff812c60a9>] pci_unregister_driver+0x29/0x90
[ 1897.518951]  [<ffffffffa00c10e8>] drm_pci_exit+0x98/0xa0 [drm]
[ 1897.518956]  [<ffffffffa06a35af>] radeon_exit+0x17/0x1e [radeon]
[ 1897.518958]  [<ffffffff810d6682>] SyS_delete_module+0x172/0x240
[ 1897.518960]  [<ffffffff81014e5c>] ? do_notify_resume+0x8c/0xa0
[ 1897.518962]  [<ffffffff815039ed>] system_call_fastpath+0x1a/0x1f
[ 1897.518963] ---[ end trace ba1fe37dd4719715 ]---
[ 1897.519005] [TTM] Zone  kernel: Used memory at exit: 13 kiB
[ 1897.519007] [TTM] Zone   dma32: Used memory at exit: 9 kiB
[ 1897.519009] [drm] radeon: ttm finalized
[ 1897.519012] vga_switcheroo: disabled
[ 1897.519212] [drm] Module unloaded

Fortunately all the other problems are not happening.



Another problem that also persists is starting X early (kdm with systemd). The display just hangs with the _ cursor in the upper left, no error messages anywhere to be found. Killing X leaves the display hanging there and only starting another X (presumably reinitializing the display output) releases the hanging and goes back to normal. This only happens with radeon enabled with and without runpm but it might rather be a problem in X or maybe even the intel driver. Anyway, not starting X directly at boot but waiting a few seconds before starting it works.
Comment 4 Christoph Haag 2013-12-06 22:21:24 UTC
Created attachment 117751 [details]
kernel panic on reboot with runpm=1

3.13-rc3, same problems.

Also, here is a crappy picture of a kernel panic that happens when I try to reboot when runpm is enabled.

When I have more time next week I'll figure out how to setup netconsole properly and get a full log.
Comment 5 Alex Deucher 2013-12-06 22:34:17 UTC
Does runpm work on my old runpm branch prior to merging with 3.13?

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.12-radeon-poweroff
Comment 6 Christoph Haag 2013-12-07 01:06:23 UTC
Created attachment 117761 [details]
dmesg with drm-fixes-3.12-radeon-poweroff branch

Hm, I think it kind of works with drm-fixes-3.12-radeon-poweroff.

There seems to be no error in dmesg and X works fine.


I have monitored /sys/kernel/debug/vgaswitcheroo/switch a while.

Unfortunately most of the time it says
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0

I have seen it say
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
But I don't know really know how I got it to switch and whether it was something I did at all.

Then, unprovoked, it switched to on ("DynPwr") again.
(There's nothing running on the radeon GPU, only X with xcompmgr on intel and radeon is configured as provideroffloadsink)


I don't know how to coax it to power the gpu off. But at least it CAN work already.




Maybe this is useful, maybe not:
With the radeon in idle I have this:
$ cat /sys/kernel/debug/dri/1/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 3

with glxgears I have this:
$ cat /sys/kernel/debug/dri/1/radeon_pm_info
uvd    vclk: 0 dclk: 0
power level 2    sclk: 85000 mclk: 120000 vddc: 1050 vddci: 975 pcie gen: 3
Comment 7 Christoph Haag 2013-12-07 01:17:24 UTC
Created attachment 117771 [details]
Same as "117761: dmesg with drm-fixes-3.12-radeon-poweroff branch", but with rmmod radeon and modprobe radeon

Okay, I exited X and from a tty did rmmod radeon. It produced the same error in dmesg than it does on 3.13, see attached file...

Then with modprobe radeon I got the GPU lockup CP stall for more than 10000msec, but interestingly only once and then it worked.

The radeon gpu was at DynOff then.

As soon as I started X, the gpu was powered on to DynPwr again. Not even configured as provideroffloadsink or so, just the X startup and presumably hardware detection powered it up and it does not go back to DynOff.

But again, knowing that runpm itself works without errors is encouraging.
Comment 8 Christoph Haag 2013-12-24 15:45:59 UTC
Created attachment 119501 [details]
rc5 with starting X (~line 1811) and trying to render something

rc5 with these patches now: http://lists.freedesktop.org/archives/dri-devel/2013-December/050902.html

I'm not sure what I can do more without getting into trying to debug kernel code of which I know nothing...

Not really a change with rc5, but I think when starting X the lockups go away (not really sure). When trying to render something with DRI_PRIME=1 however, it locks up a bit harder and can't recover apparently because when trying to start another opengl program, it'll just segfault.

rmmod radeon still creates problems that lock up the cpus and prevent proper rebooting etc.




With runpm=0 it's not really good either at the moment. For example the Distance alpha causes a lockup too and after that lockup glxgears says

$ DRI_PRIME=1 glxgears -info
radeon: Failed to allocate virtual address for buffer:
radeon:    size      : 4352 bytes
radeon:    alignment : 4096 bytes
radeon:    domains   : 4
radeon:    va        : 0x0000000000800000
radeon: Failed to allocate virtual address for buffer:
radeon:    size      : 4352 bytes
radeon:    alignment : 4096 bytes
radeon:    domains   : 4
radeon:    va        : 0x0000000000800000
[1]    2915 segmentation fault (core dumped)

I'm cramming way too much in here, but I just want to say that the HD 7970M with the 8970M are the top mobile consumer GPUs AMD currently has and it would be really cool if the developers could buy that hardware (hey, it's christmas!) and fix this mess directly. From what I have seen so far it is really close to working amazingly well. If there's something specific I could test, I'll gladly try.
Thanks
Comment 9 Alex Deucher 2013-12-24 19:43:53 UTC
This looks like a pci hotplug problem, not a radeon issue.  See:
https://bugs.freedesktop.org/show_bug.cgi?id=70687

"I seem to have discovered the root of the issue.

I've just built 3.13-rc5 kernel which has the dynamic powering of the discrete gpu and all hell broke loose.

I've narrowed the error down to the pci hotplug driver. My machine loads shpchp pci hotplug driver from what I can see in lsmod output. But the trick is, that there is another pci hotplug driver, acpi pci hotplug one, which seems to break all hell loose here. Disabling it seems to fix everything for me, at least on kernel 3.13.

# CONFIG_HOTPLUG_PCI_ACPI is not set

This kernel config option is the culprit for this, and that also can be seen from my backtrace:

[   22.731998]  [<ffffffff81343cb1>] ? acpiphp_check_bridge+0x72/0x88

So the trick behind this is that acpi pci hotplug driver conflicts with shpchp one that my machine uses. And since it is a builtin driver, and can't be built as module it is always loaded. The other possibility is that this machine doesn't support acpi hotplug, but does support shpc pci hotplug. We need a kernel workarround so that acpi pci hotplug is disabled and out of the way when shpc pci hotplug is enabled."
Comment 10 Alex Deucher 2013-12-24 19:45:56 UTC
See also:
https://bugzilla.kernel.org/show_bug.cgi?id=61891
Comment 11 Christoph Haag 2013-12-24 21:40:07 UTC
Created attachment 119551 [details]
dmesg with acpiphp.disable=1

A quick search also showed this: https://bugzilla.kernel.org/show_bug.cgi?id=67461

Good news: With acpiphp.disable=1 there are no errors with booting and using X. In a tty the gpu is in DynOff state.



Not so good news: When starting X it switches to DynPwr and after several minutes I don't think it ever powered down again. But when killing X it goes back to DynOff.

Still bad news: rmmod radeon.
Comment 12 Christoph Haag 2014-01-06 11:57:55 UTC
Okay, with 3.13 rc7 I can confirm that the acpiphp problem is fixed. Now with everything in the default configuration you can boot and use X without any error.

rmmod radeon still causes nasty crashes.
Comment 13 Jack 2014-01-06 21:52:22 UTC
Additionally vgaswitcheroo/switch remains at DynPwr whenever X is running, even when nothing is using DRI_PRIME=1. This might be an issue specifically with this card (I have a 7970M as well), as I've heard reports of users with earlier generation cards (6xxxM) have seen vgaswitcheroo report DynOff when DRI_PRIME=1 is not set by any running application.
Comment 14 Jack 2014-01-19 23:22:45 UTC
Still having issues in rc8, unfortunately.
Comment 15 Christoph Haag 2014-01-21 00:15:48 UTC
Created attachment 122811 [details]
erratic dpm with window manager actions

Okay, so we got 3.13 without crashes with default settings, so far so good, thanks so much. :)



The dynpm power management is wonky. It doesn't really fit in here, but I already dumped a lot of stuff in here anyway and this report has been linked to in relation to me mentioning that the power management isn't working well yet, so here is another (not so) small point.

Here is a screenshot from the tool radeon-profile. I believe over the whole period shown there I have not used the radeon GPU to render anything, but it is configured as offload slave. I also hope that this tool did not cause this behaviour, but from the fan noise I think it's happening without it too.

So there are some random "short" spikes that die off after a while of doing nothing in X. However there are these longer spikes (about 5-7 seconds) when doing some window management actions. An example of that is minimizing/unminizing a window or simply changing the focus from one window to another window via the mouse. I have tested this with kwin with compositing and openbox without compositing and I think it's pretty clear that this actually causes it. Changing tabs in chromium also seems to cause it.

The very quick spikes happened when I additionally ran glxgears with kwin (still everything on the Intel GPU). They died down but appeared sometimes again after doing something with the windows. With openbox on the other hand I don't see such quick spikes with glxgears.

Another funny effect in openbox is that a single click on the "gears" in glxgears doesn't have an effect, but a double click produces one of the longer spikes. If I single click in chromium here in the bugtracker on the bakground or in the textarea it produces a spike too. But simply typing in the textarea does not.

I don't really know what to make of those observations, but I would guess that the radeon gpu doesn't enter the dynoff state while X is running because of something related to this. A short google search didn't show anything new about radeonsi + dynoff, especially nobody who claims it's working, so maybe glamor is worth looking at?

The intel gpu uses SNA and radeon obviously glamor. If the Xorg.0.log or so would be helpful, just ask.

Maybe someone with a pre radeonsi gpu could test whether they have the same issue when using glamor over exa?
Comment 16 Alex Deucher 2014-01-21 00:23:28 UTC
I'm not familiar with radeon-profile, but if it does some sort of manual polling of the card, it might be waking up the card every time it polls or it might just be reading garbage because the card is powered down.  Does it help if you don't run radeon-profile?
Comment 17 Christoph Haag 2014-01-21 00:55:14 UTC
It reads /sys/kernel/debug/dri/0/radeon_pm_info once a second and lm_sensors I think.

Powered down in terms of vgaswitcheroo/dynoff, I don't think so. When X is running I don't read dynoff for the radeon gpu from /sys/kernel/debug/vgaswitcheroo/switch but when X is not running, it powers down.

I checked with radeon-profile because it's actually the fan that is sometimes annoying to me.
The fan only powers off for seconds before starting to run again, mostly only up to "level 1" for a while.
But sometimes it runs at high speed for seconds or even minutes for seemingly no reason and then powers down again.

When monitoring with
sudo sh -c 'while true; do cat /sys/kernel/debug/dri/0/radeon_pm_info; sleep 0.5; done'
I see the same thing. When clicking on an unfocussed window and focussing it with that, it goes from 

power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 3
uvd    vclk: 0 dclk: 0

to

power level 2    sclk: 85000 mclk: 120000 vddc: 1050 vddci: 975 pcie gen: 3
uvd    vclk: 0 dclk: 0

stays there for ~6 seconds and then goes back to power level 0.


If it was garbage it would be pretty consistent garbage with the gpu temperature rising pretty much exactly on every of those spike...
If constant polling from /sys/kernel/debug/dri/0/radeon_pm_info or so is causing it, then the long periods of no spikes when actually doing nothing in X confuse me.

Currently it's just this situation that the fan keeps going on and off all the time and I am pretty sure it is on most of the time when I'm doing something, like clicking things and stuff and when I'm exiting X and see dynoff in /sys/kernel/debug/vgaswitcheroo/switch then the fan actually stays off (I think. Going to a tty while X is running in the background does not result in dynoff)

If there is a better way to find out why the fan doesn't shut off, I'd gladly investigate.

The code is here, by the way: https://github.com/marazmista/radeon-profile
Comment 18 Jack 2014-01-21 21:58:39 UTC
>If there is a better way to find out why the fan doesn't shut off

I'm pretty sure the reason the fans don't shut off is because they're reacting to the dGPU usage. In my incredibly un-scientific observation, I've noticed that the laptop gets warmer to the touch in Linux than it does in Windows. I'll be doing some actual testing on this later tonight, but it seems to me that the actual issue isn't the fans -- it's the fact that the dGPU is revving up and down on a pretty constant basis, and that's driving the fans to spin up and down with the dGPU.

One thing I noticed is that if I don't have a compositor running it only goes up to power level 1, instead of two, when doing things in the OS, ie, browsing in Chrome, etc. I still don't even think it should be anything but 0 unless using DRI_PRIME, but might the compositor be interacting with the dGPU through PRIME, accidentally? I say this without having any real technical knowledge of how PRIME works.
Comment 19 Hohahiu 2014-01-22 01:53:00 UTC
I have similar trouble with vgaswitcheroo. My GPUs are Intel hd4000 + Amd mobility radeon 7750M:
/sbin/lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Chelsea LP [Radeon HD 7730M] (rev ff)

If I don't set anything during the boot my dGPU is powered down successfully:
cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0

But in the meantime:
xrandr --listproviders
Providers: number : 1
Provider 0: id: 0x4a cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 4 outputs: 7 associated providers: 0 name:Intel

So I cannot access my discrete GPU.

If after the boot I use
xrandr --setprovideroffloadsink radeon Intel
then dGPU never switches off.

Truly speaking runpm never worked for me even with Alex's early patches.
Comment 20 Hohahiu 2014-01-22 01:53:37 UTC
Created attachment 122961 [details]
dmesg
Comment 21 Hohahiu 2014-01-22 01:53:56 UTC
Created attachment 122971 [details]
Xorg.0.log
Comment 22 Hohahiu 2014-01-24 02:20:46 UTC
What does the order of GPUs in /sys/kernel/debug/vgaswitcheroo/switch mean?
Sometimes it is like this (dGPU is powered off):
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
In other case it is like that (and dGPU is obviously on):
0:DIS: :PwrDyn:0000:01:00.0
1:IGD:+:Pwr:0000:00:02.0

Can it affect runpm?
Comment 23 Christoph Haag 2014-01-31 00:10:48 UTC
Created attachment 123891 [details]
quitting steam after some time

So maybe radeon-profile shows garbage, but in the steam client I have found one of the worse offenders. It doesn't happen always, but often, especially after quitting a game I think, but maybe it's just after some time.

When "it" happens, the fan runs one "level" higher than usual and after I noticed it I started radeon-profile and it showed what you see in the attached screenshot.

As soon as I quit steam, the temperature (read by lm_sensors I think) drops quickly, the voltages and frequencies (read from debugfs I think) are (mostly) stopping to being at the top all the time and only exhibit the spikes from my last comment, but most importantly the fan quickly drops to the usual behavior of switching between the lowest level and off instead of running on the "second lowest level" all the time like when steam was running.

I have had a quick look at the xf86-video-ati but I'm not sure what I'm looking for: Is there one single function or so I can set a break point on that would definitely tell me whether the radeon gpu is ordered to wake up/render something? I'd be happy to try it on the kernel side too, but maybe there's a chance to get a backtrace on the X side to see where it is coming from - all given that it actually happens like I naively imagine. But I don't see anything else: According to LIBGL_DEBUG=verbose steam doesn't open the radeonsi driver, but yet sometimes manages to keep the fan of the radeon gpu running on higher rpm. I would be happy to also test patches against whatever that could add debug logging to point out what the radeon gpu is doing and why.
Comment 24 Alex Deucher 2014-01-31 14:14:48 UTC
The kernel runtime pm infrastructure is used.  See radeon_drv.c in the kernel.  The driver registers a struct dev_pm_ops structure with the pm core.  The pm core will then suspend and resume the device on demand.
Comment 25 Christoph Haag 2014-01-31 20:00:59 UTC
Created attachment 124041 [details]
just some printk to easily see what's going on

Well, I don't really know how it is supposed to work. I figured the easiest thing I could do right now was adding some printk() to the code to have an overview of what is happening in dmesg -H | grep radeonpm.

After booting up with no X I see this:

[  +0,000016] radeonpm: runtime_idle
[  +4,000483] radeonpm: runtime_suspend

After starting X this is added:

[Jan31 19:04] radeonpm: runtime_resume
[  +1,126688] radeonpm: runtime_idle
[  +0,000203] radeonpm: runtime_idle
[  +0,000400] radeonpm: runtime_idle
[  +0,000232] radeonpm: runtime_idle

But then it never does anything again.

After this, I did the usual xrandr --setprovideroffloadsink radeon Intel and started and exited DRI_PRIME=1 glxgears and stuff, but nothing with radeonpm was printed to the log again.

After a while I tried exiting X and sure enough, another line:

[  +2,368895] radeonpm: runtime_suspend

And after starting X for the second time:

[ +19,119650] radeonpm: runtime_resume
[  +1,131037] radeonpm: runtime_idle

And then... nothing with radeonpm again. I have read a bit in the Documentation/power/runtime_pm.txt and maybe I did not catch it but I haven't really seen when you call pm_runtime_put_autosuspend whether it is supposed to eventually call back radeon_pmops_runtime_suspend but I would think that's what it does, right? So that just never happens until X is quit.




After boot with no X running I see this with for i in /sys/class/drm/card0/device/power/*; do echo "$i: $(cat $i)"; done

/sys/class/drm/card0/device/power/async: enabled
/sys/class/drm/card0/device/power/autosuspend_delay_ms: 5000
/sys/class/drm/card0/device/power/control: auto
/sys/class/drm/card0/device/power/runtime_active_kids: 0
/sys/class/drm/card0/device/power/runtime_active_time: 13260
/sys/class/drm/card0/device/power/runtime_enabled: enabled
/sys/class/drm/card0/device/power/runtime_status: suspended
/sys/class/drm/card0/device/power/runtime_suspended_time: 218426
/sys/class/drm/card0/device/power/runtime_usage: 0
[snipped some wakeup* stuff]

The GPU is off, fan is not running, etc.


After starting X I see this:

/sys/class/drm/card0/device/power/async: enabled
/sys/class/drm/card0/device/power/autosuspend_delay_ms: 5000
/sys/class/drm/card0/device/power/control: auto
/sys/class/drm/card0/device/power/runtime_active_kids: 0
/sys/class/drm/card0/device/power/runtime_active_time: 185230
/sys/class/drm/card0/device/power/runtime_enabled: enabled
/sys/class/drm/card0/device/power/runtime_status: active
/sys/class/drm/card0/device/power/runtime_suspended_time: 226856
/sys/class/drm/card0/device/power/runtime_usage: 0

runtime_status is always active in X...
Comment 26 Alex Deucher 2014-01-31 20:46:00 UTC
Can you see how far it's getting in runtime_idle()?  Does it get all the way to the end or does it think a crtc is active?
Comment 27 Christoph Haag 2014-01-31 21:24:20 UTC
It does get to the end of radeon_pmops_runtime_idle every time (checked right before the return) and crtc->enabled is never true.
Comment 28 Alex Deucher 2014-01-31 23:00:48 UTC
Created attachment 124051 [details]
return IRQ_NONE if we don't have any interrupts

Does the attached patch help?  Dave suggested it might be interrupts that causing the problem.
Comment 29 Christoph Haag 2014-02-01 00:08:53 UTC
No, it does not seem to help.
Comment 30 Christoph Haag 2014-02-02 23:58:35 UTC
Created attachment 124221 [details]
sysprof with 2 xterm windows and kwin

I still kind of want to try the opposite direction, to find out what X does that keeps the gpu awake so I'm trying to get a good dynamic call graph. The first thing that kind of worked is sysprof...

So I started recording this after starting X and two xterms and kwin on the intel gpu and just clicked a bit between the two because it seems this triggers whatever happens. The radeon gpu was not even configured as offload slave, just default X configuration.

If I saw that correctly none of the applications called functions from the radeon driver which is probably good, but in X you can see this (more detail is available in the file):

[/usr/bin/X]                                            0,00%  65,99%
  In file [heap]                                        0,00%  29,73%
    ioctl                                               0,18%  16,17%
      - - kernel - -                                    0,00%  15,99%
        system_call_fastpath                            0,00%  15,99%
          sys_ioctl                                     0,04%  15,99%
            do_vfs_ioctl                                0,00%  15,63%
              radeon_drm_ioctl                          0,11%  15,60%
                drm_ioctl                               0,43%  14,63%
                __pm_runtime_suspend                    0,00%   0,57%
                __pm_runtime_resume                     0,04%   0,14%
                copy_user_enhanced_fast_string          0,07%   0,07%
                irq_exit                                0,00%   0,04%
                _copy_from_user                         0,04%   0,04%
              drm_ioctl                                 0,04%   0,04%
            fget_light                                  0,25%   0,25%
            radeon_drm_ioctl                            0,04%   0,04%
            fput                                        0,04%   0,04%

I have only this GPU so I don't have any comparison to one where runpm works correctly, whether it should be this active... Maybe it's interesting that stuff like __pm_runtime_suspend and __pm_runtime_resume is called?
Comment 31 Christoph Haag 2014-02-06 13:48:14 UTC
Created attachment 124781 [details]
callgrind output from X

Not much news:
I tried to run X with valgrind --tool=callgrind to try to find out about what it does that results in the radeon gpu clocking higher and going to higher voltage when clicking around with windows like minimizing/restoring with kwin & plasma.


Unfortunately it does not happen when X runs in valgrind, the radeon gpu always keeps at the lowest frequency and voltage level. But the runpm status is still "active" all the time with or without valgrind.

With the exact same xorg and xf86* builds the clocking higher and raising the voltages happens when it runs not in valgrind...

I still post the callgrind log, even if it did not clock higher, because in the Callgraph Blockhandler() calls intel sna stuff but it also calls a lot of radeon stuff.

Whether this is something that is caused by valgrind or something that should happen with a radeon gpu that is not configured to be used for anything, I don't know.
Comment 32 Hohahiu 2014-02-07 00:49:46 UTC
I have the same messages in dmesg:
http://pastebin.com/vAsRRVW8

It seems like radeon driver tries to set clock and it couldn't do it. Therefore it tries again and again and GPU does not turn off.

Christoph, does your dmesg have the same messages?
Comment 33 Christoph Haag 2014-02-07 14:52:14 UTC
Created attachment 125111 [details]
possibly call chain that calls radeon stuff

@ Hohahiu

Do you mean those?
[ 8077.648324] [drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed

No, I don't see those at all.



By the way, I'm using linux 3.14-rc1 now with this patch: https://bugzilla.kernel.org/attachment.cgi?id=124621&action=diff

But I have an awful lot of this block in dmesg, I guess from every time starting X:

[Feb 6 13:39] [drm] Disabling audio 0 support
[  +0,000004] [drm] Disabling audio 1 support
[  +0,000002] [drm] Disabling audio 2 support
[  +0,000001] [drm] Disabling audio 3 support
[  +0,000001] [drm] Disabling audio 4 support
[  +0,000002] [drm] Disabling audio 5 support
[  +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e
[  +0,000005] [drm] PCIE gen 3 link speeds already enabled
[  +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000).
[  +0,000102] radeon 0000:01:00.0: WB enabled
[  +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00
[  +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04
[  +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08
[  +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c
[  +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10
[  +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90008b35a18
[  +0,144839] [drm] ring test on 0 succeeded in 2 usecs
[  +0,000004] [drm] ring test on 1 succeeded in 1 usecs
[  +0,000003] [drm] ring test on 2 succeeded in 1 usecs
[  +0,000059] [drm] ring test on 3 succeeded in 2 usecs
[  +0,000007] [drm] ring test on 4 succeeded in 1 usecs
[  +0,187260] [drm] ring test on 5 succeeded in 2 usecs
[  +0,000004] [drm] UVD initialized successfully.
[  +0,001745] [drm] Enabling audio 0 support
[  +0,000001] [drm] Enabling audio 1 support
[  +0,000000] [drm] Enabling audio 2 support
[  +0,000001] [drm] Enabling audio 3 support
[  +0,000001] [drm] Enabling audio 4 support
[  +0,000001] [drm] Enabling audio 5 support
[  +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs
[  +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs
[  +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs
[  +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs
[  +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs
[  +0,158619] [drm] ib test on ring 5 succeeded

But I wouldn't really thing that it's a problem. Just a bit bloat in the log.



Anyway, I have stared a bit longer at the callgraph and what I have attached seems suspicious I think. There's a whole lot of calls throughout the radeon driver but the kcachegrind gui is quite limited and the graph is very convoluted at this point so it's not clear to me whether this is during normal operation. I'm still kind of searching for a tool that could create a call graph and annotate it over time...
Where this is coming from is in dix/dixutils.c line 719 ("(*(cbr->proc)) (pcbl, cbr->data, call_data);") at least but to my eyes it looks like it could do anything and has a whole lot of callers that valgrind caught:

WriteToClient <cycle 19> (Xorg: io.c, ...)
FlushAllOutput (Xorg: io.c, ...)
XaceHook <cycle 19> (Xorg: xace.c, ...)
XaceHookPropertyAccess (Xorg: xace.c, ...)
XaceHookDispatch (Xorg: xace.c, ...)
CloseDownConnection (Xorg: connection.c, ...)
DeleteClientFromAnySelections (Xorg: selection.c, ...)
CloseDownClient (Xorg: dispatch.c, ...)
ProcSetSelectionOwner (Xorg: selection.c, ...)
SendConnSetup (Xorg: dispatch.c, ...)
NextAvailableClient (Xorg: dispatch.c, ...)

But maybe I'm just going in a totally wrong direction...
Comment 34 Hohahiu 2014-02-07 22:39:55 UTC
(In reply to Christoph Haag from comment #33)
> Created attachment 125111 [details]
> possibly call chain that calls radeon stuff
> 
> @ Hohahiu
> 
> Do you mean those?
> [ 8077.648324] [drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed
> 
> No, I don't see those at all.
> 
> 
> 
> By the way, I'm using linux 3.14-rc1 now with this patch:
> https://bugzilla.kernel.org/attachment.cgi?id=124621&action=diff
> 
> But I have an awful lot of this block in dmesg, I guess from every time
> starting X:
> 
> [Feb 6 13:39] [drm] Disabling audio 0 support
> [  +0,000004] [drm] Disabling audio 1 support
> [  +0,000002] [drm] Disabling audio 2 support
> [  +0,000001] [drm] Disabling audio 3 support
> [  +0,000001] [drm] Disabling audio 4 support
> [  +0,000002] [drm] Disabling audio 5 support
> [  +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e
> [  +0,000005] [drm] PCIE gen 3 link speeds already enabled
> [  +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000).
> [  +0,000102] radeon 0000:01:00.0: WB enabled
> [  +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00
> [  +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr
> 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04
> [  +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr
> 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08
> [  +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c
> [  +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr
> 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10
> [  +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
> 0x0000000000075a18 and cpu addr 0xffffc90008b35a18
> [  +0,144839] [drm] ring test on 0 succeeded in 2 usecs
> [  +0,000004] [drm] ring test on 1 succeeded in 1 usecs
> [  +0,000003] [drm] ring test on 2 succeeded in 1 usecs
> [  +0,000059] [drm] ring test on 3 succeeded in 2 usecs
> [  +0,000007] [drm] ring test on 4 succeeded in 1 usecs
> [  +0,187260] [drm] ring test on 5 succeeded in 2 usecs
> [  +0,000004] [drm] UVD initialized successfully.
> [  +0,001745] [drm] Enabling audio 0 support
> [  +0,000001] [drm] Enabling audio 1 support
> [  +0,000000] [drm] Enabling audio 2 support
> [  +0,000001] [drm] Enabling audio 3 support
> [  +0,000001] [drm] Enabling audio 4 support
> [  +0,000001] [drm] Enabling audio 5 support
> [  +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs
> [  +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs
> [  +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs
> [  +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs
> [  +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs
> [  +0,158619] [drm] ib test on ring 5 succeeded
> 
> But I wouldn't really thing that it's a problem. Just a bit bloat in the log.
> 
> 
> 
> Anyway, I have stared a bit longer at the callgraph and what I have attached
> seems suspicious I think. There's a whole lot of calls throughout the radeon
> driver but the kcachegrind gui is quite limited and the graph is very
> convoluted at this point so it's not clear to me whether this is during
> normal operation. I'm still kind of searching for a tool that could create a
> call graph and annotate it over time...
> Where this is coming from is in dix/dixutils.c line 719 ("(*(cbr->proc))
> (pcbl, cbr->data, call_data);") at least but to my eyes it looks like it
> could do anything and has a whole lot of callers that valgrind caught:
> 
> WriteToClient <cycle 19> (Xorg: io.c, ...)
> FlushAllOutput (Xorg: io.c, ...)
> XaceHook <cycle 19> (Xorg: xace.c, ...)
> XaceHookPropertyAccess (Xorg: xace.c, ...)
> XaceHookDispatch (Xorg: xace.c, ...)
> CloseDownConnection (Xorg: connection.c, ...)
> DeleteClientFromAnySelections (Xorg: selection.c, ...)
> CloseDownClient (Xorg: dispatch.c, ...)
> ProcSetSelectionOwner (Xorg: selection.c, ...)
> SendConnSetup (Xorg: dispatch.c, ...)
> NextAvailableClient (Xorg: dispatch.c, ...)
> 
> But maybe I'm just going in a totally wrong direction...

I meant these repeated messages not the error.
Comment 35 Alex Deucher 2014-02-07 22:44:12 UTC
(In reply to Christoph Haag from comment #33)
> But I have an awful lot of this block in dmesg, I guess from every time
> starting X:

Those are the messages from the driver each time the card is powered off or on.

> 
> [Feb 6 13:39] [drm] Disabling audio 0 support
> [  +0,000004] [drm] Disabling audio 1 support
> [  +0,000002] [drm] Disabling audio 2 support
> [  +0,000001] [drm] Disabling audio 3 support
> [  +0,000001] [drm] Disabling audio 4 support
> [  +0,000002] [drm] Disabling audio 5 support

Card is powered off.


> [  +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e
> [  +0,000005] [drm] PCIE gen 3 link speeds already enabled
> [  +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000).
> [  +0,000102] radeon 0000:01:00.0: WB enabled
> [  +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00
> [  +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr
> 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04
> [  +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr
> 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08
> [  +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c
> [  +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr
> 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10
> [  +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
> 0x0000000000075a18 and cpu addr 0xffffc90008b35a18
> [  +0,144839] [drm] ring test on 0 succeeded in 2 usecs
> [  +0,000004] [drm] ring test on 1 succeeded in 1 usecs
> [  +0,000003] [drm] ring test on 2 succeeded in 1 usecs
> [  +0,000059] [drm] ring test on 3 succeeded in 2 usecs
> [  +0,000007] [drm] ring test on 4 succeeded in 1 usecs
> [  +0,187260] [drm] ring test on 5 succeeded in 2 usecs
> [  +0,000004] [drm] UVD initialized successfully.
> [  +0,001745] [drm] Enabling audio 0 support
> [  +0,000001] [drm] Enabling audio 1 support
> [  +0,000000] [drm] Enabling audio 2 support
> [  +0,000001] [drm] Enabling audio 3 support
> [  +0,000001] [drm] Enabling audio 4 support
> [  +0,000001] [drm] Enabling audio 5 support
> [  +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs
> [  +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs
> [  +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs
> [  +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs
> [  +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs
> [  +0,158619] [drm] ib test on ring 5 succeeded

Card is powered back on.
Comment 36 Hohahiu 2014-02-07 22:47:59 UTC
(In reply to Alex Deucher from comment #35)
> (In reply to Christoph Haag from comment #33)
> > But I have an awful lot of this block in dmesg, I guess from every time
> > starting X:
> 
> Those are the messages from the driver each time the card is powered off or
> on.
> 
> > 
> > [Feb 6 13:39] [drm] Disabling audio 0 support
> > [  +0,000004] [drm] Disabling audio 1 support
> > [  +0,000002] [drm] Disabling audio 2 support
> > [  +0,000001] [drm] Disabling audio 3 support
> > [  +0,000001] [drm] Disabling audio 4 support
> > [  +0,000002] [drm] Disabling audio 5 support
> 
> Card is powered off.
> 
> 
> > [  +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e
> > [  +0,000005] [drm] PCIE gen 3 link speeds already enabled
> > [  +0,004127] [drm] PCIE GART of 1024M enabled (table at
> 0x0000000000478000).
> > [  +0,000102] radeon 0000:01:00.0: WB enabled
> > [  +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> > 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00
> > [  +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr
> > 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04
> > [  +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr
> > 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08
> > [  +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> > 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c
> > [  +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr
> > 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10
> > [  +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr
> > 0x0000000000075a18 and cpu addr 0xffffc90008b35a18
> > [  +0,144839] [drm] ring test on 0 succeeded in 2 usecs
> > [  +0,000004] [drm] ring test on 1 succeeded in 1 usecs
> > [  +0,000003] [drm] ring test on 2 succeeded in 1 usecs
> > [  +0,000059] [drm] ring test on 3 succeeded in 2 usecs
> > [  +0,000007] [drm] ring test on 4 succeeded in 1 usecs
> > [  +0,187260] [drm] ring test on 5 succeeded in 2 usecs
> > [  +0,000004] [drm] UVD initialized successfully.
> > [  +0,001745] [drm] Enabling audio 0 support
> > [  +0,000001] [drm] Enabling audio 1 support
> > [  +0,000000] [drm] Enabling audio 2 support
> > [  +0,000001] [drm] Enabling audio 3 support
> > [  +0,000001] [drm] Enabling audio 4 support
> > [  +0,000001] [drm] Enabling audio 5 support
> > [  +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs
> > [  +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs
> > [  +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs
> > [  +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs
> > [  +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs
> > [  +0,158619] [drm] ib test on ring 5 succeeded
> 
> Card is powered back on.

Thanks for explanation. In my case sometimes the card is powered on and off a lot of times without any explicit use.
Comment 37 Joshua M. Thompson 2014-02-11 01:58:48 UTC
I'd just like to chime in, since I've been having similar runpm problems with both 3.13 and 3.14 kernels on my laptop with built-in Intel HD4000 and a Radeon 7730M. The dGPU is constantly suspending and resuming every few minutes, even though nothing is using it (although xrandr offload has been configured). This is accompanied by the burst of kernel messages mentioned above, as well as several seconds where the mouse cursor moves but X is otherwise unresponsive. If I let the laptop sit idle for a while and come back it'll be frozen with a blank screen, presumably because it either froze up blanking the screen or shortly after.

Running with radeon.runpm=0 eliminates the freezes and the lockups, but it keeps the laptop running a lot hotter than I'd like.
Comment 38 Jani Nikula 2014-02-11 07:30:07 UTC
*** Bug 69291 has been marked as a duplicate of this bug. ***
Comment 39 Michel Dänzer 2014-02-12 07:55:11 UTC
(In reply to Christoph Haag from comment #33)
> 
> I'm still kind of searching for a tool that could create a call graph and
> annotate it over time...

I suspect profiling is just not the right tool for this.

While the Radeon card is supposed to be off, can you attach gdb to the X server process and set a breakpoint on radeon_cs_emit, and if that triggers, attach the output of bt full?
Comment 40 Christoph Haag 2014-02-12 09:41:18 UTC
Created attachment 125701 [details]
radeon_cs_emit did not trigger, but ProcGetInputFocus causes some activity

(In reply to Michel Dänzer from comment #39)
> (In reply to Christoph Haag from comment #33)
> > 
> > I'm still kind of searching for a tool that could create a call graph and
> > annotate it over time...
> 
> I suspect profiling is just not the right tool for this.

Yes, what I actually wanted was a tool that traces every single function call and outputs the call chain for every single function call but there doesn't seem to be much in that regard except gdb scripting or long abandoned tools.

> While the Radeon card is supposed to be off, can you attach gdb to the X
> server process and set a breakpoint on radeon_cs_emit, and if that triggers,
> attach the output of bt full?

It does not trigger. On the other hand it also does not trigger when running a 3d application like xonotic on the radeon gpu. So here's another maybe useless gdb log (radeon_cs_emit was Breakpoint 1)...
Comment 41 Michel Dänzer 2014-02-13 02:53:56 UTC
Created attachment 125821 [details]
radeonsi: Short-circuit flushes with no preceding draw calls

Does this Mesa patch help?
Comment 42 Christoph Haag 2014-02-13 09:00:37 UTC
0:DIS: :DynOff:0000:01:00.0
1:IGD:+:Pwr:0000:00:02.0

Nice. It now powers off properly in X.

I haven't extensively tried but with the latest mesa git and this patch this bug seems to have vanished too:
https://bugs.freedesktop.org/show_bug.cgi?id=69101
https://bugs.kde.org/show_bug.cgi?id=324864
Comment 43 Christoph Haag 2014-02-13 12:12:27 UTC
But it silently breaks opencl.

Before the patch:

opencl-example (git)-[master] % sudo ./hello_world
./hello_world: /usr/lib/libOpenCL.so.1: no version information available (required by ./hello_world)
There are 1 platforms.
There are 1 GPU devices.
clCreateContext() succeeded.
clCreateCommandQueue() succeeded.
clCreateProgramWithSource() suceeded.
clBuildProgram() suceeded.
clCreateKernel() suceeded.
clCreateBuffer() succeeded.
clSetKernelArg() succeeded.
clEnqueueNDRangeKernel() suceeded.
clFinish() succeeded.
clEnqueueReadBuffer() suceeded.
pi = 3.141590

After the patch:

opencl-example (git)-[master] % sudo ./hello_world
./hello_world: /usr/lib/libOpenCL.so.1: no version information available (required by ./hello_world)
There are 1 platforms.
There are 1 GPU devices.
clCreateContext() succeeded.
clCreateCommandQueue() succeeded.
clCreateProgramWithSource() suceeded.
clBuildProgram() suceeded.
clCreateKernel() suceeded.
clCreateBuffer() succeeded.
clSetKernelArg() succeeded.
clEnqueueNDRangeKernel() suceeded.
clFinish() succeeded.
clEnqueueReadBuffer() suceeded.
pi = -nan
Comment 44 Michel Dänzer 2014-02-14 02:59:54 UTC
Created attachment 126011 [details]
r600g,radeonsi: Consolidate logic for short-circuiting flushes

This patch doesn't break OpenCL for me.
Comment 45 Christoph Haag 2014-02-15 17:33:19 UTC
Yes, this is working pretty good, haven't noticed anything bad with it.

Even suspending works great, manual vgaswitcheroo was always a bit buggy after waking up, but this is working great.

Maybe a minor thing: When the GPU is off lm_sensors displays
radeon-pci-0100
Adapter: PCI adapter
temp1:       +511.0°C
Should maybe be 0 or -1 or the last detected temperature or something, 511°C could throw some tools off.



Anyway, in comment #42 I was not right, the glx compositing black window issue was not fixed, but it certainly had an influence on it. Might be worth checking it out.



It's starting to work better than fglrx now, so, very nice.

Now the only issue I have is unloading the radeon kernel module. But fortunately you don't really have to do that during normal usage so it's not a priority for anyone I guess.
Comment 46 Alex Deucher 2014-02-17 17:04:08 UTC
Created attachment 126451 [details]
report 0 for temp when dGPU is powered off

(In reply to Christoph Haag from comment #45)
> Yes, this is working pretty good, haven't noticed anything bad with it.
> 
> Even suspending works great, manual vgaswitcheroo was always a bit buggy
> after waking up, but this is working great.
> 
> Maybe a minor thing: When the GPU is off lm_sensors displays
> radeon-pci-0100
> Adapter: PCI adapter
> temp1:       +511.0°C
> Should maybe be 0 or -1 or the last detected temperature or something, 511°C
> could throw some tools off.
> 

Attached patch should report 0.
Comment 47 Hohahiu 2014-02-19 00:45:26 UTC
The problem is fixed now in kernel 3.14-rc3. Truly speaking the laptop now is very cold compared to what it was before. So thank you very much for your work, Alex and Michel!
However whenever my dGPU is powered on there are following error messages:
[   32.333940] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   33.357599] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   34.381217] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   35.404882] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   36.428483] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   37.452137] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   38.475777] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   39.499447] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   40.523078] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   41.546747] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!!
[   41.566826] [drm:uvd_v1_0_start] *ERROR* UVD not responding, giving up!!!
[   41.566879] [drm:si_startup] *ERROR* radeon: failed initializing UVD (-1).

Is it related to this issue or do you want me to open new bug report?
Comment 48 Hohahiu 2014-02-19 00:46:25 UTC
Created attachment 126711 [details]
dmesg
Comment 49 Alex Deucher 2014-02-19 01:26:00 UTC
(In reply to Hohahiu from comment #47)
> [   41.546747] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to
> reset the VCPU!!!
> [   41.566826] [drm:uvd_v1_0_start] *ERROR* UVD not responding, giving up!!!
> [   41.566879] [drm:si_startup] *ERROR* radeon: failed initializing UVD (-1).
> 
> Is it related to this issue or do you want me to open new bug report?

Please open a different bug for this.
Comment 50 Joshua M. Thompson 2014-02-19 02:45:45 UTC
On Fedora Rawhide kernel 3.14.0-0.rc3.git0.2.fc21.x86_64 still doesn't fix this (maybe it doesn't have the necessary patch yet?)

With radeon.runpm=1 I get the "UVD not responding" messages, the machine locks up  frequently for 5-10 seconds to reinit the radeon, and the radeon performance is horrible (~400 fps in glxgears).

With radeon.runpm=0 I don't get "UVD not responding", the lockups are gone, and glxgears is at 3650 fps (still less than the 6-7k fps I got under 3.12 though.)
Comment 51 saadnaji89 2014-03-03 00:19:22 UTC
Hello,

I am runnig Manjaro (Arch based distro 64 bit) with Kernel 3.13.5 as right now and I am having the same error except that I am geting repetitive blocks in the kernel log like this ever certain amount of time:

03/02/14 06:59:12 PM		[drm] ring test on 0 succeeded in 2 usecs
03/02/14 06:59:12 PM		[drm] ring test on 1 succeeded in 1 usecs
03/02/14 06:59:12 PM		[drm] ring test on 2 succeeded in 1 usecs
03/02/14 06:59:12 PM		[drm] ring test on 3 succeeded in 2 usecs
03/02/14 06:59:12 PM		[drm] ring test on 4 succeeded in 1 usecs
03/02/14 06:59:12 PM		[drm] ring test on 5 succeeded in 2 usecs
03/02/14 06:59:12 PM		[drm] UVD initialized successfully.
03/02/14 06:59:12 PM		[drm] Enabling audio 0 support
03/02/14 06:59:12 PM		[drm] Enabling audio 1 support
03/02/14 06:59:12 PM		[drm] Enabling audio 2 support
03/02/14 06:59:12 PM		[drm] Enabling audio 3 support
03/02/14 06:59:12 PM		[drm] Enabling audio 4 support
03/02/14 06:59:12 PM		[drm] Enabling audio 5 support
03/02/14 06:59:12 PM		[drm] ib test on ring 0 succeeded in 0 usecs
03/02/14 06:59:12 PM		[drm] ib test on ring 1 succeeded in 0 usecs
03/02/14 06:59:12 PM		[drm] ib test on ring 2 succeeded in 0 usecs
03/02/14 06:59:12 PM		[drm] ib test on ring 3 succeeded in 0 usecs
03/02/14 06:59:12 PM		[drm] ib test on ring 4 succeeded in 0 usecs
03/02/14 06:59:12 PM		[drm] ib test on ring 5 succeeded
03/02/14 06:59:48 PM		[drm] Disabling audio 0 support
03/02/14 06:59:48 PM		[drm] Disabling audio 1 support
03/02/14 06:59:48 PM		[drm] Disabling audio 2 support
03/02/14 06:59:48 PM		[drm] Disabling audio 3 support
03/02/14 06:59:48 PM		[drm] Disabling audio 4 support
03/02/14 06:59:48 PM		[drm] Disabling audio 5 support

and this is causing freeze for my laptop for 2-3 seconds, as well as causing the temperature of the gpu to raise up by 10 C . This bug was not present in prior 3.13 kernel
Comment 52 saadnaji89 2014-03-03 00:20:07 UTC
Created attachment 127801 [details]
kenrel log
Comment 53 Michel Dänzer 2014-03-03 08:22:49 UTC
(In reply to saadnaji89 from comment #51)
> I am runnig Manjaro (Arch based distro 64 bit) with Kernel 3.13.5 as right
> now and I am having the same error except that I am geting repetitive blocks
> in the kernel log like this ever certain amount of time:

See comment #35.


> and this is causing freeze for my laptop for 2-3 seconds, as well as causing
> the temperature of the gpu to raise up by 10 C .

You probably need my Mesa patch to prevent the GPU from powering up needlessly: https://bugzilla.kernel.org/attachment.cgi?id=126011
Comment 54 saadnaji89 2014-03-03 23:16:42 UTC
(In reply to Michel Dänzer from comment #53)
> (In reply to saadnaji89 from comment #51)
> > I am runnig Manjaro (Arch based distro 64 bit) with Kernel 3.13.5 as right
> > now and I am having the same error except that I am geting repetitive
> blocks
> > in the kernel log like this ever certain amount of time:
> 
> See comment #35.
> 
> 
> > and this is causing freeze for my laptop for 2-3 seconds, as well as
> causing
> > the temperature of the gpu to raise up by 10 C .
> 
> You probably need my Mesa patch to prevent the GPU from powering up
> needlessly: https://bugzilla.kernel.org/attachment.cgi?id=126011

Thanks you.

Are we going to see this patch applied to fix the  problem in future kernel veriosn  ?. I don't know whether you work for AMD or just someone is contributing to fix the problem
Comment 55 Christoph Haag 2014-03-03 23:25:39 UTC
(In reply to saadnaji89 from comment #54)

> Are we going to see this patch applied to fix the  problem in future kernel
> veriosn  ?. I don't know whether you work for AMD or just someone is
> contributing to fix the problem

I wondered whether to reply, but now...

He works for AMD. It's even on Wikipedia.

This patch is not to the kernel, it is to mesa.

It has been in mesa for 14 days: http://cgit.freedesktop.org/mesa/mesa/commit/?id=cf0172d46ab940a691da6516057c81f28961482f

(Is it necessary to keep talking about already fixed stuff?)



I'd say the only reason this is not closed yet is because of the radeon module unloading problems. Thinking of it, I haven't tried in a while.. Should I try to get some more information for it or is the delay just until someone has time to look at it?
Comment 56 Mike Cloaked 2014-04-01 09:00:47 UTC
It is possible that this report is directly related:

https://bugzilla.kernel.org/show_bug.cgi?id=73291

So I guess at this time the bug is still awaiting the mesa fix to test further in arch linux at least?
Comment 57 Christoph Haag 2015-01-04 01:09:26 UTC
Long time, no update from me.

Using 3.19-rc2:

rmmod radeon works fine
modprobe radeon after rmmod stalls once:

[   53.031311] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[   53.032230] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000000 last fence id 0x0000000000000002 on ring 5)
[   53.032322] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[   53.033262] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 5 (-35).
[   53.037997] [drm] Radeon Display Connectors
[   53.041287] radeon 0000:01:00.0: No connectors reported connected with modes
[   53.041289] [drm] Cannot find any crtc or sizes - going 1024x768
[   53.041834] [drm] fb mappable at 0xE0479000
[   53.041835] [drm] vram apper at 0xE0000000
[   53.041835] [drm] size 3145728
[   53.041836] [drm] fb depth is 24
[   53.041837] [drm]    pitch is 4096
[   53.041941] radeon 0000:01:00.0: fb1: radeondrmfb frame buffer device
[   53.042347] [drm] Initialized radeon 2.40.0 20080528 for 0000:01:00.0 on minor 1

but it recovers and then works.

So rmmod radeon and modprobe radeon do not cause any critical errors anymore, it only is a bit annoying that modprobe radeon takes 10 seconds.

Note You need to log in before you can comment on or make changes to this bug.