Bug 16140

Summary: [RADEON:KMS:RV250:RESUME] suspend to RAM resume broken
Product: Drivers Reporter: Michael Long (harn-solo)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: akpm, alan, alexdeucher, bjaglin, braney.bugzilla4kernel, bugs+kernel, cactus, dnax88, doodle62, enricus, florian, glisse, hramrach, k.moradi, lists, pebolle, razamatan, rjw, robert.de.rooy, sedat.dilek, tomka, vmatare+kernelbug, webmaster, wirawan0
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29 - 2.6.38-rc7-00163-gfb62c00 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 15310    
Attachments: dmesg output shortly before suspending an after resume
output of lspci
lspci output for 2.6.35-git12 kernel
dmesg for 2.6.35-git12 kernel
Output of 'LIBGL_VERBOSE=debug glxinfo 2>/dev/null > glxinfo.txt'
Video-BIOS dump of RV250 in a IBM T40p
2.6.36-rc4-git3 kern.log (pm-suspend + pm-resume)
2.6.36-rc4-git3 debug (pm-suspend + pm-resume)
2.6.36-rc4-git3 syslog (pm-suspend + pm-resume)
debug patch to see where we fail to set up the ring
Tarball of logs for Florian Mickler
Checksum for tarball of logs for Florian Mickler
Some more debugging output in r100_cp_init() + Fix typos in r100_ring_test()
kern.log.xz
debug.xz
messages.xz
Most recent resume trace

Description Michael Long 2010-06-06 09:52:33 UTC
Created attachment 26667 [details]
dmesg output shortly before suspending an after resume

On my ThinkPad T40p notebook (Radoen rv250, AGP), resuming from suspend to ram is broken since the introduction of KMS into mainline. In newer kernel versions, it is possible to switch from the garbled screen to a console and reboot cleanly at least.

In UMS, suspending and resuming works as supposed.

See attached dmesg output.
Comment 1 Michael Long 2010-06-06 09:55:14 UTC
Created attachment 26668 [details]
output of lspci
Comment 2 Daniele Napolitano 2010-06-21 21:03:32 UTC
I have this issue too but with rv280 chipset.

This not happens with kernel 2.6.33, the regression is in the 2.6.34. KMS always active in both versions.


lspci VGA line:
01:00.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200] (rev 01)

dmesg log:
[  965.493155] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(0).
[  965.493164] [drm:radeon_cs_ioctl] *ERROR* Faild to schedule IB !
[  965.493702] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(1).
[  965.493707] [drm:radeon_cs_ioctl] *ERROR* Faild to schedule IB !
[  965.993269] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(2).
[  965.993277] [drm:radeon_cs_ioctl] *ERROR* Faild to schedule IB !
[  965.993785] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(3).
[  965.993790] [drm:radeon_cs_ioctl] *ERROR* Faild to schedule IB !
.....
[  968.318575] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(15).
Comment 3 Alex Deucher 2010-06-23 17:20:22 UTC
Does disabling AGP work any better?  boot with radeon.agpmode=-1
Comment 4 Michael Long 2010-06-23 21:01:59 UTC
Hi,

this does work for me, tested on 2.6.31 and 2.6.34.
Comment 6 Alex Deucher 2010-06-23 21:17:36 UTC
If so, this bug and probably bug 16273 are dupes of bug 15969.
Comment 7 Michael Long 2010-06-24 17:19:53 UTC
After applying this to 2.6.34, I get this after resume:


Jun 24 18:57:23 T40p kernel: [  110.028385] PM: resume of devices complete after 1578.173 msecs
Jun 24 18:57:23 T40p kernel: [  110.028598] Restarting tasks ... 
Jun 24 18:57:23 T40p kernel: [  110.031390] BUG: unable to handle kernel paging request at f8f50000
Jun 24 18:57:23 T40p kernel: [  110.031659] IP: [<f8aad0da>] radeon_cs_update_pages+0x15a/0x190 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.031955] *pde = 3654f067 *pte = 00000000 
Jun 24 18:57:23 T40p kernel: [  110.032008] Oops: 0002 [#1] PREEMPT 
Jun 24 18:57:23 T40p kernel: [  110.032008] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/uevent
Jun 24 18:57:23 T40p kernel: [  110.032008] Modules linked in: xts gf128mul bluetooth autofs4 nfsd lockd auth_rpcgss sunrpc snd_seq snd_usb_audio snd_hwdep snd_usb_lib snd_rawmidi snd_seq_device dm_crypt dm_mod acpi_cpufreq fuse radeon ttm snd_intel8x0 ath5k thinkpad_acpi drm_kms_helper snd_ac97_codec mac80211 drm ath ac97_bus hwmon snd_pcm cfg80211 snd_timer cfbcopyarea snd ehci_hcd cfbimgblt sr_mod rfkill soundcore uhci_hcd yenta_socket evdev led_class rtc cdrom pcmcia_core ac battery cfbfillrect snd_page_alloc sg nvram usbcore thermal processor button
Jun 24 18:57:23 T40p kernel: [  110.032008] 
Jun 24 18:57:23 T40p kernel: [  110.032008] Pid: 1530, comm: X Tainted: G        W  2.6.34 #3 2373G1G/2373G1G
Jun 24 18:57:23 T40p kernel: [  110.032008] EIP: 0060:[<f8aad0da>] EFLAGS: 00010202 CPU: 0
Jun 24 18:57:23 T40p kernel: [  110.032008] EIP is at radeon_cs_update_pages+0x15a/0x190 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008] EAX: 00000000 EBX: 00000005 ECX: 000000bc EDX: f735481c
Jun 24 18:57:23 T40p kernel: [  110.032008] ESI: f63bb000 EDI: f8f50000 EBP: ee4b43c0 ESP: f6207d0c
Jun 24 18:57:23 T40p kernel: [  110.032008]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Jun 24 18:57:23 T40p kernel: [  110.032008] Process X (pid: 1530, ti=f6206000 task=f64bb900 task.ti=f6206000)
Jun 24 18:57:23 T40p kernel: [  110.032008] Stack:
Jun 24 18:57:23 T40p kernel: [  110.032008]  00000000 00000001 000002f0 f6207da0 ee4b43c0 f6207de4 00000000 f8ab720d
Jun 24 18:57:23 T40p kernel: [  110.032008] <0> f73543f4 f8a97000 00000001 f8985a53 00000000 00000202 00000000 f6207de4
Jun 24 18:57:23 T40p kernel: [  110.032008] <0> f63b8000 f7354000 f7355574 f8ab7555 00000001 f36bf814 f89868a4 00000001
Jun 24 18:57:23 T40p kernel: [  110.032008] Call Trace:
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8ab720d>] ? r100_cs_packet_parse+0x5d/0x1d0 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8a97000>] ? radeon_bo_move+0x0/0x330 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8985a53>] ? ttm_bo_handle_move_mem+0x263/0x330 [ttm]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8ab7555>] ? r100_cs_parse+0x35/0x680 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f89868a4>] ? ttm_bo_move_buffer+0x114/0x140 [ttm]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f898695d>] ? ttm_bo_validate+0x8d/0x110 [ttm]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8a982b5>] ? radeon_bo_list_validate+0x55/0x90 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8aacea1>] ? radeon_cs_ioctl+0x111/0x1f0 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f83d589b>] ? drm_ioctl+0x14b/0x390 [drm]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f8aacd90>] ? radeon_cs_ioctl+0x0/0x1f0 [radeon]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c100a2fb>] ? restore_i387_fxsave+0x6b/0x80
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<f83d5750>] ? drm_ioctl+0x0/0x390 [drm]
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c10a50db>] ? vfs_ioctl+0x2b/0xb0
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c10a5959>] ? do_vfs_ioctl+0x79/0x600
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c10438bd>] ? __remove_hrtimer+0x2d/0x90
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c102ec74>] ? do_setitimer+0x154/0x1a0
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c102ed09>] ? sys_setitimer+0x49/0xb0
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c10a5f1d>] ? sys_ioctl+0x3d/0x70
Jun 24 18:57:23 T40p kernel: [  110.032008]  [<c1002c50>] ? sysenter_do_call+0x12/0x26
Jun 24 18:57:23 T40p kernel: [  110.032008] Code: ff ff 89 44 24 08 e9 63 ff ff ff f7 c7 01 00 00 00 75 1f f7 c7 02 00 00 00 75 30 f7 c7 04 00 00 00 75 19 89 c1 83 e0 03 c1 e9 02 <f3> a5 e9 76 ff ff ff 0f b6 16 48 46 88 17 47 eb d7 8b 16 83 e8 
Jun 24 18:57:23 T40p kernel: [  110.032008] EIP: [<f8aad0da>] radeon_cs_update_pages+0x15a/0x190 [radeon] SS:ESP 0068:f6207d0c
Jun 24 18:57:23 T40p kernel: [  110.032008] CR2: 00000000f8f50000
Jun 24 18:57:23 T40p kernel: [  110.032008] ---[ end trace 38fee5bfe123df75 ]---
Jun 24 18:57:23 T40p kernel: [  110.070427] done.
Jun 24 18:57:23 T40p kernel: [  110.155076] [drm:drm_release] *ERROR* Device busy: 1

Kernel 2.6.35-rc3, which has those patches already applied resumes with black screen, can't do anything else except triggering SysRq.
Comment 8 Sedat Dilek 2010-08-13 09:27:10 UTC
On a T40p with RV250 I have the same issues here with 2.6.35-git12 Linux kernel:

[drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(0).
[drm:radeon_cs_ioctl] *ERROR* Faild to schedule IB !

This happens after I force a suspend via:

   # /usr/sbin/pm-suspend

- Sedat -
Comment 9 Sedat Dilek 2010-08-13 09:35:01 UTC
Created attachment 27429 [details]
lspci output for 2.6.35-git12 kernel
Comment 10 Sedat Dilek 2010-08-13 09:43:20 UTC
Created attachment 27430 [details]
dmesg for 2.6.35-git12 kernel
Comment 11 Sedat Dilek 2010-08-13 09:48:10 UTC
Created attachment 27431 [details]
Output of 'LIBGL_VERBOSE=debug glxinfo 2>/dev/null > glxinfo.txt'
Comment 12 Sedat Dilek 2010-08-28 12:34:49 UTC
I tested again on 2.6.36-rc2-git4 plus pulled in drm-fixes from Dave's drm-2.6 tree.
The problem still remains.

Booting with "radeon.agpmode=-1" make a 'pm-suspend' sucessful.

I will add the video-bios dump if this helps.
Comment 13 Sedat Dilek 2010-08-28 12:35:59 UTC
Created attachment 28161 [details]
Video-BIOS dump of RV250 in a IBM T40p
Comment 14 Sedat Dilek 2010-08-28 12:43:26 UTC
How to dump video-bios (here on 2.6.36 upstream-kernel)?

# lspci | grep "VGA compatible controller"
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon RV250 [Mobility FireGL 9000] (rev 02)

# find /sys/ -name rom | grep 01:00.0
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/rom

# echo 1 > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/rom

# cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/rom > /tmp/vbios_rv250.bin

# echo 0 > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/rom
Comment 15 Sedat Dilek 2010-09-17 11:30:29 UTC
I made a debug-try with 2.6.36-rc4-git3 + (radeon) backlight-type patches [1.2]:

Booted with radeon.modeset=1 and drm.debug=15.
Logs attached.

- Sedat -

[1] https://patchwork.kernel.org/patch/163971/
[2] https://patchwork.kernel.org/patch/182352/
Comment 16 Sedat Dilek 2010-09-17 11:34:47 UTC
Created attachment 30322 [details]
2.6.36-rc4-git3 kern.log (pm-suspend + pm-resume)
Comment 17 Sedat Dilek 2010-09-17 11:35:23 UTC
Created attachment 30332 [details]
2.6.36-rc4-git3 debug (pm-suspend + pm-resume)
Comment 18 Sedat Dilek 2010-09-17 11:36:18 UTC
Created attachment 30342 [details]
2.6.36-rc4-git3 syslog (pm-suspend + pm-resume)
Comment 19 Florian Mickler 2010-10-07 21:32:15 UTC
Comment on attachment 30332 [details]
2.6.36-rc4-git3 debug (pm-suspend + pm-resume)

bit late over here... turns out this was �7zXZ�...
Comment 20 Sedat Dilek 2010-10-07 21:38:46 UTC
You can't read?
It was done with xz-utils from official Debian/sid.
Shall I attach in another archive-format?
Comment 21 Florian Mickler 2010-10-07 22:31:27 UTC
Created attachment 32762 [details]
debug patch to see where we fail to set up the ring

You have this in you kern.log file:
[drm] radeon: ring at 0x00000000D0000000

which doesnt't look so good. 
From there on it goes bad:

Sep 17 13:12:12 tbox kernel: [  381.304184] [drm:r100_ring_test] *ERROR* radeon: ring test failed (sracth(0x15E4)=0xCAFEDEAD)
Sep 17 13:12:12 tbox kernel: [  381.304187] [drm:r100_cp_init] *ERROR* radeon: cp isn't working (-22).
Sep 17 13:12:12 tbox kernel: [  381.304191] radeon 0000:01:00.0: failled initializing CP (-22).

which explains the 
"[drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(0)." message. 

cp.ready is false, due to r100_cp_init failing.

Let's see if we can narrow down what fails. Can you apply this patch and post the resulting kern.log?
Comment 22 Florian Mickler 2010-10-07 22:32:46 UTC
(In reply to comment #20)
> You can't read?

I wondered that myself...
Comment 23 Sedat Dilek 2010-10-08 09:00:24 UTC
Thanks Florian for taking care of this BR.

I applied your patch from above against 2.6.36-rc7 and will attach the logs.

Here the legend to my logs (messages, kern.log and debug):

1: 2010-10-08 10:35: boot-up into runlevel-3
2: 2010-10-08 10:36: startx
3: 2010-10-08 10:37: pm-suspend
4. 2010-10-08 10:38: power-on/pm-resume
5. 2010-10-08 10:39: poweroff

EXAMPLE:
So "3_debug.txt" covers all logs written to /var/log/debug after running "pm-suspend" command.

I hope this helps narrowing down the problem.
Comment 24 Sedat Dilek 2010-10-08 09:01:37 UTC
Created attachment 32812 [details]
Tarball of logs for Florian Mickler
Comment 25 Sedat Dilek 2010-10-08 09:02:45 UTC
Created attachment 32822 [details]
Checksum for tarball of logs for Florian Mickler
Comment 26 Sedat Dilek 2010-10-08 11:52:42 UTC
Created attachment 32852 [details]
Some more debugging output in r100_cp_init() + Fix typos in r100_ring_test()

With attached patch, I get now the below warning, looks more like an ACPI issue?

- Sedat -

[ /var/log/kern.log ]
...
Oct  8 13:38:09 tbox kernel: [   94.308199] [drm:r100_ring_test] *ERROR* radeon: ring test failed (scratch(0x15E4)=0xCAFEDEAD)
Oct  8 13:38:09 tbox kernel: [   94.308202] ------------[ cut here ]------------
Oct  8 13:38:09 tbox kernel: [   94.308236] WARNING: at /home/sd/src/linux-2.6/linux-2.6.36-rc7/debian/build/source_i386_none/drivers/gpu/drm/radeon/r100.c:1028 r100_cp_init+0x5b6/0x5db [radeon]()
Oct  8 13:38:09 tbox kernel: [   94.308240] Hardware name: 2374SG6
Oct  8 13:38:09 tbox kernel: [   94.308242] Modules linked in: sco bnep rfcomm l2cap bluetooth aes_i586 acpi_cpufreq mperf aes_generic cpufreq_stats cpufreq_userspace cpufreq_conservative ppdev cpufreq_powersave lp dm_crypt binfmt_misc ext4 snd_intel8x0m snd_intel8x0 snd_ac97_codec jbd2 ac97_bus crc16 snd_pcm_oss radeon snd_mixer_oss thinkpad_acpi snd_pcm arc4 snd_seq_midi ecb ath5k snd_rawmidi snd_seq_midi_event mac80211 ttm snd_seq ath drm_kms_helper pcmcia drm cfg80211 i2c_algo_bit rfkill i2c_i801 nsc_ircc yenta_socket i2c_core pcmcia_rsrc snd_timer snd_seq_device joydev pcmcia_core tpm_tis irda shpchp snd tpm parport_pc crc_ccitt snd_page_alloc pci_hotplug soundcore serio_raw led_class psmouse tpm_bios video nvram processor battery ac parport rng_core pcspkr output button evdev fuse autofs4 ext3 jbd mbcache dm_mod usbhid hid sg sr_mod sd_mod crc_t10dif cdrom ata_generic ata_piix libata uhci_hcd ehci_hcd usbcore scsi_mod thermal e1000 floppy thermal_sys nls_base [last unloaded: scsi_wait_scan]
Oct  8 13:38:09 tbox kernel: [   94.308314] Pid: 2052, comm: kworker/u:5 Not tainted 2.6.36-rc7-686 #1
Oct  8 13:38:09 tbox kernel: [   94.308317] Call Trace:
Oct  8 13:38:09 tbox kernel: [   94.308327]  [<c102eff1>] ? warn_slowpath_common+0x6a/0x7b
Oct  8 13:38:09 tbox kernel: [   94.308345]  [<f8f5aceb>] ? r100_cp_init+0x5b6/0x5db [radeon]
Oct  8 13:38:09 tbox kernel: [   94.308349]  [<c102f00f>] ? warn_slowpath_null+0xd/0x10
Oct  8 13:38:09 tbox kernel: [   94.308369]  [<f8f5aceb>] ? r100_cp_init+0x5b6/0x5db [radeon]
Oct  8 13:38:09 tbox kernel: [   94.308388]  [<f8f5ba4b>] ? r100_startup+0x1f8/0x246 [radeon]
Oct  8 13:38:09 tbox kernel: [   94.308403]  [<f8f33378>] ? radeon_resume_kms+0x77/0xe6 [radeon]
Oct  8 13:38:09 tbox kernel: [   94.308407]  [<c114bcf5>] ? pci_legacy_resume+0x23/0x2c
Oct  8 13:38:09 tbox kernel: [   94.308411]  [<c114bdab>] ? pci_pm_resume+0x0/0x60
Oct  8 13:38:09 tbox kernel: [   94.308418]  [<c11c15b5>] ? pm_op+0x8f/0x13f
Oct  8 13:38:09 tbox kernel: [   94.308422]  [<c11c19aa>] ? device_resume+0x3a/0xb3
Oct  8 13:38:09 tbox kernel: [   94.308425]  [<c11c1cfd>] ? async_resume+0x13/0x33
Oct  8 13:38:09 tbox kernel: [   94.308430]  [<c1048df9>] ? async_run_entry_fn+0x8b/0x121
Oct  8 13:38:09 tbox kernel: [   94.308436]  [<c103fe46>] ? process_one_work+0x181/0x25e
Oct  8 13:38:09 tbox kernel: [   94.308440]  [<c1048d6e>] ? async_run_entry_fn+0x0/0x121
Oct  8 13:38:09 tbox kernel: [   94.308444]  [<c1041319>] ? worker_thread+0xf3/0x1ed
Oct  8 13:38:09 tbox kernel: [   94.308448]  [<c1041226>] ? worker_thread+0x0/0x1ed
Oct  8 13:38:09 tbox kernel: [   94.308452]  [<c1043926>] ? kthread+0x63/0x68
Oct  8 13:38:09 tbox kernel: [   94.308455]  [<c10438c3>] ? kthread+0x0/0x68
Oct  8 13:38:09 tbox kernel: [   94.308461]  [<c100357e>] ? kernel_thread_helper+0x6/0x10
Oct  8 13:38:09 tbox kernel: [   94.308464] ---[ end trace a248808f0af92caf ]---
Oct  8 13:38:09 tbox kernel: [   94.308467] [drm:r100_cp_init] *ERROR* radeon: cp isn't working (-22).
Oct  8 13:38:09 tbox kernel: [   94.308470] radeon 0000:01:00.0: failled initializing CP (-22).
Comment 27 Sedat Dilek 2010-10-08 11:56:53 UTC
Created attachment 32862 [details]
kern.log.xz
Comment 28 Sedat Dilek 2010-10-08 11:57:56 UTC
Created attachment 32872 [details]
debug.xz
Comment 29 Sedat Dilek 2010-10-08 11:59:54 UTC
Created attachment 32882 [details]
messages.xz
Comment 30 Florian Mickler 2010-10-08 17:48:12 UTC
No,that's one of the warnings I put in place with the debug patch. 

It is definitely the radeon driver which is failing. Maybe because of some other failure in the system but I guess there is a bug somewhere in the suspend or resume phase of the radeon driver. 

I'm not really familiar with the radeon driver, but will look into it for some obvious or easy to spot errors.
Comment 31 Florian Mickler 2010-10-08 17:52:06 UTC
Oh, and please dont fix those typos! They are the only way to distinguish which code path is actually running...
 you just threw me off there for a while wondering why it's not the typoed scratch message...
Comment 32 Sedat Dilek 2010-10-10 13:40:58 UTC
(Patch see https://bugzilla.kernel.org/show_bug.cgi?id=16140#c26)

With applying my patch from above, it's this section (Line #1028 and following) from r100_cp_init() doing the problem:

 940 int r100_cp_init(struct radeon_device *rdev, unsigned ring_size)
...
1026         radeon_ring_start(rdev);
1027         r = radeon_ring_test(rdev);
1028         if (r) {
1029                 DRM_ERROR("radeon: cp isn't working (%d).\n", r);
1030                 return r;
1031         }
1032         rdev->cp.ready = true;
1033         return 0;
1034 }
...

Replacing "if (r) {" with "if (WARN_ON(r)) {" shows the above Call-trace.
I looked into r600.c source-code and put "rdev->cp.ready = true;" before Line "r = radeon_ring_test(rdev);", not helping.

Again inspired from r600.c, I put Line #966 "r100_cp_load_microcode(rdev);" after "r = radeon_ring_init(rdev, ring_size);", this resulted in a not-so-garbled screen, after hanging:
pm-resume in X -> switching to vt-1 -> killing X -> restarting startx

This is doing no harm, see my logs.
-		DRM_ERROR("radeon: ring test failed (sracth(0x%04X)=0x%08X)\n",
+		DRM_ERROR("radeon: ring test failed (scratch(0x%04X)=0x%08X)\n",

I am not sure what you mean with "radeon driver": the one in the kernel or the DDX (xf86-video-ati).

One NOTE:
In Line #3728 there is a commented "r100_gpu_init(rdev);", it is nowhere "defined". I see in r600.c a *_gpu_init() and a *_cp_start() in case of resuming. Just a hint, if you wanna compare or dig into it.

IIRC it would make sense to interprete correctly the Call-trace, I am not that familiar with "the internals".

Sorry, I don't wanna experiment with older Linux-Kernels as my graphics driver stack is mostly latest-stable or from GIT, not sure what will happen.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/radeon/r100.c;h=e151f16a8f86d73090ec6a4eb17a3590661868db;hb=HEAD#l1028
[2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/radeon/r100.c;h=e151f16a8f86d73090ec6a4eb17a3590661868db;hb=HEAD#l966
[3] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/radeon/r100.c;h=e151f16a8f86d73090ec6a4eb17a3590661868db;hb=HEAD#l3728
Comment 33 Florian Mickler 2010-10-10 23:15:54 UTC
Hi!

I did find a rv280 card. On that card, the screen is garbled after resume, but the ring test doesn't fail. It is using the same code-paths as far as I see. 

So we can probably conclude: 
1. garbled screen and the ring setup failure are independent failures. 
2. the ring setup failure is something specific to your card / or chipset.

Do you see differences in lspci -vv output before and after suspend? 

(In reply to comment #32)
> (Patch see https://bugzilla.kernel.org/show_bug.cgi?id=16140#c26)
> 
> With applying my patch from above, it's this section (Line #1028 and
> following)
> from r100_cp_init() doing the problem:
> 
>  940 int r100_cp_init(struct radeon_device *rdev, unsigned ring_size)
> ...
> 1026         radeon_ring_start(rdev);
> 1027         r = radeon_ring_test(rdev);
> 1028         if (r) {
> 1029                 DRM_ERROR("radeon: cp isn't working (%d).\n", r);
> 1030                 return r;
> 1031         }
> 1032         rdev->cp.ready = true;
> 1033         return 0;
> 1034 }
> ...
> 
> Replacing "if (r) {" with "if (WARN_ON(r)) {" shows the above Call-trace.

Yes. This is also seen by the "radeon: cp isn't working (-22)." 
Line in your dmesg. But of course the callstack is handy to verify we are
looking at the right code.
I didn't put a WARN there, because we already knew it failed. 
I wondered if some tests without error-messages failed and put the WARN's there. 
But in retrospect we would have seen that, because the above error message would have not been preceded by the ring-test error message.

> I looked into r600.c source-code and put "rdev->cp.ready = true;" before Line
> "r = radeon_ring_test(rdev);", not helping.

If you are interested how the driver works, have a look at http://www.botchco.com/agd5f/?p=50

The "ring" is a buffer where the driver writes commands and the gpu reads those commands and executes them. It's a ring buffer.
http://en.wikipedia.org/wiki/Circular_buffer

If you set cp.ready and the hardware isn't really ready, that won't help. 

The ring test works like so: The driver writes a value (0xCAFEDEAD) into the scratch-register and instructs the gpu via the ringbuffer to overwrite it with "0xDEADBEEF". Then the driver check's if the gpu does it. And if after N udelays(1)  the gpu did not write the expected value into that register, the test fails.

But of course, we are left to wonder as to why.

> Again inspired from r600.c, I put Line #966 "r100_cp_load_microcode(rdev);"
> after "r = radeon_ring_init(rdev, ring_size);", this resulted in a
> not-so-garbled screen, after hanging:
> pm-resume in X -> switching to vt-1 -> killing X -> restarting startx

That's interesting. Can you elaborate on the hanging?

> 
> This is doing no harm, see my logs.
> -        DRM_ERROR("radeon: ring test failed (sracth(0x%04X)=0x%08X)\n",
> +        DRM_ERROR("radeon: ring test failed (scratch(0x%04X)=0x%08X)\n",

True, but it is inconvenient. If you 'grep -r' on that error message you only get the r100 one. With the typo corrected, you get both, the r100 and the r600 one. I agree, not a big deal, but...

> I am not sure what you mean with "radeon driver": the one in the kernel or
> the
> DDX (xf86-video-ati).

Always the kernel one, at the moment.

> 
> One NOTE:
> In Line #3728 there is a commented "r100_gpu_init(rdev);", it is nowhere
> "defined". I see in r600.c a *_gpu_init() and a *_cp_start() in case of
> resuming. Just a hint, if you wanna compare or dig into it.
> 

Yes. I wondered about that too. 'git-blame' shows it is a  left over from:

commit 90aca4d2740255bd130ea71a91530b9920c70abe
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Tue Mar 9 14:45:12 2010 +0000

    drm/radeon/kms: simplify & improve GPU reset V2

...

> IIRC it would make sense to interprete correctly the Call-trace, I am not
> that
> familiar with "the internals".

The call-trace is not complicated. The topmost function is the function that is currently executing. The second entry is the function it will return to. The third function is the function the second function will return to. and so on. 

see: http://en.wikipedia.org/wiki/Call_stack

I don't know about the item 1 to 3 in that trace. But I guess they are just artifacts of the WARN_ON macro. 

If you look into the code, you see that the call trace is to be expected. 
What has to be considered bad, is that the ring-test fails because the gpu doesn't process the ringbuffer in time. 


In comment #12 you said, that turning off agp would fix the suspend issue? Which one was that? The ring-test error message, or the garbled screen or both?

In my setup (rv280) it only worked once out of ten times. First time, it came back without garbled screen, but all subsequent suspend/resumes did garble the screen. 


On that screen garble I have a few thoughts. It is somewhat periodic and always follows a pattern for me. I can clear the corruption by changing consoles for example. Then it always scribbles in a predetermined pattern on the framebuffer where it stays (overwriting itself with a high frequency), till I change consoles.  Same for you?
Comment 34 Sedat Dilek 2010-10-29 18:00:36 UTC
Sorry, but I don't think I will follow this BR for a while.
Currently, busy with other stuff.
Comment 35 bjaglin 2010-11-16 22:55:44 UTC
I have a RV250 (T40p, same hardware than above) and I have been experiencing the same problems as Sedat for several months (on resume, garbled screen in X AND the ring setup failure). I still can't resume properly on 2.6.32-rc7. To follow up on the previous message: turning off AGP solve both problems, completely. Can I do something to help?
Comment 36 Cactus 2010-12-04 16:22:30 UTC
I have the same problem, since several months.
I disabled KMS with radeon.modeset=0, and all worked fine, until today.
But the last upgrades (I use archlinux) make it not working now.
First, X didn't start, so I enabled KMS, but I lost the suspend-to-ram (black screen at resume).
I tried radeon.agpmode=-1, but it doesn't work.

What can I do ?

I really need s2ram for my laptop (too old, booting takes several minutes !)

Thanks for your help.

Cactus.

System Informations : 
kernel 2.6.36.1-3, xf86-video-ati 6.13.2-2, ati-dri 7.9-1, libgl 7.9-1, mesa 7.9-1.
Older versions (no problem with radeon.modeset=0): kernel 2.6.35.8-1, xf86-video-ati 6.13.2-1, ati-dri 7.8.2-3, libgl 7.8.2-3, mesa 7.8.2-3
Comment 37 Keivan Moradi 2010-12-13 05:27:18 UTC
I've followed gentoo documentation to enable standby on my workstation. When I run the command "hibernate-ram" in the console, system turns off but sytem freeze on resume from Suspend/Standby/Sleep. (monitor show a vertical rainbow) + (alt + sysrq + b cannot reboot the system).

hardware:dual xeon motherboard with only one 5540 cpu + very old ps2 mouse and keyboard + old x300SE (RV370) ati vga
kernel: gentoo-sources (2.6.34 r12) + original kernel drivers.

When I've reconfigured the kernel to do not use ati drivers (I used vesa drivers instead), suspend work without problem. Now I'm sure there is a bug in the radeon drivers.

on my next experiment I've configured the kernel to use only ati driver under direct rendering manager (ie I've removed the ati framebuffer driver) -> this time system suspends and on resume there is color page with a blinking cursor but there is no working console. alt + sysrq + b can reboot my box.
Comment 38 Jérôme Glisse 2011-03-07 18:49:37 UTC
Michael do you still have this issue with more recent kernel ?

Others people having issue, please open your own bug, before try the lastest kernel to check if the issue is resolved for you.

Note also that kernel framebuffer driver is outdated and shouldn't be used, we only actively support KMS for radeon.
Comment 39 Sedat Dilek 2011-03-07 20:09:43 UTC
I have tested with linux-next (next-20110307):

1. KMS enabled = NOPE
2. KMS enabled + radeon.agpmode=-1 = OK
   (dmesg: [drm] Forcing AGP to PCI mode)

My Xorg stack changed to libdrm-2.4.23, mesa-7.10.1, ddx-1:6.14.0 and xserver-1.10-rc3.

Killing xserver from runlevel-3 several times, does not help, the GPU seems to hang.
Still noone told me after attaching so much informations to this BR how to force a GPU reset in rl-3.
Can you please enlighten me and others?

- Sedat -
Comment 40 Sedat Dilek 2011-03-07 20:23:45 UTC
Just in case it mattersm here my PM userspace:

# dpkg -l | egrep -i 'acpid|pm-utils'
ii  acpid                                                1:2.0.8-2                           Advanced Configuration and Power Interface event daemon
ii  pm-utils                                             1.4.1-6                             utilities and scripts for power management

I still do suspend via pm-suspend (pm-utils) command.

- Sedat -
Comment 41 Michael Long 2011-03-08 18:32:33 UTC
Jérôme, unfortunately I can only confirm what Sedat reported previously.
I was testing it with the most recent git-kernel.
Comment 42 Michael Long 2011-03-08 18:33:58 UTC
Created attachment 50332 [details]
Most recent resume trace
Comment 43 Robert de Rooy 2011-03-22 08:38:35 UTC
This same bug is also reported against Fedora;
https://bugzilla.redhat.com/show_bug.cgi?id=531825

There it was determined that it is not necessary to completely disable AGP, but that dropping from AGP 4X to AGP 1X is sufficient to let the system resume.

In other words, boot with: radeon.agpmode=1

There is though a serious performance penalty with doing so with certain applications (games) that copy around large textures.
Comment 44 Victor Mataré 2011-08-27 01:37:28 UTC
I'm still seeing this on Ubuntu 11.04 with 2.6.38-11-generic. Any hope that this might be fixed?
Comment 45 soren121 2012-02-09 04:08:53 UTC
No fix in sight, but there is a workaround: set a primary password in the BIOS. The BIOS will initialize the video card on wake-up and allow Linux to resume normally. Inconvenient? Yes, but it's better than bottlenecking your already out-of-date graphics card. I posted this workaround on the Canonical Launchpad bug report some time ago; I guess it hasn't been shared outside as of yet.
Comment 46 Paul Bolle 2012-02-09 13:54:32 UTC
(In reply to comment #45)
> I posted this workaround on the Canonical Launchpad
> bug report some time ago; I guess it hasn't been shared outside as of yet.

Could you add a link to that bug report?
Comment 47 Paul Bolle 2012-02-09 16:12:36 UTC
(In reply to comment #45)
> No fix in sight, but there is a workaround: set a primary password in the
> BIOS.
> The BIOS will initialize the video card on wake-up and allow Linux to resume
> normally.

0) This wasn't on a ThinkPad, was it? Because on a ThinkPad T41 that triggers issue there's no "primary" BIOS password. Fiddling with other BIOS passwords doesn't seem to help: they're not even asked on resume.

1) Could you please provide some further details?
Comment 48 Wirawan Purwanto 2012-02-09 16:41:59 UTC
The trick mentioned in comment #45 applies to Dell Latitude D600 model. Here is the link to Canonical's bug report:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/559163

Paul Bolle said correctly that this trick depends on primary BIOS password being asked upon resuming. It did not work for me on a Dell Inspiron 600m (which is a "consumer" sister model of D600).

FWIW here are some related bug pages on Ubuntu/Launchpad:

  "resume broken on ATI radeon RV250"
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/557224
  (affecting Thinkpad T41)

  "[Dell Computer Corporation Inspiron 600m] suspend/resume failure"
  https://bugs.launchpad.net/linux/+bug/471872
  (affecting my Dell Inspiron 600m computer)
Comment 49 Sedat Dilek 2012-11-23 18:33:14 UTC
Just FYI: There is now an upstream fix (see [1]).

commit 45171002b01b2e2ec4f991eca81ffd8430fd0aec
"radeon: add AGPMode 1 quirk for RV250"

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=45171002b01b2e2ec4f991eca81ffd8430fd0aec

- Sedat -