Bug 17702

Summary: Stable2.6.35.x kernel regression: KMS/xorg fails to start, xorg log ending as DRI2 starts
Product: Drivers Reporter: Duncan (1i5t5.duncan)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: CLOSED DUPLICATE    
Severity: normal CC: maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35.4 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16444    
Attachments: config 2.6.35
dmesg.bad

Description Duncan 2010-09-03 06:53:59 UTC
Created attachment 28912 [details]
config 2.6.35

This regression is quite interesting as it may have to do with the recent xorg related security fix, tho I've not fully bisected it yet.

Until a few days ago, I had been running a late (around the last rc) 2.6.35 git kernel.  I tried upgrading to 2.6.36-rc3, and everything was fine until I tried to start X.  It froze, but the kernel was still alive, and with a magic-SRQ-R, I could get it to take a three-finger salute and reboot.  (When such things happened before KMS, I could often get a text terminal back, but I've not figured out how with KMS, blind-switching to VT-1 and blind-issuing a "reset" command doesn't work like it often used to when X was frozen but the kernel was still responsive.  Any hints?)  So wanting a kernel safe from that security vuln, I tried the latest stable, 2.6.35.4, and it had a very similary problem!  So I tried reverting to 2.6.35 (which I had been running only a handful of commits behind for some weeks, but which has that security vuln), and IT WORKED FINE!

That means the regression is post 2.6.35 in both the stable series (2.6.35.4 at least) and Linus/Head (2.6.36-rc3 at least) affected!  I've not yet bisected beyond that, as I decided that now that I know it's a stable series bug, I should file it, and get the process started, but I do plan to (tho these things seem to invariably happen when I'm short on time, important conference coming up this weekend and full schedule next week, so just when I get to it is a question).

Hardware is a now older dual socket Opteron 290, AGP based graphics, Tyan s2885 mobo, running a fairy new Radeon hd4650 graphics card.  Note that this IS an AGP... probably not a lot of folks running that new a card on AGP, that the chipset is the original AMD 8xx series, with the AGPGART based IOMMU, and that it's a dual-socket-dual-core CPU arrangement.  In case it matters (bandwidth?) dual displays as well, 1920x1080 (full hd), stacked for 1920x2160, connected via dual dvi ports on the card.

Software:  Gentoo/~amd64, native kernel/xorg freedomware Radeon driver, KMS.
Versions should be pretty close to current: xorg-server-1.9.0, xf86-video-ati-6.13.1, mesa-7.8.2, libdrm-2.4.21-r1.  kde-4.5.0 as the xorg desktop environment, all compiled with gcc-4.5.x (currently 4.5.1 tho I don't believe I recompiled the entire system after 4.5.1, as I did after 4.5.0).

Here's the lines surrounding the freeze from the xorg log:

[   501.005] (II) RADEON(0): mem size init: gart size :7dff000 vram size: s:40000000 visible:3f7d7000
[   501.005] (II) RADEON(0): EXA: Driver will allow EXA pixmaps in VRAM
[   501.010] (**) RADEON(0): Display dimensions: (480, 270) mm
[   501.010] (**) RADEON(0): DPI set to (101, 203)
[   501.010] (II) Loading sub module "fb"
[   501.010] (II) LoadModule: "fb"
[   501.011] (II) Loading /usr/lib64/xorg/modules/libfb.so
[   501.046] (II) Module fb: vendor="X.Org Foundation"
[   501.046]    compiled for 1.9.0, module version = 1.0.0
[   501.046]    ABI class: X.Org ANSI C Emulation, version 0.4
[   501.046] (II) Loading sub module "ramdac"
[   501.046] (II) LoadModule: "ramdac"
[   501.046] (II) Module "ramdac" already built-in
[   501.046] (II) RADEON(0): GPU accel disabled or not working, using shadowfb for KMS
[   501.046] (II) Loading sub module "shadow"
[   501.046] (II) LoadModule: "shadow"
[   501.047] (II) Loading /usr/lib64/xorg/modules/libshadow.so
[   501.055] (II) Module shadow: vendor="X.Org Foundation"
[   501.055]    compiled for 1.9.0, module version = 1.1.0
[   501.055]    ABI class: X.Org ANSI C Emulation, version 0.4
[   501.055] (--) Depth 24 pixmap format is 32 bpp

That's the end of the last error log, tho it often gets in two more lines, here filled in from the current (working) xorg launch on 2.6.35.

[   118.405] (II) RADEON(0): [DRI2] Setup complete
[   118.405] (II) RADEON(0): [DRI2]   DRI driver: r600

Here's the next few lines in the sequence from the working log, AFTER the point at which it freezes with the bad kernels.

[   118.406] (II) RADEON(0): Front buffer size: 17280K
[   118.406] (II) RADEON(0): VRAM usage limit set to 920646K
[   118.428] (==) RADEON(0): Backing store disabled
[   118.428] (II) RADEON(0): Direct rendering enabled
[   118.434] (II) RADEON(0): Setting EXA maxPitchBytes
[   118.434] (II) EXA(0): Driver allocated offscreen pixmaps
[   118.434] (II) EXA(0): Driver registered support for the following operations:
[   118.434] (II)         Solid
[   118.434] (II)         Copy
[   118.434] (II)         Composite (RENDER acceleration)
[   118.434] (II)         UploadToScreen
[   118.434] (II)         DownloadFromScreen
[   118.434] (II) RADEON(0): Acceleration enabled
[   118.434] (==) RADEON(0): DPMS enabled
[   118.434] (==) RADEON(0): Silken mouse enabled
[   118.435] (II) RADEON(0): Set up textured video
[   118.435] (II) RADEON(0): RandR 1.2 enabled, ignore the following RandR disabled message.

So it looks pretty clear to me that as soon as it inits DRI2, it freezes.  Unfortunately, neither Option "NoAccel" nor Option "DRI" "0" in the xorg.conf.d file containing the graphics device config, seem to do anything, so I've not yet tested an X start with that off.  I'll have to turn OpenGL effects off in KDE from a working kernel, then reboot to the bad kernel to see if it gets any farther without OpenGL immediately activated.

config-2.6.35 attached.
Comment 1 Michel Dänzer 2010-09-03 07:19:12 UTC
(In reply to comment #1)
> [   501.046] (II) RADEON(0): GPU accel disabled or not working, using
> shadowfb
> for KMS

Why is acceleration not working? If in doubt, attach the dmesg output.


> [   501.055] (--) Depth 24 pixmap format is 32 bpp
> 
> That's the end of the last error log, tho it often gets in two more lines,
> here
> filled in from the current (working) xorg launch on 2.6.35.
> 
> [   118.405] (II) RADEON(0): [DRI2] Setup complete
> [   118.405] (II) RADEON(0): [DRI2]   DRI driver: r600

Whenever the X log file ends abruptly like that, one needs to look at the X server's stderr output, which should be captured in the kdm/gdm log file.


> I'll have to turn OpenGL effects off in KDE from a working kernel, then
> reboot
> to the bad kernel to see if it gets any farther without OpenGL immediately
> activated.

It won't, the X log ends before it processes any client requests.
Comment 2 Duncan 2010-09-03 21:24:44 UTC
Switching to composite (from OpenGL) effects in KDE didn't help, but turning OpenGL effects off a different way in xorg.conf.d, did.

With...

Section Module
    Disable "dri"
EndSection

... in xorg.conf.d, xorg (and kde) started just fine on 2.6.35.4, where it wouldn't start without that.

And with that, glxinfo now says it's using the software rasterizer, so it's unaccelerated opengl.

Of course, 2.6.35 works without dri disabled, using hardware OpenGL.

Now to see if 2.6.35.1 is good or bad...
Comment 3 Duncan 2010-09-03 22:52:01 UTC
2.6.35.1 is also bad.  So the bad commit is definitely in 2.6.35.1, given 2.6.35 is good.  There's several DRM/Radeon/KMS commits there, tho.  Bisect time!

>> [   501.046] (II) RADEON(0): GPU accel disabled or not working, using
>> shadowfb
>> for KMS
> 
> Why is acceleration not working? If in doubt, attach the dmesg output.

That would appear to be the question of the hour, since it's obviously the problem.  Thanks for pointing that out. Looks like I've some digging to do there, as well.
Comment 4 Duncan 2010-09-05 09:11:20 UTC
Created attachment 29022 [details]
dmesg.bad

OK, here's the results of the bisect

44437579efca258e3c4a09f59838c8f933611990 is the first bad commit
commit 44437579efca258e3c4a09f59838c8f933611990
Author: Alex Deucher <alexdeucher@gmail.com>
Date:   Mon Jul 26 18:51:53 2010 -0400

    drm/radeon/kms/r7xx: add workaround for hw issue with HDP flush
    
    commit 812d046915f48236657f02c06d7dc47140e9ceda upstream.
    
    Use of HDP_*_COHERENCY_FLUSH_CNTL can cause a hang in certain
    situations.  Add workaround.
    
    Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
    Signed-off-by: Dave Airlie <airlied@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

:040000 040000 08710601d4da663008c68e8f4985a7efc21dac64 c306035b3979c94c24addce318f977a83ee0a2d5 M      drivers


As to why it's doing it... looks like that accel off thing was a red herring -- I had tried noaccel in xorg.conf.d, but that apparently affects 2D only, and since it's DRM/OpenGL only, it didn't cure the issue here.  But as I said above, disable dri in the module section did.

Meanwhile, dmesg says it's a kernel NULL pointer dereference... "process X"  Here's the full bug dump, and I'm attaching the full (bad) dmesg output.  (The tg3 line happens at boot, earlier; the SysRq line happens as I hit it to restore raw, after which I hit the three-finger-salute, triggering the dmesg dump as part of the reboot.)


tg3 0000:02:09.0: eth0: Flow control is on for TX and on for RX
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8124b8ff>] 0xffffffff8124b8ff
PGD 1c653b067 PUD 1c2c33067 PMD 0.
Oops: 0000 [#1] SMP.
last sysfs file: /sys/devices/pci0000:04/0000:04:01.0/0000:05:00.0/boot_vga
CPU 3.
Pid: 2057, comm: X Not tainted 2.6.35-36-gc189900+ #35 S2885 Thunder K8W Mainboard/To Be Filled By O.E.M.
RIP: 0010:[<ffffffff8124b8ff>]  [<ffffffff8124b8ff>] 0xffffffff8124b8ff
RSP: 0018:ffff8801c6753d70  EFLAGS: 00010246
RAX: ffffc90001bc0000 RBX: ffff8801c7a89e00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8801c7a89e00 RDI: ffff8801c7ac2000
RBP: ffff8801c6778300 R08: 0000000000000000 R09: 00000000c0086464
R10: 0000000000000001 R11: 0000000000003246 R12: ffff8801c7a89e48
R13: 0000000000000000 R14: ffff8801c6753dd8 R15: ffffffff81512c20
FS:  00007fe03103c860(0000) GS:ffff880149f00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001c6694000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process X (pid: 2057, threadinfo ffff8801c6752000, task ffff8801c5d53840)
Stack:
 ffffffff81226651 ffff8801c6778300 ffff8801c5793780 ffff880146e58000
<0> 00000000fffffff2 00000000c0086464 ffffffff811da935 ffff880100000064
<0> 000000000000e200 ffff880100000001 ffffffff81071413 ffffffff8122656c
Call Trace:
 [<ffffffff81226651>] ? 0xffffffff81226651
 [<ffffffff811da935>] ? 0xffffffff811da935
 [<ffffffff81071413>] ? 0xffffffff81071413
 [<ffffffff8122656c>] ? 0xffffffff8122656c
 [<ffffffff8106c1e8>] ? 0xffffffff8106c1e8
 [<ffffffff81074665>] ? 0xffffffff81074665
 [<ffffffff81094a03>] ? 0xffffffff81094a03
 [<ffffffff810952d6>] ? 0xffffffff810952d6
 [<ffffffff81095346>] ? 0xffffffff81095346
 [<ffffffff81001e2b>] ? 0xffffffff81001e2b
Code: 97 50 03 00 00 48 8b 87 c8 00 00 00 76 0a 31 c9 89 88 34 2f 00 00 eb 13 b9 34 2f 00 00 89 08 31 c0 48 8b 8f c8 
RIP  [<ffffffff8124b8ff>] 0xffffffff8124b8ff
 RSP <ffff8801c6753d70>
CR2: 0000000000000000
---[ end trace f66d01ec94df04c1 ]---
[drm:drm_release] *ERROR* Device busy: 1
SysRq : Keyboard mode set to system default
Comment 5 Duncan 2010-09-05 10:36:34 UTC
Confirmed:  Reverting that commit restores X to a working condition, here.  I'm now running X on 2.6.36-rc3, with that single commit reverted. =:^)
Comment 6 Michel Dänzer 2010-09-05 10:43:37 UTC
Looks like a duplicate of bug 17201.
Comment 7 Rafael J. Wysocki 2010-09-05 22:09:58 UTC

*** This bug has been marked as a duplicate of bug 17201 ***