Bug 29022

Summary: [REGRESSION? 2.6.38-rc4] nouveau NV50/NVA8 screen freeze
Product: Drivers Reporter: Marc B. (kernel.org)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED OBSOLETE    
Severity: normal CC: akpm, alan, dev, florian, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: 2.6.37 - no freeze
2.6.38-rc4-dezzy-00482-ga0dc00b-dirty - freeze

Description Marc B. 2011-02-13 12:29:13 UTC
I don't know if I suffer a regression of nouveau in 2.6.38-rc4.

It feels like the screens tends to freeze often in that version but I cannot be
sure if it is the 2.6.38-rc4 release or something else.

The current 2.6.37 kernel has this in Xorg.0.log:

[    71.783] (II) AutoAddDevices is off - not adding device.
[    78.808] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[    78.808] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[    78.808] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[    78.808] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[    79.422] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[    79.422] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[    79.423] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[    79.423] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[    80.115] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[    80.115] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[    80.115] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[    80.115] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[    80.729] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[    80.729] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[    80.729] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[    80.729] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[   507.789] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[   507.789] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[   507.789] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[   507.789] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[   567.978] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[   567.978] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[   567.978] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[   567.978] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)

The current 2.6.38-rc4 kernel has this in Xorg.0.log:

[    75.466] (II) AutoAddDevices is off - not adding device.
[    89.585] (II) AIGLX: Suspending AIGLX clients for VT switch
[    89.585] (II) NOUVEAU(0): NVLeaveVT is called.
[    97.577] (II) Open ACPI successful (/var/run/acpid.socket)
[    97.578] (II) AIGLX: Resuming AIGLX clients after VT switch
[    97.578] (II) NOUVEAU(0): NVEnterVT is called.
[    97.691] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[    97.691] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[    97.691] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[    97.691] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[    98.322] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[   957.921] (II) 3rd Button detected: disabling emulate3Button
[  2852.323] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2852.323] (II) NOUVEAU(0): NVLeaveVT is called.
[  2855.850] (II) Open ACPI successful (/var/run/acpid.socket)
[  2855.850] (II) AIGLX: Resuming AIGLX clients after VT switch
[  2855.850] (II) NOUVEAU(0): NVEnterVT is called.
[  2855.966] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  2855.966] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  2855.966] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  2855.966] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  2856.597] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  2863.361] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2863.361] (II) NOUVEAU(0): NVLeaveVT is called.
[  2879.447] (II) Open ACPI successful (/var/run/acpid.socket)
[  2879.447] (II) AIGLX: Resuming AIGLX clients after VT switch
[  2879.447] (II) NOUVEAU(0): NVEnterVT is called.
[  2879.561] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  2879.561] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  2879.561] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  2879.561] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  2880.192] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  2884.543] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2884.543] (II) NOUVEAU(0): NVLeaveVT is called.
[  2890.337] (II) Open ACPI successful (/var/run/acpid.socket)
[  2890.337] (II) AIGLX: Resuming AIGLX clients after VT switch
[  2890.337] (II) NOUVEAU(0): NVEnterVT is called.
[  2890.451] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  2890.451] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  2890.451] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  2890.451] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  2891.081] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  2895.780] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2895.780] (II) NOUVEAU(0): NVLeaveVT is called.
[  2917.475] (II) Open ACPI successful (/var/run/acpid.socket)
[  2917.475] (II) AIGLX: Resuming AIGLX clients after VT switch
[  2917.475] (II) NOUVEAU(0): NVEnterVT is called.
[  2917.588] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  2917.588] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  2917.588] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  2917.588] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  2918.219] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  2921.735] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2921.735] (II) NOUVEAU(0): NVLeaveVT is called.
[  2988.993] (II) Open ACPI successful (/var/run/acpid.socket)
[  2988.993] (II) AIGLX: Resuming AIGLX clients after VT switch
[  2988.993] (II) NOUVEAU(0): NVEnterVT is called.
[  2989.107] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  2989.107] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  2989.107] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  2989.107] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  2989.738] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  2997.111] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2997.111] (II) NOUVEAU(0): NVLeaveVT is called.
[  2997.952] (II) Open ACPI successful (/var/run/acpid.socket)
[  2997.952] (II) AIGLX: Resuming AIGLX clients after VT switch
[  2997.952] (II) NOUVEAU(0): NVEnterVT is called.
[  2998.065] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  2998.065] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  2998.065] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  2998.065] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  2998.700] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  2999.201] (II) AIGLX: Suspending AIGLX clients for VT switch
[  2999.201] (II) NOUVEAU(0): NVLeaveVT is called.
[  3045.403] (II) Open ACPI successful (/var/run/acpid.socket)
[  3045.404] (II) AIGLX: Resuming AIGLX clients after VT switch
[  3045.404] (II) NOUVEAU(0): NVEnterVT is called.
[  3045.519] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  3045.519] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  3045.519] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  3045.519] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  3046.151] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  3118.308] (II) AIGLX: Suspending AIGLX clients for VT switch
[  3118.308] (II) NOUVEAU(0): NVLeaveVT is called.
[  3119.091] (II) Open ACPI successful (/var/run/acpid.socket)
[  3119.091] (II) AIGLX: Resuming AIGLX clients after VT switch
[  3119.091] (II) NOUVEAU(0): NVEnterVT is called.
[  3119.204] (II) NOUVEAU(0): EDID vendor "LEN", prod id 16561
[  3119.204] (II) NOUVEAU(0): Printing DDC gathered Modelines:
[  3119.204] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 1930  900 903 906 912 -hsync -vsync (54.9 kHz)
[  3119.204] (II) NOUVEAU(0): Modeline "1600x900"x0.0  106.00  1600 1664 1706 2324  900 903 906 912 -hsync -vsync (45.6 kHz)
[  3119.838] (II) Configured Mouse: ps2EnableDataReporting: succeeded
[  3120.522] (II) AIGLX: Suspending AIGLX clients for VT switch
[  3120.522] (II) NOUVEAU(0): NVLeaveVT is called.
[  3125.951] (II) UnloadModule: "mouse"
[  3125.951] (II) UnloadModule: "kbd"
[  3125.961] (II) NOUVEAU(0): Closed GPU channel 2
[  3125.976] (WW) xf86CloseConsole: VT_WAITACTIVE failed: Interrupted system call

but I don't know if it's related to the screen freeze. Does anyone experience
the similar behavior?

Regards,
Marc
Comment 1 Marc B. 2011-02-13 13:03:57 UTC
Created attachment 47612 [details]
2.6.37 - no freeze
Comment 2 Marc B. 2011-02-13 13:05:22 UTC
Created attachment 47622 [details]
2.6.38-rc4-dezzy-00482-ga0dc00b-dirty - freeze
Comment 3 Marc B. 2011-02-13 13:06:46 UTC
[mi] EQ overflowing. The server is probably stuck in an infinite loop.

Was the message appearing on 2.6.38+ ...
Comment 4 Marc B. 2011-02-15 04:38:51 UTC
See also:

"Chris Clayton - System lockup with 2.6.38-rc4+"

on LKML
Comment 5 Marc B. 2011-02-25 10:14:44 UTC
I have the slight assumption that the problem is more often triggered with the VirtualBox kernel modules loaded. I know that you guys consider this setup 'Tainted', however, it's a real-world setup that used to work.
Comment 6 Marc B. 2011-03-05 09:46:50 UTC
(In reply to comment #5)
> I have the slight assumption that the problem is more often triggered with
> the
> VirtualBox kernel modules loaded. I know that you guys consider this setup
> 'Tainted', however, it's a real-world setup that used to work.

This was not true. The beavior persists with -rc7. Screen will freeze _always_ after about 3 - 4 minutes with mesa-HEAD and xorg 1.10.0. 2.6.37 is fine.
Comment 7 Marc B. 2011-03-05 09:55:56 UTC
Hey, we have somthing in the logs:

Mar  5 09:57:44 marc kernel: [71218.224538] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP - Ch 4/5 Class 0x8597 Mthd 0x15e0 Data 0x00000000:0x00000000
Mar  5 09:57:44 marc kernel: [71218.224554] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 080000 warp 0, opcode 000033cc 00ffffff
Mar  5 09:57:44 marc kernel: [71218.224705] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP - Ch 4/5 Class 0x8597 Mthd 0x15e0 Data 0x00000000:0x00000000
Mar  5 09:57:44 marc kernel: [71218.224717] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 080000 warp 0, opcode 000033cc 00ffffff
Mar  5 09:57:44 marc kernel: [71218.224726] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 083008 warp 1, opcode 00f8f3e6 00f8f3e6
Mar  5 09:57:44 marc kernel: [71218.224899] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP - Ch 4/5 Class 0x8597 Mthd 0x15e0 Data 0x00000000:0x00000000
Mar  5 09:57:44 marc kernel: [71218.224911] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 080000 warp 0, opcode 000033cc 00ffffff
Mar  5 09:57:44 marc kernel: [71218.224921] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 083008 warp 1, opcode 00f8f3e6 00f8f3e6
Mar  5 09:57:44 marc kernel: [71218.224992] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP - Ch 4/5 Class 0x8597 Mthd 0x15e0 Data 0x00000000:0x00000000
Mar  5 09:57:44 marc kernel: [71218.225003] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 0: INVALID_OPCODE at 080000 warp 0, opcode 000033cc 00ffffff
Mar  5 09:57:44 marc kernel: [71218.225013] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP_EXEC - TP 0 MP 1: INVALID_OPCODE at 083008 warp 1, opcode 00f8f3e6 00f8f3e6
Mar  5 09:57:44 marc kernel: [71218.225162] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP - Ch 4/5 Class 0x8597 Mthd 0x15e0 Data 0x00000000:0x00000000
Mar  5 09:57:44 marc kernel: [71218.225168] [drm] nouveau 0000:01:00.0: PGRAPH_TRAP_MP - TP0: Unhandled ustatus 0x00020000
Mar  5 09:58:11 marc kernel: [71246.831804] [drm] nouveau 0000:01:00.0: nouveau_channel_free: freeing fifo 4
Mar  5 09:58:24 marc kernel: [71259.463319] [drm] nouveau 0000:01:00.0: Allocating FIFO number 4
Mar  5 09:58:24 marc kernel: [71259.473828] [drm] nouveau 0000:01:00.0: nouveau_channel_alloc: initialised FIFO 4

This, however, seems to be something different. When I reported the bug there were no such logs. Maybe they just didn't get into kern.log or there are two bugs.

If you need more info on something, please ask.
Comment 8 Lucas Stach 2011-03-05 10:16:01 UTC
If this problem occurs on a NVA8 it's not an regression. NVA3+ have some random lockups where nobody figured out why. So it never worked okay, although some changes in the codebase could have made it more likely to trigger the lockup.

I strongly suspect the PGRAPH_TRAP is a unrelated bug, since we never saw anything in the logs after random lockup.
Comment 9 Marc B. 2011-03-06 11:42:29 UTC
I updated the whole X11 stuff to current git and the problem seems to be gone.

I have the freezing kernel running for about 15 minutes now and tried anything to make it freeze (composite WM, Firefox, several glxgears instances and fullscreen terminals just doing some stuff to make it print a lot of text (like cat'ing dmesg in loops)). Weird.

... but the workaround cannot be to upgrade X11 to HEAD. :)

Any hints on what one could do to track this down?
Comment 10 Marc B. 2011-03-06 12:01:33 UTC
OK, consider this as 'not said'. Just a second after I hit 'Commit' to send the post the desktop froze. :/
Comment 11 Lucas Stach 2011-03-06 13:02:58 UTC
Do you see this freeze on NVA8 only or also on other cards?

If it is only NVA8 there is nothing you can do, as nobody knows how to tackle this bug. You may pary that Ben finds a way to reprodue so he can hopefully fix this.
Comment 12 Marc B. 2011-03-08 08:42:13 UTC
I have no other card to test over here. :/

rc8 still fails to behave like 2.6.37. Is this really going to be rolled out?
Comment 13 Lucas Stach 2011-03-08 11:26:25 UTC
This _is_ already rolled out, as no nouveau version so far didn't lock up randomly on NVA8. It seems you were enormously lucky to not see them on .37.
Comment 14 Marc B. 2011-03-08 12:46:01 UTC
To be honest, I doubt _this_ is rolled out. The behavior might be there in 2.6.37 but in 2.6.38 it seems to be worse. With 2.6.37 I have some random lock-ups leading to the GPU once in a month (I consider this normal, even as NVidia itself has problems like this with their drivers). The recent 2.6.38-rc kernels freeze actually after some minutes. And this can be reproduced.

But OK, since you seem to say that there's nothing we could do about it at this point - as nobody seems to know a solution - what do we do with the bug report? Close it? Sad to hear I won't be able to use the 2.6.38 release.
Comment 15 Florian Mickler 2011-03-08 15:36:50 UTC
If you find the commit that made this worse it may hint at a solution or maybe just reverting fixes things... so if you can use git bisect (man git-bisect), please do so.
Comment 16 Lucas Stach 2011-03-08 15:40:44 UTC
No please leave it open for now, just remove the regression flag. Also you could add _full_ dmesg output and the versions of your userspace (xf86-video-nouveau, x-server, libdrm and if installed mesa).

Dmesg after crash may also be interesting (hopefully we find something this time, but I doubt it).
Comment 17 Marc B. 2011-03-08 20:38:46 UTC
Removed from 'regressions' due to request and perhaps actual sense.
Comment 18 Marc B. 2011-03-08 20:47:36 UTC
Curently I'd already like to offer money for this to get fixed. Damn thing this seems to be so tricky... :/
Comment 19 Marc B. 2011-04-05 20:00:14 UTC
nouveau.noaccel=1 'fixes' the issue. I have an uptime of 4 days now with 2.6.38.2 without any problem. But performance wise .... uhm.
Comment 20 Alan 2012-08-15 22:15:55 UTC
If this bug is still seen with 3.2/3.4+ kernels please re-open