Bug 50091

Summary: [BISECTED]GeForce 6150SE: system hangs on X-server start with garbled screen
Product: Drivers Reporter: schaefer.frank
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: alan, cesarb, cpanceac, jfrieben, jwilk, kernel, mahatma, rmsc
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.9.9 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output with kernel 3.6.6
dmesg output with kernel 3.7-rc5+
fix
fix for bisection
debug patch
debug patch 2
another fix for bisection
patch: port from <3.7

Description schaefer.frank 2012-11-05 18:56:16 UTC
When booting my system (GeForce 6150SE nForce 430), it hangs on X-server start-up with a garbled screen:
http://imageshack.us/photo/my-images/705/img133py.jpg/

100% reproducable, kernel and Xorg logs are empty and show nothing unusual.

This is a regression from previous kernel versions, 3.6.6 is fine.
The rest of the system is a standard installation of openSUSE 12.2.
Comment 1 schaefer.frank 2012-11-14 13:50:31 UTC
Last good commit is 
5787640db6ae722aeadb394d480c7ca21b603e34
drm/nv04-nv40/instmem: remove use of nouveau_gpuobj_new_fake()

First bad commit is
ebb945a94bba2ce8dff7b0942ff2b3f2a52a0a69
drm/nouveau: port all engines to new engine module format

Can't test 16 commits inbetween (70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 to ac1499d9573f4aadd1d2beac11fe23af8ce90c24).
Comment 2 Marcin Slusarz 2012-11-14 20:28:08 UTC
Please attach dmesgs from both 3.6.x and 3.7-rcLatest.
What's the problem with bisection? If you can't continue because of build error, please attach it too.

I hope ebb945a94bba2ce8dff7b0942ff2b3f2a52a0a69 is not a culprit... (146 files changed, 14219 insertions(+), 11099 deletions(-))
Comment 3 schaefer.frank 2012-11-15 14:59:45 UTC
Concerning bisection problems:
Starting with commit 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 (the first skipped commit), system starts up in VESA modes. Somewhere in the middle of those 16 commits, it changes to a kernel panic (NULL pointer dereference) very early during the boot process. No compilation problems.
Comment 4 schaefer.frank 2012-11-15 15:01:27 UTC
Created attachment 86421 [details]
dmesg output with kernel 3.6.6
Comment 5 schaefer.frank 2012-11-15 15:05:02 UTC
The machine dies when hitting this bug wit 3.7, so no dmesg output is possible. I can post the content of /var/log/messages, but there is nothing of interest.
Comment 6 schaefer.frank 2012-11-15 15:38:40 UTC
I've updated to 3.7-rc5 of today and the first reboot the system started normally (!). I was able to retrieve the dmesg output before the machine died again after ~2 minutes (with the same garbled screen).
The next 5 reboots again failed on X-server start.
Comment 7 schaefer.frank 2012-11-15 15:39:50 UTC
Created attachment 86431 [details]
dmesg output with kernel 3.7-rc5+
Comment 8 Marcin Slusarz 2012-11-15 18:50:14 UTC
Created attachment 86461 [details]
fix

does this patch help?
Comment 9 schaefer.frank 2012-11-15 20:06:16 UTC
No change :(
Comment 10 Marcin Slusarz 2012-11-15 21:02:06 UTC
Created attachment 86471 [details]
fix for bisection

Can you apply this one on top of 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 (the first commit which fails), verify it helps, and continue with bisection?
Comment 11 schaefer.frank 2012-11-15 21:45:08 UTC
Are you sure you that you attached the right patch ? Seems to be the same as the last one...
Will test tomorrow.
Comment 12 Marcin Slusarz 2012-11-15 21:56:06 UTC
Yes, it's the same patch, just ported to mid-rework tree.
Comment 13 schaefer.frank 2012-11-16 12:10:49 UTC
Doesn't change anything. This is what happens with commit 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9:

...
[    3.619558] [drm] Initialized drm 1.1.0 20060810
[    3.680948] nouveau 0000:00:0d.0: setting latency timer to 64
[    3.681173] ACPI: PCI Interrupt Link [AIGP] enabled at IRQ 23
[    3.681447] nouveau  [  DEVICE][0000:00:0d.0] BOOT0  : 0x04c000a2
[    3.681449] nouveau  [  DEVICE][0000:00:0d.0] Chipset: NV4C
[    3.681451] nouveau  [  DEVICE][0000:00:0d.0] Family : NV40
[    3.682383] nouveau  [   VBIOS][0000:00:0d.0] checking PRAMIN for image...
[    3.719160] nouveau  [   VBIOS][0000:00:0d.0] ... appears to be valid
[    3.719162] nouveau  [   VBIOS][0000:00:0d.0] using image from PRAMIN
[    3.719293] nouveau  [   VBIOS][0000:00:0d.0] BIT signature found
[    3.719297] nouveau  [   VBIOS][0000:00:0d.0] version 05.61.32.14
[    3.719433] nouveau  [     PFB][0000:00:0d.0] RAM type: unknown
[    3.719435] nouveau  [     PFB][0000:00:0d.0] RAM size: 256 MiB
[    3.719729] [drm] nouveau 0000:00:0d.0: Detected an NV40 generation card (0x04c000a2)
[    3.720676] [drm] nouveau 0000:00:0d.0: BIT BIOS found
[    3.720679] [drm] nouveau 0000:00:0d.0: Bios version 05.61.32.14
[    3.720681] [drm] nouveau 0000:00:0d.0: TMDS table version 1.1
[    3.720682] [drm] nouveau 0000:00:0d.0: TMDS table script pointers not stubbed
[    3.720684] [drm] nouveau 0000:00:0d.0: MXM: no VBIOS data, nothing to do
[    3.720686] [drm] nouveau 0000:00:0d.0: DCB version 3.0
[    3.720688] [drm] nouveau 0000:00:0d.0: DCB outp 00: 01000310 00000023
[    3.720690] [drm] nouveau 0000:00:0d.0: DCB outp 01: 00110204 974f0000
[    2.718650] ACPI: Invalid Power Resource to register!
[    3.720692] [drm] nouveau 0000:00:0d.0: DCB conn 00: 0000
[    3.721518] [TTM] Zone  kernel: Available graphics memory: 418854 kiB
[    3.721520] [TTM] Zone highmem: Available graphics memory: 2691562 kiB
[    3.721521] [TTM] Initializing pool allocator
[    3.724190] [drm] nouveau 0000:00:0d.0: 512 MiB GART (aperture)
[    3.728066] [drm] nouveau 0000:00:0d.0: Saving VGA fonts
[    3.762935] [drm] nouveau 0000:00:0d.0: DCB type 4 not known
[    3.762937] [drm] nouveau 0000:00:0d.0: Unknown-1 has no encoders, removing
[    3.764808] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    3.764809] [drm] No driver support for vblank timestamp query.
[    3.770228] [drm] nouveau 0000:00:0d.0: 1 available performance level(s)
[    3.770231] [drm] nouveau 0000:00:0d.0: 0: core 425MHz shader 425MHz fanspeed 100%
[    3.770233] [drm] nouveau 0000:00:0d.0: c:
[    3.770358] [drm] nouveau 0000:00:0d.0: Failed to idle channel 0.
[    3.770447] [drm] nouveau 0000:00:0d.0: Setting dpms mode 3 on vga encoder (output 0)
[    3.791615] [drm] nouveau 0000:00:0d.0: Restoring VGA fonts
[    3.794408] [drm:drm_mm_takedown] *ERROR* Memory manager not clean. Delaying takedown
[    3.794440] [drm:drm_mm_takedown] *ERROR* Memory manager not clean. Delaying takedown
[    3.794586] [TTM] Finalizing pool allocator
[    3.794619] [TTM] Zone  kernel: Used memory at exit: 8 kiB
[    3.794621] [TTM] Zone highmem: Used memory at exit: 8 kiB
[    3.794885] [drm:drm_mm_takedown] *ERROR* Memory manager not clean. Delaying takedown
[    3.795205] ------------[ cut here ]------------
[    3.795238] WARNING: at drivers/gpu/drm/nouveau/nouveau_gpuobj.c:241 nouveau_gpuobj_takedown+0x103/0x110 [nouveau]()
[    3.795239] Hardware name: System Product Name
[    3.795240] Modules linked in: nouveau(+) ttm drm_kms_helper drm i2c_algo_bit mxm_wmi video wmi fan thermal button processor thermal_sys scsi_dh_alua scsi_dh_hp_sw scsi_dh_emc scsi_dh_rdac scsi_dh ata_generic pata_amd pata_jmicron sata_nv
[    3.795257] Pid: 32, comm: kworker/0:1 Not tainted 3.6.0-2.10-desktop+ #4
[    3.795258] Call Trace:
[    3.795266]  [<c023726d>] warn_slowpath_common+0x6d/0xa0
[    3.795290]  [<f99b3ae3>] ? nouveau_gpuobj_takedown+0x103/0x110 [nouveau]
[    3.795312]  [<f99b3ae3>] ? nouveau_gpuobj_takedown+0x103/0x110 [nouveau]
[    3.795316]  [<c02372bd>] warn_slowpath_null+0x1d/0x20
[    3.795339]  [<f99b3ae3>] nouveau_gpuobj_takedown+0x103/0x110 [nouveau]
[    3.795359]  [<f996ab64>] ? nv40_instmem_takedown+0x54/0x70 [nouveau]
[    3.795382]  [<f99af4fe>] nouveau_card_init+0x68e/0xec0 [nouveau]
[    3.795386]  [<c022d61a>] ? ioremap_nocache+0x1a/0x20
[    3.795409]  [<f99b02fd>] nouveau_load+0x49d/0x8f0 [nouveau]
[    3.795432]  [<f99ac3aa>] nouveau_drm_load+0x21a/0x250 [nouveau]
[    3.795445]  [<f7abd330>] ? drm_get_minor+0x1f0/0x2b0 [drm]
[    3.795455]  [<f7abf6a3>] drm_get_pci_dev+0x133/0x250 [drm]
[    3.795460]  [<c04a183e>] ? pcibios_set_master+0x7e/0xb0
[    3.795483]  [<f99f631c>] nouveau_pci_probe+0xd/0xf [nouveau]
[    3.795504]  [<f99f62f8>] nouveau_drm_probe+0x95/0xac [nouveau]
[    3.795507]  [<c04a367a>] local_pci_probe+0x5a/0xd0
[    3.795511]  [<c024f3ec>] work_for_cpu_fn+0xc/0x20
[    3.795513]  [<c0251095>] process_one_work+0x115/0x420
[    3.795516]  [<c024f6a0>] ? hweight_long+0x10/0x10
[    3.795519]  [<c024f3e0>] ? move_linked_works+0x80/0x80
[    3.795521]  [<c02516a1>] worker_thread+0x111/0x3b0
[    3.795524]  [<c0262239>] ? complete+0x49/0x60
[    3.795527]  [<c0251590>] ? rescuer_thread+0x1c0/0x1c0
[    3.795530]  [<c025678d>] kthread+0x6d/0x80
[    3.795532]  [<c0256720>] ? kthread_freezable_should_stop+0x50/0x50
[    3.795536]  [<c070caf6>] kernel_thread_helper+0x6/0xd
[    3.795538] ---[ end trace 7467d93187e1277b ]---
[    3.795945] nouveau: probe of 0000:00:0d.0 failed with error -12
...
Comment 14 Marcin Slusarz 2012-11-16 19:43:46 UTC
Can you confirm 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 really is the first commit which fails to initialize?
Comment 15 schaefer.frank 2012-11-16 20:27:45 UTC
Sorry, it's 5787640db6ae722aeadb394d480c7ca21b603e34 (the commit before it).
Comment 16 Marcin Slusarz 2012-11-17 13:26:00 UTC
Created attachment 86541 [details]
debug patch

does 5787640db6ae722aeadb394d480c7ca21b603e34 initialize with this patch?
Comment 17 schaefer.frank 2012-11-17 15:24:48 UTC
To make it short: please disregard comment #15. 
My original statement was right, 5787640db6ae722aeadb394d480c7ca21b603e34 is the last good commit and 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 doesn't start up.
No idea what went wrong yesterday, tripple-checked now.

I applied your patch on top of 70ee6f but it doesn't help.
Comment 18 Marcin Slusarz 2012-11-17 17:31:18 UTC
Created attachment 86551 [details]
debug patch 2

try this one on top of 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9
Comment 19 schaefer.frank 2012-11-17 18:34:47 UTC
No change :(
Comment 20 Marcin Slusarz 2012-11-17 19:15:33 UTC
Created attachment 86561 [details]
another fix for bisection

Ok, this should do the trick. Try it on top of 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 and continue bisection.
(Mainline have this fixed)
Comment 21 schaefer.frank 2012-11-17 20:05:28 UTC
Still no change. Anything I can do to debug this further ?
Comment 22 schaefer.frank 2012-12-26 17:34:22 UTC
UPDATE:
I've tested with 3.8.0-rc1, and now I have a good chance that the system starts.
The criticial point is shortly before the the KDE Desktop appears (near the end of the KDE startup progress screen). If it survives this point, the machine runs stable for hours.

Maybe this gives us a hint of what could be going on ?
Could this have something to do with memory allocation ? Remember that this is an onboard device using a part of the systems RAM.
Comment 23 schaefer.frank 2012-12-26 17:35:26 UTC
Maybe I should add: x86 32bit system with 6 GB RAM.
Comment 24 Suloev Dmitry 2012-12-29 06:33:44 UTC
You use kexec for reboot or this happen on cold boot?
Comment 25 schaefer.frank 2012-12-29 10:34:25 UTC
Cold boot.
Comment 26 Cesar Eduardo Barros 2013-02-23 18:39:25 UTC
I am hitting a bug with very similar symptoms, probably the same bug. Reported at https://bugs.freedesktop.org/show_bug.cgi?id=61321 with bisection also pointing to 70ee6f1 as the first bad commit.
Comment 27 Dzianis Kahanovich 2013-03-02 16:45:05 UTC
I see next trivial (but interesting to me) differences in same case:

Older (good) kernels show messages on boot:
[    0.898887] [drm] nouveau 0000:00:0d.0: ======= misaligned reg 0x0060081D =======
[    0.898919] [drm] nouveau 0000:00:0d.0: ======= misaligned reg 0x0060081D =======

New (broken) kernels - no. Even I do not find (fast search, sorry) even near similar code in nouveau_bios.c.

PS In my case (Gentoo, openbox, tint2, feh - minimalistic) I have sure hang (broken screen and no messages) on mozilla's start (own build seamonkey with very forces GL usage) and IMHO mplayer/xv (but times ago, unsure). Other cases not checked, but at least Gtk+ (in xfce's Terminal) is stable.
Comment 28 schaefer.frank 2013-03-02 17:29:00 UTC
(In reply to comment #27)
> I see next trivial (but interesting to me) differences in same case:
> 
> Older (good) kernels show messages on boot:
> [    0.898887] [drm] nouveau 0000:00:0d.0: ======= misaligned reg 0x0060081D
> =======
> [    0.898919] [drm] nouveau 0000:00:0d.0: ======= misaligned reg 0x0060081D
> =======
> 

That's a separate issue and has nothing to do with this bug.
See https://bugs.freedesktop.org/show_bug.cgi?id=47182
In most cases, it is no bug at all.
Comment 29 Dzianis Kahanovich 2013-03-04 09:53:24 UTC
(In reply to comment #28)
> That's a separate issue and has nothing to do with this bug.
> See https://bugs.freedesktop.org/show_bug.cgi?id=47182
> In most cases, it is no bug at all.
OK, I just suggested since new DRI code not detect this misalignment - it can be sources of this bug, this is just visual differences.
Comment 30 Joachim Frieben 2013-05-27 05:16:31 UTC
This is -still- an issue for all kernels >= 3.7. Only kernels up to 3.6.x avoid the locking issue which makes the later ones unusable on systems with this video device, here an on-board nVidia Corporation C61 [GeForce 6100 nForce 405] (rev a2).
Comment 31 cornel panceac 2013-07-12 17:58:59 UTC
This still affects 3.9.9 from fedora 19.


$ lspci -nn | grep VGA
00:0d.0 VGA compatible controller [0300]: nVidia Corporation C61 [GeForce 6150SE nForce 430] [10de:03d0] (rev a2)

Indeed, all kernels > 3.6 are unusable. sooner or later the system freezes.
Comment 32 Tom Wijsman 2013-08-12 21:58:08 UTC
Similar downstream report at https://bugs.gentoo.org/show_bug.cgi?id=472200
Comment 33 Dzianis Kahanovich 2014-04-11 12:33:10 UTC
Created attachment 131911 [details]
patch: port from <3.7

Try this patch. I have some happy uptime (I have this Gigabyte mb on work desktop and use 3.4 before), but this bug can be too unpredictable. This is port of similar aligning from <3.7, just minimized. Also I don't move "misaligned reg" warning, as nobody cares from 2012. I copy old notes in comments, so you can find original similar places (was: if (reg & 0x1) reg&=...) by this comments.
Comment 34 Dzianis Kahanovich 2014-04-14 10:17:05 UTC
OOPS! Crushed now. No fix ;(