Bug 16376

Summary: random Radeon DRM KMS related freezes
Product: Drivers Reporter: Martin Steigerwald (Martin)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: airlied, alexdeucher, dragos.delcea, maciej.rutecki, mrsteven, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 15310    
Attachments: hardware of my ThinkPad T42, lspci -nvv
bisect cont'd - unbootable kernels ext4 / readahead backtrace
current state of bisection, please help with selecting next commits to test
Current status of git bisection, now testing patches mentioned in comment
no freezes so far with vmembase zero patch from Dave

Description Martin Steigerwald 2010-07-13 09:24:47 UTC
Affected kernel versions:
- 2.6.34-tp42-toi-3.1-04981-gb9a071a
- 2.6.34.1-tp42-toi-3.1.1.1-04990-g3a7d1f4

Last kernel that worked:
- 2.6.33.2-tp42-toi-3.1-lowmem-free-991-992-04964-gf00c7ec-dirty (including some patches to explicitely allow freeing lowmem pages on hibernation - I tested them for Nigel)

Currently running kernel:
- 2.6.33.6-tp42-toi-3.1.1.1-04982-g768d8a0

All from Nigel Cunnigham's TuxOnIce trees, but hangs happened before any hibernation cycle took place.

So now I just report this although I do not have much information about the circumstances of these hangs.

With 2.6.34 I had two and with 2.6.34.1 I had one sudden freeze of at least the desktop on my ThinkPad T42 with Radeon graphics. Mouse pointer just froze, Ctrl-Alt-F1 did nothing and AFAIR also there was no disk I/O anymore. I am not completely sure about the last one. Since the freezes happened in quite unpleasant circumstances I did not bother to start up a second machine in order to try to SSH into my T42. With 2.6.33.2 I did not experience those freezes.

I used the tuxonice-2.6.34 tree from Nigel Cunningham, cause I prefer TuxOnIce over other hibernation methods. Since all of the freezes just happened after a fresh boot of the system without any snapshot cycle in between them, I believe this to be a mainline kernel bug.

All freezes have been while running a KDE 4.4.4 desktop with OpenGL compositing enabled. The first two times just shortly after login in to the desktop. The third time while playing an AVI file from my photo SD card with Dragon Player. On the other hand I had hours of uptime with some TuxOnIce snapshot cycles without anything happening. I never had a freeze after the machine had done snapshot cycle. The freezes rather happened shortly after a fresh boot of the machine.

I am not sure whether this is a Radeon DRM KMS related bug, but this is my best guess at the moment. Other activities involved in all three situations were:

- USB. On the first two a M-Audio Sonica Theater was connected. On the third the kernel was reading the AVI file from a SD card connected via USB card reader.

- eSATA harddisk. In all times an external 500 GB harddisk was connected via eSATA

But also here I had the kernel running for hours with USB or eSATA without anything happening.

In the further cause I will attach some hardware and software details that might be helpful.

Currently I just downgraded to 2.6.33 again. I compiled me a 2.6.33.6. Since actually I really want some stability at least during the week where I hold a Linux training, I want to stick with it for now. Currently my plans are to wait for 2.6.34.2 or .3 and try again. The laptop is used for production work and I want it to meet some basic stability requirements. But if need be and I manage to take time for it, I may do some guided testing.
Comment 1 Martin Steigerwald 2010-07-13 10:11:15 UTC
Created attachment 27083 [details]
hardware of my ThinkPad T42, lspci -nvv
Comment 2 Martin Steigerwald 2010-07-13 10:15:02 UTC
Userspace I use:

apt-show-versions | egrep "(xserver-xorg/|xserver-xorg-core/|xserver-xorg-video-radeon/|libgl1-mesa-dri/|libdrm2/|libdrm-radeon1/|kde-window-manager/)"
kde-window-manager/squeeze uptodate 4:4.4.4-1
libdrm-radeon1/experimental uptodate 2.4.21-1
libdrm2/experimental uptodate 2.4.21-1
libgl1-mesa-dri/experimental uptodate 7.8.2-1
xserver-xorg/squeeze uptodate 1:7.5+6
xserver-xorg-core/squeeze uptodate 2:1.7.7-2
xserver-xorg-video-radeon/sid uptodate 1:6.13.1-1
Comment 3 Martin Steigerwald 2010-07-13 10:24:05 UTC
I did some research on the internet and found this possibly related case - hot, since it also happens on a ThinkPad T42 with similar if not same graphics hardware:

------------------------------------------------------
Random freezes with kernel 2.6.34 and xorg 1.8

The other day I upgraded the kernel to version 2.6.34 and at the same time xorg-server to 1.8 (along with input drivers and video drivers). From that moment, I suffer from random freezes, the system is completely locked up; the screen doesn't blank, though. The system appears to be perfectly fine, but after some minutes it will freeze.

As far as I can remember, the freezes only occur when using a webbrowser. Chromium is my default browser, but also Firefox and Konqueror caused the system to freeze completely when I want to open a webpage. Other network programs such as irssi can run for hours and they didn't seem to cause any havoc.

At first I thought this might be caused by some instability in the latest xf86-video-ati driver, so I downgraded back to xorg-server 1.7. Still the same symptoms, so I am pretty sure now there's something in the kernel. So I went back to xorg-server 1.8 and downgraded the kernel to 2.6.33. So far, the system hasn't let me down, yet. Also, I uninstalled the madwifi packages on my system so I'm using the ath5 drivers shipped with the kernel. That change doesn't seem to make a difference.

I am not exactly sure what exactly could cause this, there's no trace in any log file to be found. My suspicion is that it's network related.

The hardware is a Thinkpad T42 with an ATI Radeon Mobility 9600 (r300) chip and an Atheros wireless card.

Anyone else with similar experiences with this hardware? I can't think of a way how to properly debug this. I know I can bisect, but that's time consuming and perhaps it's quicker to ask around first before I walk down that road. Any suggestions are welcome on how to track this down
------------------------------------------------------
http://bbs.archlinux.org/viewtopic.php?pid=789948

I do not use any wireless however, the ipw2200 radio is disabled here.


Unlikely cause it seems to be easily reproducable:

------------------------------------------------------
seems to freeze on drm installation [2.6.34, 2× RV280]

See attached screenshot. While I can boot anything up to 2.6.33,
2.6.34 seems to reproducibly hang itself up when it starts DRM. This
machine has two Radeon RV280 cards.
------------------------------------------------------
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=586137
Comment 4 Martin Steigerwald 2010-07-13 12:43:23 UTC
While searching for traces of those freezes in syslog I came across backtraces that may or may not be releated. See bug #16377. There were prior to the third freeze. I did not find any backtraces shortly before that freeze. Will look for the earlier two freezes now.
Comment 5 Alex Deucher 2010-07-13 13:32:15 UTC
Does s/r work ok without tuxonice?
Comment 6 Martin Steigerwald 2010-07-13 15:12:04 UTC
The bug does not appear to be suspend/resume related at all. The freezes did happen after a fresh boot, without any snapshot cycle in between. They never happened after the first TuxOnIce cycle which I guess is just a coincidence.

So the machine had no snapshot cycles as the freeze occured. The only thing TuxOnIce does on a fresh boot is finding no TuxOnIce image and exiting, so I highly doubt the freezes are TuxOnIce related in any way.
Comment 7 Alex Deucher 2010-07-13 15:14:47 UTC
Does this happen with 2.6.35?
Comment 8 Martin Steigerwald 2010-07-16 18:15:33 UTC
Yes, after giving up on an issue of Debians make-kpkg with Kernel 2.6.35 "+" sign in version number[1], I used make deb-pkg to compile 2.6.35-rc5-04995-g7441ae8 from TuxOnIce head.

About 5 minutes after booting it, shortly after starting KDE 4 OpenGL composited desktop session, the machine froze hard. This time I tried accessing it from another machine, my second ThinkPad, a T23 with SuperSavage chipset. The frozen T42 did not answer to any ping, "destination host unreachable". Thus I conclude the kernel was completely locked up. Again it was a fresh boot, no hibernation cycle in between.

The T23 also has a 2.6.34.1 which didn't yet lock up and has an uptime of 7 days, but only 3 TuxOnIce snapshot cycles. It appears to be stable. Thus I think it really could have to do with radeon KMS DRM code.

[1] http://bugs.debian.org/588178
Comment 9 Alex Deucher 2010-07-16 19:18:13 UTC
Any chance you could bisect this?
Comment 10 Martin Steigerwald 2010-07-16 21:08:55 UTC
Difficult, since it doesn't happen all the time, I have no clear pattern on reproducabilty. When it happened it often happened within 5 minutes after starting the desktop, but when playing the AVI videos from SD card it took longer. And sometimes it just didn't freeze at all.

This also is the laptop for all my stuff, thus I have some stability requirements for it. I never did bisecting so far, but as far as I understand it usually includes testing about a dozen kernels and thus having a dozen freezes - quite some risk to loose unsaved data as I can't predict if and when it freezes.
Comment 11 Martin Steigerwald 2010-07-16 21:14:56 UTC
But there is no other way to get this one tracked? I thought about net console, but when the kernel doesn't even respond to a ping anymore, I guess it won't send out anything over the net. Maybe in the momemts before the freeze it sends anything?
Comment 12 Alex Deucher 2010-07-23 14:30:10 UTC
Unfortunately GPU freezes are really hard to track down without a reliable test case.
Comment 13 Martin Steigerwald 2010-07-23 15:17:48 UTC
I don't have a reliable test case. It does not happen always, but when it happened, it usually happened quite short after the first boot. Unfortunately with bisecting that would mean booting a kernel three or five times and letting it run for about half an hour to be quite sure. An that for possibly a dozen kernels of which some might have other problems like eating my filesystems. Right now thats too much for me. I can test some selected kernels or patches as I manage to take time however.
Comment 14 Martin Steigerwald 2010-07-23 15:19:29 UTC
Another hint: I ran 2.6.34.1 on a Radeon based Dell workstation at work, but without KMS and there the desktop never froze. It had a problem with shutting down one SoftRAID on hibernation so I downgraded there too (also reported). The 2.6.34.1 on my SuperSavage based T23 is behaving well currently.
Comment 15 Dragos Delcea 2010-07-25 18:51:35 UTC
I think I might have stumbled over this as well. Hw is a Lenovo T60 Thinkpad (radeon mobility X1400 video card). I'm on 32 bit gentoo, everything 2.6.33 and below works; 2.6.34 - in either gentoo or vanilla flavours - has random freezes.

I'using 2.6.33.6 currently and I am happy, 2.6.34.1 and 2.6.34.2 (and other gentoo 2.6.34 specific kernels as well) freeze on me. Notably, I'm not using KMS yet, but I am following the radeon master git; not sure whether the freezes are graphic related, though. My use case that seems to trigger this is KVM; having 2 VMs running at once is usually enough.
Comment 16 Dragos Delcea 2010-07-26 07:08:27 UTC
It should have red: 2.6.33.x works while 2.6.34 and 2.6.34.1 freeze on me. There's no 2.6.34.2 (yet) available.
Comment 17 Alex Deucher 2010-07-26 14:59:05 UTC
(In reply to comment #16)
> It should have red: 2.6.33.x works while 2.6.34 and 2.6.34.1 freeze on me.
> There's no 2.6.34.2 (yet) available.

Can you bisect to see what commit is causing the problem?
Comment 18 Dragos Delcea 2010-07-28 12:05:47 UTC
I can't reproduce it anymore with 2.6.35-rc6. I'm going to stay with the same kernel and switch to KMS and see if I can reproduce the reporter's problem.
Comment 19 Martin Steigerwald 2010-07-28 12:51:23 UTC
I started a compile of 2.6.35-rc6 and will test as well. Thanks.
Comment 20 Alex Deucher 2010-07-28 14:32:47 UTC
If rc6 is stable, can you bisect between rc5 and rc6 to see what fixed the issue?  It should be a much smaller change set.
Comment 21 Martin Steigerwald 2010-07-28 17:33:33 UTC
2.6.35-rc6 froze a few minutes after the second boot directly after opening an image in Gwenview from KDE 4.4.4. After the first boot it was stable for 25 minutes including playing some AVI movie from my digicam with Dragon Player - but well it seems to be randomly, I have no pattern to trigger it - only observation I made is if it doesn't happen in the first half hour after boot it likely doesn't happen anymore until I do another fresh boot:

martin@shambhala:~> uprecords -m200 | egrep "(rc6|#)"
     #               Uptime | System                                     Boot up
   127     0 days, 00:25:14 | Linux 2.6.35-rc6-tp42-to  Wed Jul 28 18:46:25 2010
   186     0 days, 00:00:53 | Linux 2.6.35-rc6-tp42-to  Wed Jul 28 19:12:13 2010
Comment 22 Rafael J. Wysocki 2010-08-01 14:16:56 UTC
Handled-By : Alex Deucher <alexdeucher@gmail.com>
Comment 23 Martin Steigerwald 2010-08-14 16:47:40 UTC
This bug still happens with 2.6.35.2. This time it hung while compiling virtualbox-ose kernel modules and playing music with Amarok. The music stopped playing.

I think I will have a try with 2.6.36-rc3 or 4 again. And if I manage to do it possibly at least try going back to some 2.6.34-rc's to see if the bug was introduced with some rc. I do not feel comfortable with bisecting at random versions, cause I don't know whether they might have been short lived ext4 filesystem corruption bugs or whatnot.
Comment 24 Martin Steigerwald 2010-08-14 16:54:43 UTC
Dave, I hope you do not mind adding you to the CC list, but since this happens on a ThinkPad T42 your feedback might help.

I thought:

From: Michel Dänzer <daenzer@vmware.com>

commit e376573f7267390f4e1bdc552564b6fb913bce76 upstream.

This fixes a problem where on low VRAM cards we'd run out of space for validation.

[airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.]

Signed-off-by: Michel Dänzer <daenzer@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

might have fixed this bug
Comment 25 Martin Steigerwald 2010-08-14 17:03:47 UTC
Dave, I hope you do not mind adding you to the CC list, but since this happens on a ThinkPad T42 your feedback might help.

I thought

----------------------------------------
From: Michel Dänzer <daenzer@vmware.com>

commit e376573f7267390f4e1bdc552564b6fb913bce76 upstream.

This fixes a problem where on low VRAM cards we'd run out of space for validation.

[airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.]

Signed-off-by: Michel Dänzer <daenzer@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
----------------------------------------

might have fixed this bug since I read something about this having been tested on a T42. I did not recognize that it was your comment, so I contacted Michel Dänzer about it.

He said its unlikely that it will fix my issue, but I should try it anyway. Well as said above, I tried 2.6.35.2 and it still hung.

You say compiz works without problems for you. I read this as you do not get any hard freezes with any kernel since 2.6.34 on your ThinkPad T42. Did you try KDE 4 KWin compositing as well? What X.org / driver versions do you use. There must be a difference why it works for your ThinkPad, but not for mine.

Here are my current versions:

martin@shambhala:~> apt-show-versions | egrep "(xserver-xorg/|xserver-xorg-core/|xserver-xorg-video-radeon/|libgl1-mesa-dri/|libdrm2/|libdrm-radeon1/|kde-window-manager/|kdelibs5/)"
kde-window-manager/squeeze uptodate 4:4.4.5-1
kdelibs5/squeeze uptodate 4:4.4.5-1
libdrm-radeon1/experimental uptodate 2.4.21-1
libdrm2/experimental uptodate 2.4.21-1
libgl1-mesa-dri/experimental uptodate 7.8.2-2
xserver-xorg/squeeze uptodate 1:7.5+6
xserver-xorg-core/squeeze uptodate 2:1.7.7-3
xserver-xorg-video-radeon/squeeze uptodate 1:6.13.1-2

As noted above in the bug report: Kernel 2.6.33 works rock stable for me with Radeon KMS. But that said, I am not even sure whether its a Radeon KMS issue. That it only happens after a KDE 4 compositing session has started points at it, but is no proof anyway.

I want to help to provide information to get this fixes, but I feel uneasy about doing a full bisect between 2.6.33 and 2.6.34. I might start trying out whether it starts happening at some rc, starting with rc3 and going forwards or backwards from there.
Comment 26 Martin Steigerwald 2010-08-23 12:39:21 UTC
Still happens with 2.6.35.3. This time I setup a netconsole connection. Nothing. So its really a complete hard freeze. Next step probably is trying 2.6.34-rc2 and going back or forward until I at least now between which rc the regression has been introduced. Or trying 2.6.36-rc2.
Comment 27 Alex Deucher 2010-08-23 14:20:40 UTC
Does disabling the lockup detect code fix the issue?

--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -1345,6 +1345,8 @@ bool r600_gpu_is_lockup(struct radeon_device *rdev)
        u32 grbm_status2;
        int r;
 
+       return false;
+
        srbm_status = RREG32(R_000E50_SRBM_STATUS);
        grbm_status = RREG32(R_008010_GRBM_STATUS);
        grbm_status2 = RREG32(R_008014_GRBM_STATUS2);
Comment 28 Alex Deucher 2010-08-23 14:22:32 UTC
updated patch:
diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index c0c5cef..7d9cc47 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -1996,6 +1996,8 @@ bool r100_gpu_is_lockup(struct radeon_device *rdev)
        u32 rbbm_status;
        int r;
 
+       return false;
+
        rbbm_status = RREG32(R_000E40_RBBM_STATUS);
        if (!G_000E40_GUI_ACTIVE(rbbm_status)) {
                r100_gpu_lockup_update(&rdev->config.r100.lockup, &rdev->cp);
diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c
index 3ba05d0..176f791 100644
--- a/drivers/gpu/drm/radeon/r300.c
+++ b/drivers/gpu/drm/radeon/r300.c
@@ -384,6 +384,8 @@ bool r300_gpu_is_lockup(struct radeon_device *rdev)
        u32 rbbm_status;
        int r;
 
+       return false;
+
        rbbm_status = RREG32(R_000E40_RBBM_STATUS);
        if (!G_000E40_GUI_ACTIVE(rbbm_status)) {
                r100_gpu_lockup_update(&rdev->config.r300.lockup, &rdev->cp);
diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index 667c237..bba86d5 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -1345,6 +1345,8 @@ bool r600_gpu_is_lockup(struct radeon_device *rdev)
        u32 grbm_status2;
        int r;
 
+       return false;
+
        srbm_status = RREG32(R_000E50_SRBM_STATUS);
        grbm_status = RREG32(R_008010_GRBM_STATUS);
        grbm_status2 = RREG32(R_008014_GRBM_STATUS2);
Comment 30 Martin Steigerwald 2010-08-23 14:37:27 UTC
Should I try both patches together? Or each one seperately? I am currently also compiling a 2.6.36-rc2.
Comment 31 Alex Deucher 2010-08-23 14:39:11 UTC
Separately please.
Comment 32 Martin Steigerwald 2010-08-23 19:44:44 UTC
2.6.35.3-tp42-toi-3.1.1.1-no-lockup-detect-05004-ga695ed8-dirty, the kernel with your patch to disable lockup detection did look very promising. Just as I wanted to tell you that it works it locked up again. It survived three boots with 20-30 minutes of KDE 4 OpenGL compositing setting. But after the last boot it locked up after one and a half hour.

Could it be that this now was a lockup due to *missing* lockup detection. I will try the other patch tomorrow. Now I wan't to use my laptop. ;)
Comment 33 Martin Steigerwald 2010-08-24 18:30:33 UTC
drm-bail-early-if-nothing-changes-2.6.git-4e186b2d6c878793587c35d7f06c94565d76e9b8.patch didn't work either. The machine crashed a few minutes after second boot in the middle of a window movement. I think I will try 2.6.36-rc2 next and if that doesn't work, start with 2.6.34-rc2 and go forward or backward from there, depending on the results. Or look whether I can git bisect with an emphasis of trying likely safe versions like rc's before going in the middle of critical changes.
Comment 34 Martin Steigerwald 2010-08-25 10:44:52 UTC
No luck with 2.6.36-rc2-tp42-00203-g502adf5 from Linus' git either. It hung after the second boot. That also proves that TuxOnIce is not involved in the bug. Back to 2.6.33.7 for now. This one works absolutely stable.
Comment 35 Martin Steigerwald 2010-08-26 19:18:39 UTC
So I am indeed bisecting this, but it is slow - cause last time just after 5 reboots and hours of testing the kernel froze after I marked it as good and started compiling the next one and since I did not find a git bisect undo to do git bisect bad after it, I used git bisect reset and git bisect start from that commit as bad to v2.6.33 as good - and how I am stuck:

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.33-2.6.34-bisect/linux-2.6> git bisect log
# bad: [64ba9926759792cf7b95f823402e2781edd1b5d4] Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
# good: [60b341b778cc2929df16c0a504c91621b3c6a4ad] Linux 2.6.33
git bisect start '64ba9926759792cf7b95f823402e2781edd1b5d4' 'v2.6.33'
# skip: [d3ae8562d43fe2b97d605dd67dc67bf8fa9b956a] usb: host: ehci: fix missing kfree in remove path also
git bisect skip d3ae8562d43fe2b97d605dd67dc67bf8fa9b956a
# good: [8bd93a2c5d4cab2ae17d06350daa7dbf546a4634] rcu: Accelerate grace period if last non-dynticked CPU
git bisect good 8bd93a2c5d4cab2ae17d06350daa7dbf546a4634
# skip: [5be796f0b842c5852d7397a82f8ebd6be8451872] USB: ch341: use get_unaligned_le16 in break_ctl
git bisect skip 5be796f0b842c5852d7397a82f8ebd6be8451872

I ran into a non booting kernel with d3ae8562d43fe2b97d605dd67dc67bf8fa9b956a, it oopsed with a backtrace in ext4 and readahead I don't want to think about deeply, so I skipped it. Then it put me just a few revisions after 2.6.33, a kernel that didn't make much sense for me to test - I did nonetheless, it was good as expected. Then it took me just a few revision after the non working kernel. Consequently this one didn't boot as well and I skipped it. After that git bisect took me prior to 2.6.33, git log from 5be796f0b842c5852d7397a82f8ebd6be8451872 gives me

    296 commit 6b7b284958d47b77d06745b36bc7f36dab769d9b
    297 Author: Linus Torvalds <torvalds@linux-foundation.org>
    298 Date:   Thu Dec 24 13:09:41 2009 -0800
    299 
    300     Linux 2.6.33-rc2

on line 300. But I told git that v2.6.33 was good. It can't be serious about asking me to test a kernel *prior* to 2.6.33!

Testing is really slow as as it seems even after the fifth reboot and some hours of uptime it can still freeze. So I need about half a day to be quite sure that a kernel is good. I won't waste my time going prior to 2.6.33 which is known good to make this experience even more painful that it was already.

So I ask for advice on how to proceed.

Well at least its clear now, that the freeze was introduced way prior to 2.6.34-rc1. I would have liked it the other way around.

According to git its still

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.33-2.6.34-bisect/linux-2.6> git bisect skip
Bisecting: 2518 revisions left to test after this (roughly 11 steps)
Comment 36 Martin Steigerwald 2010-08-26 19:28:43 UTC
Should I just mark the current soon after 2.6.33-rc2 kernel 5be796f0b842c5852d7397a82f8ebd6be8451872 I am not interested as good without even compiling it, so that git hopefully takes me to another one around the one that I skipped cause it didn't boot?
Comment 37 Martin Steigerwald 2010-09-05 07:26:50 UTC
Created attachment 29002 [details]
bisect cont'd - unbootable kernels ext4 / readahead backtrace

I asked on kernel-ml and got a hint that this may be due to merges. Thus I tested this kernel and continued. git bisect pointed me back to a lot of unbootable kernels with the ext4/readahead backtrace. They have all been in a USB merge. I am not booting from USB. I think that bug has been introduced before that USB merge and been fixed somewhen after merging it, so after testing about five unbootable kernels from that range, I told git bisect to skip all the commits of that merge. After testing a few kernels including one that showed "Destination address too large" and "-- System halted" just after trying to boot it, I got one thats quite some revisions prior to the USB merge. I'll test it and hope that it boots.

I will post a complete bisect log of what I have so far in a moment.
Comment 38 Martin Steigerwald 2010-09-05 07:35:55 UTC
Created attachment 29012 [details]
current state of bisection, please help with selecting next commits to test

If its indeed a Radeon KMS related freeze and upto now everything points at this, I think it has been introduced in the first quarter of commits that git bisect visualize shows. I'd like to test just before these KMS/DRM related commits and just after it to speed up the bisection process, but there are branches and merges and also there are those unbootable kernels (see previous commit) not that much before these KMS/DRM related commits, so I am unsure an which exact commit ids too choose.

Please help me selecting suitable commits - I had more than five unbootable kernels already and the range of commits to test thus is still above 1800. I really like to shorten this. In the meanwhile I am just going to compile and test the next kernel.
Comment 39 Martin Steigerwald 2010-09-05 13:29:03 UTC
Well I now go with

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.33-2.6.34-
bisect/linux-2.6> git log | head      
commit a27341cd5fcb7cf2d2d4726e9f324009f7162c00
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Mar 2 08:36:46 2010 -0800

    Prioritize synchronous signals over 'normal' signals
    
    This makes sure that we pick the synchronous signals caused by a
    processor fault over any pending regular asynchronous signals sent to
    use by [t]kill().

When I interpret the graphical display of gitk correctly this one should 
be directly before the whole lot of KMS/DRM/Radeon/Intel commits being 
merged.

Now hopefully this one boots. And is good. Then I would have skipped a 
whole lot of the revisions to test. But even when it just boots, but is 
bad, I at least know that its not in these merges.
Comment 40 Alex Deucher 2010-09-07 18:33:27 UTC
This might be a duplicate of:
https://bugs.freedesktop.org/show_bug.cgi?id=28402
Does the patch there help?
Comment 41 Martin Steigerwald 2010-09-08 16:36:29 UTC
Created attachment 29352 [details]
Current status of git bisection, now testing patches mentioned in comment

Alex from looking at the backtrace Alex hinted me that this might be a duplicate of[1]:

Bug 28402 - Kernel 2.6.34 freezes randomly
https://bugs.freedesktop.org/show_bug.cgi?id=28402

I will now try patch from comment 47 of that bug and then those mentioned by Da Fox im comment 43.

[1] See thread "Bisecting radeon kms freeze bug: Almost there, please help with choosing next commit": http://lkml.org/lkml/2010/9/7/199
Comment 42 Martin Steigerwald 2010-09-08 16:40:58 UTC
comment 44 instead of 43 that should have been.
Comment 43 Martin Steigerwald 2010-09-08 21:48:55 UTC
Created attachment 29382 [details]
no freezes so far with vmembase zero patch from Dave

The vmem align patch from Alex (comment #41 of freedesktop bug) didn´t help. I now test the vmembase at zero patch from Dave (attached here and comment #41 of the fdo bug). 

Looks very good so far. I will reboot this kernel several times tomorrow - as a freeze so far only every happened *before* the first hibernation / snapshot cycle - but I looked some Startrek Voyager without a freeze with:

martin@shambhala:~/Computer/Shambhala/Kernel/2.6.36> cat /proc/version 
Linux version 2.6.36-rc3-tp42-toi-3.2-rc1-vmembase-0-05032-g60140c1-dirty (martin@shambhala) (gcc version 4.4.5 20100728 (prerelease) (Debian 4.4.4-8) ) #2 PREEMPT Wed Sep 8 21:36:34 CEST 2010

Thanks.
Comment 44 Martin Steigerwald 2010-12-07 21:18:40 UTC
Fix is upstream:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b7d8cce5b558e0c0aa6898c9865356481598b46d

Tested in 2.6.37-rc3. 2.6.37-rc5 seems stable too.

Many thanks.