Bug 16376
Summary: | random Radeon DRM KMS related freezes | ||
---|---|---|---|
Product: | Drivers | Reporter: | Martin Steigerwald (Martin) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | airlied, alexdeucher, dragos.delcea, maciej.rutecki, mrsteven, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 15310 | ||
Attachments: |
hardware of my ThinkPad T42, lspci -nvv
bisect cont'd - unbootable kernels ext4 / readahead backtrace current state of bisection, please help with selecting next commits to test Current status of git bisection, now testing patches mentioned in comment no freezes so far with vmembase zero patch from Dave |
Description
Martin Steigerwald
2010-07-13 09:24:47 UTC
Created attachment 27083 [details]
hardware of my ThinkPad T42, lspci -nvv
Userspace I use: apt-show-versions | egrep "(xserver-xorg/|xserver-xorg-core/|xserver-xorg-video-radeon/|libgl1-mesa-dri/|libdrm2/|libdrm-radeon1/|kde-window-manager/)" kde-window-manager/squeeze uptodate 4:4.4.4-1 libdrm-radeon1/experimental uptodate 2.4.21-1 libdrm2/experimental uptodate 2.4.21-1 libgl1-mesa-dri/experimental uptodate 7.8.2-1 xserver-xorg/squeeze uptodate 1:7.5+6 xserver-xorg-core/squeeze uptodate 2:1.7.7-2 xserver-xorg-video-radeon/sid uptodate 1:6.13.1-1 I did some research on the internet and found this possibly related case - hot, since it also happens on a ThinkPad T42 with similar if not same graphics hardware: ------------------------------------------------------ Random freezes with kernel 2.6.34 and xorg 1.8 The other day I upgraded the kernel to version 2.6.34 and at the same time xorg-server to 1.8 (along with input drivers and video drivers). From that moment, I suffer from random freezes, the system is completely locked up; the screen doesn't blank, though. The system appears to be perfectly fine, but after some minutes it will freeze. As far as I can remember, the freezes only occur when using a webbrowser. Chromium is my default browser, but also Firefox and Konqueror caused the system to freeze completely when I want to open a webpage. Other network programs such as irssi can run for hours and they didn't seem to cause any havoc. At first I thought this might be caused by some instability in the latest xf86-video-ati driver, so I downgraded back to xorg-server 1.7. Still the same symptoms, so I am pretty sure now there's something in the kernel. So I went back to xorg-server 1.8 and downgraded the kernel to 2.6.33. So far, the system hasn't let me down, yet. Also, I uninstalled the madwifi packages on my system so I'm using the ath5 drivers shipped with the kernel. That change doesn't seem to make a difference. I am not exactly sure what exactly could cause this, there's no trace in any log file to be found. My suspicion is that it's network related. The hardware is a Thinkpad T42 with an ATI Radeon Mobility 9600 (r300) chip and an Atheros wireless card. Anyone else with similar experiences with this hardware? I can't think of a way how to properly debug this. I know I can bisect, but that's time consuming and perhaps it's quicker to ask around first before I walk down that road. Any suggestions are welcome on how to track this down ------------------------------------------------------ http://bbs.archlinux.org/viewtopic.php?pid=789948 I do not use any wireless however, the ipw2200 radio is disabled here. Unlikely cause it seems to be easily reproducable: ------------------------------------------------------ seems to freeze on drm installation [2.6.34, 2× RV280] See attached screenshot. While I can boot anything up to 2.6.33, 2.6.34 seems to reproducibly hang itself up when it starts DRM. This machine has two Radeon RV280 cards. ------------------------------------------------------ http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=586137 While searching for traces of those freezes in syslog I came across backtraces that may or may not be releated. See bug #16377. There were prior to the third freeze. I did not find any backtraces shortly before that freeze. Will look for the earlier two freezes now. Does s/r work ok without tuxonice? The bug does not appear to be suspend/resume related at all. The freezes did happen after a fresh boot, without any snapshot cycle in between. They never happened after the first TuxOnIce cycle which I guess is just a coincidence. So the machine had no snapshot cycles as the freeze occured. The only thing TuxOnIce does on a fresh boot is finding no TuxOnIce image and exiting, so I highly doubt the freezes are TuxOnIce related in any way. Does this happen with 2.6.35? Yes, after giving up on an issue of Debians make-kpkg with Kernel 2.6.35 "+" sign in version number[1], I used make deb-pkg to compile 2.6.35-rc5-04995-g7441ae8 from TuxOnIce head. About 5 minutes after booting it, shortly after starting KDE 4 OpenGL composited desktop session, the machine froze hard. This time I tried accessing it from another machine, my second ThinkPad, a T23 with SuperSavage chipset. The frozen T42 did not answer to any ping, "destination host unreachable". Thus I conclude the kernel was completely locked up. Again it was a fresh boot, no hibernation cycle in between. The T23 also has a 2.6.34.1 which didn't yet lock up and has an uptime of 7 days, but only 3 TuxOnIce snapshot cycles. It appears to be stable. Thus I think it really could have to do with radeon KMS DRM code. [1] http://bugs.debian.org/588178 Any chance you could bisect this? Difficult, since it doesn't happen all the time, I have no clear pattern on reproducabilty. When it happened it often happened within 5 minutes after starting the desktop, but when playing the AVI videos from SD card it took longer. And sometimes it just didn't freeze at all. This also is the laptop for all my stuff, thus I have some stability requirements for it. I never did bisecting so far, but as far as I understand it usually includes testing about a dozen kernels and thus having a dozen freezes - quite some risk to loose unsaved data as I can't predict if and when it freezes. But there is no other way to get this one tracked? I thought about net console, but when the kernel doesn't even respond to a ping anymore, I guess it won't send out anything over the net. Maybe in the momemts before the freeze it sends anything? Unfortunately GPU freezes are really hard to track down without a reliable test case. I don't have a reliable test case. It does not happen always, but when it happened, it usually happened quite short after the first boot. Unfortunately with bisecting that would mean booting a kernel three or five times and letting it run for about half an hour to be quite sure. An that for possibly a dozen kernels of which some might have other problems like eating my filesystems. Right now thats too much for me. I can test some selected kernels or patches as I manage to take time however. Another hint: I ran 2.6.34.1 on a Radeon based Dell workstation at work, but without KMS and there the desktop never froze. It had a problem with shutting down one SoftRAID on hibernation so I downgraded there too (also reported). The 2.6.34.1 on my SuperSavage based T23 is behaving well currently. I think I might have stumbled over this as well. Hw is a Lenovo T60 Thinkpad (radeon mobility X1400 video card). I'm on 32 bit gentoo, everything 2.6.33 and below works; 2.6.34 - in either gentoo or vanilla flavours - has random freezes. I'using 2.6.33.6 currently and I am happy, 2.6.34.1 and 2.6.34.2 (and other gentoo 2.6.34 specific kernels as well) freeze on me. Notably, I'm not using KMS yet, but I am following the radeon master git; not sure whether the freezes are graphic related, though. My use case that seems to trigger this is KVM; having 2 VMs running at once is usually enough. It should have red: 2.6.33.x works while 2.6.34 and 2.6.34.1 freeze on me. There's no 2.6.34.2 (yet) available. (In reply to comment #16) > It should have red: 2.6.33.x works while 2.6.34 and 2.6.34.1 freeze on me. > There's no 2.6.34.2 (yet) available. Can you bisect to see what commit is causing the problem? I can't reproduce it anymore with 2.6.35-rc6. I'm going to stay with the same kernel and switch to KMS and see if I can reproduce the reporter's problem. I started a compile of 2.6.35-rc6 and will test as well. Thanks. If rc6 is stable, can you bisect between rc5 and rc6 to see what fixed the issue? It should be a much smaller change set. 2.6.35-rc6 froze a few minutes after the second boot directly after opening an image in Gwenview from KDE 4.4.4. After the first boot it was stable for 25 minutes including playing some AVI movie from my digicam with Dragon Player - but well it seems to be randomly, I have no pattern to trigger it - only observation I made is if it doesn't happen in the first half hour after boot it likely doesn't happen anymore until I do another fresh boot: martin@shambhala:~> uprecords -m200 | egrep "(rc6|#)" # Uptime | System Boot up 127 0 days, 00:25:14 | Linux 2.6.35-rc6-tp42-to Wed Jul 28 18:46:25 2010 186 0 days, 00:00:53 | Linux 2.6.35-rc6-tp42-to Wed Jul 28 19:12:13 2010 Handled-By : Alex Deucher <alexdeucher@gmail.com> This bug still happens with 2.6.35.2. This time it hung while compiling virtualbox-ose kernel modules and playing music with Amarok. The music stopped playing. I think I will have a try with 2.6.36-rc3 or 4 again. And if I manage to do it possibly at least try going back to some 2.6.34-rc's to see if the bug was introduced with some rc. I do not feel comfortable with bisecting at random versions, cause I don't know whether they might have been short lived ext4 filesystem corruption bugs or whatnot. Dave, I hope you do not mind adding you to the CC list, but since this happens on a ThinkPad T42 your feedback might help. I thought: From: Michel Dänzer <daenzer@vmware.com> commit e376573f7267390f4e1bdc552564b6fb913bce76 upstream. This fixes a problem where on low VRAM cards we'd run out of space for validation. [airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.] Signed-off-by: Michel Dänzer <daenzer@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> might have fixed this bug Dave, I hope you do not mind adding you to the CC list, but since this happens on a ThinkPad T42 your feedback might help. I thought ---------------------------------------- From: Michel Dänzer <daenzer@vmware.com> commit e376573f7267390f4e1bdc552564b6fb913bce76 upstream. This fixes a problem where on low VRAM cards we'd run out of space for validation. [airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.] Signed-off-by: Michel Dänzer <daenzer@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> ---------------------------------------- might have fixed this bug since I read something about this having been tested on a T42. I did not recognize that it was your comment, so I contacted Michel Dänzer about it. He said its unlikely that it will fix my issue, but I should try it anyway. Well as said above, I tried 2.6.35.2 and it still hung. You say compiz works without problems for you. I read this as you do not get any hard freezes with any kernel since 2.6.34 on your ThinkPad T42. Did you try KDE 4 KWin compositing as well? What X.org / driver versions do you use. There must be a difference why it works for your ThinkPad, but not for mine. Here are my current versions: martin@shambhala:~> apt-show-versions | egrep "(xserver-xorg/|xserver-xorg-core/|xserver-xorg-video-radeon/|libgl1-mesa-dri/|libdrm2/|libdrm-radeon1/|kde-window-manager/|kdelibs5/)" kde-window-manager/squeeze uptodate 4:4.4.5-1 kdelibs5/squeeze uptodate 4:4.4.5-1 libdrm-radeon1/experimental uptodate 2.4.21-1 libdrm2/experimental uptodate 2.4.21-1 libgl1-mesa-dri/experimental uptodate 7.8.2-2 xserver-xorg/squeeze uptodate 1:7.5+6 xserver-xorg-core/squeeze uptodate 2:1.7.7-3 xserver-xorg-video-radeon/squeeze uptodate 1:6.13.1-2 As noted above in the bug report: Kernel 2.6.33 works rock stable for me with Radeon KMS. But that said, I am not even sure whether its a Radeon KMS issue. That it only happens after a KDE 4 compositing session has started points at it, but is no proof anyway. I want to help to provide information to get this fixes, but I feel uneasy about doing a full bisect between 2.6.33 and 2.6.34. I might start trying out whether it starts happening at some rc, starting with rc3 and going forwards or backwards from there. Still happens with 2.6.35.3. This time I setup a netconsole connection. Nothing. So its really a complete hard freeze. Next step probably is trying 2.6.34-rc2 and going back or forward until I at least now between which rc the regression has been introduced. Or trying 2.6.36-rc2. Does disabling the lockup detect code fix the issue? --- a/drivers/gpu/drm/radeon/r600.c +++ b/drivers/gpu/drm/radeon/r600.c @@ -1345,6 +1345,8 @@ bool r600_gpu_is_lockup(struct radeon_device *rdev) u32 grbm_status2; int r; + return false; + srbm_status = RREG32(R_000E50_SRBM_STATUS); grbm_status = RREG32(R_008010_GRBM_STATUS); grbm_status2 = RREG32(R_008014_GRBM_STATUS2); updated patch: diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index c0c5cef..7d9cc47 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -1996,6 +1996,8 @@ bool r100_gpu_is_lockup(struct radeon_device *rdev) u32 rbbm_status; int r; + return false; + rbbm_status = RREG32(R_000E40_RBBM_STATUS); if (!G_000E40_GUI_ACTIVE(rbbm_status)) { r100_gpu_lockup_update(&rdev->config.r100.lockup, &rdev->cp); diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c index 3ba05d0..176f791 100644 --- a/drivers/gpu/drm/radeon/r300.c +++ b/drivers/gpu/drm/radeon/r300.c @@ -384,6 +384,8 @@ bool r300_gpu_is_lockup(struct radeon_device *rdev) u32 rbbm_status; int r; + return false; + rbbm_status = RREG32(R_000E40_RBBM_STATUS); if (!G_000E40_GUI_ACTIVE(rbbm_status)) { r100_gpu_lockup_update(&rdev->config.r300.lockup, &rdev->cp); diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c index 667c237..bba86d5 100644 --- a/drivers/gpu/drm/radeon/r600.c +++ b/drivers/gpu/drm/radeon/r600.c @@ -1345,6 +1345,8 @@ bool r600_gpu_is_lockup(struct radeon_device *rdev) u32 grbm_status2; int r; + return false; + srbm_status = RREG32(R_000E50_SRBM_STATUS); grbm_status = RREG32(R_008010_GRBM_STATUS); grbm_status2 = RREG32(R_008014_GRBM_STATUS2); Please also try this patch: http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=4e186b2d6c878793587c35d7f06c94565d76e9b8 Should I try both patches together? Or each one seperately? I am currently also compiling a 2.6.36-rc2. Separately please. 2.6.35.3-tp42-toi-3.1.1.1-no-lockup-detect-05004-ga695ed8-dirty, the kernel with your patch to disable lockup detection did look very promising. Just as I wanted to tell you that it works it locked up again. It survived three boots with 20-30 minutes of KDE 4 OpenGL compositing setting. But after the last boot it locked up after one and a half hour. Could it be that this now was a lockup due to *missing* lockup detection. I will try the other patch tomorrow. Now I wan't to use my laptop. ;) drm-bail-early-if-nothing-changes-2.6.git-4e186b2d6c878793587c35d7f06c94565d76e9b8.patch didn't work either. The machine crashed a few minutes after second boot in the middle of a window movement. I think I will try 2.6.36-rc2 next and if that doesn't work, start with 2.6.34-rc2 and go forward or backward from there, depending on the results. Or look whether I can git bisect with an emphasis of trying likely safe versions like rc's before going in the middle of critical changes. No luck with 2.6.36-rc2-tp42-00203-g502adf5 from Linus' git either. It hung after the second boot. That also proves that TuxOnIce is not involved in the bug. Back to 2.6.33.7 for now. This one works absolutely stable. So I am indeed bisecting this, but it is slow - cause last time just after 5 reboots and hours of testing the kernel froze after I marked it as good and started compiling the next one and since I did not find a git bisect undo to do git bisect bad after it, I used git bisect reset and git bisect start from that commit as bad to v2.6.33 as good - and how I am stuck: martin@shambhala:~/Computer/Shambhala/Kernel/2.6.33-2.6.34-bisect/linux-2.6> git bisect log # bad: [64ba9926759792cf7b95f823402e2781edd1b5d4] Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd # good: [60b341b778cc2929df16c0a504c91621b3c6a4ad] Linux 2.6.33 git bisect start '64ba9926759792cf7b95f823402e2781edd1b5d4' 'v2.6.33' # skip: [d3ae8562d43fe2b97d605dd67dc67bf8fa9b956a] usb: host: ehci: fix missing kfree in remove path also git bisect skip d3ae8562d43fe2b97d605dd67dc67bf8fa9b956a # good: [8bd93a2c5d4cab2ae17d06350daa7dbf546a4634] rcu: Accelerate grace period if last non-dynticked CPU git bisect good 8bd93a2c5d4cab2ae17d06350daa7dbf546a4634 # skip: [5be796f0b842c5852d7397a82f8ebd6be8451872] USB: ch341: use get_unaligned_le16 in break_ctl git bisect skip 5be796f0b842c5852d7397a82f8ebd6be8451872 I ran into a non booting kernel with d3ae8562d43fe2b97d605dd67dc67bf8fa9b956a, it oopsed with a backtrace in ext4 and readahead I don't want to think about deeply, so I skipped it. Then it put me just a few revisions after 2.6.33, a kernel that didn't make much sense for me to test - I did nonetheless, it was good as expected. Then it took me just a few revision after the non working kernel. Consequently this one didn't boot as well and I skipped it. After that git bisect took me prior to 2.6.33, git log from 5be796f0b842c5852d7397a82f8ebd6be8451872 gives me 296 commit 6b7b284958d47b77d06745b36bc7f36dab769d9b 297 Author: Linus Torvalds <torvalds@linux-foundation.org> 298 Date: Thu Dec 24 13:09:41 2009 -0800 299 300 Linux 2.6.33-rc2 on line 300. But I told git that v2.6.33 was good. It can't be serious about asking me to test a kernel *prior* to 2.6.33! Testing is really slow as as it seems even after the fifth reboot and some hours of uptime it can still freeze. So I need about half a day to be quite sure that a kernel is good. I won't waste my time going prior to 2.6.33 which is known good to make this experience even more painful that it was already. So I ask for advice on how to proceed. Well at least its clear now, that the freeze was introduced way prior to 2.6.34-rc1. I would have liked it the other way around. According to git its still martin@shambhala:~/Computer/Shambhala/Kernel/2.6.33-2.6.34-bisect/linux-2.6> git bisect skip Bisecting: 2518 revisions left to test after this (roughly 11 steps) Should I just mark the current soon after 2.6.33-rc2 kernel 5be796f0b842c5852d7397a82f8ebd6be8451872 I am not interested as good without even compiling it, so that git hopefully takes me to another one around the one that I skipped cause it didn't boot? Created attachment 29002 [details]
bisect cont'd - unbootable kernels ext4 / readahead backtrace
I asked on kernel-ml and got a hint that this may be due to merges. Thus I tested this kernel and continued. git bisect pointed me back to a lot of unbootable kernels with the ext4/readahead backtrace. They have all been in a USB merge. I am not booting from USB. I think that bug has been introduced before that USB merge and been fixed somewhen after merging it, so after testing about five unbootable kernels from that range, I told git bisect to skip all the commits of that merge. After testing a few kernels including one that showed "Destination address too large" and "-- System halted" just after trying to boot it, I got one thats quite some revisions prior to the USB merge. I'll test it and hope that it boots.
I will post a complete bisect log of what I have so far in a moment.
Created attachment 29012 [details]
current state of bisection, please help with selecting next commits to test
If its indeed a Radeon KMS related freeze and upto now everything points at this, I think it has been introduced in the first quarter of commits that git bisect visualize shows. I'd like to test just before these KMS/DRM related commits and just after it to speed up the bisection process, but there are branches and merges and also there are those unbootable kernels (see previous commit) not that much before these KMS/DRM related commits, so I am unsure an which exact commit ids too choose.
Please help me selecting suitable commits - I had more than five unbootable kernels already and the range of commits to test thus is still above 1800. I really like to shorten this. In the meanwhile I am just going to compile and test the next kernel.
Well I now go with martin@shambhala:~/Computer/Shambhala/Kernel/2.6.33-2.6.34- bisect/linux-2.6> git log | head commit a27341cd5fcb7cf2d2d4726e9f324009f7162c00 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Mar 2 08:36:46 2010 -0800 Prioritize synchronous signals over 'normal' signals This makes sure that we pick the synchronous signals caused by a processor fault over any pending regular asynchronous signals sent to use by [t]kill(). When I interpret the graphical display of gitk correctly this one should be directly before the whole lot of KMS/DRM/Radeon/Intel commits being merged. Now hopefully this one boots. And is good. Then I would have skipped a whole lot of the revisions to test. But even when it just boots, but is bad, I at least know that its not in these merges. This might be a duplicate of: https://bugs.freedesktop.org/show_bug.cgi?id=28402 Does the patch there help? Created attachment 29352 [details] Current status of git bisection, now testing patches mentioned in comment Alex from looking at the backtrace Alex hinted me that this might be a duplicate of[1]: Bug 28402 - Kernel 2.6.34 freezes randomly https://bugs.freedesktop.org/show_bug.cgi?id=28402 I will now try patch from comment 47 of that bug and then those mentioned by Da Fox im comment 43. [1] See thread "Bisecting radeon kms freeze bug: Almost there, please help with choosing next commit": http://lkml.org/lkml/2010/9/7/199 comment 44 instead of 43 that should have been. Created attachment 29382 [details] no freezes so far with vmembase zero patch from Dave The vmem align patch from Alex (comment #41 of freedesktop bug) didn´t help. I now test the vmembase at zero patch from Dave (attached here and comment #41 of the fdo bug). Looks very good so far. I will reboot this kernel several times tomorrow - as a freeze so far only every happened *before* the first hibernation / snapshot cycle - but I looked some Startrek Voyager without a freeze with: martin@shambhala:~/Computer/Shambhala/Kernel/2.6.36> cat /proc/version Linux version 2.6.36-rc3-tp42-toi-3.2-rc1-vmembase-0-05032-g60140c1-dirty (martin@shambhala) (gcc version 4.4.5 20100728 (prerelease) (Debian 4.4.4-8) ) #2 PREEMPT Wed Sep 8 21:36:34 CEST 2010 Thanks. Fix is upstream: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b7d8cce5b558e0c0aa6898c9865356481598b46d Tested in 2.6.37-rc3. 2.6.37-rc5 seems stable too. Many thanks. |