Bug 10011
Description
François Valenduc
2008-02-17 06:28:29 UTC
Created attachment 14876 [details]
kernel configuration file
Thanks for doing the bisect. You reverted the commit on top of rc2 ? It looks harmless, but it seems to introduce subtle wreckage. I'm going to revert it. Thanks, tglx Hmm, I compiled your config with the patch applied and with it reverted. There is no difference in the binary image. Indeed, I don't understand really what's happening. I have succesfully started the computer with this patch applied 5 times in a row without any problem. However, this morning, it failed everytime with this patch. So, this bug seems to occur randomly, which is probably not a good news. Yeah, it seems to be something else. Please check your bisect log again and maybe restart at some point before the lasts steps. It seems that the problem occurs only if I patch the kernel with tuxonice. In fact, with kernel 2.6.24 and tuxonice 3.0-rc5, the problem already occurs. I tought that some parts of the tuxonice patch had now been merged in the mainline kernel and thus that the bug also occurs with release candidate of 2.6.25. That doesn't look correct. What I found extremely strange is that the problem doesn't occur if I plug an USB mouse. However, if I remove usbhid support, the problem still occurs. For me, this is totally incomprehensible ! > It seems that the problem occurs only if I patch the kernel with tuxonice. In
> fact, with kernel 2.6.24 and tuxonice 3.0-rc5, the problem already occurs. I
> tought that some parts of the tuxonice patch had now been merged in the
> mainline kernel and thus that the bug also occurs with release candidate of
> 2.6.25. That doesn't look correct.
> What I found extremely strange is that the problem doesn't occur if I plug an
> USB mouse. However, if I remove usbhid support, the problem still occurs. For
> me, this is totally incomprehensible !
So with plain 2.6.25-rc2 it does not happen, right. Only the
combination with tuxonice shows that ?
If yes, then please poke the tuxonice folks.
Thanks,
tglx
For the moment, I would say that the problem doesn't happen anymore. But since it seems that it occurs randomly, I don't know what to think about it. FWIW, I saw a similar thing a couple of times on an ASUS L5D with plain 2.6.25-rc1. That is, the box hung solid as soon as X was started (openSUSE 10.3 userland, 32-bit). Strangely enough, this doesn't seem to happen on the same box with a 64-bit kernel and 64-bit userland (openSUSE 10.2). BTW, Thomas, There is 1,5 GB of RAM is this box and highmem is used, so that _might_ be related to the kmap_atomic() warning that you observed. The original report is resolved (as a tuxonice issue) Closing this bug since any followups are very very unlikely to be the same issue; they need a separate bug. I'll try to close as "INVALID" since that's the closest to "outside patch caused" that we have. Reply-To: akpm@linux-foundation.org On Sun, 17 Feb 2008 06:28:29 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10011 > Rafael, one for the post-2.6.24 regression list, please. The problem now again occurs with the official 2.6.25-rc2 kernel. But it occurs randomly. What is extremely strange is that like I said in comment #6, there is no problem if I plug an USB mouse. However my Xorg config file doesn't contain any reference to an external mouse. Only the touchpad is listed as input device. Created attachment 14954 [details]
output of dmesg
I have removed some option from the kernel and now, it seems to work correctly. These options are: Kernel->user space relay support I have removed some option from the kernel and now, it seems to work correctly. These options are: Kernel->user space relay support CPU idle PM support Let's hope it keeps working correctly. I have also removed these options from the 2.6.24 kernel patched with tuxonice and it also works correctly. So for sure, one of this option must be the cause of the problem. Regressions list annotation: Handled-By : Thomas Gleixner <tglx@linutronix.de> I finally found that the problematic option is "Kernel->user space relay support". I have build 2.6.25-rc3 with CPU idle enabled and "user space relay support" disabled. This kernel works correctly. Then I build a kernel with these 2 options enabled and it crashes when X is started. Maybe Xorg is not the direct cause of the problem but the crash occurs when X is started. It's even impossible to connect via SSH to the PC. So, not only X is crashed but the kernel also. (In reply to comment #18) > I finally found that the problematic option is "Kernel->user space relay > support". I have build 2.6.25-rc3 with CPU idle enabled and "user space relay > support" disabled. This kernel works correctly. Then I build a kernel with > these 2 options enabled and it crashes when X is started. Maybe Xorg is not > the > direct cause of the problem but the crash occurs when X is started. It's even > impossible to connect via SSH to the PC. So, not only X is crashed but the > kernel also. > I hadn't wait enough for the problem to occur but "CPU idle PM support" also causes a problem. If I enabled it and if I wait a bit more than 30 seconds to type my password in the KDM login window, the screen is also blocked. The computer is not crashed since I can still make an SSH connection but it's impossible to type the password. The cursor stops blinking and I can't use the mouse to click on the buttons of this window. The only ways to power down my computer is via SSH or using the power button (which shows that ACPI events are still working and catched by acpid). I have a problem that looks similar to submitter's problem, but I'm not sure that it's the same, and I don't sure it's correct place for my posting. When I playing somewhat using xine-lib (firstly noticed with xine-lib 1.1.8) and X is running (7.3), my machine (firstly appeared with kernel 2.6.22) can stop to responds to any my actions. I didn't tested it with ssh connection, but I can get control back only with reset. It's noticeable that file, played with xine-lib, plays to it's end normally. Now I'm running 2.6.24 with Ingo Molnar's RT patch. System can freeze... for some time, but it's usually returns control back to me. Latest working kernel version: 2.6.20 (I didn't test 2.6.21) Earliest failing kernel version: 2.6.22 PREEMPT Current failing kernel: 2.6.24 (I didn't test 2.6.23), PREEMPT, with RT patch by Ingo Molnar Distribution: Gentoo Hardware Environment: AMD Athlon XP 2600+, nForce 2, ATI Radeon 9200SE Francois, is this problem still there in 2.6.25-rc5 ? Thanks, tglx The problem is still present in 2.6.25-rc5. But, if I remove CPU idle support, things are working well. So maybe my computer doesn't support this option. Hmmmm. No. CPU Idle is not some special feature supported by computer. It is just specific to Linux kernel, just a cleaner way to handle CPU C-states. Did 2.6.24 also fail when you have CPU IDLE enabled? Can you get the SYSRQ-t output (you can possibly use /proc/sysrq-trigger as you can ssh into the system in hang state. I am concerned that this may be some timing related issue that only happens once a while, irrespective of above config options. I mean, you said that even with CPU idle enabled, it does boot fine sometimes... Francois, What does the output of # grep . /sys/devices/system/cpu/cpu*/cpuidle/*/* look like with RC5 and CPU_IDLE configured? Also, can you attach your acpidump output. Thanks, Venki No, with CPU-idle, the problem always happen. The output of grep . /sys/devices/system/cpu/cpu*/cpuidle/*/* is the following when the computer is blocked, which happens around 20 seconds after the start of KDM: /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state0/name:C0 /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295 /sys/devices/system/cpu/cpu0/cpuidle/state0/time:0 /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:<null> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1 /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/time:2 /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:1 /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:<null> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:1 /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2 /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0 /sys/devices/system/cpu/cpu0/cpuidle/state2/time:17684 /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:29 /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:<null> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:85 /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3 /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0 /sys/devices/system/cpu/cpu0/cpuidle/state3/time:53347975 /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:8187 Can you chaneg ACPI PROCESSOR config to built in (*) from module (m) and keep CPU_IDLE configured and try booting with - processor.max_cstate=1 - processor.max_cstate=2 And let me know whether any of them works? Created attachment 15228 [details]
output of apcidump
The problem also occurs with kernel 2.6.24. Can you explain how to obtain the SYSRQ-t output ?
I also made the tests concerning the processor.max_cstate parameter and none of the two solves the problem.
Thanks for your help.
To make sure that processor.max_cstate setting worked: with the setting you should not see C3 (or C2 and C3 for max_cstate=1) in the above cpuidle output. For sysrq, enable MAGIC_SYSRQ in your config and enable sysrq by # echo 1 > /proc/sys/kernel/sysrq Then you can use commands in Documentation/sysrq.txt by pressing magic key storke or by echo t > /proc/sysrq-trigger It's strange, I have set processor.max_cstate to 2 and there are still line related to state3 in the cpuidle output. Created attachment 15229 [details]
output of dmesg after triggering sysrq commands
Probably max_cstate parameter did not work. Did you change ACPI PROCESSSOR in config to "y" from "m". Your original config had it as "m". Sorry, I had understood the reverse and I had let processor configured as a module. If I compile it in the kernel instead, it works with processor.max_cstate set to 2 or 1. So it seems that the C3 state is problematic when CPU idle is enabled. does the behaviour change with and without CONFIG_NO_HZ? > No, with CPU-idle, the problem always happen. So it fails in 2.6.24 just as much as it fails in 2.6.25? > ACPI: HPET 1FEEBFA0, 0038 (r1 ACER Kestrel 20020909 PTL 0) > ACPI: BOOT 1FEEBFD8, 0028 (r1 ACER Kestrel 20020909 LTP 1) > ACPI: DMI detected: Acer > ACPI: PM-Timer IO Port: 0x1008 > ACPI: HPET id: 0x8086a201 base: 0x0 is invalid please try booting with "hpet=force" to see if that helps when the system is "hung", please collect the output from "cat /proc/timer_list" and "cat /proc/interrupts" and paste it here. when the system is "hung", if you type "sleep 1" does it return? please attach the output from lspci Without CONFIG_HZ, it's even worse. The computer is directly blocked when X started and a few seconds after, the screen becomes black. It's also impossible to connect via SSH to the computer. With CONFIG_HZ enabled and hpet=force as boot parameter, it hangs also when X is started, but not completely. I can make an SSH connection. The output of cat /proc/interrupts is the following: CPU0 0: 46976 XT-PIC-XT timer 1: 14 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 1 XT-PIC-XT 4: 1 XT-PIC-XT 5: 1 XT-PIC-XT 6: 1604 XT-PIC-XT uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, eth0, radeon@pci:0000:01:00.0 7: 1 XT-PIC-XT 8: 32 XT-PIC-XT rtc 9: 52040 XT-PIC-XT acpi 10: 1765 XT-PIC-XT yenta, ohci1394, ehci_hcd:usb4, Intel 82801DB-ICH4, ipw2200 12: 752 XT-PIC-XT i8042 14: 4859 XT-PIC-XT ide0 15: 325 XT-PIC-XT ide1 NMI: 0 Non-maskable interrupts TRM: 0 Thermal event interrupts SPU: 0 Spurious interrupts ERR: 0 The output of "cat /proc/timer_list" is the following: Timer List Version: v0.3 HRTIMER_MAX_CLOCK_BASES: 2 now at 210040366375 nsecs cpu: 0 clock 0: .index: 0 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1205431075401541167 nsecs active timers: clock 1: .index: 1 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <dfaedee8>, tick_sched_timer, S:01 # expires at 210041000000 nsecs [in 633625 nsecs] #1: <dfaedee8>, it_real_fn, S:01 # expires at 210042483543 nsecs [in 2117168 nsecs] .expires_next : 210041000000 nsecs .hres_active : 1 .nr_events : 169713 .nohz_mode : 2 .idle_tick : 75726000000 nsecs .tick_stopped : 0 .idle_jiffies : 4294743022 .idle_calls : 10360 .idle_sleeps : 7607 .idle_entrytime : 75773696308 nsecs .idle_waketime : 75781004081 nsecs .idle_exittime : 75781012671 nsecs .idle_sleeptime : 47747011259 nsecs .last_jiffies : 4294743069 .next_jiffies : 4294743077 .idle_expires : 75781000000 nsecs jiffies: 4294877336 Tick Device: mode: 1 Clock Event Device: hpet max_delta_ns: 2147483647 min_delta_ns: 3352 mult: 61496110 shift: 32 mode: 3 next_event: 210041000000 nsecs set_next_event: hpet_legacy_next_event set_mode: hpet_legacy_set_mode event_handler: hrtimer_interrupt Also, "sleep 1" doesn't "unfreeze" X. Created attachment 15251 [details]
output of lspci
Is the problem still there with 2.6.25-rc7 ? Unfortunately, the problem still occurs in the same way with 2.6.25-rc7. And it also occurs with the current git version (2.6.25-rc7-git5). Can you please try the patch here on rc7-git5 and see whether that helps? Patch : http://marc.info/?l=linux-kernel&m=120674502201007&w=4 Unfortunately, the patch doesn't change anything. The graphical interface is still frozen around 10 seconds after the start of X. Running out of ideas on this one. Does the problem happen everytime with the git kernel or is it only one in several reboots? If it is happening everytime, it will help if you can try git bisect to narrow down on specific set of patches... CPU idle has never worked on my computer, with any kernel version. It doesn't work with 2.6.24 where it was first introduced if I am not wrong. In any of the newer kernels or git version, it doesn't work too. So, I think git-bisect won't be useful. Is it possible that my computer doesn't support CPU Idle ? Or maybe it doesn't support the C3 state (since it work if I set C2 as max_cstate). OK. Atleast this is not a regression bug since 2.6.24. Having said that, if you see all 3 C-states in /proc/acpi/processor/*/power with the kernel not having CPU_IDLE configured and with CPU_IDLE configured it can only work with max_cstate=2, then there is some bug in CPU_IDLE. I think we can reduce the priority of this one as it is not regression. I will send some debug patch soon that should help us identify what is going wrong with CPU_IDLE. In the mean time, can you attach the output of # cat /proc/acpi/processor/*/power with CPU_IDLE not configured. unmarking regression flag. Here is the output of cat /proc/acpi/processor/*/power: active state: C2 max_cstate: C8 bus master activity: 40020001 maximum allowed latency: 2000 usec states: C1: type[C1] promotion[C2] demotion[--] latency[000] usage[00000010] duration[00000000000000000000] *C2: type[C2] promotion[C3] demotion[C1] latency[001] usage[00013445] duration[00000000000266258279] C3: type[C3] promotion[--] demotion[C2] latency[085] usage[00000225] duration[00000000000002117141] (In reply to comment #46) > Here is the output of cat /proc/acpi/processor/*/power: > > active state: C2 > max_cstate: C8 > bus master activity: 40020001 > maximum allowed latency: 2000 usec > states: > C1: type[C1] promotion[C2] demotion[--] latency[000] > usage[00000010] duration[00000000000000000000] > *C2: type[C2] promotion[C3] demotion[C1] latency[001] > usage[00013445] duration[00000000000266258279] > C3: type[C3] promotion[--] demotion[C2] latency[085] > usage[00000225] duration[00000000000002117141] > The command was run with kernel 2.6.24.4 with CPU_IDLE disabled. It is important to find out for sure if this failure is specific to CPU_IDLE, or it if the root cause existed before CPU_IDLE, but perhaps was just hidden. please boot a CONFIG_CPU_IDLE=n kernel with processor.bm_history=0 to see if the non-cpuidle kernel can also fail. This will make it more aggressive about entering C3, and that should be reflected in /proc/acpi/processor/*/power Or, if you have USB devices plugged in, you might try unplugging them to reduce the bus master history interference and enter C3 more to stress it more. The failure might not be in CPU_IDLE. If I boot with a kernel without cpu idle and with processor.bm_history=0 like you suggested, I also encounter the problem. The output of /proc/acpi/processor/*/power is the following. active state: C2 max_cstate: C8 bus master activity: 24402041 maximum allowed latency: 2000 usec states: C1: type[C1] promotion[C2] demotion[--] latency[000] usage[00000010] duration[00000000000000000000] *C2: type[C2] promotion[C3] demotion[C1] latency[001] usage[00008880] duration[00000000000099016541] C3: type[C3] promotion[--] demotion[C2] latency[085] usage[00003717] duration[00000000000026832004] So, the current C-state is C2 but C3 as ben use more often. Your remark about usb devices also remind me that if I use a kernel with CPU-idle enabled, it worked well (without setting the max cstate) if I use an usb mouse. Maybe this prevent the computer from switching too often to the C3 state. So, if I understand correctly, the C3 state seems to be problematic for my computer and CPU idle is not the real cause of the problem. It simply trigger another problem. Yes. C3 being problematic was our suspicion as well. There is some interaction between C3 state and X driver. If we frequently use C3 during X startup it can hang the driver. CPU_IDLE is aggressive with going into C3 state (to save more power) and thus exposes the problem. Your observation with USB mouse reinforces this theory. Did this problem exist in 2.6.24? also, can you clarify what graphics drivers you are using to talk to the Radeon and if different drivers have any effect? I have already said that this problem occurs with kernel 2.6.24 and all release candidates of git snapshot of 2.6.25 when CPU_IDLE is enabled (see comment #43 for example). But, if CPU_IDLE is disabled, the problem also occurs if I set processor.bm_history=0 like you suggested (see comment #48). I am using the radeon DRM driver included in the kernel. I have also tried the proprietary fglrx driver and the problem occurs with this driver as well. As the problem happens with in-tree radeon DRM driver, I am thinking adding a DMI check for this platform and disable C3 state. Unless someone who understands radeon DRM driver and can figure out where exactly and why it is hanging. Can yo attach the dmidecode info from this laptop. I can send a patch to disable C3 based on that. Created attachment 15721 [details]
output of dmidecode
This is interesting. I have a T8100-based notebook that does something similar. It doesn't lock solid, but if I let it idle, and keep interrupt chatter to an absolute minimum (no mouse, unplug network, disable chatty USB devices) it enters a "daydream" state where the machine pauses. Touch the mouse or keyboard and it comes back to life as if nothing happened, and I can't see any other ill effect. I'm wondering if it's related to this, booting nohz=off prevents the problem. The same problem occurs again with kernel 2.6.26-rc1, even when CPÜ_IDLE is not enabled. So, this becomes extremely annoying ! Created attachment 16129 [details]
workaround patch
Patch to add this laptop to max_cstate blacklist. This auto-disables C3 on this model.
So, with this patch, my computer is not blocked anymore when X starts. It seems rather a way to get rid of the problem than to really solve it. What I don't understand is that the problem now occurs with or without CPU IDLE. Yes. The problem is with C3 state and X driver. It is just some timing and frequency of C3 invocation that made this not happen without CPU_IDLE earlier. Yes. The patch is just a workaround/bandaid for now. We need to know more about X driver on this platform to narrow this down. Copying Dave who may be able to help us with radeon drm driver... After further investigation, I think the problem appears somewhere between 2.6.25-git7 and -git8. I have tried git-bisect but I end up with a kernel panic at each steps, so it's difficult to be more precise. I also find strange that the problem also appears with max_cstate=2 as kernel parameter and ACPI processor support compiled in the kernel. Previously, setting the max_cstate parameter was a way to avoid the problem. can you try with Option "DRI" "Off" in your xorg.conf this will rule out the drm kernel driver and at least place the issue up with X itself. Is this an AGP system? When I disable DRI, X works correctly and doesn't freeze. So, there is probably a problem in the drm or radeon driver. This is an AGP system. X works again correctly with the latest git version of 2.6.26-r3, without the workaround patch. It seems that the commit 860da5e578c25d1ab4528c0d1ad13f9969e3490f (Merge branch 'drm-patches' of git://git./linux/kernel/git/airlied/drm-2.6) solves the problem. Even if CPU_IDLE is still problematic, X doesn't freeze anymore if I disable it (like with kernel 2.6.25). This problem occurs again with the final release of 2.6.26 (and maybe already with the release candidates). The only way to avoid the problem for sure for the moment is to apply the workaround patch. The commit I mentionned in comment #65 helps a bit. However, with kernel 2.6.26 and without the workaround, the screen becomes immediately black after X startup and it seems impossible to unfreeze X. Is this problem impossible to solve ? I forgot to add that if I disable DPMS in xorg.conf, the screen doesn't becomes black after X startup but X is almost immediately blocked after X startup. With the latest git (2.6.26-git5), the output of cat /proc/acpi/processor/CPU0/power is the following: active state: C3 max_cstate: C8 bus master activity: 00000000 maximum allowed latency: 2000000000 usec states: C1: type[C1] promotion[C2] demotion[--] latency[000] usage[00000010] duration[00000000000000000000] C2: type[C2] promotion[C3] demotion[C1] latency[001] usage[00035816] duration[00000000000328423975] *C3: type[C3] promotion[--] demotion[C2] latency[085] usage[00000183] duration[00000000000001543805] If I compare with the stats quoted in comment #49, the maximum allowed latency is 1000000 higher. Is it really normal ? Does anybody still care about this very annoying problem ? Or do you plan to submit the workaound patch for inclusion in the official kernel if no other solution can be found ? I have retried with kernel 2.6.24.7 and the same problem still occurs. Maybe, this problem is there since a very long time. Previously, I didn't notice the problem because at that time, I used an USB mouse. It's still true that X doesn't hang with 2.6.26 if I plug an USB mouse. As explained in comment #51. However, without an USB mouse plugged, X hangs all the time when X starts. Does this still happen with 2.6.27 rc or git ? If so output of dmesg after doing : echo 1 > /sys/module/drm/parameters/debug && echo 0 > /sys/module/drm/parameters/debug might be helpful (radeon might flood your log with same message over and over this why activating debug for a short time is enough). The problem still occurs whith 2.6.27-rc6. Doing echo 1 > /sys/module/drm/parameters/debug when the problem has occured gives these messages in dmesg: irq 6: nobody cared (try booting with the "irqpoll" option) Pid: 6726, comm: X Not tainted 2.6.27-rc6 #8 [<c044fd04>] __report_bad_irq+0x24/0x90 [<e125f9af>] b44_interrupt+0x3f/0x100 [b44] [<c044ff9f>] note_interrupt+0x22f/0x260 [<c044f2e5>] handle_IRQ_event+0x25/0x60 [<c04508d0>] handle_level_irq+0x0/0xa0 [<c045094c>] handle_level_irq+0x7c/0xa0 [<c0405b7f>] do_IRQ+0x6f/0xc0 [<c0403bf7>] common_interrupt+0x23/0x28 [<c041ef0e>] __do_softirq+0x2e/0x90 [<c041eee0>] __do_softirq+0x0/0x90 [<c0405812>] call_on_stack+0x12/0x20 [<c041eeb5>] irq_exit+0x45/0x70 [<c0405b86>] do_IRQ+0x76/0xc0 [<c0403bf7>] common_interrupt+0x23/0x28 [<c04d0000>] kobject_release+0x30/0x80 [<e17463c6>] radeon_do_wait_for_idle+0x86/0x160 [radeon] [<e1746d10>] radeon_cp_idle+0x0/0xc0 [radeon] [<e1746d10>] radeon_cp_idle+0x0/0xc0 [radeon] [<e171031a>] drm_ioctl+0x1ba/0x2f0 [drm] [<e1710160>] drm_ioctl+0x0/0x2f0 [drm] [<c047f669>] vfs_ioctl+0x69/0x70 [<c047f6cc>] do_vfs_ioctl+0x5c/0x250 [<c05bb0c2>] schedule+0x172/0x2b0 [<c047f8fd>] sys_ioctl+0x3d/0x70 [<c0403a29>] sysenter_do_call+0x12/0x25 ======================= handlers: [<e12b62e0>] (usb_hcd_irq+0x0/0x70 [usbcore]) [<e12b62e0>] (usb_hcd_irq+0x0/0x70 [usbcore]) [<e12b62e0>] (usb_hcd_irq+0x0/0x70 [usbcore]) [<e125f970>] (b44_interrupt+0x0/0x100 [b44]) [<e1753d70>] (radeon_driver_irq_handler+0x0/0x170 [radeon]) Disabling IRQ #6 Created attachment 17814 [details]
Set some bus state so that cpu c3 state doesn't lead to CP trouble
Attached is a patch which might help to fix this issue if it doesn't it will at
least provide some more debugging informations into your kernel log (no need to
enable drm debug).
I have tried your patch and unfortunately, it produces the following compile error: CC [M] drivers/gpu/drm/radeon/radeon_cp.o CC [M] drivers/gpu/drm/radeon/radeon_irq.o drivers/gpu/drm/radeon/radeon_irq.c: In function 'radeon_acknowledge_irqs': drivers/gpu/drm/radeon/radeon_irq.c:42: error: expected expression before '^' token drivers/gpu/drm/radeon/radeon_irq.c:43: error: expected expression before '^' token distcc[10266] ERROR: compile drivers/gpu/drm/radeon/radeon_irq.c on pc-francois failed make[4]: *** [drivers/gpu/drm/radeon/radeon_irq.o] Erreur 1 make[3]: *** [drivers/gpu/drm/radeon] Erreur 2 make[2]: *** [drivers/gpu/drm] Erreur 2 make[1]: *** [drivers/gpu] Erreur 2 make[1]: *** Attente des tâches non terminées.... make: *** [drivers] Erreur 2 Created attachment 17831 [details]
Set some bus state so that cpu c3 state doesn't lead to CP trouble
Sorry once again i used the wrong operator, attached patch should compile.
Unfortunately, your patch doesn't solve the problem. Here is what I get in dmesg: [drm] Initialized drm 1.1.0 20060810 pci 0000:01:00.0: power state changed by ACPI to D0 pci 0000:01:00.0: PCI INT A -> Link[LNKA] -> GSI 6 (level, low) -> IRQ 6 [drm] Initialized radeon 1.29.0 20080528 on minor 0 agpgart-intel 0000:00:00.0: AGP 2.0 bridge agpgart-intel 0000:00:00.0: putting AGP V2 device into 4x mode pci 0000:01:00.0: putting AGP V2 device into 4x mode [drm] Setting GART location based on new memory map [drm] Loading R300 Microcode [drm] initial BUS_CNTL : 0x5133A2A0 [drm] initial BUS_CNTL1 : 0x00004090 [drm] set BUS_CNTL : 0x5133A2A0 [drm] set BUS_CNTL1 : 0x00004090 [drm] Num pipes: 1 [drm] writeback test succeeded in 1 usecs [drm] irq not acking : 0x00080026 evdev.c(EVIOCGBIT): Suspicious buffer size 511, limiting output to 64 bytes. See http://userweb.kernel.org/~dtor/eviocgbit-bug.html I have never seen the message related to evdev before. Do you think it's related to the problem ? When you activated drm debug was there any message begining with : wait idle failed status In any of your log file ? Created attachment 17841 [details]
output of dmesg | grep drm
I didn't find any line beginning with "wait idle failed status". You can find the dmesg output containing the drm related messages with debug enabled in this log.
Created attachment 17852 [details]
Set some bus state so that cpu c3 state doesn't lead to CP trouble
Okay here is another patch which set some more bus state and and some more
debugging informations. Could you send the log with the echo 1 > /sys/./drm
stuff and grep drm in your log once the lockup happen and attach it.
The log I send yesterday is obtained when the problem has occured when the drm module is loaded with debug enabled. Do you want another log obtained with the last patch ? Yes please a log with the lastest patch as this patch add some more debug informations which might be insight full. Also this patch might help fixing this but i am not believing too much in that. Created attachment 17863 [details]
output of dmesg | grep drm
As you expected, this patch doesn't solve the problem. Furthermore, there are now several lines indicating "wait idle failed status" in dmesg.
Created attachment 17864 [details]
Print more debug information
Sorry i misplaced the debug information could you run again and attach log output
with this patch ?
Created attachment 17976 [details]
output of dmesg | grep drm
So, after some delay, here is the output of dmesg | grep drm with your last patch.
Created attachment 18000 [details]
Log CP status and remove useless debug output.
Unfortunetly this output is not interesting at all. It shows that it fails
to get a free buffer because the CP is stuck. It doesn't include the debug
output i was looking for. Attached is a patch which remove this debug output.
Also the debug output might be verbose so maybe the message i was looking for
was cut. For reference i am looking for :
wait idle failed status :
Followed by 3 values this are the values i am interested in.
Created attachment 18006 [details]
output of dmesg | grep drm
So here is another dmesg output obtained with your last patch.
There are a lot of lines like the following:
[drm:radeon_do_cp_idle]
[drm:radeon_do_wait_for_idle] wait idle failed status : 0x80010140 0x00000000 0xC0002804
Is this what you were looking for ?
Created attachment 18015 [details]
Keep CPU busy for sometimes after ring commit
This was not state i expected to see anyway attached is a hack that might
help, basicly it keep busy the CPU a bit longer after commiting the ring.
Unfortunately, this patch gives the following compile error: CC [M] drivers/gpu/drm/radeon/radeon_state.o drivers/gpu/drm/radeon/radeon_cp.c: In function 'radeon_do_cp_idle': drivers/gpu/drm/radeon/radeon_cp.c:414: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_cp.c:421: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_cp.c: In function 'radeon_do_cp_start': drivers/gpu/drm/radeon/radeon_cp.c:439: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_cp.c:450: error: 'for' loop initial declaration used outside C99 mode distcc[12998] ERROR: compile drivers/gpu/drm/radeon/radeon_cp.c on localhost failed make[4]: *** [drivers/gpu/drm/radeon/radeon_cp.o] Error 1 make[4]: *** Waiting for unfinished jobs.... drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_check_and_fixup_packets': drivers/gpu/drm/radeon/radeon_state.c:174: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_clip_rect': drivers/gpu/drm/radeon/radeon_state.c:434: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_state': drivers/gpu/drm/radeon/radeon_state.c:466: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:485: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:492: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:502: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:512: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:521: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:533: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:542: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:555: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:575: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:595: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_state2': drivers/gpu/drm/radeon/radeon_state.c:620: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_clear_box': drivers/gpu/drm/radeon/radeon_state.c:764: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:770: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_clear': drivers/gpu/drm/radeon/radeon_state.c:879: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:905: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:927: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1003: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1035: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1058: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1086: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1109: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1197: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1226: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1273: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1297: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1333: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_swap': drivers/gpu/drm/radeon/radeon_state.c:1359: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1373: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1410: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_flip': drivers/gpu/drm/radeon/radeon_state.c:1437: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1457: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_vertex': drivers/gpu/drm/radeon/radeon_state.c:1525: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_discard_buffer': drivers/gpu/drm/radeon/radeon_state.c:1551: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_indirect': drivers/gpu/drm/radeon/radeon_state.c:1583: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_texture': drivers/gpu/drm/radeon/radeon_state.c:1679: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1854: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1871: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1885: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:1889: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_dispatch_stipple': drivers/gpu/drm/radeon/radeon_state.c:1901: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_clear': drivers/gpu/drm/radeon/radeon_state.c:2131: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_do_init_pageflip': drivers/gpu/drm/radeon/radeon_state.c:2144: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_flip': drivers/gpu/drm/radeon/radeon_state.c:2179: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_swap': drivers/gpu/drm/radeon/radeon_state.c:2199: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_vertex': drivers/gpu/drm/radeon/radeon_state.c:2275: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_indices': drivers/gpu/drm/radeon/radeon_state.c:2363: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_stipple': drivers/gpu/drm/radeon/radeon_state.c:2409: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_indirect': drivers/gpu/drm/radeon/radeon_state.c:2459: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:2474: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_vertex2': drivers/gpu/drm/radeon/radeon_state.c:2566: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_packets': drivers/gpu/drm/radeon/radeon_state.c:2596: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_scalars': drivers/gpu/drm/radeon/radeon_state.c:2615: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_scalars2': drivers/gpu/drm/radeon/radeon_state.c:2637: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_vectors': drivers/gpu/drm/radeon/radeon_state.c:2657: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_veclinear': drivers/gpu/drm/radeon/radeon_state.c:2683: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_packet3': drivers/gpu/drm/radeon/radeon_state.c:2713: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_packet3_cliprect': drivers/gpu/drm/radeon/radeon_state.c:2763: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:2770: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_emit_wait': drivers/gpu/drm/radeon/radeon_state.c:2792: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:2797: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c:2802: error: 'for' loop initial declaration used outside C99 mode drivers/gpu/drm/radeon/radeon_state.c: In function 'radeon_cp_cmdbuf': drivers/gpu/drm/radeon/radeon_state.c:2967: error: 'for' loop initial declaration used outside C99 mode distcc[13031] ERROR: compile drivers/gpu/drm/radeon/radeon_state.c on pc-francois failed Created attachment 18025 [details]
Keep CPU busy for sometimes after ring commit
Sorry i thought to it after, i is a way to common to be used in a macro :)
Here is another hack.
Created attachment 18026 [details]
Keep CPU busy for sometimes after ring commit
My bad once again, sorry i am always in hurry.
This time, it gives the following error: CC [M] drivers/gpu/drm/radeon/radeon_state.o drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_clear’: drivers/gpu/drm/radeon/radeon_state.c:2131: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c:2131: erreur: (Each undeclared identifier is reported only once drivers/gpu/drm/radeon/radeon_state.c:2131: erreur: for each function it appears in.) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_flip’: drivers/gpu/drm/radeon/radeon_state.c:2179: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_swap’: drivers/gpu/drm/radeon/radeon_state.c:2199: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_vertex’: drivers/gpu/drm/radeon/radeon_state.c:2275: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_indices’: drivers/gpu/drm/radeon/radeon_state.c:2363: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_stipple’: drivers/gpu/drm/radeon/radeon_state.c:2409: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_vertex2’: drivers/gpu/drm/radeon/radeon_state.c:2566: erreur: ‘zzi’ undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_cmdbuf’: drivers/gpu/drm/radeon/radeon_state.c:2967: erreur: ‘zzi’ undeclared (first use in this function) distcc[9973] ERROR: compile drivers/gpu/drm/radeon/radeon_state.c on localhost failed make[4]: *** [drivers/gpu/drm/radeon/radeon_state.o] Erreur 1 make[4]: *** Attente des tâches non terminées.... make[3]: *** [drivers/gpu/drm/radeon] Erreur 2 make[2]: *** [drivers/gpu/drm] Erreur 2 make[1]: *** [drivers/gpu] Erreur 2 make: *** [drivers] Erreur 2 make: *** Attente des tâches non terminées.... zsh: exit 2 make "CC=distcc i686-pc-linux-gnu-gcc" -j3 Created attachment 18031 [details]
Keep CPU busy for sometimes after ring commit
I am very sorry i don't the infrastructure to build a kernel right now so
i am doing patch as a blind people. Hopefully This one should compile.
Unfortunately, this produce yet another error: CC [M] drivers/gpu/drm/radeon/radeon_state.o drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_clear’: drivers/gpu/drm/radeon/radeon_state.c:2116: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2116: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2116: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2116: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_flip’: drivers/gpu/drm/radeon/radeon_state.c:2169: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2169: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2169: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2169: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_swap’: drivers/gpu/drm/radeon/radeon_state.c:2189: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2189: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2189: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2189: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_vertex’: drivers/gpu/drm/radeon/radeon_state.c:2214: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2214: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2214: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2214: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_indices’: drivers/gpu/drm/radeon/radeon_state.c:2292: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2292: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2292: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2292: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_stipple’: drivers/gpu/drm/radeon/radeon_state.c:2404: erreur: conflicting types for ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2403: erreur: previous declaration of ‘mask’ was here drivers/gpu/drm/radeon/radeon_state.c:2413: attention : passing argument 2 of ‘radeon_cp_dispatch_stipple’ makes pointer from integer without a cast drivers/gpu/drm/radeon/radeon_state.c:2404: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2404: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2404: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_vertex2’: drivers/gpu/drm/radeon/radeon_state.c:2493: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2493: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2493: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2493: attention : unused variable ‘write’ drivers/gpu/drm/radeon/radeon_state.c: In function ‘radeon_cp_cmdbuf’: drivers/gpu/drm/radeon/radeon_state.c:2830: attention : unused variable ‘ring’ drivers/gpu/drm/radeon/radeon_state.c:2830: attention : unused variable ‘mask’ drivers/gpu/drm/radeon/radeon_state.c:2830: attention : unused variable ‘_nr’ drivers/gpu/drm/radeon/radeon_state.c:2830: attention : unused variable ‘write’ distcc[9601] ERROR: compile drivers/gpu/drm/radeon/radeon_state.c on localhost failed make[4]: *** [drivers/gpu/drm/radeon/radeon_state.o] Erreur 1 make[4]: *** Attente des tâches non terminées.... make[3]: *** [drivers/gpu/drm/radeon] Erreur 2 make[2]: *** [drivers/gpu/drm] Erreur 2 make[1]: *** [drivers/gpu] Erreur 2 make[1]: *** Attente des tâches non terminées.... make: *** [drivers] Erreur 2 zsh: exit 2 make "CC=distcc i686-pc-linux-gnu-gcc" -j3 Created attachment 18035 [details]
Keep CPU busy for sometimes after ring commit
Is this one better ?
Your patch generates a reject in r300_cmdbuf.c. If I well unterstood, I simply needed to move the declaration of zzi (int zzi;). After having moved that line, no compilation errors occured. Unfortunately, the problem still occurs. There are plenty of line like this in dmesg when X hangs: [drm:radeon_cp_idle] [drm:radeon_do_cp_idle] [drm:radeon_do_wait_for_fifo] wait for fifo failed status : 0x80036100 0x00000000 [drm:drm_ioctl] ret = fffffff0 [drm:drm_ioctl] pid=7276, cmd=0x6444, nr=0x44, dev 0xe200, auth=1 Created attachment 18040 [details]
Keep CPU busy for sometimes after ring commit
Here is the patch I applied.
Does it help ? Okay let's try a different approach here. I need you to git clone this: git clone git://people.freedesktop.org/~glisse/radeondump cd radeondump cmake . make Then once lockup happen log in through ssh as root and do ./radeondump -d lockup Reboot few times (3 to 5 dumps should do it) and do a dump each time it lockups. You should endup with several lockup-*- files. (Note you don't need any of the previous patch) Then install fglrx and do ./radeondump -d fglrx same do it few times and do stuff btw dump (launch application browse the web) 3 to 5 dump should do it. Do a tar of all this dumps and attach it to this bug. Basicly radeondump will dump several radeon config registers dumping this register with fglrx and in lockup case will help to find out which kind of configuration we do wrong if any. Also does adding option : Option "BusType" "PCI" to the device section about your card in xorg.conf helps ? I forgot you must run radeondump as root As you suggested, Ihave add a the option "BusType" "PCI" line in xorg.conf. I also removed the line setting the BusID which was the following: BusID "PCI:1:0:0" Now, it works perfectly well with CPU idle enabled and the workaround patch limiting C-State to C2 reverted ! So, this bug was in fact a problem of configuration. This is still a bug, setting bus type to PCI just hide it. Does it also works without this option ? If it bugs please do the series of dump. Created attachment 18064 [details]
output of radeondump
So I reopen the bug. Adding the "bustype" option is one way to avoid it. Another way is to set the max cstate to C2 is another way.
I have added the output of 3 dumps.
What would be really usefull too is 3-4 dumps with fglrx (sadly it's still the easiest way to find out how to setup some of the regs). I can't compile fglrx with the current version of the 2.6.27-rcX kernel. As I have said in comment #54, the problem occurs as well. Should I try a dump with fglrx and kernel 2.6.26 ? Created attachment 18093 [details]
Set wptr delay (necessary on some AGP chipset) and disable host path pretech to avoid some hang conditions
If you can give this patch a try.
And dump with older kernel with fglrx would be usefull. Even stock distrib kernel, i am just really interested to know how fglrx setup your card so i can compare with how we do and try to spot some difference which might help to fix your situation. I can also not compile fglrx with kernel 2.6.26. I tried to find some patches but none of them were successful. Should I go back to 2.6.25 ? Will the output still be relevant ? As I have previously said, I am not sure the fglrx will work without problem. I have also tried your path (set wptr delay...) and it doesn't solve the problem. Why is it not enough to use the BusType option ? I start being bored of investigating this problem. I have finally managed to compile fglrx with kernel 2.6.26. It also required to use unused symbols (because it needs init_mm). Unfortunately, it also fails when CPU IDLE is enabled. So I don't think we will get valuable info with it. It even fails harder since it's not possible to initiate a SSH connection when X hangs. Option bustype pci downgrade your AGP to PCI so you will experience sever performance loss. I know debugging is painfull and you have been very helpfull. Thing is we have very little good tester as you with problematic hw while we have lot of user with problematic hw. So by helping to track down this you are helping more than you, and do a very valuable contribution to help fixing others problems :). I will dig in more in some AGP doc to see what might be usefull to test and come back to you with another simple patch. Also could you attach the Xorg.0.log file in lockup case. So I have removed the BusType option and I also removed the line: BusID "PCI:1:0:0" After several minutes, the GDM logging window is not yet blocked. Maybe the problematic line was the one indicating the BusID. I don't remember where this come from. Maybe it's really a configuration problem. CPU Idle is enabled and the workaround patch to set the max CState is not applied and it has not yet failed. Created attachment 18096 [details]
Tweak AGP (cripple & others workaround features)
Strange according to your lspci removing the line :
BusID "PCI:1:0:0"
shouldn't do anythings. If you could attach your xorg log maybe there is
usefull informations in it. I am still attaching a patch which tweak some
AGP features.
Created attachment 18097 [details]
Xorg.log obtained with your last patch Tweak AGP. ..
I don't understand anything now. I have retried 2.6.27-rc7 kernel without any special patch and without setting the BUS type to PCI and the problem again occurs. I don't understand why it has only worked once.
I have attached the xorg.log file you asked for. Your last doesn't patch doesn't help and make things worse. With this one, the screen remains black when X starts and I never see the GDM login window. I can't even reboot the PC cleanly. After sometime, I have to use SysRQ keys to force a reboot.
Created attachment 18120 [details]
Another AGP tweak (cripple & others workaround features)
Okay at least now we know that this is AGP related. I attach another try at
tweaking some of the AGP configuration.
Also Could you git pull change from radeondump and provide a new dump ?
I want to look at some more config register's values.
Unfortunately, your last patch doesn't compile and give the following error: drivers/gpu/drm/radeon/radeon_cp.c:1773:1: error: unterminated argument list invoking macro "RADEON_WRITE" drivers/gpu/drm/radeon/radeon_cp.c: In function 'radeon_cp_init_ring_buffer': drivers/gpu/drm/radeon/radeon_cp.c:588: error: 'RADEON_WRITE' undeclared (first use in this function) drivers/gpu/drm/radeon/radeon_cp.c:588: error: (Each undeclared identifier is reported only once drivers/gpu/drm/radeon/radeon_cp.c:588: error: for each function it appears in.) drivers/gpu/drm/radeon/radeon_cp.c:588: error: expected ';' at end of input drivers/gpu/drm/radeon/radeon_cp.c:588: error: expected declaration or statement at end of input drivers/gpu/drm/radeon/radeon_cp.c:553: warning: unused variable 'tmp' drivers/gpu/drm/radeon/radeon_cp.c:552: warning: unused variable 'cur_read_ptr' distcc[14464] ERROR: compile drivers/gpu/drm/radeon/radeon_cp.c on localhost failed make[4]: *** [drivers/gpu/drm/radeon/radeon_cp.o] Error 1 make[3]: *** [drivers/gpu/drm/radeon] Error 2 make[2]: *** [drivers/gpu/drm] Error 2 make[1]: *** [drivers/gpu] Error 2 Created attachment 18137 [details]
Tweak AGP (cripple & others workaround features)
Again i am very sorry. This one should compile (missing )) could you also update
radeondump a provide a new dump ?
So I tried your last patch. It doesn't solve the problem. With this one, the screen remains black when X start. I never see the clock shown when gdm starts and I never see the GDM window too. I also tried radeondump but it freezes my computer when I use it. It also locks the computer when I use it with kernel 2.6.26. Created attachment 18152 [details]
Xorg.log obtained with C-State limited to C2
I have found something strange in the Xorg log file. When I don't use your patch and if I apply instead the workaround patch limiting C-State, I see more lines after "(II) RADEON(0): no multimedia table present, disabling Rage Theatre.":
The following line appears, which is not the case when I apply your patch (tweak AGP features...):
(II) RADEON(0): RandR 1.2 enabled, ignore the following RandR disabled message.
(WW) RADEON(0): Option "AddARGBGLXVisuals" is not used
(--) RandR disabled
(II) Initializing built-in extension MIT-SHM
(II) Initializing built-in extension XInputExtension
(II) Initializing built-in extension XTEST
(II) Initializing built-in extension XKEYBOARD
(II) Initializing built-in extension XC-APPGROUP
(II) Initializing built-in extension XAccessControlExtension
(II) Initializing built-in extension SECURITY
(II) Initializing built-in extension XINERAMA
(II) Initializing built-in extension XFIXES
(II) Initializing built-in extension XFree86-Bigfont
(II) Initializing built-in extension RENDER
(II) Initializing built-in extension RANDR
(II) Initializing built-in extension COMPOSITE
(II) Initializing built-in extension DAMAGE
(II) Initializing built-in extension XEVIE
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 8, (OK)
drmOpenByBusid: Searching for BusID pci:0000:01:00.0
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 8, (OK)
drmOpenByBusid: drmOpenMinor returns 8
drmOpenByBusid: drmGetBusid reports pci:0000:01:00.0
(WW) AIGLX: 3D driver claims to not support visual 0x23
(WW) AIGLX: 3D driver claims to not support visual 0x24
(WW) AIGLX: 3D driver claims to not support visual 0x25
(WW) AIGLX: 3D driver claims to not support visual 0x26
(WW) AIGLX: 3D driver claims to not support visual 0x27
(WW) AIGLX: 3D driver claims to not support visual 0x28
(WW) AIGLX: 3D driver claims to not support visual 0x29
(WW) AIGLX: 3D driver claims to not support visual 0x2a
(WW) AIGLX: 3D driver claims to not support visual 0x2b
(WW) AIGLX: 3D driver claims to not support visual 0x2c
(WW) AIGLX: 3D driver claims to not support visual 0x2d
(WW) AIGLX: 3D driver claims to not support visual 0x2e
(WW) AIGLX: 3D driver claims to not support visual 0x2f
(WW) AIGLX: 3D driver claims to not support visual 0x30
(WW) AIGLX: 3D driver claims to not support visual 0x31
(WW) AIGLX: 3D driver claims to not support visual 0x32
(II) AIGLX: Loaded and initialized /usr/lib/dri/r300_dri.so
(II) GLX: Initialized DRI GL provider for screen 0
(II) RADEON(0): Setting screen physical size to 270 x 203
It seems nobody is interested in this bug or nobody has an idea on the way to solve it. I retried the last patch (Tweak AGP (cripple - others workaround features) on kernel 2.6.27.6 and the same problem occurs: the screen remains black when X start. Furthermore, X takes 99% of the CPU resources and thus the load of the systel increases constantly (from 2.87, 0.93, 0.33 on login via SSH to 3.79, 1.42, 0.52 3 minutes later). I am forced to use the workaround to limit C-state to C2 forever ? Is this problem ever going to be solved ? With the current version of 2.6.28-rc8, it still occurs. X still hangs using 99.8% of CPU resources at startup. Sorry, i forgot about this one, given that option bus pci fixed it, it might just be one of that broken AGP hw. Unfortunately we don't have reliable hw bugs list neither for AGP chipset or GPU chipset. AGP is one of the worst things ever invented in computer, too much hw bugs in it. Debugging this mostly need a full time people working on the hw to track down what the problem is. So i would be curions to know if fglrx is enabling AGP or not on your card (given than i assume fglrx have a more reliable list of broken hw chipset or gpu). In the meantime i will fix my radeondump stuff so you can provide me with some usefull dumps. I have the same problem on my (very) old notebook where is mounted a Mobility M6 graphic controller. Also in this case the problem is the DRI option used with the Xorg radeon driver. The problem doesn't exist on an another PC, a desktop where is mounted an ATI Technologies Inc Radeon R100 QD [Radeon 7200]. I'm using the latest kernel, v2.6.29-rc3-12726-gf917b45 (wireless-testing) I tried today to re-enable C3 state on my computer and I was happily surprised to see that the problem doesn't occur anymore. I now use kernel 2.6.31.6 and KDE 4.3. In the meantime, there were also upgrades to xorg. So I don't know what has solved the problem but it seems to be gone. |