Bug 12826
Summary: | cpufreq driver do not expose all data and configuration to /sys | ||
---|---|---|---|
Product: | Power Management | Reporter: | uzytkownik2 (uzytkownik2) |
Component: | cpufreq | Assignee: | cpufreq |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | hmh, lenb, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29-rc7 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 12398 | ||
Attachments: | dmesg |
Description
uzytkownik2@gmail.com
2009-03-06 05:21:49 UTC
In email I was as asked to "write down in the bug, which files exactly you mean". The files present are listed above. In documentation (Documentation/cpu-freq/user-guide.txt) there are in /sys/devices/system/cpu/cpu0/cpufreq/: cpuinfo_min_freq, cpuinfo_max_freq, scaling_driver, scaling_available_governors, scaling_governor, cpuinfo_cur_freq, scaling_available_frequencies, scaling_min_freq, scaling_max_freq, affected_cpus, related_cpus, scaling_driver and scaling_cur_freq (13 instead of 0). Ah now I get it. I looked at the conservative files only... This is indeed strange. What cpufreq driver is used? (I hope the p4-clockmode was not accidently loaded there...). Can you post dmesg, please. If there is nothing obvious you might want to compile with: CPU_FREQ_DEBUG=y and add cpufreq.debug=7 boot param. Then boot and attach dmesg of the freshly booted system (dmesg buffer might run full after some time). Created attachment 20447 [details] dmesg (In reply to comment #2) > Ah now I get it. I looked at the conservative files only... > This is indeed strange. What cpufreq driver is used? > (I hope the p4-clockmode was not accidently loaded there...). p4-clockmode. No other driver is working (see log). AFAIR it was written that for some Intel Celeron M processors they are only that work on thinkwiki. If it is otherwise I will be happy to change. > Can you post dmesg, please. > If there is nothing obvious you might want to compile with: > CPU_FREQ_DEBUG=y > and add cpufreq.debug=7 boot param. > Then boot and attach dmesg of the freshly booted system (dmesg buffer might > run > full after some time). > Attached output for "rescue" system (i.e. for /bin/sh as init). Someone else have to help you out here. IMO this driver should not exist, best do: blacklist p4-clockmode in /etc/modprobe.conf. A Celeron M probably has better ways to save power. Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Fri, 6 Mar 2009 05:21:50 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12826 > > Summary: cpufreq driver do not expose all data and configuration > to /sys > Product: Power Management > Version: 2.5 > KernelVersion: 2.6.29-rc7 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: cpufreq > AssignedTo: cpufreq@vger.kernel.org > ReportedBy: uzytkownik2@gmail.com > > > Latest working kernel version: 2.6.28 > Earliest failing kernel version: 2.6.29-rc7 (also -rc5 and -rc6 but tested > only > with patchset) > Distribution: Gentoo > Hardware Environment: Thinkpad R51e > Software Environment: Standard stack (although reproduced with only /bin/sh > as > init) > Problem Description: > Contrary to documentation there are little files controlling the cpufreq > > # ls /sys/devices/system/cpu/cpu0/cpufreq/ > conservative stats > # ls /sys/devices/system/cpu/cpu0/cpufreq/conservative > down_threshold ignore_nice_load sampling_rate sampling_rate_min > freq_step sampling_down_factor sampling_rate_max up_threshold > > Steps to reproduce: > Boot I'd say that commit 8529154ec3f3ac20344c65b7a040c604c7af7651 Author: Herton Ronaldo Krzesinski <herton@mandriva.com.br> Date: Sat Nov 15 17:02:46 2008 -0200 [CPUFREQ] Add Celeron Core support to p4-clockmod. has a good chance of being the cause of this regression? On Fri, 2009-03-06 at 15:30 -0800, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Fri, 6 Mar 2009 05:21:50 -0800 (PST) > bugme-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=12826 > > > > Summary: cpufreq driver do not expose all data and configuration > > to /sys > > Product: Power Management > > Version: 2.5 > > KernelVersion: 2.6.29-rc7 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: cpufreq > > AssignedTo: cpufreq@vger.kernel.org > > ReportedBy: uzytkownik2@gmail.com > > > > > > Latest working kernel version: 2.6.28 > > Earliest failing kernel version: 2.6.29-rc7 (also -rc5 and -rc6 but tested > only > > with patchset) > > Distribution: Gentoo > > Hardware Environment: Thinkpad R51e > > Software Environment: Standard stack (although reproduced with only /bin/sh > as > > init) > > Problem Description: > > Contrary to documentation there are little files controlling the cpufreq > > > > # ls /sys/devices/system/cpu/cpu0/cpufreq/ > > conservative stats > > # ls /sys/devices/system/cpu/cpu0/cpufreq/conservative > > down_threshold ignore_nice_load sampling_rate sampling_rate_min > > freq_step sampling_down_factor sampling_rate_max up_threshold > > > > Steps to reproduce: > > Boot > > I'd say that > > commit 8529154ec3f3ac20344c65b7a040c604c7af7651 > Author: Herton Ronaldo Krzesinski <herton@mandriva.com.br> > Date: Sat Nov 15 17:02:46 2008 -0200 > > [CPUFREQ] Add Celeron Core support to p4-clockmod. > > has a good chance of being the cause of this regression? Unfortunately I reverted this commit and it had no effect. However I found however commit: commit e088e4c9cdb618675874becb91b2fd581ee707e6 Author: Matthew Garrett <mjg@redhat.com> Date: Tue Nov 25 13:29:47 2008 -0500 [CPUFREQ] Disable sysfs ui for p4-clockmod. p4-clockmod has a long history of abuse. It pretends to be a CPU frequency scaling driver, even though it doesn't actually change the CPU frequency, but instead just modulates the frequency with wait-states. The biggest misconception is that when running at the lower 'frequency' p4-clockmod is saving power. This isn't the case, as workloads running slower take longer to complete, preventing the CPU from entering deep C stat es. However p4-clockmod does have a purpose. It can prevent overheating. Having it hooked up to the cpufreq interfaces is the wrong way to achieve cooling however. It should instead be hooked up to ACPI. This diff introduces a means for a cpufreq driver to register with the cpufreq core, but not present a sysfs interface. I guess lack of sysfs ui is the problem (at lest AFAIU 'sysfs ui'). However lack of sysfs ui prevents the cpufreq from lowering frequency on overheat[1]. I'll try tomorrow (well. today morning) if this commit causes it. While I understend that the p4-clockmod shouldn't be used no other driver is working. p4-clockmod is recommend on thinkwiki[2]. All I found is a post from 2006 mentioning there is a patch for speedstep driver which the author is going to try in a spare time - but p4-clockmod is working. So there are no known replacements for them. Regards PS, Even if the commit will not be reverted documentation should be updated. For example in help of p4-clockmod the change should be mentioned. [1] I'd find out the problem if the system didn't started to overheat. Something is wrong but lowering the 'frequency' drop the temperature only by 20 degrees (from 9x to 7x). [2] http://www.thinkwiki.org/wiki/Intel_Celeron_M#Speed_Step Reply-To: akpm@linux-foundation.org On Sat, 07 Mar 2009 03:39:31 +0100 Maciej Piechotka <uzytkownik2@gmail.com> wrote: > On Fri, 2009-03-06 at 15:30 -0800, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > > bugzilla web interface). > > > > On Fri, 6 Mar 2009 05:21:50 -0800 (PST) > > bugme-daemon@bugzilla.kernel.org wrote: > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=12826 > > > > > > Summary: cpufreq driver do not expose all data and > configuration > > > to /sys > > > Product: Power Management > > > Version: 2.5 > > > KernelVersion: 2.6.29-rc7 > > > Platform: All > > > OS/Version: Linux > > > Tree: Mainline > > > Status: NEW > > > Severity: normal > > > Priority: P1 > > > Component: cpufreq > > > AssignedTo: cpufreq@vger.kernel.org > > > ReportedBy: uzytkownik2@gmail.com > > > > > > > > > Latest working kernel version: 2.6.28 > > > Earliest failing kernel version: 2.6.29-rc7 (also -rc5 and -rc6 but > tested only > > > with patchset) > > > Distribution: Gentoo > > > Hardware Environment: Thinkpad R51e > > > Software Environment: Standard stack (although reproduced with only > /bin/sh as > > > init) > > > Problem Description: > > > Contrary to documentation there are little files controlling the cpufreq > > > > > > # ls /sys/devices/system/cpu/cpu0/cpufreq/ > > > conservative stats > > > # ls /sys/devices/system/cpu/cpu0/cpufreq/conservative > > > down_threshold ignore_nice_load sampling_rate > sampling_rate_min > > > freq_step sampling_down_factor sampling_rate_max up_threshold > > > > > > Steps to reproduce: > > > Boot > > > > I'd say that > > > > commit 8529154ec3f3ac20344c65b7a040c604c7af7651 > > Author: Herton Ronaldo Krzesinski <herton@mandriva.com.br> > > Date: Sat Nov 15 17:02:46 2008 -0200 > > > > [CPUFREQ] Add Celeron Core support to p4-clockmod. > > > > has a good chance of being the cause of this regression? > > Unfortunately I reverted this commit and it had no effect. > > However I found however commit: > commit e088e4c9cdb618675874becb91b2fd581ee707e6 > Author: Matthew Garrett <mjg@redhat.com> > Date: Tue Nov 25 13:29:47 2008 -0500 > > [CPUFREQ] Disable sysfs ui for p4-clockmod. > > p4-clockmod has a long history of abuse. It pretends to be a CPU > frequency scaling driver, even though it doesn't actually change > the CPU frequency, but instead just modulates the frequency with > wait-states. > The biggest misconception is that when running at the lower > 'frequency' > p4-clockmod is saving power. This isn't the case, as workloads > running > slower take longer to complete, preventing the CPU from entering > deep C stat > es. > > However p4-clockmod does have a purpose. It can prevent > overheating. > Having it hooked up to the cpufreq interfaces is the wrong way to > achieve > cooling however. It should instead be hooked up to ACPI. > > This diff introduces a means for a cpufreq driver to register with > the > cpufreq core, but not present a sysfs interface. eh? So we deliberately added a regression? > > I guess lack of sysfs ui is the problem (at lest AFAIU 'sysfs ui'). > However lack of sysfs ui prevents the cpufreq from lowering frequency on > overheat[1]. I'll try tomorrow (well. today morning) if this commit > causes it. Thanks. > While I understend that the p4-clockmod shouldn't be used no other > driver is working. p4-clockmod is recommend on thinkwiki[2]. All I found > is a post from 2006 mentioning there is a patch for speedstep driver > which the author is going to try in a spare time - but p4-clockmod is > working. So there are no known replacements for them. > > Regards > > PS, Even if the commit will not be reverted documentation should be > updated. For example in help of p4-clockmod the change should be > mentioned. > > [1] I'd find out the problem if the system didn't started to overheat. > Something is wrong but lowering the 'frequency' drop the temperature > only by 20 degrees (from 9x to 7x). > [2] http://www.thinkwiki.org/wiki/Intel_Celeron_M#Speed_Step > Reply-To: mjg@redhat.com On Fri, Mar 06, 2009 at 06:50:42PM -0800, Andrew Morton wrote: > eh? So we deliberately added a regression? No, we removed functionality that doesn't save power from a subsystem that's designed to save power. Reply-To: akpm@linux-foundation.org On Sat, 7 Mar 2009 02:57:43 +0000 Matthew Garrett <mjg@redhat.com> wrote: > On Fri, Mar 06, 2009 at 06:50:42PM -0800, Andrew Morton wrote: > > > eh? So we deliberately added a regression? > > No, we removed functionality that doesn't save power from a subsystem > that's designed to save power. That's just stupid spin - stop wasting our time. The userspace interface to this driver changed in an incompatible fashion. This can lead to failure (and hence premature termination) of userspace configuration scripts and in this case (at least) it has led to CPU overheating. Please fix this driver so that existing userspace will continue to function in an unchanged manner. Reply-To: mjg@redhat.com On Fri, Mar 06, 2009 at 07:07:59PM -0800, Andrew Morton wrote: > On Sat, 7 Mar 2009 02:57:43 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > > On Fri, Mar 06, 2009 at 06:50:42PM -0800, Andrew Morton wrote: > > > > > eh? So we deliberately added a regression? > > > > No, we removed functionality that doesn't save power from a subsystem > > that's designed to save power. > > That's just stupid spin - stop wasting our time. No, really. > The userspace interface to this driver changed in an incompatible > fashion. This can lead to failure (and hence premature termination) of > userspace configuration scripts and in this case (at least) it has led > to CPU overheating. Every single case of userspace using this interface is a bug. It's either wasting power or unnecessarily reducing performance. Reply-To: akpm@linux-foundation.org On Sat, 7 Mar 2009 03:16:29 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > The userspace interface to this driver changed in an incompatible > > fashion. This can lead to failure (and hence premature termination) of > > userspace configuration scripts and in this case (at least) it has led > > to CPU overheating. > > Every single case of userspace using this interface is a bug. It's > either wasting power or unnecessarily reducing performance. I repeat: The userspace interface to this driver changed in an incompatible fashion. This can lead to failure (and hence premature termination) of userspace configuration scripts and in this case (at least) it has led to CPU overheating. I can promise you that if the changelog to e088e4c9cdb618675874becb91b2fd581ee707e6 had included the text "this patch will break existing scripts and will lead to CPU overheating" then it would not have been applied. Look, there are better ways of fixing things like this. Revert the patch, add some noisy printks (triggered by use of the sysfs interface) telling people that they are doing the wrong thing and telling them how to fix it. After 6-9 months, then we can make the kernel interface change. We shouldn't just rip the thing out without any warning and break stuff. Reply-To: mjg@redhat.com The low-level cpufreq drivers have no idea whether a speed request originated from userspace or the kernel, so we'd need to either special case p4-clockmod in the core or add an argument that everything other than p4-clockmod ignores. Or we could figure out why this computer overheats and fix that bug. Reply-To: akpm@linux-foundation.org On Sat, 7 Mar 2009 03:45:12 +0000 Matthew Garrett <mjg@redhat.com> wrote: > The low-level cpufreq drivers have no idea whether a speed request > originated from userspace or the kernel, so we'd need to either special > case p4-clockmod in the core or add an argument that everything other > than p4-clockmod ignores. Or we could figure out why this computer > overheats and fix that bug. Please stop deleting and then ignoring everything I say. The ONLY way of fully repairing this regression is to restore the sysfs files, and their 2.6.28 functionality. Reply-To: mjg@redhat.com On Fri, Mar 06, 2009 at 08:27:56PM -0800, Andrew Morton wrote: > On Sat, 7 Mar 2009 03:45:12 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > > The low-level cpufreq drivers have no idea whether a speed request > > originated from userspace or the kernel, so we'd need to either special > > case p4-clockmod in the core or add an argument that everything other > > than p4-clockmod ignores. Or we could figure out why this computer > > overheats and fix that bug. > > Please stop deleting and then ignoring everything I say. > > The ONLY way of fully repairing this regression is to restore the sysfs > files, and their 2.6.28 functionality. Resulting in computers that run slower and consume more power. "My script that does something stupid now gives an error" isn't a regression. "My computer now overheats" is a bug that was being hidden in the first place. Why don't we just fix that bug? Reply-To: akpm@linux-foundation.org On Sat, 7 Mar 2009 04:38:57 +0000 Matthew Garrett <mjg@redhat.com> wrote: > On Fri, Mar 06, 2009 at 08:27:56PM -0800, Andrew Morton wrote: > > On Sat, 7 Mar 2009 03:45:12 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > > > > The low-level cpufreq drivers have no idea whether a speed request > > > originated from userspace or the kernel, so we'd need to either special > > > case p4-clockmod in the core or add an argument that everything other > > > than p4-clockmod ignores. Or we could figure out why this computer > > > overheats and fix that bug. > > > > Please stop deleting and then ignoring everything I say. > > > > The ONLY way of fully repairing this regression is to restore the sysfs > > files, and their 2.6.28 functionality. > > Resulting in computers that run slower and consume more power. "My > script that does something stupid now gives an error" isn't a > regression. "My computer now overheats" is a bug that was being hidden > in the first place. Why don't we just fix that bug? Do the thing which I suggested, and which you deleted without comment. Or something else! Just don't break existing userspace code, and people's computers. I'm sure you can manage this. Reply-To: mjg@redhat.com On Fri, Mar 06, 2009 at 08:46:07PM -0800, Andrew Morton wrote: > On Sat, 7 Mar 2009 04:38:57 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > Resulting in computers that run slower and consume more power. "My > > script that does something stupid now gives an error" isn't a > > regression. "My computer now overheats" is a bug that was being hidden > > in the first place. Why don't we just fix that bug? > > Do the thing which I suggested, and which you deleted without comment. As I said, we can't print a warning without either special casing p4-clockmod in the core or adding code to every driver that will only be relevant for p4-clockmod. It also means another 6 months of computers running slower and consuming more power. > Or something else! Just don't break existing userspace code, and > people's computers. I'm sure you can manage this. I'd be thrilled to avoid fixing people's computers, but that means they need to report the bug about their machine overheating. Breaking code that is doing something actively harmful is a feature rather than a bug, so I'm less concerned about that. Reply-To: akpm@linux-foundation.org On Sat, 7 Mar 2009 05:18:52 +0000 Matthew Garrett <mjg@redhat.com> wrote: > On Fri, Mar 06, 2009 at 08:46:07PM -0800, Andrew Morton wrote: > > On Sat, 7 Mar 2009 04:38:57 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > > Resulting in computers that run slower and consume more power. "My > > > script that does something stupid now gives an error" isn't a > > > regression. "My computer now overheats" is a bug that was being hidden > > > in the first place. Why don't we just fix that bug? > > > > Do the thing which I suggested, and which you deleted without comment. > > As I said, we can't print a warning without either special casing > p4-clockmod in the core That patch already special-cased p4-clockmod in the core. > or adding code to every driver that will only be > relevant for p4-clockmod. It also means another 6 months of computers > running slower and consuming more power. Special-casing p4-clockmod will not affect other drivers. > > Or something else! Just don't break existing userspace code, and > > people's computers. I'm sure you can manage this. > > I'd be thrilled to avoid fixing people's computers, but that means they > need to report the bug about their machine overheating. Breaking code > that is doing something actively harmful is a feature rather than a bug, > so I'm less concerned about that. I don't know what that means. Please find a way to fix these regressions. On Sat, 2009-03-07 at 04:38 +0000, Matthew Garrett wrote: > On Fri, Mar 06, 2009 at 08:27:56PM -0800, Andrew Morton wrote: > > On Sat, 7 Mar 2009 03:45:12 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > > > > The low-level cpufreq drivers have no idea whether a speed request > > > originated from userspace or the kernel, so we'd need to either special > > > case p4-clockmod in the core or add an argument that everything other > > > than p4-clockmod ignores. Or we could figure out why this computer > > > overheats and fix that bug. > > > > Please stop deleting and then ignoring everything I say. > > > > The ONLY way of fully repairing this regression is to restore the sysfs > > files, and their 2.6.28 functionality. > > Resulting in computers that run slower and consume more power. "My > script that does something stupid now gives an error" isn't a > regression. "My computer now overheats" is a bug that was being hidden > in the first place. Why don't we just fix that bug? > AFAIU the system from Dave Jones post I've found today p4-clockmod is suppose to work with ACPI evenets. I've found out that: # cat /proc/acpi/thermal_zone/THM0/polling_frequency <polling disabled> # cat /proc/acpi/thermal_zone/THM0/state state: ok # cat /proc/acpi/thermal_zone/THM0/temperature temperature: 81 C # cat /proc/acpi/thermal_zone/THM0/trip_points critical (S5): 99 C passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU The problem is that it seems that the system halts at about 95 C - in moment when cooling should be applied. This might be an ACPI/ibm_acpi bug. Also in one comment[1] it have been mentioned that on Celeron M, which lacks SpeedStep[2] and C-states[3], there are power savings as p4-clockmod is enabled. So maybe p4-clockmod should be used fully only for Celeron M (with appropriate renaming and description) - and only Celeron M - as cpufreq backend and for rest it should warn/not expose the sysfs interface? Regards [1] http://www.codemonkey.org.uk/2009/01/18/forthcoming-p4clockmod/ [2] Confirmed on http://en.wikipedia.org/w/index.php?title=Celeron&oldid=272955598#Mobile_Celeron_and_Celeron_M [3] Powertop however reports different C-states on my computer. Maybe it varies from core to core. Reply-To: mjg@redhat.com On Sat, Mar 07, 2009 at 12:09:29PM +0100, Maciej Piechotka wrote: > # cat /proc/acpi/thermal_zone/THM0/trip_points > critical (S5): 99 C > passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU > > The problem is that it seems that the system halts at about 95 C - in > moment when cooling should be applied. This might be an ACPI/ibm_acpi > bug. Halts as in shuts down, or halts as in stops running? The R51e seems to have the 32/64-bit ACPI address issue - can you try booting with acpi=rsdt as a kernel argument and see whether it behaves any better? On Saturday 07 March 2009 04:00:16 pm Matthew Garrett wrote: > On Sat, Mar 07, 2009 at 12:09:29PM +0100, Maciej Piechotka wrote: > > # cat /proc/acpi/thermal_zone/THM0/trip_points > > critical (S5): 99 C > > passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU > > > > The problem is that it seems that the system halts at about 95 C - in > > moment when cooling should be applied. This might be an ACPI/ibm_acpi > > bug. > > Halts as in shuts down, or halts as in stops running? The R51e seems to > have the 32/64-bit ACPI address issue - can you try booting with > acpi=rsdt as a kernel argument and see whether it behaves any better? Yes it has. I posted patches to fix this about three times, first time probably more than a year ago, last one should be these (on the acpi list, easy to google): [PATCH 1/3] ACPICA: Add acpi_gbl_force_rsdt variable [PATCH 2/3] Introduce acpi_root_table=rsdt boot param and dmi list to force rsdt [PATCH 3/3] Remove R40e c-state blacklist wrong the last time I sent them was on 19th of Oct. 2008: [RESEND] [PATCH 0/3] Blacklist broken ThinkPads to use 32 bit FADT addresses [RESEND] [PATCH 1/3] ACPICA: Add acpi_gbl_force_rsdt variable [RESEND] [PATCH 2/3] Introduce acpi_root_table=rsdt boot param and dmi list to force rsdt [RESEND] [PATCH 3/3] Remove R40e c-state blacklist No we have: - SUSE and mainline kernels are out of sync with boot params, acpi_root_table=rsdt vs acpi=rsdt - R51e and R40e are broken in mainline for years, even we know why and they still are While I agree with Len that we should try to find the root cause, such issues should get blacklisted first until the root cause has been found or we start debugging the same issues over and over again and loose an overview about which machines/BIOSes are broken. Thomas On Sat, 2009-03-07 at 17:33 +0100, Thomas Renninger wrote: > On Saturday 07 March 2009 04:00:16 pm Matthew Garrett wrote: > > On Sat, Mar 07, 2009 at 12:09:29PM +0100, Maciej Piechotka wrote: > > > # cat /proc/acpi/thermal_zone/THM0/trip_points > > > critical (S5): 99 C > > > passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU > > > > > > The problem is that it seems that the system halts at about 95 C - in > > > moment when cooling should be applied. This might be an ACPI/ibm_acpi > > > bug. > > > > Halts as in shuts down, or halts as in stops running? Halts as in shut down. As after halt command but without syncing disks etc. > > The R51e seems to > > have the 32/64-bit ACPI address issue - can you try booting with > > acpi=rsdt as a kernel argument and see whether it behaves any better? > > Yes it has. > I posted patches to fix this about three times, first time probably more than > a year ago, last one should be these (on the acpi list, easy to google): > > [PATCH 1/3] ACPICA: Add acpi_gbl_force_rsdt variable > [PATCH 2/3] Introduce acpi_root_table=rsdt boot param and dmi list to force > rsdt > [PATCH 3/3] Remove R40e c-state blacklist > > wrong the last time I sent them was on 19th of Oct. 2008: > [RESEND] [PATCH 0/3] Blacklist broken ThinkPads to use 32 bit FADT addresses > [RESEND] [PATCH 1/3] ACPICA: Add acpi_gbl_force_rsdt variable > [RESEND] [PATCH 2/3] Introduce acpi_root_table=rsdt boot param and dmi list > to > force rsdt > [RESEND] [PATCH 3/3] Remove R40e c-state blacklist > > No we have: > - SUSE and mainline kernels are out of sync with boot params, > acpi_root_table=rsdt vs acpi=rsdt > - R51e and R40e are broken in mainline for years, even we know why and they > still are > > While I agree with Len that we should try to find the root cause, such issues > should get blacklisted first until the root cause has been found or we start > debugging the same issues over and over again and loose an overview about > which machines/BIOSes are broken. > > Thomas Unfortunately those patches cannot be applied to current kernel: Applying: ACPICA: Add acpi_gbl_force_rsdt variable error: drivers/acpi/tables/tbutils.c: does not exist in index error: drivers/acpi/utilities/utglobal.c: does not exist in index error: include/acpi/acglobal.h: does not exist in index Patch failed at 0001 ACPICA: Add acpi_gbl_force_rsdt variable I'll try the boot option. Regards Some comments...: On Saturday 07 March 2009 06:39:40 am Andrew Morton wrote: > On Sat, 7 Mar 2009 05:18:52 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > On Fri, Mar 06, 2009 at 08:46:07PM -0800, Andrew Morton wrote: > > > On Sat, 7 Mar 2009 04:38:57 +0000 Matthew Garrett <mjg@redhat.com> wrote: > > > > Resulting in computers that run slower and consume more power. "My > > > > script that does something stupid now gives an error" isn't a > > > > regression. "My computer now overheats" is a bug that was being > > > > hidden in the first place. Why don't we just fix that bug? > > > > > > Do the thing which I suggested, and which you deleted without comment. > > > > As I said, we can't print a warning without either special casing > > p4-clockmod in the core > > That patch already special-cased p4-clockmod in the core. > > > or adding code to every driver that will only be > > relevant for p4-clockmod. It also means another 6 months of computers > > running slower and consuming more power. > > Special-casing p4-clockmod will not affect other drivers. > > > > Or something else! Just don't break existing userspace code, and > > > people's computers. I'm sure you can manage this. Any modification on the p4-clockmod driver is not much worth it as it is broken by design. If you really touch it, please do it properly: - move it to drivers/platforms/x86/native_throttling.c - let it register at the generic thermal zone code as a cooling device - make sure it only loads if acpi throttling does not get activated This is what you want. Going the "a cpufreq driver automatically is a cooling device, let's use it" and then cut out the cpufreq capabilities: --------- [CPUFREQ] Prevent p4-clockmod from auto-binding to the ondemand governor. The latency of p4-clockmod sucks so hard that scaling on a regular basis with ondemand is a really bad idea. Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org> --------- is making things worse not better. Until this happened, please mark the p4-clockmod driver experimental, throw out all references to the cpufreq list and best state at load time that it should only be used on broken BIOSes. Thomas On Sat, 2009-03-07 at 15:00 +0000, Matthew Garrett wrote:
> On Sat, Mar 07, 2009 at 12:09:29PM +0100, Maciej Piechotka wrote:
>
> > # cat /proc/acpi/thermal_zone/THM0/trip_points
> > critical (S5): 99 C
> > passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU
> >
> > The problem is that it seems that the system halts at about 95 C - in
> > moment when cooling should be applied. This might be an ACPI/ibm_acpi
> > bug.
>
> Halts as in shuts down, or halts as in stops running? The R51e seems to
> have the 32/64-bit ACPI address issue - can you try booting with
> acpi=rsdt as a kernel argument and see whether it behaves any better?
>
Somehow. It keeps system at 70-80 C but on 2.6.29 it freezes the
computer when I log into Gnome (it seems that gdm is not enought and it
occures after few minutes) - i.e. I cannot move pointer, change VT nor
even ping the system (I use in-kernel radeon driver and Radeon Xpress
200M card RC410).
Regards
Reply-To: mjg@redhat.com On Sun, Mar 08, 2009 at 02:06:33AM +0100, Maciej Piechotka wrote: > Somehow. It keeps system at 70-80 C but on 2.6.29 it freezes the > computer when I log into Gnome (it seems that gdm is not enought and it > occures after few minutes) - i.e. I cannot move pointer, change VT nor > even ping the system (I use in-kernel radeon driver and Radeon Xpress > 200M card RC410). Ok. That sounds like a separate bug. Can you try with this patch and no kernel argument? commit 546be50e225261e8379731008cdfec336348f048 Author: Matthew Garrett <mjg@redhat.com> Date: Sun Mar 8 01:34:03 2009 +0000 Use 32-bit FADT values on X86 The ACPI specification says that we should use the 64-bit address offsets contained within the FADT if they exist. However, Windows uses the legacy address. Various vendors have left incorrect values in the 64-bit field which then causes problems later. Since the vast majority of machines have never been tested with an OS that uses the 64-bit value by default, we should behave like Windows and ignore the spec by only using the 64-bit address if it contains something that can't be represented in the legacy field. Since system io space is only 16 bits on x86, this should be entirely safe. diff --git a/drivers/acpi/acpica/tbfadt.c b/drivers/acpi/acpica/tbfadt.c index 3636e4f..ad0e858 100644 --- a/drivers/acpi/acpica/tbfadt.c +++ b/drivers/acpi/acpica/tbfadt.c @@ -361,9 +361,28 @@ static void acpi_tb_convert_fadt(void) ACPI_ADD_PTR(struct acpi_generic_address, &acpi_gbl_FADT, fadt_info_table[i].address64); - /* Expand only if the 64-bit X target is null */ + /* + * The ACPI specification says that we should use the + * 64-bit address offsets if they exists. However, + * Windows uses the legacy address. Various vendors + * have left incorrect values in the 64-bit field, + * which then causes problems later. Since the vast + * majority of machines have never been tested with an + * OS that uses the 64-bit value by default, we should + * behave like Windows and ignore the spec by only + * using the 64-bit address if it contains something + * that can't be represented in the legacy + * field. Since system io space is only 16 bits on + * x86, this should be entirely safe. We also extend + * the 32-bit value into the 64-bit one if no 64-bit + * address is provided. + */ - if (!target64->address) { + if (!target64->address +#ifdef CONFIG_X86 + || (target64->space_id == ACPI_ADR_SPACE_SYSTEM_IO) +#endif + ) { /* The space_id is always I/O for the 32-bit legacy address fields */ On Sun, 2009-03-08 at 01:37 +0000, Matthew Garrett wrote: > On Sun, Mar 08, 2009 at 02:06:33AM +0100, Maciej Piechotka wrote: > > > Somehow. It keeps system at 70-80 C And finally turn off the computer so I guess it only slowed the process down. > > but on 2.6.29 it freezes the > > computer when I log into Gnome (it seems that gdm is not enought and it > > occures after few minutes) - i.e. I cannot move pointer, change VT nor > > even ping the system (I use in-kernel radeon driver and Radeon Xpress > > 200M card RC410). > > Ok. That sounds like a separate bug. Can you try with this patch and no > kernel argument? The same problems. Regards Reply-To: mjg@redhat.com On Sun, Mar 08, 2009 at 04:29:44AM +0100, Maciej Piechotka wrote: > On Sun, 2009-03-08 at 01:37 +0000, Matthew Garrett wrote: > > On Sun, Mar 08, 2009 at 02:06:33AM +0100, Maciej Piechotka wrote: > > > > > Somehow. It keeps system at 70-80 C > > And finally turn off the computer so I guess it only slowed the process > down. Interesting. The fact that it's shutting down hard is an indication that the embedded controller is performing the shutdown and not the OS. I've got a similar issue on a machine here - let me look into it. On Sunday 08 March 2009 02:06:33 am Maciej Piechotka wrote: > On Sat, 2009-03-07 at 15:00 +0000, Matthew Garrett wrote: > > On Sat, Mar 07, 2009 at 12:09:29PM +0100, Maciej Piechotka wrote: > > > # cat /proc/acpi/thermal_zone/THM0/trip_points > > > critical (S5): 99 C > > > passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU > > > > > > The problem is that it seems that the system halts at about 95 C - in > > > moment when cooling should be applied. This might be an ACPI/ibm_acpi > > > bug. > > > > Halts as in shuts down, or halts as in stops running? The R51e seems to > > have the 32/64-bit ACPI address issue - can you try booting with > > acpi=rsdt as a kernel argument and see whether it behaves any better? > > Somehow. It keeps system at 70-80 C but on 2.6.29 it freezes the > computer when I log into Gnome (it seems that gdm is not enought and it > occures after few minutes) - i.e. I cannot move pointer, change VT nor > even ping the system (I use in-kernel radeon driver and Radeon Xpress > 200M card RC410). Hmm, I expect no PowerPlay things are implemented in the in-kernel radeon implementation yet? fglrx and aticonfig --list-powerstates and aticonfig --set-powerstate X could help a lot. This could at least explain the high temperature rates. Like that you could find out how much could be saved by graphics power savings. > i.e. I cannot move pointer, change VT nor > even ping the system (I use in-kernel radeon driver and Radeon Xpress > 200M card RC410). Sounds like the in-kernel radeon driver is not that stable yet? Yang Zhao once implemented PowerPlay support in radeonhd userspace: http://yangman.ca/git/xf86-video-radeonhd.git diffing it with the official radeonhd driver and implementing the same in the kernel's radeon driver might be necessary at some time? I once played a bit with it and I could imagine as both should be atombios based, it shouldn't be that hard to port and it got already some testing on specific HW. Matthew: Is it planned to add PowerPlay support to the radeon in-kernel driver at some time? If yes, I can help a bit. Hm, but all this has nothing to do with cpufreq which a Celeron is not capable of. Thomas On Sun, 2009-03-08 at 08:01 +0100, Thomas Renninger wrote: > On Sunday 08 March 2009 02:06:33 am Maciej Piechotka wrote: > > On Sat, 2009-03-07 at 15:00 +0000, Matthew Garrett wrote: > > > On Sat, Mar 07, 2009 at 12:09:29PM +0100, Maciej Piechotka wrote: > > > > # cat /proc/acpi/thermal_zone/THM0/trip_points > > > > critical (S5): 99 C > > > > passive: 95 C: tc1=5 tc2=4 tsp=600 devices= CPU > > > > > > > > The problem is that it seems that the system halts at about 95 C - in > > > > moment when cooling should be applied. This might be an ACPI/ibm_acpi > > > > bug. > > > > > > Halts as in shuts down, or halts as in stops running? The R51e seems to > > > have the 32/64-bit ACPI address issue - can you try booting with > > > acpi=rsdt as a kernel argument and see whether it behaves any better? > > > > Somehow. It keeps system at 70-80 C but on 2.6.29 it freezes the > > computer when I log into Gnome (it seems that gdm is not enought and it > > occures after few minutes) - i.e. I cannot move pointer, change VT nor > > even ping the system (I use in-kernel radeon driver and Radeon Xpress > > 200M card RC410). > > Hmm, I expect no PowerPlay things are implemented in the in-kernel radeon > implementation yet? > fglrx and aticonfig --list-powerstates and aticonfig --set-powerstate X > could help a lot. > This could at least explain the high temperature rates. If it might help. GPU temperature is lower then CPU. However it prevents cooling as checked with rovclock. However AFAIR change 300 -> 50 is not sufficient. > Like that you could find out how much could be saved by graphics power > savings. > Ok. I'll check. > > Hm, but all this has nothing to do with cpufreq which a Celeron is not > capable > of. > Should I move and rename the bug? Where should it go (ACPI - Power-Other? Power Management - Other?) Regards > Should I move and rename the bug? The 32 vs 64 bit is a duplicate of this one: [Bug 8246] 32/64X address mismatch in "Gpe0Block" - IBM Thinkpad R51e It is set to resolved because a boot param was added, which is IMO not sufficient. But it's hard to convince Len to add dmi blacklists... You might then want to recheck about: > i.e. I cannot move pointer, change VT nor even ping the system or other bugs which might be a follow ups, but are more likely independent. You want to open new bugs for unrelated things. > Should I move and rename the bug? Hmm, the bug is valid IMO, I'd keep it open. As long as the p4-clockmod is a cpufreq driver and not explicitly stated broken or tainting the kernel (and even then) it must provide sysfs cpufreq interface userspace programs rely on, like every other cpufreq driver does. Fixed by commit 129f8ae9b1b5be94517da76009ea956e89104ce8 . This bug was for the p4-clockmod API regression. Per comment #30, that is reverted, and so this entry should remain closed. Thomas and Matthew are right, however, that the interesting problem is why the r51e needs p4-clockmod to work around the real bug -- that the the r51e is over-heating. If we can fix that, perhaps we can finally clear the way to delete p4-clockmod... Maciej, can you file a new bug against r51e over-heating? re: RSDT note that 2.6.29 shipped with "acpi=rsdt" -- so you can test if that helps w/o needing to patch your kernel. You'll probably get asked if windows over-heats on the same laptop, and it would be ideal if you could try that. Also, there are some more 32 vs 64-bit fixes from ACPICA that are staged for 2.6.30-rc1 that may address that issue w/o a bootparam. We can discuss those in the new bug report. (In reply to comment #31) > This bug was for the p4-clockmod API regression. > Per comment #30, that is reverted, and so this entry > should remain closed. > > Thomas and Matthew are right, however, that the interesting > problem is why the r51e needs p4-clockmod to work around the > real bug -- that the the r51e is over-heating. > If we can fix that, perhaps > we can finally clear the way to delete p4-clockmod... > > Maciej, can you file a new bug against r51e over-heating? > re: RSDT > note that 2.6.29 shipped with "acpi=rsdt" -- so you > can test if that helps w/o needing to patch your kernel. > > You'll probably get asked if windows over-heats on the > same laptop, and it would be ideal if you could try that. > It was a long time ago I used Windows on this computer. AFAIR it had not overheat but I'll check it. > Also, there are some more 32 vs 64-bit fixes from ACPICA > that are staged for 2.6.30-rc1 > that may address that issue w/o a bootparam. > We can discuss those in the new bug report. Agains what product should I fill it? Len, I don't think we should remove p4-clockmod at all for as long as we support boxes that can use it. It clearly works well to avoid worse damage on boxes that are suffering some sort of thermal trouble (it really doesn't matter much why the box is overheating, that's orthogonal to the issue at hand). That's a very good thing. It is a proven safety net. Now, maybe thermal throttling should be made more obnoxious when it activates, to make it clear to the user that the box is NOT operating under normal conditions. Suitably informative rate-limited KERN_CRIT level messages, for example (limit to once every hour, maybe?). |