Bug 42725 - bogus passive trip point throttles the processors - Asus N51V laptop
Summary: bogus passive trip point throttles the processors - Asus N51V laptop
Status: CLOSED INVALID
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-04 01:51 UTC by Nate Cornell
Modified: 2013-03-11 02:04 UTC (History)
13 users (show)

See Also:
Kernel Version: 2.6.38-8 - Current
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
contents of /proc/acpi/event and wakeup (764 bytes, text/plain)
2012-02-04 01:53 UTC, Nate Cornell
Details
Battery (671 bytes, text/plain)
2012-02-04 01:54 UTC, Nate Cornell
Details
output of lspci -vvv (40.93 KB, text/plain)
2012-02-04 01:54 UTC, Nate Cornell
Details
contents of /proc/iomem (2.59 KB, text/plain)
2012-02-04 01:55 UTC, Nate Cornell
Details
Contents of /proc/ioports (1.45 KB, text/plain)
2012-02-04 01:56 UTC, Nate Cornell
Details
contents of /proc/modules (3.92 KB, text/plain)
2012-02-04 01:56 UTC, Nate Cornell
Details
Contents of /proc/cpuinfo (1.52 KB, text/plain)
2012-02-04 01:57 UTC, Nate Cornell
Details
Output of ver_linux script (1.48 KB, text/plain)
2012-02-04 01:58 UTC, Nate Cornell
Details
Output of uname -a on distribution where problem was discovered. (145 bytes, text/plain)
2012-02-04 01:58 UTC, Nate Cornell
Details
Output of dmesg after booting to 2.6.38-8 with nohz=off (64.76 KB, text/x-log)
2012-07-07 20:15 UTC, Nate Cornell
Details
dmesg and timer_list from 2.6.38-8 after booting with nohz=off and hpet=off (38.72 KB, application/zip)
2012-07-07 20:19 UTC, Nate Cornell
Details
cpufreq thermal and acpidump from 2.6.35-32 (74.43 KB, application/zip)
2012-07-12 02:16 UTC, Nate Cornell
Details
debug patch to check why passive trip point is not valid (2.82 KB, patch)
2012-07-25 03:25 UTC, Zhang Rui
Details | Diff
Output of patch command (2.51 KB, text/plain)
2012-07-26 02:07 UTC, Nate Cornell
Details
Dmesg and timer_list output good and bad kernels for Feng (57.44 KB, application/zip)
2012-07-26 13:14 UTC, Nate Cornell
Details
dmesg output after applying zangu rui's patch to 2.6.38 (16.39 KB, application/x-gzip)
2012-08-01 11:34 UTC, Carles
Details
refreshed debug patch (4.72 KB, patch)
2012-08-15 06:41 UTC, Zhang Rui
Details | Diff
dmesg output of the refreshed patch as requested by Zhang Rui (17.25 KB, application/x-gzip)
2012-08-16 20:02 UTC, Carles
Details
dmesg output of v2.6.35 with Zhang Rui's patch (16.05 KB, application/binary)
2012-10-24 18:15 UTC, Carles
Details
customized DSDT (449.72 KB, application/octet-stream)
2012-11-07 03:00 UTC, Zhang Rui
Details
dmesg output on 3.2.0-33 after DSDT patch (69.29 KB, application/octet-stream)
2012-11-15 07:43 UTC, Nate Cornell
Details
output of thermal grep on 3.2.0-33 after DSDT patch (900 bytes, application/octet-stream)
2012-11-15 07:46 UTC, Nate Cornell
Details
dmesg output on 3.2.0-33 after DSDT patch with CONFIG_ACPI_DEBUG=y (71.21 KB, text/plain)
2012-11-16 03:41 UTC, Nate Cornell
Details
grep of thermal on 3.2.0-33 after DSDT patch with CONFIG_ACPI_DEBUG=y (900 bytes, text/plain)
2012-11-16 03:44 UTC, Nate Cornell
Details
custom DSDT (449.72 KB, application/octet-stream)
2012-11-28 10:17 UTC, Zhang Rui
Details
Dmesg and thermal output of the patch found in #100 (15.69 KB, application/x-gzip)
2012-11-30 01:14 UTC, Carles
Details

Description Nate Cornell 2012-02-04 01:51:59 UTC
When using any recent *ubuntu, Gentoo or Fedora releases with a 2.6.38-8+ kernel on an Asus N51V laptop, the entire system including the console is incredibly slow, almost to the point of inoperability.
"top" reports that most processes take almost 100% CPU.
The entire system is very slow at responding to even the most basic shell commands.

This has been tracked in the Ubuntu Launchpad under bug #793437 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/793437), but they have been unable to reach a solution.

Affects:
The problem affects all "stable" kernels since 2.6.38-8, although I personally have not tested any past 3.2.0-12. A few 2.6.39 merges/commits were tested in the previously mentioned Launchpad bug report, many of them failed.

Workarounds:
Booting with the "acpi=off" flag seems to eliminate the problem, but this is not ideal for a laptop. Some users have reported that booting, entering sleep mode, then waking the laptop also solves the problem; but it returns after rebooting.

Steps to reproduce the problem:
1) Boot any *Ubuntu, Gentoo, or Fedora distribution with 2.6.38-8+ kernel.
2) Attempt to use the system.
Comment 1 Nate Cornell 2012-02-04 01:53:22 UTC
Created attachment 72266 [details]
contents of /proc/acpi/event and wakeup
Comment 2 Nate Cornell 2012-02-04 01:54:23 UTC
Created attachment 72267 [details]
Battery
Comment 3 Nate Cornell 2012-02-04 01:54:58 UTC
Created attachment 72268 [details]
output of lspci -vvv
Comment 4 Nate Cornell 2012-02-04 01:55:34 UTC
Created attachment 72269 [details]
contents of /proc/iomem
Comment 5 Nate Cornell 2012-02-04 01:56:09 UTC
Created attachment 72270 [details]
Contents of /proc/ioports
Comment 6 Nate Cornell 2012-02-04 01:56:45 UTC
Created attachment 72271 [details]
contents of /proc/modules
Comment 7 Nate Cornell 2012-02-04 01:57:13 UTC
Created attachment 72272 [details]
Contents of /proc/cpuinfo
Comment 8 Nate Cornell 2012-02-04 01:58:03 UTC
Created attachment 72273 [details]
Output of ver_linux script
Comment 9 Nate Cornell 2012-02-04 01:58:53 UTC
Created attachment 72274 [details]
Output of uname -a on distribution where problem was discovered.
Comment 10 Len Brown 2012-02-07 03:51:58 UTC
please show the output from...

grep . /sys/class/thermal/*/*

which will tell us if you are seeing a thermal throttling issue.

Lets also see what cpufreq is doing:

grep . /sys/devices/system/cpu/cpu*/cpufreq/*

Has any version of Linux worked properly on this system?
Can you hear the fan operating?  Does Windows do
the same thing, or different?
Comment 11 Nate Cornell 2012-02-07 04:42:38 UTC
(In reply to comment #10)
> please show the output from...
> 
> grep . /sys/class/thermal/*/*

/sys/class/thermal/cooling_device0/cur_state:10
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:10
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:15
/sys/class/thermal/cooling_device2/type:LCD
/sys/class/thermal/thermal_zone0/cdev0_trip_point:1
/sys/class/thermal/thermal_zone0/cdev1_trip_point:1
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:39000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:0
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz

> 
> Lets also see what cpufreq is doing:
> 
> grep . /sys/devices/system/cpu/cpu*/cpufreq/*

/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq/bios_limit:2534000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:800000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:2534000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:800000
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:10000
/sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:2534000 2533000 1600000 800000 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:1013600
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:800000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu1/cpufreq/affected_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq/bios_limit:2534000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq:2534000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_min_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_transition_latency:10000
/sys/devices/system/cpu/cpu1/cpufreq/related_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq/scaling_available_frequencies:2534000 2533000 1600000 800000 
/sys/devices/system/cpu/cpu1/cpufreq/scaling_available_governors:conservative ondemand userspace powersave performance 
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:1013600
/sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed:<unsupported>

> 
> Has any version of Linux worked properly on this system?

Every version I tried prior to 2.6.38-8 for the past 3 years has worked very well. I currently use Ubuntu 10.10 with a 2.6.35-31 kernel. The system runs very smoothly on this kernel.

> Can you hear the fan operating?

It does not speed up unusually when I boot to the affected kernels, if that's what you are asking. The core temperatures appear stable.

> Does Windows do the same thing, or different?

I haven't run Windows on this laptop in a very long time, but it was never as slow as these newer Linux kernels (as much as it shames me to say it). 

Thanks for the attention. This is a real show-stopper for folks with this model laptop.
Comment 12 Carles 2012-02-15 22:20:05 UTC
Should this be labelled a regression seeing it did work as intended in earlier versions?
Comment 13 Paul Ellenbogen 2012-02-25 02:18:48 UTC
I have an Asus n51vn (tried to install ubuntu) with this problem. I am only of intermediate technical skill, but I would be willing to help by offering my computer as a test machine.
Comment 14 Abilio Henrique 2012-03-20 23:03:10 UTC
I too have this issue.  It's horrible to deal with honestly... As with others ACPI=off or suspending laptop and unsuspending it fixes the problem, however the few minutes it takes to boot up such that you can get to the point of being able to suspend it is a frankly a nightmare.  I'm on Asus N51Vn as well, and would appreciate anything that would solve this problem properly (without having to disable ACPI which is critical for proper battery usage).
Comment 15 Frank 2012-04-19 21:11:04 UTC
I have the same issue.
@Carles : I think it should be named regression since as for others this issue didn't exist with earlier versions in my case.

The bug still exists with version 3.3.2

What information is missing causing the NEEDINFO status of this bug, and, is this why the bug doesn't get addressed or is it simply too low priority?
I would be really happy if this could get fixed, too.
Comment 16 Abilio Henrique 2012-04-19 21:26:31 UTC
(In reply to comment #15)
> I have the same issue.
> @Carles : I think it should be named regression since as for others this
> issue
> didn't exist with earlier versions in my case.
> 
> The bug still exists with version 3.3.2
> 
> What information is missing causing the NEEDINFO status of this bug, and, is
> this why the bug doesn't get addressed or is it simply too low priority?
> I would be really happy if this could get fixed, too.

Agreed, something has absolutely gotta be done.  I'm more than happy to provide whatever info hasn't already been provided. I would like someone to at least specify what information is still needed.

I've been having to run ACPI=OFF for at least 2 months now (prior to that running an old kernel version that wasn't affected).  This means no CPU speed stepping, laptop always running hot, and a heap of power saving and thermal protection features disabled.
Comment 17 Abilio Henrique 2012-05-12 21:47:43 UTC
Is someone willing to even do something about this bug? It's been going on for almost 2 years.
Comment 18 Nate Cornell 2012-05-12 22:01:53 UTC
I have provided all the info requested, but cannot change the status.

What can I do to regain attention to this bug?
Comment 19 Abilio Henrique 2012-05-12 22:12:45 UTC
(In reply to comment #18)
> I have provided all the info requested, but cannot change the status.
> 
> What can I do to regain attention to this bug?

I'm not really sure Nate, I do know something needs to be done about it however.. I've been running ACPI=off for about two months now (after having avoided upgrading linux kernel for quite sometime), and it's not a good long term solution as it'll likely cause the battery to wear out well before it's age. I don't really fancy having to buy a new laptop to deal with this when my laptop is fine.
Comment 20 Jake 2012-05-14 01:28:13 UTC
I've been banging my head against my laptop all day trying to figure out why any distro I install runs horribly slow (kept thinking it was video card drivers, except even console was affected).  I just stumbled across this and have at least salvaged my sanity.  But, running without acpi is kind of a problem.  

Like others on this thread, I'm more than happy to help however I can.
Comment 21 d876808 2012-06-11 20:23:50 UTC
The info in this forum might be of some help. Especially for those of you who do not want to turn ACPI off, but you have to do it every time you restart the computer. 

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/793437?comments=all

see post 102.

cheers!
Comment 22 Nate Cornell 2012-06-12 17:26:14 UTC
Thanks, I was actually following that thread before starting this one.

It's an interesting workaround that could offer some clues for those more savvy than myself, but I am too hung up on being able to use my laptop directly after booting it for that to work for me.
Comment 23 Abilio Henrique 2012-06-12 23:01:02 UTC
That workaround does not work for me.  My CPU still gets stuck at 800mhz even after suspending. I've also attempted the SATA fixes etc. Nothing seems to fix it.

The kernel simply needs to be fixed. This is a clear regression bug.
Comment 24 Alex Shi 2012-06-13 08:52:35 UTC
(In reply to comment #23)
> That workaround does not work for me.  My CPU still gets stuck at 800mhz even
> after suspending. I've also attempted the SATA fixes etc. Nothing seems to
> fix
> it.
> 
> The kernel simply needs to be fixed. This is a clear regression bug.

Do you mean your laptop had worked on some old kernel?
If so, could you like to do a bisect for this bug? 
Actually, I like to take a look, but I can not get this laptop.
Comment 25 Carles 2012-06-13 22:29:51 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > That workaround does not work for me.  My CPU still gets stuck at 800mhz
> even
> > after suspending. I've also attempted the SATA fixes etc. Nothing seems to
> fix
> > it.
> > 
> > The kernel simply needs to be fixed. This is a clear regression bug.
> 
> Do you mean your laptop had worked on some old kernel?
> If so, could you like to do a bisect for this bug? 
> Actually, I like to take a look, but I can not get this laptop.

Yes I think this is a regression, my Asus N51Vn with an Intel Core 2 Duo T9400 cpu does work properly on Ubuntu 10.10 with the 2.6.35 kernel with ACPI enabled. However, any of the newer kernels in an ubuntu distribution show the aforementioned undesired behaviour when ACPI is not disabled.

I would love to help you, but I don't quite know what a bisect is or how to do it, could you point me in the direction of some documentation I could use or otherwise assist me in the process?

In any case, some of the people in this launchpad thread https://bugs.launchpad.net/ubuntu/+source/linux/+bug/793437 started a bisect back when this regression appeared, I however don't think they actually finished it.
Comment 26 Alex Shi 2012-06-14 03:21:07 UTC
(In reply to comment #25)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > That workaround does not work for me.  My CPU still gets stuck at 800mhz
> even
> > > after suspending. I've also attempted the SATA fixes etc. Nothing seems
> to fix
> > > it.
> > > 
> > > The kernel simply needs to be fixed. This is a clear regression bug.
> > 
> > Do you mean your laptop had worked on some old kernel?
> > If so, could you like to do a bisect for this bug? 
> > Actually, I like to take a look, but I can not get this laptop.
> 
> Yes I think this is a regression, my Asus N51Vn with an Intel Core 2 Duo
> T9400
> cpu does work properly on Ubuntu 10.10 with the 2.6.35 kernel with ACPI
> enabled. However, any of the newer kernels in an ubuntu distribution show the
> aforementioned undesired behaviour when ACPI is not disabled.
> 
> I would love to help you, but I don't quite know what a bisect is or how to
> do
> it, could you point me in the direction of some documentation I could use or
> otherwise assist me in the process?

Oh, It is not difficult. Bisection can find the bug patch in kernel source code.

First, to get a linux kernel git repository.
$git clone git://github.com/torvalds/linux.git 

Second. study 'git bisect' command
$man git bisect 
or read tutorial online, like http://www.kernel.org/pub/software/scm/git/docs/v1.5.6/git-bisect.html

Third, to do bisection under your local git repository:
--
#go to kernel code repository
$cd linux
#start and set good point at 2.6.35
---
$git bisect start
$git bisect good v3.6.35

#set bad point at 2.6.38
$git bisect bad v2.6.38

#copy your kernel configuration from /boot/config-x.x.x to your repository.
#and then compile and installed bisection kernel
---
$cp /boot/config-xxx .config
#you can compile all modules in default config, or just installed modules
$make localmodconfig 
$make -j 4 && sudo make install

# reboot to just compile kernel, I don't what your grub version, grub 1.0 or grub 2.0. If it is grub 1.0, change the /etc/grub.conf to boot your kernel. if grub2.0, it is already changed to new kernel, but if it fails check some online tutorial, write a /boot/grub/constom.cfg is a kind of workable way.

# now, you can check if the new kernel has this bug, if yes,
$git bisect bad
#or no this bug
$git bisect good

#then continue compile and reboot to new kernel. make sure your is using your new kernel by checking with "uname -a" and "ls -lrt /boot/vmlinuz*". usually, only teens recompile/reboot can find out the trigger commit, using 'git bisect skip' command for uncompilable/unbootable kernel.

Enjoy bisection!, You will become a real kernel developer after done this! :)
> 
> In any case, some of the people in this launchpad thread
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/793437 started a bisect
> back when this regression appeared, I however don't think they actually
> finished it.
Comment 27 Alex Shi 2012-06-14 03:24:01 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > (In reply to comment #24)
> > > (In reply to comment #23)
> > > > That workaround does not work for me.  My CPU still gets stuck at
> 800mhz even
> > > > after suspending. I've also attempted the SATA fixes etc. Nothing seems
> to fix
> > > > it.
> > > > 
> > > > The kernel simply needs to be fixed. This is a clear regression bug.
> > > 
> > > Do you mean your laptop had worked on some old kernel?
> > > If so, could you like to do a bisect for this bug? 
> > > Actually, I like to take a look, but I can not get this laptop.
> > 
> > Yes I think this is a regression, my Asus N51Vn with an Intel Core 2 Duo
> T9400
> > cpu does work properly on Ubuntu 10.10 with the 2.6.35 kernel with ACPI
> > enabled. However, any of the newer kernels in an ubuntu distribution show
> the
> > aforementioned undesired behaviour when ACPI is not disabled.
> > 
> > I would love to help you, but I don't quite know what a bisect is or how to
> do
> > it, could you point me in the direction of some documentation I could use
> or
> > otherwise assist me in the process?
> 
> Oh, It is not difficult. Bisection can find the bug patch in kernel source
> code.
> 
> First, to get a linux kernel git repository.
> $git clone git://github.com/torvalds/linux.git 
> 
> Second. study 'git bisect' command
> $man git bisect 
> or read tutorial online, like
> http://www.kernel.org/pub/software/scm/git/docs/v1.5.6/git-bisect.html
> 
> Third, to do bisection under your local git repository:
> --
> #go to kernel code repository
> $cd linux
> #start and set good point at 2.6.35
> ---
> $git bisect start
> $git bisect good v3.6.35
> 
> #set bad point at 2.6.38
> $git bisect bad v2.6.38
> 
> #copy your kernel configuration from /boot/config-x.x.x to your repository.
> #and then compile and installed bisection kernel
> ---
> $cp /boot/config-xxx .config
> #you can compile all modules in default config, or just installed modules
> $make localmodconfig 
> $make -j 4 && sudo make install
may you needs more precise command: make -j 4 && make modules -j 4 && sudo make modules_install install
> 
> # reboot to just compile kernel, I don't what your grub version, grub 1.0 or
> grub 2.0. If it is grub 1.0, change the /etc/grub.conf to boot your kernel.
> if
> grub2.0, it is already changed to new kernel, but if it fails check some
> online
> tutorial, write a /boot/grub/constom.cfg is a kind of workable way.
> 
> # now, you can check if the new kernel has this bug, if yes,
> $git bisect bad
> #or no this bug
> $git bisect good
> 
> #then continue compile and reboot to new kernel. make sure your is using your
> new kernel by checking with "uname -a" and "ls -lrt /boot/vmlinuz*". usually,
> only teens recompile/reboot can find out the trigger commit, using 'git
> bisect
> skip' command for uncompilable/unbootable kernel.
> 
> Enjoy bisection!, You will become a real kernel developer after done this! :)
> > 
> > In any case, some of the people in this launchpad thread
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/793437 started a
> bisect
> > back when this regression appeared, I however don't think they actually
> > finished it.
Comment 28 Carles 2012-06-14 18:00:46 UTC
The first kernel I compiled in the bisect process just panics on boot, is this to be expected or did I do something wrong?
Comment 29 Alex Shi 2012-06-15 00:39:48 UTC
(In reply to comment #28)
> The first kernel I compiled in the bisect process just panics on boot, is
> this
> to be expected or did I do something wrong?

I am not surprised. :) 
Some code is not qualified even on Linus' tree. So, follow the man page of 'git bisect', you can skip that commit. But if the next commit is still unbootable -- that often happen until the serial commit out, you can assume the commit is 'good' or 'bad', and continue your bisection. If your assumption is incorrect, go back and change a direction. You can save what's you did by 'git bisect log &> abc' etc. That command introduce in 'man git bisect'

Oh, BTW, you'd better try the latest kernel 3.5-rc2 to make sure the bug still is there.

Go ahead! Waiting for your finding!
Comment 30 Carles 2012-06-15 12:53:07 UTC
The first kernel did not boot, the second didn't even compile and the third didn't boot either.

If I understand this correctly, if I find a kernel that doesn't compile or boot what I need to do is "git bisect skip"?

There's another question I have, it might be a bit elemental, how can I remove the kernels I have already tested? It's starting to get crowded.
Comment 31 Alex Shi 2012-06-16 00:34:24 UTC
(In reply to comment #30)
> The first kernel did not boot, the second didn't even compile and the third
> didn't boot either.

Oh, I am wondering if it is the kernel configuration issue. You can try the whole ubuntu default kernel configure, cp /boot/config-xxx .config; make oldconfig; make -j 4; .....

> 
> If I understand this correctly, if I find a kernel that doesn't compile or
> boot
> what I need to do is "git bisect skip"?

Yes.
> 
> There's another question I have, it might be a bit elemental, how can I
> remove
> the kernels I have already tested? It's starting to get crowded.

what the grub version is in your laptop?
you can do $rm /boot/*kernel_version* -rf ; rm /lib/modules/kernel_version/ -rf 
but for grub1, you need to edit /etc/grub.conf or /boot/grub/menu.lst. set 'default=' correct kernel number.
Comment 32 Alex Shi 2012-06-16 13:05:46 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > The first kernel did not boot, the second didn't even compile and the third
> > didn't boot either.
> 

BTW, our target is to get newer and workable kernel for your laptop, so, if the latest kernel 3.5-rc2 works. We don't need to find out a old bug in kernel.
Comment 33 Carles 2012-06-16 21:47:26 UTC
I'm afraid there must be something I'm doing wrong, every step of the bisect process gave me a non bootable kernel so I did as you suggested and tried the 2 last stable ones from kernel.org, they all fail in the same way:

https://dl.dropbox.com/u/2231399/panic.jpg

This is the first time I build a kernel from source and I'm sure there's something very elemental that I am missing, I'd appreciate any help. If someone else with a similar model can try the same bisect process to see if they get different results that would be immensely helpful.
Comment 34 Alex Shi 2012-06-18 07:27:38 UTC
(In reply to comment #33)
> I'm afraid there must be something I'm doing wrong, every step of the bisect
> process gave me a non bootable kernel so I did as you suggested and tried the
> 2
> last stable ones from kernel.org, they all fail in the same way:
> 
> https://dl.dropbox.com/u/2231399/panic.jpg
> 
> This is the first time I build a kernel from source and I'm sure there's
> something very elemental that I am missing, I'd appreciate any help. If
> someone
> else with a similar model can try the same bisect process to see if they get
> different results that would be immensely helpful.

I appreciate for what you did, since you never to do a kernel compile.

According to your picture, your kernel missed the hard disk controller driver. Actually, I did not find what the controller are you using from modules list. You can check it under workable system from dmesg (by $dmesg) or from pci info (by $lspci). and enable it in kernel directly(set 'y' to this driver).

If you cannot find what the driver you are using. give me the lspci or dmesg output, let me try to help you. :)
Comment 35 Carles 2012-06-18 09:31:04 UTC
Thank you for your help, I've managed to get past the problem and compile 3.5.0-rc3, it however still has the problem. I'm going to proceed with the bisect process but considering it takes me more than an hour to compile a kernel on this laptop it is going to take some days in the best case scenario.
Comment 36 Alex Shi 2012-06-18 13:35:55 UTC
(In reply to comment #35)
> Thank you for your help, I've managed to get past the problem and compile
> 3.5.0-rc3, it however still has the problem. I'm going to proceed with the
> bisect process but considering it takes me more than an hour to compile a
> kernel on this laptop it is going to take some days in the best case
> scenario.

That's great. let's wait for your good news. :)
Comment 37 Carles 2012-06-19 21:29:32 UTC
This is going far slower than I had anticipated. For every step I test as either good or bad there are at least 4 I have to skip because they either don't compile or panic on boot, is this expected?

And still, I don't think I properly understand how the process works, for instance, 2.6.35-rc1 and rc3 panic on boot, however in the way I understand it they should be good because I'm using 2.6.35 stable and it doesn't show the bug, should I skip those or should I mark them as good without testing?
Comment 38 Alex Shi 2012-06-20 00:20:36 UTC
(In reply to comment #37)
> This is going far slower than I had anticipated. For every step I test as
> either good or bad there are at least 4 I have to skip because they either
> don't compile or panic on boot, is this expected?

Sometimes, the Linus' tree also has a bad quality. :( You'd better to keep skipping until the buggy serial patches out. I means don't do testing, just repeat $git bisect skip

> 
> And still, I don't think I properly understand how the process works, for
> instance, 2.6.35-rc1 and rc3 panic on boot, however in the way I understand
> it
> they should be good because I'm using 2.6.35 stable and it doesn't show the
> bug, should I skip those or should I mark them as good without testing?

The assumption is incorrect, actually, .35-rcx is delivered before final .35 kernel. the rc version is testing kernel. In fact, git bisect is quite smart to find the next mid commitment in the tree. Could you send out the output of $git bisect log. Let's see what'd you got.
Comment 39 Carles 2012-06-20 09:41:58 UTC
Here's the output of git bisect log
https://dl.dropbox.com/u/2231399/log.tar.gz
Comment 40 Alex Shi 2012-06-21 00:36:08 UTC
(In reply to comment #39)
> Here's the output of git bisect log
> https://dl.dropbox.com/u/2231399/log.tar.gz

Oh, you do narrowed the scope. found the last good : 9ea2c4... and last bad: 31dfbc...

If the bug is in drivers/cpufreq, you can redo the bisection just on the directory: drivers/cpufreq drivers/acpi with your last good and last bad.

You also can list all suspicious patches by following command before bisection.
$git log 9ea2c4be978d597076ddc6c..31dfbc93923c0aa drivers/cpufreq/ drivers/acpi drivers/cpuidle

but if the bug is not there, you can try just bisect on other source directories, like kernel, arch/x86, mm/
in the git bisection man page, you can find how to narrow the source directory.here is a example:  $git bisect start -- arch/i386 include/asm-i386 

I think you are near the bug! :)
Comment 41 Alex Shi 2012-06-25 06:38:54 UTC
(In reply to comment #40)
> (In reply to comment #39)
> > Here's the output of git bisect log
> > https://dl.dropbox.com/u/2231399/log.tar.gz
> 
> Oh, you do narrowed the scope. found the last good : 9ea2c4... and last bad:
> 31dfbc...
> 
> If the bug is in drivers/cpufreq, you can redo the bisection just on the
> directory: drivers/cpufreq drivers/acpi with your last good and last bad.
> 
> You also can list all suspicious patches by following command before
> bisection.
> $git log 9ea2c4be978d597076ddc6c..31dfbc93923c0aa drivers/cpufreq/
> drivers/acpi
> drivers/cpuidle
> 
> but if the bug is not there, you can try just bisect on other source
> directories, like kernel, arch/x86, mm/
> in the git bisection man page, you can find how to narrow the source
> directory.here is a example:  $git bisect start -- arch/i386 include/asm-i386 
> 
> I think you are near the bug! :)

Any new update? :)
Comment 42 Carles 2012-06-26 09:46:55 UTC
Updated log just in case it helps:
https://dl.dropbox.com/u/2231399/log.tar.gz

It's been a very busy weekend at work also my ISP fucked up royally and left every customer in the province without a working phone line for the entirety of Monday.

I haven't had much to report on either, I have continued as usual, narrowed down to 3035 possible patches according to bisect, still a lot and the process seems to get slower, the amount of skips gets demoralizing.

I would love to narrow it down using your method but I haven't actually tried it yet. Mostly because I feel like I don't understand it. First, how do I know the last good and last (first?) bad? Are they just the last two patches listed in the log file? (could you check the file in case I mess up?)

If I understand this correctly, what I need to do is restart the bisect process, then start again by telling it to only look for changes between the last good and last bad and only in drivers/cpufreq and drivers/acpi?

I understand from your message that the bug might not be there and I should try to narrow it to x86, I don't know whether this is a valid concern, but the bug is also present in amd64 (or x86_64 or whatever is the current politically correct way to refer to it nowadays, actually, my testing system is 64bit).

I am also very worried about the amount of panics and compilation errors being somehow produced by me doing something wrong, could you for instance test compile the last skip to check if it's only me?

I'll keep with the usual method in the mean time.
Comment 43 Alex Shi 2012-06-28 02:22:12 UTC
(In reply to comment #42)
> Updated log just in case it helps:
> https://dl.dropbox.com/u/2231399/log.tar.gz
> 
> It's been a very busy weekend at work also my ISP fucked up royally and left
> every customer in the province without a working phone line for the entirety
> of
> Monday.

Ops!
> I haven't had much to report on either, I have continued as usual, narrowed
> down to 3035 possible patches according to bisect, still a lot and the
> process
> seems to get slower, the amount of skips gets demoralizing.
> 
> I would love to narrow it down using your method but I haven't actually tried
> it yet. Mostly because I feel like I don't understand it. First, how do I
> know
> the last good and last (first?) bad? Are they just the last two patches
> listed
> in the log file? (could you check the file in case I mess up?)

Sorry for a bit later response, since too many affairs to deal with.

Narrow by the following command is very useful if you know where is the problem or there is too many reboot issues in all source bisection.
$git bisect start last_bad_commit_number last_good  drivers/cpufreq/ drivers/acpi/

You just need to restart bisect like above, and then anything is same as full source bisection.

> 
> If I understand this correctly, what I need to do is restart the bisect
> process, then start again by telling it to only look for changes between the
> last good and last bad and only in drivers/cpufreq and drivers/acpi?

Sure
> 
> I understand from your message that the bug might not be there and I should
> try
> to narrow it to x86, I don't know whether this is a valid concern, but the
> bug
> is also present in amd64 (or x86_64 or whatever is the current politically
> correct way to refer to it nowadays, actually, my testing system is 64bit).

If so, add 'arch/x86 kernel' at the tail of above command.
> 
> I am also very worried about the amount of panics and compilation errors
> being
> somehow produced by me doing something wrong, could you for instance test
> compile the last skip to check if it's only me?

I don't worry much of skipped commits if they are in unrelated drivers. or even some of them in generic part. :)
> 
> I'll keep with the usual method in the mean time.
Comment 44 Lv Zheng 2012-06-28 12:52:51 UTC
Have tried "noapic" kernel setup parameters?  Some platform may sufferred from PIT support when ACPI is enabled.
Comment 45 Lv Zheng 2012-06-28 13:11:37 UTC
Sorry for my written English.  You can just do a simple test by feeding "noapic" to your kernel instead of "acpi=off".  If this can work, your platform should be one of those evils.
According to Len Brown's documentation, when acpi is off, IO-APIC could act as a conventional 8259 PIC.  Linux would always use 8254 PIT as its source of jiffies, this IRQ would be routed to IRQ0, pin 0 or 2, while Windows would use RTC from IRQ8.  Some vendors will create such illness platforms that can pass windows tests while it's not suitable for linux as they haven't noticed the difference on PIT between ACPI on and ACPI off.
Comment 46 Nate Cornell 2012-06-28 14:09:00 UTC
I have been testing the drivers/acpi directory between the last good and bad you found, Carles (9ea2c4... 31dfbc...). It is slow going for me because this is still my primary homework pc, and I don't get much time to test (school + work = no free time). 
I've gotten through three bisections so far, but none have booted.

I'm going to give your suggestion a try this weekend, Lv. That's good information.
Comment 47 Carles 2012-06-30 21:49:24 UTC
Thank you Nate, I just wanted to make sure the panicking kernels were not due to something I was doing wrong. I'll keep at it, in case you are interested bisect tells me I've narrowed it down to 19 possible cases, here's the last log in case you feel like trying:

This is the one from the full source bisect https://dl.dropbox.com/u/2231399/lognormal.tar.gz
This is the one from my last drivers/acpi drivers/cpufreq session https://dl.dropbox.com/u/2231399/log-estret.tar.gz

I'll keep compiling, I hope you'll bring good news regarding Lv's suggestion, I'm not sure on how to test it myself.
Comment 48 Nate Cornell 2012-07-01 01:20:30 UTC
(In reply to comment #47)
> Thank you Nate, I just wanted to make sure the panicking kernels were not due
> to something I was doing wrong. 

I don't think you're doing anything wrong, but I just discovered today that changing the IDE mode to "compatible" within the BIOS made a bunch of kernels that were panicking bootable, although they do have other issues (no mouse or wifi, etc...). I have been simply holding enter through the module selection and selecting all the defaults; which could be a big reason why. 

I think at the very least we need to include the Intel ICH9M/M-E drivers for it to work, or else go with my BIOS suggestion. All part of learning how to compile a kernel, I guess. You and I will be pros by the end of this!

> I'll keep at it, in case you are interested
> bisect tells me I've narrowed it down to 19 possible cases, here's the last
> log
> in case you feel like trying:
> 
Thanks Carles! I'm almost done with this ACPI bisection, then I'm going to try a stable kernel with 'noapic' as Lv suggested. After that I'll take a look at your latest logs to see if there's anything else I can try.
Comment 49 Nate Cornell 2012-07-01 15:38:51 UTC
No good news today.

Booting to 2.6.38 with the noapi or nolapic options did nothing to solve the issue.

I also found nothing with the drivers/acpi bisect I was running. I didn't find a single bisection that displayed the bug, although two failed to compile. They hung on drivers/acpi/acpica/exstorob.o; not sure if that means anything.
Comment 50 Carles 2012-07-01 16:57:46 UTC
I'm still following the bisect process from my last log, so far every test kernel panics on boot (even with IDE set to compatible in BIOS) except for the first, which *had* the bug.

I have been using the configuration file for my current working ubuntu kernel (2.6.35-32), does it not include these Intel ICH9M/M-E drivers you talk about? I'll check it to be sure.

It's a bit frustrating, it feels very close but every test seems to fail! : )
Comment 51 Feng Tang 2012-07-02 03:03:34 UTC
Hi,

Can you try these 2 methods:
1) add "nohz=off" to cmdline
2) add "hpet=off" to cmdline

and post the full dmesg and "/proc/timer_list"?
Comment 52 Feng Tang 2012-07-02 03:19:19 UTC
(In reply to comment #11)

> > 
> > Has any version of Linux worked properly on this system?
> 
> Every version I tried prior to 2.6.38-8 for the past 3 years has worked very
> well. I currently use Ubuntu 10.10 with a 2.6.35-31 kernel. The system runs
> very smoothly on this kernel.
> 

Could you list the working kernel versions, like 2.6.36.x, 2.6.37.x? If that kernel works, pls provide the "dmesg" and "/proc/timer_list" to help narrow down the problem. thanks
Comment 53 Carles 2012-07-02 16:14:58 UTC
I have finished bisecting, I had to skip too much because I couldn't get the last kernels to boot, this is the result:
https://dl.dropbox.com/u/2231399/resultat.tar.gz
If someone with an n51v laptop is able to get any of these to boot and tests them we'll be able to further pinpoint the offending commit. Just in case someone needs it, this is my final bisect log https://dl.dropbox.com/u/2231399/finallog.tar.gz

> Could you list the working kernel versions, like 2.6.36.x, 2.6.37.x? If that
> kernel works, pls provide the "dmesg" and "/proc/timer_list" to help narrow
> down the problem. thanks

Does the linked file help you? In case it does not help you, as far as I have tested all stable releases up to and including 2.6.35 worked, but releases from (including) 2.6.36 up to the current one display this problem.
Do you need dmesg output and /proc/timer_list of kernels that display the problem or kernels that do not? If so, do you prefer one of the first builds that showed the problem or would you rather have the latest one?
Comment 54 Alex Shi 2012-07-03 01:48:21 UTC
(In reply to comment #53)
> I have finished bisecting, I had to skip too much because I couldn't get the
> last kernels to boot, this is the result:
> https://dl.dropbox.com/u/2231399/resultat.tar.gz
> If someone with an n51v laptop is able to get any of these to boot and tests
> them we'll be able to further pinpoint the offending commit. Just in case
> someone needs it, this is my final bisect log
> https://dl.dropbox.com/u/2231399/finallog.tar.gz
> 

Thanks a lot for your hard work!
Actually, your hard work give us a very good result. just a few commits left, and all of them belong to a same subsystem ACPICA. I will call this sub system maintain to look at what the work they did!

Thanks again for your great work!

[alexs@lkp-os linux]$ git log  e8eb6228094bcf..c172cb73bc79fe6991 | grep commit
commit c172cb73bc79fe69915b1a1a48e374aa4b1f8a59
commit 28f4f8a9def2b1f3a6066bae791c77043ec49524
commit a0d468718b9049f7396d101075a129a2d683ad66
commit 9ce10df8d83d0528e80cd319b35ac5f6812b4f62
commit 9874647ba1bdf3e1af25e079070a00676f60f2f0
commit e8e18c956152ec9c26c94c6401c174691a8f04e7
commit 9e6c3e996e3c80d00cf931538e17126efe45f45c
commit 09079250db4d470f75eddcce31e0229c92d6c3bf
commit 150dba38f0c3d2d5f5edc58145d202de08ed623c
commit de5668fe7549c0586c6f64fa5661604cf7029a99
commit ddcc6a037c0f9378f29658636a2c2b54c4238ec4
commit 9d8b5e7b28179784e2c6250086a44021fbb9c5a0
commit 546eb57695875712f676e5f729159b0779f1c0af
commit 3bd741bd0dfcc1845ae6892baa5192c91addc84c
commit 3784730b02b9f147a55b0e4623fcad671273e6e6
commit b63559f5ce08bc8f94ce144a8d06f7af607ecc53
commit a44061aa8b5d58b2729faca4c155a94a5bea2a09
commit e8b6f970107cfc9c00cdcdb12ec6c7e135cf379f
commit b76df673522d94e3eafcf16935b3d7e5dded3078
commit ccba77eb45c36cf1d8b22f241eb8a4a292c1362e
commit 4461cf546ec8c97b6b997b8e533d6de1960499d3
commit a9fc03125ea0001ff18bc29da9539b587fdbd1d7
commit 20d33aea7ae7ad858f3f91c834d9043cd8122d38
commit c45b5c097001480e66d4c523eb715ad317a4ef77
commit 5821f75421aa7c7bafdec291223153597f649934
commit b27d65975c252ff774edff8e01f0a9fd46d8ab62
commit 96b7b7ad79e4bd8a0ae67dd201f7532ef4abf1c1
commit 507f7d5e27015be1e5dda5c56bb5e10315b76f71
commit aa9d36060fb7480a5907660b7ba61c3fda20fc61
Comment 55 Alex Shi 2012-07-03 02:30:03 UTC
(In reply to comment #53)
> I have finished bisecting, I had to skip too much because I couldn't get the
> last kernels to boot, this is the result:
> https://dl.dropbox.com/u/2231399/resultat.tar.gz


Carles, 

Manual bisection is a hard work and easy make mistaken. If you has interesting, you can write scripts to make all manual work automation. that will save your munch of time. and that also computer good at.

Memory what's you did by hands, write down them into scripts. There just 2 tricks you need to care.
1, let your script auto run after system reboot, you may put your script into the local system service.
2, install your kernel as default booting kernel and handle the kernel reboot panic. boot panic makes your script broken. maybe 'grub' can handle this, grub may can jump to a old good kernel. you scripts can check if the kernel by 'uname -a'.

Actually, that is not a easy job. and I have no similar scripts in hands to be a reference for you. 
but if your finished, you will has a good tool to find any git-repository project bugs. :)
Comment 56 Carles 2012-07-03 20:08:06 UTC
I did make a couple of simple scripts to make the process easier, the main problem was that I did not know of a way to make the script continue on kernel panic and I did not bother investigating the matter because I thought (probably mistakenly, I realize) that it would be faster to just do it semi-manually than it would be to code it the proper way.

Back to the bug, as requested by Feng Tang this is the output of dmesg and timer_list on the latest git 3.5.0-rc4 kernel I compiled, I applied no special boot commands: https://dl.dropbox.com/u/2231399/dmesg-timer_list.tar.gz Please let me know if you need anything else or if this is not what you need.
Comment 57 Nate Cornell 2012-07-07 20:15:10 UTC
Created attachment 75001 [details]
Output of dmesg after booting to 2.6.38-8 with nohz=off

As requested by Feng
Comment 58 Nate Cornell 2012-07-07 20:19:21 UTC
Created attachment 75011 [details]
dmesg and timer_list from 2.6.38-8 after booting with nohz=off and hpet=off

Here are the rest of the files describing the output of dmesg and /proc/timers_list Feng. Let me know if you need anything else.
Comment 59 Feng Tang 2012-07-08 15:04:25 UTC
thanks for the debug info. And sorry for that I made a mistake, one of the option should be "hpet=disable" instead of "hpet=off"

And the info I want are:
1. with 2 options "nohz=off" or "hpet=disable", if the "slow" problem is solved or still there? 
2. the "dmesg" "/proc/timer_list" for the 2 options, if there is any warning or error message in kernel log, that'll be great
3. the "dmesg" "/proc/timer_list" for the last latest working kernel version, from previous records, it is 2.6.35.xxx?
Comment 60 Feng Tang 2012-07-08 15:07:45 UTC
btw, since Nate Cornell already bisected the problem to be ACPICA related, then you can let ACPICA be the first suspect, and the investigation to "nohz" "hpet" is not that urgent now, so take your time.

Thanks,
Feng
Comment 61 Zhang Rui 2012-07-09 07:26:40 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > please show the output from...
> > 
> > grep . /sys/class/thermal/*/*
> 
> /sys/class/thermal/cooling_device0/cur_state:10
> /sys/class/thermal/cooling_device0/max_state:10
> /sys/class/thermal/cooling_device0/type:Processor

the processor is put to deepest T-state, and this is why your machine is slow.

> /sys/class/thermal/cooling_device1/cur_state:10
> /sys/class/thermal/cooling_device1/max_state:10
> /sys/class/thermal/cooling_device1/type:Processor
> /sys/class/thermal/cooling_device2/cur_state:0
> /sys/class/thermal/cooling_device2/max_state:15
> /sys/class/thermal/cooling_device2/type:LCD
> /sys/class/thermal/thermal_zone0/cdev0_trip_point:1
> /sys/class/thermal/thermal_zone0/cdev1_trip_point:1
> /sys/class/thermal/thermal_zone0/mode:enabled
> /sys/class/thermal/thermal_zone0/temp:39000
> /sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
> /sys/class/thermal/thermal_zone0/trip_point_0_type:critical
> /sys/class/thermal/thermal_zone0/trip_point_1_temp:0
> /sys/class/thermal/thermal_zone0/trip_point_1_type:passive

this is really bad.
the passive trip point temperature is 0C, so the processor is throttled from the beginning, which does not make sense.

could you please attach the same information in a working kernel?

could you please attach the acpidump output?
Comment 62 Nate Cornell 2012-07-12 02:16:25 UTC
Created attachment 75231 [details]
cpufreq thermal and acpidump from 2.6.35-32

Here are the cpufreq, thermal, and acpidump output logs.

There were a couple errors on the acpidump; I pasted them into the bottom of the report.
Comment 63 Nate Cornell 2012-07-12 02:17:48 UTC
(In reply to comment #61)
> 
> the processor is put to deepest T-state, and this is why your machine is
> slow.

That makes sense; I see that it's max scalable frequency is also very low.

> 
> could you please attach the same information in a working kernel?
> could you please attach the acpidump output?

These are attached.

I ran a diff on the output of the thermal and cpufreq greps (the '.old' files are from the bad kernel, '.log' from a working one):

$ diff cpufreq.log cpufreq.old 
13c13
< /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2534000
---
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:1013600
28c28
< /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:2534000
---
> /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq:1013600

As I noted above, the maximum scaling frequency is super low. That appears to be the only difference here. As for the thermal output:

diff thermal.log thermal.old 
1c1
< /sys/class/thermal/cooling_device0/cur_state:0
---
> /sys/class/thermal/cooling_device0/cur_state:10
4c4
< /sys/class/thermal/cooling_device1/cur_state:0
---
> /sys/class/thermal/cooling_device1/cur_state:10
9a10,11
> /sys/class/thermal/thermal_zone0/cdev0_trip_point:1
> /sys/class/thermal/thermal_zone0/cdev1_trip_point:1
11,12c13
< /sys/class/thermal/thermal_zone0/passive:0
< /sys/class/thermal/thermal_zone0/temp:42000
---
> /sys/class/thermal/thermal_zone0/temp:39000
14a16,17
> /sys/class/thermal/thermal_zone0/trip_point_1_temp:0
> /sys/class/thermal/thermal_zone0/trip_point_1_type:passive

There are quite a few differences, most notably the 4 trip point entries in the bad kernel version that don't even exist in the good one, and as you noted, mark 0C or 1C as the tripping temp.
Comment 64 Zhang Rui 2012-07-25 02:54:54 UTC
there is no passive trip points before, this is kind of weird to me.

could you please revert 9bcb8118965ab4631a65ee0726e6518f75cda6c5 and see if the problem still exists?
Comment 65 Zhang Rui 2012-07-25 03:25:54 UTC
Created attachment 76061 [details]
debug patch to check why passive trip point is not valid

please apply the patch in 2.6.38 and attach the dmesg output after boot.
I need to check why passive trip point is not valid in 2.6.38.
Comment 66 Nate Cornell 2012-07-26 02:07:22 UTC
Created attachment 76121 [details]
Output of patch command

The patch failed for me; attached is the output.
Perhaps I ran the wrong command?
Comment 67 Nate Cornell 2012-07-26 02:29:57 UTC
(In reply to comment #64)
> 
> could you please revert 9bcb8118965ab4631a65ee0726e6518f75cda6c5 and see if
> the
> problem still exists?

I'm sorry, I am not sure how to do this. I read up on 'git revert' but it appears to be aimed at developers adding branches; I'm not sure what to do in my situation.
Comment 68 Nate Cornell 2012-07-26 13:14:32 UTC
Created attachment 76141 [details]
Dmesg and timer_list output good and bad kernels for Feng

(In reply to comment #59)
> 1. with 2 options "nohz=off" or "hpet=disable", if the "slow" problem is
> solved
> or still there? 

The problem still exists after booting to either of those two options.

> 2. the "dmesg" "/proc/timer_list" for the 2 options, if there is any warning
> or
> error message in kernel log, that'll be great
> 3. the "dmesg" "/proc/timer_list" for the last latest working kernel version,
> from previous records, it is 2.6.35.xxx?

These are included in the attached zip.

Sorry it took so long for me to get back to you.
Comment 69 Carles 2012-07-26 16:37:59 UTC
(In reply to comment #65)
> Created an attachment (id=76061) [details]
> debug patch to check why passive trip point is not valid
> 
> please apply the patch in 2.6.38 and attach the dmesg output after boot.
> I need to check why passive trip point is not valid in 2.6.38.

I got v2.6.38 from git and applied your patch, this is the result of running dmesg after boot:

https://dl.dropbox.com/u/2231399/arxiu_dmesg.tar.gz

Please tell me if I missed anything.

(In reply to comment #66)
> Created an attachment (id=76121) [details]
> Output of patch command
> 
> The patch failed for me; attached is the output.
> Perhaps I ran the wrong command?

I applied the patch to a git version and compiled, here are a couple packages in case you want to try it out:

https://dl.dropbox.com/u/2231399/linux-headers-2.6.38-custom_2.6.38-custom-10.00.Custom_amd64.deb

https://dl.dropbox.com/u/2231399/linux-image-2.6.38-custom_2.6.38-custom-10.00.Custom_amd64.deb

PS: I feel stupid for asking this but I don't know how to create attachments.
Comment 70 Zhang Rui 2012-08-01 06:52:27 UTC
(In reply to comment #69)
> (In reply to comment #65)
> > Created an attachment (id=76061) [details] [details]
> > debug patch to check why passive trip point is not valid
> > 
> > please apply the patch in 2.6.38 and attach the dmesg output after boot.
> > I need to check why passive trip point is not valid in 2.6.38.
> 
> I got v2.6.38 from git and applied your patch, this is the result of running
> dmesg after boot:
> 
> https://dl.dropbox.com/u/2231399/arxiu_dmesg.tar.gz
> 
> Please tell me if I missed anything.
> 
sorry, I can not access this page...

> 
> PS: I feel stupid for asking this but I don't know how to create attachments.

look for "Add an attachment" in this page, click it and upload your local arxiu_dmesg.tar.gz
Comment 71 Zhang Rui 2012-08-01 06:53:54 UTC
(In reply to comment #66)
> Created an attachment (id=76121) [details]
> Output of patch command
> 
> The patch failed for me; attached is the output.
> Perhaps I ran the wrong command?

what command did you run?
you just need to make sure you are under the kernel source root directory and run "patch -p1 < patchname"
Comment 72 Carles 2012-08-01 11:34:07 UTC
Created attachment 76631 [details]
dmesg output after applying zangu rui's patch to 2.6.38
Comment 73 Carles 2012-08-01 11:39:30 UTC
Thank you, I wasn't looking in the proper place(In reply to comment #70)
> (In reply to comment #69)
> 
> sorry, I can not access this page...
> 
> > 
> > PS: I feel stupid for asking this but I don't know how to create
> attachments.
> 
> look for "Add an attachment" in this page, click it and upload your local
> arxiu_dmesg.tar.gz

Thank you Zhang Rui, I wasn't looking in the proper place. I guess some places block dropbox, I added it as an attachment instead.
Comment 74 Zhang Rui 2012-08-15 06:40:41 UTC
(In reply to comment #72)
> Created an attachment (id=76631) [details]
> dmesg output after applying zangu rui's patch to 2.6.38

weird, the dmesg shows that the passive trip point is valid in 2.6.38.
could you please apply the debug patch attached below and then attach both the dmesg output and the output of "grep . /sys/class/thermal/*/*"?
Comment 75 Zhang Rui 2012-08-15 06:41:33 UTC
Created attachment 77761 [details]
refreshed debug patch
Comment 76 Carles 2012-08-16 20:02:22 UTC
Created attachment 77841 [details]
dmesg output of the refreshed patch as requested by Zhang Rui

This should be it, I hope I did it right this time.
Comment 77 Carles 2012-08-16 20:10:49 UTC
If someone has this laptop model and wants deb packages of the patched kernel to try and see if the output is the same just ask and I'll upload them somewhere.
Comment 78 George B. 2012-08-29 10:49:00 UTC
Hello,

I think I also have this problem on my Dell Precision M4600 laptop.
---
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz
stepping        : 7
microcode       : 0x23
cpu MHz         : 800.000
---

(There are 8 cores in total, all limited to 800MHz).

cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_min_freq
---
800000
800000
800000
800000
800000
800000
800000
800000
---

cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq
---
2301000
2301000
2301000
2301000
2301000
2301000
2301000
2301000
---

cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
---
800000
800000
800000
800000
800000
800000
800000
800000
---

cpuinfo_cur_freq remains the same even when I put 100% load on the cores.

In my case I do not appear to have the thermal throttling mentioned in comment #61:
grep . /sys/class/thermal/*/* 2>/dev/null
/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
/sys/class/thermal/cooling_device1/cur_state:0
/sys/class/thermal/cooling_device1/max_state:10
/sys/class/thermal/cooling_device1/type:Processor
/sys/class/thermal/cooling_device2/cur_state:0
/sys/class/thermal/cooling_device2/max_state:10
/sys/class/thermal/cooling_device2/type:Processor
/sys/class/thermal/cooling_device3/cur_state:0
/sys/class/thermal/cooling_device3/max_state:10
/sys/class/thermal/cooling_device3/type:Processor
/sys/class/thermal/cooling_device4/cur_state:0
/sys/class/thermal/cooling_device4/max_state:10
/sys/class/thermal/cooling_device4/type:Processor
/sys/class/thermal/cooling_device5/cur_state:0
/sys/class/thermal/cooling_device5/max_state:10
/sys/class/thermal/cooling_device5/type:Processor
/sys/class/thermal/cooling_device6/cur_state:0
/sys/class/thermal/cooling_device6/max_state:10
/sys/class/thermal/cooling_device6/type:Processor
/sys/class/thermal/cooling_device7/cur_state:0
/sys/class/thermal/cooling_device7/max_state:10
/sys/class/thermal/cooling_device7/type:Processor
/sys/class/thermal/cooling_device8/cur_state:0
/sys/class/thermal/cooling_device8/max_state:15
/sys/class/thermal/cooling_device8/type:LCD
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/passive:0
/sys/class/thermal/thermal_zone0/temp:25000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:107000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/type:acpitz
---

Kernel version is 3.2.0-3-amd64 (Debian Sid).

Please let me know what additional information you need to troubleshoot and fix this - I will be happy to help.


Thanks,

George
Comment 79 George B. 2012-08-29 10:57:29 UTC
P.S. I have tried the 'sleep' trick (via s2ram) and it appears to work for me at least. :-)
Comment 80 Zhang Rui 2012-09-19 06:27:45 UTC
(In reply to comment #76)
> Created an attachment (id=77841) [details]
> dmesg output of the refreshed patch as requested by Zhang Rui
> 
> This should be it, I hope I did it right this time.

/sys/class/thermal/thermal_zone0/trip_point_1_temp:0
/sys/class/thermal/thermal_zone0/trip_point_1_type:passive

it shows that the passive trip point is also valid in 2.6.38 and it is 0C!
and the processors are also throttled.
which means the system should also be slow at this time.

Carles, is it true?
if yes, it does not seems to be a regression unless you can find out a kernel without this problem.
Comment 81 Zhang Rui 2012-09-19 06:29:55 UTC
George,
your problem is different. the processors are not put into deep T-state.

you'd better open a new bug report saying that all the processors are limited to the lowest frequency.
please attach the output of 
"grep . /sys/devices/system/cpu/cpu*/cpufreq/*"
in your bug report.
Comment 82 Carles 2012-09-20 18:11:48 UTC
(In reply to comment #80)
> (In reply to comment #76)
> > Created an attachment (id=77841) [details] [details]
> > dmesg output of the refreshed patch as requested by Zhang Rui
> > 
> > This should be it, I hope I did it right this time.
> 
> /sys/class/thermal/thermal_zone0/trip_point_1_temp:0
> /sys/class/thermal/thermal_zone0/trip_point_1_type:passive
> 
> it shows that the passive trip point is also valid in 2.6.38 and it is 0C!
> and the processors are also throttled.
> which means the system should also be slow at this time.
> 
> Carles, is it true?
> if yes, it does not seems to be a regression unless you can find out a kernel
> without this problem.

Yes the system was really slow when I got those logs of v2.6.38 with your patch applied.

I understand the values in the logs are unexpected, is it possible that this is due to me doing something wrong or missing a step in the process?

I'm sorry but I'm a bit lost and not very sure I understand properly, what do you mean it is not a regression? Do you mean that every previous version also has this problem?

That would be very odd because I'm running 2.6.35 right now and it doesn't show any of the symptoms. What can I do to find a kernel without this problem? Should I checkout v2.6.35 from git, apply your patch and give you the output?
Comment 83 Nate Cornell 2012-10-20 03:01:14 UTC
(In reply to comment #80)
> if yes, it does not seems to be a regression unless you can find out a kernel
> without this problem.

The 2.6.35 kernels do not have this problem,the trip point temp did not exist in those versions as I noted in comment #63:

> There are quite a few differences, most notably the 4 trip point entries in
> the
> bad kernel version that don't even exist in the good one
Comment 84 Nate Cornell 2012-10-20 03:28:08 UTC
(In reply to comment #83)
> The 2.6.35 kernels do not have this problem,the trip point temp did not exist
> in those versions

Just to clarify my statement; only the trip_point_1 files are missing in the working kernel versions. The trip_point_0 ones do exist and are valid (from kernel 2.6.35-32):
/sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
Comment 85 Nate Cornell 2012-10-21 02:01:23 UTC
A possibly dangerous workaround for this bug is add the boot option "thermal.off=1" to the grub menu entry for the affected kernel.

With the thermal module disabled the system could possibly run hotter, according to Len Brown, who implemented the feature (https://lkml.org/lkml/2007/8/14/48). Use with caution.

It looks like there is also a patch that would make the trip points writable (https://patchwork.kernel.org/patch/1215301/), but I haven't messed with that at all. Not sure if it has actually been included in any recent kernels. In theory, however, if they were writable one could just modify the trip_point_1_temp file to include the correct temperature, although this might have to be done at every boot, right?

How can we get the kernel to automatically set the thermal trip points correctly for this hardware? 

It may just be that the zone 1 trip point isn't detected correctly, which is why the zone 1 trip points didn't even exist in the unaffected kernels. Perhaps it could write a "safe" value if the detection fails? Anything is better than '0'.
Comment 86 Zhang Rui 2012-10-24 07:26:24 UTC
(In reply to comment #84)
> (In reply to comment #83)
> > The 2.6.35 kernels do not have this problem,the trip point temp did not
> exist
> > in those versions
> Just to clarify my statement; only the trip_point_1 files are missing in the
> working kernel versions. The trip_point_0 ones do exist and are valid (from
> kernel 2.6.35-32):
> /sys/class/thermal/thermal_zone0/trip_point_0_temp:110000
> /sys/class/thermal/thermal_zone0/trip_point_0_type:critical

then why not apply my debug patch and see why trip_point_1 is missing in 2.6.35?
Comment 87 Carles 2012-10-24 15:49:17 UTC
(In reply to comment #86)
> then why not apply my debug patch and see why trip_point_1 is missing in
> 2.6.35?

I intended to do this back in September but I've been busy and unable to, I'll do it as soon as possible. However, because of my dubious track record I think it would be helpful if someone else with access to the hardware did the same to be able to double check my results.
Comment 88 Nate Cornell 2012-10-24 16:10:13 UTC
(In reply to comment #86)
> then why not apply my debug patch and see why trip_point_1 is missing in
> 2.6.35?

I'm sorry, I had tried and failed when you gave us the first patch, and never tried it again because I thought Carles had been successful.

I've been very busy with work and school, but I will attempt to install the patch this weekend.
Just to verify, we need to use the '-p1' switch on this patch, correct? (I'm very new to patching)
Comment 89 Carles 2012-10-24 18:15:04 UTC
Created attachment 84681 [details]
dmesg output of v2.6.35 with Zhang Rui's patch

Here is the output of running both "dmesg" and "grep . /sys/class/thermal/*/*" on the patched v2.6.35 kernel, I can upload deb packages of the patched kernel if anyone is interested in trying it.
Comment 90 Zhang Rui 2012-11-07 02:43:31 UTC
Hah,

I see what the problem is.
1) the passive trip point temperature is a bogus value which always equals 0C.
2) _PSL uses alias name of processors
   a) the alias name is not supported in earlier kernels like 2.6.35, so Linux thinks the passive trip point is invalid.
   b) alias name is supported in later kernel, thus the bug in 1) is exposed to users.

so my suggestion is to use thermal.nopsv=-1 to get rid of the bogus passive trip point.
If you really want to use processors to cooling your system, you can 
"echo temp > /sys/class/thermal_zone0/passive", temp is the millidegree value that you want to start throttling at.
Comment 91 Abilio Henrique 2012-11-07 02:46:05 UTC
So what does this mean from a kernel point of view? Is there a fix for this?
Comment 92 Zhang Rui 2012-11-07 03:00:31 UTC
Created attachment 85731 [details]
customized DSDT

BTW, could you please set CONFIG_ACPI_DEBUG=y, apply the custom DSDT following the step 5,6,7 at https://lesswatts.org/projects/acpi/overridingDSDT.php,
and boot with kernel parameters "acpi.debug_layer=0xffffffff acpi.debug_level=0x2"?

After boot, please check if the problem still exists. And
please attach the dmesg output after boot.
please attach the output of "grep . /sys/class/thermal/*/*" after boot.
Comment 93 Zhang Rui 2012-11-14 06:10:35 UTC
ping...

can you please try this custom DSDT to see if we can make the passive trip point work as we want?
Comment 94 Carles 2012-11-15 03:00:04 UTC
Sorry about the delay in answering, I am currently abroad and do not have access to the laptop. I will make sure to test this as soon as I get back home next week.
Thank you for your time and sorry again for delaying so much, I didn't expect to be away for so long.
Comment 95 Nate Cornell 2012-11-15 07:43:41 UTC
Created attachment 86391 [details]
dmesg output on 3.2.0-33 after DSDT patch
Comment 96 Nate Cornell 2012-11-15 07:46:56 UTC
Created attachment 86401 [details]
output of thermal grep on 3.2.0-33 after DSDT patch

(In reply to comment #92)

Here's the output of the grep command and the dmesg after applying the patch.

The problem still appears to exist, however.
Comment 97 Nate Cornell 2012-11-15 15:43:35 UTC
Hold on; I did not compile that with CONFIG_ACPI_DEBUG=y; I will try again this morning with that switch and re-post those files.
Comment 98 Nate Cornell 2012-11-16 03:41:47 UTC
Created attachment 86491 [details]
dmesg output on 3.2.0-33 after DSDT patch with CONFIG_ACPI_DEBUG=y

Here's the dmesg output after applying the DSDT patch and compiling with the with CONFIG_ACPI_DEBUG=y option.
Comment 99 Nate Cornell 2012-11-16 03:44:07 UTC
Created attachment 86501 [details]
grep of thermal on 3.2.0-33 after DSDT patch with CONFIG_ACPI_DEBUG=y

This is the thermal with the DSDT patch.
Comment 100 Zhang Rui 2012-11-28 10:15:54 UTC
Please try the patch here
https://patchwork.kernel.org/patch/1812481/
without customized DSDT.
Comment 101 Zhang Rui 2012-11-28 10:17:30 UTC
Created attachment 87491 [details]
custom DSDT

you can also try this new customized DSDT either w/ or w/o the patch and see if it helps.
please attach the dmesg output and output of "grep . /sys/class/thermal/*/*" after applying the new DSDT.
Comment 102 Carles 2012-11-30 01:14:16 UTC
Created attachment 87871 [details]
Dmesg and thermal output of the patch found in #100

I have patched the 3.7 kernel but sadly it does not look like it completely fixes the problem, I have attached dmesg output and the output of "grep . /sys/class/thermal/*/*".
Comment 103 Zhang Rui 2012-11-30 04:58:50 UTC
(In reply to comment #102)
> Created an attachment (id=87871) [details]
> Dmesg and thermal output of the patch found in #100
> 
> I have patched the 3.7 kernel but sadly it does not look like it completely
> fixes the problem, I have attached dmesg output and the output of "grep .
> /sys/class/thermal/*/*".

okay.
so does the custom DSDT in comment #101 help?
Comment 104 Zhang Rui 2012-11-30 05:28:58 UTC
anyway, this seems to be a BIOS Problem to me.
close the bug now, but we can continue our discussion to see if we can find a workaround for this firmware issue.
Comment 105 Nate Cornell 2012-12-07 01:15:12 UTC
Fair enough. Thanks for all your help.

In the meantime, what can we do for a workaround? The thermal trip points are not writeable, or else we might be able to set them manually.

There is the thermal.off boot feature, but I'm not sure if that might have negative, or possibly dangerous side-effects.
Comment 106 Abilio Henrique 2013-03-10 23:29:09 UTC
So what I don't get right now is what solution do we have for this problem.  This was not occurring in older kernels, and I'm not keen on having my laptop overheat constantly by having ACPI turned off...which I can tell has had an impact on the CPU fans already from prolonged use without any thermal throttling enabled due to ACPI=off.

Is there a practical solution we can implement to make this laptop work with newer kernels?
Comment 107 Zhang Rui 2013-03-11 02:04:04 UTC
First, you can use thermal.psv=-1 to invalidate the ACPI thermal passive trip point.
Then, if you really want to use processor to keep your laptop cool, please run "echo xxx > /sys/class/thermal/thermal_zone0/passive",, where xxx is the temperature in millidegrees Celsius that you want to start passive cooling.

Note You need to log in before you can comment on or make changes to this bug.