Bug 11516 - severe performance degradation on x86_64 going from 2.6.26-rc9 -> 2.6.27-rc5
Summary: severe performance degradation on x86_64 going from 2.6.26-rc9 -> 2.6.27-rc5
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
: 11651 11719 (view as bug list)
Depends on:
Blocks: Regressions-2.6.26
  Show dependency tree
 
Reported: 2008-09-07 13:59 UTC by Jason Vas Dias
Modified: 2009-04-30 12:59 UTC (History)
13 users (show)

See Also:
Kernel Version: 2.6.27-rc6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
reboot log (38.45 KB, application/octet-stream)
2008-09-07 14:02 UTC, Jason Vas Dias
Details
kernel config for the 2.6.26-rc9 version (80.36 KB, application/octet-stream)
2008-09-07 14:04 UTC, Jason Vas Dias
Details
kernel config for the 2.6.27-rc5 version (82.94 KB, application/octet-stream)
2008-09-07 14:05 UTC, Jason Vas Dias
Details
boot log from 2.6.27-rc7-tip.jvd, which fails to set CPU freq above 800Khz (263.36 KB, application/octet-stream)
2008-09-27 18:35 UTC, Jason Vas Dias
Details
config from 2.6.27-rc7-tip.jvd (86.56 KB, application/octet-stream)
2008-09-27 18:35 UTC, Jason Vas Dias
Details
System.map from 2.6.27-rc7-tip.jvd (313.58 KB, application/x-bzip2)
2008-09-27 18:37 UTC, Jason Vas Dias
Details
boot log from 2.6.26.5.jvd, which does not have the problem (242.42 KB, application/octet-stream)
2008-09-27 18:38 UTC, Jason Vas Dias
Details
config from 2.6.26.5.jvd (83.72 KB, application/octet-stream)
2008-09-27 18:39 UTC, Jason Vas Dias
Details
System map from 2.6.26.5.jvd (293.01 KB, application/x-bzip2)
2008-09-27 18:39 UTC, Jason Vas Dias
Details
acpidump from 2.6.26.5.jvd (329.25 KB, application/octet-stream)
2008-09-27 18:43 UTC, Jason Vas Dias
Details
output of acpidump for 2.6.27-rc7-tip.jvd (80.09 KB, application/x-bzip2)
2008-09-27 20:28 UTC, Jason Vas Dias
Details
/proc/acpi files of 2.6.27-rc7-tip.jvd (104.84 KB, application/x-bzip2)
2008-09-27 20:30 UTC, Jason Vas Dias
Details
/proc/acpi files of 2.6.26.5.jvd (104.87 KB, application/x-bzip2)
2008-09-27 20:31 UTC, Jason Vas Dias
Details
dmidecode output under 2.6.26.5 (6.98 KB, application/octet-stream)
2008-10-04 11:57 UTC, Jason Vas Dias
Details
Patch to test (659 bytes, patch)
2008-10-05 12:07 UTC, Rafael J. Wysocki
Details | Diff
log from 2.6.27-tip (311.53 KB, text/plain)
2008-10-11 15:39 UTC, Jason Vas Dias
Details
Patch is against today's Linus' git (v2.6.27-3977-gc166ab7) (3.21 KB, patch)
2008-10-14 12:08 UTC, herrmann.der.user
Details | Diff
lspci -nnvxxxx under 2.6.27-jvd ( linus + tip + roland trees of 2008-10-15@21:00 ) (106.87 KB, text/plain)
2008-10-15 19:21 UTC, Jason Vas Dias
Details
boot log from 2.6.27-rc7-jvd (linus + tip + roland GIT trees 2008-10-15@21:00) (313.75 KB, text/plain)
2008-10-15 19:22 UTC, Jason Vas Dias
Details
remove superfluous DMI quirks (3.58 KB, patch)
2008-10-16 05:32 UTC, herrmann.der.user
Details | Diff
x86: call dmi-quirks for HP Laptops after early-quirks are executed (2.23 KB, patch)
2008-10-22 04:27 UTC, herrmann.der.user
Details | Diff

Description Jason Vas Dias 2008-09-07 13:59:03 UTC
Latest working kernel version:
2.6.26-rc9
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git @2008-07-10)

Earliest failing kernel version:
2.6.27-rc9 
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git @2008-09-07)

Distribution:
Gentoo / my Own + Fedora 10

Hardware Environment:
x86_64 AMD Turion TL-64 64-bit dual core

Software Environment:
gcc-4.2.4 , glibc-2.8.90, both latest from savannah CVS @ 2008-07-10

Problem Description:

After my monthly kernel update ( git-pull from Linus' tree + build )
the whole machine started crawling instead of sprinting - this was
on my "own distro" (originally Gentoo based).

Updating Fedora 10 from 2.6.26-0.115.rc9.git2.fc10 to 
2.6.27-0.305.rc5.git6.fc10 had the same effect 
on a bog-standard Fedora 10 installation.

Before I get out the toolbox and start investigating in-depth 
(which I don't really have time to do at the moment), I thought
I'd check here first if anyone has seen this issue and if there
is any quick configuration / boot arg fix ?

To quantify, normally boot-up takes less than 1 minute with the 2.6.26
kernel; with the 2.6.27 kernel, it takes over 10 minutes, and the screen
is flooded with these "hub event change" messages - and the terminal 
scrolls visibly slowly, and seems to be unable to keep up with even
my extremely slow typing rate.

I'm attaching the config-2.6.26-rc9.jvd and config-2.6.27-rc5.jvd files,
and klogd logs from a bootup of each kernel.

Any advice suggestions as to how to fix would be gratefully received.
 

Steps to reproduce:

build and install latest kernel from Linus' tree.
Comment 1 Jason Vas Dias 2008-09-07 14:02:07 UTC
Created attachment 17659 [details]
reboot log

kernel log messages (with debug enabled) showing a boot of 2.6.26-rc5 , then, following the line beginning "REBOOT", a boot of 2.6.27-rc9
Comment 2 Jason Vas Dias 2008-09-07 14:04:47 UTC
Created attachment 17660 [details]
kernel config for the 2.6.26-rc9 version
Comment 3 Jason Vas Dias 2008-09-07 14:05:32 UTC
Created attachment 17661 [details]
kernel config for the 2.6.27-rc5 version
Comment 4 Jason Vas Dias 2008-09-07 14:37:15 UTC
As seen from the <a href="http://bugzilla.kernel.org/attachment.cgi?id=17659&action=view"> Reboot Log attachment </a>, the time for the 2.6.26-rc9 
boot up was @ 10 seconds:
Sep  7 12:33:42 localhost kernel: Loaded 41213 symbols from /boot/System.map-2.6.26-rc9.jvd. 
Sep  7 12:33:53 localhost init: Entering runlevel: 3

The 2.6.27-rc5 initial boot up took 10 minutes, and the System.map file is not
found (even though it DOES exist and is identical to the System.map from the
build) and the log daemon is so flooded by these new "hub 3-0:1.0: state 7" 
messages that it gives up and exits - I did not kill it . 

Sep  7 15:56:33 localhost kernel: Inspecting /boot/System.map
Sep  7 15:56:33 localhost kernel: Cannot find map file.
Sep  7 15:56:42 localhost dhclient: bound to 135.207.58.230 -- renewal in 1721 seconds.
Sep  7 15:57:01 localhost kernel: [   50.204477] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:57:01 localhost kernel: [   50.454454] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:57:01 localhost kernel: [   50.454462] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:57:01 localhost kernel: Kernel logging (proc) stopped.
S


After @ 10mins of a single-user boot, a prompt was finally displayed.

I then tried to reboot, and after the machine entered the "HALT" state
(blank screen) it was unable to power-off - I had to manually reset 
using the power button twice before it was able to reboot.

Exactly the same behavior occurs with the latest Fedora 10 kernel, so I
don't think it's my config. 

I'm guessing it has something to do with 
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
...
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y

and that's where I'll start investigating - also I noticed
Arjan's "high res timer" tree and think I'll try that next -
but I thought this was already in Linus' tree?

Anyone got any other suggestions ?
Comment 5 Anonymous Emailer 2008-09-07 15:03:39 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun,  7 Sep 2008 13:59:04 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11516
> 
>            Summary: severe performance degradation on x86_64 going from
>                     2.6.26-rc9 -> 2.6.27-rc5
>            Product: Other
>            Version: 2.5
>      KernelVersion: 2.6.27-rc9
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: other_other@kernel-bugs.osdl.org
>         ReportedBy: jason.vas.dias@gmail.com
> 
> 
> Latest working kernel version:
> 2.6.26-rc9
> (git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> @2008-07-10)
> 
> Earliest failing kernel version:
> 2.6.27-rc9 
> (git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> @2008-09-07)
> 
> Distribution:
> Gentoo / my Own + Fedora 10
> 
> Hardware Environment:
> x86_64 AMD Turion TL-64 64-bit dual core
> 
> Software Environment:
> gcc-4.2.4 , glibc-2.8.90, both latest from savannah CVS @ 2008-07-10
> 
> Problem Description:
> 
> After my monthly kernel update ( git-pull from Linus' tree + build )
> the whole machine started crawling instead of sprinting - this was
> on my "own distro" (originally Gentoo based).
> 
> Updating Fedora 10 from 2.6.26-0.115.rc9.git2.fc10 to 
> 2.6.27-0.305.rc5.git6.fc10 had the same effect 
> on a bog-standard Fedora 10 installation.
> 
> Before I get out the toolbox and start investigating in-depth 
> (which I don't really have time to do at the moment), I thought
> I'd check here first if anyone has seen this issue and if there
> is any quick configuration / boot arg fix ?
> 
> To quantify, normally boot-up takes less than 1 minute with the 2.6.26
> kernel; with the 2.6.27 kernel, it takes over 10 minutes, and the screen
> is flooded with these "hub event change" messages - and the terminal 
> scrolls visibly slowly, and seems to be unable to keep up with even
> my extremely slow typing rate.
> 
> I'm attaching the config-2.6.26-rc9.jvd and config-2.6.27-rc5.jvd files,
> and klogd logs from a bootup of each kernel.
> 
> Any advice suggestions as to how to fix would be gratefully received.
> 
> 
> Steps to reproduce:
> 
> build and install latest kernel from Linus' tree.
> 

I've tentatively categorised this post-2.6.26 regression under USB.

Sep  7 15:56:33 localhost kernel: [   21.190720] device: 'vcsa11': device_unregister
Sep  7 15:56:33 localhost kernel: [   21.190721] PM: Removing info for No Bus:vcsa11
Sep  7 15:56:33 localhost kernel: [   21.190788] device: 'vcsa11': device_create_release
Sep  7 15:56:33 localhost kernel: [   21.205252] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.205305] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.460430] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.460454] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.704357] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.704366] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.954352] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   21.954374] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.204433] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.204441] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.454376] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.454384] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.704406] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.704414] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.960311] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:33 localhost kernel: [   22.960335] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.204362] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.204384] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.454386] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.454394] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.704357] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.704366] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.954377] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:34 localhost kernel: [   23.954696] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.204380] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.204389] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.454398] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.454406] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.704379] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.704402] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Sep  7 15:56:35 localhost kernel: [   24.954381] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000

but it might be an x86 problem or anything, it's unclear.

Jason, perhaps you could test a CONFIG_USB=n kernel?
Comment 6 Jason Vas Dias 2008-09-07 15:48:37 UTC
Well, I'm now trying Arjan's hrtimer tree:
  http://www.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-hrtimer.git
- there were quite a few HR timer related diffs between this and Linus' tree -
and I've disabled the "verbose USB core debug messages" config .
It's building now - I'll post the results when done.
Thanks,
Jason
Comment 7 Jason Vas Dias 2008-09-07 18:06:06 UTC
Arjan's hrtimer stuff had no effect on this problem - and now I disabled USB core
and driver core debug messages I get less information as to why.
The boot up is just as quick as with 2.6.26-rc9 until the kernel begins 
processing udev events - then things just seem to grind to a halt.
I'll try disabling USB completely next - isn't there any way to do so 
from the boot command line ?
Comment 8 Jason Vas Dias 2008-09-07 22:41:58 UTC
OK, this definitely is not purely a USB issue.

I recompiled Arjans hrtimer tree 
(  http://git.kernel.org/?p=linux/kernel/git/arjan/linux-2.6-hrtimer.git;a=commit;h=96d2ab484e7a9bafdab44b8c7d1ef5944319b18c
)
with CONFIG_USB_SUPPORT=no (BTW, why not add a 'usb=no' boot argument?)
and the machine still crawls under the resulting 2.6.27-rc5 kernel.

True, without USB the machine becomes usably quick (just!), and the kernel
debug log is no longer flooded with those 'hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000' messages; but it still crawls.

To quantify "crawls" :

Under 2.6.26-rc9 (Linus' tree from @ 2008-07-10), here are some comand times:

$ $ time (dd if=/dev/urandom bs=1MB count=1 2>/dev/null | sha1sum -;dd if=/dev/urandom bs=1MB count=1 2>/dev/null | sha1sum -;dd if=/dev/urandom bs=1MB count=1 >/dev/null | sha1sum -)
e026089285fe644aa48aab7ef6e3cfe36d3215f3  -
c9657aedb1fa5b2687ea1117846265f6ddd2e153  -
da39a3ee5e6b4b0d3255bfef95601890afd80709  -
1+0 records in
1+0 records out
1000000 bytes (1.0 MB) copied, 0.206451 s, 4.8 MB/s

real    0m0.700s
user    0m0.013s
sys     0m0.628s


Under the 2.6.27-rc5.jvd_no_usb kernel (Arjan's hrtimer tree), the same
command :
$ time (dd if=/dev/urandom bs=1MB count=1 2>/dev/null | sha1sum -;dd if=/dev/urandom bs=1MB count=1 2>/dev/null | sha1sum -;dd if=/dev/urandom bs=1MB count=1 >/dev/null | sha1sum -)
01f07429c8019c0e6a9c8d6934d58dc9fa3a8fc7  -
c11bff188c7b3501e7cae62cda213a92c5cddf73  -
da39a3ee5e6b4b0d3255bfef95601890afd80709  -
1+0 records in
1+0 records out
1000000 bytes (1.0 MB) copied, 2.68622 s, 372 kB/s

real    0m8.039s
user    0m0.120s
sys     0m7.021s

That's @ ( 8 / 0.7 = 11.42 ) times slower !

Even worse is the behavior of the terminal driver - under a VT console, 
NOT an xterm, this command takes nearly 140 times slower under 2.6.27-rc5
than under 2.6.26-rc9 :

Under 2.6.26-rc9:
 $ time dd if=/dev/urandom bs=100K count=1 | openssl base64
... # lots of base64 text output
real    0m0.129s
user    0m0.007s
sys     0m0.036s

Under 2.6.27-rc5:
 $ time dd if=/dev/urandom bs=100K count=1 | openssl base64
... # lots of base64 text output
real    0m18.066s
user    0m0.062s
sys     0m0.449s

That's 18 / 0.13 = 138 times slower !

This really makes 2.6.27-rc9 rather unusable because typing or viewing text
on a terminal is too slow.

There seems to have been a lot of activity in the high-res-timer + scheduling
areas for  x86_64 recently, so that would be my prime suspect - eg. with
2.6.26-rc9, you finally fixed this issue:
dmesg.2.6.24-1:[   35.299325] Could not switch to high resolution mode on CPU 0
dmesg.2.6.24-1:[   35.299946] Could not switch to high resolution mode on CPU 1
so for the first time, multi-core x86_64 CPUs finally run in high resolution
mode - thanks! But what happened to the x86_64 scheduling + timer code between
2.6.26 and 2.6.27 ? ...

I will investigate this more fully when I have more time - likely not before
next weekend - unless someone fixes this first ( I hope! ). I also
raised a Fedora bug #461441 on this, because this issue also occurs with 
the latest FC 10 kernel ( 2.6.27-0.312.rc5.git7.fc10 ) : 
https://bugzilla.redhat.com//show_bug.cgi?id=461441
Comment 9 Jason Vas Dias 2008-09-21 15:08:05 UTC
This is still an issue with latest HEAD 2.6.27-rc6 (Linus' GIT Tree 2008-09-21)??

I've tracked down three primary issues with this kernel :

  1. All Switchable CPU Scaling Governors are DISABLED, even with powernow-k8
     installed as a module and or with kernel recompiled to load powernow-k8 
     into built-ins. 
  
  By default, powernow-k8 is NOT installed with 2.6.27-+ kernels, whereas it
  was with 2.6.26- kernels. This remained true before and after upgrade
  from udev-124 -> 126 -> 128 .
    
  After manual powernow-k8 module installed with modprobe, or is loaded as
  built-in (both of which succeed) , the 
   /sys/devices/system/cpu/cpu[01]/cpufreq/*
  files are all there - and all the correct  /sys/devices/system/cpu/cpu[01]/cpufreq/*available* files display corrent values - except governor is 
  not switchable (governor displays as userspace), the cpuinfo_cur_freq
  is at its lowest value (800000) as is scaling_max_freq even though available
  freqs go up tp 2200000 which is the default under 2.6.26-rc9 .

  Thus every moderately low-latency driver (such as the terminal, USB, keyboard,
  network) is continually timing out, producing more log messages on the 
  already slow vesa terminal driver which eventually causes rsyslogd to crash
  on timeout .

  PLEASE RESTORE 2.6.27-rc7+ CPU FREQUENCY SWITCHING ON x86_64 CPU!! 
  I'll try to figure out how it comes to be disabled first - nothing 
  in the configuration has changed, but the new kernel overloads itself
  simply by generating log messages on my AMD TL-64x2 2.2GHz SATA laptop,
  while the old kernel can at least install the powernow-k8 driver and
  comes with 2.2GHz as the default CPU clock frequency even without
  powernow-k8 installed.  This is beyond the limits of my both interested
  novice kernel programming skills and my spare time at the moment
  to explore more deeply - please could kernel developers advise - 
  more detailed logs can be collected. I have tried the latest fedora kernels
  as of 2008-09-21 booting a 100% up-to-date Fedora 10 system with the same
  result - only the latest RHEL-5.1 2.6.18  RedHat kernel does not suffer 
  this bug on my x86_64x2 / Dual-Core Athlon64 SMP + Xen enabled platform .



    
Comment 10 Jason Vas Dias 2008-09-21 15:21:11 UTC
Other Symptoms are:

1. Failure to honor the "reboot" or "poweroff" commands at all - the machine goes
   into the "HALT" state, but never powers off. 

2. It takes TWO manual RESET cycles with the power button to re-enable power-on .

3. No way to CPU frequency switch from LOWEST frequency (800000) on 2.2GHz 2 CPUs
Comment 11 Thomas Gleixner 2008-09-26 03:22:40 UTC
Can you please provide the output of dmesg right after boot for both kernels ? The boot log is missing the interesting parts.

Thanks,
          tglx
Comment 12 Alan 2008-09-26 07:30:45 UTC
*** Bug 11651 has been marked as a duplicate of this bug. ***
Comment 13 Rafael J. Wysocki 2008-09-26 15:56:20 UTC
On Tuesday, 23 of September 2008, Jason Vas Dias wrote:
> Hi - 
> Yes, this bug is still a problem with both the latest 2.6.27-rc6 kernel (from
> Linus' tree 2008-09-21)
> and with the latest fedora 10 kernel .
> 
> CPU Frequency switching is completely disabled both when powernow-k8 (the
> correct cpufreq module for my
> x86_64 AMD TL-64x2 2.2GHz CPU) is installed as a module or is built-in , and
> the CPU frequency remains
> at its lowest setting; attempts to modify
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
> and /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed are not honored,
> even though 
> /sys/devices/system/cpu/cpu0/cpufreq/governor is "userspace" 
> and scaling_min_freq < scaling_setspeed > scaling_max_freq .
> 
> I see no messages from powernow-k8 indicating that it is aware it was unable
> to set the speed, though
> I do see a message if I attempt to set an invalid speed (eg 600000) .
> 
> With 2.6.26-rc9, I get a default CPU clock frequency of 2200000 ; with
> 2.6.27-rc6, it becomes 800000 and
> is not switchable. For some reason, powernow-k8 does not autoload with UDEV;
> but I don't really need it if
> the speed is already set to its highest level.
> 
> On 2.6.27-rc6. after it manages to boot, any low-latency drivers time out
> (eg. USB, Terminal, Keyboard, Network) 
> and the machine does not get through the boot-up sequence without becoming
> overloaded by the kernel's debugging log messages -
> neither the network , the terminal or the keyboard work usably. 
> 
> Building a kernel with USB completely disabled and turning off debug log
> messages allows the machine to boot 
> (after @ 15 minutes) but the speed is still at its lowest setting and cannot
> be changed.
> 
> Also, 2.6.27-rc6 is unable to reboot the machine: it can put the machine into
> the "HALT" state, with nothing displayed
> on the screen, but the machine does not power-off until manual reset with the
> power-button. Then, after the machine
> has powered-down, it cannot be powered up until the power-on button is
> depressed for at least two sections an released TWICE
> in a row.
> 
> Neither booting with 'acpi=0' or compilation without USB support has any
> effect; it happens the same with a totally
> standard Fedora 10 install with the latest 2.6.27 kernel as with my own
> heavily customized distro , and occurs when either
> udev-{124,126,128} are installed, and regardless of compiling the kernel with
> gcc-4.2.4 , gcc-3.4.6, or gcc-4.3.2 .
Comment 14 Jason Vas Dias 2008-09-27 18:35:10 UTC
Created attachment 18079 [details]
boot log from 2.6.27-rc7-tip.jvd, which fails to set CPU freq above 800Khz
Comment 15 Jason Vas Dias 2008-09-27 18:35:36 UTC
Created attachment 18080 [details]
config from 2.6.27-rc7-tip.jvd
Comment 16 Jason Vas Dias 2008-09-27 18:37:11 UTC
Created attachment 18081 [details]
System.map from 2.6.27-rc7-tip.jvd
Comment 17 Jason Vas Dias 2008-09-27 18:38:06 UTC
Created attachment 18082 [details]
boot log from 2.6.26.5.jvd, which does not have the problem
Comment 18 Jason Vas Dias 2008-09-27 18:39:23 UTC
Created attachment 18083 [details]
config from 2.6.26.5.jvd
Comment 19 Jason Vas Dias 2008-09-27 18:39:48 UTC
Created attachment 18084 [details]
System map from 2.6.26.5.jvd
Comment 20 Jason Vas Dias 2008-09-27 18:43:50 UTC
Created attachment 18085 [details]
acpidump from 2.6.26.5.jvd
Comment 21 Jason Vas Dias 2008-09-27 20:28:46 UTC
Created attachment 18087 [details]
output of acpidump for 2.6.27-rc7-tip.jvd
Comment 22 Jason Vas Dias 2008-09-27 20:30:22 UTC
Created attachment 18088 [details]
/proc/acpi files of 2.6.27-rc7-tip.jvd

Output of command:
$ (find /proc/acpi -type f | while read f; do if [[  "$f" != /proc/acpi/event ]]; then  echo '+++ '$f':';  if [[ $f =~ dt$ ]]; then od -cx $f; else cat $f; echo '---'; fi; fi; done) 2>&1 | tee 2.6.27-rc7-tip.jvd.proc.acpi.dump
Comment 23 Jason Vas Dias 2008-09-27 20:31:45 UTC
Created attachment 18089 [details]
/proc/acpi files of 2.6.26.5.jvd

Output of command:
$ (find /proc/acpi -type f | while read f; do if [[  "$f" != /proc/acpi/event ]]; then  echo '+++ '$f':';  if [[ $f =~ dt$ ]]; then od -cx $f; else cat $f; echo '---'; fi; fi; done) 2>&1 | tee 2.6.26.5.jvd.proc.acpi.dump
Comment 24 Jason Vas Dias 2008-09-27 21:03:01 UTC
Sorry for the delay in getting back to you all - finding spare time is difficult
for me at the moment.

Anyway, the problem still occurs with the latest linux-2.6-tip tree 
( 2008-09-27@22:00 - c4077bb0691bcd26c546daa52b76a47ba954fc49 ), and with
the latest Fedora 10 kernel ( kernel-2.6.27-0.352.rc7.git1.fc10 ), but
not with the latest stable kernel ( 2.6.26.5 ) nor with 2.6.26-rc9 .

With EVERY 2.6.27 based kernel I've tested, since 2.6.27-rc1 , cpu frequency
switching is completely disabled, the AMD TL-64 dual-core 2.2Ghz CPU is stuck at
800Khz, and cannot be changed by any governor switch or valid modification of
/sys/devices/system/cpu/cpu0/cpufreq/scaling_{max_freq,setspeed}, usbcore floods the dmesg log with event change messages when it seems no event has
changed, and the machine cannot reboot (it can only enter the "HALT" state).

I'm analyzing the logs now, but have not been able to find anything obvious -
anyone have any ideas suggestions ?

In response to your questions:

On Friday 26 September 2008 11:17:59 Linus Torvalds wrote:
> 
> On Fri, 26 Sep 2008, Jason Vas Dias wrote:
> > 
> > There appears to be a serious problem with every 2.6.27-X kernel I've 
> > tried - and I've tried quite a few from the linux-2.6, linux-acpi-2.6, 
> > and linux-2.6-tip GIT trees, and the latest Fedora 10 kernels over the 
> > past few weeks trying to solve this problem:
> 
> So which trees work, and which don't? 

Sorry for my lack of clarity - ANY 2.6.27 based tree from either your linux-2.6
tree, Len's linux-acpi tree, or Ingo's linux-2.6-tip tree, or the latest Fedora
10 Rawhide / Development kernels over the past three weeks have displayed the same problem - I've tried @ 8 trees over the past 3 weekends - sorry for not recording the master hash keys - I will in future. 

> Also, since it seems entirely 
> repeatable, this is a prime candidate for "git bisect" - it will be 
> eventually faster to do than trying many different trees at random, even 
> if you will likely need to reboot/compile about 12 times or so (assuming 
> 2.6.27-rc1 is broken, and 2.6.26 works, which it sounds like).
> 
> Bisecting really isn't that hard. Get the git tree, do
> 
>       git bisect start
>       git bisect bad v2.6.27-rc1
>       git bisect good v2.6.26
> 
> and off you go. You don't even need to know a lot about git, there's a few 
> quick tutorials out there if you haven't used it. See for example
> 
>       http://www.kernel.org/doc/local/git-quick.html
> 
> which has an example git bisect run, in addition to just initial clone 
> insutrctions.
> 
Thank you! I'll try this next. I'm starting to hack around the powernow-k8
code, which in any case has the bug that it is not detecting that the CPU
frequency is set successfully or not - it should at least emit some dmesg
in either case, IMHO .

> > CPU frequency switching is completely disabled, the 2.2Ghz AMD TL-64 
> > Dual Core CPU is stuck at 0.8Ghz, the machine cannot reboot or perform 
> > any ACPI actions.
> > 
> > There are no obvious failures indicated in the logs, only this:
> > 
> > [    0.000000] ACPI Error (tbfadt-0453): 32/64X address mismatch in
> "Pm2ControlBlock": [00008800] [0000000000008100], using 64X [20080609]
> 
> Hmm. I don't know if it's ACPI-related, but the fact that it cannot even 
> reboot etc sure seems to make it likely. Have you made the acpi lists 
> aware of it (linux-acpi@vger.kernel.org and lenb are listed in 
> MAINTAINERS)?
> 
>               Linus
> 
I've now attached all available acpi information - the opinion of the ACPI
maintainers (whose emails are now attached to this bug) would be gratefully received.

On Friday 26 September 2008 11:40:40 Zhao Yakui <yakui.zhao@intel.com> wrote:
> Hi, Jason
>    Will you please attach the output of acpidump, dmesg?
>    From the log it seems that the BIOS gives the incorrect ACPI table.
> For example: the 32X/64X address don't match in FADT table.
>    Of course you can open a new bug and attach the output of acpidump,
> dmesg.
>     http://bugzilla.kernel.org/acpi
> 
> thanks

Hi Zhao - thanks - this is now done.

I'm going to try to fix this - if no-one else does first! 


Dave Jones : 
I noticed your CPU Frequency tree from "7 weeks ago":
  linux/kernel/git/davej/cpufreq.git 	Unnamed repository; edit this ... 	
  ( http://git.kernel.org/?...h=621114980de1e236ffa4e72899e226de75031a5f )
Has this been merged into 2.6.27-rc7 or not ? I'm checking out and doing
a diff with latest -tip tree now. I guess this is as good a place to start
investigative hacking as any ...

Thanks & Regards,
Jason
Comment 25 Thomas Gleixner 2008-09-28 03:12:24 UTC
Can you please add "noapictimer" to the kernel command line ?
Comment 26 Thomas Gleixner 2008-09-28 03:50:12 UTC
> Can you please add "noapictimer" to the kernel command line ?

Wont help :) /me needs more coffee

Ok, I checked the proc/acpi outputs:

The limit entry is different and has the throttling limit set to T7
 +++ /proc/acpi/processor/C001/limit:
-active limit:            P0:T0
+active limit:            P0:T7
 user limit:              P0:T0
-thermal limit:           P0:T0
+thermal limit:           P0:T7

Which is the active state:

 +++ /proc/acpi/processor/C001/throttling:
 state count:             8
-active state:            T0
+active state:            T7
 state available: T0 to T7
 states:
-   *T0:                  100%
+    T0:                  100%
     T1:                  87%
     T2:                  75%
     T3:                  62%
     T4:                  50%
     T5:                  37%
     T6:                  25%
-    T7:                  12%
+   *T7:                  12%

Can you please add "acpi.debug_level=0x11" to the kernel command line
of both .25 and .27 so we can get the ACPI debug info of setting the
limits/throttling state.

Thanks,

	tglx
Comment 27 Jason Vas Dias 2008-09-28 16:22:55 UTC
Thanks for the ACPI debug suggestion - I see what the problem is now .
I tried gathering the complete boot logs with the ACPI debug argument
you gave, but the dmesg buffer was not big enough to allow a complete
log to be gathered, and the logs I could gather were huge - not suitable
inclusion in a bug report.

Basically, what is happening is that the drivers/acpi/thermal.c "thermal zone" 
module always thinks the processors are too hot, and ramps down their frequency until it reaches its lowest value. 

This is because for the 2.6.27 kernels, the thermal zone has a too low "passive"
temperature value - (from the /proc/acpi files of 2.2.27 : http://bugzilla.kernel.org/attachment.cgi?id=18088&action=view) :
+++ /proc/acpi/thermal_zone/TZ1/trip_points:
critical (S5):           105 C
passive:                 16 C: tc1=1 tc2=2 tsp=100 devices=C000 C001 
active[0]:               16 C: devices=C399 
active[1]:               16 C: devices=C39A 
active[2]:               16 C: devices=C39B 
active[3]:               16 C: devices=C39C 
---

16 degrees celsius is an unreasonably low value for the CPU, and while
the current temperature of the CPU is greater than this '_PSV' threshold, 
drivers/acpi/thermal.c will decrease the CPU frequency; but as the CPU
never gets colder than this value, the CPU frequency is always decreased to its 
lowest value within 70 seconds of booting (7 x the 10 second timer interval
of thermal.c's acpi_thermal_check()'s timer).


The 2.6.26 kernel's "trip_points" are set to reasonable values:
+++ /proc/acpi/thermal_zone/TZ1/trip_points:
critical (S5):           105 C
passive:                 95 C: tc1=1 tc2=2 tsp=100 devices=C000 C001
active[0]:               75 C: devices=C399
active[1]:               65 C: devices=C39A
active[2]:               50 C: devices=C39B
active[3]:               16 C: devices=C39C
---

The "passive" threshold is a module parameter, but was not specified
in either the 2.6.27 or 2.6.26 kernels:
$ cat /sys/module/thermal/parameters/psv
0

The default value for the psv parameter in the ACPI DSDT is 16 C (2981 deci-Kelvin) found by decompiling it with iasl.

Since the acpidump output under both kernels is identical, and the firmware
ACPI tables must be the same, so the 2.6.27 ACPI code must be interpreting
them slightly differently to the 2.6.26 code.

Here are the acpi.debug=0x11 log excerpts showing the above is occurring:

1. Retreiving initial PSV from ACPI:

[    4.905673]   evmisc-0128 [06] ev_queue_notify_reques: Dispatching Notify on [TZ1_] Node ffff880077b38680 Value 0x81 (**Device Specific**)
[    4.905683]   evmisc-0202 [06] ev_queue_notify_reques: No notify handler for Notify (TZ1_, 81) node ffff880077b38680
[    4.906506]    utils-0291 [4294967295] evaluate_integer      : Return value [3362]
[    4.906512]  thermal-0262 [4294967295] thermal_get_temperatur: Temperature is 3362 dK
[    4.906534]    utils-0291 [4294967295] evaluate_integer      : Return value [3782]
[    4.906540]  thermal-0383 [4294967295] thermal_trips_update  : Found critical threshold [3782]
[    4.906551]  nsutils-0869 [02] ns_get_node           : _HOT, AE_NOT_FOUND
[    4.906560]    utils-0047 [4294967295] util_eval_error       : Evaluate [\_TZ_.TZ1_._HOT]: AE_NOT_FOUND
[    4.906566]  thermal-0406 [4294967295] thermal_trips_update  : No hot threshold
[    4.906580] Execute Method: [\_TZ_.TZ1_._PSV] (Node ffff880077b387a0)
[    4.908052] exregion-0186 [14] ex_system_memory_space: System-Memory (width 8) R/W 0 Address=0000000077FE7DC6
[    4.908220] exregion-0289 [17] ex_system_io_space_han: System-IO (width 8) R/W 1 Address=0000000000000072
[    4.908234] exregion-0289 [16] ex_system_io_space_han: System-IO (width 8) R/W 0 Address=0000000000000073
[    4.908729]    utils-0291 [4294967295] evaluate_integer      : Return value [2891]
[    4.908750]    utils-0291 [4294967295] evaluate_integer      : Return value [1]
[    4.908771]    utils-0291 [4294967295] evaluate_integer      : Return value [2]
[    4.908792]    utils-0291 [4294967295] evaluate_integer      : Return value [100]
[    4.908826]    utils-0420 [4294967295] evaluate_reference    : Found reference [ffff8800778032a0]
[    4.908832]    utils-0420 [4294967295] evaluate_reference    : Found reference [ffff8800778032c0]

2. Throttling the processors down when their temperature exceeds 16 C:
[   24.757306]  thermal-0262 [4294967294] thermal_get_temperatur: Temperature is 3332 dK
[   24.826155]  thermal-0632 [4294967294] thermal_passive       : trend[852]=(tc1[1]*(tmp[3332]-last[3362]))+(tc2[2]*(tmp[3332]-psv[2891]))
[   24.897221] cpufreq-core: updating policy for CPU 0
[   24.897240] cpufreq-core: setting new policy for CPU 0: 800000 - 2200000 kHz
[   24.897246] freq-table: request for verification of policy (800000 - 2200000 kHz) for cpu 0
[   24.897253] freq-table: verification lead to (800000 - 2200000 kHz) for cpu 0
[   24.897271] freq-table: request for verification of policy (800000 - 1760000 kHz) for cpu 0
[   24.897277] freq-table: verification lead to (800000 - 1760000 kHz) for cpu 0
[   24.897285] cpufreq-core: new min and max freqs are 800000 - 1760000 kHz
...
[   25.171363] processor_throttling-0250 [4294967294] processor_throttling_n: PreChange Event:target T-state of CPU 1 is T1
[   25.171946] processor_throttling-0260 [4294967294] processor_throttling_n: PostChange Event:CPU 1 is switched to T1
[   25.171968] processor_thermal-0080 [4294967294] processor_apply_limit : Processor [1] limit set to (P0:T1)
[   25.171978] processor_thermal-0340 [4294967294] processor_set_thermal_: Thermal limit now (P0:T1)
[   25.171989]  thermal-0911 [4294967294] thermal_check         : TZ1: temperature[3332] sleep[10000]
...
[ 1436.691903]  thermal-0262 [4294967147] thermal_get_temperatur: Temperature is 3052 dK
[ 1436.692172]  thermal-0632 [4294967147] thermal_passive       : trend[322]=(tc1[1]*(tmp[3052]-last[3052]))+(tc2[2]*(tmp[3052]-psv[2891]))
[ 1436.693036] processor_thermal-0290 [4294967147] processor_set_thermal_: At maximum performance state
[ 1436.693638] processor_thermal-0295 [4294967147] processor_set_thermal_: At maximum throttling state
[ 1436.693874] processor_throttling-0250 [4294967147] processor_throttling_n: PreChange Event:target T-state of CPU 0 is T7
[ 1436.695743] processor_throttling-0260 [4294967147] processor_throttling_n: PostChange Event:CPU 0 is switched to T7
[ 1436.696473] processor_thermal-0080 [4294967147] processor_apply_limit : Processor [0] limit set to (P0:T7)
[ 1436.696772] processor_thermal-0340 [4294967147] processor_set_thermal_: Thermal limit now (P0:T7)
[ 1436.697472] processor_thermal-0290 [4294967147] processor_set_thermal_: At maximum performance state
[ 1436.697710] processor_thermal-0295 [4294967147] processor_set_thermal_: At maximum throttling state
[ 1436.697974] processor_throttling-0250 [4294967147] processor_throttling_n: PreChange Event:target T-state of CPU 1 is T7
[ 1436.704221] processor_throttling-0260 [4294967147] processor_throttling_n: PostChange Event:CPU 1 is switched to T7
[ 1436.704952] processor_thermal-0080 [4294967147] processor_apply_limit : Processor [1] limit set to (P0:T7)
[ 1436.705183] processor_thermal-0340 [4294967147] processor_set_thermal_: Thermal limit now (P0:T7)
[ 1436.705817]  thermal-0911 [4294967147] thermal_check         : TZ1: temperature[3052] sleep[10000]

And then the frequency is at it lowest, with the temperature throttle at its highest, because the ACPI "PSV" value is set too low.
Comment 28 Chuck Ebbert 2008-10-03 23:14:48 UTC
This should be reassigned to the ACPI component, but I can't do that.
Comment 29 Rafael J. Wysocki 2008-10-04 02:58:56 UTC
Well, not necessarily.

That still may be a result of the IO-APIC changes in .27.  I had a similar issue on my test box that has been blacklisted and now works well.

Jason, can you attach the output of dmidecode from your machine, please?
Comment 30 Jason Vas Dias 2008-10-04 11:57:22 UTC
Created attachment 18163 [details]
dmidecode output under 2.6.26.5

OK, here's the dmidecode output from kernel-2.6.26.5 .
Comment 31 Jason Vas Dias 2008-10-04 12:23:37 UTC
I have confirmed that this is still a problem under Linus' 2.6.27-rc8 tree
of 2008-09-29 ( 94aca1dac6f6d21f4b07e4864baf7768cabcc6e7 ) .
I'm now building the latest linux-2.6-tip tree ( 800be01f27c9542ea4640d731874986c60f521e5 ) and am going to enable ACPI
table parsing debug to try to understand how the two kernels can be 
parsing the same ACPI tables and get such different results .

Here is the relevant DSL from 'iasl -d /proc/acpi/dsdt' :
            ...
            Method (_PSV, 0, NotSerialized)
            {
                Return (C391 (C384, 0x00))
            }
            ...
        Method (C391, 2, Serialized)
        {
            Store (0x01, Local0)
            Store (Arg0, Local1)
            If (LLess (Arg0, DerefOf (Index (C38C, Arg1))))
            {
                Store (0x00, Local0)
                Add (Arg0, 0x01, Local1)
            }

            Store (DerefOf (Index (DerefOf (Index (DerefOf (Index (C37F, C392 (Arg1)
                )), Local0)), Local1)), Local2)
            If (LAnd (LEqual (\_SB.C03E, 0x01), LAnd (LEqual (Arg0, 0x00), LEqual (
                Arg1, 0x00))))
            {
                If (\_SB.C08B.C0FE.C172.C1D5 ())
                {
                    Store (0x0B4B, Local2)
                }
            }

            If (LEqual (\_SB.C08B.C09F.C0A0, 0x01))
            {
                Store (0x0B4B, Local2)
            }

            Return (Local2)
        }
        
Now 0x0B4B (2891 deci-Kelvin) is the infamous 16 degrees celsius default
_PSV value that the 2.6.27 kernels get ; under 2.6.26.5, the expression
  Store (DerefOf (Index (DerefOf (Index (DerefOf (Index (C37F, C392 (Arg1)
                )), Local0)), Local1)), Local2)
is evidently storing the correct 3682 deci-Kelvin ( 95 degrees celsius ) 
in "Local2" , and \_SB.C08B.C09F.C0A0 is not 1 -
ACPI debug output under 2.6.26.5:
[   11.902295] Execute Method: [\_TZ_.TZ1_._PSV] (Node ffff810077b3b7c0)
[   11.902295] exregion-0186 [15] ex_system_memory_space: System-Memory (width 8) R/W 0 Address=0000000077FE7DC6
[   11.902295] exregion-0289 [18] ex_system_io_space_han: System-IO (width 8) R/W 1 Address=0000000000000072
[   11.902295] exregion-0289 [17] ex_system_io_space_han: System-IO (width 8) R/W 0 Address=0000000000000073
[   11.902295]    utils-0291 [00] evaluate_integer      : Return value [3682]
[   11.902295]    utils-0291 [00] evaluate_integer      : Return value [1]
[   11.902295]    utils-0291 [00] evaluate_integer      : Return value [2]
[   11.902295]    utils-0291 [00] evaluate_integer      : Return value [100]
[   11.902295]    utils-0420 [00] evaluate_reference    : Found reference [ffff8100778082c0]
[   11.902295]    utils-0420 [00] evaluate_reference    : Found reference [ffff8100778082e0]

ACPI debug output under 2.6.27-rc7:
[    4.906580] Execute Method: [\_TZ_.TZ1_._PSV] (Node ffff880077b387a0)
[    4.908052] exregion-0186 [14] ex_system_memory_space: System-Memory (width 8) R/W 0 Address=0000000077FE7DC6
[    4.908220] exregion-0289 [17] ex_system_io_space_han: System-IO (width 8) R/W 1 Address=0000000000000072
[    4.908234] exregion-0289 [16] ex_system_io_space_han: System-IO (width 8) R/W 0 Address=0000000000000073
[    4.908729]    utils-0291 [4294967295] evaluate_integer      : Return value [2891]
[    4.908750]    utils-0291 [4294967295] evaluate_integer      : Return value [1]
[    4.908771]    utils-0291 [4294967295] evaluate_integer      : Return value [2]
[    4.908792]    utils-0291 [4294967295] evaluate_integer      : Return value [100]
[    4.908826]    utils-0420 [4294967295] evaluate_reference    : Found reference [ffff8800778032a0]
[    4.908832]    utils-0420 [4294967295] evaluate_reference    : Found reference [ffff8800778032c0]

I'm re-testing the above with the latest 2.6.27-rc8-tip kernel and will attach
the relevant output when done - at least I can give the "thermal.psv=95" 
parameter to the new kernel so that the CPU stays at 2.2Ghz and a boot/reboot
cycle takes less than one hour! (But the active cooling devices stay always on).
Comment 32 Linus Torvalds 2008-10-04 12:40:29 UTC
On Sat, 4 Oct 2008, bugme-daemon@bugzilla.kernel.org wrote:
<
> I have confirmed that this is still a problem under Linus' 2.6.27-rc8 tree
> of 2008-09-29 ( 94aca1dac6f6d21f4b07e4864baf7768cabcc6e7 ) .
> I'm now building the latest linux-2.6-tip tree

It really would be much better if you just tried to bisect it.

It won't be fixed until somebody has a clue what caused it, or we find it 
almost by mistake, and nothing since -rc8 (or, indeed -rc7) is very likely 
to have fixed it. So testing new kernels probably won't help, while 
testing old kernels (with bisect) should hopefully be able to show what 
caused the breakage.

		Linus
Comment 33 Rafael J. Wysocki 2008-10-05 12:07:02 UTC
Created attachment 18175 [details]
Patch to test

Please check if this patch helps.

[BTW, the '.jvd' extension is not really helpful, because it confuses some tools.  Please append '-jvd' to the file names instead.]
Comment 34 Jason Vas Dias 2008-10-05 16:07:37 UTC
Yes, thanks, that fixed it !

After the patch the trip points were back to normal and the CPU frequency
stays at 2.2GHz:
$ cat /proc/acpi/thermal_zone/TZ1/trip_points
critical (S5):           105 C
passive:                 95 C: tc1=1 tc2=2 tsp=100 devices=C000 C001
active[0]:               75 C: devices=C39B
active[1]:               65 C: devices=C39C
active[2]:               55 C: devices=C39D
active[3]:               16 C: devices=C39E
$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
2200000

But those multiple USB "event change" messages per-second when no event has 
changed remain, so I have to disable klogd logging to disk:
[  989.204090] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.204099] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.454584] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.454594] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.704080] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.704090] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.954084] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  989.954094] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.204091] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.204100] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.454112] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.454122] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.704089] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.704098] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.954088] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  990.954098] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.204334] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.204343] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.454094] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.454103] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.704083] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.704093] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.954089] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[  991.954098] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000

This is probably a different issue - shall I raise a new bug on this ?

Thanks!
Comment 35 Rafael J. Wysocki 2008-10-06 02:19:18 UTC
(In reply to comment #34)
> Yes, thanks, that fixed it !

Thanks for the confirmation.
 
> But those multiple USB "event change" messages per-second when no event has 
> changed remain, so I have to disable klogd logging to disk:
> [  989.204090] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.204099] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.454584] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.454594] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.704080] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.704090] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.954084] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  989.954094] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.204091] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.204100] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.454112] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.454122] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.704089] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.704098] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.954088] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  990.954098] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.204334] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.204343] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.454094] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.454103] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.704083] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.704093] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.954089] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
> [  991.954098] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
> 
> This is probably a different issue - shall I raise a new bug on this ?

Yes, these should be fixed by this commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=07454bfff151d2465ada809bbaddf3548cc1097c
Comment 36 Rafael J. Wysocki 2008-10-06 02:20:09 UTC
Handled-By : Rafael J. Wysocki <rjw@sisk.pl>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18175&action=view
Comment 37 Adrian Bunk 2008-10-07 07:16:21 UTC
fixed by commit e84956f92a846246b09b34f2a728329c386d250f
Comment 38 totateis 2008-10-07 17:26:10 UTC
Might want to add HP Compaq 6715s laptops to the blacklist; exactly the same issue on one of those (and probably a lot of HP Compaq 6xxx's with a Turion CPU).
Comment 39 Ingo Molnar 2008-10-07 22:32:25 UTC
> Might want to add HP Compaq 6715s laptops to the blacklist; exactly 
> the same issue on one of those (and probably a lot of HP Compaq 6xxx's 
> with a Turion CPU).

could you check tip/master:

  http://people.redhat.com/mingo/tip.git/README

it has a more generic, PCI ID based quirk, queued up for v2.6.28. Does 
that work fine on your HP Compaq 6715 laptop, out of box?

	Ingo
Comment 40 Ingo Molnar 2008-10-07 22:49:34 UTC
Reopened the bug to track other systems that are not yet covered.

Adrian Bunk, could you stop closing bugs you have zero involvement with?
Comment 41 Rafael J. Wysocki 2008-10-08 01:15:29 UTC
(In reply to comment #39)
>   http://people.redhat.com/mingo/tip.git/README
> 
> it has a more generic, PCI ID based quirk, queued up for v2.6.28. Does 
> that work fine on your HP Compaq 6715 laptop, out of box?

That will be -stable material too, I think?
Comment 42 Ingo Molnar 2008-10-08 02:06:41 UTC
> > it has a more generic, PCI ID based quirk, queued up for v2.6.28. 
> > Does that work fine on your HP Compaq 6715 laptop, out of box?
> 
> That will be -stable material too, I think?

Correct. These two are queued up for v2.6.28:

 8d89adf: x86: SB450: deprioritize DMI quirks
 33fb0e4: x86: SB450: skip IRQ0 override if it is not routed to INT2 of IOAPIC

and i already added a backport tag to 33fb0e4:

 Cc: <stable@kernel.org>

Dmitry reported success with 33fb0e4 + 8d89adf.

It will all work out fine in a natural way by the time .27.1 backport 
items are processed.

	Ingo
Comment 43 Matthias Tingelhoff 2008-10-09 12:18:20 UTC
Hello folks, Just wanted to note that mine Compaq 6715s with 32bit AMD Semperon CPU have the same problem. Adding it to the blacklist in arch/x86/kernel/acpi/boot.c solved (at least this) issue.
I guess the whole 6715x series is affected... buggy HP hardware.

greets, Matze
Comment 44 Jason Vas Dias 2008-10-10 22:17:05 UTC
Great thanks to everyone for the fixes for this! 
However, there still remains a niggle:

Having merged Linus' and Ingo's -tip trees of 2008-10-10@22:30 EDT :

$ git remote add linus \
   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
$ git remote add tip \
   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
$ git remote update
$ git checkout -b jvd tip/master

and built the resulting kernel, I now get this warning:

[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at arch/x86/kernel/acpi/boot.c:1410 dmi_ignore_irq0_timer_override+0x30/0x60()
[    0.000000] ati_ixp4x0 quirk not complete.
[    0.000000] Modules linked in:
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.27-jvd #1
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8024bd87>] warn_slowpath+0xb7/0xe0
[    0.000000]  [<ffffffff80213500>] ? nommu_map_single+0x30/0x70
[    0.000000]  [<ffffffff8024d077>] ? printk+0x67/0x70
[    0.000000]  [<ffffffff809871ca>] ? __reserve_early+0x7a/0xe0
[    0.000000]  [<ffffffff806bbab3>] ? init_memory_mapping+0x633/0x720
[    0.000000]  [<ffffffff80980140>] ? early_idt_handler+0x0/0x72
[    0.000000]  [<ffffffff8098baf0>] dmi_ignore_irq0_timer_override+0x30/0x60
[    0.000000]  [<ffffffff805a69f4>] dmi_check_system+0x74/0xc0
[    0.000000]  [<ffffffff8098c545>] acpi_boot_table_init+0x15/0xf0
[    0.000000]  [<ffffffff809850b5>] setup_arch+0x3f5/0x840
[    0.000000]  [<ffffffff80980140>] ? early_idt_handler+0x0/0x72
[    0.000000]  [<ffffffff806dadb6>] ? _spin_unlock+0x26/0x30
[    0.000000]  [<ffffffff80980140>] ? early_idt_handler+0x0/0x72
[    0.000000]  [<ffffffff80980cb1>] start_kernel+0x71/0x410
[    0.000000]  [<ffffffff8098027c>] x86_64_start_reservations+0x7c/0xc0
[    0.000000]  [<ffffffff809803da>] x86_64_start_kernel+0xea/0x110
[    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.000000] HP 6715b laptop detected: Ignoring BIOS IRQ0 pin2 override

Trying to follow the LKML thread
  "[PATCH] x86 ACPI: Blacklist two HP machines with buggy BIOSes (Re: 2.6.27-rc8+ - first impressions)"
it seems you've applied two fixes for this :

1) Adding an "early-quirk" to x86/kernel/early-quirks.c :
@181:
	{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP400_SMBUS,
	  PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },

2) "Blacklisting" HP 6715b laptops so that they assert 
   acpi_skip_timer_override==1 IFF the early quirk did not succeed 
   in arch/x86/kernel/acpi/boot.c, which is now generating the warning :
@1403:
static int __init dmi_ignore_irq0_timer_override(const struct dmi_system_id *d)
{
	/*
	 * The ati_ixp4x0_rev() early PCI quirk should have set
	 * the acpi_skip_timer_override flag already:
	 */
	if (!acpi_skip_timer_override) {
		WARN(1, KERN_ERR "ati_ixp4x0 quirk not complete.\n");
		pr_notice("%s detected: Ignoring BIOS IRQ0 pin2 override\n",
			d->ident);
		acpi_skip_timer_override = 1;
	}
	return 0;
}


Evidently, the code in early-quirks.c did not recognize mine as a "problem"
system - my dsdt.dsl begins:
/*
 * Intel ACPI Component Architecture
 * AML Disassembler version 20080926
 *
 * Disassembly of dsdt, Mon Sep 29 21:40:26 2008
 *
 *
 * Original Table Header:
 *     Signature        "DSDT"
 *     Length           0x0001193C (71996)
 *     Revision         0x01 **** ACPI 1.0, no 64-bit math support
 *     Checksum         0x0C
 *     OEM ID           "HP    "
 *     OEM Table ID     "SB400"
 *     OEM Revision     0x00010000 (65536)
 *     Compiler ID      "MSFT"
 *     Compiler Version 0x03000001 (50331649)
 */

and my SMBus identifies as :
 $ lspci -nn | grep SMBus
00:14.0 SMBus [0c05]: ATI Technologies Inc SB600 SMBus [1002:4385] (rev 14)

So mine is an "SB600", not an "SB400" / "SB450" ??

The #define PCI_DEVICE_ID_ATI_IXP400_SMBUS is :
include/linux/pci_ids.h:369:#define PCI_DEVICE_ID_ATI_IXP400_SMBUS    0x4372
whereas my PCI ID is defined as:
include/linux/pci_ids.h:374:#define PCI_DEVICE_ID_ATI_SBX00_SMBUS     0x4385

So I'm trying this patch:
<PATCH>
--- early-quirks.c~     2008-10-10 22:32:34.000000000 -0400
+++ early-quirks.c      2008-10-11 01:07:49.000000000 -0400
@@ -178,6 +178,8 @@
          PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, fix_hypertransport_config },
        { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP400_SMBUS,
          PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
+       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
+         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
 #ifdef CONFIG_DMAR
        { PCI_VENDOR_ID_INTEL, 0x29c0,
          PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, intel_g33_dmar },
</PATCH>

Perhaps you might consider also applying this ?

Also, RE: Comment #35:
> Yes, these should be fixed by this commit:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=07454bfff151d2465ada809bbaddf3548cc1097c

No, those USB event change messages are still with me:

[ 2971.204312] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.204322] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.454090] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.454097] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.704074] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.704081] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.954075] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2971.954082] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.204314] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.204321] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.454326] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.454334] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.704318] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.704326] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.954073] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 2972.954079] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000

usbcore's hub.c seems to be vigorously reporting that it has nothing to report -
no USB devices are connected to any hub.

Any ideas what caused them to appear in 2.6.27-rc5 going from 2.6.26-rc9 ?

TIA, Jason
Comment 45 Jason Vas Dias 2008-10-10 22:40:15 UTC
Nope, that early-quirks.c patch did NOT work - I'm still getting the 
dmi_ignore_irq0_timer_override warning calltrace with the patch - it
did look too easy to be correct.
I'll look at this some more tomorrow .
Comment 46 Rafael J. Wysocki 2008-10-11 04:59:29 UTC
*** Bug 11719 has been marked as a duplicate of this bug. ***
Comment 47 Jason Vas Dias 2008-10-11 15:38:33 UTC
BUG: arch/x86/kernel/early_quirks.c's ati_bugs() is not being called early enough!:

yes that patch did cause my SMBus to berecognized, but it is not being called
BEFORE the warning is generated in arch/x86/kernel/acpi/boot.c, but after !

I think it would be a good idea if the new ati_bugs() function is simply
moved to acpi's boot.c - I try this next, if no-one else has already done so.
 
Comment 48 Jason Vas Dias 2008-10-11 15:39:53 UTC
Created attachment 18267 [details]
log from 2.6.27-tip
Comment 49 Fabio Pedretti 2008-10-13 08:12:08 UTC
The workaround in the kernel is good, but from what I understand the best solution would be to fix the BIOS.

Is HP aware of its buggy BIOS? Does someone reported the bug to HP?
Comment 50 herrmann.der.user 2008-10-14 07:08:56 UTC
Jason,

I am currently working on a fix for bug #11715
Your issue seems quite similar.

WRT comment #44:

> +       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
> +         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },

This would have been too easy.
For SB600 some other bits have to be checked.

Can you please attach output of "lspci -nnvvxxxx"?
Thanks, Andreas
Comment 51 herrmann.der.user 2008-10-14 11:09:25 UTC
IMHO this is a duplicate of bug #11715
dmesg shows that your machine has an override for IRQ0. This is
obviously wrong as Linux needs to unmask IRQ0 in PIC to route timer
interrupts to IOAPIC:

[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
...
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
...
[    0.092533] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.093985] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.093985] ...trying to set up timer (IRQ0) through the 8259A ...
[    0.093985] ..... (found apic 0 pin 2) ...
[    0.104133] ....... works.

And when Linux unmasks INT2 of IOAPIC it stumbles over
"HP's thermal trap point".
Linux should skip the bogus timer override.
Fix for testing follows.
Comment 52 herrmann.der.user 2008-10-14 12:08:18 UTC
Created attachment 18305 [details]
Patch is against today's Linus' git (v2.6.27-3977-gc166ab7)

Please give it a try! Thanks.
Comment 53 Jason Vas Dias 2008-10-15 19:08:08 UTC
Thanks Andreas! But niggle still remains:

boot.c's dmi_ignore_irq0_timer_override is still called BEFORE 
early-quirks.c, so still gets this calltrace on the warning:

[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8024bd77>] warn_slowpath+0xb7/0xe0
[    0.000000]  [<ffffffff80213500>] ? nommu_map_single+0x30/0x70
[    0.000000]  [<ffffffff8024d067>] ? printk+0x67/0x70
[    0.000000]  [<ffffffff809871ca>] ? __reserve_early+0x7a/0xe0
[    0.000000]  [<ffffffff806bcfd3>] ? init_memory_mapping+0x633/0x720
[    0.000000]  [<ffffffff80980140>] ? early_idt_handler+0x0/0x72
[    0.000000]  [<ffffffff8098bb20>] dmi_ignore_irq0_timer_override+0x30/0x60
[    0.000000]  [<ffffffff805a6624>] dmi_check_system+0x74/0xc0
[    0.000000]  [<ffffffff8098c575>] acpi_boot_table_init+0x15/0xf0
[    0.000000]  [<ffffffff809850b5>] setup_arch+0x3f5/0x840
[    0.000000]  [<ffffffff80980140>] ? early_idt_handler+0x0/0x72
[    0.000000]  [<ffffffff806dc346>] ? _spin_unlock+0x26/0x30
[    0.000000]  [<ffffffff80980140>] ? early_idt_handler+0x0/0x72
[    0.000000]  [<ffffffff80980cb1>] start_kernel+0x71/0x410
[    0.000000]  [<ffffffff8098027c>] x86_64_start_reservations+0x7c/0xc0
[    0.000000]  [<ffffffff809803da>] x86_64_start_kernel+0xea/0x110
[    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---


I suggest this patch:
$ diff -U0 arch/x86/kernel/acpi/boot.c~ arch/x86/kernel/acpi/boot.c
--- arch/x86/kernel/acpi/boot.c~        2008-10-15 21:00:19.000000000 -0400
+++ arch/x86/kernel/acpi/boot.c 2008-10-15 21:52:18.000000000 -0400
@@ -1410 +1409,0 @@
-               WARN(1, KERN_ERR "ati_ixp4x0 quirk not complete.\n");

How can ati_ixp4x0 quirk be "complete" at this stage when ati_bugs()
has not been invoked ?

Your ati_bugs() does not seem to be being invoked for my SMBus - here's
the 'lspci -nnvxxxx' output for it:

00:14.1 IDE interface [0101]: ATI Technologies Inc SB600 IDE [1002:438c] (prog-if 82 [Master PriP])
        Subsystem: Hewlett-Packard Company Unknown device [103c:30c2]
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: I/O ports at 01f0 [size=8]
        Region 1: I/O ports at 03f4 [size=1]
        Region 2: I/O ports at 0170 [size=8]
        Region 3: I/O ports at 0374 [size=1]
        Region 4: I/O ports at 5040 [size=16]
        Kernel driver in use: ATIIXP_IDE
        Kernel modules: pata_acpi, pata_atiixp, atiixp, ide-pci-generic
00: 02 10 8c 43 05 00 20 02 00 82 01 01 10 40 00 00
10: 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00
20: 41 50 00 00 00 00 00 00 00 00 00 00 3c 10 c2 30
30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
40: 99 20 00 00 ff 20 00 00 00 01 04 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 40 00 10 2c 01 07 01 00 00 00 ff ff 0f 00
70: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


Hmm, this is now identifying as "438c", not "4385" -
my SMBus PCI id changed from 2.6.27-@08-10-11 from pci_ids.h: 
#define PCI_DEVICE_ID_ATI_IXP600_IDE    0x438c
#define PCI_DEVICE_ID_ATI_IXP700_SATA   0x4390

But my system is 100% SATA, so why do I see the "IDE" flavor ?

So, 2nd suggested patch:

--- arch/x86/kernel/early-quirks.c~     2008-10-15 21:01:17.000000000 -0400
+++ arch/x86/kernel/early-quirks.c      2008-10-15 22:04:37.000000000 -0400
@@ -228 +228,9 @@
-       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
+       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00,
+         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP600_SATA,
+         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP600_IDE,
+         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP700_IDE,
+         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
+       { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP700_SATA,


BTW, I am going to raise a NEW bug about this USB message flood, which
also appeared going from 2.6.26.x -> 2.6.27 , with a patch that does
"when last event is same as this event, don't emit printk" :
[ 1325.204074] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.204080] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.454075] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.454082] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.704101] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.704108] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.954073] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1325.954080] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1326.204075] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1326.204082] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1326.454319] hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
[ 1326.454328] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
Comment 54 Jason Vas Dias 2008-10-15 19:21:21 UTC
Created attachment 18335 [details]
lspci -nnvxxxx under 2.6.27-jvd ( linus + tip + roland trees of 2008-10-15@21:00 )
Comment 55 Jason Vas Dias 2008-10-15 19:22:30 UTC
Created attachment 18336 [details]
boot log from 2.6.27-rc7-jvd (linus + tip + roland GIT trees 2008-10-15@21:00)
Comment 56 herrmann.der.user 2008-10-16 03:32:22 UTC
Jason, thanks for the lspci output.
It shows that timer interrupt is routed to INT0 of IOAPIC.
Thus the propagated override for IRQ0 on your Laptop is bogus.

(1) My patch from comment #52 should address and fix that problem.
Without the need to add every now and then new Laptop models to the DMI
list you are referring to. 

(2) All dmi-quirks dmi_ignore_irq0_timer_override should be removed (IMHO).
I'll post a patch for this asap.

(3) The usb-problem is another thing -- to open a new bugzilla is the right thing to do.

Maybe you should have a look at bug #11715 -- symptoms are very similar.

It just seems to be a slightly different laptop model than yours. This best
explains why a DMI based quirk sucks to solve this problem.

If you like to test my patch you should comment out the dmi quirk for
"HP 6715b laptop" in arch/x86/acpi/boot.c
Comment 57 herrmann.der.user 2008-10-16 05:32:09 UTC
Created attachment 18338 [details]
remove superfluous DMI quirks

This patch together with patch from attachment #18305 [details] should fix the problem
and avoid the warning.
Comment 58 Jason Vas Dias 2008-10-16 11:32:23 UTC
Thanks ! patch works as expected.
But are we POSITIVE that nothing can possibly depend on IRQ0 routing or thermal 
trip points BEFORE your code in 'early_quirks.c' gets run ? 
That's why I thought it might be better to move the early-quirks.c code to
acpi/boot.c, since before ACPI boots, nothing can possibly depend on it.
Regards,
Jason
Comment 59 Fabio Pedretti 2008-10-20 09:14:05 UTC
Is this supposed to be fixed in 2.6.27.2 with this patch?
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commit;h=e87898fdba90f9a270ae6bdb8ce98da91338a951
Comment 60 Rafael J. Wysocki 2008-10-20 12:05:04 UTC
Yes, but not on every box having this issue.  You may need some more patches from the comments above.
Comment 61 herrmann.der.user 2008-10-22 02:59:19 UTC
Short summary.
I'd like to see following patches upstream and in 2.6.27-stable:

(1) x86: SB450: skip IRQ0 override if it is not routed to INT2 of IOAPIC
   (already in Linus' git and in stable tree.)
(2) x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
   (not yet in any git-tree but sent to Ingo)
(3) Remove dmi-quirks for HP Laptops that advertise bogus override for IRQ0
   (not yet in any git-tree but sent to Ingo, see attachement #18338)

(2) and (3) are still not upstream but already posted to Ingo/LKML.
Both should go into 2.6.27-stable, too.



Regarding comment #58:
> But are we POSITIVE that nothing can possibly depend
> on IRQ0 routing or thermal trip points BEFORE your
> code in 'early_quirks.c' gets run ?

Yes, what we need is that a quirk is run before ACPI MADT table is parsed.
Then we can set acpi_skip_timer_override if needed and the interrupt override
for IRQ0 is skipped. Calling sequence in setup_arch (x86) is as follows:

early_quirks()           (detect bogus IRQ0 override)
...
acpi_boot_init()         (e.g. setup IO APIC/parse MADT)

Thus early-quirks are executed before MADT is parsed.
Comment 62 herrmann.der.user 2008-10-22 04:24:31 UTC
Jason,
Ingo suggested to keep the dmi-quirks for at least 1 kernel version.
But then it must be moved after early-quirks are executed.
I'll attach a patch which does this.
Can you please test it on your machine.
Thanks, Andreas
Comment 63 herrmann.der.user 2008-10-22 04:27:18 UTC
Created attachment 18401 [details]
 x86: call dmi-quirks for HP Laptops after early-quirks are executed
Comment 64 Ingo Molnar 2008-10-22 05:19:19 UTC
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=18401&action=view)
>  x86: call dmi-quirks for HP Laptops after early-quirks are executed

applied to tip/x86/urgent, thanks Andreas!

	Ingo
Comment 65 Rafael J. Wysocki 2008-10-24 14:18:09 UTC
Ingo, has that gone to Linus already?
Comment 66 Len Brown 2008-10-24 23:15:57 UTC
shipped in linux-2.6.28-rc1:

commit 26adcfbf00e0726b4469070aa2f530dcf963f484
Author: Andreas Herrmann <andreas.herrmann3@amd.com>
Date:   Tue Oct 14 21:01:15 2008 +0200

    x86: SB600: skip ACPI IRQ0 override if it is not routed to INT2 of IOAPIC
Comment 67 Fabio Pedretti 2008-10-27 05:24:58 UTC
Can this fix be applied on 2.6.27-stable?
Comment 68 Rafael J. Wysocki 2008-10-27 05:35:06 UTC
It's on its way to -stable from what I can tell.
Comment 69 Fabio Pedretti 2008-11-21 01:49:57 UTC
Still appears to not have been applied in 2.6.27.7.
Comment 70 herrmann.der.user 2008-11-21 05:02:34 UTC
Sorry for the delay.
I've just resend the fix for 2.6.27-stable.
I'll update the bugzilla as soon as the fix is integrated.

Note You need to log in before you can comment on or make changes to this bug.