Bug 6560

Summary: Engaged ehci_hcd raises CPU temperature. Prevents fan slowdown.
Product: Drivers Reporter: Mats Johannesson (spamcan)
Component: USBAssignee: David Brownell (dbrownell)
Status: RESOLVED WILL_NOT_FIX    
Severity: normal CC: alan, greg, protasnb, stern
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.22-rc5-git3 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 5089    
Attachments: experimental ehci unlink patch
usb-test-proper.txt
usb-test-patched.txt

Description Mats Johannesson 2006-05-15 13:02:54 UTC
I've also booted 2.6.[11,12,13,14,15,16] and they all behave identically.

_System_

Notebook is an Acer Aspire 1520 (1524) WLMi with an AMD Athlon64 Processor 3400+
on a VIA K8M800 (VT8237 PCI bridge [K8T800 South] according to lspci, but I've
read the actual chip markings and southbridge is a VT8235), 2 Gig memory.

Different lspci dumps can be found as attachments in bug 6072 and acpidump in
bug 5767 . System nowadays is a from-scratch-ish pure 64bit thing where I'm in
full control/understanding, meaning no udev, hal, dbus etc with a static /dev
No desktop, just a WM.

_Problem_

When an external USB 2.0 HD is plugged in, and ehci_hcd gets hold of it, the
core CPU temperature climbs 2-4 degrees centigrade. This temp. raise is not due
to any obvious processor activity, and not because of the increased power drain
through the USB ports (if uhci_hcd is in charge of the HD no anomaly exist). 

Disconnecting the HD, the temp. remains at the high level until a "rmmod
ehci_hcd" is executed. Then the temp. immediately begins to drop to normal.

Related issue(?) - where ehci_hcd prevents "AMD K7 CPU Disconnect Control":
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=172592

_Consequence_

My notebook's fan control is strongly tied to CPU temperature. Especially at
boot time. If the machine is cold - eg first time boot - there's no problem. But
a reboot (cold or warm) with a connected external HD is aggravating.

Fan has four states - Off, Low, Medium, High - and at boot time the CPU must
drop to 49C for the fan to enter Low speed. The different situations look like this:

Room temperature 26C (summer is closing in)

* No external HD connected *
Passing BIOS - High fan
Loading kernel - dropping to Medium fan.
Log in on text console - CPU 800MHz
After ca 2 minutes CPU temp. falls to 49C - dropping to Low fan.
CPU temp. climbs to 50C due to the lower fan speed.

* External HD connected *
Passing BIOS - High fan
Loading kernel - dropping to Medium fan.
Log in on text console - CPU 800MHz
After 10 minutes the CPU temp. shows 52C - still Medium fan.
Disconnecting HD
Another 10 minutes and still at 52C CPU temp. - Medium fan...
"rmmod ehci_hcd"
Immediate temp. drop.
After ca 1 minute CPU temp. has reached 49C - dropping to Low fan.
CPU temp. climbs to 50C due to the lower fan speed.

And a "Medium" fan speed on this notebook is serious noise!

_Thoughts_

The above temperatures came from a clean kernel.org 2.6.17-rc4, but even when I
undervolt the CPU (which I usually do) through a patch like
http://dev.gentoo.org/~morfic/powernow-k8-vcore_list-2.6.16-rc2-v2.diff it is a
close shave reaching the 49C boot temp. Today for example it failed once, and
when summer hits with full force it will be futile.

If I re-enable the C1-state linus stole - see bug 6072 for my patch - there is
an increase of ca 1250 switches from C0 to C1 per 10 minutes if the HD is
connected (not mounted) as opposed to disconnected, but that seems unrelated
since the unpatched temp. situation is identical. And older kernels like 2.6.11
equally had no C-state handling on this machine.

_Extra Info_

Connecting the HD it looks like this (why does it seem to connect twice?):

usb 1-3: new high speed USB device using ehci_hcd and address 4
usb 1-3: configuration #1 chosen from 1 choice
scsi0 : SCSI emulation for USB Mass Storage devices
usb-storage: device found at 4
usb-storage: waiting for device to settle before scanning
  Vendor: IC25N080  Model: ATMR04-0          Rev: MO4O
  Type:   Direct-Access                      ANSI SCSI revision: 00
SCSI device sda: 156301487 512-byte hdwr sectors (80026 MB)
sda: Write Protect is off
sda: Mode Sense: 03 00 00 00
sda: assuming drive cache: write through
SCSI device sda: 156301487 512-byte hdwr sectors (80026 MB)
sda: Write Protect is off
sda: Mode Sense: 03 00 00 00
sda: assuming drive cache: write through
 sda: sda1
sd 0:0:0:0: Attached scsi disk sda
usb-storage: device scan complete

And here's a cat of /proc/bus/usb/devices :

T:  Bus=04 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh= 2
B:  Alloc=  0/900 us ( 0%), #Int=  0, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.17-rc4 uhci_hcd
S:  Product=UHCI Host Controller
S:  SerialNumber=0000:00:10.2
C:* #Ifs= 1 Cfg#= 1 Atr=c0 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=03 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh= 2
B:  Alloc= 93/900 us (10%), #Int=  1, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.17-rc4 uhci_hcd
S:  Product=UHCI Host Controller
S:  SerialNumber=0000:00:10.1
C:* #Ifs= 1 Cfg#= 1 Atr=c0 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=03 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#=  2 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=04b4 ProdID=0033 Rev= 1.00
S:  Product=RF Mouse
C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=10ms

T:  Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh= 2
B:  Alloc=236/900 us (26%), #Int=  2, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.17-rc4 uhci_hcd
S:  Product=UHCI Host Controller
S:  SerialNumber=0000:00:10.0
C:* #Ifs= 1 Cfg#= 1 Atr=c0 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=02 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  2 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=046d ProdID=c50c Rev=22.40
S:  Manufacturer=Logitech
S:  Product=USB Receiver
C:* #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr= 98mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   8 Ivl=10ms
I:  If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=10ms

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480 MxCh= 6
B:  Alloc=  0/800 us ( 0%), #Int=  0, #Iso=  0
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.17-rc4 ehci_hcd
S:  Product=EHCI Host Controller
S:  SerialNumber=0000:00:10.3
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=256ms

T:  Bus=01 Lev=01 Prnt=01 Port=02 Cnt=01 Dev#=  4 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=067b ProdID=3507 Rev= 0.01
S:  Manufacturer=Prolific
S:  Product=PL-3507C USB Storage Device
S:  SerialNumber=013023EC
C:* #Ifs= 1 Cfg#= 1 Atr=c0 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms

If more info is needed, you know where to reach me.
Comment 1 David Brownell 2006-06-13 14:01:03 UTC
Created attachment 8299 [details]
experimental ehci unlink patch

Yeech, VIA again.  We know there are hardware issues with this,
it doesn't issue some IRQs it's supposed to issue.  Try the patch
I've attached here, which changes how those hardware issues get
worked around ... maybe it will help, maybe not.

Also, with CONFIG_USB_DEBUG, when it's getting this overheat thing,
please look at /sys/class/usb_host/.../registers for that controller
(the file will say inside that it's EHCI).  Look at it several times,
see if its contents are changing during this overheat thing, and
please attach a copy of it (plus a description of any changes you
noticed).
Comment 2 Mats Johannesson 2006-06-14 07:37:35 UTC
Compiled two 2.6.17-rc6-git4 kernels with CONFIG_USB_DEBUG and used the patch on
one of them. Wrote a script to capture the data you requested - I'm in no
position to 'notice' changes in this area, except the temperature.

Unfortunately the temp didn't stay put with the patched kernel.

Testing procedure was: Cold boot. Wait for the core CPU temp to reach 49C. Start
script. Wait 1 minute. Plug in HD (no mount). Wait 1 minute. Unplug HD.

Search for "plugged" to find the crossover data points.

#!/bin/sh

if ! grep -q ehci /proc/modules; then
  modprobe ehci_hcd
fi

echo "" >usb-test.txt
echo "USB Test Begin" >>usb-test.txt
echo "**********" >>usb-test.txt

touch /root/.usb

while (true) do
  if [ -e /root/.usb ]; then
    if grep -q Prolific /proc/bus/usb/devices; then
      rm -f /root/.usb
      echo "" >>usb-test.txt
      echo "**********" >>usb-test.txt
      echo "HD plugged!" >>usb-test.txt
      echo "**********" >>usb-test.txt
    fi
  fi
  echo "----------" >>usb-test.txt
  date >>usb-test.txt
  echo "----------" >>usb-test.txt
  cat /proc/acpi/thermal_zone/*/temperature >>usb-test.txt
  cat /sys/class/usb_host/usb_host1/registers >>usb-test.txt
  if ! [ -e /root/.usb ]; then
    if ! grep -q Prolific /proc/bus/usb/devices; then
      echo "" >>usb-test.txt
      echo "**********" >>usb-test.txt
      echo "HD unplugged!" >>usb-test.txt
      echo "**********" >>usb-test.txt
      for i in 1 2 3 4 5; do
        echo "----------" >>usb-test.txt
        date >>usb-test.txt
        echo "----------" >>usb-test.txt
        cat /proc/acpi/thermal_zone/*/temperature >>usb-test.txt
        cat /sys/class/usb_host/usb_host1/registers >>usb-test.txt
        sleep 2s
      done
      rmmod ehci_hcd
      echo "" >>usb-test.txt
      echo "**********" >>usb-test.txt
      echo "ehci driver unloaded..." >>usb-test.txt
      echo "**********" >>usb-test.txt
      for i in 1 2 3 4 5; do
        echo "----------" >>usb-test.txt
        date >>usb-test.txt
        echo "----------" >>usb-test.txt
        cat /proc/acpi/thermal_zone/*/temperature >>usb-test.txt
        sleep 2s
      done
      exit
    fi
  fi
  sleep 2s
done
Comment 3 Mats Johannesson 2006-06-14 07:39:43 UTC
Created attachment 8303 [details]
usb-test-proper.txt

unpatched kernel
Comment 4 Mats Johannesson 2006-06-14 07:40:40 UTC
Created attachment 8304 [details]
usb-test-patched.txt

patched kernel
Comment 5 Mats Johannesson 2006-09-29 07:43:39 UTC
Still broken in 2.6.18 final
Comment 6 Natalie Protasevich 2007-06-13 12:35:17 UTC
Mats, any updates on the problem? How are the new releases working for you?
Thanks,
--Natalie
Comment 7 Mats Johannesson 2007-06-19 17:09:44 UTC
Natalie,
linux-2.6.22-rc5-git3 under Ubuntu 7.04. No change. Running my test-script above shows the core CPU temperature rising from 43C to 45C eight seconds after the HD was plugged in.

For me this is no longer a problem. I've done surgery on the notebook and installed passive cooling through various fins and plates on all hotspots, drilled extra ventilation holes and, most importantly, attached a variable resistor to the fan. The machine is whisper quiet.
Comment 8 Natalie Protasevich 2007-06-19 17:36:41 UTC
This is great workaround, should be offered as a patch ;)
But seriously, this way the test system is no longer available, Mats! can you please put is all back as it was before...

Is there known erratas on this chipset? Maybe this problem needs to be brought to attention of ACPI people? 
Comment 9 Mats Johannesson 2007-06-19 21:42:11 UTC
Eh... the test system is exactly as before in terms of _symptom_ (2 to 4 degrees core CPU temp rise on HD engagement through ehci_hcd), it's only the _consequence_  (high fan speed == noise) that has been mitigated through my hardware modifications.

The changes are irrevocable, unless you want me to desolder resistors and plug forty one mm ventilation holes etc (it's a work of enginering art ;-)

I don't know about chipset errata, but as you see in comment #1 the USB people know about VIA... From what I've seen on the net, VIA is not particularly friendly visavis open source developers. According to kernel sources, and other evidence, people worked under an NDA when eg doing IDE stuff for this southbridge (the VT8235).

As for involving ACPI, I can't see a future there. The fan speeds seem to be controlled purely through hardware, reacting against certain temperature thresh-holds (nothing in /proc/acpi controls it).
Comment 10 Natalie Protasevich 2008-03-24 00:15:00 UTC
Copying to Alan, to help sort out problem with EHCI and overheating. (and decide if to keep this bug open)
Comment 11 Alan Stern 2008-03-24 07:53:14 UTC
I have no idea what's wrong, other than the fact that some VIA EHCI chips are known to configure themselves incorrectly.  It would be good to try 2.6.25-rc6; that kernel includes a fix for a problem known to affect lots of EHCI controllers (including VIA's).

Also, a patch was submitted last week to prevent some of them from hogging the PCI bus (not applicable to the vt8235, unfortunately).  Maybe something similar is needed to prevent the overheating.  FYI, the bus-hogging patch is <http://marc.info/?l=linux-usb&m=120599996404777&w=2>