Bug 75421

Summary: [pandaboard] cpuidle breaks USB stability
Product: Power Management Reporter: Tobias Jakobi (liquid.acid)
Component: cpuidleAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED OBSOLETE    
Severity: high CC: kai.reichert, nm, santosh.shilimkar, tony
Priority: P1    
Hardware: ARM   
OS: Linux   
Kernel Version: 3.15.0-rc3 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel config
dmesg output
cpuidle states

Description Tobias Jakobi 2014-05-03 18:48:16 UTC
Hello,

previously I couldn't get the Pandaboard ES to even boot with cpuidle enabled. I retried with vanilla-3.15.0-rc3 and while booting works now, the system, more specifically the USB, is incredibly unstable now.

I get lots of these message in the kernel log:
gic_timer_retrigger: lost localtimer interrupt

SSH connections to the Panda are mostly unuseable due to lags/timeout. Pinging from my desktop system yields:
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=37 ttl=64 time=17034 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=38 ttl=64 time=16027 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=39 ttl=64 time=15027 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=40 ttl=64 time=14027 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=41 ttl=64 time=13027 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=42 ttl=64 time=12027 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=43 ttl=64 time=11027 ms
64 bytes from fragile.entropy (192.168.0.253): icmp_seq=44 ttl=64 time=10028 ms

This is not unexpected, since the ethernet adapter of the Panda is connected via USB. I also have /home and /var on a USB thumbdrive, and even a bit of activity on /home then brings the system down:
usb 1-1.3: reset high-speed USB device number 4 using ehci-omap
usb 1-1.3: reset high-speed USB device number 4 using ehci-omap
usb 1-1.3: reset high-speed USB device number 4 using ehci-omap
usb 1-1.3: reset high-speed USB device number 4 using ehci-omap
usb 1-1.3: reset high-speed USB device number 4 using ehci-omap
usb 1-1.3: reset high-speed USB device number 4 using ehci-omap
sd 0:0:0:0: Device offlined - not ready after error recovery
[...etc...]

So cpuidle still seems pretty much broken on the Panda.

Greets,
Tobias
Comment 1 Tobias Jakobi 2014-05-03 19:23:32 UTC
Created attachment 134971 [details]
kernel config

kernel config that is working (with cpuidle disabled)
Comment 2 Tobias Jakobi 2014-05-03 19:23:58 UTC
Created attachment 134981 [details]
dmesg output

dmesg output from working kernel (with cpuidle disabled)
Comment 3 Len Brown 2014-05-16 04:55:07 UTC
What idle states does the hardware export?
You can see them this way:

dmesg | grep idle
grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*
Comment 4 Tobias Jakobi 2014-05-17 20:33:15 UTC
I just applied the patch from the linux-omap ml:
http://www.spinics.net/lists/linux-omap/msg107002.html

Tested the board a bit with this, and it seems to fix the issue. Going to close this, once it hits mainline.
Comment 5 Tobias Jakobi 2014-05-17 23:05:45 UTC
Created attachment 136541 [details]
cpuidle states
Comment 6 Tobias Jakobi 2014-05-17 23:07:56 UTC
OK, scratch that. The patch seems to help, but after some time, I get a lot of "BUG: scheduling while atomic" in the kernel log.

But first things first.

dmesg | grep idle:
cpuidle: using governor ladder
cpuidle: using governor menu

grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*:
(now attached as 'cpuidle states')

The 'BUG' looks like this:
BUG: scheduling while atomic: swapper/1/0/0xffff0000
Modules linked in: ctr ccm btrfs xor xor_neon zlib_inflate zlib_deflate omapfb cfbfillrect cfbimgblt cfbcopyarea raid6_pq arc4 wl12xx wlcore mac80211 cfg80211 rfkill usb_storage snd_soc_omap_hdmi snd_soc_omap_hdmi_card wlcore_sdio
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W     3.15.0-rc5+ #1
Backtrace: 
[<c001152c>] (dump_backtrace) from [<c00116c8>] (show_stack+0x18/0x1c)
 r6:eb894000 r5:00000001 r4:00000000 r3:00000000
[<c00116b0>] (show_stack) from [<c0430a94>] (dump_stack+0x7c/0x98)
[<c0430a18>] (dump_stack) from [<c042ec10>] (__schedule_bug+0x48/0x60)
 r4:00000000 r3:00000153
[<c042ebc8>] (__schedule_bug) from [<c0432248>] (__schedule+0x41c/0x4b4)
 r4:2b9fe000 r3:00000000
[<c0431e2c>] (__schedule) from [<c04323d4>] (schedule+0x38/0x88)
 r10:eb894000 r9:eb894000 r8:00000000 r7:eb894000 r6:c05b0450 r5:c05b04a0
 r4:eb894000
[<c043239c>] (schedule) from [<c04326fc>] (schedule_preempt_disabled+0x10/0x14)
[<c04326ec>] (schedule_preempt_disabled) from [<c006d8f0>] (cpu_startup_entry+0xec/0x22c)
[<c006d804>] (cpu_startup_entry) from [<c00131a4>] (secondary_start_kernel+0xe8/0x100)
 r7:c05de240
[<c00130bc>] (secondary_start_kernel) from [<80008664>] (0x80008664)
 r4:ab87c06a r3:c000864c
Comment 7 Tony Lindgren 2014-05-19 17:49:29 UTC
(In reply to Tobias Jakobi from comment #6)
> OK, scratch that. The patch seems to help, but after some time, I get a lot
> of "BUG: scheduling while atomic" in the kernel log.

Tobias, have you tried the updated patch from Santosh in the
same thread? It's this one:

http://www.spinics.net/lists/linux-omap/msg107161.html
Comment 8 Tobias Jakobi 2014-05-19 18:57:11 UTC
Hello, this is exactly the patch I have applied at the moment. Sorry I wasn't clear about which one I did apply.
Comment 9 Tobias Jakobi 2014-05-19 19:38:09 UTC
Just for the record: In comment #4 I meant that I applied the most recent patch from the thread on the ml. At that time, the updated patch was already posted. Again sorry for the confusion.
Comment 10 Nishanth Menon 2014-10-15 16:15:18 UTC
Tobias: same issue with 3.17?
Comment 11 Tobias Jakobi 2014-10-15 16:30:23 UTC
Guess I forgot this one. Closing, since I no longer own the device.
Comment 12 Kai Reichert 2014-11-15 19:31:09 UTC
Yes, same issue with kernel 3.17.2.

This bug is not obsolete!