Bug 81701 - OOPS in cpufreq driver with AMD Kaveri CPU - AMD Athlon(tm) 5150 APU with Radeon(tm) R3
Summary: OOPS in cpufreq driver with AMD Kaveri CPU - AMD Athlon(tm) 5150 APU with Rad...
Status: CLOSED INVALID
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpufreq (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: cpufreq
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-08-04 21:45 UTC by Oleksandr Natalenko
Modified: 2015-04-14 16:04 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.16.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
/proc/cpuinfo (4.29 KB, text/plain)
2014-08-08 14:49 UTC, Oleksandr Natalenko
Details
lsb_release -rd (57 bytes, text/plain)
2014-08-08 14:50 UTC, Oleksandr Natalenko
Details
Stacktrace via netconsole (4.68 KB, text/plain)
2014-08-08 14:51 UTC, Oleksandr Natalenko
Details
lspci output (2.35 KB, text/plain)
2014-08-08 14:52 UTC, Oleksandr Natalenko
Details
sudo lspci -vvv output (36.35 KB, text/plain)
2014-08-08 15:00 UTC, Oleksandr Natalenko
Details
/proc/modules (5.10 KB, text/plain)
2014-08-08 15:00 UTC, Oleksandr Natalenko
Details
.config (166.99 KB, text/plain)
2014-08-08 17:20 UTC, Oleksandr Natalenko
Details
/proc/iomem (2.86 KB, text/plain)
2014-08-08 17:21 UTC, Oleksandr Natalenko
Details
/proc/ioports (1.65 KB, text/plain)
2014-08-08 17:21 UTC, Oleksandr Natalenko
Details
/proc (1.01 KB, text/plain)
2014-08-08 17:21 UTC, Oleksandr Natalenko
Details
/proc/scsi/scsi (177 bytes, text/plain)
2014-08-08 17:22 UTC, Oleksandr Natalenko
Details
Script to increase/decrease the cpu load (147 bytes, application/x-shellscript)
2014-08-12 09:20 UTC, Ayan
Details

Description Oleksandr Natalenko 2014-08-04 21:45:01 UTC
Occasionally I get my machine hung completely. Fortunately, I've got and saved 
oops listing using netconsole before hang, and here it is [1].

Here is little piece of oops from the link above:

===
[15051.270461] BUG: unable to handle kernel paging request at 00000000ff5ae8e4
[15051.271583] IP: [<ffffffff8109ae6e>] srcu_notifier_call_chain+0xe/0x20
…
[15051.956205] Call Trace:
[15051.980641]  [<ffffffff81606085>] ? __cpufreq_notify_transition+0x95/0x1e0
[15052.005640]  [<ffffffff816081ee>] cpufreq_notify_transition+0x3e/0x70
[15052.030240]  [<ffffffff816083d8>] cpufreq_freq_transition_begin+0xe8/0x130
[15052.054522]  [<ffffffff813b8940>] ? ucs2_strncmp+0x70/0x70
[15052.078208]  [<ffffffff816089bf>] __target_index+0xbf/0x1a0
[15052.101348]  [<ffffffff81608b9c>] __cpufreq_driver_target+0xfc/0x160
[15052.124250]  [<ffffffff8160b0d4>] od_check_cpu+0xa4/0xb0
[15052.146789]  [<ffffffff8160c9ec>] dbs_check_cpu+0x16c/0x1c0
[15052.168935]  [<ffffffff8160b4dd>] od_dbs_timer+0x11d/0x180
[15052.190607]  [<ffffffff8108e6ff>] process_one_work+0x17f/0x4c0
[15052.211825]  [<ffffffff8108f46b>] worker_thread+0x11b/0x3f0
[15052.232490]  [<ffffffff8108f350>] ? create_and_start_worker+0x80/0x80
[15052.253127]  [<ffffffff81096479>] kthread+0xc9/0xe0
[15052.273292]  [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0
[15052.293487]  [<ffffffff81793efc>] ret_from_fork+0x7c/0xb0
[15052.313544]  [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0
…
===

Also here is my lspci [2] and cpuinfo [3] as well.

Vanilla 3.15.8 and 3.16.0 are affected as well as latest Ubuntu 3.13 kernel.

No visible reason to trigger the bug. After hang machine doesn't respond via 
network, there's no disk IO, and also it doesn't respond to pressing power 
button in order to perform soft off.

[1] https://gist.github.com/085af9da81197faf6637
[2] https://gist.github.com/318ebda5576b099590b8
[3] https://gist.github.com/9c1307463c7ad6835b2d
Comment 1 Oleksandr Natalenko 2014-08-08 14:49:30 UTC
Created attachment 145781 [details]
/proc/cpuinfo
Comment 2 Oleksandr Natalenko 2014-08-08 14:50:17 UTC
Created attachment 145791 [details]
lsb_release -rd
Comment 3 Oleksandr Natalenko 2014-08-08 14:51:55 UTC
Created attachment 145801 [details]
Stacktrace via netconsole
Comment 4 Oleksandr Natalenko 2014-08-08 14:52:22 UTC
Created attachment 145811 [details]
lspci output
Comment 5 Oleksandr Natalenko 2014-08-08 14:54:43 UTC
Disabling cpufreq section in kernel configuration seems to work around this issue.

Using NOHZ_IDLE instead of NOHZ_FULL doesn't fix the issue (I also suspected RCU bugs or so).
Comment 6 Oleksandr Natalenko 2014-08-08 15:00:05 UTC
Created attachment 145821 [details]
sudo lspci -vvv output
Comment 7 Oleksandr Natalenko 2014-08-08 15:00:24 UTC
Created attachment 145831 [details]
/proc/modules
Comment 8 Oleksandr Natalenko 2014-08-08 15:09:59 UTC
Launchpad bugreport:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1352567
Comment 9 Oleksandr Natalenko 2014-08-08 17:20:58 UTC
Created attachment 145851 [details]
.config
Comment 10 Oleksandr Natalenko 2014-08-08 17:21:15 UTC
Created attachment 145861 [details]
/proc/iomem
Comment 11 Oleksandr Natalenko 2014-08-08 17:21:30 UTC
Created attachment 145871 [details]
/proc/ioports
Comment 12 Oleksandr Natalenko 2014-08-08 17:21:46 UTC
Created attachment 145881 [details]
/proc
Comment 13 Oleksandr Natalenko 2014-08-08 17:22:02 UTC
Created attachment 145891 [details]
/proc/scsi/scsi
Comment 14 Oleksandr Natalenko 2014-08-08 17:24:19 UTC
Also the issue happens no matter whether acpi-cpufreq is built-in or compiled as module.
Comment 15 Ayan 2014-08-12 09:19:28 UTC
I am trying to recreate the issue on my system. I am running 3.16 kernel on a x86 machine. I have made a small script which tries to load the cpu( for increasing the cpu frequency) and then sleeps for some time( to decrease the cpu frequency). My cpufreq governor is "ondemand". 
My request to you is to use the same script which I will upload and provide me the last log of the script before the system hangs up. I have been executing the script for the last one hour but I have not observed any hang-up
Comment 16 Ayan 2014-08-12 09:20:35 UTC
Created attachment 146301 [details]
Script to increase/decrease the cpu load
Comment 17 Oleksandr Natalenko 2014-08-12 19:01:32 UTC
@Ayan: I've put pr_info to cpufreq transition notifiers, and what I can see is that frequency changes often enough (several times per second), so there's no need to do extra CPU load/relax cycles.

Also, machine could hang after 2 hours of successful uptime, or after 25 hours, and there's no obvious reason for hang.

To answer your question, I've tried your script (stress-testing for several hours), and got nohing as well, but that is not significant result.

You may follow linux-pm thread where Viresh tells me how to debug this issue:

http://marc.info/?l=linux-pm&m=140786965520720&w=2
Comment 18 Oleksandr Natalenko 2014-11-11 10:42:11 UTC
It seems that this bug has nothing to do with acpi-cpufreq code but with 
another ACPI area.

With ACPI enabled kernel may hang in a day or in a week (never survived more 
than approx. 2 weeks). With acpi=off it seems to work OK. For instance, I had 
to boot Ubuntu installer with acpi disabled to finish it successfully.

Usually, hanging is not accompanied by panic log. Only small vertical red 
lines appear on the screen near letters (tried to use plaintext 80x25 console 
without radeon and got the same issue).

Still observing this for 3.16 kernel.
Comment 19 Len Brown 2015-04-14 15:47:53 UTC
Subject:    Re: [BUG] oops in cpufreq driver with AMD Kaveri CPU
From:       Oleksandr Natalenko <oleksandr () natalenko ! name>
Date:       2014-11-18 19:07:51

acpi=off as well as disabling ASPM and NMI watchdog didn't help

Now trying to update BIOS.

P.S. Still affected while using 3.17.2 kernel.

ref. http://marc.info/?l=linux-acpi&m=141633767812764&w=2

Did the BIOS update help?
Are you certain that running an old kernel is stable?
Comment 20 Oleksandr Natalenko 2015-04-14 16:04:13 UTC
Definitely not a kernel bug.

I've replaced RAM module with another one and the issue went away.

No idea why oopses refered to ACPI, but they seem to be the result of simple hardware incompatibility.

Note You need to log in before you can comment on or make changes to this bug.