Occasionally I get my machine hung completely. Fortunately, I've got and saved oops listing using netconsole before hang, and here it is [1]. Here is little piece of oops from the link above: === [15051.270461] BUG: unable to handle kernel paging request at 00000000ff5ae8e4 [15051.271583] IP: [<ffffffff8109ae6e>] srcu_notifier_call_chain+0xe/0x20 … [15051.956205] Call Trace: [15051.980641] [<ffffffff81606085>] ? __cpufreq_notify_transition+0x95/0x1e0 [15052.005640] [<ffffffff816081ee>] cpufreq_notify_transition+0x3e/0x70 [15052.030240] [<ffffffff816083d8>] cpufreq_freq_transition_begin+0xe8/0x130 [15052.054522] [<ffffffff813b8940>] ? ucs2_strncmp+0x70/0x70 [15052.078208] [<ffffffff816089bf>] __target_index+0xbf/0x1a0 [15052.101348] [<ffffffff81608b9c>] __cpufreq_driver_target+0xfc/0x160 [15052.124250] [<ffffffff8160b0d4>] od_check_cpu+0xa4/0xb0 [15052.146789] [<ffffffff8160c9ec>] dbs_check_cpu+0x16c/0x1c0 [15052.168935] [<ffffffff8160b4dd>] od_dbs_timer+0x11d/0x180 [15052.190607] [<ffffffff8108e6ff>] process_one_work+0x17f/0x4c0 [15052.211825] [<ffffffff8108f46b>] worker_thread+0x11b/0x3f0 [15052.232490] [<ffffffff8108f350>] ? create_and_start_worker+0x80/0x80 [15052.253127] [<ffffffff81096479>] kthread+0xc9/0xe0 [15052.273292] [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0 [15052.293487] [<ffffffff81793efc>] ret_from_fork+0x7c/0xb0 [15052.313544] [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0 … === Also here is my lspci [2] and cpuinfo [3] as well. Vanilla 3.15.8 and 3.16.0 are affected as well as latest Ubuntu 3.13 kernel. No visible reason to trigger the bug. After hang machine doesn't respond via network, there's no disk IO, and also it doesn't respond to pressing power button in order to perform soft off. [1] https://gist.github.com/085af9da81197faf6637 [2] https://gist.github.com/318ebda5576b099590b8 [3] https://gist.github.com/9c1307463c7ad6835b2d
Created attachment 145781 [details] /proc/cpuinfo
Created attachment 145791 [details] lsb_release -rd
Created attachment 145801 [details] Stacktrace via netconsole
Created attachment 145811 [details] lspci output
Disabling cpufreq section in kernel configuration seems to work around this issue. Using NOHZ_IDLE instead of NOHZ_FULL doesn't fix the issue (I also suspected RCU bugs or so).
Created attachment 145821 [details] sudo lspci -vvv output
Created attachment 145831 [details] /proc/modules
Launchpad bugreport: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1352567
Created attachment 145851 [details] .config
Created attachment 145861 [details] /proc/iomem
Created attachment 145871 [details] /proc/ioports
Created attachment 145881 [details] /proc
Created attachment 145891 [details] /proc/scsi/scsi
Also the issue happens no matter whether acpi-cpufreq is built-in or compiled as module.
I am trying to recreate the issue on my system. I am running 3.16 kernel on a x86 machine. I have made a small script which tries to load the cpu( for increasing the cpu frequency) and then sleeps for some time( to decrease the cpu frequency). My cpufreq governor is "ondemand". My request to you is to use the same script which I will upload and provide me the last log of the script before the system hangs up. I have been executing the script for the last one hour but I have not observed any hang-up
Created attachment 146301 [details] Script to increase/decrease the cpu load
@Ayan: I've put pr_info to cpufreq transition notifiers, and what I can see is that frequency changes often enough (several times per second), so there's no need to do extra CPU load/relax cycles. Also, machine could hang after 2 hours of successful uptime, or after 25 hours, and there's no obvious reason for hang. To answer your question, I've tried your script (stress-testing for several hours), and got nohing as well, but that is not significant result. You may follow linux-pm thread where Viresh tells me how to debug this issue: http://marc.info/?l=linux-pm&m=140786965520720&w=2
It seems that this bug has nothing to do with acpi-cpufreq code but with another ACPI area. With ACPI enabled kernel may hang in a day or in a week (never survived more than approx. 2 weeks). With acpi=off it seems to work OK. For instance, I had to boot Ubuntu installer with acpi disabled to finish it successfully. Usually, hanging is not accompanied by panic log. Only small vertical red lines appear on the screen near letters (tried to use plaintext 80x25 console without radeon and got the same issue). Still observing this for 3.16 kernel.
Subject: Re: [BUG] oops in cpufreq driver with AMD Kaveri CPU From: Oleksandr Natalenko <oleksandr () natalenko ! name> Date: 2014-11-18 19:07:51 acpi=off as well as disabling ASPM and NMI watchdog didn't help Now trying to update BIOS. P.S. Still affected while using 3.17.2 kernel. ref. http://marc.info/?l=linux-acpi&m=141633767812764&w=2 Did the BIOS update help? Are you certain that running an old kernel is stable?
Definitely not a kernel bug. I've replaced RAM module with another one and the issue went away. No idea why oopses refered to ACPI, but they seem to be the result of simple hardware incompatibility.