This originally started as a gpu regression report against a change to the turbo logic. After much random walking reporter consensus seems to have settled on max_cstate=1 as the one true workaround. See https://bugs.freedesktop.org/show_bug.cgi?id=88012 For all the glorious details.
For me setting max_cstate=1 didn't solve the bug. It improved the time to freeze from a couple of minutes to a couple of hours. But it is not a fully and universally working workaround.
Experienced on an Intel Celeron CPU J1900 (platform GB-BXBT-1900) on Archlinux x64. I cannot upgrade to a kernel higher than 3.14 otherwise I get very frequent crashes when playing videos on browsing the web. On the contrary, kernel 3.14 is extremely stable and the machine can stay up for weeks.
same here on an Asrock Q1900-ITX (Intel Celeron J1900): random freezes in X session.
Same here on 50+ ASRock IMB-150 mini-ITX (Intel Celeron J1900) boards: Random freezes, time to freeze ranging from some ten minutes to some hours, only when using X with conky + own QT based App (no freezes when not using X!), so it seems very likely that this problem is GPU related. I will test with kernel parameter intel_idle.max_cstate=1 to see if it is a working workaround for my case and report here later.
but intel_idle.max_cstate=1 would result in seriously increased power consumption?!
(In reply to raidyne from comment #5) > but intel_idle.max_cstate=1 would result in seriously increased power > consumption?! Correct. And that is the reason, why this bug needs to be fixed soon :-) intel_idle.max_cstate=1 is just a quick workaround, so your baytrail machine can live longer than just a few minutes. I do confirm same random machine freezes on Acer notebook with Celeron N2940. Random freezes are really random, but usually more frequent when using CPU or GFX heavily. Freezes occur on 4.2.X kernels I've tested so far. I've been able to fix this by using either intel_idle.max_cstate=1 or intel_pstate=disable. Using one of these kernel parameters makes my machine usable again.
I've succesfully tested longterm kernel 4.1.13. This one seems to work without a single freeze for the last 8 hours of uptime. I didn'd need to use any of the intel_idle.max_cstate or intel_pstate kernel parameters with this kernel.
i'm happy to provide you with any logs. Unfortunately though, my system does not seem to be particularly vebose concerning this bug: could not find any hints in dmesg, kern.log, syslog, xorg.log.
I'm seeing this freeze on an Acer Notebook with a Celeron N2940 when running Fedora. https://bugzilla.redhat.com/show_bug.cgi?id=1285895 No issues with older fedora 22 kernel - kernel-4.1.6-200.fc22.x86_64 Still have issues with latest fedora kernel build - kernel-4.2.6-301.fc23.x86_64 Is there an easy way to show the current cstate of the system?
(In reply to Steven Ellis from comment #9) > I'm seeing this freeze on an Acer Notebook with a Celeron N2940 when running > Fedora. > > https://bugzilla.redhat.com/show_bug.cgi?id=1285895 > > No issues with older fedora 22 kernel > - kernel-4.1.6-200.fc22.x86_64 > > Still have issues with latest fedora kernel build > - kernel-4.2.6-301.fc23.x86_64 > > Is there an easy way to show the current cstate of the system? YES: PowerTop(http://01.org/powertop/) and i7z(https://code.google.com/p/i7z/) can tell you this. Example output from PowerTop: PowerTOP 2.8 Overview Idle stats Frequency stats Device stats Tunables Package | Core | CPU 0 | | C0 active 0.7% | | POLL 0.1% 0.3 ms | C1 (cc1) 99.3% | C1-BYT 99.4% 4.2 ms C2 (pc2) 0.0% | | C3 (pc3) 0.0% | | C6 (pc6) 0.0% | C6 (cc6) 0.0% | | Core | CPU 1 | | C0 active 0.5% | | POLL 0.0% 0.0 ms | C1 (cc1) 99.5% | C1-BYT 99.4% 22.8 ms | | | | | C6 (cc6) 0.0% | | Core | CPU 2 | | C0 active 0.9% | | POLL 0.0% 0.0 ms | C1 (cc1) 98.9% | C1-BYT 99.0% 8.5 ms | | | | | C6 (cc6) 0.0% | | Core | CPU 3 | | C0 active 1.3% | | POLL 0.0% 0.0 ms | C1 (cc1) 98.0% | C1-BYT 98.0% 11.1 ms | | | | | C6 (cc6) 0.0% | | GPU | | | | Powered On 0.0% | | RC6 100.0% | | RC6p 0.0% | | RC6pp 0.0% | | |
I can also confirm that this workaround works for me, running 4.3.0-2 now for about two weeks with intel_idle.max_cstate=1 and no freezes. Cheers for this, I was getting desperate with the constant hangs. Downgrading kernel to 3.16.7-29 makes this run fine without any boot parameters, but anything newer than that means frequent freezes. Using intel_pstate=disable does not work for this hardware / kernel combination either, it still hangs. Only limiting cstate seems to cure this. Acer B-115M Laptop with Intel(R) Pentium(R) CPU N3540 @ 2.16GHz If there is anything that I can do to help to trace this, let me know.
Good reading for better understanding of this issue: 1. C-states and P-states are very different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different) 2. Power Management States: P-States, C-States, and Package C-States(https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states) 3. (update) C-states, C-states and even more C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states) Hope this helps!
(In reply to Wolfgang M. Reimer from comment #4) > Same here on 50+ ASRock IMB-150 mini-ITX (Intel Celeron J1900) boards: > Random freezes, time to freeze ranging from some ten minutes to some hours, > only when using X with conky + own QT based App (no freezes when not using > X!), so it seems very likely that this problem is GPU related. > > I will test with kernel parameter intel_idle.max_cstate=1 to see if it is a > working workaround for my case and report here later. I can confirm that kernel parameter intel_idle.max_cstate=1 is a working workaround for my case (50+ ASRock IMB-150 mini-ITX Intel Celeron J1900 boards running a 3.18.21-rt19 kernel)
I can confirm too, parameter "intel_idle.max_cstate=2" is required on two laptops (Medion Akoya E6239T and S6217T) with these CPU : - Intel Celeron CPU N2930 1.83GHz - Intel Celeron CPU N2940 1.83Ghz The random freezes come back when setting max_cstate to 3. Also, I don't need it on two other similar laptops (Medion Akoya E6239) with these CPU : - Intel Celeron CPU N2830 2.16Hhz - Intel Celeron CPU N2840 2.16GHz
I, too, can confirm this issue on systems that use an Intel Celeron N2930@1.83GHz or an Intel Celeron J1900@1.99GHz. While adding "intel_idle.max_cstate=1" to kernel command-line fixed the issue, the regression in GPU performance wasn't acceptable. Bumping it to "intel_idle.max_cstate=2" seems to make it run with adequate GPU performance while presenting no more hard lock-ups. I attached the output of lshw of both systems.
Created attachment 197611 [details] LG MP500 w/o fan
Created attachment 197621 [details] Advantech DS-370
Created attachment 197631 [details] drm/i915/vlv: Take forcewake on media engine writes Long shot, but could someone give this a spin.
Can anyone confirm that this problem is limited to Bay Trail and does not affect Braswell such as N3150 or N3700? Ran into this intermittent freeze-up problem after upgrading several J1900 and N2930 based boards to 3.19 kernel. [Had used 3.13 previously.] intel_idle.max_cstate=1 seems to solve the problem...all units up for 48hrs anyway. I much appreciated finding that this is a known/reported problem. We are moving to the Braswell based boards and wondering if there are any known stability problems. Thank you.
(In reply to G. Bremer from comment #19) > Can anyone confirm that this problem is limited to Bay Trail and does not > affect Braswell such as N3150 or N3700? Ran into this intermittent > freeze-up problem after upgrading several J1900 and N2930 based boards to > 3.19 kernel. [Had used 3.13 previously.] intel_idle.max_cstate=1 seems to > solve the problem...all units up for 48hrs anyway. I much appreciated > finding that this is a known/reported problem. We are moving to the > Braswell based boards and wondering if there are any known stability > problems. Thank you. Confirming same issue on N3050(Braswell/Cherry Trail/Airmont).
I can fully confirm that this issue is _not_ happening on Braswell N3150 and N3700 - both chips are perfectly fine without any patching. 3.19 is not even working on braswell.
(In reply to fritsch from comment #21) > I can fully confirm that this issue is _not_ happening on Braswell N3150 and > N3700 - both chips are perfectly fine without any patching. If you make such a statement like the one above then please specify for which kernel revision(s) this is true. Older kernel revisions (like e.g. 3.13.x) do not exhibit the issue for BayTrail processors. This thread is about (more or less) random freezes of BayTrail (and possibly newer) processors running NEWER kernel revisions (e.g. 3.18.x and newer) when used without kernel parameter intel_idle.max_cstate=1 (please do not confuse this with kernel patching). > > 3.19 is not even working on braswell. What does that mean? Does the 3.19 kernel freeze on Braswell at start-up immediately? What happens when the kernel boot parameter intel_idle.max_cstate=1 is specified for this 3.19 kernel? How does that correlate to the above message that "this issue is _not_ happening on Braswell N3150 and N3700"? What is the exact kernel revision of the 3.19 kernel you tried (or did you test it an ALL 3.19.* kernels)?
I am the original submitter of the bugreport. At the time of filing it, Braswell did only exist on paper. To get the GPU up and running on a braswell system you need at least kernel 4.1 or later or special parameters for older kernels to force gpu acceleration. Whatever kernel you run with 3.13 / 3.19 has no mainline gpu support. It won't work at all. If this something Ubuntu patched? My Braswell 3150 (minix / asrock) currently run with kernel 4.3 and 4.4-rc5 without issues. Here are the kernel image if you want to verify: http://fritsch.fruehberger.net/kernel/linux-image-4.3.0-pt-bt1+_4.3.0-pt-bt1+-10.00.Custom_amd64.deb http://fritsch.fruehberger.net/kernel/linux-headers-4.3.0-pt-bt1+_4.3.0-pt-bt1+-10.00.Custom_amd64.deb
To avoid confusions, last post was done by me - but with wrong account - now happy testing.
Created attachment 197671 [details] drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes
(In reply to Mika Kuoppala from comment #18) > Created attachment 197631 [details] > drm/i915/vlv: Take forcewake on media engine writes > > Long shot, but could someone give this a spin. Tested without success with kernel 4.3.2 on my two laptops (CPU N2930 and N2940). They froze in less than 2 two hours
Hi, I had the same problem with "Intel(R) Pentium(R) CPU N3520 @ 2.16GHz". With kernel 4.2.0-16.19 there were ~5-8 freezes/day. After upgrading to 4.3.3-040303-generic (ubuntu version) it was much better: 1/2 freezes/day. With cstate=1 there has not been one yet.
(In reply to Mika Kuoppala from comment #25) > Created attachment 197671 [details] > drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes Also tested on kernel 4.3.3 on Arch Linux and it didn't work. I have an Asrock Q1900M (with intel J1900). It froze after less than 1 hour of video playback, so no improvement compared to the base Arch Linux default kernel without the patch (v4.2.5).
(In reply to Peter Fr from comment #23) > To get the GPU up and running on a braswell system you need at least kernel > 4.1 or later or special parameters for older kernels to force gpu > acceleration. Whatever kernel you run with 3.13 / 3.19 has no mainline gpu > support. It won't work at all. If this something Ubuntu patched? > Thanks for the info. Yes, Ubuntu 15.04 (Vivid) made some patches to the 3.19 kernel line for Braswell systems (Ubuntu kernel 3.19.0-20 till 3.19.0.42, see http://forum.kodi.tv/showthread.php?tid=227771&pid=2026016#pid2026016 and http://www.phoronix.com/scan.php?page=news_item&px=Intel-Braswell-Fedora-Ubuntu) It's got support for Braswell systems however I don't know how complete this support is. The Ubuntu 15.10 (Wily) kernel 4.2.0-22 should also run on Braswell systems. The Vivid and the Wily kernel are both available for Ubuntu 14.04 LTS (Trusty, the Ubuntu release I use), too.
Then, please: Reproduce with mainline kernels. We cannot let the kernel devs debug ubuntu's picked together kernel ...
(In reply to fritsch from comment #30) > Then, please: Reproduce with mainline kernels. We cannot let the kernel devs > debug ubuntu's picked together kernel ... My report does _NOT_ relate to the Ubuntu kernels _NOR_ does it relate to a Braswell system. See my Comments https://bugzilla.kernel.org/show_bug.cgi?id=109051#c4 and https://bugzilla.kernel.org/show_bug.cgi?id=109051#c13 above.
No freeze on a Acer E11 (N2940) after "echo acpi_pm > /sys/bus/clocksource/devices/clocksource0/current_clocksource" but it hit me not so often. Most of the time after reboot and not after standby/resume.
I will compile a mainline kernel and test it. I feel there is something connection with browsing. I got freeze while I se online videos with firefox or open a new site with multimedia content. I have disabled hardware acceleration to see what will happen. I will report it.
There is no change. Freeze again and again. The only solution is "intel_idle.max_cstate=1". Does anybody know when will this be fixed? With the kernel parameter my CPU is noticeably warmer. It is not very good I think. I bought a notebook with Intel Atom (N) CPU, because that is energy efficient.
Freeze occurs on ASUS T100-CHI running Cinnamon Desktop on Mint17.3 or Manjaro15.12 with 64bit kernels after 3.16.7 including 4.3.x and 4.4-rcx. Until 4.2.6, capping GPU frequency greatly reduced the freeze rate for me. After 4.2.5 GPU frequency did not affect freeze rate (GPU hang fixed?) Freeze rate seems to depend on particulars of the distro, kernel and device it runs on. My setup freezes within a few minutes without a max_cstate below 2. I notice warmer system temperatures with cstate=0. YMMV.
(In reply to Markus Rehbach from comment #32) > No freeze on a Acer E11 (N2940) after "echo acpi_pm > > /sys/bus/clocksource/devices/clocksource0/current_clocksource" but it hit me > not so often. Most of the time after reboot and not after standby/resume. This seems to work on my Intel J1900. Can more people confirm that this works? To make the change permanent, you can add the option "clocksource=acpi_pm" to your kernel command line. What is the drawback of using the acpi_pm clock? From what I have read (https://access.redhat.com/solutions/18627) it has a lower frequency, 3.58Mhz compared to the 2GHz of my cpu clock. We could just force the kernel switch to the acpi_pm clock when available if the CPU is a BailTray / Braswell.
I was wrong, turns out it takes more time to freeze but it eventually does. The best option so far is the cstate option.
There's something strange with this bug...on my Q1900DC-ITX I tried every single version of the mainline kernel from 3.16 to 4.3. It still hangs on 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15 is very stable. No need for the cstate configuration or any patch publied here or on the other thread. For some reason, it "seemed" to get fixed on 4.1-rc something and the bug came back on 4.2.0. Now, I don't know much about how the whole i915 driver works, but it seems like a lot of changes on 4.2 concerns the cherryview chips except these: drm/i915: Use spinlocks for checking when to waitboost drm/i915: Don't downclock whilst we have clients waiting for GPU results drm/i915: Agressive downclocking on Baytrail/drm/i915: Fix computation of last_adjustment for RPS autotuning Looks like they directly affect baytrail chips and they alter code changes introduced right before the 4.1 series. I also remember trying to revert the drm/i915: Agressive downclocking on Baytrail commit without success on 4.2.
I tried the clocksource parameter without cstate. It froze within a few minutes (4.3.3/T100-CHI). So far my freeze is independent of GPU frequency and system clock source! 4.1 was more stable for me than 4.2.x, 4.3.x. But the rest of my hardware works better with newer kernels. Otherwise I could avoid the bleeding edge kernels.
Additional info: I have Dell Venue 11 Pro with Atom Z3770. Observing this freezes as everybody does from 3.17. After 4.1 behaviour of freezes changed slightly, however they happen. intel_idle.max_cstate=0 or switching to acpi_idle driver for latest kernel 4.4-rc6 don't solve this bug. So it's not idle driver fault. intel_idle.max_cstate=2 (cstate=1 also) completely solves freezes. The only difference between acpi_idle (freezes) and intel_idle with max_cstate=2 (don't freeze) is in this state: ACPI FFH INTEL MWAIT 0x64. I'll try with max_cstate=3, but I think it'll freeze too. I can reproduce freezes with html5 video in firefox. For 3.17-4.0 it happens within 10 minutes. After 4.1 it happens within 1 hour.
Tried every cstate till 6 and cannot reproduce this bug anymore... Even without parameter huge films over wifi and html5 video from firefox works without freezes. I'll continue testing. cat /proc/cmdline root=/dev/mmcblk0p6 ro init=/usr/lib/systemd/systemd rootfstype=ext4 tsc=reliable force_tsc_stable=1 clocksource=tsc clocksource_failover=tsc swap_zram zram.num_devices=4 uname -a Linux venue11pro 4.4.0-rc6-dirty #200 SMP PREEMPT Thu Dec 24 15:23:06 MSK 2015 i686 Intel(R) Atom(TM) CPU Z3770 @ 1.46GHz GenuineIntel GNU/Linux mesa 11.1 xf86-video-intel 2.99.917-r2 (gentoo version) libdrm 2.4.65 P.S. Linux "dirty" because of ath6kl patch, soc_button_array patch and gcc native optimization patches.
Replacing cstate=1 with "clocksource=acpi_pm" my setup froze within a few minutes. Replacing cstate=1 with "tsc=reliable force_tsc_stable=1 clocksource_failover=tsc" gave me significantly more run time before freezing. I was able to run almost 2 hours (~20x) streaming a bald eagle cam. T100-CHI (intel 3775) with hardware specific patches, 4.3.3 (Manjaro)
(In reply to mazout360 from comment #38) > There's something strange with this bug...on my Q1900DC-ITX I tried every > single version of the mainline kernel from 3.16 to 4.3. It still hangs on > 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15 > is very stable. No need for the cstate configuration or any patch publied > here or on the other thread. I also run kernel 4.1.5 (from ArchLinux, which doesn't include any patch) without any freeze on Q1900-ITX. Current uptime is 10 hours, with Netflix video streaming, although it is stopped from time to time. I will need more time to be sure, but it seems to work so far. It is for now the best option without any major drawback like video driver not working or power saving disabled. I will try to bisect between 4.0 and 4.2 to see exactly which commit introduced the regression and which one introduced it.
I confirm the random freezes on Acer TravelMate 115 (same as Juha Sievi-Korte). Intel(R) Pentium(R) CPU N3540 @ 2.16GHz, with Arch's linux-4.1.15-1-lts. The freezes occur mostly while watching videos, and are way sparser than the ones reported here (on specific days, I'd have 5 freezes, then it would be fine for a few weeks, and resume).
Random freezes happening on Fedora 23 - Kernel 4.2.8-300.fc23.x86_64. And before with Fedora 22. On Asrock Q1900-ITX, BIOS P1.40 (latest available). This has been happening for about a year now, at different freeze frequencies going from 2 minutes after boot up to a few weeks. It only happens intermittently when playing back video content with Kodi (this is an HTPC). It doesn't happen when compiling, playing music, or when the home server stays idle. I noticed that certain videos (but not a specific codec) are much more prone than others to trigger the bug. Disabling hardware acceleration does NOT solve the problem. It has been a very frustrating experience.
Same problem with ASUS ET2325IUK with J2900 @ 2.41GHz + Arch Linux 4.1.15-1 and 4.2.5-1 (videos, system upgrades,html5...) Freezing also systematically appears when closing gnome or cinnamon session (gdm).
No, not fixed. Freezed by just scrolling in firefox without any video. max_cstate still needed.
Same problem on Positivo ZX3040 http://lad.dsc.ufcg.edu.br/lad/pmwiki.php?n=Lad.Tablet, but occasional hard lock-ups even with intel_idle.max_cstate=1. Are the patches in https://github.com/hadess/rtl8723bs related in any way to this problem?
Last not having issue: 4.1.3 First to show issue: 4.2.0 I am on UBUNTU and have the issue. I tested the mainline kernels. From my testing UBUNTU 4.1.0-3.3 is the last kernel known to me not having the issue, successive kernel UBUNTU 4.2.0-7.7 has the issue. To my knowledge these map to 4.1.3 and 4.2.0 mainline kernels respectively. I am sharing this hoping somebody can find this information useful to make progress towards fixing the issue. MAINLINE KERNELS vivid linux 3.19.0-32.37 Ubuntu-3.19.0-32.37 3.19.8-ckt7 kernel used before I upgraded to wily, does not have issue 3.19.0-33.38 Ubuntu-3.19.0-33.38 3.19.8-ckt7 3.19.0-37.42 Ubuntu-3.19.0-37.42 3.19.8-ckt9 3.19.0-39.44 Ubuntu-3.19.0-39.44 3.19.8-ckt9 3.19.0-41.46 Ubuntu-3.19.0-41.46 3.19.8-ckt10 3.19.0-42.48 Ubuntu-3.19.0-42.48 3.19.8-ckt10 (last Vivid kernel, not tested for issue) Wily linux 3.19.0-20.20 Ubuntu-3.19.0-20.20 3.19.8 4.0.0-4.6 Ubuntu-4.0.0-4.6 4.0.7 4.0.0-4.7 Ubuntu-4.0.0-4.7 4.0.7 works fine, issue not found here 4.1.0-1.1 Ubuntu-4.1.0-1.1 4.1.0 works fine, issue not found here 4.1.0-2.2 Ubuntu-4.1.0-2.2 4.1.3 4.1.0-3.3 Ubuntu-4.1.0-3.3 4.1.3 last known to me not having issue 4.2.0-7.7 Ubuntu-4.2.0-7.7 4.2.0 has issue 4.2.0-10.11 Ubuntu-4.2.0-10.11 4.2.0 4.2.0-10.12 Ubuntu-4.2.0-10.12 4.2.0 has issue 4.2.0-11.13 Ubuntu-4.2.0-11.13 4.2.1 has issue, also at log-in reporting an error with /usr/bon/Xorg 4.2.0-12.14 Ubuntu-4.2.0-12.14 4.2.1 4.2.0-14.16 Ubuntu-4.2.0-14.16 4.2.2 has issue 4.2.0-15.18 Ubuntu-4.2.0-15.18 4.2.3 4.2.0-16.19 Ubuntu-4.2.0-16.19 4.2.3 4.2.0-17.21 Ubuntu-4.2.0-17.21 4.2.3 4.2.0-18.22 Ubuntu-4.2.0-18.22 4.2.3 has issue 4.2.0-19.23 Ubuntu-4.2.0-19.23 4.2.6 4.2.0-21.25 Ubuntu-4.2.0-21.25 4.2.6 4.2.0-22.27 Ubuntu-4.2.0-22.27 4.2.6 upstream kernel v4.3.0 has issue upstream kernel v4.4.3 has issue
i have an acer aspire es1-711 i am on gentoo linux self compiled kernels. same problem on my linux mint partition, it certainly is a kernel bug. latest kernel to work fine is 4.1.12 (4.1.13 is reported to work as well, haven't tested it), absolutely stable. any 4.2 or 4.4 kernels freeze the system, no traces, no reproduction scenarios. i can't confirm the " intel_idle.max_cstate=1" workaround to be a solution. tested it with kernel 4.4.0-rc6 and it froze after 3 days.
Created attachment 198961 [details] lspci -v Hostbridge and vga adapter output
I have the same problem with an Acer Aspire ES1-311 on Ubuntu. I am currently running 4.1.13 with the intel_idle.max_cstate=1 workaround. I fix would be much better!
Another long shot to try is to see if: 'intel_reg write 0xa168 0x0' has any effect on occurrence.
(In reply to Mika Kuoppala from comment #53) > Another long shot to try is to see if: > > 'intel_reg write 0xa168 0x0' > > has any effect on occurrence. I've had a issue with a Lenovo Yoga 2 where restarting GDM or switching to another vty would hang the system. This command fixed it and I haven't had a crash yet.
FYI. Here is another hang issue on Baytrail that is also fixed by limiting C states. https://lkml.org/lkml/2015/3/24/271 As far as I can tell these issues have not made it in to the kernel at all.
These patches have not made it in to the kernel, I meant.
(In reply to Mika Kuoppala from comment #53) > Another long shot to try is to see if: > > 'intel_reg write 0xa168 0x0' > > has any effect on occurrence. The command seems to be a correct work-around for GB-BXBT-1900. Thanks a lot! Mika, can you explain what this command does? Any problematic consequences (for power management ...)?
i will try the intel_reg write 0xa168 0x0 on an acer aspire ES1-711 now and will give feedback as soon as the system freezes or in a few days otherwise.
btw i just tried kernel 4.4.0 (latest stable git) without any parameters and without intel_reg write 0xa168 0x0 The system froze after ~1h. Now running the same kernel with intel_reg. will report shortly....
Affected by this bug as well on a Jetway JBC311U93 Celeron N2930 (Bay Trail). The system was running perfectly fine for 6 months as a router until repurposed as an HTPC. Hard freezes always occur when playing back video (h264 with vaapi) under Kodi. I am running kernel 4.3.3. Will happily test any patch/solution.
Tried intel_reg write 0xa168 0x0 on Acer B-115M (Pentium N3540) with kernel 4.3.3, hang happened within 20 mins after reboot, so I guess no change, occurence is random. Question: Should I be able to read 0x0 out from that same register? I mean: cardhu:~ # intel_reg read 0xa168 (0x0000a168): 0x0000007a cardhu:~ # intel_reg write 0xa168 0x0 cardhu:~ # intel_reg read 0xa168 (0x0000a168): 0x0000007a
Also reported in (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1511002).
(In reply to Juha Sievi-Korte from comment #61) > Tried intel_reg write 0xa168 0x0 on Acer B-115M (Pentium N3540) with kernel > 4.3.3, hang happened within 20 mins after reboot, so I guess no change, > occurence is random. > > Question: Should I be able to read 0x0 out from that same register? I mean: > cardhu:~ # intel_reg read 0xa168 > (0x0000a168): 0x0000007a > cardhu:~ # intel_reg write 0xa168 0x0 > cardhu:~ # intel_reg read 0xa168 > (0x0000a168): 0x0000007a Yes, we should forget this crude hack as the register gets overwritten on boot and also on normal operation when frequencies are changed. I will submit a patch to try.
Created attachment 200381 [details] drm/i915/vlv: Always enable internal pm interrupts
(In reply to Mika Kuoppala from comment #64) > Created attachment 200381 [details] > drm/i915/vlv: Always enable internal pm interrupts concerning this: > cardhu:~ # intel_reg write 0xa168 0x0 > cardhu:~ # intel_reg read 0xa168 i can read and write that register, but it is constantly overwritten as mika says. From my logic that means that that "workaround" can't work, although my system didn't freeze yet. So i am now compiling kernel 4.4.0 with mikas patch applied (manually). i will post that result soon.
mika, i tried your patches on 4.4.0 kernel The system hard-froze the same :( back to kernel 4.1.12....
I tried the "drm/i915/vlv: Always enable internal pm interrupts" and it froze within 3 minutes on my T100CHI... BUT, this did fix a bug of the CHI not remembering the backlight setting from the last session. With this patch, the T100CHI powered up and dimmed to the last session level before launching the desktop. Without this patch, the brightness slider would show reduced backlight, but it not go into effect until the brightness was adjusted, manually. This patch is a worth keeping at least for the CHI, even if it has no effect on the freeze problem, which it doesn't.
After a bisect between 4.1 and 4.2-rc1, and running the kernel on a laptop with a N2930 CPU : Last commit without freeze (after running for 6 hours, I will retry for 24h or more to be sure) : commit af2d94fddcf41e879908b35a8a5308fb94e989c5 Author: Ingo Molnar <mingo@kernel.org> Date: Thu Apr 23 17:34:20 2015 +0200 x86/fpu: Use 'struct fpu' in fpu_reset_state() Migrate this function to pure 'struct fpu' usage. The freezes happen (in less than one hour for each test) with the next commit : commit cb8818b6acb45a4e0acc2308df216f36cc5b950c Author: Ingo Molnar <mingo@kernel.org> Date: Thu Apr 23 17:39:04 2015 +0200 x86/fpu: Use 'struct fpu' in switch_fpu_prepare() Migrate this function to pure 'struct fpu' usage.
Sorry for the misinformation in my previous comment, but after retesting af2d94f I got a freeze after 15 hours.
To clarify patch "drm/i915/vlv: Always enable internal pm interrupts" results. Tested with 4.3.3 and 4.4.0 with hardware specific patches. Backlight control is not yet available in the standard kernel for the ASUS T100 family. Without this patch, my ASUS T100CHI always boots to full screen brightness. With the patch, the backlight usually starts at the indicated setting. This patch does fix something for baytrail systems. Thanks for the patch.
(In reply to mazout360 from comment #38) > There's something strange with this bug...on my Q1900DC-ITX I tried every > single version of the mainline kernel from 3.16 to 4.3. It still hangs on > 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15 > is very stable. No need for the cstate configuration or any patch publied > here or on the other thread. Recently re-installed arch on my X205TA. Haven't come across any freezes on kernel 4.3.3-3 with the cstate param. But, linux-lts 4.1.15-1 doesn't let me stay beyond 2 minutes, with just a couple tabs open in browser and ow doing nothing.
Hi everybody. Since December 2015 I have been following this bug, because I had system freezes (mostly while streaming), too. I use an ACER ES1-311 (intel GPU inside;-) with an up to date 4.3.3-3-ARCH. Unfortunately the intel_idle.max_cstate=1 did not do the trick for me. In Arch-Wiki, I found an interesting hint, that improved my situation tremendously. Before, I regularly had system freezes after five minutes streaming. Sometimes the freezes occured after maximum one hour. With this hint, I have not had a freeze for a couple of days streaming for hours! Possibly, my system is even fixed completely with this?! I want to share this with you guys - probably this helps finding a solution or improvement for you too. If interested, you find information here in the arch wiki: https://wiki.archlinux.org/index.php/Intel_graphics Scroll down to the chapter: "X freeze/crash with intel driver". (Funnily, this bug is linked there at the bottom of the chapter with the intel_idle.max_cstate=1 workarund.) Here is described how the GPU acceleration can be disabled. I also disabled the DRI option, because I do not play games on my machine. That did it - or improved alot. Probably on systems other than ARCH, there is a similar way to access and disable GPU acceleration.
(In reply to Johannes from comment #72) > Hi everybody. > > Since December 2015 I have been following this bug, because I had system > freezes (mostly while streaming), too. I use an ACER ES1-311 (intel GPU > inside;-) with an up to date 4.3.3-3-ARCH. Unfortunately the > intel_idle.max_cstate=1 did not do the trick for me. > In Arch-Wiki, I found an interesting hint, that improved my situation > tremendously. Before, I regularly had system freezes after five minutes > streaming. Sometimes the freezes occured after maximum one hour. With this > hint, I have not had a freeze for a couple of days streaming for hours! > Possibly, my system is even fixed completely with this?! I want to share > this with you guys - probably this helps finding a solution or improvement > for you too. > > If interested, you find information here in the arch wiki: > > https://wiki.archlinux.org/index.php/Intel_graphics > > Scroll down to the chapter: "X freeze/crash with intel driver". (Funnily, > this bug is linked there at the bottom of the chapter with the > intel_idle.max_cstate=1 workarund.) > > Here is described how the GPU acceleration can be disabled. I also disabled > the DRI option, because I do not play games on my machine. > > That did it - or improved alot. > > Probably on systems other than ARCH, there is a similar way to access and > disable GPU acceleration. Thank you so much for this. I have a Dell Inspiron 3551 which has an Intel N3540 processor. I have been facing laptop freezing issues. This just fixed it. I have Ubuntu Gnome 15.10 on it. The steps I followed are these: 1. To boot into recovery mode, https://wiki.ubuntu.com/RecoveryMode (make sure to run the two mount commands) 2. To generate the config file for X (while in recovery mode), http://askubuntu.com/questions/4662/where-is-the-x-org-config-file-how-do-i-configure-x-there 3. Change the following lines in /etc/X11/xorg.conf (you can use nano): 3a. #Option "NoAccel" -> Option "NoAccel" "true" 3b. #Option "DRI" -> Option "DRI" "false" 4. Reboot and its done :).
first of all, for me the intel_idle.max_cstate=1 solution didnt work for me either, but i said that earlier already. to you guys disabling hardware acceleration with the info from arch-wiki. why do you disable hardware acceleration if you can just install any 3.1 kernel (i use 3.1.12) and at the same time use hardware acceleration and dri, without any system freezes ? it seems to me the much better solution.
sorry of course i meant 4.1 kernel and not 3.1. i use 4.1.12
I have also been having this issue on my Lenovo 11e laptop with an Intel N2940 baytrail-m. I am running Manjaro and have been having full system hangs (mouse stops moving, everthing freezes, it doesn't even seem to dump any errors out in time) and application freezes (mostly vlc). It seems to happen on battery or plugged in when running a video. The only kernel that seems stable without limiting max_cstate to 1 seems to be 4.1.16-1. Kernels that have given me issues: 4.4.0-4, 4.3.4-1, 4.2.8.2-1, 3.18.25-1 As a side note, hibernate seems to not work on most kernels, works on 4.4.0-4. Not sure if it's related.
Hi, Im too trying to bisect this issue. Best way I have found to make freeze almost instantly (on my Acer Switch 10 with Intel Atom Z3735F, on Ubuntu Gnome 15.10) is to run glxgears (from package mesa-utils) on one half of the screen, and *x264*.mp4 415MB 42 minutes long video in VLC on other half of the screen. Freeze usually occurred between 2-5 minutes. On few occasions I had to wait like 15-20 minutes. When I was running only the VLC, I had to wait many hours until the freeze occurred. Sometimes the freeze did not occurred after 8 hours, when with glxgears it was matter of minutes. Can someone confirm that this method works for you? I can confirm that kernel 4.1 and 4.1.15 work without problem (I did not test 4.1.16 yet). Kernel 4.2-rc1 first introduced the issue. Im currently bisecting between 4.1 and 4.2-rc1, but Im not sure if I tested merges right. When I executed "git bisect [good|bad]" the same output was written on the terminal as "git bisect [good|bad]" one step back. Note that Im not testing pure vanilla kernels - Im applying patches from Adrian Hunter of Intel from here https://github.com/hadess/rtl8723bs/tree/master/patches and small patches for keyboard and sound.
git bisect bad cf5d8a46a001c9421c7397699db55f962e0410fc is the first bad commit However reverting this small commit in v4.2-rc1 did not solve the issue. Bisected kernel from previous step (was git bisect good) is running glxgears and VLC without problem around 11 hours now
(In reply to BzukTuk from comment #78) > git bisect bad > cf5d8a46a001c9421c7397699db55f962e0410fc is the first bad commit > > However reverting this small commit in v4.2-rc1 did not solve the issue. > Bisected kernel from previous step (was git bisect good) is running glxgears > and VLC without problem around 11 hours now That could very well be connected to the problem. My suspect was 099bfbfc7fbbe22356c02f0caf709ac32e1126ea given the amount of i915 changes that were merged into 4.2-rc1.
is it confirmed that i915 is the problem ? although it is the most obvious, i am just asking. it is a kernel problem: it is not xf86-video-intel, i tried all possible bridges there (sna, xaa and uxa) i also tried the gallium driver with ilo-dri. i tried different accel methods, buffers, module parameters at boot time for the i915 module like framebuffer, enable_rc6 power saving options, semaphores and pretty much all the options there is for that module. i also used different versions of xf86-video-intel, compiled them all by myself. the freezes still occured. to be able to do a meaningful bisect between kernel versions it is necessary to know which one is the last working kernel without the bug. is it confirmed that all 4.1 kernel work and that 4.2-rc1 is the first faulty version ? i will get the latest 4.1 stable kernel and test it over the next days. if someone wants me to do further testing i am also available. i am using gentoo linux and therefore compile everything on the machine.
My ASUS T100-CHI has freeze problems with all version 4 kernels. The history suggests that 3.16.7 was the last version freeze free (see also Freedesktop bug 88012.) That said, freezes do occur more often on the CHI (a few minutes to an hour(s) vs. day(s)) starting with 4.2. There definitely is an issue there (a new freeze or making the first one(s) worse)! BTW the new DMA fix for 4.5 did not solve the CHI freeze problem when I attempted to back-port it to 4.4. It froze within 2 minutes w/o cstate limit. But 4.4.0 has numerous other hardware regressions relative to the CHI (stock kernel - no wifi, no touchscreen, flackey BT) so...
I think I can provide some insight into this bug, although not really a solution. I have a Acer V3-111P featuring a N3530 processor. I got this machine in July 2014 when it was just released on the German market, because it was the first fanless laptop available. First thing I did on it was to install Fedora and those random freezes started to appear. It drove me nuts, as my system ran no more than a couple of minutes at a time and never longer than 20 minutes. I searched for "linux random freezes" on google and found this phoronix thread where a guy had random freezes similar to mine, but nobody else in the linux kernel mailing list could reproduce it at the time. In the thread Linus Torvalds himself provided a patch he made on a hunch for the guy to test. So I applied the same patch to my 3.18 kernel and to my own surprise the crashes/freezes became a lot more infrequent. Since then my laptop runs usually at least a couple of hours and occasionally can run even a couple of weeks depending on the usage pattern I suppose. I can't find the patch in the linux kernel mailing list thread anymore. Fortunately I saved a copy locally. Here's the phoronix thread: https://www.phoronix.com/scan.php?page=news_item&px=MTg1MDc Unfortunately Linus's patch can't be applied to newer kernels as the particular code was changed quite a bit or even rewritten. But I think it still might give a hint how the problem could be solved or mitigated. If I understand Linus's patch correctly (and I've only a superficial understanding of it) it's a hack (Linus's own words) that corrects goofy jumps that can happen between "timekeeping" cycles. Here's Linus's patch that I applied to the 3.18 kernel. diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 95640dc..7b14fd3 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -32,6 +32,7 @@ struct tk_read_base { cycle_t (*read)(struct clocksource *cs); cycle_t mask; cycle_t cycle_last; + cycle_t cycle_error; u32 mult; u32 shift; u64 xtime_nsec; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ec1791f..1e2722f 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) tk->tkr.read = clock->read; tk->tkr.mask = clock->mask; tk->tkr.cycle_last = tk->tkr.read(clock); + tk->tkr.cycle_error = 0; /* Do the ns -> cycle conversion first, using original mult */ tmp = NTP_INTERVAL_LENGTH; @@ -197,11 +198,17 @@ static inline s64 timekeeping_get_ns(struct tk_read_base *tkr) s64 nsec; /* read clocksource: */ - cycle_now = tkr->read(tkr->clock); + cycle_now = tkr->read(tkr->clock) + tkr->cycle_error; /* calculate the delta since the last update_wall_time: */ delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask); + /* Hmm? This is really not good, we're too close to overflowing */ + if (unlikely(delta > (tkr->mask >> 3))) { + tkr->cycle_error = delta; + delta = 0; + } + nsec = delta * tkr->mult + tkr->xtime_nsec; nsec >>= tkr->shift; @@ -455,6 +462,16 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) update_fast_timekeeper(tk); } +static void check_cycle_error(struct tk_read_base *tkr) +{ + cycle_t error = tkr->cycle_error; + + if (unlikely(error)) { + tkr->cycle_error = 0; + pr_err("Clocksource %s had cycles off by %llu\n", tkr->clock->name, error); + } +} + /** * timekeeping_forward_now - update clock to the current time * @@ -471,6 +488,7 @@ static void timekeeping_forward_now(struct timekeeper *tk) cycle_now = tk->tkr.read(clock); delta = clocksource_delta(cycle_now, tk->tkr.cycle_last, tk->tkr.mask); tk->tkr.cycle_last = cycle_now; + check_cycle_error(&tk->tkr); tk->tkr.xtime_nsec += delta * tk->tkr.mult; @@ -1181,6 +1199,7 @@ static void timekeeping_resume(void) /* Re-base the last cycle value */ tk->tkr.cycle_last = cycle_now; + tk->tkr.cycle_error = 0; tk->ntp_error = 0; timekeeping_suspended = 0; timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET); @@ -1528,11 +1547,15 @@ void update_wall_time(void) if (unlikely(timekeeping_suspended)) goto out; + check_cycle_error(&real_tk->tkr); + #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET offset = real_tk->cycle_interval; #else offset = clocksource_delta(tk->tkr.read(tk->tkr.clock), tk->tkr.cycle_last, tk->tkr.mask); + if (unlikely(offset > (tk->tkr.mask >> 3))) + pr_err("Cutting it too close for %s in in update_wall_time (offset = %llu)\n", tk->tkr.clock->name, offset); #endif /* Check if there's really nothing to do */
(In reply to julio.borreguero@gmail.com from comment #75) > sorry of course i meant 4.1 kernel and not 3.1. i use 4.1.12 You are right, Julio. Downgrading the kernel works without disabling hardware acceleration. I managed to downgrade to 4.1.6-1 and did not have a freeze yet. Before, I was not able to downgrade the kernel - I did it wrong, because I am new to this. Anyway, a lot of people have posted freezes for many kernel versions and kernel versions,that worked fine. I can add, that my ACER Aspire ES1-311 seems to work with kernel 4.1.6-1.
Tested with latest 4.5-rc2 kernel. Got hard lockup after one hour. Neither in browser nor in video player. I was emerging linux-firmware while was looking through linux kernel nconfig. But this time I added console=/dev/ttyUSB0,115200 and got some useful (maybe) information. 1) Right after boot I ended up with refined-jiffies: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource:'refined-jiffies' wd_now: fffb77c9 wd_last: fffb75d5 mask: ffffffff clocksource:'tsc' cs_now: 2d666de2c cs_last: 29a343f3d mask: ffffffffffffffff clocksource: Switched to clocksource refined-jiffies And got this lockups: NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 Modules linked in: aesni_intel xts aes_i586 lrw ablk_helper cryptd pcspkr mac_hid snd_intel_sst_acpi crc32c_intel ath6kl_sdio CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.5.0-rc2-dirty #322 Hardware name: Dell Inc. Venue 11 Pro 5130/05FF9P, BIOS A15 01/20/2016 00000000 c12a9b9f 00000000 c1118b3d c1be49d4 00000000 edc0b400 c1118a00 c112e367 00000003 96ac4d66 fffffffc 00000000 c1ce1dc0 00000000 f77afae0 00000001 c1ce1efc c112ec02 c1ce1f5c c10622b6 00000000 00000000 22f3ec2c Call Trace: [<c12a9b9f>] ? dump_stack+0x48/0x79 [<c1118b3d>] ? watchdog_overflow_callback+0x13d/0x150 [<c1118a00>] ? watchdog_enable_all_cpus+0xb0/0xb0 [<c112e367>] ? __perf_event_overflow+0xb7/0x280 [<c112ec02>] ? perf_event_overflow+0x12/0x20 [<c10622b6>] ? intel_pmu_handle_irq+0x1e6/0x3e0 [<c10b864f>] ? enqueue_entity+0x2ff/0xe80 [<c10b9214>] ? enqueue_task_fair+0x44/0xd40 [<c10b361e>] ? select_task_rq_fair+0x44e/0x850 [<c1097399>] ? __send_signal+0x189/0x310 [<c10a5c97>] ? raw_notifier_call_chain+0x17/0x20 [<c10ec7bb>] ? timekeeping_update+0x11b/0x1b0 [<c1915c5f>] ? _raw_write_unlock_irqrestore+0xf/0x30 [<c10eead3>] ? update_wall_time+0x303/0xb70 [<c1915c5f>] ? _raw_write_unlock_irqrestore+0xf/0x30 [<c112a89e>] ? perf_event_task_tick+0x4e/0x2a0 [<c1059696>] ? perf_event_nmi_handler+0x26/0x40 [<c1049ec4>] ? nmi_handle+0x44/0xa0 [<c15c9252>] ? poll_idle+0x32/0x70 [<c104a443>] ? default_do_nmi+0x53/0x230 [<c104a6bf>] ? do_nmi+0x9f/0xd0 [<c1916ea7>] ? nmi_stack_correct+0x2f/0x34 [<c10e00d8>] ? rcu_sync_func+0x38/0x90 [<c15c9252>] ? poll_idle+0x32/0x70 [<c15c8ce4>] ? cpuidle_enter_state+0x134/0x270 [<c10c474c>] ? cpu_startup_entry+0x1ac/0x250 [<c15626cd>] ? usb_find_interface+0x2d/0x50 [<c1d57a92>] ? start_kernel+0x39d/0x3a4 perf interrupt took too long (3896 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 clocksource: Switched to clocksource tsc I had to switch to tsc manually in order to use tablet at all. 2)Another bug: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 3158 at drivers/base/power/common.c:150 dev_pm_domain_set+0x54/0x60() PM domains can only be changed for unbound devices Modules linked in: aesni_intel xts aes_i586 lrw ablk_helper cryptd pcspkr mac_hid snd_intel_sst_acpi crc32c_intel ath6kl_sdio(-) CPU: 2 PID: 3158 Comm: rmmod Tainted: G W 4.5.0-rc2-dirty #322 Hardware name: Dell Inc. Venue 11 Pro 5130/05FF9P, BIOS A15 01/20/2016 00000009 c12a9b9f ecd8dec8 c108d662 c1c4adf4 ecd8dee0 00000c56 c1c17ffd 00000096 c14b1b54 c14b1b54 d0008604 00000000 00000000 bfc7cd88 c108d6c3 00000009 ecd8dec8 c1c4adf4 ecd8dee0 c14b1b54 c1c17ffd 00000096 c1c4adf4 Call Trace: [<c12a9b9f>] ? dump_stack+0x48/0x79 [<c108d662>] ? warn_slowpath_common+0x82/0xb0 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60 [<c108d6c3>] ? warn_slowpath_fmt+0x33/0x40 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60 [<c132b612>] ? acpi_dev_pm_detach+0x2d/0x6b [<c14b1a86>] ? dev_pm_domain_detach+0x16/0x20 [<c15d6523>] ? sdio_bus_remove+0x83/0xf0 [<c14a9ef8>] ? __device_release_driver+0x78/0x120 [<c14aa67f>] ? driver_detach+0x8f/0xa0 [<c14a9a68>] ? bus_remove_driver+0x38/0x90 [<c10fdd78>] ? SyS_delete_module+0x158/0x220 [<c11b163d>] ? mntput_no_expire+0xd/0x180 [<c10a3a74>] ? task_work_run+0x74/0x90 [<c100100b>] ? exit_to_usermode_loop+0x8b/0xc0 [<c1001500>] ? do_fast_syscall_32+0x80/0x130 [<c19161f8>] ? sysenter_past_esp+0x3d/0x5d ---[ end trace 969e6d42685aab80 ]--- I believe it's connected with ath6kl. I added my sdio card id to ath6kl_sdio.c. I'll try without any custom patches. 3)Last line before complete hard lockup was: perf interrupt took too long (5007 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 Before it there were only cfg80211's regulatory domain changes, IPv6 link not ready and ath6kl's stuff. Also I wasn't able to reboot using sysrq at all. Full log is here: http://pastebin.ca/3363040
Created attachment 202701 [details] Kernel bisection between v4.2 v4.1 for sudden freezes Hi, small update. My first bisect was from 4.1 to 4.2-rc1 and as first bad commit [cf5d8a46a001c9421c7397699db55f962e0410fc] was flagged. But I was not so sure that i did the bisection properly. So today I made second bisection - git bisect start v4.2 v4.1. Bisection process went without problem/confusion/doubt (as my first attempt did). Last git bisect was good on commit cf5d8a46a001c9421c7397699db55f962e0410fc (after 90 minutes of glxgears and vlc). Git pointed that first bad commit was: [8fb55197e64d5988ec57b54e973daeea72c3f2ff] drm/i915: Agressive downclocking on Baytrail then from commit cf5d8a46.. I cherry-picked 8fb55197 and this kernel froze after 3 minutes. More cherry-picking/testing tomorrow. Sorry if my previous post made confusion/unnecessary work. Todays 'git bisect log' is in the attachment
Hello everybody, I use 15.10 (x64) version and the only way for using my laptop Asus X751MJ-TY005H which is powered by an n3540 i found is passing by kernel boot parameter. https://wiki.ubuntu.com/Kernel/KernelBootParameters It's been since two days i use my laptop and the stock kernel 4.2.0.27-generic without a freeze. I essentially listen music and navigate on network and read my mail post.
Latest git kernel works for me without freezes. Tested for about a week and very hard workflows (glxgears,youtube in firefox, mpv with 1080p and kernel compiling in 4 threads at the same time). There is only one flaw: I had to add my wifi(ath6kl_sdio with custom patch adding new ID) to blocklist. Modprobing it leads to freeze in minutes.
(In reply to Dnitry from comment #87) > Latest git kernel works for me without freezes. Tested for about a week and > very hard workflows (glxgears,youtube in firefox, mpv with 1080p and kernel > compiling in 4 threads at the same time). There is only one flaw: I had to > add my wifi(ath6kl_sdio with custom patch adding new ID) to blocklist. > Modprobing it leads to freeze in minutes. What commit is working fine for you? I'm very curious because 4.5-rc2 exhibited the issue for me and it would help in bisecting. Also I just compiled 4.5-rc3 and I'm testing the stability on the Celeron N2940.
$ uname -a Linux shiva 4.5.0-rc1 #18 SMP Mon Feb 8 10:09:09 ART 2016 x86_64 Intel(R) Celeron(R) CPU N2940 @ 1.83GHz GenuineIntel GNU/Linux i am running latest stable kernel 4.5.0-rc1 on N2940 for a few hours. I did some stress-testing running parallel vlc and glxgears plus did loads of other stuff at the same time. May it really be that the bug is finally fixed ? i will give feedback as soon as my system freezes or in a few days otherwise.
short fun, it froze :(
:-) Today I tested on Acer Aspire Switch 10 linux-v4.5-rc[1-3] - freezes occured on all of them. From my bisect last good commit seems to be [cf5d8a46a001c9421c7397699db55f962e0410fc] - glxgears and VLC was running for 18hours 20minutes without problem (then I got bored and powered it off). Commit [8fb55197e64d5988ec57b54e973daeea72c3f2ff] introduced the problem - vlc&glxgears froze laptop in 3 minutes. Unfortunately my git&c skills are not good enough to revert this [8fb5519..] commit in whole releases (like in 4.2-rc1 or 4.2) because of additional changes. Biggest problem with "git revert 8fb5519.." is in file "drivers/gpu/drm/i915/intel_pm.c" - there is over 30 commits (some of them merges) changing this file between 8fb5519 and 4.2-rc1 or 4.2 kernel. Could someone look into that? Thanks
4.5.0-rc3: 4 hours of films, glxgears and browsing till batteries are dead. Without any hint of freeze. For me 4.5.0-rc2 and higher are much more stable than any other and even 4.1.y branch. Recently I got several freezes on 4.1.17 kernel and then switched to latest git. In cmdline I have only this: tsc=reliable clocksource=tsc. And as for patches I have fix for asoc channels, ath6kl enable patch and soc_button_array patch. Nothing special related to i915 or cpu(cstate). Also, as I mentioned before, I blacklisted my wifi (I use usb wifi stick and usb ethernet). I have a idea that this freezes might be connected either with clock or power instability. For baytrail platform we do not have reliable hpet and tsc seems also unstable. As for power I observe freezes when there is some changes in gpu or cpu states. When tablet works on a task it works perfect, but when this task ends there is non zero possibility of freeze. Or when we decide to do anything after a pause on a tablet. For me it looks like there is not enough voltage during frequency changes. Like what we can see during undervoltage. It is possible, because we see hard lockups, but it is just a guess. I do not know where there is in baytrail platform ability to tune voltage through any software api. Because windows works stable regardless of any workload. P.S. Or this freezes migth also be connected with mmc. Wifi is connected through it and bluetooth does not work for me at all. Only internal storage and external mmc card.
(In reply to BzukTuk from comment #91) > :-) > Today I tested on Acer Aspire Switch 10 linux-v4.5-rc[1-3] - freezes occured > on all of them. > > From my bisect last good commit seems to be > [cf5d8a46a001c9421c7397699db55f962e0410fc] - glxgears and VLC was running > for 18hours 20minutes without problem (then I got bored and powered it off). > > Commit [8fb55197e64d5988ec57b54e973daeea72c3f2ff] introduced the problem - > vlc&glxgears froze laptop in 3 minutes. > > Unfortunately my git&c skills are not good enough to revert this [8fb5519..] > commit in whole releases (like in 4.2-rc1 or 4.2) because of additional > changes. Biggest problem with "git revert 8fb5519.." is in file > "drivers/gpu/drm/i915/intel_pm.c" - there is over 30 commits (some of them > merges) changing this file between 8fb5519 and 4.2-rc1 or 4.2 kernel. Could > someone look into that? > Thanks The legacy-turbo patch does a fine job of disabling this commit. (https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/patches/4.0/linux-999-i915-use-legacy-turbo.patch) - edit as needed Since 4.2.6, my ASUS T100-CHI usually freezes within 5 minutes without max_cstate=1. Because of your bisect, I tried the legacy-turbo patch on 4.5-rc3. Before patching, my CHI ran 4.5-rc3 29 minutes before freezing (no cstate.), better than 5, but... After patching, I haven't had a freeze in over 5 hours so far (no cstate argument). So, there must be at least 2 freeze bugs! LPSS, aggressive down-clocking and another one still lurking somewhere around wifi/mmc. Not to mention the GPU hang previously fixed in 4.2.6. Usual disclaimers, YMMV. My kernels have a few T100 hardware specific patches. Still a bit early to declare success, but this is promising.
Eating crow already. My second 4.5 test without cstate froze after 5 hours. 5 minutes to 29 minutes to 5 hours is a huge improvement, but it is not the whole solution. There is still another one out there. (ASUS T100-CHI, kernel-4.5-rc3 + Legacy-turbo patch + T100 specific patches, intel_idle.max_cstate=1 does not freeze)
i tried 4.5.0-rc3 on N2940 (ACER ES1-711). with tsc=reliable clocksource=tsc cmdline => freeze after a few hours. i could try turbo patch. intel_idle.max_cstate=1 never worked for me in any freeze kernel.
I've run several kernel versions on a Jetway JBC311U93 Celeron N2930 (Bay Trail). In all cases I've had intermittent lockups after anything from 1 hour of runtime up to 2 weeks. I mostly run with both HDMI ports connected but little or no video acceleration in use. Kernels I've used include: 3.19 (built by Yocto poky) 4.1.13 4.2.6 4.3.3 4.4.0 Currently I'm running 4.4.1. I've observed intermittent lockups on the abovementioned hardware on ALL of these kernels. I see no activity on USB other than the clock pulse, which I interpret (perhaps incorrectly) as no signs of life from the SoC. I've never gotten any useful core dumps or kernel panics when the lockup occurs - the system becomes completely unresponsive. Building mplayer2 and playing h.264 high profile videos continuously, I seem to get lockups far more consistently, usually within no more than 24 hours. Previous tests I've run to try to induce failure have all been fruitless. One of the symptoms of the lockup has included very high junction temperatures (up to 98C; the rated maximum junction temp is 110C) when it is in the hung state, and in some cases reboot (via a Fintek chipset watchdog) does not clear the hung state. My previous efforts focused on stressing CPU load and the SSD disk device. However, exercising the GPU seems to yield higher failure rates. Running with intel_idle.max_cstate=1, I have gotten no lockups so far. It's too early to declare a victory but this is definitely promising.
I've added Dnitry's tsc arguments to my custom kernel (4.5-rc3(w/LPSS) + legacyturbo & t100 patches). Best test run yet without cstate: > 21 hours and counting. May it keep running after this post. They may be interrelated, but there are still more freeze bugs. Since max_cstate=n doesn't avoid all freezes, at least one is outside of power-saving. Not all Atom platforms are affected by each possible freeze. The T100-CHI is sensitive to several, but cstate has been a reliable workaround. I can try other patches or kernel arguments, if they are posted here. 4.3.5 runs quite well, but freezes readily if I omit cstate. Now that the LPSS updates have been included in 4.5, typical freezing takes several times longer, making it less suitable for rapid testing. 4.4.x does not work well for me, hardware regressed - no wifi (w/o patching) or touchscreen.
(In reply to jbMacAZ from comment #93) > (In reply to BzukTuk from comment #91) > The legacy-turbo patch does a fine job of disabling this commit. > (https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/ > patches/4.0/linux-999-i915-use-legacy-turbo.patch) - edit as needed Thank you jbMacAZ, with this patch I had no freeze during 24+ hours of running glxgears, VLC, and youtube in firefox. Kernel 4.4.1 with mmc&pm-qos patches from (https://github.com/hadess/rtl8723bs/tree/master/patches), and linux-999-i915-use-legacy-turbo.patch + small change in snd drivers. No kernel parameters. During those 24 hours, tablet went few times to hibernation (low battery), and after resume, glxgears and vlc still worked. Wifi module need reload after resume from hibernations - Youtube started playing after F5 :)
No freeze observed (47hrs) with tsc arguments, but my bluetooth inactivity timeouts became erratic. On the T100-CHI, the keyboard is linked via bluetooth, so unreliable timeouts affect usability. I won't be using the tsc arguments as an alternate workaround to max_cstate. YMMV.
I may be running into this bug as well using a Celeron N3150 (Braswell). I've tried: * Ubuntu Server 15.10 with generic 4.2.0-* kernels * Arch with 4.4.1-* kernels (console only, no X) Both setups caused similar halts and spontaneous reboots, almost always without any logs generated except to the screen. I saw watchdog errors about stalled cores and some other errors that I can't recall offhand (but may have written down at home, will check tonight). So far, Arch with lts kernel 4.1.(17?) seems to be running better, although not without an occasional issue. I'm trying intel_idle.max_cstate=2 rightto now and can report back. Will be curious to see if it helps, as C2 isn't explicitly stated as a c-state for the N3150 (only C0, C1, C6, and C7 states). I'll try max_cstate=1 after this trial as well. My thanks everyone tracking and reporting on this issue. It's been super informatative and helpful as I've been trying to figure out what's happening with this box.
I'm seeing these freezes on a Z3745. While reading the comments I get the feeling that we are mixing up two problems. BayTrail-T in current kernels only has one real clocksouce - the tsc. By default it will compare this clocksource to the refined-jiffies clocksource. But as refined-jiffies is unreliable (at least on non-rt kernels), the kernel often gets the impression that it can't rely on the tsc. When this happens the kernel switches to the refined-jiffies clocksource and starts to become sluggish. After a short time "sleep 1" will take forever and you are lucky if you have an open root shell where you can set the clocksource back to tsc. The official fix in Intel's Android kernel is to set the tsc as reliable. It is definitely a bug that refined-jiffies results in this behaviour, but it is not related to the freezes we see on BayTrail.
Thank you for the clarification on tsc. I have seen that sluggishness twice where the screen refreshes once every 20-30 seconds. 4.5rcx or 4.4.x needs to run overnight to get that bad. So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. Then nothing bad should happen (no excessive latency, no freezes)?
(In reply to jbMacAZ from comment #102) > So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. > Then nothing bad should happen (no excessive latency, no freezes)? You should also apply the patches mentioned in comment 55.
Does Intel really completely ignore this issue? It has been introduce in 3.16 and still not fixed in 4.5 kernel. Yes, there is a workaround. But no real solution. I doubt it will ever get fixed. Only a few people are trying to identify the issue in their free time. It would be awesome if they could find a permanent fix. But shouldn't have Intel done this already a long time ago? My computer freezes time to time (about twice per week) even with 3.13 kernel. So staying with the old kernel isn't the ideal solution neither.
To be clear, the issue isn't in 3.16. I've apt-pinned to 3.16.0-4 and never had the freeze issue again. 3.16.7 is meant to be the last freeze free version noted. Which 3.16 do you mean? But yes, it's been very quiet from Intel on this thread, but as I understand it, Adrian Hunter is from Intel and has done some patches on this: https://lkml.org/lkml/2015/3/24/271 (as mention in comment 55). Though these don't see to have been merged for nearly a year now.
(In reply to Daniel Glöckner from comment #103) > (In reply to jbMacAZ from comment #102) > > So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. > > Then nothing bad should happen (no excessive latency, no freezes)? > > You should also apply the patches mentioned in comment 55. I have them in 4.3.5 and that is my best running recent kernel[EOL: way too soon]. I thought that the LPSS enhancements in 4.5 meant they were no longer needed there. Appreciate the guidance.
(In reply to Joe Burmeister from comment #105) > But yes, it's been very quiet from Intel on this thread, but as I understand > it, Adrian Hunter is from Intel and has done some patches on this: > https://lkml.org/lkml/2015/3/24/271 (as mention in comment 55). Though > these don't see to have been merged for nearly a year now. I agree. In general, this seems to be a stability issue relevant with any Baytrail based machine. That is why I believe there has to be thousands of users fighting with this bug on different linux distros, probably unaware of this bug report. Would it help if somebody competent raised the importance of this bug here in Bugzilla? I don't feel that importance "P1 Normal" is correct, if this bug leads to certain freezes in tens of minutes on Baytrail machines. Also, status "NEW" is also missleading, as this bug is obviously CONFIRMED.
Just made an account here to confirm this Baytrail issue. Older kernels work fine, but are not optimal. On a new install of the latest kernel, simply moving the mouse or watching a terminal download from apt-get can cause graphical corruption, reset or freeze. Windows has zero issues with stability relating to power states or graphics, so it is not my hardware. I am using a Lenovo 11e with a Intel N2940 cpu.
(In reply to Michal Feix from comment #107) > (In reply to Joe Burmeister from comment #105) > > But yes, it's been very quiet from Intel on this thread, but as I > understand > > it, Adrian Hunter is from Intel and has done some patches on this: > > https://lkml.org/lkml/2015/3/24/271 (as mention in comment 55). Though > > these don't see to have been merged for nearly a year now. > > I agree. In general, this seems to be a stability issue relevant with any > Baytrail based machine. That is why I believe there has to be thousands of > users fighting with this bug on different linux distros, probably unaware of > this bug report. Would it help if somebody competent raised the importance > of this bug here in Bugzilla? I don't feel that importance "P1 Normal" is > correct, if this bug leads to certain freezes in tens of minutes on Baytrail > machines. Also, status "NEW" is also missleading, as this bug is obviously > CONFIRMED. I think I am one of those who are strugling with this bug, any distro other than Debian 8 (kernel 3.16) locks up after some use, which may vary from a few minutes to a several hours, but it always crashes. A fix would be very important, machines like the Dell Inspiron 3000 Series Ubuntu Edition are bay trail based, they are very affordable so many users could be running those (just like myself).
I installed Ubuntu 14.04.4 in a separate partition for experimentation. I am running Kernel 4.2.0-30. The only modification I made was the Cstate setting mentioned in this post and it locked up in fifteen minutes. I'll try something else tonight and post the results.
I can also confirm this bug. My HTPC is a shuttle XS35V4 with a J1900. It is unusable on anything higher than kernel 3.16. Exactly as Alejandro Morales Lepe explained it.
I can also confirm this bug (Acer ES-11, n2940), looking for the solution as many. Sorry I can not add any useful to the hunt.
I can also confirm this bug. ASRock Q1900TM-ITX xubuntu 3.19.0-51-generic x86_64
Greetings: I have just joined the forum to provide you feedback on my situation which seems to confirm your findings regarding 'intel_idle.max_cstate=2'. Indeed, I have two mini-PC type low power consumption very recent boxes. The first one is an Intel NUC Model 5CPYH with a Dual Core Celeron N3050. The second is a Zotac Zbox CI320 Nano with a Quad Core Celeron N2930. Since the beginning I have been running Linux Mint 17.2 and now 17.3 on both boxes as host and guest OS as I run VirtualBox 5.0.14 to virtualize a tiny family web server only accessible from my LAN. I never installed or tried any other OS (notably no Windows) or any other flavors of Linux on these machines. Several observations noteworthy: 1-Intel NUC couldn't display anything via VGA or HDMI when first installed with stock Linux Mint 17.2 (kernel 3.16 if I recall). I could remotely SSH and replace its kernel to 4.3.0 (picked randomly, and it was the most recent at that time), and everything started to work. Very well! Actually without any crash or anything for days. 2-I installed VirtualBox 5.0 and virtualized a basic server built on Linux Mint 17.2 desktop with wordpress, which has been in use for months on an AMD processor based computer (but needed to be replaced as it was a 200 Watt consuming old hardware). Everything went smoothly, but the virtual machine froze overnight. This has kept happening over and over for several weeks; the virtual machine would freeze within less than a day. Rebooting it would become an ordinary daily thing. But, the host would never freeze or crash on me! 3-Zotac CI320 on the other hand started to freeze the minute I installed Linux Mint 17.2. After each reboot it would work for a few minutes and freeze before my eyes while trying to select a WiFi access point, or changing screen resolution, or browsing with Firefox, or simply moving a window around. I upgraded its kernel to 4.3.0 and many different versions, but at best the frequency of failures changed, the problem never went away for good. Things seemed to get a bit better after upgrading to Linux Mint 17.3 with kernel 3.19 stock version, to the point that I wanted to test VirtualBox on it. I installed VirtualBox 5.0.14 and started to play around. 4-My first guest OS was FreeBSD 10.2 on Zotac's VirtualBox. Amazingly, this combo brought a new found stability to my hardware. So, Linux Mint 17.3 with kernel 3.19, VirtualBox 5.0.14 and FreeBSD 10.2 stock would work trouble free without any failure, for days. 5-Then I decided to move my little server to the Zotac platform as it looked stable as described in #4. Troubles started to show up again! But far worse than on the Intel NUC. It would actually crash the entire machine, host, all guest OS etc. whereas on Intel NUC it would only crash the guest OS. 6-I kept digging for info and eventually came across this posting and thought this might be the root cause of my problems. I have been running my Zotac box with intel_idle.max_cstate=2 for the last couple of days (both on the host and guest OS) and have even been bold to the point of doing some computer intensive things. Everything is holding up for now. Hopefully it will be ok for good. I just wanted to share my experience with the hope that if someone with similar, more or better experience want to comment or suggest, it would be helpful for me but also for others. I am still on the edge because of these 2 almost brand new computers. Also, I wanted to ask for advice regarding using 'intel_idle.max_cstate=2' on both the host and guest OS as I am doing right now. Does it make sense? or should I only run it on the host OS? Maybe one more question, although this might not be the right place to ask; is the FreeBSD 10.2 kernel known to work better with these processors with regards to this random freezing problem? Thanks for your attention and sorry for the length of the post. Hal
FWIW, I had a freeze on 4.3.6 with tsc=reliable and intel_idle.max_cstate=1. I hadn't had any freezes since 4.2.5 when cstate limit was set. A new freeze bug perhaps?
I have been distrohopping for a time now, and I can confirm, anything newer than 3.16 freezes. I installed Ubuntu 14.04.2 it runs nicely but if I install 14.04.3 the system freezes. intel_idle.max_cstate=1 sometimes seems to work and sometimes don't but I havent found any pattern or something. If there is something I can do to help solve this issue tell me or otherwise I am going to be stuck on Ubuntu 14.04.2 or Debian 8 forever.
Hey everyone, same here! With the new kernels I have several freezes per day. Just writing and doing office stuff causes that bug only sometimes. But watching a DVD (with an external drive) or surfing the internet (especially flash I think) and the system freezes a lot. I use Lubuntu 15.10 with 4.2.0-30-generic. My PC is an Acer ES-1 311 laptop with Intel N3540 CPU. Would be nice to solve the problem. Can't we just go back to the old working kernel from Ubuntu 14.04 and delete the malicious code in the new one? I don't understand, why a kernel with such a heavy bug, that affects a lot of users, was released.
> Also, I wanted to ask for advice regarding using 'intel_idle.max_cstate=2' > on both the host and guest OS as I am doing right now. Does it make sense? > or should I only run it on the host OS? IMHO, it only makes sense on the host.
I am sorry but this situation is a real comedy!! Almost all Bayltray devices have problem.This means a huge number of modern Pcs,tablets and laptops. This situation is more than 4 months and the developers dont care to fix it but to include new futures to the kernel!! I am stacked more than 4 months to kernel 3.16 because of this serious bug..and i know more than 20 people in the same situation.All of them with different devices. I love linux,i appreciate kernel developers but for sure here we need a project manager to estimate if a bug is a high priority or not..
Hello everyone. I'm have been facing the same issue. Recently i bought an Asrock N3150DC-ITX Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy, freezing, and sometimes X11 random crashed on it. Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline repository: http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/ After installed the new kernel, the system seems to be stable without the cstate hack. Sidenote: after installing the Intel open source graphics driver from 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10 (Mate Edition) (The installer also updates the vaapi packages for the latest version that supports Cherrytrail). Tested with 4K contents and 1920p downscaling. No issues, no lag after 3-4 days uptime, running mostly with Kodi.
I just tried this on a n2940 system, a Thinkpad 11e. The screen flashed a lot, so I went back to 4.2.8 again, with cstate hack. (In reply to Molnár Roland from comment #120) > Hello everyone. > > I'm have been facing the same issue. Recently i bought an Asrock N3150DC-ITX > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy, > freezing, and sometimes X11 random crashed on it. > > Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline > repository: > http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/ > > After installed the new kernel, the system seems to be stable without the > cstate hack. > > Sidenote: after installing the Intel open source graphics driver from > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10 > (Mate Edition) (The installer also updates the vaapi packages for the latest > version that supports Cherrytrail). Tested with 4K contents and 1920p > downscaling. No issues, no lag after 3-4 days uptime, running mostly with > Kodi.
(In reply to jds from comment #121) > I just tried this on a n2940 system, a Thinkpad 11e. The screen flashed a > lot, so I went back to 4.2.8 again, with cstate hack. > > > (In reply to Molnár Roland from comment #120) > > Hello everyone. > > > > I'm have been facing the same issue. Recently i bought an Asrock > N3150DC-ITX > > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy, > > freezing, and sometimes X11 random crashed on it. > > > > Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline > > repository: > > > http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/ > > > > After installed the new kernel, the system seems to be stable without the > > cstate hack. > > > > Sidenote: after installing the Intel open source graphics driver from > > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10 > > (Mate Edition) (The installer also updates the vaapi packages for the > latest > > version that supports Cherrytrail). Tested with 4K contents and 1920p > > downscaling. No issues, no lag after 3-4 days uptime, running mostly with > > Kodi. Interesting, I tried the drm-intel-next kernel linked by Molnár Roland on my Thinkpad 11e for a little while last night, and I didn't see any of those issues. I tried it on a quick fresh ubuntu install (I usually use Manjaro) playing a twitch stream on Mpv, I'll have to find a way to unzip that deb and install the kernel myself, or get it building myself to test it longer. I don't believe any of the drm-intel-next stuff is going to get merged into the upcoming 4.5, as it's already on rc7, so maybe it will come with 4.6
(In reply to Travis Hall from comment #122) > (In reply to jds from comment #121) > > I just tried this on a n2940 system, a Thinkpad 11e. The screen flashed a > > lot, so I went back to 4.2.8 again, with cstate hack. > > > > > > (In reply to Molnár Roland from comment #120) > > > Hello everyone. > > > > > > I'm have been facing the same issue. Recently i bought an Asrock > N3150DC-ITX > > > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was > buggy, > > > freezing, and sometimes X11 random crashed on it. > > > > > > Few days ago i installed the drm-intel-next kernel from the Ubuntu > mainline > > > repository: > > > > http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/ > > > > > > After installed the new kernel, the system seems to be stable without the > > > cstate hack. > > > > > > Sidenote: after installing the Intel open source graphics driver from > > > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10 > > > (Mate Edition) (The installer also updates the vaapi packages for the > latest > > > version that supports Cherrytrail). Tested with 4K contents and 1920p > > > downscaling. No issues, no lag after 3-4 days uptime, running mostly with > > > Kodi. > > Interesting, I tried the drm-intel-next kernel linked by Molnár Roland on my > Thinkpad 11e for a little while last night, and I didn't see any of those > issues. I tried it on a quick fresh ubuntu install (I usually use Manjaro) > playing a twitch stream on Mpv, I'll have to find a way to unzip that deb > and install the kernel myself, or get it building myself to test it longer. > > I don't believe any of the drm-intel-next stuff is going to get merged into > the upcoming 4.5, as it's already on rc7, so maybe it will come with 4.6 Interesting. Well, I installed this kernel over a Mint 17 setup (Ubuntu 14.04), so maybe there's some interaction between the new kernel and X?
(In reply to jds from comment #123) > Interesting. Well, I installed this kernel over a Mint 17 setup (Ubuntu > 14.04), so maybe there's some interaction between the new kernel and X? False alarm, my Ubuntu MATE install hung while running the drm-intel-next kernel from the Ubuntu repo. I also compiled a kernel from drm-next using an Arch User Repo package https://aur.archlinux.org/packages/linux-drm-intel-nightly/ on Manjaro, and it also hung within about 2 hours while running some youtube on loop, and a stream in mpv.
I tried the drm-intel-next kernel and also intel linux drivers..Nothing works back to kernel 3.6.17
I'm relatively happy that my system is stable now thanks to the intel_idle.max_cstate=1 flag, however I agree with everything Dimitris Roussis wrote about this situation. I have my machine since mid 2014, which means that this bug has plagued users for almost 2 years now. The number of users that have been burned by this issue must be staggering, and I assume most of them didn't file a bug report. I can't comprehend how this bug is rated "P1 normal", when it's clearly a critical bug preventing a huge number of Intel processors from being stable on Linux. Intel should really be embarrassed about this bug. Can we please get a statement from an Intel employee about what is being done?
(In reply to dertobi from comment #126) > I'm relatively happy that my system is stable now thanks to the > intel_idle.max_cstate=1 flag, however I agree with everything Dimitris > Roussis wrote about this situation. > > I have my machine since mid 2014, which means that this bug has plagued > users for almost 2 years now. The number of users that have been burned by > this issue must be staggering, and I assume most of them didn't file a bug > report. > > I can't comprehend how this bug is rated "P1 normal", when it's clearly a > critical bug preventing a huge number of Intel processors from being stable > on Linux. > > Intel should really be embarrassed about this bug. > > Can we please get a statement from an Intel employee about what is being > done? Most non-ARM Chromebooks use Bay Trail chips. Any sense of what the Chromium project may have done about this bug?
This bug is affecting me on an Asus Aspire E3-111. So far so good with intel_idle.max_cstate=1. I'll echo what others have said: it would be reassuring to hear from someone at Linux or Intel about progress towards solving this. Without a doubt, it has been "quietly" affecting a great many people for a long time, who had no of knowing what the issue was. I spent quite a bit of money replacing the SSD thinking that it was the culprit. :(
For about 2 months I have been using on a daily basis kernel 4.4.0 with the patch mentioned in Comments 48, 55, 77, 98, 103, 105 on Atom Z3735G, without intel_idle.max_cstate. I do experience freezes at an average of about every 10 hours of use. Rarely I have a specific operation that always causes Hard LOCKUP, in these cases I reboot using intel_idle.max_cstate=0 and the freeze does not occur any more. Now I compiled kernel 4.5.0-rc7, but I was not able to apply the mentioned patch. It does not apply cleanly and trying to introduce the failing parts by hand I got an error message during boot. This kernel freezes within less than a minute after boot, even with intel_idle.max_cstate=0 in the boot command line. With tsc=reliable clocksource=tsc in the boot command line the freeze does not occur for at least 30 minutes, but comments seem to inidcate that tsc command line is not recommended. Is there an update of the mentioned patch ?
I just wanted to update my post #114 after 2 weeks of testing as my Zotac system is now much more stable. First, for the host OS: intel_idle.max_cstate=2 definitely saved my Zotac computer. No more host crash, nor any VirtualBox system freeze. As for the guest OS freezing situation; I accidentally noticed that VirtualBox might have had a problem with 3 cores assigned to the guest LinuxMint OS. Changing it to 4 cores (the maximum available on my Zotac) seems to have stopped the freezing of the guest OS. In any event in the new configuration it has been running for over a week now with no hint of problems under medium to heavy load.
Replying to my own question from earlier, Chrome OS is on 3.10.18 (!). This is for version 48.0.2564.116 in the stable channel. I found this by checking on a Chromebook that uses the n2940. Note that there is an issue with this system: the wireless module craps out occasionally (logged bug). Seems to be related to the iwl* subsystem. jds
(In reply to Elmar Melcher from comment #129) ... > With tsc=reliable clocksource=tsc in the boot command line the freeze does > not occur for at least 30 minutes, but comments seem to indicate that tsc > command line is not recommended. > > Is there an update of the mentioned patch ? Bugzilla could benefit from the ability to append comments instead of forcing new ones. I had problems that I thought were associated with tsc arguments. But my device really does have issues with timeouts and connectivity using bluetooth with 4.5-rcx. I just hadn't noticed before trying the tsc arguments. FWIW, I'm following the guidance in comment #103. cstate and tsc only minimize one or more long-standing freeze problems. But for quite a few, they are sufficient. This link, https://github.com/hadess/rtl8723bs/tree/master/patches might help with your patch problem (last 3 are edits of same patch.) Also try the --dry-run flag to first test a patch without changing your source set.
I've managed to run the 4.4.5 kernel on Archlinux for more than a day on my laptop that has the Bay Trail 2930 cpu without any freezes after adding intel_idle.max_cstate=1 AND commenting out tlp's CPU_SCALING_GOVERNOR_ON_AC and CPU_SCALING_GOVERNOR_ON_BAT options. Maybe you guys could try setting the cpu governor to the default "powersave"? It worked for me.
Hi! One more update to my posts #114 and #130. Zotac's host OS LinuxMint 17.3 with Kernel 3.19.0 with intel_idle.max.cstate=2 is definitely holding up. It has gone through 2 weeks+ worth of stress testing by now and it works very well. The box is a bit warmer than originally (it has no fan, just passive cooling) but it's by no means within the critical range. VirtualBox 5.0.16 is also holding up. I have a FreeBSD server which has worked on it for over 2 weeks under heavy load. But, another virtual machine based n LinuxMint 17.3 running kernel 3.19.0 and xfce has been a bit more iffy. I thought that the processor core number was an issue, I still believe that there is a problem along those lines, when I assign 3 cores the failure rate definitely goes up. But with 4 cores I also had a freeze, although it was after several days of good working! Anyway, we are not out of the woods yet! But as for the host, everything is now very stable. My question is about intel_idle.max_state value of 2 vs 1. Can anyone enlighten me about the difference? How much of power savings functionality is being allowed with 2 vs 1? Thanks for any info. Hal
For the last question by Hal, difference with max_cstate=2 and max_cstate=1 with Pentium N3540 at least is occasional freezes (encounter usually a full lock-up within a week or two of use) vs no freezes at all with cstate=1. So far this max_cstate=1 is the only workaround that works for me. But I'm glad there is this one. Running kernel 4.4.3 now and my laptop is still usable and stable. Sorry this doesn't answer question about the power usage. It is only aimed at the stability aspect. There are quite many comments indicating initial success and then updating that it did crash after all. The freezes (for me) have been all the time very inconsistent. Sometimes the hangs come within minutes of boot and sometimes I could get more than a week of uptime without the kernel parameter. But with max_cstate=1 this system is "rock solid", no freezes at all. Agree on comments about the bug priority/severity, can't really use the 3.x series kernel due to some driver issues and with this cstate limiting I lose a lot on a battery life on laptop. This must affect quite a huge number of users currently and at least in my case it took months to find out that it's actually a kernel bug and not some other software issue.
Guys, please try latest kernel (4.4 or 4.5) with installed intel-microcode package (only latest version!), for example from here: https://packages.debian.org/sid/intel-microcode . With enabled C6/C& in BIOS, kernel 4.5.0 and intel-microcode package (latest version from sid), I've tested my PC within 1.5 hours and everything was fine.
Sorry, the intel_ucode does not fix freezing. Manjaro(Arch derivative) already loads the micro-code (same version) each boot. It took less than 10 minutes to freeze Manjaro15.10-x86_64 linux-4.5.0 without max_cstate limit (Asus T100-CHI). However, I intend to add this debian package to my Ubuntu install, as this is still a good idea. Thanks for the link.
I've tried today to test my PC with C7 state enabled in BIOS and with latest 4.5 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/ and latest intel-microcode package 3.20151106.1 from https://packages.debian.org/sid/intel-microcode Everything is fine with youtube videos, as well with idel state (previously I had freezes withing 10-30 minutes in 100% cases), so I think without patching kernel at least we have a solution with additional firmware (btw, microcode also can be downloaded from intel site as binary but in that case you should copy it manually to firmware directory in your system, for me debian package is much more convenient).
Same problem here... Hardware: Intel NUC6i5SYH (Intel Skylake i5-6260U) Software: Debian Stretch with Kernel 4.3.0-1-amd64 Linux freezes after a few hours, no KernelCrashDump (crashkernel=256M nmi_watchdog=1) available. Workaround (intel_idle.max_cstate=1) seems to help for the moment.
Did anyone got contacted Intel about this issue yet? We may need more help finding this bug.
Same problems here. I have an Acer Travelmate b115m with celeron n2940 and I was becoming mad until I found this topic. Crashes every 20 min - 2 hours. No way of getting crashdump info. cstate=1 seems to mitigate the problem, but the computer gets hot and runs kind of slower.
Adding myself to the Baytrail freeze party! Lenovo MIIX 3 1030 powered by an Atom Z3735F, running Arch with a vanilla 4.5.0 kernel. Tried vad1m's method (though my CPU doesn't seem to have any microcode updates) just to be sure, got a hang. Hal, I ran a couple of PowerTop draw tests on my machine. Interesting results: cstate=1 : 3.40W cstate=2 : 3.11W normal : 3.13W Taken while idle in an Openbox session, single terminal window open.
(In reply to László Kara from comment #140) > Did anyone got contacted Intel about this issue yet? We may need more help > finding this bug. This bug is already assigned to Len Brown from Intel, who is also mentioned as a maintainer of Intel Idle kernel code. Initial reporter of this bug is also an Intel employee. Anyway, I will try to raise this bug on linux-pm mailing list tomorrow, as it seems there is very little awareness about the fatality of this bug among others.
Created attachment 209541 [details] attachment-24616-0.html Meanwhile Im still trying to get @intelsupport attention on twitter. Feel free to RT: https://twitter.com/zcecc22/status/710222385430077440 On Wed, 16 Mar 2016 at 22:58, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #143 from Michal Feix <michal@feix.cz> --- > (In reply to László Kara from comment #140) > > Did anyone got contacted Intel about this issue yet? We may need more > help > > finding this bug. > > This bug is already assigned to Len Brown from Intel, who is also > mentioned as > a maintainer of Intel Idle kernel code. Initial reporter of this bug is > also an > Intel employee. Anyway, I will try to raise this bug on linux-pm mailing > list > tomorrow, as it seems there is very little awareness about the fatality of > this > bug among others. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
Meanwhile Im a rying to reach out to @intelsupport on twitter to see if we can get an official update. Feel free to retweet https://twitter.com/zcecc22/status/710222385430077440
Thank you Juha (#135) and Chris (#142). I've been running both my Zotac and Intel boxes with intel_idle.max_cstate=1 for the last few days. Both with the value of 2 and 1 I got pretty good results, I have not seen any failures on either hosts. I've also done some power monitoring on the AC line and it turns out that between 2 and 1 there is less than a watt of difference, although temperature wise it seems to be noticeably different (hotter with 1). So, for all intents of purposes my boxes are now working trouble free. VirtualBox system has also been very stable (as witnessed by my FreeBSD virtual server), but my Linux Guest OS is periodically failing on both systems. I am now convinced that the failure on the Linux Guest OS is due to some video driver issue as opposed to the processor bug related to intel_idle.max_cstate thing. But, all in all I am very disappointed both by Intel, and Linux (maybe I should say Ubuntu and LinuxMint) as in their race to release the latest and greatest they put out there half baked products. This is as if we are back to the beginning of times and we are troubleshooting windows 3.1 systems... I don't know if the real issue is Intel hardware or Linux software but either way my disappointment is such that I couldn't recommend anyone to switch to Linux as I have advocated for over a decade. Here, I only voiced my own troubles with my own two machines. My friends and relatives who have bought inexpensive Bay Trail notebooks and got Ubuntu or Mint based on my recommendation and who are pissed because their machines freeze in the middle of a Netflix movie are not sophisticated enough to come to places like this to figure out what they are up to. They will only say that windows xp worked better than any sh*t we have had in a long time (that certainly includes Linux), and I almost agree with them.
I have the same bug in my HP Pavilion x360 with a Pentium CPU N3520 (Bay Trail architecture), running Ubuntu 15.10 and 4.2.0-30 Kernel version. I´m using the private drivers for the Microprocessor. I don´t try with the "intel_idle.max_cstate=1", because that´s need a lot of battery... I want to use GNU/Linux again, but i can work normally with this bug :( Pd: English is not my first lenguage.
(In reply to Hal from comment #146) > Thank you Juha (#135) and Chris (#142). > > I've been running both my Zotac and Intel boxes with intel_idle.max_cstate=1 > for the last few days. Both with the value of 2 and 1 I got pretty good > results, I have not seen any failures on either hosts. > > I've also done some power monitoring on the AC line and it turns out that > between 2 and 1 there is less than a watt of difference, although > temperature wise it seems to be noticeably different (hotter with 1). > > So, for all intents of purposes my boxes are now working trouble free. > > VirtualBox system has also been very stable (as witnessed by my FreeBSD > virtual server), but my Linux Guest OS is periodically failing on both > systems. > > I am now convinced that the failure on the Linux Guest OS is due to some > video driver issue as opposed to the processor bug related to > intel_idle.max_cstate thing. > > But, all in all I am very disappointed both by Intel, and Linux (maybe I > should say Ubuntu and LinuxMint) as in their race to release the latest and > greatest they put out there half baked products. > This is as if we are back to the beginning of times and we are > troubleshooting windows 3.1 systems... > > I don't know if the real issue is Intel hardware or Linux software but > either way my disappointment is such that I couldn't recommend anyone to > switch to Linux as I have advocated for over a decade. > > Here, I only voiced my own troubles with my own two machines. My friends and > relatives who have bought inexpensive Bay Trail notebooks and got Ubuntu or > Mint based on my recommendation and who are pissed because their machines > freeze in the middle of a Netflix movie are not sophisticated enough to come > to places like this to figure out what they are up to. > They will only say that windows xp worked better than any sh*t we have had > in a long time (that certainly includes Linux), and I almost agree with them. I could not agree more. I worked with linux (readhat) more than 10 years ago (compiling kernel, day by day work, etc.) and after some years without touching it I had the idea of just using it in my new laptop. Im a photo professional and I wanted to give a try to Darktable and current Gimp. Im so so dissapointed, so so dissapointed. Linux is still not usable after all these years, it's even less stable now. Complety buggy for a normal average user. I can't recommend it to any friend sharing my hardware or similar since no one will now what to do. I'm back to windows 10 and the computer runs fast and with no problemas at all. I will keep an eye on this, but for me, a casual user very interested in Linux, this operating system is just a toy to spend time with. Just a toy.
Running as desktop is one side, running as server (so me) the other one :-( Same kernel on a Asus eeeBox B202 (Intel Atom) has no problems. I don´t know, if there is a context, but since i use the workaround (on Intel Skylake), i don´t see messages like "systemd-sysv-generator overwriting existing symlink" in dmesg anymore.
I share the frustration as I have been using Linux for over 15 years and this is maybe the most serious bug in all that time (for me), which ironically seems to get extremely limited attention by Intel and Kernel developers. It seems like kernel developers don't usually use low end Intel Atom like hardware and therefore don't have to deal with the problems themselves a lot. (Admittedly that's speculation, but I can imagine that most kernel developers(or even pro users) prefer to use high end hardware (let alone for compile times)) I think it's unfair to say that because of that one (severe!) bug Linux is not recommendable anymore, since not everybody will want to use a Baytrail system, but at this time you could say Linux on baytrail isn't really advisable until this bug is fixed. I lay the blame on Intel, as they should be stress testing their CPUs against the latest Linux kernel and pro-actively try to fix eventual bugs.
I've been using Linux since 1997,and this is the first time I've come across a bug so serious. Luckily I have another laptop with a different CPU and I can use that, but my baytrail machine is collecting dust. Really stupid idea to release newer and newer kernels just for the sake of adding new shiny numbers when so many people are affected by a massive bug like this, which makes windows 3.1 look like a dream. Linus has totally lost his marbles. I'd like to try bsd, but it's a pain in the neck and hardware support lags behind. Sigh.
Sorry if I wrote a too negative impression, but I just can't avoid to be dissapointed and frustated. I really wanted to jump to opensource software. And I know this is made with the collaboration of a lot of volunteer people, thanks to them. But shame on Linux stability and support.
Some of you doomsday complainers need to calm down a bit! Microsoft and Apple (and even Google in some cases) don't even have a way to open bugs on their operating systems and get transparent feedback with a way to track progress. And the computer gods know how many hours I spent trying to debug system freezes and crashes on Windows... Just right now my employer, which distributed hundreds of MacBook Pros to its users, is experiencing a bug with major battery drain for all of them. We have no idea what is causing it yet. This particular bug was very tricky to pin down and the community did great in reporting it and patiently trying things out until we found a workaround. We provided a very good direction for Intel to look for the cause and find a fix. As with other major bugs, I'm certain the major distributions will backport the fix to their older kernels that are still under bugfix and security support. Software generally is terrible, but Linux is better than most. We have a good system in place to fix bugs and keep making the OS better. I don't think there's anything particularly wrong with how Linux handles quality control. That said, I would very much appreciate it if someone from Intel steps in a comments, even briefly, on this bug report. Hint, hint. :)
Created attachment 209561 [details] attachment-28440-0.html I did raise the issue on twitter to @intelsupport, they told me to download and use the intel graphic driver at http://intel.ly/24TDt9F . Don't think @intelsupport is that useful afterall... Does anyone have a direct contact there? On Wed, 16 Mar 2016 at 20:35, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #140 from László Kara <laci.kara@gmail.com> --- > Did anyone got contacted Intel about this issue yet? We may need more help > finding this bug. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
@Vincent Frentzel Some Intel guys are on IRC, for example on irc.freenode.org #intel-gfx . You might want to bug them about this bug. I know some are aware about the issue on this channel, but some also have been dismissive about it to be honest. I know one guy who has some patches on there and needs people like us to try them.
(In reply to Tal Liron from comment #153) > Some of you doomsday complainers need to calm down a bit! ???? > This particular bug was very tricky to pin down and the community did great > in reporting it and patiently trying things out until we found a workaround. > We provided a very good direction for Intel to look for the cause and find a > fix. As with other major bugs, I'm certain the major distributions will > backport the fix to their older kernels that are still under bugfix and > security support. As this link shows the problem was discovered 15 MONTHS AGO and it is yet to be fixed! https://bugs.freedesktop.org/show_bug.cgi?id=88012 > Software generally is terrible, but Linux is better than most. We have a > good system in place to fix bugs and keep making the OS better. I don't > think there's anything particularly wrong with how Linux handles quality > control. "Quality control" you say? Since this problem was discovered 15 MONTHS AGO the linux kernel went through tons of iterations. It could have been at least provided with an automatic detection and cstate switching mechanism! > That said, I would very much appreciate it if someone from Intel steps in a > comments, even briefly, on this bug report. Hint, hint. :) Sensible suggestion! But check this out: https://communities.intel.com/thread/60984?start=0&tstart=0 It doesn't sound like Joe_Intel is much of a listener, is he? I am afraid next year this time we will still be talking about this same bug, because if it didn't get fixed since January 2015 I don't see how and why it will be ever fixed.
Created attachment 209571 [details] Arch Linux 4.1.18 LTS panic #1 (photo 1 of 3) Attaching 3 photos of kernel panics I've seen that may be related to this. Two photos are from Arch 4.1.18 LTS with intel_idle.max_cstate=1 (plus other kernel params, mostly borrowed from Clear Linux's boot line), and one is from Arch 4.4.3 using intel_idle.max_cstate=1. System is a console-only mini-PC running a Celeron N3150 (Braswell) with 8GB RAM and 250GB mSATA SSD. Trying to use the mini-PC as a custom network router. Attaching photos since these panics don't write to logs, and often don't show anything at all, halting the machine or causing a spontaneous reboot. I'm going to try setting up a netconsole to capture goings-on next. All three instances seem to choke with some invocation of start_secondary(), if I'm reading the call trace correctly. Hoping these instances may help devs track the core issue down. Please let me know if additional info is required or if I can test anything.
Created attachment 209581 [details] Arch Linux 4.1.18 LTS panic #2 (photo 2 of 3) Second kernel panic photo with Arch 4.1.18 LTS on Celeron N3150 (Braswell) system using max_cstate=1. Please see the first photo for more info.
Created attachment 209591 [details] Arch Linux 4.4.3 panic (photo 3 of 3) Third/last photo of panics, this time with Arch 4.4.3 on Celeron N3150 (Braswell) using max_cstate=1. Please see first photo for more information.
I think you're not quite entertaining the level of failure that's involved here. I totally appreciate your point. "Software is generally terrible". True. But from the perspective of users here, is Linux really better than most? My Mac at work has an uptime of 172 days. Every night I sleep it when I go home. I haven't had to reboot it in six months. BTW it's a laptop. This Linux-running thinkpad running Linux I have here crashes after 1-2 hours of sitting idle. I don't have to do anything. Turn it on; let it sit; crash! That's much worse than Windows 3.1, which was sensitive to rogue applications, but didn't simply splat into smithereeens on its own. So let's not sentimentalize and pretend this isn't a total fuck-up. Where is Intel? (In reply to Tal Liron from comment #153) > Some of you doomsday complainers need to calm down a bit! > > Microsoft and Apple (and even Google in some cases) don't even have a way to > open bugs on their operating systems and get transparent feedback with a way > to track progress. And the computer gods know how many hours I spent trying > to debug system freezes and crashes on Windows... Just right now my > employer, which distributed hundreds of MacBook Pros to its users, is > experiencing a bug with major battery drain for all of them. We have no idea > what is causing it yet. > > This particular bug was very tricky to pin down and the community did great > in reporting it and patiently trying things out until we found a workaround. > We provided a very good direction for Intel to look for the cause and find a > fix. As with other major bugs, I'm certain the major distributions will > backport the fix to their older kernels that are still under bugfix and > security support. > > Software generally is terrible, but Linux is better than most. We have a > good system in place to fix bugs and keep making the OS better. I don't > think there's anything particularly wrong with how Linux handles quality > control. > > That said, I would very much appreciate it if someone from Intel steps in a > comments, even briefly, on this bug report. Hint, hint. :)
I use N3150(braswell) too. I was set to "max_cstate=1". but, got freeze. Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a little high. So, i come up with to try "maxcpus=2". And then, it did not freeze. cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze. If this thing is useful, I'm happy.
What is really bad about this bug is the fact that it used to work until kernel 3.16. I bought my HTPC with bay-trail because it supported Linux. And now I have an unsupported hardware with not replaceable motherboard that freezes even with kernel 3.13 (there are more bugs then this, I believe it is related to WiFi which also often looses it's connection and is very slow). This chipset is in fact not supported by Linux now. I bought it because I heard everywhere that Intel has the best Linux support. Now I want to show them the same sign of respect Linus Torvalds showed NVIDIA some years ago. I should have known that since Intel did the same to me with GMA500 graphics. I thought it was just a single mistake and they will not repeat it. But they did. :-(
I've started getting occasional freezes again with 4.4 and 4.5. That's even with cstate and tsc and a bunch of good but cast off freeze patches. So, I'll try fewer CPU cores. I can't risk having anything important on my system anyway, so who cares if nothing important takes longer. Next cycle, I'll just get AMD systems.
Well done on turning this into a forum thread. I wouldn't touch this bug with a 10-foot pole and I'm sure the Intel developers feel the same.
I got the same issues after 4-5 days on Ubuntu 15.10 with the 4.5 kernel and intel driver. After this issue, it freezes again within 5-6 hours, sorry for the false hope :) Now im trying the upcoming Ubuntu LTS release (16.04 Nightly) with the following kernel: 4.4.0-13-generic #29-Ubuntu SMP Fri Mar 11 19:31:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Seems stable right now after 2days and 13hours uptime, no cstate fix needed right now. Kernel is not the newest, but the mentioned microcode package and intel drivers are up to date, va packages also, so hw decoding works nicely on it. powertop shows the following idle stats: Package | Core | CPU 0 | | C0 active 4,1% | | POLL 0,0% 0,0 ms | C1 (cc1) 0,0% | C1-CHT 0,0% 0,1 ms | | | | C6 (pc6) 81,9% | C6 (cc6) 95,4% | C6S-CHT 2,0% 2,9 ms | | C7S-CHT 31,2% 86,9 ms | Core | CPU 1 | | C0 active 1,2% | | POLL 0,0% 0,0 ms | C1 (cc1) 0,2% | C1-CHT 0,2% 0,8 ms | | | | | C6 (cc6) 97,5% | C6S-CHT 4,7% 3,2 ms | | C7S-CHT 46,4% 37,6 ms | Core | CPU 2 | | C0 active 0,3% | | POLL 0,0% 0,0 ms | C1 (cc1) 0,0% | C1-CHT 0,0% 0,4 ms | | | | | C6 (cc6) 99,0% | C6S-CHT 2,3% 4,2 ms | | C7S-CHT 91,4% 89,3 ms | Core | CPU 3 | | C0 active 2,1% | | POLL 0,0% 0,0 ms | C1 (cc1) 0,0% | C1-CHT 0,0% 0,0 ms | | | | | C6 (cc6) 94,1% | C6S-CHT 0,3% 2,1 ms | | C7S-CHT 85,7% 39,4 ms | GPU | | | | Powered On 0,2% | | RC6 99,8% | | RC6p 0,0% | | RC6pp 0,0% | | | | | If i understand it correctly, the cpu cores are mostly in C6/C7 states. One thing that i done with these config: sudo powertop --auto-tune I setted the Tunables items to good for all items in it with the above command. I wrote about my experience after a few more days...
I've been scavenging for more information about this intel_idle software module and I came across this interesting slide presentation from Len Brown (the Intel engineer in charge of the power saving scheme if I understood right). It's dated October 2015 and apparently used at his LinuxCon Dublin meeting. Many pages refers to troubles with the idle thing, how they track it, measurements, etc. On several slides like #31, 32, 33 under "Things may go wrong". It mentions Linux Kernel versions which are buggy yet unfixed. http://events.linuxfoundation.org/sites/events/files/slides/Brown-Linux-Suspend-at-Speed-of-Light-LC-EU-2015.pdf
Hi, all, I think we have a T100 in the lab, I'll have a try. BTW, could someone please tell me is it reproduced easily by playing videos?
(In reply to fao66134 from comment #161) > I use N3150(braswell) too. > I was set to "max_cstate=1". but, got freeze. > > Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a > little high. > So, i come up with to try "maxcpus=2". And then, it did not freeze. > > cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze. > > If this thing is useful, I'm happy. This seems interesting... I think my N3150's issues tend to be with CPU2 most of the time too. (I'll recheck the panics I posted last night.) I wondered about possible CPU heat issues as well as I'm using a fanless aluminum case, but haven't been watching it closely. I'll start doing that. Thanks for the maxcpus=2 workaround. I may also try that. Though I'd prefer to use all 4 cores :) Possible related note: I tried using the latest OPNSense FreeBSD 10.2-based router/firewall distro, and ran into a CPU panic there too after a few hours. I didn't get a photo of it, but if I try it again I'll be sure to capture it.
(In reply to Chen Yu from comment #167) > BTW, could someone please tell me is it reproduced easily by playing videos? Yes, it is. I'm using Firefox with HTML5 videos on YouTube to test for this bug. I always had at least one freeze within 4 hours when not restricting max_cstate.
(In reply to jbMacAZ from comment #163) > I've started getting occasional freezes again with 4.4 and 4.5. That's even > with cstate and tsc and a bunch of good but cast off freeze patches. So, > I'll try fewer CPU cores. I can't risk having anything important on my > system anyway, so who cares if nothing important takes longer. Have you tried just using max_cstate without the tsc parameters and the patches? When I added tsc params to my boot line it seemed to cause more instability/chances for halts and panics. That makes me wonder if tsc is somehow counterproductive to max_cstate.
(In reply to Chen Yu from comment #167) > Hi, all, I think we have a T100 in the lab, I'll have a try. BTW, could > someone please tell me is it reproduced easily by playing videos? I have an old SSD that I use to move things around and I have LinuxMint 17.3 stock kernel 3.19.0 on it. I can use it through SATA or with USB2.0 or USB 3.0 adaptors. I just plugged it into my CI320 via SATA for a quick test. Freezing was as quick as moving the firefox window. There is no special software on this SSD, no Virtualbox, no wine emulator, nothing. Pure stock linuxmint. After rebooting, I tried to plug a USB flash disk, before the directory was read into thunar again the whole machine froze. So, it is indeed quick to repeat the failure.
So I just get a reply from linux-pm kernel mailing list. People there are aware of this bug, but I've been told that it is quite hard to find the root cause. I've been asked to check if kernel parameter idle=nomwait is making the problems go away. Obviously, CPU's might get warmer when trying this. It is just a step to pinpoint the source. Can you test this parameter and post results? Especially if you are one of those not lucky with intel_idle.max_cstate=1 parameter as a workaround.
(In reply to John A. from comment #170) > (In reply to jbMacAZ from comment #163) > > I've started getting occasional freezes again with 4.4 and 4.5. ... > > Have you tried just using max_cstate without the tsc parameters and the > patches? When I added tsc params to my boot line it seemed to cause more > instability/chances for halts and panics. That makes me wonder if tsc is > somehow counterproductive to max_cstate. tsc is recent, I ran 4.2.x for months with relatively little trouble with just cstate and necessary patches. Frankly, 4.3.x seems to run the same with or without tsc as long as cstate is set. My gut is that there is a new instability in 4.4 and 4.5. I can't jettison all my old patches because my T100 will have sdhci and prmb issues and other bits of T100 hardware will stop working. The lack of a crash log could be partially addressed by allocating a second dmesg buffer and alternating between them at boot. The prior dmesg log would be preserved at next startup. This should probably be a new .config option. Alternatively, just save the last few K of the old dmesg before initializing dmesg at boot time.
(In reply to John A. from comment #168) > Possible related note: I tried using the latest OPNSense FreeBSD 10.2-based > router/firewall distro, and ran into a CPU panic there too after a few > hours. I didn't get a photo of it, but if I try it again I'll be sure to > capture it. That might be an OPNSense issue as they seem to have introduced lots of regressions as they tried to rewrite some of the code (or trying to cleaning it up). When I tried to run it on an Intel mobo a few weeks back it just kept crashing. On the same hardware pfsense ran without problems. Any particular reason you would favor opensense vs pfsense? One little consolation I have about Bay Trail and Braswell is that FreeBSD (and PC-BSD) and pfSense both work flawlessly on the same hardware where I experience the Linux freezing circus.
(In reply to Michal Feix from comment #172) > [...] > I've been asked to check if kernel parameter idle=nomwait is making the > problems go away. Obviously, CPU's might get warmer when trying this. It is > just a step to pinpoint the source. > > Can you test this parameter and post results? Especially if you are one of > those not lucky with intel_idle.max_cstate=1 parameter as a workaround. Using vanilla kernel 4.5.0 I tried to boot with the options tsc=reliable idle=nomwait The system crashed after the "usual" amount of time (about an hour surfing the web). I did not set cstate or anything else.
(In reply to John A. from comment #168) > This seems interesting... I think my N3150's issues tend to be with CPU2 > most of the time too. (I'll recheck the panics I posted last night.) I > wondered about possible CPU heat issues as well as I'm using a fanless > aluminum case, but haven't been watching it closely. I'll start doing that. > > Thanks for the maxcpus=2 workaround. I may also try that. Though I'd prefer > to use all 4 cores :) I found a new way. "echo 0 > /sys/kernel/debug/x86/tlb_single_page_flush_ceiling" I using full core, but have not yet acquired the frozen from this setting. In my case, it was to disable the intel_idle and intel_pstate and i915, but i got a freeze. Thus, when compared to the other CPU configuration changes of kernel3.16 and kernel4.5, I noticed the change of the TLB flush setting. (intel_tlb_flushall_shift_set function is abolished from "arch/x86/kernel/cpu/intel.c", And tlb_single_page_flush_ceiling has been added to "arch/x86/mm/tlb.c")
(In reply to fao66134 from comment #161) > I use N3150(braswell) too. > I was set to "max_cstate=1". but, got freeze. > > Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a > little high. > So, i come up with to try "maxcpus=2". And then, it did not freeze. > > cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze. > > If this thing is useful, I'm happy. Thanks for another workaround :) Running glxgears and x264 video on procesor Intel® Atom™ Z3735F (4 cores) - vanilla kernel v4.5.0: maxcpus=1, no freeze (running 90 minutes) maxcpus=2, no freeze (running 90 minutes) maxcpus=3, no freeze (running over 4 hours) no command line parameters, freeze occured after 5 minutes (as usual).
(In reply to fao66134 from comment #176) > I found a new way. > > "echo 0 > /sys/kernel/debug/x86/tlb_single_page_flush_ceiling" > > I using full core, but have not yet acquired the frozen from this setting. Sorry, i got freeze now. Running time is longer, but it seems not perfect.
(In reply to Michal Feix from comment #172) > I've been asked to check if kernel parameter idle=nomwait is making the > problems go away. Obviously, CPU's might get warmer when trying this. It is > just a step to pinpoint the source. > > Can you test this parameter and post results? Especially if you are one of > those not lucky with intel_idle.max_cstate=1 parameter as a workaround. I also tried, but was frozen in 5 minutes. This is about the same as when you do not specify anything.
(In reply to fao66134 from comment #179) > (In reply to Michal Feix from comment #172) > > I've been asked to check if kernel parameter idle=nomwait is making the > > problems go away. Obviously, CPU's might get warmer when trying this. It > is > > just a step to pinpoint the source. > > > > Can you test this parameter and post results? Especially if you are one of > > those not lucky with intel_idle.max_cstate=1 parameter as a workaround. > > I also tried, but was frozen in 5 minutes. > This is about the same as when you do not specify anything. So, setting idle=nomwait is not helping you. Fine. If intel_idle.max_cstate=1 is a working solution for you, could you please try with intel_idle.max_cstate=0 and post back result?
i want to give my latest feedback on this issue to this forum thread :-D N2940 Baytrail System running stable on all 4 cores for 2 days now. Running latest stable kernel 4.5.0 from git repo on gentoo linux. With latest microcode firmware from intel microcode-20151106.tgz uname -a Linux shiva 4.5.0 #20 SMP Tue Mar 15 19:07:39 ART 2016 x86_64 Intel(R) Celeron(R) CPU N2940 @ 1.83GHz GenuineIntel GNU/Linux kernel parameters: i915.enable_rc6=1 tsc=reliable clocksource=tsc i dont know if it is the kernel or the microcode that makes this system run stable, and of course i hope it stays stable. Playing videos, listening to music, compiling packages no freezes yet. hope it remains like this. and please, this is a bug-report thread, not a discussion platform
"nomwait" may be device dependent. I ran it overnight (tsc=reliable and idle=nomwait w/o cstate) and it was still running after 10 hours. I saw other results here - restarted without tsc and my system has already run five times longer than no arguments.) I'll keep testing. I'll need to repeat with 4.4+ since the newest kernels are less stable than 4.3 on my system. Passive cooled Atom baytrail Z3775: cstate=1 runs nearly normal temp, cstate=0 runs slightly warmer. "nomwait" runs about the same temp as cstate=1. Asus T100-CHI - Ubuntu15.10-i386, kernel-4.3.6, microcode, T100 patches, hunter patches, legacy-turbo patch. Normally freezes well under 10 minutes without kernel arguments.
(In reply to julio.borreguero@gmail.com from comment #181) > i want to give my latest feedback on this issue to this forum thread :-D > N2940 Baytrail System running stable on all 4 cores for 2 days now. > Running latest stable kernel 4.5.0 from git repo on gentoo linux. > With latest microcode firmware from intel microcode-20151106.tgz > > kernel parameters: > i915.enable_rc6=1 tsc=reliable clocksource=tsc > > i dont know if it is the kernel or the microcode that makes this system run > stable, and of course i hope it stays stable. > Playing videos, listening to music, compiling packages no freezes yet. Microcode update 20151106 only updates the 2MB cache version of N2940. If you have 1MB cache variant of N2940, the microcode update was not the cure. If you can test the 4.5 kernel version without any kernel parameters, it would help to understand whether it has been fixed in the meantime.
> > Microcode update 20151106 only updates the 2MB cache version of N2940. If > you have 1MB cache variant of N2940, the microcode update was not the cure. > > If you can test the 4.5 kernel version without any kernel parameters, it > would help to understand whether it has been fixed in the meantime. ok, thank you for that information. And yes, the cache is only 1MB but i guess you know that anyway from the attachment i posted at some earlier stage with system-specific info. i just rebooted my machine, this time without extra kernel parameters. my guess is that the kernel has been fixed for my architecture at least, as i was running those tsc-parameters in my last test (4.5.0-rc3) and that definitely froze. i will be posting a hardware freeze as soon as it happens, otherwise i will let everyone know in 2-3 days that the system is still running stable. hopefully
I certainly don't want to destroy anyone's hopes, but I've had instances where my notebook ran stable for up to two weeks and then froze. Doesn't mean it has to happen, I'm just saying the absence of crashes overnight, within 10 hours, or even in 3-4 days is not a sure sign that the issue has been fixed.
Among the posts there are several mentioning that kernel 3.16 is freeze free without any additional parameter like cstate or tsk. I am curious to know if those are distro provided versions or custom compiled ones? Today I ran some tests with Linux Mint 17.2 which comes with kernel 3.16.0 as its standard and recommended kernel. On Zotac Nano CI320 N2930 it worked for about 4 hours then froze. I actually used it only for 35 minutes, then the machine was on but simply idling for the remaining 3.5 hours. I know precisely when it froze as the frozen clock at the bottom of the screen was visible. Is there any consensus on a kernel version that reliably works on Bay Trail?
Tested with 4.5.0 and glxgears on T100, without any boot params, so far we have not reproduce this problem yet, as BzukTuk told me this method should freeze the system within 1-10 minutes. Anyway I'll keep up this stress testing.
(In reply to dertobi from comment #185) > I certainly don't want to destroy anyone's hopes, but I've had instances > where my notebook ran stable for up to two weeks and then froze. <snip> I'm just assessing the workarounds, while waiting for real fixes. My nomwait solo test did freeze after about 4 hours - but then resumed by itself about 2 hours later without bluetooth and wifi working. Rebooting restored communications. (In reply to Chen Yu from comment #187) > Tested with 4.5.0 and glxgears on T100, without any boot params, so far we > have not reproduce this problem yet, as BzukTuk told me this method should > freeze the system within 1-10 minutes. Anyway I'll keep up this stress > testing. There are several T100 models, which vary in how fast they freeze. The T100T* models are more stable than the T100CHI. Also, the very first freeze often takes longer than subsequent freezes.
(In reply to Chen Yu from comment #187) > Tested with 4.5.0 and glxgears on T100, without any boot params, so far we > have not reproduce this problem yet, as BzukTuk told me this method should > freeze the system within 1-10 minutes. Anyway I'll keep up this stress > testing. i think kernel 4.5.0 has a fix. I am running it for several days now, but on a N2940. No freezing. Since yesterday without any kernel boot parameters. Anything prior to this kernel (any 4.4 kernel if you want to have a go) freezes for sure. Also, there is a difference between N2940 and N2930. For me, on a N2940 intel_idle.max_cstate never worked as a workaround, but it works on N2930 (deduced from posts in this thread). i know it is still too early to say that 4.5.0 is fixed, but to me it certainly looks that way. freezes on my system always ocurred within 12h.
(In reply to julio.borreguero@gmail.com from comment #189) > > i think kernel 4.5.0 has a fix. > I am running it for several days now, but on a N2940. No freezing. > Since yesterday without any kernel boot parameters. > Anything prior to this kernel (any 4.4 kernel if you want to have a go) > freezes for sure. > My lucky version for both N2930 and N3050 seems to be 4.4.6. 4.5.0 has brought up unrelated instabilities (mostly with VGA and Wireless) on my systems so I can't even thoroughly test it. 4.4.6 on the other hand has been pretty good without cstates or tsk up to a point (much much longer time before freezing). Interestingly, on my Zotac box 4.4.6 spends much more time in C1 state than C6 or C7 according to powertop. That said, the behavior of these different versions is quite wild. I tried to build a chart of hardware (2 separate computers one with N3050, the other with N2930) vs kernels (I have tested 3.16.0, 3.19.0, 4.0.0, 4.3.0, 4.4.0, 4.4.4, 4.4.5, 4.4.6, 4.5.0) and captured the freeze timing and conditions (like with or without video loss at freeze time) and the chart is full of inconsistencies. Repeat tests yield contradictory results most of the time. But, all in all 4.4.6 looks the best with the longest longevity. 4.3.0 seems to be the worst. With ctates=2 freezing is almost non existent (only happened once in more than 40 sessions). With cstates=1 never got a freeze in any hardware/kernel combination, with some of the tests lasting more than 2 weeks. I never used a patch nor the tsk parameter.
I have a N3540 system that freezes at most a couple times a month without any arguments, kernel version doesn't seem to matter. .max_cstate {0,1} stabilized it. Looking at the recent posts, the N-series appears to be the processor benefiting most from the new suggestions. But the more smoke that gets cleared, the sooner the rest of the problems can be found. On my Z3775 system (T100CHI), kernel 4.5.0 without arguments didn't last 2 minutes before freezing. With idle=nomwait and it ran 2 hours before the time display froze (frozen seconds), the mouse cursor still moved. Keyboard keys or mouse clicks were accepted about once every 90 seconds. Next, maxcpus=2 and idle=nomwait produced a block of "serial8250: too much work for irq191" errors in dmesg. Raising maxcpus to 3 got rid of them. maxcpus= {2,3} yielded no obvious degradation when just browsing, etc, so I'll leave this running... tsc may be destabilizing for some systems like mine.
My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still beta version) in 30m especially when I use chrome browser. but it works well with intel_idle.max_cstate=1 on both version. kernel 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb ) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does not work without cstate flag. So I downloaded newer 4.5.0-rc7 (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it is working well without cstate flag for half day. I will update the status after one or two days later.
update: system freeze on 4.5.0 kernel on N2940 no kernel parameters. it took many hours (~40) but finally it happened. back to kernel 4.1.12....
I gave it a try with Ubuntu 15.10 and kernel 4.5 I also installed the Intel microdrivers. I was able to play a full 50 min video but then the computer freeze on the desktop wihtout any cpu/gpu intense operation (that I'm aware of).
(In reply to cororok from comment #192) > My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still > beta version) in 30m especially when I use chrome browser. > but it works well with intel_idle.max_cstate=1 on both version. > > kernel > 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb > ) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does > not work without cstate flag. > > So I downloaded newer 4.5.0-rc7 > (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it > is working well without cstate flag for half day. I will update the status > after one or two days later. 4.5.0-rc7, even it is better than others, also froze.
A combo bandaid for the Z3775 is idle=nomwait tsc=reliable maxcpus=3. Test still running at 24 hours. Better than 2 minutes without any kernel arguments... (Kernel 4.5.0.)
(In reply to Michal Feix from comment #180) > So, setting idle=nomwait is not helping you. Fine. If > intel_idle.max_cstate=1 is a working solution for you, could you please try > with intel_idle.max_cstate=0 and post back result? Only maxcpus is not to freeze. My result is next. running time(1st, 2nd) #parameters 30m, 1h30m #none 10m, 40m #idle=nomwait 1h, 2h #intel_idle.max_cstate=0 2h, 1h #intel_idle.max_cstate=0 idle=nomwait 30m, 1h #intel_idle.max_cstate=1 N3150 Gentoo drm-intel-nightly_kernel-4.5.0+
Interesting findings today: 1) Came across a 2 yr old system with a Bay Trail N2807 processor. Upgraded Ubuntu on it to kernel 4.4.6 with no parameter. It has been running for more than 12 hours without a glitch. What gives?! So, not all Bay Trail processors are afflicted by this problem? 2) I was given an Intel Nuc box for testing which turned out to be identical to mine, with same N3050. Duplicated my drive with DD and removed intel_idle.max_cstate=1. It kept working all day without missing a beat! I remove cstate from my own machine it freezes within the hour. So bizarre... 3) As I was digging into Virtualbox log files after my guest OS froze once again on my zotac, I noticed that there is nothing noteworthy until the moment of failure except for the message "28:28:41.009623 VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago". Then when I shutdown the guest OS window, Virtualbox adds a very extensive report about the state of the machine at the time it froze or became unresponsive. So, this might be a good tool to help investigate how the failure is taking place. My thinking is that this OS freezing problem is occurring the same, whether it is on a host (physical) machine or a guest (virtual) machine. It has been found that with intel_idle.max_cstate=1 or alternative special kernel parameters we can get the kernel behave differently and avoid failure. But that doesn't work with a virtual machine and whatever is causing the failure is making the virtual machine fail unrestrained. But that also indicates that the kernel software is falling apart not the microprocessor (or the microprocessor's microcode) otherwise when the virtual machine fails the host should also fail. The dump in the virtualbox log file on closing is very rich in info, unfortunately it's way above my knowledge base. So, if anyone would be interested in analyzing it I could furnish it, although I think it is very easy to make the failure occur in virtualbox same as on the host.
Anybody know why this patches doesn't upstreamed? https://github.com/hadess/rtl8723bs/tree/master/patches_4.5
Created attachment 210171 [details] attachment-21257-0.html I get that we shouldn't turn this bug report into a forum discussion, but what I just don't understand is why this bug isn't considered absolutely critical. Personally it doesn't affect me that much -- work gives me a very nice macbook pro -- but this bug gives the lie to decades of making fun of Windows BSODs. A system that can't stay up for 30 minutes? For millions and millions of users -- all on the lower-end of the performance spectrum? For kernels that go back 2 years? It's a massive pie in the face. I've added the cstate kernel parameter. The machine is more stable but battery life has gone to hell. Such is Linux, today. On Mon, Mar 21, 2016 at 11:47 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #200 from RussianNeuroMancer <russianneuromancer@ya.ru> --- > Anybody know why this patches doesn't upstreamed? > > https://github.com/hadess/rtl8723bs/tree/master/patches_4.5 > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test 3 _tentative_ patches on that tree. Please try.
What desktop are you all running? For me it's gnome-shell. Maybe there's some connection between software, hardware and that freeze that we've been missing so far.
(In reply to Mika Kuoppala from comment #202) > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > 3 _tentative_ patches on that tree. Please try. i am running 4.5.0 with 3 tentative patches from mika ;-) Already stresstesting for about 5h now. i will post any results here.
(In reply to Mika Kuoppala from comment #202) > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > 3 _tentative_ patches on that tree. Please try. I got the hang after 7 and a half hours of letting my N2940 run youtube and a twitch stream.
(In reply to dertobi from comment #203) > What desktop are you all running? For me it's gnome-shell. Maybe there's > some connection between software, hardware and that freeze that we've been > missing so far. I don't think so. I've tried with Cinnamon and Gnome 3.
(In reply to jds from comment #206) > (In reply to dertobi from comment #203) > > What desktop are you all running? For me it's gnome-shell. Maybe there's > > some connection between software, hardware and that freeze that we've been > > missing so far. > > I don't think so. I've tried with Cinnamon and Gnome 3. Cinnamon is a Gnome 3 fork though.
(In reply to dertobi from comment #207) > (In reply to jds from comment #206) > > (In reply to dertobi from comment #203) > > > What desktop are you all running? For me it's gnome-shell. Maybe there's > > > some connection between software, hardware and that freeze that we've > been > > > missing so far. > > > > I don't think so. I've tried with Cinnamon and Gnome 3. > > Cinnamon is a Gnome 3 fork though. Ah, you're right. I did try MATE too briefly, which I think is a Gnome 2 fork, and it crashed -- at the time I suspected Chrome/Chromium as the issue, so I didn't connect it with this bug.
Update: Grabbed 4.5.0 for testing on affected system (Acer B-115M, N3540). This is downloaded from opensuse repos this time, exact version: Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux Running withtout a freeze for a week now in my normal use and stress-testing since this morning with HD videos. I'll report back if it freezes. Someone asked about the desktop, I use xfce (some gnome-services running though). Have verified the freezes with two distributions, Ubuntu and Opensuse.
it definitely is a kernel bug. read old posts in this thread. i have verified this bug on 2 distributions and am running gentoo now, where everything is compiled. i am running 4.5.0 from github kernel stable repo plus mikas 3 patches for the second day under full load and no freeze yet.
I think the problem happens when C-state is changed. If it is right in order to test it needs a condition which changes CPU load up and down so that it can reach a certain situation where the CPU can get stuck. In my case it happened when I use Chromebrowser on Xubuntu so I guessed it is related to GPU but I don't have any knowledge about that.
That's what I thought too at first -- and that sent me scramblingly looking at chrome flags etc. But then I observed two different systems lock up even when no browser at all was running. (In reply to cororok from comment #211) > I think the problem happens when C-state is changed. If it is right in order > to test it needs a condition which changes CPU load up and down so that it > can reach a certain situation where the CPU can get stuck. > > In my case it happened when I use Chromebrowser on Xubuntu so I guessed it > is related to GPU but I don't have any knowledge about that.
(In reply to jds from comment #212) > That's what I thought too at first -- and that sent me scramblingly looking > at chrome flags etc. But then I observed two different systems lock up even > when no browser at all was running. > > (In reply to cororok from comment #211) > > I think the problem happens when C-state is changed. If it is right in > order > > to test it needs a condition which changes CPU load up and down so that it > > can reach a certain situation where the CPU can get stuck. > > > > In my case it happened when I use Chromebrowser on Xubuntu so I guessed it > > is related to GPU but I don't have any knowledge about that. I can confirm, that my Acer ES1-311 with it's Intel 3540 CPU crashes not only while using Chromium browser. But I recognized it crashes more often using Chromium than Firefox. Mostly it happens, when I play a movie on YouTube or scrolling the timeline of facebook. If I'm working with the PC without using any browser, the system seems stable. Writing with LibreOffice, graphic manipulation with GIMP or RawTherapee work pretty well with the 4.2.0-34 Kernel (Lubuntu) and I do not get as many freezes as before. But watching a DVD with an external drive is not possible, the system freezes within minutes. Strangely the freeze occurs pretty often if I'm just reading .pdf Documents with Evince.
Hi, i own an Asus Chromebox (Haswell Intel Celeron 2955U / 1.4 GHz) and I've always experienced full system freeze in any Linux distros I've tested including Kodibuntu and OpenElec BUT never had such issues with Windows 8/8.1/10 (currently booting off external HDD) Currently I'm running GalliumOS based on Ubuntu 15.04 with Xfce off internal SSD, came with kernel 4.1.14 by default. What I've tried: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1 tpm_tis.interrupts=0 i915.enable_ips=0" GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=2 tpm_tis.interrupts=0 i915.enable_ips=0" NOTES: *Just added intel_idle.max_cstate argument after "splash" the rest is default within /etc/default/grub. * Neither worked intel_idle.max_cstate=1 froze in less than 10m while working in Terminal & intel_idle.max_cstate=2 froze in less than 15m while watching Netflix in Chrome Browser. Currently testing: Kernel 4.1.12 with no args from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1.12-wily/ as some users suggested.Will report back if it freezes, otherwise after 2+ days. Anything else I could try? I can't compile I just have the Chromebox for now plus I'm not that advanced. T.I.A
Update: System just froze with kernel 4.1.12 , this is very frustating.
Forgot to mention that Ubuntu Server 14.04.x is the only distro that has worked reliably for me, ran it for several months without issues then needed full OS so uninstalled.
I've been watching the posts on this bug report for several days now and thought I would post my own personal experience. Just bought a laptop with a N3540 chip in it and have also been experiencing random system lockups with the 4.x series kernels. But I wanted to mention that for some reason the stock kernel that comes with Debian "Jessie" gives me no problems what so ever crash wise. In fact the only reason I've been trying to use a 4.x kernel is becauhse my graphics performance seems to improve drastically with them. Especially in opengl applications (seen much higher FPS in apps). I did try the patches Mika Kuoppala posted on the stable 4.4.6 kernel from kernel.org but had a lockup after about an hour of use. Seems to me the crashes happen most when in chrome browsing websites but I have had lockups doing other things. Gonna try the stable GIT of 4.5.0 and see what happens. If 4.5 stick locks up with Mika's patches then I dunno what to do other than to go back to Debians 3.16 kernel. I'll take a performance hit but at least the computer will run without crashing.
(In reply to Mika Kuoppala from comment #202) > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > 3 _tentative_ patches on that tree. Please try. system freeze after ~2 days (In reply to Veronica from comment #215) > Update: System just froze with kernel 4.1.12 , this is very frustating. i think you are the only one with a freeze on 4.1.[12-15] so far. but then i haven't seen anyone posting with a 2955U unit in this thread. please double-check you are running the correct kernel with uname -a or try 3.16 as suggested by brent davis and others
Veronica and Brent, please check if workaround mentioned in bugreport title at least make system hang much later (or doesn't hang at all). If that the case, then it's worth a try patches from comment #203 instead of workaround.
(In reply to julio.borreguero@gmail.com from comment #218) > i think you are the only one with a freeze on 4.1.[12-15] so far. No, not the only one. I use kernels 4.1.*(now 4.1.20) every day on BayTrail Z3770 tablet and have rare freezes. Of course with MMC PM QoS patches. Max_cstate=1 helps, but with much more power consumption. Also I hit another mysterious bug, when my tablet just turns off. It's look like overheating, but I don't know for sure. Latest kernel git also has a bug with display blinking and corruption. So I can't use it for long enough to see hang. P.S. I hit hang once when I was reading book with fbreader. Nothing more, just fbreader.
(In reply to julio.borreguero@gmail.com from comment #218) > (In reply to Mika Kuoppala from comment #202) > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > > > 3 _tentative_ patches on that tree. Please try. > > system freeze after ~2 days > > (In reply to Veronica from comment #215) > > Update: System just froze with kernel 4.1.12 , this is very frustating. > > i think you are the only one with a freeze on 4.1.[12-15] so far. > but then i haven't seen anyone posting with a 2955U unit in this thread. > please double-check you are running the correct kernel with uname -a > or try 3.16 as suggested by brent davis and others Yes I did verify. I'm very cautious when testing. What I did what press shift key while booting > advanced options and selected kernel 4.1.12 generic. I know I'm the first with a Haswell to report but my Chromebox is having the exact same symptoms people in here is having.
(In reply to RussianNeuroMancer from comment #219) > Veronica and Brent, please check if workaround mentioned in bugreport title > at least make system hang much later (or doesn't hang at all). If that the > case, then it's worth a try patches from comment #203 instead of workaround. Hi, as I mentioned in post #214 cstate=1 and cstate=2 didn't work for me. The first one froze in less than 10m and the second in less than 15m.
Hello, I have same frizing of Ubuntu 14.04.3 with 3,19 on my asrock q1900dc-itx; according information here I reinstalled system an downgrade version to 14.04.02 with 3.16.0.30 kernel, but today get stack again :-( Only version of Linux which works fine was Oracle Linux 7.2 with 3.10
Maybe related patch? http://www.spinics.net/lists/intel-gfx/msg90977.html
(In reply to Ernst Herzberg from comment #224) > Maybe related patch? > > http://www.spinics.net/lists/intel-gfx/msg90977.html looks interesting. it seems to be for a different kernel version than 4.5.0 though, 2 out of 3 hunks fail, but i hopefully managed to adapt the patch and am compling a new test-kernel just now and will post any positive results, if so. definitely worth a try, looks promising from the description. thanks
that patch looks indeed promising, I'm compiling the latest drm-intel kernel from git with that patch now (no hunks failing). Will report much later, as I expect this compilation process to take a long time on this hardware. :-)
Created attachment 210771 [details] drm/i915: Prevent machine death on Ivybridge context switching for kernel 4.5.0 from kernel archive this is Chris Wilsons patch for latest drm-intel kernel slightly modified for latest kernel v4.5.0 from stable kernel archive repo tree https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
(In reply to julio.borreguero@gmail.com from comment #227) > Created attachment 210771 [details] > drm/i915: Prevent machine death on Ivybridge context switching for kernel > 4.5.0 from kernel archive > > this is Chris Wilsons patch for latest drm-intel kernel slightly modified > for latest kernel v4.5.0 from stable kernel archive repo tree > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git it froze within 2h. probably not worth trying for anyone else. anyway lets see what dertobis test with the original patch on the drm-intel kernel leaves us with
The bug is still "P1 normal"!!! This bug affect the 30% of all laptops this moment in the market.It is one of the most serious bug never explored!! There are thousands of Linux users dissapointed. How we communicate with developers of kernel that have most high position to tell them about how serious is this situation?
I hate to break it but the Chris Wilson patch is not the fix. My laptop froze within an hour.
(In reply to Dimitris Roussis from comment #229) > The bug is still "P1 normal"!!! > > This bug affect the 30% of all laptops this moment in the market.It is one > of the most serious bug never explored!! There are thousands of Linux users > dissapointed. > > How we communicate with developers of kernel that have most high position > to tell them about how serious is this situation? You're absolutely right. It is a very serious bug because it freezes computer. Bay trails is low end computer and many users of this are probably non technical ones and want to try to get a light OS because windows 10 with limited memory is not happy. But they will very disappoint.
Just wanted to give a quick update since the last post I made stated I was gonna try the latest stable GIT with Mika's 3 tentative patches. So far it's been 24 hours and I have not experienced a crash. Not sure yet if this is just luck or if a real difference has been made. But I can definitely say my stability coming from 4.4.6 has vastly improved. Been doing everything I can to break this thing. Youtbe, h264 video, opengl games, etc.
(In reply to Brent Davis from comment #232) > Just wanted to give a quick update since the last post I made stated I was > gonna try the latest stable GIT with Mika's 3 tentative patches. So far it's > been 24 hours and I have not experienced a crash. Not sure yet if this is > just luck or if a real difference has been made. But I can definitely say my > stability coming from 4.4.6 has vastly improved. Been doing everything I can > to break this thing. Youtbe, h264 video, opengl games, etc. I just also tried Mika's three tentative patches applied to latest drm-intel as well as Chris Wilson's patch, and within an hour my system crashed yet again. Brent, are you making sure you don't have the usual workaround parameters in the command prompt while testing the patches (happened to me before, you can check with #cat /proc/cmdline)?
I have an ASUS motherboard with Celeron J1900 cpu. For me, kernel 3.19.0-47 from Ubuntu 14.04.3 is stable with options intel_idle.max_cstate=1 nox2apic loglevel=7 debug . System is used for web browsing and openvpn client . Crashes were usually happening while scrolling a large web page with mouse wheel (such as wsj.com or nytimes.com front page).
We have about 50 mainboard with J1900 and some samples with J1800, N3050, N3150 and we had to go back to the original Ubuntu 14.04 kernel 3.13 as even the lts-utopic-kernel 3.16 rarely, but sometimes froze on some few mainboards.
The last stable kernel without this horrible bug is 3.16.7. Canonical provides extended support for this kernel until April 2016!!! .I hope until then this bug have fixed.
(In reply to dertobi from comment #233) > (In reply to Brent Davis from comment #232) > > Just wanted to give a quick update since the last post I made stated I was > > gonna try the latest stable GIT with Mika's 3 tentative patches. So far > it's > > been 24 hours and I have not experienced a crash. Not sure yet if this is > > just luck or if a real difference has been made. But I can definitely say > my > > stability coming from 4.4.6 has vastly improved. Been doing everything I > can > > to break this thing. Youtbe, h264 video, opengl games, etc. > > I just also tried Mika's three tentative patches applied to latest drm-intel > as well as Chris Wilson's patch, and within an hour my system crashed yet > again. > > Brent, are you making sure you don't have the usual workaround parameters in > the command prompt while testing the patches (happened to me before, you can > check with #cat /proc/cmdline)? Haven't touched my bootup command with the cstate flags or anything. Just been testing kernels and patches. Didn't see any reason to because I'm looking for a permanent solution as opposed to a work around. But yeah for all I know mine might crash to. Just wish there was some full proof way to replicate instead of just waiting for it to happen.
(In reply to Dimitris Roussis from comment #236) > The last stable kernel without this horrible bug is 3.16.7. > Is this your own experience on your own computer or some information from an authoritative source? Because, Linux Mint 17.2 comes with 3.16.0 and it is prone to freeze on many N3050 and N2930 machines that I tested. Also, based on your statement, I installed 3.16.7 from the ubuntu repo http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt19-utopic/linux-headers-3.16.7-031607-generic_3.16.7-031607.201510301030_amd64.deb It only worked for 1 hr on one machine and 3.5 hrs on the other. So, if I may suggest, please do not make such authoritative, blanket statements, unless you can cite an authoritative source. Otherwise, simply say that this applies to your own equipment. Also, on my own two computers (if you lookup this long thread you'll see my config info) several 3.16.n versions have been tested. They absolutely all eventually froze. The only good thing is that cstate=2 reduces the failure rate significantly, and cstate=1 literally eliminates freezing on my computers.
(In reply to Hal from comment #238) > (In reply to Dimitris Roussis from comment #236) > > The last stable kernel without this horrible bug is 3.16.7. > > > > Is this your own experience on your own computer or some information from an > authoritative source? > Because, Linux Mint 17.2 comes with 3.16.0 and it is prone to freeze on many > N3050 and N2930 machines that I tested. > > Also, based on your statement, I installed 3.16.7 from the ubuntu repo > http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt19-utopic/linux- > headers-3.16.7-031607-generic_3.16.7-031607.201510301030_amd64.deb > > It only worked for 1 hr on one machine and 3.5 hrs on the other. > > So, if I may suggest, please do not make such authoritative, blanket > statements, unless you can cite an authoritative source. Otherwise, simply > say that this applies to your own equipment. > > Also, on my own two computers (if you lookup this long thread you'll see my > config info) several 3.16.n versions have been tested. They absolutely all > eventually froze. > The only good thing is that cstate=2 reduces the failure rate significantly, > and cstate=1 literally eliminates freezing on my computers. Almost all the users if you read the comments said that kernel 3.6.17 works without problem (comment 11,35,81,105 etc). Also in my situation with a N3050 Machine..Everything above this kernel unfortunately does not work. if exist machines that have problem even with this kernel or below the bug is more serious that we think.
On my netbook Acer Aspire ES1-111/R2, BIOS V1.16 10/20/2015 Celeron N2940 4GB I´m back to Centos7 (kernel 3.10 whatever) and the power drain is acceptable for me. No freeze problems so far in contrast to Ubuntu 15.10. Tried 4.5 mainline from elrepo and this one is working stable for me, too. 4.4.4 mainline was unstable. Will try 4.4.6 now and will report MY (!) results.
(In reply to Dimitris Roussis from comment #239) > (In reply to Hal from comment #238) > > (In reply to Dimitris Roussis from comment #236) > > > The last stable kernel without this horrible bug is 3.16.7. > > > > > > > Is this your own experience on your own computer or some information from > an authoritative source? ... > > Almost all the users if you read the comments said that kernel 3.6.17 works > without problem (comment 11,35,81,105 etc). > Not quite accurate though. Please note: 1) In Vladmir Jicha in comments #104 and #162 mentioned that his computer froze about twice a week even with kernel 3.13 2) Ladiko in #235 indicates that some of their 50 boards freeze with 3.16 (and also 3.13) I know that many people mentioned that they believed 3.16 worked well, without patching, on their specific hardware. And that's fine. But, that can't be generalized and turned into an authoritative statement that 3.16 was freeze free on the microprocessors that this thread is focusing on. I can concur with the few unlucky people here that freezing problems are even occurring on version 3.13. There are so any versions of the 3.x and 4.x kernels out there, and so many compilation sources, that performing a test matrix worthy of drawing conclusions from is almost impossible at this time. Between standard issue distro kernels and what can be downloaded and compiled from kernel.org, or what's on ubuntu's prolific mainline kernel-ppa, I am quite convinced that when two people are referring to a particular version they are not necessarily talking about the same binary - far from it. And I am not even talking about the privately patched derivatives... So, no - unfortunately Linux has taken a very bad turn this time and troubleshooting this issue is going to be a miserable experience. (And that's probably why this bug is still not fixed after 15 months since its first discovery). And I am not even talking about retrofitting the fix into all these versions. Although, I will personally be very happy if I had only one version with a fix like 4.5! Oh, but wait! there is a release candidate version 4.6 already! WHAT A JOKE!
regarding 2): we have issues with 3.16, but not with 3.13 which we use right now.
For me mr, Len Brown has a big responsibility of this situation. How is possible to assigned to you a such serious bug that affected 30% or more of the new laptops in markets and still the Importance is P1 normal and without inform developers above you. I think mr.Len didn't understand the effect of this bug to the Linux world!! Just go to a computer shop. The half of laptops use these cpus!! What we can say to all these people? Dont use linux wait 2 more years that someone interested to fix the bug!!! The worst situation i saw the last 10 years in Linux.
and also what do you think?..All the people are linux experts to try different kernels like us?.. It works exactly in this way...Somebody go to the shop buy a new laptop and install the latest ubuntu. After 20 minutes the system freeze and he said linux sucks i never use it again!! Thats it!!
(In reply to ladiko from comment #242) > regarding 2): we have issues with 3.16, but not with 3.13 which we use right > now. My apology. The parentheses was a left over from the edit to the sentence after I realized that you were referring to 3.16 as freezing, but not 3.13. Thank you for pointing it out. In any event my point to this group is that, I personally do not believe that a solution to this problem will be found anytime soon, as we cannot even identify the turning point beyond which this problem started to show up. And kernel version proliferation is certainly one reason for that. For those who like me, this issue has serious ramifications (beyond the fun of using linux distros at home, with friends and family, etc.), like losing credibility in front of a prospective customer because you can't even give a presentation with your beautiful lightweight laptop computer without rebooting twice in one hour, I have a word of advice: Go buy yourself an entry level Mac laptop because it's time for evasive action. It was fun riding the Linux wave - a little over 12 years for me. But now it's time to move on.
This is on my work macbook, which I sleep at the end of every workday: $ uptime 17:11 up 154 days, 3:13, 6 users, load averages: 1.34 1.83 1.90 (In reply to Hal from comment #245) > (In reply to ladiko from comment #242) > > regarding 2): we have issues with 3.16, but not with 3.13 which we use > right > > now. > > My apology. The parentheses was a left over from the edit to the sentence > after I realized that you were referring to 3.16 as freezing, but not 3.13. > Thank you for pointing it out. > > In any event my point to this group is that, I personally do not believe > that a solution to this problem will be found anytime soon, as we cannot > even identify the turning point beyond which this problem started to show > up. And kernel version proliferation is certainly one reason for that. > > For those who like me, this issue has serious ramifications (beyond the fun > of using linux distros at home, with friends and family, etc.), like losing > credibility in front of a prospective customer because you can't even give a > presentation with your beautiful lightweight laptop computer without > rebooting twice in one hour, I have a word of advice: Go buy yourself an > entry level Mac laptop because it's time for evasive action. > > It was fun riding the Linux wave - a little over 12 years for me. But now > it's time to move on.
Weird enough, but this thread is giving me back some hope! I bought an Asus F551 (Intel N2930) laptop last year in February which came with a pre-installed Windows 8.1 64bit and which was running flawlessly - until I updated to Windows 10. Right after the update the laptop started to freeze randomly. Since I spend most of the time editing PHP code and watching the result in Firefox, I'm not really bringing the machine to its limits. And maybe that's the reason those freezes didn't happen so often. Sometimes it took 3 days, sometimes I got 3 crashed within half an hour. Absolutely unpredictable. Except of my unsaved program changes: no data loss - not a single hint in the system log. I filed a detailed report to Asus then, but the only suggestion was restoring the machine to its shipping state. Poor, isn't it? And that's why I decided to give Linux a try instead. To keep it short: My Linux (Ubuntu studio) 4.2.0-34.lowlatency #39-Ubuntu SMP PREEMPT is freezing, too. To me this looked very much like a hardware defect and my next idea was running memtest86. Strange enough I got not a single error when running the test with just ONE cpu, but hundreds of errors (all at address 0) with multiple processors involved. Yeah, and so I was almost giving up hope on this laptop until I came across this thread today. My first test was running 2 glxgears plus watching a video in firefox: Freeze after about 10 minutes. After a reboot
(In reply to micha from comment #247) > Weird enough, but this thread is giving me back some hope! > > I bought an Asus F551 (Intel N2930) laptop last year in February which came > with a pre-installed Windows 8.1 64bit and which was running flawlessly - > until I updated to Windows 10. Right after the update the laptop started to > freeze randomly. Since I spend most of the time editing PHP code and > watching the result in Firefox, I'm not really bringing the machine to its > limits. And maybe that's the reason those freezes didn't happen so often. > Sometimes it took 3 days, sometimes I got 3 crashed within half an hour. > Absolutely unpredictable. Except of my unsaved program changes: no data loss > - not a single hint in the system log. > > I filed a detailed report to Asus then, but the only suggestion was > restoring the machine to its shipping state. Poor, isn't it? And that's why > I decided to give Linux a try instead. To keep it short: My Linux (Ubuntu > studio) 4.2.0-34.lowlatency #39-Ubuntu SMP PREEMPT is freezing, too. > > To me this looked very much like a hardware defect and my next idea was > running memtest86. Strange enough I got not a single error when running the > test with just ONE cpu, but hundreds of errors (all at address 0) with > multiple processors involved. > > Yeah, and so I was almost giving up hope on this laptop until I came across > this thread today. > My first test was running 2 glxgears plus watching a video in firefox: > Freeze after about 10 minutes. > After a reboot sorry, accidently hit the wrong key ... after the reboot 4 hours ago I added the cstate=1 to the boot parms and the system is still alive, continously running 2 glxgears and playing videos.
I guess Intel already knew the bug but wonder why they don't fix it. User experience should be different between expensive Core and cheap Bay trail so that Intel make a huge profit in Core cpus. Windows 10 meets this strategy because it is slow on low memory (Intel computer stick with linux has 1gb ram compared to 2gb for windows 10). Is Intel happy with this situation which restrains Bay trail in windows 10? like netbook was limited in 10 inch size?
(In reply to cororok from comment #249) > I guess Intel already knew the bug but wonder why they don't fix it. > > User experience should be different between expensive Core and cheap Bay > trail so that Intel make a huge profit in Core cpus. Windows 10 meets this > strategy because it is slow on low memory (Intel computer stick with linux > has 1gb ram compared to 2gb for windows 10). > > Is Intel happy with this situation which restrains Bay trail in windows 10? > like netbook was limited in 10 inch size? Let's not devolve into conspiracy theories. For me this looks like incompetency paired with negligence. Still bad.
In case anyone need it, there is amd64 deb packages with patches from Comment #202 https://github.com/milikhin/z3735-linux-patches https://drive.google.com/folderview?id=0BzIRxogf-cVkLWdiMTRoenU5amM Linux 4.6rc1 package also include workaround for bug 112571.
An internet post pointing here this bug. http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail
(In reply to micha from comment #248 & #247) The Asus F551 is a decent machine (not a great machine for the intended use of Ubuntu Studio though). I reconfigured one with Linux Mint several months ago for a relative of mine. I remember having tried a low latency kernel (I can't recall the exact version) but the performance was terrible. Generally speaking processors in the N2930 class are not good candidates for low latency versions of the kernel. I eventually set that machine with Linux Mint 17.2 kernel 3.16, and a few months later upgraded the OS to 17.3 with kernel 3.19. Of course the processor freezing problems came along and intel_idle.max_cstate=2 or 1, as discovered by many people by then, became the life saver for the machine. Especially with cstate=1 you can expect your machine to be very stable and run flawlessly. On my laptop (although not a F551) I sometimes set it to 2 and take the risk of seeing my machine freeze as the battery lasts significantly longer with cstate set to 2 rather than to 1. If you want to try a non-low-latency kernel, rather than installing a standard kernel on Ubuntu Studio, try a different flavor of Linux (maybe Linux Mint) with a standard kernel. Because Ubuntu Studio is a bit touchy (at least in my experience) and you may start seeing seemingly unrelated problems as soon as you replace its kernel.
(In reply to Dimitris Roussis from comment #243) > For me mr, Len Brown has a big responsibility of this situation. > You're making an excellent point. It's quite extraordinary that the gentleman in charge of fixing this bug has not posted a single line here on this thread, sharing his thoughts, or providing insight about his efforts on this matter. Quite extraordinary...
(In reply to julio.borreguero@gmail.com from comment #218) > (In reply to Mika Kuoppala from comment #202) > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > > > 3 _tentative_ patches on that tree. Please try. > > system freeze after ~2 days > Did that set affect the rate/time of hangs? I am now at 6days of uptime. Workload is glxgears + vlc with vaapi
(In reply to Mika Kuoppala from comment #255) > (In reply to julio.borreguero@gmail.com from comment #218) > > (In reply to Mika Kuoppala from comment #202) > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > > > > > 3 _tentative_ patches on that tree. Please try. > > > > system freeze after ~2 days > > > > Did that set affect the rate/time of hangs? > > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi i stressed the system more than usual. had a big glxgears on one workspace and i was playing nonstop movies from shell with mplayer (-nosound). no vaapi. Plus listening to music with clementine and compiling a lot of packages (gentoo upgrading packages). Hard to say if it improved with random freezes that can occur at any time. what i can say is that chris wilsons patch only took max 2h in freezing, although i applied it to 4.5.0 kernel. I can try more patches or use vaapi or whatever, just let me know.
(In reply to julio.borreguero@gmail.com from comment #256) > (In reply to Mika Kuoppala from comment #255) > > (In reply to julio.borreguero@gmail.com from comment #218) > > > (In reply to Mika Kuoppala from comment #202) > > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > > > > > > > 3 _tentative_ patches on that tree. Please try. > > > > > > system freeze after ~2 days > > > > > > > Did that set affect the rate/time of hangs? > > > > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi > > i stressed the system more than usual. had a big glxgears on one workspace > and i was playing nonstop movies from shell with mplayer (-nosound). no > vaapi. > Plus listening to music with clementine and compiling a lot of packages > (gentoo upgrading packages). > Hard to say if it improved with random freezes that can occur at any time. > what i can say is that chris wilsons patch only took max 2h in freezing, > although i applied it to 4.5.0 kernel. > I can try more patches or use vaapi or whatever, just let me know. Pardon my intrusion. Although I am no longer testing anything related to this issue I thought sharing some of my findings might interest you. The freezing is more prone to happen when the workload on the processor cores is light to medium as the power controller takes a more active role in switching the states. When you heavily load your processor with tasks it goes into low power or power saving states much less. If there is failure, more likely it's another cause than this bug at hand. Of course keep testing everything under heavy load too, but light load will probably cause this problem show up more quickly and frequently. (When I was doing serious structured testing I noticed that actually with no "user" load, just the internal system calls were causing enough/more frequent cstate flip flops than when running videos etc)
(In reply to Hal from comment #257) > (In reply to julio.borreguero@gmail.com from comment #256) > > (In reply to Mika Kuoppala from comment #255) > > > (In reply to julio.borreguero@gmail.com from comment #218) > > > > (In reply to Mika Kuoppala from comment #202) > > > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test > > > > > > > > > > 3 _tentative_ patches on that tree. Please try. > > > > > > > > system freeze after ~2 days > > > > > > > > > > Did that set affect the rate/time of hangs? > > > > > > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi > > > > i stressed the system more than usual. had a big glxgears on one workspace > > and i was playing nonstop movies from shell with mplayer (-nosound). no > > vaapi. > > Plus listening to music with clementine and compiling a lot of packages > > (gentoo upgrading packages). > > Hard to say if it improved with random freezes that can occur at any time. > > what i can say is that chris wilsons patch only took max 2h in freezing, > > although i applied it to 4.5.0 kernel. > > I can try more patches or use vaapi or whatever, just let me know. > > Pardon my intrusion. Although I am no longer testing anything related to > this issue I thought sharing some of my findings might interest you. > The freezing is more prone to happen when the workload on the processor > cores is light to medium as the power controller takes a more active role in > switching the states. When you heavily load your processor with tasks it > goes into low power or power saving states much less. If there is failure, > more likely it's another cause than this bug at hand. Of course keep testing > everything under heavy load too, but light load will probably cause this > problem show up more quickly and frequently. > (When I was doing serious structured testing I noticed that actually with no > "user" load, just the internal system calls were causing enough/more > frequent cstate flip flops than when running videos etc) Well thank you for your intrusion. That indeed sounds logical, good point. Interestingly enough the system finally froze after i closed those glxgears and ever-looping movies, now that you are saying, which absolutely confirms your theory. i was at that point just watching a movie (low-res) without stressing the machine in any other way. nonetheless the cstate workaround doesn't work for me, although i haven't tried cstate=2, only cstate=1 (on my N2940) and that seems to be hardware depending.
Quick observation: Once the system is frozen I can still unplug/plug the HDMI cable and the frozen screen will reappear on my external monitor. Maybe that means nothing, but wouldn't that suggest that some level of kernel activity is still occuring? I can also use the FN keys of my laptop to disable/enable the laptop screen, but that could be happening purely on the firmware/BIOS level.
(In reply to Hal from comment #257) > > Pardon my intrusion. Although I am no longer testing anything related to > this issue I thought sharing some of my findings might interest you. > The freezing is more prone to happen when the workload on the processor > cores is light to medium as the power controller takes a more active role in > switching the states. When you heavily load your processor with tasks it > goes into low power or power saving states much less. If there is failure, > more likely it's another cause than this bug at hand. Of course keep testing > everything under heavy load too, but light load will probably cause this > problem show up more quickly and frequently. > (When I was doing serious structured testing I noticed that actually with no > "user" load, just the internal system calls were causing enough/more > frequent cstate flip flops than when running videos etc) My finding is similar than yours Hal, freezes happened almost always when doing "nothing much" ie. load and scroll a web page, sometimes hang happened just after reboot when everything was loaded and system started idling. I think it would be almost always when the load changes from 'high' to 'low' or 'idle'. When these problems started for me with some kernel version (after distribution upgrade from Ubuntu 14.10 to 15.04 (kernel 3.19)), the hangs first happened always when I tried to put the laptop to sleep by closing the lid. A bit later (perhaps further distribution upgrade when I got sick of the "buggy 15.04") came the full system lock-ups during 'daily use'. But I was also thinking that are there now two (or even more) freeze issues in this same report that different users are experiencing, as cstate limiting doesn't help for everyone and also there are now other than baytrail systems also included (even they are likely related from design perspective). Btw, still running the same 4.5.0 session, hitting two weeks marker in couple of days. Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux. No other patches, no cstate limiting. Stress-tested with videos for one full day, otherwise it's been just my daily usage pattern with web-browsing, streaming, occasional gaming, etc.
just a simple question/thought: wouldn't it be quite easy to just write a program to change bewtween those cstates constantly to make a solid test program and finally be able to nail down the bug and make it reproducable ? or to do this, if that is causing the problem: >Two concurrent writes into the same register cacheline has the chance of >killing the machine on Ivybridge and other gen7. (citation from chris wilsons patch description) a reliable "freeze program" would help tremendously, i think.
(In reply to julio.borreguero@gmail.com from comment #261) > just a simple question/thought: > > wouldn't it be quite easy to just write a program to change bewtween those > cstates constantly to make a solid test program and finally be able to nail > down the bug and make it reproducable ? > or to do this, if that is causing the problem: > >Two concurrent writes into the same register cacheline has the chance of > >killing the machine on Ivybridge and other gen7. > (citation from chris wilsons patch description) > > a reliable "freeze program" would help tremendously, i think. Probably not easy, because even though you could force the cstate with your own procedure you can't prevent the microprocessor's microcode from interacting with it (unless of course you are an intel guy and have access to the nitty-gritty of the microcode and you know how to throw that control code in the dustbin and overwrite it with your own) When I first ran into this freezing problem on My Zotac, I didn't know anything about this thread or the earlier one on the freedesktop site. As I tried to do quick and dirty troubleshooting I wrote a little program with a bunch of loops (some in sequence, some in parallel) stressing different parts of the processor and computer hardware as I thought it would help me isolate the area this problem was originating from. That kept the processor cores quite busy, but it also increased the longevity of the linux session. Without that dingy program the machine would freeze within 5-10 minutes after booting, with the software running freezing would only occur an hour or two later. So, that gave me more time to look around into the system. That also gave me a hint that probably the power saving mechanism was the culprit as it was kicking in during light loads on the cpu. So, yes - it may be possible to come up with a Micky-Mouse solution to alleviate the negative impact of the problem and save the day, but a real solution by competent people who understand the root-cause of the problem is more desirable - especially after 15 months of this saga ...
Instead of keep running full load of tasks how about doing something like below? So the cpu-gpu is going up and down. #! /bin/bash function callCpuGpu() { killall -w firefox xdg-open https://www.youtube.com/xyz } fo i in {1..1000} do callCpuGpu sleep 180s done
sorry for wrong one above. #! /bin/bash function callCpuGpu() { killall -w firefox sleep 60s # idle time xdg-open https://www.youtube.com/xyz } fo i in {1..1000} do callCpuGpu sleep 60s # running time done
Well on a lighter note, 4.6-rc1-next.29 seems to have fixed two new failure modes since 4.4.x. On my system both occur after about 10 hours (cstate=1.) One was a semi-freeze, where the clock seconds field stops, but the mouse/touchscreen cursor still moves freely, and the user interface was checked/updated less than once a minute. The other failure was the screen going black w/o warning, apparently frozen. The newest patches didn't affect these failures. Without cstate, my system freezes within minutes per usual, the patches had no obvious effect. (uP=Z3775)
(In reply to Hal from comment #253) > (In reply to micha from comment #248 & #247) > > The Asus F551 is a decent machine (not a great machine for the intended use > of Ubuntu Studio though). I reconfigured one with Linux Mint several months > ago for a relative of mine. I remember having tried a low latency kernel (I > can't recall the exact version) but the performance was terrible. Thanks Hal for your hints. Actually installing Linux on this Laptop wasn't meant to get a powerful multimedia device in the end - it was meant to be a cross check. Windows 8.1 was running flawlessly for several months - and after updating to Windows10 the machine started to freeze randomly. Right now, the most surprising and interesting aspect is this coincidence: Older kernels seems to work correctly, while the newer ones don't. Thus, to me it looks like both parties have "optimized" their kernels up to a point these cpus/architectures can't cope with any more. All I can report so far is: After using lstate=1, my laptop is running more than 48 hours without any freeze. The first day with 2 glxgears and endless videos, today back to normal with just one firefox and a little editing. And I wouldn't wonder if Windows10 would run correctly with a similar booting option. Unfortunately I haven't found a switch like that up to now.
(In reply to micha from comment #266) > Thanks Hal for your hints. You are welcome. > Older kernels seems to work correctly, while the newer ones don't. You are correct. There was a time I kept my system at least a couple of steps behind the "cutting edge" as to me reliability is key. But as I replaced some of my older, high power eating, machines with tiny, low power consuming ones with Bay Trail or Braswell family CPUs, I also had to step up the kernel versions. Because, for instance on the entry level Intel NUC the integrated video circuitry (HDMI part) is not properly handled by kernel 3.16.0. Most Display Port interfaces are prone to random problems with older kernels even when they are supported. Frankly if I could, I would stick to Linux Mint 17.2 and not even upgrade to 17.3 as I had Mint 17.2 running for over a year (without powering it down) on a home built AMD machine without a hiccup. > ... if Windows10 would run correctly with a similar booting > option. Unfortunately I haven't found a switch like that up to now. I doubt that in the Windows case it's a kernel issue. It's probably a device driver issue on the integrated video hardware that needs to be fixed. Also check Intel's website for any newer microcode versions for your microprocessor. But for professional use I am now switching (back) to Mac. In the 80's and 90's people used to say "nobody got fired for buying an IBM computer". I think that applies to Apple nowadays ...
http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail
(In reply to Dimitris Roussis from comment #268) > http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail There was interesting point in the article comments section, that upgrading to xorg 1.18 had solved some freezes that had happened with chromium (but no specifics on hardware, other than mention of atom). I checked my installation log back, and I've definitely verified a freeze with 1.18.0, but not with 1.18.1 - which I am running now with the 4.5.0. Perhaps unrelated noise, but caught my eye.
Interesting info: I had similar freezes running Android x86 (64bit version, UEFI) on the same machine. So it might really be Linux-specific and unrelated to the graphics stack.
I have no X-Server running, just a plain/headless Debian without monitor, keyboard, etc.
(In reply to Juha Sievi-Korte from comment #269) > (In reply to Dimitris Roussis from comment #268) > > > http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail > > There was interesting point in the article comments section, that upgrading > to xorg 1.18 had solved some freezes that had happened with chromium (but no > specifics on hardware, other than mention of atom). > > I checked my installation log back, and I've definitely verified a freeze > with 1.18.0, but not with 1.18.1 - which I am running now with the 4.5.0. > Perhaps unrelated noise, but caught my eye. i just upgraded xorg-server to 1.18.2 from 1.17.4. kernel 4.5.0 no patches no boot parameters. it froze within minutes [N2940]
(In reply to kossmann from comment #271) > I have no X-Server running, just a plain/headless Debian without monitor, > keyboard, etc. It would be interesting to see a hypothesis as to how that bug in can occur in a headless setup. Can it still be the fault of the i915 driver in that case? Maybe the actual x86-64 cpu architecture linux code has some unexpected sideeffects with baytrail cpus?
I had two freezes today with kernel 4.6 that could be a different bug, but there's no way to know this for sure (yet). This occured with intel_idle.max_cstate=1. The good news is that this time it's at least partially reproducible and I say that because I don't know if others will be able to repruduce it, too. 1) I have my smartphone connected to one of the USB ports to keep it charged. 2) I try to reboot the phone. 3) Instead of rebooting the phone shuts off. (Probably not enough juice) 4) Then I try to force a boot by holding the power button of the phone. (The USB cable stays connected to my laptop while I'm doing all of this) 5) Just in the moment when the phone starts to boot, my desktop freezes in apparently exactly the same way people on this bug report already know about all too well. My conclusions from this: 1) The phone is connected for charging, so it's not unlikely it's messing with the power management of the laptop by draining power and sudden shifts in that power drain. (Although it shouldn't) 2) There could be a bug in the USB subsystem. 3) My particular laptop might have a serious hardware defect. 4) ??? Anyone else, please feel free to speculate what that means.
(In reply to dertobi from comment #273) > (In reply to kossmann from comment #271) > > I have no X-Server running, just a plain/headless Debian without monitor, > > keyboard, etc. > > It would be interesting to see a hypothesis as to how that bug in can occur > in a headless setup. Can it still be the fault of the i915 driver in that > case? Maybe the actual x86-64 cpu architecture linux code has some > unexpected sideeffects with baytrail cpus? J1900 Asrock Q1900DC-itx. I could easily lock with kodi and tested lots in the early days of the FDO bug. On some kernels the a patch in that bug seemed to prevent for me and is probably still in openelec (but I never ran kodi for more than 15 hours). The patch - just option 2. https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33 Testing newer kernels did seem to gain a new issue. As I bought the Q1900DC-itx to be a headless router/nas/pvr that's what I did with it. Vanilla 4.1.1 no patch or workarounds (being headless there are no i915 IRQs so the patch would be pointless). Ran 100 days OK updated kernel to 4.1.10 locked after 7 days, then next day. Booted to 4.1.1. again ran OK for 127 days - updated to 4.1.18 no lock so far (up 37 days). Don't know what was wrong with 4.1.10 or if it's just luck (but seems unlikely). Looking at stable commits there is a baytrail one just before 4.1.13 which fixes GPIO register access - maybe that is helping me now in 4.1.18. One other change made initially - though not because of locks is I disabled USB3 in BIOS as I have 2 USB DVB-T2 tuners and I was getting low level packet loss on the links. Seemed to be power related as spinning a CPU would fix it - but so did (well 99%) avoiding xhci by turning off USB3.
Two years of problems and three lengthy and painful bisects later I finally arrived at commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff (drm/i915: Agressive downclocking on Baytrail). A simple search for these terms brought me to this bug and I now know I'm not alone! Will try read-up on all comments on this bug later. Meanwhile I manually reverted the changes in mentioned commit in 4.5 and have yet to see a freeze. Will try max_cstate=1 later. HW: ASRock Q1900-ITX.
(In reply to dertobi from comment #273) > It would be interesting to see a hypothesis as to how that bug in can occur > in a headless setup. Can it still be the fault of the i915 driver in that > case? Maybe the actual x86-64 cpu architecture linux code has some > unexpected sideeffects with baytrail cpus? I updated to Kernel 4.4.0-1-amd64 and - this could be the trick for me - made a new released BIOS-Update, including a Micocode-Update (NUC6i5SYH). The uptime of my NUC is 2 days an 14 hours for now without max_cstate.
Sorry... forget my post root@nuc:~# cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.4.0-1-amd64 root=UUID=b4b6a796-e6c4-44b5-8d3b-0e34a2cae5c6 ro quiet crashkernel=256M nmi_watchdog=1 intel_idle.max_cstate=1 max_cstate is still set.
Created attachment 211641 [details] Reverted commit 8fb55197e64... for 4.5.0 Ok, have done homework and read the whole thread. My experiences with the BayTrail issue: HW: ASRock Q1900-ITX with J1900 onboard, like I said before. Load: HTPC with MythTV recording/showing both SD and HD DVB-C material. Noteworthy: I use an out-of-kernel compiled ddbridge module that comes with own dvb-core code. In my experience problems began when I started compiling 4.2.* and immediately blamed the non-standard ddbridge module. There may have been problems with 4.1.* that I don't remember, but I'm VERY confident the latest iterations in the 4.1.* series are rock-solid. Last stable kernel I used before venturing on my latest bisect was 4.1.20 without patches or work-arounds. Since my problems started with 4.2.0 and 4.1 series seemed stable I bisected between 4.1 and 4.2. This led me without a shadow of a doubt to Chris Wilson's commit 8fb55197e... Freezes tended to occur much faster as I approached this commit. On 4.2.0 and above it can take hours if not days, on 8fb55197e... it's a matter of minutes. I was surprised to end up on a commit that was not related to dvb/device code but relieved it precisely matched the other hardware I use which I never doubted it's stability. My HTPC is now watching HD DVB-C content as we speak on 4.5.0 using accompanied patch, which is a manual reversal of 8fb55197e... to the best of my knowledge. It's been up since yesterday and hasn't crashed since, but I'm sceptical since freezes took longer on later kernels anyway. So far, so good.
I lied! I've found an old mail conversation about the problem and indeed I started seeing the freezes on 3.17 like many others. I tried bisecting between 3.16 and 3.17 back then and never convincingly arrived at a commit I could blame due to the unpredictable nature. So it does seem we are looking at different bugs that (partially) got fixed somewhere in the 4.1 branch. Patched 4.5 still going strong btw.
(In reply to Juha Sievi-Korte from comment #209) > Update: Grabbed 4.5.0 for testing on affected system (Acer B-115M, N3540). > This is downloaded from opensuse repos this time, exact version: > > Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 > UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux > > Running withtout a freeze for a week now in my normal use and stress-testing > since this morning with HD videos. I'll report back if it freezes. > > Someone asked about the desktop, I use xfce (some gnome-services running > though). Have verified the freezes with two distributions, Ubuntu and > Opensuse. juhas@cardhu:~> uptime 21:32pm up 19 days 13:22, 5 users, load average: 1.44, 1.59, 1.51 juhas@cardhu:~> uname -a Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux juhas@cardhu:~> cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-4.5.0-58.gb2c9ae5-default root=UUID=4e634188-9fb6-40f9-87ae-487fd31414f3 resume=/dev/disk/by-uuid/5daad161-5400-48d2-a6e5-8cc5e0f08c20 splash=silent quiet showopts Never before had this long uptime without boot parameters. Seems I'm unable to make this crash now. Anyone else this lucky with N3540 and 4.5.0? Am I forever stuck with this particular kernel version? :)
Just rolled back to vanilla 4.5 to see if I could make a stable system out of 4.5 without my patch as Juha says. I can't. After 2,5 hours of watching it froze like it ever did after 4.2. So it seems for me at least I need to roll-back 8fb55197e... which btw is so stable I haven't seen a freeze in a week of regularly watching television.
I have known about this bug for well over a year, mostly ignored it and was content on 3.16 and 3.13. Popped in a few days ago to see the state of things and read up. I can't believe a bug that locks up the system within a few minutes or few hours has got no love. I have a J1900 as a HTPC running Ubuntu 14.04, around a year or more ago the transition from 3.16 to something higher introduced me to this issue. I was able to work around it on higher kernel versions through the BIOS settings for C-state. Never used the kernel flag. Anyway I didn't like that option so I went back to 3.16. In the interest of upgrading the system to Ubuntu 16.04 in the near future and the higher kernel version used I thought I should look in to this again. I grabbed the 4.6-RC3 source and manually by hand reverted the "Aggressive Downclocking of Baytrail" patch. I was kind of depressed to look at all the new .config changes, sure have added a lot of stuff to a non working kernel... Anyway I have seen really positive results over the past 36 hours. No lockups. While I will be the first to admit I haven't ever tried anything past 4.1 when it was in RC status a long time ago. I was usually able to lock the system up within 30 min, of bouncing between browsing and scrolling busy web pages in firefox, and Kodi starting and stopping videos(anything I could think to make the GPU up/down threshold shift). I can't seem to make it lockup at all now. I know from reading many people have thought they had it licked, only to post back a few days later that it didn't. Probably the case here too, but something has changed over the last year because like I said previously I was able to lock it up within minutes from anything over 3.16 to 4.1-RC when I messed with this bug last time, and it's lasting days+? so far.
I've read all/most related threads and to me this appears to be the status quo. 'quick' fixes: # intel_idle.max_cstate Adding intel_idle.max_cstate=1 OR intel_idle.max_cstate=0 to the kernel parameters seems to work for most people, but leaves the processor running even when it should be idle (not energy efficient and causes more heat). # Kernel 4.5+ with commit reversal Using kernel 4.5+ without commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff (drm/i915: Agressive downclocking on Baytrail). Some people mention positive results when reverting this commits on earlier kernel versions as well. # intel_pstate=disabled Some have mentioned that setting the intel_pstate=disabled kernel parameter helps, but others confirmed it did not help in their case. Problem background: # Irregular The issue does not appear on a regular basis, some have reported a working system for over a day (+1 for me) and then it crashes twice in an hour. # Confirming There are no/limited logs and as such it is difficult to tell whether everyone in these threads is actually experiencing the same issue. # cstate & pstate information from Intel (posted by Chris Rainey) 1. C-states and P-states are very different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different) 2. Power Management States: P-States, C-States, and Package C-States(https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states) 3. (update) C-states, C-states and even more C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states) Real fix? A real fix has yet to be found... In the commit which some people have reverted (https://patchwork.freedesktop.org/patch/45755/) Wilson and Deepak (from Intel) are named and in a later message Wilson states "Why those vlv_punit_read() result in a machine hang was never understood." (https://lists.freedesktop.org/archives/intel-gfx/2016-January/084206.html). I'll CC both of them to this thread. To-do: This issue affects kernels up to 4.5 (as far as I can tell from the discussion). 4.4 for sure (experiencing the issue on latest 4.4 myself now).
(In reply to Koen L from comment #284) > I've read all/most related threads and to me this appears to be the status > quo. > > 'quick' fixes: > > # intel_idle.max_cstate > > Adding intel_idle.max_cstate=1 OR intel_idle.max_cstate=0 to the kernel > parameters seems to work for most people, but leaves the processor running > even when it should be idle (not energy efficient and causes more heat). > > # Kernel 4.5+ with commit reversal > > Using kernel 4.5+ without commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff > (drm/i915: Agressive downclocking on Baytrail). Some people mention positive > results when reverting this commits on earlier kernel versions as well. > > # intel_pstate=disabled > > Some have mentioned that setting the intel_pstate=disabled kernel parameter > helps, but others confirmed it did not help in their case. > > Problem background: > > # Irregular > > The issue does not appear on a regular basis, some have reported a working > system for over a day (+1 for me) and then it crashes twice in an hour. > > # Confirming > > There are no/limited logs and as such it is difficult to tell whether > everyone in these threads is actually experiencing the same issue. > > # cstate & pstate information from Intel (posted by Chris Rainey) > > 1. C-states and P-states are very > different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p- > states-are-very-different) > > 2. Power Management States: P-States, C-States, and Package > C-States(https://software.intel.com/en-us/articles/power-management-states-p- > states-c-states-and-package-c-states) > > 3. (update) C-states, C-states and even more > C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c- > states-and-even-more-c-states) > > Real fix? > > A real fix has yet to be found... In the commit which some people have > reverted (https://patchwork.freedesktop.org/patch/45755/) Wilson and Deepak > (from Intel) are named and in a later message Wilson states "Why those > vlv_punit_read() result in a machine hang was never understood." > (https://lists.freedesktop.org/archives/intel-gfx/2016-January/084206.html). > > I'll CC both of them to this thread. > > To-do: > > This issue affects kernels up to 4.5 (as far as I can tell from the > discussion). 4.4 for sure (experiencing the issue on latest 4.4 myself now). Thanks for that comprehensive summary! The only thing that I want to add is: # The bug is still occuring on the latest kernel 4.6rc3 and git.
No problem, we all want to get this fixed! I actually ended up CC-ing every person mentioned in the 'signed-off-by' of this patch. > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Deepak S <deepak.s@linux.intel.com> > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Fairly certain they should be able to give us some more pointers as to how to properly fix this issue.
I have to add that the regression presents on all kernel versions after 3.16, so commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff drm/i915: Agressive downclocking on Baytrail was not the true cause, at least not for me. Since it was merged only after 4.2RC, but I experienced the freeze on 4.1 as well, easily in hours to days (someone above mentioned it already happens on 3.17 as well). Otherwise it could two different issues we're talking about in this thread. For kernel freezes starting from 3.16, "the commit" was https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=31685c258e0b0ad6aa486c5ec001382cf8a64212 drm/i915/vlv: WA for Turbo and RC6 to work together. as far as I can tell, reverting it or simply applying this patch https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/patches/4.0/linux-999-i915-use-legacy-turbo.patch seem to do the trick.
I'm on an Acer Aspire E 15 E5-511-P7AT Laptop with an Intel N3540. I had to wait for kernel 4.5. to solve the hanging issue on shutdown/reboot. But the random freezing issue still isn't fixed. I'm using Mint 17.3. I could downgrade the kernel, but that would recall the hanging issue on shutdown/reboot. The freezing issue only happens randomly. Sometimes after a few hours. Sometimes after a few days or even a week. But always when the graphics are used I tried the latest intel graphic drivers, but the issue still isn't solved. The laptop is 16 months old right now, and it's the first time I fell on a bug that wasn't solved in this time. It seems to happen only when the graphics are in extended use like vlc HD viewing or streaming HD Videos on Youtube higher than 720p.
I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx). I am also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes when I plug or unplug a flash drive from a USB hub. This USB problem has never occurred with older kernels I've used. Same symptom, cstate ineffective... (Asus T100-CHI, Z3775)
@Mort Yao Thanks for pointing this out! I'm trying out the intel_idle.max_cstate=1 option first now (seems to work) because we've got some of these systems in production. Afterwards I will use our test-system to check whether a kernel patch completely fixes the issue.
(In reply to jbMacAZ from comment #289) > I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx). I am > also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes when > I plug or unplug a flash drive from a USB hub. This USB problem has never > occurred with older kernels I've used. Same symptom, cstate ineffective... > (Asus T100-CHI, Z3775) I already wrote about how my phone rebooting while connected to the usb port causes my system to freeze, while that's not exactly the same thing you're reporting I feel it could be related. Question for you and experts in USB: - Is there a sudden drop/surge in power when plugging/unplugging a flash drive? Because that's what I think is happening with my phone, it normally gets a charge from the port, then no charge, and then when the booting (of the phone) starts an unusual increase in the power the phone draws from the usb port, which then somehow or another influences the CPU or other components to cause that dreadful freeze. Recently I had a freeze (caused by my phone) with the usual symptoms but the audio that was currently running was in a weird 1 second loop going on seemingly forever.
(In reply to dertobi from comment #291) > (In reply to jbMacAZ from comment #289) > > I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx). I am > > also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes > when > > I plug or unplug a flash drive from a USB hub. This USB problem has never > > occurred with older kernels I've used. Same symptom, cstate ineffective... > > (Asus T100-CHI, Z3775) > > I already wrote about how my phone rebooting while connected to the usb port > causes my system to freeze, while that's not exactly the same thing you're > reporting I feel it could be related. > > Question for you and experts in USB: > - Is there a sudden drop/surge in power when plugging/unplugging a flash > drive? A rebooting phone could certainly provoke lots of otherwise latent bugs in a USB handler. It would be a worthy test case for both hardware and firmware Q/A. My hub is (externally) powered. So USB power draw shouldn't be affecting my device. Since I build my kernels elsewhere, this problem is unmistakeably recent. And it is easily avoided by using an older kernel. Ultimately, these two freezes probably merit their own bug reports.
I'm evaluating Mort Yao's idea of reverting 31685c258e0b0a..... and so far have not seen freezes on watching DVB-C HD content. I have however experienced two crashes while watching flash content (in Chrome). The problem is, I don't trust the flashplayer anymore, so I'm reluctant to say the patch isn't valid (for me).
Unfortunately I experienced another hang yesterday (after one week's stable use), so the patch I mentioned in the last comment isn't valid for me anymore. On the other hand, reverting the complete commit 31685c2 isn't really easy -- the old revision of the module won't compile together with the current 4.x kernel codebase. I'd like to hear if anyone had any success doing that. However, a proper fix is yet to be found.
My patch in comment 279 applies cleanly against 4.5.0 and 4.5.1 and resolves the problems for me (at least for a very, very long time). You could give that a try? I agree it's not as clean as your second patch, but like others I suspect we're looking at different problems that orignated in the 3.17 and 4.2 branch. For me, the 3.17 problem seems solved in 4.1.x (for recent x) and my patch reverts whatever causes problems in the 4.2 and up series.
@Martin thanks for bringing this to my attention. Yesterday's freeze was on a 4.5 kernel (with only legacy-turbo patch applied). It seems I should try to revert 8fb55197e64 also, since there seems to be two very different causes of freezes! I'm currently on vanilla 4.1.6 (for me it's known to freeze once or twice a month; that's already much better than all kernels 4.2+), but I'm planning to try both: 1. Apply both the 8fb55197e revert patch and the legacy-turbo patch on 4.5. 2. Apply the legacy-turbo patch on 4.1.x. will see how it goes.
Same problem here.. I can't do nothing when the system freezes. With the intel_idle.max_cstate=1 flag it's ok but consumes more power :/
*I can't do anything
Hi, I'm using a ASRock Q1900 (Intel Celeron J1900 Baytrail) board with an Nvidia GT720 GPU and I don't get any hangs at all with Arch Linux (kernel 4.4.1-2-ARCH) + Kodi 16.1. I'm wondering if you guys are using the onboard GPU? I guess I could switch GPU to try. On my Intel Atom Z3770 Baytrail tablet (HP Omni10) the only way to get it even booted to Android-x86 is with intel_idle.max_cstate=1.
(In reply to Hal from comment #199) > Interesting findings today: > > 2) I was given an Intel Nuc box for testing which turned out to be identical > to mine, with same N3050. Duplicated my drive with DD and removed > intel_idle.max_cstate=1. It kept working all day without missing a beat! I > remove cstate from my own machine it freezes within the hour. So bizarre... > This is something I posted several weeks ago. Since then I have been using both boxes in parallel, for same type of daily tasks (some web browsing, occasional video playback, some Netflix, lots of background music playing). One of these boxes has a processor (N3050) with a stepping older than the other. That one doesn't show any freezing symptoms. The newer box (with the same processor but more recent stepping) needs intel_idle.max_cstate=1 to run without freezing, otherwise it fails quite regularly, within a couple of hours after booting. Hal
We have 50 ASRock Q1900-ITX - some work without issues, some work with cstate=1, some freeze anyway and need kernel 3.16. Only kernel 3.16 made all of them work without this issue. For another issue on another mainboard type we went back to 3.13, but some of them don't support a resolution of 1280x1024 via VGA this way. So we had to differ: This CPU = this kernel, that CPU = that kernel. Now we run Baytrail with 3.16 but i plan to compile a custom kernel 4.4. or 4.5 as explained before.
Did this patchset ever get merged? sounds suspiciously similar. http://lkml.iu.edu/hypermail/linux/kernel/1503.3/00271.html
@Dan, I tested these patches. There is a slght improvement but the system still hangs at some point. At least th mmc bus that is.
I was told to try the patches from here: (for 4.4) https://github.com/fritsch/linux/commit/8b48465bd197e2f4891a3f9c5737bb13981d1c94 and here: (for 4.5) https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33 Which I will try later, but I want to encourage others to do the same.
I can confirm this issue to a certain degree. Using a Dell Inspiron, with an Intel N3050 (stepping 3), I can boot and run kernel 4.2.0-35-generic (from Ubuntu 14.04), and also the 4.2 kernel shipped with Fedora 23, but no version higher than 4.2. I get a black screen immediately after booting any kernel >4.2 using any distro available (fedora, opensuse, ubuntu, etc). Disabling pstates did not help in this case (running kernel 4.5). If there's anything I can do to help debug this issue further, please let me know.
(In reply to lewexeki from comment #27) > Hi, > > I had the same problem with "Intel(R) Pentium(R) CPU N3520 @ 2.16GHz". > With kernel 4.2.0-16.19 there were ~5-8 freezes/day. After upgrading to > 4.3.3-040303-generic (ubuntu version) it was much better: 1/2 freezes/day. > With cstate=1 there has not been one yet. I have N3520 BayTrail and I am using kernel-4.0.6 with cstate=1 as well right now; Since I set cstate=1 my asus notebook doesn't freeze (its about 10 days already);
I have Intel(R) Pentium(R) CPU N3520 @ 2.16GHz BayTrail on my asus notebook; I tested cstate=1 and kernel 4.0.9 and it doesn't freeze about 10 days already; Can somebody tell me since what kernel version the bug will be solved totally?
My glass ball says kernel 6.6.6 will be useable.
When all computers in the market have an intel processor bay trail. For now ( only ) affects 40%!! of all PC's in the market.
Hi all, for asrock q1900itx-dc I have found workaround: i turned off cstate in BIOS (UEFI). uptime more than week, i cannot say that im happy with this solution, but it allowed me wait untill this bug will be fixed.
(In reply to GConst from comment #312) > Hi all, for asrock q1900itx-dc I have found workaround: i turned off cstate > in BIOS (UEFI). uptime more than week, i cannot say that im happy with this > solution, but it allowed me wait untill this bug will be fixed. That's pretty much the same workaround we already have, except you're doing it in the BIOS instead of the kernel boot command line. And that makes total sense.
Not sure if this has anything to do with our problem: https://www.dragonflydigest.com/2016/04/04/17888.html It says: If you remember this Baytrail problem, Daniel Bilik has gone and found a fix, as this appears to be a cross-platform bug, and he has patches for DragonFly. http://lists.dragonflybsd.org/pipermail/users/2016-April/228682.html http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/5d8e0f49ad2ab6201288c8b4f5ebb966f27e5779 http://lists.dragonflybsd.org/pipermail/users/2016-March/228645.html Perhaps it helps. Good luck.
Sorry, it seems to be just the fix from Mika (https://bugzilla.kernel.org/show_bug.cgi?id=109051#c203). So nothing new, I guess....
I believe that I have run into this problem running Ubuntu on an NUC clone with Intel's J1900 processor. Does anyone know if the freezing issue is confined to machines based on Bay Trail chips or is it more widespread than this?
I recently experienced unprecendented hangs while idle (HTPC unreachable in the morning, so without GPU involvement) and while watching Flash video. Although watching DVB-C content was very stable with reversed 8fb55197e... I now am back to vanilla 4.5.2 (soon .3) using intel_idle.max_cstate=1. That does seem to be the magic bullet after all. Apparently I don't quite understand the *states very well, because with this option set, I still see the GPU enter rc6 in powertop? Is that less efficient than c*? I do see that the packages is not going into pc2-7.
re: comment316 MarkB, this sighting is specific to baytrail.
I think I have run into this issue on my NUC with an Intel(R) Celeron(R) CPU N3150 @ 1.60GHz (Braswell). Basically under any significant load (gcc compile of the linux kernel for example) the system reboots after a random amount of time. Ive tried several 4.x series kernels and have been able to reproduce the bug on all of them so far. Adding intel_idle.max_cstate=1 as suggested in this thread seams to mitigate the bug albeit it locks the CPU at 2167Mhz. I am not using X, I'm running arch linux with only a couple of services enable (dhcpd and hostapd) as im using it mainly as a firewall/AP.
I have problem with random freezing on kernel 4.2 intel_idle.max_cstate=1 din't helped me
Your processor is an Intel BayTrail? Did you update-grub after the changes? Can you try kernel 3.13.0-85-generic instead?
Gabriel7340: I think the N3150 is Braswell which is a refresh of Baytrail? [1,2] Should I file a separate bug? I am not using grub I'm using systemd-boot. [1] http://www.extremetech.com/extreme/202389-intel-quietly-launches-14nm-braswell-bay-trails-successor [2] http://www.cnx-software.com/2015/04/01/intel-introduces-celeron-n3000-n-3050-n3150-and-pentium-n3700-low-power-braswell-processors/
(In reply to Gabriel7340 from comment #321) > Your processor is an Intel BayTrail? Did you update-grub after the changes? > Can you try kernel 3.13.0-85-generic instead? Sorry comment #320 is no valid. I have used intel_idle.max_cstate=2. Right now its 6 hours up-time without freezes. Is there some patches that fix this issue?
The freeze recurred to me today on 4.1 with legacy-turbo patch, for no reason (not even any GPU or CPU-intensive processes was running). So I would say no, there is no valid patch that completely fix the issue at this point. (4.1 indeed performs better than 4.2+, so that would be another lockup issue on 4.2+)
(In reply to yuriy from comment #323) > (In reply to Gabriel7340 from comment #321) > > Your processor is an Intel BayTrail? Did you update-grub after the changes? > > Can you try kernel 3.13.0-85-generic instead? > > Sorry comment #320 is no valid. I have used intel_idle.max_cstate=2. Right > now its 6 hours up-time without freezes. > > Is there some patches that fix this issue? http://www.hardwaresecrets.com/celeron-n3150-cpu-review/ "They come to replace the Bay Trail-D CPUs, actually using the same microarchitecture, ..." I think you are right. For now the best solution is the intel_idle.max_cstate workaround. You can find some patches but I'm not sure if it works. Another solution could be compile the kernel without some commits like "Agressive downclocking on Baytrail/drm/i915".
I had freezes in an Acer laptop with Pentium N-3540 (Bay Trail). Now I'm using 3.13.0.85, I can confirm no freezes at all. From 3.16 and above, 4.1.12 is the one that works better for me, freezing after many hours. Using 4.2 and above, the problem gets worse, with the system freezing few minutes after switch on.
I figured I'd post my experience with this bug and how I avoid it. Run into this problem ever since I bought my Inspiron 3000 with N3540 cpu. Running opensuse tumbleweed, always with latest Kernel. System would freeze up, usually within 15 minutes of booting. I originally thought it was my SSD as it would often happen when accessing the disk, but it got worse and would eventually happen even when sitting idle under little load. Also, fan would also run at full speed from boot to crash. If I suspend to ram as soon as the system has booted to the desktop, then bring the system out of suspend, the problem nearly goes away. The fan will work normally (it rarely kicks on unless I'm doing something crazy) and I can go long stretches without a crash. The problem is not completely gone...I'll crash maybe once a week. But it is much better than every 15 minutes. And if I forget to do the suspend trick after a reboot, I'm reminded quickly as it will crash within minutes EVERY time. I have never modified the idle_cstate as others have suggested. Perhaps my experience can help someone.
(In reply to Austin from comment #327) > I figured I'd post my experience with this bug and how I avoid it. > > Run into this problem ever since I bought my Inspiron 3000 with N3540 cpu. > Running opensuse tumbleweed, always with latest Kernel. > > System would freeze up, usually within 15 minutes of booting. I originally > thought it was my SSD as it would often happen when accessing the disk, but > it got worse and would eventually happen even when sitting idle under little > load. Also, fan would also run at full speed from boot to crash. <snip> I also have your model of Dell laptop. intel_idle.max_cstate=1 does works on my Dell. I've run various kernels from 3.19 - 4.5 with Mint, Manjaro and Cubuntu. Without ..cstate, I experience the screen freeze and runaway fan speed.
Like others, I've been also fighting this for several months. But it seems that _combination_ of "tentative" patches from Mika Kuoppala (see comment #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it with deeper C-states enabled. See this post... http://lists.dragonflybsd.org/pipermail/users/2016-May/249603.html ... so that I don't repeat myself. :) HTH.
(In reply to Daniel Bilik from comment #329) > Like others, I've been also fighting this for several months. But it seems > that _combination_ of "tentative" patches from Mika Kuoppala (see comment > #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has > finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it > with deeper C-states enabled. See this post... > > http://lists.dragonflybsd.org/pipermail/users/2016-May/249603.html > > ... so that I don't repeat myself. :) > > HTH. Thanks for your research Daniel, that looks promising.
(In reply to Daniel Bilik from comment #329) > Like others, I've been also fighting this for several months. But it seems > that _combination_ of "tentative" patches from Mika Kuoppala (see comment > #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has > finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it > with deeper C-states enabled. See this post... Can I ask which kernel and processor family you are running? I can't seem to replicate your success on my setup (various patched kernels 4.2 - 4.6rc, Atom Z3775). While I can't definitively rule out a hardware platform issue, I am freeze free with ..cstate=1. Newer kernels do take longer before freezing than older ones.
(In reply to jbMacAZ from comment #331) > Can I ask which kernel and processor family you are running? I run Dragonfly BSD on Asrock Q1900-ITX with Intel Celeron J1900. Dragonfly has drm infrastructure imported from linux kernel, with both intel and amd drivers being regularly updated. I started to experience machine freezes when i915 driver in Dragonfly was synced to what's in linux 4.0, and I had to limit CPU to C1. When Dragonfly synced i915 to linux 4.1, it made my system stable again, even with deeper C-states. But with update to a version from linux 4.2, freezes were back again. I was struggling this for months, but with two patches I've mentioned in previous post my system has been running stable for several weeks now, with deeper C-states enabled. In the meantime, i915 driver in Dragonfly was synced with linux 4.3, so I've updated my system this week, keeping the patches, and it still runs stable (it's been just a few days, but without the patches I was experiencing a freeze practically each day). > While I can't definitively rule out a hardware platform issue, I am freeze > free with ..cstate=1. Well, because I use i915 driver in Dragonfly, I can't really confirm that the patches solve freezes on linux completely. But so far, it seems to be sufficient to make my system stable with deeper C-states, so I can definitely say the patches positively influence stability of i915 driver on Baytrail. > Newer kernels do take longer before freezing than older ones. In my experience, the system uptime and/or load doesn't seem to matter. Sometimes the system was running stable for two days, sometimes it freezed after two hours. In fact, it always freezed when the system was "doing nothing" and I just moved a mouse pointer or scrolled an already loaded page in firefox.
(In reply to Daniel Bilik from comment #332) > (In reply to jbMacAZ from comment #331) > I appreciate the information and insights. Perhaps there are additional factors affecting freezes from outside the drm code that aren't present in dragonfly. ---- On a different subject, is anyone getting a blank screen lockup starting with 4.6-rc7 and 4.5.4? System runs for a while, seems fine and then suddenly the screen goes black, locked up. I think maybe some of the bug fixes for this freeze bug may be almost right but now the symptom has changed from a static display to a black screen. Just a feeling so far, but it needs the same hard reset to recover, so no dmesg to inspect. Less recent kernels are still stable with cstate, so I don't think it's a hardware fault. Are the hunter patches now obsolete in 4.6-rc7/4.5.4? My tests still use 2 of them that I had to use in earlier kernels. If they aren't needed anymore, using them could explain this new issue.
I installed cloudready distro http://www.neverware.com/ and no freezes anymore.. It is so so strange because this version use Linux kernel 4.0.5!! In all linux distro i used i have freeezes for any kernel above 3.16.7 . What is the difference in chromiumos Linux kernel??
Anyone tried to install a linux-libre kernel and see if it would work better? I'm planning on trying the one from here: http://linux-libre.fsfla.org/pub/linux-libre/freesh/pool/main/l/linux-4.5.4-gnu/ But prior to doing it I would like any feedback you might provide, as I have no experience with linux-libre kernels and what I will be missing (understand breaking in my system) once I install it. Regarding my systems' freeze status since I applied intel_idle.max_cstate=1; well no more freezing but both machines run noticeably warmer. Those boxes are small and cramped. They only have passive cooling. One other thing I noticed and which has alarmed me is that on both machines one of the cores runs at 100% for long periods (tens of minutes), then falls to normal levels for a minute or so and then goes up again creating a cycle. I don't remember having seen that when I first started to use cstate=1, so I am not sure if the two are connected. But, I am certain that something is wrong with this behaviour. Hal
I was more than 4 months trying to solve the problem of freezing on my laptop with Intel Pentium N3540 Bay Trail reading in this thread I found the solution my problem by establishing the intel_idle.max_cstate value=1 since then I have not returned to have problems however I do not quite understand I've modified that function and problems that may have long-term on my laptop.
I applied Daniel's patches (comment 329) to 4.5.4 but alas, freezes at last. Back to max_cstate=1 again.
Hope I'm not too optimistic but I'm trying 4.6.0-rc7-g44549e8 (I'm using Arch Linux and this is what was available in aur repositories 2 days ago) and so far I've experienced no crashes (almost 2 days of continous uptime with a normal use of the system). Previously the RC3 version crashed as well.
On my dell 3531 with the baytrail processor. I have linux-image-4.6.0-rc7-amd64 installed from debian experimental, here (https://packages.debian.org/experimental/kernel/linux-image-4.6.0-rc7-amd64). I still get the crashes without intel_idle.max_cstate value=1 With intel_idle.max_cstate value=1 no crashes.
(In reply to Hal from comment #335) > Anyone tried to install a linux-libre kernel and see if it would work > better? > > I'm planning on trying the one from here: > http://linux-libre.fsfla.org/pub/linux-libre/freesh/pool/main/l/linux-4.5.4- > gnu/ > linux-libre v4.5.4 from above repo installed well and worked on Linux Mint 17.3 but froze eventually. So the binary free version of the kernel is not any better than the regular kernel. Hal
Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems to work pretty well (my CPU is a celeron N2930). I'm not really an expert so I would assume that g44etc is the commit. Worth checking what is the one used for the debian RC7 build... and what has been done in between (assuming it's not a later one and they broke it again :) )
(In reply to Maurizio from comment #341) > Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems > to work pretty well (my CPU is a celeron N2930). > > I'm not really an expert so I would assume that g44etc is the commit. Worth > checking what is the one used for the debian RC7 build... and what has been > done in between (assuming it's not a later one and they broke it again :) ) Also wanted to add that sensors now report a cpu core temperature 10 degree lower than with a 4.5 kernel with max_cstate=1 ...
Digging around a little and I am seeing many people use the word 'latency' and suggesting that one cause of the problem may be that the interrupts issued to wake up the CPU from a deeper idel state are somehow causing the freezing issue. Coupled with talk above about alternative kernels, I wanted to ask whether anyone has tried any of the alternative Ubuntu kernels, low latencey, real time, etc?
(In reply to Maurizio from comment #341) > Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems > to work pretty well (my CPU is a celeron N2930). Indeed, looking through changes merged into 4.6-rc7 in past weeks, there are several commits to i915 driver claiming to solve hangs. And specifically for Baytrail, these two are interesting the most: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=4ea3959018d09edfa36a9e7b5ccdbd4ec4b99e49 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=1b3e885a05d4f0a35dde035724e7c6453d2cbe71 If I read it correctly, the first one fixes the same problem with rps thresholds that one of Mika's "tentative" patches was trying to. And second one fixes "timing cruical" ringbuffer issue that IMHO could be causing hangs at random places. BTW, timing may be that additional factor mentioned by jbMacAZ in comment #c333, and it would explain why "legacy turbo" + "tentative" patches work for me and not for others - timing on Linux vs. Dragonfly definitely is different. Anyway, I've swapped my "combo" patch for those commits mentioned above, and I'm currently testing it. Because, to be honest, the patches I've been using so far, despite making my system stable with deep C-states, smell a little "hackish". And those commits, besides being "official", look more like "the proper solution".
I'm up to 22 hours uptime now with 4.6 vanilla without intel_idle.max_cstate=1. I'm using the ubuntu built packages on debian 8 (afaik there are no external patches). I have a lenovo ideapad 100s with an atom Z3735F http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/
4.6.0-1 from the Arch testing repo just brought down my Z3735F.
(In reply to Daniel Bilik from comment #344) > (In reply to Maurizio from comment #341) > ... and I'm currently testing it. No luck, got freeze in less than a day. Commits 4ea3959+ and 1b3e885+ alone do not seem to be enough to prevent hangs. Back to my (somewhat dirty but working) patchset.
Vanilla 4.6.0+ without max_cstate=1 still freezes up for me too, sticking to max_cstate=1 for now.
Well it was too good to be true :) after 4 full days uptime I've got a crash few minutes ago. Just rebooted the machine, will let it run again to see if it was luck or at least now I can get some days of uptime. I will also update the kernel to the latest next time it crashes.
I am having issues with 4.6. ..cstate=1 no longer prevents ordinary display freeze (GUI locked, CPU activity = 0%.) intel_idle.max_cstate=1 had been a reliable workaround (for me) since 4.2.6. Asus T100CHI (Z3775) Ubuntu 16.04. Kernel minimally patched for Bluetooth device ID and other hardware bits not yet supported by main-stream. Patches proven in earlier kernels, pruned as necessary with each new kernel releases. 4.6-rc5 was still freeze free with cstate, unsure of rc6, rc7 also froze with black screen or soft freeze (mouse cursor freely moved, display updating only once every 1-2 minutes). I choose to be optimistic, that the freeze bugs are being worked on now and another edit or two will finish fixing them. It sounds like the current changes work, as is, for other systems.
Do the people committing the fixes on Linux now know about the testing we are doing here? Could someone here with some authority notify them?
Please everyone, keep this on topic. If your cursor updates when you move your mouse, this is not your bug. If your screen turns black, this is not your bug. If you can still SSH into or ping your device, this is not your bug. If it's just some application dying, this is not your bug. If you still see freezes after max_cstate=1, this is not your bug. There may be other problems with Bay Trail that might show some of these symptoms, but this is not the correct Bugzilla entry to discuss them.
I'm using now version: 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Kernel without any crash! :-)
Created attachment 216481 [details] attachment-24742-0.html Tried 4.6 RC7 from ubuntu kernel release page. 4.6.0-040600rc7-generic #201605081830 SMP Sun May 8 22:32:57 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Full lock-up after a few minutes. Reverted to max_cstate=2. On Tue, May 17, 2016 at 9:11 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #353 from Gabriel7340 <gabriel_7340@hotmail.com> --- > I'm using now version: > 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016 > x86_64 > x86_64 x86_64 GNU/Linux > > Kernel without any crash! :-) > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
Updated 4.6.0-g2dcd0af, no luck - froze after half a day. Green screen, complely hanged. It either they broke it again or the 4 days uptime have been just a lucky shot. Anyway, is actually the maintainer of this component aware of the bug? This is still in "NEW" state with no official updates (also lists up to kernel 4.2 while 4.6 is also affected) ? The max_cstate is not really a proper workaround, the power consumption as well as temperature goes up dramatically.
> Anyway, is actually the maintainer of this component aware of the bug? This This bug is assigned to Len Brown and he has commented here, so *he* at least is aware of this. However, I fear (and has already been mentioned in earlier comments) this bug report has long since lost any usefulness it might have once had and has just turned into a dumping ground for random comments and updates and now reads like some web forum thread,
Created attachment 216551 [details] attachment-7936-0.html Not so. The bug discourse may have become a bit ragged due to the age of the bug and the near-total non-response by the owner or by kernel people. But there's a perfectly clear thread: every kernel from at least 3.19 through 4.6 locks up hard on BayTrail and Broadwell systems after minutes or hours. jds On Wed, May 18, 2016 at 8:19 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #356 from Andrew Clayton <andrew@digital-domain.net> --- > > Anyway, is actually the maintainer of this component aware of the bug? > This > > This bug is assigned to Len Brown and he has commented here, so *he* at > least > is aware of this. > > However, I fear (and has already been mentioned in earlier comments) this > bug > report has long since lost any usefulness it might have once had and has > just > turned into a dumping ground for random comments and updates and now reads > like > some web forum thread, > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
(In reply to Andrew Clayton from comment #356) > > Anyway, is actually the maintainer of this component aware of the bug? This > > This bug is assigned to Len Brown and he has commented here, so *he* at > least is aware of this. > > However, I fear (and has already been mentioned in earlier comments) this > bug report has long since lost any usefulness it might have once had and has > just turned into a dumping ground for random comments and updates and now > reads like some web forum thread, Well, most of the people experiencing the problem refers to this thread. It should be the best place to source for information, and why not ask for some cooperation? The bug is open since December with no status change whatsoever but reports of baytrail hanging date back to Oct 2014 when 3.17 has been released. This is a pretty serious problem as it is preventing linux to run properly on a very large number of systems... it just doesn't look that it is getting the right attention.
i tried drm-intel kernel 4.6.0-rc7 on N2940 System freeze after 2 days. I understand the frustration of the people too, a serious bug with a clear defined thread and only some frustrated users commenting. Still the vast majority here helping testing and reporting for their platform. Non-existant feedback, not getting any attention, we don't even know if the maintainer is still alive ;) How is it not going to read like a forum thread after more than 2 years with many mayor kernel versions since its appearance ?
(In reply to Daniel Glöckner from comment #352) > Please everyone, keep this on topic. > > If your cursor updates when you move your mouse, this is not your bug. > If your screen turns black, this is not your bug. <snip>> > There may be other problems with Bay Trail that might show some of these > symptoms, but this is not the correct Bugzilla entry to discuss them. Since removing the Hunter patches(see comments #55, #103) from my 4.6 build, I have not had a recurrence of those alternate freezes.
@ Len Brown: Any chance you could give us an update on this issue? It would be much appreciated. Regards
(In reply to Gabriel7340 from comment #353) > I'm using now version: > 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016 > x86_64 x86_64 x86_64 GNU/Linux > > Kernel without any crash! :-) Sorry the bug comes back :/ I went back to kernel 3.13.0-85-generic again :/
I'm now using kernel 4.6.0-040600-generic from Ubuntu, since my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) runs on Mint 17.3 On kernel 4.5 with cstate=1 it worked flawlessly. After approx. 1 hour on 4.6 without cstate=1 it froze again during playback of an HD movie on VLC. Trying 4.6 now with cstate=1 Also I can't downgrade the kernel lower than 4.5 because then the shutdown/reboot hanging/freezing issue on this machine would be back again. :-(
4.5.5 has the same broken cstate bandaid as 4.6. In other words, both kernels freeze and ..cstate=1 no longer stops it. 4.6-rc7 did the same thing. Anyone have a new workaround? I can always use 4.4.11 which is actually running pretty well now. It just doesn't support all of my hardware (eg sound.)
intel_idle.max_cstate=4 Appears to work with rc7. So we can get more of the power savings.
Just wondering if this is not the droid we're looking for... On an unrelated development - saw a lot of jitter across different BYT platforms, excessively so, and not just on J1900, but also on the Z3537G. Digging into things, and pulling an older IVB based 1037U box, saw the same thing. putting intel_max_cstates=1 sort of solved the problem for the most part - this is with ubuntu-server 4.4.0-22 - by meaning sort of, it worked around it. Reverting the valleyview change out that was in 3.16 kind of fixed it - e.g. no more freezes on the BYT devices - but the IVB never had the freezes in the first place with 4.4.0. Hmmm... something tells me there's more to this problem than just the graphics driver. I just don't have handy the gear needed to get deeper into the HW - e.g. JTAG and Protocol Analyzer these days - but I'm suspecting that there is something going on with the timing on both BYT and IVB, and I suspect Haswell, Braswell, and later..
Hokay - did some more debug/testing - pthread crashes are inconsistent, when looking at stack dumps. Munged the UEFI to keep the BYT running at a constant speed, and things are fine. Same with IVB - weird... with constant clocks, it's all good. Let the cores sleep a bit too much - boom - worrisome as this could lead to data corruption that folks wouldn't see immediately. While I can't dig into the HW - long time back on ARM with an RTOS, we found that dynamic clocks could lead to issues with cpu clocks and mem states with mem reads specifically - cpu would read before memory was ready. since BYT, IVB use the uncore/system agent - for both CPU and GPU, this is the area of interest - as the uncore controls timing for everything. Probably need someone from intel systems to sort out this, as this is all their stuff inside.
Thanks for the debug! Good work. I'm still analysing the code, but, as you suggest, someone from intel can see more accurately and quickly what is really going wrong.
with Kernel 4.4.9 I am now seeing lockups fairly frequently. I had started to see some with FC22 and thought it was related to the nouveau driver so I updated to FC23 on May 1; it was due anyway. Lockups were apparently resolved. that was with Kernel 4.4.6. With kernel 4.4.9 I am now seeing lockups on a daily basis. Just prior to one such lock up I noticed a log entry: "May 24 19:52:50 xps8700.durand8450.info kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 -- Reboot -- " So I found this thread. Now we are talking something with kernel and cstate or pci/msi interaction? I have a Dell XPS8700 desktop; I looked in my BIOS setup and see nothing related to cstate. I tried the pci=nomsi in the GRUB entry and got no relief. I thought perhaps my BIOS is to old (I'm at A07, Dell's latest is A11). I tried flashing it from freedos but it fails to burn. Anyhow, I am now running kernel 4.4.6-201 which seems stable.
... finally read carefully enough to figure out the other method to set cstate is to qualify the kernel invocation in GRUB. I am now running kernel 4.4.9 with cstate set to 1. It hasn't locked up in half an hour.
... but not really much longer. log reports that cstate = 1 was reached shortly after reboot, then log records: 'NMI watchdog: Watchdog detected hard LOCKUP on cpu 0' about two hours after boot. hmm. I think I've tried the two work arounds. I am apparently able to run with kernel 4.4.6
... and I could have mentioned that about 14 minutes after the watchdog notice above I see the following in the log: kernel: INFO: rcu_sched detected stalls on CPUs/tasks: kernel: 0-...: (1 GPs behind) idle=643/1/0 softirq=79517/79517 fqs=320010 kernel: (detected by 3, t=960034 jiffies, g=113164, c=113163, q=0) kernel: Task dump for CPU 0: kernel: swapper/0 R running task 0 0 0 0x00000008 kernel: ffffffff8163d5af 000000001ec13d60 0000066d9d796809 ffffffff81d3b0c0 kernel: ffffffff81c04000 ffff88021ec1fb00 ffffffff81cc1040 ffffffff81c00000 kernel: ffffffff81c03ec0 ffffffff8163d797 ffffffff81c03ed8 ffffffff810e6752 kernel: Call Trace: kernel: [<ffffffff8163d5af>] ? cpuidle_enter_state+0xff/0x2b0 kernel: [<ffffffff8163d797>] ? cpuidle_enter+0x17/0x20 kernel: [<ffffffff810e6752>] ? call_cpuidle+0x32/0x60 kernel: [<ffffffff8163d773>] ? cpuidle_select+0x13/0x20 kernel: [<ffffffff810e6a10>] ? cpu_startup_entry+0x290/0x350 kernel: [<ffffffff8179513c>] ? rest_init+0x7c/0x80 kernel: [<ffffffff81d6201e>] ? start_kernel+0x498/0x4b9 kernel: [<ffffffff81d61120>] ? early_idt_handler_array+0x120/0x120 kernel: [<ffffffff81d61339>] ? x86_64_start_reservations+0x2a/0x2c kernel: [<ffffffff81d61485>] ? x86_64_start_kernel+0x14a/0x16d at which point the log stops. and I'm still working on why my BIOS update won't actually update.
Hi Guys!!! I Have the same bud in thinkpad e11 N2930 This bug is similar https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467 And people of canonical build a test kernel with possible solution. Can you probe?? http://kernel.ubuntu.com/~jsalisbury/lp1575467 Thanks.
just for the record my current state is running on kernel 4.4.6 out of the box, no tweaks in GRUB. I was too hasty when I inferred I was running well with kernel 4.4.6 with limit on cstate.
The only Linux-based commercial product that used BYT was based on the Android snapshot/fork of the Linux kernel, not the upstream Linux kernel. Nobody knows why the Android version of Linux is stable on this hardware, while upstream Linux is not. There have been several de-bunked theories. No, it isn't a bug in the intel_idle driver -- you'll have the same results with "intel_idle.max_cstate=0", which will run the acpi_idle driver. The cause is likely due to an SOC device other than the CPU.
(In reply to Javier Antonio Nisa Avila from comment #373) > Hi Guys!!! > > I Have the same bud in thinkpad e11 N2930 > > This bug is similar > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467 > > And people of canonical build a test kernel with possible solution. > > Can you probe?? > > http://kernel.ubuntu.com/~jsalisbury/lp1575467 > > Thanks. If I understand correctly the various comments they are doing the bisection to understand which commit caused the issue, but its not (yet) a possible solution. Of course the problem is so widespread that a lot of duplicated effort is being made.
after more watching perhaps I'm on the wrong thread. I actually now can see that I get the system freezes regardless of the cstate work around or the psi=nomsi workaround and regardless of which of the installed kernels I select from 4.4.6, 4.4.8 and 4.4.9. I've tried disabling watchdog not because I think that a causal relationship but watchdog is often logging an alert just before a lockup event. I was momentarily optimistic that disabling watchdog might change my system event from a system freeze to a crash which I would have preferred. I have been running with intel_idle.max_cstate=0 for the past several days. I am noticing a lot of "(tracker-miner-fs:1951): Tracker-CRITICAL" events in the log. I'll plan to go back to the beginning of my searching to see if I can make a better match to what I'm seeing.
Hi again, Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + linux-999-i915-use-legacy-turbo.patch = over 120h in one single session (without reboot/sleep..) and another 20+/- hours in few 3-4hour long sessions without single freeze. Still counting... (some of Adrian Hunters patches for pm/mmc were also applied, but I dont think (hope) this matters) Kernel 4.5.x + Mikas 3 _tentative_ patches + linux-999-i915-use-legacy-turbo.patch = first freeze after like 8 hours, second freeze came in few minutes after boot (all above was without any intel_idle.cstate parameter) Thanks Daniel Bilik for this "combo" Maybe something new in v4.6 fixed the last hole... Or maybe Im just lucky.
I can confirm that this fix worked for me on Ubuntu 14.06 (kernel: 4.4.0-22-generic) on an Acer Aspire E15 laptop, dual-booting with Windows 8. There have been no irrecoverable freezes since applying the fix yesterday, but there have been a few times where it slowed down to almost freeze. Thankfully, it saved itself.
I believe I have confirmed my issue is not related to the subject of this thread. I re-initialized tracker and all seems to have cleared up. I thought the freezes I was seeing were right in line with descriptions above but unlike other reports I was getting no relief from the work-arounds that reportedly helped others. Changing cstate to zero and disabling watchdog did help me focus on the real problem. Thank you for your patience.
From Documentation/kernel-parameters.txt: >> intel_idle.max_cstate= [KNL,HW,ACPI,X86] >> 0 disables intel_idle and fall back on acpi_idle. >> 1 to 6 specify maximum depth of C-state. acpi_idle is different idle driver and could disable all C-states as well. It depends on ACPI tables. So, you need to try this and check which states are enabled: $ uname -a $ cat /sys/devices/system/cpu/cpuidle/current_driver $ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name For me it's like this: >> Linux venue11pro 4.1.25-dirty #337 SMP PREEMPT Wed May 25 01:53:43 MSK 2016 >> i686 Intel(R) Atom(TM) CPU Z3770 @ 1.46GHz GenuineIntel GNU/Linux >> intel_idle >> POLL >> C1-BYT >> C6N-BYT >> C6S-BYT >> C7-BYT >> C7S-BYT P.S. Also I have to mention that even with kernel 4.1.25, mmc PM QOS patches and legacy turbo patch I have freezes. Disabling SDIO wifi with ath6kl driver prevent any lockup at all.
I'm baaack. I've had additional events. The best solution I could get to with the work arounds was to set cstate=0 and switch from gnome to xfce desktop. And make sure I close firefox and thunderbird. That combination really stretched out the events but still at least one a day. I also tried disabling power management for the monitor, I had it set to turn the monitor off after 45 minutes of inactivity. My next approach is to flash the computer with the latest release posted by Dell of AMI BIOS, A11. If I haven't mentioned I'm running workstation fedora on a Dell XPS 8700 which I bought new two years ago. It was still running BIOS A07 which it came with. I completed the re-flash this afternoon. I've been having some difficulty sorting out what pieces are really in play. The symptoms I see sound like what is described above but I have not seen the relief others have reported from the work arounds. Also, kernel seems to be implicated in the discussion but my events did not seem to be associated with a kernel upgrade (I didn't realize that until more recently. I was running 4.4.8 for ten days until my events started.)
@joev.mi This bugzilla entry is about Baytrail processors. Your computer does not have one of those -- it uses "4th generation Intel Core processors". Please start a separate bugzilla. This one is already confusing enough already. I think that reports of Braswell/Cherrytrail problems are likely relevant. Examples of Baytrail reported (above) as having the bug: Atom Z3735G, Atom Z3770. Celeron J1900, Celeron N2930, Celeron N2940, Pentium J2900, Pentium N3520, Pentium N3540 Examples of Baytrail reported (above) without seeming to have the bug: Celeron CPU N2830, Celeron CPU N2840 Examples of Braswell/Cherrytrail reported (above) as having the bug: Celeron N3050 Examples of Braswell/Cherrytrail reported (above) without seeming to have the bug: N3150, N3700 (I stopped reading at about comment 150)
Created attachment 219241 [details] attachment-22682-0.html Correction to the list: I have an N3150 that has the bug: with the workaround with cstate=1 I have seen it freezing only once. 2016-06-07 5:08 GMT+02:00 <bugzilla-daemon@bugzilla.kernel.org>: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #383 from D. Hugh Redelmeier <hugh@mimosa.com> --- > @joev.mi This bugzilla entry is about Baytrail processors. Your computer > does > not have one of those -- it uses "4th generation Intel Core processors". > Please start a separate bugzilla. This one is already confusing enough > already. > > I think that reports of Braswell/Cherrytrail problems are likely relevant. > > Examples of Baytrail reported (above) as having the bug: Atom Z3735G, Atom > Z3770. Celeron J1900, Celeron N2930, Celeron N2940, Pentium J2900, Pentium > N3520, Pentium N3540 > > Examples of Baytrail reported (above) without seeming to have the bug: > Celeron > CPU N2830, Celeron CPU N2840 > > Examples of Braswell/Cherrytrail reported (above) as having the bug: > Celeron > N3050 > > Examples of Braswell/Cherrytrail reported (above) without seeming to have > the > bug: N3150, N3700 > > (I stopped reading at about comment 150) > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
Since my last post and cstate=1 on kernel 4.6.0-040600-generic from Ubuntu, my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) running Mint 17.3 didn't freeze once. I tried every usual cause possible (Fullscreen HD videos on youtube or in VLC. Browsing content loaded websites in chrome and firefox. Batch HD conversion in Handbrake, etc.). No freezing or hanging so far. The cstate=1 workaround seems to work for me so far.
I'm really confused now, I've switched from arch to debian stable (3.16.0 kernel) which didn't froze once as I was expecting (I've read the bug started with 3.17 not 3.16) then I've reinstalled arch (standard) with kernel 4.5.4 and so far I didn't experience a single freeze (now I have 48 hours of uptime) Didn't really have the expertise to understand if any patch has been applied in the last couple of months to 4.5.4 kernel by arch team: the only thing I did differently this time is disabling in the bios all components I really don't need (like serial port for example)... I will let it run for a couple more days to check if it keeps running then I'll start playing with the bios turning on or off devices again. Does this make any sense? If I understand correctly what Len said its a problem with a device driver rather than with the intel_idle?
Did you disable cstates in bios?
(In reply to Gabriel7340 from comment #387) > Did you disable cstates in bios? No... I've just disabled some devices I don't use but I didn't do it with the bug in mind. I will check it and take some notes to see if the bios settings(with the exception of cstates) can actually change something.
Does the freeze only happen when using X11 or a Desktop Environment? Am I safe if I only use my hardware without any intel driver or X11? I want to use my Q1900 just as a server, in console mode.
I use an asrock q1900-itx without X and kernel 3.19, 4.2 and now 4.4 and no special settings. Running for a year without issues.
Ahh and i forgot to say that it was sorted out because it had all other known issues when running as a kiosk system. By the way, when running as kiosk system we had the problem that the USB ports started to stop working after a random time. The devices dont even disappear when unplugged. The exact same imaged installation has no issues on AMD kabini or older intel core2duo or celeron 847. Is there anything known regarding this issue? Because of all this trouble we moved to AMDs Kabini which works without any issues.
The lastest 2 maintenance releases of 4.5 & 4.6 seem to have restored the cstate work-around. My T100CHI is running again without the classic freeze described here. Many thanks to whoever restored the cstate work-around.
(In reply to jbMacAZ from comment #392) > The lastest 2 maintenance releases of 4.5 & 4.6 seem to have restored the > cstate work-around. My T100CHI is running again without the classic freeze > described here. Many thanks to whoever restored the cstate work-around. This is my impression too ... I've upgraded yesterday to 4.6 kernel and no crashes for 15 hours so far. Before I had 4 days of up-time with 4.5.4. Would be nice to have a confirmation, also to avoid that one of the next patches bring everything back.
so what's the status now?
On Acer TravelMate B115-M (N2940 @ 1.83 GHz), with latest BIOS, still hangs ocassionally with kernel 4.6.2. But it's definitelly way better than with previous kernels. I've eliminated max_cstate=1 workaround about a week ago and the machine crashed only once or twice during the past 7 days. So to sum it up - still not 100% perfect but definitelly a huge improvement. BTW - I've enabled HW watchdog in systemd configuration. When the machines hangs (display hangs, network hangs, mouse and keyboard not reacting, etc.], it is still automatically rebooted with HW watchdog. If I understand that correctly, this reboot watchdog is independent from the kernel and should always be able to automatically reboot machine with hanged kernel. As these crashes became less frequent, I started to use this HW watchdog as a new temporary workaround to keep my machine up when beeing used remotely.
With the workaround cstate=1 on kernel 4.6.0-040600-generic from Ubuntu, my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) is still running so far. No freezing or hanging. The only difference from kernels previous to 4.4. is that I disabled the onboard Broadcom Wifi/Bluetooth card/chip. I'm using a Railink USB dongle instead.
With Arch linux kernel 4.6.2 - and clearly no max_cstate - I've experienced some occasional crashes but it was always during some heavy load of the machine (streaming hd videos from the network), while previously it crashed randomly after few hours even with the machine completely idle. So big improvement, I will try to stress the machine with max_cstate=1 to check if the crashes are due to the same problem or something else.
I have the same bug with an Intel Xeon E5-1620 v3 CPU, NVIDIA Quadro K620 and 256 GB NVMe SSD. I was wondering why both PCIe cards were affected. On the NVMe card I've seen an XFS file system corruption from time to time. "intel_idle.max_cstate=1" fixed the problem with openSUSE Leap 42.1 (4.1 kernel). C-States are broken with Haswell CPUs affecting the PCIe cards! See: http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v3-spec-update.html http://www.intel.com/content/www/us/en/processors/core/4th-gen-core-family-desktop-specification-update.html HSX54: "A P-State or C-State Transition May Lead to a System Hang" HSD38: "TSC May be Incorrect After a Deep C-State Exit" HSD44: "Display May Flicker When Package C-States Are Enabled" HSD50: "Throttling and Refresh Rate Maybe be Incorrect After Exiting Package C- State" HSD60: "Processor May Not Enter Package C6 or Deeper C-states When PCIe* Links Are Disabled" HSD77: "Graphics Processor Ratio And C-State Transitions May Cause a System Hang" HSD104: "PCIe* Device’s SVID is Not Preserved Across The Package C7 C-State"
@Sebastian Parschauer #398: This is NOT the same bug. Your systems processor is neither Baytrail nor Cherrytrail. Please start a different bugzilla entry. Certainly it would interest many people if c-states are broken in Haswell. If you think that there is something relevant to bug 109051, add a comment here pointing to your new bugzilla bug.
Hello Everyone. I just come to send some feedback I'm using a thin ITX N3150 board by SOYO and my OS is archlinux I ran into the same bug several day ago I change my kernel to 4.4.13-1-lts(it's 4.4.14 now but should also work) and do nothing with the kernel parameters or the x configuration file, I have not encounter screen freeze any more(for more than 1 hour) I change my kernel to 4.7-rc4. the computer also work properly I have try to add the "intel_idle.max_cstate=1" by using efibootmgr "efibootmgr -d /dev/sdb -p 1 -c -L "Arch Linux FallBack" -l /vmlinuz-linux -u "root=/dev/sdb2 rw initrd=/initramfs-linux.img i915.semaphores=1 intel_idle.max_cstate=1" but it does not work, i ran into screen freeze just about 5min. maybe i did not add the parameter the right way? sorry for my poor english- -
I would install rEFInd and then make that the primary boot target; it's much easier to configure rEFInd to boot linux with the desired parameters.
cat /proc/cmdline to get the currently running kernel version and parameters.
System: Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux Machine: Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010 CPU: Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz 7: 2672 MHz 8: 2672 MHz Graphics: Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870] Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2 Audio: Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel Card-3 Hewlett-Packard driver: USB Audio Card-4 Logitech QuickCam Pro 9000 driver: USB Audio Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH Network: Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169 IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40 Drives: HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2 Sensors: System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0 Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0 Info: Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;) The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url I plan to revert to older stable kernel or maybe LTS. I hope to report back with relevant logs. Thanks to all who working on this.
OK like others i do not run Baywell, however im confident this is similar kernel regression regardless of CPU codename. However i will look around for this bug specific to my hardware. System: Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux Machine: Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010 CPU: Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz 7: 2672 MHz 8: 2672 MHz Graphics: Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870] Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2 Audio: Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel Card-3 Hewlett-Packard driver: USB Audio Card-4 Logitech QuickCam Pro 9000 driver: USB Audio Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH Network: Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169 IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40 Drives: HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2 Sensors: System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0 Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0 Info: Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;) The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url I plan to revert to older stable kernel or maybe LTS. I hope to report back with relevant logs. Thanks to all who working on this.
Same problem in Acer aspire switch 10
Unfortunately it seems to be clear now that as I expected this bug will never get fixed. I can only see more and more people posting here that they are affected too. But nobody even cared to change the bug state to critical from normal, confirmed from new or update the affected kernel versions up to 4.6 (and most likely any future).
Very disappointed of Intel's non excistent product support :S
I am fairly sure Intel never sold the Baytrail process for the Linux platform except in a very limited capacity (the computer stick is the only one as far as I know), the Z37xx series only sold commercially for Windows. I don't think any Chromebooks used it at all, they used the N28xx variants. So, really, we can't expect much from Intel.
> Intel never sold the Baytrail process for the Linux platform Despite the fact, Acer did (and still does) sell their Aspire E5-511 with Linpus (a distribution of Linux), which I considered as a fair proof of that those laptops would be fine with Ubuntu as well. Apparently, I was wrong.
Anyway officially or unofficially the problem has been extremely reduced in the latest kernel version. I'm running 4.6.3 on a celeron N2930 and I can get without any problems several days of uptime. Every so and then I experience a crash when streaming video, but way better than before when the machine crashed when idle after few hours.
I have found that 4.5.7 with the patch set from John Brodie on the Asus T100 Ubuntu Google+ group is very stable with cstate=1 and sdio wifi - I can achieve days of uptime. See the files section linked from here: https://plus.google.com/communities/117853703024346186936
My Shuttle XS35V4 was offered as Linux compatible. And that is not the only bay-trail computer with declared Linux support.
I did test kernel 4.6.3 (cpu j1900), took about 3h to crash (chromium browser, no video). Setting cstate to 1 or 2 still fixes the problem. @vladimir jicha I also do have a shuttle xs35v4. Please set kernel parameter intel_idle.max_cstate For grub: - vi /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=2" - update-grub - reboot Absolutely no crashes anymore, power consumption is only 8W.
Im running an Inspiron 3551 Laptop with a Pentium N3540. I've been having this same bug since Ubuntu 15.10. I recently installed Linux Mint 18 MATE and no longer had the freezing caused by this bug. Yesterday I installed some updates and ever since the bug/ freezing has returned. I am not sure if the updates included kernel updates but I have 2 kernels still in my pc and the old version is 4.4.0-21-generic is the one that did not cause any freezing for over a week after installing Mint 18. The newer version is 4.4.0-28-generic and it caused freezing. I have rolled back to and tested 4.4.0-21-generic and can watch youtube videos at 720 60fps with no freezing, with 4.4.0-28-generic youtube videos cause the freezing. 4.4.0-21-generic seems to fix this issue for me. anyone else try?
On 4.6.0, max_cstate=2 is not an option here. Will try 4.6.3 later.
Is there any way to check in code what changed for Baytrail CPUs between kernel 3.16 and 3.17? Was there a patch specific for Baytrail that is causing the issue? or some patch for c-states? There should be any active effort to fix this bug because it affects multiple machines with Ubuntu preinstalled, and Ubuntu is retiring support for kernel 3.16 so people will be stuck with either a very old kernel or will experience freezes with 16.04. Specially on machines with Ubuntu pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will really public image of Linux distros on consumer computers.
I'm fairly sure I have had this in 3.16 on my media machine. Or at least some other complete system freeze. I think it's just very rare under 3.16. So I'm not convinced the answer will fall out of bisection. :-( Seams graphics related from what has been said. Unless it does happen on headless machines, in which case, that is clearly not true. Guessing interaction of power state of CPU vs GPU. One changed in just the wrong place for the other. But loads a speculation there I don't have time to dig into. I'm getting temped to just bin the board when being stuck 3.16 becomes an issue. Or use it as headless with another Pi media machine. On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 from Alejandro Morales Lepe --- > Is there any way to check in code what changed for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch specific for Baytrail that is causing the > issue? or some patch for c-states? There should be any active effort to fix > this bug because it affects multiple machines with Ubuntu preinstalled, and > Ubuntu is retiring support for kernel 3.16 so people will be stuck with either > a very old kernel or will experience freezes with 16.04. Specially on machines > with Ubuntu pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > really public image of Linux distros on consumer computers. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
I have yet to experience a complete lock up in 3.16 the locks up I have had happen when I fill my Inspiron 3551 RAM by running a lot of stuff, however I am able to reboot the computer with some SysRq magic while on newer kernels the lock up prevents me from doing this... could be a different thing? At least you can run headless :( I am suffering this problem in my daily driver and I pretty much need it for everything. I got this machine for the price and the idea that I would be getting a Linux ready computer, oh boy... I am not an expert, but if there is some way I can help to debug this, anybody, please let me now. (In reply to Joe Burmeister from comment #418) > I'm fairly sure I have had this in 3.16 on my media machine. Or at least > some other complete system freeze. I think it's just very rare under 3.16. > So I'm not convinced the answer will fall out of bisection. :-( > > Seams graphics related from what has been said. Unless it does happen on > headless machines, in which case, that is clearly not true. Guessing > interaction of power state of CPU vs GPU. One changed in just the wrong > place for the other. But loads a speculation there I don't have time to dig > into. I'm getting temped to just bin the board when being stuck 3.16 becomes > an issue. Or use it as headless with another Pi media machine. > On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 from > Alejandro Morales Lepe --- > Is there any way to check in code what changed > for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch specific > for Baytrail that is causing the > issue? or some patch for c-states? There > should be any active effort to fix > this bug because it affects multiple > machines with Ubuntu preinstalled, and > Ubuntu is retiring support for > kernel 3.16 so people will be stuck with either > a very old kernel or will > experience freezes with 16.04. Specially on machines > with Ubuntu > pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > really > public image of Linux distros on consumer computers. > > -- > You are > receiving this mail because: > You are on the CC list for the bug.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/?id=refs/tags/v3.17
From what I am seeing the option itself to set up intel_idle.max_cstate=1 was added in kernel 3.17, does it have any relation to the problem, or am I getting lost in my ignorance? Is there any problem with the default value? where that value is used? Excuseme if this is not of much help but I am trying to make some sense from this, but since I am not familiar with kernel development maybe I am just running in circles. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2e92c7ad8f269c2b5b7f2a4763675f55f00b75f5
It could be a different complete freeze. Without pouring time on it, I can't know. There is no "the fix" as far as I know. The only work round is set cstate in the BIOS or kernel argument. Does the same thing and that sucks for power usage. I'd love some good news to.On 8 Jul 2016 19:35, bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #419 from Alejandro Morales Lepe <aml240sx@gmail.com> --- > I have yet to experience a complete lock up in 3.16 the locks up I have had > happen when I fill my Inspiron 3551 RAM by running a lot of stuff, however I > am > able to reboot the computer with some SysRq magic while on newer kernels the > lock up prevents me from doing this... could be a different thing? > > At least you can run headless :( I am suffering this problem in my daily > driver > and I pretty much need it for everything. I got this machine for the price > and > the idea that I would be getting a Linux ready computer, oh boy... > > I am not an expert, but if there is some way I can help to debug this, > anybody, > please let me now. > > (In reply to Joe Burmeister from comment #418) > > I'm fairly sure I have had this in 3.16 on my media machine. Or at least > > some other complete system freeze. I think it's just very rare under 3.16. > > So I'm not convinced the answer will fall out of bisection. :-( > > > > Seams graphics related from what has been said. Unless it does happen on > > headless machines, in which case, that is clearly not true. Guessing > > interaction of power state of CPU vs GPU. One changed in just the wrong > > place for the other. But loads a speculation there I don't have time to dig > > into. I'm getting temped to just bin the board when being stuck 3.16 > becomes > > an issue. Or use it as headless with another Pi media machine. > > On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 > from > > Alejandro Morales Lepe --- > Is there any way to check in code what changed > > for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch > specific > > for Baytrail that is causing the > issue? or some patch for c-states? There > > should be any active effort to fix > this bug because it affects multiple > > machines with Ubuntu preinstalled, and > Ubuntu is retiring support for > > kernel 3.16 so people will be stuck with either > a very old kernel or will > > experience freezes with 16.04. Specially on machines > with Ubuntu > > pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > > really > > public image of Linux distros on consumer computers. > > -- > You are > > receiving this mail because: > You are on the CC list for the bug. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
(In reply to Alejandro Morales Lepe from comment #421) > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/ > ?id=2e92c7ad8f269c2b5b7f2a4763675f55f00b75f5 That's just adding documentation. The intel_idle driver was added back in 2010 and Bay Trail support was added in 2015 by 718987d695adc991eb94501209fe5353136c8c16 ("intel_idle: support Bay Trail") And possibly last touched by d7ef76717322c8e2df7d4360b33faa9466cb1a0d ("intel_idle: Update support for Silvermont Core in Baytrail SOC") IIRC J1900 is a Silvermont.
Yes, a J1900 is an Atom and has Silvermont cores. But so does a Baytrail https://en.wikipedia.org/wiki/Silvermont
Seems some Intel CPU's have bugs in design http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
I have read the pdf and according to Intel, there is a C6 state hardware bug in the CPU numbers as listed on page 9 and 10 ---- VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine. Problem: If core C6 is entered after the start of an interrupt service routine but before a write to the APIC EOI (End of Interrupt) register, and the core is woken up by an event other than a fixed interrupt source the core may drop the EOI transaction the next time APIC EOI register is written and further interrupts from the same or lower priority level will be blocked. Implication: EOI transactions may be lost and interrupts may be blocked when core C6 is used during interrupt service routines. Workaround: It is possible for the firmware to contain a workaround for this erratum. ----
(In reply to André Hoogendoorn from comment #426) > I have read the pdf and according to Intel, there is a C6 state hardware bug > in the CPU numbers as listed on page 9 and 10 Interesting. I have been running with intel_idle.max_cstate=5 (changed from 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+ hours now. IIRC I would have had a lockup by now...
(In reply to Andrew Clayton from comment #427) > (In reply to André Hoogendoorn from comment #426) > > I have read the pdf and according to Intel, there is a C6 state hardware > bug > > in the CPU numbers as listed on page 9 and 10 > > Interesting. I have been running with intel_idle.max_cstate=5 (changed from > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+ > hours now. > > IIRC I would have had a lockup by now... According to "cpupower idle-info" a J1900 CPU has Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME AS running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2, intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with either of the latter settings it will also run stably with intel_idle.max_cstate=5.
My J1900 reliably freezes with any intel_idle.max_cstate > 1 (kernel 4.6.0) but I see why you would expect otherwise.
(In reply to Wolfgang M. Reimer from comment #428) > (In reply to Andrew Clayton from comment #427) > > (In reply to André Hoogendoorn from comment #426) > > > I have read the pdf and according to Intel, there is a C6 state hardware > bug > > > in the CPU numbers as listed on page 9 and 10 > > > > Interesting. I have been running with intel_idle.max_cstate=5 (changed from > > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+ > > hours now. > > > > IIRC I would have had a lockup by now... > > According to "cpupower idle-info" a J1900 CPU has > > Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT > > So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME AS > running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2, > intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with > either of the latter settings it will also run stably with > intel_idle.max_cstate=5. But something must be different. I also use a J1900 mainboard and there is a difference in power consumption between running with max_cstate=1 and max_cstate=2. For me, that's max_cstate=1: 17.2W max_cstate=2: 16.5W no max_cstate: 15.9W
Just found this bug report. I used to be getting freezes on my Dell XPS 15 9550 with Skylake Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz up until 4.7.0-0.rc7.git1.2.fc25.x86_64 kernel. I don't have max_cstate set at the moment. Something has changed since rc7.git0 kernel build. I am running fedora 24 with https://fedoraproject.org/wiki/RawhideKernelNodebug repo enabled.
@Vaidas Jablonskis #431: this bug report is about Baytrail CPUs. Skylake is quite different.
(In reply to D. Hugh Redelmeier from comment #432) > @Vaidas Jablonskis #431: this bug report is about Baytrail CPUs. Skylake is > quite different. Oops. My apologies for not reading the title.
Created attachment 223851 [details] Disable all C6 states enable all C7 core states for Baytrail CPUs Disable all C6 states enable all C7 core states for Baytrail CPUs to verify whether erratum VLP52 is root cause for this bug. Must be run as root.
Created attachment 223861 [details] Shows all core states (C-states) + some related info as a formatted table The intel_idle.max_cstate boot parameter refers to enumeration done by the linux kernel (number in column State) and not to the Intel notation of core states C0, C1, C2, C3, C6, C7, etc. Latency, Residency, and Time units are microseconds.
(In reply to Martin from comment #429) > My J1900 reliably freezes with any intel_idle.max_cstate > 1 (kernel 4.6.0) > but I see why you would expect otherwise. (In reply to Max Stegmeyer from comment #430) > (In reply to Wolfgang M. Reimer from comment #428) > > (In reply to Andrew Clayton from comment #427) > > > (In reply to André Hoogendoorn from comment #426) > > > > I have read the pdf and according to Intel, there is a C6 state > hardware bug > > > > in the CPU numbers as listed on page 9 and 10 > > > > > > Interesting. I have been running with intel_idle.max_cstate=5 (changed > from > > > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for > 14+ > > > hours now. > > > > > > IIRC I would have had a lockup by now... > > > > According to "cpupower idle-info" a J1900 CPU has > > > > Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT > > > > So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME > AS > > running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2, > > intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with > > either of the latter settings it will also run stably with > > intel_idle.max_cstate=5. > > But something must be different. I also use a J1900 mainboard and there is a > difference in power consumption between running with max_cstate=1 and > max_cstate=2. > For me, that's > max_cstate=1: 17.2W > max_cstate=2: 16.5W > no max_cstate: 15.9W Ok, you are right and I found out, what the problem is. The Linux kernel enumerates the states for the J1900 as follows: 0 POLL 1 C1-BYT 2 C6N-BYT 3 C6S-BYT 4 C7-BYT 5 C7S-BYT The parameter intel_idle.max_cstate refers to that enumeration and does _NOT_ conform to the Intel notation of the C-states (which confused me): So "intel_idle.max_cstate=2" means POLL, C1-BYT, and C6N-BYT (the first of the intel C6 states) are enabled and all other states (C6S-BYT, C7-BYT, C7S-BYT) are disabled and _CANNOT_ be enabled after boot time. Fortunately the /sys interface of the kernel allows fine-grained tweeking at run-time and one can turn off and on the the states individually (if not disabled at boot time via intel_idle.max_cstate=<number>). In order to investigate whether erratum VLP52 is the root cause for this kernel bug (109051) I attached two shell scripts to this bug. The first (c6off+c7on.sh) will disable all intel C6 core states for Baytrail processors (C6N-BYT and C6S-BYT) + enable all C7 core states (C7-BYT and C7S-BYT). The second script can be used to verify that the C6 states are disabled (column "Disabled" should show a "1" for the disabled states and the count for the columns "Time" and "Usage" should not change any longer for the disabled C6*-BYT states). The "c6off+c7on.sh" script should be started at system boot and if erratum VLP52 is the root cause of this bug then Baytrail systems with the processors mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=109051#c425 (J2850, J1850, J1750, N3510, N2810, N2805, N2910, N3520, N2920, N2820, N2806, N2815, J2900, J1900, J1800, N3530, N2930, N2830, N2807, N3540, N2940, N2840, N2808) should run stably again. Especially Baytrail based systems with low average load (e.g. tablets and notebooks) should consume considerably less power with enabled C7*-BYT states. Please give feedback (stability, power consumption, etc.)!
Running my submitted scripts https://bugzilla.kernel.org/attachment.cgi?id=223851 https://bugzilla.kernel.org/attachment.cgi?id=223861 on a J1900 system should produce a similar output: $ sudo $HOME/bin/c6off+c7on.sh DISABLED state C6N-BYT for cpu0. DISABLED state C6S-BYT for cpu0. DISABLED state C6N-BYT for cpu1. DISABLED state C6S-BYT for cpu1. DISABLED state C6N-BYT for cpu2. DISABLED state C6S-BYT for cpu2. DISABLED state C6N-BYT for cpu3. DISABLED state C6S-BYT for cpu3. $ $HOME/bin/cstateInfo.sh cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 77432 267 1 C1-BYT 0 1 1 13849382 21986 2 C6N-BYT 1 300 275 891290 1491 3 C6S-BYT 1 500 560 1340774 1078 4 C7-BYT 0 1200 4000 3190476 380 5 C7S-BYT 0 10000 20000 255687727 1025 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 10256 160 1 C1-BYT 0 1 1 12134067 10470 2 C6N-BYT 1 300 275 897517 514 3 C6S-BYT 1 500 560 2742364 688 4 C7-BYT 0 1200 4000 3223395 312 5 C7S-BYT 0 10000 20000 256625325 886 cpu2 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 58350 205 1 C1-BYT 0 1 1 14738863 26297 2 C6N-BYT 1 300 275 974127 1195 3 C6S-BYT 1 500 560 2688385 879 4 C7-BYT 0 1200 4000 25533926 1768 5 C7S-BYT 0 10000 20000 231166600 1894 cpu3 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 9249 232 1 C1-BYT 0 1 1 14294725 24977 2 C6N-BYT 1 300 275 1678518 2863 3 C6S-BYT 1 500 560 2531238 1394 4 C7-BYT 0 1200 4000 7240420 693 5 C7S-BYT 0 10000 20000 250630919 2281 Running cstateInfo.sh again should show no changes in the lines for the disabled C6 states (C6N-BYT and C6S-BYT): $ $HOME/bin/cstateInfo.sh cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 77497 277 1 C1-BYT 0 1 1 17466806 23676 2 C6N-BYT 1 300 275 891290 1491 3 C6S-BYT 1 500 560 1340774 1078 4 C7-BYT 0 1200 4000 4231024 429 5 C7S-BYT 0 10000 20000 1113610759 3134 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 10292 168 1 C1-BYT 0 1 1 20242967 12191 2 C6N-BYT 1 300 275 897517 514 3 C6S-BYT 1 500 560 2742364 688 4 C7-BYT 0 1200 4000 4398584 346 5 C7S-BYT 0 10000 20000 1109869872 2675 cpu2 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 58662 277 1 C1-BYT 0 1 1 24698671 33431 2 C6N-BYT 1 300 275 974127 1195 3 C6S-BYT 1 500 560 2688385 879 4 C7-BYT 0 1200 4000 94027530 3711 5 C7S-BYT 0 10000 20000 1014763708 6407 cpu3 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 9448 277 1 C1-BYT 0 1 1 29230274 30522 2 C6N-BYT 1 300 275 1678518 2863 3 C6S-BYT 1 500 560 2531238 1394 4 C7-BYT 0 1200 4000 14492087 1315 5 C7S-BYT 0 10000 20000 1090072439 7878 As one can see in my case most of the core's idle time is now spent in state C7S-BYT.
Nice script, FWIW, maybe by luck, but it seems being headless on J1900 helps a lot. I would surely lock with an unpatched kernel + graphics. Note lack of i915 interrupts (those shown were there at boot). So 99 days (would be longer but had to have power off) vanilla 4.1.18. asr[~]$ sh cstateInfo.sh cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 352565556 538707 1 C1-BYT 0 1 1 130181110251 755499147 2 C6N-BYT 0 300 275 168721715688 321645308 3 C6S-BYT 0 500 560 2679566473195 1081712423 4 C7-BYT 0 1200 4000 5201809523872 677055949 5 C7S-BYT 0 10000 20000 232672548010 6063953 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 66553174 100721 1 C1-BYT 0 1 1 21321194555 95022167 2 C6N-BYT 0 300 275 59708872499 80912844 3 C6S-BYT 0 500 560 1699545542740 616884568 4 C7-BYT 0 1200 4000 6157806454862 674802503 5 C7S-BYT 0 10000 20000 528940757115 15010441 cpu2 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 52182600 54992 1 C1-BYT 0 1 1 11577684031 44781333 2 C6N-BYT 0 300 275 30691974207 38857448 3 C6S-BYT 0 500 560 926619750261 332837818 4 C7-BYT 0 1200 4000 5605371769375 533458885 5 C7S-BYT 0 10000 20000 1938187552261 60665722 cpu3 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 87403241 51053 1 C1-BYT 0 1 1 10016724016 38039691 2 C6N-BYT 0 300 275 28416307863 35148851 3 C6S-BYT 0 500 560 827491037749 293247527 4 C7-BYT 0 1200 4000 5475692994922 503244246 5 C7S-BYT 0 10000 20000 2176064852237 65760515 asr[~]$ uptime 00:27:42 up 99 days, 11:27, 1 user, load average: 0.01, 0.02, 0.05 asr[~]$ uname -a Linux asr 4.1.18 #1 SMP Mon Feb 22 23:38:21 GMT 2016 x86_64 GNU/Linux asr[~]$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 40 0 0 0 IO-APIC-edge timer 1: 3 0 0 0 IO-APIC-edge i8042 7: 1 0 0 0 IO-APIC-edge 8: 2 0 0 0 IO-APIC-fasteoi rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 4 0 0 0 IO-APIC-edge i8042 23: 97932462 0 0 0 IO-APIC 23-fasteoi ehci_hcd:usb1 87: 38 0 0 0 PCI-MSI-edge i915 88: 12913831 0 0 0 PCI-MSI-edge 0000:04:00.0 89: 1013520070 0 0 0 PCI-MSI-edge eth0 NMI: 0 0 0 0 Non-maskable interrupts LOC: 1613399563 1431968881 931951991 935213869 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 Performance monitoring interrupts IWI: 1 0 0 0 IRQ work interrupts RTR: 0 0 0 0 APIC ICR read retries RES: 26307754 50297144 12568066 12293478 Rescheduling interrupts CAL: 2124 1472 1663888 1463719 Function call interrupts TLB: 25130 19257 11500 11783 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 28651 28651 28651 28651 Machine check polls ERR: 1 MIS: 0
I have a machine which was sorted out cause of the freezes and another issue with USB devices randomly disappear on this platform. Later on I used it as a headless asterisk server and never had a single freeze with Ubuntu 14.04 and kernel 3.16, 3.19, 4.1 or 4.4. So without a running Xserver, it seems to work without freezes.
Guys posting this again, not really sure if this helps but my system is up since 7 days without a crash and of course NO max_cstate parameter set. I'm running arch linux, with kernel 4.6.3. My CPU is a Celeron N2930... I have X constantly running as the PC runs kodi by default. processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU N2930 @ 1.83GHz stepping : 8 microcode : 0x829 Linux zotac 4.6.3-1-ARCH #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016 x86_64 GNU/Linux 10:06:41 up 7 days, 13:34, 2 users, load average: 0,08, 0,06, 0,01 [ 0.000000] Linux version 4.6.3-1-ARCH (builduser@tobias) (gcc version 6.1.1 20160602 (GCC) ) #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016 [ 0.000000] Command line: initrd=\initramfs-linux.img root=/dev/sda2 rw [ 0.000000] x86/fpu: Legacy x87 FPU detected. [ 0.000000] x86/fpu: Using 'eager' FPU context switches.
Hi all, I've been following this bug for a long time as my Bay Trail tablet HP Pavilion X2 with an Atom Z3736F kept freezing within 1-2 hours after booting. I've tried all major kernel versions since 4.1. I must mention that they all included the following mmc patches: https://github.com/hadess/rtl8723bs They also included the intel patch suggested at Debian's wiki: https://wiki.debian.org/InstallingDebianOn/HP/Pavilion%20x2%2010%20%282015%20model%29/Jessie?action=AttachFile&do=view&target=intel_display.patch Other than those, I tried various patches I found around hoping to cure the freezes. Even max_cstate=0 did not help. Finally, with 4.6.3 + Mika Kuoppala's 3 patches the situation got somewhat better but I never exceeded 4 hours without a freeze. Then I came across Daniel Bilik's patches elsewhere (for some reason I had overlooked his posts on this page!). With his patches applied, I made several reboots, also playing around with Wolfgang's scripts and so far I never had a regular freeze[1]. Now my tablet's uptime has reached 24 hours for the first time ever, being booted without any max_cstate arguments and the C6 states being active. It's perhaps too soon to declare this a success but apparently Daniel's addition to Mika's patches has made a huge difference here. [1] What I mean is this: Without max_cstate=0 (or 1), the tablet can freeze during boot. Most of the freezes are when the disks are mounted/fsck'ed and others when the drm framebuffer is initialized. And occasionally the screen goes blank when drm takes over, but I can recover with magic SysRq. I don't know if any of these problems might be due to other factors than the Intel Bay Trail bug. But once it gets past the booting stage, (with these latest patches) it seems to survive the rest.
I'm using a Bay Trail NUC (DN2820FYKH) but I don't remember encountering this bug [109051] any time. I just post it here to let you know that for whatever reason this NUC seems not to be affected. Using latest Arch (through Antergos) here are some infos: I boot into Desktop and didn't do anything else (or how should I reproduce this bug?). $ uname -a Linux *** 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU N2820 @ 2.13GHz stepping : 3 microcode : 0x324 cpu MHz : 533.116 cache size : 1024 KB ... $ ./cstateInfo.sh cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 12418075 717 1 C1-BYT 0 1 1 93279269 79577 2 C6N-BYT 0 300 275 20566295 34104 3 C6S-BYT 0 500 560 949015145 355284 4 C7-BYT 0 1200 4000 8810179482 634227 5 C7S-BYT 0 10000 20000 8042123989 99746 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 6130914 716 1 C1-BYT 0 1 1 391648635 73352 2 C6N-BYT 0 300 275 17716302 30140 3 C6S-BYT 0 500 560 846101843 326140 4 C7-BYT 0 1200 4000 8388706626 596330 5 C7S-BYT 0 10000 20000 8278801097 97725 $ uptime 14:29:40 up 5:04, 2 users, load average: 0,17, 0,07, 0,01
I have a Shuttle XS35V4 with Intel Celeron and Z36xxx/Z37xxx Series Graphics. It would rather quickly hang when graphical capabilities were used. Watching youtube or a video would hang it in minutes. Ever since I set intel_idle.max_cstate=1, I could not reproduce any hangs, despite continuously playing Idiocracy in a while loop, and playing a stream of youtube videos at the same time, the computer stays up for days.
Fascinating 2 hours reading 443 comments above ^^^ Dell Inspiron 17R SE 7720. Intel i7 3630QM. Nvidia GT650M. Observations based on Youtube streaming and 10 chrome tabs. Had been running 3.13.092 / Ubuntu 14.04.1 for a month. Conky monitor heat 50C. CPU frequency bounces between 1200 and 2400 Mhz. 8 CPU's run around 15%. No real problems other than Turbo Boost appears inactive. Yesterday Upgrade 4.4.0 / Unbuntu 16.04 close lid doesn't suspend. Update 4.6.3 patch systemd config to make suspend work. Conky monitor heat now 70C and ocassional 5-10 second keyboard lag. CPU scaling now goes over 3200Mhz turbo boost limit. But usually around 2000 Mhz. 8 CPU's now running at 7% with more even balancing. When wifi gets patchy CPU really races. Also have smartphone plugged into USB powered hub. Also have external TV via HDMI. Will try Yakety Yak (Ubuntu 16.10) soon because sound switches to laptop from TV during suspend and PulseAudio 9 (in Yakety Yak) fixes this PulseAudio 8 "undocumented feature" (in Xenial X-thingy). No system lock ups but running 20C hotter and occasional keyboard freezes from 5-10 seconds are concerning. HTH. Please don't flame me for not saying BayTrail.
Like a few other posters I spoke too soon. As I was writing last poast Youtube was auto running at 144p. Under 4.6.3, at Youtube 1080p the numbers are: heat 80C, average Mhz 3000, 8 CPU's 18% utilization (manual visually calculated average). It took many months studying UDEV and now I fear it will be the same with systemd.
Tried new Kernel 4.7.0 and removed intel_idle.max_cstate=1 (cpu J1900). Crashed after a few hours. Still not solved.
Catched today: Aug 8 06:07:44 HP-Mini kernel: [ 10.104246] ------------[ cut here ]------------ Aug 8 06:07:44 HP-Mini kernel: [ 10.104266] WARNING: CPU: 0 PID: 21 at /build/linux-7z1rSb/linux-3.16.7-ckt25/include/linux/kref.h:47 kobject_get+0x3a/0x50() Aug 8 06:07:44 HP-Mini kernel: [ 10.104270] Modules linked in: acpi_cpufreq(+) processor fuse autofs4 ext4 crc16 mbcache jbd2 ums_realtek sg sd_mod crc_t10dif crct10dif_generic crct10dif_common usb_storage ahci libahci ehci_pci uhci_hcd ehci_hcd libata psmouse scsi_mod usbcore usb_common r8169 mii fan thermal thermal_sys Aug 8 06:07:44 HP-Mini kernel: [ 10.104319] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3 Aug 8 06:07:44 HP-Mini kernel: [ 10.104323] Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011 Aug 8 06:07:44 HP-Mini kernel: [ 10.104332] Workqueue: kacpi_notify acpi_os_execute_deferred Aug 8 06:07:44 HP-Mini kernel: [ 10.104336] 0000000000000000 ffffffff8150e08f 0000000000000000 0000000000000009 Aug 8 06:07:44 HP-Mini kernel: [ 10.104343] ffffffff81067777 ffff880036961c00 0000000000000202 0000000000000003 Aug 8 06:07:44 HP-Mini kernel: [ 10.104349] 0000000000000003 ffff880036e422f0 ffffffff812acbfa ffff880036961d28 Aug 8 06:07:44 HP-Mini kernel: [ 10.104355] Call Trace: Aug 8 06:07:44 HP-Mini kernel: [ 10.104367] [<ffffffff8150e08f>] ? dump_stack+0x5d/0x78 Aug 8 06:07:44 HP-Mini kernel: [ 10.104376] [<ffffffff81067777>] ? warn_slowpath_common+0x77/0x90 Aug 8 06:07:44 HP-Mini kernel: [ 10.104383] [<ffffffff812acbfa>] ? kobject_get+0x3a/0x50 Aug 8 06:07:44 HP-Mini kernel: [ 10.104391] [<ffffffff813d94f0>] ? cpufreq_cpu_get+0x70/0xc0 Aug 8 06:07:44 HP-Mini kernel: [ 10.104398] [<ffffffff813d9f2a>] ? cpufreq_update_policy+0x1a/0x1d0 Aug 8 06:07:44 HP-Mini kernel: [ 10.104406] [<ffffffff813da0e0>] ? cpufreq_update_policy+0x1d0/0x1d0 Aug 8 06:07:44 HP-Mini kernel: [ 10.104421] [<ffffffffa018b566>] ? cpufreq_set_cur_state.part.3+0x83/0x8a [processor] Aug 8 06:07:44 HP-Mini kernel: [ 10.104430] [<ffffffffa018b666>] ? processor_set_cur_state+0x97/0xd1 [processor] Aug 8 06:07:44 HP-Mini kernel: [ 10.104444] [<ffffffffa0000e05>] ? thermal_cdev_update+0xa5/0x110 [thermal_sys] Aug 8 06:07:44 HP-Mini kernel: [ 10.104453] [<ffffffffa0003729>] ? step_wise_throttle+0x49/0x80 [thermal_sys] Aug 8 06:07:44 HP-Mini kernel: [ 10.104462] [<ffffffffa000161c>] ? handle_thermal_trip+0x4c/0x150 [thermal_sys] Aug 8 06:07:44 HP-Mini kernel: [ 10.104471] [<ffffffffa000179d>] ? thermal_zone_device_update+0x7d/0xd0 [thermal_sys] Aug 8 06:07:44 HP-Mini kernel: [ 10.104479] [<ffffffff813319a1>] ? acpi_ev_notify_dispatch+0x3c/0x51 Aug 8 06:07:44 HP-Mini kernel: [ 10.104485] [<ffffffff8131e457>] ? acpi_os_execute_deferred+0x10/0x1a Aug 8 06:07:44 HP-Mini kernel: [ 10.104492] [<ffffffff81081742>] ? process_one_work+0x172/0x420 Aug 8 06:07:44 HP-Mini kernel: [ 10.104499] [<ffffffff81081dd3>] ? worker_thread+0x113/0x4f0 Aug 8 06:07:44 HP-Mini kernel: [ 10.104505] [<ffffffff815105c1>] ? __schedule+0x2b1/0x700 Aug 8 06:07:44 HP-Mini kernel: [ 10.104511] [<ffffffff81081cc0>] ? rescuer_thread+0x2d0/0x2d0 Aug 8 06:07:44 HP-Mini kernel: [ 10.104519] [<ffffffff8108800d>] ? kthread+0xbd/0xe0 Aug 8 06:07:44 HP-Mini kernel: [ 10.104526] [<ffffffff81087f50>] ? kthread_create_on_node+0x180/0x180 Aug 8 06:07:44 HP-Mini kernel: [ 10.104533] [<ffffffff81514158>] ? ret_from_fork+0x58/0x90 Aug 8 06:07:44 HP-Mini kernel: [ 10.104540] [<ffffffff81087f50>] ? kthread_create_on_node+0x180/0x180 Aug 8 06:07:44 HP-Mini kernel: [ 10.104544] ---[ end trace 6a04776659b650d3 ]--- and: Aug 8 06:10:14 HP-Mini kernel: [ 169.942572] general protection fault: 0000 [#1] SMP Aug 8 06:10:14 HP-Mini kernel: [ 169.942761] Modules linked in: bnep ctr ccm nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc arc4 snd_hda_codec_idt ath9k ath9k_common snd_hda_codec_generic ath9k_hw uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common coretemp videodev snd_hda_intel media ath3k btusb snd_hda_controller kvm ath hp_wmi snd_hda_codec bluetooth mac80211 i915 iTCO_wdt drm_kms_helper cfg80211 drm 6lowpan_iphc iTCO_vendor_support sparse_keymap snd_hwdep snd_pcm rfkill ac shpchp wmi i2c_i801 i2c_algo_bit joydev evdev snd_timer serio_raw pcspkr lpc_ich mfd_core i2c_core snd video battery soundcore button acpi_cpufreq processor fuse autofs4 ext4 crc16 mbcache jbd2 ums_realtek sg sd_mod crc_t10dif crct10dif_generic crct10dif_common usb_storage ahci libahci ehci_pci uhci_hcd ehci_hcd libata psmouse scsi_mod usbcore usb_common r8169 mii fan thermal thermal_sys Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] CPU: 2 PID: 836 Comm: Xorg Tainted: G W 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] task: ffff88007b236d20 ti: ffff880079294000 task.ti: ffff880079294000 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] RIP: 0010:[<ffffffff812bada0>] [<ffffffff812bada0>] sg_next+0x0/0x30 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] RSP: 0018:ffff880079297b80 EFLAGS: 00010202 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] RAX: ea00011d51829182 RBX: 0000000000000001 RCX: ffff880036b38880 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] RDX: ffff880036b38700 RSI: 0000000000000000 RDI: ea00011d51829182 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] RBP: 000000000000ffff R08: 0000000007637000 R09: 0000000000000000 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] R10: 0000000007800000 R11: 0000000000000000 R12: ea00011d51829182 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] R13: ffff88007c7ee098 R14: ffffffff8181f660 R15: ffff8800628d1900 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] FS: 00007f4281a3c980(0000) GS:ffff88007f300000(0000) knlGS:0000000000000000 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] CR2: 00007f428157e000 CR3: 000000007b2ff000 CR4: 00000000000007e0 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] Stack: Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] ffffffffa03b7b7b ffff88007c1b7a08 0000000000001000 0000000000001000 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] ffff880079e0b800 0000000000000000 ffffffffa03bdf8f 0000000020000000 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] 0000000000000000 ffff880000000000 0000000000000000 ffff88007c1b0000 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] Call Trace: Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa03b7b7b>] ? i915_gem_gtt_prepare_object+0x6b/0xb0 [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa03bdf8f>] ? i915_gem_object_pin+0x57f/0x780 [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa0421559>] ? i915_gem_execbuffer_reserve_vma.isra.16+0x95/0x11a [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa042182a>] ? i915_gem_execbuffer_reserve+0x24c/0x2dc [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa03b358d>] ? i915_gem_do_execbuffer.isra.24+0x89d/0x13f0 [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa03bcc8b>] ? i915_gem_object_put_fence+0x1b/0xc0 [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa03b459f>] ? i915_gem_execbuffer2+0xaf/0x2b0 [i915] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffffa02db8a7>] ? drm_ioctl+0x1c7/0x5b0 [drm] Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffff811be12e>] ? dput+0x9e/0x170 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffff811ba9af>] ? do_vfs_ioctl+0x2cf/0x4b0 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffff81085261>] ? task_work_run+0x91/0xb0 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffff811bac11>] ? SyS_ioctl+0x81/0xa0 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffff815144ca>] ? int_signal+0x12/0x17 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] [<ffffffff8151420d>] ? system_call_fast_compare_end+0x10/0x15 Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] Code: 27 fa ff ff 0f 1f 80 00 00 00 00 c7 47 10 00 00 00 00 89 57 0c 48 89 37 89 4f 08 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <f6> 07 02 75 13 48 8b 57 20 48 8d 47 20 f6 c2 01 75 09 f3 c3 0f Aug 8 06:10:14 HP-Mini kernel: [ 169.944026] RSP <ffff880079297b80> Aug 8 06:10:14 HP-Mini kernel: [ 170.030137] ---[ end trace 6a04776659b650d4 ]--- and also: Aug 8 06:07:44 HP-Mini kernel: [ 11.789183] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236) Aug 8 06:07:44 HP-Mini kernel: [ 11.789203] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536) Aug 8 06:07:44 HP-Mini kernel: [ 11.789228] ACPI Error: Method parse/execution failed [\_SB_.WMID.WMAD] (Node ffff88007e852d10), AE_AML_BUFFER_LIMIT (20140424/psparse-536) Aug 8 06:07:44 HP-Mini kernel: [ 11.789420] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236) Aug 8 06:07:44 HP-Mini kernel: [ 11.789437] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536) Aug 8 06:07:44 HP-Mini kernel: [ 11.789461] ACPI Error: Method parse/execution failed [\_SB_.WMID.WMAD] (Node ffff88007e852d10), AE_AML_BUFFER_LIMIT (20140424/psparse-536) Aug 8 06:07:44 HP-Mini kernel: [ 11.789643] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236) Aug 8 06:07:44 HP-Mini kernel: [ 11.789659] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536) With the freeze effect (hard-boot required). Aren't the dumps related?
@Maciej Hrebien #444 You don't explain much about your system. Buried in the log is "Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011". This seems to be a notebook with a pineview or earlier Atom processor. Not the subject of this bugzilla entry.
Yes, it's N570 chip and I can share more details if needed. I thought the dumps are related as setting cstate to 1 makes the device usable (running ~12h now without the freeze). The 3.8.2 kernel seems to be working fine for me that is without any freezes and workarounds.
Hello, my computer is an Acer E5-511P, it always crashed randomly (the problem above) when in Linux, I was able to resolve the issue with "intel_idle.max_cstate=1" and blacklisting dw_dmac and dw_dmac_core. A fun fact, now I'm using Windows 10 with the "Windows Subsystem for Linux" (Ubuntu 14.04, Linux kernel version 3.4), and my computer has crashed two times since then (the UI freezed, and the cpu fan spinning at top speed).
(In reply to Kevin from comment #450) > and blacklisting dw_dmac and dw_dmac_core. That's should be solved in v4.5. So, if you have kernel v4.5+, please, try again w/o disabling dw_dmac. You may refer to bug #101271 for the details.
(In reply to Wolfgang M. Reimer from comment #437) > Running my submitted scripts > > https://bugzilla.kernel.org/attachment.cgi?id=223851 > https://bugzilla.kernel.org/attachment.cgi?id=223861 > > on a J1900 system should produce a similar output: > > As one can see in my case most of the core's idle time is now spent in state > C7S-BYT. Thanks a lot! I've had my system running since then with disabled C6 state and no freezes. I've done one reboot to update kernel, 21 days uptime on current session now. This is on N3540 laptop with which I've had quite random but steady occurences of freezes over last year. So might be too early to declare success, but it seems promising for now.
I tried the disable C6 state script as per Wolfgang M. Reimer's scripts (C6 events didnt seem to be increasing afterwards), lockup within 2 hours. Linux htpc 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:06:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz stepping : 8 microcode : 0x831 Board is Asrock Mod Q1900ITX My use case: launch Kodi, start palying a DVD for 20 minutes, pause DVD for 30 minutes, resume playing for perhaps 10 minutes, then pause again for 20 minutes. When I tried to resume playback, system was unresponsive. Even the reset button doesn't respond. Just booted with intel_idle.max_cstate=1, will report if this has same issue.
I disabled C6 states as described by Wolfgang M. Reimer. No crashes for two days now, when having them on this J1900 based system reliably in less than an hour uptime. Board is ASRock Q1900-ITX processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz stepping : 3 microcode : 0x320 cpu MHz : 1521.891 cache size : 1024 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat bugs : bogomips : 3993.60 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
Thanks to Wolfgang M. Reimer, I am now running 4.7.2 with no problems. As per https://bugzilla.kernel.org/show_bug.cgi?id=109051#c437 Here is my relevant information; lscpu: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 55 Model name: Intel(R) Pentium(R) CPU N3540 @ 2.16GHz Stepping: 8 CPU MHz: 499.677 CPU max MHz: 2665.6001 CPU min MHz: 499.8000 BogoMIPS: 4328.66 Virtualization: VT-x L1d cache: 24K L1i cache: 32K L2 cache: 1024K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat Base Board Information Manufacturer: Dell Inc. Product Name: 0H4MK6 Version: A00 Serial Number: .HS1LK52.CN7620657U0002.
How disabling C6 with still enabled C7 affect battery life?
(In reply to cscs from comment #455) > Thanks to Wolfgang M. Reimer, I am now running 4.7.2 with no problems. > As per https://bugzilla.kernel.org/show_bug.cgi?id=109051#c437 > > Here is my relevant information; > > lscpu: > > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 1 > Core(s) per socket: 4 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 55 > Model name: Intel(R) Pentium(R) CPU N3540 @ 2.16GHz > Stepping: 8 > CPU MHz: 499.677 > CPU max MHz: 2665.6001 > CPU min MHz: 499.8000 > BogoMIPS: 4328.66 > Virtualization: VT-x > L1d cache: 24K > L1i cache: 32K > L2 cache: 1024K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall > nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est > tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer > rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid > tsc_adjust smep erms dtherm ida arat > > Base Board Information > Manufacturer: Dell Inc. > Product Name: 0H4MK6 > Version: A00 > Serial Number: .HS1LK52.CN7620657U0002. As far as I can tell, I too am running 4.7.2 with no problem - Arch Linux, Acer Travelmate B115M
Hello, on a Lenovo E50-00 with CPU Intel Pentium J2900 I had these random freezes. In addition I also have a freeze when RESUMING AFTER SUSPEND. Therefore when I tested more distributions (kernel versions) I just tested the resume after suspend. (Hibernate does work fine.) I had the bug with : - Linux Mint 18.0 (based on Ubuntu 16.04) with kernel 4.4 - Ubuntu 14.04 - Ubuntu 12.04 with kernel 3.13 I also had the bug when I changed CPU's BIOS setting to C1 only.
The c6-off/c7-on script is also effective on Z3775 baytrail processor in my ASUS T100CHI (and is reported effective for other ASUS T100T* models.) Listed cstates for the Z3775 are (POLL,C1-BYT,C6N-BYT,C6S-BYT,C7-BYT,C7S-BYT) Now my only kernel arguments are "tsc=reliable clocksource=tsc". I no longer need intel_idle.max_cstate={0,1}. Even with recent kernels, the T100CHI would rarely go more than 30 minutes without a freeze unless cstate was limited. I also had freezes when trying max_cstate=3. Many thanks for tracking this one down.
Tried the c6-disabling script on asrock q1900itx-m. Ran for a whole night while otherwise it would freeze within 1 or 2 hours and had to use the max_cstate=1-fix. Will roll it out to 50 more machines until December. Not a fix but an ok workaround for this issue.
I've also tried the c6-disabling script, i made a startup service for it on OpenSuse Tumbleweed with the latest updates. It's now running smoothly while first i did disable the c6 and c7 state in the UEFI BIOS. Now i've re-enabled the states and everything seems to run ok. No freezes for 4 hours now. output from cstateInfo.sh : cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 29315450 16010 1 C1-BYT 0 1 1 2075922404 2720073 2 C6N-BYT 1 300 275 1298175 459 3 C6S-BYT 1 500 560 6214377 1612 4 C7-BYT 0 1200 4000 1228124502 139642 5 C7S-BYT 0 10000 20000 474724588 20423 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 34446485 17262 1 C1-BYT 0 1 1 2026994454 2604049 2 C6N-BYT 1 300 275 1339377 414 3 C6S-BYT 1 500 560 5097170 895 4 C7-BYT 0 1200 4000 1215493749 130038 5 C7S-BYT 0 10000 20000 554333717 22140 cpu2 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 32861042 15934 1 C1-BYT 0 1 1 2994741581 5739338 2 C6N-BYT 1 300 275 958074 269 3 C6S-BYT 1 500 560 5137562 1061 4 C7-BYT 0 1200 4000 533053353 77172 5 C7S-BYT 0 10000 20000 111720079 5108 cpu3 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 26658663 12680 1 C1-BYT 0 1 1 2232165052 3062867 2 C6N-BYT 1 300 275 900500 238 3 C6S-BYT 1 500 560 4047949 844 4 C7-BYT 0 1200 4000 1198658992 148698 5 C7S-BYT 0 10000 20000 307666394 14599 And lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 55 Model name: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz Stepping: 8 CPU MHz: 1332.718 CPU max MHz: 2415.7000 CPU min MHz: 1332.8000 BogoMIPS: 3993.60 Virtualization: VT-x L1d cache: 24K L1i cache: 32K L2 cache: 1024K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat
(In reply to Paul Nijenhuis from comment #461) > I've also tried the c6-disabling script, i made a startup service for it > on OpenSuse Tumbleweed with the latest updates. It's now running smoothly > while first i did disable the c6 and c7 state in the UEFI BIOS. > Now i've re-enabled the states and everything seems to run ok. > No freezes for 4 hours now. > output from cstateInfo.sh : > cpu0 State Name Disabled Latency Residency Time Usage > 0 POLL 0 0 0 29315450 16010 > 1 C1-BYT 0 1 1 2075922404 2720073 > 2 C6N-BYT 1 300 275 1298175 459 > 3 C6S-BYT 1 500 560 6214377 1612 > 4 C7-BYT 0 1200 4000 1228124502 139642 > 5 C7S-BYT 0 10000 20000 474724588 20423 > cpu1 State Name Disabled Latency Residency Time Usage > 0 POLL 0 0 0 34446485 17262 > 1 C1-BYT 0 1 1 2026994454 2604049 > 2 C6N-BYT 1 300 275 1339377 414 > 3 C6S-BYT 1 500 560 5097170 895 > 4 C7-BYT 0 1200 4000 1215493749 130038 > 5 C7S-BYT 0 10000 20000 554333717 22140 > cpu2 State Name Disabled Latency Residency Time Usage > 0 POLL 0 0 0 32861042 15934 > 1 C1-BYT 0 1 1 2994741581 5739338 > 2 C6N-BYT 1 300 275 958074 269 > 3 C6S-BYT 1 500 560 5137562 1061 > 4 C7-BYT 0 1200 4000 533053353 77172 > 5 C7S-BYT 0 10000 20000 111720079 5108 > cpu3 State Name Disabled Latency Residency Time Usage > 0 POLL 0 0 0 26658663 12680 > 1 C1-BYT 0 1 1 2232165052 3062867 > 2 C6N-BYT 1 300 275 900500 238 > 3 C6S-BYT 1 500 560 4047949 844 > 4 C7-BYT 0 1200 4000 1198658992 148698 > 5 C7S-BYT 0 10000 20000 307666394 14599 > > And lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 1 > Core(s) per socket: 4 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 55 > Model name: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz > Stepping: 8 > CPU MHz: 1332.718 > CPU max MHz: 2415.7000 > CPU min MHz: 1332.8000 > BogoMIPS: 3993.60 > Virtualization: VT-x > L1d cache: 24K > L1i cache: 32K > L2 cache: 1024K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall > nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est > tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer > rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid > tsc_adjust smep erms dtherm ida arat Ok, freeze again after 4,5 hours :-( Back to enable only C1 in BIOS.....
Newest Kernel 4.7.2 crashed after 3 hours. Going back to intel_idle.max_cstate=1. Who cares? More than 400 comments and the status is still NEW? I might give up reporting to this thread...
On my Celeron N2930 system I switched from intel_idle.max_cstate=1 to using c6off+c7on.sh a couple of days ago. Everything works quite well so far. cstateInfo.sh confirms that there is no C6N-BYT or C6S-BYT being used. As this is a very active machine running 3 virtual machines (light load) the CPU is mostly in the C1E-BYT state. C7-BYT and C7S-BYT also get a good hit. So, my experience so far is very positive. The only surprising thing is that my box is not running much cooler than before this change. I guess the active CPU load explains that. Hal
(In reply to Hal from comment #464) > On my Celeron N2930 system I switched from intel_idle.max_cstate=1 to using > c6off+c7on.sh a couple of days ago. Everything works quite well so far. > cstateInfo.sh confirms that there is no C6N-BYT or C6S-BYT being used. As > this is a very active machine running 3 virtual machines (light load) the > CPU is mostly in the C1E-BYT state. C7-BYT and C7S-BYT also get a good hit. > So, my experience so far is very positive. The only surprising thing is that > my box is not running much cooler than before this change. I guess the > active CPU load explains that. > Hal I also wanted to add that I've been using kernel version 4.5.7. Finally a question too: What would be the best way to launch c6off+c7on.sh. I have it currently in a "session and startup" entry in xfce4. But it would probably make sense to start it before xfce or even xorg is launched. Hal
That would add the function as a system service on ubuntu 14.04. Later versions would use systemd services. So it's different there, but don't have the files here. echo -e 'for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do case "$(< "${state}/name")" in C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;; C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;; esac done' > /etc/init.d/c6off+c7on.sh chown root:root /etc/init.d/c6off+c7on.sh chmod 755 /etc/init.d/c6off+c7on.sh update-rc.d -f /etc/init.d/c6off+c7on.sh start 90 2 . I think the bug also exists on CherryTrail, so i added |C6*-CHT and |C7*-CHT. If the bug doesnt affect it, just remove it.
I've been running the previously unstable kernel 4.5.4 without max_cstate=1 using the c6off+c7on script for more than 3 days now and have yet see the dreaded freeze. $ uptime 10:02:05 up 3 days, 13:13, 2 users, load average: 0,16, 0,13, 0,14 $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz stepping : 3 ... Seems to be the holy grail for me! Thx a lot! Too bad intel never chipped in with the addendum before. Shame! I put the c6off+c7on script in /etc/rc.local.
(In reply to Juha Sievi-Korte from comment #452) > (In reply to Wolfgang M. Reimer from comment #437) > > Running my submitted scripts > > > > https://bugzilla.kernel.org/attachment.cgi?id=223851 > > https://bugzilla.kernel.org/attachment.cgi?id=223861 > > > > on a J1900 system should produce a similar output: > > > > As one can see in my case most of the core's idle time is now spent in > state > > C7S-BYT. > > Thanks a lot! > > I've had my system running since then with disabled C6 state and no freezes. > I've done one reboot to update kernel, 21 days uptime on current session > now. This is on N3540 laptop with which I've had quite random but steady > occurences of freezes over last year. > > So might be too early to declare success, but it seems promising for now. And today two crashes with c6 disabled by this script, so this wasn't the root cause either. Phew. Back to two hour battery life it is...
Crashed or freezes?
(In reply to ladiko from comment #469) > Crashed or freezes? Sorry for that, freezed. First happened within the 'long' uptime session and next within hour of a reboot, so it seems equally random for me as before. Cstate script was run early in boot-up.
It's fine. Just wanted to be sure what we talk about. Didn't yet pushed it to the other 50 machines. Just the one I tested it on ran stable. So you checked the c6 state after boot? I will run a long term test. Like several days before I roll it out to the other machines.
(In reply to ladiko from comment #471) > It's fine. Just wanted to be sure what we talk about. Didn't yet pushed it > to the other 50 machines. Just the one I tested it on ran stable. So you > checked the c6 state after boot? I will run a long term test. Like several > days before I roll it out to the other machines. Yep, checked that the script was run ok and c6 wasn't active after the last reboot. As some folks still seem to have promising results with this script, I think I'll let it still run for a while to see the effects in longer run. Perhaps it makes some events that cause this issue to happen less frequently. Checked that last log entry was ~40 mins after the reboot at last attempt when it freezed. The system was just sitting idle by itself when it happened, one ssh session open on desktop + few tabs on browser.
I have a computer where I had constant crashes. I set "intel_idle.max_cstate=1" and now it stays up for weeks and never crashes.
Ran 44 hours with c6off+c7on.sh before freezing hard. Usually would freeze within 30 minutes without any cstate limits. Z3775 might have issue with C7 states. The system was idle when it locked up. Will be there newer uCode that would help my baytrail?
I have had 0 lockups (3 days) using 4.8-rc5 from ubuntu mainline archives. No patches, max_cstate settings or anything. Before that I was using 3.16 kernel which was the last stable one for me (baytrail). Unfortunately my hdmi is not working with this kernel. Anyone else wanna try and report ?
(In reply to Martin from comment #467) > I've been running the previously unstable kernel 4.5.4 without max_cstate=1 > using the c6off+c7on script for more than 3 days now and have yet see the > dreaded freeze. > > $ uptime > 10:02:05 up 3 days, 13:13, 2 users, load average: 0,16, 0,13, 0,14 > > $ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 55 > model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz > stepping : 3 > ... > > Seems to be the holy grail for me! Thx a lot! > Too bad intel never chipped in with the addendum before. Shame! > > I put the c6off+c7on script in /etc/rc.local. I regret having to crawl back on my statement above. After many days of stable TV watching, our HTPC was non responsive and I had to power-cycle it to get back in business.
Follow up on week old post: c6off+c7on still works well for my Zotac system. Below comes my stats: Thu Sep 8 15:18:37 EDT 2016 15:18:37 up 7 days, 6:46, 2 users, load average: 6.95, 7.15, 7.28 *-cpu description: CPU product: Intel(R) Celeron(R) CPU N2930 @ 1.83GHz vendor: Intel Corp. physical id: 34 bus info: cpu@0 version: Intel(R) Celeron(R) CPU N2930 @ 1.83GHz slot: SOCKET 0 size: 2165MHz capacity: 2400MHz width: 64 bits clock: 83MHz capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch ida arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms cpufreq configuration: cores=4 enabledcores=4 threads=4 cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 22302321 2554713 1 C1-BYT 0 1 1 15757596319 262145340 2 C1E-BYT 0 15 30 119156258736 521385788 3 C6N-BYT 1 40 275 456080 865 4 C6S-BYT 1 140 560 855986 870 5 C7-BYT 0 1200 1500 3704252148 5761453 6 C7S-BYT 0 10000 20000 61455590 3197 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 208494519 3428799 1 C1-BYT 0 1 1 17097828007 280365124 2 C1E-BYT 0 15 30 117773052639 522674956 3 C6N-BYT 1 40 275 376329 506 4 C6S-BYT 1 140 560 784219 596 5 C7-BYT 0 1200 1500 3331593582 4844889 6 C7S-BYT 0 10000 20000 59530341 2921 cpu2 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 21634146 2606774 1 C1-BYT 0 1 1 16503086447 284253332 2 C1E-BYT 0 15 30 122405265787 542914915 3 C6N-BYT 1 40 275 537565 835 4 C6S-BYT 1 140 560 626723 544 5 C7-BYT 0 1200 1500 3845541641 5968789 6 C7S-BYT 0 10000 20000 43528414 2486 cpu3 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 22154717 2630336 1 C1-BYT 0 1 1 16894582028 282524229 2 C1E-BYT 0 15 30 123929531803 549088123 3 C6N-BYT 1 40 275 313412 440 4 C6S-BYT 1 140 560 722191 486 5 C7-BYT 0 1200 1500 4109133756 6219843 6 C7S-BYT 0 10000 20000 52241428 2709 Hal
Haven't written here for some time. Latest kernels(~4.8-rc5) under z3770 aren't stable without max_cstate parameter. Hit this bug even in inintram root switch state. See this bug: 150881 . Kernel will freeze in tens seconds after entering idle state.
on my tablet Z3735F cpu , kernel 4.4 to 4.8 The "stability" and guaranteed less than 15 minutes instead adding to grub.cfg intel_idle.max_cstate=1 and clocksource=tsc I can use it for days. P.S. I suspect the clocksource "refined - jiffies"
If kernel is unstable when cpu is in idle, then jiffies are no good. For me, a zero response from Intel is evidence that there is a huge mistake in hardware. But I don't believe that it can't be overcome by any software changes.
(In reply to Dmitry from comment #480) > If kernel is unstable when cpu is in idle, then jiffies are no good. > > For me, a zero response from Intel is evidence that there is a huge mistake > in hardware. But I don't believe that it can't be overcome by any software > changes. It does not crash on Windows. Some software change must work.
(In reply to Dmitry from comment #480) > For me, a zero response from Intel is evidence that there is a huge mistake > in hardware. But I don't believe that it can't be overcome by any software > changes. > > It does not crash on Windows. Some software change must work. Not only Windows XP, NT, Vista, 7, 8, 10 work without problem but also BSD distributions work well (including pfsense). Linux kernels 2.x and 3.0 kernels (probably all the way up to 3.16 - although on one of my systems 3.16 freezes) seem to work too. There is no doubt there is a bug related to the cstates in some of those Intel processors but evidently software workarounds or fixes are possible. I think the kernel team should be equally accountable for a definitive fix as is Intel (with no disrespect intended to either group of engineers - actually eternal gratitude to all Linux/GNU people for an outstanding platform). Now, is the CPU bug a design problem or due to 14nm manufacturing process issues? that's an intriguing question in my mind; because I happen to own 2 identical boxes (Intel NUC) manufactured (by Intel) just a few months apart - one with the freezing problem and the other without. The steppings on the CPU and BIOS versions are unfortunately not identical. So whether the 'fix' is part of a newer CPU stepping or part of a remediation microcode loaded through the BIOS at start up, it looks like the CPU is able to run all cstates. That tells me that a post-manufacturing CPU microcode fix is always possible. Hal
I am inclined to agree with Hal. There's a group of three of us with Z3735F-based Toshiba Click Minis. Despite all running the same UEFI firmware versions and dock firmware versions, mine seems to be less stable. It could also be down to the other choices we've made in terms of root partitions on memory cards, but we have no solid proof. If we could attach a simple serial console we might have some hope. I'm not sure if everybody's seen the Shark's Cove reference design http://www.cnx-software.com/2014/07/30/sharks-cove-intel-atom-bay-trail-t-development-board-for-windows-8-1-is-now-available-for-299/ the links from that are broken, but I managed to track down the technical docs for it: https://firmware.intel.com/sites/default/files/Sharks_Cove_Schematic.pdf http://composter.com.ua/documents/Sharks-Cove-Technical-Specifications.pdf maybe it will help someone who understands drivers to fix the situation.
If this bug depends on firmware than I have bad news. Atom Baytrail doesn't support microcode loading. Firstly tried tree different ways loading microcode and none of them succeeded. Then found list of cpus which support microcode loading and there is no Z3770. Mine microcode: sig=0x30673, pf=0x2, revision=0x324 P.S. There is no microcode for 06-37-03 (Family-6,Model-55,Stepping-3).
the only one beeing able to fix this would be intel, at least in a proper way. obviously they don't give a shit. baytrail is old, on top low-cost. they want to sell new stuff. let's face it, if you have a workaround for this bug (like me running kernel 4.1.12) then you are lucky. i assume we are left alone with this and nothing is going to happen from intels side. if you are clever never buy intel or nvidia again. i probably won't. that is all i can do.
Hi, I have just found the following link where an Intel employee seems to have found a problem and has already developed some kind of a solution. However I have not been able to find out if this has been added to the kernel. Here is the link: https://lkml.org/lkml/2015/3/24/271 Perhaps somebody with more experience than myself could have a look at this?
(In reply to Martin Brand from comment #486) > Hi, I have just found the following link where an Intel employee seems to > have found a problem and has already developed some kind of a solution. See Comment 55. This patch is not enough to fix the problem.
(In reply to BzukTuk from comment #378) > Hi again, > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long > sessions without single freeze. Still counting... > > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont > think (hope) this matters) > ... Hi there again, just so you know, I did not experienced !single! freeze on fresh kernels (>=4.6) since I started using Mika`s patches + legacy turbo patch together (as mentioned above). Could anyone with freeze issues and non-laptop machine give it a long stress test? I have just tablet/laptop and I dont want to wreck battery/LCD (also can`t turn display off - another story). I did not count exactly, but I think I have >500 hour long uptime without any freeze on this device. Mika`s tentative patches https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81 https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643 https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f Legacy turbo: https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.7.3/linux-999-i915-use-legacy-turbo.patch
(In reply to Martin Brand from comment #486) > Hi, I have just found the following link where an Intel employee seems to > have found a problem and has already developed some kind of a solution. > However I have not been able to find out if this has been added to the > kernel. > Here is the link: > https://lkml.org/lkml/2015/3/24/271 > Perhaps somebody with more experience than myself could have a look at this? That message is 18 months old... Hardly qualifies as "hot off the press" using Intels' Moore's law.
(In reply to BzukTuk from comment #488) > (In reply to BzukTuk from comment #378) > > Hi again, > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long > > sessions without single freeze. Still counting... > > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont > > think (hope) this matters) > > ... > > Hi there again, > just so you know, I did not experienced !single! freeze on fresh kernels > (>=4.6) since I started using Mika`s patches + legacy turbo patch together > (as mentioned above). Could anyone with freeze issues and non-laptop machine > give it a long stress test? I have just tablet/laptop and I dont want to > wreck battery/LCD (also can`t turn display off - another story). I did not > count exactly, but I think I have >500 hour long uptime without any freeze > on this device. > > Mika`s tentative patches > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81 > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643 > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f > > Legacy turbo: > https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4. > 7.3/linux-999-i915-use-legacy-turbo.patch can you please point out how to apply these patches? Thanx in advance
Hello, i've got a Biostar J1900 with Celeron Quad Core. The freeze comes with all Ubuntu 16.04 based distributions. I've tested it with Ubuntu 16.04, Lubuntu 16.04 and Ubuntu Mate 16.04. Even Ubuntu 14.04.5 freezes. Now i am running with Zorin 9 and Ubuntu 14.04.4. Kernel in use is 3.13.0-95-generic. Would apply myself for a testing person. Maybe is here a supporting person from Germany? Regards Christian
because this could be missing ? Atom PMC platform clocks: drivers/clk/x86/clk-byt-plt.c: https://patchwork.kernel.org/patch/9286345/
Testing the c6off/c7on script with a Z3775 system at idle. First freeze took 44 hours. Subsequent freezes take about 3 hours. Turning off C7-BYT extends to 4 hours of idling before freezing. Turning off C7S-BYT on just one core gets me running again w/o freezing. Effectively, one core is set to intel_idle.max_cstate=1 while the others could allow C7S-BYT. Setting intel_idle.max_cstate=2 without the c6off/c7on script yields less than 30 minutes of run time before freezing, often within a few minutes. I'm not sure that it's permissible to control power saving on a per core basis. What I did may just be an another way to set ..max_cstate=1. FWIW - I have had one or two identical freezes in Windows, but this is quite rare in comparison to linux.
(Follow up to own message #477) 12 days into using c6off+c7on I decided to go back to the intel_idle.max_cstate=1 workaround. The reason is that although I did not experience any freeze or crash on my host OS, I started to see some very awkward SSD access problems. Swapping the SSD drive with a new one did not alleviate the problem, but going back to max_cstate=1 definitely eliminated it. The awkwardness of the SSD problem was that it would tie up data retrieval from the SSD for many tens of seconds but the host OS wouldn't fail. The mouse, internet access etc would all work. Many drive access retrial messages would pop up but without causing a system crash. On the other hand, as I ran Virtualbox and several virtual machines, those would partially freeze. For instance their GUIs would not respond to keyboard or mouse actions but I could still SSH into them from the host computer or a remote computer. Eventually I would get serious data corruption in the guest machines. The problem with the guest machines didn't happen very often but happened on different virtual machines running different GNU flavors and different Linux kernels. Hal
(In reply to Paul Nijenhuis from comment #490) > (In reply to BzukTuk from comment #488) > > (In reply to BzukTuk from comment #378) > > > Hi again, > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long > > > sessions without single freeze. Still counting... > > > > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont > > > think (hope) this matters) > > > ... > > > > Hi there again, > > just so you know, I did not experienced !single! freeze on fresh kernels > > (>=4.6) since I started using Mika`s patches + legacy turbo patch together > > (as mentioned above). Could anyone with freeze issues and non-laptop > machine > > give it a long stress test? I have just tablet/laptop and I dont want to > > wreck battery/LCD (also can`t turn display off - another story). I did not > > count exactly, but I think I have >500 hour long uptime without any freeze > > on this device. > > > > Mika`s tentative patches > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81 > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643 > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f > > > > Legacy turbo: > > > https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4. > > 7.3/linux-999-i915-use-legacy-turbo.patch > > can you please point out how to apply these patches? > Thanx in advance I found out how to apply the patches and i'm building a 4.7.3 kernel on OpenSuse Tumbleweed... I'll post the results later
(In reply to Paul Nijenhuis from comment #495) > (In reply to Paul Nijenhuis from comment #490) > > (In reply to BzukTuk from comment #488) > > > (In reply to BzukTuk from comment #378) > > > > Hi again, > > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + > > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session > > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long > > > > sessions without single freeze. Still counting... > > > > > > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I > dont > > > > think (hope) this matters) > > > > ... > > > > > > Hi there again, > > > just so you know, I did not experienced !single! freeze on fresh kernels > > > (>=4.6) since I started using Mika`s patches + legacy turbo patch > together > > > (as mentioned above). Could anyone with freeze issues and non-laptop > machine > > > give it a long stress test? I have just tablet/laptop and I dont want to > > > wreck battery/LCD (also can`t turn display off - another story). I did > not > > > count exactly, but I think I have >500 hour long uptime without any > freeze > > > on this device. > > > > > > Mika`s tentative patches > > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81 > > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643 > > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f > > > > > > Legacy turbo: > > > > https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4. > > > 7.3/linux-999-i915-use-legacy-turbo.patch > > > > can you please point out how to apply these patches? > > Thanx in advance > > I found out how to apply the patches and i'm building a 4.7.3 kernel on > OpenSuse Tumbleweed... > I'll post the results later using an ASROCK Q2900 with a J2900 cpu. Until now i was running an old 3.14-lts kernel because all newer ones froze after some time. tried a custom 4.7.2 kernel on archlinux with the 4 patches mentioned above, system still froze when idling for around 30 hours, back to 3.14. see the PKGBUILD here in case anybody wants to try: https://dl.dropboxusercontent.com/u/9188780/linux-baytrail.zip
(In reply to Paul Nijenhuis from comment #495) > (In reply to Paul Nijenhuis from comment #490) > > (In reply to BzukTuk from comment #488) > > > (In reply to BzukTuk from comment #378) > > > > Hi again, > > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + > > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session > > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long > > > > sessions without single freeze. Still counting... > > > > > > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I > dont > > > > think (hope) this matters) > > > > ... > > > > > > Hi there again, > > > just so you know, I did not experienced !single! freeze on fresh kernels > > > (>=4.6) since I started using Mika`s patches + legacy turbo patch > together > > > (as mentioned above). Could anyone with freeze issues and non-laptop > machine > > > give it a long stress test? I have just tablet/laptop and I dont want to > > > wreck battery/LCD (also can`t turn display off - another story). I did > not > > > count exactly, but I think I have >500 hour long uptime without any > freeze > > > on this device. > > > > > > Mika`s tentative patches > > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81 > > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643 > > > https://cgit.freedesktop.org/~miku/drm-intel/commit/ > > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f > > > > > > Legacy turbo: > > > > https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4. > > > 7.3/linux-999-i915-use-legacy-turbo.patch > > > > can you please point out how to apply these patches? > > Thanx in advance > > I found out how to apply the patches and i'm building a 4.7.3 kernel on > OpenSuse Tumbleweed... > I'll post the results later Unfortenately, freeze after 1,5 days.... :-( back to C1 only in BIOS.
The script of Wolfgang Reimer seems to be a good workaround so far. A way to install this permanently is described here: https://forum.manjaro.org/t/intel-baytrail-freezes-the-linux-kernel/1931/10 Works for manjaro and ubuntu
Additional thoughts: Wolfgang had the idea to write a test routine to verify whether erratum VLP52 was the root cause for this bug. I found an erratum of another CPU (Z670), http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/atom-z6xx-specification-update.pdf that has the same description (its number here is BN38, page 25): "EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine." Here a workaround is given by Intel! "Software should check the ISR register and enter CD1 only if any interrupt is in service." Perhaps this is helpful to find an even more effective method to avoid this error without blocking C6 generally. There even might be already a fix for the Z6xx-cpu in the kernel.
I also came across a patch that was created for SUSE and that seems to be adressing mentioned erratum in pre 4.X kernels: https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7
(In reply to Michal Feix from comment #501) > I also came across a patch that was created for SUSE and that seems to be > adressing mentioned erratum in pre 4.X kernels: > > https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI. > patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7 Wow, if this works, that will be absolutely fantastic. I'll be compiling with this patch as soon as I get home. Now to get it merged into mainline...
(In reply to Travis Hall from comment #502) > (In reply to Michal Feix from comment #501) > > I also came across a patch that was created for SUSE and that seems to be > > adressing mentioned erratum in pre 4.X kernels: > > > > https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI. > > patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7 > > Wow, if this works, that will be absolutely fantastic. I'll be compiling > with this patch as soon as I get home. Now to get it merged into mainline... You might want to check your CPU model number. If I'm reading that patch right, it won't for example have any effect on my J1900 CPU with a model number of 55 (0x37) (assuming boot_cpu_data.x86_model is what is displayed in /proc/cpuinfo as "Model").
Strange, latest git kernel works without cmdline parameter or any scripts. System works 3 days with reboots without freezes. I recompiled kernel with PREEMPT_VOLUNTARY, NO_HZ, RCU_FAST_NO_HZ and IRQ_TIME_ACCOUNTING. In cmdline I have:tsc=reliable clocksource=tsc pcie_aspm=force nmi_watchdog=0. cpu0 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 345670 157 1 C1-BYT 0 1 1 68407851 147097 2 C6N-BYT 0 300 275 48677058 52359 3 C6S-BYT 0 500 560 680803055 270719 4 C7-BYT 0 1200 4000 1337235518 180699 5 C7S-BYT 0 10000 20000 771972999 31738 cpu1 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 344769 200 1 C1-BYT 0 1 1 365963607 701346 2 C6N-BYT 0 300 275 88538699 99895 3 C6S-BYT 0 500 560 1131180391 481825 4 C7-BYT 0 1200 4000 1097908939 191670 5 C7S-BYT 0 10000 20000 189777207 20370 cpu2 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 223966 125 1 C1-BYT 0 1 1 82220646 205674 2 C6N-BYT 0 300 275 81845726 99118 3 C6S-BYT 0 500 560 1150791746 496208 4 C7-BYT 0 1200 4000 1226103249 198451 5 C7S-BYT 0 10000 20000 313530989 24010 cpu3 State Name Disabled Latency Residency Time Usage 0 POLL 0 0 0 146758 132 1 C1-BYT 0 1 1 68183419 163110 2 C6N-BYT 0 300 275 56932066 64232 3 C6S-BYT 0 500 560 846271647 338258 4 C7-BYT 0 1200 4000 1344663248 198891 5 C7S-BYT 0 10000 20000 564638850 27960
Still no freezes. Please, try somebody kernel 4.8-rc8 or it could be another workaround which doesn't relate to max_cstate.
(In reply to Dmitry from comment #505) > Still no freezes. Please, try somebody kernel 4.8-rc8 or it could be another > workaround which doesn't relate to max_cstate. I'm still seeing the freezes on 4.8-rc8, ran a youtube playlist over night, woke up to a freeze.
(In reply to Dmitry from comment #505) > Please, try somebody kernel 4.8-rc8 or it could be another > workaround which doesn't relate to max_cstate. For all of you who still hope that this could (and will) be fixed, let me direct your attention to commit a7b4667+ (drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB): https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=a7b4667a00025ac28300737c868bd4818b6d8c4d I guess that specifically this commit has stabilized i915 driver behaviour on less powerful CPUs (ie. our Baytrail Atoms and Celerons), so that some people have found their systems to run stable with Linux 4.8 (the commit was merged to 4.8-rc1). I've applied this one-liner to i915 driver from Linux 4.4 (vanilla, no other "stabilization" patches), and got similar experience as Dmitry, ie. desktop system running on J1900 with no C-states limiting, used almost daily several hours per session, with just regular shutdowns, and working stable for weeks now. Though it may not solve stability issues for everyone completely, the commit does seem to hit the right nail.
With this commit (included in kernel 4.4.20) my Bay Trail tablet can finally run stable without limiting c-states. a3043e mmc: sdhci-acpi: Reduce Baytrail eMMC/SD/SDIO hangs https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=a3043ecef71f5b880fe1b1d2aa77b3a896b86a0c
In which (newer) Kernel versions the patch is included? Ubuntu 16.04 is using 4.4.0, so it's not included. Do we have to wait for 18.04 or an LTS/HWE kernel? I am not sure if i want to go for the mainstream kernel and stay with the script which disables C6 - at least for the moment.
By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is this MMC/SD patch really related to CPU/GPU hangs?
Created attachment 240861 [details] attachment-3924-0.html I searched in the Ubuntu Kernels for "drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB" and I found in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1615620 in the full text of the fix that it this fix is applied on 4.4.0-38.57, which is the currently released kernel version for a standard 16.04 installation. 2016-10-05 17:55 GMT+02:00 <bugzilla-daemon@bugzilla.kernel.org>: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #510 from ladiko <ladiko@web.de> --- > By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is > this > MMC/SD patch really related to CPU/GPU hangs? > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
(In reply to ladiko from comment #509) > In which (newer) Kernel versions the patch is included? Ubuntu 16.04 is > using 4.4.0, so it's not included. Do we have to wait for 18.04 or an > LTS/HWE kernel? I am not sure if i want to go for the mainstream kernel and > stay with the script which disables C6 - at least for the moment. It's included in the longterm vanilla kernel 4.4.20 and up. You can install it manually or via the package manager I guess. http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23/ The i915 patch mentioned by Daniel is also in there. (In reply to ladiko from comment #510) > By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is > this MMC/SD patch really related to CPU/GPU hangs? It's related to the subject of this thread, not directly to a hanging CPU/GPU. In my case I had a hanging MMC bus related to c-states. No GPU issues for me though :).
> It's included in the longterm vanilla kernel 4.4.20 and up. You can install > it manually or via the package manager I guess. > > http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23/ > installed 4.4.24. this didn't fix the freezing for N3700 running ubunutu 14.04 http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.24/
SO this is a SD/EMMC only bug?? no issue in the n3510/j1900 motherboards with SATA HDD? Is this only included in the >4.4.20 kernel line or also in later ones? (4.7 etc)
finally, I don't have any freezes after installing 4.8.0-997-generic kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/current/ Desktop motherboard with J1900 CPU and SATA HDD. All C-States, including C7, seems working well (haven't checked power consumption, but no freezes at all during about one week of continuous tests).
Indeed. I didn't experience any freezes so far since nearly a week now. But i installed the normal kernel 4.8.0-040800-generic from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/ I removed the intel_idle.max_cstate=1 from the grub config. And so far no problems. Laptop Acer Aspire E15 E5-511-P7AT with Pentium N3540 up to 2,66GHz. The only thing I noticed during the kernel upgrade was a "nagging" about missing intel-drm-i915 firmware. HD Playback on Youtube or in VLC works flawlessly. Also no troubles with Steam.
Is there separate bugreport about SD/EMMC issue?
I patched my current 4.5.4 with the mentioned patch didn't have any success. So I suspect (hope) there's more to it than only those two lines. Will try to upgrade to 4.8 soon to see if that helps.
after some months i tried latest kernel 4.8.1 from kernel.org still freezing my system: https://bugzilla.kernel.org/attachment.cgi?id=198961
I just changed from a non-crashing N3510 to a J1900 and I had a freeze after 5 minutes. Disabled the C states in the bios, but a energy saving solution looks different.
(In reply to vad1m from comment #515) > finally, I don't have any freezes after installing 4.8.0-997-generic kernel > from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/current/ > Desktop motherboard with J1900 CPU and SATA HDD. Same to me for my N3700. Installed this one two days ago. Usually it freezed some minutes after starting Firefox.
I've been plagued by this bug ever since I got my laptop, like 1 1/2 - 2 years ago. Just saying. My PC is an Acer E5-511p The CPU: Intel(R) Pentium(R) CPU N3530 @ 2.16GHz That being said, I went months without a hangup/freeze using Ubuntu 16.04 LTS with various kernels from 4.4.0-X-generic, using SELinux, and xscreensaver (without gnome-screensaver installed), and intel_idle.max_cstate=1. I only ever rebooted when a new kernel came out. ( I never tried without max_cstate, is should have). I got tired of trying to get SELinux working on Ubuntu and decided to go back to apparmor and gnome-screensaver, as well as upgrading to 4.4.0-42-generic (previous was 4.4.0-38-generic). Since using apparmor as LSM, gnome-screensaver, and 4.4.0-42 (yesterday), I get freezes with the periodic spinning fans again even with intel_idle.max_cstate=1, but it seems when I am not using the pc for 20+ minutes. I switched max_ctate=0, am using this now $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-4.4.0-42-generic.efi.signed root=UUID=cf4dc10b-511a-4369-ad5c-637833244929 ro apparmor=1 security=apparmor intel_idle.max_cstate=0 I will switch back one-by-one the things I have changed going forward and see if it stops crashes. I get a hint that it was xcreensaver preventing the cpu from going idle that was preventing the hangups/freezes. Maybe no new info there. I've had UEFI enabled and use the signed kernels, not sure if that matters as I have this issue even in BIOS mode. Hope that helps.
Created attachment 241641 [details] attachment-14281-0.html You know if with the New Ubuntu versión solve the bug? El 13 oct. 2016 7:19 p. m., <bugzilla-daemon@bugzilla.kernel.org> escribió: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > Todd Fulton <edge-case@hotmail.com> changed: > > What |Removed |Added > ------------------------------------------------------------ > ---------------- > CC| |edge-case@hotmail.com > > --- Comment #522 from Todd Fulton <edge-case@hotmail.com> --- > I've been plagued by this bug ever since I got my laptop, like 1 1/2 - 2 > years > ago. Just saying. > > My PC is an Acer E5-511p > The CPU: Intel(R) Pentium(R) CPU N3530 @ 2.16GHz > > That being said, I went months without a hangup/freeze using Ubuntu 16.04 > LTS > with various kernels from 4.4.0-X-generic, using SELinux, and xscreensaver > (without gnome-screensaver installed), and intel_idle.max_cstate=1. I only > ever > rebooted when a new kernel came out. ( I never tried without max_cstate, is > should have). > > I got tired of trying to get SELinux working on Ubuntu and decided to go > back > to apparmor and gnome-screensaver, as well as upgrading to 4.4.0-42-generic > (previous was 4.4.0-38-generic). Since using apparmor as LSM, > gnome-screensaver, and 4.4.0-42 (yesterday), I get freezes with the > periodic > spinning fans again even with intel_idle.max_cstate=1, but it seems when I > am > not using the pc for 20+ minutes. > > I switched max_ctate=0, am using this now > > $ cat /proc/cmdline > BOOT_IMAGE=/boot/vmlinuz-4.4.0-42-generic.efi.signed > root=UUID=cf4dc10b-511a-4369-ad5c-637833244929 ro apparmor=1 > security=apparmor > intel_idle.max_cstate=0 > > I will switch back one-by-one the things I have changed going forward and > see > if it stops crashes. I get a hint that it was xcreensaver preventing the > cpu > from going idle that was preventing the hangups/freezes. Maybe no new info > there. > > I've had UEFI enabled and use the signed kernels, not sure if that matters > as I > have this issue even in BIOS mode. > > Hope that helps. > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
I haven't experienced any freezes since v4.8 from http://kernel.ubuntu.com/~kernel-ppa/mainline/ (generic version) on a lenovo yoga 300 (Intel Celeron N2940).
Javier Antonio Nisa Avila (In reply to Javier Antonio Nisa Avila from comment #523) > Created attachment 241641 [details] > attachment-14281-0.html > > You know if with the New Ubuntu versión solve the bug? > > El 13 oct. 2016 7:19 p. m., <bugzilla-daemon@bugzilla.kernel.org> escribió: > I'll try 16.10 out as well, no problem. I see it's running 4.8, thanks for the heads up on the release ;).
All, I have experienced the hard lockups on a NDis b324 using kernel version 3.13 (all minor variants from ubuntu). The time scale is order of several hours to several weeks; the result is always a hard lockup. With 4.8.1 the hard lockup occurs after around 5 minutes. With the cstate restriction I have yet to see it crash (in 48hrs of testing with 4.8.1 and the cstate restriction). I hate to say it, but I don't think this bug is going to get fixed, and that the workaround is the fix! What it will take is engineering time from Intel with fully-instrumented dev boards to analyse this, and the wherewithal to do the root cause analysis. We can merely speculate from the sidelines, and so I will speculate that this bug affects all operating systems, and because of code timing variations some OSes get lucky and others do not. There may be no other fix than to disable the cstate management and suck the power loss. -bms
I suspect there are different bugs in different Intel chipsets / processors, since some fixes works for some people but not all. It is also possible that some of these hardware bugs might be impossible to fix in software. It has happend before both in Intel and AMD hardware. However I have a stable system now and I wanted to share my findings to maybe help some other users with the same hardware. My processor is: Intel(R) Celeron(R) CPU N2930 @ 1.83GHz (Baytrail). The machine is Acer_Extensa_EX2508-C66M Laptop. I use Gentoo on this system so I always build my own kernel. I had freezes right from the beginning (kernel 4.1), but found a workaround that worked for this system. I had a rock solid system with kernels 4.1 - 4.4 when I: - chose "Intel P State Control" to be built into the kernel - chose "Default CPU Frequency Governor" to be "Performance" - booted the system with kernel option: intel_idle.max_cstate=0 This resulted processor frequency to be constant and idle processor temperature to be between 48 - 50 degrees Celcius. I use this laptop for recording multitrack audio so the stable processor frequency was a bonus. Heavy audio processing is more reliable when the processor speed is constant. This is due to the fact that stable processor frequency leads to predictable latencies and multitrack audio software likes that. I don't much care about power consumption since my laptop is always plugged in, so I don't now what affect this might have had to the battery life. When I upgraded to kernel 4.7 this changed and I begun the get the freezes again. I also noticed that my processor speed had begun to fluctuate even though I used the same kernel options I used with kernels 4.1 - 4.4. It seems the fact that processor speed was constant with my settings in kernels 4.1 - 4.4 was a bug and Intel had "fixed" this for 4.7. I now have found new settings that work for me with kernel 4.7. I did: - choose "Intel P State Control" to be built into the kernel - choose "Default CPU Frequency Governor" to be "Performance" - booted the system with kernel option: intel_idle.max_cstate=0 - disable turbo boost with: echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo - disable processor pstate 3 in all processor cores (example for core 0): echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable The last option leaves other pstates (0, 1 and 2) on, but only disables pstate 3. These settings results in similar behaviour that with previous kernels meaning stable processor frequency (1.8 Ghz) and idle core temperature about 48 - 50 degrees Celcius. I've used kernel 4.7 now for 8 days with no freezes, if one occurs I will disable power saving state 2 for all processor cores and so on until I have a stable system. I have sometimes had turbo boost enabled and have not had any freezes so when it becomes evident in a couple of weeks that these settings really do work, I will enable turbo boost again and see if that has any effect on stability. I use a script to disable all processor pstate3s and turbo boost. It is based on another script talked about on this forum. You can download the script here: https://dl.dropboxusercontent.com/u/2071830/disable_intel_processor_pstates.sh or here: http://pastebin.com/egTKmkwX
Thanks for the advice. I applied your settings to gentoo (vanilla-sources-4.8.1) However I additionally disabled speedstep in bios. It seems that my n3150 is much more stable than my j1900. I had a freeze within minutes on the j1900 - even though it runs headless, no X. The case has a 12cm fan next to the CPU it is never going higher than 30°C. - However! cpu MHz : 479.980 cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors performance powersave cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor performance cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq 479980 cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq 1600000 weird. On Performance governor, the cpu should never clock down Is there a microcode update to these cpus? Do they make a difference?
adding "intel_pstate=disable" seems to disable any frequency variation. the cpu now sits on 1.6ghz.
Created attachment 241811 [details] Patch to disable c-states at boot
There might be the following errata affected: VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine. AAU36 EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine AAN42 EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine BN38.EOI Transaction May Not be Sent if AAK76. EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine BA106. EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 Duringan Interrupt Service Routine
As an experiment I've set up a google spreadsheet in the hopes you will enter details about your system(s), configurations you have tried, and the length of time that your test ran prior to failure. The goal is simply to be able to mine the data. The spreadsheet is (or rather should be) fully editable, so please don't abuse it; I think we all want this resolved. https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit?usp=sharing Here are some suggestions on how you should fill in the entries: Column A: Did your system end up in a locked up state? Column B: How long did your test run for. For example, if your test ended in a lock up, was it several hours or just a few minutes. If you answered yes for column A, then enter the amount of time your computer ran prior to you rebooting or powering down the system. Column C: The name / make of your machine. We want to know who made the motherboard. Column D: The model name of your CPU. Column E: The code name for your CPU. Naturally non-baytrail cpus that show similar failures will be interesting information to know. Column F, G, H: Enter the details from the result of "cat /proc/cpuinfo". Column I, J: use dmidecode to obtain the bios information, enter the version and vendor. Column K: The linux kernel version: use "uname -a" Column L: Did you modify the kernel boot parameters; if so, record them. Column M: Additional notes: What other configurations did you do? Did you use the c6 off script, etc... Do add columns if you think the data relevant. ... just trying to get to the bottom of this...
Kernel 4.8 seems to have some bugs in cpufreq. Intel has recently added a fix for these for kernel 4.9, so I will skip kernel 4.8 completely and use 4.9 when it comes out. Here is the message telling about the Intel regression fixes for 4.9: https://lkml.org/lkml/2016/10/14/288 Here is the Phoronix article mentioning it: https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.9-Atom-P-State-Algo
BMS: Great idea :) There are a couple of other things to consider. On newer kernels (4.7) you have the option of controlling processor performance either by ACPI P-State driver or Intel P-State driver. I had lockups when using the ACPI driver, Intel version works for me with no lockups. Also it might be important to know which governor (powersave, ondemand, performance) people use, since it deeply affects how the processor uses power saving states. I also added a column called "Reporter", I hope this is allright. It helps when some additional information needs to be asked from the reporter.
I have updated my home server to Ubuntu 16.10 which contains 4.8.0-22-generic, now I have 53 hours uptime. Before that I used Ubuntu 16.04.1 LTS with 4.4 and I had to disable C-states in the bios to get a stable system. I hope this kernel solved the bug, I will keep the spreadsheet updated with my longtime results. My system is ASRock Q1900-ITX with a J1900 CPU.
(In reply to jbMacAZ from comment #191) > I have a N3540 system that freezes at most a couple times a month without > any arguments, kernel version doesn't seem to matter. .max_cstate {0,1} > stabilized it. Looking at the recent posts, the N-series appears to be the > processor benefiting most from the new suggestions. But the more smoke that > gets cleared, the sooner the rest of the problems can be found. > > On my Z3775 system (T100CHI), kernel 4.5.0 without arguments didn't last 2 > minutes before freezing. With idle=nomwait and it ran 2 hours before the > time display froze (frozen seconds), the mouse cursor still moved. Keyboard > keys or mouse clicks were accepted about once every 90 seconds. > > Next, maxcpus=2 and idle=nomwait produced a block of "serial8250: too much > work for irq191" errors in dmesg. Raising maxcpus to 3 got rid of them. > maxcpus= {2,3} yielded no obvious degradation when just browsing, etc, so > I'll leave this running... tsc may be destabilizing for some systems like > mine. I compile 4.9 rc1 kernel,dmesg,serial8250: too much work for irq191 4.8 no this problem.
I have been suffering from the same issue, but on a Broadwell system (Dell XPS 9343, i5-5200). Restricting the max_cstate to 1 helps. c6off+c7on (after modifying to work on BDW instead of BYT), does not. It works only when I disable all cstates except C1 and C1E (which is rquivalent to max_cstate=1). Though I have been following this thread since long, I never posted. I have lately been wondering if there are any other Broadwell users facing the same, and if there is a separate bug for them. I mean, though the symptoms are exactly same, I am not 100% sure if the bug is the same. Also, as most other users here, I have no logs anywhere - syslog/kern.log - which would help raising a separate bug request. Summarising, Are there any broadwell co-sufferers here? Am I safe to assume this is the same bug as mine?
I've also added my machine into google spreadsheet. "serial8250: too much work for irq191" - also see this when I try to turn bluetooth on. I've never managed to get it working though.
(In reply to Jochen Hein from comment #531) > There might be the following errata affected: > VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 ... > AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 Duringan > Interrupt Service Routine Thanks Jochen, I started to dig and found out, that a lot of Intel processors suffer from erratum: EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine Here is the list (with links to the docs) I found so far: [1] AAJ72: Intel Core i7-900 Desktop Processor Extreme Edition Series and Intel Core i7-900 Desktop Processor Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-ee-and-desktop-processor-series-spec-update.pdf [2] AAK76: Intel Xeon Processor 5500 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5500-specification-update.pdf [3] AAM73: Intel Xeon Processor 3500 Series http://www.intel.com/Assets/en_US/PDF/specupdate/321333.pdf [4] AAN42: Intel Core i7-800 and i5-700 Desktop Processor Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-800-i5-700-spec-update.pdf [5] AAO42: Intel Xeon Processor 3400 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3400-specification-update.pdf [6] AAP41: Intel Core i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and i7-700 Mobile Processor Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-mobile-ee-and-mobile-processor-series-spec-update.pdf [7] AAT32: Intel Core i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-mobile-spec-update.pdf [8] AAU36: Intel Core i5-600, i3-500 Desktop Processor Series and Intel Pentium Desktop Processor 6000 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i5-600-i3-500-pentium-6000-spec-update.pdf [9] AAY38: Intel Xeon Processor 3600 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf [10] BA106: Intel Xeon Processor 7500 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-processor-7500-series-specification-update.pdf [11] BB38: Intel Atom Processor Z6xx Series http://www.intel.com/content/dam/doc/specification-update/atom-z6xx-specification-update.pdf [12] BC38: Intel Core i7-900 Desktop Processor Extreme Edition Series and Intel Core i7-900 Desktop Processor Series on 32-nm Process http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-ee-and-desktop-processor-series-32nm-spec-update.pdf [13] BD40: Intel Xeon Processor 5600 Series Specification Update http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf [14] BF41: Intel Xeon Processor C5500/C3500 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf [15] BG31: Intel Pentium P6000 and U5000 Mobile Processor Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-p6000-u5000-mobile-specification-update.pdf [16] BI46: Intel Atom Processor E6xx Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e6xx-spec-update.pdf [17] BN38: Intel Atom Processor Z600 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z6xx-specification-update.pdf [18] BP37: Intel Xeon Processor E7-8800/4800/2800 Product Families http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e7-8800-4800-2800-families-specification-update.pdf [19] CC5: Intel Atom Processor Z2760 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z2760-spec-update.pdf [20] VLI55: Intel Atom Processor E3800 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf [21] VLP52: Intel Celeron and Pentium Processor N- and J-Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf [22] VLT56: Intel Atom Processor Z3600 and Z3700 Series http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-Z36xxx-Z37xxx-spec-update.pdf The erratum is first mentioned in November 2008 [1] and a first patch for it (only for AAJ72-plagued processors reported in [1]) has been added by the Xen developers in September 2010: https://lists.xen.org/archives/html/xen-devel/2010-09/msg00894.html
The image on the screen freezes and the USB ports do not work, ATX power button not responding. Right? I have bug like this many times on i5-4460 (P87 Gigabyte mb with nvidia videocard) after updating from kernel 3.13 (ubuntu 14.04) to kernel 4.4 (ubuntu 16.04). This appears more often during hot weather, then usb-wifi attached, when computer is idle. More rare when some programs are active and when it is cold in the room. I thought that the reason are micro-cracks in the motherboard but now I see this ticket and will delay shopping new matherboard:) Also I saw bug like this on ASUS R556 with i5-5200U with nvidia video on 4.4. Now in asus ubuntu was updated to 16.10 and I test it with kernel 4.8. intel_idle.max_cstate=1 did not helps in both cases on kernel 4.4
Also, on Broadwell, *any* c-state (beyond 1e) if enabled, causes the lockdown. For baytrail, as some users have pointed out, just c7 off and others enabled works.
I cannot confirm that Ubuntu 16.10 fixes the bug. The freezes still remain with Kernel 4.8.0-22.
Freezes after one hour of VLC... OS: Arch Linux Kernel: x86_64 Linux 4.8.2-1-ARCH CPU: Intel Pentium CPU J2900 @ 2.4157GHz GPU: Mesa DRI Intel(R) Bay Trail
Yes I read that 4.8 has a faulty p-state implementation that should be fixed with 4.9.
(In reply to Libor Chmelik from comment #516) > Indeed. I didn't experience any freezes so far since nearly a week now. But > i installed the normal kernel 4.8.0-040800-generic from > http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/ > I removed the intel_idle.max_cstate=1 from the grub config. And so far no > problems. > Laptop Acer Aspire E15 E5-511-P7AT with Pentium N3540 up to 2,66GHz. > The only thing I noticed during the kernel upgrade was a "nagging" about > missing intel-drm-i915 firmware. > HD Playback on Youtube or in VLC works flawlessly. Also no troubles with > Steam. Spoken to early. It froze after all. Situation 1 : Youtube in HD AND cpu's in forced performance mode (turbo boost ??) Situation 2 : HD Playback in VLC in automatic powersave mode. But it took 12 days until the first freeze. And 3 days later for the second. Trying kernel 4.8.2 from ubuntu mainline now. c-state still disabled in grub.conf
One week so far no crashes. 4.8.0-rc8-amd64 Options GRUB_CMDLINE_LINUX_DEFAULT=intel_idle.max_cstate=5 In rc.local this script is run at boot... ----- #!/bin/bash echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo thanks
Let's see whether the Goldmont successor architecture CPUs https://en.wikipedia.org/wiki/Goldmont_(microarchitecture)#List_of_Goldmont_processors are affected. The first motherboards/NUCs are already/soon available, like the ASRock J4205-ITX. Would be nice if someone could report on that. BTW: My experience has been that a Celeron-NUC is extremely slow for desktop usage and my next NUC is going to be a Pentium-NUC (if of course not affected, otherwise maybe something from AMD).
Why wait for another product to spend money on? I have two of boards that are not doing what they should that have actually cost money. Before doing that I'll change to ARM cpus because of the frustration. My J1900 resetted twice today doing nothing (headless, no X) - (4.4 kernel) I deactivated C states in the bios. Until now it seems stable I will give 4.9_rc1 a try in the morning and activate all c-states. I bought those boards to save me some power, not sit around doing nothing ;-)
Has anyone tried this kernel yet? https://aur.archlinux.org/packages/linux-baytrail48/
I have a N2940 under Gentoo and keep running into the same bug. i already tried: 4.7.5-gentoo 4.8.2-gentoo and git-sources too: 4.9-rc1 still getting random freezes (depending on workload every 15 min to 2 hours).
thorsten: Try the commands below, and report back. These eliminate hang ups on my N2930 with kernel 4.7 (Gentoo). First start kernel with: intel_idle.max_cstate=0 Then give these commands as root: echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state3/disable
Tried as indicated linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70ad.tar.gz cp config.x86_64 .config make INSTALL_MOD_STRIP=1 rpm then installed on Atom Z3735G running for 1 hour now without kernel parameter, neither cstate nor tsc. Everything else I ever tried crashed in less than 1 hour, sometimes in 1 minute without kernel parameters. On 10/25/16, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #549 from Sebastian Heyn <sebastian.heyn@yahoo.de> --- > Has anyone tried this kernel yet? > > https://aur.archlinux.org/packages/linux-baytrail48/ > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
My system is running stable for 2 weeks now with the archlinux baytrail kernel. (I had to restart it to swap an hdd however. Could it be that a headless system is much less likely to crash?
(In reply to Sebastian Heyn from comment #553) > Could it be that a headless system is much less likely to crash? Yes, the initial bug report linked in the first comment thought the problem was related to the GPU driver. Even if you are using the unaccelerated efifb + xf86-fideo-fbdev driver, you are less likely to get the freeze. After all, the best way to trigger the problem so far has been to play videos.
Hi Daniel, thanks. Playing videos means GPU decoding or high framerate framebuffer access via CPU?
I've been using a J1900 board as a router/firewall/fileserver for a couple of years now, it's a gigabyte ga-j1900d3v (chosen for the dual gigabit NICs and the low power consumption). It's pretty stable, runs for weeks and weeks without locking up, but of course there's no video activity - often there's not even a monitor plugged in! However, when it does lock up, it needs a forced reset, as it will have locked up solid.
p.s. I've never used the cstate hack, always used stock kernels without any special patching.
Update on my side: 4.8.4-gentoo seems to work since several days so far without patches or disabling cstate options on my machine. If anyone is interested provided my kernel as a download with modules and initrd: http://s000.tinyupload.com/index.php?file_id=06491416522851495522 md5sum: 4c7fbd190b8656899cfe3b35dbd6f185 kernel.tar.bz2 sha1sum: 3218d1a4064b649d64c46fa493c3d364f1f02737 kernel.tar.bz2 I have an Aspire ES1-311. Would be interested if this kernel works on other machines, too.
@Thorsten, can you check if your /proc/cpuinfo shows the correct frequency info? Mine seems to hang on less than 500MHZ, using the ondemand governor
@Sebastian, i have 499 MHz shown in /proc/cpuinfo too, but i think its a display error: ~# cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 500 MHz - 2.25 GHz available cpufreq governors: performance powersave current policy: frequency should be within 500 MHz and 2.25 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: 500 MHz (asserted by call to hardware) boost state support: Supported: yes Active: yes and: ~# cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 500 MHz - 2.25 GHz available cpufreq governors: performance powersave current policy: frequency should be within 500 MHz and 2.25 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: 2.25 GHz (asserted by call to hardware) boost state support: Supported: yes Active: yes maybe the difference from my kernel to the 'regular' is CONFIG_CPU_FREQ_GOV_SCHEDUTIL is set as default policy in my config. Strange tho that cpupower only shows powersave and performance as available governors?
@Thorsten Hello ! I use an asusE502 with Intel Corporation Atom Processor Z36xxx/Z37xxx I am affected by this problem since 1 year. The kernels 4.8.4 works for me since 2 days (without intel_idle.max_cstate=1 ). It still work for you ?
Tive estes problemas de travamento no meu notebook Asus z550ma que utiliza o processador n2940. Como consegui resolver o problema no Ubuntu 16.04: 1-Instalei os drivers de vídeo da intel (https://01.org/linuxgraphics/downloads) 2- Atualizei o Kernel do Ubuntu para versão 4.8.6 3- Após atualizar o Kernel, fui em: Configurações, Programas e Atualizações, Drivers Adicionais e desativei o Processor microcode firmware for intel cpus de intel-microcode (coloquei em não usar este dispositivo) 4- Reiniciei o Notebook 5-Criei um arquivo de configuração com o nome "i915.conf" (digitar sem aspas) e dentro dele inseri o código: "options i915 modeset=1 enable_execlists=0" (digitar sem aspas) 6- Colei este arquivo de configuração na pasta etc/modprobe.d 7- Reiniciei o PC 8- Resultado: o notebook já não trava a aproximadamente 1 semana (estou usando ele o dia inteiro) OBS: não precisei inserir este código (intel_idle.max_cstate=1) no Grub Este notebook z550ma possui também um problema com a placa wifi. Para resolver os problemas basta instalar os drivers da placa (o site do Diolinux tem um tutorial) e inserir um código na pasta etc/modprobe.d Este código possui o nome "rtl8723be.conf" (digitar sem aspas) e dentro deste arquivo deve estar escrito o código: "options rtl8723be fwlps=N ips=N" (digitar sem aspas) A internet agora funciona normalmente aqui. Espero ter ajudado pessoal!!! OBS: BR também entende de Linux!!!
I had these locking issues on my Asus z550ma notebook that uses the n2940 processor. How I solved the problem in Ubuntu 16.04: 1-I installed intel's video drivers (https://01.org/linuxgraphics/downloads) 2- Updated the Ubuntu Kernel for version 4.8.6 3- After updating the Kernel, I went to: Settings, Programs and Updates, Additional Drivers and deactivated the Processor microcode firmware for intel-microcode cpus (I put in not to use this device) 4- Restart the Notebook 5 - I created a configuration file with the name "i915.conf" (enter without quotes) and inside it inserted the code: "options i915 modeset = 1 enable_execlists = 0" (enter without quotes) 6- Pasted this configuration file into the etc / modprobe.d folder. 7- Restart the PC 8- Result: the notebook no longer locks for approximately 1 week (I'm using it all day) This z550ma notebook also has a problem with the wifi card. To solve the problems simply install the drivers of the card (the Diolinux website has a tutorial) and insert a code in the etc / modprobe.d folder This code has the name "rtl8723be.conf" (enter without quotes) and inside this file should be written the code: "options rtl8723be fwlps = N ips = N" (type without quotes) The internet now works normally here. I hope I have helped people !!! Note: BR also understands Linux Sorry for my bad English!!!!
@Thorsten Ubuntu freezes this morning after 3 days of usage with the 4.8.4. False hope ...
In my experience it may take a week or two before the first freeze happens. It would be very helpful if people could wait and use their machines for 7 - 14 days before declaring success. This would help us weed out false positives :) Thanks for filling in your success and failure details into the spreadsheet bms created, it seems to me patterns are emerging, please keep filling in details about your experiments :)
I would say if freeze takes few days to reproduce, while before it was few hours or even minutes - it's already success to some degree. By the way, this patches could be interesting for some subscribers: https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8 Especially 0001 and 0006 probably could reduce hangs even more.
@Poumon Had my first two freezes with my kernel yesterday, with my older kernels I had daily freezes without using other options. So sorry for the false positive. @RussianNeuroMancer I think too its an improvement too if we can use all the power saving features on an 'unpatched' kernel for multiple days now
I have an unused J1800 desktop machine, so I'll try to reproduce the problem and maybe try to get a kernel stacktrace over serial terminal in the next week. I hope we can pinpoint the actual origin of the problem this way. @RussianNeuroMancer if the problem would be mmc-related the regular desktop user without an mmc reader should not affected since the modules for mmc would not be loaded, but maybe the other patches could change something.
@thorsten, if problem mmc-related I wonder why it doesn't fixed long time ago. Patches for mmc hang literally available for years.
I am facing freezing issues on a SOC running on Intel-Celeron-J1900. The devices are supposed to be deployed in areas with not a single human-being, so freezes are unacceptable. Also, I really don't care about power-consumption. I was wondering why has no one tried the following kernel-options :: intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll If I am not being idiotic, above options would surely switch-off all power-management-possibilities?
@Ajay: Are the machines headless or with an active X running? Some board allow to switch off all power management on the BIOS
@Sebastien, Thanks for the reply. Nopes, each machine has Ubuntu-14.04.3 installed, with kernel upgraded manually to 3.19(-generic). I don't have a board with me right now, so cannot confirm if there is an option in the BIOS. But irrespective of that, won't each of the kernel-options (as per my previous post) work? The important question is, might intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll break anything over a period of time?
(In reply to Ajay Garg from comment #572) > @Sebastien, > > Thanks for the reply. > > Nopes, each machine has Ubuntu-14.04.3 installed, with kernel upgraded > manually to 3.19(-generic). Pardon me, I meant a full-blown client-image of Ubuntu-14.04.3, with all the fancy GUI. > > I don't have a board with me right now, so cannot confirm if there is an > option in the BIOS. But irrespective of that, won't each of the > kernel-options (as per my previous post) work? > > > The important question is, might > > intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll > > break anything over a period of time?
@Ajay Garg I don't own a J1900 device, but I guess the main concern with those options is heat. The kernel documentation (https://www.kernel.org/doc/Documentation/kernel-parameters.txt) warns that idle=poll will make the machine run hot. It may be better to drop that option if possible. I guess it is best to test the machines in advance. You could install a software for monitoring cpu / gpu temperature and run workloads that will be typical when the machines are installed on location. In this way you will get some hard data about the machines reliability. It's really frustrating that Intel hardware is as buggy as it is right now. I can't remember any worse period in Intel history. I guess they a really afraid of the flood of ARM devices and trying to compete with those they are going too far with aggressive power saving features.
(In reply to Wolfgang M. Reimer from comment #539) Hello, you have missed the lucky 23. [23] AAZ32: Intel Celeron P4000 and U3000 Mobile Processor Series http://www.intel.ie/content/dam/www/public/us/en/documents/specification-updates/celeron-mobile-p4000-u3000-specification-update.pdf
Just tried 4.8.7 on Ubuntu 16.04. This kernel won't even boot. I have a N3700 CPU. So back to 4.8.6.
I tried the serial console approach and could not get a kernel crash dump this way despite the machine freezing with 4.8.7. My guess is because this seems to be a hardware-bug the cpu is frozen before the kernel can throw a crash dump or contact my serial console. @Ajay Garg I would probably disable cpufreq (and cstates) alltogether on a non-mobile machine. Downside would be a hotter and possibly louder machine. Also in case your chipset has a watchdog functionality maybe this would be an idea how to reset the machine automatically after freezing it it helps with your application. Alternatively in your use case I would problably switch to unaffected hardware i.e. a Celeron 847 or otherwise do a lot of testing first. @Martin Brand I have a Pentium N3700 too and was not yet affected by this bug so far, have you had freezes before?
It IS a hardware bug and Intel should fix it.
Yes I did have freezes before, but never during boot. The c6off+c7on scripts from Wolfgang Reimer made my system usable. So thanks a lot for that! Without the script it usually freezes within an hour. With c6off about once every one to two weeks. Still very annoying when it happens.
What conditions of entering C7? I run this script and looking at powertop output now, cores spend 96-97% time in C1. So looks like it's doesn't different much from known intel_idle.max_cstate=1 workaround.
My Powertop displays the following PowerTOP 2.8 Übersicht Untätigkeits Frequenzstatistik Gerätestatisti Einstellbarkeit Paket | Kern | CPU 0 | | C0 aktiv 7,2% | | POLL 0,0% 0,0 ms | C1 (cc1) 23,7% | C1-CHT 22,3% 1,2 ms | | | | C6 (pc6) 18,1% | C6 (cc6) 62,9% | C6S-CHT 0,0% 0,0 ms | | C7S-CHT 16,3% 20,7 ms | Kern | CPU 1 | | C0 aktiv 22,4% | | POLL 0,0% 0,0 ms | C1 (cc1) 11,3% | C1-CHT 9,5% 2,1 ms | | | | | C6 (cc6) 59,3% | C6S-CHT 0,0% 0,0 ms | | C7S-CHT 22,6% 22,9 ms So it uses C7 state. Battery life is better and CPU temperature is also about 10°C less than with intel_idle.max_cstate=1 workaround
Ok, I need to clarify, that in my case it's BayTrail CPU Z3735G. After script run I stop getting PC7 and CC6 (that works before script run) and doesn't get expected C7S-BYT (constantly 0% while C6S-BYT was sometimes over 90% before script run). So for me script outcome is not different from intel_idle.max_cstate=1 workaround. Is there anyone who have working PC7/CC6/C7S-BYT on BayTrail device after disabling C6S-BYT?
On CPU Z3735G, I always saw a Call Trace related to hard LOCKUP on the screen when it froze while in console mode. As also reported in other comments (#157, #568), these Call Traces were not logged by the system. Today I received this message in a xterm: Message from syslogd@leia at Nov 18 09:50:32 ... kernel:NMI watchdog: Watchdog detected hard LOCKUP on cpu 3#001dModules linked in: msr r8723bs(O) intel_... And found the complete Call Trace in dmesg: [ 261.956671] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3dModules linked in: msr r8723bs(O) intel_rapl intel_soc_dts_thermal nls_iso8859_1 intel_powerclamp coretemp nls_cp437 vfat kvm_intel fat kvm iTCO_wdt snd_soc_sst_bytcr_rt5640 gpio_keys iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul joydev glue_helper input_leds snd_usb_audio snd_usbmidi_lib snd_hwdep mousedev ablk_helper cryptd snd_soc_rt5645 snd_rawmidi mac_hid cfg80211 intel_cstate pcspkr thermal evdev kxcjk_1013 snd_intel_sst_acpi snd_intel_sst_core tpm_crb industrialio_triggered_buffer snd_soc_rt5640 soc_button_array snd_soc_sst_mfld_platform kfifo_buf snd_soc_rl6231 snd_soc_sst_match industrialio dptf_power int3406_thermal int3403_thermal snd_soc_core processor_thermal_device int3402_thermal goodix int3400_thermal battery snd_compress int340x_thermal_zone snd_pcm_dmaengine acpi_thermal_rel intel_soc_dts_iosf ac97_bus hci_uart snd_seq snd_seq_device b [ 261.956681] CPU: 3 PID: 3055 Comm: inkscape Tainted: G O 4.8.6-BAYTRAIL48 #1 [ 261.956684] Hardware name: Positivo Informatica SA WCBT1013/WCBT1013, BIOS 1.7 06/09/2015 [ 261.956687] 0000000000000086 000000000bea3d8f ffff880038643bf0 ffffffff812f9d4b [ 261.956690] 0000000000000000 0000000000000000 ffff880038643c08 ffffffff8111d918 [ 261.956693] ffff880038d30800 ffff880038643c40 ffffffff811613ac 0000000000000001 [ 261.956695] Call Trace: [ 261.956698] [<ffffffff812f9d4b>] dump_stack+0x63/0x88 [ 261.956700] [<ffffffff8111d918>] watchdog_overflow_callback+0xc8/0xf0 [ 261.956703] [<ffffffff811613ac>] __perf_event_overflow+0x7c/0x1b0 [ 261.956706] [<ffffffff81169664>] perf_event_overflow+0x14/0x20 [ 261.956708] [<ffffffff8100c147>] intel_pmu_handle_irq+0x1e7/0x4a0 [ 261.956711] [<ffffffff81185606>] ? __pagevec_lru_add_fn+0x186/0x290 [ 261.956714] [<ffffffff811f7395>] ? mem_cgroup_commit_charge+0x85/0x100 [ 261.956716] [<ffffffff81187209>] ? lru_cache_add_active_or_unevictable+0x39/0xc0 [ 261.956719] [<ffffffff811af8da>] ? handle_mm_fault+0x41a/0x1550 [ 261.956722] [<ffffffff810055ed>] perf_event_nmi_handler+0x2d/0x50 [ 261.956724] [<ffffffff810312d1>] nmi_handle+0x61/0x140 [ 261.956727] [<ffffffff81031878>] default_do_nmi+0x48/0x130 [ 261.956730] [<ffffffff81031a4b>] do_nmi+0xeb/0x160 [ 261.956732] [<ffffffff815f1406>] nmi+0x56/0xa5 The system did not freeze but continued to operate normally. Kernel was 4.8.6 with following patches: from https://aur.archlinux.org/packages/linux-baytrail48/, from linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70a, baytrailfix[1-5].patch from https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8, patch 0001*, 0006*, and 0008* and config: from https://aur.archlinux.org/packages/linux-baytrail48, from linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70a, config.x86_64 Clocksource during this event was cat /sys/devices/system/clocksource/clocksource0/current_clocksource refined-jiffies but often, after reboot in this configuration, clocksource is tsc.
Scripts at comment #c437 solve the problem. However, I had to modify c6off+c7on.sh in order to work for CHT (Cerry Trail) processors. Latest stable kernel for Ubuntu (4.8.10) seems to solve also the problem. More details at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467/comments/142
Same here. I have upgraded to 4.8.9. I was not able to boot 4.8.7. Kernel 4.8.9 with c6off+c7on.sh has definitely improved the situation. So far no crash. The system survived 2 hour HD film and an old windows game played with wine. This has not happened before, so I am very hopeful.
Len Brown (and Intel in general) should be ashamed of themselves (assuming they still have some self-respect left). Making mistakes is perfectly acceptable (we all make mistakes). But being shamelessly quiet is a sign of impotency.
Hello, For all of you having issues with this I used the c6off+c7on script and did nto solve my problem. So I modified the script to turn off both C6 & C7 and have not had a freeze in months. My alters script is below. Hope it helps some. #!/bin/sh #title: c6off+c7off.sh #description: Disables all C6 and C7 core states for Baytrail CPUs #author: Wolfgang Reimer <linuxball (at) gmail.com> #date: 2016014 #version: 1.0 #usage: sudo <path>/c6off+c7on.sh #notes: Intended as test script to verify whether erratum VLP52 (see # [1]) is the root cause for kernel bug 109051 (see [2]). In order # for this to work you must _NOT_ use boot parameter # intel_idle.max_cstate=<number>. # # [1] http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf # [2] https://bugzilla.kernel.org/show_bug.cgi?id=109051 # Disable ($1 == 1) or enable ($1 == 0) core state, if not yet done. disable() { local action read disabled <disable test "$disabled" = $1 && return echo $1 >disable || return action=ENABLED; test "$1" = 0 || action=DISABLED printf "%-8s state %7s for %s.\n" $action "$name" $cpu } # Iterate through each core state and for Baytrail (BYT) disable all C6 & C7 states. cd /sys/devices/system/cpu for cpu in cpu[0-9]*; do for dir in $cpu/cpuidle/state*; do cd "$dir" read name <name case $name in C6*-BYT) disable 1;; C7*-BYT) disable 1;; esac cd ../../.. done done
@Dan0780 Please tell us what your processor is. Without this info we don't know in which cases your solution helps. Thanks :)
Dan0780: Could you please fill in details of your solution in the google spreadsheet BMS created so your solution will be easily found by people having the same hardware. The spreadsheet is here: https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit?usp=sharing
Sorry, my processor is J1900. I will try and fill out the spreadsheet
(In reply to Dan0780 from comment #591) > Sorry, my processor is J1900. I will try and fill out the spreadsheet Dan; as far as I can see (a diff would have been useful), the difference with the original script is that you actually disable C7—this really does the same as max_cstate=1 then.
The script is unnecessarily complicated ... # baytrail workaround for https://bugs.freedesktop.org/show_bug.cgi?id=88012 for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do case "$(< "${state}/name")" in C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;; C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;; esac done or to disable C6 and C7: for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do case "$(< "${state}/name")" in C6*-BYT|C6*-CHT|C7*-BYT|C7*-CHT) echo "1" > "${state}/disable" ;; esac done it just lacks all the feedback of the other script while the shell will complain in case you dont have the right permissions. otherwise if everything is fine - you just get no feedback - fine for me.
(In reply to Michaël from comment #592) > (In reply to Dan0780 from comment #591) > > Sorry, my processor is J1900. I will try and fill out the spreadsheet > > Dan; as far as I can see (a diff would have been useful), the difference > with the original script is that you actually disable C7—this really does > the same as max_cstate=1 then. I did not have the option to set max_cstate=1 and therefore I modified the original script. Either way I just wanted to share as with the original script of disabling C6 only it did not work for me but disabling all of them worked and have no issues.
(In reply to ladiko from comment #593) > The script is unnecessarily complicated ... ... for you. I NEED the feedback (Usually I test hundreds of boxes with different combinations of enabled/disabled cstates and log the output for documentation purposes. For the next test result I need to document what exactly has changed from the previous state). > > # baytrail workaround for https://bugs.freedesktop.org/show_bug.cgi?id=88012 > for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do > case "$(< "${state}/name")" in > C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;; > C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;; > esac > done > > or to disable C6 and C7: > > for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do > case "$(< "${state}/name")" in > C6*-BYT|C6*-CHT|C7*-BYT|C7*-CHT) echo "1" > "${state}/disable" > ;; > esac > done > > it just lacks all the feedback of the other script while the shell will > complain in case you dont have the right permissions. otherwise if > everything is fine - you just get no feedback - fine for me. ... and your changed script IS NOT POSIX shell (dash, busybox' ash) compatible any longer (which is REQUIRED in my case).
(In reply to DE from comment #575) > (In reply to Wolfgang M. Reimer from comment #539) > > Hello, you have missed the lucky 23. > > [23] AAZ32: Intel Celeron P4000 and U3000 Mobile Processor Series > > http://www.intel.ie/content/dam/www/public/us/en/documents/specification- > updates/celeron-mobile-p4000-u3000-specification-update.pdf Thanks, added to my list.
I have an unnamed board based on a J1900 with a number of GigE ports. The device is used as a router and runs headless. I've experimented with a number of combinations and simply disabling C6S-BYT state (using a script) made the biggest improvement for me (forcing max_cstate=1 also works, but cpu runs hotter). Without the C6S-BYT being disabled the uptime would be never longer than 24h, sometimes the device would reload some other times - locked up hard, never leaving a trace on the serial console on what exactly went wrong. This is using standard 4.4.0-47-generic Ubuntu Xenial kernel. Now I have an uptime of over 7 days with no issues.
After running my Zotac (Intel® Celeron® Processor N2930 Bay Trail family) box without any crashes for over 2 months (thanks to intel_idle.max_cstate=1) a few days ago I installed the latest Linux Mint stock version kernel (4.4.0-51) with intel_idle.max_cstate=1. To my greatest surprise my machine froze a few hours later while it was in light use. I found it frozen again the following morning while it was idling overnight. Then again this morning completely frozen. This is a significant regression in this machine's case, as from the very beginning of this saga intel_idle.max_cstate=1 has been a life saver, and until now no kernel version had frozen while using intel_idle.max_cstate=1. So, right now I have it running with 4.4.35-040435-generic #201611260431 SMP (obtained from http://kernel.ubuntu.com/~kernel-ppa/mainline/) Hopefully it will work better. I know that I can always go back to 4.4.0 from ubuntu, but I am concerned that version 4.4.0 might have known security vulnerabiities). Hal
Hi Hal, I am using Kernel 4.8.9 with c6 states disabled. This has worked for me since November 21. Why don't you try this kernel before you go back to 4.4.0
@ Hal I have a N2930 processor and had freezes again when migrating from kernel 4.4 to 4.7. The commands below stopped the freezes when used with intel_idle.max_cstate=0. Try these :) echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state3/disable echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state3/disable
max_state=1 does not work for my Asrock q1900DC-ITX J1900 processor. Using Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-51-generic x86_64) Avg uptime is around 48-72 hours with the c6 and c7 fix it's even less about 26 hours. I'm getting really frustrated with this. Thinking of buying a i3 barebone pc setup from gigabyte. GB-BKi3A-7100 http://www.gigabyte.com/products/product-page.aspx?pid=6079#ov Does the i3-7100U have similar issues with freezing?
You also need to #!/bin/bash echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable in order to stop crashing... I did this with rc.local and a script.
Okay, right now i have the script running from crontab: @reboot /home/john/scripts/c6off+c7off.sh can i just add the echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable lines very end?
Also (In reply to john from comment #603) > Okay, right now i have the script running from crontab: > @reboot /home/john/scripts/c6off+c7off.sh > > can i just add the > > echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable > echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable > echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable > echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable > > lines very end? Also ls -la into: /sys/devices/system/cpu/cpu0/cpuidle only gives 2 states: state0 state1
(Follow up to own post #598) > After running my Zotac (Intel® Celeron® Processor N2930 Bay Trail family) > box without any crashes for over 2 months (thanks to > intel_idle.max_cstate=1) a few days ago I installed the latest Linux Mint > stock version kernel (4.4.0-51) with intel_idle.max_cstate=1. > > To my greatest surprise my machine froze a few hours later while it was in > light use. I found it frozen again the following morning while it was idling > overnight. Then again this morning completely frozen. > > This is a significant regression in this machine's case, as from the very > beginning of this saga intel_idle.max_cstate=1 has been a life saver, and > until now no kernel version had frozen while using intel_idle.max_cstate=1. > > So, right now I have it running with 4.4.35-040435-generic #201611260431 SMP > (obtained from http://kernel.ubuntu.com/~kernel-ppa/mainline/) > > Hopefully it will work better. I know that I can always go back to 4.4.0 > from ubuntu, but I am concerned that version 4.4.0 might have known security > vulnerabiities). > > Hal Thanks for the suggestions. For now, I am sticking to 4.4.35-040435-generic as it seems to be working fine. No freezing or crash in 25 hours! One reason I am sticking to the 4.4 line is that it is a long term support version. In the past I ran 4.5.7, but then it was no longer maintained because of EOL. I am still not sure if the 4.8 strain will be long term or not. Also, my (albeit superficial) reading of comments about it gave me the impression that 4.8 started off the wrong foot (insufficient QA on its initial release). Anyway, I only update the kernel when I need it to support new devices (like USB 3.1 or 802.11 ac), or when I hear about new found vulnerabilities, and typically try to stay one step behind rather than at the cutting edge. I'll keep the thread posted if anything new happens but so far 4.4.35 looks good! Hal
(In reply to Hal from comment #605) > I am still not sure if the 4.8 strain will be long term or > not. 4.8 will not be LTS. But 4.9 will be, see: http://kroah.com/log/blog/2016/09/06/4-dot-9-equals-equals-next-lts-kernel/
Configuration from #583 stable during more than 2 weeks of daily use, no kernel parameters, no C-state script. Updated the spreadsheet. Observed a 30% chance that tsc is rejected as clocksource during boot and refined-jiffies is used instead. In this case wall clock is almost 10x slower and keyboard repetition rate is extremely slow, and occasionally a hard lockup occurs in one processor core, but system continues working. For this reason I will use kernel parameter tsc=reliable from now on. Does it makes sense to reject tsc of a CPU that has the flags rdtscp. constant_tsc, and nonstop_tsc ?
https://www.spinics.net/lists/linux-i2c/msg27520.html > About this patch vs bug bko109051, yesterday I've spend time reading > that entire bug. It seems it is a combination of at least 3 bugs > combined, 2 i915 related with commits which seem to trigger > the problem (2 different groups of users with a different problem > it seems) which causes a hang every few hours. And one other > bug where the system freezes in minutes, that one sounds like > what I was seeing without this patch (but may well be yet > another issue). > > As for the 2 i915 bugs, there have been git bisects for both of > them, it would be good if someone could take a look at these, just > search for bisect in that huge bug.
Hi, The patch here: https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530 seems to have fixed the problem for me on my N3520 Bay trail (Lenovo Yoga 2 11). I changed the patch to be applied on 4.8.0-30, the default Ubuntu kernel on an updated Ubuntu 16.10. Please test if that fixes the issue. The patch (couldn't find attachment button): diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 67ec58f..2a77317 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -1242,6 +1242,34 @@ static void bxt_idle_state_table_update(void) } /* + * byt_idle_state_table_update(void) + * + * On BYT, we have errata VLP52 and disable C6. + * https://bugzilla.kernel.org/show_bug.cgi?id=109051A + * http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf + * VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine. + +Problem: +If core C6 is entered after the start of an interrupt service routine but before a write +to the APIC EOI (End of Interrupt) register, and the core is woken up by an event +other than a fixed interrupt source the core may drop the EOI transaction the next +time APIC EOI register is written and further interrupts from the same or lower +priority level will be blocked. + +Implication: +EOI transactions may be lost and interrupts may be blocked when core C6 is used +during interrupt service routines. + +Workaround: +It is possible for the firmware to contain a workaround for this erratum. + */ +static void byt_idle_state_table_update(void) +{ + printk(PREFIX "byt_idle_state_table_update reached\n"); + byt_cstates[1].disabled = 1; /* C6N-BYT */ + byt_cstates[2].disabled = 1; /* C6S-BYT */ +} +/* * sklh_idle_state_table_update(void) * * On SKL-H (model 0x5e) disable C8 and C9 if: @@ -1299,6 +1327,11 @@ static void intel_idle_state_table_update(void) case INTEL_FAM6_ATOM_GOLDMONT: bxt_idle_state_table_update(); break; + case INTEL_FAM6_ATOM_SILVERMONT1: /* BYT */ + printk(PREFIX "intel_idle_state_table_update BYT 0x37 reached\n"); + byt_idle_state_table_update(); + break; + case INTEL_FAM6_SKYLAKE_DESKTOP: sklh_idle_state_table_update(); break;
Created attachment 247621 [details] Patch for Bay trail for 4.8 based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530
(In reply to Pshem K from comment #597) > I have an unnamed board based on a J1900 with a number of GigE ports. The > device is used as a router and runs headless. I've experimented with a > number of combinations and simply disabling C6S-BYT state (using a script) > made the biggest improvement for me (forcing max_cstate=1 also works, but > cpu runs hotter). Without the C6S-BYT being disabled the uptime would be > never longer than 24h, sometimes the device would reload some other times - > locked up hard, never leaving a trace on the serial console on what exactly > went wrong. This is using standard 4.4.0-47-generic Ubuntu Xenial kernel. > Now I have an uptime of over 7 days with no issues. Spoke too soon. It looks like I can only get stability with max_cstate=1. Disabling C6 only helped a lot, but occasionally the box would still lock up. The device acts as a router and usually the lockups occur after a long (a few hours) high speed (300-600Mb/s) transfers. Currently running with ubuntu 4.4.0-53 kernel.
Please try the patches that were posted and report. Thank you
for the ubuntu users, here are some precompiled kernels in deb package format, containing the Bay trail fixes: https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0 Please test and report back!
(In reply to Vincent Gerris from comment #610) > Created attachment 247621 [details] > Patch for Bay trail for 4.8 > > based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530 Can you confirm that this is needed for Z3735F CPU? I'm currently running Arch 4.8.11-1-zen with 2 patches from https://github.com/ferbar/rtl8723bs/tree/master/patches_4.7 and "clocksource=tsc" in cmdline. System appears stable, aside from misrecognized battery and lack of external physical keys support, it's a commodity tablet "Axdia international GmbH wintab 9 plus 3G/Tablet, BIOS 5.6.5 03/10/2015".
#611 if it is only a router, just don't start X and it should be stable.
(In reply to ladiko from comment #615) > #611 if it is only a router, just don't start X and it should be stable. There is no X on that box. Not running it is not sufficient to make the machine stable. Without the max_cstate=1 the device eventually locks up.
(In reply to VoobScout from comment #614) > (In reply to Vincent Gerris from comment #610) > > Created attachment 247621 [details] > > Patch for Bay trail for 4.8 > > > > based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530 > > Can you confirm that this is needed for Z3735F CPU? > > I'm currently running Arch 4.8.11-1-zen with 2 patches from > https://github.com/ferbar/rtl8723bs/tree/master/patches_4.7 and > "clocksource=tsc" in cmdline. > > System appears stable, aside from misrecognized battery and lack of external > physical keys support, it's a commodity tablet "Axdia international GmbH > wintab 9 plus 3G/Tablet, BIOS 5.6.5 03/10/2015". Hi, it seems like a Bay trail processor so I think you need it. I don't know about that Arch kernel, I saw they also patched a 4.8 kernel. The patch from Jochen Hein works for me as does my modification for 4.8 kernels. Just compile your own kernel with the patch applied according to your Linux version to be sure.
I am currently using 4.8.0-32 kernel installed from linuxmint 18's update manager. System is stable without intel_idle.max_cstate=1 till now . Do test this kernel out.
Prashant Poonia: Please tell us what your processor is, otherwise your success story is quite useless to others :) Different processors have different bugs and also different workarounds. One solution does not fit all :)
(In reply to mhartzel from comment #619) > Prashant Poonia: Please tell us what your processor is, otherwise your > success story is quite useless to others :) Different processors have > different bugs and also different workarounds. One solution does not fit all > :) sorry :D its N3540 Baytrail laptop is asus x553MA linuxmint 18 with kernel 4.8.0-32 The updated yakkety yak's wifi driver for this kernel causes freezes when using wifi, rest it works flawlessly. Hope this helps someone, and i recommend you to check it out
Hello to everybody I 'm new here. I would like to share my story. (In reply too to VoobScout and Vincent Gerris lasts posts- comment #614 and 617, respectively) I recently built different Linux Flavors on this Z3735F mini machine : https://www.aliexpress.com/store/product/2016-QOTOM-Micro-ITX-motherboard-Z3735F-with-2GB-RAM-32GB-SSD-WIFI-Bluetooth-support-Win-8/108231_32694240800.html - (Swiped for all of the the MS stuff when received.) The native Jessie multiarch (https://cdimage.debian.org/cdimage/unofficial/non-free/cd-including-firmware/8.6.0+nonfree/multi-arch/iso-cd/) works fine directly, is stable, without no adjustments but with, however, no HDMI, WIFI, sound and Bluetooth.... Trying with Debian different kernels (https://github.com/hadess/rtl8723bs/wiki/RTL8723BS-module-building-instruction-for-Debian-GNU-Linux) gave instability and a lot of freezes. I also tried the 'Linuxium - LUBUNTU 16.04 OS" that works fine, is stable and the wifi is directly well active (RTL8723bs) but still without sound, HDMI or Bluetooth. ($ inxi -F System: Host: gil-lbnt Kernel: 4.4.0-31-linuxium x86_64 (64 bit) Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial Machine: Mobo: AMI model: Aptio CRB Bios: American Megatrends v: 5.6.5 date: 08/01/2015 CPU: Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB clock speeds: max: 1832 MHz 1: 705 MHz 2: 1426 MHz 3: 1140 MHz 4: 1374 MHz Graphics: Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa) Resolution: 1366x768@59.79hz GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0 Audio: Card IntelHDMI driver: IntelHDMI Sound: ALSA v: k4.4.0-31-linuxium Network: Card: Failed to Detect Network Card! Drives: HDD Total Size: 15.5GB (Used Error!) ID-1: /dev/mmcblk0 model: N/A size: 31.0GB ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk0p2 ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk0p4 ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk0p3 RAID: No RAID devices: /proc/mdstat, md_mod kernel module present Sensors: System Temperatures: cpu: 45.0C mobo: N/A Fan Speeds (in rpm): cpu: N/A Info: Processes: 189 Uptime: 0 min Memory: 258.1/1939.2MB Client: Shell (bash) inxi: 2.2.35 ) Recently, I upgraded to the generic kernel 4.8 (http://sourcedigit.com/21520-upgrade-linux-kernel-4-8-10-install-linux-kernel-4-8-10-ubuntu/) And after rebooting, I installed this r8723bs module version: https://github.com/ferbar/rtl8723bs ($ inxi -F System: Host: gil-lbnt Kernel: 4.8.10-040810-generic x86_64 (64 bit) Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial Machine: Mobo: AMI model: Aptio CRB Bios: American Megatrends v: 5.6.5 date: 08/01/2015 CPU: Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB clock speeds: max: 1832 MHz 1: 499 MHz 2: 499 MHz 3: 499 MHz 4: 499 MHz Graphics: Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa) Resolution: 1366x768@59.79hz GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0 Audio: Card-1 bytcr-rt5640 driver: bytcr-rt5640 Card-2 USB Audio DAC driver: USB-Audio Card-3 Texas Instruments Audio Codec driver: USB Audio Sound: Advanced Linux Sound Architecture v: k4.8.10-040810-generic Network: Card: Failed to Detect Network Card! Drives: HDD Total Size: 15.5GB (Used Error!) ID-1: /dev/mmcblk1 model: N/A size: 31.0GB ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk1p2 ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk1p4 ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk1p3 RAID: No RAID devices: /proc/mdstat, md_mod kernel module present Sensors: System Temperatures: cpu: 43.0C mobo: N/A Fan Speeds (in rpm): cpu: N/A Info: Processes: 190 Uptime: 7 min Memory: 287.5/1938.7MB Client: Shell (bash) inxi: 2.2.35 ) Now, the 4.8 kernel seems well stable (one week, 24/24), with the wifi well working... (always no sound, no HDMI nor Bluetooth).
(In reply to Gi_44 from comment #621) > Hello to everybody > I 'm new here. > > I would like to share my story. > > (In reply too to VoobScout and Vincent Gerris lasts posts- comment #614 and > 617, respectively) > > > I recently built different Linux Flavors on this Z3735F mini machine : > https://www.aliexpress.com/store/product/2016-QOTOM-Micro-ITX-motherboard- > Z3735F-with-2GB-RAM-32GB-SSD-WIFI-Bluetooth-support-Win-8/108231_32694240800. > html - (Swiped for all of the the MS stuff when received.) > > > The native Jessie multiarch > (https://cdimage.debian.org/cdimage/unofficial/non-free/cd-including- > firmware/8.6.0+nonfree/multi-arch/iso-cd/) works fine directly, is stable, > without no adjustments but with, however, no HDMI, WIFI, sound and > Bluetooth.... > > > Trying with Debian different kernels > (https://github.com/hadess/rtl8723bs/wiki/RTL8723BS-module-building- > instruction-for-Debian-GNU-Linux) gave instability and a lot of freezes. > > I also tried the 'Linuxium - LUBUNTU 16.04 OS" that works fine, is stable > and the wifi is directly well active (RTL8723bs) but still without sound, > HDMI or Bluetooth. > > ($ inxi -F > System: Host: gil-lbnt Kernel: 4.4.0-31-linuxium x86_64 (64 bit) > Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial > Machine: Mobo: AMI model: Aptio CRB > Bios: American Megatrends v: 5.6.5 date: 08/01/2015 > CPU: Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB > clock speeds: max: 1832 MHz 1: 705 MHz 2: 1426 MHz 3: 1140 MHz > 4: 1374 MHz > Graphics: Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display > Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa) > Resolution: 1366x768@59.79hz > GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0 > Audio: Card IntelHDMI driver: IntelHDMI Sound: ALSA v: k4.4.0-31-linuxium > Network: Card: Failed to Detect Network Card! > Drives: HDD Total Size: 15.5GB (Used Error!) > ID-1: /dev/mmcblk0 model: N/A size: 31.0GB > ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB > Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk0p2 > ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk0p4 > ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk0p3 > RAID: No RAID devices: /proc/mdstat, md_mod kernel module present > Sensors: System Temperatures: cpu: 45.0C mobo: N/A > Fan Speeds (in rpm): cpu: N/A > Info: Processes: 189 Uptime: 0 min Memory: 258.1/1939.2MB > Client: Shell (bash) inxi: 2.2.35 ) > > Recently, I upgraded to the generic kernel 4.8 > (http://sourcedigit.com/21520-upgrade-linux-kernel-4-8-10-install-linux- > kernel-4-8-10-ubuntu/) > > And after rebooting, I installed this r8723bs module version: > https://github.com/ferbar/rtl8723bs > > ($ inxi -F > System: Host: gil-lbnt Kernel: 4.8.10-040810-generic x86_64 (64 bit) > Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial > Machine: Mobo: AMI model: Aptio CRB > Bios: American Megatrends v: 5.6.5 date: 08/01/2015 > CPU: Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB > clock speeds: max: 1832 MHz 1: 499 MHz 2: 499 MHz 3: 499 MHz > 4: 499 MHz > Graphics: Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display > Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa) > Resolution: 1366x768@59.79hz > GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0 > Audio: Card-1 bytcr-rt5640 driver: bytcr-rt5640 > Card-2 USB Audio DAC driver: USB-Audio > Card-3 Texas Instruments Audio Codec driver: USB Audio > Sound: Advanced Linux Sound Architecture v: k4.8.10-040810-generic > Network: Card: Failed to Detect Network Card! > Drives: HDD Total Size: 15.5GB (Used Error!) > ID-1: /dev/mmcblk1 model: N/A size: 31.0GB > ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB > Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk1p2 > ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk1p4 > ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk1p3 > RAID: No RAID devices: /proc/mdstat, md_mod kernel module present > Sensors: System Temperatures: cpu: 43.0C mobo: N/A > Fan Speeds (in rpm): cpu: N/A > Info: Processes: 190 Uptime: 7 min Memory: 287.5/1938.7MB > Client: Shell (bash) inxi: 2.2.35 > ) > > Now, the 4.8 kernel seems well stable (one week, 24/24), with the wifi well > working... (always no sound, no HDMI nor Bluetooth). can you post the link from where you downloaded wifi driver?? i am also running 4.8 kernel with no issues except lockups when downloading significant data through wifi
Hi Prashant Poonia The address is in the post Here -> : " And after rebooting, I installed this r8723bs module version: https://github.com/ferbar/rtl8723bs"
(In reply to Gi_44 from comment #623) > Hi Prashant Poonia > The address is in the post > Here -> : > " And after rebooting, I installed this r8723bs module version: > https://github.com/ferbar/rtl8723bs" ohh! its a realtek driver, my bad luck. Anyone else test and confirm 4.8
Created attachment 248541 [details] attachment-26085-0.html Hi, I already tested it and reported that it freezes on my N3520. It may take longer but it will. Please test it very thoroughly and for a long time. The processor has an errata, so it does not make sense it would be fixed unless the kernel is patched or you had some firmware update somehow. I really hope you can so some more and thorough testing and please try to not get exited too soo. It could help if you try the patch Jochen Hein posted or the 4.8x mod I posted or any precompiled kernels and test powet management. Even if you do not get a freeze on your setup, it may be a good extra support reason to have the patch applied in the mainline with some priority, since this still affects many users. Thank you On Dec 25, 2016 23:37, <bugzilla-daemon@bugzilla.kernel.org> wrote: https://bugzilla.kernel.org/show_bug.cgi?id=109051 --- Comment #624 from Prashant Poonia <pooniaprashant400@gmail.com> --- (In reply to Gi_44 from comment #623) > Hi Prashant Poonia > The address is in the post > Here -> : > " And after rebooting, I installed this r8723bs module version: > https://github.com/ferbar/rtl8723bs" ohh! its a realtek driver, my bad luck. Anyone else test and confirm 4.8 -- You are receiving this mail because: You are on the CC list for the bug.
I'm running with this in /etc/modprobe.d/rtl8723be.conf: options rtl8723be fwlps=0 swlps=0 Otherwise Wifi is unstable for me
It it strange I have the same Qotom z3735f board but only jessie 3.16 is well stable with and only without any bay and cherry drivers. With the 4.7 and 4.8 kernels, freezes append very quickly and with r8723bs drivers (ferbar or hadess) or with the linuxium OSs, the systems overload and crash insanely...
Created attachment 248751 [details] Debug patch to enable BYT C6 auto-demotion Please test this patch and report if if it has any effect on the stability issue. You can verify that it is applied and running via dmesg: dmesg | grep idle intel_idle: BYT C6 auto-demotion-disable: 0 Under some conditions, it will reduce the amount of C6 residency, which you can observe with turbostat: # turbostat --debug -o ts.out sleep 10
Created attachment 248841 [details] nanosleep.c As mentioned above, idle-related failures become more rare when heavy load is added to the system. So a "stress test" for idle entry/exit does not add computation. Instead, it does almost no work except waking and going back to sleep. Attached is a little program, nanosleep.c, that can be used as an idle "stress test". It has a random element, so running it for longer duration will provoke a wider variety of timing. Also, my intent was that one copy be run for every logical CPU in the system, but you may find it useful running it other ways that I have not thought of. nanosleep takes a single parameter, its target for highest wakes per second. By default, it uses a max of 500 wakes per second, which would be wakes at a rate up to (1 sec/500) = 2 ms. Or if you run 4 copies, that becomes 2000/sec, or 500us. For reference, cpuidle's target residency for C6N-BYT is 275 usec. So even at that rate, of wakeup, the system may still be able to enter C6. For those with Baytrail systems that fail without intel_idle.max_cstate=1, it would be interesting if you can experiment with nanosleep, alone or in combination with glxgears or video playback or whatever to see if you can provoke the failure sooner. For observing what C-states are actually in use, please use turbostat; which is available in the upstream kernel tree under tools/power/x86/turbostat/ (yes, you should be able to use the latest version of turbostat with an older kernel, as long as the kernel supports the cpu msr driver) Note that turbostat exposes the underlying C-state residency hardware counters. While the software counters in sysfs reflect what the kernel requested, the hardware residency counters reflect what states were actually achieved. For this reason, it is preferable to use turbostat instead of Wolfgang's script in comment #435. eg. # turbostat --debug -o ts.out sleep 10 forks the "sleep 10" command -- you can use any command -- and outputs the stats to the file ts.out. If you omit the command for turbostat to fork, it will run in interval mode until you kill it.
(In reply to pilot_6572 from comment #627) > It it strange > > I have the same Qotom z3735f board but only jessie 3.16 is well stable with > and only without any bay and cherry drivers. > > With the 4.7 and 4.8 kernels, freezes append very quickly and with r8723bs > drivers (ferbar or hadess) or with the linuxium OSs, the systems overload > and crash insanely... do you have the focaltech touchpad drivers for 3.16 kernel? or 3.16 with drivers cooked
(In reply to Len Brown from comment #628) > Created attachment 248751 [details] > Debug patch to enable BYT C6 auto-demotion > > Please test this patch and report if if it has any > effect on the stability issue. > > You can verify that it is applied and running via dmesg: > > dmesg | grep idle > intel_idle: BYT C6 auto-demotion-disable: 0 > I stripped down my 4.8.15 setup on Asus T100CHI (Z3775). No ..cstate arg no c6offc7on script only tsc=reliable and let it idle on Mint Cinnamon 18.1 desktop with wifi, and bt enabled). It took less than 1/2 hour to freeze. Then I added auto-demotion-disable patch to the kernel. The CHI has been running over 9 hours. I'll leave it running (same conditions) a few days to see if freezes.
Turbostat is not working in debug for me: turbostat: msr 0 offset 0x1aa read failed: Input/output error I haven't seen freezes since September-Oktober. Nanosleep didn't show anything new, 4.8.15 stable with Z3770. Four tasks with taskset on different cores. GLxgears, youtube in firefox over wifi(ath6kl), mpd. This all on battery with powersave governor. Cmdline:root=UUID=... rootfstype=f2fs ro tsc=reliable clocksource=tsc pcie_aspm=force nmi_watchdog=0 rd.skipfsck fsck.mode=skip quiet splash cpupower monitor -m Idle_Stats -i 10 -c sleep 300 sleep took 300,00265 seconds and exited with status 0 |Idle_Stats CPU | POLL | C1-B | C6N- | C6S- | C7-B | C7S- 0| 0,00| 0,58| 1,84| 42,82| 26,51| 2,23 1| 0,00| 1,09| 2,38| 49,07| 19,96| 0,54 2| 0,00| 0,90| 2,15| 39,27| 28,15| 5,10 3| 0,06| 0,46| 1,59| 37,10| 29,12| 6,59 Tested for 2 hours then got bored. P.S. Gentoo, vanilla stable kernel with bfq, ath6kl and different small(shut sst debug output up and touch button scancode change) patches.
During the test with the nanosleep script with Jessie and the 4.8 krnel loaded on the Qotom z3537f motherboard (intel_idle.max_cstate=1 added), I ran "stellarium" and "cairo-dock", two high level time and video resources consuming programs and a freeze occurred directly. Progressively any program, kernel or system have been unable to be loaded. The Bios has been altered. I rebuilt Jessie with only the 3.16 original, cancelling before any GPT partition (sgdisk --zapp-all). The Bios is altered but the grub.efi is loadable through the Shell (fs0:). Running again the same stellarium and cairo-dock programs, with no grub modification or nanosleep.c running, gave the same crashes affecting now progressively others video players or browsers. I am looking now to the AMI afulnx_64 tool to flash the Bios before reloading an OS.
(In reply to jbMacAZ from comment #631) > > I stripped down my 4.8.15 setup on Asus T100CHI (Z3775). No ..cstate arg no > c6offc7on script only tsc=reliable and let it idle on Mint Cinnamon 18.1 > desktop with wifi, and bt enabled). It took less than 1/2 hour to freeze. > Then I added auto-demotion-disable patch to the kernel. The CHI has .. .. run fine for 48 hours - the last 24 hours with 4 copies of nanosleep running. Next tests ran nanosleep on the unpatched 4.8.15. I had one freeze before I could get a second copy of nanosleep running. A second test froze in 38 minutes with 4 copies of nanosleep. Not sure nanosleep matters, but thanks for the patch.
Created attachment 249491 [details] turbostat-src.tar.gz Attached is a copy of the latest development version of turbostat. It has two additions from what is released in the upstream kernel tree: 1. New --show --hide parameters (90% implemented) 2. disable --debug access to MSR_MISC_PWR_MGMT (MSR 0x1aa) on BYT This tar file incluces a binary you can run directly, or you can first "make clean; make" to build it from scratch. $ tar xzvf turbostat-src.tar.gz $ cd turbostat-src $ sudo ./turbostat --debug -o ts.out sleep 10 Both Wolfgang's script and cpupower are limited because they format the the software counters in /sys/devices/system/cpu/cpu*/cpuidle/state*/* The software counters show what the kernel requested. turbostat shows instead the underlying hardware residency counters. The difference is important when the hardware has the ability to "demote" a software request into a more "shallow" state; and is particularly applicable when we are experimenting with a patch that enables/disables the ability of the hardware to do so.
Created attachment 249561 [details] turbostat --debug -o ts.out sleep 10 (In reply to Len Brown from comment #635) > > Attached is a copy of the latest development version of turbostat. Thank you! Tested. Some new output but also there is an error: turbostat: msr 0 offset 0x3fe read failed: Input/output error
Created attachment 249571 [details] turbostat-src.tar.gz thanks for testing the latest turbostat, this update should fix the issue seen in the last one.
Created attachment 249601 [details] tubostat --debug -o ts.out sleep 10 So my cpu spends 50% of time in C6 state. This is with 4 instances of nanosleep, glxgears and video playback with mpv. Without any activities cpu spends 94% of time in C6. I forgot to mention that I use 32bit gentoo with UEFI stub capable kernel.
(In reply to Sudhanshu from comment #537) > I have been suffering from the same issue, but on a Broadwell system (Dell ... > Summarising, > Are there any broadwell co-sufferers here? > Am I safe to assume this is the same bug as mine? I am also on a Broadwell system and i suffer from the same occasional freezes. I haven't yet tried changing the max cstate setting though.
Hi, I patched a 4.8.11 kernel with the auto demotion patch: dmesg shows: [ 1.244957] intel_idle: BYT C6 auto-demotion-disable: 0 In my usual test setup, it freezes after about 15 minutes. Since I still see quite a variation in time before that happens, I can't really tell if it made much difference (N3520). The patch from Jochen Hein still works fine and does not freeze at all after the usual test. Do we have an issue with the C6 state that is not in the errata? For others on Ubuntu, you can find the auto demotion enabled deb kernels here: https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0 @Len Brown: is there anything we can do to pin down the cause as much as possible? I would really like to see a kernel patch fixing this and I am looking for the best way forward to help achieve that.
@ Sudhanshu @ oemer+kernel@o9z.de No, this bug report is specific to Baytrail. Here is the complete list of Baytrail processors: http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All If you have a problem with Broadwell, then file a new bug report -- because this one will be closed when "intel_idle.max_cstate=1" is no longer required on baytrail.
@ Vincent Gerris It seems likely there are multiple baytrail c6 issues, and only if we are very lucky will they turn out to have a common root cause. When this bug was opened, I assumed that this had nothing to do with cpuidle and that Adrian's i2c patches would handle this. That didn't happen. I think there are multiple failures here, and i2c and i915 changes clearly effect some failures, and so they are both high on the list of suspects. Also worth checking out is the cpuidle auto-demotion-disable=0 patch that you just tested. The problem with that patch is if it works, we don't know if it is because we are taking a better route through the pcode, or if it is just hiding an i2c or i915 bug because the system is in c6 less... So the interesting comparison with the auto-demotion-disable=0 patch is: 1. Does it change stability? For jbMacAZ it seems it may help, but for you it seems it may make no difference. There are a of submitters here, and I'd like to see more testing. 2. Does it make a measurable difference in C6 residency under the same workload (ie. turbostat output with vs without the patch should show this). Vincent, Since the cpuidle patch doesn't make any difference on your system, I would say that testing the i2c patches, (or maybe even blacklisting dw i2c if it doing so doesn't hose your system) and perturbing how i915 works to see if any changes effect your failure are the best areas to look. Also, any efforts to discover how to best cause the failure to happen as soon as possible would be extremely valuable. Eg. experimenting to see if you can provoke the failure sooner by running nanosleep in a certain way with certain parameters might turn out to be extremely valuable. If we can reliably reproduce a failure in under 60 seconds, when we know when it is gone. If it takes a week or so to reproduce a failure, when we'll never know when we are done.
I'm running right now: Linux detrius 4.9.0-040900-generic #201612111631 SMP Sun Dec 11 21:33:00 UTC which is the ubuntu mainline ppa. Until yesterday that kernel seemed stable, but yesterday I had a hang as well. turbostat output: turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org> CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8) CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR) CPUID(7): No-SGX SLM BCLK: 83.3 Mhz RAPL: 4581 sec. Joule Counter Range, at 30 Watts cpu1: MSR_PLATFORM_INFO: 0x60000001600 6 * 83 = 500 MHz max efficiency frequency 22 * 83 = 1833 MHz base frequency cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled) cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000 cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x0016000f (UNlocked: pkg-cstate-limit=15: unknown) cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced) cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.) cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked) cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled) cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled) cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked) cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled) cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C) cpu0: MSR_IA32_THERM_STATUS: 0x883c0100 (45 C +/- 1) cpu1: MSR_IA32_THERM_STATUS: 0x883c0100 (45 C +/- 1) cpu2: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1) cpu3: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1) 10.002627 sec Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c6 CoreTmp GFX%rc6 GFXMHz PkgWatt CorWatt - - 546 28.89 1889 1834 0 0 2.41 68.71 45 0.00 0 0.77 0.58 0 0 74 9.23 804 1835 0 0 2.76 88.00 43 0.00 0 0.77 0.58 1 1 388 23.96 1618 1835 0 0 1.58 74.47 44 2 2 1653 77.27 2137 1835 0 0 0.26 22.48 45 3 3 69 5.07 1368 1833 0 0 5.03 89.90 45
Acer Switch 10 with Intel Atom Z3735F - with Len Brown`s one liner patch on vanilla 4.8.15 with ubuntu-ppa config: up 4 days, 11:24. No freeze. Workload was glxgears and vlc. Without this patch, same kernel with same workload froze in 12 minutes. Please someone with different baytrail CPU test this patch, so we can move forward. Another workaround that makes at least my Z3735F rock stable was described in comment 378 (but only with kernel 4.7 and up, so it could be only luck)
Since I upgraded to Ubuntu 16.04 last october, my Acer Aspire V11 Touch with Intel Celeron quad core N2940 + Intel Bay Trail has the same problem. Patch intel_idle.max_cstate=1 does prevent the crashes. How will I know when a fix to the kernel is done?
Setting the max C state as 1 does not fix the problem. That is just a temporary measure to make sure that the freezes don't occur when the stakes are high. For example, if you are working on some important project and there is a freeze expectedly, then all of the unsaved data that you are working on will be lost. Additionally, if you are working on battery, then setting the max C-state as 1 will invariably force your processor to consume a lot of power. In other words, your laptop's battery drastically improve if this bug is fixed. And you will know when it is fixed when the "status", at the top of this page, is marked as “VERIFIED” OR "RESOLVED". it is currently marked as "NEEDINFO". If you would like, then you can contribute. Your contribution can speed things up. You do not need to be an expert programmer to contribute. You just need to know how to apply the patches and update your kernel. Additionally, you may also need to know how to run a few commands on the terminal and post the output here. These are actually very very simple steps. If you do not know how to do them, then you can just go ahead and Google it. These are relatively simple topics there are not that many Complex steps involved. NOTE : MAINTAIN CAUTION WHILE TESTING :)
Patched kernel 4.5.4 using Len's auto demotion patch and had a freeze after a day. CPU: J1900. root@pandora:/usr/src/turbostat-src# ./turbostat -d turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org> CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:3 (6:55:3) CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB cpu2: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR) CPUID(7): No-SGX SLM BCLK: 83.3 Mhz RAPL: 4581 sec. Joule Counter Range, at 30 Watts cpu2: MSR_PLATFORM_INFO: 0x100000001800 16 * 83 = 1333 MHz max efficiency frequency 24 * 83 = 1999 MHz base frequency cpu2: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled) cpu2: MSR_TURBO_RATIO_LIMIT: 0x00000000 cpu2: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x0018000f (UNlocked: pkg-cstate-limit=15: unknown) cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced) cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.) cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked) cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled) cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled) cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked) cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled) cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C) cpu0: MSR_IA32_THERM_STATUS: 0x88360000 (51 C +/- 1) cpu1: MSR_IA32_THERM_STATUS: 0x88360000 (51 C +/- 1) cpu2: MSR_IA32_THERM_STATUS: 0x88320000 (55 C +/- 1) cpu3: MSR_IA32_THERM_STATUS: 0x88320000 (55 C +/- 1) Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c6 CoreTmp GFX%rc6 PkgWatt CorWatt - - 425 22.23 1911 2000 9310 0 19.02 58.75 58 **.** 1.97 0.65 0 0 316 19.56 1613 2000 5564 0 36.85 43.59 53 **.** 1.97 0.65 1 1 288 17.62 1633 2000 2113 0 20.16 62.21 53 2 2 381 18.92 2015 2000 1010 0 12.29 68.79 58 3 3 715 32.80 2179 2000 623 0 6.80 60.41 58 Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c6 CoreTmp GFX%rc6 PkgWatt CorWatt - - 451 24.48 1840 2000 11641 0 40.57 34.95 56 34.17 3.75 0.70 0 0 297 18.22 1628 2000 7549 0 81.78 0.00 53 34.17 3.75 0.70 1 1 807 38.64 2088 2000 2083 0 38.13 23.23 53 2 2 356 22.37 1589 2000 998 0 23.38 54.25 56 3 3 343 18.71 1834 2000 1011 0 18.97 62.32 56
Thank you Len, for picking this up and streamlining the hunt for the cause :)! thanks everyone for the follow ups :). I am going to try and be as scientific as I can on the matter. I did 6 attempts to freeze of which 3 on 4.8.0 unpatched and 4.8.11 patched with auto demotion enabled. The reason of the 0/11 difference is because when I use the automated scripts it pulls a tar and I did not find how to manipulate that. Until anyone finds a way to a quick freeze, this is what I use to freeze up my Lenovo Yoga 2 11 with N3520 processor (assuming that the minor version difference has no big influence, but might): - pick an mkv file from a samba share and copy it to local video folder - setup bluetooth audio, with high fidelity playback profile to an external speaker (jambox mini) - play the same file that is copying from the network with the ubuntu default video player (totem) The way I can get the 4.8.0 usually to freeze between 1 and 30 minutes. An issue is that bluetooth is utter crap: stottering, connecting loss, wrong profile or not able to set it are some that influence it. The bluetooth seems also to block video playback, maybe it is buffering related. Any way, with the above, I was unable to freeze the 4.8.11 with demotion this time. One time, the 4.8.0 without played about 30 minutes, one time it froze instantly. Further info: - not using the laptop plugged in may freeze the 4.8.0 faster, but not sure - I tried nanosleep and ran up to 5 times the program, but it didn't seem to make a difference on the freezing speed. It was running while playing video too. A sample output of turbostat -d on the 4.8.11 with auto demotion , 5 times nanosleep running, playing video with audio over over bluetooth: ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads/turbostat-src$ sudo ./turbostat -d turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org> CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:3 (6:55:3) CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR) CPUID(7): No-SGX SLM BCLK: 83.3 Mhz RAPL: 4581 sec. Joule Counter Range, at 30 Watts cpu1: MSR_PLATFORM_INFO: 0x60000001a00 6 * 83 = 500 MHz max efficiency frequency 26 * 83 = 2166 MHz base frequency cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled) cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000 cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x001a000e (UNlocked: pkg-cstate-limit=14: unknown) cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced) cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.) cpu0: MSR_PKG_POWER_LIMIT: 0x000280fb (UNlocked) cpu0: PKG Limit #1: ENabled (7.843750 Watts, 0.001953 sec, clamp DISabled) cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled) cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked) cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled) cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C) cpu0: MSR_IA32_THERM_STATUS: 0x883b0100 (46 C +/- 1) cpu1: MSR_IA32_THERM_STATUS: 0x883b0100 (46 C +/- 1) cpu2: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1) cpu3: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1) turbostat: msr 0 offset 0x3fe read failed: Input/output error Based on the above I dare to conclude that: - the auto demotion enablement makes it more stable than without - it seems like a good idea to have to go mainline, a few reports have also stated it - it seems there is still an issue that needs further investigation. Challenges seem: - feedback loop : let's all keep looking for a fast freeze on either kernel - detailed reports (hard with above challenge) - let's try to be version specific and share test methods or scripts (I for example can dedicate this hardware to it, but I have limited time) I am very much motivated to find the cause and I hope everyone has read Len's previous requests about using nanosleep, turbostat and that this is ONLY about Bay trail. Let us please try to keep it confined to that and nail this bug :). thanks everyone for the help and collaboration!
@Vincent Gerris Thanks for the test report. It is extremely helpful when reports such as yours, include the processor and system model #. > turbostat: msr 0 offset 0x3fe read failed: Input/output error Hopefully this means you are running the turbostat in comment #635 and that error goes away when you run the update in comment #637 > I tried nanosleep and ran up to 5 times the program Note that adding more load may result in less idle c6, and thus make the failure more rare. That is to say, 3 copies may be more effective than 5... turbostat (the working version:-) will show the % of c6 residency, and if that goes down, the system may be too busy to be exercising c6 enough to provoke the failure. Aside from number of copies of nanosleep, its default parameter is 500 wakes per second. I don't know if making that higher or lower will cause the failure sooner, and if somebody has a system that fails quickly, that would be a great thing to discover and share.
I don't know if can be related but at work we have Acer TM (TravelMate) B117M N3150 processor (Braswell Processor but kernel code sees as Cherry Trail Processor) Lubuntu 14.04 LTS + LTSEnablementStack (https://wiki.ubuntu.com/Kernel/LTSEnablementStack) About 110 units of this model. Some of them are freezing using them and after suspend. On 2016-11-24 we migrated some machines to latest stable kernel (4.8.10). Computers are more stable. (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467/comments/142) but... they continue to freeze after suspend. Modifying /etc/default/grub from GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_backlight=vendor" to GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_backlight=vendor acpi_osi='!Windows 2013' acpi_osi='!Windows 2012'" prevent freezing after suspend. I explain this because I suppose that suspending the computer could be an excellent situation for having C6 & C7 states. Just closing & opening the lid with any application opened (FireFox or VLC) caused the freezing for us.
(In reply to Len Brown from comment #649) > @Vincent Gerris > > > turbostat: msr 0 offset 0x3fe read failed: Input/output error > > Hopefully this means you are running the turbostat in comment #635 > and that error goes away when you run the update in comment #637 Hello Mr. Brown, That turbostat binary which is included in the tar file won't work right out of the box. i had to run "make clean; make" inorder to avoid this error.... --- turbostat: msr 0 offset 0x3fe read failed: Input/output error --- Here's the output of turbostat, right after runnging "make clean; make".... --- http://pastebin.com/raw/GTXDbZRz --- Here's the output of 'dmesg | grep idle', right after installing the auto-demoion-enabled kernel.... --- http://pastebin.com/raw/JacrPBG9 --- Here's the current info about my system.... --- http://pastebin.com/raw/RtMeVBSG --- and one more thing, that turbostat output, after my pc freezes, then i should power it down, switch it back again AND THEN post the output of turbostat, right ? or should i post the output of turbostat now itself ?
@ Josep Pujadas-Jubany Please file a new bug for Cherry-Trail/Braswell suspend issues. This bug report is specific to the previous generation, Valleyview/Baytrail that go away with cmdline option intel_idle.max_cstate=0. Bay Trail processor list: http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All
@ A Uday K Yes, your N3530 is a Baytrail, yes, the auto-demotion patch is installed. Several things we are trying to discover: 1. does the auto-demotion patch in comment #628 help? running the same workload, does using this pach change time to hang? It seems that it helps dramatically on some, and not at all on others. 2. do you see a different amount of %c6 when running the autodemotion patch vs. not running that patch? (this is what turbostat can tell us) 3. can you help discover how to make the problem occur sooner? nanosleep in comment #629 is a tool that is available to help. My guess is that 4 copies should run on a 4-processor system, and that they should use the default parameter of 500 wakes/sec. But if you can make the problem happen by changing from 500, or changing the number of copies, that is a valuable discovery. Here again, turbostat is available to help track if you are making the system too busy to get into c6.
Created attachment 250691 [details] Turbostat for Asus T100CHI auto-demotion is also helpful with 4.10-rc2 on Manjaro Cinnamon on ASUS T100CHI (Z3775) ~80 hours w/o freeze. Will resume testing with 4.8.16 (Mint)
(In reply to Len Brown from comment #652) > @ Josep Pujadas-Jubany > > Please file a new bug for Cherry-Trail/Braswell suspend issues. > My comment came from https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1566302/comments/150 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1566302 And the status for this Linux(Ubuntu) bug is [Won't fix]. I left Windows when Vista appeared. I have been very happy (at home and at work) using Linux (open source, no viruses, speed, stability, ...) But in the last months (applying latest kernels for latest hardwares) it seems we lost stability. It's a pity. I would like to help more about this but I'm just an intermediate/advanced user. I'm not capable to do more. I'm sorry!
(In reply to Len Brown from comment #653) > a different amount of %c6 when running the autodemotion patch > vs. not running that patch? (this is what turbostat can tell us) Thanks for the clarification :) > can you help discover how to make the problem occur sooner? > nanosleep in comment #629 is a tool that is available to help. > My guess is that 4 copies should run on a 4-processor system, > and that they should use the default parameter of 500 wakes/sec. > But if you can make the problem happen by changing from 500, > or changing the number of copies, that is a valuable discovery. > Here again, turbostat is available to help track if you are > making the system too busy to get into c6. I'll continue to experiment and will keep you updated.
turbostat error: "msr 0 offset 0x3fe read failed: Input/output error" on Asus T100CHI (Z3775 - SILVERMONT1.) FWIW, turbostat runs w/o errors on my skylake desktop. FYI, 4.8.16 with auto-promotion-disable seems stable for me so far. Is there any value to testing an older kernel such as 4.2.x, which I found more unstable on my system?
@jbMacAZ For the turbostat error, try "make clean; make" of the latest attachment. Apparently I sent the latest source but failed to re-build the binary. FWIW, I expect to upload an updated turbostat this coming week with some baytrail specific updates. Re: value in testing older, more unstable, kernels. My personal bias it to always run the latest upstream kernel, or at least the latest kernel.org -stable. That kernel is what all other kernels follow, eventually. However, many users are on binary distro binary kernels, and so it is useful to know where those are too. The root cause of this particular failure has been elusive. It seems there are multiple ways of making the root cause occur more/less frequently. There may even be multiple independent root causes. If we can use an old kernel to isolate the difference between bad/good to help find the effect of a certain patch, that is useful. But with possible multiple causes, the benefit of a patch on an unstable could be lost in the noise.
(In reply to Len Brown from comment #658) > @jbMacAZ > > For the turbostat error, try "make clean; make" of the latest attachment. > Apparently I sent the latest source but failed to re-build the binary. > > FWIW, I expect to upload an updated turbostat this coming week > with some baytrail specific updates. > I played with the source, but only succeeded in changing which offset provokes the error! Looking forward to the baytrail turbostat update. In the mean time, I'll stick to testing recent non EOL kernels.
Joining to report this bug on ASUS E502MA - Intel(R) Pentium(R) CPU N3540 @ 2.16GHz, using any linux distro - Ubuntu, Mint, Manjaro, etc... Fixed this by https://wiki.archlinux.org/index.php/Intel_graphics#X_freeze.2Fcrash_with_intel_driver - last option, which has a link to this site.
Created attachment 251091 [details] latest turbostat utility for baytrail Here is my latest turbostat utility, updated for Baytrail. This is a development version, not yet released to the Linux kernel source tree. $ tar xzvf turbostat-src.tar.gz $ cd turbostat-src $ sudo ./turbostat --debug -o ts.out sleep 10 If you are not comfortable running a utility you download from the internet as root, first built it from source: $ make clean $ make and optionally $ sudo make install Updates: 1. uses the Baytrail C1 hardware residency counter instead of software 2. shows the Baytrail module c6 hardware residency counter. Yes, this is the same on pairs of cores (that is what a module is) 3. shows Package C6 4. does not access/show un-supported counters c3, pc3, c7, pc7 Here are what states/counters are enabled for the interesting parameters to intel_idle.max_cstate: intel_idle.max_cstate=1 C1 intel_idle.max_cstate=2 C1, Mod-C6 intel_idle.max_cstate=3 C1, Mod-C6, Pkg-C6 This release replaces the turbostat versions attached to comment #635 and comment #637.
I do run kernel 4.9.0 now for two weeks without any freeze. - shuttle xs35v4, j1900 - 4.9.0-sparky-amd64 #1 SMP Tue Dec 20 12:43:44 CET 2016 x86_64 GNU/Linux @len brown: does it still help you if I run turbostat?
Created attachment 251101 [details] Test script to freeze your baytrail quickly I have done some testing on two Baytrail systems: Dell Insprion 3451 laptop (Atom N3540) Acer Aspire AXC dekstop (Atom J1900) Currently using Ubuntu 16.10 vmlinuz-4.8.0-32-generic with no cmdline parameters. Using the attached script, each box freezes in under 30-minutes, and often much sooner. I've seen a freeze as quickly as under 60 seconds. The current script runs 8 copies of "nanosleep 1000" from comment #629 plus one copy of glxgears -fullscreen. It also displays information about your system that I'd like to see when you report a failure. I ssh into the test system, and invoke a 1-line shell script that does this: ./byt.test | tee out.`date +%Y%m%d_%H%M%S` so when the system hangs, there is a record both in the ssh window, and also in the out.* file. Yes, attaching your out.* file to this bug report is appropriate -- though the the turbostat output gets redundant after a while -- so copy/paste of the top of the file also works... You can simply show the last timestamp, or say how long to freeze. Adding more copies of glxgears did not seem to make the failure occur sooner. When I ran without glxgears, the failure stretched out to 23 hours on the acer, and the dell was still running at 24 hours. So 1 copy of glxgears seems to be the ticket. intel_idle.max_cstate=2 still fails, my one attempt took 49 minutes.
@ amjafuso please try the script in comment #663 to see if you can get 4.9 to fail. I've not tested 4.9 yet. You've also reported success with intel_idle.max_cstate=2. If you get 4.9 to fail with no cmdline, please re-test with intel_idle.max_cstate=2 to see if that survives. My experience is that they will both fail, and that cmdline will simply take a bit longer than the default. I also acquired an Acer T100 TAS and Acer T100 CHI. My next step is to wrestle 64-bit unbutu onto their 32-bit BIOS in a dual-boot config, and broaden the testing to those boxes, before I start changing the kernel.
Ok, script started 2 hours ago, no freezes. No freeze with kernel 4.9. Boot parameter: # cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-4.9.0-sparky-amd64 root=UUID=612f6bbb-b095-4da7-b823-0658edce9dfc ro quiet splash I didn't patch the kernel (don't know how to do that, sorry).
Hello, Do anybody tried 4.9.2 in Ubuntu 16.04 from following site? http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9.2/
Created attachment 251131 [details] T100CHI turbostat kernel 4.9 patched Turbostat is working now on CHI. Thank you. Kernel 4.9.2 with auto-demotion-disable running for ~12 hours so far (just added 4 nanosleeps 1000). Without nanosleep(x4) CPU%6 was ~90%. 4.9.2 without auto-demotion-disable patch ran about an hour before freezing at idle. (Z3775) re: T100 test beds The T100CHI is a little tedious to get going in linux. CHI's OEM Bluetooth keyboard is offline at boot time (and unpaired at install time). It's easier to use a powered hub, USB keyboard/mouse and wifi dongle for linux install. If you add boot32ia.efi to the installer USB /EFI/boot/, edit the installer /boot/grub/grub.cfg to add intel_idle.max_cstate=1 and you can boot most debian derivative installers. Press <ESC> during power up to get to the boot menu. Some distros need grub-efi-ia32 & grub-efi-ia32-bin to be installed manually. Wifi needs brcmfmac43241b4-sdio.txt and bluetooth needs BCM4324B3.hcd and works better with blueman device manager...
Thanks for all the updates. I've been now trying to freeze my N3540 laptop with nanosleep and different combinations of other tools, with varying success. Managed once to freeze idle system running 4xnanosleep 250 processes in couple of hours, but then same test again yielded 36+hrs of uptime. I can confirm that adding glxgears surely helps, 4xnanosleep 250 + glxgears -fullscreen I've gotten now 4 freezes in a row with times to freeze being roughly: 90 mins, 50mins, 15mins and 8,5 hours. Also now when trying to update this report I got freeze in less than 10 mins from reboot, no nanosleep running... Attached is a turbostat output from few seconds before one of the freezes happened. turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org> CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8) CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR) CPUID(7): No-SGX SLM BCLK: 83.3 Mhz RAPL: 4581 sec. Joule Counter Range, at 30 Watts cpu1: MSR_PLATFORM_INFO: 0x60000001a00 6 * 83 = 500 MHz max efficiency frequency 26 * 83 = 2166 MHz base frequency cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled) cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000 cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x001a000f (UNlocked: pkg-cstate-limit=15: unknown) cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced) cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.) cpu0: MSR_PKG_POWER_LIMIT: 0x003880c0 (UNlocked) cpu0: PKG Limit #1: ENabled (6.000000 Watts, 262144.000000 sec, clamp DISabled) cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled) cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked) cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled) cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C) cpu0: MSR_IA32_THERM_STATUS: 0x88330100 (54 C +/- 1) cpu1: MSR_IA32_THERM_STATUS: 0x88330100 (54 C +/- 1) cpu2: MSR_IA32_THERM_STATUS: 0x88340100 (53 C +/- 1) cpu3: MSR_IA32_THERM_STATUS: 0x88340100 (53 C +/- 1) 10.006062 sec Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c6 CoreTmp GFX%rc6 GFXMHz PkgWatt CorWatt - - 80 11.80 675 2167 0 0 2.18 86.02 53 0.00 0 1.48 0.09 0 0 67 10.77 622 2167 0 0 3.08 86.15 53 0.00 0 1.48 0.09 1 1 127 18.55 682 2167 0 0 2.90 78.55 53 2 2 65 9.83 659 2167 0 0 1.47 88.70 53 3 3 61 8.06 752 2167 0 0 1.26 90.68 53 Running 4.9.0-1 from opensuse repos at the moment, will try autodemotion patch next.
I did as told on my system, here's a link to the output as soon as I ran those commands... --- http://pastebin.com/ByQVV5gc --- On my system, If try this line, --- $ ./byt.test | tee out.`date +%Y%m%d_%H%M%S` --- It gives the "permission denied" error. I get this error even if I'm in super user mode ( sudo su ). However, on my system, the workaround is... --- $ . byt.test | tee out.`date +%Y%m%d_%H%M%S` --- Also, there's this one more thing.... If you see the 2nd last line of that link, you'll notice this... --- The program 'glxgears' is currently not installed. --- How would you suggest I procede ? Should I go ahead and type.... --- $ sudo apt install mesa-utils --- Or should I be doing something else ? What is glxgear ?
Created attachment 251261 [details] CHI_freeze_4.9.2_no_demotion_disable_patch Using Len Brown's freeze script with kernel 4.9.2, I had a freeze in about 20 minutes. Freeze times range from 10 minutes to around an hour.
@ A Uday K . file will interpret that file in the current shell session. that isn't what you want if the script has side effects, like changing directory, or calling exit. try this $ chmod +x file $ ./file this particular script has an sudo, and so you will be prompted for a password if your session doesn't remember it from previous sudo yes, glxgears is a simply graphics demo. it seems to come installed by default in ubuntu 16.10 go ahead an install and try it. the good thing about glxgears is that it does video w/o doing any audio. I suspect that the folks freezing their system by playing an audio+video stream may be running into a known audio issue that hopefully will soon be fixed. @ Juha Sievi-Korte Thanks for the testing. Please update to the latest turbostat from comment #661 when i wrote the test script in comment #663 i expected to be varying the number of copies of glxgears. I too notice a huge benefit (shorter time to freeze) from running 1 copy, but did not notice a benefit of running more copies. @ jbMacAZ thanks for confirming that 4.9.2 is not magic, and that the test script from comment #663 fails well for an unpatched kernel. so did you eventually get a hang with 4.9 with demotion-re-enabled patch? what kernel is "LapLet 4.9.2.2n" -- is that unpatched or patched? your %pc6 is remarkably low in comment #670 (under 2%) BTW. thanks for the T100 CHI install tips, hopefully I'll get that box going tonight. If it like yours, it will be a great test bed.
(In reply to Len Brown from comment #671) > > @ jbMacAZ > > so did you eventually get a hang with 4.9 with demotion-re-enabled patch? > > what kernel is "LapLet 4.9.2.2n" -- is that unpatched or patched? > your %pc6 is remarkably low in comment #670 (under 2%) > > BTW. thanks for the T100 CHI install tips, hopefully I'll get that box going > tonight. If it like yours, it will be a great test bed. 4.9.2.2n is 4.9.2 + aufs4.9 + T100 specific patches not yet upstreamed (from T100/Ubuntu G+ group), but no ubuntu patches. glxgears has a slight stutter while running, so the system may be maxing out?? I'm running mint 18.1 (x86_64) w/cinnamon 3.2.7. I have a system monitor (CPU,mem,net,disk) graphical applet in the system tray, wifi and bluetooth active. A standard unpatched recent(>4.6) kernel will run acceptably on the CHI, but more of the minor hardware (buttons, backlight, etc.) works with the T100 patches and .config. I also built 4.9.2.2 which adds your demotion patch. So far I have not seen 4.9.2.2 freeze - longest test run so far has been about 16 hours. FWIW, 4.10-rc3 with your patch did freeze, but 4.10 is still too new for me to take seriously.
Hello, how is going this bug ? I'm still have the freeze issue for a year and the workaround doesn't work with my system. At least it's better but come on, the system freeze randomly. Why it's not a priority ?
If you need a quick way to get freezes, I had a laptop with N3540 which had freezes few seconds after start (the same complete freeze and without logs) when I launched google chrome just after the desktop starts in Manjaro Gnome 16.08 with 3.16 manjaro kernel (Only in Gnome edition), if someone have time and is interested please confirm this, I'm unable to try it right now.
n3540 also freezes with 4.9.2, frozen once in 12hr uptime while connected through wifi hotspot of my android device and a call was missed which was notified on netbook through kde connect. Firefox and terminal was running on foreground. even then 4.8 and later kernels are much better than previous versions, atleast on my Asus x553ma netbook
@ jechtpurgateur@gmail.com if the workaround (booting with "intel_idle.max_cstate=1") does not help your system, then you have a different bug. Please file it.
@ alvararo It is interesting that you had a failure easily reproducible in 3.16 -- was widely cited as the most stable baytrail kernel before things went south in 3.17. I'm afraid, however, that interest in 3.16 is about zero right now. There have been a lot of fixes and more interesting is if you could get Linux-3.9 to fail quickly.
@ alvararo oops, typo, we want to go forward to the present, not back in history:-) < get Linux-3.9 to fail quickly. > get Linux-4.9 to fail quickly.
(In reply to Len Brown from comment #676) > @ jechtpurgateur@gmail.com > > if the workaround (booting with "intel_idle.max_cstate=1") does not > help your system, then you have a different bug. Please file it. @len there seem to be various bugs with baytrail architecture, my system a N2940 still freezes with "intel_idle.max_cstate=1" and for all other N2940 as well. But for us Kernel 4.12 (i am using it) or even 4.16 (i think) work well without any kernel parameters. All later versions freeze including 4.8.x and pretty sure 4.9.rcx as well. All that info is in this thread that seems to be more like a public chat-forum by now ;) The Point i want to make, as there are several bugs that affect baytrail, most likely related somehow, why would you file a different bug report for N2940 ? This is the best Bug-Thread there is so far regarding baytrail problems on the net as far as i know. if some day the baytrail problem will really be solved i am pretty sure it will be solved for us as well. i would include that valid info for N2940 made by many users in this bug-report to try solving the problem(s) with baytrail kind regards and thank you for your work on this
(In reply to Len Brown from comment #676) > @ jechtpurgateur@gmail.com > > if the workaround (booting with "intel_idle.max_cstate=1") does not > help your system, then you have a different bug. Please file it. there is something more to this bug as i have an interesting observation. My n3540 is of baytrail architecture, and I have tested all kernel versions and intel_idle.max_cstate=1 was the ultimate workaround for all, this makes my hardware perfectly fit to regard as affected by this specific bug. But recently when i tested 4.8.0-32 my system froze once even when cstate parameter was in place, it didn't happened again, also the same kernel is the most stable when cstate is not implemented. and I have noticed a strong relation between freezes and heavy wifi usage too. People who are facing freezes even after max state parameter is set should see if there is a relation between wifi and freezes and report back. Downloading big files in a fast connection as the trigger.
@ julio.borreguero@gmail.com I must insist. If you have a failure that is anything other than a baytrail hang that goes away with intel_idle.max_cstate=1, then you are best served by a new bug report. While we always hope there is a magic bullet that fixes multiple similar issues, experience shows that is very rarely the case. This bug report will be closed when Baytrail systems that used to hang without intel_idle.max_csate=1 no longer need that parameter. So if that doesn't describe your system, you are best off with a bug report that does. Go ahead and reference it here, but please put all necessary information describing that failure in that bug report. Thanks.
@ Prashant Poonia Yes, the n3540 failing with intel_idle.max_cstate=1 is also interesting. If you can isolate what kind of workload triggers it, please put that in a new bug report describing the failure. If you suspect WIFI, then I suggest seeing if you can eliminate sound and graphics from the workload to be sure the known problems there are not the actual root cause.
Created attachment 251471 [details] drm/i915/byt: Avoid tweaking evaluation thresholds
Kind of disagree with the new bug report sentiment... This bug is almost 2 years old and multiple kernel updates have come out since then. Are we asking all those who the intel_idle.max_cstate=1 used to work for 2 years ago to go back and wait another 2 years for those issues to be addressed? Simply because they now need additional commands for the intel_idle.max_cstate=1 to work?
2 years ago? True. It's explained at https://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail (Bug started at https://bugs.freedesktop.org/show_bug.cgi?id=88012 and moved to a kernel bug) The hardware bug that it's supposed to origin the problem (VLP52) was reported by Intel on March-2014, https://bugzilla.kernel.org/show_bug.cgi?id=109051#c425 Other OS are also affected for this Intel's hardware bug and others. Recommended Google searches: freeze c-state windows freeze c-state osx In fact, many windows-gamers recommend to disable c-states.
Interesting document from Dell, about how to manage c-states in Linux http://en.community.dell.com/cfs-file/__key/telligent-evolution-components-attachments/13-4491-00-00-20-22-77-64/Controlling_5F00_Processor_5F00_C_2D00_State_5F00_Usage_5F00_in_5F00_Linux_5F00_v1.1_5F00_Nov2013.pdf
I have a Lenovo Yoga 11e 20D9 Intel Celeron N2930 (2.17GHz) 4GB RAM, updated UEFI-BIOS to last version (17-October-2016). In Windows 8.1 go perfect -> is not a hardware bug. But in Linux Mint 18.1 Serena 64bit MATE 1.16.1 kernel 4.4.0-57-generic #78-Ubuntu x86_64 the system freeze when I watch a video in Youtube with firefox before 5 minutes. With intel_idle.max_cstate=1 do not freeze.
kernel 4.9.3 seems to be more stable. Unpatched and no workarounds (no T100 patches either, just custom .config) took over eight hours to freeze. 4.9.2 would freeze in less than an hour on my system. YMMV I applied Mika Kuoppala's new patch to 4.9.2 and it is still running 12 hours later with Len Brown's byt.test script. Without any cstate workaround hard freeze averages around 30 minutes on recent kernels. (T100CHI - Z3775)
I’m going to jump on the bandwagon here. I only experience freezes with both onboard WNIC and mode setting enabled. A semi-consistent reproducer is establishing multiple concurrent network connections (e.g. downloading a torrent) and/or downloading big files at high speeds. The laptop is an ASUS X553MA with a Celeron N2840. A probably unrelated thing is that the BIOS has a mysterious setting called “OS Selection” with “Windows 7” and “Windows 8.x” options. Older (probably 3.x and early 4.x) kernels used to not boot with “Windows 8.x” selected. I could manage to get it to boot by using a modified DSDT, so I assumed it was an ACPI problem, but it works fine with the current kernel. Another probably unrelated thing is that this thing freezes when modules dw_dmac and dw_dmac_core are loaded (I tried documenting these things here: https://wiki.archlinux.org/index.php/ASUS_X553MA#Laptop_freezes_on_boot).
@ Mika Kuoppala Your patch in comment #683 made a dramatic improvement, when applied to Linux-4.8.17. Without the patch, the Dell-n3540 hanged in 13 minutes and the Acer-J1900 hanged in 3 minutes. With the patch, both machines are still running after 12 hours. (both fixed at HFM, running 1 copy of glxgears + 8 copies of nanosleep) (both are using wired ethernet -- wifi is disabled on the Dell and it is using an USB/wired-ethernet dongle) (no audio is being played) Looking at the patch, it appears to be a revert of commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 7 16:20:28 2015 +0100 drm/i915: Agressive downclocking on Baytrail That patch went upstream in Linux-4.2-rc1. That is interesting because 4.1 was often cited as a local maximum in baytrail stability with 4.2 widely cited as less stable. And so my feedback on that patch is consistent with the favorable result reported above by jbMacAZ on the T100 TAM z3775. I tried doing the same comparison using Linux-4.9.3, but the baseline test of Linux-4.9.3 with no patches ran for 30 hours on both machines without failure.
Can anyone reproduce freezes with Ethernet cable connection and wifi turned off? On my desktop I see them only when wifi usb dongle is connected (with both RTL chips available for me).
I tried using both the demotion patch and threshold patch on 4.9.4 only to be stymied by a regression in wifi (also seen in 4.8.17.) I call it a soft freeze, because the UI only updates about once a minute, but the mouse cursor moves freely. dmesg fills up with various brcmfmac error -110's. For purposes of the cstate bug, I'll stick to testing with 4.9.2 FWIW, with 4.9.2 I was able to run Mika's patch for 37 hours without a freeze. I stopped that test to try other things. I had done some testing many months ago regarding aggressive down-clocking (comment #93) which showed only slight improvement at that time.
(In reply to AB from comment #691) > Can anyone reproduce freezes with Ethernet cable connection and wifi turned > off? > > On my desktop I see them only when wifi usb dongle is connected (with both > RTL chips available for me). I can easily reproduce this without wifi. I ran a headless router setup and the lockups most frequently occur after (or sometimes during) heavy network activity.
Running ./byt.test | tee out.`date +%Y%m%d_%H%M%S` i cannot hang my J1900. After about 36hrs of running, i've stop it, play some movie and get hang in about 10 minutes of playing. For me these script doesn't work (or I should rather say 'isn't effective'). Running on Ubuntu 16.10. BOOT_IMAGE=/boot/vmlinuz-4.8.0-34-generic root=UUID=d097e0d3-b7a2-4943-95fa-591edd652328 ro quiet splash vt.handoff=7 board_vendor:ASRock board_name:Q1900DC-ITX board_version: bios_date:03/31/2016 bios_vendor:American Megatrends Inc. bios_version:P1.50 processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 55 model name : Intel(R) Celeron(R) CPU J1900 @ 1.99GHz stepping : 8 microcode : 0x831 cpu MHz : 1509.649 cache size : 1024 KB physical id : 0 siblings : 4 [ 1.860044] intel_idle: MWAIT substates: 0x33000020 [ 1.860046] intel_idle: v0.4.1 model 0x