Bug 109051 - cstates: intel_idle.max_cstate=1 required to prevent crashes - Baytrail
Summary: cstates: intel_idle.max_cstate=1 required to prevent crashes - Baytrail
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_idle (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks: 113151
  Show dependency tree
 
Reported: 2015-12-08 09:50 UTC by Daniel Vetter
Modified: 2023-11-16 09:44 UTC (History)
277 users (show)

See Also:
Kernel Version: 3.16-4.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
LG MP500 w/o fan (16.36 KB, text/plain)
2015-12-17 16:11 UTC, Chris Eineke
Details
Advantech DS-370 (23.51 KB, text/plain)
2015-12-17 16:12 UTC, Chris Eineke
Details
drm/i915/vlv: Take forcewake on media engine writes (2.57 KB, patch)
2015-12-17 16:37 UTC, Mika Kuoppala
Details | Diff
drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes (2.01 KB, patch)
2015-12-18 13:05 UTC, Mika Kuoppala
Details | Diff
lspci -v Hostbridge and vga adapter output (1001 bytes, text/plain)
2016-01-08 10:32 UTC, julio.borreguero@gmail.com
Details
drm/i915/vlv: Always enable internal pm interrupts (1.64 KB, patch)
2016-01-18 11:09 UTC, Mika Kuoppala
Details | Diff
Kernel bisection between v4.2 v4.1 for sudden freezes (2.53 KB, text/plain)
2016-02-01 22:06 UTC, BzukTuk
Details
attachment-24616-0.html (1.45 KB, text/html)
2016-03-16 22:10 UTC, Vincent Frentzel
Details
attachment-28440-0.html (1.06 KB, text/html)
2016-03-17 00:27 UTC, Vincent Frentzel
Details
Arch Linux 4.1.18 LTS panic #1 (photo 1 of 3) (2.69 MB, image/jpeg)
2016-03-17 02:53 UTC, John A.
Details
Arch Linux 4.1.18 LTS panic #2 (photo 2 of 3) (2.72 MB, image/jpeg)
2016-03-17 02:55 UTC, John A.
Details
Arch Linux 4.4.3 panic (photo 3 of 3) (2.65 MB, image/jpeg)
2016-03-17 02:57 UTC, John A.
Details
attachment-21257-0.html (1.77 KB, text/html)
2016-03-22 06:47 UTC, jds
Details
drm/i915: Prevent machine death on Ivybridge context switching for kernel 4.5.0 from kernel archive (1.54 KB, patch)
2016-03-26 21:56 UTC, julio.borreguero@gmail.com
Details | Diff
Reverted commit 8fb55197e64... for 4.5.0 (4.88 KB, patch)
2016-04-04 12:25 UTC, Martin
Details | Diff
attachment-24742-0.html (1.27 KB, text/html)
2016-05-18 05:44 UTC, jds
Details
attachment-7936-0.html (1.63 KB, text/html)
2016-05-18 16:13 UTC, jds
Details
attachment-22682-0.html (1.81 KB, text/html)
2016-06-07 04:48 UTC, Koen Roggemans
Details
Disable all C6 states enable all C7 core states for Baytrail CPUs (1.33 KB, text/x-sh)
2016-07-14 18:09 UTC, Wolfgang M. Reimer
Details
Shows all core states (C-states) + some related info as a formatted table (1.41 KB, text/x-sh)
2016-07-14 18:12 UTC, Wolfgang M. Reimer
Details
attachment-21109-0.html (4.15 KB, text/html)
2016-09-19 21:22 UTC, Konstantin Koslowski
Details
attachment-3924-0.html (1.42 KB, text/html)
2016-10-05 16:20 UTC, Koen Roggemans
Details
attachment-14281-0.html (2.85 KB, text/html)
2016-10-13 17:37 UTC, Javier Antonio Nisa Avila
Details
Patch to disable c-states at boot (1.77 KB, patch)
2016-10-16 05:38 UTC, Jochen Hein
Details | Diff
Patch for Bay trail for 4.8 (1.95 KB, patch)
2016-12-14 17:56 UTC, Vincent Gerris
Details | Diff
attachment-26085-0.html (2.01 KB, text/html)
2016-12-25 22:44 UTC, Vincent Gerris
Details
Debug patch to enable BYT C6 auto-demotion (1.73 KB, patch)
2016-12-27 21:57 UTC, Len Brown
Details | Diff
nanosleep.c (723 bytes, text/plain)
2016-12-28 11:43 UTC, Len Brown
Details
turbostat-src.tar.gz (68.42 KB, application/x-gzip)
2017-01-01 18:24 UTC, Len Brown
Details
turbostat --debug -o ts.out sleep 10 (1.65 KB, text/plain)
2017-01-01 20:54 UTC, Dmitry
Details
turbostat-src.tar.gz (68.43 KB, application/x-gzip)
2017-01-02 03:50 UTC, Len Brown
Details
tubostat --debug -o ts.out sleep 10 (2.01 KB, text/plain)
2017-01-02 09:52 UTC, Dmitry
Details
Turbostat for Asus T100CHI (1.50 KB, text/plain)
2017-01-07 07:18 UTC, jbMacAZ
Details
latest turbostat utility for baytrail (69.54 KB, application/x-gzip)
2017-01-10 09:05 UTC, Len Brown
Details
Test script to freeze your baytrail quickly (1.01 KB, application/octet-stream)
2017-01-10 09:58 UTC, Len Brown
Details
T100CHI turbostat kernel 4.9 patched (1.98 KB, text/plain)
2017-01-10 19:16 UTC, jbMacAZ
Details
CHI_freeze_4.9.2_no_demotion_disable_patch (3.61 KB, text/plain)
2017-01-11 22:59 UTC, jbMacAZ
Details
drm/i915/byt: Avoid tweaking evaluation thresholds (3.13 KB, patch)
2017-01-13 14:43 UTC, Mika Kuoppala
Details | Diff
pstate.set script (1.97 KB, text/plain)
2017-01-26 04:46 UTC, Len Brown
Details
Mika v3: drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3 (3.98 KB, patch)
2017-02-28 03:13 UTC, Len Brown
Details | Diff
latest turbostat (17.03.04) utility for baytrail (80.20 KB, application/octet-stream)
2017-03-05 23:46 UTC, Len Brown
Details
attachment-16106-0.html (1.60 KB, text/html)
2017-03-16 04:38 UTC, Alejandro Morales Lepe
Details
attachment-3110-0.html (4.17 KB, text/html)
2017-06-12 11:43 UTC, Fred
Details
Win 10 Uptime (44.54 KB, image/png)
2017-06-12 18:45 UTC, luke
Details
attachment-10405-0.html (2.24 KB, text/html)
2017-06-20 15:42 UTC, Vincent Gerris
Details
drm/i915: Only use idle or max freq on Baytrail (958 bytes, patch)
2017-07-20 14:42 UTC, Mika Kuoppala
Details | Diff
attachment-18432-0.html (4.45 KB, text/html)
2017-07-21 08:53 UTC, Fred
Details
attachment-29700-0.html (4.28 KB, text/html)
2017-08-12 11:50 UTC, Fred
Details
patch to fix v3 cstate patch (1.76 KB, patch)
2017-11-08 07:37 UTC, jbMacAZ
Details | Diff
[PATCH] i915: pm: Be less agressive with clockfreq changes on Bay Trail (3.41 KB, patch)
2017-11-09 18:53 UTC, Hans de Goede
Details | Diff
[PATCH] intel_idle: Disable C6N and C6S on Bay Trail (1.93 KB, patch)
2017-11-09 18:56 UTC, Hans de Goede
Details | Diff
attachment-19545-0.html (4.07 KB, text/html)
2017-11-11 19:40 UTC, Fred
Details
attachment-30778-0.html (6.39 KB, text/html)
2017-12-21 13:07 UTC, jechtpurgateur
Details
Dmesg -n 8 output when network dies (58.50 KB, text/plain)
2018-01-18 01:11 UTC, Srdjan Todorovic
Details
attachment-11991-0.html (1.64 KB, text/html)
2018-02-10 22:15 UTC, Vincent Gerris
Details
attachment-27696-0.html (2.48 KB, text/html)
2018-02-16 09:45 UTC, jechtpurgateur
Details
attachment-4533-0.html (983 bytes, text/html)
2018-12-04 01:30 UTC, merlino37
Details
attachment-1662-0.html (3.02 KB, text/html)
2019-01-12 14:33 UTC, Burg
Details
attachment-9601-0.html (2.18 KB, text/html)
2019-01-12 15:00 UTC, Burg
Details
Turn off C6N and C6S states on baytrail N3540, python script (920 bytes, text/plain)
2019-01-17 21:23 UTC, infosecislame
Details
attachment-24716-0.html (2.71 KB, text/html)
2019-05-05 17:38 UTC, Rick Lee
Details
attachment-28600-0.html (2.85 KB, text/html)
2019-05-21 23:46 UTC, stOneskull
Details
attachment-20672-0.html (2.92 KB, text/html)
2019-10-29 10:02 UTC, James Preston
Details
attachment-14937-0.html (1.58 KB, text/html)
2019-11-03 18:52 UTC, merlino37
Details
attachment-14850-0.html (2.67 KB, text/html)
2019-12-31 05:55 UTC, Gary
Details
attachment-22980-0.html (1.50 KB, text/html)
2020-02-12 19:11 UTC, Xermán
Details
attachment-971-0.html (1.15 KB, text/html)
2020-02-24 20:14 UTC, merlino37
Details
attachment-1878-0.html (1.53 KB, text/html)
2020-02-24 20:17 UTC, merlino37
Details
attachment-25299-0.html (1.34 KB, text/html)
2020-03-08 23:48 UTC, merlino37
Details
Restore initialization code dropped in commit drm-next-2020-01-30 (627 bytes, patch)
2020-04-18 17:57 UTC, jbMacAZ
Details | Diff
Updated: Restore code dropped in commit drm-next-2020-01-30 (982 bytes, patch)
2020-04-20 06:42 UTC, jbMacAZ
Details | Diff
Update.2 Restore forcedwake - dropped in commit drm-next-2020-01-30 (1.15 KB, patch)
2020-04-20 17:29 UTC, jbMacAZ
Details | Diff

Description Daniel Vetter 2015-12-08 09:50:50 UTC
This originally started as a gpu regression report against a change to the turbo logic. After much random walking reporter consensus seems to have settled on max_cstate=1 as the one true workaround. See

https://bugs.freedesktop.org/show_bug.cgi?id=88012

For all the glorious details.
Comment 1 Vladimir Jicha 2015-12-08 10:13:29 UTC
For me setting max_cstate=1 didn't solve the bug. It improved the time to freeze from a couple of minutes to a couple of hours. But it is not a fully and universally working workaround.
Comment 2 Anael O. 2015-12-08 10:42:29 UTC
Experienced on an Intel Celeron CPU J1900 (platform GB-BXBT-1900) on Archlinux x64. I cannot upgrade to a kernel higher than 3.14 otherwise I get very frequent crashes when playing videos on browsing the web. On the contrary, kernel 3.14 is extremely stable and the machine can stay up for weeks.
Comment 3 raidyne 2015-12-08 11:09:49 UTC
same here on an Asrock Q1900-ITX (Intel Celeron J1900): random freezes in X session.
Comment 4 Wolfgang M. Reimer 2015-12-09 14:56:37 UTC
Same here on 50+ ASRock IMB-150 mini-ITX (Intel Celeron J1900) boards: Random freezes, time to freeze ranging from some ten minutes to some hours, only when using X with conky + own QT based App (no freezes when not using X!), so it seems very likely that this problem is GPU related.

I will test with kernel parameter intel_idle.max_cstate=1 to see if it is a working workaround for my case and report here later.
Comment 5 raidyne 2015-12-09 15:03:34 UTC
but intel_idle.max_cstate=1 would result in seriously increased power consumption?!
Comment 6 Michal Feix 2015-12-09 15:19:13 UTC
(In reply to raidyne from comment #5)
> but intel_idle.max_cstate=1 would result in seriously increased power
> consumption?!

Correct. And that is the reason, why this bug needs to be fixed soon :-) intel_idle.max_cstate=1 is just a quick workaround, so your baytrail machine can live longer than just a few minutes.

I do confirm same random machine freezes on Acer notebook with Celeron N2940. Random freezes are really random, but usually more frequent when using CPU or GFX heavily. Freezes occur on 4.2.X kernels I've tested so far. I've been able to fix this by using either intel_idle.max_cstate=1 or intel_pstate=disable. Using one of these kernel parameters makes my machine usable again.
Comment 7 Michal Feix 2015-12-09 16:04:31 UTC
I've succesfully tested longterm kernel 4.1.13. This one seems to work without a single freeze for the last 8 hours of uptime. I didn'd need to use any of the intel_idle.max_cstate or intel_pstate kernel parameters with this kernel.
Comment 8 raidyne 2015-12-10 12:59:00 UTC
i'm happy to provide you with any logs. Unfortunately though, my system does not seem to be particularly vebose concerning this bug: could not find any hints in dmesg, kern.log, syslog, xorg.log.
Comment 9 Steven Ellis 2015-12-10 19:49:06 UTC
I'm seeing this freeze on an Acer Notebook with a Celeron N2940 when running Fedora.

https://bugzilla.redhat.com/show_bug.cgi?id=1285895

No issues with older fedora 22 kernel
 - kernel-4.1.6-200.fc22.x86_64

Still have issues with latest fedora kernel build
 - kernel-4.2.6-301.fc23.x86_64

Is there an easy way to show the current cstate of the system?
Comment 10 Chris Rainey 2015-12-11 15:12:00 UTC
(In reply to Steven Ellis from comment #9)
> I'm seeing this freeze on an Acer Notebook with a Celeron N2940 when running
> Fedora.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1285895
> 
> No issues with older fedora 22 kernel
>  - kernel-4.1.6-200.fc22.x86_64
> 
> Still have issues with latest fedora kernel build
>  - kernel-4.2.6-301.fc23.x86_64
> 
> Is there an easy way to show the current cstate of the system?

YES:  PowerTop(http://01.org/powertop/) and i7z(https://code.google.com/p/i7z/) can tell you this.


Example output from PowerTop:



PowerTOP 2.8      Overview   Idle stats   Frequency stats   Device stats   Tunables                                     


          Package   |             Core    |            CPU 0
                    |                     | C0 active   0.7%
                    |                     | POLL        0.1%    0.3 ms
                    | C1 (cc1)   99.3%    | C1-BYT     99.4%    4.2 ms
C2 (pc2)    0.0%    |                     |
C3 (pc3)    0.0%    |                     |
C6 (pc6)    0.0%    | C6 (cc6)    0.0%    |

                    |             Core    |            CPU 1
                    |                     | C0 active   0.5%
                    |                     | POLL        0.0%    0.0 ms
                    | C1 (cc1)   99.5%    | C1-BYT     99.4%   22.8 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    0.0%    |

                    |             Core    |            CPU 2
                    |                     | C0 active   0.9%
                    |                     | POLL        0.0%    0.0 ms
                    | C1 (cc1)   98.9%    | C1-BYT     99.0%    8.5 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    0.0%    |

                    |             Core    |            CPU 3
                    |                     | C0 active   1.3%
                    |                     | POLL        0.0%    0.0 ms
                    | C1 (cc1)   98.0%    | C1-BYT     98.0%   11.1 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    0.0%    |

                    |             GPU     |
                    |                     |
                    | Powered On  0.0%    |
                    | RC6       100.0%    |
                    | RC6p        0.0%    |
                    | RC6pp       0.0%    |
                    |                     |
Comment 11 Juha Sievi-Korte 2015-12-12 10:34:36 UTC
I can also confirm that this workaround works for me, running 4.3.0-2 now for about two weeks with intel_idle.max_cstate=1 and no freezes. Cheers for this, I was getting desperate with the constant hangs.

Downgrading kernel to 3.16.7-29 makes this run fine without any boot parameters, but anything newer than that means frequent freezes.

Using intel_pstate=disable does not work for this hardware / kernel combination either, it still hangs. Only limiting cstate seems to cure this.

Acer B-115M Laptop with Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz

If there is anything that I can do to help to trace this, let me know.
Comment 12 Chris Rainey 2015-12-12 21:27:40 UTC
Good reading for better understanding of this issue:


1. C-states and P-states are very different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different)


2. Power Management States: P-States, C-States, and Package C-States(https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states)


3. (update) C-states, C-states and even more C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states)


Hope this helps!
Comment 13 Wolfgang M. Reimer 2015-12-17 09:14:58 UTC
(In reply to Wolfgang M. Reimer from comment #4)
> Same here on 50+ ASRock IMB-150 mini-ITX (Intel Celeron J1900) boards:
> Random freezes, time to freeze ranging from some ten minutes to some hours,
> only when using X with conky + own QT based App (no freezes when not using
> X!), so it seems very likely that this problem is GPU related.
> 
> I will test with kernel parameter intel_idle.max_cstate=1 to see if it is a
> working workaround for my case and report here later.

I can confirm that kernel parameter intel_idle.max_cstate=1 is a working workaround for my case (50+ ASRock IMB-150 mini-ITX Intel Celeron J1900 boards running a 3.18.21-rt19 kernel)
Comment 14 Pascal VITOUX 2015-12-17 14:37:57 UTC
I can confirm too, parameter "intel_idle.max_cstate=2" is required on two laptops (Medion Akoya E6239T and S6217T) with these CPU : 
 - Intel Celeron CPU N2930 1.83GHz
 - Intel Celeron CPU N2940 1.83Ghz
The random freezes come back when setting max_cstate to 3.

Also, I don't need it on two other similar laptops (Medion Akoya E6239) with these CPU : 
 - Intel Celeron CPU N2830 2.16Hhz
 - Intel Celeron CPU N2840 2.16GHz
Comment 15 Chris Eineke 2015-12-17 16:11:34 UTC
I, too, can confirm this issue on systems that use an Intel Celeron N2930@1.83GHz or an Intel Celeron J1900@1.99GHz. While adding "intel_idle.max_cstate=1" to kernel command-line fixed the issue, the regression in GPU performance wasn't acceptable. Bumping it to "intel_idle.max_cstate=2" seems to make it run with adequate GPU performance while presenting no more hard lock-ups. I attached the output of lshw of both systems.
Comment 16 Chris Eineke 2015-12-17 16:11:57 UTC
Created attachment 197611 [details]
LG MP500 w/o fan
Comment 17 Chris Eineke 2015-12-17 16:12:23 UTC
Created attachment 197621 [details]
Advantech DS-370
Comment 18 Mika Kuoppala 2015-12-17 16:37:45 UTC
Created attachment 197631 [details]
drm/i915/vlv: Take forcewake on media engine writes

Long shot, but could someone give this a spin.
Comment 19 G. Bremer 2015-12-17 20:49:17 UTC
Can anyone confirm that this problem is limited to Bay Trail and does not affect Braswell such as N3150 or N3700?  Ran into this intermittent freeze-up problem after upgrading several J1900 and N2930 based boards to 3.19 kernel.  [Had used  3.13 previously.]  intel_idle.max_cstate=1 seems to solve the problem...all units up for 48hrs anyway.  I much appreciated finding that this is a known/reported problem.  We are moving to the Braswell based boards and wondering if there are any known stability problems.  Thank you.
Comment 20 Chris Rainey 2015-12-17 21:35:19 UTC
(In reply to G. Bremer from comment #19)
> Can anyone confirm that this problem is limited to Bay Trail and does not
> affect Braswell such as N3150 or N3700?  Ran into this intermittent
> freeze-up problem after upgrading several J1900 and N2930 based boards to
> 3.19 kernel.  [Had used  3.13 previously.]  intel_idle.max_cstate=1 seems to
> solve the problem...all units up for 48hrs anyway.  I much appreciated
> finding that this is a known/reported problem.  We are moving to the
> Braswell based boards and wondering if there are any known stability
> problems.  Thank you.

Confirming same issue on N3050(Braswell/Cherry Trail/Airmont).
Comment 21 fritsch 2015-12-17 21:37:13 UTC
I can fully confirm that this issue is _not_ happening on Braswell N3150 and N3700 - both chips are perfectly fine without any patching.

3.19 is not even working on braswell.
Comment 22 Wolfgang M. Reimer 2015-12-18 09:18:22 UTC
(In reply to fritsch from comment #21)
> I can fully confirm that this issue is _not_ happening on Braswell N3150 and
> N3700 - both chips are perfectly fine without any patching.

If you make such a statement like the one above then please specify for which kernel revision(s) this is true. Older kernel revisions (like e.g. 3.13.x) do not exhibit the issue for BayTrail processors. This thread is about (more or less) random freezes of BayTrail (and possibly newer) processors running NEWER kernel revisions (e.g. 3.18.x and newer) when used without kernel parameter intel_idle.max_cstate=1 (please do not confuse this with kernel patching).

> 
> 3.19 is not even working on braswell.

What does that mean? Does the 3.19 kernel freeze on Braswell at start-up immediately? What happens when the kernel boot parameter intel_idle.max_cstate=1 is specified for this 3.19 kernel? How does that correlate to the above message that "this issue is _not_ happening on Braswell N3150 and N3700"? What is the exact kernel revision of the 3.19 kernel you tried (or did you test it an ALL 3.19.* kernels)?
Comment 23 Peter Fr 2015-12-18 09:29:27 UTC
I am the original submitter of the bugreport. At the time of filing it, Braswell did only exist on paper.

To get the GPU up and running on a braswell system you need at least kernel 4.1 or later or special parameters for older kernels to force gpu acceleration. Whatever kernel you run with 3.13 / 3.19 has no mainline gpu support. It won't work at all. If this something Ubuntu patched?

My Braswell 3150 (minix / asrock) currently run with kernel 4.3 and 4.4-rc5 without issues.

Here are the kernel image if you want to verify:
http://fritsch.fruehberger.net/kernel/linux-image-4.3.0-pt-bt1+_4.3.0-pt-bt1+-10.00.Custom_amd64.deb
http://fritsch.fruehberger.net/kernel/linux-headers-4.3.0-pt-bt1+_4.3.0-pt-bt1+-10.00.Custom_amd64.deb
Comment 24 fritsch 2015-12-18 09:31:27 UTC
To avoid confusions, last post was done by me - but with wrong account - now happy testing.
Comment 25 Mika Kuoppala 2015-12-18 13:05:11 UTC
Created attachment 197671 [details]
drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes
Comment 26 Pascal VITOUX 2015-12-18 13:11:41 UTC
(In reply to Mika Kuoppala from comment #18)
> Created attachment 197631 [details]
> drm/i915/vlv: Take forcewake on media engine writes
> 
> Long shot, but could someone give this a spin.

Tested without success with kernel 4.3.2 on my two laptops (CPU N2930 and N2940).
They froze in less than 2 two hours
Comment 27 lewexeki 2015-12-20 04:31:10 UTC
Hi,

I had the same problem with "Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz". With kernel 4.2.0-16.19 there were ~5-8 freezes/day. After upgrading to 4.3.3-040303-generic (ubuntu version) it was much better: 1/2 freezes/day. With cstate=1 there has not been one yet.
Comment 28 Nicolas Porcel 2015-12-21 20:10:19 UTC
(In reply to Mika Kuoppala from comment #25)
> Created attachment 197671 [details]
> drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes

Also tested on kernel 4.3.3 on Arch Linux and it didn't work. I have an Asrock Q1900M (with intel J1900). It froze after less than 1 hour of video playback, so no improvement compared to the base Arch Linux default kernel without the patch (v4.2.5).
Comment 29 Wolfgang M. Reimer 2015-12-22 14:36:07 UTC
(In reply to Peter Fr from comment #23)

> To get the GPU up and running on a braswell system you need at least kernel
> 4.1 or later or special parameters for older kernels to force gpu
> acceleration. Whatever kernel you run with 3.13 / 3.19 has no mainline gpu
> support. It won't work at all. If this something Ubuntu patched?
> 


Thanks for the info.

Yes, Ubuntu 15.04 (Vivid) made some patches to the 3.19 kernel line for Braswell systems (Ubuntu kernel 3.19.0-20 till 3.19.0.42, see http://forum.kodi.tv/showthread.php?tid=227771&pid=2026016#pid2026016 and http://www.phoronix.com/scan.php?page=news_item&px=Intel-Braswell-Fedora-Ubuntu)
It's got support for Braswell systems however I don't know how complete this support is. The Ubuntu 15.10 (Wily) kernel 4.2.0-22 should also run on Braswell systems. The Vivid and the Wily kernel are both available for Ubuntu 14.04 LTS (Trusty, the Ubuntu release I use), too.
Comment 30 fritsch 2015-12-22 14:38:07 UTC
Then, please: Reproduce with mainline kernels. We cannot let the kernel devs debug ubuntu's picked together kernel ...
Comment 31 Wolfgang M. Reimer 2015-12-22 16:52:14 UTC
(In reply to fritsch from comment #30)
> Then, please: Reproduce with mainline kernels. We cannot let the kernel devs
> debug ubuntu's picked together kernel ...

My report does _NOT_ relate to the Ubuntu kernels _NOR_ does it relate to a Braswell system. See my Comments https://bugzilla.kernel.org/show_bug.cgi?id=109051#c4 and https://bugzilla.kernel.org/show_bug.cgi?id=109051#c13 above.
Comment 32 Markus Rehbach 2015-12-22 21:03:02 UTC
No freeze on a Acer E11 (N2940) after "echo acpi_pm > /sys/bus/clocksource/devices/clocksource0/current_clocksource" but it hit me not so often. Most of the time after reboot and not after standby/resume.
Comment 33 lewexeki 2015-12-22 22:16:16 UTC
I will compile a mainline kernel and test it.

I feel there is something connection with browsing. I got freeze while I se online videos with firefox or open a new site with multimedia content. I have disabled hardware acceleration to see what will happen. I will report it.
Comment 34 lewexeki 2015-12-23 02:12:14 UTC
There is no change. Freeze again and again. The only solution is "intel_idle.max_cstate=1".

Does anybody know when will this be fixed? With the kernel parameter my CPU is noticeably warmer. It is not very good I think. I bought a notebook with Intel Atom (N) CPU, because that is energy efficient.
Comment 35 jbMacAZ 2015-12-23 07:47:58 UTC
Freeze occurs on ASUS T100-CHI running Cinnamon Desktop on Mint17.3 or Manjaro15.12 with 64bit kernels after 3.16.7 including 4.3.x and 4.4-rcx.  Until 4.2.6, capping GPU frequency greatly reduced the freeze rate for me.  After 4.2.5 GPU frequency did not affect freeze rate (GPU hang fixed?)

Freeze rate seems to depend on particulars of the distro, kernel and device it runs on.  My setup freezes within a few minutes without a max_cstate below 2.  I notice warmer system temperatures with cstate=0.  YMMV.
Comment 36 Nicolas Porcel 2015-12-23 19:30:07 UTC
(In reply to Markus Rehbach from comment #32)
> No freeze on a Acer E11 (N2940) after "echo acpi_pm >
> /sys/bus/clocksource/devices/clocksource0/current_clocksource" but it hit me
> not so often. Most of the time after reboot and not after standby/resume.

This seems to work on my Intel J1900. Can more people confirm that this works? To make the change permanent, you can add the option "clocksource=acpi_pm" to your kernel command line.

What is the drawback of using the acpi_pm clock? From what I have read (https://access.redhat.com/solutions/18627) it has a lower frequency, 3.58Mhz compared to the 2GHz of my cpu clock. We could just force the kernel switch to the acpi_pm clock when available if the CPU is a BailTray / Braswell.
Comment 37 Nicolas Porcel 2015-12-23 23:30:50 UTC
I was wrong, turns out it takes more time to freeze but it eventually does. The best option so far is the cstate option.
Comment 38 mazout360 2015-12-24 01:10:59 UTC
There's something strange with this bug...on my Q1900DC-ITX I tried every single version of the mainline kernel from 3.16 to 4.3. It still hangs on 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15 is very stable. No need for the cstate configuration or any patch publied here or on the other thread.
For some reason, it "seemed" to get fixed on 4.1-rc something and the bug came back on 4.2.0. Now, I don't know much about how the whole i915 driver works, but it seems like a lot of changes on 4.2 concerns the cherryview chips except these:

drm/i915: Use spinlocks for checking when to waitboost
drm/i915: Don't downclock whilst we have clients waiting for GPU results 
drm/i915: Agressive downclocking on Baytrail/drm/i915: Fix computation of last_adjustment for RPS autotuning 

Looks like they directly affect baytrail chips and they alter code changes introduced right before the 4.1 series. I also remember trying to revert the drm/i915: Agressive downclocking on Baytrail commit without success on 4.2.
Comment 39 jbMacAZ 2015-12-24 06:33:18 UTC
I tried the clocksource parameter without cstate.  It froze within a few minutes (4.3.3/T100-CHI).  So far my freeze is independent of GPU frequency and system clock source!  

4.1 was more stable for me than 4.2.x, 4.3.x.  But the rest of my hardware works better with newer kernels.  Otherwise I could avoid the bleeding edge kernels.
Comment 40 Dmitry 2015-12-24 09:08:39 UTC
Additional info:
I have Dell Venue 11 Pro with Atom Z3770. Observing this freezes as everybody does from 3.17. After 4.1 behaviour of freezes changed slightly, however they happen.
intel_idle.max_cstate=0 or switching to acpi_idle driver for latest kernel 4.4-rc6 don't solve this bug. So it's not idle driver fault.
intel_idle.max_cstate=2 (cstate=1 also) completely solves freezes.

The only difference between acpi_idle (freezes) and intel_idle with max_cstate=2 (don't freeze) is in this state: ACPI FFH INTEL MWAIT 0x64.
I'll try with max_cstate=3, but I think it'll freeze too.

I can reproduce freezes with html5 video in firefox. For 3.17-4.0 it happens within 10 minutes. After 4.1 it happens within 1 hour.
Comment 41 Dmitry 2015-12-24 14:36:29 UTC
Tried every cstate till 6 and cannot reproduce this bug anymore... Even without parameter huge films over wifi and html5 video from firefox works without freezes. I'll continue testing.


cat /proc/cmdline 
root=/dev/mmcblk0p6 ro init=/usr/lib/systemd/systemd rootfstype=ext4 tsc=reliable force_tsc_stable=1 clocksource=tsc clocksource_failover=tsc swap_zram zram.num_devices=4 

uname -a
Linux venue11pro 4.4.0-rc6-dirty #200 SMP PREEMPT Thu Dec 24 15:23:06 MSK 2015 i686 Intel(R) Atom(TM) CPU Z3770 @ 1.46GHz GenuineIntel GNU/Linux

mesa 11.1
xf86-video-intel 2.99.917-r2 (gentoo version)
libdrm 2.4.65

P.S. Linux "dirty" because of ath6kl patch, soc_button_array patch and gcc native optimization patches.
Comment 42 jbMacAZ 2015-12-24 22:03:31 UTC
Replacing cstate=1 with "clocksource=acpi_pm" my setup froze within a few minutes.  Replacing cstate=1 with "tsc=reliable force_tsc_stable=1 clocksource_failover=tsc" gave me significantly more run time before freezing.  I was able to run almost 2 hours (~20x) streaming a bald eagle cam. T100-CHI (intel 3775) with hardware specific patches, 4.3.3 (Manjaro)
Comment 43 Nicolas Porcel 2015-12-24 23:40:33 UTC
 (In reply to mazout360 from comment #38)
> There's something strange with this bug...on my Q1900DC-ITX I tried every
> single version of the mainline kernel from 3.16 to 4.3. It still hangs on
> 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15
> is very stable. No need for the cstate configuration or any patch publied
> here or on the other thread.

I also run kernel 4.1.5 (from ArchLinux, which doesn't include any patch) without any freeze on Q1900-ITX. Current uptime is 10 hours, with Netflix video streaming, although it is stopped from time to time. I will need more time to be sure, but it seems to work so far.

It is for now the best option without any major drawback like video driver not working or power saving disabled. I will try to bisect between 4.0 and 4.2 to see exactly which commit introduced the regression and which one introduced it.
Comment 44 Michaël 2015-12-25 05:38:10 UTC
I confirm the random freezes on Acer TravelMate 115 (same as Juha Sievi-Korte).  Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz, with Arch's linux-4.1.15-1-lts.  The freezes occur mostly while watching videos, and are way sparser than the ones reported here (on specific days, I'd have 5 freezes, then it would be fine for a few weeks, and resume).
Comment 45 Ariel 2015-12-26 19:52:28 UTC
Random freezes happening on Fedora 23 - Kernel 4.2.8-300.fc23.x86_64. And before with Fedora 22.

On Asrock Q1900-ITX, BIOS P1.40 (latest available).

This has been happening for about a year now, at different freeze frequencies going from 2 minutes after boot up to a few weeks. It only happens intermittently when playing back video content with Kodi (this is an HTPC). It doesn't happen when compiling, playing music, or when the home server stays idle. 

I noticed that certain videos (but not a specific codec) are much more prone than others to trigger the bug. Disabling hardware acceleration does NOT solve the problem. It has been a very frustrating experience.
Comment 46 FL 2015-12-28 11:09:54 UTC
Same problem with ASUS ET2325IUK with J2900  @ 2.41GHz + Arch Linux 4.1.15-1 and 4.2.5-1 (videos, system upgrades,html5...)
Freezing also systematically appears when closing gnome or cinnamon session (gdm).
Comment 47 Dmitry 2015-12-29 15:20:28 UTC
No, not fixed. Freezed by just scrolling in firefox without any video. max_cstate still needed.
Comment 48 Elmar Melcher 2016-01-02 14:02:10 UTC
Same problem on Positivo ZX3040 http://lad.dsc.ufcg.edu.br/lad/pmwiki.php?n=Lad.Tablet, but occasional hard lock-ups even with intel_idle.max_cstate=1.
Are the patches in https://github.com/hadess/rtl8723bs related in any way to this problem?
Comment 49 gpdemedici 2016-01-03 14:26:19 UTC
Last not having issue: 4.1.3 
First to show issue: 4.2.0

I am on UBUNTU and have the issue. I tested the mainline kernels. From my testing UBUNTU 4.1.0-3.3 is the last kernel known to me not having the issue, successive kernel UBUNTU 4.2.0-7.7 has the issue. To my knowledge these map to 4.1.3 and 4.2.0 mainline kernels respectively. I am sharing this hoping somebody can find this information useful to make progress towards fixing the issue.

MAINLINE KERNELS

vivid linux 
3.19.0-32.37	Ubuntu-3.19.0-32.37	3.19.8-ckt7 kernel used before I upgraded to wily, does not have issue
3.19.0-33.38	Ubuntu-3.19.0-33.38	3.19.8-ckt7
3.19.0-37.42	Ubuntu-3.19.0-37.42	3.19.8-ckt9
3.19.0-39.44	Ubuntu-3.19.0-39.44	3.19.8-ckt9
3.19.0-41.46	Ubuntu-3.19.0-41.46	3.19.8-ckt10
3.19.0-42.48	Ubuntu-3.19.0-42.48	3.19.8-ckt10 (last Vivid kernel, not tested for issue)

Wily linux

3.19.0-20.20	Ubuntu-3.19.0-20.20	3.19.8
4.0.0-4.6	Ubuntu-4.0.0-4.6	4.0.7
4.0.0-4.7	Ubuntu-4.0.0-4.7	4.0.7 works fine, issue not found here
4.1.0-1.1	Ubuntu-4.1.0-1.1	4.1.0 works fine, issue not found here
4.1.0-2.2	Ubuntu-4.1.0-2.2	4.1.3
4.1.0-3.3	Ubuntu-4.1.0-3.3	4.1.3 last known to me not having issue
4.2.0-7.7	Ubuntu-4.2.0-7.7	4.2.0 has issue
4.2.0-10.11	Ubuntu-4.2.0-10.11	4.2.0
4.2.0-10.12	Ubuntu-4.2.0-10.12	4.2.0 has issue
4.2.0-11.13	Ubuntu-4.2.0-11.13	4.2.1 has issue, also at log-in reporting an error with /usr/bon/Xorg
4.2.0-12.14	Ubuntu-4.2.0-12.14	4.2.1 
4.2.0-14.16	Ubuntu-4.2.0-14.16	4.2.2 has issue
4.2.0-15.18	Ubuntu-4.2.0-15.18	4.2.3
4.2.0-16.19	Ubuntu-4.2.0-16.19	4.2.3
4.2.0-17.21	Ubuntu-4.2.0-17.21	4.2.3
4.2.0-18.22	Ubuntu-4.2.0-18.22	4.2.3 has issue
4.2.0-19.23	Ubuntu-4.2.0-19.23	4.2.6
4.2.0-21.25	Ubuntu-4.2.0-21.25	4.2.6
4.2.0-22.27	Ubuntu-4.2.0-22.27	4.2.6

upstream kernel v4.3.0 has issue
upstream kernel v4.4.3 has issue
Comment 50 julio.borreguero@gmail.com 2016-01-08 10:26:59 UTC
i have an acer aspire es1-711


i am on gentoo linux self compiled kernels.
same problem on my linux mint partition, it certainly is a kernel bug.

latest kernel to work fine is 4.1.12 (4.1.13 is reported to work as well, haven't tested it), absolutely stable.

any 4.2 or 4.4 kernels freeze the system, no traces, no reproduction scenarios.

i can't confirm the " intel_idle.max_cstate=1" workaround to be a solution.

tested it with kernel 4.4.0-rc6 and it froze after 3 days.
Comment 51 julio.borreguero@gmail.com 2016-01-08 10:32:39 UTC
Created attachment 198961 [details]
lspci -v Hostbridge and vga adapter output
Comment 52 Christian Wansart 2016-01-08 11:12:11 UTC
I have the same problem with an Acer Aspire ES1-311 on Ubuntu. I am currently running 4.1.13 with the intel_idle.max_cstate=1 workaround. I fix would be much better!
Comment 53 Mika Kuoppala 2016-01-08 16:05:45 UTC
Another long shot to try is to see if:

'intel_reg write 0xa168 0x0'

has any effect on occurrence.
Comment 54 kernelorg 2016-01-09 01:09:03 UTC
(In reply to Mika Kuoppala from comment #53)
> Another long shot to try is to see if:
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.

I've had a issue with a Lenovo Yoga 2 where restarting GDM or switching to another vty would hang the system. This command fixed it and I haven't had a crash yet.
Comment 55 Bob George 2016-01-09 02:25:43 UTC
FYI. Here is another hang issue on Baytrail that is also fixed by limiting C states.

https://lkml.org/lkml/2015/3/24/271

As far as I can tell these issues have not made it in to the kernel at all.
Comment 56 Bob George 2016-01-09 02:39:23 UTC
These patches have not made it in to the kernel, I meant.
Comment 57 hartrumpf 2016-01-17 14:12:27 UTC
(In reply to Mika Kuoppala from comment #53)
> Another long shot to try is to see if:
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.

The command seems to be a correct work-around for GB-BXBT-1900. Thanks a lot!
Mika, can you explain what this command does? Any problematic consequences (for power management ...)?
Comment 58 julio.borreguero@gmail.com 2016-01-17 14:33:30 UTC
i will try the 
intel_reg write 0xa168 0x0
on an acer aspire ES1-711 now and will give feedback as soon as the system freezes or in a few days otherwise.
Comment 59 julio.borreguero@gmail.com 2016-01-17 18:07:44 UTC
btw i just tried kernel 4.4.0 (latest stable git) without any parameters and without intel_reg write 0xa168 0x0
The system froze after ~1h.
Now running the same kernel with intel_reg.
will report shortly....
Comment 60 Vincent Frentzel 2016-01-17 23:14:22 UTC
Affected by this bug as well on a Jetway JBC311U93 Celeron N2930 (Bay Trail). The system was running perfectly fine for 6 months as a router until repurposed as an HTPC. 

Hard freezes always occur when playing back video (h264 with vaapi) under Kodi. I am running kernel 4.3.3.

Will happily test any patch/solution.
Comment 61 Juha Sievi-Korte 2016-01-18 07:12:59 UTC
Tried intel_reg write 0xa168 0x0 on Acer B-115M (Pentium N3540) with kernel 4.3.3, hang happened within 20 mins after reboot, so I guess no change, occurence is random.

Question: Should I be able to read 0x0 out from that same register? I mean:
cardhu:~ # intel_reg read 0xa168 
                                    (0x0000a168): 0x0000007a
cardhu:~ # intel_reg write 0xa168 0x0
cardhu:~ # intel_reg read 0xa168 
                                    (0x0000a168): 0x0000007a
Comment 62 Alberto Salvia Novella 2016-01-18 08:17:26 UTC
Also reported in (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1511002).
Comment 63 Mika Kuoppala 2016-01-18 11:04:22 UTC
(In reply to Juha Sievi-Korte from comment #61)
> Tried intel_reg write 0xa168 0x0 on Acer B-115M (Pentium N3540) with kernel
> 4.3.3, hang happened within 20 mins after reboot, so I guess no change,
> occurence is random.
> 
> Question: Should I be able to read 0x0 out from that same register? I mean:
> cardhu:~ # intel_reg read 0xa168 
>                                     (0x0000a168): 0x0000007a
> cardhu:~ # intel_reg write 0xa168 0x0
> cardhu:~ # intel_reg read 0xa168 
>                                     (0x0000a168): 0x0000007a

Yes, we should forget this crude hack as the register gets overwritten on boot
and also on normal operation when frequencies are changed.

I will submit a patch to try.
Comment 64 Mika Kuoppala 2016-01-18 11:09:40 UTC
Created attachment 200381 [details]
drm/i915/vlv: Always enable internal pm interrupts
Comment 65 julio.borreguero@gmail.com 2016-01-18 11:55:50 UTC
(In reply to Mika Kuoppala from comment #64)
> Created attachment 200381 [details]
> drm/i915/vlv: Always enable internal pm interrupts

concerning this:

> cardhu:~ # intel_reg write 0xa168 0x0
> cardhu:~ # intel_reg read 0xa168 

i can read and write that register, but it is constantly overwritten as mika says.
From my logic that means that that "workaround" can't work, although my system didn't freeze yet.
So i am now compiling kernel 4.4.0 with mikas patch applied (manually).
i will post that result soon.
Comment 66 julio.borreguero@gmail.com 2016-01-18 23:48:29 UTC
mika, i tried your patches on 4.4.0 kernel
The system hard-froze the same :(
back to kernel 4.1.12....
Comment 67 jbMacAZ 2016-01-19 07:50:21 UTC
 I tried the "drm/i915/vlv: Always enable internal pm interrupts" and it froze within 3 minutes on my T100CHI...

BUT, this did fix a bug of the CHI not remembering the backlight setting from the last session.  With this patch, the T100CHI powered up and dimmed to the last session level before launching the desktop.  Without this patch, the brightness slider would show reduced backlight, but it not go into effect until the brightness was adjusted, manually.

This patch is a worth keeping at least for the CHI, even if it has no effect on the freeze problem, which it doesn't.
Comment 68 Pascal VITOUX 2016-01-19 17:58:24 UTC
After a bisect between 4.1 and 4.2-rc1, and running the kernel on a laptop with a N2930 CPU : 

Last commit without freeze (after running for 6 hours, I will retry for 24h or more to be sure) : 

 commit af2d94fddcf41e879908b35a8a5308fb94e989c5
 Author: Ingo Molnar <mingo@kernel.org>
 Date:   Thu Apr 23 17:34:20 2015 +0200

    x86/fpu: Use 'struct fpu' in fpu_reset_state()
    
    Migrate this function to pure 'struct fpu' usage.
    

The freezes happen (in less than one hour for each test) with the next commit  :

 commit cb8818b6acb45a4e0acc2308df216f36cc5b950c
 Author: Ingo Molnar <mingo@kernel.org>
 Date:   Thu Apr 23 17:39:04 2015 +0200

    x86/fpu: Use 'struct fpu' in switch_fpu_prepare()
    
    Migrate this function to pure 'struct fpu' usage.
Comment 69 Pascal VITOUX 2016-01-20 09:43:33 UTC
Sorry for the misinformation in my previous comment, but after retesting af2d94f I got a freeze after 15 hours.
Comment 70 jbMacAZ 2016-01-21 02:16:36 UTC
To clarify patch "drm/i915/vlv: Always enable internal pm interrupts" results.

Tested with 4.3.3 and 4.4.0 with hardware specific patches.  Backlight control is not yet available in the standard kernel for the ASUS T100 family.

Without this patch, my ASUS T100CHI always boots to full screen brightness.  With the patch, the backlight usually starts at the indicated setting.  

This patch does fix something for baytrail systems.  Thanks for the patch.
Comment 71 Jayant Sharma 2016-01-23 12:34:30 UTC
 (In reply to mazout360 from comment #38)
> There's something strange with this bug...on my Q1900DC-ITX I tried every
> single version of the mainline kernel from 3.16 to 4.3. It still hangs on
> 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15
> is very stable. No need for the cstate configuration or any patch publied
> here or on the other thread.

Recently re-installed arch on my X205TA.

Haven't come across any freezes on kernel 4.3.3-3 with the cstate param. But, linux-lts 4.1.15-1 doesn't let me stay beyond 2 minutes, with just a couple tabs open in browser and ow doing nothing.
Comment 72 Johannes 2016-01-26 16:33:32 UTC
Hi everybody. 

Since December 2015 I have been following this bug, because I had system freezes (mostly while streaming), too. I use an ACER ES1-311 (intel GPU inside;-) with an up to date 4.3.3-3-ARCH. Unfortunately the intel_idle.max_cstate=1 did not do the trick for me. 
In Arch-Wiki, I found an interesting hint, that improved my situation tremendously. Before, I regularly had system freezes after five minutes streaming. Sometimes the freezes occured after maximum one hour. With this hint, I have not had a freeze for a couple of days streaming for hours! Possibly, my system is even fixed completely with this?! I want to share this with you guys - probably this helps finding a solution or improvement for you too. 

If interested, you find information here in the arch wiki: 

        https://wiki.archlinux.org/index.php/Intel_graphics

Scroll down to the chapter: "X freeze/crash with intel driver". (Funnily, this bug is linked there at the bottom of the chapter with the intel_idle.max_cstate=1 workarund.)

Here is described how the GPU acceleration can be disabled. I also disabled the DRI option, because I do not play games on my machine. 

That did it - or improved  alot.

Probably on systems other than ARCH, there is a similar way to access and disable GPU acceleration.
Comment 73 Ayush Agrawal 2016-01-29 15:02:06 UTC
(In reply to Johannes from comment #72)
> Hi everybody. 
> 
> Since December 2015 I have been following this bug, because I had system
> freezes (mostly while streaming), too. I use an ACER ES1-311 (intel GPU
> inside;-) with an up to date 4.3.3-3-ARCH. Unfortunately the
> intel_idle.max_cstate=1 did not do the trick for me. 
> In Arch-Wiki, I found an interesting hint, that improved my situation
> tremendously. Before, I regularly had system freezes after five minutes
> streaming. Sometimes the freezes occured after maximum one hour. With this
> hint, I have not had a freeze for a couple of days streaming for hours!
> Possibly, my system is even fixed completely with this?! I want to share
> this with you guys - probably this helps finding a solution or improvement
> for you too. 
> 
> If interested, you find information here in the arch wiki: 
> 
>         https://wiki.archlinux.org/index.php/Intel_graphics
> 
> Scroll down to the chapter: "X freeze/crash with intel driver". (Funnily,
> this bug is linked there at the bottom of the chapter with the
> intel_idle.max_cstate=1 workarund.)
> 
> Here is described how the GPU acceleration can be disabled. I also disabled
> the DRI option, because I do not play games on my machine. 
> 
> That did it - or improved  alot.
> 
> Probably on systems other than ARCH, there is a similar way to access and
> disable GPU acceleration.

Thank you so much for this.

I have a Dell Inspiron 3551 which has an Intel N3540 processor. I have been facing laptop freezing issues. This just fixed it.

I have Ubuntu Gnome 15.10 on it.

The steps I followed are these:

1. To boot into recovery mode, https://wiki.ubuntu.com/RecoveryMode (make sure to run the two mount commands)
2. To generate the config file for X (while in recovery mode), http://askubuntu.com/questions/4662/where-is-the-x-org-config-file-how-do-i-configure-x-there
3. Change the following lines in /etc/X11/xorg.conf (you can use nano):
3a. #Option "NoAccel" -> Option "NoAccel" "true"
3b. #Option "DRI" -> Option "DRI" "false"
4. Reboot and its done :).
Comment 74 julio.borreguero@gmail.com 2016-01-29 16:29:28 UTC
first of all, for me the  intel_idle.max_cstate=1 solution didnt work for me either, but i said that earlier already.

to you guys disabling hardware acceleration with the info from arch-wiki.
why do you disable hardware acceleration if you can just install any 3.1 kernel (i use 3.1.12) and at the same time use hardware acceleration and dri, without any system freezes ?
it seems to me the much better solution.
Comment 75 julio.borreguero@gmail.com 2016-01-29 20:18:28 UTC
sorry of course i meant 4.1 kernel and not 3.1. i use 4.1.12
Comment 76 Travis Hall 2016-01-30 03:35:01 UTC
I have also been having this issue on my Lenovo 11e laptop with an Intel N2940 baytrail-m.  I am running Manjaro and have been having full system hangs (mouse stops moving, everthing freezes, it doesn't even seem to dump any errors out in time) and application freezes (mostly vlc).  It seems to happen on battery or plugged in when running a video.  The only kernel that seems stable without limiting max_cstate to 1 seems to be 4.1.16-1.

Kernels that have given me issues:
4.4.0-4, 4.3.4-1, 4.2.8.2-1, 3.18.25-1

As a side note, hibernate seems to not work on most kernels, works on 4.4.0-4.  Not sure if it's related.
Comment 77 BzukTuk 2016-01-30 11:21:15 UTC
Hi, Im too trying to bisect this issue.
Best way I have found to make freeze almost instantly (on my Acer Switch 10 with Intel Atom Z3735F, on Ubuntu Gnome 15.10) is to run glxgears (from package mesa-utils) on one half of the screen, and *x264*.mp4 415MB 42 minutes long video in VLC on other half of the screen. Freeze usually occurred between 2-5 minutes. On few occasions I had to wait like 15-20 minutes.

When I was running only the VLC, I had to wait many hours until the freeze occurred. Sometimes the freeze did not occurred after 8 hours, when with glxgears it was matter of minutes.

Can someone confirm that this method works for you?

I can confirm that kernel 4.1 and 4.1.15 work without problem (I did not test 4.1.16 yet). Kernel 4.2-rc1 first introduced the issue. Im currently bisecting between 4.1 and 4.2-rc1, but Im not sure if I tested merges right. When I executed "git bisect [good|bad]" the same output was written on the terminal as "git bisect [good|bad]" one step back.

Note that Im not testing pure vanilla kernels - Im applying patches from Adrian Hunter of Intel from here https://github.com/hadess/rtl8723bs/tree/master/patches and small patches for keyboard and sound.
Comment 78 BzukTuk 2016-01-31 11:22:01 UTC
git bisect bad
cf5d8a46a001c9421c7397699db55f962e0410fc is the first bad commit

However reverting this small commit in v4.2-rc1 did not solve the issue. Bisected kernel from previous step (was git bisect good) is running glxgears and VLC without problem around 11 hours now
Comment 79 Travis Hall 2016-01-31 11:40:38 UTC
(In reply to BzukTuk from comment #78)
> git bisect bad
> cf5d8a46a001c9421c7397699db55f962e0410fc is the first bad commit
> 
> However reverting this small commit in v4.2-rc1 did not solve the issue.
> Bisected kernel from previous step (was git bisect good) is running glxgears
> and VLC without problem around 11 hours now

That could very well be connected to the problem.  My suspect was 099bfbfc7fbbe22356c02f0caf709ac32e1126ea given the amount of i915 changes that were merged into 4.2-rc1.
Comment 80 julio.borreguero@gmail.com 2016-01-31 12:51:56 UTC
is it confirmed that i915 is the problem ?
although it is the most obvious, i am just asking.

it is a kernel problem:
it is not xf86-video-intel, i tried all possible bridges there (sna, xaa and uxa)
i also tried the gallium driver with ilo-dri.
i tried different accel methods, buffers, module parameters at boot time for the i915 module like framebuffer, enable_rc6 power saving options, semaphores and pretty much all the options there is for that module.
i also used different versions of xf86-video-intel, compiled them all by myself.
the freezes still occured.

to be able to do a meaningful bisect between kernel versions it is necessary to know which one is the last working kernel without the bug.
is it confirmed that all 4.1 kernel work and that 4.2-rc1 is the first faulty version ?

i will get the latest 4.1 stable kernel and test it over the next days.
if someone wants me to do further testing i am also available.
i am using gentoo linux and therefore compile everything on the machine.
Comment 81 jbMacAZ 2016-01-31 19:56:28 UTC
My ASUS T100-CHI has freeze problems with all version 4 kernels.  The history suggests that 3.16.7 was the last version freeze free (see also Freedesktop bug 88012.)  That said, freezes do occur more often on the CHI (a few minutes to an hour(s) vs. day(s)) starting with 4.2.  There definitely is an issue there (a new freeze or making the first one(s) worse)!

BTW the new DMA fix for 4.5 did not solve the CHI freeze problem when I attempted to back-port it to 4.4.  It froze within 2 minutes w/o cstate limit.  But 4.4.0 has numerous other hardware regressions relative to the CHI (stock kernel - no wifi, no touchscreen, flackey BT) so...
Comment 82 dertobi 2016-02-01 07:00:50 UTC
I think I can provide some insight into this bug, although not really a solution.

I have a Acer V3-111P featuring a N3530 processor. I got this machine in July 2014 when it was just released on the German market, because it was the first fanless laptop available.

First thing I did on it was to install Fedora and those random freezes started to appear. It drove me nuts, as my system ran no more than a couple of minutes at a time and never longer than 20 minutes. I searched for "linux random freezes" on google and found this phoronix thread where a guy had random freezes similar to mine, but nobody else in the linux kernel mailing list could reproduce it at the time. In the thread Linus Torvalds himself provided a patch he made on a hunch for the guy to test. So I applied the same patch to my 3.18 kernel and to my own surprise the crashes/freezes became a lot more infrequent. Since then my laptop runs usually at least a couple of hours and occasionally can run even a couple of weeks depending on the usage pattern I suppose. I can't find the patch in the linux kernel mailing list thread anymore. Fortunately I saved a copy locally.

Here's the phoronix thread: https://www.phoronix.com/scan.php?page=news_item&px=MTg1MDc

Unfortunately Linus's patch can't be applied to newer kernels as the particular code was changed quite a bit or even rewritten. But I think it still might give a hint how the problem could be solved or mitigated. If I understand Linus's patch correctly (and I've only a superficial understanding of it) it's a hack (Linus's own words) that corrects goofy jumps that can happen between "timekeeping" cycles.

Here's Linus's patch that I applied to the 3.18 kernel.

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 95640dc..7b14fd3 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
 	cycle_t			(*read)(struct clocksource *cs);
 	cycle_t			mask;
 	cycle_t			cycle_last;
+	cycle_t			cycle_error;
 	u32			mult;
 	u32			shift;
 	u64			xtime_nsec;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index ec1791f..1e2722f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr.read = clock->read;
 	tk->tkr.mask = clock->mask;
 	tk->tkr.cycle_last = tk->tkr.read(clock);
+	tk->tkr.cycle_error = 0;
 
 	/* Do the ns -> cycle conversion first, using original mult */
 	tmp = NTP_INTERVAL_LENGTH;
@@ -197,11 +198,17 @@ static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
 	s64 nsec;
 
 	/* read clocksource: */
-	cycle_now = tkr->read(tkr->clock);
+	cycle_now = tkr->read(tkr->clock) + tkr->cycle_error;
 
 	/* calculate the delta since the last update_wall_time: */
 	delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
 
+	/* Hmm? This is really not good, we're too close to overflowing */
+	if (unlikely(delta > (tkr->mask >> 3))) {
+		tkr->cycle_error = delta;
+		delta = 0;
+	}
+
 	nsec = delta * tkr->mult + tkr->xtime_nsec;
 	nsec >>= tkr->shift;
 
@@ -455,6 +462,16 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
 	update_fast_timekeeper(tk);
 }
 
+static void check_cycle_error(struct tk_read_base *tkr)
+{
+	cycle_t error = tkr->cycle_error;
+
+	if (unlikely(error)) {
+		tkr->cycle_error = 0;
+		pr_err("Clocksource %s had cycles off by %llu\n", tkr->clock->name, error);
+	}
+}
+
 /**
  * timekeeping_forward_now - update clock to the current time
  *
@@ -471,6 +488,7 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 	cycle_now = tk->tkr.read(clock);
 	delta = clocksource_delta(cycle_now, tk->tkr.cycle_last, tk->tkr.mask);
 	tk->tkr.cycle_last = cycle_now;
+	check_cycle_error(&tk->tkr);
 
 	tk->tkr.xtime_nsec += delta * tk->tkr.mult;
 
@@ -1181,6 +1199,7 @@ static void timekeeping_resume(void)
 
 	/* Re-base the last cycle value */
 	tk->tkr.cycle_last = cycle_now;
+	tk->tkr.cycle_error = 0;
 	tk->ntp_error = 0;
 	timekeeping_suspended = 0;
 	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
@@ -1528,11 +1547,15 @@ void update_wall_time(void)
 	if (unlikely(timekeeping_suspended))
 		goto out;
 
+	check_cycle_error(&real_tk->tkr);
+
 #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
 	offset = real_tk->cycle_interval;
 #else
 	offset = clocksource_delta(tk->tkr.read(tk->tkr.clock),
 				   tk->tkr.cycle_last, tk->tkr.mask);
+	if (unlikely(offset > (tk->tkr.mask >> 3)))
+		pr_err("Cutting it too close for %s in in update_wall_time (offset = %llu)\n", tk->tkr.clock->name, offset);
 #endif
 
 	/* Check if there's really nothing to do */
Comment 83 Johannes 2016-02-01 11:35:20 UTC
(In reply to julio.borreguero@gmail.com from comment #75)
> sorry of course i meant 4.1 kernel and not 3.1. i use 4.1.12

You are right, Julio. Downgrading the kernel works without disabling hardware acceleration. I managed to downgrade to 4.1.6-1 and did not have a freeze yet. Before, I was not able to downgrade the kernel - I did it wrong, because I am new to this. Anyway, a lot of people have posted freezes for many kernel versions and kernel versions,that worked fine. I can add, that my ACER Aspire ES1-311 seems to work with kernel 4.1.6-1.
Comment 84 Dmitry 2016-02-01 20:44:46 UTC
Tested with latest 4.5-rc2 kernel. Got hard lockup after one hour. Neither in browser nor in video player. I was emerging linux-firmware while was looking through linux kernel nconfig.
But this time I added console=/dev/ttyUSB0,115200 and got some useful (maybe) information.

1) Right after boot I ended up with refined-jiffies:

clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource:'refined-jiffies' wd_now: fffb77c9 wd_last: fffb75d5 mask: ffffffff
clocksource:'tsc' cs_now: 2d666de2c cs_last: 29a343f3d mask: ffffffffffffffff
clocksource: Switched to clocksource refined-jiffies

And got this lockups:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
Modules linked in: aesni_intel xts aes_i586 lrw ablk_helper cryptd pcspkr mac_hid snd_intel_sst_acpi crc32c_intel ath6kl_sdio
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.5.0-rc2-dirty #322
Hardware name: Dell Inc. Venue 11 Pro 5130/05FF9P, BIOS A15 01/20/2016
 00000000 c12a9b9f 00000000 c1118b3d c1be49d4 00000000 edc0b400 c1118a00
 c112e367 00000003 96ac4d66 fffffffc 00000000 c1ce1dc0 00000000 f77afae0
 00000001 c1ce1efc c112ec02 c1ce1f5c c10622b6 00000000 00000000 22f3ec2c
Call Trace:
 [<c12a9b9f>] ? dump_stack+0x48/0x79
 [<c1118b3d>] ? watchdog_overflow_callback+0x13d/0x150
 [<c1118a00>] ? watchdog_enable_all_cpus+0xb0/0xb0
 [<c112e367>] ? __perf_event_overflow+0xb7/0x280
 [<c112ec02>] ? perf_event_overflow+0x12/0x20
 [<c10622b6>] ? intel_pmu_handle_irq+0x1e6/0x3e0
 [<c10b864f>] ? enqueue_entity+0x2ff/0xe80
 [<c10b9214>] ? enqueue_task_fair+0x44/0xd40
 [<c10b361e>] ? select_task_rq_fair+0x44e/0x850
 [<c1097399>] ? __send_signal+0x189/0x310
 [<c10a5c97>] ? raw_notifier_call_chain+0x17/0x20
 [<c10ec7bb>] ? timekeeping_update+0x11b/0x1b0
 [<c1915c5f>] ? _raw_write_unlock_irqrestore+0xf/0x30
 [<c10eead3>] ? update_wall_time+0x303/0xb70
 [<c1915c5f>] ? _raw_write_unlock_irqrestore+0xf/0x30
 [<c112a89e>] ? perf_event_task_tick+0x4e/0x2a0
 [<c1059696>] ? perf_event_nmi_handler+0x26/0x40
 [<c1049ec4>] ? nmi_handle+0x44/0xa0
 [<c15c9252>] ? poll_idle+0x32/0x70
 [<c104a443>] ? default_do_nmi+0x53/0x230
 [<c104a6bf>] ? do_nmi+0x9f/0xd0
 [<c1916ea7>] ? nmi_stack_correct+0x2f/0x34
 [<c10e00d8>] ? rcu_sync_func+0x38/0x90
 [<c15c9252>] ? poll_idle+0x32/0x70
 [<c15c8ce4>] ? cpuidle_enter_state+0x134/0x270
 [<c10c474c>] ? cpu_startup_entry+0x1ac/0x250
 [<c15626cd>] ? usb_find_interface+0x2d/0x50
 [<c1d57a92>] ? start_kernel+0x39d/0x3a4
perf interrupt took too long (3896 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
clocksource: Switched to clocksource tsc


I had to switch to tsc manually in order to use tablet at all.

2)Another bug:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 3158 at drivers/base/power/common.c:150 dev_pm_domain_set+0x54/0x60()
PM domains can only be changed for unbound devices
Modules linked in: aesni_intel xts aes_i586 lrw ablk_helper cryptd pcspkr mac_hid snd_intel_sst_acpi crc32c_intel ath6kl_sdio(-)
CPU: 2 PID: 3158 Comm: rmmod Tainted: G        W       4.5.0-rc2-dirty #322
Hardware name: Dell Inc. Venue 11 Pro 5130/05FF9P, BIOS A15 01/20/2016
 00000009 c12a9b9f ecd8dec8 c108d662 c1c4adf4 ecd8dee0 00000c56 c1c17ffd
 00000096 c14b1b54 c14b1b54 d0008604 00000000 00000000 bfc7cd88 c108d6c3
 00000009 ecd8dec8 c1c4adf4 ecd8dee0 c14b1b54 c1c17ffd 00000096 c1c4adf4
Call Trace:
 [<c12a9b9f>] ? dump_stack+0x48/0x79
 [<c108d662>] ? warn_slowpath_common+0x82/0xb0
 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60
 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60
 [<c108d6c3>] ? warn_slowpath_fmt+0x33/0x40
 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60
 [<c132b612>] ? acpi_dev_pm_detach+0x2d/0x6b
 [<c14b1a86>] ? dev_pm_domain_detach+0x16/0x20
 [<c15d6523>] ? sdio_bus_remove+0x83/0xf0
 [<c14a9ef8>] ? __device_release_driver+0x78/0x120
 [<c14aa67f>] ? driver_detach+0x8f/0xa0
 [<c14a9a68>] ? bus_remove_driver+0x38/0x90
 [<c10fdd78>] ? SyS_delete_module+0x158/0x220
 [<c11b163d>] ? mntput_no_expire+0xd/0x180
 [<c10a3a74>] ? task_work_run+0x74/0x90
 [<c100100b>] ? exit_to_usermode_loop+0x8b/0xc0
 [<c1001500>] ? do_fast_syscall_32+0x80/0x130
 [<c19161f8>] ? sysenter_past_esp+0x3d/0x5d
---[ end trace 969e6d42685aab80 ]---

I believe it's connected with ath6kl. I added my sdio card id to ath6kl_sdio.c. I'll try without any custom patches.

3)Last line before complete hard lockup was:


perf interrupt took too long (5007 > 5000), lowering kernel.perf_event_max_sample_rate to 25000


Before it there were only cfg80211's regulatory domain changes, IPv6 link not ready and ath6kl's stuff.
Also I wasn't able to reboot using sysrq at all.


Full log is here: http://pastebin.ca/3363040
Comment 85 BzukTuk 2016-02-01 22:06:32 UTC
Created attachment 202701 [details]
Kernel bisection between v4.2 v4.1 for sudden freezes

Hi, small update.

My first bisect was from 4.1 to 4.2-rc1 and as first bad commit [cf5d8a46a001c9421c7397699db55f962e0410fc] was flagged. But I was not so sure that i did the bisection properly.

So today I made second bisection - git bisect start v4.2 v4.1. Bisection process went without problem/confusion/doubt (as my first attempt did). Last git bisect was good on commit cf5d8a46a001c9421c7397699db55f962e0410fc (after 90 minutes of glxgears and vlc). Git pointed that first bad commit was:

[8fb55197e64d5988ec57b54e973daeea72c3f2ff] drm/i915: Agressive downclocking on Baytrail

then from commit cf5d8a46.. I cherry-picked 8fb55197 and this kernel froze after 3 minutes.

More cherry-picking/testing tomorrow. Sorry if my previous post made confusion/unnecessary work.
Todays 'git bisect log' is in the attachment
Comment 86 ceric 2016-02-02 19:28:14 UTC
Hello everybody, I use 15.10 (x64) version and the only way for using my laptop Asus X751MJ-TY005H which is powered by an n3540 i found is passing by kernel boot parameter.
https://wiki.ubuntu.com/Kernel/KernelBootParameters
It's been since two days i use my laptop and the stock kernel 4.2.0.27-generic without a freeze. I essentially listen music and navigate on network and read my mail post.
Comment 87 Dmitry 2016-02-07 18:58:49 UTC
Latest git kernel works for me without freezes. Tested for about a week and very hard workflows (glxgears,youtube in firefox, mpv with 1080p and kernel compiling in 4 threads at the same time). There is only one flaw: I had to add my wifi(ath6kl_sdio with custom patch adding new ID) to blocklist. Modprobing it leads to freeze in minutes.
Comment 88 Travis Hall 2016-02-08 03:17:35 UTC
(In reply to Dnitry from comment #87)
> Latest git kernel works for me without freezes. Tested for about a week and
> very hard workflows (glxgears,youtube in firefox, mpv with 1080p and kernel
> compiling in 4 threads at the same time). There is only one flaw: I had to
> add my wifi(ath6kl_sdio with custom patch adding new ID) to blocklist.
> Modprobing it leads to freeze in minutes.

What commit is working fine for you?  I'm very curious because 4.5-rc2 exhibited the issue for me and it would help in bisecting.  

Also I just compiled 4.5-rc3 and I'm testing the stability on the Celeron N2940.
Comment 89 julio.borreguero@gmail.com 2016-02-08 16:12:31 UTC
$ uname -a
Linux shiva 4.5.0-rc1 #18 SMP Mon Feb 8 10:09:09 ART 2016 x86_64 Intel(R) Celeron(R) CPU N2940 @ 1.83GHz GenuineIntel GNU/Linux

i am running latest stable kernel 4.5.0-rc1 on N2940 for a few hours.
I did some stress-testing running parallel vlc and glxgears plus did loads of other stuff at the same time.
May it really be that the bug is finally fixed ?
i will give feedback as soon as my system freezes or in a few days otherwise.
Comment 90 julio.borreguero@gmail.com 2016-02-08 16:27:44 UTC
short fun, it froze :(
Comment 91 BzukTuk 2016-02-08 18:46:24 UTC
:-)
Today I tested on Acer Aspire Switch 10 linux-v4.5-rc[1-3] - freezes occured on all of them.

From my bisect last good commit seems to be [cf5d8a46a001c9421c7397699db55f962e0410fc] - glxgears and VLC was running for 18hours 20minutes without problem (then I got bored and powered it off). 

Commit [8fb55197e64d5988ec57b54e973daeea72c3f2ff] introduced the problem - vlc&glxgears froze laptop in 3 minutes.

Unfortunately my git&c skills are not good enough to revert this [8fb5519..] commit in whole releases (like in 4.2-rc1 or 4.2) because of additional changes. Biggest problem with "git revert 8fb5519.." is in file "drivers/gpu/drm/i915/intel_pm.c" - there is over 30 commits (some of them merges) changing this file between 8fb5519 and 4.2-rc1 or 4.2 kernel. Could someone look into that? 
Thanks
Comment 92 Dmitry 2016-02-09 17:14:18 UTC
4.5.0-rc3: 4 hours of films, glxgears and browsing till batteries are dead. Without any hint of freeze. For me 4.5.0-rc2 and higher are much more stable than any other and even 4.1.y branch. Recently I got several freezes on 4.1.17 kernel and then switched to latest git.
In cmdline I have only this: tsc=reliable clocksource=tsc. And as for patches I have fix for asoc channels, ath6kl enable patch and soc_button_array patch. Nothing special related to i915 or cpu(cstate). Also, as I mentioned before, I blacklisted my wifi (I use usb wifi stick and usb ethernet).
I have a idea that this freezes might be connected either with clock or power instability. For baytrail platform we do not have reliable hpet and tsc seems also unstable. As for power I observe freezes when there is some changes in gpu or cpu states. When tablet works on a task it works perfect, but when this task ends there is non zero possibility of freeze. Or when we decide to do anything after a pause on a tablet. For me it looks like there is not enough voltage during frequency changes. Like what we can see during undervoltage. It is possible, because we see hard lockups, but it is just a guess. I do not know where there is in baytrail platform ability to tune voltage through any software api. Because windows works stable regardless of any workload.
P.S. Or this freezes migth also be connected with mmc. Wifi is connected through it and bluetooth does not work for me at all. Only internal storage and external mmc card.
Comment 93 jbMacAZ 2016-02-10 02:33:44 UTC
(In reply to BzukTuk from comment #91)
> :-)
> Today I tested on Acer Aspire Switch 10 linux-v4.5-rc[1-3] - freezes occured
> on all of them.
> 
> From my bisect last good commit seems to be
> [cf5d8a46a001c9421c7397699db55f962e0410fc] - glxgears and VLC was running
> for 18hours 20minutes without problem (then I got bored and powered it off). 
> 
> Commit [8fb55197e64d5988ec57b54e973daeea72c3f2ff] introduced the problem -
> vlc&glxgears froze laptop in 3 minutes.
> 
> Unfortunately my git&c skills are not good enough to revert this [8fb5519..]
> commit in whole releases (like in 4.2-rc1 or 4.2) because of additional
> changes. Biggest problem with "git revert 8fb5519.." is in file
> "drivers/gpu/drm/i915/intel_pm.c" - there is over 30 commits (some of them
> merges) changing this file between 8fb5519 and 4.2-rc1 or 4.2 kernel. Could
> someone look into that? 
> Thanks

The legacy-turbo patch does a fine job of disabling this commit. (https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/patches/4.0/linux-999-i915-use-legacy-turbo.patch) - edit as needed

Since 4.2.6, my ASUS T100-CHI usually freezes within 5 minutes without max_cstate=1.  Because of your bisect, I tried the legacy-turbo patch on 4.5-rc3.  Before patching, my CHI ran 4.5-rc3 29 minutes before freezing (no cstate.), better than 5, but...  After patching, I haven't had a freeze in over 5 hours so far (no cstate argument).  So, there must be at least 2 freeze bugs!  LPSS, aggressive down-clocking and another one still lurking somewhere around wifi/mmc.  Not to mention the GPU hang previously fixed in 4.2.6.

Usual disclaimers, YMMV.  My kernels have a few T100 hardware specific patches.  Still a bit early to declare success, but this is promising.
Comment 94 jbMacAZ 2016-02-10 05:07:18 UTC
Eating crow already.  My second 4.5 test without cstate froze after 5 hours.  5 minutes to 29 minutes to 5 hours is a huge improvement, but it is not the whole solution.  There is still another one out there.

(ASUS T100-CHI, kernel-4.5-rc3 + Legacy-turbo patch + T100 specific patches, 
intel_idle.max_cstate=1 does not freeze)
Comment 95 julio.borreguero@gmail.com 2016-02-10 09:04:17 UTC
i tried 4.5.0-rc3 on N2940 (ACER ES1-711).
with tsc=reliable clocksource=tsc cmdline => freeze after a few hours.
i could try turbo patch.
intel_idle.max_cstate=1 never worked for me in any freeze kernel.
Comment 96 Henry Groover 2016-02-11 01:39:01 UTC
I've run several kernel versions on a Jetway JBC311U93 Celeron N2930 (Bay Trail). In all cases I've had intermittent lockups after anything from 1 hour of runtime up to 2 weeks. I mostly run with both HDMI ports connected but little or no video acceleration in use.

Kernels I've used include:
3.19 (built by Yocto poky)
4.1.13
4.2.6
4.3.3
4.4.0

Currently I'm running 4.4.1.

I've observed intermittent lockups on the abovementioned hardware on ALL of these kernels. I see no activity on USB other than the clock pulse, which I interpret (perhaps incorrectly) as no signs of life from the SoC.

I've never gotten any useful core dumps or kernel panics when the lockup occurs - the system becomes completely unresponsive.

Building mplayer2 and playing h.264 high profile videos continuously, I seem to get lockups far more consistently, usually within no more than 24 hours. Previous tests I've run to try to induce failure have all been fruitless. One of the symptoms of the lockup has included very high junction temperatures (up to 98C; the rated maximum junction temp is 110C) when it is in the hung state, and in some cases reboot (via a Fintek chipset watchdog) does not clear the hung state.  My previous efforts focused on stressing CPU load and the SSD disk device. However, exercising the GPU seems to yield higher failure rates.

Running with intel_idle.max_cstate=1, I have gotten no lockups so far. It's too early to declare a victory but this is definitely promising.
Comment 97 jbMacAZ 2016-02-12 19:45:02 UTC
I've added Dnitry's tsc arguments to my custom kernel (4.5-rc3(w/LPSS) + legacyturbo & t100 patches). Best test run yet without cstate: > 21 hours and counting.  May it keep running after this post.

They may be interrelated, but there are still more freeze bugs.  Since max_cstate=n doesn't avoid all freezes, at least one is outside of power-saving.  Not all Atom platforms are affected by each possible freeze.  The T100-CHI is sensitive to several, but cstate has been a reliable workaround.

I can try other patches or kernel arguments, if they are posted here.  4.3.5 runs quite well, but freezes readily if I omit cstate.  Now that the LPSS updates have been included in 4.5, typical freezing takes several times longer, making it less suitable for rapid testing.  4.4.x does not work well for me, hardware regressed - no wifi (w/o patching) or touchscreen.
Comment 98 BzukTuk 2016-02-13 16:09:15 UTC
(In reply to jbMacAZ from comment #93)
> (In reply to BzukTuk from comment #91)
> The legacy-turbo patch does a fine job of disabling this commit.
> (https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/
> patches/4.0/linux-999-i915-use-legacy-turbo.patch) - edit as needed

Thank you jbMacAZ, with this patch I had no freeze during 24+ hours of running glxgears, VLC, and youtube in firefox. Kernel 4.4.1 with mmc&pm-qos patches from (https://github.com/hadess/rtl8723bs/tree/master/patches), and linux-999-i915-use-legacy-turbo.patch + small change in snd drivers. No kernel parameters. During those 24 hours, tablet went few times to hibernation (low battery), and after resume, glxgears and vlc still worked. Wifi module need reload after resume from hibernations - Youtube started playing after F5 :)
Comment 99 jbMacAZ 2016-02-17 04:41:22 UTC
No freeze observed (47hrs) with tsc arguments, but my bluetooth inactivity timeouts became erratic.  On the T100-CHI, the keyboard is linked via bluetooth, so unreliable timeouts affect usability.  I won't be using the tsc arguments as an alternate workaround to max_cstate.  YMMV.
Comment 100 John A. 2016-02-17 14:37:48 UTC
I may be running into this bug as well using a Celeron N3150 (Braswell).

I've tried:
* Ubuntu Server 15.10 with generic 4.2.0-* kernels
* Arch with 4.4.1-* kernels (console only, no X)

Both setups caused similar halts and spontaneous reboots, almost always without any logs generated except to the screen. I saw watchdog errors about stalled cores and some other errors that I can't recall offhand (but may have written down at home, will check tonight).

So far, Arch with lts kernel 4.1.(17?) seems to be running better, although not without an occasional issue. I'm trying intel_idle.max_cstate=2 rightto  now and can report back. Will be curious to see if it helps, as C2 isn't explicitly stated as a c-state for the N3150 (only C0, C1, C6, and C7 states). I'll try max_cstate=1 after this trial as well.

My thanks everyone tracking and reporting on this issue. It's been super informatative and helpful as I've been trying to figure out what's happening with this box.
Comment 101 Daniel Glöckner 2016-02-19 13:41:09 UTC
I'm seeing these freezes on a Z3745.

While reading the comments I get the feeling that we are mixing up two problems.
BayTrail-T in current kernels only has one real clocksouce - the tsc. By default it will compare this clocksource to the refined-jiffies clocksource. But as refined-jiffies is unreliable (at least on non-rt kernels), the kernel often gets the impression that it can't rely on the tsc. When this happens the kernel switches to the refined-jiffies clocksource and starts to become sluggish. After a short time "sleep 1" will take forever and you are lucky if you have an open root shell where you can set the clocksource back to tsc. The official fix in Intel's Android kernel is to set the tsc as reliable.

It is definitely a bug that refined-jiffies results in this behaviour, but it is not related to the freezes we see on BayTrail.
Comment 102 jbMacAZ 2016-02-20 07:09:56 UTC
Thank you for the clarification on tsc.  I have seen that sluggishness twice where the screen refreshes once every 20-30 seconds. 4.5rcx or 4.4.x needs to run overnight to get that bad.  

So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}.  Then nothing bad should happen (no excessive latency, no freezes)?
Comment 103 Daniel Glöckner 2016-02-24 13:42:32 UTC
(In reply to jbMacAZ from comment #102)
> So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. 
> Then nothing bad should happen (no excessive latency, no freezes)?

You should also apply the patches mentioned in comment 55.
Comment 104 Vladimir Jicha 2016-02-24 14:08:14 UTC
Does Intel really completely ignore this issue? It has been introduce in 3.16 and still not fixed in 4.5 kernel. Yes, there is a workaround. But no real solution.

I doubt it will ever get fixed. Only a few people are trying to identify the issue in their free time. It would be awesome if they could find a permanent fix. But shouldn't have Intel done this already a long time ago?

My computer freezes time to time (about twice per week) even with 3.13 kernel. So staying with the old kernel isn't the ideal solution neither.
Comment 105 Joe Burmeister 2016-02-24 14:28:26 UTC
To be clear, the issue isn't in 3.16. I've apt-pinned to 3.16.0-4 and never had the freeze issue again. 
3.16.7 is meant to be the last freeze free version noted.
Which 3.16 do you mean?

But yes, it's been very quiet from Intel on this thread, but as I understand it,  Adrian Hunter is from Intel and has done some patches on this: https://lkml.org/lkml/2015/3/24/271  (as mention in comment 55). Though these don't see to have been merged for nearly a year now.
Comment 106 jbMacAZ 2016-02-24 17:48:16 UTC
(In reply to Daniel Glöckner from comment #103)
> (In reply to jbMacAZ from comment #102)
> > So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. 
> > Then nothing bad should happen (no excessive latency, no freezes)?
> 
> You should also apply the patches mentioned in comment 55.

I have them in 4.3.5 and that is my best running recent kernel[EOL: way too soon].  I thought that the LPSS enhancements in 4.5 meant they were no longer needed there.

Appreciate the guidance.
Comment 107 Michal Feix 2016-02-24 20:21:27 UTC
(In reply to Joe Burmeister from comment #105)
> But yes, it's been very quiet from Intel on this thread, but as I understand
> it,  Adrian Hunter is from Intel and has done some patches on this:
> https://lkml.org/lkml/2015/3/24/271  (as mention in comment 55). Though
> these don't see to have been merged for nearly a year now.

I agree. In general, this seems to be a stability issue relevant with any Baytrail based machine. That is why I believe there has to be thousands of users fighting with this bug on different linux distros, probably unaware of this bug report. Would it help if somebody competent raised the importance of this bug here in Bugzilla? I don't feel that importance "P1 Normal" is correct, if this bug leads to certain freezes in tens of minutes on Baytrail machines. Also, status "NEW" is also missleading, as this bug is obviously CONFIRMED.
Comment 108 Casey 2016-02-24 21:34:41 UTC
Just made an account here to confirm this Baytrail issue. Older kernels work fine, but are not optimal. On a new install of the latest kernel, simply moving the mouse or watching a terminal download from apt-get can cause graphical corruption, reset or freeze.

Windows has zero issues with stability relating to power states or graphics, so it is not my hardware. I am using a Lenovo 11e with a Intel N2940 cpu.
Comment 109 Alejandro Morales Lepe 2016-02-24 22:51:11 UTC
(In reply to Michal Feix from comment #107)
> (In reply to Joe Burmeister from comment #105)
> > But yes, it's been very quiet from Intel on this thread, but as I
> understand
> > it,  Adrian Hunter is from Intel and has done some patches on this:
> > https://lkml.org/lkml/2015/3/24/271  (as mention in comment 55). Though
> > these don't see to have been merged for nearly a year now.
> 
> I agree. In general, this seems to be a stability issue relevant with any
> Baytrail based machine. That is why I believe there has to be thousands of
> users fighting with this bug on different linux distros, probably unaware of
> this bug report. Would it help if somebody competent raised the importance
> of this bug here in Bugzilla? I don't feel that importance "P1 Normal" is
> correct, if this bug leads to certain freezes in tens of minutes on Baytrail
> machines. Also, status "NEW" is also missleading, as this bug is obviously
> CONFIRMED.

I think I am one of those who are strugling with this bug, any distro other than Debian 8 (kernel 3.16) locks up after some use, which may vary from a few minutes to a several hours, but it always crashes. A fix would be very important, machines like the Dell Inspiron 3000 Series Ubuntu Edition are bay trail based, they are very affordable so many users could be running those (just like myself).
Comment 110 Juliaonly 2016-02-24 23:18:40 UTC
I installed Ubuntu 14.04.4 in a separate partition for experimentation. I am running Kernel 4.2.0-30. The only modification I made was the Cstate setting mentioned in this post and it locked up in fifteen minutes. I'll try something else tonight and post the results.
Comment 111 Sebastian Damsgaard 2016-02-26 07:37:45 UTC
I can also confirm this bug. My HTPC is a shuttle XS35V4 with a J1900. It is unusable on anything higher than kernel 3.16. Exactly as Alejandro Morales Lepe explained it.
Comment 112 László Kara 2016-02-26 20:56:10 UTC
I can also confirm this bug (Acer ES-11, n2940), looking for the solution as many. Sorry I can not add any useful to the hunt.
Comment 113 radarixxx 2016-02-27 08:22:18 UTC
I can also confirm this bug. ASRock Q1900TM-ITX xubuntu 3.19.0-51-generic x86_64
Comment 114 Hal 2016-02-28 14:52:06 UTC
Greetings: 

I have just joined the forum to provide you feedback on my situation which seems to confirm your findings regarding 'intel_idle.max_cstate=2'. 

Indeed, I have two mini-PC type low power consumption very recent boxes. The first one is an Intel NUC Model 5CPYH with a Dual Core Celeron N3050. The second is a Zotac Zbox CI320 Nano with a Quad Core Celeron N2930.

Since the beginning I have been running Linux Mint 17.2 and now 17.3 on both boxes as host and guest OS as I run VirtualBox 5.0.14 to virtualize a tiny family web server only accessible from my LAN. I never installed or tried any other OS (notably no Windows) or any other flavors of Linux on these machines.

Several observations noteworthy:

1-Intel NUC couldn't display anything via VGA or HDMI when first installed with stock Linux Mint 17.2 (kernel 3.16 if I recall). I could remotely SSH and replace its kernel to 4.3.0 (picked randomly, and it was the most recent at that time), and everything started to work. Very well! Actually without any crash or anything for days.

2-I installed VirtualBox 5.0 and virtualized a basic server built on Linux Mint 17.2 desktop with wordpress, which has been in use for months on an AMD processor based computer (but needed to be replaced as it was a 200 Watt consuming old hardware). Everything went smoothly, but the virtual machine froze overnight. This has kept happening over and over for several weeks; the virtual machine would freeze within less than a day. Rebooting it would become an ordinary daily thing. But, the host would never freeze or crash on me!

3-Zotac CI320 on the other hand started to freeze the minute I installed Linux Mint 17.2. After each reboot it would work for a few minutes and freeze before my eyes while trying to select a WiFi access point, or changing screen resolution, or browsing with Firefox, or simply moving a window around. I upgraded its kernel to 4.3.0 and many different versions, but at best the frequency of failures changed, the problem never went away for good. Things seemed to get a bit better after upgrading to Linux Mint 17.3 with kernel 3.19 stock version, to the point that I wanted to test VirtualBox on it. I installed VirtualBox 5.0.14 and started to play around.

4-My first guest OS was FreeBSD 10.2 on Zotac's VirtualBox. Amazingly, this combo brought a new found stability to my hardware. So, Linux Mint 17.3 with kernel 3.19, VirtualBox 5.0.14 and FreeBSD 10.2 stock would work trouble free without any failure, for days.

5-Then I decided to move my little server to the Zotac platform as it looked stable as described in #4. Troubles started to show up again! But far worse than on the Intel NUC. It would actually crash the entire machine, host, all guest OS etc. whereas on Intel NUC it would only crash the guest OS.

6-I kept digging for info and eventually came across this posting and thought this might be the root cause of my problems. I have been running my Zotac box with intel_idle.max_cstate=2 for the last couple of days (both on the host and guest OS) and have even been bold to the point of doing some computer intensive things. Everything is holding up for now. Hopefully it will be ok for good.

I just wanted to share my experience with the hope that if someone with similar, more or better experience want to comment or suggest, it would be helpful for me but also for others. I am still on the edge because of these 2 almost brand new computers.

Also, I wanted to ask for advice regarding using 'intel_idle.max_cstate=2' on both the host and guest OS as I am doing right now. Does it make sense? or should I only run it on the host OS?

Maybe one more question, although this might not be the right place to ask; is the FreeBSD 10.2 kernel known to work better with these processors with regards to this random freezing problem?

Thanks for your attention and sorry for the length of the post.
Hal
Comment 115 jbMacAZ 2016-02-29 23:51:46 UTC
FWIW, I had a freeze on 4.3.6 with tsc=reliable and intel_idle.max_cstate=1.  I hadn't had any freezes since 4.2.5 when cstate limit was set.  A new freeze bug perhaps?
Comment 116 Alejandro Morales Lepe 2016-03-02 19:25:02 UTC
I have been distrohopping for a time now, and I can confirm, anything newer than 3.16 freezes. I installed Ubuntu 14.04.2 it runs nicely but if I install 14.04.3 the system freezes. intel_idle.max_cstate=1 sometimes seems to work and sometimes don't but I havent found any pattern or something. If there is something I can do to help solve this issue tell me or otherwise I am going to be stuck on Ubuntu 14.04.2 or Debian 8 forever.
Comment 117 podschie 2016-03-03 07:23:37 UTC
Hey everyone, same here! With the new kernels I have several freezes per day. Just writing and doing office stuff causes that bug only sometimes. But watching a DVD (with an external drive) or surfing the internet (especially flash I think) and the system freezes a lot. I use Lubuntu 15.10 with 4.2.0-30-generic.
My PC is an Acer ES-1 311 laptop with Intel N3540 CPU. Would be nice to solve the problem. Can't we just go back to the old working kernel from Ubuntu 14.04 and delete the malicious code in the new one? I don't understand, why a kernel with such a heavy bug, that affects a lot of users, was released.
Comment 118 Michal Feix 2016-03-04 11:11:55 UTC
> Also, I wanted to ask for advice regarding using 'intel_idle.max_cstate=2'
> on both the host and guest OS as I am doing right now. Does it make sense?
> or should I only run it on the host OS?

IMHO, it only makes sense on the host.
Comment 119 Dimitris Roussis 2016-03-05 16:22:26 UTC
I am sorry but this situation is a real comedy!!

Almost all Bayltray devices have problem.This means a huge number of modern Pcs,tablets and laptops.

This situation is more than 4 months and the developers dont care to fix it but to include new futures to the kernel!! 

I am stacked more than 4 months to kernel 3.16 because of this serious bug..and i know more than 20 people in the same situation.All of them with different devices.

I love linux,i appreciate kernel developers but for sure here we need a project manager to estimate if a bug is a high priority or not..
Comment 120 Molnár Roland 2016-03-07 21:02:05 UTC
Hello everyone.

I'm have been facing the same issue. Recently i bought an Asrock N3150DC-ITX Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy, freezing, and sometimes X11 random crashed on it.

Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline repository: http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/

After installed the new kernel, the system seems to be stable without the cstate hack.

Sidenote: after installing the Intel open source graphics driver from 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10 (Mate Edition) (The installer also updates the vaapi packages for the latest version that supports Cherrytrail). Tested with 4K contents and 1920p downscaling. No issues, no lag after 3-4 days uptime, running mostly with Kodi.
Comment 121 jds 2016-03-08 23:01:05 UTC
I just tried this on a n2940 system, a Thinkpad 11e.  The screen flashed a lot, so I went back to 4.2.8 again, with cstate hack.


(In reply to Molnár Roland from comment #120)
> Hello everyone.
> 
> I'm have been facing the same issue. Recently i bought an Asrock N3150DC-ITX
> Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy,
> freezing, and sometimes X11 random crashed on it.
> 
> Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline
> repository:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/
> 
> After installed the new kernel, the system seems to be stable without the
> cstate hack.
> 
> Sidenote: after installing the Intel open source graphics driver from
> 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10
> (Mate Edition) (The installer also updates the vaapi packages for the latest
> version that supports Cherrytrail). Tested with 4K contents and 1920p
> downscaling. No issues, no lag after 3-4 days uptime, running mostly with
> Kodi.
Comment 122 Travis Hall 2016-03-09 00:04:05 UTC
(In reply to jds from comment #121)
> I just tried this on a n2940 system, a Thinkpad 11e.  The screen flashed a
> lot, so I went back to 4.2.8 again, with cstate hack.
> 
> 
> (In reply to Molnár Roland from comment #120)
> > Hello everyone.
> > 
> > I'm have been facing the same issue. Recently i bought an Asrock
> N3150DC-ITX
> > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy,
> > freezing, and sometimes X11 random crashed on it.
> > 
> > Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline
> > repository:
> >
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/
> > 
> > After installed the new kernel, the system seems to be stable without the
> > cstate hack.
> > 
> > Sidenote: after installing the Intel open source graphics driver from
> > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10
> > (Mate Edition) (The installer also updates the vaapi packages for the
> latest
> > version that supports Cherrytrail). Tested with 4K contents and 1920p
> > downscaling. No issues, no lag after 3-4 days uptime, running mostly with
> > Kodi.

Interesting, I tried the drm-intel-next kernel linked by Molnár Roland on my Thinkpad 11e for a little while last night, and I didn't see any of those issues.  I tried it on a quick fresh ubuntu install (I usually use Manjaro) playing a twitch stream on Mpv, I'll have to find a way to unzip that deb and install the kernel myself, or get it building myself to test it longer.

I don't believe any of the drm-intel-next stuff is going to get merged into the upcoming 4.5, as it's already on rc7, so maybe it will come with 4.6
Comment 123 jds 2016-03-09 02:09:55 UTC
(In reply to Travis Hall from comment #122)
> (In reply to jds from comment #121)
> > I just tried this on a n2940 system, a Thinkpad 11e.  The screen flashed a
> > lot, so I went back to 4.2.8 again, with cstate hack.
> > 
> > 
> > (In reply to Molnár Roland from comment #120)
> > > Hello everyone.
> > > 
> > > I'm have been facing the same issue. Recently i bought an Asrock
> N3150DC-ITX
> > > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was
> buggy,
> > > freezing, and sometimes X11 random crashed on it.
> > > 
> > > Few days ago i installed the drm-intel-next kernel from the Ubuntu
> mainline
> > > repository:
> > >
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/
> > > 
> > > After installed the new kernel, the system seems to be stable without the
> > > cstate hack.
> > > 
> > > Sidenote: after installing the Intel open source graphics driver from
> > > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10
> > > (Mate Edition) (The installer also updates the vaapi packages for the
> latest
> > > version that supports Cherrytrail). Tested with 4K contents and 1920p
> > > downscaling. No issues, no lag after 3-4 days uptime, running mostly with
> > > Kodi.
> 
> Interesting, I tried the drm-intel-next kernel linked by Molnár Roland on my
> Thinkpad 11e for a little while last night, and I didn't see any of those
> issues.  I tried it on a quick fresh ubuntu install (I usually use Manjaro)
> playing a twitch stream on Mpv, I'll have to find a way to unzip that deb
> and install the kernel myself, or get it building myself to test it longer.
> 
> I don't believe any of the drm-intel-next stuff is going to get merged into
> the upcoming 4.5, as it's already on rc7, so maybe it will come with 4.6

Interesting.  Well, I installed this kernel over a Mint 17 setup (Ubuntu 14.04), so maybe there's some interaction between the new kernel and X?
Comment 124 Travis Hall 2016-03-09 22:09:15 UTC
(In reply to jds from comment #123)
> Interesting.  Well, I installed this kernel over a Mint 17 setup (Ubuntu
> 14.04), so maybe there's some interaction between the new kernel and X?

False alarm, my Ubuntu MATE install hung while running the drm-intel-next kernel  from the Ubuntu repo.  I also compiled a kernel from drm-next using an Arch User Repo package https://aur.archlinux.org/packages/linux-drm-intel-nightly/ on Manjaro, and it also hung within about 2 hours while running some youtube on loop, and a stream in mpv.
Comment 125 Dimitris Roussis 2016-03-10 18:02:19 UTC
I tried the drm-intel-next kernel and also intel linux drivers..Nothing works back to kernel 3.6.17
Comment 126 dertobi 2016-03-10 21:07:05 UTC
I'm relatively happy that my system is stable now thanks to the intel_idle.max_cstate=1 flag, however I agree with everything Dimitris Roussis wrote about this situation.

I have my machine since mid 2014, which means that this bug has plagued users for almost 2 years now. The number of users that have been burned by this issue must be staggering, and I assume most of them didn't file a bug report.

I can't comprehend how this bug is rated "P1 normal", when it's clearly a critical bug preventing a huge number of Intel processors from being stable on Linux.

Intel should really be embarrassed about this bug.

Can we please get a statement from an Intel employee about what is being done?
Comment 127 jds 2016-03-10 21:44:15 UTC
(In reply to dertobi from comment #126)
> I'm relatively happy that my system is stable now thanks to the
> intel_idle.max_cstate=1 flag, however I agree with everything Dimitris
> Roussis wrote about this situation.
> 
> I have my machine since mid 2014, which means that this bug has plagued
> users for almost 2 years now. The number of users that have been burned by
> this issue must be staggering, and I assume most of them didn't file a bug
> report.
> 
> I can't comprehend how this bug is rated "P1 normal", when it's clearly a
> critical bug preventing a huge number of Intel processors from being stable
> on Linux.
> 
> Intel should really be embarrassed about this bug.
> 
> Can we please get a statement from an Intel employee about what is being
> done?

Most non-ARM Chromebooks use Bay Trail chips.  Any sense of what the Chromium project may have done about this bug?
Comment 128 Tal Liron 2016-03-11 00:52:49 UTC
This bug is affecting me on an Asus Aspire E3-111.

So far so good with intel_idle.max_cstate=1.

I'll echo what others have said: it would be reassuring to hear from someone at Linux or Intel about progress towards solving this. Without a doubt, it has been "quietly" affecting a great many people for a long time, who had no of knowing what the issue was.

I spent quite a bit of money replacing the SSD thinking that it was the culprit. :(
Comment 129 Elmar Melcher 2016-03-11 14:49:01 UTC
For about 2 months I have been using on a daily basis kernel 4.4.0 with the patch mentioned in Comments 48, 55, 77, 98, 103, 105 on Atom Z3735G, without intel_idle.max_cstate. I do experience freezes at an average of about every 10 hours of use. Rarely I have a specific operation that always causes Hard LOCKUP, in these cases I reboot using intel_idle.max_cstate=0 and the freeze does not occur any more.

Now I compiled kernel 4.5.0-rc7, but I was not able to apply the mentioned patch. It does not apply cleanly and trying to introduce the failing parts by hand I got an error message during boot.
This kernel freezes within less than a minute after boot, even with intel_idle.max_cstate=0 in the boot command line.
With tsc=reliable clocksource=tsc in the boot command line the freeze does not occur for at least 30 minutes, but comments seem to inidcate that tsc command line is not recommended.

Is there an update of the mentioned patch ?
Comment 130 Hal 2016-03-11 15:46:19 UTC
I just wanted to update my post #114 after 2 weeks of testing as my Zotac system is now much more stable.
First, for the host OS: intel_idle.max_cstate=2 definitely saved my Zotac computer. No more host crash, nor any VirtualBox system freeze.
As for the guest OS freezing situation; I accidentally noticed that VirtualBox might have had a problem with 3 cores assigned to the guest LinuxMint OS. Changing it to 4 cores (the maximum available on my Zotac) seems to have stopped the freezing of the guest OS. In any event in the new configuration it has been running for over a week now with no hint of problems under medium to heavy load.
Comment 131 jds 2016-03-11 17:48:17 UTC
Replying to my own question from earlier,  Chrome OS is on 3.10.18 (!).  This is for version 48.0.2564.116 in the stable channel.

I found this by checking on a Chromebook that uses the n2940.

Note that there is an issue with this system: the wireless module craps out occasionally (logged bug).  Seems to be related to the iwl* subsystem.

jds
Comment 132 jbMacAZ 2016-03-11 18:12:18 UTC
(In reply to Elmar Melcher from comment #129)
...
> With tsc=reliable clocksource=tsc in the boot command line the freeze does
> not occur for at least 30 minutes, but comments seem to indicate that tsc
> command line is not recommended.
> 
> Is there an update of the mentioned patch ?

Bugzilla could benefit from the ability to append comments instead of forcing new ones.

I had problems that I thought were associated with tsc arguments.  But my device really does have issues with timeouts and connectivity using bluetooth with 4.5-rcx.  I just hadn't noticed before trying the tsc arguments.  FWIW, I'm following the guidance in comment #103.

cstate and tsc only minimize one or more long-standing freeze problems.  But for quite a few, they are sufficient.

This link, https://github.com/hadess/rtl8723bs/tree/master/patches might help with your patch problem (last 3 are edits of same patch.)  Also try the --dry-run flag to first test a patch without changing your source set.
Comment 133 Mădălin Ionuț Icleanu 2016-03-12 07:06:05 UTC
I've managed to run the 4.4.5 kernel on Archlinux for more than a day on my laptop that has the Bay Trail 2930 cpu without any freezes after adding intel_idle.max_cstate=1 AND commenting out tlp's CPU_SCALING_GOVERNOR_ON_AC and CPU_SCALING_GOVERNOR_ON_BAT options.

Maybe you guys could try setting the cpu governor to the default "powersave"? It worked for me.
Comment 134 Hal 2016-03-12 21:40:31 UTC
Hi! One more update to my posts #114 and #130.

Zotac's host OS LinuxMint 17.3 with Kernel 3.19.0 with intel_idle.max.cstate=2 is definitely holding up. It has gone through 2 weeks+ worth of stress testing by now and it works very well. The box is a bit warmer than originally (it has no fan, just passive cooling) but it's by no means within the critical range.

VirtualBox 5.0.16 is also holding up. I have a FreeBSD server which has worked on it for over 2 weeks under heavy load.

But, another virtual machine based n LinuxMint 17.3 running kernel 3.19.0 and xfce has been a bit more iffy. I thought that the processor core number was an issue, I still believe that there is a problem along those lines, when I assign 3 cores the failure rate definitely goes up. But with 4 cores I also had a freeze, although it was after several days of good working! 

Anyway, we are not out of the woods yet! But as for the host, everything is now very stable.

My question is about intel_idle.max_state value of 2 vs 1. Can anyone enlighten me about the difference? How much of power savings functionality is being allowed with 2 vs 1?
Thanks for any info.
Hal
Comment 135 Juha Sievi-Korte 2016-03-12 22:27:44 UTC
For the last question by Hal, difference with max_cstate=2 and max_cstate=1 with Pentium N3540 at least is occasional freezes (encounter usually a full lock-up within a week or two of use) vs no freezes at all with cstate=1. So far this max_cstate=1 is the only workaround that works for me. But I'm glad there is this one. Running kernel 4.4.3 now and my laptop is still usable and stable.

Sorry this doesn't answer question about the power usage. It is only aimed at the stability aspect. There are quite many comments indicating initial success and then updating that it did crash after all. The freezes (for me) have been all the time very inconsistent. Sometimes the hangs come within minutes of boot and sometimes I could get more than a week of uptime without the kernel parameter. But with max_cstate=1 this system is "rock solid", no freezes at all.

Agree on comments about the bug priority/severity, can't really use the 3.x series kernel due to some driver issues and with this cstate limiting I lose a lot on a battery life on laptop.

This must affect quite a huge number of users currently and at least in my case it took months to find out that it's actually a kernel bug and not some other software issue.
Comment 136 vad1m 2016-03-15 09:12:07 UTC
Guys, please try latest kernel (4.4 or 4.5) with installed intel-microcode package (only latest version!), for example from here: https://packages.debian.org/sid/intel-microcode . With enabled C6/C& in BIOS, kernel 4.5.0 and intel-microcode package (latest version from sid), I've tested my PC within 1.5 hours and everything was fine.
Comment 137 jbMacAZ 2016-03-15 18:15:58 UTC
Sorry, the intel_ucode does not fix freezing.  Manjaro(Arch derivative) already loads the micro-code (same version) each boot.  It took less than 10 minutes to freeze Manjaro15.10-x86_64 linux-4.5.0 without max_cstate limit (Asus T100-CHI).  However, I intend to add this debian package to my Ubuntu install, as this is still a good idea.  Thanks for the link.
Comment 138 vad1m 2016-03-15 18:48:30 UTC
I've tried today to test my PC with C7 state enabled in BIOS and with latest 4.5 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/ and latest intel-microcode package 3.20151106.1 from https://packages.debian.org/sid/intel-microcode
Everything is fine with youtube videos, as well with idel state (previously I had freezes withing 10-30 minutes in 100% cases), so I think without patching kernel at least we have a solution with additional firmware (btw, microcode also can be downloaded from intel site as binary but in that case you should copy it manually to firmware directory in your system, for me debian package is much more convenient).
Comment 139 kossmann 2016-03-16 08:01:29 UTC
Same problem here...

Hardware: Intel NUC6i5SYH (Intel Skylake i5-6260U)
Software: Debian Stretch with Kernel 4.3.0-1-amd64

Linux freezes after a few hours, no KernelCrashDump (crashkernel=256M nmi_watchdog=1) available. Workaround (intel_idle.max_cstate=1) seems to help for the moment.
Comment 140 László Kara 2016-03-16 19:35:05 UTC
Did anyone got contacted Intel about this issue yet? We may need more help finding this bug.
Comment 141 Xermán 2016-03-16 21:21:17 UTC
Same problems here. I have an Acer Travelmate b115m with celeron n2940 and I was becoming mad until I found this topic.

Crashes every 20 min - 2 hours. No way of getting crashdump info.
cstate=1 seems to mitigate the problem, but the computer gets hot and runs kind of slower.
Comment 142 Cris Daniel 2016-03-16 21:35:06 UTC
Adding myself to the Baytrail freeze party! Lenovo MIIX 3 1030 powered by an Atom Z3735F, running Arch with a vanilla 4.5.0 kernel. 

Tried vad1m's method (though my CPU doesn't seem to have any microcode updates) just to be sure, got a hang.

Hal, I ran a couple of PowerTop draw tests on my machine. Interesting results:

cstate=1 : 3.40W
cstate=2 : 3.11W
normal   : 3.13W

Taken while idle in an Openbox session, single terminal window open.
Comment 143 Michal Feix 2016-03-16 21:58:18 UTC
(In reply to László Kara from comment #140)
> Did anyone got contacted Intel about this issue yet? We may need more help
> finding this bug.

This bug is already assigned to Len Brown from Intel, who is also mentioned as a maintainer of Intel Idle kernel code. Initial reporter of this bug is also an Intel employee. Anyway, I will try to raise this bug on linux-pm mailing list tomorrow, as it seems there is very little awareness about the fatality of this bug among others.
Comment 144 Vincent Frentzel 2016-03-16 22:10:51 UTC
Created attachment 209541 [details]
attachment-24616-0.html

Meanwhile Im still trying to get @intelsupport attention on twitter. Feel
free to RT:

https://twitter.com/zcecc22/status/710222385430077440
On Wed, 16 Mar 2016 at 22:58, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #143 from Michal Feix <michal@feix.cz> ---
> (In reply to László Kara from comment #140)
> > Did anyone got contacted Intel about this issue yet? We may need more
> help
> > finding this bug.
>
> This bug is already assigned to Len Brown from Intel, who is also
> mentioned as
> a maintainer of Intel Idle kernel code. Initial reporter of this bug is
> also an
> Intel employee. Anyway, I will try to raise this bug on linux-pm mailing
> list
> tomorrow, as it seems there is very little awareness about the fatality of
> this
> bug among others.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 145 Vincent Frentzel 2016-03-16 22:11:40 UTC
Meanwhile Im a rying to reach out to @intelsupport on twitter to see if we can get an official update.

Feel free to retweet https://twitter.com/zcecc22/status/710222385430077440
Comment 146 Hal 2016-03-16 22:37:52 UTC
Thank you Juha (#135) and Chris (#142). 

I've been running both my Zotac and Intel boxes with intel_idle.max_cstate=1 for the last few days. Both with the value of 2 and 1 I got pretty good results, I have not seen any failures on either hosts.

I've also done some power monitoring on the AC line and it turns out that between 2 and 1 there is less than a watt of difference, although temperature wise it seems to be noticeably different (hotter with 1).

So, for all intents of purposes my boxes are now working trouble free. 

VirtualBox system has also been very stable (as witnessed by my FreeBSD virtual server), but my Linux Guest OS is periodically failing on both systems. 

I am now convinced that the failure on the Linux Guest OS is due to some video driver issue as opposed to the processor bug related to intel_idle.max_cstate thing.

But, all in all I am very disappointed both by Intel, and Linux (maybe I should say Ubuntu and LinuxMint) as in their race to release the latest and greatest they put out there half baked products. 
This is as if we are back to the beginning of times and we are troubleshooting windows 3.1 systems...

I don't know if the real issue is Intel hardware or Linux software but either way my disappointment is such that I couldn't recommend anyone to switch to Linux as I have advocated for over a decade.

Here, I only voiced my own troubles with my own two machines. My friends and relatives who have bought inexpensive Bay Trail notebooks and got Ubuntu or Mint based on my recommendation and who are pissed because their machines freeze in the middle of a Netflix movie are not sophisticated enough to come to places like this to figure out what they are up to. 
They will only say that windows xp worked better than any sh*t we have had in a long time (that certainly includes Linux), and I almost agree with them.
Comment 147 mario439 2016-03-16 22:42:34 UTC
I have the same bug in my HP Pavilion x360 with a Pentium CPU N3520 (Bay Trail architecture), running Ubuntu 15.10 and 4.2.0-30 Kernel version.
I´m using the private drivers for the Microprocessor.
I don´t try with the "intel_idle.max_cstate=1", because that´s need a lot of battery...
I want to use GNU/Linux again, but i can work normally with this bug :(

Pd: English is not my first lenguage.
Comment 148 Xermán 2016-03-16 22:50:30 UTC
(In reply to Hal from comment #146)
> Thank you Juha (#135) and Chris (#142). 
> 
> I've been running both my Zotac and Intel boxes with intel_idle.max_cstate=1
> for the last few days. Both with the value of 2 and 1 I got pretty good
> results, I have not seen any failures on either hosts.
> 
> I've also done some power monitoring on the AC line and it turns out that
> between 2 and 1 there is less than a watt of difference, although
> temperature wise it seems to be noticeably different (hotter with 1).
> 
> So, for all intents of purposes my boxes are now working trouble free. 
> 
> VirtualBox system has also been very stable (as witnessed by my FreeBSD
> virtual server), but my Linux Guest OS is periodically failing on both
> systems. 
> 
> I am now convinced that the failure on the Linux Guest OS is due to some
> video driver issue as opposed to the processor bug related to
> intel_idle.max_cstate thing.
> 
> But, all in all I am very disappointed both by Intel, and Linux (maybe I
> should say Ubuntu and LinuxMint) as in their race to release the latest and
> greatest they put out there half baked products. 
> This is as if we are back to the beginning of times and we are
> troubleshooting windows 3.1 systems...
> 
> I don't know if the real issue is Intel hardware or Linux software but
> either way my disappointment is such that I couldn't recommend anyone to
> switch to Linux as I have advocated for over a decade.
> 
> Here, I only voiced my own troubles with my own two machines. My friends and
> relatives who have bought inexpensive Bay Trail notebooks and got Ubuntu or
> Mint based on my recommendation and who are pissed because their machines
> freeze in the middle of a Netflix movie are not sophisticated enough to come
> to places like this to figure out what they are up to. 
> They will only say that windows xp worked better than any sh*t we have had
> in a long time (that certainly includes Linux), and I almost agree with them.

I could not agree more. I worked with linux (readhat) more than 10 years ago (compiling kernel, day by day work, etc.) and after some years without touching it I had the idea of just using it in my new laptop. Im a photo professional and I wanted to give a try to Darktable and current Gimp.

Im so so dissapointed, so so dissapointed. 

Linux is still not usable after all these years, it's even less stable now.
Complety buggy for a normal average user. I can't recommend it to any friend sharing my hardware or similar since no one will now what to do. I'm back to windows 10 and the computer runs fast and with no problemas at all.

I will keep an eye on this, but for me, a casual user very interested in Linux, this operating system is just a toy to spend time with. Just a toy.
Comment 149 kossmann 2016-03-16 23:06:42 UTC
Running as desktop is one side, running as server (so me) the other one :-( Same kernel on a Asus eeeBox B202 (Intel Atom) has no problems.

I don´t know, if there is a context, but since i use the workaround (on Intel Skylake), i don´t see messages like "systemd-sysv-generator overwriting existing symlink" in dmesg anymore.
Comment 150 dertobi 2016-03-16 23:33:42 UTC
I share the frustration as I have been using Linux for over 15 years and this is maybe the most serious bug in all that time (for me), which ironically seems to get extremely limited attention by Intel and Kernel developers. It seems like kernel developers don't usually use low end Intel Atom like hardware and therefore don't have to deal with the problems themselves a lot. (Admittedly that's speculation, but I can imagine that most kernel developers(or even pro users) prefer to use high end hardware (let alone for compile times))

I think it's unfair to say that because of that one (severe!) bug Linux is not recommendable anymore, since not everybody will want to use a Baytrail system, but at this time you could say Linux on baytrail isn't really advisable until this bug is fixed.

I lay the blame on Intel, as they should be stress testing their CPUs against the latest Linux kernel and pro-actively try to fix eventual bugs.
Comment 151 saracenim 2016-03-16 23:52:46 UTC
I've been using Linux since 1997,and this is the first time I've come across a bug so serious. Luckily I have another laptop with a different CPU and I can use that, but my baytrail machine is collecting dust. 
Really stupid idea to release newer and newer kernels just for the sake of adding new shiny numbers when so many people are affected by a massive bug like this, which makes windows 3.1 look like a dream. Linus has totally lost his marbles. 
I'd like to try bsd, but it's a pain in the neck and hardware support lags behind. Sigh.
Comment 152 Xermán 2016-03-16 23:59:38 UTC
Sorry if I wrote a too negative impression, but I just can't avoid to be dissapointed and frustated. I really wanted to jump to opensource software.

And I know this is made with the collaboration of a lot of volunteer people, thanks to them. But shame on Linux stability and support.
Comment 153 Tal Liron 2016-03-17 00:16:21 UTC
Some of you doomsday complainers need to calm down a bit!

Microsoft and Apple (and even Google in some cases) don't even have a way to open bugs on their operating systems and get transparent feedback with a way to track progress. And the computer gods know how many hours I spent trying to debug system freezes and crashes on Windows... Just right now my employer, which distributed hundreds of MacBook Pros to its users, is experiencing a bug with major battery drain for all of them. We have no idea what is causing it yet.

This particular bug was very tricky to pin down and the community did great in reporting it and patiently trying things out until we found a workaround. We provided a very good direction for Intel to look for the cause and find a fix. As with other major bugs, I'm certain the major distributions will backport the fix to their older kernels that are still under bugfix and security support.

Software generally is terrible, but Linux is better than most. We have a good system in place to fix bugs and keep making the OS better. I don't think there's anything particularly wrong with how Linux handles quality control.

That said, I would very much appreciate it if someone from Intel steps in a comments, even briefly, on this bug report. Hint, hint. :)
Comment 154 Vincent Frentzel 2016-03-17 00:27:03 UTC
Created attachment 209561 [details]
attachment-28440-0.html

I did raise the issue on twitter to @intelsupport, they told me to download
and use the intel graphic driver at http://intel.ly/24TDt9F .

Don't think @intelsupport is that useful afterall...

Does anyone have a direct contact there?
On Wed, 16 Mar 2016 at 20:35, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #140 from László Kara <laci.kara@gmail.com> ---
> Did anyone got contacted Intel about this issue yet? We may need more help
> finding this bug.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 155 dertobi 2016-03-17 01:12:27 UTC
@Vincent Frentzel

Some Intel guys are on IRC, for example on irc.freenode.org #intel-gfx . You might want to bug them about this bug. I know some are aware about the issue on this channel, but some also have been dismissive about it to be honest. I know one guy who has some patches on there and needs people like us to try them.
Comment 156 Hal 2016-03-17 02:28:39 UTC
(In reply to Tal Liron from comment #153)

> Some of you doomsday complainers need to calm down a bit!
???? 

> This particular bug was very tricky to pin down and the community did great
> in reporting it and patiently trying things out until we found a workaround.
> We provided a very good direction for Intel to look for the cause and find a
> fix. As with other major bugs, I'm certain the major distributions will
> backport the fix to their older kernels that are still under bugfix and
> security support.

As this link shows the problem was discovered 15 MONTHS AGO and it is yet to be fixed! https://bugs.freedesktop.org/show_bug.cgi?id=88012

> Software generally is terrible, but Linux is better than most. We have a
> good system in place to fix bugs and keep making the OS better. I don't
> think there's anything particularly wrong with how Linux handles quality
> control.

"Quality control" you say? Since this problem was discovered 15 MONTHS AGO the linux kernel went through tons of iterations. It could have been at least provided with an automatic detection and cstate switching mechanism!

> That said, I would very much appreciate it if someone from Intel steps in a
> comments, even briefly, on this bug report. Hint, hint. :)

Sensible suggestion! But check this out: https://communities.intel.com/thread/60984?start=0&tstart=0
It doesn't sound like Joe_Intel is much of a listener, is he?

I am afraid next year this time we will still be talking about this same bug, because if it didn't get fixed since January 2015 I don't see how and why it will be ever fixed.
Comment 157 John A. 2016-03-17 02:53:14 UTC
Created attachment 209571 [details]
Arch Linux 4.1.18 LTS panic #1 (photo 1 of 3)

Attaching 3 photos of kernel panics I've seen that may be related to this. Two photos are from Arch 4.1.18 LTS with intel_idle.max_cstate=1 (plus other kernel params, mostly borrowed from Clear Linux's boot line), and one is from Arch 4.4.3 using intel_idle.max_cstate=1.

System is a console-only mini-PC running a Celeron N3150 (Braswell) with 8GB RAM and 250GB mSATA SSD. Trying to use the mini-PC as a custom network router.

Attaching photos since these panics don't write to logs, and often don't show anything at all, halting the machine or causing a spontaneous reboot. I'm going to try setting up a netconsole to capture goings-on next.

All three instances seem to choke with some invocation of start_secondary(), if I'm reading the call trace correctly. 

Hoping these instances may help devs track the core issue down. Please let me know if additional info is required or if I can test anything.
Comment 158 John A. 2016-03-17 02:55:46 UTC
Created attachment 209581 [details]
Arch Linux 4.1.18 LTS panic #2 (photo 2 of 3)

Second kernel panic photo with Arch 4.1.18 LTS on Celeron N3150 (Braswell) system using max_cstate=1. Please see the first photo for more info.
Comment 159 John A. 2016-03-17 02:57:24 UTC
Created attachment 209591 [details]
Arch Linux 4.4.3 panic (photo 3 of 3)

Third/last photo of panics, this time with Arch 4.4.3 on Celeron N3150 (Braswell) using max_cstate=1. Please see first photo for more information.
Comment 160 jds 2016-03-17 05:22:15 UTC
I think you're not quite entertaining the level of failure that's involved here.

I totally appreciate your point.  "Software is generally terrible".  True.  But from the perspective of users here, is Linux really better than most?

My Mac at work has an uptime of 172 days.  Every night I sleep it when I go home.  I haven't had to reboot it in six months.  BTW it's a laptop.

This Linux-running thinkpad running Linux I have here crashes after 1-2 hours of sitting idle.  I don't have to do anything.  Turn it on; let it sit; crash!  That's much worse than Windows 3.1, which was sensitive to rogue applications, but didn't simply splat into smithereeens on its own.

So let's not sentimentalize and pretend this isn't a total fuck-up.  

Where is Intel?




(In reply to Tal Liron from comment #153)
> Some of you doomsday complainers need to calm down a bit!
> 
> Microsoft and Apple (and even Google in some cases) don't even have a way to
> open bugs on their operating systems and get transparent feedback with a way
> to track progress. And the computer gods know how many hours I spent trying
> to debug system freezes and crashes on Windows... Just right now my
> employer, which distributed hundreds of MacBook Pros to its users, is
> experiencing a bug with major battery drain for all of them. We have no idea
> what is causing it yet.
> 
> This particular bug was very tricky to pin down and the community did great
> in reporting it and patiently trying things out until we found a workaround.
> We provided a very good direction for Intel to look for the cause and find a
> fix. As with other major bugs, I'm certain the major distributions will
> backport the fix to their older kernels that are still under bugfix and
> security support.
> 
> Software generally is terrible, but Linux is better than most. We have a
> good system in place to fix bugs and keep making the OS better. I don't
> think there's anything particularly wrong with how Linux handles quality
> control.
> 
> That said, I would very much appreciate it if someone from Intel steps in a
> comments, even briefly, on this bug report. Hint, hint. :)
Comment 161 fao66134 2016-03-17 06:05:31 UTC
I use N3150(braswell) too.
I was set to "max_cstate=1". but, got freeze.

Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a little high.
So, i come up with to try "maxcpus=2". And then, it did not freeze.

cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze.

If this thing is useful, I'm happy.
Comment 162 Vladimir Jicha 2016-03-17 08:47:45 UTC
What is really bad about this bug is the fact that it used to work until kernel 3.16. I bought my HTPC with bay-trail because it supported Linux. And now I have an unsupported hardware with not replaceable motherboard that freezes even with kernel 3.13 (there are more bugs then this, I believe it is related to WiFi which also often looses it's connection and is very slow).

This chipset is in fact not supported by Linux now. I bought it because I heard everywhere that Intel has the best Linux support. Now I want to show them the same sign of respect Linus Torvalds showed NVIDIA some years ago.

I should have known that since Intel did the same to me with GMA500 graphics. I thought it was just a single mistake and they will not repeat it. But they did. :-(
Comment 163 jbMacAZ 2016-03-17 09:12:12 UTC
I've started getting occasional freezes again with 4.4 and 4.5.  That's even with cstate and tsc and a bunch of good but cast off freeze patches.  So, I'll try fewer CPU cores.  I can't risk having anything important on my system anyway, so who cares if nothing important takes longer.  

Next cycle, I'll just get AMD systems.
Comment 164 Bastien Nocera 2016-03-17 09:41:11 UTC
Well done on turning this into a forum thread. I wouldn't touch this bug with a 10-foot pole and I'm sure the Intel developers feel the same.
Comment 165 Molnár Roland 2016-03-17 11:15:18 UTC
I got the same issues after 4-5 days on Ubuntu 15.10 with the 4.5 kernel and intel driver. After this issue, it freezes again within 5-6 hours, sorry for the false hope :)

Now im trying the upcoming Ubuntu LTS release (16.04 Nightly) with the following kernel: 4.4.0-13-generic #29-Ubuntu SMP Fri Mar 11 19:31:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Seems stable right now after 2days and 13hours uptime, no cstate fix needed right now.

Kernel is not the newest, but the mentioned microcode package and intel drivers are up to date, va packages also, so hw decoding works nicely on it.

powertop shows the following idle stats:

          Package   |             Core    |            CPU 0
                    |                     | C0 active   4,1%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,0%    | C1-CHT      0,0%    0,1 ms
                    |                     |
                    |                     |
C6 (pc6)   81,9%    | C6 (cc6)   95,4%    | C6S-CHT     2,0%    2,9 ms
                    |                     | C7S-CHT    31,2%   86,9 ms

                    |             Core    |            CPU 1
                    |                     | C0 active   1,2%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,2%    | C1-CHT      0,2%    0,8 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   97,5%    | C6S-CHT     4,7%    3,2 ms
                    |                     | C7S-CHT    46,4%   37,6 ms

                    |             Core    |            CPU 2
                    |                     | C0 active   0,3%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,0%    | C1-CHT      0,0%    0,4 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   99,0%    | C6S-CHT     2,3%    4,2 ms
                    |                     | C7S-CHT    91,4%   89,3 ms

                    |             Core    |            CPU 3
                    |                     | C0 active   2,1%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,0%    | C1-CHT      0,0%    0,0 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   94,1%    | C6S-CHT     0,3%    2,1 ms
                    |                     | C7S-CHT    85,7%   39,4 ms

                    |             GPU     |
                    |                     |
                    | Powered On  0,2%    |
                    | RC6        99,8%    |
                    | RC6p        0,0%    |
                    | RC6pp       0,0%    |
                    |                     |
                    |                     |

If i understand it correctly, the cpu cores are mostly in C6/C7 states.

One thing that i done with these config:
sudo powertop --auto-tune

I setted the Tunables items to good for all items in it with the above command.

I wrote about my experience after a few more days...
Comment 166 Hal 2016-03-17 12:29:35 UTC
I've been scavenging for more information about this intel_idle software module and I came across this interesting slide presentation from Len Brown (the Intel engineer in charge of the power saving scheme if I understood right). It's dated October 2015 and apparently used at his LinuxCon Dublin meeting.
Many pages refers to troubles with the idle thing, how they track it, measurements, etc. On several slides like #31, 32, 33 under "Things may go wrong". It mentions Linux Kernel versions which are buggy yet unfixed.
http://events.linuxfoundation.org/sites/events/files/slides/Brown-Linux-Suspend-at-Speed-of-Light-LC-EU-2015.pdf
Comment 167 Chen Yu 2016-03-17 14:03:09 UTC
Hi, all, I think we have a T100 in the lab, I'll have a try. BTW, could someone please tell me is it reproduced easily by playing videos?
Comment 168 John A. 2016-03-17 14:18:25 UTC
(In reply to fao66134 from comment #161)
> I use N3150(braswell) too.
> I was set to "max_cstate=1". but, got freeze.
> 
> Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a
> little high.
> So, i come up with to try "maxcpus=2". And then, it did not freeze.
> 
> cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze.
> 
> If this thing is useful, I'm happy.

This seems interesting... I think my N3150's issues tend to be with CPU2 most of the time too. (I'll recheck the panics I posted last night.) I wondered about possible CPU heat issues as well as I'm using a fanless aluminum case, but haven't been watching it closely. I'll start doing that.

Thanks for the maxcpus=2 workaround. I may also try that. Though I'd prefer to use all 4 cores :)

Possible related note: I tried using the latest OPNSense FreeBSD 10.2-based router/firewall distro, and ran into a CPU panic there too after a few hours. I didn't get a photo of it, but if I try it again I'll be sure to capture it.
Comment 169 Daniel Glöckner 2016-03-17 14:34:41 UTC
(In reply to Chen Yu from comment #167)
> BTW, could someone please tell me is it reproduced easily by playing videos?

Yes, it is. I'm using Firefox with HTML5 videos on YouTube to test for this bug. I always had at least one freeze within 4 hours when not restricting max_cstate.
Comment 170 John A. 2016-03-17 14:37:30 UTC
(In reply to jbMacAZ from comment #163)
> I've started getting occasional freezes again with 4.4 and 4.5.  That's even
> with cstate and tsc and a bunch of good but cast off freeze patches.  So,
> I'll try fewer CPU cores.  I can't risk having anything important on my
> system anyway, so who cares if nothing important takes longer.  

Have you tried just using max_cstate without the tsc parameters and the patches? When I added tsc params to my boot line it seemed to cause more instability/chances for halts and panics. That makes me wonder if tsc is somehow counterproductive to max_cstate.
Comment 171 Hal 2016-03-17 14:55:30 UTC
(In reply to Chen Yu from comment #167)
> Hi, all, I think we have a T100 in the lab, I'll have a try. BTW, could
> someone please tell me is it reproduced easily by playing videos?

I have an old SSD that I use to move things around and I have LinuxMint 17.3 stock kernel 3.19.0 on it. I can use it through SATA or with USB2.0 or USB 3.0 adaptors.

I just plugged it into my CI320 via SATA for a quick test. Freezing was as quick as moving the firefox window.

There is no special software on this SSD, no Virtualbox, no wine emulator, nothing. Pure stock linuxmint.

After rebooting, I tried to plug a USB flash disk, before the directory was read into thunar again the whole machine froze.

So, it is indeed quick to repeat the failure.
Comment 172 Michal Feix 2016-03-17 16:58:33 UTC
So I just get a reply from linux-pm kernel mailing list. People there are aware of this bug, but I've been told that it is quite hard to find the root cause.

I've been asked to check if kernel parameter idle=nomwait is making the problems go away. Obviously, CPU's might get warmer when trying  this. It is just a step to pinpoint the source.

Can you test this parameter and post results? Especially if you are one of those not lucky with intel_idle.max_cstate=1 parameter as a workaround.
Comment 173 jbMacAZ 2016-03-17 18:02:19 UTC
(In reply to John A. from comment #170)
> (In reply to jbMacAZ from comment #163)
> > I've started getting occasional freezes again with 4.4 and 4.5.  ...
> 
> Have you tried just using max_cstate without the tsc parameters and the
> patches? When I added tsc params to my boot line it seemed to cause more
> instability/chances for halts and panics. That makes me wonder if tsc is
> somehow counterproductive to max_cstate.

tsc is recent, I ran 4.2.x for months with relatively little trouble with just cstate and necessary patches.  Frankly, 4.3.x seems to run the same with or without tsc as long as cstate is set.  My gut is that there is a new instability in 4.4 and 4.5.  I can't jettison all my old patches because my T100 will have sdhci and prmb issues and other bits of T100 hardware will stop working.  


The lack of a crash log could be partially addressed by allocating a second dmesg buffer and alternating between them at boot.  The prior dmesg log would be preserved at next startup.  This should probably be a new .config option.  Alternatively, just save the last few K of the old dmesg before initializing dmesg at boot time.
Comment 174 Hal 2016-03-17 19:38:17 UTC
(In reply to John A. from comment #168)

> Possible related note: I tried using the latest OPNSense FreeBSD 10.2-based
> router/firewall distro, and ran into a CPU panic there too after a few
> hours. I didn't get a photo of it, but if I try it again I'll be sure to
> capture it.

That might be an OPNSense issue as they seem to have introduced lots of regressions as they tried to rewrite some of the code (or trying to cleaning it up). When I tried to run it on an Intel mobo a few weeks back it just kept crashing. On the same hardware pfsense ran without problems. Any particular reason you would favor opensense vs pfsense?

One little consolation I have about Bay Trail and Braswell is that FreeBSD (and PC-BSD) and pfSense both work flawlessly on the same hardware where I experience the Linux freezing circus.
Comment 175 Nils Asmussen 2016-03-17 20:51:13 UTC
(In reply to Michal Feix from comment #172)
> [...]
> I've been asked to check if kernel parameter idle=nomwait is making the
> problems go away. Obviously, CPU's might get warmer when trying  this. It is
> just a step to pinpoint the source.
> 
> Can you test this parameter and post results? Especially if you are one of
> those not lucky with intel_idle.max_cstate=1 parameter as a workaround.

Using vanilla kernel 4.5.0 I tried to boot with the options
tsc=reliable idle=nomwait
The system crashed after the "usual" amount of time (about an hour surfing the web).
I did not set cstate or anything else.
Comment 176 fao66134 2016-03-18 09:13:53 UTC
(In reply to John A. from comment #168)
> This seems interesting... I think my N3150's issues tend to be with CPU2
> most of the time too. (I'll recheck the panics I posted last night.) I
> wondered about possible CPU heat issues as well as I'm using a fanless
> aluminum case, but haven't been watching it closely. I'll start doing that.
> 
> Thanks for the maxcpus=2 workaround. I may also try that. Though I'd prefer
> to use all 4 cores :)

I found a new way.

"echo 0 > /sys/kernel/debug/x86/tlb_single_page_flush_ceiling"

I using full core, but have not yet acquired the frozen from this setting.

In my case, it was to disable the intel_idle and intel_pstate and i915, but i got a freeze.
Thus, when compared to the other CPU configuration changes of kernel3.16 and kernel4.5, I noticed the change of the TLB flush setting. (intel_tlb_flushall_shift_set function is abolished from "arch/x86/kernel/cpu/intel.c", And tlb_single_page_flush_ceiling has been added to  "arch/x86/mm/tlb.c")
Comment 177 BzukTuk 2016-03-18 09:39:59 UTC
(In reply to fao66134 from comment #161)
> I use N3150(braswell) too.
> I was set to "max_cstate=1". but, got freeze.
> 
> Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a
> little high.
> So, i come up with to try "maxcpus=2". And then, it did not freeze.
> 
> cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze.
> 
> If this thing is useful, I'm happy.

Thanks for another workaround :)

Running glxgears and x264 video on procesor Intel® Atom™ Z3735F (4 cores) - vanilla kernel v4.5.0:
maxcpus=1, no freeze (running 90 minutes)
maxcpus=2, no freeze (running 90 minutes)
maxcpus=3, no freeze (running over 4 hours)
no command line parameters, freeze occured after 5 minutes (as usual).
Comment 178 fao66134 2016-03-18 15:12:12 UTC
(In reply to fao66134 from comment #176)
> I found a new way.
> 
> "echo 0 > /sys/kernel/debug/x86/tlb_single_page_flush_ceiling"
> 
> I using full core, but have not yet acquired the frozen from this setting.

Sorry, i got freeze now.

Running time is longer, but it seems not perfect.
Comment 179 fao66134 2016-03-18 15:39:05 UTC
(In reply to Michal Feix from comment #172)
> I've been asked to check if kernel parameter idle=nomwait is making the
> problems go away. Obviously, CPU's might get warmer when trying  this. It is
> just a step to pinpoint the source.
> 
> Can you test this parameter and post results? Especially if you are one of
> those not lucky with intel_idle.max_cstate=1 parameter as a workaround.

I also tried, but was frozen in 5 minutes.
This is about the same as when you do not specify anything.
Comment 180 Michal Feix 2016-03-18 15:49:27 UTC
(In reply to fao66134 from comment #179)
> (In reply to Michal Feix from comment #172)
> > I've been asked to check if kernel parameter idle=nomwait is making the
> > problems go away. Obviously, CPU's might get warmer when trying  this. It
> is
> > just a step to pinpoint the source.
> >
> > Can you test this parameter and post results? Especially if you are one of
> > those not lucky with intel_idle.max_cstate=1 parameter as a workaround.
> 
> I also tried, but was frozen in 5 minutes.
> This is about the same as when you do not specify anything.

So, setting idle=nomwait is not helping you. Fine. If intel_idle.max_cstate=1 is a working solution for you, could you please try with intel_idle.max_cstate=0 and post back result?
Comment 181 julio.borreguero@gmail.com 2016-03-18 16:42:32 UTC
i want to give my latest feedback on this issue to this forum thread :-D
N2940 Baytrail System running stable on all 4 cores for 2 days now.
Running latest stable kernel 4.5.0 from git repo on gentoo linux.
With latest microcode firmware from intel microcode-20151106.tgz

uname -a
Linux shiva 4.5.0 #20 SMP Tue Mar 15 19:07:39 ART 2016 x86_64 Intel(R) Celeron(R) CPU N2940 @ 1.83GHz GenuineIntel GNU/Linux

kernel parameters:
i915.enable_rc6=1 tsc=reliable clocksource=tsc

i dont know if it is the kernel or the microcode that makes this system run stable, and of course i hope it stays stable.
Playing videos, listening to music, compiling packages no freezes yet.
hope it remains like this.

and please, this is a bug-report thread, not a discussion platform
Comment 182 jbMacAZ 2016-03-18 17:45:00 UTC
"nomwait" may be device dependent.  I ran it overnight (tsc=reliable and idle=nomwait w/o cstate) and it was still running after 10 hours.  I saw other results here - restarted without tsc and my system has already run five times longer than no arguments.)  I'll keep testing.  I'll need to repeat with 4.4+ since the newest kernels are less stable than 4.3 on my system.

Passive cooled Atom baytrail Z3775: cstate=1 runs nearly normal temp, cstate=0 runs slightly warmer.  "nomwait" runs about the same temp as cstate=1.  
Asus T100-CHI - Ubuntu15.10-i386, kernel-4.3.6, microcode, T100 patches, hunter patches, legacy-turbo patch.  Normally freezes well under 10 minutes without kernel arguments.
Comment 183 Michal Feix 2016-03-18 19:21:17 UTC
(In reply to julio.borreguero@gmail.com from comment #181)
> i want to give my latest feedback on this issue to this forum thread :-D
> N2940 Baytrail System running stable on all 4 cores for 2 days now.
> Running latest stable kernel 4.5.0 from git repo on gentoo linux.
> With latest microcode firmware from intel microcode-20151106.tgz
> 
> kernel parameters:
> i915.enable_rc6=1 tsc=reliable clocksource=tsc
> 
> i dont know if it is the kernel or the microcode that makes this system run
> stable, and of course i hope it stays stable.
> Playing videos, listening to music, compiling packages no freezes yet.

Microcode update 20151106 only updates the 2MB cache version of N2940. If you have 1MB cache variant of N2940, the microcode update was not the cure.

If you can test the 4.5 kernel version without any kernel parameters, it would help to understand whether it has been fixed in the meantime.
Comment 184 julio.borreguero@gmail.com 2016-03-18 19:37:56 UTC
> 
> Microcode update 20151106 only updates the 2MB cache version of N2940. If
> you have 1MB cache variant of N2940, the microcode update was not the cure.
> 
> If you can test the 4.5 kernel version without any kernel parameters, it
> would help to understand whether it has been fixed in the meantime.

ok, thank you for that information.
And yes, the cache is only 1MB but i guess you know that anyway from the attachment i posted at some earlier stage with system-specific info.

i just rebooted my machine, this time without extra kernel parameters.
my guess is that the kernel has been fixed for my architecture at least, as i was running those tsc-parameters in my last test (4.5.0-rc3) and that definitely froze.
i will be posting a hardware freeze as soon as it happens, otherwise i will let everyone know in 2-3 days that the system is still running stable. hopefully
Comment 185 dertobi 2016-03-19 00:34:09 UTC
I certainly don't want to destroy anyone's hopes, but I've had instances where my notebook ran stable for up to two weeks and then froze. Doesn't mean it has to happen, I'm just saying the absence of crashes overnight, within 10 hours, or even in 3-4 days is not a sure sign that the issue has been fixed.
Comment 186 Hal 2016-03-19 03:56:42 UTC
Among the posts there are several mentioning that kernel 3.16 is freeze free without any additional parameter like cstate or tsk. I am curious to know if those are distro provided versions or custom compiled ones?

Today I ran some tests with Linux Mint 17.2 which comes with kernel 3.16.0 as its standard and recommended kernel. On Zotac Nano CI320 N2930 it worked for about 4 hours then froze. I actually used it only for 35 minutes, then the machine was on but simply idling for the remaining 3.5 hours. I know precisely when it froze as the frozen clock at the bottom of the screen was visible.

Is there any consensus on a kernel version that reliably works on Bay Trail?
Comment 187 Chen Yu 2016-03-19 04:00:32 UTC
Tested with 4.5.0 and glxgears on T100, without any boot params, so far we have not reproduce this problem yet, as BzukTuk told me this method should freeze the system within 1-10 minutes.  Anyway I'll  keep up this stress testing.
Comment 188 jbMacAZ 2016-03-19 06:55:11 UTC
(In reply to dertobi from comment #185)
> I certainly don't want to destroy anyone's hopes, but I've had instances
> where my notebook ran stable for up to two weeks and then froze. <snip>

I'm just assessing the workarounds, while waiting for real fixes.

My nomwait solo test did freeze after about 4 hours - but then resumed by itself about 2 hours later without bluetooth and wifi working.  Rebooting restored communications.

(In reply to Chen Yu from comment #187)

> Tested with 4.5.0 and glxgears on T100, without any boot params, so far we
> have not reproduce this problem yet, as BzukTuk told me this method should
> freeze the system within 1-10 minutes.  Anyway I'll  keep up this stress
> testing.

There are several T100 models, which vary in how fast they freeze.  The T100T* models are more stable than the T100CHI.  Also, the very first freeze often takes longer than subsequent freezes.
Comment 189 julio.borreguero@gmail.com 2016-03-19 14:42:02 UTC
(In reply to Chen Yu from comment #187)
> Tested with 4.5.0 and glxgears on T100, without any boot params, so far we
> have not reproduce this problem yet, as BzukTuk told me this method should
> freeze the system within 1-10 minutes.  Anyway I'll  keep up this stress
> testing.

i think kernel 4.5.0 has a fix.
I am running it for several days now, but on a N2940. No freezing.
Since yesterday without any kernel boot parameters.
Anything prior to this kernel (any 4.4 kernel if you want to have a go) freezes for sure.

Also, there is a difference between N2940 and N2930.
For me, on a N2940 intel_idle.max_cstate never worked as a workaround, but it works on N2930 (deduced from posts in this thread).

i know it is still too early to say that 4.5.0 is fixed, but to me it certainly looks that way. freezes on my system always ocurred within 12h.
Comment 190 Hal 2016-03-19 15:10:20 UTC
(In reply to julio.borreguero@gmail.com from comment #189)
> 
> i think kernel 4.5.0 has a fix.
> I am running it for several days now, but on a N2940. No freezing.
> Since yesterday without any kernel boot parameters.
> Anything prior to this kernel (any 4.4 kernel if you want to have a go)
> freezes for sure.
> 

My lucky version for both N2930 and N3050 seems to be 4.4.6.

4.5.0 has brought up unrelated instabilities (mostly with VGA and Wireless) on my systems so I can't even thoroughly test it. 

4.4.6 on the other hand has been pretty good without cstates or tsk up to a point (much much longer time before freezing). Interestingly, on my Zotac box 4.4.6 spends much more time in C1 state than C6 or C7 according to powertop.

That said, the behavior of these different versions is quite wild. 
I tried to build a chart of hardware (2 separate computers one with N3050, the other with N2930) vs kernels (I have tested 3.16.0, 3.19.0, 4.0.0, 4.3.0, 4.4.0, 4.4.4, 4.4.5, 4.4.6, 4.5.0) and captured the freeze timing and conditions (like with or without video loss at freeze time) and the chart is full of inconsistencies. 

Repeat tests yield contradictory results most of the time. But, all in all 4.4.6 looks the best with the longest longevity. 4.3.0 seems to be the worst.

With ctates=2 freezing is almost non existent (only happened once in more than 40 sessions). With cstates=1 never got a freeze in any hardware/kernel combination, with some of the tests lasting more than 2 weeks. I never used a patch nor the tsk parameter.
Comment 191 jbMacAZ 2016-03-20 08:15:23 UTC
I have a N3540 system that freezes at most a couple times a month without any arguments, kernel version doesn't seem to matter.  .max_cstate {0,1} stabilized it.  Looking at the recent posts, the N-series appears to be the processor benefiting most from the new suggestions.  But the more smoke that gets cleared, the sooner the rest of the problems can be found.

On my Z3775 system (T100CHI), kernel 4.5.0 without arguments didn't last 2 minutes before freezing.  With idle=nomwait and it ran 2 hours before the time display froze (frozen seconds), the mouse cursor still moved.  Keyboard keys or mouse clicks were accepted about once every 90 seconds.

Next, maxcpus=2 and idle=nomwait produced a block of "serial8250: too much work for irq191" errors in dmesg.  Raising maxcpus to 3 got rid of them.  maxcpus= {2,3} yielded no obvious degradation when just browsing, etc, so I'll leave this running...  tsc may be destabilizing for some systems like mine.
Comment 192 cororok 2016-03-20 14:18:32 UTC
My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still beta version) in 30m especially when I use chrome browser.
but it works well with intel_idle.max_cstate=1 on both version.

kernel 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb
) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does not work without cstate flag.

So I downloaded newer 4.5.0-rc7 (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it is working well without cstate flag for half day. I will update the status after one or two days later.
Comment 193 julio.borreguero@gmail.com 2016-03-20 16:59:20 UTC
update:
system freeze on 4.5.0 kernel on N2940 no kernel parameters.
it took many hours (~40) but finally it happened.
back to kernel 4.1.12....
Comment 194 Xermán 2016-03-20 19:26:26 UTC
I gave it a try with Ubuntu 15.10 and kernel 4.5
I also installed the Intel microdrivers.

I was able to play a full 50 min video but then the computer freeze on the desktop wihtout any cpu/gpu intense operation (that I'm aware of).
Comment 195 cororok 2016-03-20 23:46:21 UTC
(In reply to cororok from comment #192)
> My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still
> beta version) in 30m especially when I use chrome browser.
> but it works well with intel_idle.max_cstate=1 on both version.
> 
> kernel
> 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb
> ) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does
> not work without cstate flag.
> 
> So I downloaded newer 4.5.0-rc7
> (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it
> is working well without cstate flag for half day. I will update the status
> after one or two days later.

4.5.0-rc7, even it is better than others, also froze.
Comment 196 cororok 2016-03-20 23:46:52 UTC
(In reply to cororok from comment #192)
> My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still
> beta version) in 30m especially when I use chrome browser.
> but it works well with intel_idle.max_cstate=1 on both version.
> 
> kernel
> 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb
> ) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does
> not work without cstate flag.
> 
> So I downloaded newer 4.5.0-rc7
> (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it
> is working well without cstate flag for half day. I will update the status
> after one or two days later.

4.5.0-rc7, even it is better than others, also froze.
Comment 197 jbMacAZ 2016-03-21 08:01:34 UTC
A combo bandaid for the Z3775 is idle=nomwait tsc=reliable maxcpus=3.  Test still running at 24 hours.  Better than 2 minutes without any kernel arguments... (Kernel 4.5.0.)
Comment 198 fao66134 2016-03-21 10:55:12 UTC
(In reply to Michal Feix from comment #180)
> So, setting idle=nomwait is not helping you. Fine. If
> intel_idle.max_cstate=1 is a working solution for you, could you please try
> with intel_idle.max_cstate=0 and post back result?

Only maxcpus is not to freeze. My result is next.

running time(1st, 2nd) #parameters
30m, 1h30m #none
10m, 40m #idle=nomwait
1h, 2h #intel_idle.max_cstate=0
2h, 1h #intel_idle.max_cstate=0 idle=nomwait
30m, 1h #intel_idle.max_cstate=1

N3150 Gentoo drm-intel-nightly_kernel-4.5.0+
Comment 199 Hal 2016-03-22 02:23:11 UTC
Interesting findings today:

1) Came across a 2 yr old system with a Bay Trail N2807 processor. Upgraded Ubuntu on it to kernel 4.4.6 with no parameter. It has been running for more than 12 hours without a glitch. What gives?! So, not all Bay Trail processors are afflicted by this problem?

2) I was given an Intel Nuc box for testing which turned out to be identical to mine, with same N3050. Duplicated my drive with DD and removed intel_idle.max_cstate=1. It kept working all day without missing a beat! I remove cstate from my own machine it freezes within the hour. So bizarre...

3) As I was digging into Virtualbox log files after my guest OS froze once again on my zotac, I noticed that there is nothing noteworthy until the moment of failure except for the message "28:28:41.009623 VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago". 

Then when I shutdown the guest OS window, Virtualbox adds a very extensive report about the state of the machine at the time it froze or became unresponsive. 

So, this might be a good tool to help investigate how the failure is taking place. 

My thinking is that this OS freezing problem is occurring the same, whether it is on a host (physical) machine or a guest (virtual) machine. 

It has been found that with intel_idle.max_cstate=1 or alternative special kernel parameters we can get the kernel behave differently and avoid failure.

But that doesn't work with a virtual machine and whatever is causing the failure is making the virtual machine fail unrestrained. But that also indicates that the kernel software is falling apart not the microprocessor (or the microprocessor's microcode) otherwise when the virtual machine fails the host should also fail.

The dump in the virtualbox log file on closing is very rich in info, unfortunately it's way above my knowledge base. So, if anyone would be interested in analyzing it I could furnish it, although I think it is very easy to make the failure occur in virtualbox same as on the host.
Comment 200 RussianNeuroMancer 2016-03-22 03:47:48 UTC
Anybody know why this patches doesn't upstreamed? 

https://github.com/hadess/rtl8723bs/tree/master/patches_4.5
Comment 201 jds 2016-03-22 06:46:57 UTC
Created attachment 210171 [details]
attachment-21257-0.html

I get that we shouldn't turn this bug report into a forum discussion, but
what I just don't understand is why this bug isn't considered absolutely
critical.  Personally it doesn't affect me that much -- work gives me a
very nice macbook pro -- but this bug gives the lie to decades of making
fun of Windows BSODs.

A system that can't stay up for 30 minutes?  For millions and millions of
users -- all on the lower-end of the performance spectrum?  For kernels
that go back 2 years?  It's a massive pie in the face.

I've added the cstate kernel parameter.  The machine is more stable but
battery life has gone to hell.  Such is Linux, today.


On Mon, Mar 21, 2016 at 11:47 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #200 from RussianNeuroMancer <russianneuromancer@ya.ru> ---
> Anybody know why this patches doesn't upstreamed?
>
> https://github.com/hadess/rtl8723bs/tree/master/patches_4.5
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 202 Mika Kuoppala 2016-03-23 08:50:01 UTC
https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test

3 _tentative_ patches on that tree. Please try.
Comment 203 dertobi 2016-03-23 11:58:28 UTC
What desktop are you all running? For me it's gnome-shell. Maybe there's some connection between software, hardware and that freeze that we've been missing so far.
Comment 204 julio.borreguero@gmail.com 2016-03-23 18:09:49 UTC
(In reply to Mika Kuoppala from comment #202)
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> 
> 3 _tentative_ patches on that tree. Please try.

i am running 4.5.0 with 3 tentative patches from mika ;-)
Already stresstesting for about 5h now. i will post any results here.
Comment 205 Travis Hall 2016-03-24 05:37:53 UTC
(In reply to Mika Kuoppala from comment #202)
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> 
> 3 _tentative_ patches on that tree. Please try.
I got the hang after 7 and a half hours of letting my N2940 run youtube and a twitch stream.
Comment 206 jds 2016-03-24 06:19:24 UTC
(In reply to dertobi from comment #203)
> What desktop are you all running? For me it's gnome-shell. Maybe there's
> some connection between software, hardware and that freeze that we've been
> missing so far.

I don't think so.  I've tried with Cinnamon and Gnome 3.
Comment 207 dertobi 2016-03-24 06:45:46 UTC
(In reply to jds from comment #206)
> (In reply to dertobi from comment #203)
> > What desktop are you all running? For me it's gnome-shell. Maybe there's
> > some connection between software, hardware and that freeze that we've been
> > missing so far.
> 
> I don't think so.  I've tried with Cinnamon and Gnome 3.

Cinnamon is a Gnome 3 fork though.
Comment 208 jds 2016-03-24 18:40:59 UTC
(In reply to dertobi from comment #207)
> (In reply to jds from comment #206)
> > (In reply to dertobi from comment #203)
> > > What desktop are you all running? For me it's gnome-shell. Maybe there's
> > > some connection between software, hardware and that freeze that we've
> been
> > > missing so far.
> > 
> > I don't think so.  I've tried with Cinnamon and Gnome 3.
> 
> Cinnamon is a Gnome 3 fork though.

Ah, you're right.  I did try MATE too briefly, which I think is a Gnome 2 fork, and it crashed -- at the time I suspected Chrome/Chromium as the issue, so I didn't connect it with this bug.
Comment 209 Juha Sievi-Korte 2016-03-25 12:20:46 UTC
Update: Grabbed 4.5.0 for testing on affected system (Acer B-115M, N3540). This is downloaded from opensuse repos this time, exact version:

Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux

Running withtout a freeze for a week now in my normal use and stress-testing since this morning with HD videos. I'll report back if it freezes.

Someone asked about the desktop, I use xfce (some gnome-services running though). Have verified the freezes with two distributions, Ubuntu and Opensuse.
Comment 210 julio.borreguero@gmail.com 2016-03-25 12:29:18 UTC
it definitely is a kernel bug. read old posts in this thread.
i have verified this bug on 2 distributions and am running gentoo now, where everything is compiled.

i am running 4.5.0 from github kernel stable repo 
plus mikas 3 patches for the second day under full load and no freeze yet.
Comment 211 cororok 2016-03-25 19:13:27 UTC
I think the problem happens when C-state is changed. If it is right in order to test it needs a condition which changes CPU load up and down so that it can reach a certain situation where the CPU can get stuck.

In my case it happened when I use Chromebrowser on Xubuntu so I guessed it is related to GPU but I don't have any knowledge about that.
Comment 212 jds 2016-03-25 19:15:53 UTC
That's what I thought too at first -- and that sent me scramblingly looking at chrome flags etc.  But then I observed two different systems lock up even when no browser at all was running.

(In reply to cororok from comment #211)
> I think the problem happens when C-state is changed. If it is right in order
> to test it needs a condition which changes CPU load up and down so that it
> can reach a certain situation where the CPU can get stuck.
> 
> In my case it happened when I use Chromebrowser on Xubuntu so I guessed it
> is related to GPU but I don't have any knowledge about that.
Comment 213 podschie 2016-03-25 21:58:35 UTC
(In reply to jds from comment #212)
> That's what I thought too at first -- and that sent me scramblingly looking
> at chrome flags etc.  But then I observed two different systems lock up even
> when no browser at all was running.
> 
> (In reply to cororok from comment #211)
> > I think the problem happens when C-state is changed. If it is right in
> order
> > to test it needs a condition which changes CPU load up and down so that it
> > can reach a certain situation where the CPU can get stuck.
> > 
> > In my case it happened when I use Chromebrowser on Xubuntu so I guessed it
> > is related to GPU but I don't have any knowledge about that.

I can confirm, that my Acer ES1-311 with it's Intel 3540 CPU crashes not only while using Chromium browser. But I recognized it crashes more often using Chromium than Firefox. Mostly it happens, when I play a movie on YouTube or scrolling the timeline of facebook.

If I'm working with the PC without using any browser, the system seems stable. Writing with LibreOffice, graphic manipulation with GIMP or RawTherapee work pretty well with the 4.2.0-34 Kernel (Lubuntu) and I do not get as many freezes as before. But watching a DVD with an external drive is not possible, the system freezes within minutes. Strangely the freeze occurs pretty often if I'm just reading .pdf Documents with Evince.
Comment 214 Veronica 2016-03-25 23:33:30 UTC
Hi, i own an Asus Chromebox (Haswell Intel Celeron 2955U / 1.4 GHz) and I've always experienced full system freeze in any Linux distros I've tested including Kodibuntu and OpenElec BUT never had such issues with Windows 8/8.1/10 (currently booting off external HDD)
Currently I'm running GalliumOS based on Ubuntu 15.04 with Xfce off internal SSD, came with kernel 4.1.14 by default.

What I've tried:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1 tpm_tis.interrupts=0 i915.enable_ips=0"

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=2 tpm_tis.interrupts=0 i915.enable_ips=0"

NOTES:
*Just added intel_idle.max_cstate argument after "splash" the rest is default within /etc/default/grub.
* Neither worked intel_idle.max_cstate=1  froze in less than 10m while working in Terminal & intel_idle.max_cstate=2 froze in less than 15m while watching Netflix in Chrome Browser.

Currently testing:
Kernel 4.1.12 with no args from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1.12-wily/ as some users suggested.Will report back if it freezes, otherwise after 2+ days.

Anything else I could try? I can't compile I just have the Chromebox for now plus I'm not that advanced. T.I.A
Comment 215 Veronica 2016-03-25 23:38:11 UTC
Update: System just froze with kernel 4.1.12 , this is very frustating.
Comment 216 Veronica 2016-03-25 23:44:47 UTC
Forgot to mention that Ubuntu Server 14.04.x is the only distro that has worked reliably for me, ran it for several months without issues then needed full OS so uninstalled.
Comment 217 Brent Davis 2016-03-26 08:46:18 UTC
I've been watching the posts on this bug report for several days now and thought I would post my own personal experience. Just bought a laptop with a N3540 chip in it and have also been experiencing random system lockups with the 4.x series kernels. But I wanted to mention that for some reason the stock kernel that comes with Debian "Jessie" gives me no problems what so ever crash wise. In fact the only reason I've been trying to use a 4.x kernel is becauhse my graphics performance seems to improve drastically with them. Especially in opengl applications (seen much higher FPS in apps). I did try the patches Mika Kuoppala posted on the stable 4.4.6 kernel from kernel.org but had a lockup after about an hour of use. Seems to me the crashes happen most when in chrome browsing websites but I have had lockups doing other things. Gonna try the stable GIT of 4.5.0 and see what happens. If 4.5 stick locks up with Mika's patches then I dunno what to do other than to go back to Debians 3.16 kernel. I'll take a performance hit but at least the computer will run without crashing.
Comment 218 julio.borreguero@gmail.com 2016-03-26 09:25:09 UTC
(In reply to Mika Kuoppala from comment #202)
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> 
> 3 _tentative_ patches on that tree. Please try.

system freeze after ~2 days

(In reply to Veronica from comment #215)
> Update: System just froze with kernel 4.1.12 , this is very frustating.

i think you are the only one with a freeze on 4.1.[12-15] so far.
but then i haven't seen anyone posting with a 2955U unit in this thread.
please double-check you are running the correct kernel with uname -a
or try 3.16 as suggested by brent davis and others
Comment 219 RussianNeuroMancer 2016-03-26 09:28:58 UTC
Veronica and Brent, please check if workaround mentioned in bugreport title at least make system hang much later (or doesn't hang at all). If that the case, then it's worth a try patches from comment #203 instead of workaround.
Comment 220 Dmitry 2016-03-26 12:48:12 UTC
(In reply to julio.borreguero@gmail.com from comment #218)
> i think you are the only one with a freeze on 4.1.[12-15] so far.
No, not the only one. I use kernels 4.1.*(now 4.1.20) every day on BayTrail Z3770 tablet and have rare freezes. Of course with MMC PM QoS patches. Max_cstate=1 helps, but with much more power consumption. Also I hit another mysterious bug, when my tablet just turns off. It's look like overheating, but I don't know for sure.
Latest kernel git also has a bug with display blinking and corruption. So I can't use it for long enough to see hang.

P.S. I hit hang once when I was reading book with fbreader. Nothing more, just fbreader.
Comment 221 Veronica 2016-03-26 14:57:30 UTC
(In reply to julio.borreguero@gmail.com from comment #218)
> (In reply to Mika Kuoppala from comment #202)
> > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > 
> > 3 _tentative_ patches on that tree. Please try.
> 
> system freeze after ~2 days
> 
> (In reply to Veronica from comment #215)
> > Update: System just froze with kernel 4.1.12 , this is very frustating.
> 
> i think you are the only one with a freeze on 4.1.[12-15] so far.
> but then i haven't seen anyone posting with a 2955U unit in this thread.
> please double-check you are running the correct kernel with uname -a
> or try 3.16 as suggested by brent davis and others

Yes I did verify. I'm very cautious when testing. What I did what press shift key while booting > advanced options and selected kernel 4.1.12 generic.
I know I'm the first with a Haswell to report but my Chromebox is having the exact same symptoms people in here is having.
Comment 222 Veronica 2016-03-26 14:59:50 UTC
(In reply to RussianNeuroMancer from comment #219)
> Veronica and Brent, please check if workaround mentioned in bugreport title
> at least make system hang much later (or doesn't hang at all). If that the
> case, then it's worth a try patches from comment #203 instead of workaround.

Hi, as I mentioned in post #214 cstate=1 and cstate=2 didn't work for me. The first one froze in less than 10m and the second in less than 15m.
Comment 223 GConst 2016-03-26 16:58:51 UTC
Hello, I have same frizing of Ubuntu 14.04.3 with 3,19 on my asrock q1900dc-itx; according information here I reinstalled system an downgrade version to 14.04.02 with 3.16.0.30 kernel, but today get stack again :-( Only version of Linux which works fine was Oracle Linux 7.2 with 3.10
Comment 224 Ernst Herzberg 2016-03-26 18:29:45 UTC
Maybe related patch?

http://www.spinics.net/lists/intel-gfx/msg90977.html
Comment 225 julio.borreguero@gmail.com 2016-03-26 19:14:44 UTC
(In reply to Ernst Herzberg from comment #224)
> Maybe related patch?
> 
> http://www.spinics.net/lists/intel-gfx/msg90977.html

looks interesting.
it seems to be for a different kernel version than 4.5.0 though, 2 out of 3 hunks fail, but i hopefully managed to adapt the patch and am compling a new test-kernel just now and will post any positive results, if so.

definitely worth a try, looks promising from the description. thanks
Comment 226 dertobi 2016-03-26 21:19:56 UTC
that patch looks indeed promising, I'm compiling the latest drm-intel kernel from git with that patch now (no hunks failing). Will report much later, as I expect this compilation process to take a long time on this hardware. :-)
Comment 227 julio.borreguero@gmail.com 2016-03-26 21:56:53 UTC
Created attachment 210771 [details]
drm/i915: Prevent machine death on Ivybridge context switching for kernel 4.5.0 from kernel archive

this is Chris Wilsons patch for latest drm-intel kernel slightly modified for latest kernel v4.5.0 from stable kernel archive repo tree
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Comment 228 julio.borreguero@gmail.com 2016-03-26 23:16:39 UTC
(In reply to julio.borreguero@gmail.com from comment #227)
> Created attachment 210771 [details]
> drm/i915: Prevent machine death on Ivybridge context switching for kernel
> 4.5.0 from kernel archive
> 
> this is Chris Wilsons patch for latest drm-intel kernel slightly modified
> for latest kernel v4.5.0 from stable kernel archive repo tree
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

it froze within 2h.
probably not worth trying for anyone else. anyway lets see what dertobis test with the original patch on the drm-intel kernel leaves us with
Comment 229 Dimitris Roussis 2016-03-27 09:07:57 UTC
The bug is still "P1 normal"!!!

This bug affect the 30% of all laptops this moment in the market.It is one of the most serious bug never explored!! There are thousands of Linux users dissapointed.

How we communicate with developers of kernel that have most high position  to tell them about how serious is this situation?
Comment 230 dertobi 2016-03-27 09:17:53 UTC
I hate to break it but the Chris Wilson patch is not the fix. My laptop froze within an hour.
Comment 231 cororok 2016-03-27 11:34:41 UTC
(In reply to Dimitris Roussis from comment #229)
> The bug is still "P1 normal"!!!
> 
> This bug affect the 30% of all laptops this moment in the market.It is one
> of the most serious bug never explored!! There are thousands of Linux users
> dissapointed.
> 
> How we communicate with developers of kernel that have most high position 
> to tell them about how serious is this situation?

You're absolutely right. It is a very serious bug because it freezes computer.
Bay trails is low end computer and many users of this are probably non technical ones and want to try to get a light OS because windows 10 with limited memory is not happy. But they will very disappoint.
Comment 232 Brent Davis 2016-03-27 13:13:55 UTC
Just wanted to give a quick update since the last post I made stated I was gonna try the latest stable GIT with Mika's 3 tentative patches. So far it's been 24 hours and I have not experienced a crash. Not sure yet if this is just luck or if a real difference has been made. But I can definitely say my stability coming from 4.4.6 has vastly improved. Been doing everything I can to break this thing. Youtbe, h264 video, opengl games, etc.
Comment 233 dertobi 2016-03-27 14:42:10 UTC
(In reply to Brent Davis from comment #232)
> Just wanted to give a quick update since the last post I made stated I was
> gonna try the latest stable GIT with Mika's 3 tentative patches. So far it's
> been 24 hours and I have not experienced a crash. Not sure yet if this is
> just luck or if a real difference has been made. But I can definitely say my
> stability coming from 4.4.6 has vastly improved. Been doing everything I can
> to break this thing. Youtbe, h264 video, opengl games, etc.

I just also tried Mika's three tentative patches applied to latest drm-intel as well as Chris Wilson's patch, and within an hour my system crashed yet again.

Brent, are you making sure you don't have the usual workaround parameters in the command prompt while testing the patches (happened to me before, you can check with #cat /proc/cmdline)?
Comment 234 Allen 2016-03-27 16:23:35 UTC
I have an ASUS motherboard with Celeron J1900 cpu. 
For me, kernel 3.19.0-47 from Ubuntu 14.04.3 is stable with options
  intel_idle.max_cstate=1 nox2apic loglevel=7 debug . 
System is used for web browsing and openvpn client .
Crashes were usually happening while scrolling a large web page with mouse wheel (such as wsj.com or nytimes.com front page).
Comment 235 ladiko 2016-03-27 17:17:02 UTC
We have about 50 mainboard with J1900 and some samples with J1800, N3050, N3150 and we had to go back to the original Ubuntu 14.04 kernel 3.13 as even the lts-utopic-kernel 3.16 rarely, but sometimes froze on some few mainboards.
Comment 236 Dimitris Roussis 2016-03-27 18:50:04 UTC
The last stable kernel without this horrible bug is 3.16.7.

Canonical provides extended support for this kernel until April 2016!!! .I hope until then this bug have fixed.
Comment 237 Brent Davis 2016-03-27 19:20:02 UTC
(In reply to dertobi from comment #233)
> (In reply to Brent Davis from comment #232)
> > Just wanted to give a quick update since the last post I made stated I was
> > gonna try the latest stable GIT with Mika's 3 tentative patches. So far
> it's
> > been 24 hours and I have not experienced a crash. Not sure yet if this is
> > just luck or if a real difference has been made. But I can definitely say
> my
> > stability coming from 4.4.6 has vastly improved. Been doing everything I
> can
> > to break this thing. Youtbe, h264 video, opengl games, etc.
> 
> I just also tried Mika's three tentative patches applied to latest drm-intel
> as well as Chris Wilson's patch, and within an hour my system crashed yet
> again.
> 
> Brent, are you making sure you don't have the usual workaround parameters in
> the command prompt while testing the patches (happened to me before, you can
> check with #cat /proc/cmdline)?

Haven't touched my bootup command with the cstate flags or anything. Just been testing kernels and patches. Didn't see any reason to because I'm looking for a permanent solution as opposed to a work around. But yeah for all I know mine might crash to. Just wish there was some full proof way to replicate instead of just waiting for it to happen.
Comment 238 Hal 2016-03-28 18:18:58 UTC
(In reply to Dimitris Roussis from comment #236)
> The last stable kernel without this horrible bug is 3.16.7.
> 

Is this your own experience on your own computer or some information from an authoritative source?
Because, Linux Mint 17.2 comes with 3.16.0 and it is prone to freeze on many  N3050 and N2930 machines that I tested.

Also, based on your statement, I installed 3.16.7 from the ubuntu repo http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt19-utopic/linux-headers-3.16.7-031607-generic_3.16.7-031607.201510301030_amd64.deb

It only worked for 1 hr on one machine and 3.5 hrs on the other.

So, if I may suggest, please do not make such authoritative, blanket statements, unless you can cite an authoritative source. Otherwise, simply say that this applies to your own equipment.

Also, on my own two computers (if you lookup this long thread you'll see my config info) several 3.16.n versions have been tested. They absolutely all eventually froze.
The only good thing is that cstate=2 reduces the failure rate significantly, and cstate=1 literally eliminates freezing on my computers.
Comment 239 Dimitris Roussis 2016-03-28 18:35:10 UTC
(In reply to Hal from comment #238)
> (In reply to Dimitris Roussis from comment #236)
> > The last stable kernel without this horrible bug is 3.16.7.
> > 
> 
> Is this your own experience on your own computer or some information from an
> authoritative source?
> Because, Linux Mint 17.2 comes with 3.16.0 and it is prone to freeze on many
> N3050 and N2930 machines that I tested.
> 
> Also, based on your statement, I installed 3.16.7 from the ubuntu repo
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt19-utopic/linux-
> headers-3.16.7-031607-generic_3.16.7-031607.201510301030_amd64.deb
> 
> It only worked for 1 hr on one machine and 3.5 hrs on the other.
> 
> So, if I may suggest, please do not make such authoritative, blanket
> statements, unless you can cite an authoritative source. Otherwise, simply
> say that this applies to your own equipment.
> 
> Also, on my own two computers (if you lookup this long thread you'll see my
> config info) several 3.16.n versions have been tested. They absolutely all
> eventually froze.
> The only good thing is that cstate=2 reduces the failure rate significantly,
> and cstate=1 literally eliminates freezing on my computers.

Almost all the users if you read the comments said that kernel 3.6.17 works without problem (comment 11,35,81,105 etc). 

Also in my situation with a N3050 Machine..Everything above this kernel unfortunately does not work.

if exist machines that have problem even with this kernel or below the bug is more serious that we think.
Comment 240 Markus Rehbach 2016-03-28 19:02:48 UTC
On my netbook 

Acer Aspire ES1-111/R2, BIOS V1.16 10/20/2015 Celeron N2940 4GB

I´m back to Centos7 (kernel 3.10 whatever) and the power drain is acceptable for me. No freeze problems so far in contrast to Ubuntu 15.10. 

Tried 4.5 mainline from elrepo and this one is working stable for me, too. 4.4.4 mainline was unstable.

Will try 4.4.6 now and will report MY (!) results.
Comment 241 Hal 2016-03-28 19:46:54 UTC
(In reply to Dimitris Roussis from comment #239)
> (In reply to Hal from comment #238)
> > (In reply to Dimitris Roussis from comment #236)
> > > The last stable kernel without this horrible bug is 3.16.7.
> > > 
> > 
> > Is this your own experience on your own computer or some information from
> an authoritative source?
...
> 
> Almost all the users if you read the comments said that kernel 3.6.17 works
> without problem (comment 11,35,81,105 etc). 
> 

Not quite accurate though. Please note:

1) In Vladmir Jicha in comments #104 and #162 mentioned that his computer froze about twice a week even with kernel 3.13

2) Ladiko in #235 indicates that some of their 50 boards freeze with 3.16 (and also 3.13)

I know that many people mentioned that they believed 3.16 worked well, without patching, on their specific hardware. And that's fine. But, that can't be generalized and turned into an authoritative statement that 3.16 was freeze free on the microprocessors that this thread is focusing on.

I can concur with the few unlucky people here that freezing problems are even occurring on version 3.13.

There are so any versions of the 3.x and 4.x kernels out there, and so many compilation sources, that performing a test matrix worthy of drawing conclusions from is almost impossible at this time.
Between standard issue distro kernels and what can be downloaded and compiled from kernel.org, or what's on ubuntu's prolific mainline kernel-ppa, I am quite convinced that when two people are referring to a particular version they are not necessarily talking about the same binary - far from it.
And I am not even talking about the privately patched derivatives...

So, no - unfortunately Linux has taken a very bad turn this time and troubleshooting this issue is going to be a miserable experience. (And that's probably why this bug is still not fixed after 15 months since its first discovery).
And I am not even talking about retrofitting the fix into all these versions.

Although, I will personally be very happy if I had only one version with a fix like 4.5!
Oh, but wait! there is a release candidate version 4.6 already!
WHAT A JOKE!
Comment 242 ladiko 2016-03-28 20:00:10 UTC
regarding 2): we have issues with 3.16, but not with 3.13 which we use right now.
Comment 243 Dimitris Roussis 2016-03-28 20:04:22 UTC
For me mr, Len Brown has a big responsibility of this situation. 

How is possible to assigned to you a such serious bug that affected 30% or more of the new laptops in markets and still the Importance is P1 normal and without inform developers above you.

I think mr.Len didn't understand the effect of this bug to the Linux world!! 

Just go to a computer shop. The half of laptops use these cpus!! What we can say to all these people? Dont use linux wait 2 more years that someone interested to fix the bug!!! The worst situation i saw the last 10 years in Linux.
Comment 244 Dimitris Roussis 2016-03-28 20:11:43 UTC
and also what do you think?..All the people are linux experts to try different kernels like us?..

It works exactly in this way...Somebody go to the shop buy a new laptop and install the latest ubuntu. After 20 minutes the system freeze and he said linux sucks i never use it again!!

Thats it!!
Comment 245 Hal 2016-03-28 20:34:44 UTC
(In reply to ladiko from comment #242)
> regarding 2): we have issues with 3.16, but not with 3.13 which we use right
> now.

My apology. The parentheses was a left over from the edit to the sentence after I realized that you were referring to 3.16 as freezing, but not 3.13. Thank you for pointing it out.

In any event my point to this group is that, I personally do not believe that a solution to this problem will be found anytime soon, as we cannot even identify the turning point beyond which this problem started to show up. And kernel version proliferation is certainly one reason for that.

For those who like me, this issue has serious ramifications (beyond the fun of using linux distros at home, with friends and family, etc.), like losing credibility in front of a prospective customer because you can't even give a presentation with your beautiful lightweight laptop computer without rebooting twice in one hour, I have a word of advice: Go buy yourself an entry level Mac laptop because it's time for evasive action.

It was fun riding the Linux wave - a little over 12 years for me. But now it's time to move on.
Comment 246 jds 2016-03-28 21:12:30 UTC
This is on my work macbook, which I sleep at the end of every workday:

$ uptime
17:11  up 154 days,  3:13, 6 users, load averages: 1.34 1.83 1.90

(In reply to Hal from comment #245)
> (In reply to ladiko from comment #242)
> > regarding 2): we have issues with 3.16, but not with 3.13 which we use
> right
> > now.
> 
> My apology. The parentheses was a left over from the edit to the sentence
> after I realized that you were referring to 3.16 as freezing, but not 3.13.
> Thank you for pointing it out.
> 
> In any event my point to this group is that, I personally do not believe
> that a solution to this problem will be found anytime soon, as we cannot
> even identify the turning point beyond which this problem started to show
> up. And kernel version proliferation is certainly one reason for that.
> 
> For those who like me, this issue has serious ramifications (beyond the fun
> of using linux distros at home, with friends and family, etc.), like losing
> credibility in front of a prospective customer because you can't even give a
> presentation with your beautiful lightweight laptop computer without
> rebooting twice in one hour, I have a word of advice: Go buy yourself an
> entry level Mac laptop because it's time for evasive action.
> 
> It was fun riding the Linux wave - a little over 12 years for me. But now
> it's time to move on.
Comment 247 micha 2016-03-28 21:55:06 UTC
Weird enough, but this thread is giving me back some hope!

I bought an Asus F551 (Intel N2930) laptop last year in February which came with a pre-installed Windows 8.1 64bit and which was running flawlessly - until I updated to Windows 10. Right after the update the laptop started to freeze randomly. Since I spend most of the time editing PHP code and watching the result in Firefox, I'm not really bringing the machine to its limits. And maybe that's the reason those freezes didn't happen so often. Sometimes it took 3 days, sometimes I got 3 crashed within half an hour. Absolutely unpredictable. Except of my unsaved program changes: no data loss - not a single hint in the system log.

I filed a detailed report to Asus then, but the only suggestion was restoring the machine to its shipping state. Poor, isn't it? And that's why I decided to give Linux a try instead. To keep it short: My Linux (Ubuntu studio) 4.2.0-34.lowlatency #39-Ubuntu SMP PREEMPT is freezing, too.

To me this looked very much like a hardware defect and my next idea was running memtest86. Strange enough I got not a single error when running the test with just ONE cpu, but hundreds of errors (all at address 0) with multiple processors involved.

Yeah, and so I was almost giving up hope on this laptop until I came across this thread today.
My first test was running 2 glxgears plus watching a video in firefox: Freeze after about 10 minutes.
After a reboot
Comment 248 micha 2016-03-28 21:59:56 UTC
(In reply to micha from comment #247)
> Weird enough, but this thread is giving me back some hope!
> 
> I bought an Asus F551 (Intel N2930) laptop last year in February which came
> with a pre-installed Windows 8.1 64bit and which was running flawlessly -
> until I updated to Windows 10. Right after the update the laptop started to
> freeze randomly. Since I spend most of the time editing PHP code and
> watching the result in Firefox, I'm not really bringing the machine to its
> limits. And maybe that's the reason those freezes didn't happen so often.
> Sometimes it took 3 days, sometimes I got 3 crashed within half an hour.
> Absolutely unpredictable. Except of my unsaved program changes: no data loss
> - not a single hint in the system log.
> 
> I filed a detailed report to Asus then, but the only suggestion was
> restoring the machine to its shipping state. Poor, isn't it? And that's why
> I decided to give Linux a try instead. To keep it short: My Linux (Ubuntu
> studio) 4.2.0-34.lowlatency #39-Ubuntu SMP PREEMPT is freezing, too.
> 
> To me this looked very much like a hardware defect and my next idea was
> running memtest86. Strange enough I got not a single error when running the
> test with just ONE cpu, but hundreds of errors (all at address 0) with
> multiple processors involved.
> 
> Yeah, and so I was almost giving up hope on this laptop until I came across
> this thread today.
> My first test was running 2 glxgears plus watching a video in firefox:
> Freeze after about 10 minutes.
> After a reboot

sorry, accidently hit the wrong key ...
after the reboot 4 hours ago I added the cstate=1 to the boot parms and the system is still alive, continously running 2 glxgears and playing videos.
Comment 249 cororok 2016-03-28 22:02:01 UTC
I guess Intel already knew the bug but wonder why they don't fix it.

User experience should be different between expensive Core and cheap Bay trail so that Intel make a huge profit in Core cpus. Windows 10 meets this strategy because it is slow on low memory (Intel computer stick with linux has 1gb ram compared to 2gb for windows 10).

Is Intel happy with this situation which restrains Bay trail in windows 10? like netbook was limited in 10 inch size?
Comment 250 dertobi 2016-03-28 22:18:29 UTC
(In reply to cororok from comment #249)
> I guess Intel already knew the bug but wonder why they don't fix it.
> 
> User experience should be different between expensive Core and cheap Bay
> trail so that Intel make a huge profit in Core cpus. Windows 10 meets this
> strategy because it is slow on low memory (Intel computer stick with linux
> has 1gb ram compared to 2gb for windows 10).
> 
> Is Intel happy with this situation which restrains Bay trail in windows 10?
> like netbook was limited in 10 inch size?

Let's not devolve into conspiracy theories. For me this looks like incompetency paired with negligence. Still bad.
Comment 251 RussianNeuroMancer 2016-03-28 22:27:11 UTC
In case anyone need it, there is amd64 deb packages with patches from Comment #202

https://github.com/milikhin/z3735-linux-patches
https://drive.google.com/folderview?id=0BzIRxogf-cVkLWdiMTRoenU5amM 

Linux 4.6rc1 package also include workaround for bug 112571.
Comment 252 cororok 2016-03-28 23:30:38 UTC
An internet post pointing here this bug.

http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail
Comment 253 Hal 2016-03-29 01:10:39 UTC
(In reply to micha from comment #248 & #247)

The Asus F551 is a decent machine (not a great machine for the intended use of Ubuntu Studio though). I reconfigured one with Linux Mint several months ago for a relative of mine. I remember having tried a low latency kernel (I can't recall the exact version) but the performance was terrible. Generally speaking processors in the N2930 class are not good candidates for low latency versions of the kernel. I eventually set that machine with Linux Mint 17.2 kernel 3.16, and a few months later upgraded the OS to 17.3 with kernel 3.19.
Of course the processor freezing problems came along and intel_idle.max_cstate=2 or 1, as discovered by many people by then, became the life saver for the machine.

Especially with cstate=1 you can expect your machine to be very stable and run flawlessly. On my laptop (although not a F551) I sometimes set it to 2 and take the risk of seeing my machine freeze as the battery lasts significantly longer with cstate set to 2 rather than to 1.

If you want to try a non-low-latency kernel, rather than installing a standard kernel on Ubuntu Studio, try a different flavor of Linux (maybe Linux Mint) with a standard kernel. Because Ubuntu Studio is a bit touchy (at least in my experience) and you may start seeing seemingly unrelated problems as soon as you replace its kernel.
Comment 254 Hal 2016-03-29 02:46:06 UTC
(In reply to Dimitris Roussis from comment #243)
> For me mr, Len Brown has a big responsibility of this situation. 
> 
You're making an excellent point. It's quite extraordinary that the gentleman in charge of fixing this bug has not posted a single line here on this thread, sharing his thoughts, or providing insight about his efforts on this matter.
Quite extraordinary...
Comment 255 Mika Kuoppala 2016-03-29 12:33:38 UTC
(In reply to julio.borreguero@gmail.com from comment #218)
> (In reply to Mika Kuoppala from comment #202)
> > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > 
> > 3 _tentative_ patches on that tree. Please try.
> 
> system freeze after ~2 days
> 

Did that set affect the rate/time of hangs?

I am now at 6days of uptime. Workload is glxgears + vlc with vaapi
Comment 256 julio.borreguero@gmail.com 2016-03-29 12:48:57 UTC
(In reply to Mika Kuoppala from comment #255)
> (In reply to julio.borreguero@gmail.com from comment #218)
> > (In reply to Mika Kuoppala from comment #202)
> > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > > 
> > > 3 _tentative_ patches on that tree. Please try.
> > 
> > system freeze after ~2 days
> > 
> 
> Did that set affect the rate/time of hangs?
> 
> I am now at 6days of uptime. Workload is glxgears + vlc with vaapi

i stressed the system more than usual. had a big glxgears on one workspace and i was playing nonstop movies from shell with mplayer (-nosound). no vaapi.
Plus listening to music with clementine and compiling a lot of packages (gentoo upgrading packages).
Hard to say if it improved with random freezes that can occur at any time.
what i can say is that chris wilsons patch only took max 2h in freezing, although i applied it to 4.5.0 kernel.
I can try more patches or use vaapi or whatever, just let me know.
Comment 257 Hal 2016-03-29 13:42:26 UTC
(In reply to julio.borreguero@gmail.com from comment #256)
> (In reply to Mika Kuoppala from comment #255)
> > (In reply to julio.borreguero@gmail.com from comment #218)
> > > (In reply to Mika Kuoppala from comment #202)
> > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > > > 
> > > > 3 _tentative_ patches on that tree. Please try.
> > > 
> > > system freeze after ~2 days
> > > 
> > 
> > Did that set affect the rate/time of hangs?
> > 
> > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi
> 
> i stressed the system more than usual. had a big glxgears on one workspace
> and i was playing nonstop movies from shell with mplayer (-nosound). no
> vaapi.
> Plus listening to music with clementine and compiling a lot of packages
> (gentoo upgrading packages).
> Hard to say if it improved with random freezes that can occur at any time.
> what i can say is that chris wilsons patch only took max 2h in freezing,
> although i applied it to 4.5.0 kernel.
> I can try more patches or use vaapi or whatever, just let me know.

Pardon my intrusion. Although I am no longer testing anything related to this issue I thought sharing some of my findings might interest you.
The freezing is more prone to happen when the workload on the processor cores is light to medium as the power controller takes a more active role in switching the states. When you heavily load your processor with tasks it goes into low power or power saving states much less. If there is failure, more likely it's another cause than this bug at hand. Of course keep testing everything under heavy load too, but light load will probably cause this problem show up more quickly and frequently.
(When I was doing serious structured testing I noticed that actually with no "user" load, just the internal system calls were causing enough/more frequent cstate flip flops than when running videos etc)
Comment 258 julio.borreguero@gmail.com 2016-03-29 13:56:06 UTC
(In reply to Hal from comment #257)
> (In reply to julio.borreguero@gmail.com from comment #256)
> > (In reply to Mika Kuoppala from comment #255)
> > > (In reply to julio.borreguero@gmail.com from comment #218)
> > > > (In reply to Mika Kuoppala from comment #202)
> > > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > > > > 
> > > > > 3 _tentative_ patches on that tree. Please try.
> > > > 
> > > > system freeze after ~2 days
> > > > 
> > > 
> > > Did that set affect the rate/time of hangs?
> > > 
> > > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi
> > 
> > i stressed the system more than usual. had a big glxgears on one workspace
> > and i was playing nonstop movies from shell with mplayer (-nosound). no
> > vaapi.
> > Plus listening to music with clementine and compiling a lot of packages
> > (gentoo upgrading packages).
> > Hard to say if it improved with random freezes that can occur at any time.
> > what i can say is that chris wilsons patch only took max 2h in freezing,
> > although i applied it to 4.5.0 kernel.
> > I can try more patches or use vaapi or whatever, just let me know.
> 
> Pardon my intrusion. Although I am no longer testing anything related to
> this issue I thought sharing some of my findings might interest you.
> The freezing is more prone to happen when the workload on the processor
> cores is light to medium as the power controller takes a more active role in
> switching the states. When you heavily load your processor with tasks it
> goes into low power or power saving states much less. If there is failure,
> more likely it's another cause than this bug at hand. Of course keep testing
> everything under heavy load too, but light load will probably cause this
> problem show up more quickly and frequently.
> (When I was doing serious structured testing I noticed that actually with no
> "user" load, just the internal system calls were causing enough/more
> frequent cstate flip flops than when running videos etc)

Well thank you for your intrusion. That indeed sounds logical, good point.
Interestingly enough the system finally froze after i closed those glxgears and ever-looping movies, now that you are saying, which absolutely confirms your theory. i was at that point just watching a movie (low-res) without stressing the machine in any other way.
nonetheless the cstate workaround doesn't work for me, although i haven't tried cstate=2, only cstate=1 (on my N2940) and that seems to be hardware depending.
Comment 259 dertobi 2016-03-29 14:25:11 UTC
Quick observation:

Once the system is frozen I can still unplug/plug the HDMI cable and the frozen screen will reappear on my external monitor. Maybe that means nothing, but wouldn't that suggest that some level of kernel activity is still occuring? I can also use the FN keys of my laptop to disable/enable the laptop screen, but that could be happening purely on the firmware/BIOS level.
Comment 260 Juha Sievi-Korte 2016-03-29 14:46:50 UTC
(In reply to Hal from comment #257)
> 
> Pardon my intrusion. Although I am no longer testing anything related to
> this issue I thought sharing some of my findings might interest you.
> The freezing is more prone to happen when the workload on the processor
> cores is light to medium as the power controller takes a more active role in
> switching the states. When you heavily load your processor with tasks it
> goes into low power or power saving states much less. If there is failure,
> more likely it's another cause than this bug at hand. Of course keep testing
> everything under heavy load too, but light load will probably cause this
> problem show up more quickly and frequently.
> (When I was doing serious structured testing I noticed that actually with no
> "user" load, just the internal system calls were causing enough/more
> frequent cstate flip flops than when running videos etc)

My finding is similar than yours Hal, freezes happened almost always when doing "nothing much" ie. load and scroll a web page, sometimes hang happened just after reboot when everything was loaded and system started idling. I think it would be almost always when the load changes from 'high' to 'low' or 'idle'. 

When these problems started for me with some kernel version (after distribution upgrade from Ubuntu 14.10 to 15.04 (kernel 3.19)), the hangs first happened always when I tried to put the laptop to sleep by closing the lid. A bit later (perhaps further distribution upgrade when I got sick of the "buggy 15.04") came the full system lock-ups during 'daily use'.

But I was also thinking that are there now two (or even more) freeze issues in this same report that different users are experiencing, as cstate limiting doesn't help for everyone and also there are now other than baytrail systems also included (even they are likely related from design perspective).

Btw, still running the same 4.5.0 session, hitting two weeks marker in couple of days. Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux.

No other patches, no cstate limiting. Stress-tested with videos for one full day, otherwise it's been just my daily usage pattern with web-browsing, streaming, occasional gaming, etc.
Comment 261 julio.borreguero@gmail.com 2016-03-29 15:01:54 UTC
just a simple question/thought:

wouldn't it be quite easy to just write a program to change bewtween those cstates constantly to make a solid test program and finally be able to nail down the bug and make it reproducable ?
or to do this, if that is causing the problem:
>Two concurrent writes into the same register cacheline has the chance of
>killing the machine on Ivybridge and other gen7.
(citation from chris wilsons patch description)

a reliable "freeze program" would help tremendously, i think.
Comment 262 Hal 2016-03-29 15:33:21 UTC
(In reply to julio.borreguero@gmail.com from comment #261)
> just a simple question/thought:
> 
> wouldn't it be quite easy to just write a program to change bewtween those
> cstates constantly to make a solid test program and finally be able to nail
> down the bug and make it reproducable ?
> or to do this, if that is causing the problem:
> >Two concurrent writes into the same register cacheline has the chance of
> >killing the machine on Ivybridge and other gen7.
> (citation from chris wilsons patch description)
> 
> a reliable "freeze program" would help tremendously, i think.

Probably not easy, because even though you could force the cstate with your own procedure you can't prevent the microprocessor's microcode from interacting with it (unless of course you are an intel guy and have access to the nitty-gritty of the microcode and you know how to throw that control code in the dustbin and overwrite it with your own)

When I first ran into this freezing problem on My Zotac, I didn't know anything about this thread or the earlier one on the freedesktop site. As I tried to do quick and dirty troubleshooting I wrote a little program with a bunch of loops (some in sequence, some in parallel) stressing different parts of the processor and computer hardware as I thought it would help me isolate the area this problem was originating from. That kept the processor cores quite busy, but it also increased the longevity of the linux session. Without that dingy program the machine would freeze within 5-10 minutes after booting, with the software running freezing would only occur an hour or two later. So, that gave me more time to look around into the system. That also gave me a hint that probably the power saving mechanism was the culprit as it was kicking in during light loads on the cpu.

So, yes - it may be possible to come up with a Micky-Mouse solution to alleviate the negative impact of the problem and save the day, but a real solution by competent people who understand the root-cause of the problem is more desirable - especially after 15 months of this saga ...
Comment 263 cororok 2016-03-29 17:49:29 UTC
Instead of keep running full load of tasks how about doing something like below?
So the cpu-gpu is going up and down.


#! /bin/bash
function callCpuGpu() {
  killall -w firefox
  xdg-open https://www.youtube.com/xyz
}

fo i in {1..1000}
do
  callCpuGpu
  sleep 180s
done
Comment 264 cororok 2016-03-29 20:32:22 UTC
sorry for wrong one above.

#! /bin/bash
function callCpuGpu() {
  killall -w firefox
  sleep 60s # idle time
  xdg-open https://www.youtube.com/xyz
}

fo i in {1..1000}
do
  callCpuGpu
  sleep 60s # running time
done
Comment 265 jbMacAZ 2016-03-30 01:48:55 UTC
Well on a lighter note, 4.6-rc1-next.29 seems to have fixed two new failure modes since 4.4.x.  On my system both occur after about 10 hours (cstate=1.)  One was a semi-freeze, where the clock seconds field stops, but the mouse/touchscreen cursor still moves freely, and the user interface was checked/updated less than once a minute.  The other failure was the screen going black w/o warning, apparently frozen.  The newest patches didn't affect these failures. Without cstate, my system freezes within minutes per usual, the patches had no obvious effect.  (uP=Z3775)
Comment 266 micha 2016-03-30 08:59:03 UTC
(In reply to Hal from comment #253)
> (In reply to micha from comment #248 & #247)
> 
> The Asus F551 is a decent machine (not a great machine for the intended use
> of Ubuntu Studio though). I reconfigured one with Linux Mint several months
> ago for a relative of mine. I remember having tried a low latency kernel (I
> can't recall the exact version) but the performance was terrible. 

Thanks Hal for your hints.

Actually installing Linux on this Laptop wasn't meant to get a powerful multimedia device in the end - it was meant to be a cross check.

Windows 8.1 was running flawlessly for several months - and after updating to Windows10 the machine started to freeze randomly.

Right now, the most surprising and interesting aspect is this coincidence:
Older kernels seems to work correctly, while the newer ones don't.

Thus, to me it looks like both parties have "optimized" their kernels up to a point these cpus/architectures can't cope with any more.

All I can report so far is: After using lstate=1, my laptop is running more than 48 hours without any freeze. The first day with 2 glxgears and endless videos, today back to normal with just one firefox and a little editing. And I wouldn't wonder if Windows10 would run correctly with a similar booting option. Unfortunately I haven't found a switch like that up to now.
Comment 267 Hal 2016-03-30 12:32:21 UTC
(In reply to micha from comment #266)

> Thanks Hal for your hints.
You are welcome.

> Older kernels seems to work correctly, while the newer ones don't.
You are correct. There was a time I kept my system at least a couple of steps behind the "cutting edge" as to me reliability is key. But as I replaced some of my older, high power eating, machines with tiny, low power consuming ones with Bay Trail or Braswell family CPUs, I also had to step up the kernel versions. Because, for instance on the entry level Intel NUC the integrated video circuitry (HDMI part) is not properly handled by kernel 3.16.0. Most Display Port interfaces are prone to random problems with older kernels even when they are supported. Frankly if I could, I would stick to Linux Mint 17.2 and not even upgrade to 17.3 as I had Mint 17.2 running for over a year (without powering it down) on a home built AMD machine without a hiccup.

> ... if Windows10 would run correctly with a similar booting
> option. Unfortunately I haven't found a switch like that up to now.
I doubt that in the Windows case it's a kernel issue. It's probably a device driver issue on the integrated video hardware that needs to be fixed.
Also check Intel's website for any newer microcode versions for your microprocessor.

But for professional use I am now switching (back) to Mac. In the 80's and 90's people used to say "nobody got fired for buying an IBM computer". I think that applies to Apple nowadays ...
Comment 269 Juha Sievi-Korte 2016-04-01 03:48:12 UTC
(In reply to Dimitris Roussis from comment #268)
> http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail

There was interesting point in the article comments section, that upgrading to xorg 1.18 had solved some freezes that had happened with chromium (but no specifics on hardware, other than mention of atom).

I checked my installation log back, and I've definitely verified a freeze with 1.18.0, but not with 1.18.1 - which I am running now with the 4.5.0. Perhaps unrelated noise, but caught my eye.
Comment 270 Tal Liron 2016-04-01 04:16:59 UTC
Interesting info: I had similar freezes running Android x86 (64bit version, UEFI) on the same machine. So it might really be Linux-specific and unrelated to the graphics stack.
Comment 271 kossmann 2016-04-01 06:40:58 UTC
I have no X-Server running, just a plain/headless Debian without monitor, keyboard, etc.
Comment 272 julio.borreguero@gmail.com 2016-04-01 14:40:36 UTC
(In reply to Juha Sievi-Korte from comment #269)
> (In reply to Dimitris Roussis from comment #268)
> >
> http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail
> 
> There was interesting point in the article comments section, that upgrading
> to xorg 1.18 had solved some freezes that had happened with chromium (but no
> specifics on hardware, other than mention of atom).
> 
> I checked my installation log back, and I've definitely verified a freeze
> with 1.18.0, but not with 1.18.1 - which I am running now with the 4.5.0.
> Perhaps unrelated noise, but caught my eye.

i just upgraded xorg-server to 1.18.2 from 1.17.4.
kernel 4.5.0 no patches no boot parameters.
it froze within minutes

[N2940]
Comment 273 dertobi 2016-04-01 16:22:38 UTC
(In reply to kossmann from comment #271)
> I have no X-Server running, just a plain/headless Debian without monitor,
> keyboard, etc.

It would be interesting to see a hypothesis as to how that bug in can occur in a headless setup. Can it still be the fault of the i915 driver in that case? Maybe the actual x86-64 cpu architecture linux code has some unexpected sideeffects with baytrail cpus?
Comment 274 dertobi 2016-04-02 13:06:23 UTC
I had two freezes today with kernel 4.6 that could be a different bug, but there's no way to know this for sure (yet).

This occured with intel_idle.max_cstate=1.

The good news is that this time it's at least partially reproducible and I say that because I don't know if others will be able to repruduce it, too.


1) I have my smartphone connected to one of the USB ports to keep it charged.

2) I try to reboot the phone.

3) Instead of rebooting the phone shuts off.  (Probably not enough juice)

4) Then I try to force a boot by holding the power button of the phone.

(The USB cable stays connected to my laptop while I'm doing all of this)

5) Just in the moment when the phone starts to boot, my desktop freezes in apparently exactly the same way people on this bug report already know about all too well.

My conclusions from this:

1) The phone is connected for charging, so it's not unlikely it's messing with the power management of the laptop by draining power and sudden shifts in that power drain. (Although it shouldn't)

2) There could be a bug in the USB subsystem.

3) My particular laptop might have a serious hardware defect.

4) ???

Anyone else, please feel free to speculate what that means.
Comment 275 Andy Furniss 2016-04-02 14:21:43 UTC
(In reply to dertobi from comment #273)
> (In reply to kossmann from comment #271)
> > I have no X-Server running, just a plain/headless Debian without monitor,
> > keyboard, etc.
> 
> It would be interesting to see a hypothesis as to how that bug in can occur
> in a headless setup. Can it still be the fault of the i915 driver in that
> case? Maybe the actual x86-64 cpu architecture linux code has some
> unexpected sideeffects with baytrail cpus?

J1900 Asrock Q1900DC-itx.

I could easily lock with kodi and tested lots in the early days of the FDO bug.

On some kernels the a patch in that bug seemed to prevent for me and is probably still in openelec (but I never ran kodi for more than 15 hours).

The patch - just option 2.

https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33

Testing newer kernels did seem to gain a new issue.

As I bought the Q1900DC-itx to be a headless router/nas/pvr that's what I did with it.

Vanilla 4.1.1 no patch or workarounds (being headless there are no i915 IRQs so the patch would be pointless).

Ran 100 days OK updated kernel to 4.1.10 locked after 7 days, then next day.

Booted to 4.1.1. again ran OK for 127 days - updated to 4.1.18 no lock so far (up 37 days).

Don't know what was wrong with 4.1.10 or if it's just luck (but seems unlikely).

Looking at stable commits there is a baytrail one just before 4.1.13 which fixes GPIO register access - maybe that is helping me now in 4.1.18.

One other change made initially - though not because of locks is I disabled USB3 in BIOS as I have 2 USB DVB-T2 tuners and I was getting low level packet loss on the links. Seemed to be power related as spinning a CPU would fix it - but so did (well 99%) avoiding xhci by turning off USB3.
Comment 276 Martin 2016-04-03 16:22:39 UTC
Two years of problems and three lengthy and painful bisects later I finally arrived at commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff (drm/i915: Agressive downclocking on Baytrail). A simple search for these terms brought me to this bug and I now know I'm not alone! Will try read-up on all comments on this bug later. Meanwhile I manually reverted the changes in mentioned commit in 4.5 and have yet to see a freeze. Will try max_cstate=1 later.

HW: ASRock Q1900-ITX.
Comment 277 kossmann 2016-04-04 05:41:59 UTC
(In reply to dertobi from comment #273)
> It would be interesting to see a hypothesis as to how that bug in can occur
> in a headless setup. Can it still be the fault of the i915 driver in that
> case? Maybe the actual x86-64 cpu architecture linux code has some
> unexpected sideeffects with baytrail cpus?

I updated to Kernel 4.4.0-1-amd64 and - this could be the trick for me - made a new released BIOS-Update, including a Micocode-Update (NUC6i5SYH). The uptime of my NUC is 2 days an 14 hours for now without max_cstate.
Comment 278 kossmann 2016-04-04 05:44:12 UTC
Sorry... forget my post

root@nuc:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.4.0-1-amd64 root=UUID=b4b6a796-e6c4-44b5-8d3b-0e34a2cae5c6 ro quiet crashkernel=256M nmi_watchdog=1 intel_idle.max_cstate=1

max_cstate is still set.
Comment 279 Martin 2016-04-04 12:25:16 UTC
Created attachment 211641 [details]
Reverted commit 8fb55197e64... for 4.5.0

Ok, have done homework and read the whole thread.
My experiences with the BayTrail issue:

HW: ASRock Q1900-ITX with J1900 onboard, like I said before.
Load: HTPC with MythTV recording/showing both SD and HD DVB-C material.
Noteworthy: I use an out-of-kernel compiled ddbridge module that comes with own dvb-core code.

In my experience problems began when I started compiling 4.2.* and immediately blamed the non-standard ddbridge module. There may have been problems with 4.1.* that I don't remember, but I'm VERY confident the latest iterations in the 4.1.* series are rock-solid. Last stable kernel I used before venturing on my latest bisect was 4.1.20 without patches or work-arounds.
Since my problems started with 4.2.0 and 4.1 series seemed stable I bisected between 4.1 and 4.2. This led me without a shadow of a doubt to Chris Wilson's commit 8fb55197e... Freezes tended to occur much faster as I approached this commit. On 4.2.0 and above it can take hours if not days, on 8fb55197e... it's a matter of minutes. I was surprised to end up on a commit that was not related to dvb/device code but relieved it precisely matched the other hardware I use which I never doubted it's stability.

My HTPC is now watching HD DVB-C content as we speak on 4.5.0 using accompanied patch, which is a manual reversal of 8fb55197e... to the best of my knowledge. It's been up since yesterday and hasn't crashed since, but I'm sceptical since freezes took longer on later kernels anyway. So far, so good.
Comment 280 Martin 2016-04-04 12:48:06 UTC
I lied! I've found an old mail conversation about the problem and indeed I started seeing the freezes on 3.17 like many others. I tried bisecting between 3.16 and 3.17 back then and never convincingly arrived at a commit I could blame due to the unpredictable nature. So it does seem we are looking at different bugs that (partially) got fixed somewhere in the 4.1 branch. Patched 4.5 still going strong btw.
Comment 281 Juha Sievi-Korte 2016-04-07 18:36:40 UTC
(In reply to Juha Sievi-Korte from comment #209)
> Update: Grabbed 4.5.0 for testing on affected system (Acer B-115M, N3540).
> This is downloaded from opensuse repos this time, exact version:
> 
> Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21
> UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux
> 
> Running withtout a freeze for a week now in my normal use and stress-testing
> since this morning with HD videos. I'll report back if it freezes.
> 
> Someone asked about the desktop, I use xfce (some gnome-services running
> though). Have verified the freezes with two distributions, Ubuntu and
> Opensuse.

juhas@cardhu:~> uptime
 21:32pm  up 19 days 13:22,  5 users,  load average: 1.44, 1.59, 1.51
juhas@cardhu:~> uname -a
Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux
juhas@cardhu:~> cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.5.0-58.gb2c9ae5-default root=UUID=4e634188-9fb6-40f9-87ae-487fd31414f3 resume=/dev/disk/by-uuid/5daad161-5400-48d2-a6e5-8cc5e0f08c20 splash=silent quiet showopts

Never before had this long uptime without boot parameters. Seems I'm unable to make this crash now. Anyone else this lucky with N3540 and 4.5.0? Am I forever stuck with this particular kernel version? :)
Comment 282 Martin 2016-04-11 12:01:01 UTC
Just rolled back to vanilla 4.5 to see if I could make a stable system out of 4.5 without my patch as Juha says. I can't. After 2,5 hours of watching it froze like it ever did after 4.2.

So it seems for me at least I need to roll-back 8fb55197e... which btw is so stable I haven't seen a freeze in a week of regularly watching television.
Comment 283 aicjofs 2016-04-13 18:04:23 UTC
I have known about this bug for well over a year, mostly ignored it and was content on 3.16 and 3.13.  Popped in a few days ago to see the state of things and read up.  I can't believe a bug that locks up the system within a few minutes or few hours has got no love.  

I have a J1900 as a HTPC running Ubuntu 14.04, around a year or more ago the transition from 3.16 to something higher introduced me to this issue.  I was able to work around it on higher kernel versions through the BIOS settings for C-state.  Never used the kernel flag.  Anyway I didn't like that option so I went back to 3.16.  In the interest of upgrading the system to Ubuntu 16.04 in the near future and the higher kernel version used I thought I should look in to this again.

I grabbed the 4.6-RC3 source and manually by hand reverted the "Aggressive Downclocking of Baytrail" patch.  I was kind of depressed to look at all the new .config changes, sure have added a lot of stuff to a non working kernel... Anyway I have seen really positive results over the past 36 hours.  No lockups.  While I will be the first to admit I haven't ever tried anything past 4.1 when it was in RC status a long time ago.  I was usually able to lock the system up within 30 min, of bouncing between browsing and scrolling busy web pages in firefox, and Kodi starting and stopping videos(anything I could think to make the GPU up/down threshold shift).  I can't seem to make it lockup at all now.  

I know from reading many people have thought they had it licked, only to post back a few days later that it didn't.  Probably the case here too, but something has changed over the last year because like I said previously I was able to lock it up within minutes from anything over 3.16 to 4.1-RC when I messed with this bug last time, and it's lasting days+? so far.
Comment 284 Koen L 2016-04-14 08:08:42 UTC
I've read all/most related threads and to me this appears to be the status quo.

'quick' fixes:

# intel_idle.max_cstate

Adding intel_idle.max_cstate=1 OR intel_idle.max_cstate=0 to the kernel parameters seems to work for most people, but leaves the processor running even when it should be idle (not energy efficient and causes more heat).

# Kernel 4.5+ with commit reversal

Using kernel 4.5+ without commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff (drm/i915: Agressive downclocking on Baytrail). Some people mention positive results when reverting this commits on earlier kernel versions as well.

# intel_pstate=disabled

Some have mentioned that setting the intel_pstate=disabled kernel parameter helps, but others confirmed it did not help in their case.

Problem background:

# Irregular

The issue does not appear on a regular basis, some have reported a working system for over a day (+1 for me) and then it crashes twice in an hour.

# Confirming

There are no/limited logs and as such it is difficult to tell whether everyone in these threads is actually experiencing the same issue.

# cstate & pstate information from Intel (posted by Chris Rainey)

1. C-states and P-states are very different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different)

2. Power Management States: P-States, C-States, and Package C-States(https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states)

3. (update) C-states, C-states and even more C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states)

Real fix?

A real fix has yet to be found... In the commit which some people have reverted (https://patchwork.freedesktop.org/patch/45755/) Wilson and Deepak (from Intel) are named and in a later message Wilson states "Why those vlv_punit_read() result in a machine hang was never understood." (https://lists.freedesktop.org/archives/intel-gfx/2016-January/084206.html).

I'll CC both of them to this thread.

To-do:

This issue affects kernels up to 4.5 (as far as I can tell from the discussion). 4.4 for sure (experiencing the issue on latest 4.4 myself now).
Comment 285 dertobi 2016-04-14 08:16:55 UTC
(In reply to Koen L from comment #284)
> I've read all/most related threads and to me this appears to be the status
> quo.
> 
> 'quick' fixes:
> 
> # intel_idle.max_cstate
> 
> Adding intel_idle.max_cstate=1 OR intel_idle.max_cstate=0 to the kernel
> parameters seems to work for most people, but leaves the processor running
> even when it should be idle (not energy efficient and causes more heat).
> 
> # Kernel 4.5+ with commit reversal
> 
> Using kernel 4.5+ without commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff
> (drm/i915: Agressive downclocking on Baytrail). Some people mention positive
> results when reverting this commits on earlier kernel versions as well.
> 
> # intel_pstate=disabled
> 
> Some have mentioned that setting the intel_pstate=disabled kernel parameter
> helps, but others confirmed it did not help in their case.
> 
> Problem background:
> 
> # Irregular
> 
> The issue does not appear on a regular basis, some have reported a working
> system for over a day (+1 for me) and then it crashes twice in an hour.
> 
> # Confirming
> 
> There are no/limited logs and as such it is difficult to tell whether
> everyone in these threads is actually experiencing the same issue.
> 
> # cstate & pstate information from Intel (posted by Chris Rainey)
> 
> 1. C-states and P-states are very
> different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-
> states-are-very-different)
> 
> 2. Power Management States: P-States, C-States, and Package
> C-States(https://software.intel.com/en-us/articles/power-management-states-p-
> states-c-states-and-package-c-states)
> 
> 3. (update) C-states, C-states and even more
> C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-
> states-and-even-more-c-states)
> 
> Real fix?
> 
> A real fix has yet to be found... In the commit which some people have
> reverted (https://patchwork.freedesktop.org/patch/45755/) Wilson and Deepak
> (from Intel) are named and in a later message Wilson states "Why those
> vlv_punit_read() result in a machine hang was never understood."
> (https://lists.freedesktop.org/archives/intel-gfx/2016-January/084206.html).
> 
> I'll CC both of them to this thread.
> 
> To-do:
> 
> This issue affects kernels up to 4.5 (as far as I can tell from the
> discussion). 4.4 for sure (experiencing the issue on latest 4.4 myself now).

Thanks for that comprehensive summary!

The only thing that I want to add is:

# The bug is still occuring on the latest kernel 4.6rc3 and git.
Comment 286 Koen L 2016-04-14 08:24:26 UTC
No problem, we all want to get this fixed!

I actually ended up CC-ing every person mentioned in the 'signed-off-by' of this patch.

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

Fairly certain they should be able to give us some more pointers as to how to properly fix this issue.
Comment 287 Mort Yao 2016-04-14 11:47:25 UTC
I have to add that the regression presents on all kernel versions after 3.16, so commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff

    drm/i915: Agressive downclocking on Baytrail

was not the true cause, at least not for me. Since it was merged only after 4.2RC, but I experienced the freeze on 4.1 as well, easily in hours to days (someone above mentioned it already happens on 3.17 as well). Otherwise it could two different issues we're talking about in this thread.

For kernel freezes starting from 3.16, "the commit" was

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=31685c258e0b0ad6aa486c5ec001382cf8a64212

    drm/i915/vlv: WA for Turbo and RC6 to work together.

as far as I can tell, reverting it or simply applying this patch https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/patches/4.0/linux-999-i915-use-legacy-turbo.patch seem to do the trick.
Comment 288 Libor Chmelik 2016-04-14 13:39:28 UTC
I'm on an Acer Aspire E 15 E5-511-P7AT Laptop with an Intel N3540. I had to wait for kernel 4.5. to solve the hanging issue on shutdown/reboot. But the random freezing issue still isn't fixed. I'm using Mint 17.3. I could downgrade the kernel, but that would recall the hanging issue on shutdown/reboot. 

The freezing issue only happens randomly. Sometimes after a few hours. Sometimes after a few days or even a week. But always when the graphics are used

I tried the latest intel graphic drivers, but the issue still isn't solved.
The laptop is 16 months old right now, and it's the first time I fell on a bug that wasn't solved in this time. It seems to happen only when the graphics are in extended use like vlc HD viewing or streaming HD Videos on Youtube higher than 720p.
Comment 289 jbMacAZ 2016-04-14 18:32:40 UTC
I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx).  I am also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes when I plug or unplug a flash drive from a USB hub.  This USB problem has never occurred with older kernels I've used.  Same symptom, cstate ineffective... (Asus T100-CHI, Z3775)
Comment 290 Koen L 2016-04-14 18:34:10 UTC
@Mort Yao

Thanks for pointing this out!

I'm trying out the intel_idle.max_cstate=1 option first now (seems to work) because we've got some of these systems in production.

Afterwards I will use our test-system to check whether a kernel patch completely fixes the issue.
Comment 291 dertobi 2016-04-14 20:17:52 UTC
(In reply to jbMacAZ from comment #289)
> I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx).  I am
> also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes when
> I plug or unplug a flash drive from a USB hub.  This USB problem has never
> occurred with older kernels I've used.  Same symptom, cstate ineffective...
> (Asus T100-CHI, Z3775)

I already wrote about how my phone rebooting while connected to the usb port causes my system to freeze, while that's not exactly the same thing you're reporting I feel it could be related.

Question for you and experts in USB:
- Is there a sudden drop/surge in power when plugging/unplugging a flash drive?

Because that's what I think is happening with my phone, it normally gets a charge from the port, then no charge, and then when the booting (of the phone) starts an unusual increase in the power the phone draws from the usb port, which then somehow or another influences the CPU or other components to cause that dreadful freeze.

Recently I had a freeze (caused by my phone) with the usual symptoms but the audio that was currently running was in a weird 1 second loop going on seemingly forever.
Comment 292 jbMacAZ 2016-04-14 22:36:14 UTC
(In reply to dertobi from comment #291)
> (In reply to jbMacAZ from comment #289)
> > I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx).  I am
> > also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes
> when
> > I plug or unplug a flash drive from a USB hub.  This USB problem has never
> > occurred with older kernels I've used.  Same symptom, cstate ineffective...
> > (Asus T100-CHI, Z3775)
> 
> I already wrote about how my phone rebooting while connected to the usb port
> causes my system to freeze, while that's not exactly the same thing you're
> reporting I feel it could be related.
> 
> Question for you and experts in USB:
> - Is there a sudden drop/surge in power when plugging/unplugging a flash
> drive?

A rebooting phone could certainly provoke lots of otherwise latent bugs in a USB handler.  It would be a worthy test case for both hardware and firmware Q/A.

My hub is (externally) powered.  So USB power draw shouldn't be affecting my device.  Since I build my kernels elsewhere, this problem is unmistakeably recent.  And it is easily avoided by using an older kernel.  

Ultimately, these two freezes probably merit their own bug reports.
Comment 293 Martin 2016-04-17 14:06:20 UTC
I'm evaluating Mort Yao's idea of reverting 31685c258e0b0a..... and so far have not seen freezes on watching DVB-C HD content. I have however experienced two crashes while watching flash content (in Chrome). The problem is, I don't trust the flashplayer anymore, so I'm reluctant to say the patch isn't valid (for me).
Comment 294 Mort Yao 2016-04-17 15:02:05 UTC
Unfortunately I experienced another hang yesterday (after one week's stable use), so the patch I mentioned in the last comment isn't valid for me anymore.

On the other hand, reverting the complete commit 31685c2 isn't really easy -- the old revision of the module won't compile together with the current 4.x kernel codebase. I'd like to hear if anyone had any success doing that. However, a proper fix is yet to be found.
Comment 295 Martin 2016-04-17 15:31:18 UTC
My patch in comment 279 applies cleanly against 4.5.0 and 4.5.1 and resolves the problems for me (at least for a very, very long time). You could give that a try?

I agree it's not as clean as your second patch, but like others I suspect we're looking at different problems that orignated in the 3.17 and 4.2 branch. For me, the 3.17 problem seems solved in 4.1.x (for recent x) and my patch reverts whatever causes problems in the 4.2 and up series.
Comment 296 Mort Yao 2016-04-17 16:12:47 UTC
@Martin thanks for bringing this to my attention.

Yesterday's freeze was on a 4.5 kernel (with only legacy-turbo patch applied). It seems I should try to revert 8fb55197e64 also, since there seems to be two very different causes of freezes!

I'm currently on vanilla 4.1.6 (for me it's known to freeze once or twice a month; that's already much better than all kernels 4.2+), but I'm planning to try both:
1. Apply both the 8fb55197e revert patch and the legacy-turbo patch on 4.5.
2. Apply the legacy-turbo patch on 4.1.x.

will see how it goes.
Comment 297 Gabriel7340 2016-04-21 20:29:28 UTC
Same problem here.. I can't do nothing when the system freezes. With the  intel_idle.max_cstate=1 flag it's ok but consumes more power :/
Comment 298 Gabriel7340 2016-04-21 20:32:17 UTC
*I can't do anything
Comment 299 jjmeijer88 2016-04-21 22:41:16 UTC
Hi,

I'm using a ASRock Q1900 (Intel Celeron J1900 Baytrail) board with an Nvidia GT720 GPU and I don't get any hangs at all with Arch Linux (kernel 4.4.1-2-ARCH) + Kodi 16.1.
I'm wondering if you guys are using the onboard GPU? I guess I could switch GPU to try.

On my Intel Atom Z3770 Baytrail tablet (HP Omni10) the only way to get it even booted to Android-x86 is with intel_idle.max_cstate=1.
Comment 300 Hal 2016-04-23 14:59:33 UTC
(In reply to Hal from comment #199)
> Interesting findings today:
> 
> 2) I was given an Intel Nuc box for testing which turned out to be identical
> to mine, with same N3050. Duplicated my drive with DD and removed
> intel_idle.max_cstate=1. It kept working all day without missing a beat! I
> remove cstate from my own machine it freezes within the hour. So bizarre...
> 

This is something I posted several weeks ago. Since then I have been using both boxes in parallel, for same type of daily tasks (some web browsing, occasional video playback, some Netflix, lots of background music playing).

One of these boxes has a processor (N3050) with a stepping older than the other. That one doesn't show any freezing symptoms.

The newer box (with the same processor but more recent stepping) needs intel_idle.max_cstate=1 to run without freezing, otherwise it fails quite regularly, within a couple of hours after booting.

Hal
Comment 301 ladiko 2016-04-23 15:59:51 UTC
We have 50 ASRock Q1900-ITX - some work without issues, some work with cstate=1, some freeze anyway and need kernel 3.16. Only kernel 3.16 made all of them work without this issue. For another issue on another mainboard type we went back to 3.13, but some of them don't support a resolution of 1280x1024 via VGA this way. So we had to differ: This CPU = this kernel, that CPU = that kernel. Now we run Baytrail with 3.16 but i plan to compile a custom kernel 4.4. or 4.5 as explained before.
Comment 302 dan.g.tob 2016-04-23 16:02:53 UTC
Did this patchset ever get merged? sounds suspiciously similar.

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/00271.html
Comment 303 jjmeijer88 2016-04-23 17:43:59 UTC
@Dan, I tested these patches. There is a slght improvement but the system still hangs at some point. At least th mmc bus that is.
Comment 304 jjmeijer88 2016-04-23 17:44:17 UTC
@Dan, I tested these patches. There is a slght improvement but the system still hangs at some point. At least th mmc bus that is.
Comment 305 dertobi 2016-04-23 19:35:01 UTC
I was told to try the patches from here:

(for 4.4)
https://github.com/fritsch/linux/commit/8b48465bd197e2f4891a3f9c5737bb13981d1c94 

and here:

(for 4.5)
https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33

Which I will try later, but I want to encourage others to do the same.
Comment 306 dertobi 2016-04-23 19:35:40 UTC
I was told to try the patches from here:

(for 4.4)
https://github.com/fritsch/linux/commit/8b48465bd197e2f4891a3f9c5737bb13981d1c94 

and here:

(for 4.5)
https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33

Which I will try later, but I want to encourage others to do the same.
Comment 307 dkrall 2016-04-25 17:19:46 UTC
I can confirm this issue to a certain degree.

Using a Dell Inspiron, with an Intel N3050 (stepping 3), I can boot and run kernel 4.2.0-35-generic (from Ubuntu 14.04), and also the 4.2 kernel shipped with Fedora 23, but no version higher than 4.2.

I get a black screen immediately after booting any kernel >4.2 using any distro available (fedora, opensuse, ubuntu, etc). Disabling pstates did not help in this case (running kernel 4.5).

If there's anything I can do to help debug this issue further, please let me know.
Comment 308 Ghry 2016-04-25 19:08:09 UTC
(In reply to lewexeki from comment #27)
> Hi,
> 
> I had the same problem with "Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz".
> With kernel 4.2.0-16.19 there were ~5-8 freezes/day. After upgrading to
> 4.3.3-040303-generic (ubuntu version) it was much better: 1/2 freezes/day.
> With cstate=1 there has not been one yet.

I have N3520 BayTrail and I am using kernel-4.0.6 with cstate=1 as well right now; Since I set cstate=1 my asus notebook doesn't freeze (its about 10 days already);
Comment 309 Ghry 2016-04-25 19:14:46 UTC
I have Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz BayTrail on my asus notebook; I tested cstate=1 and kernel 4.0.9 and it doesn't freeze about 10 days already; Can somebody tell me since what kernel version the bug will be solved totally?
Comment 310 ladiko 2016-04-26 04:26:47 UTC
My glass ball says kernel 6.6.6 will be useable.
Comment 311 Gabriel7340 2016-04-26 14:24:41 UTC
When all computers in the market have an intel processor bay trail.
For now ( only ) affects 40%!! of all PC's in the market.
Comment 312 GConst 2016-04-27 08:40:55 UTC
Hi all, for asrock q1900itx-dc I have found workaround: i turned off cstate in BIOS (UEFI). uptime more than week, i cannot say that im happy with this solution, but it allowed me wait untill this bug will be fixed.
Comment 313 dertobi 2016-04-27 09:23:40 UTC
(In reply to GConst from comment #312)
> Hi all, for asrock q1900itx-dc I have found workaround: i turned off cstate
> in BIOS (UEFI). uptime more than week, i cannot say that im happy with this
> solution, but it allowed me wait untill this bug will be fixed.

That's pretty much the same workaround we already have, except you're doing it in the BIOS instead of the kernel boot command line. And that makes total sense.
Comment 314 w2q 2016-04-27 20:43:52 UTC
Not sure if this has anything to do with our problem:

https://www.dragonflydigest.com/2016/04/04/17888.html

It says: 
If you remember this Baytrail problem, Daniel Bilik has gone and found a fix, as this appears to be a cross-platform bug, and he has patches for DragonFly.

http://lists.dragonflybsd.org/pipermail/users/2016-April/228682.html
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/5d8e0f49ad2ab6201288c8b4f5ebb966f27e5779
http://lists.dragonflybsd.org/pipermail/users/2016-March/228645.html

Perhaps it helps. Good luck.
Comment 315 w2q 2016-04-27 20:55:53 UTC
Sorry, it seems to be just the fix from Mika (https://bugzilla.kernel.org/show_bug.cgi?id=109051#c203). So nothing new, I guess....
Comment 316 MarkB 2016-05-01 08:48:50 UTC
I believe that I have run into this problem running Ubuntu on an NUC clone with Intel's J1900 processor. Does anyone know if the freezing issue is confined to machines based on Bay Trail chips or is it more widespread than this?
Comment 317 Martin 2016-05-05 07:55:41 UTC
I recently experienced unprecendented hangs while idle (HTPC unreachable in the morning, so without GPU involvement) and while watching Flash video. Although watching DVB-C content was very stable with reversed 8fb55197e... I now am back to vanilla 4.5.2 (soon .3) using intel_idle.max_cstate=1. That does seem to be the magic bullet after all.

Apparently I don't quite understand the *states very well, because with this option set, I still see the GPU enter rc6 in powertop? Is that less efficient than c*? I do see that the packages is not going into pc2-7.
Comment 318 Len Brown 2016-05-06 17:11:50 UTC
re: comment316
MarkB, this sighting is specific to baytrail.
Comment 319 Phil 2016-05-06 18:46:44 UTC
I think I have run into this issue on my NUC with an Intel(R) Celeron(R) CPU  N3150  @ 1.60GHz (Braswell). 

Basically under any significant load (gcc compile of the linux kernel for example) the system reboots after a random amount of time. Ive tried several 4.x series kernels and have been able to reproduce the bug on all of them so far. Adding intel_idle.max_cstate=1 as suggested in this thread seams to mitigate the bug albeit it locks the CPU at 2167Mhz.

I am not using X, I'm running arch linux with only a couple of services enable (dhcpd and hostapd) as im using it mainly as a firewall/AP.
Comment 320 yuriy 2016-05-07 23:07:48 UTC
I have problem with  random freezing on kernel 4.2
intel_idle.max_cstate=1 din't helped me
Comment 321 Gabriel7340 2016-05-07 23:22:44 UTC
Your processor is an Intel BayTrail? Did you update-grub after the changes? Can you try kernel 3.13.0-85-generic instead?
Comment 322 Phil 2016-05-08 01:38:25 UTC
Gabriel7340: I think the N3150 is Braswell which is a refresh of Baytrail? [1,2] Should I file a separate bug? I am not using grub I'm using systemd-boot. 

[1] http://www.extremetech.com/extreme/202389-intel-quietly-launches-14nm-braswell-bay-trails-successor
[2] http://www.cnx-software.com/2015/04/01/intel-introduces-celeron-n3000-n-3050-n3150-and-pentium-n3700-low-power-braswell-processors/
Comment 323 yuriy 2016-05-08 10:38:23 UTC
(In reply to Gabriel7340 from comment #321)
> Your processor is an Intel BayTrail? Did you update-grub after the changes?
> Can you try kernel 3.13.0-85-generic instead?

Sorry comment #320 is no valid. I have used intel_idle.max_cstate=2. Right now its 6 hours up-time without freezes.

Is there some patches that fix this issue?
Comment 324 Mort Yao 2016-05-08 17:44:23 UTC
The freeze recurred to me today on 4.1 with legacy-turbo patch, for no reason (not even any GPU or CPU-intensive processes was running). So I would say no, there is no valid patch that completely fix the issue at this point. (4.1 indeed performs better than 4.2+, so that would be another lockup issue on 4.2+)
Comment 325 Gabriel7340 2016-05-08 18:36:25 UTC
(In reply to yuriy from comment #323)
> (In reply to Gabriel7340 from comment #321)
> > Your processor is an Intel BayTrail? Did you update-grub after the changes?
> > Can you try kernel 3.13.0-85-generic instead?
> 
> Sorry comment #320 is no valid. I have used intel_idle.max_cstate=2. Right
> now its 6 hours up-time without freezes.
> 
> Is there some patches that fix this issue?

http://www.hardwaresecrets.com/celeron-n3150-cpu-review/

"They come to replace the Bay Trail-D CPUs, actually using the same microarchitecture, ..." 

I think you are right.

For now the best solution is the intel_idle.max_cstate workaround. You can find some patches but I'm not sure if it works. Another solution could be compile the kernel without some commits like "Agressive downclocking on Baytrail/drm/i915".
Comment 326 alvararo 2016-05-08 23:41:08 UTC
I had freezes in an Acer laptop with Pentium N-3540 (Bay Trail). Now I'm using 3.13.0.85, I can confirm no freezes at all. 
From 3.16 and above, 4.1.12 is the one that works better for me, freezing after many hours.
Using 4.2 and above, the problem gets worse, with the system freezing few minutes after switch on.
Comment 327 Austin 2016-05-10 15:50:20 UTC
I figured I'd post my experience with this bug and how I avoid it.

Run into this problem ever since I bought my Inspiron 3000 with N3540 cpu. Running opensuse tumbleweed, always with latest Kernel.

System would freeze up, usually within 15 minutes of booting. I originally thought it was my SSD as it would often happen when accessing the disk, but it got worse and would eventually happen even when sitting idle under little load. Also, fan would also run at full speed from boot to crash.

If I suspend to ram as soon as the system has booted to the desktop, then bring the system out of suspend, the problem nearly goes away. The fan will work normally (it rarely kicks on unless I'm doing something crazy) and I can go long stretches without a crash. The problem is not completely gone...I'll crash maybe once a week. But it is much better than every 15 minutes. And if I forget to do the suspend trick after a reboot, I'm reminded quickly as it will crash within minutes EVERY time. 

I have never modified the idle_cstate as others have suggested. Perhaps my experience can help someone.
Comment 328 jbMacAZ 2016-05-11 06:34:53 UTC
(In reply to Austin from comment #327)
> I figured I'd post my experience with this bug and how I avoid it.
> 
> Run into this problem ever since I bought my Inspiron 3000 with N3540 cpu.
> Running opensuse tumbleweed, always with latest Kernel.
> 
> System would freeze up, usually within 15 minutes of booting. I originally
> thought it was my SSD as it would often happen when accessing the disk, but
> it got worse and would eventually happen even when sitting idle under little
> load. Also, fan would also run at full speed from boot to crash.
<snip>

I also have your model of Dell laptop. intel_idle.max_cstate=1 does works on my Dell.  I've run various kernels from 3.19 - 4.5 with Mint, Manjaro and Cubuntu.  Without ..cstate, I experience the screen freeze and runaway fan speed.
Comment 329 Daniel Bilik 2016-05-12 15:46:08 UTC
Like others, I've been also fighting this for several months. But it seems that _combination_ of "tentative" patches from Mika Kuoppala (see comment #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it with deeper C-states enabled. See this post...

http://lists.dragonflybsd.org/pipermail/users/2016-May/249603.html

... so that I don't repeat myself. :)

HTH.
Comment 330 Xermán 2016-05-12 20:53:56 UTC
(In reply to Daniel Bilik from comment #329)
> Like others, I've been also fighting this for several months. But it seems
> that _combination_ of "tentative" patches from Mika Kuoppala (see comment
> #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has
> finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it
> with deeper C-states enabled. See this post...
> 
> http://lists.dragonflybsd.org/pipermail/users/2016-May/249603.html
> 
> ... so that I don't repeat myself. :)
> 
> HTH.

Thanks for your research Daniel, that looks promising.
Comment 331 jbMacAZ 2016-05-12 23:38:59 UTC
(In reply to Daniel Bilik from comment #329)
> Like others, I've been also fighting this for several months. But it seems
> that _combination_ of "tentative" patches from Mika Kuoppala (see comment
> #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has
> finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it
> with deeper C-states enabled. See this post...

Can I ask which kernel and processor family you are running?  I can't seem to replicate your success on my setup (various patched kernels 4.2 - 4.6rc, Atom Z3775).

While I can't definitively rule out a hardware platform issue, I am freeze free with ..cstate=1.  Newer kernels do take longer before freezing than older ones.
Comment 332 Daniel Bilik 2016-05-13 23:15:34 UTC
(In reply to jbMacAZ from comment #331)

> Can I ask which kernel and processor family you are running?

I run Dragonfly BSD on Asrock Q1900-ITX with Intel Celeron J1900. Dragonfly has drm infrastructure imported from linux kernel, with both intel and amd drivers being regularly updated. I started to experience machine freezes when i915 driver in Dragonfly was synced to what's in linux 4.0, and I had to limit CPU to C1. When Dragonfly synced i915 to linux 4.1, it made my system stable again, even with deeper C-states. But with update to a version from linux 4.2, freezes were back again. I was struggling this for months, but with two patches I've mentioned in previous post my system has been running stable for several weeks now, with deeper C-states enabled. In the meantime, i915 driver in Dragonfly was synced with linux 4.3, so I've updated my system this week, keeping the patches, and it still runs stable (it's been just a few days, but without the patches I was experiencing a freeze practically each day).

> While I can't definitively rule out a hardware platform issue, I am freeze
> free with ..cstate=1.

Well, because I use i915 driver in Dragonfly, I can't really confirm that the patches solve freezes on linux completely. But so far, it seems to be sufficient to make my system stable with deeper C-states, so I can definitely say the patches positively influence stability of i915 driver on Baytrail.

> Newer kernels do take longer before freezing than older ones.

In my experience, the system uptime and/or load doesn't seem to matter. Sometimes the system was running stable for two days, sometimes it freezed after two hours. In fact, it always freezed when the system was "doing nothing" and I just moved a mouse pointer or scrolled an already loaded page in firefox.
Comment 333 jbMacAZ 2016-05-14 00:48:21 UTC
(In reply to Daniel Bilik from comment #332)
> (In reply to jbMacAZ from comment #331)
> 
I appreciate the information and insights.  Perhaps there are additional factors affecting freezes from outside the drm code that aren't present in dragonfly.

---- 

On a different subject, is anyone getting a blank screen lockup starting with 4.6-rc7 and 4.5.4?  System runs for a while, seems fine and then suddenly the screen goes black, locked up.  I think maybe some of the bug fixes for this freeze bug may be almost right but now the symptom has changed from a static display to a black screen.  Just a feeling so far, but it needs the same hard reset to recover, so no dmesg to inspect.  Less recent kernels are still stable with cstate, so I don't think it's a hardware fault.  

Are the hunter patches now obsolete in 4.6-rc7/4.5.4?  My tests still use 2 of them that I had to use in earlier kernels.  If they aren't needed anymore, using them could explain this new issue.
Comment 334 Dimitris Roussis 2016-05-14 10:51:13 UTC
I installed cloudready distro http://www.neverware.com/ and no freezes anymore..

It is so so strange because this version use Linux kernel 4.0.5!!

In all linux distro i used i have freeezes for any kernel above 3.16.7 .

What is the difference in chromiumos Linux kernel??
Comment 335 Hal 2016-05-14 14:46:21 UTC
Anyone tried to install a linux-libre kernel and see if it would work better? 

I'm planning on trying the one from here: http://linux-libre.fsfla.org/pub/linux-libre/freesh/pool/main/l/linux-4.5.4-gnu/

But prior to doing it I would like any feedback you might provide, as I have no experience with linux-libre kernels and what I will be missing (understand breaking in my system) once I install it.

Regarding my systems' freeze status since I applied intel_idle.max_cstate=1; well no more freezing but both machines run noticeably warmer. Those boxes are small and cramped. They only have passive cooling. 

One other thing I noticed and which has alarmed me is that on both machines one of the cores runs at 100% for long periods (tens of minutes), then falls to normal levels for a minute or so and then goes up again creating a cycle.
I don't remember having seen that when I first started to use cstate=1, so I am not sure if the two are connected. But, I am certain that something is wrong with this behaviour.

Hal
Comment 336 heimdall_cuba 2016-05-14 15:46:52 UTC
I was more than 4 months trying to solve the problem of freezing on my laptop with Intel Pentium N3540 Bay Trail reading in this thread I found the solution my problem by establishing the intel_idle.max_cstate value=1 since then I have not returned to have problems however I do not quite understand I've modified that function and problems that may have long-term on my laptop.
Comment 337 Martin 2016-05-15 09:59:22 UTC
I applied Daniel's patches (comment 329) to 4.5.4 but alas, freezes at last. Back to max_cstate=1 again.
Comment 338 Maurizio 2016-05-15 11:30:07 UTC
Hope I'm not too optimistic but I'm trying 4.6.0-rc7-g44549e8 (I'm using Arch Linux and this is what was available in aur repositories 2 days ago) and so far I've experienced no crashes (almost 2 days of continous uptime with a normal use of the system).

Previously the RC3 version crashed as well.
Comment 339 Justin 2016-05-15 13:50:34 UTC
On my dell 3531 with the baytrail processor. I have linux-image-4.6.0-rc7-amd64 installed from debian experimental, here (https://packages.debian.org/experimental/kernel/linux-image-4.6.0-rc7-amd64). I still get the crashes without intel_idle.max_cstate value=1 

With intel_idle.max_cstate value=1 no crashes.
Comment 340 Hal 2016-05-15 15:21:23 UTC
(In reply to Hal from comment #335)
> Anyone tried to install a linux-libre kernel and see if it would work
> better? 
> 
> I'm planning on trying the one from here:
> http://linux-libre.fsfla.org/pub/linux-libre/freesh/pool/main/l/linux-4.5.4-
> gnu/
> 
linux-libre v4.5.4 from above repo installed well and worked on Linux Mint 17.3 but froze eventually. So the binary free version of the kernel is not any better than the regular kernel.
Hal
Comment 341 Maurizio 2016-05-16 10:42:09 UTC
Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems to work pretty well (my CPU is a celeron N2930). 
 
I'm not really an expert so I would assume that g44etc is the commit. Worth checking what is the one used for the debian RC7 build... and what has been done in between (assuming it's not a later one and they broke it again :) )
Comment 342 Maurizio 2016-05-16 10:44:31 UTC
(In reply to Maurizio from comment #341)
> Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems
> to work pretty well (my CPU is a celeron N2930). 
>  
> I'm not really an expert so I would assume that g44etc is the commit. Worth
> checking what is the one used for the debian RC7 build... and what has been
> done in between (assuming it's not a later one and they broke it again :) )

Also wanted to add that sensors now report a cpu core temperature 10 degree lower than with a 4.5 kernel with max_cstate=1 ...
Comment 343 MarkB 2016-05-17 08:00:08 UTC
Digging around a little and I am seeing many people use the word 'latency' and suggesting that one cause of the problem may be that the interrupts issued to wake up the CPU from a deeper idel state are somehow causing the freezing issue. Coupled with talk above about alternative kernels, I wanted to ask whether anyone has tried any of the alternative Ubuntu kernels, low latencey, real time, etc?
Comment 344 Daniel Bilik 2016-05-17 08:42:53 UTC
(In reply to Maurizio from comment #341)
> Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems
> to work pretty well (my CPU is a celeron N2930). 

Indeed, looking through changes merged into 4.6-rc7 in past weeks, there are several commits to i915 driver claiming to solve hangs. And specifically for Baytrail, these two are interesting the most:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=4ea3959018d09edfa36a9e7b5ccdbd4ec4b99e49
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=1b3e885a05d4f0a35dde035724e7c6453d2cbe71

If I read it correctly, the first one fixes the same problem with rps thresholds that one of Mika's "tentative" patches was trying to. And second one fixes "timing cruical" ringbuffer issue that IMHO could be causing hangs at random places. BTW, timing may be that additional factor mentioned by jbMacAZ in comment #c333, and it would explain why "legacy turbo" + "tentative" patches work for me and not for others - timing on Linux vs. Dragonfly definitely is different.

Anyway, I've swapped my "combo" patch for those commits mentioned above, and I'm currently testing it. Because, to be honest, the patches I've been using so far, despite making my system stable with deep C-states, smell a little "hackish". And those commits, besides being "official", look more like "the proper solution".
Comment 345 dan.g.tob 2016-05-17 10:27:29 UTC
I'm up to 22 hours uptime now with 4.6 vanilla without intel_idle.max_cstate=1. I'm using the ubuntu built packages on debian 8 (afaik there are no external patches). I have a lenovo ideapad 100s with an atom Z3735F


http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/
Comment 346 Cris Daniel 2016-05-17 11:12:34 UTC
4.6.0-1 from the Arch testing repo just brought down my Z3735F.
Comment 347 Daniel Bilik 2016-05-17 13:01:12 UTC
(In reply to Daniel Bilik from comment #344)
> (In reply to Maurizio from comment #341)
> ... and I'm currently testing it.

No luck, got freeze in less than a day. Commits 4ea3959+ and 1b3e885+ alone do not seem to be enough to prevent hangs. Back to my (somewhat dirty but working) patchset.
Comment 348 Martin 2016-05-17 13:23:46 UTC
Vanilla 4.6.0+ without max_cstate=1 still freezes up for me too, sticking to max_cstate=1 for now.
Comment 349 Maurizio 2016-05-17 15:04:57 UTC
Well it was too good to be true :) after 4 full days uptime I've got a crash few minutes ago. Just rebooted the machine, will let it run again to see if it was luck or at least now I can get some days of uptime. 

I will also update the kernel to the latest next time it crashes.
Comment 350 jbMacAZ 2016-05-17 18:36:07 UTC
I am having issues with 4.6.  ..cstate=1 no longer prevents ordinary display freeze (GUI locked, CPU activity = 0%.)  intel_idle.max_cstate=1 had been a reliable workaround (for me) since 4.2.6.  Asus T100CHI (Z3775) Ubuntu 16.04.  Kernel minimally patched for Bluetooth device ID and other hardware bits not yet supported by main-stream.  Patches proven in earlier kernels, pruned as necessary with each new kernel releases.  4.6-rc5 was still freeze free with cstate, unsure of rc6, rc7 also froze with black screen or soft freeze (mouse cursor freely moved, display updating only once every 1-2 minutes).

I choose to be optimistic, that the freeze bugs are being worked on now and another edit or two will finish fixing them.  It sounds like the current changes work, as is, for other systems.
Comment 351 Tal Liron 2016-05-17 18:41:58 UTC
Do the people committing the fixes on Linux now know about the testing we are doing here? Could someone here with some authority notify them?
Comment 352 Daniel Glöckner 2016-05-17 22:19:50 UTC
Please everyone, keep this on topic.

If your cursor updates when you move your mouse, this is not your bug.
If your screen turns black, this is not your bug.
If you can still SSH into or ping your device, this is not your bug.
If it's just some application dying, this is not your bug.
If you still see freezes after max_cstate=1, this is not your bug.

There may be other problems with Bay Trail that might show some of these symptoms, but this is not the correct Bugzilla entry to discuss them.
Comment 353 Gabriel7340 2016-05-18 01:11:04 UTC
I'm using now version:
4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux 

Kernel without any crash! :-)
Comment 354 jds 2016-05-18 05:44:11 UTC
Created attachment 216481 [details]
attachment-24742-0.html

Tried 4.6 RC7 from ubuntu kernel release page.

4.6.0-040600rc7-generic #201605081830 SMP Sun May 8 22:32:57 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

Full lock-up after a few minutes.  Reverted to max_cstate=2.



On Tue, May 17, 2016 at 9:11 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #353 from Gabriel7340 <gabriel_7340@hotmail.com> ---
> I'm using now version:
> 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016
> x86_64
> x86_64 x86_64 GNU/Linux
>
> Kernel without any crash! :-)
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 355 Maurizio 2016-05-18 10:35:55 UTC
Updated 4.6.0-g2dcd0af, no luck - froze after half a day. Green screen, complely hanged.

It either they broke it again or the 4 days uptime have been just a lucky shot. 

Anyway, is actually the maintainer of this component aware of the bug? This is still in "NEW" state with no official updates (also lists up to kernel 4.2 while 4.6 is also affected) ? 

The max_cstate is not really a proper workaround, the power consumption as well as temperature goes up dramatically.
Comment 356 Andrew Clayton 2016-05-18 12:19:01 UTC
> Anyway, is actually the maintainer of this component aware of the bug? This

This bug is assigned to Len Brown and he has commented here, so *he* at least is aware of this.

However, I fear (and has already been mentioned in earlier comments) this bug report has long since lost any usefulness it might have once had and has just turned into a dumping ground for random comments and updates and now reads like some web forum thread,
Comment 357 jds 2016-05-18 16:12:57 UTC
Created attachment 216551 [details]
attachment-7936-0.html

Not so.  The bug discourse may have become a bit ragged due to the age of
the bug and the near-total non-response by the owner or by kernel people.
But there's a perfectly clear thread: every kernel from at least 3.19
through 4.6 locks up hard on BayTrail and Broadwell systems after minutes
or hours.

jds

On Wed, May 18, 2016 at 8:19 AM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #356 from Andrew Clayton <andrew@digital-domain.net> ---
> > Anyway, is actually the maintainer of this component aware of the bug?
> This
>
> This bug is assigned to Len Brown and he has commented here, so *he* at
> least
> is aware of this.
>
> However, I fear (and has already been mentioned in earlier comments) this
> bug
> report has long since lost any usefulness it might have once had and has
> just
> turned into a dumping ground for random comments and updates and now reads
> like
> some web forum thread,
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 358 Maurizio 2016-05-18 17:26:42 UTC
(In reply to Andrew Clayton from comment #356)
> > Anyway, is actually the maintainer of this component aware of the bug? This
> 
> This bug is assigned to Len Brown and he has commented here, so *he* at
> least is aware of this.
> 
> However, I fear (and has already been mentioned in earlier comments) this
> bug report has long since lost any usefulness it might have once had and has
> just turned into a dumping ground for random comments and updates and now
> reads like some web forum thread,

Well, most of the people experiencing the problem refers to this thread. It should be the best place to source for information, and why not ask for some cooperation? 

The bug is open since December with no status change whatsoever but reports of baytrail hanging date back to Oct 2014 when 3.17 has been released. This is a pretty serious problem as it is preventing linux to run properly on a very large number of systems... it just doesn't look that it is getting the right attention.
Comment 359 julio.borreguero@gmail.com 2016-05-18 17:48:17 UTC
i tried drm-intel kernel 4.6.0-rc7 on N2940
System freeze after 2 days.

I understand the frustration of the people too, a serious bug with a clear defined thread and only some frustrated users commenting.
Still the vast majority here helping testing and reporting for their platform.
Non-existant feedback, not getting any attention, we don't even know if the maintainer is still alive ;)
How is it not going to read like a forum thread after more than 2 years with many mayor kernel versions since its appearance ?
Comment 360 jbMacAZ 2016-05-18 18:18:20 UTC
(In reply to Daniel Glöckner from comment #352)
> Please everyone, keep this on topic.
> 
> If your cursor updates when you move your mouse, this is not your bug.
> If your screen turns black, this is not your bug.
<snip>> 
> There may be other problems with Bay Trail that might show some of these
> symptoms, but this is not the correct Bugzilla entry to discuss them.

Since removing the Hunter patches(see comments #55, #103) from my 4.6 build, I have not had a recurrence of those alternate freezes.
Comment 361 nw9165-3201 2016-05-18 23:05:31 UTC
@ Len Brown:

Any chance you could give us an update on this issue?

It would be much appreciated.

Regards
Comment 362 Gabriel7340 2016-05-21 17:30:42 UTC
(In reply to Gabriel7340 from comment #353)
> I'm using now version:
> 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux 
> 
> Kernel without any crash! :-)

Sorry the bug comes back :/ I went back to kernel 3.13.0-85-generic again :/
Comment 363 Libor Chmelik 2016-05-21 20:05:26 UTC
I'm now using kernel 4.6.0-040600-generic from Ubuntu, since my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) runs on Mint 17.3

On kernel 4.5 with cstate=1 it worked flawlessly. 

After approx. 1 hour on 4.6 without cstate=1 it froze again during playback of an HD movie on VLC.

Trying 4.6 now with cstate=1

Also I can't downgrade the kernel lower than 4.5 because then the shutdown/reboot hanging/freezing issue on this machine would be back again. :-(
Comment 364 jbMacAZ 2016-05-22 01:32:30 UTC
4.5.5 has the same broken cstate bandaid as 4.6.  In other words, both kernels freeze and ..cstate=1 no longer stops it.  4.6-rc7 did the same thing.  

Anyone have a new workaround?  I can always use 4.4.11 which is actually running pretty well now.  It just doesn't support all of my hardware (eg sound.)
Comment 365 Justin 2016-05-22 01:49:59 UTC
intel_idle.max_cstate=4   Appears to work with rc7.  So we can get more of the power savings.
Comment 366 tim 2016-05-24 00:23:27 UTC
Just wondering if this is not the droid we're looking for...

On an unrelated development - saw a lot of jitter across different BYT platforms, excessively so, and not just on J1900, but also on the Z3537G.

Digging into things, and pulling an older IVB based 1037U box, saw the same thing.

putting intel_max_cstates=1 sort of solved the problem for the most part - this is with ubuntu-server 4.4.0-22 - by meaning sort of, it worked around it.

Reverting the valleyview change out that was in 3.16 kind of fixed it - e.g. no more freezes on the BYT devices - but the IVB never had the freezes in the first place with 4.4.0.

Hmmm... something tells me there's more to this problem than just the graphics driver. I just don't have handy the gear needed to get deeper into the HW - e.g. JTAG and Protocol Analyzer these days - but I'm suspecting that there is something going on with the timing on both BYT and IVB, and I suspect Haswell, Braswell, and later..
Comment 367 tim 2016-05-24 04:22:55 UTC
Hokay - did some more debug/testing - pthread crashes are inconsistent, when looking at stack dumps.

Munged the UEFI to keep the BYT running at a constant speed, and things are fine. Same with IVB - weird... with constant clocks, it's all good. Let the cores sleep a bit too much - boom - worrisome as this could lead to data corruption that folks wouldn't see immediately.

While I can't dig into the HW - long time back on ARM with an RTOS, we found that dynamic clocks could lead to issues with cpu clocks and mem states with mem reads specifically - cpu would read before memory was ready.

since BYT, IVB use the uncore/system agent - for both CPU and GPU, this is the area of interest - as the uncore controls timing for everything.

Probably need someone from intel systems to sort out this, as this is all their stuff inside.
Comment 368 Gabriel7340 2016-05-25 22:58:18 UTC
Thanks for the debug! Good work. I'm still analysing the code, but, as you suggest, someone from intel can see more accurately and quickly what is really going wrong.
Comment 369 joev.mi 2016-05-26 12:44:28 UTC
with Kernel 4.4.9 I am now seeing lockups fairly frequently.  I had started to see some with FC22 and thought it was related to the nouveau driver so I updated to FC23 on May 1; it was due anyway.  Lockups were apparently resolved.  that was with Kernel 4.4.6.  With kernel 4.4.9 I am now seeing lockups on a daily basis.  Just prior to one such lock up I noticed a log entry: 
"May 24 19:52:50 xps8700.durand8450.info kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
-- Reboot -- "
So I found this thread.  Now we are talking something with kernel and cstate or pci/msi interaction?  I have a Dell XPS8700 desktop; I looked in my BIOS setup and see nothing related to cstate.  I tried the pci=nomsi in the GRUB entry and got no relief.  I thought perhaps my BIOS is to old (I'm at A07, Dell's latest is A11).  I tried flashing it from freedos but it fails to burn.  Anyhow, I am now running kernel 4.4.6-201 which seems stable.
Comment 370 joev.mi 2016-05-26 13:35:01 UTC
... finally read carefully enough to figure out the other method to set cstate is to qualify the kernel invocation in GRUB.  I am now running kernel 4.4.9 with cstate set to 1.  It hasn't locked up in half an hour.
Comment 371 joev.mi 2016-05-26 17:40:25 UTC
... but not really much longer.  log reports that cstate = 1 was reached shortly after reboot, then log records:  'NMI watchdog: Watchdog detected hard LOCKUP on cpu 0' about two hours after boot.  hmm.  I think I've tried the two work arounds.  I am apparently able to run with kernel 4.4.6
Comment 372 joev.mi 2016-05-26 17:49:23 UTC
... and I could have mentioned that about 14 minutes after the watchdog notice above I see the following in the log:
kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
kernel:         0-...: (1 GPs behind) idle=643/1/0 softirq=79517/79517 fqs=320010
kernel:         (detected by 3, t=960034 jiffies, g=113164, c=113163, q=0)
kernel: Task dump for CPU 0:
kernel: swapper/0       R  running task        0     0      0 0x00000008
kernel:  ffffffff8163d5af 000000001ec13d60 0000066d9d796809 ffffffff81d3b0c0
kernel:  ffffffff81c04000 ffff88021ec1fb00 ffffffff81cc1040 ffffffff81c00000
kernel:  ffffffff81c03ec0 ffffffff8163d797 ffffffff81c03ed8 ffffffff810e6752
kernel: Call Trace:
kernel:  [<ffffffff8163d5af>] ? cpuidle_enter_state+0xff/0x2b0
kernel:  [<ffffffff8163d797>] ? cpuidle_enter+0x17/0x20
kernel:  [<ffffffff810e6752>] ? call_cpuidle+0x32/0x60
kernel:  [<ffffffff8163d773>] ? cpuidle_select+0x13/0x20
kernel:  [<ffffffff810e6a10>] ? cpu_startup_entry+0x290/0x350
kernel:  [<ffffffff8179513c>] ? rest_init+0x7c/0x80
kernel:  [<ffffffff81d6201e>] ? start_kernel+0x498/0x4b9
kernel:  [<ffffffff81d61120>] ? early_idt_handler_array+0x120/0x120
kernel:  [<ffffffff81d61339>] ? x86_64_start_reservations+0x2a/0x2c
kernel:  [<ffffffff81d61485>] ? x86_64_start_kernel+0x14a/0x16d

at which point the log stops.

and I'm still working on why my BIOS update won't actually update.
Comment 373 Javier Antonio Nisa Avila 2016-05-27 06:20:24 UTC
Hi Guys!!!

I Have the same bud in thinkpad e11 N2930

This bug is similar https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467

And people of canonical build a test kernel with possible solution.

Can you probe??

http://kernel.ubuntu.com/~jsalisbury/lp1575467

Thanks.
Comment 374 joev.mi 2016-05-27 12:44:45 UTC
just for the record my current state is running on kernel 4.4.6 out of the box, no tweaks in GRUB.  I was too hasty when I inferred I was running well with kernel 4.4.6 with limit on cstate.
Comment 375 Len Brown 2016-05-27 16:57:10 UTC
The only Linux-based commercial product that used BYT
was based on the Android snapshot/fork of the Linux kernel,
not the upstream Linux kernel.

Nobody knows why the Android version of Linux
is stable on this hardware, while upstream Linux is not.
There have been several de-bunked theories.

No, it isn't a bug in the intel_idle driver -- you'll
have the same results with "intel_idle.max_cstate=0",
which will run the acpi_idle driver.

The cause is likely due to an SOC device other than the CPU.
Comment 376 Maurizio 2016-05-27 17:27:24 UTC
(In reply to Javier Antonio Nisa Avila from comment #373)
> Hi Guys!!!
> 
> I Have the same bud in thinkpad e11 N2930
> 
> This bug is similar
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467
> 
> And people of canonical build a test kernel with possible solution.
> 
> Can you probe??
> 
> http://kernel.ubuntu.com/~jsalisbury/lp1575467
> 
> Thanks.

If I understand correctly the various comments they are doing the bisection to understand which commit caused the issue, but its not (yet) a possible solution. 
Of course the problem is so widespread that a lot of duplicated effort is being made.
Comment 377 joev.mi 2016-05-30 18:20:22 UTC
after more watching perhaps I'm on the wrong thread.  I actually now can see that I get the system freezes regardless of the cstate work around or the psi=nomsi workaround and regardless of which of the installed kernels I select from 4.4.6, 4.4.8 and 4.4.9.  I've tried disabling watchdog not because I think that a causal relationship but watchdog is often logging an alert just before a lockup event.  I was momentarily optimistic that disabling watchdog might change my system event from a system freeze to a crash which I would have preferred.  
    I have been running with intel_idle.max_cstate=0 for the past several days.  I am noticing a lot of "(tracker-miner-fs:1951): Tracker-CRITICAL" events in the log.  I'll plan to go back to the beginning of my searching to see if I can make a better match to what I'm seeing.
Comment 378 BzukTuk 2016-05-30 20:14:43 UTC
Hi again,
Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + linux-999-i915-use-legacy-turbo.patch = over 120h in one single session (without reboot/sleep..) and another 20+/- hours in few 3-4hour long sessions without single freeze. Still counting...

(some of Adrian Hunters patches for pm/mmc were also applied, but I dont think (hope) this matters)

Kernel 4.5.x + Mikas 3 _tentative_ patches + linux-999-i915-use-legacy-turbo.patch = first freeze after like 8 hours, second freeze came in few minutes after boot

(all above was without any intel_idle.cstate parameter)

Thanks Daniel Bilik for this "combo" 

Maybe something new in v4.6 fixed the last hole... Or maybe Im just lucky.
Comment 379 Mina Demian 2016-05-31 10:46:54 UTC
I can confirm that this fix worked for me on Ubuntu 14.06 (kernel: 4.4.0-22-generic) on an Acer Aspire E15 laptop, dual-booting with Windows 8. There have been no irrecoverable freezes since applying the fix yesterday, but there have been a few times where it slowed down to almost freeze. Thankfully, it saved itself.
Comment 380 joev.mi 2016-05-31 13:45:57 UTC
I believe I have confirmed my issue is not related to the subject of this thread.  I re-initialized tracker and all seems to have cleared up.  I thought the freezes I was seeing were right in line with descriptions above but unlike other reports I was getting no relief from the work-arounds that reportedly helped others.  Changing cstate to zero and disabling watchdog did help me focus on the real problem.  Thank you for your patience.
Comment 381 Dmitry 2016-05-31 14:16:10 UTC
From Documentation/kernel-parameters.txt:
>>  intel_idle.max_cstate=  [KNL,HW,ACPI,X86]
>>       0       disables intel_idle and fall back on acpi_idle.
>>       1 to 6  specify maximum depth of C-state.

acpi_idle is different idle driver and could disable all C-states as well. It depends on ACPI tables.

So, you need to try this and check which states are enabled:
$ uname -a
$ cat /sys/devices/system/cpu/cpuidle/current_driver
$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name

For me it's like this:
>>  Linux venue11pro 4.1.25-dirty #337 SMP PREEMPT Wed May 25 01:53:43 MSK 2016
>>  i686 Intel(R) Atom(TM) CPU Z3770 @ 1.46GHz GenuineIntel GNU/Linux
>>  intel_idle
>>  POLL
>>  C1-BYT
>>  C6N-BYT
>>  C6S-BYT
>>  C7-BYT
>>  C7S-BYT

P.S. Also I have to mention that even with kernel 4.1.25, mmc PM QOS patches and legacy turbo patch I have freezes. Disabling SDIO wifi with ath6kl driver prevent any lockup at all.
Comment 382 joev.mi 2016-06-06 22:14:04 UTC
I'm baaack.  I've had additional events.  The best solution I could get to with the work arounds was to set cstate=0 and switch from gnome to xfce desktop.  And make sure I close firefox and thunderbird.  That combination really stretched out the events but still at least one a day.  I also tried disabling power management for the monitor, I had it set to turn the monitor off after 45 minutes of inactivity.  My next approach is to flash the computer with the latest release posted by Dell of AMI BIOS, A11.  If I haven't mentioned I'm running workstation fedora on a Dell XPS 8700 which I bought new two years ago.  It was still running BIOS A07 which it came with.  I completed the re-flash this afternoon.

I've been having some difficulty sorting out what pieces are really in play.  The symptoms I see sound like what is described above but I have not seen the relief others have reported from the work arounds.  Also, kernel seems to be implicated in the discussion but my events did not seem to be associated with a kernel upgrade (I didn't realize that until more recently.  I was running 4.4.8 for ten days until my events started.)
Comment 383 D. Hugh Redelmeier 2016-06-07 03:08:37 UTC
@joev.mi  This bugzilla entry is about Baytrail processors.  Your computer does not have one of those -- it uses "4th generation Intel Core processors".  Please start a separate bugzilla.  This one is already confusing enough already.

I think that reports of Braswell/Cherrytrail problems are likely relevant.

Examples of Baytrail reported (above) as having the bug: Atom Z3735G, Atom Z3770. Celeron J1900, Celeron N2930, Celeron N2940, Pentium J2900, Pentium N3520, Pentium N3540

Examples of Baytrail reported (above) without seeming to have the bug: Celeron CPU N2830, Celeron CPU N2840

Examples of Braswell/Cherrytrail reported (above) as having the bug: Celeron N3050

Examples of Braswell/Cherrytrail reported (above) without seeming to have the bug: N3150, N3700

(I stopped reading at about comment 150)
Comment 384 Koen Roggemans 2016-06-07 04:48:16 UTC
Created attachment 219241 [details]
attachment-22682-0.html

Correction to the list: I have an N3150 that has the bug: with the
workaround with cstate=1 I have seen it freezing only once.

2016-06-07 5:08 GMT+02:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #383 from D. Hugh Redelmeier <hugh@mimosa.com> ---
> @joev.mi  This bugzilla entry is about Baytrail processors.  Your computer
> does
> not have one of those -- it uses "4th generation Intel Core processors".
> Please start a separate bugzilla.  This one is already confusing enough
> already.
>
> I think that reports of Braswell/Cherrytrail problems are likely relevant.
>
> Examples of Baytrail reported (above) as having the bug: Atom Z3735G, Atom
> Z3770. Celeron J1900, Celeron N2930, Celeron N2940, Pentium J2900, Pentium
> N3520, Pentium N3540
>
> Examples of Baytrail reported (above) without seeming to have the bug:
> Celeron
> CPU N2830, Celeron CPU N2840
>
> Examples of Braswell/Cherrytrail reported (above) as having the bug:
> Celeron
> N3050
>
> Examples of Braswell/Cherrytrail reported (above) without seeming to have
> the
> bug: N3150, N3700
>
> (I stopped reading at about comment 150)
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 385 Libor Chmelik 2016-06-07 05:30:25 UTC
Since my last post and cstate=1 on kernel 4.6.0-040600-generic from Ubuntu, my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) running Mint 17.3 didn't freeze once.
I tried every usual cause possible (Fullscreen HD videos on youtube or in VLC. Browsing content loaded websites in chrome and firefox. Batch HD conversion in Handbrake, etc.).

No freezing or hanging so far.

The cstate=1 workaround seems to work for me so far.
Comment 386 Maurizio 2016-06-07 12:23:49 UTC
I'm really confused now, I've switched from arch to debian stable (3.16.0 kernel) which didn't froze once as I was expecting (I've read the bug started with 3.17 not 3.16) then I've reinstalled arch (standard) with kernel 4.5.4 and so far I didn't experience a single freeze (now I have 48 hours of uptime)

Didn't really have the expertise to understand if any patch has been applied in the last couple of months to 4.5.4 kernel by arch team: the only thing I did differently this time is disabling in the bios all components I really don't need (like serial port for example)... I will let it run for a couple more days to check if it keeps running then I'll start playing with the bios turning on or off devices again. Does this make any sense?
If I understand correctly what Len said its a problem with a device driver rather than with the intel_idle?
Comment 387 Gabriel7340 2016-06-07 13:02:02 UTC
Did you disable cstates in bios?
Comment 388 Maurizio 2016-06-07 14:28:19 UTC
(In reply to Gabriel7340 from comment #387)
> Did you disable cstates in bios?

No... I've just disabled some devices I don't use but I didn't do it with the bug in mind.  I will check it and take some notes to see if the bios settings(with the exception of cstates) can actually change something.
Comment 389 marco_silva85 2016-06-07 14:39:32 UTC
Does the freeze only happen when using X11 or a Desktop Environment? Am I safe if I only use my hardware without any intel driver or X11?

I want to use my Q1900 just as a server, in console mode.
Comment 390 ladiko 2016-06-07 18:54:14 UTC
I use an asrock q1900-itx without X and kernel 3.19, 4.2 and now 4.4 and no special settings. Running for a year without issues.
Comment 391 ladiko 2016-06-07 19:00:17 UTC
Ahh and i forgot to say that it was sorted out because it had all other known issues when running as a kiosk system.

By the way, when running as kiosk system we had the problem that the USB ports started to stop working after a random time. The devices dont even disappear when unplugged.  The exact same imaged installation has no issues on AMD kabini or older intel core2duo or celeron 847. Is there anything known regarding this issue?

Because of all this trouble we moved to AMDs Kabini which works without any issues.
Comment 392 jbMacAZ 2016-06-08 18:03:18 UTC
The lastest 2 maintenance releases of 4.5 & 4.6 seem to have restored the cstate work-around.  My T100CHI is running again without the classic freeze described here.  Many thanks to whoever restored the cstate work-around.
Comment 393 Maurizio 2016-06-09 07:53:56 UTC
(In reply to jbMacAZ from comment #392)
> The lastest 2 maintenance releases of 4.5 & 4.6 seem to have restored the
> cstate work-around.  My T100CHI is running again without the classic freeze
> described here.  Many thanks to whoever restored the cstate work-around.

This is my impression too ... I've upgraded yesterday to 4.6 kernel and no crashes for 15 hours so far. Before I had 4 days of up-time with 4.5.4.

Would be nice to have a confirmation, also to avoid that one of the next patches bring everything back.
Comment 394 Zhang Rui 2016-06-20 02:17:38 UTC
so what's the status now?
Comment 395 Michal Feix 2016-06-20 04:59:14 UTC
On Acer TravelMate B115-M (N2940 @ 1.83 GHz), with latest BIOS, still hangs ocassionally with kernel 4.6.2. But it's definitelly way better than with previous kernels. I've eliminated max_cstate=1 workaround about a week ago and the machine crashed only once or twice during the past 7 days. So to sum it up - still not 100% perfect but definitelly a huge improvement.

BTW - I've enabled HW watchdog in systemd configuration. When the machines hangs (display hangs, network hangs, mouse and keyboard not reacting, etc.], it is still automatically rebooted with HW watchdog. If I understand that correctly, this reboot watchdog is independent from the kernel and should always be able to automatically reboot machine with hanged kernel. As these crashes became less frequent, I started to use this HW watchdog as a new temporary workaround to keep my machine up when beeing used remotely.
Comment 396 Libor Chmelik 2016-06-20 05:42:30 UTC
With the workaround cstate=1 on kernel 4.6.0-040600-generic from Ubuntu, my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) is still running so far.

No freezing or hanging.

The only difference from kernels previous to 4.4. is that I disabled the onboard Broadcom Wifi/Bluetooth card/chip.
I'm using a Railink USB dongle instead.
Comment 397 Maurizio 2016-06-20 08:15:36 UTC
With Arch linux kernel 4.6.2 - and clearly no max_cstate - I've experienced some occasional crashes but it was always during some heavy load of the machine (streaming hd videos from the network), while previously it crashed randomly after few hours even with the machine completely idle. So big improvement, I will try to stress the machine with max_cstate=1 to check if the crashes are due to the same problem or something else.
Comment 398 Sebastian Parschauer 2016-06-25 06:18:18 UTC
I have the same bug with an Intel Xeon E5-1620 v3 CPU, NVIDIA Quadro K620 and 256 GB NVMe SSD. I was wondering why both PCIe cards were affected. On the NVMe card I've seen an XFS file system corruption from time to time.
"intel_idle.max_cstate=1" fixed the problem with openSUSE Leap 42.1 (4.1 kernel). C-States are broken with Haswell CPUs affecting the PCIe cards!

See:
http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v3-spec-update.html
http://www.intel.com/content/www/us/en/processors/core/4th-gen-core-family-desktop-specification-update.html

HSX54: "A P-State or C-State Transition May Lead to a System Hang"
HSD38: "TSC May be Incorrect After a Deep C-State Exit"
HSD44: "Display May Flicker When Package C-States Are Enabled"
HSD50: "Throttling and Refresh Rate Maybe be Incorrect After Exiting Package C-
State"
HSD60: "Processor May Not Enter Package C6 or Deeper C-states When PCIe* 
Links Are Disabled"
HSD77: "Graphics Processor Ratio And C-State Transitions May Cause a System Hang"
HSD104: "PCIe* Device’s SVID is Not Preserved Across The Package C7 C-State"
Comment 399 D. Hugh Redelmeier 2016-06-25 14:39:56 UTC
@Sebastian Parschauer #398: This is NOT the same bug.  Your systems processor is neither Baytrail nor Cherrytrail.  Please start a different bugzilla entry.

Certainly it would interest many people if c-states are broken in Haswell.

If you think that there is something relevant to bug 109051, add a comment here pointing to your new bugzilla bug.
Comment 400 Hong Zhang 2016-06-25 23:28:42 UTC
Hello Everyone.
I just come to send some feedback
I'm using a thin ITX N3150 board by SOYO and my OS is archlinux
I ran into the same bug several day ago
I change my kernel to 4.4.13-1-lts(it's 4.4.14 now but should also work) and do nothing with the kernel parameters or the x configuration file, I have not encounter screen freeze any more(for more than 1 hour)
I change my kernel to 4.7-rc4. the computer also work properly

I have try to add the "intel_idle.max_cstate=1" by using efibootmgr
"efibootmgr -d /dev/sdb -p 1 -c -L "Arch Linux FallBack" -l /vmlinuz-linux -u "root=/dev/sdb2 rw initrd=/initramfs-linux.img i915.semaphores=1 intel_idle.max_cstate=1"
but it does not work, i ran into screen freeze just about 5min. maybe i did not add the parameter the right way?

sorry for my poor english- -
Comment 401 Paul Mansfield 2016-06-25 23:31:24 UTC
I would install rEFInd and then make that the primary boot target; it's much easier to configure rEFInd to boot linux with the desired parameters.
Comment 402 ladiko 2016-06-26 07:01:45 UTC
cat /proc/cmdline to get the currently running kernel version and parameters.
Comment 403 cirrusuk 2016-06-29 23:25:52 UTC
System:    Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
Machine:   Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
CPU:       Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB 
           clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz
           7: 2672 MHz 8: 2672 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel
           Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel
           Card-3 Hewlett-Packard driver: USB Audio
           Card-4 Logitech QuickCam Pro 9000 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40
Drives:    HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB
           ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB
           ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB
           ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB
Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2
Sensors:   System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0
           Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0
Info:      Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 

I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;)
The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url
I plan to revert to older stable kernel or maybe LTS.
I hope to report back with relevant logs. 
Thanks to all who working on this.
Comment 404 cirrusuk 2016-06-29 23:35:34 UTC
OK like others i do not run Baywell, however im confident this is similar kernel regression regardless of CPU codename. However i will look around for this bug specific to my hardware.
System:    Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
Machine:   Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
CPU:       Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB 
           clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz
           7: 2672 MHz 8: 2672 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel
           Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel
           Card-3 Hewlett-Packard driver: USB Audio
           Card-4 Logitech QuickCam Pro 9000 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40
Drives:    HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB
           ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB
           ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB
           ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB
Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2
Sensors:   System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0
           Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0
Info:      Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 

I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;)
The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url
I plan to revert to older stable kernel or maybe LTS.
I hope to report back with relevant logs. 
Thanks to all who working on this.
Comment 405 cirrusuk 2016-06-29 23:37:09 UTC
OK like others i do not run Baywell, however im confident this is similar kernel regression regardless of CPU codename. However i will look around for this bug specific to my hardware.
System:    Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
Machine:   Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
CPU:       Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB 
           clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz
           7: 2672 MHz 8: 2672 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel
           Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel
           Card-3 Hewlett-Packard driver: USB Audio
           Card-4 Logitech QuickCam Pro 9000 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40
Drives:    HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB
           ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB
           ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB
           ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB
Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2
Sensors:   System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0
           Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0
Info:      Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 

I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;)
The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url
I plan to revert to older stable kernel or maybe LTS.
I hope to report back with relevant logs. 
Thanks to all who working on this.
Comment 406 carlos.valin 2016-07-01 01:49:50 UTC
Same problem in Acer aspire switch 10
Comment 407 Vladimir Jicha 2016-07-02 09:55:52 UTC
Unfortunately it seems to be clear now that as I expected this bug will never get fixed. I can only see more and more people posting here that they are affected too. But nobody even cared to change the bug state to critical from normal, confirmed from new or update the affected kernel versions up to 4.6 (and most likely any future).
Comment 408 muhaar 2016-07-02 10:14:38 UTC
Very disappointed of Intel's non excistent product support :S
Comment 409 Paul Mansfield 2016-07-02 13:59:32 UTC
I am fairly sure Intel never sold the Baytrail process for the Linux platform except in a very limited capacity (the computer stick is the only one as far as I know), the Z37xx series only sold commercially for Windows. I don't think any Chromebooks used it at all, they used the N28xx variants.
So, really, we can't expect much from Intel.
Comment 410 Dmytro Kyrychuk 2016-07-02 14:40:39 UTC
> Intel never sold the Baytrail process for the Linux platform

Despite the fact, Acer did (and still does) sell their Aspire E5-511 with Linpus (a distribution of Linux), which I considered as a fair proof of that those laptops would be fine with Ubuntu as well. Apparently, I was wrong.
Comment 411 Maurizio 2016-07-03 16:48:19 UTC
Anyway officially or unofficially the problem has been extremely reduced in the latest kernel version. 

I'm running 4.6.3 on a celeron N2930 and I can get without any problems several days of uptime. Every so and then I experience a crash when streaming video, but way better than before when the machine crashed when idle after few hours.
Comment 412 Paul Mansfield 2016-07-03 16:59:23 UTC
I have found that 4.5.7 with the patch set from John Brodie on the Asus T100 Ubuntu Google+ group is very stable with cstate=1 and sdio wifi - I can achieve days of uptime.
See the files section linked from here: https://plus.google.com/communities/117853703024346186936
Comment 413 Vladimir Jicha 2016-07-04 19:48:24 UTC
My Shuttle XS35V4 was offered as Linux compatible. And that is not the only bay-trail computer with declared Linux support.
Comment 414 amjafuso 2016-07-07 08:12:32 UTC
I did test kernel 4.6.3 (cpu j1900), took about 3h to crash (chromium browser, no video). Setting cstate to 1 or 2 still fixes the problem.



@vladimir jicha

I also do have a shuttle xs35v4. Please set kernel parameter intel_idle.max_cstate

For grub:
- vi /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=2"
- update-grub
- reboot

Absolutely no crashes anymore, power consumption is only 8W.
Comment 415 Rush Hour Rambo 2016-07-08 02:00:22 UTC
Im running an Inspiron 3551 Laptop with a Pentium N3540. I've been having this same bug since Ubuntu 15.10. I recently installed Linux Mint 18 MATE and no longer had the freezing caused by this bug. Yesterday I installed some updates and ever since the bug/ freezing has returned. I am not sure if the updates included kernel updates but I have 2 kernels still in my pc and the old version is 4.4.0-21-generic is the one that did not cause any freezing for over a week after installing Mint 18. The newer version is 4.4.0-28-generic and it caused freezing.
I have rolled back to and tested 4.4.0-21-generic and can watch youtube videos at 720 60fps with no freezing, with 4.4.0-28-generic youtube videos cause the freezing.
4.4.0-21-generic seems to fix this issue for me. anyone else try?
Comment 416 Martin 2016-07-08 17:03:23 UTC
On 4.6.0, max_cstate=2 is not an option here. Will try 4.6.3 later.
Comment 417 Alejandro Morales Lepe 2016-07-08 17:36:43 UTC
Is there any way to check in code what changed for Baytrail CPUs between kernel 3.16 and 3.17? Was there a patch specific for Baytrail that is causing the issue? or some patch for c-states? There should be any active effort to fix this bug because it affects multiple machines with Ubuntu preinstalled, and Ubuntu is retiring support for kernel 3.16 so people will be stuck with either a very old kernel or will experience freezes with 16.04. Specially on machines with Ubuntu pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will really public image of Linux distros on consumer computers.
Comment 418 Joe Burmeister 2016-07-08 18:21:45 UTC
I'm fairly sure I have had this in 3.16 on my media machine. Or at least some other complete system freeze. I think it's just very rare under 3.16. So I'm not convinced the answer will fall out of bisection. :-(

Seams graphics related from what has been said. Unless it does happen on headless machines, in which case, that is clearly not true. Guessing interaction of power state of CPU vs GPU. One changed in just the wrong place for the other. But loads a speculation there I don't have time to dig into. I'm getting temped to just bin the board when being stuck 3.16 becomes an issue. Or use it as headless with another Pi media machine.
On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 from Alejandro Morales Lepe --- > Is there any way to check in code what changed for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch specific for Baytrail that is causing the > issue? or some patch for c-states? There should be any active effort to fix > this bug because it affects multiple machines with Ubuntu preinstalled, and > Ubuntu is retiring support for kernel 3.16 so people will be stuck with either > a very old kernel or will experience freezes with 16.04. Specially on machines > with Ubuntu pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > really public image of Linux distros on consumer computers. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
Comment 419 Alejandro Morales Lepe 2016-07-08 18:35:05 UTC
I have yet to experience a complete lock up in 3.16 the locks up I have had happen when I fill my Inspiron 3551 RAM by running a lot of stuff, however I am able to reboot the computer with some SysRq magic while on newer kernels the lock up prevents me from doing this... could be a different thing? 

At least you can run headless :( I am suffering this problem in my daily driver and I pretty much need it for everything. I got this machine for the price and the idea that I would be getting a Linux ready computer, oh boy...

I am not an expert, but if there is some way I can help to debug this, anybody, please let me now. 

(In reply to Joe Burmeister from comment #418)
> I'm fairly sure I have had this in 3.16 on my media machine. Or at least
> some other complete system freeze. I think it's just very rare under 3.16.
> So I'm not convinced the answer will fall out of bisection. :-(
> 
> Seams graphics related from what has been said. Unless it does happen on
> headless machines, in which case, that is clearly not true. Guessing
> interaction of power state of CPU vs GPU. One changed in just the wrong
> place for the other. But loads a speculation there I don't have time to dig
> into. I'm getting temped to just bin the board when being stuck 3.16 becomes
> an issue. Or use it as headless with another Pi media machine.
> On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > >
> https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 from
> Alejandro Morales Lepe --- > Is there any way to check in code what changed
> for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch specific
> for Baytrail that is causing the > issue? or some patch for c-states? There
> should be any active effort to fix > this bug because it affects multiple
> machines with Ubuntu preinstalled, and > Ubuntu is retiring support for
> kernel 3.16 so people will be stuck with either > a very old kernel or will
> experience freezes with 16.04. Specially on machines > with Ubuntu
> pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > really
> public image of Linux distros on consumer computers. > > -- > You are
> receiving this mail because: > You are on the CC list for the bug.
Comment 421 Alejandro Morales Lepe 2016-07-08 20:13:37 UTC
From what I am seeing the option itself to set up intel_idle.max_cstate=1
was added in kernel 3.17, does it have any relation to the problem, or am I getting lost in my ignorance? Is there any problem with the default value? where that value is used? Excuseme if this is not of much help but I am trying to make some sense from this, but since I am not familiar with kernel development maybe I am just running in circles. 

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2e92c7ad8f269c2b5b7f2a4763675f55f00b75f5
Comment 422 Joe Burmeister 2016-07-08 20:25:51 UTC
It could be a different complete freeze. Without pouring time on it, I can't know.
There is no "the fix" as far as I know. The only work round is set cstate in the BIOS or kernel argument. Does the same thing and that sucks for power usage.

I'd love some good news to.On 8 Jul 2016 19:35, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=109051 
>
> --- Comment #419 from Alejandro Morales Lepe <aml240sx@gmail.com> --- 
> I have yet to experience a complete lock up in 3.16 the locks up I have had 
> happen when I fill my Inspiron 3551 RAM by running a lot of stuff, however I
> am 
> able to reboot the computer with some SysRq magic while on newer kernels the 
> lock up prevents me from doing this... could be a different thing? 
>
> At least you can run headless :( I am suffering this problem in my daily
> driver 
> and I pretty much need it for everything. I got this machine for the price
> and 
> the idea that I would be getting a Linux ready computer, oh boy... 
>
> I am not an expert, but if there is some way I can help to debug this,
> anybody, 
> please let me now. 
>
> (In reply to Joe Burmeister from comment #418) 
> > I'm fairly sure I have had this in 3.16 on my media machine. Or at least 
> > some other complete system freeze. I think it's just very rare under 3.16. 
> > So I'm not convinced the answer will fall out of bisection. :-( 
> > 
> > Seams graphics related from what has been said. Unless it does happen on 
> > headless machines, in which case, that is clearly not true. Guessing 
> > interaction of power state of CPU vs GPU. One changed in just the wrong 
> > place for the other. But loads a speculation there I don't have time to dig 
> > into. I'm getting temped to just bin the board when being stuck 3.16
> becomes 
> > an issue. Or use it as headless with another Pi media machine. 
> > On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417
> from 
> > Alejandro Morales Lepe --- > Is there any way to check in code what changed 
> > for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch
> specific 
> > for Baytrail that is causing the > issue? or some patch for c-states? There 
> > should be any active effort to fix > this bug because it affects multiple 
> > machines with Ubuntu preinstalled, and > Ubuntu is retiring support for 
> > kernel 3.16 so people will be stuck with either > a very old kernel or will 
> > experience freezes with 16.04. Specially on machines > with Ubuntu 
> > pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will >
> really 
> > public image of Linux distros on consumer computers. > > -- > You are 
> > receiving this mail because: > You are on the CC list for the bug. 
>
> -- 
> You are receiving this mail because: 
> You are on the CC list for the bug.
Comment 423 Andrew Clayton 2016-07-08 20:33:49 UTC
(In reply to Alejandro Morales Lepe from comment #421)

> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> ?id=2e92c7ad8f269c2b5b7f2a4763675f55f00b75f5

That's just adding documentation. The intel_idle driver was added back in 2010 and Bay Trail support was added in 2015 by 718987d695adc991eb94501209fe5353136c8c16 ("intel_idle: support Bay Trail")

And possibly last touched by

d7ef76717322c8e2df7d4360b33faa9466cb1a0d ("intel_idle: Update support for Silvermont Core in Baytrail SOC")


IIRC J1900 is a Silvermont.
Comment 424 Paul Mansfield 2016-07-08 20:37:27 UTC
Yes, a J1900 is an Atom and has Silvermont cores. But so does a Baytrail

https://en.wikipedia.org/wiki/Silvermont
Comment 426 André Hoogendoorn 2016-07-13 03:01:06 UTC
I have read the pdf and according to Intel, there is a C6 state hardware bug in the CPU numbers as listed on page 9 and 10
----
VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine.

Problem:
If core C6 is entered after the start of an interrupt service routine but before a write
to the APIC EOI (End of Interrupt) register, and the core is woken up by an event
other than a fixed interrupt source the core may drop the EOI transaction the next
time APIC EOI register is written and further interrupts from the same or lower
priority level will be blocked.

Implication:
EOI transactions may be lost and interrupts may be blocked when core C6 is used
during interrupt service routines.

Workaround:
It is possible for the firmware to contain a workaround for this erratum.
----
Comment 427 Andrew Clayton 2016-07-14 01:10:06 UTC
(In reply to André Hoogendoorn from comment #426)
> I have read the pdf and according to Intel, there is a C6 state hardware bug
> in the CPU numbers as listed on page 9 and 10

Interesting. I have been running with intel_idle.max_cstate=5 (changed from 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+ hours now.

IIRC I would have had a lockup by now...
Comment 428 Wolfgang M. Reimer 2016-07-14 09:31:02 UTC
(In reply to Andrew Clayton from comment #427)
> (In reply to André Hoogendoorn from comment #426)
> > I have read the pdf and according to Intel, there is a C6 state hardware
> bug
> > in the CPU numbers as listed on page 9 and 10
> 
> Interesting. I have been running with intel_idle.max_cstate=5 (changed from
> 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+
> hours now.
> 
> IIRC I would have had a lockup by now...

According to "cpupower idle-info" a J1900 CPU has

Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT

So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME AS running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2, intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with either of the latter settings it will also run stably with intel_idle.max_cstate=5.
Comment 429 Martin 2016-07-14 09:36:31 UTC
My J1900 reliably freezes with any intel_idle.max_cstate > 1 (kernel 4.6.0) but I see why you would expect otherwise.
Comment 430 Max Stegmeyer 2016-07-14 09:37:02 UTC
(In reply to Wolfgang M. Reimer from comment #428)
> (In reply to Andrew Clayton from comment #427)
> > (In reply to André Hoogendoorn from comment #426)
> > > I have read the pdf and according to Intel, there is a C6 state hardware
> bug
> > > in the CPU numbers as listed on page 9 and 10
> > 
> > Interesting. I have been running with intel_idle.max_cstate=5 (changed from
> > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+
> > hours now.
> > 
> > IIRC I would have had a lockup by now...
> 
> According to "cpupower idle-info" a J1900 CPU has
> 
> Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT
> 
> So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME AS
> running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2,
> intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with
> either of the latter settings it will also run stably with
> intel_idle.max_cstate=5.

But something must be different. I also use a J1900 mainboard and there is a difference in power consumption between running with max_cstate=1 and max_cstate=2.
For me, that's
max_cstate=1: 17.2W
max_cstate=2: 16.5W
no max_cstate: 15.9W
Comment 431 Vaidas Jablonskis 2016-07-14 14:08:21 UTC
Just found this bug report. I used to be getting freezes on my Dell XPS 15 9550 with Skylake Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz up until 4.7.0-0.rc7.git1.2.fc25.x86_64 kernel.

I don't have max_cstate set at the moment. Something has changed since rc7.git0 kernel build.

I am running fedora 24 with https://fedoraproject.org/wiki/RawhideKernelNodebug repo enabled.
Comment 432 D. Hugh Redelmeier 2016-07-14 14:46:44 UTC
@Vaidas Jablonskis #431: this bug report is about Baytrail CPUs.  Skylake is quite different.
Comment 433 Vaidas Jablonskis 2016-07-14 15:25:34 UTC
(In reply to D. Hugh Redelmeier from comment #432)
> @Vaidas Jablonskis #431: this bug report is about Baytrail CPUs.  Skylake is
> quite different.

Oops. My apologies for not reading the title.
Comment 434 Wolfgang M. Reimer 2016-07-14 18:09:11 UTC
Created attachment 223851 [details]
Disable all C6 states enable all C7 core states for Baytrail CPUs

Disable all C6 states enable all C7 core states for Baytrail CPUs to verify whether erratum VLP52 is root cause for this bug. Must be run as root.
Comment 435 Wolfgang M. Reimer 2016-07-14 18:12:53 UTC
Created attachment 223861 [details]
Shows all core states (C-states) + some related info as a formatted table

The intel_idle.max_cstate boot parameter refers to enumeration done by the linux kernel (number in column State) and not to the Intel notation of core states C0, C1, C2, C3, C6, C7, etc. Latency, Residency, and Time units are microseconds.
Comment 436 Wolfgang M. Reimer 2016-07-14 18:14:31 UTC
(In reply to Martin from comment #429)
> My J1900 reliably freezes with any intel_idle.max_cstate > 1 (kernel 4.6.0)
> but I see why you would expect otherwise.

(In reply to Max Stegmeyer from comment #430)
> (In reply to Wolfgang M. Reimer from comment #428)
> > (In reply to Andrew Clayton from comment #427)
> > > (In reply to André Hoogendoorn from comment #426)
> > > > I have read the pdf and according to Intel, there is a C6 state
> hardware bug
> > > > in the CPU numbers as listed on page 9 and 10
> > > 
> > > Interesting. I have been running with intel_idle.max_cstate=5 (changed
> from
> > > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for
> 14+
> > > hours now.
> > > 
> > > IIRC I would have had a lockup by now...
> > 
> > According to "cpupower idle-info" a J1900 CPU has
> > 
> > Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT
> > 
> > So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME
> AS
> > running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2,
> > intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with
> > either of the latter settings it will also run stably with
> > intel_idle.max_cstate=5.
> 
> But something must be different. I also use a J1900 mainboard and there is a
> difference in power consumption between running with max_cstate=1 and
> max_cstate=2.
> For me, that's
> max_cstate=1: 17.2W
> max_cstate=2: 16.5W
> no max_cstate: 15.9W

Ok, you are right and I found out, what the problem is. The Linux kernel enumerates the states for the J1900 as follows:

0 POLL
1 C1-BYT
2 C6N-BYT
3 C6S-BYT
4 C7-BYT
5 C7S-BYT

The parameter intel_idle.max_cstate refers to that enumeration and does _NOT_ conform to the Intel notation of the C-states (which confused me):

So "intel_idle.max_cstate=2" means POLL, C1-BYT, and C6N-BYT (the first of the intel C6 states) are enabled and all other states (C6S-BYT, C7-BYT, C7S-BYT) are disabled and _CANNOT_ be enabled after boot time.

Fortunately the /sys interface of the kernel allows fine-grained tweeking at run-time and one can turn off and on the the states individually (if not disabled at boot time via intel_idle.max_cstate=<number>).

In order to investigate whether erratum VLP52 is the root cause for this kernel bug (109051) I attached two shell scripts to this bug.

The first (c6off+c7on.sh) will disable all intel C6 core states for Baytrail processors (C6N-BYT and C6S-BYT) + enable all C7 core states (C7-BYT and C7S-BYT).

The second script can be used to verify that the C6 states are disabled (column "Disabled" should show a "1" for the disabled states and the count for the columns "Time" and "Usage" should not change any longer for the disabled C6*-BYT states).

The "c6off+c7on.sh" script should be started at system boot and if erratum VLP52 is the root cause of this bug then Baytrail systems with the processors mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=109051#c425 (J2850, J1850, J1750, N3510, N2810, N2805, N2910, N3520, N2920, N2820, N2806, N2815, J2900, J1900, J1800, N3530, N2930, N2830, N2807, N3540, N2940, N2840, N2808) should run stably again. Especially Baytrail based systems with low average load (e.g. tablets and notebooks) should consume considerably less power with enabled C7*-BYT states.

Please give feedback (stability, power consumption, etc.)!
Comment 437 Wolfgang M. Reimer 2016-07-14 19:28:11 UTC
Running my submitted scripts

https://bugzilla.kernel.org/attachment.cgi?id=223851
https://bugzilla.kernel.org/attachment.cgi?id=223861

on a J1900 system should produce a similar output:

$ sudo $HOME/bin/c6off+c7on.sh
DISABLED state C6N-BYT for cpu0.
DISABLED state C6S-BYT for cpu0.
DISABLED state C6N-BYT for cpu1.
DISABLED state C6S-BYT for cpu1.
DISABLED state C6N-BYT for cpu2.
DISABLED state C6S-BYT for cpu2.
DISABLED state C6N-BYT for cpu3.
DISABLED state C6S-BYT for cpu3.

$ $HOME/bin/cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0      77432    267
         1  C1-BYT          0        1          1   13849382  21986
         2  C6N-BYT         1      300        275     891290   1491
         3  C6S-BYT         1      500        560    1340774   1078
         4  C7-BYT          0     1200       4000    3190476    380
         5  C7S-BYT         0    10000      20000  255687727   1025
cpu1 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0      10256    160
         1  C1-BYT          0        1          1   12134067  10470
         2  C6N-BYT         1      300        275     897517    514
         3  C6S-BYT         1      500        560    2742364    688
         4  C7-BYT          0     1200       4000    3223395    312
         5  C7S-BYT         0    10000      20000  256625325    886
cpu2 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0      58350    205
         1  C1-BYT          0        1          1   14738863  26297
         2  C6N-BYT         1      300        275     974127   1195
         3  C6S-BYT         1      500        560    2688385    879
         4  C7-BYT          0     1200       4000   25533926   1768
         5  C7S-BYT         0    10000      20000  231166600   1894
cpu3 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0       9249    232
         1  C1-BYT          0        1          1   14294725  24977
         2  C6N-BYT         1      300        275    1678518   2863
         3  C6S-BYT         1      500        560    2531238   1394
         4  C7-BYT          0     1200       4000    7240420    693
         5  C7S-BYT         0    10000      20000  250630919   2281

Running cstateInfo.sh again should show no changes in the lines for the disabled C6 states (C6N-BYT and C6S-BYT):

$ $HOME/bin/cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0       77497    277
         1  C1-BYT          0        1          1    17466806  23676
         2  C6N-BYT         1      300        275      891290   1491
         3  C6S-BYT         1      500        560     1340774   1078
         4  C7-BYT          0     1200       4000     4231024    429
         5  C7S-BYT         0    10000      20000  1113610759   3134
cpu1 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0       10292    168
         1  C1-BYT          0        1          1    20242967  12191
         2  C6N-BYT         1      300        275      897517    514
         3  C6S-BYT         1      500        560     2742364    688
         4  C7-BYT          0     1200       4000     4398584    346
         5  C7S-BYT         0    10000      20000  1109869872   2675
cpu2 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0       58662    277
         1  C1-BYT          0        1          1    24698671  33431
         2  C6N-BYT         1      300        275      974127   1195
         3  C6S-BYT         1      500        560     2688385    879
         4  C7-BYT          0     1200       4000    94027530   3711
         5  C7S-BYT         0    10000      20000  1014763708   6407
cpu3 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0        9448    277
         1  C1-BYT          0        1          1    29230274  30522
         2  C6N-BYT         1      300        275     1678518   2863
         3  C6S-BYT         1      500        560     2531238   1394
         4  C7-BYT          0     1200       4000    14492087   1315
         5  C7S-BYT         0    10000      20000  1090072439   7878

As one can see in my case most of the core's idle time is now spent in state C7S-BYT.
Comment 438 Andy Furniss 2016-07-14 23:38:58 UTC
Nice script, FWIW, maybe by luck, but it seems being headless on J1900 helps a lot. I would surely lock with an unpatched kernel + graphics.

Note lack of i915 interrupts (those shown were there at boot).

So 99 days (would be longer but had to have power off) vanilla 4.1.18. 

asr[~]$ sh cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0      352565556      538707
         1  C1-BYT          0        1          1   130181110251   755499147
         2  C6N-BYT         0      300        275   168721715688   321645308
         3  C6S-BYT         0      500        560  2679566473195  1081712423
         4  C7-BYT          0     1200       4000  5201809523872   677055949
         5  C7S-BYT         0    10000      20000   232672548010     6063953
cpu1 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0       66553174      100721
         1  C1-BYT          0        1          1    21321194555    95022167
         2  C6N-BYT         0      300        275    59708872499    80912844
         3  C6S-BYT         0      500        560  1699545542740   616884568
         4  C7-BYT          0     1200       4000  6157806454862   674802503
         5  C7S-BYT         0    10000      20000   528940757115    15010441
cpu2 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0       52182600       54992
         1  C1-BYT          0        1          1    11577684031    44781333
         2  C6N-BYT         0      300        275    30691974207    38857448
         3  C6S-BYT         0      500        560   926619750261   332837818
         4  C7-BYT          0     1200       4000  5605371769375   533458885
         5  C7S-BYT         0    10000      20000  1938187552261    60665722
cpu3 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0       87403241       51053
         1  C1-BYT          0        1          1    10016724016    38039691
         2  C6N-BYT         0      300        275    28416307863    35148851
         3  C6S-BYT         0      500        560   827491037749   293247527
         4  C7-BYT          0     1200       4000  5475692994922   503244246
         5  C7S-BYT         0    10000      20000  2176064852237    65760515
asr[~]$ uptime 
 00:27:42 up 99 days, 11:27,  1 user,  load average: 0.01, 0.02, 0.05
asr[~]$ uname -a
Linux asr 4.1.18 #1 SMP Mon Feb 22 23:38:21 GMT 2016 x86_64 GNU/Linux
asr[~]$ cat /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       
   0:         40          0          0          0   IO-APIC-edge      timer
   1:          3          0          0          0   IO-APIC-edge      i8042
   7:          1          0          0          0   IO-APIC-edge    
   8:          2          0          0          0   IO-APIC-fasteoi   rtc0
   9:          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          4          0          0          0   IO-APIC-edge      i8042
  23:   97932462          0          0          0   IO-APIC   23-fasteoi   ehci_hcd:usb1
  87:         38          0          0          0   PCI-MSI-edge      i915
  88:   12913831          0          0          0   PCI-MSI-edge      0000:04:00.0
  89: 1013520070          0          0          0   PCI-MSI-edge      eth0
 NMI:          0          0          0          0   Non-maskable interrupts
 LOC: 1613399563 1431968881  931951991  935213869   Local timer interrupts
 SPU:          0          0          0          0   Spurious interrupts
 PMI:          0          0          0          0   Performance monitoring interrupts
 IWI:          1          0          0          0   IRQ work interrupts
 RTR:          0          0          0          0   APIC ICR read retries
 RES:   26307754   50297144   12568066   12293478   Rescheduling interrupts
 CAL:       2124       1472    1663888    1463719   Function call interrupts
 TLB:      25130      19257      11500      11783   TLB shootdowns
 TRM:          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:      28651      28651      28651      28651   Machine check polls
 ERR:          1
 MIS:          0
Comment 439 ladiko 2016-07-15 06:02:43 UTC
I have a machine which was sorted out cause of the freezes and another issue with USB devices randomly disappear on this platform. Later on I used it as a headless asterisk server and never had a single freeze with Ubuntu 14.04 and kernel 3.16, 3.19, 4.1 or 4.4. So without a running Xserver, it seems to work without freezes.
Comment 440 Maurizio 2016-07-18 08:12:37 UTC
Guys posting this again, not really sure if this helps but my system is up since 7 days without a crash and of course NO max_cstate parameter set. I'm running arch linux, with kernel 4.6.3. 

My CPU is a Celeron N2930... I have X constantly running as the PC runs kodi by default. 

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  N2930  @ 1.83GHz
stepping        : 8
microcode       : 0x829


Linux zotac 4.6.3-1-ARCH #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016 x86_64 GNU/Linux

10:06:41 up 7 days, 13:34,  2 users,  load average: 0,08, 0,06, 0,01

[    0.000000] Linux version 4.6.3-1-ARCH (builduser@tobias) (gcc version 6.1.1 20160602 (GCC) ) #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016
[    0.000000] Command line: initrd=\initramfs-linux.img root=/dev/sda2 rw
[    0.000000] x86/fpu: Legacy x87 FPU detected.
[    0.000000] x86/fpu: Using 'eager' FPU context switches.
Comment 441 Kemal Ilgar Eroğlu 2016-07-18 20:30:41 UTC
Hi all,

I've been following this bug for a long time as my Bay Trail tablet HP Pavilion X2 with an Atom Z3736F kept freezing within 1-2 hours after booting. I've tried all major kernel versions since 4.1. I must mention that they all included the following mmc patches:

https://github.com/hadess/rtl8723bs

They also included the intel patch suggested at Debian's wiki:

https://wiki.debian.org/InstallingDebianOn/HP/Pavilion%20x2%2010%20%282015%20model%29/Jessie?action=AttachFile&do=view&target=intel_display.patch

Other than those, I tried various patches I found around hoping to cure the freezes. Even max_cstate=0 did not help. Finally, with 4.6.3 + Mika Kuoppala's 3 patches the situation got somewhat better but I never exceeded 4 hours without a freeze. Then I came across Daniel Bilik's patches elsewhere (for some reason I had overlooked his posts on this page!).

With his patches applied, I made several reboots, also playing around with Wolfgang's scripts and so far I never had a regular freeze[1]. Now my tablet's uptime has reached 24 hours for the first time ever, being booted without any max_cstate arguments and the C6 states being active. It's perhaps too soon to declare this a success but apparently Daniel's addition to Mika's patches has made a huge difference here.

[1] What I mean is this: Without max_cstate=0 (or 1), the tablet can freeze during boot. Most of the freezes are when the disks are mounted/fsck'ed and others when the drm framebuffer is initialized. And occasionally the screen goes blank when drm takes over, but I can recover with magic SysRq. I don't know if any of these problems might be due to other factors than the Intel Bay Trail bug. But once it gets past the booting stage, (with these latest patches) it seems to survive the rest.
Comment 442 bzq7xy5gpj 2016-07-21 07:46:49 UTC
I'm using a Bay Trail NUC (DN2820FYKH) but I don't remember encountering this bug [109051] any time. I just post it here to let you know that for whatever reason this NUC seems not to be affected.

Using latest Arch (through Antergos) here are some infos:

I boot into Desktop and didn't do anything else (or how should I reproduce this bug?).

$ uname -a
Linux *** 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Celeron(R) CPU  N2820  @ 2.13GHz
stepping	: 3
microcode	: 0x324
cpu MHz		: 533.116
cache size	: 1024 KB
...

$ ./cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0    12418075     717
         1  C1-BYT          0        1          1    93279269   79577
         2  C6N-BYT         0      300        275    20566295   34104
         3  C6S-BYT         0      500        560   949015145  355284
         4  C7-BYT          0     1200       4000  8810179482  634227
         5  C7S-BYT         0    10000      20000  8042123989   99746
cpu1 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0     6130914     716
         1  C1-BYT          0        1          1   391648635   73352
         2  C6N-BYT         0      300        275    17716302   30140
         3  C6S-BYT         0      500        560   846101843  326140
         4  C7-BYT          0     1200       4000  8388706626  596330
         5  C7S-BYT         0    10000      20000  8278801097   97725
$ uptime
 14:29:40 up  5:04,  2 users,  load average: 0,17, 0,07, 0,01
Comment 443 ichudov 2016-08-04 16:02:41 UTC
I have a Shuttle XS35V4 with Intel Celeron and  Z36xxx/Z37xxx Series Graphics. It would rather quickly hang when graphical capabilities were used. Watching youtube or a video would hang it in minutes.

Ever since I set intel_idle.max_cstate=1, I could not reproduce any hangs, despite continuously playing Idiocracy in a while loop, and playing a stream of youtube videos at the same time, the computer stays up for days.
Comment 444 Rick Lee 2016-08-07 17:18:21 UTC
Fascinating 2 hours reading 443 comments above ^^^

Dell Inspiron 17R SE 7720. Intel i7 3630QM. Nvidia GT650M. Observations based on Youtube streaming and 10 chrome tabs.

Had been running 3.13.092 / Ubuntu 14.04.1 for a month. Conky monitor heat 50C. CPU frequency bounces between 1200 and 2400 Mhz. 8 CPU's run around 15%. No real problems other than Turbo Boost appears inactive.

Yesterday Upgrade 4.4.0 / Unbuntu 16.04 close lid doesn't suspend.
Update 4.6.3 patch systemd config to make suspend work.
Conky monitor heat now 70C and ocassional 5-10 second keyboard lag. CPU scaling now goes over 3200Mhz turbo boost limit. But usually around 2000 Mhz. 8 CPU's now running at 7% with more even balancing.

When wifi gets patchy CPU really races.
Also have smartphone plugged into USB powered hub.
Also have external TV via HDMI.
Will try Yakety Yak (Ubuntu 16.10) soon because sound switches to laptop from TV during suspend and PulseAudio 9 (in Yakety Yak) fixes this PulseAudio 8 "undocumented feature" (in Xenial X-thingy).

No system lock ups but running 20C hotter and occasional keyboard freezes from 5-10 seconds are concerning.

HTH. Please don't flame me for not saying BayTrail.
Comment 445 Rick Lee 2016-08-07 17:50:09 UTC
Like a few other posters I spoke too soon. As I was writing last poast Youtube was auto running at 144p. Under 4.6.3, at Youtube 1080p the numbers are: heat 80C, average Mhz 3000, 8 CPU's 18% utilization (manual visually calculated average). It took many months studying UDEV and now I fear it will be the same with systemd.
Comment 446 amjafuso 2016-08-08 13:57:29 UTC
Tried new Kernel 4.7.0 and removed intel_idle.max_cstate=1 (cpu J1900).

Crashed after a few hours. Still not solved.
Comment 447 Maciej Hrebien 2016-08-08 16:15:19 UTC
Catched today:

Aug  8 06:07:44 HP-Mini kernel: [   10.104246] ------------[ cut here ]------------
Aug  8 06:07:44 HP-Mini kernel: [   10.104266] WARNING: CPU: 0 PID: 21 at /build/linux-7z1rSb/linux-3.16.7-ckt25/include/linux/kref.h:47 kobject_get+0x3a/0x50()
Aug  8 06:07:44 HP-Mini kernel: [   10.104270] Modules linked in: acpi_cpufreq(+) processor fuse autofs4 ext4 crc16 mbcache jbd2 ums_realtek sg sd_mod crc_t10dif crct10dif_generic crct10dif_common usb_storage ahci libahci ehci_pci uhci_hcd ehci_hcd libata psmouse scsi_mod usbcore usb_common r8169 mii fan thermal thermal_sys
Aug  8 06:07:44 HP-Mini kernel: [   10.104319] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3
Aug  8 06:07:44 HP-Mini kernel: [   10.104323] Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011
Aug  8 06:07:44 HP-Mini kernel: [   10.104332] Workqueue: kacpi_notify acpi_os_execute_deferred
Aug  8 06:07:44 HP-Mini kernel: [   10.104336]  0000000000000000 ffffffff8150e08f 0000000000000000 0000000000000009
Aug  8 06:07:44 HP-Mini kernel: [   10.104343]  ffffffff81067777 ffff880036961c00 0000000000000202 0000000000000003
Aug  8 06:07:44 HP-Mini kernel: [   10.104349]  0000000000000003 ffff880036e422f0 ffffffff812acbfa ffff880036961d28
Aug  8 06:07:44 HP-Mini kernel: [   10.104355] Call Trace:
Aug  8 06:07:44 HP-Mini kernel: [   10.104367]  [<ffffffff8150e08f>] ? dump_stack+0x5d/0x78
Aug  8 06:07:44 HP-Mini kernel: [   10.104376]  [<ffffffff81067777>] ? warn_slowpath_common+0x77/0x90
Aug  8 06:07:44 HP-Mini kernel: [   10.104383]  [<ffffffff812acbfa>] ? kobject_get+0x3a/0x50
Aug  8 06:07:44 HP-Mini kernel: [   10.104391]  [<ffffffff813d94f0>] ? cpufreq_cpu_get+0x70/0xc0
Aug  8 06:07:44 HP-Mini kernel: [   10.104398]  [<ffffffff813d9f2a>] ? cpufreq_update_policy+0x1a/0x1d0
Aug  8 06:07:44 HP-Mini kernel: [   10.104406]  [<ffffffff813da0e0>] ? cpufreq_update_policy+0x1d0/0x1d0
Aug  8 06:07:44 HP-Mini kernel: [   10.104421]  [<ffffffffa018b566>] ? cpufreq_set_cur_state.part.3+0x83/0x8a [processor]
Aug  8 06:07:44 HP-Mini kernel: [   10.104430]  [<ffffffffa018b666>] ? processor_set_cur_state+0x97/0xd1 [processor]
Aug  8 06:07:44 HP-Mini kernel: [   10.104444]  [<ffffffffa0000e05>] ? thermal_cdev_update+0xa5/0x110 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104453]  [<ffffffffa0003729>] ? step_wise_throttle+0x49/0x80 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104462]  [<ffffffffa000161c>] ? handle_thermal_trip+0x4c/0x150 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104471]  [<ffffffffa000179d>] ? thermal_zone_device_update+0x7d/0xd0 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104479]  [<ffffffff813319a1>] ? acpi_ev_notify_dispatch+0x3c/0x51
Aug  8 06:07:44 HP-Mini kernel: [   10.104485]  [<ffffffff8131e457>] ? acpi_os_execute_deferred+0x10/0x1a
Aug  8 06:07:44 HP-Mini kernel: [   10.104492]  [<ffffffff81081742>] ? process_one_work+0x172/0x420
Aug  8 06:07:44 HP-Mini kernel: [   10.104499]  [<ffffffff81081dd3>] ? worker_thread+0x113/0x4f0
Aug  8 06:07:44 HP-Mini kernel: [   10.104505]  [<ffffffff815105c1>] ? __schedule+0x2b1/0x700
Aug  8 06:07:44 HP-Mini kernel: [   10.104511]  [<ffffffff81081cc0>] ? rescuer_thread+0x2d0/0x2d0
Aug  8 06:07:44 HP-Mini kernel: [   10.104519]  [<ffffffff8108800d>] ? kthread+0xbd/0xe0
Aug  8 06:07:44 HP-Mini kernel: [   10.104526]  [<ffffffff81087f50>] ? kthread_create_on_node+0x180/0x180
Aug  8 06:07:44 HP-Mini kernel: [   10.104533]  [<ffffffff81514158>] ? ret_from_fork+0x58/0x90
Aug  8 06:07:44 HP-Mini kernel: [   10.104540]  [<ffffffff81087f50>] ? kthread_create_on_node+0x180/0x180
Aug  8 06:07:44 HP-Mini kernel: [   10.104544] ---[ end trace 6a04776659b650d3 ]---

and:

Aug  8 06:10:14 HP-Mini kernel: [  169.942572] general protection fault: 0000 [#1] SMP 
Aug  8 06:10:14 HP-Mini kernel: [  169.942761] Modules linked in: bnep ctr ccm nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc arc4 snd_hda_codec_idt ath9k ath9k_common snd_hda_codec_generic ath9k_hw uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common coretemp videodev snd_hda_intel media ath3k btusb snd_hda_controller kvm ath hp_wmi snd_hda_codec bluetooth mac80211 i915 iTCO_wdt drm_kms_helper cfg80211 drm 6lowpan_iphc iTCO_vendor_support sparse_keymap snd_hwdep snd_pcm rfkill ac shpchp wmi i2c_i801 i2c_algo_bit joydev evdev snd_timer serio_raw pcspkr lpc_ich mfd_core i2c_core snd video battery soundcore button acpi_cpufreq processor fuse autofs4 ext4 crc16 mbcache jbd2 ums_realtek sg sd_mod crc_t10dif crct10dif_generic crct10dif_common usb_storage ahci libahci ehci_pci uhci_hcd ehci_hcd libata psmouse scsi_mod usbcore usb_common r8169 mii fan thermal thermal_sys
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] CPU: 2 PID: 836 Comm: Xorg Tainted: G        W     3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] task: ffff88007b236d20 ti: ffff880079294000 task.ti: ffff880079294000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RIP: 0010:[<ffffffff812bada0>]  [<ffffffff812bada0>] sg_next+0x0/0x30
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RSP: 0018:ffff880079297b80  EFLAGS: 00010202
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RAX: ea00011d51829182 RBX: 0000000000000001 RCX: ffff880036b38880
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RDX: ffff880036b38700 RSI: 0000000000000000 RDI: ea00011d51829182
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RBP: 000000000000ffff R08: 0000000007637000 R09: 0000000000000000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] R10: 0000000007800000 R11: 0000000000000000 R12: ea00011d51829182
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] R13: ffff88007c7ee098 R14: ffffffff8181f660 R15: ffff8800628d1900
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] FS:  00007f4281a3c980(0000) GS:ffff88007f300000(0000) knlGS:0000000000000000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] CR2: 00007f428157e000 CR3: 000000007b2ff000 CR4: 00000000000007e0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Stack:
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  ffffffffa03b7b7b ffff88007c1b7a08 0000000000001000 0000000000001000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  ffff880079e0b800 0000000000000000 ffffffffa03bdf8f 0000000020000000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  0000000000000000 ffff880000000000 0000000000000000 ffff88007c1b0000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Call Trace:
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03b7b7b>] ? i915_gem_gtt_prepare_object+0x6b/0xb0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03bdf8f>] ? i915_gem_object_pin+0x57f/0x780 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa0421559>] ? i915_gem_execbuffer_reserve_vma.isra.16+0x95/0x11a [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa042182a>] ? i915_gem_execbuffer_reserve+0x24c/0x2dc [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03b358d>] ? i915_gem_do_execbuffer.isra.24+0x89d/0x13f0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03bcc8b>] ? i915_gem_object_put_fence+0x1b/0xc0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03b459f>] ? i915_gem_execbuffer2+0xaf/0x2b0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa02db8a7>] ? drm_ioctl+0x1c7/0x5b0 [drm]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff811be12e>] ? dput+0x9e/0x170
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff811ba9af>] ? do_vfs_ioctl+0x2cf/0x4b0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff81085261>] ? task_work_run+0x91/0xb0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff811bac11>] ? SyS_ioctl+0x81/0xa0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff815144ca>] ? int_signal+0x12/0x17
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff8151420d>] ? system_call_fast_compare_end+0x10/0x15
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Code: 27 fa ff ff 0f 1f 80 00 00 00 00 c7 47 10 00 00 00 00 89 57 0c 48 89 37 89 4f 08 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <f6> 07 02 75 13 48 8b 57 20 48 8d 47 20 f6 c2 01 75 09 f3 c3 0f 
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  RSP <ffff880079297b80>
Aug  8 06:10:14 HP-Mini kernel: [  170.030137] ---[ end trace 6a04776659b650d4 ]---

and also:

Aug  8 06:07:44 HP-Mini kernel: [   11.789183] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236)
Aug  8 06:07:44 HP-Mini kernel: [   11.789203] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789228] ACPI Error: Method parse/execution failed [\_SB_.WMID.WMAD] (Node ffff88007e852d10), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789420] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236)
Aug  8 06:07:44 HP-Mini kernel: [   11.789437] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789461] ACPI Error: Method parse/execution failed [\_SB_.WMID.WMAD] (Node ffff88007e852d10), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789643] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236)
Aug  8 06:07:44 HP-Mini kernel: [   11.789659] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536)

With the freeze effect (hard-boot required). Aren't the dumps related?
Comment 448 D. Hugh Redelmeier 2016-08-08 18:17:02 UTC
@Maciej Hrebien #444
You don't explain much about your system.  Buried in the log is "Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011".  This seems to be a notebook with a pineview or earlier Atom processor.  Not the subject of this bugzilla entry.
Comment 449 Maciej Hrebien 2016-08-09 04:47:00 UTC
Yes, it's N570 chip and I can share more details if needed. I thought the dumps are related as setting cstate to 1 makes the device usable (running ~12h now without the freeze). The 3.8.2 kernel seems to be working fine for me that is without any freezes and workarounds.
Comment 450 Kevin 2016-08-09 19:41:54 UTC
Hello, my computer is an Acer E5-511P, it always crashed randomly (the problem above) when in Linux, I was able to resolve the issue with "intel_idle.max_cstate=1" and blacklisting dw_dmac and dw_dmac_core.

A fun fact, now I'm using Windows 10 with the "Windows Subsystem for Linux" (Ubuntu 14.04, Linux kernel version 3.4), and my computer has crashed two times since then (the UI freezed, and the cpu fan spinning at top speed).
Comment 451 Andy Shevchenko 2016-08-10 14:54:40 UTC
(In reply to Kevin from comment #450)

> and blacklisting dw_dmac and dw_dmac_core.

That's should be solved in v4.5. So, if you have kernel v4.5+, please, try again w/o disabling dw_dmac. You may refer to bug #101271 for the details.
Comment 452 Juha Sievi-Korte 2016-08-22 16:43:36 UTC
(In reply to Wolfgang M. Reimer from comment #437)
> Running my submitted scripts
> 
> https://bugzilla.kernel.org/attachment.cgi?id=223851
> https://bugzilla.kernel.org/attachment.cgi?id=223861
> 
> on a J1900 system should produce a similar output:
> 
> As one can see in my case most of the core's idle time is now spent in state
> C7S-BYT.

Thanks a lot!

I've had my system running since then with disabled C6 state and no freezes. I've done one reboot to update kernel, 21 days uptime on current session now. This is on N3540 laptop with which I've had quite random but steady occurences of freezes over last year.

So might be too early to declare success, but it seems promising for now.
Comment 453 Srdjan Todorovic 2016-08-22 21:49:18 UTC
I tried the disable C6 state script as per Wolfgang M. Reimer's scripts (C6 events didnt seem to be increasing afterwards), lockup within 2 hours.

Linux htpc 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:06:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 8
microcode       : 0x831

Board is Asrock Mod Q1900ITX

My use case: launch Kodi, start palying a DVD for 20 minutes, pause DVD for 30 minutes, resume playing for perhaps 10 minutes, then pause again for 20 minutes. When I tried to resume playback, system was unresponsive. Even the reset button doesn't respond.

Just booted with intel_idle.max_cstate=1, will report if this has same issue.
Comment 454 smaj 2016-08-25 12:44:51 UTC
I disabled C6 states as described by Wolfgang M. Reimer. No crashes for two days now, when having them on this J1900 based system reliably in less than an hour uptime.

Board is ASRock Q1900-ITX 


processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping	: 3
microcode	: 0x320
cpu MHz		: 1521.891
cache size	: 1024 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat
bugs		:
bogomips	: 3993.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:
Comment 455 cscs 2016-08-26 20:39:59 UTC
Thanks to Wolfgang M. Reimer, I am now running 4.7.2 with no problems.
As per https://bugzilla.kernel.org/show_bug.cgi?id=109051#c437

Here is my relevant information;

lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 55
Model name:            Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
Stepping:              8
CPU MHz:               499.677
CPU max MHz:           2665.6001
CPU min MHz:           499.8000
BogoMIPS:              4328.66
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat

Base Board Information
	Manufacturer: Dell Inc.
	Product Name: 0H4MK6
	Version: A00
	Serial Number: .HS1LK52.CN7620657U0002.
Comment 456 RussianNeuroMancer 2016-08-27 05:23:27 UTC
How disabling C6 with still enabled C7 affect battery life?
Comment 457 fdservices 2016-08-27 17:32:56 UTC
(In reply to cscs from comment #455)
> Thanks to Wolfgang M. Reimer, I am now running 4.7.2 with no problems.
> As per https://bugzilla.kernel.org/show_bug.cgi?id=109051#c437
> 
> Here is my relevant information;
> 
> lscpu:
> 
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    1
> Core(s) per socket:    4
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 55
> Model name:            Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
> Stepping:              8
> CPU MHz:               499.677
> CPU max MHz:           2665.6001
> CPU min MHz:           499.8000
> BogoMIPS:              4328.66
> Virtualization:        VT-x
> L1d cache:             24K
> L1i cache:             32K
> L2 cache:              1024K
> NUMA node0 CPU(s):     0-3
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
> nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est
> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
> rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid
> tsc_adjust smep erms dtherm ida arat
> 
> Base Board Information
>       Manufacturer: Dell Inc.
>       Product Name: 0H4MK6
>       Version: A00
>       Serial Number: .HS1LK52.CN7620657U0002.

As far as I can tell, I too am running 4.7.2 with no problem - Arch Linux, Acer Travelmate B115M
Comment 458 mohican 2016-08-29 13:26:03 UTC
Hello, on a Lenovo E50-00 with CPU Intel Pentium J2900
I had these random freezes.

In addition I also have a freeze when RESUMING AFTER SUSPEND.
Therefore when I tested more distributions (kernel versions) I just tested the resume after suspend.
(Hibernate does work fine.)

I had the bug with :
- Linux Mint 18.0 (based on Ubuntu 16.04) with kernel 4.4
- Ubuntu 14.04
- Ubuntu 12.04 with kernel 3.13

I also had the bug when I changed CPU's BIOS setting to C1 only.
Comment 459 jbMacAZ 2016-08-30 17:33:50 UTC
The c6-off/c7-on script is also effective on Z3775 baytrail processor in my ASUS T100CHI (and is reported effective for other ASUS T100T* models.)  Listed cstates for the Z3775 are (POLL,C1-BYT,C6N-BYT,C6S-BYT,C7-BYT,C7S-BYT)

Now my only kernel arguments are "tsc=reliable clocksource=tsc".  I no longer need intel_idle.max_cstate={0,1}.  Even with recent kernels, the T100CHI would rarely go more than 30 minutes without a freeze unless cstate was limited.  I also had freezes when trying max_cstate=3.

Many thanks for tracking this one down.
Comment 460 ladiko 2016-08-30 18:12:13 UTC
Tried the c6-disabling script on asrock q1900itx-m. Ran for a whole night while otherwise it would freeze within 1 or 2 hours and had to use the max_cstate=1-fix. Will roll it out to 50 more machines until December. Not a fix but an ok workaround for this issue.
Comment 461 Paul Nijenhuis 2016-09-01 16:01:56 UTC
I've also tried the c6-disabling script, i made a startup service for it
on OpenSuse Tumbleweed with the latest updates. It's now running smoothly
while first i did disable the c6 and c7 state in the UEFI BIOS.
Now i've re-enabled the states and everything seems to run ok.
No freezes for 4 hours now.
output from cstateInfo.sh :
cpu0 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    29315450    16010
         1  C1-BYT          0        1          1  2075922404  2720073
         2  C6N-BYT         1      300        275     1298175      459
         3  C6S-BYT         1      500        560     6214377     1612
         4  C7-BYT          0     1200       4000  1228124502   139642
         5  C7S-BYT         0    10000      20000   474724588    20423
cpu1 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    34446485    17262
         1  C1-BYT          0        1          1  2026994454  2604049
         2  C6N-BYT         1      300        275     1339377      414
         3  C6S-BYT         1      500        560     5097170      895
         4  C7-BYT          0     1200       4000  1215493749   130038
         5  C7S-BYT         0    10000      20000   554333717    22140
cpu2 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    32861042    15934
         1  C1-BYT          0        1          1  2994741581  5739338
         2  C6N-BYT         1      300        275      958074      269
         3  C6S-BYT         1      500        560     5137562     1061
         4  C7-BYT          0     1200       4000   533053353    77172
         5  C7S-BYT         0    10000      20000   111720079     5108
cpu3 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    26658663    12680
         1  C1-BYT          0        1          1  2232165052  3062867
         2  C6N-BYT         1      300        275      900500      238
         3  C6S-BYT         1      500        560     4047949      844
         4  C7-BYT          0     1200       4000  1198658992   148698
         5  C7S-BYT         0    10000      20000   307666394    14599

And lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 55
Model name:            Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
Stepping:              8
CPU MHz:               1332.718
CPU max MHz:           2415.7000
CPU min MHz:           1332.8000
BogoMIPS:              3993.60
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat
Comment 462 Paul Nijenhuis 2016-09-01 16:21:01 UTC
(In reply to Paul Nijenhuis from comment #461)
> I've also tried the c6-disabling script, i made a startup service for it
> on OpenSuse Tumbleweed with the latest updates. It's now running smoothly
> while first i did disable the c6 and c7 state in the UEFI BIOS.
> Now i've re-enabled the states and everything seems to run ok.
> No freezes for 4 hours now.
> output from cstateInfo.sh :
> cpu0 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    29315450    16010
>          1  C1-BYT          0        1          1  2075922404  2720073
>          2  C6N-BYT         1      300        275     1298175      459
>          3  C6S-BYT         1      500        560     6214377     1612
>          4  C7-BYT          0     1200       4000  1228124502   139642
>          5  C7S-BYT         0    10000      20000   474724588    20423
> cpu1 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    34446485    17262
>          1  C1-BYT          0        1          1  2026994454  2604049
>          2  C6N-BYT         1      300        275     1339377      414
>          3  C6S-BYT         1      500        560     5097170      895
>          4  C7-BYT          0     1200       4000  1215493749   130038
>          5  C7S-BYT         0    10000      20000   554333717    22140
> cpu2 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    32861042    15934
>          1  C1-BYT          0        1          1  2994741581  5739338
>          2  C6N-BYT         1      300        275      958074      269
>          3  C6S-BYT         1      500        560     5137562     1061
>          4  C7-BYT          0     1200       4000   533053353    77172
>          5  C7S-BYT         0    10000      20000   111720079     5108
> cpu3 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    26658663    12680
>          1  C1-BYT          0        1          1  2232165052  3062867
>          2  C6N-BYT         1      300        275      900500      238
>          3  C6S-BYT         1      500        560     4047949      844
>          4  C7-BYT          0     1200       4000  1198658992   148698
>          5  C7S-BYT         0    10000      20000   307666394    14599
> 
> And lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    1
> Core(s) per socket:    4
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 55
> Model name:            Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
> Stepping:              8
> CPU MHz:               1332.718
> CPU max MHz:           2415.7000
> CPU min MHz:           1332.8000
> BogoMIPS:              3993.60
> Virtualization:        VT-x
> L1d cache:             24K
> L1i cache:             32K
> L2 cache:              1024K
> NUMA node0 CPU(s):     0-3
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
> nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est
> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
> rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid
> tsc_adjust smep erms dtherm ida arat

Ok, freeze again after 4,5 hours :-(
Back to enable only C1 in BIOS.....
Comment 463 amjafuso 2016-09-02 09:31:44 UTC
Newest Kernel 4.7.2 crashed after 3 hours. Going back to intel_idle.max_cstate=1.

Who cares? More than 400 comments and the status is still NEW? I might give up reporting to this thread...
Comment 464 Hal 2016-09-02 16:50:25 UTC
On my Celeron N2930 system I switched from intel_idle.max_cstate=1 to using c6off+c7on.sh a couple of days ago. Everything works quite well so far.
cstateInfo.sh confirms that there is no C6N-BYT or C6S-BYT being used. As this is a very active machine running 3 virtual machines (light load) the CPU is mostly in the C1E-BYT state. C7-BYT and C7S-BYT also get a good hit.
So, my experience so far is very positive. The only surprising thing is that my box is not running much cooler than before this change. I guess the active CPU load explains that.
Hal
Comment 465 Hal 2016-09-02 16:56:08 UTC
(In reply to Hal from comment #464)
> On my Celeron N2930 system I switched from intel_idle.max_cstate=1 to using
> c6off+c7on.sh a couple of days ago. Everything works quite well so far.
> cstateInfo.sh confirms that there is no C6N-BYT or C6S-BYT being used. As
> this is a very active machine running 3 virtual machines (light load) the
> CPU is mostly in the C1E-BYT state. C7-BYT and C7S-BYT also get a good hit.
> So, my experience so far is very positive. The only surprising thing is that
> my box is not running much cooler than before this change. I guess the
> active CPU load explains that.
> Hal

I also wanted to add that I've been using kernel version 4.5.7.
Finally a question too: What would be the best way to launch c6off+c7on.sh. I have it currently in a "session and startup" entry in xfce4. But it would probably make sense to start it before xfce or even xorg is launched.
Hal
Comment 466 ladiko 2016-09-03 05:18:32 UTC
That would add the function as a system service on ubuntu 14.04. Later versions would use systemd services. So it's different there, but don't have the files here.




echo -e 'for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
	case "$(< "${state}/name")" in
		C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;;
		C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;;
	esac
done' > /etc/init.d/c6off+c7on.sh
chown root:root /etc/init.d/c6off+c7on.sh
chmod 755 /etc/init.d/c6off+c7on.sh
update-rc.d -f /etc/init.d/c6off+c7on.sh start 90 2 .



I think the bug also exists on CherryTrail, so i added |C6*-CHT and |C7*-CHT. If the bug doesnt affect it, just remove it.
Comment 467 Martin 2016-09-03 08:09:08 UTC
I've been running the previously unstable kernel 4.5.4 without max_cstate=1 using the c6off+c7on script for more than 3 days now and have yet see the dreaded freeze.

$ uptime 
 10:02:05 up 3 days, 13:13,  2 users,  load average: 0,16, 0,13, 0,14

$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 3
...

Seems to be the holy grail for me! Thx a lot!
Too bad intel never chipped in with the addendum before. Shame!

I put the c6off+c7on script in /etc/rc.local.
Comment 468 Juha Sievi-Korte 2016-09-04 19:11:15 UTC
(In reply to Juha Sievi-Korte from comment #452)
> (In reply to Wolfgang M. Reimer from comment #437)
> > Running my submitted scripts
> > 
> > https://bugzilla.kernel.org/attachment.cgi?id=223851
> > https://bugzilla.kernel.org/attachment.cgi?id=223861
> > 
> > on a J1900 system should produce a similar output:
> > 
> > As one can see in my case most of the core's idle time is now spent in
> state
> > C7S-BYT.
> 
> Thanks a lot!
> 
> I've had my system running since then with disabled C6 state and no freezes.
> I've done one reboot to update kernel, 21 days uptime on current session
> now. This is on N3540 laptop with which I've had quite random but steady
> occurences of freezes over last year.
> 
> So might be too early to declare success, but it seems promising for now.

And today two crashes with c6 disabled by this script, so this wasn't the root cause either. Phew. Back to two hour battery life it is...
Comment 469 ladiko 2016-09-04 19:14:36 UTC
Crashed or freezes?
Comment 470 Juha Sievi-Korte 2016-09-04 19:38:56 UTC
(In reply to ladiko from comment #469)
> Crashed or freezes?

Sorry for that, freezed. First happened within the 'long' uptime session and next within hour of a reboot, so it seems equally random for me as before. Cstate script was run early in boot-up.
Comment 471 ladiko 2016-09-04 19:47:38 UTC
It's fine. Just wanted to be sure what we talk about. Didn't yet pushed it to the other 50 machines. Just the one I tested it on ran stable. So you checked the c6 state after boot? I will run a long term test. Like several days before I roll it out to the other machines.
Comment 472 Juha Sievi-Korte 2016-09-04 20:26:17 UTC
(In reply to ladiko from comment #471)
> It's fine. Just wanted to be sure what we talk about. Didn't yet pushed it
> to the other 50 machines. Just the one I tested it on ran stable. So you
> checked the c6 state after boot? I will run a long term test. Like several
> days before I roll it out to the other machines.

Yep, checked that the script was run ok and c6 wasn't active after the last reboot.

As some folks still seem to have promising results with this script, I think I'll let it still run for a while to see the effects in longer run. Perhaps it makes some events that cause this issue to happen less frequently.

Checked that last log entry was ~40 mins after the reboot at last attempt when it freezed. The system was just sitting idle by itself when it happened, one ssh session open on desktop + few tabs on browser.
Comment 473 ichudov 2016-09-04 20:30:43 UTC
I have a computer where I had constant crashes. I set "intel_idle.max_cstate=1" and now it stays up for weeks and never crashes.
Comment 474 jbMacAZ 2016-09-06 08:06:05 UTC
Ran 44 hours with c6off+c7on.sh before freezing hard.  Usually would freeze within 30 minutes without any cstate limits.  Z3775 might have issue with C7 states. The system was idle when it locked up.  

Will be there newer uCode that would help my baytrail?
Comment 475 rkrambovitis 2016-09-07 06:14:07 UTC
I have had 0 lockups (3 days) using 4.8-rc5 from ubuntu mainline archives.
No patches, max_cstate settings or anything.
Before that I was using 3.16 kernel which was the last stable one for me (baytrail).

Unfortunately my hdmi is not working with this kernel.

Anyone else wanna try and report ?
Comment 476 Martin 2016-09-07 06:50:46 UTC
(In reply to Martin from comment #467)
> I've been running the previously unstable kernel 4.5.4 without max_cstate=1
> using the c6off+c7on script for more than 3 days now and have yet see the
> dreaded freeze.
> 
> $ uptime 
>  10:02:05 up 3 days, 13:13,  2 users,  load average: 0,16, 0,13, 0,14
> 
> $ cat /proc/cpuinfo 
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 55
> model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
> stepping        : 3
> ...
> 
> Seems to be the holy grail for me! Thx a lot!
> Too bad intel never chipped in with the addendum before. Shame!
> 
> I put the c6off+c7on script in /etc/rc.local.

I regret having to crawl back on my statement above. After many days of stable TV watching, our HTPC was non responsive and I had to power-cycle it to get back in business.
Comment 477 Hal 2016-09-08 19:22:04 UTC
Follow up on week old post: c6off+c7on still works well for my Zotac system. Below comes my stats:

Thu Sep  8 15:18:37 EDT 2016
 15:18:37 up 7 days,  6:46,  2 users,  load average: 6.95, 7.15, 7.28

  *-cpu
       description: CPU
       product: Intel(R) Celeron(R) CPU  N2930  @ 1.83GHz
       vendor: Intel Corp.
       physical id: 34
       bus info: cpu@0
       version: Intel(R) Celeron(R) CPU N2930 @ 1.83GHz
       slot: SOCKET 0
       size: 2165MHz
       capacity: 2400MHz
       width: 64 bits
       clock: 83MHz
       capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch ida arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms cpufreq
       configuration: cores=4 enabledcores=4 threads=4

cpu0 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0      22302321    2554713
         1  C1-BYT          0        1          1   15757596319  262145340
         2  C1E-BYT         0       15         30  119156258736  521385788
         3  C6N-BYT         1       40        275        456080        865
         4  C6S-BYT         1      140        560        855986        870
         5  C7-BYT          0     1200       1500    3704252148    5761453
         6  C7S-BYT         0    10000      20000      61455590       3197
cpu1 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0     208494519    3428799
         1  C1-BYT          0        1          1   17097828007  280365124
         2  C1E-BYT         0       15         30  117773052639  522674956
         3  C6N-BYT         1       40        275        376329        506
         4  C6S-BYT         1      140        560        784219        596
         5  C7-BYT          0     1200       1500    3331593582    4844889
         6  C7S-BYT         0    10000      20000      59530341       2921
cpu2 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0      21634146    2606774
         1  C1-BYT          0        1          1   16503086447  284253332
         2  C1E-BYT         0       15         30  122405265787  542914915
         3  C6N-BYT         1       40        275        537565        835
         4  C6S-BYT         1      140        560        626723        544
         5  C7-BYT          0     1200       1500    3845541641    5968789
         6  C7S-BYT         0    10000      20000      43528414       2486
cpu3 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0      22154717    2630336
         1  C1-BYT          0        1          1   16894582028  282524229
         2  C1E-BYT         0       15         30  123929531803  549088123
         3  C6N-BYT         1       40        275        313412        440
         4  C6S-BYT         1      140        560        722191        486
         5  C7-BYT          0     1200       1500    4109133756    6219843
         6  C7S-BYT         0    10000      20000      52241428       2709

Hal
Comment 478 Dmitry 2016-09-09 08:39:36 UTC
Haven't written here for some time. Latest kernels(~4.8-rc5) under z3770 aren't stable without max_cstate parameter. Hit this bug even in inintram root switch state. See this bug: 150881 . Kernel will freeze in tens seconds after entering idle state.
Comment 479 sfumato1977 2016-09-09 12:55:00 UTC
on my tablet Z3735F cpu , kernel 4.4 to 4.8
The "stability" and guaranteed less than 15 minutes
instead adding to grub.cfg  intel_idle.max_cstate=1 and clocksource=tsc 
I can use it for days.

P.S.
I suspect the clocksource "refined - jiffies"
Comment 480 Dmitry 2016-09-09 13:04:45 UTC
If kernel is unstable when cpu is in idle, then jiffies are no good.

For me, a zero response from Intel is evidence that there is a huge mistake in hardware. But I don't believe that it can't be overcome by any software changes.
Comment 481 László Kara 2016-09-09 18:34:00 UTC
(In reply to Dmitry from comment #480)
> If kernel is unstable when cpu is in idle, then jiffies are no good.
> 
> For me, a zero response from Intel is evidence that there is a huge mistake
> in hardware. But I don't believe that it can't be overcome by any software
> changes.

It does not crash on Windows. Some software change must work.
Comment 482 Hal 2016-09-10 19:29:40 UTC
(In reply to Dmitry from comment #480)
 > For me, a zero response from Intel is evidence that there is a huge mistake
 > in hardware. But I don't believe that it can't be overcome by any software
 > changes.
> 
> It does not crash on Windows. Some software change must work.

Not only Windows XP, NT, Vista, 7, 8, 10 work without problem but also BSD distributions work well (including pfsense). Linux kernels 2.x and 3.0 kernels (probably all the way up to 3.16 - although on one of my systems 3.16 freezes) seem to work too.

There is no doubt there is a bug related to the cstates in some of those Intel processors but evidently software workarounds or fixes are possible. I think the kernel team should be equally accountable for a definitive fix as is Intel (with no disrespect intended to either group of engineers - actually eternal gratitude to all Linux/GNU people for an outstanding platform).

Now, is the CPU bug a design problem or due to 14nm manufacturing process issues? that's an intriguing question in my mind; because I happen to own 2 identical boxes (Intel NUC) manufactured (by Intel) just a few months apart - one with the freezing problem and the other without. The steppings on the CPU and BIOS versions are unfortunately not identical. 
So whether the 'fix' is part of a newer CPU stepping or part of a remediation microcode loaded through the BIOS at start up, it looks like the CPU is able to run all cstates. That tells me that a post-manufacturing CPU microcode fix is always possible. 

Hal
Comment 483 Paul Mansfield 2016-09-10 21:50:33 UTC
I am inclined to agree with Hal. There's a group of three of us with Z3735F-based Toshiba Click Minis. Despite all running the same UEFI firmware versions and dock firmware versions, mine seems to be less stable. It could also be down to the other choices we've made in terms of root partitions on memory cards, but we have no solid proof. If we could attach a simple serial console we might have some hope.

I'm not sure if everybody's seen the Shark's Cove reference design
http://www.cnx-software.com/2014/07/30/sharks-cove-intel-atom-bay-trail-t-development-board-for-windows-8-1-is-now-available-for-299/

the links from that are broken, but I managed to track down the
technical docs for it:

https://firmware.intel.com/sites/default/files/Sharks_Cove_Schematic.pdf

http://composter.com.ua/documents/Sharks-Cove-Technical-Specifications.pdf


maybe it will help someone who understands drivers to fix the situation.
Comment 484 Dmitry 2016-09-11 09:07:57 UTC
If this bug depends on firmware than I have bad news. Atom Baytrail doesn't support microcode loading. Firstly tried tree different ways loading microcode and none of them succeeded. Then found list of cpus which support microcode loading and there is no Z3770.

Mine microcode: sig=0x30673, pf=0x2, revision=0x324

P.S. There is no microcode for 06-37-03 (Family-6,Model-55,Stepping-3).
Comment 485 julio.borreguero@gmail.com 2016-09-11 10:21:03 UTC
the only one beeing able to fix this would be intel, at least in a proper way.
obviously they don't give a shit. baytrail is old, on top low-cost. they want to sell new stuff.
let's face it, if you have a workaround for this bug (like me running kernel 4.1.12) then you are lucky.
i assume we are left alone with this and nothing is going to happen from intels side.
if you are clever never buy intel or nvidia again.
i probably won't.
that is all i can do.
Comment 486 Martin Brand 2016-09-12 17:53:57 UTC
Hi, I have just found the following link where an Intel employee seems to have found a problem and has already developed some kind of a solution.
However I have not been able to find out if this has been added to the kernel.
Here is the link:
https://lkml.org/lkml/2015/3/24/271
Perhaps somebody with more experience than myself could have a look at this?
Comment 487 Daniel Glöckner 2016-09-12 17:58:35 UTC
(In reply to Martin Brand from comment #486)
> Hi, I have just found the following link where an Intel employee seems to
> have found a problem and has already developed some kind of a solution.

See Comment 55. This patch is not enough to fix the problem.
Comment 488 BzukTuk 2016-09-12 19:01:41 UTC
(In reply to BzukTuk from comment #378)
> Hi again,
> Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> sessions without single freeze. Still counting...
> 
> (some of Adrian Hunters patches for pm/mmc were also applied, but I dont
> think (hope) this matters)
> ...

Hi there again,
just so you know, I did not experienced !single! freeze on fresh kernels (>=4.6) since I started using Mika`s patches + legacy turbo patch together (as mentioned above). Could anyone with freeze issues and non-laptop machine give it a long stress test? I have just tablet/laptop and I dont want to wreck battery/LCD (also can`t turn display off - another story). I did not count exactly, but I think I have >500 hour long uptime without any freeze on this device.

Mika`s tentative patches
https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f

Legacy turbo:
https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.7.3/linux-999-i915-use-legacy-turbo.patch
Comment 489 Rick Lee 2016-09-13 02:31:01 UTC
(In reply to Martin Brand from comment #486)
> Hi, I have just found the following link where an Intel employee seems to
> have found a problem and has already developed some kind of a solution.
> However I have not been able to find out if this has been added to the
> kernel.
> Here is the link:
> https://lkml.org/lkml/2015/3/24/271
> Perhaps somebody with more experience than myself could have a look at this?

That message is 18 months old... Hardly qualifies as "hot off the press" using Intels' Moore's law.
Comment 490 Paul Nijenhuis 2016-09-13 05:56:40 UTC
(In reply to BzukTuk from comment #488)
> (In reply to BzukTuk from comment #378)
> > Hi again,
> > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > sessions without single freeze. Still counting...
> > 
> > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont
> > think (hope) this matters)
> > ...
> 
> Hi there again,
> just so you know, I did not experienced !single! freeze on fresh kernels
> (>=4.6) since I started using Mika`s patches + legacy turbo patch together
> (as mentioned above). Could anyone with freeze issues and non-laptop machine
> give it a long stress test? I have just tablet/laptop and I dont want to
> wreck battery/LCD (also can`t turn display off - another story). I did not
> count exactly, but I think I have >500 hour long uptime without any freeze
> on this device.
> 
> Mika`s tentative patches
> https://cgit.freedesktop.org/~miku/drm-intel/commit/
> ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> https://cgit.freedesktop.org/~miku/drm-intel/commit/
> ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> https://cgit.freedesktop.org/~miku/drm-intel/commit/
> ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> 
> Legacy turbo:
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> 7.3/linux-999-i915-use-legacy-turbo.patch

can you please point out how to apply these patches?
Thanx in advance
Comment 491 Christian First 2016-09-13 14:45:10 UTC
Hello, i've got a Biostar J1900 with Celeron Quad Core.
The freeze comes with all Ubuntu 16.04 based distributions. I've tested it with Ubuntu 16.04, Lubuntu 16.04 and Ubuntu Mate 16.04. Even Ubuntu 14.04.5 freezes.
Now i am running with Zorin 9 and Ubuntu 14.04.4. Kernel in use is 3.13.0-95-generic. Would apply myself for a testing person. Maybe is here a supporting person from Germany? Regards Christian
Comment 492 sfumato1977 2016-09-13 14:56:37 UTC
because this could be missing ?

Atom PMC platform clocks:

drivers/clk/x86/clk-byt-plt.c:

https://patchwork.kernel.org/patch/9286345/
Comment 493 jbMacAZ 2016-09-14 17:47:13 UTC
Testing the c6off/c7on script with a Z3775 system at idle.  First freeze took 44 hours.  Subsequent freezes take about 3 hours.  Turning off C7-BYT extends to 4 hours of idling before freezing.  Turning off C7S-BYT on just one core gets me running again w/o freezing.  Effectively, one core is set to intel_idle.max_cstate=1 while the others could allow C7S-BYT.  

Setting intel_idle.max_cstate=2 without the c6off/c7on script yields less than 30 minutes of run time before freezing, often within a few minutes.  

I'm not sure that it's permissible to control power saving on a per core basis. What I did may just be an another way to set ..max_cstate=1.

FWIW - I have had one or two identical freezes in Windows, but this is quite rare in comparison to linux.
Comment 494 Hal 2016-09-15 01:36:07 UTC
(Follow up to own message #477)

12 days into using c6off+c7on I decided to go back to the intel_idle.max_cstate=1 workaround.

The reason is that although I did not experience any freeze or crash on my host OS, I started to see some very awkward SSD access problems. Swapping the SSD drive with a new one did not alleviate the problem, but going back to max_cstate=1 definitely eliminated it.

The awkwardness of the SSD problem was that it would tie up data retrieval from the SSD for many tens of seconds but the host OS wouldn't fail. The mouse, internet access etc would all work. Many drive access retrial messages would pop up but without causing a system crash.

On the other hand, as I ran Virtualbox and several virtual machines, those would partially freeze. For instance their GUIs would not respond to keyboard or mouse actions but I could still SSH into them from the host computer or a remote computer. Eventually I would get serious data corruption in the guest machines.

The problem with the guest machines didn't happen very often but happened on different virtual machines running different GNU flavors and different Linux kernels.

Hal
Comment 495 Paul Nijenhuis 2016-09-16 06:56:59 UTC
(In reply to Paul Nijenhuis from comment #490)
> (In reply to BzukTuk from comment #488)
> > (In reply to BzukTuk from comment #378)
> > > Hi again,
> > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > > sessions without single freeze. Still counting...
> > > 
> > > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont
> > > think (hope) this matters)
> > > ...
> > 
> > Hi there again,
> > just so you know, I did not experienced !single! freeze on fresh kernels
> > (>=4.6) since I started using Mika`s patches + legacy turbo patch together
> > (as mentioned above). Could anyone with freeze issues and non-laptop
> machine
> > give it a long stress test? I have just tablet/laptop and I dont want to
> > wreck battery/LCD (also can`t turn display off - another story). I did not
> > count exactly, but I think I have >500 hour long uptime without any freeze
> > on this device.
> > 
> > Mika`s tentative patches
> > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> > 
> > Legacy turbo:
> >
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> > 7.3/linux-999-i915-use-legacy-turbo.patch
> 
> can you please point out how to apply these patches?
> Thanx in advance

I found out how to apply the patches and i'm building a 4.7.3 kernel on OpenSuse Tumbleweed...
I'll post the results later
Comment 496 Konstantin Koslowski 2016-09-16 10:27:26 UTC
(In reply to Paul Nijenhuis from comment #495)
> (In reply to Paul Nijenhuis from comment #490)
> > (In reply to BzukTuk from comment #488)
> > > (In reply to BzukTuk from comment #378)
> > > > Hi again,
> > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > > > sessions without single freeze. Still counting...
> > > > 
> > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I
> dont
> > > > think (hope) this matters)
> > > > ...
> > > 
> > > Hi there again,
> > > just so you know, I did not experienced !single! freeze on fresh kernels
> > > (>=4.6) since I started using Mika`s patches + legacy turbo patch
> together
> > > (as mentioned above). Could anyone with freeze issues and non-laptop
> machine
> > > give it a long stress test? I have just tablet/laptop and I dont want to
> > > wreck battery/LCD (also can`t turn display off - another story). I did
> not
> > > count exactly, but I think I have >500 hour long uptime without any
> freeze
> > > on this device.
> > > 
> > > Mika`s tentative patches
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> > > 
> > > Legacy turbo:
> > >
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> > > 7.3/linux-999-i915-use-legacy-turbo.patch
> > 
> > can you please point out how to apply these patches?
> > Thanx in advance
> 
> I found out how to apply the patches and i'm building a 4.7.3 kernel on
> OpenSuse Tumbleweed...
> I'll post the results later

using an ASROCK Q2900 with a J2900 cpu. Until now i was running an old 3.14-lts kernel because all newer ones froze after some time.

tried a custom 4.7.2 kernel on archlinux with the 4 patches mentioned above, system still froze when idling for around 30 hours, back to 3.14.

see the PKGBUILD here in case anybody wants to try: https://dl.dropboxusercontent.com/u/9188780/linux-baytrail.zip
Comment 497 Paul Nijenhuis 2016-09-18 19:03:42 UTC
(In reply to Paul Nijenhuis from comment #495)
> (In reply to Paul Nijenhuis from comment #490)
> > (In reply to BzukTuk from comment #488)
> > > (In reply to BzukTuk from comment #378)
> > > > Hi again,
> > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > > > sessions without single freeze. Still counting...
> > > > 
> > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I
> dont
> > > > think (hope) this matters)
> > > > ...
> > > 
> > > Hi there again,
> > > just so you know, I did not experienced !single! freeze on fresh kernels
> > > (>=4.6) since I started using Mika`s patches + legacy turbo patch
> together
> > > (as mentioned above). Could anyone with freeze issues and non-laptop
> machine
> > > give it a long stress test? I have just tablet/laptop and I dont want to
> > > wreck battery/LCD (also can`t turn display off - another story). I did
> not
> > > count exactly, but I think I have >500 hour long uptime without any
> freeze
> > > on this device.
> > > 
> > > Mika`s tentative patches
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> > > 
> > > Legacy turbo:
> > >
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> > > 7.3/linux-999-i915-use-legacy-turbo.patch
> > 
> > can you please point out how to apply these patches?
> > Thanx in advance
> 
> I found out how to apply the patches and i'm building a 4.7.3 kernel on
> OpenSuse Tumbleweed...
> I'll post the results later

Unfortenately, freeze after 1,5 days.... :-( back to C1 only in BIOS.
Comment 498 w2q 2016-09-19 21:02:51 UTC
The script of Wolfgang Reimer seems to be a good workaround so far. A way to install this permanently is described here:

https://forum.manjaro.org/t/intel-baytrail-freezes-the-linux-kernel/1931/10

Works for manjaro and ubuntu
Comment 499 w2q 2016-09-19 21:05:29 UTC
Additional thoughts:



Wolfgang had the idea to write a test routine to verify whether erratum VLP52 was the root cause for this bug.

I found an erratum of another CPU (Z670),

http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/atom-z6xx-specification-update.pdf

that has the same description (its number here is BN38, page 25):
"EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine."

Here a workaround is given by Intel!
"Software should check the ISR register and enter CD1 only if any interrupt is in service."

Perhaps this is helpful to find an even more effective method to avoid this error without blocking C6 generally. There even might be already a fix for the Z6xx-cpu in the kernel.
Comment 501 Michal Feix 2016-09-20 13:28:39 UTC
I also came across a patch that was created for SUSE and that seems to be adressing mentioned erratum in pre 4.X kernels:

https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7
Comment 502 Travis Hall 2016-09-20 20:01:56 UTC
(In reply to Michal Feix from comment #501)
> I also came across a patch that was created for SUSE and that seems to be
> adressing mentioned erratum in pre 4.X kernels:
> 
> https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.
> patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7

Wow, if this works, that will be absolutely fantastic.  I'll be compiling with this patch as soon as I get home.  Now to get it merged into mainline...
Comment 503 Andrew Clayton 2016-09-20 20:39:00 UTC
(In reply to Travis Hall from comment #502)
> (In reply to Michal Feix from comment #501)
> > I also came across a patch that was created for SUSE and that seems to be
> > adressing mentioned erratum in pre 4.X kernels:
> > 
> > https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.
> > patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7
> 
> Wow, if this works, that will be absolutely fantastic.  I'll be compiling
> with this patch as soon as I get home.  Now to get it merged into mainline...

You might want to check your CPU model number. If I'm reading that patch right, it won't for example have any effect on my J1900 CPU with a model number of 55 (0x37) (assuming boot_cpu_data.x86_model is what is displayed in /proc/cpuinfo as "Model").
Comment 504 Dmitry 2016-09-21 19:13:07 UTC
Strange, latest git kernel works without cmdline parameter or any scripts. System works 3 days with reboots without freezes.
I recompiled kernel with PREEMPT_VOLUNTARY, NO_HZ, RCU_FAST_NO_HZ and IRQ_TIME_ACCOUNTING. In cmdline I have:tsc=reliable clocksource=tsc pcie_aspm=force nmi_watchdog=0.

cpu0 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0      345670     157
         1  C1-BYT          0        1          1    68407851  147097                                                                                                  
         2  C6N-BYT         0      300        275    48677058   52359                                                                                                  
         3  C6S-BYT         0      500        560   680803055  270719                                                                                                  
         4  C7-BYT          0     1200       4000  1337235518  180699                                                                                                  
         5  C7S-BYT         0    10000      20000   771972999   31738                                                                                                  
cpu1 State  Name     Disabled  Latency  Residency        Time   Usage                                                                                                  
         0  POLL            0        0          0      344769     200                                                                                                  
         1  C1-BYT          0        1          1   365963607  701346                                                                                                  
         2  C6N-BYT         0      300        275    88538699   99895                                                                                                  
         3  C6S-BYT         0      500        560  1131180391  481825
         4  C7-BYT          0     1200       4000  1097908939  191670
         5  C7S-BYT         0    10000      20000   189777207   20370
cpu2 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0      223966     125
         1  C1-BYT          0        1          1    82220646  205674
         2  C6N-BYT         0      300        275    81845726   99118
         3  C6S-BYT         0      500        560  1150791746  496208
         4  C7-BYT          0     1200       4000  1226103249  198451
         5  C7S-BYT         0    10000      20000   313530989   24010
cpu3 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0      146758     132
         1  C1-BYT          0        1          1    68183419  163110
         2  C6N-BYT         0      300        275    56932066   64232
         3  C6S-BYT         0      500        560   846271647  338258
         4  C7-BYT          0     1200       4000  1344663248  198891
         5  C7S-BYT         0    10000      20000   564638850   27960
Comment 505 Dmitry 2016-09-26 20:34:02 UTC
Still no freezes. Please, try somebody kernel 4.8-rc8 or it could be another workaround which doesn't relate to max_cstate.
Comment 506 Travis Hall 2016-09-28 16:38:32 UTC
(In reply to Dmitry from comment #505)
> Still no freezes. Please, try somebody kernel 4.8-rc8 or it could be another
> workaround which doesn't relate to max_cstate.

I'm still seeing the freezes on 4.8-rc8, ran a youtube playlist over night, woke up to a freeze.
Comment 507 Daniel Bilik 2016-10-05 10:33:45 UTC
(In reply to Dmitry from comment #505)
> Please, try somebody kernel 4.8-rc8 or it could be another
> workaround which doesn't relate to max_cstate.

For all of you who still hope that this could (and will) be fixed, let me direct your attention to commit a7b4667+ (drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB):

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=a7b4667a00025ac28300737c868bd4818b6d8c4d

I guess that specifically this commit has stabilized i915 driver behaviour on less powerful CPUs (ie. our Baytrail Atoms and Celerons), so that some people have found their systems to run stable with Linux 4.8 (the commit was merged to 4.8-rc1).

I've applied this one-liner to i915 driver from Linux 4.4 (vanilla, no other "stabilization" patches), and got similar experience as Dmitry, ie. desktop system running on J1900 with no C-states limiting, used almost daily several hours per session, with just regular shutdowns, and working stable for weeks now.

Though it may not solve stability issues for everyone completely, the commit does seem to hit the right nail.
Comment 508 jjmeijer88 2016-10-05 11:20:51 UTC
With this commit (included in kernel 4.4.20) my Bay Trail tablet can finally run stable without limiting c-states.

a3043e mmc: sdhci-acpi: Reduce Baytrail eMMC/SD/SDIO hangs
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=a3043ecef71f5b880fe1b1d2aa77b3a896b86a0c
Comment 509 ladiko 2016-10-05 15:54:03 UTC
In which (newer) Kernel versions the patch is included? Ubuntu 16.04 is using 4.4.0, so it's not included. Do we have to wait for 18.04 or an LTS/HWE kernel? I am not sure if i want to go for the mainstream kernel and stay with the script which disables C6 - at least for the moment.
Comment 510 ladiko 2016-10-05 15:55:19 UTC
By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is this MMC/SD patch really related to CPU/GPU hangs?
Comment 511 Koen Roggemans 2016-10-05 16:19:55 UTC
Created attachment 240861 [details]
attachment-3924-0.html

I searched in the Ubuntu Kernels for "drm/i915: Never fully mask the the EI
up rps interrupt on SNB/IVB" and I found in
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1615620 in the full
text of the fix that it this fix is applied on 4.4.0-38.57, which is the
currently released kernel version for a standard 16.04 installation.

2016-10-05 17:55 GMT+02:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #510 from ladiko <ladiko@web.de> ---
> By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is
> this
> MMC/SD patch really related to CPU/GPU hangs?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 512 jjmeijer88 2016-10-05 16:26:51 UTC
(In reply to ladiko from comment #509)
> In which (newer) Kernel versions the patch is included? Ubuntu 16.04 is
> using 4.4.0, so it's not included. Do we have to wait for 18.04 or an
> LTS/HWE kernel? I am not sure if i want to go for the mainstream kernel and
> stay with the script which disables C6 - at least for the moment.

It's included in the longterm vanilla kernel 4.4.20 and up. You can install it manually or via the package manager I guess.

 http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23/

The i915 patch mentioned by Daniel is also in there.


(In reply to ladiko from comment #510)
> By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is
> this MMC/SD patch really related to CPU/GPU hangs?

It's related to the subject of this thread, not directly to a hanging CPU/GPU. In my case I had a hanging MMC bus related to c-states. No GPU issues for me though :).
Comment 513 Brave Hurts 2016-10-11 17:40:16 UTC
> It's included in the longterm vanilla kernel 4.4.20 and up. You can install
> it manually or via the package manager I guess.
> 
>  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23/
> 

installed 4.4.24. this didn't fix the freezing for N3700 running ubunutu 14.04

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.24/
Comment 514 Sebastian Heyn 2016-10-11 18:17:13 UTC
SO this is a SD/EMMC only bug?? no issue in the n3510/j1900 motherboards with SATA HDD?

Is this only included in the >4.4.20 kernel line or also in later ones? (4.7 etc)
Comment 515 vad1m 2016-10-11 19:19:53 UTC
finally, I don't have any freezes after installing 4.8.0-997-generic kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/current/
Desktop motherboard with J1900 CPU and SATA HDD.
All C-States, including C7, seems working well (haven't checked power consumption, but no freezes at all during about one week of continuous tests).
Comment 516 Libor Chmelik 2016-10-11 21:07:08 UTC
Indeed. I didn't experience any freezes so far since nearly a week now. But i installed the normal kernel 4.8.0-040800-generic from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
I removed the intel_idle.max_cstate=1 from the grub config. And so far no problems.
Laptop Acer Aspire E15 E5-511-P7AT with Pentium N3540 up to 2,66GHz.
The only thing I noticed during the kernel upgrade was a "nagging" about missing intel-drm-i915 firmware.
HD Playback on Youtube or in VLC works flawlessly. Also no troubles with Steam.
Comment 517 RussianNeuroMancer 2016-10-11 21:43:16 UTC
Is there separate bugreport about SD/EMMC issue?
Comment 518 Martin 2016-10-12 06:46:38 UTC
I patched my current 4.5.4 with the mentioned patch didn't have any success. So I suspect (hope) there's more to it than only those two lines. Will try to upgrade to 4.8 soon to see if that helps.
Comment 519 julio.borreguero@gmail.com 2016-10-12 07:11:58 UTC
after some months i tried latest kernel 4.8.1 from kernel.org
still freezing

my system:
https://bugzilla.kernel.org/attachment.cgi?id=198961
Comment 520 Sebastian Heyn 2016-10-13 08:56:07 UTC
I just changed from a non-crashing N3510 to a J1900 and I had a freeze after 5 minutes. Disabled the C states in the bios, but a energy saving solution looks different.
Comment 521 Brave Hurts 2016-10-13 16:24:25 UTC
(In reply to vad1m from comment #515)
> finally, I don't have any freezes after installing 4.8.0-997-generic kernel
> from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/current/
> Desktop motherboard with J1900 CPU and SATA HDD.

Same to me for my N3700. Installed this one two days ago. Usually it freezed some minutes after starting Firefox.
Comment 522 Todd Fulton 2016-10-13 17:19:04 UTC
I've been plagued by this bug ever since I got my laptop, like 1 1/2 - 2 years ago. Just saying.

My PC is an Acer E5-511p
The CPU: Intel(R) Pentium(R) CPU  N3530  @ 2.16GHz

That being said, I went months without a hangup/freeze using Ubuntu 16.04 LTS with various kernels from 4.4.0-X-generic, using SELinux, and xscreensaver (without gnome-screensaver installed), and intel_idle.max_cstate=1. I only ever rebooted when a new kernel came out. ( I never tried without max_cstate, is should have).

I got tired of trying to get SELinux working on Ubuntu and decided to go back to apparmor and gnome-screensaver, as well as upgrading to 4.4.0-42-generic (previous was 4.4.0-38-generic). Since using apparmor as LSM, gnome-screensaver, and 4.4.0-42 (yesterday), I get freezes with the periodic spinning fans again even with intel_idle.max_cstate=1, but it seems when I am not using the pc for 20+ minutes.

I switched max_ctate=0, am using this now

$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-42-generic.efi.signed root=UUID=cf4dc10b-511a-4369-ad5c-637833244929 ro apparmor=1 security=apparmor intel_idle.max_cstate=0

I will switch back one-by-one the things I have changed going forward and see if it stops crashes. I get a hint that it was xcreensaver preventing the cpu from going idle that was preventing the hangups/freezes. Maybe no new info there.

I've had UEFI enabled and use the signed kernels, not sure if that matters as I have this issue even in BIOS mode.

Hope that helps.
Comment 523 Javier Antonio Nisa Avila 2016-10-13 17:37:09 UTC
Created attachment 241641 [details]
attachment-14281-0.html

You know if with the New Ubuntu versión solve the bug?

El 13 oct. 2016 7:19 p. m., <bugzilla-daemon@bugzilla.kernel.org> escribió:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> Todd Fulton <edge-case@hotmail.com> changed:
>
>            What    |Removed                     |Added
> ------------------------------------------------------------
> ----------------
>                  CC|                            |edge-case@hotmail.com
>
> --- Comment #522 from Todd Fulton <edge-case@hotmail.com> ---
> I've been plagued by this bug ever since I got my laptop, like 1 1/2 - 2
> years
> ago. Just saying.
>
> My PC is an Acer E5-511p
> The CPU: Intel(R) Pentium(R) CPU  N3530  @ 2.16GHz
>
> That being said, I went months without a hangup/freeze using Ubuntu 16.04
> LTS
> with various kernels from 4.4.0-X-generic, using SELinux, and xscreensaver
> (without gnome-screensaver installed), and intel_idle.max_cstate=1. I only
> ever
> rebooted when a new kernel came out. ( I never tried without max_cstate, is
> should have).
>
> I got tired of trying to get SELinux working on Ubuntu and decided to go
> back
> to apparmor and gnome-screensaver, as well as upgrading to 4.4.0-42-generic
> (previous was 4.4.0-38-generic). Since using apparmor as LSM,
> gnome-screensaver, and 4.4.0-42 (yesterday), I get freezes with the
> periodic
> spinning fans again even with intel_idle.max_cstate=1, but it seems when I
> am
> not using the pc for 20+ minutes.
>
> I switched max_ctate=0, am using this now
>
> $ cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-4.4.0-42-generic.efi.signed
> root=UUID=cf4dc10b-511a-4369-ad5c-637833244929 ro apparmor=1
> security=apparmor
> intel_idle.max_cstate=0
>
> I will switch back one-by-one the things I have changed going forward and
> see
> if it stops crashes. I get a hint that it was xcreensaver preventing the
> cpu
> from going idle that was preventing the hangups/freezes. Maybe no new info
> there.
>
> I've had UEFI enabled and use the signed kernels, not sure if that matters
> as I
> have this issue even in BIOS mode.
>
> Hope that helps.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 524 kernelbugtracker 2016-10-13 17:40:24 UTC
I haven't experienced any freezes since v4.8 from  http://kernel.ubuntu.com/~kernel-ppa/mainline/ (generic version) on a lenovo yoga 300 (Intel Celeron N2940).
Comment 525 Todd Fulton 2016-10-13 18:06:51 UTC
Javier Antonio Nisa Avila (In reply to Javier Antonio Nisa Avila from comment #523)
> Created attachment 241641 [details]
> attachment-14281-0.html
> 
> You know if with the New Ubuntu versión solve the bug?
> 
> El 13 oct. 2016 7:19 p. m., <bugzilla-daemon@bugzilla.kernel.org> escribió:
> 

I'll try 16.10 out as well, no problem. I see it's running 4.8, thanks for the heads up on the release ;).
Comment 526 bms 2016-10-15 06:09:25 UTC
All,

  I have experienced the hard lockups on a NDis b324 using kernel version 3.13 (all minor variants from ubuntu).  The time scale is order of several hours to several weeks; the result is always a hard lockup.  With 4.8.1 the hard lockup occurs after around 5 minutes.  With the cstate restriction I have yet to see it crash (in 48hrs of testing with 4.8.1 and the cstate restriction).

  I hate to say it, but I don't think this bug is going to get fixed, and that the workaround is the fix!

  What it will take is engineering time from Intel with fully-instrumented dev boards to analyse this, and the wherewithal to do the root cause analysis.

  We can merely speculate from the sidelines, and so I will speculate that this bug affects all operating systems, and because of code timing variations some OSes get lucky and others do not.  There may be no other fix than to disable the cstate management and suck the power loss.

-bms
Comment 527 mhartzel 2016-10-15 13:51:22 UTC
I suspect there are different bugs in different Intel chipsets / processors, since some fixes works for some people but not all. It is also possible that some of these hardware bugs might be impossible to fix in software. It has happend before both in Intel and AMD hardware. However I have a stable system now and I wanted to share my findings to maybe help some other users with the same hardware.

My processor is: Intel(R) Celeron(R) CPU  N2930  @ 1.83GHz (Baytrail). The machine is Acer_Extensa_EX2508-C66M Laptop.

I use Gentoo on this system so I always build my own kernel. I had freezes right from the beginning (kernel 4.1), but found a workaround that worked for this system. I had a rock solid system with kernels 4.1 - 4.4 when I:

- chose "Intel P State Control" to be built into the kernel
- chose "Default CPU Frequency Governor" to be "Performance"
- booted the system with kernel option: intel_idle.max_cstate=0

This resulted processor frequency to be constant and idle processor temperature to be between 48 - 50 degrees Celcius. I use this laptop for recording multitrack audio so the stable processor frequency was a bonus. Heavy audio processing is more reliable when the processor speed is constant. This is due to the fact that stable processor frequency leads to predictable latencies and multitrack audio software likes that. I don't much care about power consumption since my laptop is always plugged in, so I don't now what affect this might have had to the battery life.

When I upgraded to kernel 4.7 this changed and I begun the get the freezes again. I also noticed that my processor speed had begun to fluctuate even though I used the same kernel options I used with kernels 4.1 - 4.4. It seems the fact that processor speed was constant with my settings in kernels 4.1 - 4.4 was a bug and Intel had "fixed" this for 4.7. I now have found new settings that work for me with kernel 4.7.

I did:

- choose "Intel P State Control" to be built into the kernel
- choose "Default CPU Frequency Governor" to be "Performance"
- booted the system with kernel option: intel_idle.max_cstate=0
- disable turbo boost with: echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
- disable processor pstate 3 in all processor cores (example for core 0): echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable

The last option leaves other pstates (0, 1 and 2) on, but only disables pstate 3. These settings results in similar behaviour that with previous kernels meaning stable processor frequency (1.8 Ghz) and idle core temperature about 48 - 50 degrees Celcius.

I've used kernel 4.7 now for 8 days with no freezes, if one occurs I will disable power saving state 2 for all processor cores and so on until I have a stable system.

I have sometimes had turbo boost enabled and have not had any freezes so when it becomes evident in a couple of weeks that these settings really do work, I will enable turbo boost again and see if that has any effect on stability.

I use a script to disable all processor pstate3s and turbo boost. It is based on another script talked about on this forum. You can download the script here: 


https://dl.dropboxusercontent.com/u/2071830/disable_intel_processor_pstates.sh

or here:

http://pastebin.com/egTKmkwX
Comment 528 Sebastian Heyn 2016-10-15 23:03:30 UTC
Thanks for the advice. I applied your settings to gentoo (vanilla-sources-4.8.1)
However I additionally disabled speedstep in bios.

It seems that my n3150 is much more stable than my j1900. I had a freeze within minutes on the j1900 - even though it runs headless, no X. The case has a 12cm fan next to the CPU it is never going higher than 30°C. -

However! 

cpu MHz         : 479.980


cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors 
performance powersave

cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor 
performance

cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq 
479980

cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq 
1600000


weird. On Performance governor, the cpu should never clock down


Is there a microcode update to these cpus? Do they make a difference?
Comment 529 Sebastian Heyn 2016-10-15 23:24:43 UTC
adding "intel_pstate=disable" seems to disable any frequency variation. the cpu now sits on 1.6ghz.
Comment 530 Jochen Hein 2016-10-16 05:38:29 UTC
Created attachment 241811 [details]
Patch to disable c-states at boot
Comment 531 Jochen Hein 2016-10-16 05:44:40 UTC
There might be the following errata affected:
VLP52 EOI Transactions May Not be Sent if Software Enters Core C6
During an Interrupt Service Routine.
AAU36     EOI Transaction May Not be Sent if Software Enters Core C6 During an
Interrupt Service Routine
AAN42     EOI Transaction May Not be Sent if Software Enters Core C6 During an
Interrupt Service Routine
BN38.EOI Transaction May Not be Sent if
AAK76.       EOI Transaction May Not be Sent if
 Software Enters Core C6 During an
 Interrupt Service Routine
BA106.
EOI Transaction May Not be Sent if
 Software Enters Core C6 During an
 Interrupt Service Routine
AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 Duringan Interrupt Service Routine
Comment 532 bms 2016-10-16 06:05:44 UTC
As an experiment I've set up a google spreadsheet in the hopes you will enter details about your system(s), configurations you have tried, and the length of time that your test ran prior to failure.

The goal is simply to be able to mine the data.

The spreadsheet is (or rather should be) fully editable, so please don't abuse it; I think we all want this resolved.

https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit?usp=sharing

Here are some suggestions on how you should fill in the entries:

Column A:  
  Did your system end up in a locked up state?

Column B: 
  How long did your test run for.  For example, if your test ended in a lock up, was it several hours or just a few minutes.  If you answered yes for column A, then enter the amount of time your computer ran prior to you rebooting or powering down the system.

Column C:
  The name / make of your machine.  We want to know who made the motherboard.

Column D:
  The model name of your CPU.

Column E:
  The code name for your CPU.  Naturally non-baytrail cpus that show similar failures will be interesting information to know.

Column F, G, H:
  Enter the details from the result of "cat /proc/cpuinfo".

Column I, J:
  use dmidecode to obtain the bios information, enter the version and vendor.

Column K:
  The linux kernel version: use "uname -a"

Column L:
  Did you modify the kernel boot parameters; if so, record them.

Column M:
  Additional notes: What other configurations did you do?  Did you use the c6 off script, etc...

Do add columns if you think the data relevant.

... just trying to get to the bottom of this...
Comment 533 mhartzel 2016-10-16 08:12:42 UTC
Kernel 4.8 seems to have some bugs in cpufreq. Intel has recently added a fix for these for kernel 4.9, so I will skip kernel 4.8 completely and use 4.9 when it comes out.

Here is the message telling about the Intel regression fixes for 4.9:

https://lkml.org/lkml/2016/10/14/288

Here is the Phoronix article mentioning it:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.9-Atom-P-State-Algo
Comment 534 mhartzel 2016-10-16 12:55:59 UTC
BMS: Great idea :) There are a couple of other things to consider. On newer kernels (4.7)  you have the option of controlling processor performance either by ACPI P-State driver or Intel P-State driver. I had lockups when using the ACPI driver, Intel version works for me with no lockups.

Also it might be important to know which governor (powersave, ondemand, performance) people use, since it deeply affects how the processor uses power saving states.

I also added a column called "Reporter", I hope this is allright. It helps when some additional information needs to be asked from the reporter.
Comment 535 ekemate93 2016-10-17 18:05:32 UTC
I have updated my home server to Ubuntu 16.10 which contains 4.8.0-22-generic, now I have 53 hours uptime.

Before that I used Ubuntu 16.04.1 LTS with 4.4 and I had to disable C-states in the bios to get a stable system. 

I hope this kernel solved the bug, I will keep the spreadsheet updated with my longtime results.

My system is ASRock Q1900-ITX with a J1900 CPU.
Comment 536 youling257 2016-10-17 22:42:28 UTC
(In reply to jbMacAZ from comment #191)
> I have a N3540 system that freezes at most a couple times a month without
> any arguments, kernel version doesn't seem to matter.  .max_cstate {0,1}
> stabilized it.  Looking at the recent posts, the N-series appears to be the
> processor benefiting most from the new suggestions.  But the more smoke that
> gets cleared, the sooner the rest of the problems can be found.
> 
> On my Z3775 system (T100CHI), kernel 4.5.0 without arguments didn't last 2
> minutes before freezing.  With idle=nomwait and it ran 2 hours before the
> time display froze (frozen seconds), the mouse cursor still moved.  Keyboard
> keys or mouse clicks were accepted about once every 90 seconds.
> 
> Next, maxcpus=2 and idle=nomwait produced a block of "serial8250: too much
> work for irq191" errors in dmesg.  Raising maxcpus to 3 got rid of them. 
> maxcpus= {2,3} yielded no obvious degradation when just browsing, etc, so
> I'll leave this running...  tsc may be destabilizing for some systems like
> mine.

I compile 4.9 rc1 kernel,dmesg,serial8250: too much work for irq191

4.8 no this problem.
Comment 537 Sudhanshu 2016-10-18 09:05:17 UTC
I have been suffering from the same issue, but on a Broadwell system (Dell XPS 9343, i5-5200). Restricting the max_cstate to 1 helps. c6off+c7on (after modifying to work on BDW instead of BYT), does not. It works only when I disable all cstates except C1 and C1E (which is rquivalent to max_cstate=1).

Though I have been following this thread since long, I never posted. I have lately been wondering if there are any other Broadwell users facing the same, and if there is a separate bug for them. I mean, though the symptoms are exactly same, I am not 100% sure if the bug is the same.

Also, as most other users here, I have no logs anywhere - syslog/kern.log - which would help raising a separate bug request.

Summarising,
Are there any broadwell co-sufferers here?
Am I safe to assume this is the same bug as mine?
Comment 538 Dmitry 2016-10-18 13:24:51 UTC
I've also added my machine into google spreadsheet. 

"serial8250: too much work for irq191" - also see this when I try to turn bluetooth on. I've never managed to get it working though.
Comment 539 Wolfgang M. Reimer 2016-10-18 15:36:34 UTC
(In reply to Jochen Hein from comment #531)
> There might be the following errata affected:
> VLP52 EOI Transactions May Not be Sent if Software Enters Core C6
...
> AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 Duringan
> Interrupt Service Routine

Thanks Jochen, I started to dig and found out, that a lot of Intel processors suffer from erratum:

EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine

Here is the list (with links to the docs) I found so far: 

 [1] AAJ72: Intel Core i7-900 Desktop Processor Extreme Edition Series and Intel Core i7-900 Desktop Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-ee-and-desktop-processor-series-spec-update.pdf
 [2] AAK76: Intel Xeon Processor 5500 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5500-specification-update.pdf
 [3] AAM73: Intel Xeon Processor 3500 Series
     http://www.intel.com/Assets/en_US/PDF/specupdate/321333.pdf
 [4] AAN42: Intel Core i7-800 and i5-700 Desktop Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-800-i5-700-spec-update.pdf
 [5] AAO42: Intel Xeon Processor 3400 Series 
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3400-specification-update.pdf
 [6] AAP41: Intel Core i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and i7-700 Mobile Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-mobile-ee-and-mobile-processor-series-spec-update.pdf
 [7] AAT32: Intel Core i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-mobile-spec-update.pdf
 [8] AAU36: Intel Core i5-600, i3-500 Desktop Processor Series and Intel Pentium Desktop Processor 6000 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i5-600-i3-500-pentium-6000-spec-update.pdf
 [9] AAY38: Intel Xeon Processor 3600 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf
[10] BA106: Intel Xeon Processor 7500 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-processor-7500-series-specification-update.pdf
[11] BB38: Intel Atom Processor Z6xx Series 
     http://www.intel.com/content/dam/doc/specification-update/atom-z6xx-specification-update.pdf
[12] BC38: Intel Core i7-900 Desktop Processor Extreme Edition Series and Intel Core i7-900 Desktop Processor Series on 32-nm Process
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-ee-and-desktop-processor-series-32nm-spec-update.pdf
[13] BD40: Intel Xeon Processor 5600 Series Specification Update
     http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
[14] BF41: Intel Xeon Processor C5500/C3500 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf
[15] BG31: Intel Pentium P6000 and U5000 Mobile Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-p6000-u5000-mobile-specification-update.pdf
[16] BI46: Intel Atom Processor E6xx Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e6xx-spec-update.pdf
[17] BN38: Intel Atom Processor Z600 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z6xx-specification-update.pdf
[18] BP37: Intel Xeon Processor E7-8800/4800/2800 Product Families
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e7-8800-4800-2800-families-specification-update.pdf
[19] CC5: Intel Atom Processor Z2760
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z2760-spec-update.pdf
[20] VLI55: Intel Atom Processor E3800
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
[21] VLP52: Intel Celeron and Pentium Processor N- and J-Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
[22] VLT56: Intel Atom Processor Z3600 and Z3700 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-Z36xxx-Z37xxx-spec-update.pdf

The erratum is first mentioned in November 2008 [1] and a first patch for it (only for AAJ72-plagued processors reported in [1]) has been added by the Xen developers in September 2010:

https://lists.xen.org/archives/html/xen-devel/2010-09/msg00894.html
Comment 540 AB 2016-10-18 18:17:36 UTC
The image on the screen freezes and the USB ports do not work, ATX power button not responding. Right?

I have bug like this many times on i5-4460 (P87 Gigabyte mb with nvidia videocard) after updating from kernel 3.13 (ubuntu 14.04) to kernel 4.4 (ubuntu 16.04). This appears more often during hot weather, then usb-wifi attached, when computer is idle. More rare when some programs are active and when it is cold in the room. I thought that the reason are micro-cracks in the motherboard but now I see this ticket and will delay shopping new matherboard:)

Also I saw bug like this on ASUS R556 with i5-5200U with nvidia video on 4.4. Now in asus ubuntu was updated to 16.10 and I test it with kernel 4.8.

intel_idle.max_cstate=1 did not helps in both cases on kernel 4.4
Comment 541 Sudhanshu 2016-10-20 03:28:10 UTC
Also, on Broadwell, *any* c-state (beyond 1e) if enabled, causes the lockdown. For baytrail, as some users have pointed out, just c7 off and others enabled works.
Comment 542 HansPeterIngo 2016-10-20 12:10:57 UTC
I cannot confirm that Ubuntu 16.10 fixes the bug. The freezes still remain with Kernel 4.8.0-22.
Comment 543 FL 2016-10-20 19:39:40 UTC
Freezes after one hour of VLC...
OS: Arch Linux
Kernel: x86_64 Linux 4.8.2-1-ARCH
CPU: Intel Pentium CPU J2900 @ 2.4157GHz
GPU: Mesa DRI Intel(R) Bay Trail
Comment 544 Sebastian Heyn 2016-10-21 14:33:17 UTC
Yes I read that 4.8 has a faulty p-state implementation that should be fixed with 4.9.
Comment 545 Libor Chmelik 2016-10-21 17:55:49 UTC
(In reply to Libor Chmelik from comment #516)
> Indeed. I didn't experience any freezes so far since nearly a week now. But
> i installed the normal kernel 4.8.0-040800-generic from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
> I removed the intel_idle.max_cstate=1 from the grub config. And so far no
> problems.
> Laptop Acer Aspire E15 E5-511-P7AT with Pentium N3540 up to 2,66GHz.
> The only thing I noticed during the kernel upgrade was a "nagging" about
> missing intel-drm-i915 firmware.
> HD Playback on Youtube or in VLC works flawlessly. Also no troubles with
> Steam.

Spoken to early. It froze after all.
Situation 1 : Youtube in HD AND cpu's in forced performance mode (turbo boost ??)
Situation 2 : HD Playback in VLC in automatic powersave mode.
But it took 12 days until the first freeze. And 3 days later for the second.

Trying kernel 4.8.2 from ubuntu mainline now. c-state still disabled in grub.conf
Comment 546 Justin 2016-10-22 21:58:34 UTC
One week so far no crashes.  4.8.0-rc8-amd64

Options

GRUB_CMDLINE_LINUX_DEFAULT=intel_idle.max_cstate=5

In rc.local this script is run at boot...

 ----- 

#!/bin/bash
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo


thanks
Comment 547 bzq7xy5gpj 2016-10-24 12:53:03 UTC
Let's see whether the Goldmont successor architecture CPUs 
https://en.wikipedia.org/wiki/Goldmont_(microarchitecture)#List_of_Goldmont_processors
are affected. The first motherboards/NUCs are already/soon available, like the ASRock J4205-ITX. Would be nice if someone could report on that. BTW: My experience has been that a Celeron-NUC is extremely slow for desktop usage and my next NUC is going to be a Pentium-NUC (if of course not affected, otherwise maybe something from AMD).
Comment 548 Sebastian Heyn 2016-10-24 20:18:23 UTC
Why wait for another product to spend money on? I have two of boards that are not doing what they should that have actually cost money. Before doing that I'll change to ARM cpus because of the frustration. 

My J1900 resetted twice today doing nothing (headless, no X) - (4.4 kernel) I deactivated C states in the bios. Until now it seems stable

I will give 4.9_rc1 a try in the morning and activate all c-states. I bought those boards to save me some power, not sit around doing nothing ;-)
Comment 549 Sebastian Heyn 2016-10-25 07:38:55 UTC
Has anyone tried this kernel yet?  

https://aur.archlinux.org/packages/linux-baytrail48/
Comment 550 thorsten 2016-10-29 17:03:11 UTC
I have a N2940 under Gentoo and keep running into the same bug.
i already tried:
4.7.5-gentoo
4.8.2-gentoo
and git-sources too:
4.9-rc1

still getting random freezes (depending on workload every 15 min to 2 hours).
Comment 551 mhartzel 2016-10-30 17:30:21 UTC
thorsten: Try the commands below, and report back. These eliminate hang ups on my N2930 with kernel 4.7 (Gentoo).

First start kernel with: intel_idle.max_cstate=0

Then give these commands as root:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state3/disable
Comment 552 Elmar Melcher 2016-11-04 16:56:54 UTC
Tried as indicated
linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70ad.tar.gz
cp config.x86_64 .config
make INSTALL_MOD_STRIP=1 rpm
then installed on Atom Z3735G
running for 1 hour now without kernel parameter, neither cstate nor tsc.

Everything else I ever tried crashed in less than 1 hour, sometimes in
1 minute without kernel parameters.

On 10/25/16, bugzilla-daemon@bugzilla.kernel.org
<bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #549 from Sebastian Heyn <sebastian.heyn@yahoo.de> ---
> Has anyone tried this kernel yet?
>
> https://aur.archlinux.org/packages/linux-baytrail48/
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 553 Sebastian Heyn 2016-11-07 08:17:32 UTC
My system is running stable for 2 weeks now with the archlinux baytrail kernel. (I had to restart it to swap an hdd however.  Could it be that a headless system is much less likely to crash?
Comment 554 Daniel Glöckner 2016-11-07 09:51:40 UTC
(In reply to Sebastian Heyn from comment #553)
> Could it be that a headless system is much less likely to crash?

Yes, the initial bug report linked in the first comment thought the problem was related to the GPU driver. Even if you are using the unaccelerated efifb + xf86-fideo-fbdev driver, you are less likely to get the freeze. After all, the best way to trigger the problem so far has been to play videos.
Comment 555 Sebastian Heyn 2016-11-07 10:14:09 UTC
Hi Daniel,

thanks. Playing videos means GPU decoding or high framerate framebuffer access via CPU?
Comment 556 Paul Mansfield 2016-11-07 11:08:35 UTC
I've been using a J1900 board as a router/firewall/fileserver for a couple of years now, it's a gigabyte ga-j1900d3v (chosen for the dual gigabit NICs and the low power consumption). It's pretty stable, runs for weeks and weeks without locking up, but of course there's no video activity - often there's not even a monitor plugged in! However, when it does lock up, it needs a forced reset, as it will have locked up solid.
Comment 557 Paul Mansfield 2016-11-07 11:15:28 UTC
p.s. I've never used the cstate hack, always used stock kernels without any special patching.
Comment 558 thorsten 2016-11-07 20:47:42 UTC
Update on my side: 
4.8.4-gentoo seems to work since several days so far without patches or disabling cstate options on my machine.

If anyone is interested provided my kernel as a download with modules and initrd:

http://s000.tinyupload.com/index.php?file_id=06491416522851495522

md5sum:
4c7fbd190b8656899cfe3b35dbd6f185  kernel.tar.bz2

sha1sum:
3218d1a4064b649d64c46fa493c3d364f1f02737  kernel.tar.bz2

I have an Aspire ES1-311.
Would be interested if this kernel works on other machines, too.
Comment 559 Sebastian Heyn 2016-11-08 10:41:38 UTC
@Thorsten,

can you check if your /proc/cpuinfo shows the correct frequency info? Mine seems to hang on less than 500MHZ, using the ondemand governor
Comment 560 thorsten 2016-11-08 19:44:21 UTC
@Sebastian,

i have 499 MHz shown in /proc/cpuinfo too, but i think its a display error:

~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 500 MHz - 2.25 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 500 MHz and 2.25 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 500 MHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes

and:
~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 500 MHz - 2.25 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 500 MHz and 2.25 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 2.25 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes

maybe the difference from my kernel to the 'regular' is CONFIG_CPU_FREQ_GOV_SCHEDUTIL is set as default policy in my config.
Strange tho that cpupower only shows powersave and performance as available governors?
Comment 561 Poumon 2016-11-10 20:44:09 UTC
@Thorsten

Hello ! I use an asusE502 with Intel Corporation Atom Processor Z36xxx/Z37xxx
I am affected by this problem since 1 year. 

The kernels 4.8.4 works for me since 2 days (without intel_idle.max_cstate=1 ). It still work for you ?
Comment 562 luanrafael19 2016-11-11 14:00:27 UTC
Tive estes problemas de travamento no meu notebook Asus z550ma que utiliza o processador n2940. Como consegui resolver o problema no Ubuntu 16.04:

1-Instalei os drivers de vídeo da intel (https://01.org/linuxgraphics/downloads)
2- Atualizei o Kernel do Ubuntu para versão 4.8.6
3- Após atualizar o Kernel, fui em: Configurações, Programas e Atualizações, Drivers Adicionais e desativei o Processor microcode firmware for intel cpus de intel-microcode (coloquei em não usar este dispositivo)
4- Reiniciei o Notebook
5-Criei um arquivo de configuração com o nome "i915.conf" (digitar sem aspas) e dentro dele inseri o código: "options i915 modeset=1 enable_execlists=0" (digitar sem aspas)
6- Colei este arquivo de configuração na pasta etc/modprobe.d
7- Reiniciei o PC
8- Resultado: o notebook já não trava a aproximadamente 1 semana (estou usando ele o dia inteiro)

OBS: não precisei inserir este código (intel_idle.max_cstate=1) no Grub

Este notebook z550ma possui também um problema com a placa wifi. Para resolver os problemas basta instalar os drivers da placa (o site do Diolinux tem um tutorial) e inserir um código na pasta etc/modprobe.d

Este código possui o nome "rtl8723be.conf" (digitar sem aspas) e dentro deste arquivo deve estar escrito o código: "options rtl8723be fwlps=N ips=N" (digitar sem aspas)

A internet agora funciona normalmente aqui.

Espero ter ajudado pessoal!!!

OBS: BR também entende de Linux!!!
Comment 563 luanrafael19 2016-11-11 14:04:56 UTC
I had these locking issues on my Asus z550ma notebook that uses the n2940 processor. How I solved the problem in Ubuntu 16.04:

1-I installed intel's video drivers (https://01.org/linuxgraphics/downloads)
2- Updated the Ubuntu Kernel for version 4.8.6
3- After updating the Kernel, I went to: Settings, Programs and Updates, Additional Drivers and deactivated the Processor microcode firmware for intel-microcode cpus (I put in not to use this device)
4- Restart the Notebook
5 - I created a configuration file with the name "i915.conf" (enter without quotes) and inside it inserted the code: "options i915 modeset = 1 enable_execlists = 0" (enter without quotes)
6- Pasted this configuration file into the etc / modprobe.d folder.
7- Restart the PC
8- Result: the notebook no longer locks for approximately 1 week (I'm using it all day)

This z550ma notebook also has a problem with the wifi card. To solve the problems simply install the drivers of the card (the Diolinux website has a tutorial) and insert a code in the etc / modprobe.d folder

This code has the name "rtl8723be.conf" (enter without quotes) and inside this file should be written the code: "options rtl8723be fwlps = N ips = N" (type without quotes)

The internet now works normally here.

I hope I have helped people !!!

Note: BR also understands Linux

Sorry for my bad English!!!!
Comment 564 Poumon 2016-11-11 18:02:57 UTC
@Thorsten

Ubuntu freezes this morning after 3 days of usage with the 4.8.4. False hope ...
Comment 565 mhartzel 2016-11-11 20:16:38 UTC
In my experience it may take a week or two before the first freeze happens. It would be very helpful if people could wait and use their machines for 7 - 14 days before declaring success. This would help us weed out false positives :)

Thanks for filling in your success and failure details into the spreadsheet bms created, it seems to me patterns are emerging, please keep filling in details about your experiments :)
Comment 566 RussianNeuroMancer 2016-11-11 23:21:34 UTC
I would say if freeze takes few days to reproduce, while before it was few hours or even minutes - it's already success to some degree.

By the way, this patches could be interesting for some subscribers: https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8

Especially 0001 and 0006 probably could reduce hangs even more.
Comment 567 thorsten 2016-11-12 06:52:49 UTC
@Poumon Had my first two freezes with my kernel yesterday, with my older kernels I had daily freezes without using other options. So sorry for the false positive.

@RussianNeuroMancer I think too its an improvement too if we can use all the power saving features on an 'unpatched' kernel for multiple days now
Comment 568 thorsten 2016-11-12 07:26:08 UTC
I have an unused J1800 desktop machine, so I'll try to reproduce the problem and maybe try to get a kernel stacktrace over serial terminal in the next week.
I hope we can pinpoint the actual origin of the problem this way. 

@RussianNeuroMancer if the problem would be mmc-related the regular desktop user without an mmc reader should not affected since the modules for mmc would not be loaded, but maybe the other patches could change something.
Comment 569 RussianNeuroMancer 2016-11-12 07:37:14 UTC
@thorsten, if problem mmc-related I wonder why it doesn't fixed long time ago. Patches for mmc hang literally available for years.
Comment 570 Ajay Garg 2016-11-15 14:00:20 UTC
I am facing freezing issues on a SOC running on Intel-Celeron-J1900. The devices are supposed to be deployed in areas with not a single human-being, so freezes are unacceptable. Also, I really don't care about power-consumption.

I was wondering why has no one tried the following kernel-options ::

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll


If I am not being idiotic, above options would surely switch-off all power-management-possibilities?
Comment 571 Sebastian Heyn 2016-11-15 15:42:44 UTC
@Ajay:

Are the machines headless or with an active X running? Some board allow to switch off all power management on the BIOS
Comment 572 Ajay Garg 2016-11-15 15:52:17 UTC
@Sebastien,

Thanks for the reply.

Nopes, each machine has Ubuntu-14.04.3 installed, with kernel upgraded manually to 3.19(-generic).

I don't have a board with me right now, so cannot confirm if there is an option in the BIOS. But irrespective of that, won't each of the kernel-options (as per my previous post) work?


The important question is, might

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

break anything over a period of time?
Comment 573 Ajay Garg 2016-11-15 15:54:02 UTC
(In reply to Ajay Garg from comment #572)
> @Sebastien,
> 
> Thanks for the reply.
> 
> Nopes, each machine has Ubuntu-14.04.3 installed, with kernel upgraded
> manually to 3.19(-generic).

Pardon me, I meant a full-blown client-image of Ubuntu-14.04.3, with all the fancy GUI.


> 
> I don't have a board with me right now, so cannot confirm if there is an
> option in the BIOS. But irrespective of that, won't each of the
> kernel-options (as per my previous post) work?
> 
> 
> The important question is, might
> 
> intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll
> 
> break anything over a period of time?
Comment 574 mhartzel 2016-11-15 17:06:34 UTC
@Ajay Garg 

I don't own a J1900 device, but I guess the main concern with those options is heat. The kernel documentation (https://www.kernel.org/doc/Documentation/kernel-parameters.txt) warns that idle=poll will make the machine run hot. It may be better to drop that option if possible.

I guess it is best to test the machines in advance. You could  install a software for monitoring cpu / gpu temperature and run workloads that will be typical when the machines are installed on location. In this way you will get some hard data about the machines reliability.

It's really frustrating that Intel hardware is as buggy as it is right now. I can't remember any worse period in Intel history. I guess they a really afraid of the flood of ARM devices and trying to compete with those they are going too far with aggressive power saving features.
Comment 575 DE 2016-11-15 21:00:49 UTC
(In reply to Wolfgang M. Reimer from comment #539)

Hello, you have missed the lucky 23.

[23] AAZ32: Intel Celeron P4000 and U3000 Mobile Processor Series
     http://www.intel.ie/content/dam/www/public/us/en/documents/specification-updates/celeron-mobile-p4000-u3000-specification-update.pdf
Comment 576 Martin Brand 2016-11-15 21:20:46 UTC
Just tried 4.8.7 on Ubuntu 16.04. This kernel won't even boot. I have a N3700 CPU. So back to 4.8.6.
Comment 577 thorsten 2016-11-16 06:25:09 UTC
I tried the serial console approach and could not get a kernel crash dump this way despite the machine freezing with 4.8.7. My guess is because this seems to be a hardware-bug the cpu is frozen before the kernel can throw a crash dump or contact my serial console. 

@Ajay Garg
I would probably disable cpufreq (and cstates) alltogether on a non-mobile machine. Downside would be a hotter and possibly louder machine. Also in case your chipset has a watchdog functionality maybe this would be an idea how to reset the machine automatically after freezing it it helps with your application.
Alternatively in your use case I would problably switch to unaffected  hardware i.e. a Celeron 847 or otherwise do a lot of testing first. 

@Martin Brand
I have a Pentium N3700 too and was not yet affected by this bug so far, have you had freezes before?
Comment 578 André Hoogendoorn 2016-11-16 11:12:25 UTC
It IS a hardware bug and Intel should fix it.
Comment 579 Martin Brand 2016-11-16 17:40:46 UTC
Yes I did have freezes before, but never during boot. The c6off+c7on scripts from Wolfgang Reimer made my system usable. So thanks a lot for that!
Without the script it usually freezes within an hour. With c6off about once every one to two weeks. Still very annoying when it happens.
Comment 580 RussianNeuroMancer 2016-11-16 18:14:37 UTC
What conditions of entering C7? I run this script and looking at powertop output now, cores spend 96-97% time in C1. So looks like it's doesn't different much from known intel_idle.max_cstate=1 workaround.
Comment 581 Martin Brand 2016-11-16 20:16:21 UTC
My Powertop displays the following
PowerTOP 2.8      Übersicht  Untätigkeits Frequenzstatistik Gerätestatisti Einstellbarkeit                              
          Paket     |             Kern    |            CPU 0
                    |                     | C0 aktiv    7,2%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   23,7%    | C1-CHT     22,3%    1,2 ms
                    |                     |
                    |                     |
C6 (pc6)   18,1%    | C6 (cc6)   62,9%    | C6S-CHT     0,0%    0,0 ms
                    |                     | C7S-CHT    16,3%   20,7 ms

                    |             Kern    |            CPU 1
                    |                     | C0 aktiv   22,4%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   11,3%    | C1-CHT      9,5%    2,1 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   59,3%    | C6S-CHT     0,0%    0,0 ms
                    |                     | C7S-CHT    22,6%   22,9 ms
So it uses C7 state. Battery life is better and CPU temperature is also about 10°C less than with intel_idle.max_cstate=1 workaround
Comment 582 RussianNeuroMancer 2016-11-17 02:41:32 UTC
Ok, I need to clarify, that in my case it's BayTrail CPU Z3735G.  After script run I stop getting PC7 and CC6 (that works before script run) and doesn't get expected C7S-BYT (constantly 0% while C6S-BYT was sometimes over 90% before script run). So for me script outcome is not different from intel_idle.max_cstate=1 workaround.

Is there anyone who have working PC7/CC6/C7S-BYT on BayTrail device after disabling C6S-BYT?
Comment 583 Elmar Melcher 2016-11-18 12:52:45 UTC
On CPU Z3735G, I always saw a Call Trace related to hard LOCKUP on the screen when it froze while in console mode. As also reported in other comments (#157, #568), these Call Traces were not logged by the system.

Today I received this message in a xterm:

Message from syslogd@leia at Nov 18 09:50:32 ...
 kernel:NMI watchdog: Watchdog detected hard LOCKUP on cpu 3#001dModules linked in: msr r8723bs(O) intel_...

And found the complete Call Trace in dmesg:

[  261.956671] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3dModules linked in: msr r8723bs(O) intel_rapl intel_soc_dts_thermal nls_iso8859_1 intel_powerclamp coretemp nls_cp437 vfat kvm_intel fat kvm iTCO_wdt snd_soc_sst_bytcr_rt5640 gpio_keys iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul joydev glue_helper input_leds snd_usb_audio snd_usbmidi_lib snd_hwdep mousedev ablk_helper cryptd snd_soc_rt5645 snd_rawmidi mac_hid cfg80211 intel_cstate pcspkr thermal evdev kxcjk_1013 snd_intel_sst_acpi snd_intel_sst_core tpm_crb industrialio_triggered_buffer snd_soc_rt5640 soc_button_array snd_soc_sst_mfld_platform kfifo_buf snd_soc_rl6231 snd_soc_sst_match industrialio dptf_power int3406_thermal int3403_thermal snd_soc_core processor_thermal_device int3402_thermal goodix int3400_thermal battery snd_compress int340x_thermal_zone snd_pcm_dmaengine acpi_thermal_rel intel_soc_dts_iosf ac97_bus hci_uart snd_seq snd_seq_device b
[  261.956681] CPU: 3 PID: 3055 Comm: inkscape Tainted: G           O    4.8.6-BAYTRAIL48 #1
[  261.956684] Hardware name: Positivo Informatica SA WCBT1013/WCBT1013, BIOS 1.7 06/09/2015
[  261.956687]  0000000000000086 000000000bea3d8f ffff880038643bf0 ffffffff812f9d4b
[  261.956690]  0000000000000000 0000000000000000 ffff880038643c08 ffffffff8111d918
[  261.956693]  ffff880038d30800 ffff880038643c40 ffffffff811613ac 0000000000000001
[  261.956695] Call Trace:
[  261.956698]  [<ffffffff812f9d4b>] dump_stack+0x63/0x88
[  261.956700]  [<ffffffff8111d918>] watchdog_overflow_callback+0xc8/0xf0
[  261.956703]  [<ffffffff811613ac>] __perf_event_overflow+0x7c/0x1b0
[  261.956706]  [<ffffffff81169664>] perf_event_overflow+0x14/0x20
[  261.956708]  [<ffffffff8100c147>] intel_pmu_handle_irq+0x1e7/0x4a0
[  261.956711]  [<ffffffff81185606>] ? __pagevec_lru_add_fn+0x186/0x290
[  261.956714]  [<ffffffff811f7395>] ? mem_cgroup_commit_charge+0x85/0x100
[  261.956716]  [<ffffffff81187209>] ? lru_cache_add_active_or_unevictable+0x39/0xc0
[  261.956719]  [<ffffffff811af8da>] ? handle_mm_fault+0x41a/0x1550
[  261.956722]  [<ffffffff810055ed>] perf_event_nmi_handler+0x2d/0x50
[  261.956724]  [<ffffffff810312d1>] nmi_handle+0x61/0x140
[  261.956727]  [<ffffffff81031878>] default_do_nmi+0x48/0x130
[  261.956730]  [<ffffffff81031a4b>] do_nmi+0xeb/0x160
[  261.956732]  [<ffffffff815f1406>] nmi+0x56/0xa5


The system did not freeze but continued to operate normally.

Kernel was 4.8.6 with following patches:

from https://aur.archlinux.org/packages/linux-baytrail48/, from linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70a, baytrailfix[1-5].patch 

from https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8, patch 0001*, 0006*, and 0008*

and config:

from https://aur.archlinux.org/packages/linux-baytrail48, from linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70a, config.x86_64

Clocksource during this event was
cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
refined-jiffies

but often, after reboot in this configuration, clocksource is tsc.
Comment 584 Josep Pujadas-Jubany 2016-11-22 16:59:04 UTC
Scripts at comment #c437 solve the problem. However, I had to modify c6off+c7on.sh in order to work for CHT (Cerry Trail) processors.

Latest stable kernel for Ubuntu (4.8.10) seems to solve also the problem.

More details at:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467/comments/142
Comment 585 Martin Brand 2016-11-22 19:47:50 UTC
Same here. I have upgraded to 4.8.9. I was not able to boot 4.8.7. Kernel 4.8.9 with c6off+c7on.sh has definitely improved the situation. So far no crash. The system survived 2 hour HD film and an old windows game played with wine. This has not happened before, so I am very hopeful.
Comment 586 Ajay Garg 2016-11-28 09:06:28 UTC
Len Brown (and Intel in general) should be ashamed of themselves (assuming they still have some self-respect left).

Making mistakes is perfectly acceptable (we all make mistakes). But being shamelessly quiet is a sign of impotency.
Comment 587 Ajay Garg 2016-11-28 09:07:22 UTC
Len Brown (and Intel in general) should be ashamed of themselves (assuming they still have some self-respect left).

Making mistakes is perfectly acceptable (we all make mistakes). But being shamelessly quiet is a sign of impotency.
Comment 588 Dan0780 2016-11-28 21:13:46 UTC
Hello,

For all of you having issues with this I used the c6off+c7on script and did nto solve my problem.  So I modified the script to turn off both C6 & C7 and have not had a freeze in months.  My alters script is below.  Hope it helps some.

#!/bin/sh

#title:       c6off+c7off.sh
#description: Disables all C6 and C7 core states for Baytrail CPUs
#author:      Wolfgang Reimer <linuxball (at) gmail.com>
#date:        2016014
#version:     1.0    
#usage:       sudo <path>/c6off+c7on.sh
#notes:       Intended as test script to verify whether erratum VLP52 (see
#             [1]) is the root cause for kernel bug 109051 (see [2]). In order
#             for this to work you must _NOT_ use boot parameter
#             intel_idle.max_cstate=<number>.
#
# [1] http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
# [2] https://bugzilla.kernel.org/show_bug.cgi?id=109051

# Disable ($1 == 1) or enable ($1 == 0) core state, if not yet done.
disable() {
	local action
	read disabled <disable
	test "$disabled" = $1 && return
	echo $1 >disable || return
	action=ENABLED; test "$1" = 0 || action=DISABLED
	printf "%-8s state %7s for %s.\n" $action "$name" $cpu  
}

# Iterate through each core state and for Baytrail (BYT) disable all C6 & C7 states.
cd /sys/devices/system/cpu
for cpu in cpu[0-9]*; do
	for dir in $cpu/cpuidle/state*; do
		cd "$dir"
		read name <name
		case $name in
			C6*-BYT) disable 1;;
			C7*-BYT) disable 1;;
		esac
		cd ../../..
	done
done
Comment 589 mhartzel 2016-11-29 10:18:52 UTC
@Dan0780 Please tell us what your processor is. Without this info we don't know in which cases your solution helps.

Thanks :)
Comment 590 mhartzel 2016-11-29 10:23:52 UTC
Dan0780: Could you please fill in details of your solution in the google spreadsheet BMS created so your solution will be easily found by people having the same hardware.

The spreadsheet is here:

https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit?usp=sharing
Comment 591 Dan0780 2016-11-29 12:21:07 UTC
Sorry, my processor is J1900.  I will try and fill out the spreadsheet
Comment 592 Michaël 2016-11-29 13:45:22 UTC
(In reply to Dan0780 from comment #591)
> Sorry, my processor is J1900.  I will try and fill out the spreadsheet

Dan; as far as I can see (a diff would have been useful), the difference with the original script is that you actually disable C7—this really does the same as max_cstate=1 then.
Comment 593 ladiko 2016-11-29 14:15:16 UTC
The script is unnecessarily complicated ...

# baytrail workaround for https://bugs.freedesktop.org/show_bug.cgi?id=88012
for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
	case "$(< "${state}/name")" in
		C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;;
		C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;;
	esac
done

or to disable C6 and C7:

for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
	case "$(< "${state}/name")" in
		C6*-BYT|C6*-CHT|C7*-BYT|C7*-CHT) echo "1" > "${state}/disable" ;;
	esac
done

it just lacks all  the feedback of the other script while the shell will complain in case you dont have the right permissions. otherwise if everything is fine - you just get no feedback - fine for me.
Comment 594 Dan0780 2016-11-29 16:26:47 UTC
(In reply to Michaël from comment #592)
> (In reply to Dan0780 from comment #591)
> > Sorry, my processor is J1900.  I will try and fill out the spreadsheet
> 
> Dan; as far as I can see (a diff would have been useful), the difference
> with the original script is that you actually disable C7—this really does
> the same as max_cstate=1 then.

I did not have the option to set max_cstate=1 and therefore I modified the original script.  Either way I just wanted to share as with the original script of disabling C6 only it did not work for me but disabling all of them worked and have no issues.
Comment 595 Wolfgang M. Reimer 2016-11-29 16:58:11 UTC
(In reply to ladiko from comment #593)
> The script is unnecessarily complicated ...

... for you. I NEED the feedback (Usually I test hundreds of boxes with different combinations of enabled/disabled cstates and log the output for documentation purposes. For the next test result I need to document what exactly has changed from the previous state).

> 
> # baytrail workaround for https://bugs.freedesktop.org/show_bug.cgi?id=88012
> for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
>       case "$(< "${state}/name")" in
>               C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;;
>               C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;;
>       esac
> done
> 
> or to disable C6 and C7:
> 
> for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
>       case "$(< "${state}/name")" in
>               C6*-BYT|C6*-CHT|C7*-BYT|C7*-CHT) echo "1" > "${state}/disable"
> ;;
>       esac
> done
> 
> it just lacks all  the feedback of the other script while the shell will
> complain in case you dont have the right permissions. otherwise if
> everything is fine - you just get no feedback - fine for me.

... and your changed script IS NOT POSIX shell (dash, busybox' ash) compatible  any longer (which is REQUIRED in my case).
Comment 596 Wolfgang M. Reimer 2016-11-29 17:04:39 UTC
(In reply to DE from comment #575)
> (In reply to Wolfgang M. Reimer from comment #539)
> 
> Hello, you have missed the lucky 23.
> 
> [23] AAZ32: Intel Celeron P4000 and U3000 Mobile Processor Series
>     
> http://www.intel.ie/content/dam/www/public/us/en/documents/specification-
> updates/celeron-mobile-p4000-u3000-specification-update.pdf

Thanks, added to my list.
Comment 597 Pshem K 2016-11-29 20:14:43 UTC
I have an unnamed board based on a J1900 with a number of GigE ports. The device is used as a router and runs headless. I've experimented with a number of combinations and simply disabling C6S-BYT state (using a script) made the biggest improvement for me (forcing max_cstate=1 also works, but cpu runs hotter). Without  the C6S-BYT being disabled the uptime would be never longer than 24h, sometimes the device would reload some other times - locked up hard, never leaving a trace on the serial console on what exactly went wrong. This is using standard 4.4.0-47-generic Ubuntu Xenial kernel. Now I have an uptime of over 7 days with no issues.
Comment 598 Hal 2016-12-01 17:06:14 UTC
After running my Zotac (Intel® Celeron® Processor N2930 Bay Trail family) box without any crashes for over 2 months (thanks to intel_idle.max_cstate=1) a few days ago I installed the latest Linux Mint stock version kernel (4.4.0-51) with intel_idle.max_cstate=1.

To my greatest surprise my machine froze a few hours later while it was in light use. I found it frozen again the following morning while it was idling overnight. Then again this morning completely frozen.

This is a significant regression in this machine's case, as from the very beginning of this saga intel_idle.max_cstate=1 has been a life saver, and until now no kernel version had frozen while using intel_idle.max_cstate=1.

So, right now I have it running with 4.4.35-040435-generic #201611260431 SMP (obtained from http://kernel.ubuntu.com/~kernel-ppa/mainline/)

Hopefully it will work better. I know that I can always go back to 4.4.0 from ubuntu, but I am concerned that version 4.4.0 might have known security vulnerabiities).

Hal
Comment 599 Martin Brand 2016-12-01 19:57:19 UTC
Hi Hal,
I am using Kernel 4.8.9 with c6 states disabled. This has worked for me since November 21. Why don't you try this kernel before you go back to 4.4.0
Comment 600 mhartzel 2016-12-01 21:19:48 UTC
@ Hal I have a N2930 processor and had freezes again when migrating from kernel 4.4 to 4.7. The commands below stopped the freezes when used with  intel_idle.max_cstate=0. Try these :)

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state3/disable
Comment 601 john 2016-12-02 08:51:08 UTC
max_state=1 does not work for my Asrock q1900DC-ITX J1900 processor.
Using Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-51-generic x86_64)

Avg uptime is around 48-72 hours with the c6 and c7 fix it's even less about 26 hours. I'm getting really frustrated with this. Thinking of buying a i3 barebone pc setup from gigabyte. GB-BKi3A-7100 http://www.gigabyte.com/products/product-page.aspx?pid=6079#ov

Does the i3-7100U have similar issues with freezing?
Comment 602 Justin 2016-12-02 10:56:20 UTC
You also need to

#!/bin/bash
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable

 in order to stop crashing... I did this with rc.local and a script.
Comment 603 john 2016-12-02 11:07:19 UTC
Okay, right now i have the script running from crontab:
@reboot /home/john/scripts/c6off+c7off.sh

can i just add the 

echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable

lines very end?
Comment 604 john 2016-12-02 11:34:36 UTC
Also (In reply to john from comment #603)
> Okay, right now i have the script running from crontab:
> @reboot /home/john/scripts/c6off+c7off.sh
> 
> can i just add the 
> 
> echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable
> echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable
> echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable
> echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable
> 
> lines very end?

Also ls -la into:

/sys/devices/system/cpu/cpu0/cpuidle

only gives 2 states:

state0  state1
Comment 605 Hal 2016-12-02 17:09:56 UTC
(Follow up to own post #598)
> After running my Zotac (Intel® Celeron® Processor N2930 Bay Trail family)
> box without any crashes for over 2 months (thanks to
> intel_idle.max_cstate=1) a few days ago I installed the latest Linux Mint
> stock version kernel (4.4.0-51) with intel_idle.max_cstate=1.
> 
> To my greatest surprise my machine froze a few hours later while it was in
> light use. I found it frozen again the following morning while it was idling
> overnight. Then again this morning completely frozen.
> 
> This is a significant regression in this machine's case, as from the very
> beginning of this saga intel_idle.max_cstate=1 has been a life saver, and
> until now no kernel version had frozen while using intel_idle.max_cstate=1.
> 
> So, right now I have it running with 4.4.35-040435-generic #201611260431 SMP
> (obtained from http://kernel.ubuntu.com/~kernel-ppa/mainline/)
> 
> Hopefully it will work better. I know that I can always go back to 4.4.0
> from ubuntu, but I am concerned that version 4.4.0 might have known security
> vulnerabiities).
> 
> Hal

Thanks for the suggestions. For now, I am sticking to 4.4.35-040435-generic as it seems to be working fine. No freezing or crash in 25 hours!

One reason I am sticking to the 4.4 line is that it is a long term support version. In the past I ran 4.5.7, but then it was no longer maintained because of EOL. I am still not sure if the 4.8 strain will be long term or not. Also, my (albeit superficial) reading of comments about it gave me the impression that 4.8 started off the wrong foot (insufficient QA on its initial release).

Anyway, I only update the kernel when I need it to support new devices (like USB 3.1 or 802.11 ac), or when I hear about new found vulnerabilities, and typically try to stay one step behind rather than at the cutting edge.

I'll keep the thread posted if anything new happens but so far 4.4.35 looks good!

Hal
Comment 606 nw9165-3201 2016-12-07 15:35:27 UTC
(In reply to Hal from comment #605)
> I am still not sure if the 4.8 strain will be long term or
> not.

4.8 will not be LTS. But 4.9 will be, see:

http://kroah.com/log/blog/2016/09/06/4-dot-9-equals-equals-next-lts-kernel/
Comment 607 Elmar Melcher 2016-12-08 14:26:20 UTC
Configuration from #583 stable during more than 2 weeks of daily use,
no kernel parameters, no C-state script.
Updated the spreadsheet.

Observed a 30% chance that tsc is rejected as clocksource during boot and refined-jiffies is used instead. In this case wall clock is almost 10x slower and keyboard repetition rate is extremely slow, and occasionally a hard lockup occurs in one processor core, but system continues working.
For this reason I will use kernel parameter tsc=reliable from now on.

Does it makes sense to reject tsc of a CPU that has the flags rdtscp. constant_tsc, and nonstop_tsc ?
Comment 608 RussianNeuroMancer 2016-12-13 20:05:26 UTC
https://www.spinics.net/lists/linux-i2c/msg27520.html

> About this patch vs bug bko109051, yesterday I've spend time reading
> that entire bug. It seems it is a combination of at least 3 bugs
> combined, 2 i915 related with commits which seem to trigger
> the problem (2 different groups of users with a different problem
> it seems) which causes a hang every few hours. And one other
> bug where the system freezes in minutes, that one sounds like
> what I was seeing without this patch (but may well be yet
> another issue).
> 
> As for the 2 i915 bugs, there have been git bisects for both of
> them, it would be good if someone could take a look at these, just
> search for bisect in that huge bug.
Comment 609 Vincent Gerris 2016-12-14 17:54:45 UTC
Hi,

The patch here:
https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530
seems to have fixed the problem for me on my N3520 Bay trail (Lenovo Yoga 2 11).

I changed the patch to be applied on 4.8.0-30, the default Ubuntu kernel on an updated Ubuntu 16.10.

Please test if that fixes the issue.
The patch (couldn't find attachment button):

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 67ec58f..2a77317 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1242,6 +1242,34 @@ static void bxt_idle_state_table_update(void)
 
 }
 /*
+ * byt_idle_state_table_update(void)
+ *
+ * On BYT, we have errata VLP52 and disable C6.
+ * https://bugzilla.kernel.org/show_bug.cgi?id=109051A
+ * http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
+ * VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine.
+
+Problem:
+If core C6 is entered after the start of an interrupt service routine but before a write
+to the APIC EOI (End of Interrupt) register, and the core is woken up by an event
+other than a fixed interrupt source the core may drop the EOI transaction the next
+time APIC EOI register is written and further interrupts from the same or lower
+priority level will be blocked.
+
+Implication:
+EOI transactions may be lost and interrupts may be blocked when core C6 is used
+during interrupt service routines.
+
+Workaround:
+It is possible for the firmware to contain a workaround for this erratum.
+ */
+static void byt_idle_state_table_update(void)
+{
+	printk(PREFIX "byt_idle_state_table_update reached\n");
+	byt_cstates[1].disabled = 1;	/* C6N-BYT */
+	byt_cstates[2].disabled = 1;	/* C6S-BYT */
+}
+/*
  * sklh_idle_state_table_update(void)
  *
  * On SKL-H (model 0x5e) disable C8 and C9 if:
@@ -1299,6 +1327,11 @@ static void intel_idle_state_table_update(void)
 	case INTEL_FAM6_ATOM_GOLDMONT:
 		bxt_idle_state_table_update();
 		break;
+	case INTEL_FAM6_ATOM_SILVERMONT1: /* BYT */
+                printk(PREFIX "intel_idle_state_table_update BYT 0x37 reached\n");
+                byt_idle_state_table_update();
+                break;
+
 	case INTEL_FAM6_SKYLAKE_DESKTOP:
 		sklh_idle_state_table_update();
 		break;
Comment 610 Vincent Gerris 2016-12-14 17:56:41 UTC
Created attachment 247621 [details]
Patch for Bay trail for 4.8

based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530
Comment 611 Pshem K 2016-12-14 19:29:47 UTC
(In reply to Pshem K from comment #597)
> I have an unnamed board based on a J1900 with a number of GigE ports. The
> device is used as a router and runs headless. I've experimented with a
> number of combinations and simply disabling C6S-BYT state (using a script)
> made the biggest improvement for me (forcing max_cstate=1 also works, but
> cpu runs hotter). Without  the C6S-BYT being disabled the uptime would be
> never longer than 24h, sometimes the device would reload some other times -
> locked up hard, never leaving a trace on the serial console on what exactly
> went wrong. This is using standard 4.4.0-47-generic Ubuntu Xenial kernel.
> Now I have an uptime of over 7 days with no issues.

Spoke too soon. It looks like I can only get stability with max_cstate=1. Disabling C6 only helped a lot, but occasionally the box would still lock up. The device acts as a router and usually the lockups occur after a long (a few hours) high speed (300-600Mb/s) transfers. 
Currently running with ubuntu 4.4.0-53 kernel.
Comment 612 Vincent Gerris 2016-12-14 19:37:56 UTC
Please try the patches that were posted and report. Thank you
Comment 613 Vincent Gerris 2016-12-16 23:16:55 UTC
for the ubuntu users, here are some precompiled kernels in deb package format, containing the Bay trail fixes:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

Please test and report back!
Comment 614 VoobScout 2016-12-17 10:22:34 UTC
(In reply to Vincent Gerris from comment #610)
> Created attachment 247621 [details]
> Patch for Bay trail for 4.8
> 
> based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530

Can you confirm that this is needed for Z3735F CPU?

I'm currently running Arch 4.8.11-1-zen with 2 patches from https://github.com/ferbar/rtl8723bs/tree/master/patches_4.7 and "clocksource=tsc" in cmdline.

System appears stable, aside from misrecognized battery and lack of external physical keys support, it's a commodity tablet "Axdia international GmbH wintab 9 plus 3G/Tablet, BIOS 5.6.5 03/10/2015".
Comment 615 ladiko 2016-12-18 13:05:29 UTC
#611 if it is only a router, just don't start X and it should be stable.
Comment 616 Pshem K 2016-12-18 21:39:52 UTC
(In reply to ladiko from comment #615)
> #611 if it is only a router, just don't start X and it should be stable.

There is no X on that box. Not running it is not sufficient to make the machine stable. Without the max_cstate=1 the device eventually locks up.
Comment 617 Vincent Gerris 2016-12-18 22:09:27 UTC
(In reply to VoobScout from comment #614)
> (In reply to Vincent Gerris from comment #610)
> > Created attachment 247621 [details]
> > Patch for Bay trail for 4.8
> > 
> > based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530
> 
> Can you confirm that this is needed for Z3735F CPU?
> 
> I'm currently running Arch 4.8.11-1-zen with 2 patches from
> https://github.com/ferbar/rtl8723bs/tree/master/patches_4.7 and
> "clocksource=tsc" in cmdline.
> 
> System appears stable, aside from misrecognized battery and lack of external
> physical keys support, it's a commodity tablet "Axdia international GmbH
> wintab 9 plus 3G/Tablet, BIOS 5.6.5 03/10/2015".

Hi, it seems like a Bay trail processor so I think you need it.
I don't know about that Arch kernel, I saw they also patched a 4.8 kernel.
The patch from Jochen Hein works for me as does my modification for 4.8 kernels. 
Just compile your own kernel with the patch applied according to your Linux version to be sure.
Comment 618 Prashant Poonia 2016-12-23 15:36:58 UTC
I am currently using 4.8.0-32 kernel installed from linuxmint 18's update manager. System is stable without intel_idle.max_cstate=1 till now . Do test this kernel out.
Comment 619 mhartzel 2016-12-24 12:26:33 UTC
Prashant Poonia: Please tell us what your processor is, otherwise your success story is quite useless to others :) Different processors have different bugs and also different workarounds. One solution does not fit all :)
Comment 620 Prashant Poonia 2016-12-25 11:49:26 UTC
(In reply to mhartzel from comment #619)
> Prashant Poonia: Please tell us what your processor is, otherwise your
> success story is quite useless to others :) Different processors have
> different bugs and also different workarounds. One solution does not fit all
> :)

sorry :D
its N3540 Baytrail
laptop is asus x553MA
linuxmint 18 with kernel 4.8.0-32
The updated yakkety yak's wifi driver for this kernel causes freezes when using wifi, rest it works flawlessly. Hope this helps someone, and i recommend you to check it out
Comment 621 Gi_44 2016-12-25 15:00:53 UTC
Hello to everybody
I 'm new here.

I would like to share my story.

(In reply too to VoobScout and Vincent Gerris lasts posts-  comment #614 and 617, respectively)


I recently built different Linux Flavors  on this Z3735F mini machine :
https://www.aliexpress.com/store/product/2016-QOTOM-Micro-ITX-motherboard-Z3735F-with-2GB-RAM-32GB-SSD-WIFI-Bluetooth-support-Win-8/108231_32694240800.html - (Swiped for all of the the MS stuff when received.)


The native Jessie multiarch (https://cdimage.debian.org/cdimage/unofficial/non-free/cd-including-firmware/8.6.0+nonfree/multi-arch/iso-cd/) works fine directly, is stable, without no adjustments but with, however, no HDMI, WIFI, sound and Bluetooth....


Trying with Debian different kernels (https://github.com/hadess/rtl8723bs/wiki/RTL8723BS-module-building-instruction-for-Debian-GNU-Linux) gave instability and a lot of freezes.

I also tried the 'Linuxium - LUBUNTU 16.04 OS" that works fine, is stable and the wifi is directly well active (RTL8723bs) but still without sound, HDMI or Bluetooth.

($ inxi -F
System:    Host: gil-lbnt Kernel: 4.4.0-31-linuxium x86_64 (64 bit)
Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
Machine:   Mobo: AMI model: Aptio CRB
Bios: American Megatrends v: 5.6.5 date: 08/01/2015
CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
clock speeds: max: 1832 MHz 1: 705 MHz 2: 1426 MHz 3: 1140 MHz
4: 1374 MHz
Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
Resolution: 1366x768@59.79hz
GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
Audio:     Card IntelHDMI driver: IntelHDMI Sound: ALSA v: k4.4.0-31-linuxium
Network:   Card: Failed to Detect Network Card!
Drives:    HDD Total Size: 15.5GB (Used Error!)
ID-1: /dev/mmcblk0 model: N/A size: 31.0GB
ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk0p2
ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk0p4
ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk0p3
RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors:   System Temperatures: cpu: 45.0C mobo: N/A
Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 189 Uptime: 0 min Memory: 258.1/1939.2MB
Client: Shell (bash) inxi: 2.2.35 )

Recently, I upgraded to the generic kernel 4.8 (http://sourcedigit.com/21520-upgrade-linux-kernel-4-8-10-install-linux-kernel-4-8-10-ubuntu/)

And after rebooting, I installed this r8723bs module version:
https://github.com/ferbar/rtl8723bs

($ inxi -F
System:    Host: gil-lbnt Kernel: 4.8.10-040810-generic x86_64 (64 bit)
Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
Machine:   Mobo: AMI model: Aptio CRB
Bios: American Megatrends v: 5.6.5 date: 08/01/2015
CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
clock speeds: max: 1832 MHz 1: 499 MHz 2: 499 MHz 3: 499 MHz
4: 499 MHz
Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
Resolution: 1366x768@59.79hz
GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
Audio:     Card-1 bytcr-rt5640 driver: bytcr-rt5640
Card-2 USB Audio DAC driver: USB-Audio
Card-3 Texas Instruments Audio Codec driver: USB Audio
Sound: Advanced Linux Sound Architecture v: k4.8.10-040810-generic
Network:   Card: Failed to Detect Network Card!
Drives:    HDD Total Size: 15.5GB (Used Error!)
ID-1: /dev/mmcblk1 model: N/A size: 31.0GB
ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk1p2
ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk1p4
ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk1p3
RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors:   System Temperatures: cpu: 43.0C mobo: N/A
Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 190 Uptime: 7 min Memory: 287.5/1938.7MB
Client: Shell (bash) inxi: 2.2.35
)

Now, the 4.8 kernel seems well stable (one week, 24/24), with the wifi well working... (always no sound, no HDMI nor Bluetooth).
Comment 622 Prashant Poonia 2016-12-25 18:12:00 UTC
(In reply to Gi_44 from comment #621)
> Hello to everybody
> I 'm new here.
> 
> I would like to share my story.
> 
> (In reply too to VoobScout and Vincent Gerris lasts posts-  comment #614 and
> 617, respectively)
> 
> 
> I recently built different Linux Flavors  on this Z3735F mini machine :
> https://www.aliexpress.com/store/product/2016-QOTOM-Micro-ITX-motherboard-
> Z3735F-with-2GB-RAM-32GB-SSD-WIFI-Bluetooth-support-Win-8/108231_32694240800.
> html - (Swiped for all of the the MS stuff when received.)
> 
> 
> The native Jessie multiarch
> (https://cdimage.debian.org/cdimage/unofficial/non-free/cd-including-
> firmware/8.6.0+nonfree/multi-arch/iso-cd/) works fine directly, is stable,
> without no adjustments but with, however, no HDMI, WIFI, sound and
> Bluetooth....
> 
> 
> Trying with Debian different kernels
> (https://github.com/hadess/rtl8723bs/wiki/RTL8723BS-module-building-
> instruction-for-Debian-GNU-Linux) gave instability and a lot of freezes.
> 
> I also tried the 'Linuxium - LUBUNTU 16.04 OS" that works fine, is stable
> and the wifi is directly well active (RTL8723bs) but still without sound,
> HDMI or Bluetooth.
> 
> ($ inxi -F
> System:    Host: gil-lbnt Kernel: 4.4.0-31-linuxium x86_64 (64 bit)
> Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
> Machine:   Mobo: AMI model: Aptio CRB
> Bios: American Megatrends v: 5.6.5 date: 08/01/2015
> CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
> clock speeds: max: 1832 MHz 1: 705 MHz 2: 1426 MHz 3: 1140 MHz
> 4: 1374 MHz
> Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
> Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
> Resolution: 1366x768@59.79hz
> GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
> Audio:     Card IntelHDMI driver: IntelHDMI Sound: ALSA v: k4.4.0-31-linuxium
> Network:   Card: Failed to Detect Network Card!
> Drives:    HDD Total Size: 15.5GB (Used Error!)
> ID-1: /dev/mmcblk0 model: N/A size: 31.0GB
> ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
> Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk0p2
> ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk0p4
> ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk0p3
> RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
> Sensors:   System Temperatures: cpu: 45.0C mobo: N/A
> Fan Speeds (in rpm): cpu: N/A
> Info:      Processes: 189 Uptime: 0 min Memory: 258.1/1939.2MB
> Client: Shell (bash) inxi: 2.2.35 )
> 
> Recently, I upgraded to the generic kernel 4.8
> (http://sourcedigit.com/21520-upgrade-linux-kernel-4-8-10-install-linux-
> kernel-4-8-10-ubuntu/)
> 
> And after rebooting, I installed this r8723bs module version:
> https://github.com/ferbar/rtl8723bs
> 
> ($ inxi -F
> System:    Host: gil-lbnt Kernel: 4.8.10-040810-generic x86_64 (64 bit)
> Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
> Machine:   Mobo: AMI model: Aptio CRB
> Bios: American Megatrends v: 5.6.5 date: 08/01/2015
> CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
> clock speeds: max: 1832 MHz 1: 499 MHz 2: 499 MHz 3: 499 MHz
> 4: 499 MHz
> Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
> Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
> Resolution: 1366x768@59.79hz
> GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
> Audio:     Card-1 bytcr-rt5640 driver: bytcr-rt5640
> Card-2 USB Audio DAC driver: USB-Audio
> Card-3 Texas Instruments Audio Codec driver: USB Audio
> Sound: Advanced Linux Sound Architecture v: k4.8.10-040810-generic
> Network:   Card: Failed to Detect Network Card!
> Drives:    HDD Total Size: 15.5GB (Used Error!)
> ID-1: /dev/mmcblk1 model: N/A size: 31.0GB
> ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
> Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk1p2
> ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk1p4
> ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk1p3
> RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
> Sensors:   System Temperatures: cpu: 43.0C mobo: N/A
> Fan Speeds (in rpm): cpu: N/A
> Info:      Processes: 190 Uptime: 7 min Memory: 287.5/1938.7MB
> Client: Shell (bash) inxi: 2.2.35
> )
> 
> Now, the 4.8 kernel seems well stable (one week, 24/24), with the wifi well
> working... (always no sound, no HDMI nor Bluetooth).

can you post the link from where you downloaded wifi driver??
 i am also running 4.8 kernel with no issues except lockups when downloading significant data through wifi
Comment 623 Gi_44 2016-12-25 19:52:37 UTC
Hi Prashant Poonia
The address is in the post
Here -> : 
" And after rebooting, I installed this r8723bs module version:
  https://github.com/ferbar/rtl8723bs"
Comment 624 Prashant Poonia 2016-12-25 22:36:53 UTC
(In reply to Gi_44 from comment #623)
> Hi Prashant Poonia
> The address is in the post
> Here -> : 
> " And after rebooting, I installed this r8723bs module version:
>   https://github.com/ferbar/rtl8723bs"

ohh! its a realtek driver, my bad luck.

Anyone else test and confirm 4.8
Comment 625 Vincent Gerris 2016-12-25 22:44:45 UTC
Created attachment 248541 [details]
attachment-26085-0.html

Hi,

I already tested it and reported that it freezes on my N3520. It may take
longer but it will.
Please test it very thoroughly and for a long time.
The processor has an errata, so it does not make sense it would be fixed
unless the kernel is patched or you had some firmware update somehow.

I really hope you can so some more and thorough testing and please try to
not get exited too soo. It could help if you try the patch Jochen Hein
posted or the 4.8x mod I posted or any precompiled kernels and test powet
management.

Even if you do not get a freeze on your setup, it may be a good extra
support reason to have the patch applied in the mainline with some
priority, since this still affects many users.

Thank you

On Dec 25, 2016 23:37, <bugzilla-daemon@bugzilla.kernel.org> wrote:

https://bugzilla.kernel.org/show_bug.cgi?id=109051

--- Comment #624 from Prashant Poonia <pooniaprashant400@gmail.com> ---
(In reply to Gi_44 from comment #623)
> Hi Prashant Poonia
> The address is in the post
> Here -> :
> " And after rebooting, I installed this r8723bs module version:
>   https://github.com/ferbar/rtl8723bs"

ohh! its a realtek driver, my bad luck.

Anyone else test and confirm 4.8

--
You are receiving this mail because:
You are on the CC list for the bug.
Comment 626 Jochen Hein 2016-12-25 23:42:21 UTC
I'm running with this in /etc/modprobe.d/rtl8723be.conf:

options rtl8723be fwlps=0 swlps=0

Otherwise Wifi is unstable for me
Comment 627 pilot_6572 2016-12-27 21:55:17 UTC
It it strange

I have the same Qotom z3735f board but only jessie 3.16 is well stable with and only without any bay and cherry drivers.

With the 4.7 and 4.8 kernels, freezes append very quickly and with r8723bs drivers (ferbar or hadess) or with the linuxium OSs, the systems overload and crash insanely...
Comment 628 Len Brown 2016-12-27 21:57:22 UTC
Created attachment 248751 [details]
Debug patch to enable BYT C6 auto-demotion

Please test this patch and report if if it has any
effect on the stability issue.

You can verify that it is applied and running via dmesg:

dmesg | grep idle
intel_idle: BYT C6 auto-demotion-disable: 0

Under some conditions, it will reduce the amount
of C6 residency, which you can observe with turbostat:

 # turbostat --debug -o ts.out sleep 10
Comment 629 Len Brown 2016-12-28 11:43:07 UTC
Created attachment 248841 [details]
nanosleep.c

As mentioned above, idle-related failures become more rare
when heavy load is added to the system.  So a "stress test"
for idle entry/exit does not add computation.  Instead,
it does almost no work except waking and going back to sleep.

Attached is a little program, nanosleep.c, that can be used
as an idle "stress test".  It has a random element,
so running it for longer duration will provoke
a wider variety of timing.  Also, my intent was that one
copy be run for every  logical CPU in the system,
but you may find it useful running it other ways
that I have not thought of.

nanosleep takes a single parameter, its target for highest wakes per second.
By default, it uses a max of 500 wakes per second, which
would be wakes at a rate up to (1 sec/500) = 2 ms.
Or if you run 4 copies, that becomes 2000/sec, or 500us.

For reference, cpuidle's target residency for C6N-BYT is 275 usec.
So even at that rate, of wakeup, the system may still be able
to enter C6.

For those with Baytrail systems that fail without intel_idle.max_cstate=1,
it would be interesting if you can experiment with nanosleep,
alone or in combination with glxgears or video playback or whatever
to see if you can provoke the failure sooner.

For observing what C-states are actually in use, please use turbostat;
which is available in the upstream kernel tree under
tools/power/x86/turbostat/ (yes, you should be able to use the latest
version of turbostat with an older kernel, as long as the kernel
supports the cpu msr driver)

Note that turbostat exposes the underlying C-state residency
hardware counters.  While the software counters in sysfs reflect
what the kernel requested, the hardware residency counters reflect
what states were actually achieved.
For this reason, it is preferable to use turbostat instead of
Wolfgang's script in comment #435.  eg.

 # turbostat --debug -o ts.out sleep 10

forks the "sleep 10" command -- you can use any command --
and outputs the stats to the file ts.out.  If you omit
the command for turbostat to fork, it will run in interval mode
until you kill it.
Comment 630 Prashant Poonia 2016-12-28 16:21:45 UTC
(In reply to pilot_6572 from comment #627)
> It it strange
> 
> I have the same Qotom z3735f board but only jessie 3.16 is well stable with
> and only without any bay and cherry drivers.
> 
> With the 4.7 and 4.8 kernels, freezes append very quickly and with r8723bs
> drivers (ferbar or hadess) or with the linuxium OSs, the systems overload
> and crash insanely...

do you have the focaltech touchpad drivers for 3.16 kernel? or 3.16 with drivers cooked
Comment 631 jbMacAZ 2016-12-28 19:10:48 UTC
(In reply to Len Brown from comment #628)
> Created attachment 248751 [details]
> Debug patch to enable BYT C6 auto-demotion
> 
> Please test this patch and report if if it has any
> effect on the stability issue.
> 
> You can verify that it is applied and running via dmesg:
> 
> dmesg | grep idle
> intel_idle: BYT C6 auto-demotion-disable: 0
> 

I stripped down my 4.8.15 setup on Asus T100CHI (Z3775).  No ..cstate arg no c6offc7on script only tsc=reliable and let it idle on Mint Cinnamon 18.1 desktop with wifi, and bt enabled).  It took less than 1/2 hour to freeze.  Then I added auto-demotion-disable patch to the kernel.  The CHI has been running over 9 hours.  I'll leave it running (same conditions) a few days to see if freezes.
Comment 632 Dmitry 2016-12-30 11:35:02 UTC
Turbostat is not working in debug for me:
turbostat: msr 0 offset 0x1aa read failed: Input/output error

I haven't seen freezes since September-Oktober. Nanosleep didn't show anything new, 4.8.15 stable with Z3770. Four tasks with taskset on different cores. GLxgears, youtube in firefox over wifi(ath6kl), mpd. This all on battery with powersave governor.
Cmdline:root=UUID=... rootfstype=f2fs ro tsc=reliable clocksource=tsc pcie_aspm=force nmi_watchdog=0 rd.skipfsck fsck.mode=skip quiet splash

cpupower monitor -m Idle_Stats -i 10 -c sleep 300
sleep took 300,00265 seconds and exited with status 0
    |Idle_Stats                               
CPU | POLL | C1-B | C6N- | C6S- | C7-B | C7S- 
   0|  0,00|  0,58|  1,84| 42,82| 26,51|  2,23
   1|  0,00|  1,09|  2,38| 49,07| 19,96|  0,54
   2|  0,00|  0,90|  2,15| 39,27| 28,15|  5,10
   3|  0,06|  0,46|  1,59| 37,10| 29,12|  6,59

Tested for 2 hours then got bored. 
P.S. Gentoo, vanilla stable kernel with bfq, ath6kl and different small(shut sst debug output up and touch button scancode change) patches.
Comment 633 Pilot_6572n 2016-12-30 13:23:37 UTC
During the test with the nanosleep script with Jessie and the 4.8 krnel loaded on the Qotom z3537f motherboard (intel_idle.max_cstate=1 added), I ran "stellarium" and "cairo-dock", two high level time and video resources consuming programs and a freeze occurred directly. Progressively any program, kernel or system have been unable to be loaded. The Bios has been altered.

I rebuilt Jessie with only the 3.16 original, cancelling before any GPT partition (sgdisk --zapp-all). The Bios is altered but the grub.efi is loadable through the Shell (fs0:). Running again the same stellarium and cairo-dock programs, with no grub modification or nanosleep.c running, gave the same crashes affecting now progressively others video players or browsers.

I am looking now to the AMI afulnx_64 tool to flash the Bios before reloading an OS.
Comment 634 jbMacAZ 2016-12-30 19:31:29 UTC
(In reply to jbMacAZ from comment #631)
> 
> I stripped down my 4.8.15 setup on Asus T100CHI (Z3775).  No ..cstate arg no
> c6offc7on script only tsc=reliable and let it idle on Mint Cinnamon 18.1
> desktop with wifi, and bt enabled).  It took less than 1/2 hour to freeze. 
> Then I added auto-demotion-disable patch to the kernel.  The CHI has .. 

.. run fine for 48 hours - the last 24 hours with 4 copies of nanosleep running.  Next tests ran nanosleep on the unpatched 4.8.15.  I had one freeze before I could get a second copy of nanosleep running.  A second test froze in 38 minutes with 4 copies of nanosleep.  Not sure nanosleep matters, but thanks for the patch.
Comment 635 Len Brown 2017-01-01 18:24:12 UTC
Created attachment 249491 [details]
turbostat-src.tar.gz

Attached is a copy of the latest development version of turbostat.
It has two additions from what is released in the upstream kernel tree:

1. New --show --hide parameters (90% implemented)
2. disable --debug access to MSR_MISC_PWR_MGMT (MSR 0x1aa) on BYT

This tar file incluces a binary you can run directly, or you can
first "make clean; make" to build it from scratch.

$ tar xzvf turbostat-src.tar.gz
$ cd turbostat-src

$ sudo ./turbostat --debug -o ts.out sleep 10

Both Wolfgang's script and cpupower are limited because they format
the the software counters in /sys/devices/system/cpu/cpu*/cpuidle/state*/*
The software counters show what the kernel requested.

turbostat shows instead the underlying hardware
residency counters.  The difference is important when the hardware
has the ability
to "demote" a software request into a more "shallow" state; and is
particularly applicable when we are experimenting with a patch that
enables/disables the ability of the hardware to do so.
Comment 636 Dmitry 2017-01-01 20:54:25 UTC
Created attachment 249561 [details]
turbostat --debug -o ts.out sleep 10

(In reply to Len Brown from comment #635)
> 
> Attached is a copy of the latest development version of turbostat.
Thank you! Tested. Some new output but also there is an error:
turbostat: msr 0 offset 0x3fe read failed: Input/output error
Comment 637 Len Brown 2017-01-02 03:50:00 UTC
Created attachment 249571 [details]
turbostat-src.tar.gz

thanks for testing the latest turbostat, this update
should fix the issue seen in the last one.
Comment 638 Dmitry 2017-01-02 09:52:04 UTC
Created attachment 249601 [details]
tubostat --debug -o ts.out sleep 10

So my cpu spends 50% of time in C6 state. This is with 4 instances of nanosleep, glxgears and video playback with mpv. Without any activities cpu spends 94% of time in C6.
I forgot to mention that I use 32bit gentoo with UEFI stub capable kernel.
Comment 639 Oemer 2017-01-02 10:22:44 UTC
(In reply to Sudhanshu from comment #537)
> I have been suffering from the same issue, but on a Broadwell system (Dell
...
> Summarising,
> Are there any broadwell co-sufferers here?
> Am I safe to assume this is the same bug as mine?

I am also on a Broadwell system and i suffer from the same occasional freezes. I haven't yet tried changing the max cstate setting though.
Comment 640 Vincent Gerris 2017-01-02 19:06:14 UTC
Hi,

I patched a 4.8.11 kernel with the auto demotion patch:
dmesg shows:
[    1.244957] intel_idle: BYT C6 auto-demotion-disable: 0

In my usual test setup, it freezes after about 15 minutes.
Since I still see quite a variation in time before that happens, I can't really tell if it made much difference (N3520).

The patch from Jochen Hein still works fine and does not freeze at all after the usual test.

Do we have an issue with the C6 state that is not in the errata?

For others on Ubuntu, you can find the auto demotion enabled deb kernels here:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

@Len Brown: is there anything we can do to pin down the cause as much as possible?
I would really like to see a kernel patch fixing this and I am looking for the best way forward to help achieve that.
Comment 641 Len Brown 2017-01-03 03:05:07 UTC
@ Sudhanshu
@ oemer+kernel@o9z.de

No, this bug report is specific to Baytrail.

Here is the complete list of Baytrail processors:
http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All

If you have a problem with Broadwell, then file a new bug report -- because this one will be closed when "intel_idle.max_cstate=1" is no longer required on baytrail.
Comment 642 Len Brown 2017-01-03 03:24:09 UTC
@ Vincent Gerris

It seems likely there are multiple baytrail c6 issues, and only if we are very lucky will they turn out to have a common root cause.

When this bug was opened, I assumed that this had nothing to do with cpuidle and that Adrian's i2c patches would handle this.  That didn't happen.  I think there are multiple failures here, and i2c and i915 changes clearly effect some failures, and so they are both high on the list of suspects.

Also worth checking out is the cpuidle auto-demotion-disable=0 patch that you just tested.  The problem with that patch is if it works, we don't know if it is because we are taking a better route through the pcode, or if it is just hiding an i2c or i915 bug because the system is in c6 less... 

So the interesting comparison with the auto-demotion-disable=0 patch is:
1. Does it change stability?
For jbMacAZ it seems it may help, but for you it seems it may make no difference.  There are a of submitters here, and I'd like to see more testing.

2. Does it make a measurable difference in C6 residency under the same workload (ie. turbostat output with vs without the patch should show this).

Vincent,

Since the cpuidle patch doesn't make any difference on your system, I would say that testing the i2c patches, (or maybe even blacklisting dw i2c if it doing so doesn't hose your system) and perturbing how i915 works to see if any changes effect your failure are the best areas to look.

Also, any efforts to discover how to best cause the failure to happen as soon as possible would be extremely valuable.  Eg. experimenting to see if you can provoke the failure sooner by running nanosleep in a certain way with certain parameters might turn out to be extremely valuable.  If we can reliably reproduce a failure in under 60 seconds, when we know when it is gone.  If it takes a week or so to reproduce a failure, when we'll never know when we are done.
Comment 643 Jochen Hein 2017-01-03 06:39:39 UTC
I'm running right now:
Linux detrius 4.9.0-040900-generic #201612111631 SMP Sun Dec 11 21:33:00 UTC 
which is the ubuntu mainline ppa. Until yesterday that kernel seemed stable,
but yesterday I had a hang as well.

turbostat output:
turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu1: MSR_PLATFORM_INFO: 0x60000001600
6 * 83 = 500 MHz max efficiency frequency
22 * 83 = 1833 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x0016000f (UNlocked: pkg-cstate-limit=15: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x883c0100 (45 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x883c0100 (45 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
10.002627 sec
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	CoreTmp	GFX%rc6	GFXMHz	PkgWatt	CorWatt
	-	-	546	28.89	1889	1834	0	0	2.41	68.71	45	0.00	0	0.77	0.58
	0	0	74	9.23	804	1835	0	0	2.76	88.00	43	0.00	0	0.77	0.58
	1	1	388	23.96	1618	1835	0	0	1.58	74.47	44
	2	2	1653	77.27	2137	1835	0	0	0.26	22.48	45
	3	3	69	5.07	1368	1833	0	0	5.03	89.90	45
Comment 644 BzukTuk 2017-01-03 12:49:59 UTC
Acer Switch 10 with Intel Atom Z3735F - with Len Brown`s one liner patch on vanilla 4.8.15 with ubuntu-ppa config:

up 4 days, 11:24. No freeze. Workload was glxgears and vlc.

Without this patch, same kernel with same workload froze in 12 minutes. Please someone with different baytrail CPU test this patch, so we can move forward.

Another workaround that makes at least my Z3735F rock stable was described in comment 378 (but only with kernel 4.7 and up, so it could be only luck)
Comment 645 Gilbert Dion 2017-01-04 01:00:09 UTC
Since I upgraded to Ubuntu 16.04 last october, my Acer Aspire V11 Touch with Intel Celeron quad core N2940 + Intel Bay Trail has the same problem.

Patch intel_idle.max_cstate=1 does prevent the crashes. How will I know when a fix to the kernel is done?
Comment 646 A Uday K 2017-01-04 10:19:53 UTC
Setting the max C state as 1 does not fix the problem. That is just a temporary measure to make sure that the freezes don't occur when the stakes are high. For example, if you are working on some important project and there is a freeze expectedly, then all of the unsaved data that you are working on will be lost. Additionally, if you are working on battery, then setting the max C-state as 1 will invariably force your processor to consume a lot of power. In other words, your laptop's battery drastically improve if this bug is fixed.
And you will know when it is fixed when the "status", at the top of this page, is marked as “VERIFIED” OR "RESOLVED". it is currently marked as "NEEDINFO".

If you would like, then you can contribute. Your contribution can speed things up. 

You do not need to be an expert programmer to contribute. You just need to know how to apply the patches and update your kernel. Additionally, you may also need to know how to run a few commands on the terminal and post the output here. These are actually very very simple steps. If you do not know how to do them, then you can just go ahead and Google it. These are relatively simple topics there are not that many Complex steps involved. 

NOTE : MAINTAIN CAUTION WHILE TESTING :)
Comment 647 Martin 2017-01-04 17:45:39 UTC
Patched kernel 4.5.4 using Len's auto demotion patch and had a freeze after a day. CPU: J1900.

root@pandora:/usr/src/turbostat-src# ./turbostat -d
turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:3 (6:55:3)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu2: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu2: MSR_PLATFORM_INFO: 0x100000001800
16 * 83 = 1333 MHz max efficiency frequency
24 * 83 = 1999 MHz base frequency
cpu2: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu2: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu2: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x0018000f (UNlocked: pkg-cstate-limit=15: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88360000 (51 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88360000 (51 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88320000 (55 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88320000 (55 C +/- 1)                                                                                                                         
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  CoreTmp GFX%rc6 PkgWatt CorWatt                                                      
        -       -       425     22.23   1911    2000    9310    0       19.02   58.75   58      **.**   1.97    0.65                                                         
        0       0       316     19.56   1613    2000    5564    0       36.85   43.59   53      **.**   1.97    0.65                                                         
        1       1       288     17.62   1633    2000    2113    0       20.16   62.21   53                                                                                   
        2       2       381     18.92   2015    2000    1010    0       12.29   68.79   58                                                                                   
        3       3       715     32.80   2179    2000    623     0       6.80    60.41   58                                                                                   
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  CoreTmp GFX%rc6 PkgWatt CorWatt                                                      
        -       -       451     24.48   1840    2000    11641   0       40.57   34.95   56      34.17   3.75    0.70                                                         
        0       0       297     18.22   1628    2000    7549    0       81.78   0.00    53      34.17   3.75    0.70                                                         
        1       1       807     38.64   2088    2000    2083    0       38.13   23.23   53
        2       2       356     22.37   1589    2000    998     0       23.38   54.25   56
        3       3       343     18.71   1834    2000    1011    0       18.97   62.32   56
Comment 648 Vincent Gerris 2017-01-04 20:21:47 UTC
Thank you Len, for picking this up and streamlining the hunt for the cause :)!
thanks everyone for the follow ups :).

I am going to try and be as scientific as I can on the matter.
I did 6 attempts to freeze of which 3 on 4.8.0 unpatched and 4.8.11 patched with auto demotion enabled.
The reason of the 0/11 difference is because when I use the automated scripts it pulls a tar and I did not find how to manipulate that. Until anyone finds a way to a quick freeze, this is what I use to freeze up my Lenovo Yoga 2 11 with N3520 processor (assuming that the minor version difference has no big influence, but might):
 - pick an mkv file from a samba share and copy it to local video folder
 - setup bluetooth audio, with high fidelity playback profile to an external speaker (jambox mini)
 - play the same file that is copying from the network with the ubuntu default video player (totem) 

The way I can get the 4.8.0 usually to freeze between 1 and 30 minutes.
An issue is that bluetooth is utter crap: stottering, connecting loss, wrong profile or not able to set it are some that influence it.
The bluetooth seems also to block video playback, maybe it is buffering related.

Any way, with the above, I was unable to freeze the 4.8.11 with demotion this time.
One time, the 4.8.0 without played about 30 minutes, one time it froze instantly.

Further info:
 - not using the laptop plugged in may freeze the 4.8.0 faster, but not sure
 - I tried nanosleep and ran up to 5 times the program, but it didn't seem to make a difference on the freezing speed. It was running while playing video too.

A sample output of turbostat -d on the 4.8.11 with auto demotion , 5 times nanosleep running, playing video with audio over over bluetooth:

ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads/turbostat-src$ sudo ./turbostat -d
turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:3 (6:55:3)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu1: MSR_PLATFORM_INFO: 0x60000001a00
6 * 83 = 500 MHz max efficiency frequency
26 * 83 = 2166 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x001a000e (UNlocked: pkg-cstate-limit=14: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x000280fb (UNlocked)
cpu0: PKG Limit #1: ENabled (7.843750 Watts, 0.001953 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x883b0100 (46 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x883b0100 (46 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
turbostat: msr 0 offset 0x3fe read failed: Input/output error

Based on the above I dare to conclude that:
 - the auto demotion enablement makes it more stable than without - it seems like a good idea to have to go mainline, a few reports have also stated it
 - it seems there is still an issue that needs further investigation.

Challenges seem:
 - feedback loop : let's all keep looking for a fast freeze on either kernel
 - detailed reports (hard with above challenge) - let's try to be version specific and share test methods or scripts (I for example can dedicate this hardware to it, but I have limited time)

I am very much motivated to find the cause and I hope everyone has read Len's previous requests about using nanosleep, turbostat and that this is ONLY about Bay trail. 
Let us please try to keep it confined to that and nail this bug :). thanks everyone for the help and collaboration!
Comment 649 Len Brown 2017-01-04 21:43:25 UTC
@Vincent Gerris

Thanks for the test report. It is extremely helpful when reports
such as yours, include the processor and system model #.

> turbostat: msr 0 offset 0x3fe read failed: Input/output error

Hopefully this means you are running the turbostat in comment #635
and that error goes away when you run the update in comment #637

> I tried nanosleep and ran up to 5 times the program

Note that adding more load may result in less idle c6,
and thus make the failure more rare.  That is to say,
3 copies may be more effective than 5...   turbostat
(the working version:-) will show the % of c6 residency,
and if that goes down, the system may be too busy
to be exercising c6 enough to provoke the failure.
Aside from number of copies of nanosleep, its default
parameter is 500 wakes per second. 
I don't know if making that higher or lower will cause the failure
sooner, and if somebody has a system that fails quickly,
that would be a great thing to discover and share.
Comment 650 Josep Pujadas-Jubany 2017-01-04 22:34:21 UTC
I don't know if can be related but at work we have

Acer TM (TravelMate) B117M N3150 processor (Braswell Processor but kernel code sees as Cherry Trail Processor)

Lubuntu 14.04 LTS + LTSEnablementStack
(https://wiki.ubuntu.com/Kernel/LTSEnablementStack)

About 110 units of this model. Some of them are freezing using them and after suspend.

On 2016-11-24 we migrated some machines to latest stable kernel (4.8.10). Computers are more stable.
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467/comments/142)

but... they continue to freeze after suspend.

Modifying /etc/default/grub from

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_backlight=vendor"

to

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_backlight=vendor acpi_osi='!Windows 2013' acpi_osi='!Windows 2012'"

prevent freezing after suspend.

I explain this because I suppose that suspending the computer could be an excellent situation for having C6 & C7 states.

Just closing & opening the lid with any application opened (FireFox or VLC) caused the freezing for us.
Comment 651 A Uday K 2017-01-06 07:31:25 UTC
(In reply to Len Brown from comment #649)
> @Vincent Gerris
> 
> > turbostat: msr 0 offset 0x3fe read failed: Input/output error
> 
> Hopefully this means you are running the turbostat in comment #635
> and that error goes away when you run the update in comment #637

Hello Mr. Brown,

That turbostat binary which is included in the tar file won't work right out of the box. i had to run 
"make clean; make" 
inorder to avoid this error....
---
turbostat: msr 0 offset 0x3fe read failed: Input/output error
---

Here's the output of turbostat, right after runnging "make clean; make"....
---
http://pastebin.com/raw/GTXDbZRz
---

Here's the output of 'dmesg | grep idle', right after installing the auto-demoion-enabled kernel....
---
http://pastebin.com/raw/JacrPBG9
---

Here's the current info about my system....
---
http://pastebin.com/raw/RtMeVBSG
---

and one more thing,
that turbostat output,
after my pc freezes, then i should power it down, switch it back again AND THEN post the output of turbostat, right ?
or should i post the output of turbostat now itself ?
Comment 652 Len Brown 2017-01-06 19:58:38 UTC
@ Josep Pujadas-Jubany

Please file a new bug for Cherry-Trail/Braswell suspend issues.

This bug report is specific to the previous generation, Valleyview/Baytrail
that go away with cmdline option intel_idle.max_cstate=0.

Bay Trail processor list:

http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All
Comment 653 Len Brown 2017-01-06 20:12:30 UTC
@ A Uday K 

Yes, your N3530 is a Baytrail, yes, the auto-demotion patch is installed.

Several things we are trying to discover:

1. does the auto-demotion patch in comment #628 help?
   running the same workload, does using this pach change time to hang?
   It seems that it helps dramatically on some, and not at all on others.

2. do you see a different amount of %c6 when running the autodemotion patch
   vs. not running that patch?  (this is what turbostat can tell us)

3. can you help discover how to make the problem occur sooner?
   nanosleep in comment #629 is a tool that is available to help.
   My guess is that 4 copies should run on a 4-processor system,
   and that they should use the default parameter of 500 wakes/sec.
   But if you can make the problem happen by changing from 500,
   or changing the number of copies, that is a valuable discovery.
   Here again, turbostat is available to help track if you are
   making the system too busy to get into c6.
Comment 654 jbMacAZ 2017-01-07 07:18:55 UTC
Created attachment 250691 [details]
Turbostat for Asus T100CHI

auto-demotion is also helpful with 4.10-rc2 on Manjaro Cinnamon on ASUS T100CHI (Z3775) ~80 hours w/o freeze.  Will resume testing with 4.8.16 (Mint)
Comment 655 Josep Pujadas-Jubany 2017-01-07 10:05:55 UTC
(In reply to Len Brown from comment #652)
> @ Josep Pujadas-Jubany
> 
> Please file a new bug for Cherry-Trail/Braswell suspend issues.
> 

My comment came from

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1566302/comments/150

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1566302

And the status for this Linux(Ubuntu) bug is [Won't fix].

I left Windows when Vista appeared. I have been very happy (at home and at work) using Linux (open source, no viruses, speed, stability, ...)

But in the last months (applying latest kernels for latest hardwares) it seems we lost stability.

It's a pity. I would like to help more about this but I'm just an intermediate/advanced user. I'm not capable to do more. I'm sorry!
Comment 656 A Uday K 2017-01-07 19:13:50 UTC
(In reply to Len Brown from comment #653)

>    a different amount of %c6 when running the autodemotion patch
>    vs. not running that patch?  (this is what turbostat can tell us)

Thanks for the clarification :)

>    can you help discover how to make the problem occur sooner?
>    nanosleep in comment #629 is a tool that is available to help.
>    My guess is that 4 copies should run on a 4-processor system,
>    and that they should use the default parameter of 500 wakes/sec.
>    But if you can make the problem happen by changing from 500,
>    or changing the number of copies, that is a valuable discovery.
>    Here again, turbostat is available to help track if you are
>    making the system too busy to get into c6.

I'll continue to experiment and will keep you updated.
Comment 657 jbMacAZ 2017-01-08 00:09:35 UTC
turbostat error: "msr 0 offset 0x3fe read failed: Input/output error"
on Asus T100CHI (Z3775 - SILVERMONT1.)  FWIW, turbostat runs w/o errors on my skylake desktop.  FYI, 4.8.16 with auto-promotion-disable seems stable for me so far.  

Is there any value to testing an older kernel such as 4.2.x, which I found more unstable on my system?
Comment 658 Len Brown 2017-01-08 23:03:14 UTC
@jbMacAZ

For the turbostat error, try "make clean; make" of the latest attachment.
Apparently I sent the latest source but failed to re-build the binary.

FWIW, I expect to upload an updated turbostat this coming week
with some baytrail specific updates.

Re: value in testing older, more unstable, kernels.

My personal bias it to always run the latest upstream kernel,
or at least the latest kernel.org -stable.  That kernel
is what all other kernels follow, eventually.

However, many users are on binary distro binary kernels, and so
it is useful to know where those are too.

The root cause of this particular failure has been elusive.
It seems there are multiple ways of making the root cause
occur more/less frequently.  There may even be multiple
independent root causes.  If we can use an old kernel to
isolate the difference between bad/good to help find
the effect of a certain patch, that is useful.  But
with possible multiple causes, the benefit of a patch
on an unstable could be lost in the noise.
Comment 659 jbMacAZ 2017-01-09 01:57:14 UTC
(In reply to Len Brown from comment #658)
> @jbMacAZ
> 
> For the turbostat error, try "make clean; make" of the latest attachment.
> Apparently I sent the latest source but failed to re-build the binary.
> 
> FWIW, I expect to upload an updated turbostat this coming week
> with some baytrail specific updates.
> 
I played with the source, but only succeeded in changing which offset provokes the error!  Looking forward to the baytrail turbostat update.  In the mean time, I'll stick to testing recent non EOL kernels.
Comment 660 sikorskydenisua 2017-01-09 16:54:12 UTC
Joining to report this bug on ASUS E502MA - Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz, using any linux distro - Ubuntu, Mint, Manjaro, etc...

Fixed this by https://wiki.archlinux.org/index.php/Intel_graphics#X_freeze.2Fcrash_with_intel_driver - last option, which has a link to this site.
Comment 661 Len Brown 2017-01-10 09:05:59 UTC
Created attachment 251091 [details]
latest turbostat utility for baytrail

Here is my latest turbostat utility, updated for Baytrail.
This is a development version, not yet released to the Linux kernel source tree.

$ tar xzvf turbostat-src.tar.gz
$ cd turbostat-src

$ sudo ./turbostat --debug -o ts.out sleep 10

If you are not comfortable running a utility you download from the internet as root, first built it from source:

$ make clean
$ make
and optionally
$ sudo make install

Updates:
1. uses the Baytrail C1 hardware residency counter instead of software
2. shows the Baytrail module c6 hardware residency counter.
   Yes, this is the same on pairs of cores (that is what a module is)
3. shows Package C6
4. does not access/show un-supported counters c3, pc3, c7, pc7

Here are what states/counters are enabled for the interesting parameters to intel_idle.max_cstate:

intel_idle.max_cstate=1 C1
intel_idle.max_cstate=2 C1, Mod-C6
intel_idle.max_cstate=3 C1, Mod-C6, Pkg-C6

This release replaces the turbostat versions attached to comment #635 and comment #637.
Comment 662 amjafuso 2017-01-10 09:21:20 UTC
I do run kernel 4.9.0 now for two weeks without any freeze.

- shuttle xs35v4, j1900
- 4.9.0-sparky-amd64 #1 SMP Tue Dec 20 12:43:44 CET 2016 x86_64 GNU/Linux

@len brown: does it still help you if I run turbostat?
Comment 663 Len Brown 2017-01-10 09:58:15 UTC
Created attachment 251101 [details]
Test script to freeze your baytrail quickly

I have done some testing on two Baytrail systems:

Dell Insprion 3451 laptop (Atom N3540)
Acer Aspire AXC dekstop (Atom J1900)

Currently using Ubuntu 16.10 vmlinuz-4.8.0-32-generic
with no cmdline parameters.

Using the attached script, each box freezes in under 30-minutes, and often much sooner.  I've seen a freeze as quickly as under 60 seconds.

The current script runs 8 copies of "nanosleep 1000" from comment #629 plus one copy of glxgears -fullscreen.  It also displays information about your system that I'd like to see when you report a failure.

I ssh into the test system, and invoke a 1-line shell script that does this:
./byt.test | tee out.`date +%Y%m%d_%H%M%S`

so when the system hangs, there is a record both in the ssh window, and also in the out.* file.  Yes, attaching your out.* file to this bug report is appropriate -- though the the turbostat output gets redundant after a while -- so copy/paste of the top of the file also works...  You can simply show the last timestamp, or say how long to freeze.

Adding more copies of glxgears did not seem to make the failure occur sooner.  When I ran without glxgears, the failure stretched out to 23 hours on the acer, and the dell was still running at 24 hours.  So 1 copy of glxgears seems to be the ticket.

intel_idle.max_cstate=2 still fails, my one attempt took 49 minutes.
Comment 664 Len Brown 2017-01-10 10:22:23 UTC
@ amjafuso

please try the script in comment #663 to see if you can get 4.9 to fail.
I've not tested 4.9 yet.  You've also reported success with intel_idle.max_cstate=2.  If you get 4.9 to fail with no cmdline, please re-test with intel_idle.max_cstate=2 to see if that survives.  My experience is that they will both fail, and that cmdline will simply take a bit longer than the default.

I also acquired an Acer T100 TAS and Acer T100 CHI.  My next step is to wrestle 64-bit unbutu onto their 32-bit BIOS in a dual-boot config, and broaden the testing to those boxes, before I start changing the kernel.
Comment 665 amjafuso 2017-01-10 13:41:32 UTC
Ok, script started 2 hours ago, no freezes. No freeze with kernel 4.9. Boot parameter:

# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.9.0-sparky-amd64 root=UUID=612f6bbb-b095-4da7-b823-0658edce9dfc ro quiet splash

I didn't patch the kernel (don't know how to do that, sorry).
Comment 666 GConst 2017-01-10 15:39:04 UTC
Hello,

Do anybody tried 4.9.2 in Ubuntu 16.04 from following site? 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9.2/
Comment 667 jbMacAZ 2017-01-10 19:16:18 UTC
Created attachment 251131 [details]
T100CHI turbostat kernel 4.9 patched

Turbostat is working now on CHI. Thank you.

Kernel 4.9.2 with auto-demotion-disable running for ~12 hours so far (just added 4 nanosleeps 1000).  Without nanosleep(x4) CPU%6 was ~90%. 4.9.2 without auto-demotion-disable patch ran about an hour before freezing at idle. (Z3775)

re: T100 test beds
The T100CHI is a little tedious to get going in linux.  CHI's OEM Bluetooth keyboard is offline at boot time (and unpaired at install time).  It's easier to use a powered hub, USB keyboard/mouse and wifi dongle for linux install.  If you add boot32ia.efi to the installer USB /EFI/boot/, edit the installer /boot/grub/grub.cfg to add intel_idle.max_cstate=1 and you can boot most debian derivative installers.  Press <ESC> during power up to get to the boot menu.  Some distros need grub-efi-ia32 & grub-efi-ia32-bin to be installed manually. Wifi needs brcmfmac43241b4-sdio.txt and bluetooth needs BCM4324B3.hcd and works better with blueman device manager...
Comment 668 Juha Sievi-Korte 2017-01-11 18:23:57 UTC
Thanks for all the updates.

I've been now trying to freeze my N3540 laptop with nanosleep and different combinations of other tools, with varying success. Managed once to freeze idle system running 4xnanosleep 250 processes in couple of hours, but then same test again yielded 36+hrs of uptime.

I can confirm that adding glxgears surely helps, 4xnanosleep 250 + glxgears -fullscreen I've gotten now 4 freezes in a row with times to freeze being roughly: 90 mins, 50mins, 15mins and 8,5 hours. Also now when trying to update this report I got freeze in less than 10 mins from reboot, no nanosleep running...

Attached is a turbostat output from few seconds before one of the freezes happened.

turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu1: MSR_PLATFORM_INFO: 0x60000001a00
6 * 83 = 500 MHz max efficiency frequency
26 * 83 = 2166 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x001a000f (UNlocked: pkg-cstate-limit=15: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880c0 (UNlocked)
cpu0: PKG Limit #1: ENabled (6.000000 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88330100 (54 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88330100 (54 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88340100 (53 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88340100 (53 C +/- 1)
10.006062 sec
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	CoreTmp	GFX%rc6	GFXMHz	PkgWatt	CorWatt
	-	-	80	11.80	675	2167	0	0	2.18	86.02	53	0.00	0	1.48	0.09
	0	0	67	10.77	622	2167	0	0	3.08	86.15	53	0.00	0	1.48	0.09
	1	1	127	18.55	682	2167	0	0	2.90	78.55	53
	2	2	65	9.83	659	2167	0	0	1.47	88.70	53
	3	3	61	8.06	752	2167	0	0	1.26	90.68	53

Running 4.9.0-1 from opensuse repos at the moment, will try autodemotion patch next.
Comment 669 A Uday K 2017-01-11 19:18:00 UTC
I did as told on my system, 
here's a link to the output as soon as I ran those commands...
---
http://pastebin.com/ByQVV5gc
---

On my system,
If try this line,
---
$ ./byt.test | tee out.`date +%Y%m%d_%H%M%S`
---
It gives the "permission denied" error. I get this error even if I'm in super user mode ( sudo su ).

However, on my system, the workaround is...
---
$ . byt.test | tee out.`date +%Y%m%d_%H%M%S`
---

Also, there's this one more thing....
If you see the 2nd last line of that link, you'll notice this...
---
The program 'glxgears' is currently not installed.
---
How would you suggest I procede ?
Should I go ahead and type....
---
$ sudo apt install mesa-utils
---
Or should I be doing something else ?
What is glxgear ?
Comment 670 jbMacAZ 2017-01-11 22:59:18 UTC
Created attachment 251261 [details]
CHI_freeze_4.9.2_no_demotion_disable_patch

Using Len Brown's freeze script with kernel 4.9.2, I had a freeze in about 20 minutes.  Freeze times range from 10 minutes to around an hour.
Comment 671 Len Brown 2017-01-11 23:39:00 UTC
@ A Uday K

. file
will interpret that file in the current shell session.
that isn't what you want if the script has side effects, like changing directory, or calling exit.

try this

$ chmod +x file
$ ./file

this particular script has an sudo, and so you will be prompted for a password if your session doesn't remember it from previous sudo

yes, glxgears is a simply graphics demo.
it seems to come installed by default in ubuntu 16.10
go ahead an install and try it.

the good thing about glxgears is that it does video w/o doing any audio.
I suspect that the folks freezing their system by playing an audio+video
stream may be running into a known audio issue that hopefully will soon be fixed.
 
@ Juha Sievi-Korte 

Thanks for the testing.
Please update to the latest turbostat from comment #661

when i wrote the test script in comment #663
i expected to be varying the number of copies of glxgears.
I too notice a huge benefit (shorter time to freeze)
from running 1 copy, but did not notice a benefit of running more copies.

@ jbMacAZ

thanks for confirming that 4.9.2 is not magic,
and that the test script from comment #663 fails well for
an unpatched kernel.

so did you eventually get a hang with 4.9 with demotion-re-enabled patch?

what kernel is "LapLet 4.9.2.2n" -- is that unpatched or patched?
your %pc6 is remarkably low in comment #670 (under 2%)

BTW. thanks for the T100 CHI install tips, hopefully I'll get that box going tonight.  If it like yours, it will be a great test bed.
Comment 672 jbMacAZ 2017-01-12 01:10:07 UTC
(In reply to Len Brown from comment #671)
> 
> @ jbMacAZ
> 
> so did you eventually get a hang with 4.9 with demotion-re-enabled patch?
> 
> what kernel is "LapLet 4.9.2.2n" -- is that unpatched or patched?
> your %pc6 is remarkably low in comment #670 (under 2%)
> 
> BTW. thanks for the T100 CHI install tips, hopefully I'll get that box going
> tonight.  If it like yours, it will be a great test bed.

4.9.2.2n is 4.9.2 + aufs4.9 + T100 specific patches not yet upstreamed (from T100/Ubuntu G+ group), but no ubuntu patches.  glxgears has a slight stutter while running, so the system may be maxing out??  I'm running mint 18.1 (x86_64) w/cinnamon 3.2.7. I have a system monitor (CPU,mem,net,disk) graphical applet in the system tray, wifi and bluetooth active.
A standard unpatched recent(>4.6) kernel will run acceptably on the CHI, but more of the minor hardware (buttons, backlight, etc.) works with the T100 patches and .config.

I also built 4.9.2.2 which adds your demotion patch.  So far I have not seen 4.9.2.2 freeze - longest test run so far has been about 16 hours.  FWIW, 4.10-rc3 with your patch did freeze, but 4.10 is still too new for me to take seriously.
Comment 673 jechtpurgateur 2017-01-12 11:05:22 UTC
Hello,

how is going this bug ? I'm still have the freeze issue for a year and the workaround doesn't work with my system. At least it's better but come on, the system freeze randomly. Why it's not a priority ?
Comment 674 alvararo 2017-01-12 12:43:29 UTC
If you need a quick way to get freezes, I had a laptop with N3540 which had freezes few seconds after start (the same complete freeze and without logs) when I launched google chrome just after the desktop starts in Manjaro Gnome 16.08 with 3.16 manjaro kernel (Only in Gnome edition), if someone have time and is interested please confirm this, I'm unable to try it right now.
Comment 675 Prashant Poonia 2017-01-12 15:24:56 UTC
n3540 also freezes with 4.9.2, frozen once in 12hr uptime while connected through wifi hotspot of my android device and a call was missed which was notified on netbook through kde connect. Firefox and terminal was running on foreground.
even then 4.8 and later kernels are much better than previous versions, atleast on my Asus x553ma netbook
Comment 676 Len Brown 2017-01-12 18:23:58 UTC
@ jechtpurgateur@gmail.com 

if the workaround (booting with "intel_idle.max_cstate=1") does not
help your system, then you have a different bug.  Please file it.
Comment 677 Len Brown 2017-01-12 18:27:28 UTC
@ alvararo 

It is interesting that you had a failure easily reproducible in 3.16 -- was widely cited as the most stable baytrail kernel before things went south in 3.17.  I'm afraid, however, that interest in 3.16 is about zero right now.  There have been a lot of fixes and more interesting is if you could get Linux-3.9 to fail quickly.
Comment 678 Len Brown 2017-01-12 18:29:40 UTC
@ alvararo

oops, typo, we want to go forward to the present, not back in history:-)

< get Linux-3.9 to fail quickly.
> get Linux-4.9 to fail quickly.
Comment 679 julio.borreguero@gmail.com 2017-01-12 18:51:56 UTC
(In reply to Len Brown from comment #676)
> @ jechtpurgateur@gmail.com 
> 
> if the workaround (booting with "intel_idle.max_cstate=1") does not
> help your system, then you have a different bug.  Please file it.

@len

there seem to be various bugs with baytrail architecture, my system a N2940
still freezes with "intel_idle.max_cstate=1" and for all other N2940 as well.
But for us Kernel 4.12 (i am using it) or even 4.16 (i think) work well without any kernel parameters.
All later versions freeze including 4.8.x and pretty sure 4.9.rcx as well.
All that info is in this thread that seems to be more like a public chat-forum by now ;)
The Point i want to make, as there are several bugs that affect baytrail, most likely related somehow, why would you file a different bug report for N2940 ?
This is the best Bug-Thread there is so far regarding baytrail problems on the net as far as i know.
if some day the baytrail problem will really be solved i am pretty sure it will be solved for us as well.
i would include that valid info for N2940 made by many users in this bug-report to try solving the problem(s) with baytrail
kind regards and thank you for your work on this
Comment 680 Prashant Poonia 2017-01-12 21:45:23 UTC
(In reply to Len Brown from comment #676)
> @ jechtpurgateur@gmail.com 
> 
> if the workaround (booting with "intel_idle.max_cstate=1") does not
> help your system, then you have a different bug.  Please file it.

there is something more to this bug as i have an interesting observation. My n3540 is of baytrail architecture, and I have tested all kernel versions and intel_idle.max_cstate=1 was the ultimate workaround for all, this makes my hardware perfectly fit to regard as affected by this specific bug. But recently when i tested 4.8.0-32 my system froze once even when cstate parameter was in place, it didn't happened again, also the same kernel is the most stable when cstate is not implemented.
and I have noticed a strong relation between freezes and heavy wifi usage too. People who are facing freezes even after max state parameter is set should see if there is a relation between wifi and freezes and report back. Downloading big files in a fast connection as the trigger.
Comment 681 Len Brown 2017-01-12 22:20:25 UTC
@ julio.borreguero@gmail.com

I must insist.

If you have a failure that is anything other than a baytrail hang that goes away with intel_idle.max_cstate=1, then you are best served by a new bug report.

While we always hope there is a magic bullet that fixes multiple similar issues, experience shows that is very rarely the case.  This bug report will be closed when Baytrail systems that used to hang without intel_idle.max_csate=1 no longer need that parameter.  So if that doesn't describe your system, you are best off with a bug report that does.  Go ahead and reference it here, but please put all necessary information describing that failure in that bug report.  Thanks.
Comment 682 Len Brown 2017-01-12 22:42:37 UTC
@ Prashant Poonia

Yes, the n3540 failing with intel_idle.max_cstate=1 is also interesting.  If you can isolate what kind of workload triggers it, please put that in a new bug report describing the failure.  If you suspect WIFI, then I suggest seeing if you can eliminate sound and graphics from the workload to be sure the known problems there are not the actual root cause.
Comment 683 Mika Kuoppala 2017-01-13 14:43:41 UTC
Created attachment 251471 [details]
drm/i915/byt: Avoid tweaking evaluation thresholds
Comment 684 Justin 2017-01-13 16:34:32 UTC
Kind of disagree with the new bug report sentiment...  This bug is almost 2 years old and multiple kernel updates have come out since then.  Are we asking all those who the intel_idle.max_cstate=1 used to work for 2 years ago to go back and wait another 2 years for those issues to be addressed? Simply because they now need additional commands for the intel_idle.max_cstate=1 to work?
Comment 685 Josep Pujadas-Jubany 2017-01-13 18:31:10 UTC
2 years ago? True. It's explained at https://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail

(Bug started at https://bugs.freedesktop.org/show_bug.cgi?id=88012 and moved to a kernel bug)

The hardware bug that it's supposed to origin the problem (VLP52) was reported by Intel on March-2014, https://bugzilla.kernel.org/show_bug.cgi?id=109051#c425

Other OS are also affected for this Intel's hardware bug and others.

Recommended Google searches:

freeze c-state windows

freeze c-state osx

In fact, many windows-gamers recommend to disable c-states.
Comment 687 Guillermo Molleda 2017-01-14 17:06:03 UTC
I have a Lenovo Yoga 11e 20D9 Intel Celeron N2930 (2.17GHz) 4GB RAM, updated UEFI-BIOS to last version (17-October-2016). 
In Windows 8.1 go perfect -> is not a hardware bug.
But in Linux Mint 18.1 Serena 64bit MATE 1.16.1 kernel 4.4.0-57-generic #78-Ubuntu x86_64 the system freeze when I watch a video in Youtube with firefox before 5 minutes.
With intel_idle.max_cstate=1 do not freeze.
Comment 688 jbMacAZ 2017-01-14 17:14:41 UTC
kernel 4.9.3 seems to be more stable.  Unpatched and no workarounds (no T100 patches either, just custom .config) took over eight hours to freeze.  4.9.2 would freeze in less than an hour on my system. YMMV 

I applied Mika Kuoppala's new patch to 4.9.2 and it is still running 12 hours later with Len Brown's byt.test script.  Without any cstate workaround hard freeze averages around 30 minutes on recent kernels.  (T100CHI - Z3775)
Comment 689 Ernestas Kulik 2017-01-14 20:35:28 UTC
I’m going to jump on the bandwagon here.

I only experience freezes with both onboard WNIC and mode setting enabled.
A semi-consistent reproducer is establishing multiple concurrent network connections (e.g. downloading a torrent) and/or downloading big files at high speeds.

The laptop is an ASUS X553MA with a Celeron N2840.

A probably unrelated thing is that the BIOS has a mysterious setting called “OS Selection” with “Windows 7” and “Windows 8.x” options. Older (probably 3.x and early 4.x) kernels used to not boot with “Windows 8.x” selected. I could manage to get it to boot by using a modified DSDT, so I assumed it was an ACPI problem, but it works fine with the current kernel.

Another probably unrelated thing is that this thing freezes when modules dw_dmac and dw_dmac_core are loaded (I tried documenting these things here: https://wiki.archlinux.org/index.php/ASUS_X553MA#Laptop_freezes_on_boot).
Comment 690 Len Brown 2017-01-15 18:24:58 UTC
@ Mika Kuoppala

Your patch in comment #683 made a dramatic improvement,
when applied to Linux-4.8.17.

Without the patch, the Dell-n3540 hanged in 13 minutes
and the Acer-J1900 hanged in 3 minutes.

With the patch, both machines are still running after 12 hours.

(both fixed at HFM, running 1 copy of glxgears + 8 copies of nanosleep)
(both are using wired ethernet -- wifi is disabled on the Dell
 and it is using an USB/wired-ethernet dongle)
(no audio is being played)

Looking at the patch, it appears to be a revert of

            commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff
            Author: Chris Wilson <chris@chris-wilson.co.uk>
            Date:   Tue Apr 7 16:20:28 2015 +0100
    
            drm/i915: Agressive downclocking on Baytrail

That patch went upstream in Linux-4.2-rc1.  That is interesting
because 4.1 was often cited as a local maximum in baytrail stability
with 4.2 widely cited as less stable.

And so my feedback on that patch is consistent with the favorable result
reported above by jbMacAZ on the T100 TAM z3775.

I tried doing the same comparison using Linux-4.9.3,
but the baseline test of Linux-4.9.3 with no patches
ran for 30 hours on both machines without failure.
Comment 691 AB 2017-01-16 07:50:04 UTC
Can anyone reproduce freezes with Ethernet cable connection and wifi turned off?

On my desktop I see them only when wifi usb dongle is connected (with both RTL chips available for me).
Comment 692 jbMacAZ 2017-01-16 18:31:17 UTC
I tried using both the demotion patch and threshold patch on 4.9.4 only to be stymied by a regression in wifi (also seen in 4.8.17.)  I call it a soft freeze, because the UI only updates about once a minute, but the mouse cursor moves freely.  dmesg fills up with various brcmfmac error -110's.  For purposes of the cstate bug, I'll stick to testing with 4.9.2

FWIW, with 4.9.2 I was able to run Mika's patch for 37 hours without a freeze.  I stopped that test to try other things.  I had done some testing many months ago regarding aggressive down-clocking (comment #93) which showed only slight improvement at that time.
Comment 693 Pshem K 2017-01-16 20:04:24 UTC
(In reply to AB from comment #691)
> Can anyone reproduce freezes with Ethernet cable connection and wifi turned
> off?
> 
> On my desktop I see them only when wifi usb dongle is connected (with both
> RTL chips available for me).

I can easily reproduce this without wifi. I ran a headless router setup and the lockups most frequently occur after (or sometimes during) heavy network activity.
Comment 694 Shev_84 2017-01-18 20:53:53 UTC
Running ./byt.test | tee out.`date +%Y%m%d_%H%M%S` i cannot hang my J1900. After about 36hrs of running, i've stop it, play some movie and get hang in about 10 minutes of playing. For me these script doesn't work (or I should rather say 'isn't effective').
Running on Ubuntu 16.10.
BOOT_IMAGE=/boot/vmlinuz-4.8.0-34-generic root=UUID=d097e0d3-b7a2-4943-95fa-591edd652328 ro quiet splash vt.handoff=7
board_vendor:ASRock
board_name:Q1900DC-ITX
board_version:
bios_date:03/31/2016
bios_vendor:American Megatrends Inc.
bios_version:P1.50
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 8
microcode       : 0x831
cpu MHz         : 1509.649
cache size      : 1024 KB
physical id     : 0
siblings        : 4
[    1.860044] intel_idle: MWAIT substates: 0x33000020
[    1.860046] intel_idle: v0.4.1 model 0x37
[    1.860414] intel_idle: lapic_timer_reliable_states 0xffffffff
state0/desc:CPUIDLE CORE POLL IDLE
state1/desc:MWAIT 0x00
state2/desc:MWAIT 0x58
state3/desc:MWAIT 0x52
state4/desc:MWAIT 0x60
state5/desc:MWAIT 0x64
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
glxgears -display :0 -fullscreen
turbostat --debug -i 100
turbostat version 4.17 10 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu2: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
cpu2: MSR_CC6_DEMOTION_POLICY_CONFIG: 0x00000000 (DISable-CC6-Demotion)
cpu2: MSR_MC6_DEMOTION_POLICY_CONFIG: 0x00000000 (DISable-MC6-Demotion)
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu2: MSR_PLATFORM_INFO: 0x100000001800
16 * 83 = 1333 MHz max efficiency frequency
24 * 83 = 1999 MHz base frequency
cpu2: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu2: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu2: MSR_PKG_CST_CONFIG_CONTROL: 0x0018000f (UNlocked: pkg-cstate-limit=15: pc7)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88330000 (54 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88330000 (54 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88310000 (56 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88310000 (56 C +/- 1)


        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       56      4.18    1348    2000    239554  0       8.01    87.81   58.56   57      **.**   187     30.99   0.62    0.21
        0       0       55      4.10    1349    2000    39173   0       5.37    90.53   67.47   55      **.**   187     30.99   0.62    0.21
        1       1       54      4.02    1350    2000    39053   0       5.25    90.73   67.47   55
        2       2       63      4.70    1347    2000    88111   0       11.57   83.73   49.65   57
        3       3       53      3.90    1349    2000    73217   0       9.87    86.23   49.65   57

I can run movie, and then grab these turbostat output, but i don't know if it would be helpful to the topic.
Comment 695 João Paulo Rechi Vita 2017-01-21 15:36:57 UTC
Hello Len,

First, thanks for taking the lead on this. I've recently worked on enabling a device based on the Intel Atom Z3735F at Endless. From what I can tell it is pretty similar to one of the recent Intel Compute Sticks. That device also has a RTL8723BS WiFi adapter, which is known to be very problematic on Bay-Trail platforms. I dug the device out of my drawers and did a couple of tests with it:

I've based my tests on our current -next kernel, which is based on Ubuntu's Zesty master branch, in turn based on Linus' v4.9 tag. Additionally, to be able to use the machine for more than a couple of minutes I need https://patchwork.kernel.org/patch/9478087/, or to disable run-time PM for the SDHCI host controller.

Without your C6 auto-demotion patch, I was not able to reproduce the freeze using your stress test script running for ~8-10h, but playing videos from youtube in loop froze the machine in ~1h. I've also tried heavy downloads without X being running at all, to see if the problem could be isolated to networking / SDHCI, but it also didn't reproduce the freeze. This is the turbostat output when the machine froze playing youtube:

	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	190	15.18	1253	1333	311669	0	13.62	71.20	37.07	75	1.14	646	5.38	2.77	0.33
	0	0	332	24.79	1337	1333	151194	0	23.39	51.81	14.19	75	1.14	646	5.38	2.77	0.33
	1	1	278	20.44	1360	1333	91562	0	19.57	59.99	14.19	75
	2	2	101	10.28	986	1333	46459	0	7.67	82.05	59.95	75
	3	3	50	5.19	961	1333	22454	0	3.85	90.95	59.95	75

Using your C6 auto-demotion patch the machine survived an overnight youtube play loop, but the time on Mod-C6 or Pkg-C6 dropped down to zero most of the time, except once where I got (still super low):

	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	275	15.74	1749	1333	107463	0	32.25	52.01	0.01	68	6.58	229	0.00	1.24	0.49
	0	0	341	19.39	1760	1333	53679	0	66.65	13.97	0.00	66	6.58	229	0.00	1.24	0.49
	1	1	238	13.61	1744	1333	16186	0	17.04	69.35	0.00	66
	2	2	274	15.68	1744	1333	21012	0	26.33	57.99	0.02	68
	3	3	249	14.29	1744	1333	16586	0	18.97	66.74	0.02	68

I've also tried Mika Kuoppala's patch from comment #683 (without the auto-demotion patch), and the machine survived an overnight youtube play loop while still entering all the C6 states:

	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	278	24.62	1130	1333	185129	0	8.55	66.83	38.72	68	**.**	646	20.25	0.79	0.34
	0	0	301	26.78	1123	1333	101421	0	11.76	61.46	36.63	66	**.**	646	20.25	0.79	0.34
	1	1	269	23.18	1159	1333	20539	0	6.21	70.62	36.63	66
	2	2	285	25.80	1103	1333	42486	0	9.73	64.47	40.81	68
	3	3	259	22.74	1139	1333	20683	0	6.49	70.77	40.81	68

I can give another try on a clean v4.9.Y if you think this would be helpful.
Comment 696 Len Brown 2017-01-22 08:59:39 UTC
@ Shev_84 

It is interesting that with 4.8.0-distro your j1900 survives my test script for 36 hours, while my j1900 dies quite quickly.  I've not examined it closely, but I should mention that I run with cpufreq set to maximum frequency, since I suspect, but have not rigorously proven, that causes the failure sooner.

Even more interesting that on the same j1900, a movie hangs the system in 10 minutes.  Can you share exactly how you are playing the movie?  Does the movie still hang the system as fast if you have audio disabled?  Yes, turbostat from your movie test is interesting.

Finally, the cutting edge is 3.9.stable plus Mika's patch from comment #683, so if you can find a way to make that fail, please share it.

@ João Paulo Rechi Vita

My experience was that my script could make 4.8 fail in under 30 minutes, but that when I tried 4.9, 1st failure was at 24 hours, and another machine was still running at 30 hours.  So I'm not surprised that my script didn't hang your 4.9 system after 10 hours.  It would be interesting to know it you see consistent observations as I do -- does my script hang your system in under 30 minutes when you run 4.8?

Re: youtube

On the configuration that fails quickly with youtube, I'd be interested to know if it survives longer if sound is disabled.  There is a known audio bug, with patch on the way, that may be independent of failures that occur when audio is not active.

Re: Len's demotion patch

I think it is obsolete, and not worth further testing.  Your youtube turbostat output shows it made a dramatic difference in mc6 residency -- more than I've seen on other workloads.  I spoke to the baytrail pcode author, who concurs that correctness and stability should not require enabling demotion.  So the difference in stability with that patch is more likely a side-effect because demotion is simply making c6 less common.

Re: Mika's patch from comment #683

Running my nanosleep + glxgears script, I've not seen *any* failure with it.
My last report said it ran for 12 hours -- I let the n3540 and the j1900 continue running for 7 days and nights and they did not fail.  That was based on 4.8.17 -- a baseline that without that patch would typically fail in under 20 minutes.

I'm now testing 4.9.3 + the same patch.  Based on the fact that 4.9.3 was robust before the patch, I'm expecting it to be stable.

I will start experimenting with sound, movies, and youtube.  I'm interested in hearing other's experiences with 4.9.stable + the patch from comment #683
Comment 697 Sebastian Heyn 2017-01-22 16:44:26 UTC
>It is interesting that with 4.8.0-distro your j1900 survives my test script
>for >36 hours, while my j1900 dies quite quickly. 

are you both using the same cpu microcode?
Comment 698 Len Brown 2017-01-23 05:52:55 UTC
@ Sebastian Heyn 

Good observation -- different microcode.
So today I updated the microcode to match, and re-tested.
I found the microcode version made no difference.

details:

Same CPU:

cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 8

But the microcode was different:

Shev_84's ASRock Q1900DC-ITX w/ BIOS 03/31/2016:

microcode       : 0x831

Len's Acer Aspire XC-703G w/ BIOS 8/28/2014:

microcode       : 0x809

So to see if the older microcode is what makes the Acer less stable
than the ASRock, I updated it to Acer XC-703G to BIOS 09/15/2015,
which brought also up to microcode 0x831.

Re-testing vmlinuz-4.8.0-34-generic three times failed after
14, 8 and 15 minutes -- which is typical of the previous microcode.

I can't explain why my Acer XC-703G fails more easily
than Shev_84's ASRock when running just my nanosleep+glxgears script,
but now we know it has nothing to do with the microcode version.
Comment 699 dizzy 2017-01-23 10:32:19 UTC
Hi all,

I am experiencimg the same problems with my notebook too since I switched my distro from Ubuntu 16.04LTS to Fedora 25. The hardware - toshiba tecra R840-110 containing:
- Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz (Sandy Bridge, I think..)
- Network controller: Intel Corporation Centrino Advanced-N 6230 [Rainbow Peak] (rev 34) (will be important, see later)

Symptoms (1st iteration ;-)):
- computer randomly freezes after cca 5 minutes to few (cca 10) hours, regardless if I was using it or not. Computer stopped responding completely (even magic RysRq doesn't work), CPU fan running high after few seconds (after freeze)
- kernel 4.8.6-300.fc25.x86_64

I applied the script c6off+c7on from Wolfgang Reimer (modifying it for SNB) and it definitely helped and the system was stable with no freezes during work (few hours).

The problem came after update to kernel 4.9.X - the script stopped working and the computer was freezing regardless the script has been applied or not, so I reverted my kernel back to version 4.8.6 (nice number, isn't it) and started to examine a bit and following facts came up:
- the computer freezes after few minutes up to few hours randomly if the script has not been applied
- after applying the script you can work with the computer for many hour (5-6 hrs without freeze), but if I leave it turned on inactive (without working with it) IT WILL FREEZE IN 20-30 MINUTES (same symptoms).

So in addition to c6 state problem, there is something happening after 20-30 minutes of inactivity, which causes the computer to deadlock. By examining log files, last log entries before freeze were (among others) were from Network manager (refreshing DHCP leases, WIFI rekeying,...). So I tried to shutdown the Network manager, disabled wifi (as it was the only active network interface during all freezes), unloaded it's modules from kernel (including the whole wifi stack related modules) using rmmod and voila - the computer survived inactivity more than 16 hours without freeze.

So the final workaround for me looks as follows:
- kernel 4.8.6
- Wolfgang Reimer's script (thanks a lot)
- inactive WIFI - maybe wifi without powermanagement, that will be a subject to further investigation

Last note to kernel versions:
- 4.4.0 - actually in Ubuntu 16.04LTS - works fine, with small issues (sometimes screen distorted, wifi malfunction after wakeup from standby) - NO FREEZES (in Ubuntu, not tested with Fedora)
- 4.8.6 - works after applying above workaround
- 4.9.X (4.9.3, 4.9.4 for example) - random freezes, even with Wolfgang's script. After disabling c_states (=0), crashes within 45 minutes (can be caused by WIFI,...)

Hope this helps someone. If you need further information/tests,...len me know.

R.
Comment 700 Len Brown 2017-01-23 17:57:47 UTC
@ dizzy

> Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

https://ark.intel.com/products/52229/Intel-Core-i5-2520M-Processor-3M-Cache-up-to-3_20-GHz

Yes, your system is a "Sandy Bridge".
As that 2011 processor is of the Core Architecture, rather than the Atom Architecture "Bay Trail", please file a bug specifically describing that failure.  I recommend that before you do, you run memtest overnight.
Comment 701 dizzy 2017-01-24 15:57:27 UTC
(In reply to Len Brown from comment #700)
> @ dizzy
> 
> > Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> 
> https://ark.intel.com/products/52229/Intel-Core-i5-2520M-Processor-3M-Cache-
> up-to-3_20-GHz
> 
> Yes, your system is a "Sandy Bridge".
> As that 2011 processor is of the Core Architecture, rather than the Atom
> Architecture "Bay Trail", please file a bug specifically describing that
> failure.  I recommend that before you do, you run memtest overnight.

Hi Len - I'm sorry - I was just thinking, if the HW is similar, symptoms are similar, even workaround is similar, maybe would someone around here find this information helpful (I didn't know the bug is for baytrail strictly). But OK, created another bug (https://bugzilla.kernel.org/show_bug.cgi?id=193261) as suggested (after successful pass of memtest, of course ;-)).

Sorry again for spamming Your bug...
Comment 702 Vincent Gerris 2017-01-24 22:13:38 UTC
Hi,

I put a patched 4.9.0 kernel (the patch from Mika,latests ubuntu-zesty) up for the ubuntu users here that want to try it:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

I am running:
ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads$ ./byt.test | tee out.`date +%Y%m%d_%H%M%S`
Linux ubuntu-Lenovo-Yoga-2-11 4.9.0-mika-no-tweak-eval-th #3 SMP Tue Jan 24 10:10:26 CET 2017 x86_64 x86_64 x86_64 GNU/Linux
tis 24 jan 2017 22:37:53 CET
BOOT_IMAGE=/boot/vmlinuz-4.9.0-mika-no-tweak-eval-th root=UUID=6a53171b-c5f2-44a4-a69f-a08f38312a8c ro quiet splash vt.handoff=7
board_vendor:LENOVO
board_name:AIUU1
board_version:31900042STD
bios_date:08/19/2015
bios_vendor:LENOVO
bios_version:92CN93WW(V1.93)
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz
stepping	: 3
microcode	: 0x320
cpu MHz		: 1305.507
cache size	: 1024 KB
physical id	: 0
siblings	: 4
[    1.274218] intel_idle: MWAIT substates: 0x33000020
[    1.274220] intel_idle: v0.4.1 model 0x37
[    1.274527] intel_idle: lapic_timer_reliable_states 0xffffffff
state0/desc:CPUIDLE CORE POLL IDLE
state1/desc:MWAIT 0x00
state2/desc:MWAIT 0x58
state3/desc:MWAIT 0x52
state4/desc:MWAIT 0x60
state5/desc:MWAIT 0x64
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
glxgears -display :0 -fullscreen
turbostat --debug -i 100
[sudo] password for ubuntu: tis 24 jan 2017 22:38:53 CET
tis 24 jan 2017 22:39:53 CET
tis 24 jan 2017 22:40:53 CET
tis 24 jan 2017 22:41:53 CET
tis 24 jan 2017 22:42:53 CET
tis 24 jan 2017 22:43:53 CET
tis 24 jan 2017 22:44:53 CET
tis 24 jan 2017 22:45:53 CET
tis 24 jan 2017 22:46:53 CET

No problems yet.
I also tried my regular stress test at the same time (movie play over network with bluetooth and copy of same file) just to see if it hung and it did not yet.

Please grab the kernel to test as Len asked. 
Thanks Len for driving this, it's greatly appreciated!

I'll post my uptime later on, it looks good for now.
Comment 703 Len Brown 2017-01-25 07:50:58 UTC
Re: youtube or movie playback as a stress test

Can anybody share, exactly, how you play movies to provoke the failure faster than you can provoke it running my script?  I opened youtube surfed for movie previews and played them, and youtube seemed to move onto new videos, but when I came in the next day it had decided to stop streaming.  I thought it was hung, but wiggling the mouse prompted it to start playing again... anyway, it took me just under 24 hours to get a 4.8 kernel to fail using youtube on the Acer n3540, when it took < 15 minutes using my nanosleep+glxgears script.  So unless somebody has a recipe showing exactly how to play movies that fails quickly, I'm at a loss to reproduce and explain Shev_84's experience of movies failing quickly in comment #694.
Comment 704 Len Brown 2017-01-25 07:55:04 UTC
<typo in previous comment, that was an Acer-J1900, not an n3540, ie matched Shev_84's system as best as I could>
Comment 705 Mika Kuoppala 2017-01-25 09:10:45 UTC
Here is my test script which has been rather effective. Usually less than hour but always less than 24h for hang.

glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
Comment 706 Len Brown 2017-01-25 09:13:05 UTC
Re: Asus T100

I finally got the Asus T100-TAM and T100-CHI installed with Ubuntu 16.10.  It seems the secret for the T100-TAM was to not try to dual boot with Windows, but to erase the hard drive.  The T100-CHI, amazingly, installed properly as a Windows dual-boot.  The T100-CHI requires a USB hub and some adapters to give it a network/kbd/mouse and USB thumb drive access.  I'm running both with usb/wired-ethernet.

running my idle torture test script, I find both of these boxes to be delightfully unstable. Both system with 4.8.0-34-generic hanged on average in under 5 minutes.
and interestingly, they did seem any more stable with Ubuntu's 4.9.5 ppa.

@ Vincent Gerris

Thanks for packaging the test kernel, I've kicked off a test of it on all 4 boxes.
Comment 707 Juha Sievi-Korte 2017-01-25 18:37:51 UTC
I've been now running 4.9.0-1 with Mika's patch (and also have the autodemotion patch applied at the moment). Just had a freeze in less than 24 hours with the byt.test script. I'll continue running to see if there is as much deviation on the times to freeze as previously.

Turbostat output from near the failure:

Wed Jan 25 19:36:59 EET 2017
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	112	11.02	1012	2167	266841	0	14.60	74.38	30.30	49	**.**	396	11.35	0.41	0.16
	0	0	120	12.15	986	2167	76451	0	16.90	70.95	30.94	49	**.**	396	11.35	0.41	0.16
	1	1	93	8.71	1065	2167	51852	0	12.34	78.95	30.94	49
	2	2	134	12.46	1076	2167	66838	0	13.31	74.23	29.66	49
	3	3	100	10.76	927	2167	71700	0	15.86	73.38	29.66	49

So not a big improvement for me on N3540.
Comment 708 Pilot_6572n 2017-01-25 22:03:21 UTC
A very quick and little return about what I experienced with the Qotom bay trail z3735f mother board I used.

With ubuntu (16.04 - linuxium), the Bios has been altered: I got an 'ubuntu' line permanently introduced in the Bios boot menu option list.

I observed insane crashes and regular freezes so quickly arriving that I was unable to load any repair program or test scripts (here above).

With Jessie, I observed the same.

Upgrading the Bios with the micro-code obtained from the furnisher didn't changed a lot. The 'ubuntu' option in the Bios disappeared but continuous crashes were still there present with ubuntu reloaded.

As I am in a production procedure, I changed the bay trail Qotom motherboard for a Braswell proc one (Asrock - N3700 - a little bit larger size but with a processor consuming quite the same energy (6 watts vs 2 )).

All seems fine until now, running 24hrs some real-time gps and videos programs (navit - gnuplot a.s.o..) with a 16.04 'regular' Ubuntu or with Jessie too (4.8).

Now I am testing the last Fedora and all looks fine.

I then let the Bay-cherry trail down for the Braswell with, I hope, no return.
Comment 709 Len Brown 2017-01-26 04:46:07 UTC
Created attachment 253151 [details]
pstate.set script

@ Juha Sievi-Korte

How about if you first run the attached script to configure frequency to the max:

./pstate.set max

Does that cause the failure to occur sooner?

If it can shorten the time to failure, what do you see when you then boot with intel_idle.max_cstate=2 ?  (That will enable C6NS, but will not enable C6Shrink -- so you will core-c6 residency, but no module-c6 or package-c6.)
Comment 710 Len Brown 2017-01-26 05:49:59 UTC
I've tested Vincent Gerris' 3.9+mika kernel on 4 systems, and 3 of them had no issue after 18 hours:

Dell n3540 laptop (wireless)
Acer j1900 desktop (wired net)
Asus T100-TAM convertible (USB/wired net)

But the 4th system, an Asus T100-CHI, fails reliably in under 10 minutes of testing with this kernel, just like it did for un-patched 4.8 and 4.9.

So I booted the T100-CHI with "intel_idle.max_cstate=2" (enables Core-C6, but disables module/package-C6) and it ran over 18 hours without a problem.
Rebooted with intel_idle.max_cstate=3, and it failed again after 3 minutes.

Right now it is running with intel_idle.max_cstate=0, which boots in ACPI mode.  Here the C-states are MWAIT 0 (C1), MWAIT 0x51 (CC6), and MWAIT 0x64 (C7s).  While Linux does make requests for C7s, those appear to be demoted all the way to CC6, as there is no mc6 or pc6 residency.  ie. this is behaving exactly like the "intel_idle.max_cstate=2" case, and so I expect it to still be running when I come in tomorrow...

Unclear why the other 3 systems do not see this, especially the T100-TAM, which is extremely similar.  Indeed, the list of difference between the TAM and the CHI are very short.  They have identical identical SOC: Z3775  @ 1.46GHz, ucode 0x832, same INT33FD Crystal Cove PMIC.  The systems include different wireless, but I'm not using wireless -- instead I'm using the same a USB/wired-ethernet on both.

The T100-TAM has a 1766x768 display, the T100-CHI has a 1920x1200 display.  I don't know if the display difference might be related.

Then there is the possibility that my T100-CHI has some unit-detect.  But I'm going to assume that is not the case, unless other T100-CHI test results differ.
Comment 711 jbMacAZ 2017-01-26 07:01:14 UTC
I've been chasing some unrelated problems with my CHI.  I've found a couple bad commits which fixed my touchscreen and might fix the wifi soft freeze.  I'm currently testing 4.10-rc5 with T100 patches/.config and Mika's patch.  So far running 11 hours without freeze.  I can post that kernel if that would be useful, it should also work on the T100TAM.  

Another big difference between the T100T* and the T100CHI is the keyboard connection.  The CHI keyboard/touchpad is bluetooth and the TAM is hardwired. 

The CHI seems to be the most freeze prone in the Asus T100 baytrail family.  A few minutes until freeze is not unusual without some kind of c-state limit, though 30 minutes is what I've seen with the newest kernels.  Len's freeze rate reminds me of what I saw with the 4.2 kernel series.
Comment 712 amjafuso 2017-01-26 08:24:20 UTC
@len

not sure if this helps to narrow down the problem...

Before setting intel_idle.max_cstate=1 to solve freezes on my system, turning off hardware acceleration was the way to go:

http://sparkylinux.org/forum/index.php/topic,3296.msg7132.html#msg7132

Settings done: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/1453298/comments/147

I also had to turn off hw acceleration in every browser.
Comment 713 Shev_84 2017-01-26 09:22:45 UTC
I've done some more tests on my machine.

It occurs that playing movie isn't main factor to get freeze. I must play it in specific way. I.e, when I start movie by this command:
cvlc --quiet --x11-display :0 -f -L ~/Temp/cstate-test/test.mkv &

It can run all day long, and nothing will happened.

But when I play the same movie in Kodi (16.1 current stable version), freeze gets me in about 10-30 minutes (twice i've reached almost 2 hours).

Here are sample outputs of turbostat just before hangs:
śro, 25 sty 2017, 00:25:02 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       54      4.01    1334    2000    74894   0       13.07   82.92   65.44   58      **.**   354     41.45   0.78    0.22
        0       0       48      3.63    1333    2000    15576   0       14.82   81.55   67.97   56      **.**   354     41.45   0.78    0.22
        1       1       53      4.00    1333    2000    17137   0       9.28    86.72   67.97   56
        2       2       57      4.30    1334    2000    21421   0       11.67   84.02   62.92   58
        3       3       55      4.12    1334    2000    20760   0       16.50   79.38   62.92   58

###

śro, 25 sty 2017, 18:56:35 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       51      3.86    1333    2000    73357   0       12.61   83.53   66.63   58      **.**   208     43.31   0.77    0.22
        0       0       56      4.18    1333    2000    22495   0       15.73   80.09   63.09   56      **.**   208     43.31   0.77    0.22
        1       1       55      4.13    1333    2000    21941   0       12.35   83.53   63.09   56
        2       2       50      3.73    1333    2000    17085   0       10.00   86.27   70.16   58
        3       3       45      3.40    1333    2000    11836   0       12.37   84.23   70.16   58

###

śro, 25 sty 2017, 19:23:37 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       51      3.86    1333    2000    72826   0       13.02   83.12   65.43   57      **.**   396     41.86   0.77    0.22
        0       0       45      3.36    1333    2000    15717   0       15.25   81.38   62.70   56      **.**   396     41.86   0.77    0.22
        1       1       46      3.48    1334    2000    16001   0       14.97   81.55   62.70   56
        2       2       59      4.39    1333    2000    22710   0       8.46    87.15   68.17   57
        3       3       56      4.20    1333    2000    18398   0       13.41   82.40   68.17   57

###

śro, 25 sty 2017, 19:49:19 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       49      3.69    1334    2000    72568   0       13.22   83.09   66.29   57      **.**   396     41.71   0.77    0.22
        0       0       50      3.76    1333    2000    23108   0       16.20   80.04   56.57   56      **.**   396     41.71   0.77    0.22
        1       1       48      3.61    1333    2000    21970   0       21.68   74.72   56.57   56
        2       2       49      3.70    1335    2000    16326   0       7.02    89.28   76.00   57
        3       3       49      3.69    1334    2000    11164   0       7.99    88.32   76.00   57

###

śro, 25 sty 2017, 23:26:31 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       49      3.69    1333    2000    72763   0       14.25   82.06   64.90   58      **.**   854     39.66   0.78    0.22
        0       0       51      3.79    1333    2000    23751   0       17.92   78.28   52.58   56      **.**   854     39.66   0.78    0.22
        1       1       47      3.55    1333    2000    23495   0       25.04   71.41   52.58   56
        2       2       53      3.96    1333    2000    14315   0       6.18    89.86   77.21   58
        3       3       46      3.46    1333    2000    11202   0       7.85    88.69   77.21   58

###

And here is output of turbostat while currently running byt.test script with glxgears and nanosleep for almost 10 hours now:

czw, 26 sty 2017, 10:12:42 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       61      4.47    1354    2000    243983  0       7.93    87.59   56.23   57      5.10    187     22.64   0.63    0.22
        0       0       53      3.93    1341    2000    25244   0       3.42    92.66   73.34   55      5.10    187     22.64   0.63    0.22
        1       1       52      3.74    1401    2000    25351   0       3.60    92.66   73.34   55
        2       2       74      5.55    1337    2000    104319  0       13.20   81.25   39.13   57
        3       3       63      4.68    1345    2000    89069   0       11.52   83.81   39.13   57


Of course setting intel_idle.max_cstate=1 solves freezes on my machine. Now I need to find out how to apply these patches attached in this thread to Ubuntu kernel. Then I can do some more tests.
Comment 714 Vincent Gerris 2017-01-26 20:16:50 UTC
@Len Brown, thank you, happy to help out :).
@Chev_84 check my previous post for prepatched 4.9 kernel (Mika's patch only, as Len would like to see tested).

This command has been running for about 47 hours:
ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads$ ./byt.test | tee out.`date +%Y%m%d_%H%M%S`
Linux ubuntu-Lenovo-Yoga-2-11 4.9.0-mika-no-tweak-eval-th #3 SMP Tue Jan 24 10:10:26 CET 2017 x86_64 x86_64 x86_64 GNU/Linux
tis 24 jan 2017 22:37:53 CET
BOOT_IMAGE=/boot/vmlinuz-4.9.0-mika-no-tweak-eval-th root=UUID=6a53171b-c5f2-44a4-a69f-a08f38312a8c ro quiet splash vt.handoff=7
board_vendor:LENOVO
board_name:AIUU1
board_version:31900042STD
bios_date:08/19/2015
bios_vendor:LENOVO
bios_version:92CN93WW(V1.93)
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz
stepping	: 3
microcode	: 0x320
cpu MHz		: 1305.507
cache size	: 1024 KB
physical id	: 0
siblings	: 4
[    1.274218] intel_idle: MWAIT substates: 0x33000020
[    1.274220] intel_idle: v0.4.1 model 0x37
[    1.274527] intel_idle: lapic_timer_reliable_states 0xffffffff
state0/desc:CPUIDLE CORE POLL IDLE
state1/desc:MWAIT 0x00
state2/desc:MWAIT 0x58
state3/desc:MWAIT 0x52
state4/desc:MWAIT 0x60
state5/desc:MWAIT 0x64
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
glxgears -display :0 -fullscreen
turbostat --debug -i 100
[sudo] password for ubuntu: tis 24 jan 2017 22:38:53 CET
tis 24 jan 2017 22:39:53 CET -
...
...
until...
tor 26 jan 2017 21:06:27 CET

No freezes yet. So for me this seems stable so far.
I will do further testing and report if anything suspicious or interesting happens.
Thanks to everybody for the serious and committed bug hunting and for not giving up :). cheers
Comment 715 Vladislav 2017-01-27 13:00:40 UTC
Also had this issue, now with linux-rt-manjaro kernel(4.9) my N3540 based laptop works more than one week without freezes.
Comment 716 amjafuso 2017-01-27 17:47:59 UTC
I use Kernel 4.9 for almost a month and never had any freezes. Script in comment #663 didn't force freeze as well.

Half an hour ago my system froze the first time! No heavy load, only Firefox (no video), thunar and some sshfs connections. After a reboot I started Firefox, half of the window was black (never saw that before). I switched from "full size" to "custom size" and back, it instantly froze again!


After a second boot dmesg shows me red entries I have not seen before:


[    5.663123] intel_soc_dts_thermal: request_threaded_irq ret -22
...
...
...
[    6.766407] [drm:valleyview_pipestat_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun


Useful?
Comment 717 Jochen Hein 2017-01-29 13:28:24 UTC
I'm now running 4.9.5 with Mika Kuoppala's patch from #683.
System is stable and was playing local video files without problems.
4.9.5 without the patch hung within minutes in kodi.
No max_idle or other workarounds applied.

I'd like to see Mika's patch submitted for inclusion, even if
it fixes only part of the problems.
Comment 718 Vincent Gerris 2017-01-29 21:32:40 UTC
Hi,

I ran more tests with Mika's patch.
The previous test run again without problems.

However, when I played video (with audio over bluetooth) I got a freeze again.
At some point I even got a freeze while I didn't play video, but just copied a file over wifi from a smb share.
So no sound or video played, although bluetooth was connected.

Perhaps this is a wifi issue as well? Or something with bluetooth.

I will try to see if I can pin this further down, although that make take some time.

It seems like a good idea to get Mika's patch in if it fixes freezes for many people.
Comment 719 luke 2017-01-30 08:13:19 UTC
> Perhaps this is a wifi issue as well? Or something with bluetooth.

My system is an Asrock Q1900-ITX with no wifi or bluetooth. I was experiencing these issues with freezing under Linux about every 12hr to 2 weeks depending on the kernel. Since switching to Widows 10, I've had an uptime of over 2 month now with the same role, a home media server.
Comment 720 Gabriel7340 2017-02-01 14:40:49 UTC
@Len Brown Can you please answer a little question? I am confused. It's possible to have the same issue on Windows 10? My processor is an Intel® Bay Trail-M Quad Core Pentium N3540 Processor.

The notebook( http://www.asus.com/pt/Notebooks/X552MJ/specifications/ ) freezes constantly when idle or when I watch a movie.

I can't disable C-States on Bios ( no option ) and I can't found some option on windows 10 to disable or prevent system going into deep idle state.

I'm thinking if a custom bios can be the best solution to disable c-states. 

Do you have some information about problems related to this processor on windows 10? The symptons are equal, if I change to linux I have to use the max_cstate flag in order to have a stable system. 

Thank you.
Comment 721 Juha Sievi-Korte 2017-02-01 16:26:56 UTC
(In reply to Len Brown from comment #709)
> Created attachment 253151 [details]
> pstate.set script
> 
> @ Juha Sievi-Korte
> 
> How about if you first run the attached script to configure frequency to the
> max:
> 
> ./pstate.set max
> 
> Does that cause the failure to occur sooner?
> 
> If it can shorten the time to failure, what do you see when you then boot
> with intel_idle.max_cstate=2 ?  (That will enable C6NS, but will not enable
> C6Shrink -- so you will core-c6 residency, but no module-c6 or package-c6.)

Thanks Len,

Update as I continued with the same test set after the latest freeze and now it's been running for a week without a freeze on same configuration. So it seems Mika's patch did make a huge difference after all and I was just very unlucky to get so quick freeze on the first try (and my bad not verifying the result is repeatable before commenting here again:)

I'll continue to experiment, if I find a way to reproduce this with the patch applied.

So my N3540 seems now relatively stable with byt.test.
Comment 722 Prashant Poonia 2017-02-04 14:10:36 UTC
(In reply to Gabriel7340 from comment #720)
> @Len Brown Can you please answer a little question? I am confused. It's
> possible to have the same issue on Windows 10? My processor is an Intel® Bay
> Trail-M Quad Core Pentium N3540 Processor.
> 
> The notebook( http://www.asus.com/pt/Notebooks/X552MJ/specifications/ )
> freezes constantly when idle or when I watch a movie.
> 
> I can't disable C-States on Bios ( no option ) and I can't found some option
> on windows 10 to disable or prevent system going into deep idle state.
> 
> I'm thinking if a custom bios can be the best solution to disable c-states. 
> 
> Do you have some information about problems related to this processor on
> windows 10? The symptons are equal, if I change to linux I have to use the
> max_cstate flag in order to have a stable system. 
> 
> Thank you.

Strange, I have Asus x553MA and i have been using win10 for the past few months and my laptop has frozen not more than 2 times. And i have 2gb ram so i can associate those freezes with high ram usage. My netbook is very stable with windows 10, make sure you have the latest BIOS from the asus website, my BIOS version is v214.
My laptop specs
n3540
2gb ddr3l ram 1333mhz 1.35volts
500gb 5200rpm hdd
windows 10 x64 1607
Comment 723 jbMacAZ 2017-02-04 18:57:16 UTC
Yes this c-state issue does affect windows, IMO.  I have 4 systems with windows that have frozen exactly the same way (display frozen, inputs unresponsive, needs hard reset to recover, no obvious reason).  One has frozen once several months ago(i5-6400 skylake) and the other 3 are N3540's baytrail which freeze weekly to monthly.  When 3 of these systems run linux with a c-state bandaid, they don't freeze (The other only has windows...)  Freeze rates on windows are infrequent for me, but the processor is only 1 part of the problem.  Hardware implementation matters as does OS version.

But this is a linux baytrail cstate bug.  With Mika's patch, I haven't had a hard freeze running 4.9.7 or 4.10-rc5 during the last week and a half.  4.10-rc6 has a wifi regression that halts testing after about 12 hours (unrelated soft freeze).  These results with Z3775 baytrail quad core.
Comment 724 Len Brown 2017-02-06 21:19:37 UTC
Yes, on 25-Jan, Mika submitted the patch in comment #683
to the i915 driver owners:

https://lists.freedesktop.org/archives/intel-gfx/2017-January/117932.html

If all goes well, I would expect it to go into the Linux-3.11 merge window.
Ideally, it will then get back-ported to the .stable kernels.
Comment 725 Elmar Melcher 2017-02-07 11:04:41 UTC
kernel 4.9.5 with patch from comment #683 and from https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8 patch 0001, 0006, 0008 on Z3735G, command line tsc=reliable only, daily use during 2 weeks, no freeze.
Updated spreadsheet.
Comment 726 Martin 2017-02-10 19:33:17 UTC
As one of the people that arrived at 8fb55197e64d5988ec57b54e973daeea72c3f2ff while bisecting (comment 276) I can confirm that 4.9.0 with Mika's patch would have been voted a good! Uptime 3 days 8 hours and counting. Typical load: HTPC (recording and watching HD TV using MythTV).

Thanks everybody that helped cranking this patch out!
Comment 727 Shev_84 2017-02-11 11:11:03 UTC
I've applied Mika's patch from comment 683 to current Ubuntu kernel, and it seems to work fine without cstate set to 1.
$ uname -a
Linux panda 4.8.0-37-generic #39com683 SMP Thu Feb 9 12:54:37 CET 2017 x86_64 x86_64 x86_64 GNU/Linux

All day i've run movies in Kodi, then i've run some transcoding of stream grabbed from DVB-S2 tuner.

Output of Len's turbostat at the end of test:
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       220     13.95   1578    2000    38888   0       1.26    84.79   70.26   61      6.39    854     48.10   1.55    0.33
        0       0       110     6.84    1607    2000    6408    0       0.83    92.33   83.21   59      6.39    854     48.10   1.55    0.33
        1       1       103     6.63    1558    2000    9220    0       1.41    91.97   83.21   59
        2       2       330     19.08   1730    2000    13646   0       1.50    79.42   57.31   61
        3       3       337     23.25   1449    2000    9614    0       1.31    75.44   57.31   61

For now it looks stable, hope it is solved for good.
Comment 728 frr 2017-02-13 14:09:45 UTC
Dear gentlemen, thanks for this exquisite snippet of detective work.

I don't want to be specific about HW models, but over the past month or so, I've been struggling with one particular machine (model - a whole batch, systematic problem) exhibiting the exact symptoms: video playback in Ubuntu 16.04 or Debian Jessie 8.7, in Totem, would result in freezes (video frozen on the screen, whole system locked up). Under Unity+Compiz, the looped-around playback of a short x264 HD movie trailer would lock up within half an hour, under xfce4 with compositing it would take 2-3 hours, xfce4 without compositing would run for several hours (but would typically freeze within a day). Actually the video playback was just a reliable/accelerated problem reproduction method - in real life, the machine would freeze with some simple 2D GUI app within a day or two. I even tried improving Vcore blocking on the boards, which didn't help. 

Finally I found this bug report, while fumbling for a way to speculatively over-volt the cores a bit using a manual intervention into EIST... 
and I haven't tried patching the kernel yet, but the workaround with max_cstate=1 does seem to have the desired effect. Up for 30 hours and cranking away at the video loop.

Actually I had two sibling models on the table, both in several specimen, with an almost identical motherboard, the only difference being the CPU: Celeron N2807 (dual core, working fine no matter what) vs. Celeron N2930 (quad core, freezing no matter what). Over the last two years I also tried some machines with the Celeron J1900 where the problem does not occur... All of these are "industrial"/embedded PC machines from two vendors (plus the odd ITX board from Gigabyte if memory serves).
Comment 729 Hanno Zulla 2017-02-17 08:14:12 UTC
First of all, thanks a lot.

What is the status of the patch by now? Reading https://lists.freedesktop.org/archives/intel-gfx/2017-January/117932.html it appears that it wasn't accepted.
Comment 730 Olivier 2017-02-17 10:01:21 UTC
We are encountering similar total freeze issues on 16.04.1 (kernel 4.4.0-x, some 4.4.0-62) NUC6i5 devices (Skylake), so not exactly low power like most CPUs in this thread.

We have a hundred of these devices deployed in the field, and they are randomly freezing (at least 25 devices have frozen already), we don't have physical access to many of them (requires special technical intervention).

A few very strange things:

- The freezes do not produce any log (no kernel panic/crash).
- The freezes aren't reproducible easily (do not happen every day) but they always happen exactly 4 hours after boot (our devices reboot daily @05h05), freezes always happen around 09h05.
- We have a hundred of NUC6i3 devices out there with the exact same setup that are not having these freezes.

The devices are unattended POS devices that mostly play webapps (chromium/electron) or video (mpv) (auto-logged in without interaction).

We're going to try the cstate boot flag to see if it fixes things.

I would be extremely grateful if anyone would have an idea on how we could debug this a bit more.


Related askubuntu issue with logs/more info: http://askubuntu.com/questions/884099/troobleshoot-16-04-inexplicable-total-system-freeze-4-hours-after-reboot-on-seve
Comment 731 RussianNeuroMancer 2017-02-18 08:29:40 UTC
Hi, Olivier 
Please read comment #700. You need to fill separate bugreport about your issue, because this one about BayTrail, not Skylake.
Comment 732 Vincent Gerris 2017-02-19 17:24:08 UTC
After many hours of trying, I made great progress in getting the Mika patched kernel to hang on my Lenovo Yoga 2 11 with N3520 on latest 1.93 BIOS :)!
Never thought I'd be so happy crashing my computer.

I can now CONSISTENTLY hang my computer as follows:
 - (re)boot without AC power
 - start file copy from smb share to local folder with nautilus (310 mb)
 - wait, or trigger by unloading bluetooth driver with:
sudo modprobe -r btusb
(or when not loaded: sudo modprobe btusb && sudo modprobe -r btusb)

Now for some more detail, I tried to find influencing factors and these are some.
 - With AC power connected, it barely happens (1 out of about 10 times it did)
 - on battery power, using an external USB (rt2573) and the internal wifi not connected, I could not trigger this
 - when max cstate is one, this does NOT happen
 - one time after using a USB, I could not trigger this, but after one more I did.
 - sometimes the driver fiddling is not needed and the hang occurs without it

To be very sure the cstate parameter is in play, I kept running the same kernel and rebooting and repeating the procedure, it never hung with the parameter enabled (once with 3 reboots, once with one) and consistently hangs without the parameter.

The kernel is the 4.9.0 patched that I put on dropbox.
Another side note is that sometimes the bluetooth driver does not load after boot, this does not seem to have much influence on the procedure.
What I see in dmesg when that is the case is:
[    8.604187] Bluetooth: hci0 command 0x1001 tx timeout
[   16.604447] Bluetooth: hci0: BCM: Reading local version info failed (-110)

There are a few great things about this:
1. I can consistently reproduce the error
2. cstate kernel parameter dependent

The non consistent behavior may be caused by the firmware saving different things? 
Also seems clearly related to power management/wifi (not sure if chip or power related).

@Len Brown let me know if I can supply any info that may be useful.
I will not update the laptop and it is dedicated to identify this issue, so happy to make some more time to nail it down.
Comment 733 River Zhou 2017-02-19 17:26:35 UTC
Dose anybody try CONFIG_PREEMPT=y and CONFIG_HZ=1000 ?
On my Lenovo Miix 2 8 (BayTrail Z3740). it will make system very slow sometimes.
I use 4.4.49 kernel + Mika's patch and with no ctate set.
Comment 734 Hans de Goede 2017-02-21 08:07:24 UTC
(In reply to Vincent Gerris from comment #732)
>  - on battery power, using an external USB (rt2573) and the internal wifi
> not connected, I could not trigger this

But you were still using the internal bluetooth, right ?

So this seems to point to a problem with the sdio wifi. I think this means we may still need the patches to force the CPU to not enter C4/C5 when mmc is active which have been used by various baytrail users in the past:

https://github.com/hadess/rtl8723bs/tree/master/patches_4.5

Some patches have been merged to fix this, but IIRC their commit msg mentioned those patches might just make it harder to trigger the problem.

While working on some cherrytrail issues I rebased those patches to a recent upstream kernel (not the latest, but a recent one) I've saved those rebased patches in case we would need them again, I've uploaded them here:

https://fedorapeople.org/~jwrdegoede/trail-mmc/

It would be good to build a kernel with those and see if that fixes your reproducable bug.
Comment 735 Michaël 2017-02-21 10:06:10 UTC
I compiled Mika's i915 modification into a module, this helped stability, but my system froze after 5 days, with no log as usual.  During these 5 days, the machine was sitting mostly idle, with activity coming only from background tasks (ownCloud, mostly).

On Acer TravelMate 115, Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz, no crash with max_cstate=1.
Comment 736 Nicolas Porcel 2017-02-22 23:17:50 UTC
Does anyone tested the new DRM_I915_CAPTURE_ERROR option in 4.10?

I don't know the effect, but I just saw it after upgrading to 4.10 and it might be interesting to try it.
Comment 737 Len Brown 2017-02-24 15:31:45 UTC
@ Michaël

Are you saying that your N3540 hangs, even with intel_idle.max_cstate=1 ?
If yes, does it past overnight memtest, or a thorough soaking
of stressapptest?
Comment 738 Michaël 2017-02-24 16:07:13 UTC
@Len Brown: Sorry if this wasn't clear: this was *without* max_cstate=1.  With max_cstate=1, it is, and always has been, perfectly stable.  I have a ~2.5day average uptime with Mika's patch, which is pretty much the same as without.  `modinfo` does report that i915 is using the patched module.
Comment 739 Len Brown 2017-02-24 16:19:53 UTC
@ Michaël 

Thanks for the clarification.

Please test with Mika's patch plus intel_idle.max_cstate=2.  Per comment #710, the difference from intel_idle.max_cstate=1 is that now core C6 will be ENABLED.  In common with intel_idle.max_cstate=1, Module and Package C6 will continue to be disabled.

On the only machine I have where Mika's patch is not sufficient for 100% stability (the T100-CHI) this works for me.
Comment 740 jbMacAZ 2017-02-25 01:23:50 UTC
(In reply to Len Brown from comment #739)
> 
> On the only machine I have where Mika's patch is not sufficient for 100%
> stability (the T100-CHI) this works for me.

I haven't had a freeze on my CHI since using only Mika's patch and "tsc=reliable clocksource=tsc" for kernel args.  That said, I do have some non-default .config settings and a handful of ASUS device specific patches.

The most serious outstanding bug (with a .config workaround) is bugzilla#150881 and it affects other T100's.  Wifi issues also hamper stable operation although that appears to be fixed as of 4.9.11+ (but not 4.10.0)  

Let me know if you want patches, .config or built kernels to evaluate on your CHI.
Comment 741 Vincent Gerris 2017-02-27 07:32:08 UTC
I have some interesting observations.
@Hans de Goede : if you mean with using the internal bluetooth that the driver was loaded, yes. It was not doing anything like streaming audio or anything else.

I tried your patches on the same 4.9.0 kernel with the Mika patches but I still get the freeze.

Some interesting observations:
 - sometimes it takes a while for a freeze: when this happens, the file copy speed showing in nautilus goes down gradually from around 3,5 mb/s to a few hundred kb/s and then it hangs.

 - to test driver influence I blacklisted btusb and then tried: once when not on AC power I got a hang after a few seconds of file copy(without loading or unloading the driver), once on AC the copy speed reduced a lot but I got no freeze

 - once with the btusb driver blacklisted again, I got no issue, until I loaded and unloaded the driver

Off AC power seems a big factor, isn't there an acpi event trigger when the power is unplugged? Maybe that has influence.

When I look at top when file copy gets slow, there is no significant use of resources. It looks like the kernel slowly hangs itself up.

I hop this helps to get an idea on where to find the issue.
Let me know if I can test anything more.
thank you!
Comment 742 Len Brown 2017-02-28 03:13:58 UTC
Created attachment 254971 [details]
Mika v3: drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3

Please test this patch.

drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3

This patch is expected to have the same function as the previous version, which was attached to comment #683.

It should apply, with offset, from Linux-4.2 through Linux-4.10, and now 4.11-merge/rc.

If we can supply some "Tested-by: " tags, that may help Mika get permission to ship this patch with the i915 tree.
Comment 743 Laszlo Fiat 2017-02-28 18:40:31 UTC
(In reply to Hans de Goede from comment #734)
> (In reply to Vincent Gerris from comment #732)

[snip]

> So this seems to point to a problem with the sdio wifi. I think this means
> we may still need the patches to force the CPU to not enter C4/C5 when mmc
> is active which have been used by various baytrail users in the past:
> 
> https://github.com/hadess/rtl8723bs/tree/master/patches_4.5
> 
> Some patches have been merged to fix this, but IIRC their commit msg
> mentioned those patches might just make it harder to trigger the problem.
> 
> While working on some cherrytrail issues I rebased those patches to a recent
> upstream kernel (not the latest, but a recent one) I've saved those rebased
> patches in case we would need them again, I've uploaded them here:
> 
> https://fedorapeople.org/~jwrdegoede/trail-mmc/
> 
> It would be good to build a kernel with those and see if that fixes your
> reproducable bug.

Hello,

I wrote at [1], that I think that most of the old MMC patches are not needed for kernels 4.7 and above as Adrian Hunter mainlined a patch [2].

We do need a new version of [3], because [4] removed the basis of that patch. If this is not applied, we get IRQ 187 Nobody cared [5]. A new version of [3] with the partly reversed [4] (plus a few other Baytrail related patches) is at [1]. But a proper mainlined solution would be great.

[1]: https://github.com/hadess/rtl8723bs/issues/76#issuecomment-234706390
[2]: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/mmc?id=6e1c7d6103fe7031035cec321307c6356809adf4
[3]: https://github.com/hadess/rtl8723bs/blob/master/patches_4.5/0002-mmc-sdhci-get-runtime-pm-when-sdio-irq-is-enabled.patch
[4]: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=15e82076a0edbebedbe12652b4ad8f1d93bcb7fe
[5]: https://github.com/hadess/rtl8723bs/issues/76#issuecomment-227532284
Comment 744 jbMacAZ 2017-03-01 03:46:17 UTC
re: Avoid tweaking evaluation thresholds on Baytrail v3

Quick test of 4.10.1 w/v3 ran 12 hours without freezing (Asus T100CHI, Z3775).  The CHI would consistently freeze within several hours without a cstate patch, script or kernel arg.  I'll restart the 4.10 test and try to let it run several days.

Unfortunately, while the new v3 patch does apply to 4.2.8, it still froze in 45 minutes, which is comparable to unpatched.  I am rebuilt 4.2.8 with original patch to retest and it also froze in 1:45.  I guess there were too many other problems not yet fixed in that old kernel series.
Comment 745 Len Brown 2017-03-01 04:57:14 UTC
An i915 GFXMHz observation...

Linux-4.8 out of box (no patches or cmdline workarounds)
was easy to hang in under 20 minutes on my dell n3540 laptop
and acer j1900 desktop.  Turbostat showed i915 GFXMHz of 875.

Linux-4.9 became significantly harder to hang -- surviving
the same stress test for over 24 hours.
turbostat showed i915 GFXMHz of 187.

Linux-4.10 seems even more difficult to hang.  The j1900
took almost 3 days, and the n3540 is still running the
stress test after 4-days.  i915 GFXMHz is 187.
Comment 746 Jochen Hein 2017-03-01 19:31:14 UTC
Re: Avoid tweaking evaluation thresholds on Baytrail v3

I'm running 4.10.1 with the patch from #742 and didn't have a hang since yesterday. You may use my
Tested-by: Jochen Hein <jochen@jochen.org>
Comment 747 Len Brown 2017-03-02 16:00:22 UTC
my comment #745 regardiugn GFX MHz is erroneous

Thanks to Yaroslav Isakov for reporting that
turbostat is re-reading a constant value from  in sysfs,
and presenting the un-changing value in the GFXMHz column.  I'll
post an updated turbostat to handle this shortly. (re-opening
the file works around the problem, just as you'd see if you did this:

$ watch -d cat /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

More interesting is that while testing the workaround for that bug,
I logged into my dell-n3540 laptop -- which had been running my
standard byt.test stress test for 7-days on stock 4.10
without a hang.  I fired up a few additional copies of glxgears to
make sure I could see GFXMHz wiggle; and the machine hanged in 5 minutes.
Hopefully a tweak of that stress test to make the graphics run
faster will bring down time-to failure on an un-patched here.
Comment 748 Len Brown 2017-03-05 23:46:24 UTC
Created attachment 255091 [details]
latest turbostat (17.03.04) utility for baytrail

Attached is the latest turbostat utility, please stop using
the older version, and let me know if you have any troubles
with the latest.

This is version 17.03.04 -- slightly newer than the 17.02.24
that was just checked into the Linux-4.11-rc1 source tree.
Above that one, this version fixes the GFXMHz column issue.

Note that turbostat prints more columns than it used to,
and so capturing the output in a file is prudent.
A 10 second snapshot can be gathered in a file "ts.out", this way:

$ sudo ./turbostat -o ts.out sleep 10
Comment 749 Len Brown 2017-03-06 00:10:03 UTC
Re: GFXMHx and glxgears load in comment #747

Adding 5 copies of glxgears to byt.test allows my dell-n3540 to fail on Linux-3.10
in 15 minutes.   Without this additional load, the same hardware and OS
would not fail in 3-days.  That was quite a mystery, as previously
a single copy of glxgears -fullscreen in byt.test would
reliably kill my test systems on Linux-4.8 within 15 minutes.

It appears this is related to Graphics P-states.

You can poll GFXMHz 10 times per second this way:

watch -n .1 -d cat /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

I found that if GFXMHz was pegged to minimum or maximum,
there were no failures.  Or if the load was high enough
or low enough that the frequency stayed near max or min
with few changes -- no failure.

However, if the number of glxgears is tuned to a "sweet spot"
such that GFXMHz is different virtually every time it is polled,
the time to failure is shortest.

Curiously, the number of copies of glxgears to hit the "sweet spot"
is quite different on different machines.  Also the -fullscreen parameter
makes a big difference.

on my T100-CHI, it takes only 1 copy of glxgears -fullscreen
to kill the machine.  On that machine, additional copies of glxgears make
the GFXMHz reach and stay at maximum, and the failure is not seen.
Comment 750 Len Brown 2017-03-06 00:17:16 UTC
Re: Avoid tweaking evaluation thresholds on Baytrail v3

Running this patch on 4.10, I've not yet seen a failure on
Dell-n3540, Acer-J1900, ASus-T100-CHI-Z3775 -- while I was able
to fail all of those machines in under 15-minutes without the patch.

However, based on the output of

watch -n .1 -d cat /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

This patch seems to make the i915 get pegged to maximum
GFXMHz as soon as there is any GFX workload.  When the workload
is terminated, GFXMHz returns to minimum.  This appears to
put the i915 in "race to halt" mode.  Unclear if that was
the intention of the patch.
Comment 751 Vincent Gerris 2017-03-07 22:24:12 UTC
Hi Laszio,

Thank you for your elaborate reply.
I will try that and the recently posted patch on a recent kernel again and report (may take a while).

@Len Brown since my problem does not seem to be fixed by the patch, but is going away when the max cpu param is used, will it be further pursued in this bug report, especially considering what Laszio wrote?

I have a really good way to trigger the bug abd it would be great if we can fix this properly for everyone.

Thank you and regards,
Vincent
Comment 752 frr 2017-03-08 12:03:53 UTC
Apologies for going slightly off topic here, at this software-side forum: 
I cannot help but wonder where the gremlins are possibly hiding :-) and I can't exclude that "someplace in the hardware" is the correct answer.

I haven't looked into details, but the proposed patches seem to modify the rate and aggresivity of GPU clock frequency changes. Also take into account the reported "sweet spot" consisting in a particular number of GLXgears instances, a figure that is HW-specific. As if, with every change in clock frequency, there's a tiny "window of opportunity" for something to clash under the hood, in silicon. The "window of opportunity" (unhandled critical section in the HW?) may be different on different motherboard models, or in different flavours of the BayTrail SoC.

I'm wondering if this is a "window of opportunity" in time (albeit very short, around the "event of clock change") where something in the amazing "silicon clockwork" can clash, or if this is really some power rail instability = lack of proper blocking, at board level, SoC package level, chip level, or chip subsystem level. In my practice, in different situations, I've seen both - I've seen an unhandled "critical section" in an FPGA design doing some very basic counter with current value latching (for bus access from host CPU), and obviously numerous power blocking goofups. None of it was so subtle and so close to the CPU though.

It would potentially be nice to know if the freeze happens when the clock gets bumped up vs. relaxed down (or, irrespective).

I'm a troubleshooter / application geek helping customers integrate industrial/embedded PC's and motherboards with operating systems in their very diverse setups. I got to know about this thread when one particular HW model was curiously misbehaving. More precisely, as I already mentioned in this thread, I had two models of hardware on my lab desk, several pieces of each, with a pretty much identical motherboard, different only in the SoC soldered on the board: one was a dual-core Celeron N2807 (running like a cheetah no matter what), the other is a quad-core Celeron N2930 - freezing reliably under the well known test conditions. Tested on several pieces of either hardware. It makes you wonder: two motherboards, likely identical PCB layout, don't know about possible differences in the BOM (the set of power blocking components) - but the VRM and SMD MLCC's around the socket seemd identical to the naked eye. If they *are* in fact identical, I'm not sure if this means the design is correct or flawed :-)

Unfortunately Intel's reference board designs and detailed power design guidelines are NDA'ed for the recent generations of Intel CPU's and SoC's, so I don't have access to them. All I have is the basic datasheet:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-e3800-family-datasheet.pdf
containing some rather coarse notes about what power wells the BayTrail SoC has and how they behave. See chapter 9 "Electrical specifications". There's the CORE_VCC and UNCORE_VNN (that's the GPU?), these two are dynamic and apparently steered by a "serial VID" bus (that has replaced the traditional parallel Voltage ID pins?). Apart from them, there are maybe 2 or 3 static rails around 1 Volt, likely for some less demanding interface/glue logic. 

I can see them all on my motherboard, the two dynamic rails (VCC/VNN) have something around 0.85 V if memory serves... but my oscilloscope is probably much too slow to show anything interesting on the power rails (at 300-500 MHz analog bandwidth and 1 GSps true sampling rate). Actually I'm a bit puzzled about this. The Intel datasheet says that (quote) "The voltage rails
should be measured with a bandwidth limited oscilloscope with a roll-off of 3 dB/decade above 20 MHz". Makes me wonder how I'm supposed to judge power blocking for a ~2GHz CPU with a 20MHz 'scope :-) The VRM's producing VCC/VNN on "my" BayTrail motherboards are "all solid" = all the caps are MLCC's. Speaking of MLCC's, their frequency response is illustrated e.g. here:
http://www.avx.com/docs/techinfo/CeramicCapacitors/parasitc.pdf
The modern bulk multi-dozen-uF MLCC's have a lowest impedance at their self-resonant minimum, which is a couple dozen MHz. Better than solid polymer, but probably not enough (not alone) for perfect decoupling of modern 22/14nm silicon, specified to have a maximum consumption of about 10 A per the VCC/VNN rail - and that's peak 10 Amps each, observed by a 20 MHz 'scope (= effectively an "average" over some timespan in dozens of nanoseconds). The nominal core clock rate is about 2 GHz, but individual gates comprising the CPU core must actually be much faster, with switching and propagation times at least a decimal order faster. Imagine that the CPU can step up its consumption by a couple Amps within a nanosecond. The dI/dt is enormous. Also note that even if you manage to build a decent RF blocking capacity out of smaller / higher-resonating ceramic caps, a lambda/4 transmission line can flip the impedance "inside out" (turn a short into high Z), and lambda/4 at 2 GHz is about 37 mm in open space, maybe under 30 mm in a PCB. Considering how fast the gates really are, I would say that an optimum distance to place blocking capacitors is within a few mm from the chip, i.e. on the interposer. Yes there are some caps on the interposer, and they seem so tiny... If this is really a power blocking problem, it may well be a problem on part of Intel, rather than on part of the board maker. The beefy MLCC's on the motherboard (alone) won't cut it, and the VRM's response time to a consumption peak (filtered by the MLCC's) should not be a problem.  

In terms of thermals, the board/computer makers often regard the BayTrail as something that almost doesn't need a heatsink (I don't agree, but that's a different story, and I have the thermals right in this case) but it's making me wonder to what extent both the motherboard makers and Intel are possibly soothed by the relatively low consumption at the electrical level. As if "it doesn't draw too much power on average, so it doesn't need that much rail blocking, right?" Oops, that attitude would be a problem. The silicon is admittedly not power hungry, but it's got some screaming fast, latest generation gates, and can ramp its own consumption quite aggressively in very short time quanta...

The step in power consumption due to a change in clock rate alone is likely not *that* abrupt - it's making me wonder if there's possibly some "synchronized gate switch" on a massive scale, brought about by the clock change event. Something that would produce a tall power glitch, lasting for a few picoseconds, that's otherwise not likely to occur in such a perfect synch.

Does Linux fiddle with *voltage* during those GPU power management events? Does it ever tweak the VID? I don't recall anyone mentioning it here... if Linux did fiddle with the VID, the VRM would possibly need some time to ramp up, before it would be safe to increase the clock rate. Again this seems unlikely to me, such a scheme would not be very swift and efficient. 

Also, many users say (including my customer) that the hardware doesn't hang in Windows. Which makes me wonder in what way are windows "different", in their handling of the GPU power management.

It's also the outside behavior that's making me wonder. The problem can be suppressed by preventing clock rate changes in the GPU (IGP). Yet apparently it's not just the GPU that hangs, it's the whole computer that hangs. Unfortunately it doesn't mean that a CPU core has frozen - it means that something along the path between the RAM, MCH, cache and the CPU cores has frozen. And it always freezes (in my case) with the picture stuck on the screen, i.e. no random chance of the kernel reporting a panic. Interestingly, dual-core chips generally don't have the problem - it's typically a problem of the quad-core SoC's. Now where does the core count have a cross-section with the GPU clock modulation? Through power consumption? But if the GPU runs off the VNN "power well", it doesn't even share a power well with the CPU cores!

Makes me wonder if someone (Intel R&D ?) has on-chip ICE/ICD capabilities,
would be able to reproduce the problem and take a closer look at what happend.
But I suspect that with this many pieces sold, they would keep the results
to themselves anyway :-/

Hmm... in the kernel code, what does the actual "set GPU clock" or "set GPU power mode" look like? Just a single MSR write? I'm wondering if it takes the GPU hardware some time to actually carry out that order. And, what would happen if another "clock change" instruction came too early... Could that be our "unhandled critical section" ?

I should probably put my crystal ball to rest and get off the hallucinogens.

Thank you guys with an Intel e-mail (and on Intel payroll?) for keeping up the fight for our benefit, even if you don't have full access internally. You're doing a marvellous job. And I think Intel should receive some thanks too - for paying you, and for making the hardware in the first place :-)
Comment 753 Paul Mansfield 2017-03-08 12:36:04 UTC
I'm with Frantisek.Rysanek here that there's deep-rooted problems in the Baytrail SoC. 
I'm pretty sure the only way to make it stable with Linux is to constrain the chip to run its clocks at constant frequencies and not try and switch into different sleep states.

I and and six others all have the same convertible tablet, the Toshiba Click Mini with the Z3735F, from different batches. Mine is terribly unstable and locks up at the slightest provocation, yet others have been able to run linux with some success with the same kernel.
As far as we can tell, there's only ever been one stepping level of the Z3735F as we all have the same SKU, and there's no microcode loader for this chip.

The biggest cause of instability comes when using SDIO, e.g. an SDIO wifi adaptor which is common in many baytrail devices. Quite often a device will run acceptably well with a USB wifi adaptor, and lock solid within minutes or seconds of loading the SDIO driver.

ARS Technica have a block diagram of the chip, and my gut feel is that the block labelled "Storage Hub" craps out under certain conditions.
https://cdn.arstechnica.net/wp-content/uploads/2013/09/Screen-Shot-2013-09-13-at-6.32.07-PM.jpg


Intel produced a reference board called Sharks Cove using the Z3735F, but a lot of the documentation has disappeared.. however, if you get lucky it's possible to find third parties carrying the documentation, such as this:
http://www.mouser.com/ds/2/456/Sharks-Cove-Technical-Specifications-587828.pdf
(I recommend people grab a copy before it too disappears)
Comment 754 jbMacAZ 2017-03-08 18:11:01 UTC
I've had 2 freezes (less than an hour of casual use) of the linuxium(Budgie-Ubuntu17.04) w/4.10.0 kernel which includes the v3 patch.  When I rebuilt the linuxium kernel removing the v3 (comment #742) and adding back Mika's original patch (coment #683), I'm freeze free.  The revised patch does not seem to be as effective.  My other kernels run days with either patch version (Mint 18.1, Manjaro).  (T100CHI, Z3775)
Comment 755 Juha Sievi-Korte 2017-03-12 18:22:48 UTC
My results on N3540 running 4.10.1 (g8c10701) with and without v3 patch.

Only 1 run each, so I don't know abou repeatability. I did as Len instructed (watching the gpu frequency). For me even one fullscreen glxgears was enough to cap the frequency to maximum, but running several small screens made the frequency change all the time.

Unpacthed kernel froze in less than two hours, with v3 patch applied, freeze happened while running for about 8 hours with same test set.
Comment 756 Alejandro Morales Lepe 2017-03-16 04:38:10 UTC
Created attachment 255281 [details]
attachment-16106-0.html

I have been running Fedora 25 in my Dell Inspiron 15 3000 Series with Intel
Pentium N3540 and kernel 4.9 has been stable for around 3 weeks now,
completely vanilla, has somebody in Fedora Project/Red Hat tweaked
something? more people should try it too, makes no difference if I use
wayland or xorg.

2017-03-12 11:22 GMT-07:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #755 from Juha Sievi-Korte (jsievikorte@gmail.com) ---
> My results on N3540 running 4.10.1 (g8c10701) with and without v3 patch.
>
> Only 1 run each, so I don't know abou repeatability. I did as Len
> instructed
> (watching the gpu frequency). For me even one fullscreen glxgears was
> enough to
> cap the frequency to maximum, but running several small screens made the
> frequency change all the time.
>
> Unpacthed kernel froze in less than two hours, with v3 patch applied,
> freeze
> happened while running for about 8 hours with same test set.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 757 Vincent Gerris 2017-03-23 21:39:04 UTC
Hi,

I was able to test with wireless:
https://launchpad.net/ubuntu/zesty/amd64/bcmwl-kernel-source/6.30.223.271+bdcom-0ubuntu2

Some interesting results:
 - the laptop was on power when the wireless driver was installed. When doing the copy and unloading and loading the btusb driver twice, all kept working
 - after a reboot, without the power on, same action made an instant hang happen.

So it seems that:
 - the laptop started without power affects this (noticed that earlier)

Since @Len Brown mentioned that this worked on his Asus :
intel_idle.max_cstate=2

I tried that too. That actually works for my situation as well.

Does that mean the C6 is the cause in combination with wireless?
Is anyone at intel able to patch this up too :)?

I would be very happy to see this resolved still. 
The machine I have is still dedicated to testing.

Thank you
Comment 758 t.sarawinski 2017-03-25 05:42:26 UTC
Arch users can test linux-baytrail410 & linux-baytrail411 (rc3).

including patches from here and more.

My Tablet seems to run much smoother (tested on Gnome 3 - very laggy before)


On stock kernel it often freezes after the Login. This problem is gone for me.


feel free to test and give some feedback.


After some idle time the screen wents black. Sometimes it comes back after a longer time by ramdomly pushing all buttons.

Hibernate and screen suspending are set to disabled.

Anyone have a suggestion?
Comment 759 AndyMG 2017-03-26 21:22:10 UTC
Hi,

Just a quick observation for you guys ;]

I'm running a Ubuntu 16.04 (GalliumOS 2.1 with 4.8.17 kernel) on an Acer CB3-111 Chromebook with Bay Trail CPU. I have been using it a lot for about a month now without any problems or freezes and suddenly today it started to freeze after a couple of minutes (about 15-20) or so and only power button could help (hard reset). I was looking for what might be the reason and I googled to here. After browsing through this thread I got an idea:

I noticed that my Bluetooth is on in xface. I never used bluetooth on this device and so I just turned it off in GUI (xface) and the freezes seem to be no more. I am working for a couple of hours now. Will see in a couple of days but it seems like the issue is fixed now.

What suggested me the solution was some post here about bluetooth (btmon I think). The issue is so annoying that I decided to share in case this simple solution might also work for someone else. I would not like to chang c-state as this would mean additional battery drainage.

Stay safe ;]
Comment 760 jbMacAZ 2017-03-27 19:56:31 UTC
The good news is that 4.11-rc4 has the v3 patch built in.  The bad news for me is that my build hard froze within 5 minutes on Mint 18.1.  The original patch can't be used anymore because some of the code it modified was rewritten. "Delightfully unstable" T100CHI - Z3775.
Update: 4.11-rc4 with intel_idle.max_cstate=2 froze in about 20 minutes.  I'll retest when -rc5 comes out.
Comment 761 Mark_H 2017-03-28 08:02:51 UTC
As AndyMG reported that Bluetooth may have an influence I have switched it off yesterday and had no freeze during 2 hours.
Even intel_idle.max_cstate=0 did not help always with bluetooth on.

Have a nice day
Comment 762 Travis Hall 2017-04-02 23:36:43 UTC
I've been having really good stability with the ck patchset kernel on Arch Linux (linux-ck-silvermont-4.10 available at https://mirror.archlinux.no/repo-ck/os/x86_64/) on my N2940 Lenovo 11e

Been running youtube videos, vlc and other general use for about 16 hours so far

No idea why this would be the case, but it's interesting
Comment 763 jbMacAZ 2017-04-03 17:11:28 UTC
4.11-rc5 is stable so far without any cstate argument on my CHI w/v3 patch.  My rc4 was stable with cstate=1, so I'm beginning to suspect a build error (comment #760) 
Except for WiFi noise (brcm) in dmesg, 4.11-rc5 looks quite good overall for my system (T100CHI - Z3775).
Comment 764 Len Brown 2017-04-07 00:52:03 UTC
Re: comment #750 - Avoid tweaking evaluation thresholds on Baytrail v3

> Running this patch on 4.10, I've not yet seen a failure on
> Dell-n3540, Acer-J1900, ASus-T100-CHI-Z3775 -- while I was able
> to fail all of those machines in under 15-minutes without the patch.

At 6 weeks + 1 hour of running my torture test,
my Dell-N3540 hanged.  (Acer J1900 still running.)

@ Juha Sievi-Korte 

Thanks for testing!
Your n3540/Mika-v3 failure after 8-hours was much more prompt
than my 6-week result!
Comment 765 Len Brown 2017-04-07 01:07:48 UTC
@ Vincent Gerris

Re: intel_idle.max_cstate=2

Yes, that allows C1 (state1) and C6-no-shrink (state2),
but disables C6-Shrink, C7 and C7-Shrink.  Here "shrink"
refers to forcing the cache to be flushed on 1st entry
into that state -- an action that is good for power,
as the cache can be powered-off, but bad for performance,
assuming you plan to access again the cache state that was flushed.

you'll be able to observe this with turbostat,
The result is that you'll have Core C6 residency,
but since the shared module cache will not be flushed, you'll
not often have module (pair of cores) residency, or package C6 residency.

Package C6 is where the external voltage to the package will be changed.
To enter package C6, the graphics must also be in Render-C6 (RC6).

@frr

Regarding voltage changes.  Linux does control them, but in-directly.
Higher voltage is used for high frequency, lower voltage for lower
frequency.  On the CPU, we write the PERF_CTRL MSR with a cookie
that includes the speed, and the hardware translates that into
what it should do with the voltage.

note that I generally run with a fixed frequency -- the highest --
in an attempt to cause worst-case voltage swings.  The voltage doesn't
stay high -- the most frequent cause of voltage changes it the fact
that when all cores go idle, the hardware automatically lowers the
voltage, and then ramps it back up on exit from idle.

GFX has its own P-states, and they are under direct control of the
i915 driver.  "Mika v3" and other patches are tweaking how and when
the GFX P-state is changed -- and this area appears to be very close
to the most common pain (but no universal) point on these systems.
Comment 766 Len Brown 2017-04-07 01:16:37 UTC
@ Mark_H 

Re: bluetooth

> Even intel_idle.max_cstate=0 did not help always with bluetooth on.

when that parameter is used, intel_idle is disabled and you run acpi_idle.
What acpi_idle does varies from machine to machine.
(wee what states it offers w/ :grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*)

So the question is if intel_idle.max_cstate=1 does not make a difference,
but disabling bluetooth does.  If that is the case, then that may be
a different bug.  If that is NOT the case, then BT may simply be very
good at helping us take interrupts and pop in and out of idle to make
the failure happen sooner.
Comment 767 b.peguero 2017-04-08 17:28:38 UTC
I see the status as "NEEDINFO" but no obvious indication of what information is needed. This bug is pushing ~800 comments, most of them "me too" and related. What information needs to be provided from users and is there a matrix of affected/non-affected vs context (like the BT one mentioned a couple of comments above)?
Comment 768 zgrabe 2017-04-10 07:51:35 UTC
@b.peguero 

See here for user reports: https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit#gid=0
Comment 769 Mark_H 2017-04-10 08:53:53 UTC
@zgrabe

As bluetooth enabed seems to have impact, maybe there should be an additional column (e.g. bluetooth enabled? yes/no)
Comment 770 frr 2017-04-11 06:17:29 UTC
Dear gents,

I've previously wallpapered enough of this forum in posts 728 and 752.
Just to add a bit of recent experience, I've had two other industrial machines (models) pass through my lab, they both had a Celeron J1900, and both passed the torture chamber just fine (same test environment). In addition to the loopy video playback (same file as before, same OS setup and kernel), I've also tried GLXgears (while the video playback test was turned off), gradually stepping up the number of instances of GLXgears from 1 to 5 - I added another one after a day of flawless operation. No matter what I did, those two machines were stable. They were a Nexcom IPPC-1840P (very similar hardware to APPC-xx40 series which also seem to work good) and an AAOEN OMNI-5175 (engineering sample, apparently). I kept teasing them for a straight week, before I had to ship them to a customer.

=> makes me feel as if the BayTrail is merely susceptible to some kinds of motherboard design and testing deficiency, that either Intel did not properly warn the mobo makers about, or the mobo makers did not take Intel's guidelines seriously enough and there are no tests in their QC benches for this particular "corner case"...
Comment 771 Paul Mansfield 2017-04-11 10:57:47 UTC
I don't know if it's the done thing, but I would like to propose this bug be closed, and a new one created referencing this one, and only current information be put in the new bug.

This is because it's very hard to determine what the current situation is regarding the cstates that work in combination with which patches, and any specific hardware issues such as SDIO and video, which seem to trigger problems more quickly.
Comment 772 gutosoni 2017-04-11 12:56:42 UTC
I'm using the 4.4.0-72 kernel, it looks like they fixed the problem. The freezes are over, I urge you to test this kernel.

Hardware: Intel Celeron Bay Trail 4x CPU N2930 @ 1,83GHz
Comment 773 Vincent Gerris 2017-04-11 13:22:05 UTC
@gutosoni : your comment is less than helpful if you do not specify which exact kernel you mean, where you got it, etc. Please share some concrete links and tell which one did not work, and preferably what change you think fixed this.

@Len Brown
Thank you for you elaborate explanation.
Can you say if the issue I have (N3520, Yoga 2 11) will be fixed in this bug report, or do I need to file a new bug report?

I also noticed this comment that I missed before (from Mika):
Another long shot to try is to see if:

'intel_reg write 0xa168 0x0'

has any effect on occurrence.

I will see if that does anything.
As reported, I still have consistent freezes when doing a file transfer over wifi, on 4.11 with the latest patch (sometimes even without touching bluetooth).

Kind regards,
Vincent
Comment 774 Mika Kuoppala 2017-04-11 13:38:01 UTC
(In reply to Vincent Gerris from comment #773)
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.
> 

Very likely a waste of time. That change wont last as we rewrite the pmintrmask
often. You would need to change the mask in the kernel and recompile (there was a patch long way back).

One intresting triaging point is not limiting the cstate but rather
limiting the number of active cpus.

Please try if 'maxcpus=2' will make a difference.
Comment 775 gutosoni 2017-04-11 14:21:34 UTC
(In reply to Vincent Gerris from comment #773)
> @gutosoni : your comment is less than helpful if you do not specify which
> exact kernel you mean, where you got it, etc. Please share some concrete
> links and tell which one did not work, and preferably what change you think
> fixed this.
> 
> @Len Brown
> Thank you for you elaborate explanation.
> Can you say if the issue I have (N3520, Yoga 2 11) will be fixed in this bug
> report, or do I need to file a new bug report?
> 
> I also noticed this comment that I missed before (from Mika):
> Another long shot to try is to see if:
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.
> 
> I will see if that does anything.
> As reported, I still have consistent freezes when doing a file transfer over
> wifi, on 4.11 with the latest patch (sometimes even without touching
> bluetooth).
> 
> Kind regards,
> Vincent

http://packages.ubuntu.com/xenial/linux-image-4.4.0-72-generic

I have been using it for over a week, so far everything is fine, no problems.
Comment 776 Hal 2017-04-11 16:48:56 UTC
(In reply to gutosoni from comment #775)
> http://packages.ubuntu.com/xenial/linux-image-4.4.0-72-generic
> I have been using it for over a week, so far everything is fine, no problems.

My office desktop (Zotac ZBOX-CI320NANO with Intel Celeron N2930) has that same exact version provided through Linux Mint 18.1 updates. 
Without intel_idle.max_cstate=1 it freezes within the hour. 

Same machine with 4.10.9-041009 (from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/linux-image-4.10.9-041009-generic_4.10.9-041009.201704080516_amd64.deb) runs a bit longer - about 3-4 hours before freezing solid!

When loading the kernel with intel_idle.max_cstate=1 the machine performs flawlessly for months with no crash.

Hal
Comment 777 Fred 2017-04-11 17:18:41 UTC
Looking at a 4.1.12 prempted-rt kernel with a bay trail 3805 ( headless ) meaning there is no graphics engine.  So the patch from comment# 683 https://bugzilla.kernel.org/show_bug.cgi?id=109051#c683 does not apply.

What they are seeing is an occasional 5 – 7 mS of additional latency around some of our pthread_cond_timedwait() calls.   For example, if they tell pthread_cond_timedwait() to wait up to 50ms, it actually isn’t returning for 56 ms.   If they use intel_idle.max_cstate=1 on the kernel command line, the problem goes away.

So I am wondering if we are looking for the issue in the wrong place.   OR should this be listed as a separate/ new bug?
Comment 778 Mika Kuoppala 2017-04-12 07:31:29 UTC
(In reply to Fred from comment #777)
> Looking at a 4.1.12 prempted-rt kernel with a bay trail 3805 ( headless )
> meaning there is no graphics engine.  So the patch from comment# 683
> https://bugzilla.kernel.org/show_bug.cgi?id=109051#c683 does not apply.
> 
> What they are seeing is an occasional 5 – 7 mS of additional latency around
> some of our pthread_cond_timedwait() calls.   For example, if they tell
> pthread_cond_timedwait() to wait up to 50ms, it actually isn’t returning for
> 56 ms.   If they use intel_idle.max_cstate=1 on the kernel command line, the
> problem goes away.
> 
> So I am wondering if we are looking for the issue in the wrong place.   OR
> should this be listed as a separate/ new bug?

Fred, please take a look at:
https://bugzilla.kernel.org/show_bug.cgi?id=195255
Comment 779 Fred 2017-04-12 11:57:12 UTC
(In reply to Mika Kuoppala from comment #778)
> (In reply to Fred from comment #777)
> > Looking at a 4.1.12 prempted-rt kernel with a bay trail 3805 ( headless )
> > meaning there is no graphics engine.  So the patch from comment# 683
> > https://bugzilla.kernel.org/show_bug.cgi?id=109051#c683 does not apply.
> > 
> > What they are seeing is an occasional 5 – 7 mS of additional latency around
> > some of our pthread_cond_timedwait() calls.   For example, if they tell
> > pthread_cond_timedwait() to wait up to 50ms, it actually isn’t returning
> for
> > 56 ms.   If they use intel_idle.max_cstate=1 on the kernel command line,
> the
> > problem goes away.
> > 
> > So I am wondering if we are looking for the issue in the wrong place.   OR
> > should this be listed as a separate/ new bug?
> 
> Fred, please take a look at:
> https://bugzilla.kernel.org/show_bug.cgi?id=195255

Thanks Mika!
Comment 780 Vincent Gerris 2017-04-17 16:51:11 UTC
(In reply to Mika Kuoppala from comment #774)
> (In reply to Vincent Gerris from comment #773)
> > 
> > 'intel_reg write 0xa168 0x0'
> > 
> > has any effect on occurrence.
> > 
> 
> Very likely a waste of time. That change wont last as we rewrite the
> pmintrmask
> often. You would need to change the mask in the kernel and recompile (there
> was a patch long way back).
> 
> One intresting triaging point is not limiting the cstate but rather
> limiting the number of active cpus.
> 
> Please try if 'maxcpus=2' will make a difference.

Hi Mika,

Thanks, maxcpus=2 makes it stable for me.
How would you like me to proceed?
Comment 781 jbMacAZ 2017-04-19 08:37:26 UTC
I've built the new 4.10.11 with the v3 patch backport.  It froze twice, once within 10 minutes and then again within an hour.  For third try, I added intel_idle.max_cstate=2 as previously recommended.  That resulted in a "soft" freeze within 3.5 hours (apparently a wifi issue - dmesg was spammed with brcmfmac error -110s.)  Technically, that wasn't a cstate hard freeze, but either way, all three tests ended with the system unusable.

4.11-rc5, rc6 & rc7 all run fine without any cstate arguments, scripts or patches (v3 patch was mainstreamed in rc4.)  4.10.10 is rock solid with Mika's original patch.  Did something else need to be backported to avoid freezing in 4.10.11?  (Also see comment 754)

For test #4 I'm using maxcpus=2.  9+ hours of run time so far.  I've had previous good results with maxcpus before (comments 191, 197) before settling on intel_idle.max_cstate=1.  Curiously, the wifi dmesg spam is conspicuously absent so far in this test.  I prefer cstate=1 workaround, since maxcpus cuts system performance - video streaming shows more stuttering.  I can test any new patches if/when available.  

Thanks to all at Intel (and elsewhere) for applying some real horsepower to this cluster of baytrail freeze problems.  For my system - T100CHI, Z3775, 4.11 looks great.  4.10 has more issues muddying the waters.  Baytrail sound was still evolving and (broadcom) wifi has had chronic issues since early December backported to several kernel series.  4.10[EOL] can't happen soon enough!  YMMV.
Comment 782 jbMacAZ 2017-04-21 08:05:23 UTC
(In reply to jbMacAZ from comment #781)
> I've built the new 4.10.11 with the v3 patch backport.

I repeated test 3 with wifi disabled and got a typical hard freeze in about 4 hours, better than the earlier test, but not good.  Setting intel_idle.max_cstate=1 appears to restore stability for my system with 4.10.11.  I also got my first freeze today on 4.11-rc7 after 2 weeks of successful 4.11 (rc5, rc6 & rc7) testing.  It's time to throw in the towel on this clunker.  Even other T100 models (e.g. T100TA...) are far less prone to freezing...
Comment 783 Andrey 2017-04-22 15:13:51 UTC
I have very similar problem, on LENOVO V510-15IKB with i7 7500U (Kaby Lake).
With default kernel (4.4) on Ubuntu 16.04 freezes were very often (from 5 minutes up to maximum 2 hours of work). 

Now I've update kernel to 4.10.10 and set intel_idle.max_cstate=1, freezes still happens but in 3-12 hours of work.

Somebody using Kaby Lake?
How to diagnose that is it same c-state bug?
Comment 784 jbMacAZ 2017-04-22 19:52:13 UTC
(In reply to Andrey from comment #783)
> I have very similar problem, on LENOVO V510-15IKB with i7 7500U (Kaby Lake).
> With default kernel (4.4) on Ubuntu 16.04 freezes were very often (from 5
> minutes up to maximum 2 hours of work). 
> 
> Now I've update kernel to 4.10.10 and set intel_idle.max_cstate=1, freezes
> still happens but in 3-12 hours of work.
> 
> Somebody using Kaby Lake?
> How to diagnose that is it same c-state bug?

You probably have a different issue because the hallmark of this bug is setting intel_idle.max_cstate=1 virtually stops the freezes.  To diagnose, it is best to change things one at a time.  if setting intel_idle.max_cstate=1 with your default kernel still freezes under 2 hours (at least 2 attempts) then cstate is probably unrelated to your issue.

BTW my Dell kabylake (i7-7500U) hasn't frozen on me yet, so that processor can run without freezing (about 2 months).
Comment 785 slumbergod 2017-04-25 11:53:11 UTC
I have a laptop with a 2nd generation Intel i3 CPU (Ivy Bridge i3-3110) and ever since I installed Ubuntu 16.04 I have been have random freezes as well. But the thread suggests that with my CPU it is *not* the cstate bug. 

Can anyone suggest which bug thread it is for the people with the *same* random hangs as the cstate bug but for CPUs other than the Bay Fail?

I've tried Xubuntu 16.04, 16.04.1, and 16.04.2 and all the kernels available, including the latest mainline ones. Same result. If I leave my machine running at some point it will freeze and nothing but a hard power off will solve it. Unfortunately, doing that has corrupted the file system twice and required reinstallation because fsck wasn't able to resolve it.
Comment 786 Hal 2017-04-25 13:10:20 UTC
(In reply to slumbergod from comment #785)
> I have a laptop with a 2nd generation Intel i3 CPU (Ivy Bridge i3-3110) and
> ever since I installed Ubuntu 16.04 I have been have random freezes as well.
> But the thread suggests that with my CPU it is *not* the cstate bug. 
> 
> Can anyone suggest which bug thread it is for the people with the *same*
> random hangs as the cstate bug but for CPUs other than the Bay Fail?
> 
> I've tried Xubuntu 16.04, 16.04.1, and 16.04.2 and all the kernels
> available, including the latest mainline ones. Same result. If I leave my
> machine running at some point it will freeze and nothing but a hard power
> off will solve it. Unfortunately, doing that has corrupted the file system
> twice and required reinstallation because fsck wasn't able to resolve it.

Just for the heck of it why don't you try to load the kernel with intel_idle.max_cstate=1 and see if it helps to avoid freezing while you gather more info about i3-3110 specific issues. You might be onto something.
The typical issues that I personally experienced with Ivy Bridge CPUs (not exactly your model) were graphics and USB related.
Comment 787 slumbergod 2017-04-25 21:18:24 UTC
I have a laptop with a 2nd generation Intel i3 CPU (Ivy Bridge i3-3110) and ever since I installed Ubuntu 16.04 I have been have random freezes as well. But the thread suggests that with my CPU it is *not* the cstate bug. 

Can anyone suggest which bug thread it is for the people with the *same* random hangs as the cstate bug but for CPUs other than the Bay Fail?

I've tried Xubuntu 16.04, 16.04.1, and 16.04.2 and all the kernels available, including the latest mainline ones. Same result. If I leave my machine running at some point it will freeze and nothing but a hard power off will solve it. Unfortunately, doing that has corrupted the file system twice and required reinstallation because fsck wasn't able to resolve it.
Comment 788 slumbergod 2017-04-25 21:24:15 UTC
(there is no way to remove the repost that somehow happened?)

@Hal, thanks. Yes, I am testing the cstate=1 solution but like the Bay Fail bug, whatever it is that affects the i3-3110 CPU is also *very random*. I could get 24 hours before a freeze or a couple of days. I suspect it is an Intel graphics driver bug, as you suggested. If I have no luck with the cstate=1 solution I will try rolling back to a pre-Ubuntu 16.04 kernel. I look back fondly and remember when my machine could run for weeks or months without a restart! Then came 16.04
Comment 789 luke 2017-04-27 04:45:45 UTC
slumbergod, (In reply to slumbergod from comment #788)

As Andred posed above:

> However, I fear (and has already been mentioned in earlier comments) this bug 
> report has long since lost any usefulness it might have once had and has just 
> turned into a dumping ground for random comments and updates and now reads
> like > some web forum thread


This is not a support form. Please have some respect for others and stop SPAMing us with your unrelated issues.
Comment 790 Hanno Zulla 2017-04-27 08:40:53 UTC
Hi.

It is very difficult to keep up.

Could please someone summarize and clarify the current status of this bug?

Please correct me if the following observations are wrong:

- the symptom is known, but not the root cause.

- for some reason, the bug does not affect Windows 10, but it affects Linux.

- the bug affects 4-core Bay Trail CPUs, but not 2-core Bay Trail CPUs.

- there is a workaround setting (the original subject of this bug) which is detrimental to battery runtime.

- there is a workaround patch (by Mika), some users of the patch report that it makes things better, others still report crashes.

- all in all, the bug is still unresolved.

Thanks for clarifying.

Thanks to everyone for their hard work on this bug, it is very appreciated. (I can't wait to use the cute little Bay Trail machine I have lying around here for my kids.)
Comment 791 slumbergod 2017-04-27 11:54:43 UTC
Hi Luke, here's a big huge FUCK YOU for being A FUCKING ASSHOLE!!
Go spam yourself you social rejects.
Comment 792 Hanno Zulla 2017-05-03 08:08:32 UTC
Sorry for asking again, but a clarification on the current status of this bug would be very much appreciated. See comment 790 on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c790

Thank you.
Comment 793 jbMacAZ 2017-05-03 19:22:23 UTC
(In reply to Hanno Zulla from comment #790)
> 
> - there is a workaround patch (by Mika), some users of the patch report that
> it makes things better, others still report crashes.
> 
The v3 patch was mainstreamed into 4.11-rc4 and has been back-ported into 4.9 and 4.10.  

For my system, the original patch is effective in 4.9.25 and 4.10.13, but v3 (mainstreamed) patch is not.  Neither patch works for me in 4.11-rc8+, only setting .._cstate=1 will stop freezing.  (Asus T100-CHI Z3775)

The lack of any other reports of continued freezing (that can still be fixed with .._cstate=1) suggest that the v3 patch might be sufficient for most users.
Comment 794 Vincent Gerris 2017-05-04 20:07:33 UTC
Hi,

Thanks for the update, it was unclear to me and I guess others that the patch landed there.

I wonder if it fixed most peoples issues, but any is a plus I would say.

I am also wondering about status here because I still have the issue.
Mika, can you let us know if you want to continue investigating the maxcpus path for the people affected?

I am happy to contribute, but I would like to know if this will continue.
My issue like some others, seems to be related to wireless/bluetooth and power management.

I was hoping that these issues can also be fixed with current feedback and as posted before, I am happy to test!
Comment 795 Hal 2017-05-07 00:38:44 UTC
I've been testing 4.11.0 (Ubuntu's compile) on my Zotac (ZBOX-CI320NANO with Intel Celeron N2930) as I gathered that some patched were applied to it.
Without cstate=1 it freezes within 4 to 4.5 hours, no matter the workload. 
So, not out of the woods yet...
Hal
Comment 796 jbMacAZ 2017-05-07 22:32:31 UTC
correction:  The original c-state patch DOES stop freezing in 4.11.0.  There were other changes made about the same time as the v3 patch that interfere with applying the original patch.  Those other changes are probably the problem rather than the v3 patch itself, but I leave that to the experts to ponder.
Comment 797 jerameel 2017-05-11 07:11:30 UTC
Using kernel 4.10.14 from ubuntu on xubuntu 16.04.2 asus x453m laptop with intel baytrail, I'm running fine for several days now without any patch or workaround from cstate. I have tried both powersave and performance governor and still working fine on full load conditions.

TL:DR; I assume this has been already fixed with 4.10.14 (ubuntu)
Comment 798 Mika Kuoppala 2017-05-11 08:53:35 UTC
Fix is overstatement. As the commit message notes, we have only a workaround
that only helps on some cases.

One intresting datapoint is that with my J1900 using kernel param 'nohz=off' hangs the system in very short time.
Comment 799 mopplayer 2017-05-16 11:44:01 UTC
It seems that system freezes in heavy loading condition (Linux).

Without any patches, kernel 4.11.0-rc8, work fine on Debian 8 (Jessie).

On Ubuntu 16.04, it will very often freezes (4.11.0-rc8 and 4.12.0-rc1).

On Windows 10, I discover that CPUs clock very often down to 4xx Hz with any loading condition.
Comment 800 floating 2017-05-20 08:41:18 UTC
I have been following this thread for a long time, waiting for the cure. Decided to post now as I got to test 4.11. I have the "High Powerful XCY Mini PC Celeron Dual Core N2830 2.16GHz 4G Ram 16G SSD Hotal Using Hdmi+Vga Computer Stick Windows 8 Computer" from http://m.aliexpress.com/item-desc/32617460139.html It's my only pc at home, and I am not into testing much, I just want it to work.

Year ago when I got it, I was using a dell wireless mouse and keyboard and I was trying to install linux on it. My internet connection is also only wireless (Realtek Semiconductor Corp. RTL8191SU 802.11n WLAN Adapter).  I started with Arch linux of the time, followed up Lubuntu 16.04, Ubuntu 16.04 etc, eventually going back to Lubuntu 14.04.1 or .2 alternative iso. Everything that had a newer kernel than the 3.13(?) something made the command-line installer really laggy, plus the installation process froze at some point except for the Arch. With Arch I managed to install it, but launching the barebones X that comes with xorg was less than 1 fps for a minute or so before freezing. In tty running lspci froze the system also. Also the alternate iso of lubuntu 14.04.3 or 4 or such was similarly laggy and froze at installation. Interestingly the lsb_release -a now one year later shows 14.04.5 as my version, kernel is 3.13.0-32-generic, and it works without kernel parameters or anything, just like the 14.04.1 from where I at some point have upgraded. I don't remember the kernel versions of the images that made the difference.

The intel_idle.max_cstate=1 would solve the problem, but I decided I didn't want to overheat or increase power consumption, so I thought I will run the Lubuntu 14.04 for now. Currently I am still running the lubuntu 14.04 with wireless usb internet and a wireless mouse, my keyboard is wired.

Fast-forward to today. Finally the opensuse live cd got bundled with the kernel 4.11, and because it seemed to have at least some promise of improvement, I decided to run the live cd (gnome version) and see the lag or the freeze. During the boot, there was no lag at the live cd's selection screen for installation, booting live cd etc. So maybe this is an indication already that I could get a working system with 4.11. I probably wouldn't mind if the freezes occured once in 8 hours or so, if I could run it without parameters and lagless, like lubuntu 14.04. The boot splash screen was loading very long time, maybe some errors in the background, I didn't check the logs. My wireless mouse seemed to work without lag on the opensuse's desktop. I ran terminal, and could type keys to it without noticable lag, except occasionally, when the wireless mouse pointer on the screen made a small movement that I didn't initiate myself. It repositioned itself couple times in a minute like that, and at the same time made my wired keyboard not respond. Then after this type of basic testing just couple minutes later the keyboard's 'u' letter started spamming the terminal and wouldn't stop. I closed the terminal with my mouse, but then the u-spam continued on the opensuse's search bar. I disconnected keyboard and re plugged it to stop it. Next what I wanted to test is to connect to the internet with the wireless adapter, so I opened the "network" app from the gui, but after doing that the app froze immediately. Also the terminal app froze behind it. I went to tty1 and rebooted.. and this ends my test. 

If someone who read that, and thinks my bug is not related, or knows 99% that if I applied the c6off+c7on.sh and/or mika's patch, or some other patches to certain kernel and could then run otherwise updated arch system (and performance would be at least as good as lubuntu 14.04 without extra power consumption), I guess I could go for it, otherwise I think I will stick to the lubuntu 14.04, as I got most things covered with it anyway.
Comment 801 Len Brown 2017-05-26 17:41:33 UTC
@ Vincent Gerris

Yes, the N3520 in your Yoga2 is a Baytrail, and so it is subject of this bug report.

Note that the list of Baytrail models is available here:

http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All
Comment 802 Len Brown 2017-05-26 18:06:53 UTC
@ floating

Yes, your N2830 is a baytrail, and thus potentially subject to the
symptoms described in this bug report -- system freezes without
intel_idle.max_cstate=1

I don't suspect that the other issues you experience are related to
the subject of this report.  I recommend that if you do experience
system freezes, that you simply use intel_idle.max_cstate=1 for now,
assuming that it doesn't cause your battery life to reduce
below what you can handle.
Comment 803 Hanno Zulla 2017-06-02 08:26:21 UTC
Could please someone involved with development on this bug give a brief status update? Is the cause identified or are we still looking for it? How can us users be helpful at this stage? Thanks.
Comment 804 FL 2017-06-02 19:05:57 UTC
System is now stable since the last kernel update (no microcode, no cstate=1): 

OS: Arch Linux
Kernel: x86_64 Linux 4.11.3-1-ARCH
CPU: Intel Pentium J2900 @ 4x 2.4157GHz [26.8°C]
Mesa DRI Intel(R) Bay Trail

BOOT_IMAGE=/boot/vmlinuz-linux root=/dev/sda2
Comment 805 john 2017-06-03 07:21:33 UTC
Can anyone else confirm that 4.11.3 kernel fixes the bug?
Comment 806 Emmanuel 2017-06-04 08:14:46 UTC
Gave it a try, still froze after a few hours.

Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz on ASRock computer
Ubuntu 14.04 with kernel 4.11.3

Reverted to kernel 3.16
Comment 807 Vincent Gerris 2017-06-04 08:42:29 UTC
@Mika Kuoppala and @Len Brown, can you give us a way forward when it comes to contributing as Hanno asked?

In my case on my N3520, the maxcpus=2 also prevents freezing, but I have seen no follow up.
I would still like to see a bugfix that does not require any kernel parameters, so a default installation of Linux will work for every user.

As I mentioned before, this Lenovo Yoga 2 11 laptop has been dedicated to troubleshoot this bug for a long time and I would like to move on :).

Thank you for the support and I hope we can move forward.
Comment 808 Paul Mansfield 2017-06-04 09:10:47 UTC
I've found that 4.12-rc2 has been very good, it's the first kernel since 4.5.7 where I can reliably use the SDIO wifi on my Click Mini.

I'm using the John Brodie patch set from the Asus T100 Google+ group with the following config:

https://www.zaurus.org.uk/download/toshiba_click_mini_l9w/config-4.12-rc2-jbpm0


I've still using the c6+c7-off script, but not had to limit max-cstate to 1.
Comment 809 Hanno Zulla 2017-06-07 07:54:52 UTC
I'm still trying to figure out wether to wait this out and keep the bay trail based notebook for the kids or if it's time to replace it with something that's certain to work. This bug tracker doesn't give me hope so far.
Comment 810 Volodymyr Saveliev 2017-06-07 11:23:26 UTC
My notebook uses to freeze sometimes when watching videos in Firefox and even randomly while performing not heavy tasks. It happens two - three times a day. 
My kernel version is 4.4.0-79-generic and cpu is Intel Pentium n3710.
Will try the max-state workaround and report.
Comment 811 David Mace 2017-06-07 20:32:08 UTC
(In reply to Paul Mansfield from comment #808)
> I've found that 4.12-rc2 has been very good, it's the first kernel since
> 4.5.7 where I can reliably use the SDIO wifi on my Click Mini.
> 
> I'm using the John Brodie patch set from the Asus T100 Google+ group with
> the following config:
> 
> https://www.zaurus.org.uk/download/toshiba_click_mini_l9w/config-4.12-rc2-
> jbpm0
> 
> 
> I've still using the c6+c7-off script, but not had to limit max-cstate to 1.

I've had an uptime of 9 days and counting so far with 4.12-rc2 on my Asrock Q1900 ITX motherboard which has an Intel Celeron J1900 cpu. The previous longest uptime I could get on any previous kernel was about 4 days before freezing.
Comment 812 Elmar Melcher 2017-06-08 14:10:06 UTC
I've been using kernel-ml-4.10.13-1.el7.elrepo.x86_64 (no cstate patches AFAIK), no kernel parameters, not even tsc=reliable, for almost 2 months average usage 2h per day on Z3735G without freeze. On any previous kernel I could achieve this only with one of the appropriate cstate patches.

kernel-ml-4.11.0 froze after a few days, looking forward to try 
kernel-ml-4.12.0
Comment 813 A Uday K 2017-06-12 11:33:24 UTC
Tried kernel 4.12.0 rc4, froze within the day. Have tried both generic and low latency, no luck with either of them.
Comment 814 Fred 2017-06-12 11:43:04 UTC
Created attachment 256953 [details]
attachment-3110-0.html

Thank you for your email.  I will be out of the office on business travel starting on 6/9 through 6/16, returning on 6/19.    Email response will be slow.

If urgent, you can try to contact me on my cell phone listed below, response time may be slow at times.

Thank you

Fred

-------------------------------------
Fred Moses
Intel America's Inc.
SMG IOT Technical Sales
Senior Technical Sales Specialist
Desk (978) 553-1463
Cell (978) 621-2508
Comment 815 Paul Mansfield 2017-06-12 14:16:24 UTC
so who's going to call Fred Moses and demand he get someone to fix this s***???
:-D
Comment 816 Volodymyr Saveliev 2017-06-12 14:20:54 UTC
(In reply to Volodymyr Saveliev from comment #810)
> My notebook uses to freeze sometimes when watching videos in Firefox and
> even randomly while performing not heavy tasks. It happens two - three times
> a day. 
> My kernel version is 4.4.0-79-generic and cpu is Intel Pentium n3710.
> Will try the max-state workaround and report.

Can confirm that tuning max_cstate solved freezing for my laptop.
Thanks.
Comment 817 luke 2017-06-12 18:45:51 UTC
Created attachment 256965 [details]
Win 10 Uptime

I have a home media server that uses the Celeron J1900 CPU. After the v3 patch, I experienced a system freeze within 24 hours. I have since switched to Windows 10 and now have an uptime of 83 days. To the people here also experiencing Win 10 crashes, you probably have additional hardware or driver problems outside the scope of this bug report.

My system is used as a VirtualBox host, media server, and local media player so it has a wide load range. Before 3.13 and in Win 10, I never experienced a single crash. The v1 patch also appeared to fix the issue, although I only used it for a short time. 

After 3 months of Windows, I'd like to go back to Linux. Has any progress been made beyond the v3 patch?
Comment 818 kemeng 2017-06-18 07:46:15 UTC
Is anyone tested the new microcode (2017.05.11)?

https://downloadcenter.intel.com/download/26798/Linux-Processor-Microcode-Data-File


My processor isnt on the list (N3060) but it looks more keeping calm and staying cool than before:)

I took out the cstate=1 parameter, hoping the best

kernel: 4.11.4
Ubuntu: 16.04.2
Comment 819 Hanno Zulla 2017-06-20 07:05:31 UTC
Sorry, but is there any hope for this bug to be resolved in the near future?

It's been one and a half years by now and it appears that the cause is still unknown, status of this bug is still NEEDINFO. It's curious that the bug doesn't affect Windows 10, but it seems there's something odd specific to this processor that can't be resolved without Intel's internal expertise.

While I'm thankful for the work that's been done so far, please Intel: Throw a few more engineers with some internal documents at this problem to stomp it out.

Thanks.
Comment 820 Roman Kurbatov 2017-06-20 09:43:12 UTC
Hanno Zulla, Windows 10 is affected also. 

I have Acer E511-G laptop. When I just bought it, I installed 14.04 on it - everything was just fine. I almost did not use Windows. Then I had to install 15.04 or 15.10 due to some dependencies, then 16.04. Bugs appeared in that version. Initially, the system was freezing with fans rotating very loud on Linux. This was fixed with some kernel param (dmac or something like that) and then I gave that laptop to my kid. He complained of the same problem while he was using Windows 10. Not often, but pretty annoying. 
I took laptop back recently (somewhat around a year has passed since then), the loud rotating bug was fixed at last, but the system is just freezing under heavy load like come linter running, type-checking script or IDE updating its cache. The mouse cursor is moving very slowly during that freeze, but fans are not rotating. I tried a lot of different kernels from this topic, from 4.05 to 4.12rc and even some kernels proposed by dedicated Intel engineer (who's desperately fighting with this problem for a year as I can judge from the topic on Intel's forum).
As a result, I cannot use my laptop for a year and a half :) I think, I'll never buy Acer laptops anymore (it also has famous Synaptics touchpad problem). And the most curious thing - it's my second personal machine with Intel CPU inside (I'm using AMD only CPUs at home since 1997). First one was stolen, this one is not working. It's a curse :)
Comment 821 László Kara 2017-06-20 14:57:19 UTC
(In reply to Roman Kurbatov from comment #820)
> The mouse cursor is moving very slowly during that
> freeze, but fans are not rotating.

RK, this bug is about freezing, and by freezing I mean no cursor is working at all! Your's seems like different bug, or likely a hardware defect of the laptop.
It may not be OS specific.

I use win10 now with my ACER ES11 (n2940) laptop without any issues. It just works. TBH I dont expect progress anymore on this bug, I have waited long enough.
Comment 822 Vincent Gerris 2017-06-20 15:41:43 UTC
Created attachment 257095 [details]
attachment-10405-0.html

Guys, the last two comments haven't contributed anything. Please read the
thread before you post and spam.
In an attempt to reiterate:
 - this is a Linux bugzilla, nobody is interested in Windows, unless maybe
when you are a Microsoft engineer posting what patch they applied (it
happens on some Windows too)
 - this is ONLY about Baytrail cpus that freeze completely AND have a
workaround with kernel parameters (read back)
  - to help whoever still looks into this, please read back on how you can
contribute and otherwise just don't post here
 - any other issue that is not the above, please file a new bug according
procedures

I am personally considering selling my machine, after which I will
unsubscribe. I hope the intel engineers involved will reach out and update
us before that.
Thank you and keep up the good work!

On 20 Jun 2017 4:57 pm, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #821 from László Kara (laci.kara@gmail.com) ---
> (In reply to Roman Kurbatov from comment #820)
> > The mouse cursor is moving very slowly during that
> > freeze, but fans are not rotating.
>
> RK, this bug is about freezing, and by freezing I mean no cursor is
> working at
> all! Your's seems like different bug, or likely a hardware defect of the
> laptop.
> It may not be OS specific.
>
> I use win10 now with my ACER ES11 (n2940) laptop without any issues. It
> just
> works. TBH I dont expect progress anymore on this bug, I have waited long
> enough.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 823 A Uday K 2017-06-20 16:05:54 UTC
In my Case, On my N3530, Linux Kernel 4.12 rc5 has given me an uptime of 7 days and counting. This is the Longest uptime I've ever had on any kernel, without the boot param ( excluding the kernel with auto-demotion patch, which gave me an uptime of 48 days, thanks to Vincent Gerris ).
FYI, Linux kernel 4.12 rc4 didn't even give me an uptime of 1 day.
Interestingly, I've noticed that these freezes occur only when my system tries to wake up from the suspend state.

I'm running Ubuntu 16.04 LTS.
Comment 824 Gregg S. 2017-07-01 00:29:59 UTC
I just did a vanilla install of Xubuntu on a new GB-BXBT-1900  (J1900) and am having freezes in less than 2 hours with a streaming audio application (liquidsoap). When the application is not running, it can take much longer to freeze up, on at least one occasion > 48 hours. I've been running the box less than a week.

Ubuntu 16.04.2 LTS (GNU/Linux 4.8.0-56-generic i686)
$ cat /proc/cmdline
BOOT_IMAGE=/@/boot/vmlinuz-4.8.0-56-generic root=UUID=ac779692-05e0-4bf5-af86-3e46e9cc3cf2 ro rootflags=subvol=@ persistent quiet splash vt.handoff=7

Meanwhile, I'll begin working through some of the suggestions and report back results.
Comment 825 Brian T. McKee 2017-07-01 17:11:11 UTC
I attempted to install ubuntu 17.04 on a braswell Celeron N3060. Even with c_states change in the kernel boot parameter it still hangs. Moving the mouse flashes the screen until a hang. The 4.11 kernel crashes faster than the 4.10.

My suggestion is to steer clear from Braswell for linux. Too bad too. So inexpensive.

I put Chrome Back on it and am returning it.
Comment 826 Elmar Melcher 2017-07-14 12:48:53 UTC
I've been using kernel-ml-4.12.0-1.el7.elrepo.x86_64 (no cstate patches AFAIK), no kernel parameters, not even tsc=reliable, for 1 week average usage 2h per day on Z3735G without freeze.
kernel-ml-4.11.0 without kernel parameters froze after a few days.
Comment 827 Hanno Zulla 2017-07-19 14:42:59 UTC
Elmar, this bugzilla report does not mention any patches made to ml-4.12.0-1 compared to ml-4.11.0. Is there any change in the kernel related to this issue that can explain this improvement?
Comment 828 Elmar Melcher 2017-07-19 15:15:34 UTC
My experience is:
4.10.13 good,
4.11.0 freeze,
4.12.0 good.
Unfortunately, I do not have any explanation to offer.
However, my experience relates to comment 795 and to Your comment 793.
Comment 829 jbMacAZ 2017-07-19 19:46:26 UTC
(In reply to Elmar Melcher from comment #828)
All three of your kernels have the v3 patch.  It was main-streamed into 4.11rc4 in February and is now in affected/maintained kernels including 4.9 through 4.13rcx.

I find the v3 patch far less effective than the v1 patch on my particular system.  I get cstate freezes on all recent kernels unless I use either the v1 patch, intel_idle.max_cstate=1 kernel arg or a modified c6off+c7on script.  YMMV.  

Unrelated code changes in 4.11 make it challenging but not impossible to apply v1 after reverting v3.  V1 is far more effective at avoiding freezes.  Udoing the unrelated code changes seems to cause no obvious problems.

Neither version of the cstate patch is 100% effective, but only v1 makes my system usable.  Apparently, enough systems had enough improvement with v3 cstate patch that intel no longer seems focused this problem.

The cstate freeze is a hardware bug in the processor related to power saving.  The cstate problem also seems to be affected by the motherboard design and how the OS code tries to conserve power.  Even the processor version seems to matter.  This would explain the range of suffering even among comparable systems using the same processor.

To be sure that your 4.11 freeze was the cstate problem, try adding the intel_idle.max_cstate=1 (or 0) to your kernel args and retest to see if 4.11 still freezes.  However, your freeze took long enough that you might not see it again either way in 4.11.
Comment 830 EdSut 2017-07-20 13:22:57 UTC
I am running with an Azulle Access Plus (Cherry Trail Z8300).  Using Ian Morrison's isorespin.sh script, I updated it to Ubuntu 16.04.2 and kernel 4.12.2, and I included intel_idle.max_cstate=1 in kernel cmdline.
Using firefox, I can consistently open the webglsamples.org/aquarium/aquarium.html page in fullscreen mode (and 4000) fish and lock up my box (usually in 10-15 minutes).
Comment 831 Mika Kuoppala 2017-07-20 14:42:58 UTC
Created attachment 257621 [details]
drm/i915: Only use idle or max freq on Baytrail
Comment 832 Mika Kuoppala 2017-07-20 14:49:11 UTC
About the V3 vs V1 patches: they are identical of doing the same thing, but perhaps due to the intricate timings, they behave differently.

I added a patch which uses either max or idle freq on gpu with byt, thus limiting
the traffic to the punit, to try work around the bug. For those people who don't want to keep their cpus in >=C2.
Comment 833 Hanno Zulla 2017-07-21 08:52:56 UTC
Thanks. I'd still like to know about the cause of this bug and especially if there's an explanation why it affects Linux while Windows 10 is unaffected by it.
Comment 834 Fred 2017-07-21 08:53:19 UTC
Created attachment 257631 [details]
attachment-18432-0.html

Thank you for your email.  I will be on vacation from 7/21 through 7/28, returning 7/31

I will not have cell phone or email access.   If it is not urgent, I will respond when I get back.

If it is urgent, please contact my manager Rob Marsh ( rob.marsh@intel.com).

Thank you

Fred

-------------------------------------
Fred Moses
Intel America's Inc.
SMG IOT Technical Sales
Senior Technical Sales Specialist
Desk (978) 553-1463
Cell (978) 621-2508
Comment 835 jbMacAZ 2017-07-23 17:32:27 UTC
(In reply to Mika Kuoppala from comment #832)
> About the V3 vs V1 patches: they are identical of doing the same thing, but
> perhaps due to the intricate timings, they behave differently.
> 
> I added a patch which uses either max or idle freq on gpu with byt, thus
> limiting
> the traffic to the punit, to try work around the bug. For those people who
> don't want to keep their cpus in >=C2.

I gave it a try on 4.12.3 (v3 built-in), but I had a hard freeze just under 13 hours (Asus T100CHI, Z3775.)  YMMV.
Comment 836 Juha Sievi-Korte 2017-08-12 11:40:05 UTC
I thought I'd share my experience too. With my N3540 laptop, all recent kernels do still freeze, last test without any tweaks with 4.11.8 and the freezes happened within two hours with light workload (browser + pdf reader).

I haven't tested the 'v1' patch with latest kernels so cannot comment on that, but back when it was released, the system seemed rock solid.
Comment 837 Fred 2017-08-12 11:50:23 UTC
Created attachment 257899 [details]
attachment-29700-0.html

Thank you for your email.  I am OOO for the next few weeks.
I will not have cell phone or email access.

Please contact my manager Rob Marsh ( rob.marsh@intel.com) or Sujatha Sivamoothy ( sujatha.sivamoothy@intel.com ) if you need assistance.

Thank you

Fred

-------------------------------------
Fred Moses
Intel America's Inc.
SMG IOT Technical Sales
Senior Technical Sales Specialist
Desk (978) 553-1463
Cell (978) 621-2508
Comment 838 jack.lan 2017-08-15 09:33:54 UTC
Hello, everyone,

This is the first time I have spoken because of a similar situation in the BayTrail platform in my hand

CPU: E3827, E3825
OS: Ubuntu 14.04.5 amd64

But unfortunately, spent nearly two days of testing and compiling

Whether it is intel_idle.max_cstate = 1 or the following Patch can not solve my problem

1. https://bugzilla.kernel.org/attachment.cgi?id=251471&action=diff (the earliest that there are problems to solve the crash patch, but the results are still the same)

2. https://bugzilla.kernel.org/attachment.cgi?id=254971 (in response to the structure of i915 modified, with the first but the revised v3 version)

3. https://bugzilla.kernel.org/attachment.cgi?id=257621 (for v3 itself in the patch)

Also tested the following Kernel version:
3.16.46
4.8.17
4.9.3
4.9.22
4.9.41
4.11.12
4.12.5

No matter what combination of core and patch, I just need to run "glxgears -fullscreen"

In the operation of an hour of the screen will be completely dark and freeze the system, or even can not use sysrq generated oops

The most amazing is that these two models are dual-core, with the previously mentioned only in the four-core platform in the situation is slightly different

At the moment my attention has shifted to how to get oops information.

If you can give me any advice on this question, please be grateful.
Comment 839 jerameel 2017-08-15 09:44:03 UTC
(In reply to jack.lan from comment #838)
> Hello, everyone,
> 
> This is the first time I have spoken because of a similar situation in the
> BayTrail platform in my hand
> 
> CPU: E3827, E3825
> OS: Ubuntu 14.04.5 amd64
> 
> But unfortunately, spent nearly two days of testing and compiling
> 
> Whether it is intel_idle.max_cstate = 1 or the following Patch can not solve
> my problem
> 
> 1. https://bugzilla.kernel.org/attachment.cgi?id=251471&action=diff (the
> earliest that there are problems to solve the crash patch, but the results
> are still the same)
> 
> 2. https://bugzilla.kernel.org/attachment.cgi?id=254971 (in response to the
> structure of i915 modified, with the first but the revised v3 version)
> 
> 3. https://bugzilla.kernel.org/attachment.cgi?id=257621 (for v3 itself in
> the patch)
> 
> Also tested the following Kernel version:
> 3.16.46
> 4.8.17
> 4.9.3
> 4.9.22
> 4.9.41
> 4.11.12
> 4.12.5
> 
> No matter what combination of core and patch, I just need to run "glxgears
> -fullscreen"
> 
> In the operation of an hour of the screen will be completely dark and freeze
> the system, or even can not use sysrq generated oops
> 
> The most amazing is that these two models are dual-core, with the previously
> mentioned only in the four-core platform in the situation is slightly
> different
> 
> At the moment my attention has shifted to how to get oops information.
> 
> If you can give me any advice on this question, please be grateful.


If intel_idle.max_cstate = 1 doesn't work, it is probably caused by a different bug.
Comment 840 ladiko 2017-08-15 10:22:55 UTC
I can confirm that. We have no issues on ~50x Asrock Q1900-ITX with disabled C6 state as described in #437.
Comment 841 jack.lan 2017-08-16 01:29:49 UTC
(In reply to jerameel from comment #839)
> (In reply to jack.lan from comment #838)
> > Hello, everyone,
> > 
> > This is the first time I have spoken because of a similar situation in the
> > BayTrail platform in my hand
> > 
> > CPU: E3827, E3825
> > OS: Ubuntu 14.04.5 amd64
> > 
> > But unfortunately, spent nearly two days of testing and compiling
> > 
> > Whether it is intel_idle.max_cstate = 1 or the following Patch can not
> solve
> > my problem
> > 
> > 1. https://bugzilla.kernel.org/attachment.cgi?id=251471&action=diff (the
> > earliest that there are problems to solve the crash patch, but the results
> > are still the same)
> > 
> > 2. https://bugzilla.kernel.org/attachment.cgi?id=254971 (in response to the
> > structure of i915 modified, with the first but the revised v3 version)
> > 
> > 3. https://bugzilla.kernel.org/attachment.cgi?id=257621 (for v3 itself in
> > the patch)
> > 
> > Also tested the following Kernel version:
> > 3.16.46
> > 4.8.17
> > 4.9.3
> > 4.9.22
> > 4.9.41
> > 4.11.12
> > 4.12.5
> > 
> > No matter what combination of core and patch, I just need to run "glxgears
> > -fullscreen"
> > 
> > In the operation of an hour of the screen will be completely dark and
> freeze
> > the system, or even can not use sysrq generated oops
> > 
> > The most amazing is that these two models are dual-core, with the
> previously
> > mentioned only in the four-core platform in the situation is slightly
> > different
> > 
> > At the moment my attention has shifted to how to get oops information.
> > 
> > If you can give me any advice on this question, please be grateful.
> 
> 
> If intel_idle.max_cstate = 1 doesn't work, it is probably caused by a
> different bug.

Hello jerameel

I was also thinking about whether this was different from the situation of the problem, but it was also suggested that cstate = 1 could not solve the problem, and glxgears was a good way to reproduce the problem.

So what should I have to open a new question, or is there any test that before I can try it first?

Jack.lan
Comment 842 sfumato1977 2017-08-16 10:40:21 UTC
My configuration:



CPU: CPU Intel Atom Z3735F / 1.33 GH
CAM:  OV5648 and HM2056 cameras
WIFI: RTL8723BS

OS: WINDOWS 10 Ubuntu Android

Ubuntu and Android kernel 4.13-rc5 without intel_idle.max_cstate=1 10min freeze
Windows 10 frezze  Even after 10 minutes If uninstalled

Intel(R) Imaging Signal Processor 2400
Camera Sesnors HM2056
Camera Sensors OV5648

And erased brutally These files:

camera.sys
hm2056.sys
MBI.sys  <-----?????
ov5648.sys

Can anyone repeat the experiment?

stress wifi rtl8723bs ping -l 65000 -t 192.168.0.1
play mp4 video

In these conditions I have a windows 10 freeze on average after 10 15 minutes

Maybe the problem and atomisp
Is the only driver that does not work under linux  

Umberto.Izzo
Comment 843 sfumato1977 2017-08-16 16:54:53 UTC
updating


I tried to enable suspicious drivers one at a time.

The freezing of windows 10 on my Tablet is due to:

Intel (R) Sideband Fabric Device MBI.sys


Without this driver, MIB.sys. Windows part, but after a few minutes, if I send a mp4 in play, it crashes


Enabling cameras drivers I no longer have anomaly

I know this is not the place to ditch windows 10 problems. But it seems to me that they behave in a similar way to the two operating systems.


Can anyone check this "INT33BD"? I'm not a programmer. 
But I can not find this string in the source kernel.



 Device (MBID)
        {
            Name (_HID, "INT33BD")  // _HID: Hardware ID
            Name (_CID, "INT33BD")  // _CID: Compatible ID
            Name (_HRV, 0x02)  // _HRV: Hardware Revision
            Name (_UID, One)  // _UID: Unique ID
            Method (_CRS, 0, Serialized)  // _CRS: Current Resource Settings
            {
                Name (RBUF, ResourceTemplate ()
                {
                    Memory32Fixed (ReadWrite,
                        0xE00000D0,         // Address Base
                        0x0000000C,         // Address Length
                        )
                })
                Return (RBUF)
            }

            OperationRegion (REGS, 0x87, Zero, 0x30)
            Field (REGS, DWordAcc, NoLock, Preserve)
            {
                PORT,   32, 
                REG,    32, 
                DATA,   32, 
                MASK,   32, 
                BE,     32, 
                OP,     32
            }

            Name (AVBL, Zero)
            Method (_REG, 2, NotSerialized)  // _REG: Region Availability
            {
                If (LEqual (Arg0, 0x87))
                {
                    Store (Arg1, AVBL)
                }
            }

            Method (READ, 3, Serialized)
            {
                Store (0xFFFFFFFF, Local0)
                If (LEqual (AVBL, One))
                {
                    Store (Zero, OP)
                    Store (Arg0, PORT)
                    Store (Arg1, REG)
                    Store (Arg2, BE)
                    Store (DATA, Local0)
                }

                Return (Local0)
            }

            Method (WRIT, 4, Serialized)
            {
                If (LEqual (AVBL, One))
                {
                    Store (One, OP)
                    Store (Arg0, PORT)
                    Store (Arg1, REG)
                    Store (Arg2, BE)
                    Store (Arg3, DATA)
                }
            }

            Method (MODI, 5, Serialized)
            {
                If (LEqual (AVBL, One))
                {
                    Store (0x02, OP)
                    Store (Arg0, PORT)
                    Store (Arg1, REG)
                    Store (Arg2, BE)
                    Store (Arg3, DATA)
                    Store (Arg4, MASK)
                }
            }
        }
Comment 844 Kiril Todorov 2017-08-19 08:12:01 UTC
I'm a programmer and I think the problem is that the BIOS on the motherboard of Baytrail and Braswell processors has not added a microcode for Linux. In this situation, I think the kernel has to be recompiled with the external microcode of intel-microcode and it should all work normally. This can be done following these instructions (https://www.dotslashlinux.com/2017/04/30/building-intel-cpu-microcode-updates-directly-into-the-linux-kernel/). Please check it out and report it because I do not have such hardware, but just try to help!
Comment 845 jechtpurgateur 2017-08-22 14:08:38 UTC
I don't get you. Classic microcode install isn't enough ? You link seems to be related to an other topic. It's about installing microcode directly inside the kernel. 
I have checked my microcode before posting here long time ago : https://wiki.archlinux.org/index.php/microcode

Anyway that a very interesting topic ;) I'm glad to read it and i will probably try it anyway.
Comment 846 jerameel 2017-08-26 12:43:55 UTC
For those people who didn't take the time to backread,

This bug happens ONLY on linux running on kernel > 3.16. Setting intel max cstate to 1 would totally prevent the freeze however will increase power consumption. Freezing on Windows is another thing. The main feature about this bug is it is a total freeze, no mouse movement, no TTY shell access, not even Magic SysRq keys. If this happens, your only choice is to force shutdown the system.

I hope this brings enlightenment.
Comment 847 Vladimir Mokrozub 2017-08-30 12:19:49 UTC
I have the same problem with GB-BACE-3000 miniPC (Intel Braswell N3000 CPU). I experience freezes with Debian 9 (kernel 4.9) and Ubuntu 16.04 (kernels 4.4 and 4.10). Interestingly, it's not only freezes but sometimes spontaneous reboots. I'll test it with intel_idle.max_cstate=1 for several days and report the result.

P.S. I don't think it matters but this PC is diskless and works as an LTSP fat client.
Comment 848 alan 2017-10-06 10:59:32 UTC
Hi guys,

I found a patch that could fix the cstate problem. This one come from the Asus T100 Ubuntu community and his name is  : fix_c-state_patchv4.12.patch. 
I think it can be used with 4.14.0-rc3 kernel. You can find it here : https://drive.google.com/drive/folders/0B4DiU2o72Fbub0U2ZzJaUzl5OEE

You can have a look on this patch to see if it can solve the problem. ;-)
Comment 849 harryharryharry 2017-10-06 17:39:40 UTC
@alan
No, to my knowledge it's an unofficial patch to circumvent the problem by not allowing deeper sleep (c) states (I think it was made by jbMacAZ - the patch was based on a earlier version which was merged into mainline but in such a way that it was ineffective for a lot of baytrail devices)
(see: https://bugzilla.kernel.org/show_bug.cgi?id=109051#c829)

It will make baytrail devices (Asus X205TA in my case) less power efficient than they could be, but at least they don't freeze...
Comment 850 jbMacAZ 2017-10-06 20:05:10 UTC
(In reply to alan from comment #848)
> Hi guys,
> 
> I found a patch that could fix the cstate problem. This one come from the
> Asus T100 Ubuntu community and his name is  : fix_c-state_patchv4.12.patch. 
> I think it can be used with 4.14.0-rc3 kernel. You can find it here :
> https://drive.google.com/drive/folders/0B4DiU2o72Fbub0U2ZzJaUzl5OEE

This patch simplifies removing the main-streamed v3 cstate patch (comment #742) and reinstalling Mika's v1 patch (comment #683).  On my Asus T100CHI the freeze rate dropped from 30 minutes on average to about once a month.  Better, but not 100%, so I use intel_idle.max_cstate=1.  ..cstate=1 limits how much power can be saved while idling.  Mika's patches make deep cstates transitions less likely to cause freezes.

The patch can be used with 4.12 through 4.14.  There is a version for 4.11 (navigate up a folder and down into 4.11.7): fix_c-state_patch.patch.  This patch might virtually stop cstate freezes on systems that freeze infrequently.  YMMV.
Comment 851 alan 2017-10-10 12:02:17 UTC
Hi,

@jbMacAZ: So, this problem with freeze is not solved yet on new kernels (4.12 through 4.14) ? We still need to use the intel_idle.max_cstate=1 command for grub to avoid the freeze for now ? Isn't it ?
Comment 852 john 2017-10-10 12:07:30 UTC
Yes and on some systems the max_state helps but does not fix the issue entirely. For example J1900 processor. Which i have on a AsRock Q1900-ITX board.
Comment 853 john 2017-10-10 12:08:31 UTC
Yes and on some systems the max_state helps but does not fix the issue entirely. For example J1900 processor. Which i have on a AsRock Q1900-ITX board. I have to reboot the system every week or so. Sometimes few times a week.
Comment 854 Paul Mansfield 2017-10-10 15:35:43 UTC
if I limit cstate on my Gigabyte J1900 board (GA-J1900-D3V), it is very stable, I have achieved uptimes of many weeks. It's running the standard OpenSUSE Leap 43 kernel, 4.4.79-19-default.

However, I am not using graphics, it's acting as a firewall, I specifically bought it because it was a cheap twin-NIC motherboard with power efficient Baytrail SoC.

On my baytrail tablet, even with the cstate limit it was unstable, and a lot of that seemed to be SDIO and/or the GPU. With the latest 4.13.5 plus patches, it's become useable.
Comment 855 jbMacAZ 2017-10-11 01:16:42 UTC
(In reply to alan from comment #851)
> Hi,
> 
> @jbMacAZ: So, this problem with freeze is not solved yet on new kernels
> (4.12 through 4.14) ? We still need to use the intel_idle.max_cstate=1
> command for grub to avoid the freeze for now ? Isn't it ?

Cstate freezes still exist in all versions after kernel 3.16.7.  The freeze rate varies widely for each kernel and a few (e.g. early 4.6) froze even with intel_idle.max_cstate=1.

(from comment #829)
> The cstate freeze is a hardware bug in the processor related to power
> saving.  The cstate problem seems to be affected by the motherboard
> design and how the OS code tries to conserve power.  Even the processor
revision seems to matter.

Intel_idle.max_cstate=1 avoids cstate transition freezes reliably on my baytrail system.  The v1 patch is about 99.44% effective, while v3 is the similar to unpatched on my system.  YMMV.
Comment 856 Hanno Zulla 2017-10-11 08:18:11 UTC
But is this problem still being worked on or has it been abandoned by now?

Sorry to sound fatalistic, but this looks like a problem that can't be solved outside of Intel, yet I don't understand why it couldn't be fixed by the Intel kernel team after almost years of the initial bug report.

(But if anybody is indeed still trying to fix this, three cheers to you, I'd love to buy you a beer.)
Comment 857 jerameel 2017-10-11 08:26:55 UTC
(In reply to Hanno Zulla from comment #856)
> But is this problem still being worked on or has it been abandoned by now?
> 
> Sorry to sound fatalistic, but this looks like a problem that can't be
> solved outside of Intel, yet I don't understand why it couldn't be fixed by
> the Intel kernel team after almost years of the initial bug report.
> 
> (But if anybody is indeed still trying to fix this, three cheers to you, I'd
> love to buy you a beer.)

This is because the affected processors were entry level and they wouldn't gain much fixing it.

We could just hope the community fixes this or at some point the bug would eventually fix by itself.
Comment 858 Hanno Zulla 2017-10-11 09:09:31 UTC
Just guessing here, but this looks like an issue deep down in the hardware where the community outside of Intel can hardly fix this without internal documentation.

Lacking the source of Windows 10's power management, there is no way to look at what they do differently compared to Linux and why Win10 is more stable on this hardware.

So far, it appears that the root cause of this hasn't been identified.
Comment 859 Armand 2017-10-14 16:49:01 UTC
I am seeing a regression with N3540 and kernel 4.9.53.
Even with 'max_cstate=1' I experience freezes every 48 hours. That had never happened to me before with the cstate=1.
I am returning to 4.4.90.
Comment 860 mirh 2017-11-05 13:23:39 UTC
(In reply to Wolfgang M. Reimer from comment #539)

There have been updates in the meantime to errata from these lists. 
While I was at it, I also took the liberty to include bugs that may lead to system hang, but not *necessarily* related to c-state. 
I mean, seriously, they are so many of them that the incredible variance of reports gets almost easy to explain. 

BI46: Intel Atom Processor E6xx Series
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e6xx-spec-update.pdf

CC5: Intel Atom Processor Z2760
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z2760-spec-update.pdf

VLI1, VLI2, VLI7, VLI39, VLI55, VLI56, VLI64, VLI66, VLI68, VLI79, VLI91, VLI92: Intel Atom Processor E3800
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf

VLP2, VLP41, VLP52, VLP53, VLP56, VLP58, VLP70, VLP73, VLP82: Intel Celeron and Pentium Processor N- and J-Series
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf

VLT1, VLT2, VLT7, VLT14, VLT39, VLT56, VLT57, VLT61, VLT63, VLT65, VLT75, VLT76, VLT78: Intel Atom Processor Z3600 and Z3700 Series
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-Z36xxx-Z37xxx-spec-update.pdf

Personally, I believe some good willing developer should just read these docs as gospel, implement as many in-kernel workarounds as possible, only then start to look back to the situation and ask questions again. 

In other news, it seems that for at least Bay Trail-D family, some of the suspicions of comment 752 have been confirmed
http://advci.eastasia.cloudapp.azure.com/wordpress/wp-content/uploads/2017/05/570005_Intel_Celeron_Processor_J1900_Sighting_Alert_4995585_Rev1_0.pdf#page=4
Comment 861 RussianNeuroMancer 2017-11-07 19:16:05 UTC
New workaround to test: https://github.com/jwrdegoede/linux-sunxi/commit/0b6feb84ad305f821772dcbffeae92f80f5f63cc
Comment 862 Hans de Goede 2017-11-07 21:56:45 UTC
Hi,

(In reply to RussianNeuroMancer from comment #861)
> New workaround to test:
> https://github.com/jwrdegoede/linux-sunxi/commit/
> 0b6feb84ad305f821772dcbffeae92f80f5f63cc

Erm yes, I was planning on announcing this here myself :)  Anyways this is a patch is an attempt to fix the different results people have been seeing between v1 and v3 of "drm/i915: Avoid tweaking evaluation thresholds on Baytrail" v3 only implements 1 of the 2 different fixes v1 contained, this adds the missing fix in a separate commit on top of the mainline kernel.

If people could test this, esp. people who have had success with v1, but not with v3, then that would be great.

Regards,

Hans
Comment 863 jbMacAZ 2017-11-08 07:37:14 UTC
Created attachment 260559 [details]
patch to fix v3 cstate patch

I've taken the liberty of posting a ready to use version of Hans de Goede's patch.  A pointer name was different in the kernel.org sources compared to linux-sunxi.

If this patch is successful, it is so much cleaner than the hack I put together for the Asus T100 (comment #850).  The new patch has been running over 45 minutes, which is already longer than the average freeze time for the v3 fix by itself.  I'll go let the test run for a few days.  (Asus T100CHI z3775, 4.14-rc8 Mint Cinnamon 18.2, 64 bit)
Comment 864 Hans de Goede 2017-11-09 18:53:29 UTC
Created attachment 260583 [details]
[PATCH] i915: pm: Be less agressive with clockfreq changes on Bay Trail

Here is a cleaned up version of my patch which makes 4.14 behave as with the v1 "drm/i915: Avoid tweaking evaluation thresholds on Baytrail" patch.

Note there are no functional differences compared to my previous version.
Comment 865 Hans de Goede 2017-11-09 18:56:15 UTC
Created attachment 260585 [details]
[PATCH] intel_idle: Disable C6N and C6S on Bay Trail

And here is a patch for 4.14 which implements the functionality of the c6-off-c7-on.sh script at the kernel level. Given the azure.com link in comment #860 I think that disabling C6 on Bay Trail in general is probably a good idea.
Comment 866 Hans de Goede 2017-11-09 19:00:37 UTC
If people who are still seeing stability issues could try these 2 patches, then that would be great.

My intention is to get both patches upstream, but please test them separately and per patch write down if it helps with stability (and esp. if it makes the system completely stable).

If neither patch creates a completely stable system for you, then please (also) try a kernel with both combined.
Comment 867 harryharryharry 2017-11-11 19:30:27 UTC
I'm inclined to say for me these patches don't do as good of a job as the patch jbMacAZ posted a while ago.

I've tried the patches from #864 & #865 both separately and together. With all three kernelversions my laptop (Asus X205TA) freezes within a couple of hours (playing media files on a loop with kodi).

With the old patch from jbMacAZ I am using right now the laptop freezes only once in a blue moon.
Comment 868 Fred 2017-11-11 19:40:10 UTC
Created attachment 260617 [details]
attachment-19545-0.html

Thank you for your email.  I am OOO on business travel until 11/16.   Email response will be slow.

If you need immediate assistance please call my cell phone listed below.

Thank you

Fred

-------------------------------------
Fred Moses
Intel America's Inc.
SMG IOT Technical Sales
Senior Technical Sales Specialist
Desk (978) 553-1463
Cell (978) 621-2508
Comment 869 jbMacAZ 2017-12-02 20:22:01 UTC
I'm not seeing cstate freezes with 4.14.0+.  I tried the new patches (#864, #865) individually, together and without either.  Without some kind of work-around, my system used to freeze reliably in about 30 minutes, certainly no more that a few hours.  I've gotten over a week freeze-free without any workarounds.  I became suspicious because the original c6offc7on.sh script had only been slightly effective for my T100CHI.  But patch #865 alone wasn't freezing.  
YMMV - could be just a weird coincidence and not enough patience.

(In reply to harryharryharry from comment #867)
> With the old patch from jbMacAZ I am using right now the laptop freezes only
> once in a blue moon.
There is an update for 4.15rc, if needed.
Comment 870 harryharryharry 2017-12-03 23:53:17 UTC
(In reply to jbMacAZ from comment #869)
> There is an update for 4.15rc, if needed.

Yes please ;) 
Right now I'm on 4.15rc2 (?? which I think has incorporated Hans de Goede's recent patches since they can't be applied anymore ??) only a couple of hours, but it already froze twice...
Comment 871 jbMacAZ 2017-12-04 07:20:37 UTC
(In reply to harryharryharry from comment #870)
> (In reply to jbMacAZ from comment #869)
> > There is an update for 4.15rc, if needed.
> 
> Yes please ;) 

https://drive.google.com/drive/folders/14zXrXxNa6dhpbcn9t4V1Q0MsmLlw1S53

I'm running rc2 also with Hans de Goede's patches without mine, 11 hours so far, no freezes.  I ran rc1 with my patch for 4 days w/o problems.
Comment 872 Hans de Goede 2017-12-04 09:39:28 UTC
Hi,

(In reply to jbMacAZ from comment #869)
> I'm not seeing cstate freezes with 4.14.0+.  I tried the new patches (#864,
> #865) individually, together and without either.  Without some kind of
> work-around, my system used to freeze reliably in about 30 minutes,
> certainly no more that a few hours.  I've gotten over a week freeze-free
> without any workarounds.

Thank you for your testing, so if I understand you correctly then 4.14 works without freezes for you without any of my 2 patches, or any other workarounds, IOW it seems that a vanilla 4.14 fixes things for you ?

Regards,

Hans
Comment 873 jbMacAZ 2017-12-04 17:02:32 UTC
(In reply to Hans de Goede from comment #872)
> Hi,
> 
> (In reply to jbMacAZ from comment #869)
> > I'm not seeing cstate freezes with 4.14.0+.  I tried the new patches (#864,
> > #865) individually, together and without either.  Without some kind of
> > work-around, my system used to freeze reliably in about 30 minutes,
> > certainly no more that a few hours.  I've gotten over a week freeze-free
> > without any workarounds.
> 
> Thank you for your testing, so if I understand you correctly then 4.14 works
> without freezes for you without any of my 2 patches, or any other
> workarounds, IOW it seems that a vanilla 4.14 fixes things for you ?
> 
> Regards,
> 
> Hans

That seems to be what I'm seeing on my T100CHI.  I'll feel more confident in a few more weeks.  I think it is safe to say that my freeze rate is much reduced at the moment.
Comment 874 harryharryharry 2017-12-05 08:08:23 UTC
Thanks for the patch jbMacAZ! I'll try it out
Comment 875 Armand 2017-12-13 08:15:37 UTC
with N3540 and kernel 4.14 my system has been surprisingly stable (without 'max_cstate=1' nor any patch). It is safe to say my freeze rate has dropped dramatically. If someone has more technical information about what was introduced with 4.14 to improve Baytrail stability I am interested.
Comment 876 Armand 2017-12-14 07:47:38 UTC
with N3540 and kernel 4.14, eventually the 1st freeze happened after one week (running 24/7) vs. a few hours with previous kernels (without max_cstate=1)
Comment 877 Prashant Poonia 2017-12-15 03:34:36 UTC
I too am having a n3540 and based on my observations of 2 years, i am pretty sure that this is a pci-e driver issue. If your system is stable while using ethernet/wifi off, then same is the case with you.
Mine is having realtek PCIe controller, and for windows driver v10.2.703.2015 fixes the issue (yes these exact freezes are also in windows if you use old pcie drivers). So if anyone is having enough technical capabilities to reverse engineer the driver, this issue will be fixed.
Link to the driver - http://dlcdnet.asus.com/pub/ASUS/nb/DriversForWin10/LAN/LAN_Realtek_Win10_64_VER1027032015.zip

I am a human, i might be wrong, but i have done deep research into this.
Hope it helps
Comment 878 frr 2017-12-15 07:58:13 UTC
On 15 Dec 2017 at 3:34, Prashant Poonia wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
> 
> --- Comment #877 from Prashant Poonia (pooniaprashant400@gmail.com) ---
> I too am having a n3540 and based on my observations of 2 years, i am pretty
> sure that this is a pci-e driver issue. If your system is stable while using
> ethernet/wifi off, then same is the case with you.
> Mine is having realtek PCIe controller, and for windows driver v10.2.703.2015
> fixes the issue (yes these exact freezes are also in windows if you use old
> pcie drivers). So if anyone is having enough technical capabilities to
> reverse
> engineer the driver, this issue will be fixed.
> Link to the driver -
>
> http://dlcdnet.asus.com/pub/ASUS/nb/DriversForWin10/LAN/LAN_Realtek_Win10_64_VER1027032015.zip
> 
> I am a human, i might be wrong, but i have done deep research into this.
> Hope it helps
> 

so am I, or so I hope :-)
...I'm wondering what others will say. A problem in a PCI-e NIC 
device driver that only "fires" combined with system-global C-states 
deeper than 1 and appears to have some links to graphics activity.

I suspect that the stumbling block in this theory is, that "freezes" 
are a pretty broad symptom, which can have very different technical 
causes. And yes I have seen Windows freeze / stumble / bluescreen due 
to misc driver bugs.

Thank you for sharing your suspision nonetheless, as it will turn 
many eyeballs on the theory :-) And congratulations to hunting down 
that bug in Windows. (Your report may come in handy to me as well "at 
face value", because I do meet a lot of Realtek chips under Windows 
where I work. Google "FMAPP.EXE freezes" BTW/off topic.) 

Frank
Comment 879 Daniel Glöckner 2017-12-15 08:55:21 UTC
(In reply to frr from comment #878)
> ...I'm wondering what others will say.

AFAICS Z3745 does not have an external PCIe bus.
Comment 880 A Uday K 2017-12-17 05:18:07 UTC
(In reply to Daniel Glöckner from comment #879)
> AFAICS Z3745 does not have an external PCIe bus.

Are you sure about that?
I use a N3530. The following are the outputs of a few commands that might give a better insight.
> dmidecode | grep "PCI"
Output is:
>               PCI is supported
>       Type: x4 PCI Express x4
>       Type: x1 PCI Express x1
>       Type: x1 PCI Express x1
>       Type: x1 PCI Express x1

> sudo dmidecode --type 9
Output is here: https://pastebin.com/raw/pVq7xu0M

> lspci -vv
Output is here: https://pastebin.com/raw/feisZHzB
Comment 881 Prashant Poonia 2017-12-17 07:58:51 UTC
updated (2017) link to updated realtek pci-e driver for unix, win, mac, dos etc.
http://www.realtek.com.tw/downloads/downloadsView.aspx?Langid=1&PNid=7&PFid=7&Level=5&Conn=4&DownTypeID=3&GetDown=false#RTL8100E/RTL8101E/RTL8102E-GR/RTL8103E(L)<br>RTL8102E(L)/RTL8101E/RTL8103T<br>RTL8401/RTL8401P/RTL8105E<br>RTL8402/RTL8106E

hope it helps.
Comment 882 Leonardo Santos 2017-12-17 21:58:34 UTC
(In reply to Prashant Poonia from comment #881)
> updated (2017) link to updated realtek pci-e driver for unix, win, mac, dos
> etc.
> http://www.realtek.com.tw/downloads/downloadsView.
> aspx?Langid=1&PNid=7&PFid=7&Level=5&Conn=4&DownTypeID=3&GetDown=false#RTL8100
> E/RTL8101E/RTL8102E-GR/RTL8103E(L)<br>RTL8102E(L)/RTL8101E/
> RTL8103T<br>RTL8401/RTL8401P/RTL8105E<br>RTL8402/RTL8106E
> 
> hope it helps.

That worked for me, thank you so much!
Comment 883 Prashant Poonia 2017-12-18 03:41:21 UTC
(In reply to Leonardo Santos from comment #882)
> (In reply to Prashant Poonia from comment #881)
> > updated (2017) link to updated realtek pci-e driver for unix, win, mac, dos
> > etc.
> > http://www.realtek.com.tw/downloads/downloadsView.
> >
> aspx?Langid=1&PNid=7&PFid=7&Level=5&Conn=4&DownTypeID=3&GetDown=false#RTL8100
> > E/RTL8101E/RTL8102E-GR/RTL8103E(L)<br>RTL8102E(L)/RTL8101E/
> > RTL8103T<br>RTL8401/RTL8401P/RTL8105E<br>RTL8402/RTL8106E
> > 
> > hope it helps.
> 
> That worked for me, thank you so much!

glad it helped, if you still face occasional freezes try using a different driver (a bit older), the latest drivers (2017 ones) freezes once a week for me but most of the 2016 ones are rock solid.
Keep testing what works for you the best and please do report your feedback atleast once a month, it will surely help.
Comment 884 Vincent Gerris 2017-12-20 16:21:43 UTC
Hi Hans,

Thanks a lot for chasing this :)!
I tested with the Ubuntu 4.14.7 mainline kernel and get the same freezes without patches (semed a bit better but after some times without AC and unloading btusb again freezes).


I have patched the 4.14.7 with :
https://bugzilla.kernel.org/show_bug.cgi?id=109051#c865
and that seems to make my system stable!

For Ubuntu users, I have uploaded the deb files for x86_64 here:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

It would be great if that patch can get merged in master!
Thanks a bunch :)!
Comment 885 Hans de Goede 2017-12-20 18:38:24 UTC
(In reply to Armand from comment #876)
> with N3540 and kernel 4.14, eventually the 1st freeze happened after one
> week (running 24/7) vs. a few hours with previous kernels (without
> max_cstate=1)

Ok, so 4.14 is much better for you but not yet perfect, can you please try the patches from:

https://bugzilla.kernel.org/show_bug.cgi?id=109051#c864
https://bugzilla.kernel.org/show_bug.cgi?id=109051#c865

And see if those fix the last instabilities?
Comment 886 Hans de Goede 2017-12-20 18:42:37 UTC
Hi Vincent,

(In reply to Vincent Gerris from comment #884)
> Hi Hans,
> 
> Thanks a lot for chasing this :)!
> I tested with the Ubuntu 4.14.7 mainline kernel and get the same freezes
> without patches (semed a bit better but after some times without AC and
> unloading btusb again freezes).
> 
> 
> I have patched the 4.14.7 with :
> https://bugzilla.kernel.org/show_bug.cgi?id=109051#c865
> and that seems to make my system stable!
> 
> For Ubuntu users, I have uploaded the deb files for x86_64 here:
> https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0
> 
> It would be great if that patch can get merged in master!
> Thanks a bunch :)!

Thank you for testing this, so to be clear you applied the "[PATCH] intel_idle: Disable C6N and C6S on Bay Trail" patch and ONLY that patch on top of 4.14 and then without any other workarounds the system seems stable? Have I understood that correctly?

If I've understood that correctly that is great news, can you please:

1) Run powertop and confirm that your CPU cores are still hitting / using C7 when (mostly) idle, iow that my patch does not just re-introduce the intel_idle.max_cstate=1 workaround in a different way.

2) Run some more tests and more in general try to use the system (or run some test workloads) for at least a week to confirm that it really is stable.

Once I get feedback from you an 3 above items and if all that feedback is positive I will take a shot at getting this merged upstream.

Regards,

Hans
Comment 887 Vincent Gerris 2017-12-21 12:41:05 UTC
hi,

You understood correctly.
Unfortunately I have had 2 freezes since and a graphical crash that went to getting a login screen. which I cannot reproduce as before. So in overall, it does seem to add to stability, just doesn't make it completely stable. 

I still see C7 states (C7S?), here is a snapshot of powertop:
         Package   |             Core    |            CPU 0
                    |                     | C0 active  72,7%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   17,2%    | C1         17,2%    1,9 ms
C2 (pc2)    0,0%    |                     |
C3 (pc3)    0,0%    |                     |
C6 (pc6)    0,0%    | C6 (cc6)   16,1%    |
                    |                     | C7S         2,1%    9,6 ms

                    |             Core    |            CPU 1
                    |                     | C0 active  75,9%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   16,2%    | C1         16,2%    1,8 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   14,2%    |
                    |                     | C7S         2,1%   10,5 ms

                    |             Core    |            CPU 2
                    |                     | C0 active  74,5%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   24,0%    | C1         24,1%    1,4 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    6,6%    |
                    |                     | C7S         0,0%    1,4 ms

                    |             Core    |            CPU 3
                    |                     | C0 active  68,2%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   22,4%    | C1         22,5%    1,8 ms

Should I try to add the other patch too? I'll keep testing and using to see what I can find out. Thanks a lot for all your work and support!
Comment 888 jechtpurgateur 2017-12-21 13:07:14 UTC
Created attachment 261289 [details]
attachment-30778-0.html

Use fixed width in order to display readable output (shift + 6 two times)


          Package   |             Core    |            CPU 0
                    |                     | C0 active  72,7%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   17,2%    | C1         17,2%    1,9 ms
C2 (pc2)    0,0%    |                     |
C3 (pc3)    0,0%    |                     |
C6 (pc6)    0,0%    | C6 (cc6)   16,1%    |
                    |                     | C7S         2,1%    9,6 ms

                    |             Core    |            CPU 1
                    |                     | C0 active  75,9%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   16,2%    | C1         16,2%    1,8 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   14,2%    |
                    |                     | C7S         2,1%   10,5 ms

                    |             Core    |            CPU 2
                    |                     | C0 active  74,5%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   24,0%    | C1         24,1%    1,4 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    6,6%    |
                    |                     | C7S         0,0%    1,4 ms

                    |             Core    |            CPU 3
                    |                     | C0 active  68,2%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   22,4%    | C1         22,5%    1,8 ms

On Thu, Dec 21, 2017 at 1:41 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #887 from Vincent Gerris (vgerris@gmail.com) ---
> hi,
>
> You understood correctly.
> Unfortunately I have had 2 freezes since and a graphical crash that went to
> getting a login screen. which I cannot reproduce as before. So in overall,
> it
> does seem to add to stability, just doesn't make it completely stable.
>
> I still see C7 states (C7S?), here is a snapshot of powertop:
>          Package   |             Core    |            CPU 0
>                     |                     | C0 active  72,7%
>                     |                     | POLL        0,0%    0,0 ms
>                     | C1 (cc1)   17,2%    | C1         17,2%    1,9 ms
> C2 (pc2)    0,0%    |                     |
> C3 (pc3)    0,0%    |                     |
> C6 (pc6)    0,0%    | C6 (cc6)   16,1%    |
>                     |                     | C7S         2,1%    9,6 ms
>
>                     |             Core    |            CPU 1
>                     |                     | C0 active  75,9%
>                     |                     | POLL        0,0%    0,0 ms
>                     | C1 (cc1)   16,2%    | C1         16,2%    1,8 ms
>                     |                     |
>                     |                     |
>                     | C6 (cc6)   14,2%    |
>                     |                     | C7S         2,1%   10,5 ms
>
>                     |             Core    |            CPU 2
>                     |                     | C0 active  74,5%
>                     |                     | POLL        0,0%    0,0 ms
>                     | C1 (cc1)   24,0%    | C1         24,1%    1,4 ms
>                     |                     |
>                     |                     |
>                     | C6 (cc6)    6,6%    |
>                     |                     | C7S         0,0%    1,4 ms
>
>                     |             Core    |            CPU 3
>                     |                     | C0 active  68,2%
>                     |                     | POLL        0,0%    0,0 ms
>                     | C1 (cc1)   22,4%    | C1         22,5%    1,8 ms
>
> Should I try to add the other patch too? I'll keep testing and using to see
> what I can find out. Thanks a lot for all your work and support!
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 889 Hans de Goede 2017-12-23 16:50:47 UTC
Hi,

> You understood correctly.
> Unfortunately I have had 2 freezes since and a graphical crash that went to
> getting a login screen. which I cannot reproduce as before. So in overall,
> it does seem to add to stability, just doesn't make it completely stable. 

Ok.

> Should I try to add the other patch too? I'll keep testing and using to see
> what I can find out. Thanks a lot for all your work and support!

Yes if you can try the other patch on top of the one you are already using, then that would be great. If that gives good stability then it would be interesting to just try the other patch, but one step at a time.

Regards,

Hans
Comment 890 Vincent Gerris 2018-01-02 23:56:49 UTC
Hi,

I added the patch so running a kernel with both.
I let it run for about a week with different loads and repeatedly tried the procedure that would lock it (do smb transfer and unload btusb).
I still see influence on the speed but no locks so far.

When running an Android emulator I can get it really slow, but eventually the system responds (emulator crashes).

So this looks very promising :).
The kernel with both patches for Ubuntu can be found here (as the one with only the second patch):
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

I didn't try the first patches by itself because I thought I used a previous version that did not seem effective.
It seems like a very good idea to me to send both patches upstream, especially if they are sensible based on the errata.

For me this seems to give me a stable system, I'm very happy :).

Can people still having issues on the 4.14 kernel running Ubuntu please test too and report? Thank you very much Hans for the relentless support !
Comment 891 Hans de Goede 2018-01-03 08:04:28 UTC
Vincent,

Thank you for testing this.

All: It would be great if some other people could give Vincent's Ubuntu kernel with the patches a try and report back if it helps them too. I really need some more data points before I can push these upstream.

Regards,

Hans
Comment 892 mvdw 2018-01-06 17:34:26 UTC
Hans, Vincent,

after three days with both patches and no freezes, I am inclined to consider these working. It was impossible for my device to sustain more than one day of regular usage prior. Let's get this merged.

Thanks to everyone who contributed.
Comment 893 Hans de Goede 2018-01-06 19:28:06 UTC
Hi,

(In reply to mvdw from comment #892)
> Hans, Vincent,
> 
> after three days with both patches and no freezes, I am inclined to consider
> these working. It was impossible for my device to sustain more than one day
> of regular usage prior. Let's get this merged.

Thank you for the feedback. Are you using this system regularly ? If so please give it a few more days and then report back here again to make sure it really is stable now.

Regards,

Hans
Comment 894 mvdw 2018-01-09 15:44:36 UTC
(In reply to Hans de Goede from comment #893)
> Hi,
> 
> (In reply to mvdw from comment #892)
> > Hans, Vincent,
> > 
> > after three days with both patches and no freezes, I am inclined to
> consider
> > these working. It was impossible for my device to sustain more than one day
> > of regular usage prior. Let's get this merged.
> 
> Thank you for the feedback. Are you using this system regularly ? If so
> please give it a few more days and then report back here again to make sure
> it really is stable now.
> 
> Regards,
> 
> Hans

Six crash-free days and counting.

This cheap laptop is being used for about 3-8 hours each day. It would inevitably freeze randomly during web-browsing, while using PhpStorm or Atom, during video playback or even immediately after boot or after suspend on the lockscreen. I cannot comment on energy usage though, now that C6 is missing as the battery is not good anymore anyway.
Comment 895 Grzegorz Kalwig 2018-01-15 10:54:38 UTC
I have 4 machines with freeze problems, i need to have it online 24/7. (Proc: Intel Celeron J1900, OS: Centos 7).

Previously they worked a maximum of once a week, although usually enough a day or two to freeze. (on clean kernel 4.14.11)

I added both patch to kernel (4.14.12), after that have 4 days with no crash. I go ahead with testing.
Comment 896 Juha Sievi-Korte 2018-01-17 18:48:14 UTC
(In reply to Hans de Goede from comment #891)
> Vincent,
> 
> Thank you for testing this.
> 
> All: It would be great if some other people could give Vincent's Ubuntu
> kernel with the patches a try and report back if it helps them too. I really
> need some more data points before I can push these upstream.
> 
> Regards,
> 
> Hans

Hi,

Also reporting back on this. I have now 4.14.11 with both of these patches and running 10 days now freeze-free with no other workarounds. Before this attempt I've been using the c6+c7off shell script and have gone months pretty much freeze-free.

I tried using my old test script for 24 hours and other than that it has been my 'normal use' which previously has triggered the freeze quite constantly. So so far this seems great improvement. I'll continue and report back if situation changes.

Tested with N3540.
Comment 897 Srdjan Todorovic 2018-01-17 23:41:36 UTC
I've just tested kernel 4.14.14 with both patches from comment #885 on a Asrock Q1900-ITX (Intel Quad Core J1900). Running headless x86_64 Kubuntu 16.04.03 with Kodi on startup, SSHed into the box after installing the new kernel and confirmed that the new kernel was being booted.

SSH session became unresponsive after about 15 minutes, cannot connect when initiating a new SSH session, no ping reply.

I will try the Ubuntu kernel from Vincent's comment #890, but to me this doesn't look like it's fixed or I have some difference in my setup.
Comment 898 Srdjan Todorovic 2018-01-18 01:11:22 UTC
Created attachment 273659 [details]
Dmesg -n 8 output when network dies

I was unable to scp the aforementioned .deb files as the network kept dying.
I pulled the machine out of the cabinet and hooked to a monitor, and used 4.14.14 with the 2 patches.

Approximately at the time that the screen darkened (Kodi was displayed) due to power saving dimming, the network connection died (LEDs at the back of the machine turned off). I can see from dmesg a report for:

r8169 0000:03:00.0 enp3s0: rtl_counters_cont == 1 (loop:  1000, delay: 10).

Some 200 seconds later, a kernel oops being printed to the screen. Something to do with a dev watchdog. Perhaps my problem is not related to this Bugzilla ID?
I've pulled the dmesg output to my desktop, attached here if it is relevant.
Comment 899 Hans de Goede 2018-01-18 09:33:07 UTC
(In reply to Srdjan Todorovic from comment #898)
> Created attachment 273659 [details]
> Dmesg -n 8 output when network dies
> 
> I was unable to scp the aforementioned .deb files as the network kept dying.
> I pulled the machine out of the cabinet and hooked to a monitor, and used
> 4.14.14 with the 2 patches.
> 
> Approximately at the time that the screen darkened (Kodi was displayed) due
> to power saving dimming, the network connection died (LEDs at the back of
> the machine turned off). I can see from dmesg a report for:
> 
> r8169 0000:03:00.0 enp3s0: rtl_counters_cont == 1 (loop:  1000, delay: 10).
> 
> Some 200 seconds later, a kernel oops being printed to the screen. Something
> to do with a dev watchdog. Perhaps my problem is not related to this
> Bugzilla ID?

Yes this sounds like a difr8169ferent problem, with the combination of a Bay Trail J1900 + r8169 NIC. Please file a new bug for this.
Comment 900 Jan Jasper de Kroon 2018-01-20 20:10:24 UTC
I can confirm the patches which Hans de Goede submitted are working on my Fedora 27 installation as well.
My system is a Acer Aspire Switch 10 which features a Intel Atom Z3745 processor.
Before applying these patches my system also had random freezes which would require me to hard-shutdown the laptop.
Comment 901 Rasmus 2018-01-25 03:31:47 UTC
Vincent, could you please advise me which one of the debs should I install from your Dropbox share?

My current kernel is 4.13.0-31-generic and I'm running Intel's J1900. I can extensively test your patches on my working machine.
Comment 902 Stanislav Graf 2018-01-25 18:20:02 UTC
Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz
Lenovo X270

I had similar problems as described in this bug and I was able to get rid of most of them by updating BIOS and following latest stable kernel in Fedora. The only remaining issue was - whenever I left laptop without power supply (on battery) and didn't do anything, after few minutes it got stuck and I had to switch it of by pressing power for 10sec.

This was fixed for me from
kernel-4.15.0-0.rc8.git0.1.fc28.x86_64

(I didn't test earlier rc, rc9 is ok too, previous 4.14.14 is still with above problem)
Comment 903 mirh 2018-01-25 22:35:54 UTC
Your CPU has *nothing* to do with this bug. 
Regards.
Comment 904 alan 2018-01-26 13:31:18 UTC
Hi,

patches from Hans with the 4.15-rc9 kernel tested. I tried my system during more then 3 hours and I didn't see freezes.

I will do more tests.
Comment 905 Vincent Gerris 2018-01-27 14:06:20 UTC
Hi Rasmus and other testers,

Thank you for testing :)!

The names should be self explanatory.
The one with and in the name contains both patches, the other only the second Hans mentioned. The first patch was committed upstream, so it might be a good idea to test the one with both patches (but you can try the other too to just test that change).

Everyone posting results, please mention the exact situation that you are testing (for example the kernels I put on dropbox, or a self patched kernel 4.14.x with patches from Hans) so we know exactly what is being tested and what might work or not.

I haven't tried a kernel with only the first patch (reinstalled because of some strange issue I wanted to rule out), but so far with both patches this is stable for me. Sometimes the system an be unresponsive for like 10 seconds if I really load it, but it will come out of it and continue to work. No hard freezes anymore.

Thank you and happy Linuxing
Comment 906 Hans de Goede 2018-01-27 14:15:34 UTC
Hi,

(In reply to Vincent Gerris from comment #905)
> The first patch was committed upstream

Correction it has been submitted upstream, it has not yet been merged. But it did get the attention of the upstream i915 folks and they are currentyl actively looking into this.

Regards,

Hans
Comment 907 Martin 2018-01-27 20:50:23 UTC
I'm sorry I have to report a freeze after 3,5 days uptime on a patched 4.14.14 kernel, using both patches from comment 864 and 865.

HW: ASRock Q1900-ITX with J1900
Load: HTPC with MythTV recording/showing HD DVB-C material

The freeze took place while recording, no video output other than black, screensaved desktop (I was not actively watching TV at the moment).

I rebooted to the previously compiled 4.14.15 to continue a live recording and that froze within 15 minutes. Had to reboot to 4.10+ kernel to finish the program.
Comment 908 john 2018-01-27 21:17:28 UTC
Have an ASRock Q1900-ITX tried patches and the cstate=1 fix. Still regularly getting freezes. Been following this thread for a year now and due to the fact there has not been any progress will switch to a i3 intel.

Good luck.
Hope you guys will find a fix someday.
Comment 909 David Mace 2018-01-27 21:50:48 UTC
(In reply to john from comment #908)
> Have an ASRock Q1900-ITX tried patches and the cstate=1 fix. Still regularly
> getting freezes. Been following this thread for a year now and due to the
> fact there has not been any progress will switch to a i3 intel.
> 
> Good luck.
> Hope you guys will find a fix someday.

I have the same board, I used to get freezes all the time. I stopped updating since I applied 4.12.10 http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12.10/ and cstate=1 fix. I never get the freeze any more (but yes the machine is vunerable to Meltdown etc)
Comment 910 Rasmus 2018-01-28 02:40:58 UTC
I have been testing for 3 days with 4.15-rc9 and no freezes. J1900 used to freeze after 15 minutes of use with 4.13.
Comment 911 Martin 2018-01-28 09:54:53 UTC
(In reply to Rasmus from comment #910)
> I have been testing for 3 days with 4.15-rc9 and no freezes. J1900 used to
> freeze after 15 minutes of use with 4.13.

What patches applied?
Comment 912 Rasmus 2018-01-29 00:51:49 UTC
(In reply to Martin from comment #911)
> (In reply to Rasmus from comment #910)
> > I have been testing for 3 days with 4.15-rc9 and no freezes. J1900 used to
> > freeze after 15 minutes of use with 4.13.
> 
> What patches applied?

Just 4.15-rc9 from mainline. No patches, I understood something was already merged to mainline.

I've been using it for long now, still no problems despite constant C state switching.
Comment 913 Martin 2018-01-30 15:23:38 UTC
4.15 without any patches freezes up within three hrs for me.
Comment 914 nobody 2018-02-10 22:04:39 UTC
Any news about this? I am experimenting random freezes since a reinstallation of Ubuntu 16.04 from Agost 2017. I tried kernel 4.4, 4.13 and 4.15, with the set of intel_idle.max_cstate=1 on grub file. Even with that I have random freeze (can be one for week, can be two in a day). My CPU is an Intel Core i5.
Comment 915 Vincent Gerris 2018-02-10 22:15:34 UTC
Created attachment 274099 [details]
attachment-11991-0.html

Please READ before you post. This is NOT for i5, only for Baytrail CPUs.
Please research your issue and report a new bug if applicable. As always,
please follow the rules there to prevent dubble posts and spamming ( no
need to reply ).


Den 10 feb. 2018 23:04 skrev <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> ibuioli@gmail.com changed:
>
>            What    |Removed                     |Added
> ------------------------------------------------------------
> ----------------
>                  CC|                            |ibuioli@gmail.com
>
> --- Comment #914 from ibuioli@gmail.com ---
> Any news about this? I am experimenting random freezes since a
> reinstallation
> of Ubuntu 16.04 from Agost 2017. I tried kernel 4.4, 4.13 and 4.15, with
> the
> set of intel_idle.max_cstate=1 on grub file. Even with that I have random
> freeze (can be one for week, can be two in a day). My CPU is an Intel Core
> i5.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 916 SFA 2018-02-14 07:37:14 UTC
Hi, 

We have approx. 30 fanless servers in the wide, equipped with E3845 CPUs. The servers are running 24h/24h, 365/365, with high CPU usage (more than 60%, all the time). We faced in the past the freeze bug discussed here, and we fixed it by using the famous 'intel_idle.max_cstate=1' kernel parameters.

But in fact this parameter  does not fix the bug at 100% (as told in some posts above) : still sometimes, after a few months, some servers still hangs, with no other solution than a manual reboot. Event the HW watchdog does not work in this case ! Like if the RESET pin of the CPU was totally frozen also.

We are using Opensuse 42.2, kernel 4.4.104, and we can not change that, as it is an embedded HW, installed in various client sites.

So we are searching for a fix (like another kernel parameter ? Something else ?) that will get rid of this infamous freeze.

Any help or hint appreciated :)
Comment 917 Paul Mansfield 2018-02-14 11:20:00 UTC
@SFA
Although it's very easy to attribute system lockups to the baytrail c-state transitions being buggy, it's still a good idea to rule out all the standard problems like bad memory, so if your system isn't stable using a specific kernel build that works for others, it's worth running memtest and checking system logs for signs of faults, before entirely blaming the CPU.

I've been using a Gigabyte GA-J1900-D3V board as a firewall and it wasn't too bad, only occasionally locking up (weeks of uptime) but then I very rarely used the video - just for a console when doing updates, and didn't use audio or SDIO. Once I fixed the max c-state, I had almost no stability problems. However, power consumption has increased, so I intend to switch to a Gemini Lake board once Gigabyte make one available with twin NICs.


OTOH, my baytrail tablet was terribly unstable, and a lot of my problems seemed primarily due to the SDIO wifi device, and secondly to the embedded video. If I used a USB ethernet adaptor and the machine was idle, the uptime could be quite decent.
Comment 918 Michael 2018-02-16 08:58:50 UTC
I tested the patches from Hans de Goede on my Lenovo 100s-11iby, Atom Z3735F.

4.15.3 unpatched:
worst case: freeze in under 10 minutes

4.15.3 with patch from comment 864:
worst case: freeze in under 10 minutes

4.15.3 with patch from comment 865:
no freezes so far, using it for 3 days now 
longest uptime without reboot was about 20 hours
Comment 919 Grzegorz Kalwig 2018-02-16 09:10:48 UTC
Like i wrote earlier, now i had 12 machines with 4.14.12 kernel + both patches and 1 month with no freezes.

On 4.15 mainline had the same freezes like on kernel 4.14.11

(In reply to Grzegorz Kalwig from comment #895)
> I have 4 machines with freeze problems, i need to have it online 24/7.
> (Proc: Intel Celeron J1900, OS: Centos 7).
> 
> Previously they worked a maximum of once a week, although usually enough a
> day or two to freeze. (on clean kernel 4.14.11)
> 
> I added both patch to kernel (4.14.12), after that have 4 days with no
> crash. I go ahead with testing.
Comment 920 jechtpurgateur 2018-02-16 09:45:22 UTC
Created attachment 274203 [details]
attachment-27696-0.html

That good to see people participating but honestly don't you think a feed
back have to be very reliable. In other case it may just flood the mailing
list (near 1000 comments).

3 days without a freeze is not reliable at all.

I'm running on 4.9.8 for more than a year with the v3 patch from mika
Kuoppala
<https://patchwork.kernel.org/project/intel-gfx/list/?submitter=49531>
Note that i'm also using *intel_idle.max_cstate=0*.

I use my computer everyday the whole day and i can spend like a month
without any freeze.
BUT, sometimes my system hang twice a day or more.

A reliable test without any stress test have to be at least a month to me.
Too much people come here because they run a day without any freeze and
come back a week later to tell us that they got a freeze.

On Fri, Feb 16, 2018 at 9:58 AM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> Michael Straube (michael.straube1@gmx.de) changed:
>
>            What    |Removed                     |Added
> ------------------------------------------------------------
> ----------------
>                  CC|                            |michael.straube1@gmx.de
>
> --- Comment #918 from Michael Straube (michael.straube1@gmx.de) ---
> I tested the patches from Hans de Goede on my Lenovo 100s-11iby, Atom
> Z3735F.
>
> 4.15.3 unpatched:
> worst case: freeze in under 10 minutes
>
> 4.15.3 with patch from comment 864:
> worst case: freeze in under 10 minutes
>
> 4.15.3 with patch from comment 865:
> no freezes so far, using it for 3 days now
> longest uptime without reboot was about 20 hours
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 921 SFA 2018-02-20 15:18:45 UTC
@jechtpurgateur : I fully agree with you. A few days mean nothing in terms of stability, we are searching for months and months.

@Paul Mansfield : Yes, Paul, i can confirm that this is a *real* freeze bug, not something else. From my point of view, it means that the intel_idle.max_cstate=1 is not reliable enough. 

Now i will try with intel_idle.max_cstate=0
Comment 922 kernel 2018-03-09 10:38:26 UTC
which driver are loaded for your internal graphic processor?

If you are using i915 try i915.enable_dc=0 and i915.enable_rc6=0 as i mentioned here

https://forums.lenovo.com/t5/Lenovo-C-E-K-M-N-and-V-Series/V510-15IKB-Laptop-Freeze/m-p/3956623#M24848
Comment 923 Zoltan Boszormenyi 2018-03-24 11:02:53 UTC
I have tested the patches from comment 864 and comment 865 on three machines
in the last few week.

The machines are LG MP500, Flytech POS335 and POS455.
The two POS machines both use J1900.

The LG MP500 had a constant video playing load.
Without the patches or intel_idle.max_cstate=1 on the kernel command line,
a lockup is reproducible in quite a short time, from 15 minutes to 1 hour.
This machine seems to be stable with intel_idle.max_cstate=1 alone or
with the patches.

With a very light load (i.e. mostly idle) on the two POS machines I got
hard lockups about twice daily previously even with intel_idle.max_cstate=1
applied. With both patches applied over 4.15.9 the lockups are gone.
"powertop" confirms that the C6 states are not used.
Comment 924 Hans de Goede 2018-03-24 11:09:34 UTC
(In reply to Zoltan Boszormenyi from comment #923)
> With a very light load (i.e. mostly idle) on the two POS machines I got
> hard lockups about twice daily previously even with intel_idle.max_cstate=1
> applied. With both patches applied over 4.15.9 the lockups are gone.
> "powertop" confirms that the C6 states are not used.

Thank you for testing! A question about the powertop output, C7 does still get used, right?
Comment 925 vova7890 2018-03-28 23:59:51 UTC
On my cherry trail disabling C6n C6s states is completely fix random poweroff's
Comment 926 Zoltan Boszormenyi 2018-03-29 05:15:01 UTC
(In reply to Hans de Goede from comment #924)
> (In reply to Zoltan Boszormenyi from comment #923)
> > With a very light load (i.e. mostly idle) on the two POS machines I got
> > hard lockups about twice daily previously even with intel_idle.max_cstate=1
> > applied. With both patches applied over 4.15.9 the lockups are gone.
> > "powertop" confirms that the C6 states are not used.
> 
> Thank you for testing! A question about the powertop output, C7 does still
> get used, right?

Yes, C7 is used.
Comment 927 nw9165-3201 2018-03-31 22:24:47 UTC
Is only Bay Trail affected? Or Cherry Trail as well?

If Cherry Trail has been confirmed to have the same issue, then why is the title of this bug only mentioning Bay Trail?

Can someone add Cherry Trail to the title as well?
Comment 928 mirh 2018-04-01 18:00:04 UTC
Cherry trail seems to have a quite matching erratum reported
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z8000-spec-update.pdf#page=27

*But* for as much as symptoms are similar, I think there's already too much spice into this problem with only an architecture/platform. 
(also, fix is reported to be a firmware thing there)

So.. I don't know, maybe make a separate bug _depending on_ this for that, and wait for intel-gfx guys + Hans to find the magic.
Comment 929 luke 2018-04-09 18:24:13 UTC
Intel recently released new CPU microcode that updates B2, B3 Stepping microcode to 326 and C0 microcode to 836. Any word if this addresses our issues with the Bay Trail-D?
Comment 930 Prashant Poonia 2018-04-10 01:40:19 UTC
(In reply to luke from comment #929)
> Intel recently released new CPU microcode that updates B2, B3 Stepping
> microcode to 326 and C0 microcode to 836. Any word if this addresses our
> issues with the Bay Trail-D?

i think its directed mainly towards spectre vulnerability, still i hope it addresses this issue too.
http://forum.notebookreview.com/threads/cpu-vulnerabilities-meltdown-and-spectre-kernel-page-table-isolation-patches-and-more.812424/
Comment 931 w2q 2018-04-10 20:20:27 UTC
(In reply to luke from comment #929)
> Intel recently released new CPU microcode that updates B2, B3 Stepping
> microcode to 326 and C0 microcode to 836. Any word if this addresses our
> issues with the Bay Trail-D?

My CPU is a N3540, CPUID 30678.


According to 
https://newsroom.intel.com/wp-content/uploads/sites/11/2018/04/microcode-update-guidance.pdf

this CPU should have gotten a microcode-update to Version 836.

On manjaro, the latest microcode-update didn't change anything, according to dmesg the microcode still has version 831.

$   dmesg|grep micro
[    0.846514] microcode: sig=0x30678, pf=0x8, revision=0x831
...


$   ls -la  /boot/intel-ucode.img 
-rw-r--r-- 1 root root 1668608 14. Mär 08:22 /boot/intel-ucode.img

$   uname -a
Linux orion 4.14.31-1-MANJARO #1 SMP PREEMPT Wed Mar 28 21:42:49 UTC 2018 x86_64 GNU/Linux


$   cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
stepping	: 8
microcode	: 0x831
...


Can anybody confirm this? Did the cpu N3540 not get any microcode-update?
Comment 932 mirh 2018-04-11 13:35:56 UTC
https://downloadcenter.intel.com/search?keyword=processor+microcode+data+file
It's not released yet. 

And, anyway, I wouldn't really put that high my hopes. 
The fix will likely come from drm driver guys.
Comment 933 Martin 2018-04-13 19:48:42 UTC
Just for the sake of reporting 4.16 with 864 and 865 patch applied freezes in short time on MythTV HTPC DVB-C/HD load.
HW: ASRock Q1900-ITX, J1900 @ 1.99GHz
Comment 934 vova7890 2018-06-02 20:36:27 UTC
processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
stepping        : 9
microcode       : 0x84
cpu MHz         : 500.026
cache size      : 4096 KB


Random freezes. Just like in my prev intel cherry-trail. Patch for disabling C6n C6s states is complete fix this problem on both processors. Almost freezes hapen at running X11 in idle, if just tty1 is working - seems no problem or much stable.

distro: ArchLinux

$ uname -a
Linux tv-pc 4.16.12-1-ARCH #1 SMP Wed May 30 23:02:24 EEST 2018 x86_64 GNU/Linux



diff -Naur linux-4.15.7-1/drivers/idle/intel_idle.c linux-4.15.7/drivers/idle/intel_idle.c
--- linux-4.15.7-1/drivers/idle/intel_idle.c    2018-02-28 11:21:39.000000000 +0200
+++ linux-4.15.7/drivers/idle/intel_idle.c      2018-03-03 21:57:43.381646516 +0200
@@ -221,6 +221,7 @@
                .flags = MWAIT2flg(0x58) | CPUIDLE_FLAG_TLB_FLUSHED,
                .exit_latency = 300,
                .target_residency = 275,
+               .disabled = true,
                .enter = &intel_idle,
                .enter_s2idle = intel_idle_s2idle, },
        {
@@ -229,6 +230,7 @@
                .flags = MWAIT2flg(0x52) | CPUIDLE_FLAG_TLB_FLUSHED,
                .exit_latency = 500,
                .target_residency = 560,
+               .disabled = true,
                .enter = &intel_idle,
                .enter_s2idle = intel_idle_s2idle, },
        {
Comment 935 harryharryharry 2018-06-02 23:35:23 UTC
@vova7890 I don't think your issue is related to this bug as your CPU is super different from baytrail/cherrytrail.

It also doesn't help that you're posting random snippets of patches without an explanation. I suggest you search for a more appropriate bugtracker or create your own - and describe in detail what you're experiencing/what you've done to alleviate the issue.
Comment 936 vova7890 2018-06-03 00:10:57 UTC
@harryharryharry, not a problem. I'm suggest similar behavior for another CPU. This may can help with research a problem also for cherry/bay. I'm also notice that I'm facing this bug on cherry-trail. So, this can be problem for all newer(mobile?) intel processors. Pasted patch is already attached from another user(not realy remember who, sorry but thanx). Just my 50 cent
Comment 937 harryharryharry 2018-06-03 00:15:57 UTC
no biggie smalls
Comment 938 w2q 2018-06-14 06:10:36 UTC
(In reply to w2q from comment #931)
> (In reply to luke from comment #929)
> > Intel recently released new CPU microcode that updates B2, B3 Stepping
> > microcode to 326 and C0 microcode to 836. Any word if this addresses our
> > issues with the Bay Trail-D?
> 
> My CPU is a N3540, CPUID 30678.
> 
> 
> According to 
> https://newsroom.intel.com/wp-content/uploads/sites/11/2018/04/microcode-
> update-guidance.pdf
> 
> this CPU should have gotten a microcode-update to Version 836.
> 
> On manjaro, the latest microcode-update didn't change anything, according to
> dmesg the microcode still has version 831.
> 
> $   dmesg|grep micro
> [    0.846514] microcode: sig=0x30678, pf=0x8, revision=0x831
> ...
> 
> 
> $   ls -la  /boot/intel-ucode.img 
> -rw-r--r-- 1 root root 1668608 14. Mär 08:22 /boot/intel-ucode.img
> 
> $   uname -a
> Linux orion 4.14.31-1-MANJARO #1 SMP PREEMPT Wed Mar 28 21:42:49 UTC 2018
> x86_64 GNU/Linux
> 
> 
> $   cat /proc/cpuinfo 
> processor     : 0
> vendor_id     : GenuineIntel
> cpu family    : 6
> model         : 55
> model name    : Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
> stepping      : 8
> microcode     : 0x831
> ...
> 
> 
> Can anybody confirm this? Did the cpu N3540 not get any microcode-update?

It seems, Intel withholds several microcodes, although they were announced long ago: 

https://bsd.denkverbot.info/2018/05/exposed-missing-meltdownspectre.html
Comment 939 youling257 2018-06-14 08:12:44 UTC
https://github.com/me176c-dev/android_vendor_asus_me176c/commit/7eac6beac2935b5d188641ed1c5003de4978e6c3
https://github.com/me176c-dev/android_vendor_asus_me176c/commit/31a3b46c4a763d3d4e2b17fd8b820748544993c3

download https://download.lenovo.com/pccbbs/mobiles/gwuj26ww.exe
innoextract gwuj26ww.exe,get app/GWET46WW/$0AGW000.FL1,rename 0AGW000.FL1
dd if=0AGW000.FL1 of=GenuineIntel.bin bs=1 skip=5853320 count=52224 status=none
copy GenuineIntel.bin to lib/firmware/intel-ucode/06-37-08

echo 1 > /sys/devices/system/cpu/microcode/reload

[78787.595904] microcode: updated to revision 0x836, date = 2018-01-10
[78787.602718] x86/CPU: CPU features have changed after loading microcode, but might not take effect.
[78787.602736] x86/CPU: Please consider either early loading through initrd/built-in or a potential BIOS update.

https://pcsupport.lenovo.com/us/en/product_security/ps500151
Comment 940 Zoltan Boszormenyi 2018-07-05 05:32:09 UTC
(In reply to harryharryharry from comment #935)
> @vova7890 I don't think your issue is related to this bug as your CPU is
> super different from baytrail/cherrytrail.
> 
> It also doesn't help that you're posting random snippets of patches without
> an explanation. I suggest you search for a more appropriate bugtracker or
> create your own - and describe in detail what you're experiencing/what
> you've done to alleviate the issue.

It's not a random patch, it's exactly the same as in comment 865.
Comment 941 Zoltan Boszormenyi 2018-07-05 18:33:59 UTC
New datapoint. LG MP500-FKBAP and MP500-FJBAP machines, both using:

Intel(R) Celeron(R) CPU N2930 @ 1.83GHz

and running 4.15.13 patched with both changes from comment 864 and comment 865
still lock up hard sporadically.
Comment 942 Saulo Oliveira 2018-07-11 18:28:49 UTC
Também estou com o mesmo problemas. Travamentos constantes em todas as distribuições Linux, inclusive as que usam as versões mais recentes do kernel linux. Ubuntu 18.04, Fedora 28, OpenSUSE Leap 15, Linux Mint, Debian... Todas travam devido a esse erro. Se o parâmetro intel_idle.max_cstate=1 é inserido o consumo de energia fica absolutamente alto.
Comment 943 Saulo Oliveira 2018-07-11 18:29:25 UTC
Também estou com o mesmo problema. Travamentos constantes em todas as distribuições Linux, inclusive as que usam as versões mais recentes do kernel linux. Ubuntu 18.04, Fedora 28, OpenSUSE Leap 15, Linux Mint, Debian... Todas travam devido a esse erro. Se o parâmetro intel_idle.max_cstate=1 é inserido o consumo de energia fica absolutamente alto.
Comment 944 Saulo Oliveira 2018-07-11 18:29:38 UTC
Também estou com o mesmo problema. Travamentos constantes em todas as distribuições Linux, inclusive as que usam as versões mais recentes do kernel linux. Ubuntu 18.04, Fedora 28, OpenSUSE Leap 15, Linux Mint, Debian... Todas travam devido a esse erro. Se o parâmetro intel_idle.max_cstate=1 é inserido o consumo de energia fica absolutamente alto.
Comment 945 Prashant Poonia 2018-07-29 19:50:22 UTC
(In reply to Hans de Goede from comment #924)
> (In reply to Zoltan Boszormenyi from comment #923)
> > With a very light load (i.e. mostly idle) on the two POS machines I got
> > hard lockups about twice daily previously even with intel_idle.max_cstate=1
> > applied. With both patches applied over 4.15.9 the lockups are gone.
> > "powertop" confirms that the C6 states are not used.
> 
> Thank you for testing! A question about the powertop output, C7 does still
> get used, right?

why havent these patches upstreamed to the mainline since a year?
Comment 946 Prashant Poonia 2018-07-29 19:55:22 UTC
(In reply to Leonardo Santos from comment #882)
> (In reply to Prashant Poonia from comment #881)
> > updated (2017) link to updated realtek pci-e driver for unix, win, mac, dos
> > etc.
> > http://www.realtek.com.tw/downloads/downloadsView.
> >
> aspx?Langid=1&PNid=7&PFid=7&Level=5&Conn=4&DownTypeID=3&GetDown=false#RTL8100
> > E/RTL8101E/RTL8102E-GR/RTL8103E(L)<br>RTL8102E(L)/RTL8101E/
> > RTL8103T<br>RTL8401/RTL8401P/RTL8105E<br>RTL8402/RTL8106E
> > 
> > hope it helps.
> 
> That worked for me, thank you so much!

is your system still stable? if yes then also mention the kernel version you are using
Comment 947 merlino37 2018-08-14 00:12:52 UTC
cstate=1 has been working for me for two weeks...prior to that it was unstable using many other kernels ...without cstate limit..
-----
System:    Host:-ThinkPad-11e
Kernel: 4.15.0-30-generic x86_64 bits: 64 gcc: 7.3.0
Desktop: Cinnamon 3.8.8 (Gtk 3.22.30-1ubuntu1) dm: lightdm 
Distro: Linux Mint 19 Tara
Machine:   Device: laptop System: LENOVO product: 20DAS00R00 v: ThinkPad 11e 
           UEFI: LENOVO v: N15ET73W (1.33) date: 01/25/2018
...................
CPU:       Quad core Intel Celeron N2920 (-MCP-) arch: Silvermont rev.3 cache: 1024 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 14929
           clock speeds: min/max: 533/1999 MHz 1: 1199 MHz 2: 1201 MHz 3: 1536 MHz 4: 1116 MHz
Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
           bus-ID: 00:02.0 chip-ID: 8086:0f31
           Display Server: x11 (X.Org 1.19.6 ) drivers: modesetting (unloaded: fbdev,vesa)
           Resolution: 1366x768@59.97hz
           OpenGL: renderer: Mesa DRI Intel Bay Trail
           version: 4.2 Mesa 18.0.5 (compat-v: 3.0) Direct Render: Yes
Comment 948 massa.lumumba 2018-10-04 17:30:17 UTC
I can confirm this bug on nearly any atom-cpu based systems, without the cstate=1
workaround those systems just do not run stable.
Many Kernels tested.
Comment 949 Bob L. 2018-10-05 03:46:23 UTC
You would not believe how much time I've wasted trying to figure out what was wrong with this install (or maybe you would). Can't believe that this is a known issue since December 2015 that is still not fixed in recent kernels.

I can confirm the behavior is still present in Ubuntu Bionic 18.04.1 LTS with kernel 4.15.0 and the i915 driver. Box is a Bay Trail J1900 and freezes anywhere from a few minutes to about 30 or so max. The kernel parameter "intel_idle.max_cstate=1" works around the issue succesfully. Why am I using i915 instead of the modesetting driver? Because it has the "TearFree" option and the modesetting driver tears badly in video playback.

System config:


System:    Host: Bobs-Ubuntu Kernel: 4.15.0-36-generic x86_64 bits: 64
           Desktop: MATE 1.20.1  Distro: Ubuntu 18.04.1 LTS
Machine:   Device: desktop System: ASUSTeK product: EB1036 v: 1403 serial: N/A
           Mobo: ASUSTeK model: EB1036 v: Rev 1.xx serial: N/A
           UEFI [Legacy]: ASUSTeK v: 1403 date: 11/25/2015
CPU:       Quad core Intel Celeron J1900 (-MCP-) cache: 1024 KB
           clock speeds: max: 2415 MHz 1: 2056 MHz 2: 1898 MHz 3: 2414 MHz
           4: 2411 MHz
Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
           Display Server: x11 (X.Org 1.19.6 ) driver: intel
           Resolution: 1600x900@59.98hz
           OpenGL: renderer: Mesa DRI Intel Bay Trail version: 4.2 Mesa 18.1.5
Audio:     Card Intel Atom Processor Z36xxx/Z37xxx Series High Def. Audio Controller
           driver: snd_hda_intel
           Sound: Advanced Linux Sound Architecture v: k4.15.0-36-generic
Network:   Card-1: Realtek RTL8111/8168/8411 PCIE Gigabit Ethernet Controller
           driver: r8169
           IF: enp2s0 state: down mac: 40:16:7e:bd:92:f8
           Card-2: Qualcomm Atheros AR9462 Wireless Network Adapter
           driver: ath9k
           IF: wlp4s0 state: up mac: d0:53:49:87:92:fc
Drives:    HDD Total Size: 240.1GB (39.1% used)
           ID-1: /dev/sda model: SanDisk_SSD_PLUS size: 240.1GB
Partition: ID-1: / size: 220G used: 88G (43%) fs: ext4 dev: /dev/sda1
RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors:   System Temperatures: cpu: 50.0C mobo: N/A
           Fan Speeds (in rpm): cpu: 0
Info:      Processes: 240 Uptime: 4:23 Memory: 1745.5/7859.8MB
           Client: Shell (bash) inxi: 2.3.56
Comment 950 Rafael Gandolfi 2018-10-06 02:29:06 UTC
Just to report, with z3735f (meegopad t02) only "Debug patch to enable BYT C6 auto-demotion" by Len Brown is ok. Disabling C6 works only if the system is quasi idle, under workload it freezes in minutes, so not really an improvement for the user.
Comment 951 Callum Lerwick 2018-10-15 03:14:05 UTC
Hi, Acer V3-111P netbook, Bay Trail "Pentium" N3530 or whatever intel calls it this week. Recently barely escaped massive data loss thanks to this bug and forgetting to add the max_idle workaround when I booted SystemRescueCD as chronicled here: https://twitter.com/NinjaSeg/status/1046254778919968768

For what it's worth sysresccd is gentoo based and reports kernel version 4.14.70-std531-amd64

I recently did a fresh re-install of Fedora 28 which wiped out the work-around, it ran the entire drive repair and ran for a day or two without hangs, so maybe this finally got fixed?? Fedora kernel version is 4.18.9-200.fc28.x86_64
Comment 952 merlino37 2018-10-15 22:18:00 UTC
(In reply to merlino37 from comment #947)
> cstate=1 has been working for me for two weeks...prior to that it was
> unstable using many other kernels ...without cstate limit..
> -----
> System:    Host:-ThinkPad-11e
> Kernel: 4.15.0-30-generic x86_64 bits: 64 gcc: 7.3.0
> Desktop: Cinnamon 3.8.8 (Gtk 3.22.30-1ubuntu1) dm: lightdm 
> Distro: Linux Mint 19 Tara
> Machine:   Device: laptop System: LENOVO product: 20DAS00R00 v: ThinkPad 11e 
>            UEFI: LENOVO v: N15ET73W (1.33) date: 01/25/2018
> ...................
> CPU:       Quad core Intel Celeron N2920 (-MCP-) arch: Silvermont rev.3
> cache: 1024 KB
>            flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 14929
>            clock speeds: min/max: 533/1999 MHz 1: 1199 MHz 2: 1201 MHz 3:
> 1536 MHz 4: 1116 MHz
> Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
>            bus-ID: 00:02.0 chip-ID: 8086:0f31
>            Display Server: x11 (X.Org 1.19.6 ) drivers: modesetting
> (unloaded: fbdev,vesa)
>            Resolution: 1366x768@59.97hz
>            OpenGL: renderer: Mesa DRI Intel Bay Trail
>            version: 4.2 Mesa 18.0.5 (compat-v: 3.0) Direct Render: Yes

Still working with cstate=1 .. I have blocked kernel updates..
Comment 953 merlino37 2018-10-15 22:19:33 UTC
(In reply to merlino37 from comment #947)
> cstate=1 has been working for me for two weeks...prior to that it was
> unstable using many other kernels ...without cstate limit..
> -----
> System:    Host:-ThinkPad-11e
> Kernel: 4.15.0-30-generic x86_64 bits: 64 gcc: 7.3.0
> Desktop: Cinnamon 3.8.8 (Gtk 3.22.30-1ubuntu1) dm: lightdm 
> Distro: Linux Mint 19 Tara
> Machine:   Device: laptop System: LENOVO product: 20DAS00R00 v: ThinkPad 11e 
>            UEFI: LENOVO v: N15ET73W (1.33) date: 01/25/2018
> ...................
> CPU:       Quad core Intel Celeron N2920 (-MCP-) arch: Silvermont rev.3
> cache: 1024 KB
>            flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 14929
>            clock speeds: min/max: 533/1999 MHz 1: 1199 MHz 2: 1201 MHz 3:
> 1536 MHz 4: 1116 MHz
> Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
>            bus-ID: 00:02.0 chip-ID: 8086:0f31
>            Display Server: x11 (X.Org 1.19.6 ) drivers: modesetting
> (unloaded: fbdev,vesa)
>            Resolution: 1366x768@59.97hz
>            OpenGL: renderer: Mesa DRI Intel Bay Trail
>            version: 4.2 Mesa 18.0.5 (compat-v: 3.0) Direct Render: Yes

Still working with cstate=1 .. I have blocked kernel updates..
Comment 954 mailinglists35 2018-11-29 14:25:19 UTC
Is Intel still looking into this? If yes, does it still make sense to keep attaching relevant info (dmesg, byt.test output etc) on this issue? 

I am able to trigger hangs after 2-3 days with byt.test script provided by Len. 
Can someone suggest tweaking of byt.test script that can trigger faster hangs? 

Base Board Information
        Manufacturer: ASRock
        Product Name: IMB-150

Model name:            Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
Comment 955 merlino37 2018-12-01 20:08:26 UTC
still working with cstate=1..I have updated kernel to  4.15.0-39-generic x86_64 bits: 64 gcc: 7.3.0..
Comment 956 mailinglists35 2018-12-03 12:52:56 UTC
@merlino37:
for the vast majority cstate=1 works, therefor is quite noisy to keep repeating that it works. we all wait to know when it works *without* cstate=1.
Comment 957 massa.lumumba 2018-12-03 15:41:34 UTC
Anyone know if Pentium N5000 is affected,too? Someone told me MMU does not work with N5000 and Linux, has anyone tested already?
Comment 958 massa.lumumba 2018-12-03 15:41:55 UTC
Anyone know if Pentium N5000 is affected,too? Someone told me MMU does not work with N5000 and Linux, has anyone tested already?
Comment 959 merlino37 2018-12-04 01:30:46 UTC
Created attachment 279833 [details]
attachment-4533-0.html

It doesn't work with some kernels..which is why I reported it working after
a kernel update..

On Mon, Dec 3, 2018 at 9:23 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #956 from mailinglists35@gmail.com ---
> @merlino37:
> for the vast majority cstate=1 works, therefor is quite noisy to keep
> repeating
> that it works. we all wait to know when it works *without* cstate=1.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 960 Tianli 2018-12-04 21:41:56 UTC
I was recently wresting with the ASRock Q1900 motherboard random freeze and I found this page. I am running a windows 10 system and it freezes multiple times per day without any reason. I think I found the culprit and share it here to give others a new idea of what might go wrong.

Basically I tried everything and nothing seem to work. Until one day I noticed that the computer always freezes when the HDD light is on. So this pointed to the storage system. I have a Transcend SSD but as far as I can remember, the computer also freezes when I was running ubuntu on a seagate SSHD. I googled online aboud SSD and found some Microsoft article about changing the registry for the AHCI controller 

https://support.microsoft.com/en-us/help/3083595/task-manager-might-show-100-disk-utilization-on-windows-10-devices-wit

But unfortunately it didn't work either.

Then I tried to turn on and off bios switches for the storage system, I found one thing that does stop the freeze. It seems on the ASRock motherboard, there are two types of SATA connectors. The ones support SATA3 (which connects to an ASMedia chip per asrock spec https://www.asrock.com/mb/Intel/Q1900-ITX/) and the ones support SATA2. Of course, for better speed we always by default connect to SATA3 ports, and it turns out that's the culprit (and that explains why nothing on the software side solve the problem). Once I switches the SSD to the SATA2 ports, the system stops freeze. I also disabled the two ASMedia ports in BIOS.

Yes, this will reduce the hard drive transfer speed, sequential read on the transcend tool show speed dropping from ~500MB/s to ~250MB/s, but at least I got a stable system to get some work done.
Comment 961 Bob L. 2018-12-05 18:52:39 UTC
@mailinglists35 People here have been awaiting word that it "works *without* cstate=1" since 2015. If it were not for the occasional post that "yes, this is still a problem with kernel X.Y" then there would be almost no posts here at all.

Also, I was only able to find this workaround myself recently through discovery of this thread. Would it have even shown up in a search if it were several years stale? Probably not. I can't help but wonder how many people have tossed perfectly good hardware simply because they never found this thread (and the workaround contained within) and assumed it was a physical fault of some kind.
Comment 962 boostedd2 2018-12-11 03:27:06 UTC
This issue has been resolved for me entirely using the original c6off+c7on script.

Was running Arch and always did every update including to the latest kernel and never had an issue. 

Had the script set as a launch daemon with systemd on boot.

I recently formatted my drive and will be testing again soon with most likely the latest Ubuntu release.

Running on N3540 powered Asus laptop.
Comment 964 nedmostoles 2019-01-11 11:50:06 UTC
Is there any latest workaround fix for the bug? Currently using 4.20.1 Kernel under Kali Linux running on HP 15-f271wm Intel Pentium N3540 CPU.

Thanks.
Comment 965 JanVlietland 2019-01-11 17:28:06 UTC
Have same problem with my Samsung NP900X5N laptop: https://bugs.freedesktop.org/show_bug.cgi?id=109209

dmesg & syslog available over there.

ps: guys at freenode - #intel-gfx pointed me here

intel_idle.max_cstate=1 or intel_idle.max_cstate=2 is a valid workaround for my machine. 

Disabled cstate>2 with /sys/devices/cpu/cpu[0-x]/cpuidle/state/disabled = 1 still results in freezes.....

These freezes are in my opinion one of the biggest pain-in-the-ass(es) of a lot of Linux riders. I suggest a little organization to conquer this beast.

Please let me know in case I can do more!!!
Comment 966 Newk 2019-01-12 14:16:09 UTC
Hi, i have followed this thread since i bought my GF a Gigabyte BXBT J1900.
It was running ubuntu 16.04 and with the given scripts here it ran stable enough for weeks (maybe even longer). I did some Kernel upgrades from the standard one but i can't remember wich version anymore

But After i had to upgrade it to 18.04 it did fall back into freezing VERY often.
I did put the script into the bootup... but even after starting the scripts manually and then testing it with cstateInfo.sh they still remain enabled... no matter which script.
Did something change in 18.04 that renders the scripts not working anymore?
I find it difficult to understand what the scripts do in /sys/devices/system/cpu/..
Because in /sys/devices/system/cpu/cpu0/cpuidle/ (for instance) there is state1 to state5.

Please could someone elaborate on what is up and how to fix it,
My GF is starting to regret using Linux when she lost some progress because the system not being secure.
Comment 967 Newk 2019-01-12 14:24:08 UTC
ah damn no edit... i meant state0 to state5
Comment 968 Burg 2019-01-12 14:33:32 UTC
Created attachment 280439 [details]
attachment-1662-0.html

It will not be fixed. I lost 4 machines. Hoping for 3 years there would be
a solution.

There is not. Only cstate=1 will keep him running, 24/7.

I dumped all these Intel machines. Z3735.

Fault? Intel. They hired some cheapos in China to do the software of this
processor. "It worked with Wintendo!".

I have been subscribed to this thread since 2015. Never opened up my mouth.
This is my first and last post:

Write those machines off. It will never be fixed. Time to move on.

Burg

On Sat, Jan 12, 2019, 21:16 <bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> Newk (newk@widerstand.org) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |newk@widerstand.org
>
> --- Comment #966 from Newk (newk@widerstand.org) ---
> Hi, i have followed this thread since i bought my GF a Gigabyte BXBT J1900.
> It was running ubuntu 16.04 and with the given scripts here it ran stable
> enough for weeks (maybe even longer). I did some Kernel upgrades from the
> standard one but i can't remember wich version anymore
>
> But After i had to upgrade it to 18.04 it did fall back into freezing VERY
> often.
> I did put the script into the bootup... but even after starting the scripts
> manually and then testing it with cstateInfo.sh they still remain
> enabled... no
> matter which script.
> Did something change in 18.04 that renders the scripts not working anymore?
> I find it difficult to understand what the scripts do in
> /sys/devices/system/cpu/..
> Because in /sys/devices/system/cpu/cpu0/cpuidle/ (for instance) there is
> state1
> to state5.
>
> Please could someone elaborate on what is up and how to fix it,
> My GF is starting to regret using Linux when she lost some progress
> because the
> system not being secure.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 969 Paul Mansfield 2019-01-12 14:44:41 UTC
I'm still using my Gigabyte GA-J1900D3V as a firewall (built in twin NICs plus a PCI 2 port NIC). It's running penSuse Leap 15. Current kernel is stock 4.12.14-lp150.12.28-default

It's fairly stable, uptime stretches to weeks, was OK without the cstates hack, is much better with it. I'm still convinced that this stability depends on not using video at all.


When Gigabyte produce a similar board with a Gemini Lake SoC on it, I'll upgrade to that, and I can then dump the sorry-a** Baytrail chipset for good... but there's a shortage of Gemini Lake chips, and not likely to be many around till March/April 2019!!
Comment 970 Newk 2019-01-12 14:46:35 UTC
I am in no financial position to write this machine off.. it was an investment for us... i do regret it, but i have to live with it.. so the script has been a moneysaver (lifesaver) up untill now.

I managed to replace the functionality of the c6off+c7on.sh script by hand by editing:
/sys/devices/system/cpu/cpu0/cpuidle/state2/disable
/sys/devices/system/cpu/cpu0/cpuidle/state3/disable
/sys/devices/system/cpu/cpu1/cpuidle/state2/disable
/sys/devices/system/cpu/cpu1/cpuidle/state3/disable
/sys/devices/system/cpu/cpu2/cpuidle/state2/disable
/sys/devices/system/cpu/cpu2/cpuidle/state3/disable
/sys/devices/system/cpu/cpu3/cpuidle/state2/disable
/sys/devices/system/cpu/cpu3/cpuidle/state3/disable

changing the value in those files from 0 to 1

Quite tedious to do so by hand every time this machine has to be rebooted..
so please, someone, help me update the script
Comment 971 Burg 2019-01-12 15:00:43 UTC
Created attachment 280443 [details]
attachment-9601-0.html

I my case video had nothing to do with it.
I got my first z3735, 2015, configured it (removed android and wintendo
first) with an ubuntu server, 16.04.
No problems. Installed my sshd, configured my network and shipped the
wintel box 9000 km from home.
And then shit came. There was  no X involved at all. Headless server.

So i can say: without X it crased every day also. Until you do your
cstate=1. And then it runs until there is a power cut, there, 9000 km from
home.

Burg

On Sat, Jan 12, 2019, 21:44 <bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #969 from Paul Mansfield (paul+kernel.org@mansfield.co.uk) ---
> I'm still using my Gigabyte GA-J1900D3V as a firewall (built in twin NICs
> plus
> a PCI 2 port NIC). It's running penSuse Leap 15. Current kernel is stock
> 4.12.14-lp150.12.28-default
>
> It's fairly stable, uptime stretches to weeks, was OK without the cstates
> hack,
> is much better with it. I'm still convinced that this stability depends on
> not
> using video at all.
>
>
> When Gigabyte produce a similar board with a Gemini Lake SoC on it, I'll
> upgrade to that, and I can then dump the sorry-a** Baytrail chipset for
> good...
> but there's a shortage of Gemini Lake chips, and not likely to be many
> around
> till March/April 2019!!
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 972 merlino37 2019-01-12 16:05:59 UTC
...
"add the statement intel_idle.max_cstate=1 in the grub configuration file:
STEPS

 1   sudo nano /etc/default/grub
 2   There is a line in that: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" (like this), replace with: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1"
 3   Save it (CTRL+O)
 4   sudo update-grub
 5   sudo reboot

source:https://askubuntu.com/questions/761706/ubuntu-15-10-and-16-04-keep-freezing-randomly
...this has worked for me for months...
Comment 973 harryharryharry 2019-01-12 19:04:25 UTC
Please guys, stop stating the obvious by telling that intel_idle.max_cstate=1 is needed to prevent freezing, it is literally in the freaking title of this bugtracker.

That doesn't "solve" the issue, it's like saying to the doctor "My leg is broken in 3 places" and the doctor telling you "well don't go to those places then".

Furthermore I agree wholeheartedly with Burg: write these machines off. It's a sh!t state of affairs, but Intel has done so years ago, otherwise there would have long been a proper fix.
Comment 974 tnu63035 2019-01-12 19:22:57 UTC
Sue these fuckers (Intel), it's the only way they'll learn.
Comment 975 Rick Lee 2019-01-13 01:09:39 UTC
@Comment #970: Indeed one by one could be tedious:

$ cd /sys/devices/system/cpu/

/sys/devices/system/cpu$ grep -H . */cpuidle/*/disable*cpu0/cpuidle/state0/disable:0

cpu0/cpuidle/state1/disable:0
cpu0/cpuidle/state2/disable:0
cpu0/cpuidle/state3/disable:0
cpu0/cpuidle/state4/disable:0
cpu0/cpuidle/state5/disable:0
cpu0/cpuidle/state6/disable:0
cpu0/cpuidle/state7/disable:0
cpu0/cpuidle/state8/disable:0
cpu1/cpuidle/state0/disable:0
cpu1/cpuidle/state1/disable:0
cpu1/cpuidle/state2/disable:0
cpu1/cpuidle/state3/disable:0
cpu1/cpuidle/state4/disable:0
cpu1/cpuidle/state5/disable:0
cpu1/cpuidle/state6/disable:0
cpu1/cpuidle/state7/disable:0
cpu1/cpuidle/state8/disable:0
cpu2/cpuidle/state0/disable:0
cpu2/cpuidle/state1/disable:0
cpu2/cpuidle/state2/disable:0
cpu2/cpuidle/state3/disable:0
cpu2/cpuidle/state4/disable:0
cpu2/cpuidle/state5/disable:0
cpu2/cpuidle/state6/disable:0
cpu2/cpuidle/state7/disable:0
cpu2/cpuidle/state8/disable:0
cpu3/cpuidle/state0/disable:0
cpu3/cpuidle/state1/disable:0
cpu3/cpuidle/state2/disable:0
cpu3/cpuidle/state3/disable:0
cpu3/cpuidle/state4/disable:0
cpu3/cpuidle/state5/disable:0
cpu3/cpuidle/state6/disable:0
cpu3/cpuidle/state7/disable:0
cpu3/cpuidle/state8/disable:0
cpu4/cpuidle/state0/disable:0
cpu4/cpuidle/state1/disable:0
cpu4/cpuidle/state2/disable:0
cpu4/cpuidle/state3/disable:0
cpu4/cpuidle/state4/disable:0
cpu4/cpuidle/state5/disable:0
cpu4/cpuidle/state6/disable:0
cpu4/cpuidle/state7/disable:0
cpu4/cpuidle/state8/disable:0
cpu5/cpuidle/state0/disable:0
cpu5/cpuidle/state1/disable:0
cpu5/cpuidle/state2/disable:0
cpu5/cpuidle/state3/disable:0
cpu5/cpuidle/state4/disable:0
cpu5/cpuidle/state5/disable:0
cpu5/cpuidle/state6/disable:0
cpu5/cpuidle/state7/disable:0
cpu5/cpuidle/state8/disable:0
cpu6/cpuidle/state0/disable:0
cpu6/cpuidle/state1/disable:0
cpu6/cpuidle/state2/disable:0
cpu6/cpuidle/state3/disable:0
cpu6/cpuidle/state4/disable:0
cpu6/cpuidle/state5/disable:0
cpu6/cpuidle/state6/disable:0
cpu6/cpuidle/state7/disable:0
cpu6/cpuidle/state8/disable:0
cpu7/cpuidle/state0/disable:0
cpu7/cpuidle/state1/disable:0
cpu7/cpuidle/state2/disable:0
cpu7/cpuidle/state3/disable:0
cpu7/cpuidle/state4/disable:0
cpu7/cpuidle/state5/disable:0
cpu7/cpuidle/state6/disable:0
cpu7/cpuidle/state7/disable:0
cpu7/cpuidle/state8/disable:0


Short hand would be to use:

$ sudo -i
# echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state2/disable
# exit
$

To disable all C-States (frankly don't know why yet but am checking this out in more detail next couple of days) use:

$ sudo -i
# echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
# exit
$

@Comment # 966:

Not so sure it is Ubuntu 18.04 but more so Linux Kernel around the time of April 2018 (18th year, 4th month = 18.04) changed. I've noticed on non-Baytrail chipset weird behavior. When system is idle frequencies jump into turbo range on i7-6700 HQ processor (Skylake) of 3000 MHz and only 3% CPU load. When watching a video of about 18% CPU load frequencies drop to "normal" of about 1500 MHz (non-turbo).

@Slew of comments today:

Corporations with million dollar IT budgets are laughing at the rest of us in the dark because Intel and Canonical briefs them on how to handle these situations. Those of us where $1,000 is a lot of money to spend on a rig are cannon fodder. Intel by ignoring the 1,000 pleas on this thread over the years are laughing too. Unfortunately it's the nature of the beast and good corporate governance means nothing these days.
Comment 976 Rick Lee 2019-01-13 01:13:31 UTC
(In reply to Rick Lee from comment #975)
> @Comment #970: Indeed one by one could be tedious:
> 
> $ cd /sys/devices/system/cpu/
> 
> /sys/devices/system/cpu$ grep -H . */cpuidle/*/disable*
> 
> cpu0/cpuidle/state1/disable:0
> cpu0/cpuidle/state2/disable:0
> cpu0/cpuidle/state3/disable:0
> cpu0/cpuidle/state4/disable:0
> cpu0/cpuidle/state5/disable:0
> cpu0/cpuidle/state6/disable:0
> cpu0/cpuidle/state7/disable:0
> cpu0/cpuidle/state8/disable:0
> cpu1/cpuidle/state0/disable:0
> cpu1/cpuidle/state1/disable:0
> cpu1/cpuidle/state2/disable:0
> cpu1/cpuidle/state3/disable:0
> cpu1/cpuidle/state4/disable:0
> cpu1/cpuidle/state5/disable:0
> cpu1/cpuidle/state6/disable:0
> cpu1/cpuidle/state7/disable:0
> cpu1/cpuidle/state8/disable:0
> cpu2/cpuidle/state0/disable:0
> cpu2/cpuidle/state1/disable:0
> cpu2/cpuidle/state2/disable:0
> cpu2/cpuidle/state3/disable:0
> cpu2/cpuidle/state4/disable:0
> cpu2/cpuidle/state5/disable:0
> cpu2/cpuidle/state6/disable:0
> cpu2/cpuidle/state7/disable:0
> cpu2/cpuidle/state8/disable:0
> cpu3/cpuidle/state0/disable:0
> cpu3/cpuidle/state1/disable:0
> cpu3/cpuidle/state2/disable:0
> cpu3/cpuidle/state3/disable:0
> cpu3/cpuidle/state4/disable:0
> cpu3/cpuidle/state5/disable:0
> cpu3/cpuidle/state6/disable:0
> cpu3/cpuidle/state7/disable:0
> cpu3/cpuidle/state8/disable:0
> cpu4/cpuidle/state0/disable:0
> cpu4/cpuidle/state1/disable:0
> cpu4/cpuidle/state2/disable:0
> cpu4/cpuidle/state3/disable:0
> cpu4/cpuidle/state4/disable:0
> cpu4/cpuidle/state5/disable:0
> cpu4/cpuidle/state6/disable:0
> cpu4/cpuidle/state7/disable:0
> cpu4/cpuidle/state8/disable:0
> cpu5/cpuidle/state0/disable:0
> cpu5/cpuidle/state1/disable:0
> cpu5/cpuidle/state2/disable:0
> cpu5/cpuidle/state3/disable:0
> cpu5/cpuidle/state4/disable:0
> cpu5/cpuidle/state5/disable:0
> cpu5/cpuidle/state6/disable:0
> cpu5/cpuidle/state7/disable:0
> cpu5/cpuidle/state8/disable:0
> cpu6/cpuidle/state0/disable:0
> cpu6/cpuidle/state1/disable:0
> cpu6/cpuidle/state2/disable:0
> cpu6/cpuidle/state3/disable:0
> cpu6/cpuidle/state4/disable:0
> cpu6/cpuidle/state5/disable:0
> cpu6/cpuidle/state6/disable:0
> cpu6/cpuidle/state7/disable:0
> cpu6/cpuidle/state8/disable:0
> cpu7/cpuidle/state0/disable:0
> cpu7/cpuidle/state1/disable:0
> cpu7/cpuidle/state2/disable:0
> cpu7/cpuidle/state3/disable:0
> cpu7/cpuidle/state4/disable:0
> cpu7/cpuidle/state5/disable:0
> cpu7/cpuidle/state6/disable:0
> cpu7/cpuidle/state7/disable:0
> cpu7/cpuidle/state8/disable:0
> 
> 
> Short hand would be to use:
> 
> $ sudo -i
> # echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state2/disable
> # exit
> $
> 
> To disable all C-States (frankly don't know why yet but am checking this out
> in more detail next couple of days) use:
> 
> $ sudo -i
> # echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
> # exit
> $
> 
> @Comment # 966:
> 
> Not so sure it is Ubuntu 18.04 but more so Linux Kernel around the time of
> April 2018 (18th year, 4th month = 18.04) changed. I've noticed on
> non-Baytrail chipset weird behavior. When system is idle frequencies jump
> into turbo range on i7-6700 HQ processor (Skylake) of 3000 MHz and only 3%
> CPU load. When watching a video of about 18% CPU load frequencies drop to
> "normal" of about 1500 MHz (non-turbo).
> 
> @Slew of comments today:
> 
> Corporations with million dollar IT budgets are laughing at the rest of us
> in the dark because Intel and Canonical briefs them on how to handle these
> situations. Those of us where $1,000 is a lot of money to spend on a rig are
> cannon fodder. Intel by ignoring the 1,000 pleas on this thread over the
> years are laughing too. Unfortunately it's the nature of the beast and good
> corporate governance means nothing these days.
Comment 977 julio.borreguero@gmail.com 2019-01-13 14:31:29 UTC
Why don't you just install 4.1.15 kernel?
Then you won't even need the cstate fix.
Any Kernel below 4.1.16 (i think 4.1.16 also works fine) does the trick:
I am running xubuntu 18.04 as well with that kernel and all is good.
# uname -a
# Linux shiva 4.1.15-040115-generic #201512150136 SMP Tue Dec 15 06:38:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Get it from here (for example):
https://ubuntu.pkgs.org/18.04/ubuntu-updates-main-amd64/linux-image-unsigned-4.15.0-43-generic_4.15.0-43.46_amd64.deb.html

good luck

(In reply to Newk from comment #966)
> Hi, i have followed this thread since i bought my GF a Gigabyte BXBT J1900.
> It was running ubuntu 16.04 and with the given scripts here it ran stable
> enough for weeks (maybe even longer). I did some Kernel upgrades from the
> standard one but i can't remember wich version anymore
> 
> But After i had to upgrade it to 18.04 it did fall back into freezing VERY
> often.
> I did put the script into the bootup... but even after starting the scripts
> manually and then testing it with cstateInfo.sh they still remain enabled...
> no matter which script.
> Did something change in 18.04 that renders the scripts not working anymore?
> I find it difficult to understand what the scripts do in
> /sys/devices/system/cpu/..
> Because in /sys/devices/system/cpu/cpu0/cpuidle/ (for instance) there is
> state1 to state5.
> 
> Please could someone elaborate on what is up and how to fix it,
> My GF is starting to regret using Linux when she lost some progress because
> the system not being secure.
Comment 978 Martin 2019-01-13 15:15:23 UTC
On Sunday, January 13, 2019 3:31:29 PM CET bugzilla-daemon@bugzilla.kernel.org 
wrote:
> (julio.borreguero@gmail.com) --- Why don't you just install 4.1.15 kernel?
> Then you won't even need the cstate fix.

Technically that's not true. It's just a lot less likely to happen on those 
kernels. The underlying hw fault is still present.

M.
Comment 979 JanVlietland 2019-01-13 15:39:25 UTC
@Len Brown, is still somebody working at Intel on solving the bug, using the data currently available through this thread, or do we need more data?

@all, I have gone through all posts this weekend, including mosts of the links towards other sources (reads like a thriller :-). Based on that I would like to share the following thought:

We are currently looking from one microprocessor perspective (Baytrail) in this thread. However I have the exactly the same issue with my Kaby Lake processor. No I am NOT talking about those soft freezes, I am talking about a complete death of the laptop, without any log whatsoever. My issue is in combination with the i915 module. No i915 module enabled, no freeze. My freezes occur in minutes.

So can we setup tests from a different perspective: predefined tests on all  microprocessors that are affected. I'm thinking of cutting down the kernel to a bare minimum and simulate on that part the data that occurs with these predefined tests. A more directive approach.

For instance on my NP900X4D (Ivy bridge) I never had these freezes. So why on some processors we haven them and on others not? Yes power saving, but why? 

Hopefully we can deduce the root-cause in this way.

I am happy to open a new thread for this, but I first want to have some confirmation and support on this.
Comment 980 vova7890 2019-01-13 18:05:25 UTC
Guys, you are not alone, this bug also affected on Core i7 7500u, which used on my home server, with 24/7 uptime. Without max_cstates=1 it freezes after 10-15 days(I'm happy that I have hardware watchdog in this case). Before that CPU, I'm used Cherry Trail processor, which also faced this bug, 7500 is my second processor with that (
Comment 981 luke 2019-01-13 20:07:18 UTC
vova and Jan,
This thread is specific to Baytrail CPUs. Stop posting your unrelated i7 problems here. File a new bug report or search for one that is appropriate. 

Please show some respect for others and stop SPAMing this thread with your unrelated issues.
Comment 982 Nikolay Lavrinenko 2019-01-16 09:42:03 UTC
Hi guys.
I am from Russia and use Google translator to write this message. I use laptop asus X205TA (z3735f) with Gentoo Linux installed. I used to use the 64-bit version of the kernel and environment, and I also had freezes. Now I switched to i686 architecture and what do you think is working. Already 3 films are watching online and there haven't been any freezes. Previously, friezes were when watching online video. I have been following this forum for a long time, waiting for someone to solve the problem. On one of the Russian forums I found a quote from some kind of manual for Intel Baytrel that you need to use a 32 bit OS. I can confirm that in my case it works. My system is Gento Linux i686. There are no friezes. :)

I also wanted to say that the patches that were proposed here lead to unstable work of some programs, in particular, the gnuradio companion.
Comment 983 infosecislame 2019-01-17 21:23:49 UTC
Created attachment 280575 [details]
Turn off C6N and C6S states on baytrail N3540, python script

Here is a new script written in python to disable the C6N and C6S states on baytrail N3540.

Got the idea from the old bash script by Wolfgang M. Reimer, which worked great on my Asus laptop, but that one no longer worked for me.

Works fine on Linux Mint 19.1.

Hope it helps.
Comment 984 D. Hugh Redelmeier 2019-01-17 22:59:49 UTC
@Nicolay Lavrinenko (comment 982):

Very interesting.

This bz is about Baytrail systems, so what I'm going to say is only about them.

The z3735f is a Baytrail system.  One odd thing about it is that (as far as I know) all z3735f systems come with 32-bit UEFI firmware.  Apparently Intel never released firmware for running this SoC in 64-bit mode.

I had assumed that all reports were about 64-bit Linux on systems with 64-bit UEFI firmware.

Can anyone else speak up if they are running a 64-bit Linux on a 32-bit UEFI firmware?  If you are, can you too try running a 32-bit Linux to see if things work better?
Comment 985 Paul Mansfield 2019-01-17 23:52:11 UTC
this thread is so long now that it's impossible to work out what the state of the problem is, and what history might be relevant. 

in reply to D.Hugh Redelmeier, and repeating previous commments...

As well as the Gigabyte J1900 board, I also have a Toshiba tablet with keyboard dock.

I can say that in my experience when I first had the Toshiba Click Mini that it was definitely more stable when running 32 bit Linux than 64, that was before there were patched kernels at all. I tried various storage, only with linux running from the eMMC was it usable. The SDIO wifi made it very unstable too. Everything pointed to the storage hub in the Baytrail SoC causing instability when used:
https://cdn.arstechnica.net/wp-content/uploads/2013/09/Screen-Shot-2013-09-13-at-6.32.07-PM.jpg


however, I have two friends who bought similar machines, much newer, who didn't have anything like the stability problems which I had! However, /proc/cpuinfo showed the same stepping level, and there's no microcode for them, so we couldn't explain the difference. Mine would lock up very quickly when using sdio wifi, theirs lasted a lot longer.
It was only when John Brodie published patch sets that I got a kernel that made my Tosh usable at all. Sadly, I mostly had to revert back to Windows to actually use it, and overall it was a waste of money.

I don't have any evidence but my gut feel is that some systems or Baytrail SoCs are more unstable than others. It could be anything, a race condition caused by poor timing in the chip which is worse on some devices than others, slightly poorer power conditioning on some boards which affects the cstate transitions, who knows, but for sure, if Intel know they haven't said! 

In theory there's a regular old fashioned UART on the chip, so maybe someone might be able to get some dying gasp debug out before the chip locks solid?
Comment 986 Paul Mansfield 2019-01-17 23:56:52 UTC
p.s. I also accidentally destroyed the speakers in it when trying to get audio to work. It seems that the amplifier is probably a class D, and when the audio subsystem locked up it blasted the full voltage continuously through the speakers and melted them. When I replace them I'll be fitting capacitors to decouple them and hopefully prevent a recurrence. I know this is off topic but I felt important to warn people.
Comment 987 Nikolay Lavrinenko 2019-01-19 09:59:49 UTC
(In reply to D. Hugh Redelmeier from comment #984)
> @Nicolay Lavrinenko (comment 982):
> 
> Very interesting.
> 
> This bz is about Baytrail systems, so what I'm going to say is only about
> them.
> 
> The z3735f is a Baytrail system.  One odd thing about it is that (as far as
> I know) all z3735f systems come with 32-bit UEFI firmware.  Apparently Intel
> never released firmware for running this SoC in 64-bit mode.
> 
> I had assumed that all reports were about 64-bit Linux on systems with
> 64-bit UEFI firmware.
> 
> Can anyone else speak up if they are running a 64-bit Linux on a 32-bit UEFI
> firmware?  If you are, can you too try running a 32-bit Linux to see if
> things work better?

The problem is that the binary distributions support uefi only in x86_64 and need to insert additional bootia32. In the system from sources, you can build grub with grub_platforms="efi-32" flag and set it with --target=i386-efi flag. The loader will load anything.
Comment 989 Newk 2019-02-15 14:37:07 UTC
(In reply to infosecislame from comment #983)
> Created attachment 280575 [details]
> Turn off C6N and C6S states on baytrail N3540, python script
> 
> Here is a new script written in python to disable the C6N and C6S states on
> baytrail N3540.
> 
> Got the idea from the old bash script by Wolfgang M. Reimer, which worked
> great on my Asus laptop, but that one no longer worked for me.
> 
> Works fine on Linux Mint 19.1.
> 
> Hope it helps.

Thank you infosecislame!
That script worked well for this J1900.

@All
This machine (gigabyte-bxbt-1900) is also running i686 / 32bit OS (Lubuntu 18.04, kernel 4.15.0-45)
and still has this problem... so it's probably less likely down to that.
Not using wifi on this box.
Comment 990 Newk 2019-02-15 16:47:26 UTC
(In reply to Nikolay Lavrinenko from comment #987)
> (In reply to D. Hugh Redelmeier from comment #984)
> > @Nicolay Lavrinenko (comment 982):
> > 
> > Very interesting.
> > 
> > This bz is about Baytrail systems, so what I'm going to say is only about
> > them.
> > 
> > The z3735f is a Baytrail system.  One odd thing about it is that (as far as
> > I know) all z3735f systems come with 32-bit UEFI firmware.  Apparently
> Intel
> > never released firmware for running this SoC in 64-bit mode.
> > 
> > I had assumed that all reports were about 64-bit Linux on systems with
> > 64-bit UEFI firmware.
> > 
> > Can anyone else speak up if they are running a 64-bit Linux on a 32-bit
> UEFI
> > firmware?  If you are, can you too try running a 32-bit Linux to see if
> > things work better?
> 
> The problem is that the binary distributions support uefi only in x86_64 and
> need to insert additional bootia32. In the system from sources, you can
> build grub with grub_platforms="efi-32" flag and set it with
> --target=i386-efi flag. The loader will load anything.

Intresting!
Do i understand that i might need to install a diffrent UEFI to solve the issue?
Comment 991 D. Hugh Redelmeier 2019-02-15 18:06:47 UTC
(In reply to Newk from comment #990)

> Intresting!
> Do i understand that i might need to install a diffrent UEFI to solve the
> issue?

What to you mean "install a UEFI"?  UEFI is an interface standard.  It's implemented by each machine's firmware (often called "BIOS" incorrectly).  You generally don't get to choose 32- vs 64-bit UEFI firmware, only one choice is made and that's by the manufacturer.

Your bootable medium must match the firmware.  It is possible for a bootable medium to support both (two versions of shims and grub and so on).  The Linux kernel itself optionally supports running in 64-bit mode on top of a 32-bit UEFI.

Originally, most distros did not support 32-bit UEFI.  Almost all machines that forced UEFI (i.e. did not allow old-style MBR booting) were 64-bit processors, so why bother?  Well, the answer was that Intel crippled some 64-bit processors by only supplying 32-bit UEFI to the OEMs.  (Why?  Surely to segment the market.)  The z3735f was such a system.

Exceptions: the original Intel Macs used CoreDuo processors which were 32-bit and used (somewhat defective) UEFI firmware.  The failed Intel Quark SoC is another 32-bit X86 processor (System on Chip).
Comment 992 stOneskull 2019-05-05 15:12:49 UTC
after a couple of years of dealing with this stupid bug, i thought i'd add some observations.

it seems that different distros behave differently. i think debian and ubuntu are the worst. basically killed my battery using the cstate fix for a year. i found using kernel 4.8.17 was nice in linux mint and didn't need the cstate fix. i have recently found the solus distro is pretty nice. manjaro was better than debian distros but not as good as solus. i tried kubuntu recently and that was horrible. i really love kubuntu though and want to keep using it, so have put in the cstate fix again in its grub.

the only thing is that i'm using refind loader now to help with booting multiple distros (mostly because of solus as it can't use grub) and now i need to find how to add the cstate fix into kubuntu entry in refind. i can boot the kubuntu grub efi file from refind and it works but not sure how to go straight to booting to kubuntu from refind with the cstate.

anyway, enough rambling from me, just find it interesting how different distros can behave differently with this stupid baytrail bug.
Comment 993 Rick Lee 2019-05-05 17:38:29 UTC
Created attachment 282621 [details]
attachment-24716-0.html

To Comment #92.

You can call Solus from Grub:
https://forum.manjaro.org/t/adding-solus-and-other-punk-linux-boot-sequence-in-grub-uefi-only/21405.
Then it will be on menu with Windows, Ubuntu, Kubuntu, etc.

On Sun, May 5, 2019 at 9:12 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> stOneskull (stoneskull@gmail.com) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |stoneskull@gmail.com
>
> --- Comment #992 from stOneskull (stoneskull@gmail.com) ---
> after a couple of years of dealing with this stupid bug, i thought i'd add
> some
> observations.
>
> it seems that different distros behave differently. i think debian and
> ubuntu
> are the worst. basically killed my battery using the cstate fix for a
> year. i
> found using kernel 4.8.17 was nice in linux mint and didn't need the cstate
> fix. i have recently found the solus distro is pretty nice. manjaro was
> better
> than debian distros but not as good as solus. i tried kubuntu recently and
> that
> was horrible. i really love kubuntu though and want to keep using it, so
> have
> put in the cstate fix again in its grub.
>
> the only thing is that i'm using refind loader now to help with booting
> multiple distros (mostly because of solus as it can't use grub) and now i
> need
> to find how to add the cstate fix into kubuntu entry in refind. i can boot
> the
> kubuntu grub efi file from refind and it works but not sure how to go
> straight
> to booting to kubuntu from refind with the cstate.
>
> anyway, enough rambling from me, just find it interesting how different
> distros
> can behave differently with this stupid baytrail bug.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 994 stOneskull 2019-05-05 18:03:31 UTC
yeah i put that in parentheses because it's pretty much irrelevant why i use refind. i know the page you linked to and stuff. 

anyway, i found the easy way to make the cstate persistent in refind is to have a refind_linux.conf file in /boot and that it can be made by running sudo mkrlconf and it should copy the grub parameters. mkrlconf comes with the refind package.

kubuntu rocks (besides kde not allowing dolphin to be opened as root to get to the EFI directory.. krusader helped me there.. [maybe someone would reply about sudo bash or something but whatever] i'm happy)
Comment 995 hfern 2019-05-20 07:51:51 UTC
My Gentoo / AsRock Q1900-ITX based system with J1900 processor also always used to hang every few days. The only way to keep it under control was to keep a little load on it. But it is now solid as a rock. The change that I did was that I disabled the Intel P-State CPU frequency scaling driver (CONFIG_X86_INTEL_PSTATE). I could never upgrade from kernel version 4.1.15-r1 because it would cause a quick hang. Since disabling the Intel P-State control driver I have been able to upgrade to kernel version 5.1.3 and remove boot parameter intel_idle.max_cstate=1. For those interested, my kernel config can be found here: http://dpaste.com/0ZDDRB2 .
Comment 996 un4tt3nd3d 2019-05-21 21:52:03 UTC
(In reply to hfern from comment #995)
> My Gentoo / AsRock Q1900-ITX based system with J1900 processor also always
> used to hang every few days. The only way to keep it under control was to
> keep a little load on it. But it is now solid as a rock. The change that I
> did was that I disabled the Intel P-State CPU frequency scaling driver
> (CONFIG_X86_INTEL_PSTATE). I could never upgrade from kernel version
> 4.1.15-r1 because it would cause a quick hang. Since disabling the Intel
> P-State control driver I have been able to upgrade to kernel version 5.1.3
> and remove boot parameter intel_idle.max_cstate=1. For those interested, my
> kernel config can be found here: http://dpaste.com/0ZDDRB2 .

That workaround was mentioned in one of the very first comments here, years ago. I think the reason we're just chatting and repeating ourselves on here is that devs have pretty much given up on this bug, at least until they can get something from Intel or Microsoft, which seems unlikely at this point. For what it's worth, I'm having this issue on a Cherry Trail Atom x5-Z8350. The system locks up after just a couple minutes of use. I haven't tried the cstate workaround because I don't want to ruin a brand new battery, and I'm not sure if the other 'fixes' could also affect the hardware or not. I'd rather play it safe and run Windows for now.
Comment 997 stOneskull 2019-05-21 23:46:28 UTC
Created attachment 282903 [details]
attachment-28600-0.html

running Windows was probably always the best answer..

Haiku seems to be ok but then that has a bunch of its own issues being in
beta




On Wed, May 22, 2019 at 7:52 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> un4tt3nd3d@live.com changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |un4tt3nd3d@live.com
>
> --- Comment #996 from un4tt3nd3d@live.com ---
> (In reply to hfern from comment #995)
> > My Gentoo / AsRock Q1900-ITX based system with J1900 processor also
> always
> > used to hang every few days. The only way to keep it under control was to
> > keep a little load on it. But it is now solid as a rock. The change that
> I
> > did was that I disabled the Intel P-State CPU frequency scaling driver
> > (CONFIG_X86_INTEL_PSTATE). I could never upgrade from kernel version
> > 4.1.15-r1 because it would cause a quick hang. Since disabling the Intel
> > P-State control driver I have been able to upgrade to kernel version
> 5.1.3
> > and remove boot parameter intel_idle.max_cstate=1. For those interested,
> my
> > kernel config can be found here: http://dpaste.com/0ZDDRB2 .
>
> That workaround was mentioned in one of the very first comments here, years
> ago. I think the reason we're just chatting and repeating ourselves on
> here is
> that devs have pretty much given up on this bug, at least until they can
> get
> something from Intel or Microsoft, which seems unlikely at this point. For
> what
> it's worth, I'm having this issue on a Cherry Trail Atom x5-Z8350. The
> system
> locks up after just a couple minutes of use. I haven't tried the cstate
> workaround because I don't want to ruin a brand new battery, and I'm not
> sure
> if the other 'fixes' could also affect the hardware or not. I'd rather
> play it
> safe and run Windows for now.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 998 w2q 2019-05-23 21:37:59 UTC
There are new microcodes available, perhaps this repairs something:


-- Updates upon 20190312 release --
Processor             Identifier     Version       Products
Model        Stepping F-MO-S/PI      Old->New
---- new platforms ----------------------------------------
VLV          C0       6-37-8/02           00000838 Atom Z series
VLV          C0       6-37-8/0C           00000838 Celeron N2xxx, Pentium N35xx
VLV          D0       6-37-9/0F           0000090c Atom E38xx
CHV          C0       6-4c-3/01           00000368 Atom X series
CHV          D0       6-4c-4/01           00000411 Atom X series
CLX-SP       B1       6-55-7/bf           05000021 Xeon Scalable Gen2
---- updated platforms ------------------------------------
SNB          D2/G1/Q0 6-2a-7/12 0000002e->0000002f Core Gen2
IVB          E1/L1    6-3a-9/12 00000020->00000021 Core Gen3
HSW          C0       6-3c-3/32 00000025->00000027 Core Gen4
BDW-U/Y      E0/F0    6-3d-4/c0 0000002b->0000002d Core Gen5
IVB-E/EP     C1/M1/S1 6-3e-4/ed 0000042d->0000042e Core Gen3 X Series; Xeon E5 v2
IVB-EX       D1       6-3e-7/ed 00000714->00000715 Xeon E7 v2
HSX-E/EP     Cx/M1    6-3f-2/6f 00000041->00000043 Core Gen4 X series; Xeon E5 v3
HSX-EX       E0       6-3f-4/80 00000013->00000014 Xeon E7 v3
HSW-U        C0/D0    6-45-1/72 00000024->00000025 Core Gen4
HSW-H        C0       6-46-1/32 0000001a->0000001b Core Gen4
BDW-H/E3     E0/G0    6-47-1/22 0000001e->00000020 Core Gen5
SKL-U/Y      D0/K1    6-4e-3/c0 000000c6->000000cc Core Gen6
BDX-ML       B0/M0/R0 6-4f-1/ef 0b00002e->0b000036 Xeon E5/E7 v4; Core i7-69xx/68xx
SKX-SP       H0/M0/U0 6-55-4/b7 0200005a->0000005e Xeon Scalable
SKX-D        M1       6-55-4/b7 0200005a->0000005e Xeon D-21xx
BDX-DE       V1       6-56-2/10 00000019->0000001a Xeon D-1520/40
BDX-DE       V2/3     6-56-3/10 07000016->07000017 Xeon D-1518/19/21/27/28/31/33/37/41/48, Pentium D1507/08/09/17/19
BDX-DE       Y0       6-56-4/10 0f000014->0f000015 Xeon D-1557/59/67/71/77/81/87
BDX-NS       A0       6-56-5/10 0e00000c->0e00000d Xeon D-1513N/23/33/43/53
APL          D0       6-5c-9/03 00000036->00000038 Pentium N/J4xxx, Celeron N/J3xxx, Atom x5/7-E39xx
APL          E0       6-5c-a/03 0000000c->00000016 Atom x5-E39xx
SKL-H/S      R0/N0    6-5e-3/36 000000c6->000000cc Core Gen6; Xeon E3 v5
DNV          B0       6-5f-1/01 00000024->0000002e Atom C Series
GLK          B0       6-7a-1/01 0000002c->0000002e Pentium Silver N/J5xxx, Celeron N/J4xxx
AML-Y22      H0       6-8e-9/10 0000009e->000000b4 Core Gen8 Mobile
KBL-U/Y      H0       6-8e-9/c0 0000009a->000000b4 Core Gen7 Mobile
CFL-U43e     D0       6-8e-a/c0 0000009e->000000b4 Core Gen8 Mobile
WHL-U        W0       6-8e-b/d0 000000a4->000000b8 Core Gen8 Mobile
WHL-U        V0       6-8e-d/94 000000b2->000000b8 Core Gen8 Mobile
KBL-G/H/S/E3 B0       6-9e-9/2a 0000009a->000000b4 Core Gen7; Xeon E3 v6
CFL-H/S/E3   U0       6-9e-a/22 000000aa->000000b4 Core Gen8 Desktop, Mobile, Xeon E
CFL-S        B0       6-9e-b/02 000000aa->000000b4 Core Gen8
CFL-H/S      P0       6-9e-c/22 000000a2->000000ae Core Gen9
CFL-H R0 6-9e-d/22 000000b0->000000b8 Core Gen9 Mobile 




If you type cat /proc/cpuinfo, you'll get something like

>$   cat /proc/cpuinfo 
> processor     : 0
> vendor_id     : GenuineIntel
> cpu family    : 6
> model         : 55
> model name    : Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
> stepping      : 8
> microcode     : 0x831

Now you have to calculate the model number to hexadecimal : 55 -> 37h, so 

F-MO-S is 6-37-8. For this CPU, a new microcode 838h is available, according to the above list.

https://www.archlinux.org/packages/extra/any/intel-ucode/
I guess, manjaro will follow with the next update.

I don't know yet if it will affect this bug, but hope dies last.
Comment 999 hfern 2019-05-24 12:12:40 UTC
(In reply to un4tt3nd3d from comment #996)
> (In reply to hfern from comment #995)
> > My Gentoo / AsRock Q1900-ITX based system with J1900 processor also always
> > used to hang every few days. The only way to keep it under control was to
> > keep a little load on it. But it is now solid as a rock. The change that I
> > did was that I disabled the Intel P-State CPU frequency scaling driver
> > (CONFIG_X86_INTEL_PSTATE).......
> 
> That workaround was mentioned in one of the very first comments here, years
> ago. I think the reason we're just chatting and repeating ourselves on here
> is that devs have pretty much given up on this bug, at least until they can
> get something from Intel or Microsoft, which seems unlikely at this point....

I disabled the Intel P-State frequency driver in the kernel, not as a boot parameter. Maybe that helped? Or maybe a firmware update like what was mentioned in comment 998?. My system happens to be on linux-firmware-20190502, 6 weeks newer.
Comment 1000 w2q 2019-05-27 21:41:03 UTC
I may cite Hans de Goede from this site: 
https://www.phoronix.com/forums/forum/software/mobile-linux/1096936-intel-baytrail-cherrytrail-systems-can-now-correctly-hibernate-again-under-linux?p=1096999#post1096999

"

Actually the Intel open-source devs have been working on fixing this and a patch-series which should improve things wrt this has been queued for merging into 5.3 (it just missed the 5.2 merge window), see: 

https://cgit.freedesktop.org/drm-intel/commit/?id=a75d035fedbdecf83f86767aa2e4d05c8c4ffd95

"



Yeah, comment 1000!
Comment 1001 D. Hugh Redelmeier 2019-05-28 15:54:12 UTC
The patch says it only applies to "Valleyview" systems.  As far as I can tell, that includes all Bay Trail chips, and nothing else.  Comments about other systems should be disregarded.

https://ark.intel.com/content/www/us/en/ark/products/codename/55844/bay-trail.html

The patch is self-described as a work-around.  This is a tacit admission of a chip bug.  With no fix for over 4 years.  At least they didn't give up.
Comment 1002 Simon Rettberg 2019-06-01 16:46:17 UTC
I was suffering from this issue since 2014 but only ever learned about the bug report today. It's a (very cheap) HP Pavilion 15-p005ng (Pentium N3530) that I use occasionally. I quickly found that the issue is very easily triggered by doing any 3D stuff, Chromium with hw acceleration enabled would usually do the trick within minutes, or just running anything OpenGL. Back then, googling the issue failed to bring up anything meaningful, so I chalked it up to the Laptop's cheap hardware.

Since my 6yo started playing Minecraft occasionally about a year ago, I tested every other new kernel release, which consistently resulted in a crash after a few minutes, so I booted up Windows for her to play. (usually resulting in long update procedures since it's pretty much the only reason I boot it up at all once a month or so)

Long story short: Today I applied the patch from comment 1000 to 5.2-rc2, and for the first time, my daughter could play Minecraft on Linux for over an hour. This was never possible before. byt.test is also running for a couple hours now. But before calling it a day - it might still just be that the stars aligned just right today - I would ask anyone still running this hardware to try this patch and report back. Judging from this endless thread there have been a couple of moments were people thought a fix was found, so it's probably wise to get more people to test this. Also it would be a really strange coincidence that I would only find this bug report days after a fix was posted.
Comment 1003 Stéphane Tréboux 2019-06-16 08:06:56 UTC
I took the script from Wolfgang Reimer and updated it to work with Kubuntu 19.04.

The names of the C6/C7 power states were successfully matched using C6*-BYT and C7*-BYT in the past. For some reason the "-BYT" at the end was dropped. Look inside /sys/devices/system/cpu/cpu0/cpuidle/state2/ to see what I mean (path may differ on your computer depending on your kernel and CPU).

The fix is to remove "-BYT" in the lines 35 and 36 in the original script. The result is here:

#!/bin/sh

#title:       c6off+c7on.sh
#description: Disables all C6 and enables all C7 core states for Baytrail CPUs
#author:      Wolfgang Reimer <linuxball (at) gmail.com>
#date:        20190616
#version:     2.0    
#usage:       sudo <path>/c6off+c7on.sh
#notes:       Intended as test script to verify whether erratum VLP52 (see
#             [1]) is the root cause for kernel bug 109051 (see [2]). In order
#             for this to work you must _NOT_ use boot parameter
#             intel_idle.max_cstate=<number>.
#
# [1] http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
# [2] https://bugzilla.kernel.org/show_bug.cgi?id=109051

# Disable ($1 == 1) or enable ($1 == 0) core state, if not yet done.
disable() {
	local action
	read disabled <disable
	test "$disabled" = $1 && return
	echo $1 >disable || return
	action=ENABLED; test "$1" = 0 || action=DISABLED
	printf "%-8s state %7s for %s.\n" $action "$name" $cpu  
}

# Iterate through each core state and for Baytrail (BYT) disable all C6
# and enable all C7 states.
cd /sys/devices/system/cpu
for cpu in cpu[0-9]*; do
	for dir in $cpu/cpuidle/state*; do
		cd "$dir"
		read name <name
		case $name in
			C6*-BYT) disable 1;;
			C7*-BYT) disable 0;;
		esac
		cd ../../..
	done
done
Comment 1004 Stéphane Tréboux 2019-06-17 00:14:46 UTC
Again... the previous comment has a typo.

I took the script from Wolfgang Reimer and updated it to work with Kubuntu 19.04.

The names of the C6/C7 power states were successfully matched using C6*-BYT and C7*-BYT in the past. For some reason the "-BYT" at the end was dropped. Look inside /sys/devices/system/cpu/cpu0/cpuidle/state2/ to see what I mean (path may differ on your computer depending on your kernel and CPU).

The fix is to remove "-BYT" in the lines 35 and 36 in the original script. This works on my setup and effectively prevents the laptop from hanging. The result is here:

#!/bin/sh

#title:       c6off+c7on.sh
#description: Disables all C6 and enables all C7 core states for Baytrail CPUs
#author:      Wolfgang Reimer <linuxball (at) gmail.com>
#date:        20190616
#version:     2.0    
#usage:       sudo <path>/c6off+c7on.sh
#notes:       Intended as test script to verify whether erratum VLP52 (see
#             [1]) is the root cause for kernel bug 109051 (see [2]). In order
#             for this to work you must _NOT_ use boot parameter
#             intel_idle.max_cstate=<number>.
#
# [1] http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
# [2] https://bugzilla.kernel.org/show_bug.cgi?id=109051

# Disable ($1 == 1) or enable ($1 == 0) core state, if not yet done.
disable() {
	local action
	read disabled <disable
	test "$disabled" = $1 && return
	echo $1 >disable || return
	action=ENABLED; test "$1" = 0 || action=DISABLED
	printf "%-8s state %7s for %s.\n" $action "$name" $cpu  
}

# Iterate through each core state and for Baytrail (BYT) disable all C6
# and enable all C7 states.
cd /sys/devices/system/cpu
for cpu in cpu[0-9]*; do
	for dir in $cpu/cpuidle/state*; do
		cd "$dir"
		read name <name
		case $name in
			C6*) disable 1;;
			C7*) disable 0;;
		esac
		cd ../../..
	done
done
Comment 1005 mirh 2019-07-16 18:27:45 UTC
Ladies and gentlemen, this morning Linus merged a hopeful fix. 
If you can build from git, please have a test.
Comment 1006 Travis Hall 2019-07-16 19:22:17 UTC
(In reply to mirh from comment #1005)
> Ladies and gentlemen, this morning Linus merged a hopeful fix. 
> If you can build from git, please have a test.

Which commit was it a part of?
Comment 1009 jbMacAZ 2019-07-26 17:07:06 UTC
I've been running Chris Wilson's 5.3 patch (comment 1000) in 5.2 since it was available and have not had a freeze yet.  I won't claim it's fixed, but my system has been highly susceptible to (quick) freezing without some kind of work-around. In my current tests (5.2-rc2 > 5.2.3-rc1) this patch is the only freeze workaround I'm using. YMMV
Comment 1010 Simon Rettberg 2019-07-26 18:15:47 UTC
Since applying the comment 1000 patch (see comment 1002) I've been using that laptop as frequently as possible, with Chromium's hw acceleration enabled, which as stated before usually hung the system within minutes. Those freezes are now gone.

Interestingly, I still did have exactly one freeze since then, which actually happened right after bootup, while the system was sitting at the lightdm login screen. (Apart from the kernel I'm currently running Debian 10/Buster).

Is it possible that there is another condition that might occur where the CPU is put to sleep but actually should not? OTOH as said before, this particular laptop is generally pretty bad with Linux. For example, USB some devices frequently stop working and need to be replugged. It seems to depend on the USB device how often this happens. I have a particular mouse that would stop working every minute or two, forcing me to replug, so it's basically unusable. So I wouldn't be surprised if that single freeze is entirely unrelated to the i915 issue.
Comment 1011 Prashant Poonia 2019-09-07 05:31:32 UTC
any updates with the 5.3 kernel?
I am on n3540 running linuxmint 19.2 with 5.0 kernel, waiting for stable 5.3 release so that i can upgrade and test it out. I am not running any workarounds, because luckily my system doesn't freeze much (asus x553ma), especially when not running wifi, i mostly use bluetooth tethering now. Its been 2 months since the last post here, so if anyone is running the 5.3 kernel with the patch merged, please keep updating with your findings atleast once a month. Thanks.
Comment 1012 jbMacAZ 2019-09-07 17:12:21 UTC
(In reply to Prashant Poonia from comment #1011)
> any updates with the 5.3 kernel?
> I am on n3540 running linuxmint 19.2 with 5.0 kernel, waiting for stable 5.3
> release so that i can upgrade and test it out. I am not running any
> workarounds, because luckily my system doesn't freeze much (asus x553ma),
> especially when not running wifi, i mostly use bluetooth tethering now. Its
> been 2 months since the last post here, so if anyone is running the 5.3
> kernel with the patch merged, please keep updating with your findings
> atleast once a month. Thanks.

I haven't had freezes with either 5.2 (patched with the 5.3 "fix") or 5.3 - starting with rc4. (older rc's were unusable for me.) I've been running these "fixed" kernels since the end of May, sometimes with Mint 19, other times with Manjaro.

I have had a few freezes when running liveUSBs (older unpatched kernels).  My liveUSB freezes only took a few minutes to a couple hours.  The cstate arg stopped the liveUSB freezing. YMMV

I also don't enable wifi on my Asus T100CHI (Z3775D) because wifi hangs are a different problem.  But it takes at least 12 hours before wifi causes problems on my system.
Comment 1013 Hal 2019-09-25 21:25:58 UTC
I haven't checked this thread in a long time, but I thought I should post my findings just in case it is of any use to anyone.
A low end computer powered with an Intel N2807 which never worked more than 30 mn without crashing when I didn't set ...max_cstates=1 now works perfectly well with a stock kernel v. 5.3.1 or 4.19.75. I ran it for a couple of days with each version without any issues. The average power consumption was also down by a little over 10%.
By stock I mean downloaded from https://kernel.ubuntu.com/~kernel-ppa/mainline/
FWIW...
Comment 1014 Julien 2019-10-19 14:55:02 UTC
Hello,

on a celeron J1900 (same platform as comment 2, GB-BXBT-1900), I don't get crash/freeze anymore with kernel 5.3 whereas if I didn't use intel_idle.max_cstate=1, it was very frequent with 5.2 and previous.

thanks to the dev
regards
julien
Comment 1017 James Preston 2019-10-29 10:02:31 UTC
Created attachment 285705 [details]
attachment-20672-0.html

I will be out of the office starting Tuesday 29/10/19 and returning Wednesday 30/10/19.

If you need immediate assistance during my absence, please contact Christopher Quest @ Christopher.Quest@herbertgroup.com - otherwise I will respond to your emails as soon as possible upon my return.

Kind regards,

James Preston
Comment 1018 gordon.foster.home 2019-11-01 09:20:11 UTC
I'm using 5.3.0-19-generic on a Celeron J1900.  So far rock solid - no issues at all.  Previously I'd get freeze periodically.

Cheers to the devs for fixing this.
Comment 1019 Rick 2019-11-03 18:15:14 UTC
I have a Zotac Zbox CI320 nano, which has a Intel Celeron N2930.

I have been running the system with intel_idle.max_cstate=1 and it has been stable.

On Linux Mint 19.2, I upgraded to 5.3.0-19-generic and removed intel_idle.max_cstate=1

After about 3 hours uptime today, my system froze.  Same symptoms as before.

Had to revert back to intel_idle.max_cstate=1
Comment 1020 merlino37 2019-11-03 18:52:45 UTC
Created attachment 285763 [details]
attachment-14937-0.html

thanks for the update

On Sun, Nov 3, 2019 at 2:45 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> Rick (rmauerff@vivaldi.net) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |rmauerff@vivaldi.net
>
> --- Comment #1019 from Rick (rmauerff@vivaldi.net) ---
> I have a Zotac Zbox CI320 nano, which has a Intel Celeron N2930.
>
> I have been running the system with intel_idle.max_cstate=1 and it has been
> stable.
>
> On Linux Mint 19.2, I upgraded to 5.3.0-19-generic and removed
> intel_idle.max_cstate=1
>
> After about 3 hours uptime today, my system froze.  Same symptoms as
> before.
>
> Had to revert back to intel_idle.max_cstate=1
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1021 Gary 2019-12-31 05:55:03 UTC
Created attachment 286531 [details]
attachment-14850-0.html

Hi there,

Thank you for your message. I will be out of office during 12/30~12/31 and Please expect delay in my response. If it is urgent please give me a call on my cellular.

Thank you for your patience,

Gary
Comment 1022 jbMacAZ 2020-02-12 16:16:00 UTC
Something is not right with 5.6-rc1.  I've had several freezes already in Mint and Manjaro.  I had been freeze free since 5.2.  The intel_idle.max_cstate=1 kernel argument seems to be needed again to stop the 5.6-rc1 freezes.
Comment 1023 André Hoogendoorn 2020-02-12 18:51:47 UTC
(In reply to jbMacAZ from comment #1022)
> Something is not right with 5.6-rc1.  I've had several freezes already in
> Mint and Manjaro.  I had been freeze free since 5.2.  The
> intel_idle.max_cstate=1 kernel argument seems to be needed again to stop the
> 5.6-rc1 freezes.

I do not believe this hardware bug (which it actually is) will ever be solved by creating a workaround in the kernel. Windows 8 had this problem too when I bought my laptop, and Windows 10 is freezing as well. The problem is with the processor manufacturer. I still hope someone will sue Intel.
Comment 1024 Xermán 2020-02-12 19:11:46 UTC
Created attachment 287331 [details]
attachment-22980-0.html

Never had a problem with windows 10.

On Wed, 12 Feb 2020 at 19:51, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #1023 from André Hoogendoorn (andre.hoogendoorn@gmail.com) ---
> (In reply to jbMacAZ from comment #1022)
> > Something is not right with 5.6-rc1.  I've had several freezes already in
> > Mint and Manjaro.  I had been freeze free since 5.2.  The
> > intel_idle.max_cstate=1 kernel argument seems to be needed again to stop
> the
> > 5.6-rc1 freezes.
>
> I do not believe this hardware bug (which it actually is) will ever be
> solved
> by creating a workaround in the kernel. Windows 8 had this problem too
> when I
> bought my laptop, and Windows 10 is freezing as well. The problem is with
> the
> processor manufacturer. I still hope someone will sue Intel.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1025 youling257 2020-02-14 05:21:59 UTC
Cherry trail no this problem? sure?
my z8350 ezpad always hang on kernel 5.4/5.5/5.6.
when i bought it,Linux kernel has been 5.4.
Comment 1026 jbMacAZ 2020-02-17 20:34:47 UTC
rc2 also freezes.  Bisecting gave me a network "bad" commit, which was not the problem.  The problem with bisecting is that it can take many hours to fail, and my threshold was only 6-8 hours.  This apparently gave me one or more false "good" bisects.

I think I found the offending commits, but I can't revert them due to changes later in the source tree.  However, I found this in dmesg when testing commit efdaedfdf9fc71334883a164341881bc22 (the last of a set of three commits for intel_idle - states off)

dmesg:
[ 4754.803268] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[ 4754.905772] i915 0000:00:02.0: Xorg[881] context reset due to GPU hang

This failure did not freeze hard like later versions of 5.6-rc.  It did lock up the desktop, but not TTY's.  I'm testing the previous commit a4ac9d45c0cd14a2adc872186431c79804b77dbf, but it will take at least 3 days to be sure that a bisect good is legitimate.  Most of my pre-5.6-rc1 builds DID work without freezing.

(In reply to André Hoogendoorn from comment #1023)
> I do not believe this hardware bug (which it actually is) will ever be
> solved by creating a workaround in the kernel. Windows 8 had this problem
> too when I bought my laptop, and Windows 10 is freezing as well. The problem
> is with the processor manufacturer. I still hope someone will sue Intel.

My concern is that the previous workaround was working for my T100CHI (Z3775D). Something broke it in the last few days of the merge window for 5.6.  I don't want to see that change/s backported to older kernels.
Comment 1028 mirh 2020-02-24 16:53:33 UTC
Uh, thanks to the mods having removed the previous uncivil comment. 

I wonder if this regression couldn't be due to PPGTT?
https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html
Or perhaps it was the iGPU Leak patches...
Comment 1029 merlino37 2020-02-24 20:14:49 UTC
Created attachment 287581 [details]
attachment-971-0.html

Works fine with cstate=1 on kernel

On Mon, Feb 24, 2020 at 1:23 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #1028 from mirh (mirh@protonmail.ch) ---
> Uh, thanks to the mods having removed the previous uncivil comment.
>
> I wonder if this regression couldn't be due to PPGTT?
> https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html
> Or perhaps it was the iGPU Leak patches...
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1030 merlino37 2020-02-24 20:17:18 UTC
Created attachment 287583 [details]
attachment-1878-0.html

works fine on kernel 4.15.0-88

On Mon, Feb 24, 2020 at 4:44 PM m w <merlino37@gmail.com> wrote:

> Works fine with cstate=1 on kernel
>
> On Mon, Feb 24, 2020 at 1:23 PM <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>>
>> --- Comment #1028 from mirh (mirh@protonmail.ch) ---
>> Uh, thanks to the mods having removed the previous uncivil comment.
>>
>> I wonder if this regression couldn't be due to PPGTT?
>> https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html
>> Or perhaps it was the iGPU Leak patches...
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.
>
>
Comment 1031 jbMacAZ 2020-02-24 23:48:28 UTC
(In reply to mirh from comment #1028)
> Uh, thanks to the mods having removed the previous uncivil comment. 
> 
> I wonder if this regression couldn't be due to PPGTT?
> https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html
> Or perhaps it was the iGPU Leak patches...

This is a good candidate.  I am still bisecting, but the last bisect bad has the .ppgtt_type change in it.  I need a couple more days to validate bisect good on the prior commit.  When that test concludes (hopefully without a freeze) I'll add the PPGTT patch to 5.6-rc3.  My system usually freezes within 5 hours although it can take up to a day.
Comment 1032 jbMacAZ 2020-02-26 22:15:53 UTC
(In reply to jbMacAZ from comment #1031)
> (In reply to mirh from comment #1028)
> > I wonder if this regression couldn't be due to PPGTT?
> > https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html
> > Or perhaps it was the iGPU Leak patches...
> 
> This is a good candidate.  I am still bisecting, but the last bisect bad has
> the .ppgtt_type change in it.  I need a couple more days to validate bisect
> good on the prior commit.  When that test concludes (hopefully without a
> freeze) I'll add the PPGTT patch to 5.6-rc3.  My system usually freezes
> within 5 hours although it can take up to a <edit> 1/2 day.

The first bad commit was 9f68e3655aae6d49d6ba05dd263f99f33c2567af which modified .ppgtt_type.  Adding the patch from
https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html does appear to fix this cstate regression.  This build has been running 24 hours w/o freezing.  I'll let it run a couple more days to be sure, but this patch does seem to fix my freezing issues.  My next test will be adding the patch to 5.6-rc4, but no more news from me is good news.
Comment 1033 Andrew Clayton 2020-02-27 00:21:43 UTC
(In reply to jbMacAZ from comment #1032)
>
> The first bad commit was 9f68e3655aae6d49d6ba05dd263f99f33c2567af which
> modified .ppgtt_type.  Adding the patch from
> https://lists.freedesktop.org/archives/intel-gfx/2020-February/230829.html
> does appear to fix this cstate regression.  This build has been running 24
> hours w/o freezing.  I'll let it run a couple more days to be sure, but this
> patch does seem to fix my freezing issues.  My next test will be adding the
> patch to 5.6-rc4, but no more news from me is good news.

You can track the progress of that patch in patchwork, https://patchwork.freedesktop.org/series/73842/
Comment 1034 youling257 2020-02-27 05:08:20 UTC
"drm/i915/gem: Prepare gen7 cmdparser for async execution" is bad commit for my Bay trail device running Androidx86.
have to Revert "drm/i915/gem: Take local vma references for the parser" Revert "drm/i915/gem: Asynchronous cmdparser" Revert "drm/i915/gem: Prepare gen7 cmdparser for async execution".
https://gitlab.freedesktop.org/drm/intel/issues/1144
Comment 1035 jbMacAZ 2020-02-27 08:00:00 UTC
Bad news, my system froze after 29+ hours.  Maybe the PPGTT patch helped, maybe not. 

I'll try bisecting 'drm-next-2020-01-30' of git://anongit.freedesktop.org/drm/drm merge, commit 9f68e3655aae6d49d6ba05dd263f99f33c2567af, but that could take weeks.  Meanwhile, I've reverted the 2 commits of gen7 cmdparser of comment #1034 ... commits 32d94048b988469f8bd62cdc6d934f9f58c2b7c5 and 686c7c35abc2201535e6921f9f5610a0b3c9194a of 2019-12-12.
Comment 1036 Markus Kwaśnicki 2020-03-08 15:40:33 UTC
Just for the record: I am working on a Laptop (TravelMate B115-M with Intel Pentium CPU N3530) which is affected by the here discussed bug. However, since Wolfgang M. Reimer published his workaround as shell script, for turning off C6 and turning on C7 states, I am running it in my rc.local file. That way I have not experienced any freezes for years. Though, without running that shell script I am experiencing freezes two to three times per day. So, as long as this bug is not solved, I want to encourage everybody give Wolfgang M. Reimer's workaround a try.
Comment 1037 merlino37 2020-03-08 23:48:35 UTC
Created attachment 287837 [details]
attachment-25299-0.html

Thanks, I'll check it out.

On Sun, Mar 8, 2020 at 1:10 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #1036 from markus@kwasniccy.eu ---
> Just for the record: I am working on a Laptop (TravelMate B115-M with Intel
> Pentium CPU N3530) which is affected by the here discussed bug. However,
> since
> Wolfgang M. Reimer published his workaround as shell script, for turning
> off C6
> and turning on C7 states, I am running it in my rc.local file. That way I
> have
> not experienced any freezes for years. Though, without running that shell
> script I am experiencing freezes two to three times per day. So, as long as
> this bug is not solved, I want to encourage everybody give Wolfgang M.
> Reimer's
> workaround a try.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1039 jbMacAZ 2020-03-21 17:03:25 UTC
(In reply to jbMacAZ from comment #1035)
> Bad news, my system froze after 29+ hours.  Maybe the PPGTT patch helped,
> maybe not. 
> 
> I'll try bisecting 'drm-next-2020-01-30' of
> git://anongit.freedesktop.org/drm/drm merge, commit
> 9f68e3655aae6d49d6ba05dd263f99f33c2567af, but that could take weeks. 
> Meanwhile, I've reverted the 2 commits of gen7 cmdparser of comment #1034
> ... commits 32d94048b988469f8bd62cdc6d934f9f58c2b7c5 and
> 686c7c35abc2201535e6921f9f5610a0b3c9194a of 2019-12-12.

Update:  PPGTT patch does help.  Failure rate improves from within hours to within days.

Further testing with PPGTT patch included shows that c6offc7on script does not help my system.  Reverting the gen7 commits didn't help mine either.  YMMV  

CONFIG_DRM and CONFIG_DRM_KMS_HELPER must be set to module.  Built-in causes freezes in the commit before the drm merge commit (9f68e3655aae6d49d6ba05dd263f99f33c2567af first bad commit), whereas when set to module there weren't any freezes pre-merge.  Mint, Manjaro and Arcolinux (and probably most other distro kernels) are set to module, so this is probably not an issue for most users.  

Another kernel argument that stops the freezes for me in linux-5.6-rc6 is idle=nomwait.  idle=nomwait was not effective when I last tried it (comment #191.)  The idea to retry this argument came from the comments within the big merge commit.  

Testing continues, bisect good requires days of running.
Comment 1040 jbMacAZ 2020-04-18 17:57:38 UTC
Created attachment 288603 [details]
Restore initialization code dropped in commit drm-next-2020-01-30

Adding this patch and the PPGTT patch to commit 9f68e3655aae6d49d6ba05dd263f99f33c2567af seems to fix my freezing problem first seen in 5.6.0-rc0 (pre rc1 release).  I've had no freezes so far running 88 hours.   I also tested the patch in 5.6.4.  Without it my ASUS T100CHI (Z3775D) froze in less than 5 hours, with it, my system ran over 80 hours without a freeze.

Test results summary
commit/tag                                             results
4cadc60d6bcfee9c626d4b55e9dc1475d21ad3bb                 no freezes observed
9f68e3655aae6d49d6ba05dd263f99f33c2567af                 frequent freezes < 8 hrs
9f68e3655aae6d49d6ba05dd263f99f33c2567af "idle=nomwait"  no freeze observed
9f68e3655aae6d49d6ba05dd263f99f33c2567af+PPGTT           froze after 29 hours
v5.6.4                                                   froze in 5 hours
v5.6.4  "idle=nomwait"                                   no freeze observed
v5.6.4+patch                                             no freeze in 80 hour test
9f68e3655aae6d49d6ba05dd263f99f33c2567af+PPGTT+patch     still running - 89 hours

Note: PPGTT patch was merged into v5.6.0-rc4
Comment 1041 Hans de Goede 2020-04-18 18:04:13 UTC
(In reply to jbMacAZ from comment #1040)
> Created attachment 288603 [details]
> Restore initialization code dropped in commit drm-next-2020-01-30
> 
> Adding this patch and the PPGTT patch to commit
> 9f68e3655aae6d49d6ba05dd263f99f33c2567af seems to fix my freezing problem
> first seen in 5.6.0-rc0 (pre rc1 release).  I've had no freezes so far
> running 88 hours.   I also tested the patch in 5.6.4.  Without it my ASUS
> T100CHI (Z3775D) froze in less than 5 hours, with it, my system ran over 80
> hours without a freeze.
> 
> Test results summary
> commit/tag                                             results
> 4cadc60d6bcfee9c626d4b55e9dc1475d21ad3bb                 no freezes observed
> 9f68e3655aae6d49d6ba05dd263f99f33c2567af                 frequent freezes <
> 8 hrs
> 9f68e3655aae6d49d6ba05dd263f99f33c2567af "idle=nomwait"  no freeze observed
> 9f68e3655aae6d49d6ba05dd263f99f33c2567af+PPGTT           froze after 29 hours
> v5.6.4                                                   froze in 5 hours
> v5.6.4  "idle=nomwait"                                   no freeze observed
> v5.6.4+patch                                             no freeze in 80
> hour test
> 9f68e3655aae6d49d6ba05dd263f99f33c2567af+PPGTT+patch     still running - 89
> hours
> 
> Note: PPGTT patch was merged into v5.6.0-rc4

Good detective work, but I'm afraid that your "revert" of the patch which you think may has been causing trouble is incomplete, the:

       intel_uncore_forcewake_get(&dev_priv->uncore, FORCEWAKE_ALL);

Line which your patch adds, should be accompanied by a matching:

       intel_uncore_forcewake_put()

Somewhere, what you've now done is simply always keep all GPU related power-plans in their on condition at all times. This may help with stability, but is very bad for the power-consumption, which esp. on mobile devices is a problem.
Comment 1042 jbMacAZ 2020-04-18 21:55:40 UTC
(In reply to Hans de Goede from comment #1041)
> Good detective work, but I'm afraid that your "revert" of the patch which
> you think may has been causing trouble is incomplete, the:
> 
>        intel_uncore_forcewake_get(&dev_priv->uncore, FORCEWAKE_ALL);
> 
> Line which your patch adds, should be accompanied by a matching:
> 
>        intel_uncore_forcewake_put()
> 
> Somewhere, what you've now done is simply always keep all GPU related
> power-plans in their on condition at all times. This may help with
> stability, but is very bad for the power-consumption, which esp. on mobile
> devices is a problem.

Then this is not much different then max_cstate=1, probably worse.  I found the matching _put() and will add it back.  I'm guessing that this will un-solve the stability gains.  Thanks for the help.
Comment 1043 jbMacAZ 2020-04-20 06:41:23 UTC
Updated the revert patch to add intel_uncore_forcewake_put() (2 places) as suggested.  No adverse affect on stability so far through 32 hours.  The new freezes have so far taken no more than 29 hours if they are going to occur.  I will continue test run until Wednesday.
Comment 1044 jbMacAZ 2020-04-20 06:42:32 UTC
Created attachment 288627 [details]
Updated: Restore code dropped in commit drm-next-2020-01-30
Comment 1045 Hans de Goede 2020-04-20 08:48:42 UTC
(In reply to jbMacAZ from comment #1043)
> Updated the revert patch to add intel_uncore_forcewake_put() (2 places) as
> suggested.  No adverse affect on stability so far through 32 hours.  The new
> freezes have so far taken no more than 29 hours if they are going to occur. 
> I will continue test run until Wednesday.

Thank you for your continued work on this. Lets hope that your patch still fixes the stability issues you are seeing.

BTW, there is something funny with your patch, it looks like you manually edited it instead of generating it with say "diff -u" ?  The lines for the hunks are not in ascending order:

@@ -1238,6 +1238,14 @@
@@ -1120,6 +1128,8 @@
@@ -1130,6 +1140,7 @@

Notice how the first hunk has a higher line number then the others. It still applies though, patch looks where it can apply the first hunk and moves it up 134 lines from 1238, so it puts it at location 1104, after which the order is ok again so it does not reject the patch:

[hans@x1 linux]$ patch -p1 < ~/Downloads/intel_uncore_forcewake.patch 
patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 succeeded at 1104 (offset -134 lines).
Hunk #2 succeeded at 1133 (offset 5 lines).
Hunk #3 succeeded at 1145 (offset 5 lines).
Comment 1046 jbMacAZ 2020-04-20 17:29:21 UTC
Created attachment 288633 [details]
Update.2 Restore forcedwake - dropped in commit drm-next-2020-01-30

I used meld to generate the updated patch.  Then I used meld again to clean up the directory paths.  That's likely where the "@@ -1238..." came from instead of "@@ -1099".  I checked my patched file i915_gem.c being tested and I see the changes I intended.  I've posted a proper patch generated with diff -up.

I get the same offset (5) when I tested the update.2 patch on 5.7-rc2.  That test will have to wait until after 80+ hours with my current test (up to 42 hours now.)
Comment 1047 jbMacAZ 2020-04-25 18:38:44 UTC
My test was successful running 81+ hours without a freeze.  For my next test I'm running 5.6.6 without my patch to get a baseline.  But after 3 days it hasn't frozen, yet.  I'll have to leave this baseline test running for a while before trying anything else.  Maybe something was fixed between 5.6.4 and 5.6.6.  There were some drm/i915 commits.  Time will tell.
Comment 1048 Nikolay Lavrinenko 2020-04-25 18:45:51 UTC
Asus x205ta (atom z3735f) with Kali Linux kernel_5.5+gcc_cpu_opt_patch by graysky2 all ok.
Comment 1050 jbMacAZ 2020-05-07 20:41:52 UTC
I haven't had any more freezes in kernels 5.6.6+ or 5.7-rc4.  This is without my "patch" or kernel arguments "idle=nomwait" or "intel_idle.max_cstate=1".  Thanks to whoever found the real problem(s).  [Asus T100CHI with Z3775D]
Comment 1051 TBoinski 2020-05-19 13:50:05 UTC
I believe I also stumbled on this bug on Dell Latitude 7400 with i5-8365U, Ubuntu 20.04 with kernel 5.4.0-31. With that kernel the system just froze at least once per day, as with previous posts no errors were found in logs. After upgrading the kernel to 5.6.13 and disabling secure boot the crashes are less frequent, the system freezes once every 2-3 days, so I cannot say that the 5.6.6+ kernels are free of the problem.

So far the freezes were always on battery power, there were mixed conditions - hdd backup in progress, sending email, doing teleconference, once it froze just after the system start so it is not temperature depended IMHO. I testet other components, no errors when doing memtest or running stress tests.

Currently I'm testing system stability using max_cstate=1, I will update after few days with the results.
Comment 1052 Leho Kraav 2020-05-19 13:52:02 UTC
@TBoinski ensure you're not actually getting hit by graphics stack bugs, like https://gitlab.freedesktop.org/drm/intel/issues/673
Comment 1053 TBoinski 2020-05-20 06:16:40 UTC
@Leho Kraav I believe I'm not as I never seen any of the errors mentioned there and I'm using 5.6.13 kernel (the GPU bug was fixed in 5.5 apparently). I'm working with max_cstate=1 currently to verify whether this helps or not. Will get back after few days of testing with the results.
Comment 1054 Hans de Goede 2020-05-20 10:19:12 UTC
(In reply to TBoinski from comment #1051)
> I believe I also stumbled on this bug on Dell Latitude 7400 with i5-8365U,

TBoinski, this bug is about a problem which is specific to Bay Trail CPUs / SoCs,
your i5-8365U is a Whiskey Lake CPU / SoC. So what ever issue you are seeing it is not this bug. Please file a new bug for your issue.
Comment 1055 TBoinski 2020-05-21 14:36:54 UTC
Oh, sorry for the mistake. I will test the problem further and open new bug if needed.
Comment 1061 Rick 2020-07-07 06:06:22 UTC
Another update using my Zotac Zbox CI320 nano, which has a Intel Celeron N2930.

I re-installed with Linux Mint 20, which is currently using 5.4.0-40-generic.

I still get the system freeze.  Anecdotally, the system seemed to work for longer than before.  But it did still crash.

Another observation is that I would occasionally notice the screen quickly glitch.  It's hard to describe.  But the display gets corrupted for a second and goes back to normal.  Sort of similar to the screen effect that occurs when the system actually freezes, but the system manages to recover.  This quick glitch is not new to kernel 5.4.

The solution continues to be use intel_idle.max_cstate=1.  This also prevents the quick screen glitch as well.
Comment 1062 merlino37 2020-07-07 15:41:44 UTC
Created attachment 290159 [details]
attachment-14567-0.html

Thanks for the post. I was wondering whether I would still need it when I
upgrade.

On Tue, Jul 7, 2020 at 3:36 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #1061 from Rick (rmauerff@vivaldi.net) ---
> Another update using my Zotac Zbox CI320 nano, which has a Intel Celeron
> N2930.
>
> I re-installed with Linux Mint 20, which is currently using
> 5.4.0-40-generic.
>
> I still get the system freeze.  Anecdotally, the system seemed to work for
> longer than before.  But it did still crash.
>
> Another observation is that I would occasionally notice the screen quickly
> glitch.  It's hard to describe.  But the display gets corrupted for a
> second
> and goes back to normal.  Sort of similar to the screen effect that occurs
> when
> the system actually freezes, but the system manages to recover.  This quick
> glitch is not new to kernel 5.4.
>
> The solution continues to be use intel_idle.max_cstate=1.  This also
> prevents
> the quick screen glitch as well.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1072 merlino37 2020-07-27 14:05:34 UTC
Created attachment 290613 [details]
attachment-7709-0.html

This a Baytrail family cpu bug. Your cpu is not a Baytrail model.

On Mon, Jul 27, 2020 at 1:23 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #1071 from Gary (gary.c.wang@intel.com) ---
> Hi there,
>
> Thank you for your message. I will be out of office during on 7/27 and
> Please
> expect delay in my response. If it is urgent please give me a call on my
> cellular.
>
> Thank you for your patience,
>
> Gary
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1081 Hans de Goede 2020-09-17 08:29:22 UTC
I'm not sure why this is still in need-info. AFAIK this has been fixed for a while now (I know it took a long while to get it fixed, sorry about that). So lets close this bug.
Comment 1082 mailinglists35 2020-09-17 08:43:03 UTC
Created attachment 292521 [details]
attachment-30225-0.html

 in what version/patch is fixed? I have 1k SoC machines affected by this?
 

 
 

 
 
>  
> On Sep 17, 2020 at 11:29 AM,  <Bugzilla-Daemon
> (mailto:bugzilla-daemon@bugzilla.kernel.org)>  wrote:
>  
>  
>  
>  https://bugzilla.kernel.org/show_bug.cgi?id=109051 Hans de Goede
>  (jwrdegoede@fedoraproject.org) changed: What |Removed |Added
>  ----------------------------------------------------------------------------
>  Status|NEEDINFO |CLOSED Resolution|--- |CODE_FIX --- Comment #1081 from Hans
>  de Goede (jwrdegoede@fedoraproject.org) --- I'm not sure why this is still
>  in need-info. AFAIK this has been fixed for a while now (I know it took a
>  long while to get it fixed, sorry about that). So lets close this bug. --
>  You are receiving this mail because: You are on the CC list for the bug. 
>
>
Comment 1083 Hans de Goede 2020-09-17 08:59:43 UTC
The issue discussed in this bug was fixed by this commit:
https://cgit.freedesktop.org/drm-intel/commit/?id=a75d035fedbdecf83f86767aa2e4d05c8c4ffd95

Which has been included in kernel 5.3 and later.

If you are still seeing stability issues with recent kernels, chances are that you are actually hitting a different bug, specific to the model of the hardware you have deployed.
Comment 1084 Ivan Gubkin 2020-09-28 16:29:17 UTC
In my experience this bug has NOT been fixed. I am using the ASRock Q1900-ITX motherboard with an Intel Celeron J1900 processor (silvermont). Using a fully updated Arch linux build with the 5.8 kernel series, without intel_idle.max_cstate=1 the machine will still lock up within 2-3 days. The lockup occurs when the machine is not under load (screen off). There are no indications in the systemd journal of any error. In other words, the error is the same as it has been for years.
Comment 1100 yftoh 2021-04-19 13:53:21 UTC
I am using a mini pc with Z8500 Cherry Trail cpu, tried using Kubuntu 20.04 (kernel version 5.8) and also linux mint 20 (also 5.x). I am experiencing full system freeze after playing some videos in Chromium browser for about an hour. Adding the intel_idle.max_cstate=1 prevents this problem and is able to run more than a day. Are we also fixing this bug for Cherry Trail CPU as well?
Comment 1101 Vincent Gerris 2021-04-19 14:05:36 UTC
Created attachment 296433 [details]
attachment-16640-0.html

Please DO NOT post on this bug report unless you read it and it applies to
you. This bug report is for Baytrail, not Cherry trail. Please look for a
bug report that exists and if not found, file a new one according to the
rules. Please do not reply, it spams people. Thank you

On Mon, Apr 19, 2021, 15:53 <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> yftoh (tohyifeng@gmail.com) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |tohyifeng@gmail.com
>
> --- Comment #1100 from yftoh (tohyifeng@gmail.com) ---
> I am using a mini pc with Z8500 Cherry Trail cpu, tried using Kubuntu 20.04
> (kernel version 5.8) and also linux mint 20 (also 5.x). I am experiencing
> full
> system freeze after playing some videos in Chromium browser for about an
> hour.
> Adding the intel_idle.max_cstate=1 prevents this problem and is able to run
> more than a day. Are we also fixing this bug for Cherry Trail CPU as well?
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 1106 jbMacAZ 2021-06-04 17:52:40 UTC
FYI - YMMV: There appears to be a new regression that started around 5.11-rc4.  On my baytrail Z3775d (Asus T100CHI), it takes about a week to freeze while the system is idle.  I have not seen 5.10 freezing so far.  5.11, 5.12 and 5.13-rc4 all freeze.

The reliable workaround remains intel_idle.max_cstate=1.  I suspect a bisect would reveal a 5.11-rc merge commit, but it could take a month to confirm.  

I suggest using the workaround or one of the current LTS kernels.
Comment 1108 Ivan Veloz 2021-06-15 01:29:52 UTC
I can confirm jbMacAZ’s comment. Running 5.12.6-300.fc34.x86_64 on Fedora, on an Asus T100HAN. Processor is an x5-Z8500. The system will hang and reboot in 2 to 6 days, no logs left on dmesg (I assume if they exist they never get a chance to get written to the filesystem). I already have the c6off+c7on.sh workaround applied as a system service and I can confirm the processor is not entering C6 (at least for several minutes). 

Will attempt to downgrade to 5.10 and report back.
Comment 1109 Andy 2021-08-03 18:46:32 UTC
same here on gentoo with 5.13.7 with a Asus T100HAN to . i have tried 

v5.8.0
v5.10.6
v5.10.52
and
v5.13.7

all have freezes without intel_idle.max_cstate=1.the system will freeze 
right away when X  has started and rebooting after 3-5sec on all kernels.
Comment 1110 Zhihao Wang 2021-08-10 10:39:12 UTC
Everything work fine on 5.8.0 without cpu turbo boost
But the freeze happened on 5.11
My CPU is Z3795 4 cores@~1.6GHz
Comment 1113 Andy 2021-10-13 17:28:03 UTC
Kernel 5.15.0-rc5 works with : intel_idle.max_cstate=2 and now hibernate works :) thanks devs..

Note You need to log in before you can comment on or make changes to this bug.