Bug 105371 - System randomly freeze since upgraded to 4.1.9
Summary: System randomly freeze since upgraded to 4.1.9
Status: RESOLVED OBSOLETE
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-02 10:22 UTC by Adrien DAUGABEL
Modified: 2016-08-18 13:35 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.1.9
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg (51.90 KB, text/plain)
2015-10-02 10:22 UTC, Adrien DAUGABEL
Details
dmesg 4.1.9 (51.71 KB, text/plain)
2015-10-02 16:27 UTC, Adrien DAUGABEL
Details
/var/log/messages when i have the problem (5.79 KB, text/plain)
2015-10-07 19:47 UTC, Adrien DAUGABEL
Details

Description Adrien DAUGABEL 2015-10-02 10:22:18 UTC
Hi all,

All is OK on my laptop with 4.1.8 kernel but, with 4.1.9 kernel, sometimes the system freeze.

I must use magic keys to reboot properly. 

I don't think i have a kernel panic.

I don't have any information in the logs.

When the system freeze, if i listen music, i have a loop with the sound 2 last seconds.

I can't go to console (Ctrl+Alt+F2) to see logs, screen is static.

My config :
System:    Host: superlinux Kernel: 4.1.8-calculate x86_64 (64 bit) Desktop: KDE 4.14.12
           Distro: Calculate Linux Desktop 15 KDE
Machine:   Mobo: ASUSTeK model: N76VZ v: 1.0 Bios: American Megatrends v: N76VZ.202 date: 03/16/2012
CPU:       Quad core Intel Core i7-3610QM (-HT-MCP-) cache: 6144 KB 
           clock speeds: max: 3300 MHz 1: 1884 MHz 2: 1402 MHz 3: 1557 MHz 4: 1432 MHz 5: 2335 MHz 6: 1322 MHz
           7: 1221 MHz 8: 1240 MHz
Graphics:  Card-1: Intel 3rd Gen Core processor Graphics Controller
           Card-2: NVIDIA GK107M [GeForce GT 650M]
           Display Server: X.Org 1.16.4 driver: intel Resolution: 1920x1080@60.01hz
           GLX Renderer: Mesa DRI Intel Ivybridge Mobile GLX Version: 3.0 Mesa 10.3.7
Audio:     Card Intel 7 Series/C210 Series Family High Definition Audio Controller driver: snd_hda_intel
           Sound: Advanced Linux Sound Architecture v: k4.1.8-calculate
Network:   Card-1: Intel Centrino Wireless-N 2230 driver: iwlwifi
           IF: wlp3s0 state: up mac: 68:5d:43:2a:f3:af
           Card-2: Qualcomm Atheros AR8161 Gigabit Ethernet driver: alx
           IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: 10:bf:48:13:f6:cc
Drives:    HDD Total Size: 1256.3GB (78.7% used) ID-1: /dev/sda model: OCZ size: 256.1GB
           ID-2: /dev/sdb model: HGST_HTS721010A9 size: 1000.2GB
Partition: ID-1: / size: 29G used: 14G (50%) fs: ext4 dev: /dev/sda6
           ID-2: /home size: 93G used: 80G (87%) fs: ext4 dev: /dev/sda8
           ID-3: swap-1 size: 8.59GB used: 0.00GB (0%) fs: swap dev: /dev/sda7
Sensors:   System Temperatures: cpu: 48.0C mobo: N/A
           Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 191 Uptime: 8 min Memory: 976.5/7865.2MB Client: Shell (bash) inxi: 2.2.19
Comment 1 Adrien DAUGABEL 2015-10-02 10:22:44 UTC
Created attachment 189311 [details]
dmesg
Comment 2 Adrien DAUGABEL 2015-10-02 16:27:34 UTC
Created attachment 189331 [details]
dmesg 4.1.9
Comment 3 Greg Turner 2015-10-04 03:51:02 UTC
I seem to have a similar thing here.

I'm on an AMD-cpu desktop with a Radeon GPU.  I use btrfs and fuse-ntfs3g only.

So, if we have the same problem, it's either an x86_64 thing, or broader (or, maybe something to do with snd-hda-intel).

Seems like it could be mm, sched, or pci related.

I've pretty much resigned myself to regressing; there are too many potential culprits in 4.1.8 -> 4.1.9 and none jumps out at me.  Plus, the freaking box is "dead as a doornail(*)" leaving no debuggable artifacts.

Interestingly, after TSHTF, certain things that don't touch disk at all keep going for a while (i.e.: my plasma clock can keep ticking for minutes before stopping and compiles on tmpfs keep moving for a bit (maybe until the next sync?)

I have aufs patches in my kernel.  You?

Notably, we are both on gentoo-ish KDE4 desktops; some userland component seems likely.

---
(*) I wonder, WTF is a doornail, and what's so dead about 'em?  But, not enough to ask the hive-consciousness :)
Comment 4 Greg Turner 2015-10-04 23:38:03 UTC
(In reply to Greg Turner from comment #3)
> I seem to have a similar thing here.
> 
> [Maybe it's] ... mm, sched, or pci related

I have 

> 
> I've pretty much resigned myself to regressing; there are too many potential
> culprits in 4.1.8 -> 4.1.9 and none jumps out at me.  Plus, the freaking box
> is "dead as a doornail(*)" leaving no debuggable artifacts.

Actually I've finally managed to see some netdev watchdog traces correlating with this problem.

This is a kernel-space-TSHTF type of deal; the watchdog trace is probably secondary to some underlying shit-show like deadlock, use-after-free, etc.

  transcribing thatmy trace by hand (I just have a crappy photo):

  NETDEV WATCHDOG: r8169:
 
  etc
  .
  .
  .
  warn_slowpath_*
  dev_watchdog
  ? qdisc_rcu_free
  call_timer_fn
  run_timer_softirq
  ? qdisc_rcu_free
  __do_softirq
  irq_exit
  smp_apic_timer_interrupt

Also I've managed to sort-of git bisect it.  Only problem is, my chance of Type II Error clusterfucking my bisect is higher than I'd like, pending ongoing massive waste of electricity.  I'm pretty sure Type I Error is a non-issue at least.  So, my bisect's "good" is weakly supported but it's "bad" should be  reliable:

git bisect start
# bad: [cbc890891ddaf0240ad669dd9f0e48c599ff3d63] Linux 4.1.9
git bisect bad cbc890891ddaf0240ad669dd9f0e48c599ff3d63
# good: [36311a9ec4904c080bbdfcefc0f3d609ed508224] Linux 4.1.8
git bisect good 36311a9ec4904c080bbdfcefc0f3d609ed508224
# good: [2be9c8262419a2db45e7461b1eb26ead770a4438] fs: Don't dump core if the corefile would become world-readable.
git bisect good 2be9c8262419a2db45e7461b1eb26ead770a4438
# good: [27463fc0ab7a97bb0f311623f846f5f7c8457be8] bridge: mdb: zero out the local br_ip variable before use
git bisect good 27463fc0ab7a97bb0f311623f846f5f7c8457be8
# good: [51677b722338da3671ef19846616cf3811253760] virtio_net: don't require ANY_LAYOUT with VERSION_1
git bisect good 51677b722338da3671ef19846616cf3811253760
# good: [89b2791c0fd6938a1c2124589fe1b0f699d4b0d2] udp: fix dst races with multicast early demux
git bisect good 89b2791c0fd6938a1c2124589fe1b0f699d4b0d2
# good: [d36f8434da8c333aa3837cf421a52f3835642759] inet: fix possible request socket leak
git bisect good d36f8434da8c333aa3837cf421a52f3835642759
# bad: [b21ee342590aa41e21aa0196bff5af592cc349d0] net: dsa: Do not override PHY interface if already configured
git bisect bad b21ee342590aa41e21aa0196bff5af592cc349d0
# bad: [0c1122ae6107b01e50bb18fa40eb44e7fa492fbc] inet: fix races with reqsk timers
git bisect bad 0c1122ae6107b01e50bb18fa40eb44e7fa492fbc
# first bad commit: [0c1122ae6107b01e50bb18fa40eb44e7fa492fbc] inet: fix races with reqsk timers


Note how I got this unlikely number of consecutive "good" results -- sure, maybe the offending commit just happened to be at the top of the pile, but alternatively, maybe I just wasn't patient enough trying to trigger failures.

I'm grinding away full-bore on 0c1122ae^ right now (tons of simultaneous cpu/memory/network pressure seems to make crashes more likely); if I can't crash it within a few more hours, I'll be a lot more confident that it's 0c1122ae.

Also note that Eric Dumazet has another patch that's supposed to go on top of 0c1122ae posted to the -net ml.  But I tried that (or maybe I just think so..  Need to confirm my recollection here) and it didn't do the trick.
Comment 5 Greg Turner 2015-10-04 23:40:42 UTC
(In reply to Greg Turner from comment #4)
> (In reply to Greg Turner from comment #3)
> > I seem to have a similar thing here.
> > 
> > [Maybe it's] ... mm, sched, or pci related
> 
> I have 

... to do a better job of reviewing stuff before I post it :)
Comment 6 Greg Turner 2015-10-05 02:43:25 UTC
OK, progress.

First, I'm now comfortable confirming with a pretty decent degree of confidence that 0c1122ae ("inet: fix races with reqsk timers") is the first susceptible commit.  Unfortunately, without that patch, we purportedly must take our chances with the use-after-free race that commit fixes.

In my case, however, the cure has been far worse than the disease -- there seem to be at least three of us who feel that way :)

Also, I think I /was/ wrong about having tested the patch-fixing-patch from Eric Dumazet -- I had believed incorrectly that it was in 4.1.10, and decided my unsuccessful attempt with a v4.1.10-based kernel meant "no dice".

But, to be be clear, the patch in question:

  http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

ain't in 4.1.10 -- I have yet to test it, but I'm hopeful, given the comments in bug #102861 (of which this may well prove to be a duplicate).
Comment 7 Greg Turner 2015-10-05 08:41:49 UTC
(In reply to Greg Turner from comment #6)
> OK, progress.

...

> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/
> ?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

With this patch, for three or four hours, I was able to keep my load averages around 40, grinding through various compiles on a big tmpfs while streaming a bunch of videos and webgl stress-tests into a chromium instance with way too many tabs open.

That's been a pretty reliable recipe for minimizing crash-free uptime on bugged builds, maybe 20 min. mean boot-to-wedge, or so.

In other words, if OP and I have the same problem this is a duplicate of bug #102861.

My 0.02 USD: Given this bug's lack of diagnostic clues or Googleable behaviors and super-hard-wedge symptomology, this can be an expensive bug (it was for me).  Therefore, davem/net.git:83fccfc3 should clearly go into 4.1.11 stable.
Comment 8 Adrien DAUGABEL 2015-10-07 05:55:21 UTC
I updated my kernel to 4.1.10 and i don't have any problem.
Bug only in 4.1.9 ? :O
Comment 9 Greg Turner 2015-10-07 08:15:02 UTC
(In reply to Adrien D from comment #8)
> I updated my kernel to 4.1.10 and i don't have any problem.
> Bug only in 4.1.9 ? :O

I guess we had different bugs.  Apparently "it freezes" isn't really enough to assume two people have the same problem :)
Comment 10 Adrien DAUGABEL 2015-10-07 19:44:27 UTC
No 4.1.10 : I have same bug.

I can launch a console (very slow) and i have a big load (full RAM).

When i kill VirtualBox, process is defunct and doesn't kill
Comment 11 Adrien DAUGABEL 2015-10-07 19:47:37 UTC
Created attachment 189661 [details]
/var/log/messages when i have the problem
Comment 12 Greg Turner 2015-10-07 21:04:26 UTC
Have you tried applying

  http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

on top of your 4.1.10 kernel?  This is what I'm running now and I still have not seen a single crash, despite protracted stress testing.
Comment 13 Adrien DAUGABEL 2015-10-08 11:23:09 UTC
Now yes.

No problem since yesterday
Comment 14 Adrien DAUGABEL 2015-10-13 05:41:06 UTC
uptime
 07:39:17 up 2 days, 21:28,  7 users,  load average: 0.80, 0.71, 0.66

I think the patch fix my problem.

Since Saturday, i use my laptop : ISO Linux download (big files to download) VirtualBox (use network bridge), Skype, Firefox (surf) SSH, RSYNC and no bugs.
Comment 15 Adrien DAUGABEL 2015-10-21 06:54:11 UTC
(In reply to Greg Turner from comment #12)
> Have you tried applying
> 
>  
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/
> ?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
> 
> on top of your 4.1.10 kernel?  This is what I'm running now and I still have
> not seen a single crash, despite protracted stress testing.

No problems :

$uptime
08:53:23 up 7 days,  9:22,  9 users,  load average: 1.97, 1.77, 1.39

Note You need to log in before you can comment on or make changes to this bug.