Hi all, All is OK on my laptop with 4.1.8 kernel but, with 4.1.9 kernel, sometimes the system freeze. I must use magic keys to reboot properly. I don't think i have a kernel panic. I don't have any information in the logs. When the system freeze, if i listen music, i have a loop with the sound 2 last seconds. I can't go to console (Ctrl+Alt+F2) to see logs, screen is static. My config : System: Host: superlinux Kernel: 4.1.8-calculate x86_64 (64 bit) Desktop: KDE 4.14.12 Distro: Calculate Linux Desktop 15 KDE Machine: Mobo: ASUSTeK model: N76VZ v: 1.0 Bios: American Megatrends v: N76VZ.202 date: 03/16/2012 CPU: Quad core Intel Core i7-3610QM (-HT-MCP-) cache: 6144 KB clock speeds: max: 3300 MHz 1: 1884 MHz 2: 1402 MHz 3: 1557 MHz 4: 1432 MHz 5: 2335 MHz 6: 1322 MHz 7: 1221 MHz 8: 1240 MHz Graphics: Card-1: Intel 3rd Gen Core processor Graphics Controller Card-2: NVIDIA GK107M [GeForce GT 650M] Display Server: X.Org 1.16.4 driver: intel Resolution: 1920x1080@60.01hz GLX Renderer: Mesa DRI Intel Ivybridge Mobile GLX Version: 3.0 Mesa 10.3.7 Audio: Card Intel 7 Series/C210 Series Family High Definition Audio Controller driver: snd_hda_intel Sound: Advanced Linux Sound Architecture v: k4.1.8-calculate Network: Card-1: Intel Centrino Wireless-N 2230 driver: iwlwifi IF: wlp3s0 state: up mac: 68:5d:43:2a:f3:af Card-2: Qualcomm Atheros AR8161 Gigabit Ethernet driver: alx IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: 10:bf:48:13:f6:cc Drives: HDD Total Size: 1256.3GB (78.7% used) ID-1: /dev/sda model: OCZ size: 256.1GB ID-2: /dev/sdb model: HGST_HTS721010A9 size: 1000.2GB Partition: ID-1: / size: 29G used: 14G (50%) fs: ext4 dev: /dev/sda6 ID-2: /home size: 93G used: 80G (87%) fs: ext4 dev: /dev/sda8 ID-3: swap-1 size: 8.59GB used: 0.00GB (0%) fs: swap dev: /dev/sda7 Sensors: System Temperatures: cpu: 48.0C mobo: N/A Fan Speeds (in rpm): cpu: N/A Info: Processes: 191 Uptime: 8 min Memory: 976.5/7865.2MB Client: Shell (bash) inxi: 2.2.19
Created attachment 189311 [details] dmesg
Created attachment 189331 [details] dmesg 4.1.9
I seem to have a similar thing here. I'm on an AMD-cpu desktop with a Radeon GPU. I use btrfs and fuse-ntfs3g only. So, if we have the same problem, it's either an x86_64 thing, or broader (or, maybe something to do with snd-hda-intel). Seems like it could be mm, sched, or pci related. I've pretty much resigned myself to regressing; there are too many potential culprits in 4.1.8 -> 4.1.9 and none jumps out at me. Plus, the freaking box is "dead as a doornail(*)" leaving no debuggable artifacts. Interestingly, after TSHTF, certain things that don't touch disk at all keep going for a while (i.e.: my plasma clock can keep ticking for minutes before stopping and compiles on tmpfs keep moving for a bit (maybe until the next sync?) I have aufs patches in my kernel. You? Notably, we are both on gentoo-ish KDE4 desktops; some userland component seems likely. --- (*) I wonder, WTF is a doornail, and what's so dead about 'em? But, not enough to ask the hive-consciousness :)
(In reply to Greg Turner from comment #3) > I seem to have a similar thing here. > > [Maybe it's] ... mm, sched, or pci related I have > > I've pretty much resigned myself to regressing; there are too many potential > culprits in 4.1.8 -> 4.1.9 and none jumps out at me. Plus, the freaking box > is "dead as a doornail(*)" leaving no debuggable artifacts. Actually I've finally managed to see some netdev watchdog traces correlating with this problem. This is a kernel-space-TSHTF type of deal; the watchdog trace is probably secondary to some underlying shit-show like deadlock, use-after-free, etc. transcribing thatmy trace by hand (I just have a crappy photo): NETDEV WATCHDOG: r8169: etc . . . warn_slowpath_* dev_watchdog ? qdisc_rcu_free call_timer_fn run_timer_softirq ? qdisc_rcu_free __do_softirq irq_exit smp_apic_timer_interrupt Also I've managed to sort-of git bisect it. Only problem is, my chance of Type II Error clusterfucking my bisect is higher than I'd like, pending ongoing massive waste of electricity. I'm pretty sure Type I Error is a non-issue at least. So, my bisect's "good" is weakly supported but it's "bad" should be reliable: git bisect start # bad: [cbc890891ddaf0240ad669dd9f0e48c599ff3d63] Linux 4.1.9 git bisect bad cbc890891ddaf0240ad669dd9f0e48c599ff3d63 # good: [36311a9ec4904c080bbdfcefc0f3d609ed508224] Linux 4.1.8 git bisect good 36311a9ec4904c080bbdfcefc0f3d609ed508224 # good: [2be9c8262419a2db45e7461b1eb26ead770a4438] fs: Don't dump core if the corefile would become world-readable. git bisect good 2be9c8262419a2db45e7461b1eb26ead770a4438 # good: [27463fc0ab7a97bb0f311623f846f5f7c8457be8] bridge: mdb: zero out the local br_ip variable before use git bisect good 27463fc0ab7a97bb0f311623f846f5f7c8457be8 # good: [51677b722338da3671ef19846616cf3811253760] virtio_net: don't require ANY_LAYOUT with VERSION_1 git bisect good 51677b722338da3671ef19846616cf3811253760 # good: [89b2791c0fd6938a1c2124589fe1b0f699d4b0d2] udp: fix dst races with multicast early demux git bisect good 89b2791c0fd6938a1c2124589fe1b0f699d4b0d2 # good: [d36f8434da8c333aa3837cf421a52f3835642759] inet: fix possible request socket leak git bisect good d36f8434da8c333aa3837cf421a52f3835642759 # bad: [b21ee342590aa41e21aa0196bff5af592cc349d0] net: dsa: Do not override PHY interface if already configured git bisect bad b21ee342590aa41e21aa0196bff5af592cc349d0 # bad: [0c1122ae6107b01e50bb18fa40eb44e7fa492fbc] inet: fix races with reqsk timers git bisect bad 0c1122ae6107b01e50bb18fa40eb44e7fa492fbc # first bad commit: [0c1122ae6107b01e50bb18fa40eb44e7fa492fbc] inet: fix races with reqsk timers Note how I got this unlikely number of consecutive "good" results -- sure, maybe the offending commit just happened to be at the top of the pile, but alternatively, maybe I just wasn't patient enough trying to trigger failures. I'm grinding away full-bore on 0c1122ae^ right now (tons of simultaneous cpu/memory/network pressure seems to make crashes more likely); if I can't crash it within a few more hours, I'll be a lot more confident that it's 0c1122ae. Also note that Eric Dumazet has another patch that's supposed to go on top of 0c1122ae posted to the -net ml. But I tried that (or maybe I just think so.. Need to confirm my recollection here) and it didn't do the trick.
(In reply to Greg Turner from comment #4) > (In reply to Greg Turner from comment #3) > > I seem to have a similar thing here. > > > > [Maybe it's] ... mm, sched, or pci related > > I have ... to do a better job of reviewing stuff before I post it :)
OK, progress. First, I'm now comfortable confirming with a pretty decent degree of confidence that 0c1122ae ("inet: fix races with reqsk timers") is the first susceptible commit. Unfortunately, without that patch, we purportedly must take our chances with the use-after-free race that commit fixes. In my case, however, the cure has been far worse than the disease -- there seem to be at least three of us who feel that way :) Also, I think I /was/ wrong about having tested the patch-fixing-patch from Eric Dumazet -- I had believed incorrectly that it was in 4.1.10, and decided my unsuccessful attempt with a v4.1.10-based kernel meant "no dice". But, to be be clear, the patch in question: http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ain't in 4.1.10 -- I have yet to test it, but I'm hopeful, given the comments in bug #102861 (of which this may well prove to be a duplicate).
(In reply to Greg Turner from comment #6) > OK, progress. ... > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/ > ?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af With this patch, for three or four hours, I was able to keep my load averages around 40, grinding through various compiles on a big tmpfs while streaming a bunch of videos and webgl stress-tests into a chromium instance with way too many tabs open. That's been a pretty reliable recipe for minimizing crash-free uptime on bugged builds, maybe 20 min. mean boot-to-wedge, or so. In other words, if OP and I have the same problem this is a duplicate of bug #102861. My 0.02 USD: Given this bug's lack of diagnostic clues or Googleable behaviors and super-hard-wedge symptomology, this can be an expensive bug (it was for me). Therefore, davem/net.git:83fccfc3 should clearly go into 4.1.11 stable.
I updated my kernel to 4.1.10 and i don't have any problem. Bug only in 4.1.9 ? :O
(In reply to Adrien D from comment #8) > I updated my kernel to 4.1.10 and i don't have any problem. > Bug only in 4.1.9 ? :O I guess we had different bugs. Apparently "it freezes" isn't really enough to assume two people have the same problem :)
No 4.1.10 : I have same bug. I can launch a console (very slow) and i have a big load (full RAM). When i kill VirtualBox, process is defunct and doesn't kill
Created attachment 189661 [details] /var/log/messages when i have the problem
Have you tried applying http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af on top of your 4.1.10 kernel? This is what I'm running now and I still have not seen a single crash, despite protracted stress testing.
Now yes. No problem since yesterday
uptime 07:39:17 up 2 days, 21:28, 7 users, load average: 0.80, 0.71, 0.66 I think the patch fix my problem. Since Saturday, i use my laptop : ISO Linux download (big files to download) VirtualBox (use network bridge), Skype, Firefox (surf) SSH, RSYNC and no bugs.
(In reply to Greg Turner from comment #12) > Have you tried applying > > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/ > ?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af > > on top of your 4.1.10 kernel? This is what I'm running now and I still have > not seen a single crash, despite protracted stress testing. No problems : $uptime 08:53:23 up 7 days, 9:22, 9 users, load average: 1.97, 1.77, 1.39