Bug 105371
Summary: | System randomly freeze since upgraded to 4.1.9 | ||
---|---|---|---|
Product: | Other | Reporter: | Adrien DAUGABEL (email) |
Component: | Other | Assignee: | other_other |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | email, gmt |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.1.9 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg
dmesg 4.1.9 /var/log/messages when i have the problem |
Description
Adrien DAUGABEL
2015-10-02 10:22:18 UTC
Created attachment 189311 [details]
dmesg
Created attachment 189331 [details]
dmesg 4.1.9
I seem to have a similar thing here. I'm on an AMD-cpu desktop with a Radeon GPU. I use btrfs and fuse-ntfs3g only. So, if we have the same problem, it's either an x86_64 thing, or broader (or, maybe something to do with snd-hda-intel). Seems like it could be mm, sched, or pci related. I've pretty much resigned myself to regressing; there are too many potential culprits in 4.1.8 -> 4.1.9 and none jumps out at me. Plus, the freaking box is "dead as a doornail(*)" leaving no debuggable artifacts. Interestingly, after TSHTF, certain things that don't touch disk at all keep going for a while (i.e.: my plasma clock can keep ticking for minutes before stopping and compiles on tmpfs keep moving for a bit (maybe until the next sync?) I have aufs patches in my kernel. You? Notably, we are both on gentoo-ish KDE4 desktops; some userland component seems likely. --- (*) I wonder, WTF is a doornail, and what's so dead about 'em? But, not enough to ask the hive-consciousness :) (In reply to Greg Turner from comment #3) > I seem to have a similar thing here. > > [Maybe it's] ... mm, sched, or pci related I have > > I've pretty much resigned myself to regressing; there are too many potential > culprits in 4.1.8 -> 4.1.9 and none jumps out at me. Plus, the freaking box > is "dead as a doornail(*)" leaving no debuggable artifacts. Actually I've finally managed to see some netdev watchdog traces correlating with this problem. This is a kernel-space-TSHTF type of deal; the watchdog trace is probably secondary to some underlying shit-show like deadlock, use-after-free, etc. transcribing thatmy trace by hand (I just have a crappy photo): NETDEV WATCHDOG: r8169: etc . . . warn_slowpath_* dev_watchdog ? qdisc_rcu_free call_timer_fn run_timer_softirq ? qdisc_rcu_free __do_softirq irq_exit smp_apic_timer_interrupt Also I've managed to sort-of git bisect it. Only problem is, my chance of Type II Error clusterfucking my bisect is higher than I'd like, pending ongoing massive waste of electricity. I'm pretty sure Type I Error is a non-issue at least. So, my bisect's "good" is weakly supported but it's "bad" should be reliable: git bisect start # bad: [cbc890891ddaf0240ad669dd9f0e48c599ff3d63] Linux 4.1.9 git bisect bad cbc890891ddaf0240ad669dd9f0e48c599ff3d63 # good: [36311a9ec4904c080bbdfcefc0f3d609ed508224] Linux 4.1.8 git bisect good 36311a9ec4904c080bbdfcefc0f3d609ed508224 # good: [2be9c8262419a2db45e7461b1eb26ead770a4438] fs: Don't dump core if the corefile would become world-readable. git bisect good 2be9c8262419a2db45e7461b1eb26ead770a4438 # good: [27463fc0ab7a97bb0f311623f846f5f7c8457be8] bridge: mdb: zero out the local br_ip variable before use git bisect good 27463fc0ab7a97bb0f311623f846f5f7c8457be8 # good: [51677b722338da3671ef19846616cf3811253760] virtio_net: don't require ANY_LAYOUT with VERSION_1 git bisect good 51677b722338da3671ef19846616cf3811253760 # good: [89b2791c0fd6938a1c2124589fe1b0f699d4b0d2] udp: fix dst races with multicast early demux git bisect good 89b2791c0fd6938a1c2124589fe1b0f699d4b0d2 # good: [d36f8434da8c333aa3837cf421a52f3835642759] inet: fix possible request socket leak git bisect good d36f8434da8c333aa3837cf421a52f3835642759 # bad: [b21ee342590aa41e21aa0196bff5af592cc349d0] net: dsa: Do not override PHY interface if already configured git bisect bad b21ee342590aa41e21aa0196bff5af592cc349d0 # bad: [0c1122ae6107b01e50bb18fa40eb44e7fa492fbc] inet: fix races with reqsk timers git bisect bad 0c1122ae6107b01e50bb18fa40eb44e7fa492fbc # first bad commit: [0c1122ae6107b01e50bb18fa40eb44e7fa492fbc] inet: fix races with reqsk timers Note how I got this unlikely number of consecutive "good" results -- sure, maybe the offending commit just happened to be at the top of the pile, but alternatively, maybe I just wasn't patient enough trying to trigger failures. I'm grinding away full-bore on 0c1122ae^ right now (tons of simultaneous cpu/memory/network pressure seems to make crashes more likely); if I can't crash it within a few more hours, I'll be a lot more confident that it's 0c1122ae. Also note that Eric Dumazet has another patch that's supposed to go on top of 0c1122ae posted to the -net ml. But I tried that (or maybe I just think so.. Need to confirm my recollection here) and it didn't do the trick. (In reply to Greg Turner from comment #4) > (In reply to Greg Turner from comment #3) > > I seem to have a similar thing here. > > > > [Maybe it's] ... mm, sched, or pci related > > I have ... to do a better job of reviewing stuff before I post it :) OK, progress. First, I'm now comfortable confirming with a pretty decent degree of confidence that 0c1122ae ("inet: fix races with reqsk timers") is the first susceptible commit. Unfortunately, without that patch, we purportedly must take our chances with the use-after-free race that commit fixes. In my case, however, the cure has been far worse than the disease -- there seem to be at least three of us who feel that way :) Also, I think I /was/ wrong about having tested the patch-fixing-patch from Eric Dumazet -- I had believed incorrectly that it was in 4.1.10, and decided my unsuccessful attempt with a v4.1.10-based kernel meant "no dice". But, to be be clear, the patch in question: http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ain't in 4.1.10 -- I have yet to test it, but I'm hopeful, given the comments in bug #102861 (of which this may well prove to be a duplicate). (In reply to Greg Turner from comment #6) > OK, progress. ... > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/ > ?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af With this patch, for three or four hours, I was able to keep my load averages around 40, grinding through various compiles on a big tmpfs while streaming a bunch of videos and webgl stress-tests into a chromium instance with way too many tabs open. That's been a pretty reliable recipe for minimizing crash-free uptime on bugged builds, maybe 20 min. mean boot-to-wedge, or so. In other words, if OP and I have the same problem this is a duplicate of bug #102861. My 0.02 USD: Given this bug's lack of diagnostic clues or Googleable behaviors and super-hard-wedge symptomology, this can be an expensive bug (it was for me). Therefore, davem/net.git:83fccfc3 should clearly go into 4.1.11 stable. I updated my kernel to 4.1.10 and i don't have any problem. Bug only in 4.1.9 ? :O (In reply to Adrien D from comment #8) > I updated my kernel to 4.1.10 and i don't have any problem. > Bug only in 4.1.9 ? :O I guess we had different bugs. Apparently "it freezes" isn't really enough to assume two people have the same problem :) No 4.1.10 : I have same bug. I can launch a console (very slow) and i have a big load (full RAM). When i kill VirtualBox, process is defunct and doesn't kill Created attachment 189661 [details]
/var/log/messages when i have the problem
Have you tried applying http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af on top of your 4.1.10 kernel? This is what I'm running now and I still have not seen a single crash, despite protracted stress testing. Now yes. No problem since yesterday uptime 07:39:17 up 2 days, 21:28, 7 users, load average: 0.80, 0.71, 0.66 I think the patch fix my problem. Since Saturday, i use my laptop : ISO Linux download (big files to download) VirtualBox (use network bridge), Skype, Firefox (surf) SSH, RSYNC and no bugs. (In reply to Greg Turner from comment #12) > Have you tried applying > > > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/ > ?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af > > on top of your 4.1.10 kernel? This is what I'm running now and I still have > not seen a single crash, despite protracted stress testing. No problems : $uptime 08:53:23 up 7 days, 9:22, 9 users, load average: 1.97, 1.77, 1.39 |