Bug 203735

Summary: Can incoming resent TCP packets (due to congestion?) crash my kernel?
Product: Networking Reporter: GYt2bW (howaboutsynergy)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: RESOLVED MOVED    
Severity: normal CC: howaboutsynergy, kadlec, pablo
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=203849
Kernel Version: 5.1.5-g835365932f0d Subsystem:
Regression: No Bisected commit-id:
Attachments: some traffic shaping script

Description GYt2bW 2019-05-27 21:59:08 UTC
I have oops=panic on /proc/cmdline but kdump wasn't set correctly(crashkernel=128M was too low, 256M was needed) so I don't know why the panic happened, however on next boot I read the previous boot's last log lines(via `journalctl -b -1`) which had these lines:

```
May 27 18:22:00 i87k kernel: nf_ct_proto_6: invalid packet ignored in state ESTABLISHED  IN= OUT= SRC=151.101.112.133 DST=192.168.0.63 LEN=60 TOS=0x00 PREC=0x20 TTL=55 ID=0 DF PROTO=TCP SPT=443 DPT=55408 SEQ=1298638150 ACK=6832714 WINDOW=27320 RES=0x00 ACK SYN URGP=0 OPT (020405620402080A35735DFC2CCE4B1E01030309) 
May 27 18:22:00 i87k kernel: nf_ct_proto_6: invalid packet ignored in state ESTABLISHED  IN= OUT= SRC=151.101.112.133 DST=192.168.0.63 LEN=60 TOS=0x00 PREC=0x20 TTL=55 ID=0 DF PROTO=TCP SPT=443 DPT=55406 SEQ=2698278138 ACK=1746385854 WINDOW=27320 RES=0x00 ACK SYN URGP=0 OPT (020405620402080A6A2DE5892CCE4B1E01030309) 
May 27 18:22:00 i87k kernel: nf_ct_proto_6: invalid packet ignored in state ESTABLISHED  IN= OUT= SRC=151.101.112.133 DST=192.168.0.63 LEN=60 TOS=0x00 PREC=0x20 TTL=55 ID=0 DF PROTO=TCP SPT=443 DPT=55430 SEQ=3980092654 ACK=3018519461 WINDOW=27320 RES=0x00 ACK SYN URGP=0 OPT (020405620402080A0904C1332CCE4C1701030309) 
```

so the kernel panic happened seconds(or maybe minutes? depending on how much was lost due to no sync?)

This is either because the packets were malformed(intentionally?) or more likely they were resent due to congestion? since that github IP (151.101.112.133) was losing about 40% of the packets (via `ping`) at the time.

I've gathered some info since the crash happened here: https://gist.github.com/howaboutsynergy/c69f4a44ad10f7cce48c1544266e43f6

If I'm encountering this issue again I will have a crash dump and thus stacktrace ...

But until then, is there some way I can test if resending the same(?) packets(due to congestion, possibly) could cause kernel to oops(or panic) ?

Maybe `CONFIG_NET_PKTGEN` ? I've never used it before! I'll have a read `Documentation/networking/pktgen.txt`

I'm on ArchLinux and was using kernel stable 5.1.5-g835365932f0d, but not a pristine one though...
Comment 1 GYt2bW 2019-05-27 22:02:14 UTC
the commit for this message 
`nf_ct_proto_6: invalid packet ignored in state ESTABLISHED`
 is from 2012:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1a4ac9870fb82eed56623d0f69ec59aa5bef85fe

attempting to add those people as CC, maybe they've some ideas?
Comment 2 GYt2bW 2019-05-29 15:14:11 UTC
I can't use `CONFIG_NET_PKTGEN` because it only deals with UDP packets.
I'd need something to emulate congestion... maybe some firewall rules that drops packets ? hmm...
Comment 3 GYt2bW 2019-05-29 18:51:41 UTC
I tried using some `iptables` rules with `hashlimit` to drop or accept TCP packets in `ESTABLISHED` state to cause the source to resend them... it didn't work as well as I expected, but I wasn't able to reproduce the issue(or any of the dmesg messages) but I did get a 260K file to download in like 3mins.

Oh well, I just hope that whatever caused the kernel oops(or panic) doesn't allow for anything worse(like RCE) than just that.

Github Support seems content with the fact that I'm no longer experiencing the slowdown or packet loss. No word on what might've caused those dmesg messages or anything else really.

I won't be trying anything else to reproduce this. But if it happens again, my kdump[1](https://github.com/howaboutsynergy/q1q/blob/9f656ef9f31f227cc0951f7dc06761b660a56fdd/OSes/archlinux/etc/systemd/system/kdump.service.hostspecific%3Di87k#L1)[2](https://github.com/howaboutsynergy/q1q/blob/9f656ef9f31f227cc0951f7dc06761b660a56fdd/OSes/archlinux/etc/systemd/system/kdump-save.service#L1) should be ready to catch it and then I'll update this.
Comment 4 GYt2bW 2019-06-01 15:52:59 UTC
Created attachment 283021 [details]
some traffic shaping script

I tried using this traffic shaping script but still couldn't reproduce the issue with it! 

I guess whatever happened it has to be done upstream of me, and lose like 30% of packets or something, in order to have any chance of reproducing this.
Comment 5 GYt2bW 2019-06-08 07:17:47 UTC
I got the crash dump! System froze again(with corrupted screen) just like the first time and this is the info:

looks like it was caused by `rustc` (just like the first time!) while compiling itself...

```
$ crash_kernel_read 
Not already root, re-executing myself as root by using sudo(required!)...

/usr/bin/makedumpfile is owned by makedumpfile 1.6.5-1

crash 7.2.6
Copyright (C) 2002-2019  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [720MB]: patching 75391 gdb minimal_symbol values

      KERNEL: /usr/lib/modules/5.1.7-g2f7d9d47575e/build/vmlinux       
    DUMPFILE: /var/crash/crashdump-2019-06-08-09:08:23  [PARTIAL DUMP]
        CPUS: 6
        DATE: Sat Jun  8 09:07:50 2019
      UPTIME: 00:40:33
LOAD AVERAGE: 4.74, 5.80, 5.45
       TASKS: 728
    NODENAME: i87k
     RELEASE: 5.1.7-g2f7d9d47575e
     VERSION: #31 SMP Fri Jun 7 00:10:52 CEST 2019
     MACHINE: x86_64  (3700 Mhz)
      MEMORY: 31.9 GB
       PANIC: "Oops: 0000 [#1] SMP PTI" (check log for details)
         PID: 25124
     COMMAND: "rustc"
        TASK: ffff968413869e40  [THREAD_INFO: ffff968413869e40]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 25124  TASK: ffff968413869e40  CPU: 2   COMMAND: "rustc"
 #0 [ffffb6eecf1a7660] machine_kexec at ffffffffae03400b
 #1 [ffffb6eecf1a76a8] __crash_kexec at ffffffffae124bd8
 #2 [ffffb6eecf1a7770] crash_kexec at ffffffffae125a08
 #3 [ffffb6eecf1a7788] oops_end at ffffffffae011866
 #4 [ffffb6eecf1a77a8] no_context at ffffffffae03bdb7
 #5 [ffffb6eecf1a7848] do_page_fault at ffffffffae03c7bb
 #6 [ffffb6eecf1a7870] page_fault at ffffffffae800dfe
    [exception RIP: compaction_alloc+1339]
    RIP: ffffffffae1b47eb  RSP: ffffb6eecf1a7928  RFLAGS: 00010286
    RAX: 0000000000000001  RBX: ffffb6eecf1a7b00  RCX: 0000000000000001
    RDX: 80000000000ffe00  RSI: 0000000000000000  RDI: 000000000000003c
    RBP: 80000000000ffe00   R8: 0000000000000000   R9: 0000000000000034
    R10: ffffe632c3ff8000  R11: ffffb6eecf1a7980  R12: 8000000000100000
    R13: 0000000000000001  R14: ffffe632c3ff8000  R15: ffff96850dfded00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #7 [ffffb6eecf1a79c8] migrate_pages at ffffffffae1fa9a2
 #8 [ffffb6eecf1a7a48] compact_zone at ffffffffae1b62f0
 #9 [ffffb6eecf1a7ae8] compact_zone_order at ffffffffae1b67c3
#10 [ffffb6eecf1a7ba8] try_to_compact_pages at ffffffffae1b6fa4
#11 [ffffb6eecf1a7bf8] __alloc_pages_direct_compact at ffffffffae193f72
#12 [ffffb6eecf1a7c50] __alloc_pages_slowpath at ffffffffae1945f0
#13 [ffffb6eecf1a7d40] __alloc_pages_nodemask at ffffffffae194fa8
#14 [ffffb6eecf1a7da0] do_huge_pmd_anonymous_page at ffffffffae1fd37c
#15 [ffffb6eecf1a7df0] __handle_mm_fault at ffffffffae1c0e0d
#16 [ffffb6eecf1a7ea0] handle_mm_fault at ffffffffae1c152f
#17 [ffffb6eecf1a7ec8] __do_page_fault at ffffffffae03c522
#18 [ffffb6eecf1a7f28] do_page_fault at ffffffffae03c7bb
#19 [ffffb6eecf1a7f50] page_fault at ffffffffae800dfe
    RIP: 00007cd640afabbc  RSP: 00007cd633fedc50  RFLAGS: 00010206
    RAX: 00007cd5b02fe020  RBX: 00007cd602d01020  RCX: 00007cd5b00fe010
    RDX: 00000000001fffff  RSI: 0000000000d02000  RDI: 0000000000000010
    RBP: 000000000000ffff   R8: 0000000000000067   R9: 00007cd601fff010
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007cd5fd1f81b0
    R13: 00007cd602d01020  R14: 00000000fffffdd2  R15: 00007cd601f67120
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
crash> 

```

```
...
[ 2389.137459] i2c i2c-1: NAK from device addr 0x50 msg #0
[ 2389.141642] i2c i2c-3: NAK from device addr 0x50 msg #0
[ 2432.660116] gpg-agent[1125]: handler 0x74e5c195d700 for fd 10 started
[ 2432.730809] gpg-agent[1125]: handler 0x74e5c195d700 for fd 10 terminated
[ 2434.126414] BUGGY: unable to handle kernel paging request at ffffe632c3ff8030
[ 2434.126415] #PF error: [normal kernel read fault]
[ 2434.126416] PGD 82dfd5067 P4D 82dfd5067 PUD 82dfd4067 PMD 0 
[ 2434.126418] Oops: 0000 [#1] SMP PTI
[ 2434.126419] CPU: 2 PID: 25124 Comm: rustc Kdump: loaded Tainted: G     U            5.1.7-g2f7d9d47575e #31
[ 2434.126420] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 1002 07/02/2018
[ 2434.126422] RIP: 0010:compaction_alloc+0x53b/0x890
[ 2434.126423] Code: 1f 41 83 c5 01 4c 39 f5 0f 82 5e 01 00 00 4c 89 34 24 eb 76 49 89 ea 49 c1 e2 06 4c 03 15 75 f3 d0 00 4d 89 d6 4d 85 f6 74 44 <41> 8b 46 30 25 80 00 00 f0 3d 00 00 00 f0 0f 84 ff 00 00 00 80 7b
[ 2434.126423] RSP: 0000:ffffb6eecf1a7928 EFLAGS: 00010286
[ 2434.126424] RAX: 0000000000000001 RBX: ffffb6eecf1a7b00 RCX: 0000000000000001
[ 2434.126425] RDX: 80000000000ffe00 RSI: 0000000000000000 RDI: 000000000000003c
[ 2434.126425] RBP: 80000000000ffe00 R08: 0000000000000000 R09: 0000000000000034
[ 2434.126426] R10: ffffe632c3ff8000 R11: ffffb6eecf1a7980 R12: 8000000000100000
[ 2434.126426] R13: 0000000000000001 R14: ffffe632c3ff8000 R15: ffff96850dfded00
[ 2434.126427] FS:  00007cd633fff700(0000) GS:ffff9684eda80000(0000) knlGS:0000000000000000
[ 2434.126428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2434.126428] CR2: ffffe632c3ff8030 CR3: 00000007ae5a8005 CR4: 00000000003606e0
[ 2434.126429] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2434.126429] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2434.126430] Call Trace:
[ 2434.126432]  migrate_pages+0x112/0xa00
[ 2434.126433]  ? isolate_freepages_block+0x330/0x330
[ 2434.126434]  ? move_freelist_tail+0xd0/0xd0
[ 2434.126435]  compact_zone+0x6b0/0xab0
[ 2434.126436]  compact_zone_order+0xd3/0x110
[ 2434.126437]  ? psi_task_change+0xe2/0x210
[ 2434.126438]  try_to_compact_pages+0x164/0x220
[ 2434.126439]  __alloc_pages_direct_compact+0x82/0x170
[ 2434.126440]  __alloc_pages_slowpath+0x430/0xb70
[ 2434.126441]  __alloc_pages_nodemask+0x278/0x2c0
[ 2434.126442]  do_huge_pmd_anonymous_page+0x12c/0x5e0
[ 2434.126444]  __handle_mm_fault+0xbed/0x1250
[ 2434.126445]  handle_mm_fault+0xbf/0x1e0
[ 2434.126446]  __do_page_fault+0x242/0x490
[ 2434.126448]  ? page_fault+0x8/0x30
[ 2434.126449]  do_page_fault+0x1b/0x5e
[ 2434.126449]  page_fault+0x1e/0x30
[ 2434.126450] RIP: 0033:0x7cd640afabbc
[ 2434.126451] Code: 74 dc 41 f7 d6 eb 31 49 c1 e8 39 48 8d 46 f0 48 21 d0 44 88 04 31 44 88 44 01 10 48 8b 44 24 40 48 c1 e6 05 c4 c1 7e 6f 45 00 <c5> fe 7f 04 30 4c 89 cd 66 45 85 f6 74 a6 48 8b 7c 24 60 c5 f8 77
[ 2434.126452] RSP: 002b:00007cd633fedc50 EFLAGS: 00010206
[ 2434.126452] RAX: 00007cd5b02fe020 RBX: 00007cd602d01020 RCX: 00007cd5b00fe010
[ 2434.126453] RDX: 00000000001fffff RSI: 0000000000d02000 RDI: 0000000000000010
[ 2434.126453] RBP: 000000000000ffff R08: 0000000000000067 R09: 00007cd601fff010
[ 2434.126454] R10: 0000000000000000 R11: 0000000000000246 R12: 00007cd5fd1f81b0
[ 2434.126454] R13: 00007cd602d01020 R14: 00000000fffffdd2 R15: 00007cd601f67120
[ 2434.126455] Modules linked in: xt_comment msr xt_TCPMSS iptable_mangle iptable_security iptable_nat nf_nat iptable_raw nf_log_ipv4 nf_log_common xt_owner xt_LOG xt_connlimit nf_conncount xt_conntrack nf_conntrack nf_defrag_ipv4 xt_hashlimit xt_multiport xt_addrtype snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp i915 crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper snd_hda_intel snd_hda_codec syscopyarea sysfillrect sysimgblt ghash_clmulni_intel snd_hwdep fb_sys_fops iTCO_wdt intel_cstate snd_hda_core drm iTCO_vendor_support intel_uncore snd_pcm intel_rapl_perf snd_timer pcspkr mei_me mq_deadline snd drm_panel_orientation_quirks e1000e soundcore i2c_i801 mei xhci_pci xhci_hcd
[ 2434.126466] CR2: ffffe632c3ff8030
```

What other info should I provide?
Comment 6 GYt2bW 2019-06-08 07:36:08 UTC
the `BUGGY` text is from:
2600_whichbug.patch:+	pr_alert("BUGGY: unable to handle kernel %s at %px\n",

more info here: https://gist.github.com/howaboutsynergy/c69f4a44ad10f7cce48c1544266e43f6#gistcomment-2938304
Comment 7 GYt2bW 2019-06-08 07:44:55 UTC
I'll make a new bug since it's unrelated to what I thought(net packets)
done as: https://bugzilla.kernel.org/show_bug.cgi?id=203849