Bug 201137 - using traffic control with sfq cause kernel crash
Summary: using traffic control with sfq cause kernel crash
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-15 08:43 UTC by Mario Bachmann
Modified: 2020-03-10 11:08 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.18.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel config (95.86 KB, text/x-mpsub)
2018-09-15 08:43 UTC, Mario Bachmann
Details
Picture of the Screen after the Oops. (1.17 MB, image/jpeg)
2018-09-18 17:31 UTC, Mario Bachmann
Details
netem crash kernel also (110.38 KB, image/png)
2018-10-24 23:03 UTC, baldzhang
Details

Description Mario Bachmann 2018-09-15 08:43:09 UTC
Created attachment 278555 [details]
kernel config

Copying from the machine to an other server (protocol does not matter), causes a kernel crash when using tc-setting with SFQ.

The machine has a Qualcom Killer NIC: lspci |grep Killer
03:00.0 Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)

I use traffic control with SFQ: 
tc qdisc add dev enp3s0 root handle 1: sfq
tc qdisc show dev enp3s0

Now I try to copy a big file (124GB, an image of a partition) to another Linux-Server (same kernel version) to a NFS-Share. It does not matter if it is a nfs or samba or whatever-share. It also does not matter if I use cp or rsync command. 

The target-share is for example:
grep base /proc/mounts
jaguar.grafnetz:/base /mnt/base nfs4 rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.9,local_lock=none,addr=192.168.0.7 0 0

df shows this nfs-share called base when mounted:
jaguar.grafnetz:/base   11718572032 6012592128 5705979904   52% /mnt/base

Now I use a simpe cp-command:
cp big-fime.dd.image /mnt/base/test_01
The machine crashes after 7833735168 Bytes reached the Target-Server. About 7,9 GB (with G=1000^3). 

I can reproduce this crash. 

The good thing is: I figured out that no kernel crash happens when I do not use:
tc qdisc add dev enp3s0 root handle 1: sfq
tc qdisc show dev enp3s0
(So I commented it out from my local start-script and rebootet the system.)
Result: No crash any more. Copying the big file (124GB) completed without a kernel crash. 

Additional Information...

NIC is configured with IPv4:
haswell ~ # ifconfig
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.9  netmask 255.255.255.0  broadcast 192.168.0.255
        ether d4:3d:7e:bd:89:44  txqueuelen 1000  (Ethernet)
        RX packets 7399483  bytes 511559908 (487.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 91781850  bytes 47176316774 (43.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 19  

ethtool enp3s0
Settings for enp3s0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: Unknown
	Current message level: 0x000060e4 (24804)
			       link ifup rx_err tx_err hw wol
	Link detected: yes

While copying over the Gigabit-Network, speed is near maximum:

ifstat
      enp2s0      
 KB/s in  KB/s out
    0.06      0.18
 8348.65     31.60
117536.2    435.11
118049.0    435.04
119100.9    434.84
118889.7    435.19
119004.1    444.53
119061.4    440.47
119102.8    444.04
119077.4    444.39
119084.1    432.32
119089.6    439.71
[...]

So, perhaps the sfq-Kernel-module has a bug. I use the vanilla kernel from kernel.org and sfq is compiled as a module. 

/usr/src/linux # grep SFQ .config
CONFIG_NET_SCH_SFQ=m

Perhaps important: the server with the target-share also uses sfq with the same settings without a problem. It runs stable.
Comment 1 Mario Bachmann 2018-09-15 08:46:03 UTC
The ifstat-output is from the target machine. So the incoming file is showed as "KB/s in".
Comment 2 Mario Bachmann 2018-09-15 12:31:01 UTC
Same behaviour with fresh Linux Kernel 4.18.8.
Comment 3 Andrea Claudi 2018-09-17 15:41:08 UTC
Hi Mario, thanks for your report.
Can you please provide a stack trace for the crash?
Comment 4 Mario Bachmann 2018-09-18 17:31:00 UTC
Created attachment 278639 [details]
Picture of the Screen after the Oops.

The messages after the copy command.
Comment 5 baldzhang 2018-10-24 23:03:04 UTC
Created attachment 279143 [details]
netem crash kernel also

I meet almost the same issue, and reproducted at 4.14.71

after system startup, run:
  [~]# tc qdisc add dev eth0 root netem delay 5ms

then, just ping the IP address of eth0 from other machine, the kernel will crash.

4.14.70 is ok under the same operation
4.19 is ok
4.18.10+ is ok, couldn't confirm versions before 4.18.10
Comment 6 baldzhang 2018-11-04 16:57:51 UTC
4.14.79: netem is ok.
Comment 7 Dennis 2019-11-11 15:01:06 UTC
I have a similar situation


1. From another Computer, send a file to my server.
scp some_file to _my_server


2. ADD a qdisc sfq into the default class
tc qdisc add dev ifb1 root handle 0:0 hfsc default 100
tc class add dev ifb1 parent 0:0 classid 1 hfsc sc rate 500mbit ul rate 500mbit
tc class add dev ifb1 parent 0:1 classid 100 hfsc sc rate 200mbit ul rate 200mbit
-> tc qdisc add dev ifb1 parent 0:100 sfq perturb 10


3. kernel crash occurs without any LOG. all system freeze..


The problem occur with any Kernel 5.x (already tested 5.3.10).
With 4.x, its working like a charm
Comment 8 Dennis 2019-11-14 16:47:29 UTC
Same situation with Kernel 5.3.11.
Comment 9 Dennis 2020-03-10 11:08:20 UTC
I upgraded my kernel to version 5.8.8, and freezes continue...
(I used the same .config file compiled in 4.x - and in a 4.x kernel, its ok.)



1. From another Computer, send a file to my server.
$ scp test.bin to_my_server


2. ADD a qdisc sfq into the default class (see the last line)
$ tc qdisc add dev ifb1 root handle 0:0 hfsc default 100
$ tc class add dev ifb1 parent 0:0 classid 1 hfsc sc rate 500mbit ul rate 500mbit
$ tc class add dev ifb1 parent 0:1 classid 100 hfsc sc rate 200mbit ul rate 200mbit
$ -> tc qdisc add dev ifb1 parent 0:100 sfq perturb 10


3. kernel crash occurs without any LOG. all system freezes..



Thanks in advance.

Note You need to log in before you can comment on or make changes to this bug.