Bug 88161 - High traffic causes a lot of softirqs
Summary: High traffic causes a lot of softirqs
Status: NEEDINFO
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-13 14:18 UTC by Mike Zupan
Modified: 2017-04-03 12:59 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.17.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Mike Zupan 2014-11-13 14:18:19 UTC
I'm using packaged rpms by centos and elrepo with the same results and I can replicate this on any server in our cluster. 

I have tried installing 

kernel-3.10.56-11.el6.centos.alt.x86_64

Also currently running

[root@web125-east.domain.com /var/www/html]# uname -a
Linux web125-east.domain.com 3.17.2-1.el6.elrepo.x86_64 #1 SMP Fri Oct 31 10:37:44 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

from the centosplus repo to solve a problem where 2.6 was locking up process tree on high cpu and it fixed it but it introduced another issue where we have a lot of softirq requests when under a lot of traffic load. 

Here is a powertop from a 2.6 series server

Summary: 42492.1 wakeups/second, 0.0 GPU ops/seconds, 0.0 VFS ops/sec and 2422.0% CPU use

                Usage Events/s Category Description
            22613 ms/s 23637.4	Process php-fpm: pool www
            716.9 ms/s 15783.2	Process nginx: worker process
             21.3 ms/s 1096.1	Process /usr/bin/java -Xms200m -Xmx2000m -Xss256k -XX:MaxDirectMemorySize=516m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Dage
              5.8 ms/s 674.4 Process /usr/sbin/gmond
            130.0 ms/s 494.5 Process /usr/bin/redis-server 127.0.0.1:6379
             73.2 ms/s 487.4 Process python /usr/bin/statsd-relay.py
              3.8 ms/s 82.7 Process java -Xmx6g -server -Dfile.encoding=utf-8 -XX:OnOutOfMemoryError=kill -9 %p -XX:+HeapDumpOnOutOfMemoryError -XX:HeapD
            212.4 ms/s 0.00 Interrupt [3] net_rx(softirq)


Here it is from 3.10


                Usage Events/s Category Description
             10.2 ms/s 1033.6	Timer hrtimer_wakeup
              3.3 ms/s 932.7 Process /usr/bin/java -Xms200m -Xmx2000m -Xss256k
            591.1 ms/s 624.3 Process php-fpm: pool www
             41.5 ms/s 724.0 Interrupt [3] net_rx(softirq)

Load pretty much just keeps crawling up to the 500's

There also is a lot of CPU usage from

  116 root 20 0 0 0 0 R 75.0 0.0 0:04.57 kworker/u66:0

Which from my understanding handles a lot of the acpi calls that softirq is doing. 

I've tried many other 3.x kernels above 3.10 with the same results.. so I'm wondering if this is a known issue
Comment 1 Mike Zupan 2014-11-13 14:30:17 UTC
Sorry here's the nics we have on the system

06:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
06:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
06:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
Comment 2 Alan 2014-12-08 22:03:25 UTC
kworker just handles offloaded work, so if the box is being hammered then it's not unreasonable for it to be high.

What makes you think its doing lots of acpi calls ??
Comment 3 dmitry.samsonov 2015-09-28 10:15:22 UTC
Same issue here with all Centos 7 and elrepo kernels.
NET_RX in many times bigger than NET_TX:
NET_TX:      94345      94868      94714      94441      96972      97641
NET_RX:  466312374  484706991  484924300  494927859  500039928  499807940
Comment 4 Antonio 2017-04-03 02:48:32 UTC
i'm having a similar behavior, same network card. 

Updated from kernel 3.10.58 to 4.4.38, with higher traffic i start to have a lot pkt loss, with one of two of my cpu cores getting lock (output on htop).

Recently update to kernel 4.9.20 but the same results. I've try some options (queue size, gro,...) to improve network performance, but the issue is still at softirq level, in top what i see is that the traffic seams to lock to a softirq, where in the previous kernel this doesn't happen:


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
   29 root      20   0       0      0      0 R  99.9  0.0   0:21.30 ksoftirqd/3 
  114 root      20   0       0      0      0 R  99.6  0.0   0:19.02 ksoftirqd/17
 


in /proc/softirqs and /proc/interrupts i didn't find anything strange. 

Please tel me if is there any more info i can get to help.
My server is just doing routing with iptables.


Regards,
Comment 5 Antonio 2017-04-03 12:59:41 UTC
Some extra info, after reading issue #109581 i set the qdisc to prio_fast and no more cpu usage in softirq. 

Rules that i have installed: 
ip link set dev eth2.24 txqlen 1000
tc qdisc del dev eth2.24 root
tc qdisc add dev eth2.24 root handle 1: prio bands 3
tc qdisc add dev eth2.24 parent 1:1 handle 10: pfifo limit 50
tc qdisc add dev eth2.24 parent 1:2 handle 2: hfsc default 2
tc class add dev eth2.24 parent 2: classid 2:1 hfsc sc rate 300000kbit ul rate 300000kbit
tc class add dev eth2.24 parent 2: classid 2:2 hfsc sc rate 300000kbit ul rate 300000kbit


Side note: 
I notice that is issue is more that one year old.. did you manage to solve your issue? can you share how?

Regards,

Note You need to log in before you can comment on or make changes to this bug.