Bug 5039
Summary: | high cpu usage (softirq takes 50% all the time) | ||
---|---|---|---|
Product: | Process Management | Reporter: | Alvaro (askxfs) |
Component: | Other | Assignee: | process_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | 7r3ni7y, akpm, ismail, laforge, migo, protasnb, romieu |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.12 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Screenshot of top from latest procps.
top stats |
Description
Alvaro
2005-08-09 18:54:47 UTC
running `top' would help. Also see Documentation/basic_profiling.txt in the kernel source tree.. Created attachment 5582 [details]
Screenshot of top from latest procps.
I have used top, it didn't report any process using 50% cpu.
attached is a screenshot of it running.
I will read the indicated document and see if I can narrow it down.
let me know if there's anything else I can do.
Thanks
Hello, I just figured that passing noapic to the kernel on grub, stopped this behaviour. I am not sure what that will break now.. :-) i see the same issue of soft irqs "stuck" at 50% on a Compaq Presrio V2311US laptop with turion64 cpu running gentoo-amd64, and it does get "fixed" by passing noapic to kernel here as well what further information do you need to properly fix this? I have the same issue (si, soft-irq in top stuck to 50%) and the systems runs really slow. It is an AMD Turion 64 Mobile ML-34 1.8GHz inside a HP nx6125 Laptop. I also tried to add the "noapic" option to the kernel boot parameters but that didn't help. Instead I only got a kernel fault during bootup when initializing the CPU. The default kernel from the gentoo install cd didn't show this behaviour so I tried to compile the newest (at this time) kernel 2.6.13.3 with the config.gz from the gentoo default kerenl. This indeed worked so it seemed to be a kernel configuration problem. I succesively tried to enable/disable options and finally found out that the soft-irq load get's stuck to 50% if I DON'T compile SMP (Multiprocessor) into the kernel. Hope this information helps Additionally I found out that my System clock runs at double speed IF i put SMP into the kernel. So none of both option (WITH/WITHOUT) SMP is really ideal but with SMP at least my machines runs at full speed. Created attachment 6415 [details]
top stats
screenshot of my top stats from router with high % of softirq
Distribution: PLD Linux Hardware: 00:00.0 Host bridge: Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02) 00:01.0 PCI bridge: Intel Corporation 82865G/PE/P PCI to AGP Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) 01:00.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200 PRO] (rev 01) 01:00.1 Display controller: ATI Technologies Inc: Unknown device 5940 (rev 01) 02:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) 02:0c.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) [root@router root]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Celeron(R) CPU 2.53GHz stepping : 1 cpu MHz : 2533.770 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor ds_cpl cid xtpr bogomips : 4997.12 [root@router root]# free total used free shared buffers cached Mem: 515108 475108 40000 0 281632 37120 -/+ buffers/cache: 156356 358752 Swap: 497972 0 497972 Software: kernel from distribution [root@router root]# uname -a Linux router 2.6.11.10-6 #1 Fri May 27 22:44:33 CEST 2005 i686 Intel(R)_Celeron(R)_CPU_2.53GHz unknown PLD Linux Problem: High usage of softirq, sometimes %of si growing to 99% [root@router root]# vmstat procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 39744 281632 36680 0 0 12 59 89 53 1 61 36 2 [root@router root]# vmstat procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 39888 281632 36684 0 0 12 59 89 53 1 61 36 2 top - 09:49:59 up 16:32, 2 users, load average: 0.11, 0.21, 0.18 Tasks: 64 total, 1 running, 61 sleeping, 0 stopped, 2 zombie Cpu(s): 0.0% us, 0.0% sy, 0.3% ni, 28.6% id, 0.0% wa, 2.0% hi, 69.1% si Mem: 515108k total, 476460k used, 38648k free, 281632k buffers Swap: 497972k total, 0k used, 497972k free, 36684k cached bugme-daemon@kernel-bugs.osdl.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=5039 > Well this is a depressing saga. A bunch of people whose machines appear to be spending 50% CPU time in softirq processing. Some find that `noapic' fixes it and some dont. Some are on x86_64, some are on x86. I suspect we have multiple bugs here. It's quite a mess. Could the reporters please, via a reply-to-all to this email: a) test 2.6.14. b) summarise the current status of your bug (what CPU type, what are the symptoms, any known workarounds, etc). c) Generate a kernel profile (see Documentation/basic_profiling.txt) Generally, let's get a bit of action happening here and see if we can get these relatively long-standing regressions fixed up, thanks. Reply-To: arjan@infradead.org On Sat, 2005-10-29 at 01:09 -0700, Andrew Morton wrote: > bugme-daemon@kernel-bugs.osdl.org wrote: > > > > http://bugzilla.kernel.org/show_bug.cgi?id=5039 > > > > Well this is a depressing saga. A bunch of people whose machines appear to > be spending 50% CPU time in softirq processing. Some find that `noapic' > fixes it and some dont. Some are on x86_64, some are on x86. > > I suspect we have multiple bugs here. It's quite a mess. > > Could the reporters please, via a reply-to-all to this email: > > a) test 2.6.14. > > b) summarise the current status of your bug (what CPU type, what are the > symptoms, any known workarounds, etc). > > c) Generate a kernel profile (see Documentation/basic_profiling.txt) d) get a /proc/interrupts to see if there's some screaming irq.. and if so which one sumarise: Today we changed processor to P4 2.4GHz, now in stable situation i have 20-30% si (before on celeron 2,5GHz 50-70%). Problems started after change of bandwidth from ISP from 10 to 22Mbps. After upgrade to kernel 2.6.14 i don't see any changes. Here are answers for c) and d) points: http://www.abp.pl/migo/softirq_problem.txt cpu P4 2.4GHz, Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet adapters. Iptables generate much of work for processor but why on softirq? Ok, some comments: 1) I also see the "50% cpu usage" problem on my Turion64. I fix this (and many other ACPI related problems) by passing "acpi=off" to the kernel command line. The only remaining problem now is that the clock runs at roughly thrice the speed, i.e. a second only lasts one third of a second ;) [just a side note] This 'turion64' problem shows 49-51% cpu usage at all time, constantly, no fluctuation. Oprofile indicated that it was somewhere in the ACPI idle handling code. Didn't have the time for any more debugging. This is totally unrelated to iptables, however. 2) iptables processing runs like almost all of the network stack in softirq context. So if you have many rules (probably even in a not very much optimized layout), and high bandwith, it's not unlikely that you spend a significant amount of time in softirq context, especially on dedicated firewalls and the like. This usage should have an almost linear dependency on the number of packets per second the system processes. Also, the system from bug reporter 'migo' has a number of tc filter rules, it runs connection tracking (and all runs in softirq). So besides 'iptables shall be optimized further', I don't really see why this should be a bug. 3) why does 'migo' profile show "ct_get_next" in the profile? This looks like somebody is constantly reading /proc/net/ip_conntrack from userspace (which is an _extremely_ expensive operation that should only happen for debugging). If you need that information, use ip_conntrack_netlink and libnetfilter_conntrack. Answer after Herald Welter sugestions. I tried work of system without iptables and tc rules. % of si (6-10% on p4 2.4GHz) after turn on firewall (iptables,ipset and few tc rules) I have 16-35% of si. I turned off simple conntrack monitoring by mrtg ("cat /proc/net/ip_conntrack 2> /dev/null | wc -l") and I didn't observe any changes. Here is log from profile before and after turning on firewall: http://www.abp.pl/migo/softirq_problem2.txt Mayby Herald is right. Mayby it's bug in my firewall. If it's true, is anybody who helps me optimize it? (or look at it to check is it firewall problem or not) Migo, can you send the .config and the dmesg of the bi r8169/celeron box ? I do not get it: a - the IRQ/s rate is low (~100, i.e. HZ) ; b - the system is idle one third of the time (~30%) ; c - each of the two interfaces receives ~3kpps ; d - the napi quota of the r8169 equals 64. c + d => in average, rtl8169_poll() processes all Rx pending packets (< 64) in one iteration. The r8169 driver reenables its irq immediately. a => however the r8169 driver is not delivered any irq more than once every 100ms. Imho Harald is completely right here: iptables is not the issue. Something delays the softirq processing. Comments, anyone ? -- Ueimor Francois there are dmesg and .config (2.6.14) from my box. (now there is p4 2.4ghz not celeron 2.5 because %si usage on celeron was twice than p4 and i had problems with access to this box, but on p4 i have still over 30% usage of softirq). http://www.abp.pl/migo/softirq_problem3_dmesg.txt http://www.abp.pl/migo/softirq_problem3_config.txt (after thursday I'll try 2.4) > Francois there are dmesg and .config (2.6.14) from my box. (now there is p4 > 2.4ghz not celeron 2.5 because %si usage on celeron was twice than p4 and i had Ok. Is it the same motherboard (CPU and network cards apart) ? The r8169 is simple enough for me. The tg3 is a bit more clever. :o) > problems with access to this box, but on p4 i have still over 30% usage of > softirq). > http://www.abp.pl/migo/softirq_problem3_dmesg.txt The dmesg is a bit truncated. > http://www.abp.pl/migo/softirq_problem3_config.txt Have you ever compiled ACPI ? You will still be able to disable it as suggested by Harald. ACPI tells a lot of thing at the start of the dmesg. By the way, you can enable X86_UP_APIC/X86_UP_IOAPIC (it is not related to ACPI). The usual dmesg, lspci and /proc/interrupts would be welcome. Before going to 2.4, the behavior of 2.6.14 with X86_UP_APIC/X86_UP_IOAPIC + ACPI compiled in but enabled/disabled via the boot command line could help. -- Ueimor Yes Francois, it's that same motherboard. I chaged only CPU. (but two days ago i changed all hardware for tests, cpu, memory, motherboard and nic's - effect was that same like on this hardware, now there is old hardware, so situation is that: cpu only changed). >Have you ever compiled ACPI ? You will still be able to disable it as >suggested >by Harald. ACPI tells a lot of thing at the start of the dmesg. Yes, ok I recompiled kernel with apci and apic(ioapic too). So there are logs from actual behavior of system: http://www.abp.pl/migo/softirq_problem4_dmesg.txt http://www.abp.pl/migo/softirq_problem4_interrupts_and_profile.txt http://www.abp.pl/migo/softirq_problem4_lspci.txt Any sugestions? Migo:
[...]
> Any sugestions?
- enable the debug options and bump up LOG_BUF_SHIFT. dmesg -s will help you see
the whole content of the log buffer
- compile as module the usb subsystem then load the uhci-hcd modules
If if does not make a difference, enable SMP/HT.
The usual .config/dmesg/interrupts... will be welcome. I will not mind if the
crypto messages disappear from the log :o)
It is a bit voodooish but we share the same components on the motherboard
(integrated video apart).
--
Ueimor
Francois: >- enable the debug options and bump up LOG_BUF_SHIFT. dmesg -s will help you see > the whole content of the log buffer done >- compile as module the usb subsystem then load the uhci-hcd modules done >If if does not make a difference, enable SMP/HT. There wasn't difference, so I compiled with SMP/HT And still I have big usage of softirq on traffic: 30723.4 kbits/sec 6077.4 packets/sec On top: 0.7% us, 0.3% sy, 0.0% ni, 66.4% id, 0.0% wa, 0.3% hi, 32.2% si (after upgrate celeron 2,5 to p4 2.4 usage of si lessen from 60-70% to 20-35% so still i have problem) // my firewall stats: router ~ # tc qdisc show | wc -l 26 router ~ # tc qdisc show dev eth0 | wc -l 13 router ~ # tc qdisc show dev eth1 | wc -l 13 router ~ # tc filter show dev eth0 | wc -l 18 router ~ # tc filter show dev eth1 | wc -l 18 router ~ # tc class show dev eth0 | wc -l 13 router ~ # tc class show dev eth1 | wc -l 13 router ~ # iptables-save | wc -l 117 router ~ # ipset -S | wc -l 638 Here are kernel config/interrupts/dmesg: http://www.abp.pl/migo/softirq_problem5.txt http://www.abp.pl/migo/softirq_problem5_config.txt Today i tried 2.4.31 kernel. Efect is surprising. On 2.4 i observed over 40% usage of processor (system time). Tommorow I'll chech network cards. There are logs (profile,inerrupts,dmesg): http://www.abp.pl/migo/softirq_problem6_2.4.31.txt What's wrong? Migo:
[...]
> >- compile as module the usb subsystem then load the uhci-hcd modules
> done
Please increase LOG_BUF_LOG_BUF_SHIFT, enable CONFIG_ACPI_DEBUG and send dmesg
(the unabreviated, 2.4-like version) + /proc/interrupts + vmstat 1 + lsmod with
usb loaded and HT/SMP on.
What is the exact model of this motherboard ?
--
Ueimor
>Please increase LOG_BUF_LOG_BUF_SHIFT, enable CONFIG_ACPI_DEBUG increassed and enabled. >and send dmesg >(the unabreviated, 2.4-like version) + /proc/interrupts + vmstat 1 + lsmod with >usb loaded and HT/SMP on. I don't know how to send dmesg in 2.4 like style. So i simple dmesg -s 131072 > What is the exact model of this motherboard ? P4P800-VM with P4 2,4GHz and 2 gigabit 3com nic's. Mayby it's normal processor ussage :| ? I forgotted to add link to new stats and logs: http://www.abp.pl/migo/softirq_problem7.txt Migo: > >usb loaded and HT/SMP on. > > I don't know how to send dmesg in 2.4 like style. So i simple dmesg -s 131072 If the init script do not save the early dmesg somewhere (/var/log ?), it may help to issue the dmesg in the very first script. > > What is the exact model of this motherboard ? > > P4P800-VM with P4 2,4GHz and 2 gigabit 3com nic's. Any fancy bios option related to USB, ACPI or IO-APIC ? > Mayby it's normal processor ussage :| ? I doubt it: pps and idle are so high that NAPI can not explain the low irq/s. -- Ueimor > > I don't know how to send dmesg in 2.4 like style. So i simple dmesg -s > > 131072 > > If the init script do not save the early dmesg somewhere (/var/log ?), it > may help to issue the dmesg in the very first script. There are that same scripts in 2.4 and 2.6 kernel in PLD. In /etc/rc.d/rc.sysinit: -- cut -- dmesg -s 131072 > /var/log/dmesg -- cut -- > > > What is the exact model of this motherboard ? > > > > P4P800-VM with P4 2,4GHz and 2 gigabit 3com nic's. > > Any fancy bios option related to USB, ACPI or IO-APIC ? I can't to monday check bis settings because router is far away from me;). But i think that usb settings are off, and there isn't any fancy settings to ACPI or IO-APIC. I'll check it in next week. But I tested another motherboard 2 weeks ago and there wasn't difference. > > Mayby it's normal processor ussage :| ? > > I doubt it: pps and idle are so high that NAPI can not explain the low > irq/s. Can I do any tests? When I turn off all of firewall (tc,ipset,iptables) usage of softirq decrease about 15% (from 30-40% to 5-15%) I see that nobody have idea what's wrong with this high usage of softirq. Same problem with 2.6.14, 50%si in top or 50%sy in vmstat not network/iptables related because I have no eth noapic and/or acpi=off result in kernelpanic using a HP-nx6128 I see that problem is still unresolved. I observed on next three machines this problem. Some network traffic or something else causes cyclic very high load (with 95-99% softirq). Traffic < than 6MBps. Processors - Intel(R) Celeron(R) CPU 3.06GHz (bogomips: 6134.07,cache size: 256 KB). pkp ~ # vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 14 0 0 345104 39236 32204 0 0 1 12 442 44 1 56 42 0 3 0 0 345568 39236 32204 0 0 0 10 3633 40 2 98 0 0 8 0 0 343708 39236 32200 0 0 0 21 3929 28 2 98 0 0 8 0 0 343180 39236 32204 0 0 0 13 4003 29 1 99 0 0 10 1 0 343676 39236 32200 0 0 0 10 4253 19 1 99 0 0 10 0 0 343700 39236 32200 0 0 0 8 3792 4 0 100 0 0 11 0 0 343716 39236 32204 0 0 0 0 4706 5 0 100 0 0 4 0 0 343792 39236 32200 0 0 0 10 3558 78 5 95 0 0 6 0 0 345264 39236 32200 0 0 0 18 4258 75 3 97 0 0 6 0 0 347264 39236 32200 0 0 0 24 3496 101 3 93 3 0 8 0 0 347404 39236 32208 0 0 0 18 3665 43 1 97 2 0 1 0 0 347404 39236 32204 0 0 0 19 3519 78 4 96 0 0 0 0 0 347420 39236 32200 0 0 0 18 3044 104 1 75 24 0 pkp ~ # cat /proc/interrupts CPU0 0: 8954240 XT-PIC timer 2: 0 XT-PIC cascade 5: 63230 XT-PIC libata 8: 2 XT-PIC rtc 10: 46061753 XT-PIC eth1 11: 46463991 XT-PIC eth0 NMI: 0 LOC: 8954657 ERR: 0 MIS: 0 pkp ~ # readprofile | sort -nr | head -10 4556642 *unknown* 4263867 total 2,2744 3675594 mwait_idle 45944,9250 120337 handle_IRQ_event 1074,4375 30926 nf_iterate 214,7639 28691 tc_classify 128,0848 24189 __do_softirq 151,1813 17191 ip_forward 23,3573 16419 fn_hash_lookup 85,5156 15999 local_bh_enable 124,9922 pkp ~ # readprofile | sort -nr +2 | head -50 3675594 mwait_idle 45944,9250 119223 handle_IRQ_event 1064,4911 30586 nf_iterate 212,4028 24180 __do_softirq 151,1250 28443 tc_classify 126,9777 15710 local_bh_enable 122,7344 9148 kmem_cache_alloc 114,3500 1662 netpoll_trap 103,8750 16286 fn_hash_lookup 84,8229 11482 __mod_timer 71,7625 pkp ~ # uname -a Linux pkp 2.6.14.7-2 #1 Mon Feb 6 09:46:47 CET 2006 i686 Intel(R)_Celeron(R)_CPU_3.06GHz unknown PLD Linux tested noapic, acpi=off without results. Just to sync up on current situation - is the problem still there with 2.6.22+? If yes, then can you please revisit #9 and try setting up kernel profiling (using oprofile http://oprofile.sourceforge.net). Thanks. I think we can close this, presumed fixed. There were some CPU-consuming nasties in netfilter and routing which got fixed - it might have been that? there isnt that problem on 2.6.21 now (i cant reproduce it but i rewrited all firewall scripts, iproute2 with hashes, ipset etc.. so i cant say: problem is solved, i can say: now it works good;) ). ok, thanks. |