Bug 12465
Summary: | KVM guests stalling on 2.6.28 (bisected) | ||
---|---|---|---|
Product: | Process Management | Reporter: | Kevin Shanahan (kmshanah) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | normal | CC: | alan, florian, mtosatti |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 11808 | ||
Attachments: |
kernel config
git bisect log Notes from tests tracing setup script trace output, sample 1 trace output, sample 2 trace output, sample 3 trace output, sample 4 exit kvm_vcpu_block if KVM_REQ_PENDING_TIMER is set |
Description
Kevin Shanahan
2009-01-17 03:37:17 UTC
This is the bug I initially filed against kvm: https://sourceforge.net/tracker/index.php?func=detail&aid=2494730&group_id=180599&atid=893831 Created attachment 19851 [details]
kernel config
Created attachment 19852 [details]
git bisect log
Created attachment 19853 [details]
Notes from tests
The tests look good with just that commit reverted from 2.6.28. commit 80412b323514b9009e65ab32624c39b6baa50138 Author: Kevin Shanahan <kmshanah@flexo.wumi.org.au> Date: Sat Jan 17 22:19:27 2009 +1030 Revert "sched: fine-tune SD_MC_INIT" This reverts commit 14800984706bf6936bbec5187f736e928be5c218. ===== Linux 2.6.28-00001-g80412b3 --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 899071ms rtt min/avg/max/mdev = 0.116/0.270/24.916/0.912 ms --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 899190ms rtt min/avg/max/mdev = 0.122/0.243/2.298/0.182 ms Copy for bugzilla -------- Forwarded Message -------- From: Kevin Shanahan <kmshanah@ucwb.org.au> To: Avi Kivity <avi@redhat.com> Cc: Ingo Molnar <mingo@elte.hu>, Mike Galbraith <efault@gmx.de>, Andrew Morton <akpm@linux-foundation.org>, a.p.zijlstra@chello.nl, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [git pull] scheduler fixes Date: Sun, 18 Jan 2009 20:10:04 +1030 On Sun, 2009-01-18 at 10:59 +0200, Avi Kivity wrote: > A simple test running 10 idle guests over 4 cores doesn't reproduce > (2.6.29-rc). However we likely need a more complicated setup. > > Kevin, any details about your setup? What are the guests doing? Are > you overcommitting memory (is the host swapping)? Hi Avi, No, I don't think there is any swapping. The host has 32GB RAM and only ~9GB of guest RAM has been requested in total. The host doesn't do anything but run KVM. I fear my test setup is quite complicated as it is a production system and I haven't been able to reproduce a simpler case as yet. Anyway, here are the guests I have running during the test: Windows XP, 32-bit (UP) - There are twelve of these launched with identical scripts, apart from differing amounts of RAM. Six with "-m 512", one with "-m 768" and five with "-m 256": KVM=/usr/local/kvm/bin/qemu-system-x86_64 $KVM \ -no-acpi -localtime -m 512 \ -hda kvm-00-xp.img \ -net nic,vlan=0,macaddr=52:54:00:12:34:56,model=rtl8139 \ -net tap,vlan=0,ifname=tap0,script=no \ -vnc 127.0.0.1:0 -usbdevice tablet \ -daemonize Windows 2003, 32-bit (UP): KVM=/usr/local/kvm/bin/qemu-system-x86_64 $KVM \ -no-acpi \ -localtime -m 1024 \ -hda kvm-09-2003-C.img \ -hdb kvm-09-2003-E.img \ -net nic,vlan=0,macaddr=52:54:00:12:34:5F,model=rtl8139 \ -net tap,vlan=0,ifname=tap9,script=no \ -vnc 127.0.0.1:9 -usbdevice tablet \ -daemonize Debian Lenny, 64-bit (MP): KVM=/usr/local/kvm/bin/qemu-system-x86_64 $KVM \ -smp 2 \ -m 512 \ -hda kvm-10-lenny64.img \ -net nic,vlan=0,macaddr=52:54:00:12:34:60,model=rtl8139 \ -net tap,vlan=0,ifname=tap10,script=no \ -vnc 127.0.0.1:10 -usbdevice tablet \ -daemonize Debian Lenny, 32-bit (MP): KVM=/usr/local/kvm/bin/qemu-system-x86_64 $KVM \ -smp 2 \ -m 512 \ -hda kvm-15-etch-i386.img \ -net nic,vlan=0,macaddr=52:54:00:12:34:65,model=rtl8139 \ -net tap,vlan=0,ifname=tap15,script=no \ -vnc 127.0.0.1:15 -usbdevice tablet \ -daemonize Debian Etch (using a kernel.org 2.6.27.10) 32-bit (MP): KVM=/usr/local/kvm/bin/qemu-system-x86_64 $KVM \ -smp 2 \ -m 2048 \ -hda kvm-17-1.img \ -hdb kvm-17-tmp.img \ -net nic,vlan=0,macaddr=52:54:00:12:34:67,model=rtl8139 \ -net tap,vlan=0,ifname=tap17,script=no \ -vnc 127.0.0.1:17 -usbdevice tablet \ -daemonize This last VM is the one I was pinging from the host. I begin by launching the guests at 90 second intervals and then let them settle for about 5 minutes. The final guest runs Apache with RT and a postgresql database. I set my browser to refresh the main RT page every 2 minutes, to roughly simulate the common usage. With all the guests basically idling and just the last guest servicing the RT requests, the load average is about 0.25 or less. Then I just ping the last guest for about 15 minutes. One thing that may be of interest - a couple of times while running the test I also had a terminal logged into the guest being pinged and running "top". There was some crazy stuff showing up in there, with some processes having e.g. 1200 in the "%CPU" column. I remember postmaster (postgresql) being one of them. A quick scan over the top few tasks might show the total "%CPU" adding up to 2000 or more. This tended to happen at the same time I saw the latency spikes in ping as well. That might just be an artifact of timing getting screwed up in "top", I guess. Regards, Kevin. On Sunday 15 February 2009, Kevin Shanahan wrote:
> On Sat, 2009-02-14 at 21:50 +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.27 and 2.6.28.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.27 and 2.6.28. Please verify if it still should
> > be listed and let me know (either way).
>
> Yes, this should still be listed.
>
> I just tested against 2.6.29-rc5 and the problem is as bad as ever
> (perhaps worse?)
>
> --- hermes-old.wumi.org.au ping statistics ---
> 900 packets transmitted, 448 received, +317 errors, 50% packet loss, time
> 899845ms
> rtt min/avg/max/mdev = 0.131/420.015/10890.699/1297.022 ms, pipe 11
>
> The guest being pinged crashed during the test - the QEMU monitor was
> accessible, but the guest didn't respond to "sendkey alt-sysrq-s", etc.
> This was the last thing in the guest syslog after reboot:
>
> Feb 15 19:48:58 hermes-old kernel: ------------[ cut here ]------------
> Feb 15 19:48:58 hermes-old kernel: WARNING: at net/sched/sch_generic.c:219
> dev_watchdog+0x111/0x195()
> Feb 15 19:48:58 hermes-old kernel: NETDEV WATCHDOG: eth0 (8139too): transmit
> timed out
> Feb 15 19:48:58 hermes-old kernel: Pid: 0, comm: swapper Not tainted
> 2.6.27.10 #1
> Feb 15 19:48:58 hermes-old kernel: [<c011d75c>] warn_slowpath+0x5c/0x81
> Feb 15 19:48:58 hermes-old kernel: [<c02f5f7c>] nf_hook_slow+0x44/0xb1
> Feb 15 19:48:58 hermes-old kernel: [<c02d93f1>] dev_queue_xmit+0x3da/0x411
> Feb 15 19:48:58 hermes-old kernel: [<c030043d>] ip_finish_output+0x1f9/0x231
> Feb 15 19:48:58 hermes-old kernel: [<c01daeee>] __next_cpu+0x12/0x21
> Feb 15 19:48:58 hermes-old kernel: [<c0116b42>]
> find_busiest_group+0x232/0x69f
> Feb 15 19:48:58 hermes-old kernel: [<c01160dc>] update_curr+0x41/0x65
> Feb 15 19:48:58 hermes-old kernel: [<c02e33b5>] dev_watchdog+0x111/0x195
> Feb 15 19:48:58 hermes-old kernel: [<c011822f>] enqueue_task_fair+0x16/0x24
> Feb 15 19:48:58 hermes-old kernel: [<c0115645>] enqueue_task+0xa/0x14
> Feb 15 19:48:58 hermes-old kernel: [<c01156d5>] activate_task+0x16/0x1b
> Feb 15 19:48:58 hermes-old kernel: [<c0119c8c>] try_to_wake_up+0x131/0x13a
> Feb 15 19:48:58 hermes-old kernel: [<c02e32a4>] dev_watchdog+0x0/0x195
> Feb 15 19:48:58 hermes-old kernel: [<c012424c>] run_timer_softirq+0xf5/0x14a
> Feb 15 19:48:58 hermes-old kernel: [<c0120f60>] __do_softirq+0x5d/0xc1
> Feb 15 19:48:58 hermes-old kernel: [<c0120ff6>] do_softirq+0x32/0x36
> Feb 15 19:48:58 hermes-old kernel: [<c012112c>] irq_exit+0x35/0x40
> Feb 15 19:48:58 hermes-old kernel: [<c010e8db>]
> smp_apic_timer_interrupt+0x6e/0x7b
> Feb 15 19:48:58 hermes-old kernel: [<c01035ac>]
> apic_timer_interrupt+0x28/0x30
> Feb 15 19:48:58 hermes-old kernel: [<c0107386>] default_idle+0x2a/0x3d
> Feb 15 19:48:58 hermes-old kernel: [<c0101900>] cpu_idle+0x5c/0x84
> Feb 15 19:48:58 hermes-old kernel: =======================
> Feb 15 19:48:58 hermes-old kernel: ---[ end trace eff10a8043ac4e7b ]---
> Feb 15 19:49:01 hermes-old kernel: eth0: Transmit timeout, status 0d 0000
> c07f media d0.
> Feb 15 19:49:01 hermes-old kernel: eth0: Tx queue start entry 839 dirty
> entry 839.
> Feb 15 19:49:01 hermes-old kernel: eth0: Tx descriptor 0 is 0008a03c.
> Feb 15 19:49:01 hermes-old kernel: eth0: Tx descriptor 1 is 0008a062.
> Feb 15 19:49:01 hermes-old kernel: eth0: Tx descriptor 2 is 0008a062.
> Feb 15 19:49:01 hermes-old kernel: eth0: Tx descriptor 3 is 0008a05b. (queue
> head)
> Feb 15 19:49:01 hermes-old kernel: eth0: link up, 100Mbps, full-duplex, lpa
> 0x05E1
>
> I think I saw some patches to fix the latency tracer for non-RT tasks on
> the mailing list a while ago. If that's still going to be a useful test,
> can someone give me some hints on which kernel tree and/or patches to
> download to get that working? The simpler you can make it, the better ;)
-------- Forwarded Message -------- From: Ingo Molnar <mingo@elte.hu> To: Kevin Shanahan <kmshanah@ucwb.org.au> Cc: Rafael J. Wysocki <rjw@sisk.pl>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Kernel Testers List <kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>, bugme-daemon@bugzilla.kernel.org, Steven Rostedt <rostedt@goodmis.org>, Peter Zijlstra <a.p.zijlstra@chello.nl> Subject: Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected) [Bug 12465] Date: Sun, 15 Feb 2009 11:04:54 +0100 * Kevin Shanahan <kmshanah@ucwb.org.au> wrote: > On Sat, 2009-02-14 at 21:50 +0100, Rafael J. Wysocki wrote: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.27 and 2.6.28. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.27 and 2.6.28. Please verify if it still should > > be listed and let me know (either way). > > Yes, this should still be listed. > > I just tested against 2.6.29-rc5 and the problem is as bad as ever > (perhaps worse?) > > --- hermes-old.wumi.org.au ping statistics --- > 900 packets transmitted, 448 received, +317 errors, 50% packet loss, time > 899845ms > rtt min/avg/max/mdev = 0.131/420.015/10890.699/1297.022 ms, pipe 11 i looked at the trace you did earlier and which you uploaded to: http://disenchant.net/tmp/bug-12465/trace-1/ Here is one 3 seconds (!) latency: 0) qemu-sy-4237 | | kvm_vcpu_block() { 0) qemu-sy-4237 | | kvm_cpu_has_interrupt() { 0) qemu-sy-4237 | | kvm_apic_has_interrupt() { 0) qemu-sy-4237 | 0.291 us | } 0) qemu-sy-4237 | | kvm_apic_accept_pic_intr() { 0) qemu-sy-4237 | 0.291 us | } 0) qemu-sy-4237 | 1.476 us | } 0) qemu-sy-4237 | | kvm_cpu_has_pending_timer() { 0) qemu-sy-4237 | 0.325 us | } 0) qemu-sy-4237 | | kvm_arch_vcpu_runnable() { 0) qemu-sy-4237 | 0.288 us | } 0) qemu-sy-4237 | | kvm_arch_vcpu_put() { 0) qemu-sy-4237 | 0.415 us | } 0) qemu-sy-4237 | | schedule() { 0) qemu-sy-4237 | | wakeup_preempt_entity() { 0) qemu-sy-4237 | 0.300 us | } ------------------------------------------ 0) qemu-sy-4237 => ksoftir-4 ------------------------------------------ 0) ksoftir-4 | ! 3010470 us | } ------------------------------------------ 0) ksoftir-4 => qemu-sy-4355 ------------------------------------------ 0) qemu-sy-4355 | 1.575 us | } 0) qemu-sy-4355 | 6.520 us | } 0) qemu-sy-4355 | 7.121 us | } 0) qemu-sy-4355 | | __wake_up() { 0) qemu-sy-4355 | | __wake_up_common() { 0) qemu-sy-4355 | | autoremove_wake_function() { 0) qemu-sy-4355 | | default_wake_function() { qemu-sy-4237 has been scheduled away, and the system appeared to have done nothing in the meantime. That's not something that really looks like a scheduler regression - there is nothing the scheduler can do if KVM decides to block a task. It would be nice to enhance this single-CPU trace some more - to more surgically see what is going on. Firstly, absolute timestamps would be nice: echo funcgraph-abstime > trace_options echo funcgraph-proc > trace_options as it's a bit hard to see the global timescale of events. Secondly, not all events are included - in particular i dont really see the points when packets are passed. Would it be possible to add a tracing hypercall so that the guest kernel can inject trace events that can be seen on the native-side trace? Regarding ping latencies really just two things matter: the loopback network device's rx and tx path. We should trace the outgoing sequence number and the incoming sequence number of IP packets, and inject that to the host side. This way we can correlate the delays precisely. Ingo -------- Forwarded Message -------- From: Kevin Shanahan <kmshanah@ucwb.org.au> To: Ingo Molnar <mingo@elte.hu> Cc: Rafael J. Wysocki <rjw@sisk.pl>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Kernel Testers List <kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>, bugme-daemon@bugzilla.kernel.org, Steven Rostedt <rostedt@goodmis.org>, Peter Zijlstra <a.p.zijlstra@chello.nl> Subject: Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected) [bug 12465] Date: Sun, 22 Feb 2009 21:09:19 +1030 On Sun, 2009-02-15 at 11:04 +0100, Ingo Molnar wrote: > qemu-sy-4237 has been scheduled away, and the system appeared to have done > nothing in the meantime. That's not something that really looks like a > scheduler regression - there is nothing the scheduler can do if KVM > decides to block a task. > > It would be nice to enhance this single-CPU trace some more - to more > surgically see what is going on. Firstly, absolute timestamps would be > nice: > > echo funcgraph-abstime > trace_options > echo funcgraph-proc > trace_options > > as it's a bit hard to see the global timescale of events. I was going to try and grab the trace with absolute timestamps tonight, but that option doesn't seem to be available in Linus' current kernel. flexo:/sys/kernel/debug/tracing# echo 0 > tracing_enabled flexo:/sys/kernel/debug/tracing# echo function_graph > current_tracer flexo:/sys/kernel/debug/tracing# echo funcgraph-proc > trace_options flexo:/sys/kernel/debug/tracing# echo funcgraph-abstime > trace_options -su: echo: write error: Invalid argument flexo:/sys/kernel/debug/tracing# cat trace_options print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock nostacktrace nosched-tree ftrace_printk noftrace_preempt nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only nofuncgraph-overrun funcgraph-cpu funcgraph-overhead funcgraph-proc flexo:/sys/kernel/debug/tracing# uname -a Linux flexo 2.6.29-rc5-00299-gadfafef #6 SMP Sun Feb 22 20:09:37 CST 2009 x86_64 GNU/Linux What am I missing? Cheers, Kevin. * bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > I was going to try and grab the trace with absolute timestamps > tonight, but that option doesn't seem to be available in > Linus' current kernel. you need latest -tip: http://people.redhat.com/mingo/tip.git/README which includes the latest tracing tree as well. (-tip has no KVM changes so the issue should still be present in it.) Ingo The bugme-daemon email gateway still hates me. Emailed response pasted below. The traces I uploaded were: trace5.txt.bz2 trace6.txt.bz2 trace7.txt.bz2 trace8.txt.bz2 I'll attach them here if bugzilla will accept the file sizes. -------- Forwarded Message -------- From: Kevin Shanahan <kmshanah@ucwb.org.au> To: Ingo Molnar <mingo@elte.hu> Cc: Rafael J. Wysocki <rjw@sisk.pl>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Kernel Testers List <kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>, bugme-daemon@bugzilla.kernel.org, Steven Rostedt <rostedt@goodmis.org>, Peter Zijlstra <a.p.zijlstra@chello.nl> Subject: Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected) [Bug 12465] Date: Mon, 23 Feb 2009 22:08:58 +1030 On Sun, 2009-02-15 at 11:04 +0100, Ingo Molnar wrote: > It would be nice to enhance this single-CPU trace some more - to more > surgically see what is going on. Firstly, absolute timestamps would be > nice: > > echo funcgraph-abstime > trace_options > echo funcgraph-proc > trace_options > > as it's a bit hard to see the global timescale of events. Okay, here's some more trace data. I grabbed a few samples at different times during the ping test. I think the data in files trace6.txt and trace8.txt coincided with some of the biggest delays. http://disenchant.net/tmp/bug-12465/trace-2/ This is captured on 2.6.29-rc5-tip-02057-gaad11ad. The kvm guest being pinged is process 11211: flexo:~# pstree -p 11211 qemu-system-x86(11211)─┬─{qemu-system-x86}(11212) ├─{qemu-system-x86}(11213) └─{qemu-system-x86}(11609) Cheers, Kevin. > Secondly, not all events are included - in particular i dont really see > the points when packets are passed. Would it be possible to add a tracing > hypercall so that the guest kernel can inject trace events that can be seen > on the native-side trace? Regarding ping latencies really just two things > matter: the loopback network device's rx and tx path. We should trace the > outgoing sequence number and the incoming sequence number of IP packets, > and inject that to the host side. This way we can correlate the delays > precisely. > > Ingo Created attachment 20336 [details]
tracing setup script
Created attachment 20337 [details]
trace output, sample 1
Created attachment 20338 [details]
trace output, sample 2
Created attachment 20339 [details]
trace output, sample 3
Created attachment 20340 [details]
trace output, sample 4
On Tuesday 24 February 2009, Kevin Shanahan wrote:
> On Mon, 2009-02-23 at 23:03 +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.27 and 2.6.28.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.27 and 2.6.28. Please verify if it still should
> > be listed and let me know (either way).
>
> Yes, the problem should still be listed.
> The bug is still present as recently as 2.6.29-rc5-00299-gadfafef.
Handled-By : Avi Kivity <avi@redhat.com> On Thu, 2009-03-05 at 08:29 +0100, Mike Galbraith wrote: > On Wed, 2009-03-04 at 13:38 +1030, Kevin Shanahan wrote: > > The traces are there waiting to be looked at. If there's anything else I > > can do to help things along, please let me know. > > I've been staring at them. They're loaded with kipmi0 sequences as > below. Having zero success at parsing.. 77ms __schedule()? Not. > > 5812.647954 | 0) kipmi0-501 | 0.357 us | > perf_counter_task_sched_in(); > 5812.647954 | 0) kipmi0-501 | ! 3902.489 us | } > 5812.647955 | 0) kipmi0-501 | ! 3903.102 us | } > 5812.647955 | 0) kipmi0-501 | ! 3903.928 us | } > 5812.647955 | 0) kipmi0-501 | ! 3904.572 us | } > 5812.647957 | 0) kipmi0-501 | | > schedule_timeout_interruptible() { > 5812.647958 | 0) kipmi0-501 | | schedule_timeout() { > 5812.647958 | 0) kipmi0-501 | | schedule() { > 5812.647958 | 0) kipmi0-501 | | __schedule() { > 5812.647959 | 0) kipmi0-501 | | > perf_counter_task_sched_out() { > 5812.647960 | 0) kipmi0-501 | 0.325 us | > __perf_counter_sched_out(); > 5812.647960 | 0) kipmi0-501 | 0.956 us | } > 5812.647961 | 0) qemu-sy-11119 | 0.343 us | > perf_counter_task_sched_in(); > 5812.647962 | 0) qemu-sy-11119 | | > kvm_sched_in() { > 5812.647962 | 0) qemu-sy-11119 | | > kvm_arch_vcpu_load() { > 5812.647963 | 0) qemu-sy-11119 | 0.313 us | > kvm_write_guest_time(); > 5812.647963 | 0) qemu-sy-11119 | 1.203 us | } > 5812.647964 | 0) qemu-sy-11119 | 1.809 us | } > 5812.647964 | 0) qemu-sy-11119 | ! 77180.30 us | } > 5812.647964 | 0) qemu-sy-11119 | ! 77180.90 us | } > 5812.647965 | 0) qemu-sy-11119 | ! 77181.51 us | } > 5812.647965 | 0) qemu-sy-11119 | ! 77182.12 us | } > 5812.647965 | 0) qemu-sy-11119 | ! 77182.71 us | } > 5812.647966 | 0) qemu-sy-11119 | | > kvm_inject_pending_timer_irqs() { > 5812.647966 | 0) qemu-sy-11119 | 0.427 us | > kvm_inject_apic_timer_irqs(); > 5812.647967 | 0) qemu-sy-11119 | | > kvm_inject_pit_timer_irqs() { > 5812.647967 | 0) qemu-sy-11119 | | kvm_set_irq() { > 5812.647968 | 0) qemu-sy-11119 | 0.460 us | > kvm_ioapic_set_irq(); > 5812.647969 | 0) qemu-sy-11119 | | > kvm_pic_set_irq() { > 5812.647969 | 0) qemu-sy-11119 | 0.312 us | > kvm_vcpu_kick(); > 5812.647970 | 0) qemu-sy-11119 | 1.145 us | } > 5812.647970 | 0) qemu-sy-11119 | 2.655 us | } > 5812.647970 | 0) qemu-sy-11119 | | kvm_set_irq() { > 5812.647971 | 0) qemu-sy-11119 | 0.315 us | > kvm_ioapic_set_irq(); > 5812.647971 | 0) qemu-sy-11119 | 0.478 us | > kvm_pic_set_irq(); > 5812.647972 | 0) qemu-sy-11119 | 1.719 us | } > 5812.647972 | 0) qemu-sy-11119 | 5.465 us | } > 5812.647973 | 0) qemu-sy-11119 | 6.797 us | } On Thu, 2009-03-05 at 10:45 +0100, Mike Galbraith wrote: > On Thu, 2009-03-05 at 08:29 +0100, Mike Galbraith wrote: > > I've been staring at them. They're loaded with kipmi0 sequences as > > below. Having zero success at parsing.. 77ms __schedule()? Not. > > > 5812.647964 | 0) qemu-sy-11119 | ! 77180.30 us | } > > Hrmph, despite not being able to determine where such scary numbers come > from, timestamp wise it still looks to me like the scheduler is doing > it's thing properly, and whatever is going wonky is in KVM. > > Not very useful I know, but thought I should let you know that someone > is pondering those traces. > > -Mike I'm afraid I can't really read the traces. Can someone who can decode the format comment? References : http://lkml.org/lkml/2009/3/15/51 Unless I'm misinterpreting the trace, the following seems to indicate KVM is not at fault (correct me otherwise). From trace6.txt from http://disenchant.net/tmp/bug-12465/trace-2/. I've extracted the relevant part into http://people.redhat.com/~mtosatti/chunk-of-trace6.txt. You can see qemu being scheduled out, cat scheduled in, and ksoftirq scheduled in. Only after 3 seconds qemu-kvm is scheduled back in again. Note that KVM _is not_ emulating halt for the guest (which waits for events in kvm_vcpu_block). The state of each vcpu is as follows: qemu-sy-11194: kvm_vcpu_block qemu-sy-11133: kvm_vcpu_block qemu-sy-11134: kvm_vcpu_block qemu-sy-11213: not blocked on vcpu block, scheduled out by cat. _Could_ be waiting on IO, but that should be visible in the trace? qemu-sy-11118: poll_schedule_timeout, the IO thread. Halt emulation can be seen in http://people.redhat.com/~mtosatti/chunk-of-trace6-good.txt. Kevin, is the guest using periodic ticks? /proc/timer_list output from the guest would help. Have you attempted to use -no-kvm-irqchip? Periodic ticks, yes I think so - you mean this info? $ grep HZ hermesold.0 # CONFIG_NO_HZ is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_300=y # CONFIG_HZ_1000 is not set CONFIG_HZ=300 kmshanah@hermes-old:~$ cat /proc/timer_list Timer List Version: v0.3 HRTIMER_MAX_CLOCK_BASES: 2 now at 124344651194856 nsecs cpu: 0 clock 0: .index: 0 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1237209293010817006 nsecs active timers: clock 1: .index: 1 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <f6631eec>, tick_sched_timer, S:01 # expires at 124344650898867 nsecs [in -295989 nsecs] #1: <f6631eec>, it_real_fn, S:01 # expires at 124369186053044 nsecs [in 24534858188 nsecs] #2: <f6631eec>, hrtimer_wakeup, S:01 # expires at 124388537770723 nsecs [in 43886575867 nsecs] #3: <f6631eec>, it_real_fn, S:01 # expires at 149486451698113 nsecs [in 25141800503257 nsecs] .expires_next : 124344650898867 nsecs .hres_active : 1 .nr_events : 24547649 .nohz_mode : 0 .idle_tick : 0 nsecs .tick_stopped : 0 .idle_jiffies : 0 .idle_calls : 0 .idle_sleeps : 0 .idle_entrytime : 0 nsecs .idle_waketime : 0 nsecs .idle_exittime : 0 nsecs .idle_sleeptime : 0 nsecs .last_jiffies : 0 .next_jiffies : 0 .idle_expires : 0 nsecs jiffies: 37213398 cpu: 1 clock 0: .index: 0 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1237209293010817006 nsecs active timers: clock 1: .index: 1 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <f6631eec>, tick_sched_timer, S:01 # expires at 124344651107200 nsecs [in -87656 nsecs] #1: <f6631eec>, sched_rt_period_timer, S:01 # expires at 124345000000000 nsecs [in 348805144 nsecs] #2: <f6631eec>, it_real_fn, S:01 # expires at 124372954500821 nsecs [in 28303305965 nsecs] #3: <f6631eec>, it_real_fn, S:01 # expires at 124489098233043 nsecs [in 144447038187 nsecs] #4: <f6631eec>, it_real_fn, S:01 # expires at 124635106565989 nsecs [in 290455371133 nsecs] #5: <f6631eec>, it_real_fn, S:01 # expires at 124635106671613 nsecs [in 290455476757 nsecs] #6: <f6631eec>, hrtimer_wakeup, S:01 # expires at 126159569385803 nsecs [in 1814918190947 nsecs] .expires_next : 124344651107200 nsecs .hres_active : 1 .nr_events : 24416849 .nohz_mode : 0 .idle_tick : 0 nsecs .tick_stopped : 0 .idle_jiffies : 0 .idle_calls : 0 .idle_sleeps : 0 .idle_entrytime : 0 nsecs .idle_waketime : 0 nsecs .idle_exittime : 0 nsecs .idle_sleeptime : 0 nsecs .last_jiffies : 0 .next_jiffies : 0 .idle_expires : 0 nsecs jiffies: 37213398 Tick Device: mode: 1 Clock Event Device: pit max_delta_ns: 27461866 min_delta_ns: 12571 mult: 5124677 shift: 32 mode: 3 next_event: 9223372036854775807 nsecs set_next_event: pit_next_event set_mode: init_pit_timer event_handler: tick_handle_oneshot_broadcast tick_broadcast_mask: 00000000 tick_broadcast_oneshot_mask: 00000000 Tick Device: mode: 1 Clock Event Device: lapic max_delta_ns: 134208338 min_delta_ns: 1000 mult: 268454204 shift: 32 mode: 3 next_event: 124344650898867 nsecs set_next_event: lapic_next_event set_mode: lapic_timer_setup event_handler: hrtimer_interrupt Tick Device: mode: 1 Clock Event Device: lapic max_delta_ns: 134208338 min_delta_ns: 1000 mult: 268454204 shift: 32 mode: 3 next_event: 124344651107200 nsecs set_next_event: lapic_next_event set_mode: lapic_timer_setup event_handler: hrtimer_interrupt As for the -no-kvm-irqchip, yes tried it but it didn't obviously solve the problem. I have an email around here somewhere: -------- Forwarded Message -------- From: Kevin Shanahan <kmshanah@ucwb.org.au> To: Avi Kivity <avi@redhat.com> Cc: Ingo Molnar <mingo@elte.hu>, Rafael J. Wysocki <rjw@sisk.pl>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Kernel Testers List <kernel-testers@vger.kernel.org>, Mike Galbraith <efault@gmx.de>, bugme-daemon@bugzilla.kernel.org, Peter Zijlstra <a.p.zijlstra@chello.nl> Subject: Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected) Date: Wed, 21 Jan 2009 04:24:44 +1030 On Tue, 2009-01-20 at 15:04 +0200, Avi Kivity wrote: > Kevin Shanahan wrote: > > On Tue, 2009-01-20 at 12:35 +0100, Ingo Molnar wrote: > >> This only seems to occur under KVM, right? I.e. you tested it with -no-kvm > >> and the problem went away, correct? > >> > > > > Well, the I couldn't make the test conditions identical, but it the > > problem didn't occur with the test I was able to do: > > > > http://marc.info/?l=linux-kernel&m=123228728416498&w=2 > > > > > > Can you also try with -no-kvm-irqchip? > > You will need to comment out the lines > > /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps > * to GSI 2. GSI maps to ioapic 1-1. This is not > * the cleanest way of doing it but it should work. */ > > if (vector == 0) > vector = 2; > > in qemu/hw/apic.c (should also fix -no-kvm smp). This will change kvm > wakeups to use signals rather than the in-kernel code, which may be buggy. Okay, I commented out those lines and compiled a new kvm-82 userspace and kernel modules. Using those on a vanilla 2.6.28 kernel, with all guests run with -no-kvm-irqchip added. As before a number of the XP guests wanted to chug away at 100% CPU usage for a long time. Three of the guests clocked up ~40 minutes CPU time before I decided to just shut them down. Perhaps coincidentally, these three guests are the only ones with Office 2003 installed on them. That could be the difference between those guests and the other XP guests, but that's probably not important for now. The two Linux SMP guests booted okay this time, though they seem to only use one CPU on the host (I guess kvm is not multi-threaded in this mode?). "hermes-old", the guest I am pinging in all my tests, had a lot of trouble running the apache2 setup - it was so slow it was difficult to load a complete page from our RT system. The kvm process for this guest was taking up 100% cpu on the host constantly and all sorts of wierd stuff could be seen by watching top in the guest: top - 03:44:17 up 43 min, 1 user, load average: 3.95, 1.55, 0.80 Tasks: 101 total, 4 running, 97 sleeping, 0 stopped, 0 zombie Cpu(s): 79.7%us, 10.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 9.9%si, 0.0%st Mem: 2075428k total, 391128k used, 1684300k free, 13044k buffers Swap: 3502160k total, 0k used, 3502160k free, 118488k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2956 postgres 20 0 19704 11m 10m S 1658 0.6 2:55.99 postmaster 2934 www-data 20 0 60392 40m 5132 R 31 2.0 0:17.28 apache2 2958 postgres 20 0 19700 11m 9.8m R 28 0.6 0:20.41 postmaster 2940 www-data 20 0 58652 38m 5016 S 27 1.9 0:04.87 apache2 2937 www-data 20 0 60124 40m 5112 S 18 2.0 0:11.00 apache2 2959 postgres 20 0 19132 5424 4132 S 10 0.3 0:01.50 postmaster 2072 postgres 20 0 8064 1416 548 S 7 0.1 0:23.71 postmaster 2960 postgres 20 0 19132 5368 4060 R 6 0.3 0:01.55 postmaster 2071 postgres 20 0 8560 1972 488 S 5 0.1 0:08.33 postmaster Running the ping test with without apache2 running in the guest: --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 902740ms rtt min/avg/max/mdev = 0.568/3.745/272.558/16.990 ms And with apache2 running: --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 902758ms rtt min/avg/max/mdev = 0.625/25.634/852.739/76.586 ms In both cases it's quite variable, but the max latency is still not as bad as when running with the irq chip enabled. Anyway, the test is again not ideal, but I hope we're proving something. That's all I can do for tonight - should be ready for more again tomorrow night. Regards, Kevin. Created attachment 20591 [details]
exit kvm_vcpu_block if KVM_REQ_PENDING_TIMER is set
Kevin, Can you please try the attached patch (test-pending-timer-bit.patch) top of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git. Thanks Also try -no-kvm-pit, and see if there is a difference on maximum latency (that option will not affect performance as much as -no-kvm-irqchip). (In reply to comment #25) > Can you please try the attached patch (test-pending-timer-bit.patch) top of > git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git. Okay, compiled and tested on 2.6.29-rc8-02003-g4dcf262 + your patch. Unfortunately it doesn't look like this is the fix. --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 899280ms rtt min/avg/max/mdev = 0.122/497.769/15491.621/1624.426 ms, pipe 16 --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 899503ms rtt min/avg/max/mdev = 0.132/865.812/16075.744/2240.968 ms, pipe 17 (In reply to comment #26) > Also try -no-kvm-pit, and see if there is a difference on maximum latency > (that > option will not affect performance as much as -no-kvm-irqchip). Same kernel as above (2.6.29-rc8-02003-g4dcf262 + patch), but with all guests using -no-kvm-pit: --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 899207ms rtt min/avg/max/mdev = 0.118/51.113/4071.994/323.490 ms, pipe 5 --- hermes-old.wumi.org.au ping statistics --- 900 packets transmitted, 900 received, 0% packet loss, time 900201ms rtt min/avg/max/mdev = 0.126/535.958/14692.164/1669.703 ms, pipe 15 -------- Forwarded Message -------- From: Kevin Shanahan <kmshanah@ucwb.org.au> To: Frederic Weisbecker <fweisbec@gmail.com> Cc: Avi Kivity <avi@redhat.com>, Rafael J. Wysocki <rjw@sisk.pl>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Kernel Testers List <kernel-testers@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>, Mike Galbraith <efault@gmx.de>, Peter Zijlstra <a.p.zijlstra@chello.nl> Subject: Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected) Date: Sat, 21 Mar 2009 15:30:42 +1030 On Thu, 2009-03-19 at 07:54 +1030, Kevin Shanahan wrote: > On Wed, 2009-03-18 at 11:46 +1030, Kevin Shanahan wrote: > > On Wed, 2009-03-18 at 01:20 +0100, Frederic Weisbecker wrote: > > > Ok, I've made a small script based on yours which could do this job. > > > You will just have to set yourself a threshold of latency > > > that you consider as buggy. I don't remember the latency you observed. > > > About 5 secs right? > > > > > > It's the "thres" variable in the script. > > > > > > The resulting trace should be a mixup of the function graph traces > > > and scheduler events which look like this: > > > > > > gnome-screensav-4691 [000] 6716.774277: 4691:120:S ==> [000] > 0:140:R <idle> > > > xfce4-terminal-4723 [001] 6716.774303: 4723:120:R + [001] > 4289:120:S Xorg > > > xfce4-terminal-4723 [001] 6716.774417: 4723:120:S ==> [001] > 4289:120:R Xorg > > > Xorg-4289 [001] 6716.774427: 4289:120:S ==> [001] > 0:140:R <idle> > > > > > > + is a wakeup and ==> is a context switch. > > > > > > The script will loop trying some pings and will only keep the trace that > matches > > > the latency threshold you defined. > > > > > > Tell if the following script work for you. > > ... > > > Either way, I'll try to get some results in my maintenance window > > tonight. > > Testing did not go so well. I compiled and booted > 2.6.29-rc8-tip-02630-g93c4989, but had some problems with the system > load when I tried to start tracing - it shot up to around 16-20 or so. I > started shutting down VMs to try and get it under control, but before I > got back to tracing again the machine disappeared off the network - > unresponsive to ping. > > When I got in this morning, there was nothing on the console, nothing in > the logs to show what went wrong. I will try again, but my next chance > will probably be Saturday. Stay tuned. Okay, new set of traces have been uploaded to: http://disenchant.net/tmp/bug-12465/trace-3/ These were done on the latest tip, which I pulled down this morning: 2.6.29-rc8-tip-02744-gd9937cb. The system load was very high again when I first tried to trace with sevarl guests running, so I ended up only having the one guest running and thankfully the bug was still reproducable that way. Fingers crossed this set of traces is able to tell us something. Regards, Kevin. -------- Forwarded Message -------- From: Kevin Shanahan <kmshanah@ucwb.org.au> To: Frederic Weisbecker <fweisbec@gmail.com> Cc: Avi Kivity <avi@redhat.com>, Rafael J. Wysocki <rjw@sisk.pl>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Kernel Testers List <kernel-testers@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>, Mike Galbraith <efault@gmx.de>, Peter Zijlstra <a.p.zijlstra@chello.nl> Subject: Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected) Date: Fri, 27 Mar 2009 06:53:05 +1030 On Tue, 2009-03-24 at 12:44 +0100, Frederic Weisbecker wrote: > As I explained in my previous mail, you trace is only > a snapshot that happened in 10 msec. > > I experimented different sizes for the ring buffer but even > a 1 second trace require 20 Mo of memory. And a so huge trace > would be impractical. > > I think we should keep the trace filters we had previously. > If you don't minde, could you please retest against latest -tip > the following updated patch? Iadded the filters, fixed the python > subshell and also flushed the buffer more nicely according to > a recent feature in -tip: > > echo > trace > > instead of switching to nop. > You will need to pull latest -tip again. Ok, new set of traces uploaded again here: http://disenchant.net/tmp/bug-12465/trace-4/ These were taken using 2.6.29-tip-02749-g398bf09. Same as last time, it was only necessary to have the one guest running to reproduce the problem. Cheers, Kevin. Is this still a problem on newer kernels? No longer a problem. Currently using Linux 3.2.9 and qemu-kvm 1.0. |