Created attachment 25155 [details] syslog of jbd2/dm-1-8:973 blocked for more than 120 seconds. This is a fedora kernel kernel-2.6.32.7-41.fc12.x86_64 running as a Xen DomU under a RedHat 2.6.18-186.el5xen Dom0 Workload is an LTSP server with 5 to 40 users, low load web/smb/nfs server. Best up time has been just over 2 days. Worst was 2 minutes. Out of the blue processes will start locking up in D state. Sometime they grow quickly other times slowly. Load has got up to over 300 before we give up and reboot. System does not recover by itself.
I've seen similar crashes with 2.6.31 except that the hung task tends to be kjournald. This kernel mailing list thread seems similar: http://www.mail-archive.com/kvm@vger.kernel.org/msg23039.html
Other redhat bugzillas (one made by me) that are related: kernel 2.6.31 processes lock up in D state https://bugzilla.redhat.com/show_bug.cgi?id=550724 FC12 2.6.31.9-174.fc12.x86_64 hangs under heavy disk I/O https://bugzilla.redhat.com/show_bug.cgi?id=551552 F12 Xen DomU unstable (2.6.31) https://bugzilla.redhat.com/show_bug.cgi?id=526627
top output of dying system top - 10:58:42 up 4:55, 21 users, load average: 62.42, 43.99, 22.27 Tasks: 777 total, 1 running, 776 sleeping, 0 stopped, 0 zombie Cpu(s): 0.4%us, 0.6%sy, 0.0%ni, 31.9%id, 67.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 30100024k total, 4208836k used, 25891188k free, 313452k buffers Swap: 10239992k total, 0k used, 10239992k free, 1557692k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11652 root 20 0 20528 1704 872 R 1.6 0.0 1:46.83 top 973 root 20 0 0 0 0 D 0.0 0.0 0:00.33 jbd2/dm-1-8 1069 root 20 0 0 0 0 D 0.0 0.0 0:00.08 flush-253:1 2645 apache 20 0 475m 21m 6020 D 0.0 0.1 0:05.25 httpd 2647 apache 20 0 474m 19m 5996 D 0.0 0.1 0:04.74 httpd 2648 apache 20 0 474m 19m 5992 D 0.0 0.1 0:04.53 httpd 2650 apache 20 0 475m 20m 6036 D 0.0 0.1 0:06.52 httpd 2651 apache 20 0 476m 22m 5980 D 0.0 0.1 0:04.70 httpd 2652 apache 20 0 476m 21m 6028 D 0.0 0.1 0:05.44 httpd 2896 apache 20 0 476m 21m 6016 D 0.0 0.1 0:05.54 httpd 2897 apache 20 0 475m 20m 6016 D 0.0 0.1 0:05.27 httpd 2898 apache 20 0 476m 21m 6004 D 0.0 0.1 0:05.18 httpd 5028 apache 20 0 474m 20m 6020 D 0.0 0.1 0:05.01 httpd 8075 name1 20 0 137m 6484 2040 D 0.0 0.0 0:00.40 gconfd-2 8085 name1 20 0 332m 16m 11m D 0.0 0.1 0:01.47 gnome-panel 8929 apache 20 0 475m 20m 5964 D 0.0 0.1 0:04.40 httpd 10458 name2 20 0 214m 5436 3784 D 0.0 0.0 0:00.40 smbd 10740 apache 20 0 475m 20m 5916 D 0.0 0.1 0:02.48 httpd 12005 name2 20 0 214m 5636 4028 D 0.0 0.0 0:00.48 smbd 14550 name3 20 0 110m 1608 1248 D 0.0 0.0 0:00.02 bash 14592 name3 20 0 820m 102m 28m D 0.0 0.3 0:42.92 thunderbird-bin 15914 name3 20 0 712m 116m 26m D 0.0 0.4 3:52.89 firefox 17831 name2 20 0 214m 6068 4104 D 0.0 0.0 0:01.61 smbd 19420 name3 20 0 311m 17m 7108 D 0.0 0.1 0:08.50 gvim 19429 name3 20 0 149m 13m 2688 D 0.0 0.0 0:00.76 xdvi-xaw3d 21706 name4 20 0 227m 7736 4000 D 0.0 0.0 0:01.26 pine 22351 name3 20 0 309m 15m 6880 D 0.0 0.1 0:00.46 gvim 23276 name3 20 0 171m 91m 26m D 0.0 0.3 5:38.29 acroread 23924 math101 20 0 160m 8096 3920 D 0.0 0.0 0:00.15 vim 23944 math101 20 0 7688 332 256 D 0.0 0.0 0:00.00 fortune 23991 name7 20 0 54968 2156 1804 D 0.0 0.0 0:00.00 imap 24001 name5 20 0 99.5m 3832 1132 D 0.0 0.0 0:00.00 sendmail 24032 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24053 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24062 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24067 name6 20 0 66268 2068 1452 D 0.0 0.0 0:00.03 sftp-server 24109 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24118 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24127 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24136 name3 20 0 17760 532 420 D 0.0 0.0 0:00.00 xauth 24138 name2 20 0 214m 5412 3844 D 0.0 0.0 0:00.04 smbd 24158 name8 20 0 55112 2144 1764 D 0.0 0.0 0:00.03 imap 24166 name2 20 0 214m 5036 3500 D 0.0 0.0 0:00.03 smbd 24175 name6 20 0 66268 2052 1448 D 0.0 0.0 0:00.03 sftp-server 24218 name2 20 0 214m 5404 3836 D 0.0 0.0 0:00.03 smbd 24244 name6 20 0 66268 2052 1448 D 0.0 0.0 0:00.03 sftp-server 24295 comp131 20 0 107m 20m 3976 D 0.0 0.1 0:00.28 index.cgi 24311 name1 20 0 20676 792 708 D 0.0 0.0 0:00.00 procmail 24318 name8 20 0 55144 2252 1836 D 0.0 0.0 0:00.00 imap 24336 apache 20 0 374m 9.8m 1908 D 0.0 0.0 0:00.00 httpd 24344 apache 20 0 374m 9.8m 1880 D 0.0 0.0 0:00.00 httpd 24345 apache 20 0 374m 9712 1628 D 0.0 0.0 0:00.00 httpd 24346 apache 20 0 374m 9.8m 1884 D 0.0 0.0 0:00.00 httpd 24347 apache 20 0 374m 9.8m 1900 D 0.0 0.0 0:00.00 httpd 24356 name2 20 0 214m 5264 3716 D 0.0 0.0 0:00.03 smbd 24360 apache 20 0 374m 9352 1292 D 0.0 0.0 0:00.00 httpd 24365 apache 20 0 374m 9668 1576 D 0.0 0.0 0:00.00 httpd 24367 name8 20 0 117m 1432 584 D 0.0 0.0 0:00.00 sshd 24370 apache 20 0 476m 19m 3996 D 0.0 0.1 0:00.11 httpd 24374 name9 20 0 55340 2472 1876 D 0.0 0.0 0:00.90 pop3
Created attachment 25158 [details] hung tasks Latest hung tasks from 2.6.32.7-41.fc12.x86_64
Created attachment 25159 [details] dmesg
atop output of a system that just went into the hung state. Note that one disk is in in 90% busy. I don't always get these atop screens. Sometime the atop process seems to hang. All the CPU is being chewed by several acroreads comsuming CPU due to some sort of bug with acroread and proxy settings. ATOP - turing 2010/02/23 15:10:31 10 seconds elapsed PRC | sys 61.11s | user 45.27s | #proc 1741 | #zombie 1 | #exit 7 | CPU | sys 512% | user 404% | irq 35% | idle 76% | wait 169% | CPU | steal 4% | stl/cpu 0% | | | | cpu | sys 55% | user 35% | irq 1% | idle 0% | cpu009 w 9% | cpu | sys 51% | user 35% | irq 1% | idle 13% | cpu005 w 0% | cpu | sys 49% | user 34% | irq 1% | idle 0% | cpu004 w 16% | cpu | sys 48% | user 34% | irq 1% | idle 16% | cpu006 w 0% | cpu | sys 47% | user 35% | irq 1% | idle 0% | cpu011 w 17% | cpu | sys 45% | user 36% | irq 1% | idle 0% | cpu008 w 18% | cpu | sys 46% | user 34% | irq 1% | idle 19% | cpu003 w 0% | cpu | sys 45% | user 33% | irq 1% | idle 0% | cpu010 w 20% | cpu | sys 43% | user 35% | irq 1% | idle 0% | cpu002 w 20% | cpu | sys 24% | user 27% | irq 22% | idle 0% | cpu001 w 28% | cpu | sys 39% | user 32% | irq 1% | idle 27% | cpu007 w 0% | cpu | sys 23% | user 36% | irq 2% | idle 0% | cpu000 w 39% | CPL | avg1 42.15 | avg5 23.96 | avg15 18.00 | csw 99052 | intr 263662 | MEM | tot 28.7G | free 17.5G | cache 3.3G | buff 622.7M | slab 651.4M | SWP | tot 9.8G | free 9.8G | | vmcom 15.5G | vmlim 24.1G | PAG | scan 0 | stall 0 | | swin 0 | swout 0 | DSK | xvdb | busy 90% | read 0 | write 0 | avio 0 ms | DSK | xvda | busy 1% | read 14 | write 54 | avio 1 ms | NET | transport | tcpi 194947 | tcpo 115281 | udpi 159 | udpo 161 | NET | network | ipi 195130 | ipo 115450 | ipfrw 0 | deliv 195111 | NET | eth0 ---- | pcki 194853 | pcko 115160 | si 157 Mbps | so 82 Mbps | NET | lo ---- | pcki 291 | pcko 291 | si 369 Kbps | so 369 Kbps |
Problem persists with kernel-2.6.32.9-64.fc12.x86_64 As I posted in: https://bugzilla.redhat.com/show_bug.cgi?id=550724 This problem is quite frustrating. I have not been reproduce the problem in a different VM on either the same hardware or different hardware. Other VMs seem to withstand heavy disk tests with no problem. It's something about the LTSP workload. The network I'm sure is in this mix somewhere. In the not very conclusive thread mentioned in comment #1, it is said: "I kept hitting this issue for 2.6.31 guest kernel even with a simple network test." The system can hum along with labs full of students and work fine for several hours, then suddenly one of the jbd2/dm- tasks won't come out of D state and the rest of the system follows on as more and more processes catch the D state. Other jbd2/dm- tasks catch the D as well after the first one goes. Depending on which jbd2/dm- goes D first, the growth of the D's can be very fast or start slow and then spread at an increasing pace. The problem is load related but it is not a big load. Its some combination of network and disk I think. I'm at a lose as to what more information I can provide.
Created attachment 25291 [details] /usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 This is the result of running: for p in $(seq 0 11) do /usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 27 $p done on the system after it panicked with the hung_task_panic=1 passed to the F12 guest. Thanks to Andrew Jones on the RH bugzilla for suggesting this.
This bug is either in the RH enterprise Xen kernel or a problem with the hardware. Not likely to be a bug in 2.6.32 Sorry for the noise.