Bug 15370 - task jbd2/dm-1-8:973 blocked for more than 120 seconds.
Summary: task jbd2/dm-1-8:973 blocked for more than 120 seconds.
Status: CLOSED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-22 05:16 UTC by Norman Gaywood
Modified: 2011-02-24 15:23 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.32.9
Tree: Fedora
Regression: No


Attachments
syslog of jbd2/dm-1-8:973 blocked for more than 120 seconds. (28.41 KB, text/plain)
2010-02-22 05:16 UTC, Norman Gaywood
Details
hung tasks (26.03 KB, text/plain)
2010-02-22 11:31 UTC, Norman Gaywood
Details
dmesg (22.79 KB, text/plain)
2010-02-22 11:35 UTC, Norman Gaywood
Details
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 (14.35 KB, text/plain)
2010-03-01 14:54 UTC, Norman Gaywood
Details

Description Norman Gaywood 2010-02-22 05:16:31 UTC
Created attachment 25155 [details]
syslog of jbd2/dm-1-8:973 blocked for more than 120 seconds.

This is a fedora kernel kernel-2.6.32.7-41.fc12.x86_64 running as a Xen DomU under a RedHat 2.6.18-186.el5xen Dom0

Workload is an LTSP server with 5 to 40 users, low load web/smb/nfs server.

Best up time has been just over 2 days. Worst was 2 minutes.

Out of the blue processes will start locking up in D state. Sometime they grow quickly other times slowly. Load has got up to over 300 before we give up and reboot. System does not recover by itself.
Comment 1 Norman Gaywood 2010-02-22 05:17:07 UTC
I've seen similar crashes with 2.6.31 except that the hung task tends to
be kjournald. This kernel mailing list thread seems similar:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23039.html
Comment 2 Norman Gaywood 2010-02-22 05:20:06 UTC
Other redhat bugzillas (one made by me) that are related:

kernel 2.6.31 processes lock up in D state
https://bugzilla.redhat.com/show_bug.cgi?id=550724

FC12 2.6.31.9-174.fc12.x86_64 hangs under heavy disk I/O
https://bugzilla.redhat.com/show_bug.cgi?id=551552

F12 Xen DomU unstable (2.6.31)
https://bugzilla.redhat.com/show_bug.cgi?id=526627
Comment 3 Norman Gaywood 2010-02-22 05:30:26 UTC
top output of dying system

top - 10:58:42 up  4:55, 21 users,  load average: 62.42, 43.99, 22.27
Tasks: 777 total,   1 running, 776 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.4%us,  0.6%sy,  0.0%ni, 31.9%id, 67.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  30100024k total,  4208836k used, 25891188k free,   313452k buffers
Swap: 10239992k total,        0k used, 10239992k free,  1557692k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11652 root      20   0 20528 1704  872 R  1.6  0.0   1:46.83 top
  973 root      20   0     0    0    0 D  0.0  0.0   0:00.33 jbd2/dm-1-8
 1069 root      20   0     0    0    0 D  0.0  0.0   0:00.08 flush-253:1
 2645 apache    20   0  475m  21m 6020 D  0.0  0.1   0:05.25 httpd
 2647 apache    20   0  474m  19m 5996 D  0.0  0.1   0:04.74 httpd
 2648 apache    20   0  474m  19m 5992 D  0.0  0.1   0:04.53 httpd
 2650 apache    20   0  475m  20m 6036 D  0.0  0.1   0:06.52 httpd
 2651 apache    20   0  476m  22m 5980 D  0.0  0.1   0:04.70 httpd
 2652 apache    20   0  476m  21m 6028 D  0.0  0.1   0:05.44 httpd
 2896 apache    20   0  476m  21m 6016 D  0.0  0.1   0:05.54 httpd
 2897 apache    20   0  475m  20m 6016 D  0.0  0.1   0:05.27 httpd
 2898 apache    20   0  476m  21m 6004 D  0.0  0.1   0:05.18 httpd
 5028 apache    20   0  474m  20m 6020 D  0.0  0.1   0:05.01 httpd
 8075 name1     20   0  137m 6484 2040 D  0.0  0.0   0:00.40 gconfd-2
 8085 name1     20   0  332m  16m  11m D  0.0  0.1   0:01.47 gnome-panel
 8929 apache    20   0  475m  20m 5964 D  0.0  0.1   0:04.40 httpd
10458 name2     20   0  214m 5436 3784 D  0.0  0.0   0:00.40 smbd
10740 apache    20   0  475m  20m 5916 D  0.0  0.1   0:02.48 httpd
12005 name2     20   0  214m 5636 4028 D  0.0  0.0   0:00.48 smbd
14550 name3     20   0  110m 1608 1248 D  0.0  0.0   0:00.02 bash
14592 name3     20   0  820m 102m  28m D  0.0  0.3   0:42.92 thunderbird-bin
15914 name3     20   0  712m 116m  26m D  0.0  0.4   3:52.89 firefox
17831 name2     20   0  214m 6068 4104 D  0.0  0.0   0:01.61 smbd
19420 name3     20   0  311m  17m 7108 D  0.0  0.1   0:08.50 gvim
19429 name3     20   0  149m  13m 2688 D  0.0  0.0   0:00.76 xdvi-xaw3d
21706 name4     20   0  227m 7736 4000 D  0.0  0.0   0:01.26 pine
22351 name3     20   0  309m  15m 6880 D  0.0  0.1   0:00.46 gvim
23276 name3     20   0  171m  91m  26m D  0.0  0.3   5:38.29 acroread
23924 math101   20   0  160m 8096 3920 D  0.0  0.0   0:00.15 vim
23944 math101   20   0  7688  332  256 D  0.0  0.0   0:00.00 fortune
23991 name7     20   0 54968 2156 1804 D  0.0  0.0   0:00.00 imap
24001 name5     20   0 99.5m 3832 1132 D  0.0  0.0   0:00.00 sendmail
24032 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24053 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24062 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24067 name6     20   0 66268 2068 1452 D  0.0  0.0   0:00.03 sftp-server
24109 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24118 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24127 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24136 name3     20   0 17760  532  420 D  0.0  0.0   0:00.00 xauth
24138 name2     20   0  214m 5412 3844 D  0.0  0.0   0:00.04 smbd
24158 name8     20   0 55112 2144 1764 D  0.0  0.0   0:00.03 imap
24166 name2     20   0  214m 5036 3500 D  0.0  0.0   0:00.03 smbd
24175 name6     20   0 66268 2052 1448 D  0.0  0.0   0:00.03 sftp-server
24218 name2     20   0  214m 5404 3836 D  0.0  0.0   0:00.03 smbd
24244 name6     20   0 66268 2052 1448 D  0.0  0.0   0:00.03 sftp-server
24295 comp131   20   0  107m  20m 3976 D  0.0  0.1   0:00.28 index.cgi
24311 name1     20   0 20676  792  708 D  0.0  0.0   0:00.00 procmail
24318 name8     20   0 55144 2252 1836 D  0.0  0.0   0:00.00 imap
24336 apache    20   0  374m 9.8m 1908 D  0.0  0.0   0:00.00 httpd
24344 apache    20   0  374m 9.8m 1880 D  0.0  0.0   0:00.00 httpd
24345 apache    20   0  374m 9712 1628 D  0.0  0.0   0:00.00 httpd
24346 apache    20   0  374m 9.8m 1884 D  0.0  0.0   0:00.00 httpd
24347 apache    20   0  374m 9.8m 1900 D  0.0  0.0   0:00.00 httpd
24356 name2     20   0  214m 5264 3716 D  0.0  0.0   0:00.03 smbd
24360 apache    20   0  374m 9352 1292 D  0.0  0.0   0:00.00 httpd
24365 apache    20   0  374m 9668 1576 D  0.0  0.0   0:00.00 httpd
24367 name8     20   0  117m 1432  584 D  0.0  0.0   0:00.00 sshd
24370 apache    20   0  476m  19m 3996 D  0.0  0.1   0:00.11 httpd
24374 name9     20   0 55340 2472 1876 D  0.0  0.0   0:00.90 pop3
Comment 4 Norman Gaywood 2010-02-22 11:31:12 UTC
Created attachment 25158 [details]
hung tasks

Latest hung tasks from 2.6.32.7-41.fc12.x86_64
Comment 5 Norman Gaywood 2010-02-22 11:35:04 UTC
Created attachment 25159 [details]
dmesg
Comment 6 Norman Gaywood 2010-02-23 04:50:47 UTC
atop output of a system that just went into the hung state. Note that one disk is in in 90% busy. I don't always get these atop screens. Sometime the atop process seems to hang. All the CPU is being chewed by several acroreads comsuming CPU due to some sort of bug with acroread and proxy settings.

ATOP - turing             2010/02/23  15:10:31               10 seconds elapsed
PRC | sys  61.11s | user  45.27s | #proc   1741 | #zombie    1 | #exit      7 |
CPU | sys    512% | user    404% | irq      35% | idle     76% | wait    169% |
CPU | steal    4% | stl/cpu   0% |              |              |              |
cpu | sys     55% | user     35% | irq       1% | idle      0% | cpu009 w  9% |
cpu | sys     51% | user     35% | irq       1% | idle     13% | cpu005 w  0% |
cpu | sys     49% | user     34% | irq       1% | idle      0% | cpu004 w 16% |
cpu | sys     48% | user     34% | irq       1% | idle     16% | cpu006 w  0% |
cpu | sys     47% | user     35% | irq       1% | idle      0% | cpu011 w 17% |
cpu | sys     45% | user     36% | irq       1% | idle      0% | cpu008 w 18% |
cpu | sys     46% | user     34% | irq       1% | idle     19% | cpu003 w  0% |
cpu | sys     45% | user     33% | irq       1% | idle      0% | cpu010 w 20% |
cpu | sys     43% | user     35% | irq       1% | idle      0% | cpu002 w 20% |
cpu | sys     24% | user     27% | irq      22% | idle      0% | cpu001 w 28% |
cpu | sys     39% | user     32% | irq       1% | idle     27% | cpu007 w  0% |
cpu | sys     23% | user     36% | irq       2% | idle      0% | cpu000 w 39% |
CPL | avg1  42.15 | avg5   23.96 | avg15  18.00 | csw    99052 | intr  263662 |
MEM | tot   28.7G | free   17.5G | cache   3.3G | buff  622.7M | slab  651.4M |
SWP | tot    9.8G | free    9.8G |              | vmcom  15.5G | vmlim  24.1G |
PAG | scan	0 | stall      0 |              | swin       0 | swout      0 |
DSK |        xvdb | busy     90% | read       0 | write      0 | avio    0 ms |
DSK |        xvda | busy      1% | read      14 | write     54 | avio    1 ms |
NET | transport   | tcpi  194947 | tcpo  115281 | udpi     159 | udpo     161 |
NET | network     | ipi   195130 | ipo   115450 | ipfrw      0 | deliv 195111 |
NET | eth0   ---- | pcki  194853 | pcko  115160 | si  157 Mbps | so   82 Mbps |
NET | lo     ---- | pcki     291 | pcko     291 | si  369 Kbps | so  369 Kbps |
Comment 7 Norman Gaywood 2010-02-25 02:40:36 UTC
Problem persists with kernel-2.6.32.9-64.fc12.x86_64

As I posted in: https://bugzilla.redhat.com/show_bug.cgi?id=550724

This problem is quite frustrating. I have not been reproduce the problem in a
different VM on either the same hardware or different hardware. Other VMs seem
to withstand heavy disk tests with no problem.

It's something about the LTSP workload. The network I'm sure is in this mix
somewhere. In the not very conclusive thread mentioned in comment #1, it is
said: "I kept hitting this issue for 2.6.31 guest kernel even with a simple
network test."


The system can hum along with labs full of students and work fine for several
hours, then suddenly one of the jbd2/dm- tasks won't come out of D state and
the rest of the system follows on as more and more processes catch the D state.
Other jbd2/dm- tasks catch the D as well after the first one goes.

Depending on which jbd2/dm- goes D first, the growth of the D's can be very
fast or start slow and then spread at an increasing pace.

The problem is load related but it is not a big load. Its some combination of
network and disk I think.

I'm at a lose as to what more information I can provide.
Comment 8 Norman Gaywood 2010-03-01 14:54:28 UTC
Created attachment 25291 [details]
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64

This is the result of running:

for p in $(seq 0 11)
do
   /usr/lib64/xen/bin/xenctx -s System.map-2.6.32.9-64.fc12.x86_64 27 $p
done

on the system after it panicked with the hung_task_panic=1 passed to the F12
guest.

Thanks to Andrew Jones on the RH bugzilla for suggesting this.
Comment 9 Norman Gaywood 2010-05-17 03:43:19 UTC
This bug is either in the RH enterprise Xen kernel or a problem with the hardware.

Not likely to be a bug in 2.6.32

Sorry for the noise.

Note You need to log in before you can comment on or make changes to this bug.