Bug 10378 - Task blocked for more than 120 seconds
Summary: Task blocked for more than 120 seconds
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-01 22:30 UTC by Ritesh Raj Sarraf
Modified: 2008-05-01 19:24 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.24, 2.6.25-rc6
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Ritesh Raj Sarraf 2008-04-01 22:30:56 UTC
Latest working kernel version: N/A

Earliest failing kernel version: 2.6.25-rc6

Distribution: Debian/Suse

Hardware Environment: Dell XPS M1210 Laptop
Dual Core CPU 2.0 Ghz
RAM - 2GB
HDD - SATA 60 GB + USB 120 GB Self-Powered

Software Environment: DM + DM-Crypt + LVM2

Problem Description:
> On my laptop, doing heavy C++ compilations in parallel with -j3 (this
> is a dual core) often generates the following trace:
> 
> INFO: task g++:25119 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> g++           D c03916c0     0 25119  25118
>        ecfef200 00200086 c038ef18 c03916c0 c03916c0 f77b2230 f77b2378
> c17fa6c0 
>        00000000 f89817bc f7486280 000000ff f7a1eb20 00000000 00000000
> 00000000 
>        c17fa6c0 00000000 d1401e9c c17f0408 c02a2aa1 d1401e94 c0147100
> c02a2c93 
> Call Trace:
>  [<f89817bc>] dm_table_unplug_all+0x1e/0x2e [dm_mod]
>  [<c02a2aa1>] io_schedule+0x1b/0x24
>  [<c0147100>] sync_page+0x33/0x36
>  [<c02a2c93>] __wait_on_bit+0x33/0x58
>  [<c01470cd>] sync_page+0x0/0x36
>  [<c01472fd>] wait_on_page_bit+0x59/0x60
>  [<c012cf7b>] wake_bit_function+0x0/0x3c
>  [<c014e20d>] truncate_inode_pages_range+0x238/0x29f
>  [<c014e27d>] truncate_inode_pages+0x9/0xc
>  [<f8c67187>] ext2_delete_inode+0x12/0x6e [ext2]
>  [<f8c67175>] ext2_delete_inode+0x0/0x6e [ext2]
>  [<c0170ea5>] generic_delete_inode+0x8f/0xf3
>  [<c0170819>] iput+0x60/0x62
>  [<c0169aa5>] do_unlinkat+0xb7/0xf9
>  [<c0113a3e>] do_page_fault+0x1fa/0x4dc
>  [<c0104822>] sysenter_past_esp+0x5f/0x85
>  =======================
> 
> This is with 2.6.25-rc6 (SMP) and has been present, as far as I can
> remember, since the beginning of the 2.6.25-rc series. It is not
> always reproducible, but the trace is always the same.
> 
> My filesystem is stored on a ext3 (rw,noatime) dm_crypt'd partition
> leaving in a LVM volume.
> 
> % lsmod | grep dm | grep -v ' 0 *$'
> dm_crypt               14340  1 
> crypto_blkcipher       18308  6 ecb,cbc,dm_crypt
> dm_mod                 53008  26 dm_crypt,dm_mirror,dm_snapshot
> 
> Here is the code I have in dm-table.o:
> 
> 00001042 <dm_table_unplug_all>:
>     1042:       56                      push   %esi
>     1043:       53                      push   %ebx
>     1044:       8b 98 a0 00 00 00       mov    0xa0(%eax),%ebx
>     104a:       8d b0 a0 00 00 00       lea    0xa0(%eax),%esi
>     1050:       eb 10                   jmp    1062
> <dm_table_unplug_all+0x20>
>     1052:       8b 43 10                mov    0x10(%ebx),%eax
>     1055:       8b 40 5c                mov    0x5c(%eax),%eax
>     1058:       8b 40 34                mov    0x34(%eax),%eax
>     105b:       e8 fc ff ff ff          call   105c
> <dm_table_unplug_all+0x1a>
>     1060:       8b 1b                   mov    (%ebx),%ebx
>     1062:       8b 03                   mov    (%ebx),%eax
>     1064:       0f 1f 40 00             nopl   0x0(%eax)
>     1068:       39 f3                   cmp    %esi,%ebx
>     106a:       75 e6                   jne    1052
> <dm_table_unplug_all+0x10>
>     106c:       5b                      pop    %ebx
>     106d:       5e                      pop    %esi
>     106e:       c3                      ret    
> 
> The symbol in 105b call is, after relocation, blk_unplug.
> 
> It there anything else I can do to help debugging this?
> 
>  Sam

Steps to reproduce:
Steps to reproduce is not consistent. The last time I reported, I wasn't able to reproduce the bug on a different setup (An IBM xSeries server).
One point to notice is that, for both us users, it has been reproducible on the laptops.

I'm filing this bugzilla report because now the user count is 2. And the report is identical.

This was earlier discussed here in this thread:
https://www.redhat.com/archives/dm-devel/2008-March/msg00014.html
Comment 1 Ritesh Raj Sarraf 2008-04-01 22:45:58 UTC
There's a commit in between rc6..rc8 which I believe addresses this bug.
Neil, Since you run RC kernels, will it be possible for you to test rc8 ?

commit 3f1e9070f63b0eecadfa059959bf7c9dbe835962
Author: Milan Broz <mbroz@redhat.com>
Date:   Fri Mar 28 14:16:07 2008 -0700

    dm crypt: fix ctx pending

    Fix regression in dm-crypt introduced in commit
    3a7f6c990ad04e6f576a159876c602d14d6f7fef ("dm crypt: use async crypto").

    If write requests need to be split into pieces, the code must not process them
    in parallel because the crypto context cannot be shared.  So there can be
    parallel crypto operations on one part of the write, but only one write bio
    can be processed at a time.

    This is not optimal and the workqueue code needs to be optimized for parallel
    processing, but for now it solves the problem without affecting the
    performance of synchronous crypto operation (most of current dm-crypt users).

    http://bugzilla.kernel.org/show_bug.cgi?id=10242
    http://bugzilla.kernel.org/show_bug.cgi?id=10207

    Signed-off-by: Milan Broz <mbroz@redhat.com>
    Signed-off-by: Alasdair G Kergon <agk@redhat.com>
    Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment 2 Ritesh Raj Sarraf 2008-04-15 11:31:08 UTC
Neil,

Did you get a chance to verify it with the latest rc?
Comment 3 Neil Brown 2008-04-30 23:40:08 UTC
Sorry, but I've got no idea what comment #2 is asking.
I'm not aware that I was going to verify anything....

Confused.
Comment 4 Ritesh Raj Sarraf 2008-05-01 12:23:06 UTC
You posted the same behavior on dm-devel some weeks back. This behavior was seen by me too.

That is why I opened the bugzilla. Since you mentioned that you did use rc kernels, I wanted to know if the behavior was still present after the above mentioned patch was applied.

I asked for your help in comment #1
Comment 5 Neil Brown 2008-05-01 19:17:29 UTC
I think I know what caused the confusion.

Samuel Tardieu has this problem and sent and email to the linux-raid
mailing list.
I forwarded it to dm-devel because that was a more appropriate mailing list.
You misunderstood that email and thought I was experiencing the problem.
But I wasn't.  I've never used dm-crypt.  You need to talk to Samuel.
Comment 6 Alasdair G Kergon 2008-05-01 19:24:25 UTC
Assuming this is fixed.  If not, please reopen with an updated trace from a recent kernel. 

Note You need to log in before you can comment on or make changes to this bug.