Bug 7483 - Unable to handle kernel paging request for data at address 0x5a5a5a5a5a5a5a5a
Summary: Unable to handle kernel paging request for data at address 0x5a5a5a5a5a5a5a5a
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: File System
Classification: Unclassified
Component: CIFS (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: fs_cifs
URL:
Keywords:
: 7484 7485 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-11-10 00:47 UTC by Shu Qing Yang
Modified: 2008-04-04 09:18 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.19-rc5
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Draft patch (5.77 KB, patch)
2007-03-21 12:06 UTC, Olaf Kirch
Details | Diff

Description Shu Qing Yang 2006-11-10 00:47:44 UTC
Most recent kernel where this bug did not occur: 2.6.19-rc5

Distribution:  SLES10 GA

Hardware Environment: IBM p-series server (POWER5+)

Software Environment: SLES10 GA + 2.6.19-rc5 

Problem Description: 
I mount a local samba directory and run fsstress in it. Several minutes later, 
the machine is no response and entered xmon(a debug mode in Power arch).

Dmesg as following:

CIFS VFS: close with pending writes
CIFS VFS: No task to wake, unknown frame rcvd! NumMids 8
Received Data is: : dump of 37 bytes of data at 0xc00000003a6d9300
 0000003a ff534d42 32000000 008041c0 . . . : 
Comment 1 Andrew Morton 2006-11-10 01:07:53 UTC
(switching to email - please follow up via reply-to-all and not via bugzilla)

On Fri, 10 Nov 2006 00:48:31 -0800
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=7483
> 
>            Summary: Unable to handle kernel paging request for data at
>                     address 0x5a5a5a5a5a5a5a5a
>     Kernel Version: 2.6.19-rc5
>             Status: NEW
>           Severity: normal
>              Owner: fs_cifs@kernel-bugs.osdl.org

This could be a networking bug.

>          Submitter: sqy@vnet.ibm.com
>                 CC: akpm@osdl.org,walkinair@cn.ibm.com
> 
> 
> Most recent kernel where this bug did not occur: 2.6.19-rc5

You misunderstand.  We're asking what is the most recent version of the
kernel which *did not* have this bug.

> Distribution:  SLES10 GA
> 
> Hardware Environment: IBM p-series server (POWER5+)
> 
> Software Environment: SLES10 GA + 2.6.19-rc5 
> 
> Problem Description: 
> I mount a local samba directory and run fsstress in it. Several minutes later, 
> the machine is no response and entered xmon(a debug mode in Power arch).
> 
> Dmesg as following:
> 
> CIFS VFS: close with pending writes
> CIFS VFS: No task to wake, unknown frame rcvd! NumMids 8
> Received Data is: : dump of 37 bytes of data at 0xc00000003a6d9300
>  0000003a ff534d42 32000000 008041c0 . . . : _ S M B 2 . . . . . A _
>  00000000 00000000 00000000 0100cf08 . . . . . . . . . . . . . . _ .
>  6400422f 0a020000 d . B / .
> CIFS VFS: No task to wake, unknown frame rcvd! NumMids 
> ......
> CIFS VFS: close with pending writes.
> ......
> Unable to handle kernel paging request for data at address 0x5a5a5a5a5a5a5a5a.
> Faulting instruction address: 0xc0000000000a2e04..
> 
> xmon output:
> 
> 5:mon> e
> cpu 0x5: Vector: 300 (Data Access) at [c0000000075b7470]
>     pc: c0000000000a2e04: .put_page+0x2c/0x16c
>     lr: c00000000042f238: .skb_release_data+0x84/0xe4
>     sp: c0000000075b76f0
>    msr: 8000000000009032
>    dar: 5a5a5a5a5a5a5a5a
>  dsisr: 40000000
>   current = 0xc000000003000ad0
>   paca    = 0xc00000000062ce00
>     pid   = 4419, comm = syslog-ng
> 5:mon> t
> [c0000000075b7790] c00000000042f238 .skb_release_data+0x84/0xe4
> [c0000000075b7820] c00000000042ef04 .kfree_skbmem+0x20/0xd4
> [c0000000075b78a0] c000000000431c48 .skb_free_datagram+0x14/0x28
> [c0000000075b7920] c0000000004a7018 .unix_dgram_recvmsg+0x238/0x294
> [c0000000075b7a10] c000000000427b4c .sock_recvmsg+0xd0/0x110
> [c0000000075b7c10] c000000000428e18 .sys_recvfrom+0xcc/0x14c
> [c0000000075b7d90] c00000000044784c .compat_sys_socketcall+0x194/0x214
> [c0000000075b7e30] c000000000008724 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at 0000000007edffdc
> SP (fa4df720) is in userspace
> 5:mon> r
> R00 = c00000000042f238   R16 = 0000000010020000
> R01 = c0000000075b76f0   R17 = 0000000010020000
> R02 = c000000000848180   R18 = 0000000010010000
> R03 = 5a5a5a5a5a5a5a5a   R19 = 0000000000000003
> R04 = 0000000000000002   R20 = 00000000fbe5fe96
> R05 = 0000000000020002   R21 = 0000000010020000
> R06 = 0000000000000000   R22 = c000000003f184b0
> R07 = 0000000000000000   R23 = c000000003de96e8
> R08 = 3520336320633000   R24 = 0000000000000050
> R09 = c000000021f18e20   R25 = c000000003f181c8
> R10 = c0000000075bb8e8   R26 = 0000000000000040
> R11 = c000000021f18e20   R27 = c00000007b165e60
> R12 = fa4df78cfa4df788   R28 = c0000000075b7a80
> R13 = c00000000062ce00   R29 = c00000007b165e60
> R14 = 0000000000000000   R30 = c000000000662d10
> R15 = 0000000010020000   R31 = 5a5a5a5a5a5a5a5a
> pc  = c0000000000a2e04 .put_page+0x2c/0x16c
> lr  = c00000000042f238 .skb_release_data+0x84/0xe4
> msr = 8000000000009032   cr  = 24044884
> ctr = c0000000004a7cb4   xer = 0000000020000001   trap =  300
> dar = 5a5a5a5a5a5a5a5a   dsisr = 40000000
> 
> Steps to reproduce:
> 
> 1.mkfs.ext3 /dev/sdb1
> 2.mount /dev/sdb1 /home
> 3.mkdir /home/public
> 4.add entry "public" (export directory /home/public) to /etc/samba/smb.conf, 
> then restart samba service
> 5.mount //localhost/public /mnt/test -o username=xxx,password=xxx
> 6.fsstress -l 500 -p 1000 -n 1000 -d /mnt/test
> 

I guess skb_release_data() ran off the end of the ->frags array and passed
uninitialised, kmalloced data to put_page().

It's a pity that you're running both samba and the CIFS client on the same
machine.  If you were to run them on two separate machines across the
network then we might be able to eliminate CIFS.

But it does look like CIFS is involved.

Comment 2 Andrew Morton 2006-11-10 01:19:56 UTC
*** Bug 7484 has been marked as a duplicate of this bug. ***
Comment 3 Shu Qing Yang 2006-11-10 01:29:37 UTC
*** Bug 7485 has been marked as a duplicate of this bug. ***
Comment 4 Shu Qing Yang 2006-11-12 19:02:29 UTC
Quoting Andrew Morton <akpm@osdl.org>:

>
> (switching to email - please follow up via reply-to-all and not via bugzilla)
>
> On Fri, 10 Nov 2006 00:48:31 -0800
> bugme-daemon@bugzilla.kernel.org wrote:
>
>> http://bugzilla.kernel.org/show_bug.cgi?id=7483
>>
>>            Summary: Unable to handle kernel paging request for data at
>>                     address 0x5a5a5a5a5a5a5a5a
>>     Kernel Version: 2.6.19-rc5
>>             Status: NEW
>>           Severity: normal
>>              Owner: fs_cifs@kernel-bugs.osdl.org
>
> This could be a networking bug.
>
>>          Submitter: sqy@vnet.ibm.com
>>                 CC: akpm@osdl.org,walkinair@cn.ibm.com
>>
>>
>> Most recent kernel where this bug did not occur: 2.6.19-rc5
>
> You misunderstand.  We're asking what is the most recent version of the
> kernel which *did not* have this bug.
>
Sorry.
>> Distribution:  SLES10 GA
>>
>> Hardware Environment: IBM p-series server (POWER5+)
>>
>> Software Environment: SLES10 GA + 2.6.19-rc5
>>
>> Problem Description:
>> I mount a local samba directory and run fsstress in it. Several   
>> minutes later,
>> the machine is no response and entered xmon(a debug mode in Power arch).
>>
>> Dmesg as following:
>>
>> CIFS VFS: close with pending writes
>> CIFS VFS: No task to wake, unknown frame rcvd! NumMids 8
>> Received Data is: : dump of 37 bytes of data at 0xc00000003a6d9300
>>  0000003a ff534d42 32000000 008041c0 . . . : _ S M B 2 . . . . . A _
>>  00000000 00000000 00000000 0100cf08 . . . . . . . . . . . . . . _ .
>>  6400422f 0a020000 d . B / .
>> CIFS VFS: No task to wake, unknown frame rcvd! NumMids
>> ......
>> CIFS VFS: close with pending writes.
>> ......
>> Unable to handle kernel paging request for data at address   
>> 0x5a5a5a5a5a5a5a5a.
>> Faulting instruction address: 0xc0000000000a2e04..
>>
>> xmon output:
>>
>> 5:mon> e
>> cpu 0x5: Vector: 300 (Data Access) at [c0000000075b7470]
>>     pc: c0000000000a2e04: .put_page+0x2c/0x16c
>>     lr: c00000000042f238: .skb_release_data+0x84/0xe4
>>     sp: c0000000075b76f0
>>    msr: 8000000000009032
>>    dar: 5a5a5a5a5a5a5a5a
>>  dsisr: 40000000
>>   current = 0xc000000003000ad0
>>   paca    = 0xc00000000062ce00
>>     pid   = 4419, comm = syslog-ng
>> 5:mon> t
>> [c0000000075b7790] c00000000042f238 .skb_release_data+0x84/0xe4
>> [c0000000075b7820] c00000000042ef04 .kfree_skbmem+0x20/0xd4
>> [c0000000075b78a0] c000000000431c48 .skb_free_datagram+0x14/0x28
>> [c0000000075b7920] c0000000004a7018 .unix_dgram_recvmsg+0x238/0x294
>> [c0000000075b7a10] c000000000427b4c .sock_recvmsg+0xd0/0x110
>> [c0000000075b7c10] c000000000428e18 .sys_recvfrom+0xcc/0x14c
>> [c0000000075b7d90] c00000000044784c .compat_sys_socketcall+0x194/0x214
>> [c0000000075b7e30] c000000000008724 syscall_exit+0x0/0x40
>> --- Exception: c01 (System Call) at 0000000007edffdc
>> SP (fa4df720) is in userspace
>> 5:mon> r
>> R00 = c00000000042f238   R16 = 0000000010020000
>> R01 = c0000000075b76f0   R17 = 0000000010020000
>> R02 = c000000000848180   R18 = 0000000010010000
>> R03 = 5a5a5a5a5a5a5a5a   R19 = 0000000000000003
>> R04 = 0000000000000002   R20 = 00000000fbe5fe96
>> R05 = 0000000000020002   R21 = 0000000010020000
>> R06 = 0000000000000000   R22 = c000000003f184b0
>> R07 = 0000000000000000   R23 = c000000003de96e8
>> R08 = 3520336320633000   R24 = 0000000000000050
>> R09 = c000000021f18e20   R25 = c000000003f181c8
>> R10 = c0000000075bb8e8   R26 = 0000000000000040
>> R11 = c000000021f18e20   R27 = c00000007b165e60
>> R12 = fa4df78cfa4df788   R28 = c0000000075b7a80
>> R13 = c00000000062ce00   R29 = c00000007b165e60
>> R14 = 0000000000000000   R30 = c000000000662d10
>> R15 = 0000000010020000   R31 = 5a5a5a5a5a5a5a5a
>> pc  = c0000000000a2e04 .put_page+0x2c/0x16c
>> lr  = c00000000042f238 .skb_release_data+0x84/0xe4
>> msr = 8000000000009032   cr  = 24044884
>> ctr = c0000000004a7cb4   xer = 0000000020000001   trap =  300
>> dar = 5a5a5a5a5a5a5a5a   dsisr = 40000000
>>
>> Steps to reproduce:
>>
>> 1.mkfs.ext3 /dev/sdb1
>> 2.mount /dev/sdb1 /home
>> 3.mkdir /home/public
>> 4.add entry "public" (export directory /home/public) to /etc/samba/smb.conf,
>> then restart samba service
>> 5.mount //localhost/public /mnt/test -o username=xxx,password=xxx
>> 6.fsstress -l 500 -p 1000 -n 1000 -d /mnt/test
>>
>
> I guess skb_release_data() ran off the end of the ->frags array and passed
> uninitialised, kmalloced data to put_page().
>
> It's a pity that you're running both samba and the CIFS client on the same
> machine.  If you were to run them on two separate machines across the
> network then we might be able to eliminate CIFS.
>
I separate them and run the same test case. The problem still appears  
on the client end. And it seems no related with samba server.
> But it does look like CIFS is involved.
>




Comment 5 Steve French 2007-02-22 20:45:58 UTC
But it does appear related to the tcp stack.

Has the failing part of the skb handling been identified by one of the tcp guys?
Comment 6 Olaf Kirch 2007-03-09 01:41:10 UTC
Note it's not TCP. This is unix_dgram_recvmsg, ie AF_LOCAL sockets. And while
it may be CIFS that's ultimately causing the problem, it's syslog-ng that's
crashing. And indeed it's kfree_skb that is walking off the end of the
frags array.

I dimly recall seeing something like this reported against SLES10 before...
but I can't remember any details, or whether we were able to fix it. It may
help to search the Novell bugzilla. IBM should have access to SLES bugs
(I don't)
Comment 7 Olaf Kirch 2007-03-21 09:43:53 UTC
I think "CIFS VFS: close with pending writes" is an important clue here.

There is virtually no locking between cifs_close destroying a file,
and cifs_writepages trying to use it (via find_writable_file). Granted,
when we traverse cifs_inode->openFileList we grab the global session lock,
and we check closePend. However, the load/store reordering on power does not
guarantee that other processors will actually see the changed closePend flag
(read_lock only does a lwsync barrier).

So, one potential improvement may be to place a wmb() after setting closePend.
Which looks ugly. Alternatively, you could remove the file from the inode's
openFileList _before_ we actually check for for wrtPending.

Another thing I noticed is that find_writable_file will call cifs_reopen_file
while holding the GlobalSMBSeslock, which is an rwlock. cifs_reopen_file
can sleep however.

However, neither of these is probably what causes the oops. It's more likely
that cifs_partialpagewrite grabbed the file, and queued the write request
which takes quite a while to complete due to the high load. cifs_file_close
however will wait a maximum of 2 seconds before giving up and destroying
the file.

A possible (untested) patch idea might be to treat wrtPending as a kind of
reference count, and destroy the file when the last pending write is done.
Comment 8 Olaf Kirch 2007-03-21 12:04:04 UTC
Actually it will wait for a maximum of 682 milliseconds only.

Concerning my comment on cifs_reopen_file - it actually does release
the global session lock, so it's not sleeping under spinlock. But what's
equally bad is that it continues scanning the list at the point
where it left off, regardless of the fact that the list may have
changed.
Comment 9 Olaf Kirch 2007-03-21 12:06:02 UTC
Created attachment 10905 [details]
Draft patch

Here's a very hand-wavy patch to give you a rough idea of what I
was referring to in comment #7
Comment 10 Olaf Kirch 2007-03-21 12:09:30 UTC
Adding Steve French back to the Cc list. This may be CIFS related after all.
Comment 11 Natalie Protasevich 2007-07-04 20:10:19 UTC
Shu Qing, have you tried latest kernel, is the problem still there?
Any other updates on this bug?
Thanks.
Comment 12 Natalie Protasevich 2008-04-04 09:18:10 UTC
Closing this bug for no recent activity.
Please reopen if confirmed with recent kernel.

Note You need to log in before you can comment on or make changes to this bug.