Bug 42578 - Kernel crash "Out of memory error by X" when using NTFS file system on external USB Hard drive
Summary: Kernel crash "Out of memory error by X" when using NTFS file system on extern...
Status: CLOSED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Page Allocator (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-14 16:23 UTC by Stuart Foster
Modified: 2012-06-13 15:13 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.1.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output and other system details (24.38 KB, application/x-bzip2)
2012-01-14 16:23 UTC, Stuart Foster
Details
Contents of /proc/slabinfo (9.21 KB, application/octet-stream)
2012-01-17 23:55 UTC, Stuart Foster
Details

Description Stuart Foster 2012-01-14 16:23:22 UTC
Created attachment 72066 [details]
dmesg output and other system details

The device that triggers the fault is a 1TB external USB hard drive. The device is a USB 3.0 device but the system crashes with OOM errors in either 3.0 or 2.0 USB motherboard ports. The system board is a ASUS M5A97 PRO mother board with a  AMD Phenom(tm) II X6 1100T Processor and 16GByte DDR3 memory. The fault appears to be triggered by simply file copying a large file16Gbyte+ off the external drive over the USB system. The system is 100% stable in all other respects. The problem was first noticed on an earlier kernel (from memory 3.0.6 possibly) but appeared only when used on a 3.0 USB port, moving the device onto a 2.0 USB port "cured" the problem. However the device now appears to be failing on usb 2.0 as well.
Comment 1 Andrew Morton 2012-01-17 22:18:58 UTC
660MB of ZONE_NORMAL is tied up in slab_reclaimable memory.

I wonder what that is.  Are you able to capture the contents of /proc/slabinfo when the system is in this state, or is close to it?
Comment 2 Stuart Foster 2012-01-17 23:55:49 UTC
Created attachment 72099 [details]
Contents of /proc/slabinfo

I am able to login to a console shell after the crash (after a few attempts) the contents of /proc/slabinfo are available (attached).
Comment 3 Andrew Morton 2012-01-18 00:03:33 UTC
OK, lowmem is full of buffer_heads.

The machine has 16G highmem.

6G of highmem is in pagecache.  Presumably the buffer_heads are attached to the highmem pagecache and the VM failed to release them because it wasn't looking at highmem pages.

We had all this working 5-10 years ago.  I suspect it was subsequently broken because few people are still using huge highmem boxes.

I don't suppose you've tested this sort of workload on any earlier kernel versions?
Comment 4 Stuart Foster 2012-01-18 09:22:11 UTC
(In reply to comment #3)
> OK, lowmem is full of buffer_heads.
> 
> The machine has 16G highmem.
> 
> 6G of highmem is in pagecache.  Presumably the buffer_heads are attached to
> the
> highmem pagecache and the VM failed to release them because it wasn't looking
> at highmem pages.
> 
> We had all this working 5-10 years ago.  I suspect it was subsequently broken
> because few people are still using huge highmem boxes.
> 
> I don't suppose you've tested this sort of workload on any earlier kernel
> versions?

This machine has been operational since the end of October last year, the external drive was added about the same time. The machine replaced an ASUS m3a79-t deluxe with Quad core Phenom 940 and 8Gbyte of DDR2 memory which had beem operational since the 940 came out. The upgrade used the same internal harddrives etc and the work loads have been pretty much the same over this time period for both configurations. The kernel would have been the latest mainline version at the time (end Oct 2011). The kernel config has been essentially the same for the last few years and as I noted in my initial submission the fault only shows up with the introduction of the external USB drive. I have not noticed any similar problems with the internal drives. Having said that the internal drives are all configure for AHCI with Ext4 and the USB Drive is configured for NTFS which with the USB component probably changes the dynamics significantly.
Comment 5 Stuart Foster 2012-01-18 10:29:03 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > OK, lowmem is full of buffer_heads.
> > 
> > The machine has 16G highmem.
> > 
> > 6G of highmem is in pagecache.  Presumably the buffer_heads are attached to
> the
> > highmem pagecache and the VM failed to release them because it wasn't
> looking
> > at highmem pages.
> > 
> > We had all this working 5-10 years ago.  I suspect it was subsequently
> broken
> > because few people are still using huge highmem boxes.
> > 
> > I don't suppose you've tested this sort of workload on any earlier kernel
> > versions?
> 
> This machine has been operational since the end of October last year, the
> external drive was added about the same time. The machine replaced an ASUS
> m3a79-t deluxe with Quad core Phenom 940 and 8Gbyte of DDR2 memory which had
> beem operational since the 940 came out. The upgrade used the same internal
> harddrives etc and the work loads have been pretty much the same over this
> time
> period for both configurations. The kernel would have been the latest
> mainline
> version at the time (end Oct 2011). The kernel config has been essentially
> the
> same for the last few years and as I noted in my initial submission the fault
> only shows up with the introduction of the external USB drive. I have not
> noticed any similar problems with the internal drives. Having said that the
> internal drives are all configure for AHCI with Ext4 and the USB Drive is
> configured for NTFS which with the USB component probably changes the
> dynamics
> significantly.

Sorry, to answer your question I have not tried earlier kernels. How far back do you suggest I go ? I am concerned that if I go back too far I might do damage to my system.
Comment 6 Andrew Morton 2012-01-18 17:29:58 UTC
One possibility is that NTFS is using 1kbyte buffer_heads, whereas ext4 is using 4k. That would cause NTFS to require 4x as many buffer_heads to support the same amount of highmem pagecache, leading to the OOM.  I don't immediately know how to work out what blocksize your NTFS mount is using.

Regarding earlier kernels: don't bother - you'd probably have to go back a loooong way.  If you can think of a way of working out what the NTFS blocksize is, that would help.
Comment 7 Stuart Foster 2012-01-18 18:50:00 UTC
(In reply to comment #6)
> One possibility is that NTFS is using 1kbyte buffer_heads, whereas ext4 is
> using 4k. That would cause NTFS to require 4x as many buffer_heads to support
> the same amount of highmem pagecache, leading to the OOM.  I don't
> immediately
> know how to work out what blocksize your NTFS mount is using.
> 
> Regarding earlier kernels: don't bother - you'd probably have to go back a
> loooong way.  If you can think of a way of working out what the NTFS
> blocksize
> is, that would help.

I recovered this information by plugging the drive into a friends laptop, does this help ?

NTFS Volume Serial Number :       0x350b63fc75775ac7
Version :                         3.1
Number Sectors :                  0x0000000074705981
Total Clusters :                  0x000000000e8e0b30
Free Clusters  :                  0x000000000974e983
Total Reserved :                  0x0000000000000a00
Bytes Per Sector  :               512
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x000000000533a800
Mft Start Lcn  :                  0x0000000000000004
Mft2 Start Lcn :                  0x0000000007470598
Mft Zone Start :                  0x0000000000006340
Mft Zone End   :                  0x0000000001d1c180
Comment 8 Andrew Morton 2012-01-19 02:03:36 UTC
OK, Anton tells me that this fs will be using 512-byte buffer_heads - the worst case!  Eight buffer_heads in lowmem per highmem pagecache page.

I'll take this thread to email...
Comment 9 Stuart Foster 2012-01-19 09:31:26 UTC
(In reply to comment #8)
> OK, Anton tells me that this fs will be using 512-byte buffer_heads - the
> worst
> case!  Eight buffer_heads in lowmem per highmem pagecache page.
> 
> I'll take this thread to email...

Ok, thank you.
Comment 10 Florian Mickler 2012-04-04 15:00:53 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit cc715d99e529d470dde2f33a6614f255adea71f3
Author: Mel Gorman <mel@csn.ul.ie>
Date:   Wed Mar 21 16:34:00 2012 -0700

    mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem

Note You need to log in before you can comment on or make changes to this bug.