Created attachment 72066 [details] dmesg output and other system details The device that triggers the fault is a 1TB external USB hard drive. The device is a USB 3.0 device but the system crashes with OOM errors in either 3.0 or 2.0 USB motherboard ports. The system board is a ASUS M5A97 PRO mother board with a AMD Phenom(tm) II X6 1100T Processor and 16GByte DDR3 memory. The fault appears to be triggered by simply file copying a large file16Gbyte+ off the external drive over the USB system. The system is 100% stable in all other respects. The problem was first noticed on an earlier kernel (from memory 3.0.6 possibly) but appeared only when used on a 3.0 USB port, moving the device onto a 2.0 USB port "cured" the problem. However the device now appears to be failing on usb 2.0 as well.
660MB of ZONE_NORMAL is tied up in slab_reclaimable memory. I wonder what that is. Are you able to capture the contents of /proc/slabinfo when the system is in this state, or is close to it?
Created attachment 72099 [details] Contents of /proc/slabinfo I am able to login to a console shell after the crash (after a few attempts) the contents of /proc/slabinfo are available (attached).
OK, lowmem is full of buffer_heads. The machine has 16G highmem. 6G of highmem is in pagecache. Presumably the buffer_heads are attached to the highmem pagecache and the VM failed to release them because it wasn't looking at highmem pages. We had all this working 5-10 years ago. I suspect it was subsequently broken because few people are still using huge highmem boxes. I don't suppose you've tested this sort of workload on any earlier kernel versions?
(In reply to comment #3) > OK, lowmem is full of buffer_heads. > > The machine has 16G highmem. > > 6G of highmem is in pagecache. Presumably the buffer_heads are attached to > the > highmem pagecache and the VM failed to release them because it wasn't looking > at highmem pages. > > We had all this working 5-10 years ago. I suspect it was subsequently broken > because few people are still using huge highmem boxes. > > I don't suppose you've tested this sort of workload on any earlier kernel > versions? This machine has been operational since the end of October last year, the external drive was added about the same time. The machine replaced an ASUS m3a79-t deluxe with Quad core Phenom 940 and 8Gbyte of DDR2 memory which had beem operational since the 940 came out. The upgrade used the same internal harddrives etc and the work loads have been pretty much the same over this time period for both configurations. The kernel would have been the latest mainline version at the time (end Oct 2011). The kernel config has been essentially the same for the last few years and as I noted in my initial submission the fault only shows up with the introduction of the external USB drive. I have not noticed any similar problems with the internal drives. Having said that the internal drives are all configure for AHCI with Ext4 and the USB Drive is configured for NTFS which with the USB component probably changes the dynamics significantly.
(In reply to comment #4) > (In reply to comment #3) > > OK, lowmem is full of buffer_heads. > > > > The machine has 16G highmem. > > > > 6G of highmem is in pagecache. Presumably the buffer_heads are attached to > the > > highmem pagecache and the VM failed to release them because it wasn't > looking > > at highmem pages. > > > > We had all this working 5-10 years ago. I suspect it was subsequently > broken > > because few people are still using huge highmem boxes. > > > > I don't suppose you've tested this sort of workload on any earlier kernel > > versions? > > This machine has been operational since the end of October last year, the > external drive was added about the same time. The machine replaced an ASUS > m3a79-t deluxe with Quad core Phenom 940 and 8Gbyte of DDR2 memory which had > beem operational since the 940 came out. The upgrade used the same internal > harddrives etc and the work loads have been pretty much the same over this > time > period for both configurations. The kernel would have been the latest > mainline > version at the time (end Oct 2011). The kernel config has been essentially > the > same for the last few years and as I noted in my initial submission the fault > only shows up with the introduction of the external USB drive. I have not > noticed any similar problems with the internal drives. Having said that the > internal drives are all configure for AHCI with Ext4 and the USB Drive is > configured for NTFS which with the USB component probably changes the > dynamics > significantly. Sorry, to answer your question I have not tried earlier kernels. How far back do you suggest I go ? I am concerned that if I go back too far I might do damage to my system.
One possibility is that NTFS is using 1kbyte buffer_heads, whereas ext4 is using 4k. That would cause NTFS to require 4x as many buffer_heads to support the same amount of highmem pagecache, leading to the OOM. I don't immediately know how to work out what blocksize your NTFS mount is using. Regarding earlier kernels: don't bother - you'd probably have to go back a loooong way. If you can think of a way of working out what the NTFS blocksize is, that would help.
(In reply to comment #6) > One possibility is that NTFS is using 1kbyte buffer_heads, whereas ext4 is > using 4k. That would cause NTFS to require 4x as many buffer_heads to support > the same amount of highmem pagecache, leading to the OOM. I don't > immediately > know how to work out what blocksize your NTFS mount is using. > > Regarding earlier kernels: don't bother - you'd probably have to go back a > loooong way. If you can think of a way of working out what the NTFS > blocksize > is, that would help. I recovered this information by plugging the drive into a friends laptop, does this help ? NTFS Volume Serial Number : 0x350b63fc75775ac7 Version : 3.1 Number Sectors : 0x0000000074705981 Total Clusters : 0x000000000e8e0b30 Free Clusters : 0x000000000974e983 Total Reserved : 0x0000000000000a00 Bytes Per Sector : 512 Bytes Per Cluster : 4096 Bytes Per FileRecord Segment : 1024 Clusters Per FileRecord Segment : 0 Mft Valid Data Length : 0x000000000533a800 Mft Start Lcn : 0x0000000000000004 Mft2 Start Lcn : 0x0000000007470598 Mft Zone Start : 0x0000000000006340 Mft Zone End : 0x0000000001d1c180
OK, Anton tells me that this fs will be using 512-byte buffer_heads - the worst case! Eight buffer_heads in lowmem per highmem pagecache page. I'll take this thread to email...
(In reply to comment #8) > OK, Anton tells me that this fs will be using 512-byte buffer_heads - the > worst > case! Eight buffer_heads in lowmem per highmem pagecache page. > > I'll take this thread to email... Ok, thank you.
A patch referencing this bug report has been merged in Linux v3.4-rc1: commit cc715d99e529d470dde2f33a6614f255adea71f3 Author: Mel Gorman <mel@csn.ul.ie> Date: Wed Mar 21 16:34:00 2012 -0700 mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem