Bug 16348

Summary: kswapd continuously active when doing IO
Product: File System Reporter: Marius Tolzmann (marius)
Component: XFSAssignee: XFS Guru (xfs-masters)
Status: CLOSED CODE_FIX    
Severity: high CC: akpm, cebbert, david, kosaki.motohiro, maciej.rutecki, rjw, wwwutz
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 15310    

Description Marius Tolzmann 2010-07-07 10:58:48 UTC
Hi..

this bug may be related to #15193 where i attached the issue as a comment.
since bug #15193 has status RESOLVED PATCH_ALREADY_AVAILABLE and not CLOSED CODE_FIX i file the situation as a new bug since the problem still exists for us.

THE ISSUE:
we are currently having trouble with continuously running kswapds consuming up
to 100% CPU time when the system is busy doing (heavy) I/O.

to reproduce (on our machines):  dd if=/dev/zero of=somefile  will activate all kswapds when the file reaches the size of the installed memory (our systems have 8G up to 256G). (same effect on local reiserfs and local xfs)

In kernel 2.6.34 (and 2.6.35-rc3) this issue causes the system to become very
very slow and unresponsive.

we switched back to 2.6.34-rc6 which seems to have no issues with the same IO.

what information can i provide to help fixing this issue.

regards, 
marius tolzmann
Comment 1 Andrew Morton 2010-07-08 21:30:22 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 7 Jul 2010 10:58:50 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=16348
> 
>            Summary: kswapd continuously active when doing IO
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 2.6.34
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: io_other@kernel-bugs.osdl.org
>         ReportedBy: tolzmann@molgen.mpg.de
>         Regression: Yes
> 
> 
> Hi..
> 
> this bug may be related to #15193 where i attached the issue as a comment.
> since bug #15193 has status RESOLVED PATCH_ALREADY_AVAILABLE and not CLOSED
> CODE_FIX i file the situation as a new bug since the problem still exists for
> us.
> 
> THE ISSUE:
> we are currently having trouble with continuously running kswapds consuming
> up
> to 100% CPU time when the system is busy doing (heavy) I/O.
> 
> to reproduce (on our machines):  dd if=/dev/zero of=somefile  will activate
> all
> kswapds when the file reaches the size of the installed memory (our systems
> have 8G up to 256G). (same effect on local reiserfs and local xfs)
> 
> In kernel 2.6.34 (and 2.6.35-rc3) this issue causes the system to become very
> very slow and unresponsive.
> 
> we switched back to 2.6.34-rc6 which seems to have no issues with the same
> IO.
> 
> what information can i provide to help fixing this issue.
> 

That's odd - there are no changes in mm/vmscan.c between 2.6.34-rc6 and
2.6.34.  Are you sure about those kernel versions?
Comment 2 KOSAKI Motohiro 2010-07-09 01:12:21 UTC
I can't reproduce this. 

Marius, can you try ext3/4 filesystems? or can you get wchan by following command?

# ps -ewwo pid,cmd,state,wchan:30


And, Your dirty pages don't decrease completely? or very slowly decreasing?
Comment 3 Marius Tolzmann 2010-07-09 09:50:23 UTC
Hi..

i now have trouble reproducing it the easy way (via dd) on high memory machines too.. sorry.. but i reproduced it on a 2GB-machine using dd

# cat /proc/version
Linux version 2.6.34.x86_64.45 (root@venustraphobie.molgen.mpg.de) (gcc version 4.4.2 (GCC) ) #1 SMP PREEMPT Tue May 25 12:24:36 CEST 2010

# top
top - 11:37:46 up 44 days, 22:50, 11 users,  load average: 2.99, 2.38, 1.28
Tasks: 170 total,   4 running, 166 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.2%us, 84.1%sy,  0.0%ni, 13.9%id,  0.3%wa,  0.5%hi,  0.0%si,  0.0%st
Mem:   2053436k total,  2019304k used,    34132k free,   438036k buffers
Swap:  2048280k total,     3412k used,  2044868k free,  1399924k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  406 root      20   0     0    0    0 R   64  0.0 231:01.46 kswapd1
31677 root      20   0  4940  328  252 R   63  0.0   1:09.35 dd
  405 root      20   0     0    0    0 R   36  0.0 203:36.02 kswapd0
31678 root      20   0  9256 1100  572 S    3  0.1   0:03.67 bash
 1352 root      20   0     0    0    0 S    2  0.0   0:29.40 flush-3:0
  226 root      20   0     0    0    0 S    2  0.0   4:33.12 sync_supers
  233 root      20   0     0    0    0 S    0  0.0   4:13.86 kblockd/1

# ps -ewwo pid,cmd,state,wchan:30 | grep kswap
  405 [kswapd0]                   R -
  406 [kswapd1]                   R -

i will test this machine with different kernel-versions and filesystems and attach all logs to this bug soon.

when stopping the dd-process the system behaves normal again and no kswapds are running anymore.

here is some dirty pages-data while dd is running.. from /proc/meminfo

TIMESTAMP    FILESIZE-OF-DD-OUTPUT  DIRTY-PAGES
1278668407      7009581056            185084
1278668412      7109813760            189076
1278668415      7183441920            186092
1278668416      7223492608            186100
1278668417      7272640512            70040 
1278668419      7316069888            43388 
1278668420      7359446528            86268 
1278668421      7385733120            94732 
1278668423      7466468864            174408
1278668425      7530849280            186956
1278668428      7560326144            185928
1278668429      7595791360            48976 
1278668431      7633754112            22468 
1278668432      7663499264            52364 
1278668433      7690510336            78148 
1278668439      8045338112            187816
1278668451      8086799872            187712
1278668452      8120495104            185548
1278668453      8149839872            187044
1278668457      8237108224            187480
1278668458      8266469376            187572
1278668460      8299431936            111528
1278668462      8352441856            84272 
1278668463      8375342080            44244 
1278668466      8440810496            107268
1278668470      8527720448            14156 
1278668471      8573242880            74400 
1278668473      8616877568            101080
1278668474      8661471744            144852
1278668477      8727186944            188708
1278668479      8834936832            188176
1278668483      8875110400            186088
1278668484      8925085696            186104

regards, m.
Comment 4 KOSAKI Motohiro 2010-07-09 10:36:10 UTC
Can you please run following command?

# perf top
Comment 5 Marius Tolzmann 2010-07-11 15:11:26 UTC
it turns out that this bug seems to be not as easy reproducible as we 
thought in first place.

when booting into another kernel we can't trigger the kswapd behavior 
just by doing heavy IO.

When the system is running some time and the bug was triggered once it 
can easily be triggered again by any large IO-Operation.

currently i am stressing only one system to trigger it again and to 
provide the perf top output. Our production servers are currently 
running an older kernel again to get the work done - i will upgrade them 
again and try to find an reproducible way to trigger this.

--- Comment #4 from KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> 
2010-07-09 10:36:10 ---
> Can you please run following command?
>
> # perf top
>
Comment 6 Marius Tolzmann 2010-07-12 13:02:38 UTC
this bug seems to be related to XFS.. can you please re-assign it to the 
right person - if you agree .. ? or should i file it again?

further Info:

ok.. we got the bug to trigger when doing heavy IO on our xfs-fs..

the strange part is: that i can trigger the same behavior when doing dd 
on an reiserfs-partiotion.

the perf top and top output are snapshots of the system while doing dd 
if=/dev/zero of=somefile_on_reiserfs_partition

just the dd is running... no other large IO is done on any other partition)

---------------------------------------------------------------------------
PerfTop:    4115 irqs/sec  kernel:95.9% [1000Hz cycles],  (all, 8 CPUs)
---------------------------------------------------------------------------

samples  pcnt function                        DSO
_______ _____ _______________________________ ___________________

5643.00 16.4% xfs_perag_get                   [kernel.kallsyms]
2325.00  6.7% xfs_perag_put                   [kernel.kallsyms]
1158.00  3.4% read_tsc                        [kernel.kallsyms]
  988.00  2.9% do_sys_poll                     [kernel.kallsyms]
  823.00  2.4% radix_tree_lookup               [kernel.kallsyms]
  785.00  2.3% perf_session__mmap_read_counter /usr/local/perf/bin/perf
  735.00  2.1% x86_pmu_enable_all              [kernel.kallsyms]
  713.00  2.1% fget_light                      [kernel.kallsyms]
  688.00  2.0% sched_clock                     [kernel.kallsyms]
  612.00  1.8% _raw_spin_lock_irqsave          [kernel.kallsyms]
  610.00  1.8% _raw_spin_unlock_irqrestore     [kernel.kallsyms]
  560.00  1.6% xfs_reclaim_inode_shrink        [kernel.kallsyms]
  544.00  1.6% dso__find_symbol                /usr/local/perf/bin/perf
  513.00  1.5% perf_pending_interrupt          [kernel.kallsyms]
  511.00  1.5% sched_clock_local               [kernel.kallsyms]
	
top - 14:50:03 up  4:06,  6 users,  load average: 4.33, 3.22, 3.03
Tasks: 275 total,   4 running, 271 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us, 32.5%sy,  0.0%ni, 43.8%id, 23.2%wa,  0.1%hi,  0.2%si, 
0.0%st
Mem:  33077232k total, 32979564k used,    97668k free,    62344k buffers
Swap: 65537160k total,        0k used, 65537160k free, 29747004k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11148 root      20   0  4948  324  252 R   94  0.0   0:55.76 dd
   556 root      20   0     0    0    0 R   81  0.0  32:19.99 kswapd0
   557 root      20   0     0    0    0 R   80  0.0  29:53.66 kswapd1
  2951 root      20   0     0    0    0 D    2  0.0   0:04.80 flush-8:0
   250 root      20   0     0    0    0 D    2  0.0   0:03.65 sync_supers


here is the perf top output before starting the dd:

-----------------------------------------------------------------------
PerfTop:    3477 irqs/sec  kernel:97.4% [1000Hz cycles],  (all, 8 CPUs)
-----------------------------------------------------------------------

  samples  pcnt function                        DSO
  _______ _____ _______________________________ ________________________

14867.00 35.4% read_tsc                        [kernel.kallsyms]
  2592.00  6.2% sched_clock                     [kernel.kallsyms]
  1657.00  3.9% tick_handle_periodic            [kernel.kallsyms]
  1459.00  3.5% hrtimer_run_queues              [kernel.kallsyms]
  1186.00  2.8% do_sys_poll                     [kernel.kallsyms]
   883.00  2.1% apic_timer_interrupt            [kernel.kallsyms]
   847.00  2.0% perf_session__mmap_read_counter /usr/local/perf/bin/perf
   841.00  2.0% find_busiest_group              [kernel.kallsyms]
   831.00  2.0% find_next_bit                   [kernel.kallsyms]
   699.00  1.7% sched_clock_local               [kernel.kallsyms]
   685.00  1.6% x86_pmu_enable_all              [kernel.kallsyms]
   679.00  1.6% current_kernel_time             [kernel.kallsyms]
   667.00  1.6% xfs_perag_get                   [kernel.kallsyms]
   622.00  1.5% remove_wait_queue               [kernel.kallsyms]


thanks for the help..

marius
Comment 7 Peter Marquardt 2010-07-12 15:30:08 UTC
Here's the same situation on a 128 GB / 32 core server. It is doing some heavy I/O and very very unresponsive. This one is running 2.6.34-rc6:


top - 17:17:59 up 19 days,  6:49,  8 users,  load average: 11.29, 10.81, 10.47
Tasks: 688 total,  13 running, 675 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 37.5%sy,  0.0%ni, 62.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132293220k total, 132119488k used,   173732k free,   544744k buffers
Swap: 156290900k total,        0k used, 156290900k free, 123948184k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND                                                                                                   
 1008 root      20   0     0    0    0 R  100  0.0   6975:11 kswapd1                                                                                                    
 1014 root      20   0     0    0    0 R  100  0.0   6903:52 kswapd7                                                                                                    
26312 root      20   0 34836  10m 2704 R  100  0.0   2:05.84 perf                                                                                                       
 1007 root      20   0     0    0    0 R  100  0.0   6986:52 kswapd0                                                                                                    
 1009 root      20   0     0    0    0 R  100  0.0   7002:45 kswapd2                                                                                                    
 1010 root      20   0     0    0    0 R  100  0.0   6900:53 kswapd3                                                                                                    
 1011 root      20   0     0    0    0 R  100  0.0   6960:24 kswapd4                                                                                                    
 1012 root      20   0     0    0    0 R  100  0.0   6956:13 kswapd5                                                                                                    
 1013 root      20   0     0    0    0 R  100  0.0   6929:03 kswapd6                                                                                                    
 4255 root      20   0 69272  62m 1736 S    0  0.0 176:53.39 amd                                                                                                        
 4364 root      20   0  2716  380  308 S    0  0.0 189:06.05 pattexstasid                                                                                               
 4780 solxdata  20   0 10532 2228 1324 S    0  0.0 322:13.79 iview                                                                                                      
26324 solxdata  20   0  9088 1728  916 R    0  0.0   0:00.82 top                                                                                                        
    1 root      20   0  1624  476  408 S    0  0.0   0:18.51 init                                                                                                       
    2 root      20   0     0    0    0 S    0  0.0  45:47.76 kthreadd                                                                                                   

----------------------------------------------------------------------------
   PerfTop:       0 irqs/sec  kernel: nan% [1000Hz cycles],  (all, 32 CPUs)
----------------------------------------------------------------------------

samples  pcnt function                        DSO
_______ _____ _______________________________ _______________________________

 232.00  9.5% hrtimer_run_queues              [kernel.kallsyms]              
 197.00  8.1% xfs_perag_put                   [kernel.kallsyms]              
 172.00  7.0% do_sys_poll                     [kernel.kallsyms]              
 154.00  6.3% perf_session__mmap_read_counter /usr/local/perf/bin/perf       
  98.00  4.0% fget_light                      [kernel.kallsyms]              
  77.00  3.2% perf_poll                       [kernel.kallsyms]              
  74.00  3.0% remove_wait_queue               [kernel.kallsyms]              
  69.00  2.8% _raw_spin_lock_irqsave          [kernel.kallsyms]              
  63.00  2.6% ktime_get                       [kernel.kallsyms]              
  58.00  2.4% sched_clock                     [kernel.kallsyms]              
  53.00  2.2% S_hv_fetch_common               /arch/x86_64/usr/local/bin/perl
  52.00  2.1% apic_timer_interrupt            [kernel.kallsyms]              
  50.00  2.0% schedule                        [kernel.kallsyms]              
  50.00  2.0% perf_event_do_pending           [kernel.kallsyms]              
  49.00  2.0% find_busiest_group              [kernel.kallsyms]              
  49.00  2.0% xfs_perag_get                   [kernel.kallsyms]              
  48.00  2.0% _raw_spin_unlock_irqrestore     [kernel.kallsyms]              


here we also find some hints in the direction of xfs. Maybe it helps, that both machines have quite some big disks with xfs attached:

the one Marius reported looks like this:

root@hexakosioihexekontahexaphobia:~# cat /proc/partitions 
major minor  #blocks  name

   8        0  156290904 sda
   8        1   10241406 sda1
   8        2  146046915 sda2
   8       16  156290904 sdb
   8       32 5859311616 sdc
   8       48 13671727104 sdd
   8       64 13671727104 sde
   8       80 13671727104 sdf
   8       96 13671727104 sdg

this one has 5x 13 TB and one drive with 27 TB running xfs.

root@arachibotyrophobie:~# cat /proc/partitions 
major minor  #blocks  name

   8        0  244198584 sda
   8        1   10241406 sda1
   8        2   65537167 sda2
   8        3  168417427 sda3
   8       16 13671727104 sdb
   8       32 13671727104 sdc
   8       48 13671727104 sdd
   8       80 27343597568 sdf
   8       64 13671727104 sde
   8       96 13671727104 sdg


we'll supply any additional needed information a.s.a.p. ... just tell us what you need 8)

thanks,
  Peter
Comment 8 Dave Chinner 2010-07-12 22:50:26 UTC
(In reply to comment #7)
> Here's the same situation on a 128 GB / 32 core server. It is doing some
> heavy
> I/O and very very unresponsive. This one is running 2.6.34-rc6:

These look like the new xfs inode cache shrinker is active. The dd is generating memory pressure, and kswapd potentially running the slab shrinkers reclaiming cached inodes. Can you show the output of slabtop before the dd and while it is running?

Cheers,

Dave.
Comment 9 Peter Marquardt 2010-07-13 06:43:53 UTC
Here's the output of top/slabtop running on a 32 GB RAM machine:
---------------------------------
start, 8 out of 32 GB RAM used
---------------------------------

top - 08:32:13 up 21:48,  4 users,  load average: 0.00, 0.00, 0.00
Tasks: 267 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  33077232k total,  8384520k used, 24692712k free,   252740k buffers
Swap: 65537160k total,        0k used, 65537160k free,  2744916k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                     
 6055 root      20   0  4512 1288  736 S    7  0.0   0:05.35 slabtop                                                                                                                      
    1 root      20   0  1624  480  408 S    0  0.0   0:01.25 init                                                                                                                         
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                                                                     
    3 root      RT   0     0    0    0 S    0  0.0   0:00.05 migration/0                                                                                                                  
    4 root      20   0     0    0    0 S    0  0.0   0:01.49 ksoftirqd/0                                                                                                                  
    5 root      RT   0     0    0    0 S    0  0.0   0:00.03 migration/1                                                                                                                  
    6 root      20   0     0    0    0 S    0  0.0   0:01.82 ksoftirqd/1                                                                                                                  
    7 root      RT   0     0    0    0 S    0  0.0   0:00.01 migration/2                                                                                                                  
[...]
 Active / Total Objects (% used)    : 12463647 / 13977711 (89.2%)
 Active / Total Slabs (% used)      : 1305429 / 1305429 (100.0%)
 Active / Total Caches (% used)     : 95 / 223 (42.6%)
 Active / Total Size (% used)       : 4892545.30K / 5125737.29K (95.5%)
 Minimum / Average / Maximum Object : 0.02K / 0.37K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
4716283 4595683  97%    0.06K  79937       59    319748K size-64
3913260 3809455  97%    0.19K 195663       20    782652K dentry
3753424 3723471  99%    1.00K 938356        4   3753424K xfs_inode
1219187  80374   6%    0.10K  32951       37    131804K buffer_head
204939  97428  47%    0.54K  29277        7    117108K radix_tree_node
 77495  77479  99%    0.70K  15499        5     61996K reiser_inode_cache
 21420  21164  98%    0.12K    714       30      2856K size-128
 13621  13444  98%    0.07K    257       53      1028K sysfs_dir_cache
 10528   9091  86%    0.03K     94      112       376K size-32
  8684   8684 100%    4.00K   8684        1     34736K size-4096
  5680   3189  56%    0.19K    284       20      1136K size-192
  2885   2873  99%    0.77K    577        5      2308K shmem_inode_cache
  2849   1880  65%    0.05K     37       77       148K anon_vma_chain
  2438   1703  69%    0.07K     46       53       184K Acpi-Operand
  2398   1913  79%    0.17K    109       22       436K vm_area_struct
  2064   1806  87%    1.00K    516        4      2064K size-1024
  1950   1621  83%    0.61K    325        6      1300K proc_inode_cache
  1840   1365  74%    0.19K     92       20       368K filp
  1776   1693  95%    0.50K    222        8       888K size-512
  1728   1072  62%    0.02K     12      144        48K anon_vma
  1510   1467  97%    2.00K    755        2      3020K size-2048
  1470   1070  72%    0.25K     98       15       392K skbuff_head_cache
  1140    404  35%    0.12K     38       30       152K cred_jar
   900    798  88%    0.25K     60       15       240K size-256

-------------------------------------
5 minutes later, 27 out of 32 GB used
-------------------------------------

top - 08:36:59 up 21:53,  5 users,  load average: 1.32, 0.54, 0.20
Tasks: 274 total,   1 running, 273 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.8%us,  5.1%sy,  0.0%ni, 79.5%id,  9.6%wa,  0.1%hi,  0.8%si,  0.0%st
Mem:  33077232k total, 27160668k used,  5916564k free,   252740k buffers
Swap: 65537160k total,        0k used, 65537160k free, 21326208k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                     
 6209 solidata  20   0 27356 6724 1696 S   36  0.0   0:46.69 solidmirror                                                                                                                  
 6157 root      20   0 26160 6664 1632 S   25  0.0   0:29.76 pmirror                                                                                                                      
 6154 root      20   0 27776 8288 1656 D   18  0.0   0:19.73 pmirror                                                                                                                      
 6264 root      20   0  4244 1136  736 S    6  0.0   0:01.06 slabtop                                                                                                                      
 6134 root      20   0     0    0    0 S    2  0.0   0:01.92 flush-8:80                                                                                                                   
  594 root      20   0     0    0    0 S    1  0.0   0:06.78 xfsdatad/4                                                                                                                   
 Active / Total Objects (% used)    : 15002352 / 15335899 (97.8%)
 Active / Total Slabs (% used)      : 1348855 / 1348971 (100.0%)
 Active / Total Caches (% used)     : 104 / 223 (46.6%)
 Active / Total Size (% used)       : 5206572.94K / 5289652.04K (98.4%)
 Minimum / Average / Maximum Object : 0.02K / 0.34K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
4717522 4606213  97%    0.06K  79958       59    319832K size-64
3913420 3823949  97%    0.19K 195671       20    782684K dentry
3753528 3737799  99%    1.00K 938382        4   3753528K xfs_inode
2506380 2474638  98%    0.10K  67740       37    270960K buffer_head
255066 187566  73%    0.54K  36438        7    145752K radix_tree_node
 77500  77481  99%    0.70K  15500        5     62000K reiser_inode_cache
 21420  21237  99%    0.12K    714       30      2856K size-128
 13621  13446  98%    0.07K    257       53      1028K sysfs_dir_cache
 10528   9105  86%    0.03K     94      112       376K size-32
  8812   8764  99%    4.00K   8812        1     35248K size-4096
  7140   7137  99%    0.18K    340       21      1360K xfs_ili
  5680   3349  58%    0.19K    284       20      1136K size-192


----------------------------------------------
2 minutes later. 32 GB used, kswapd popping up
----------------------------------------------

top - 08:38:08 up 21:54,  5 users,  load average: 1.98, 0.86, 0.33
Tasks: 274 total,   5 running, 269 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.3%us, 40.2%sy,  0.0%ni, 50.1%id,  5.7%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:  33077232k total, 32977904k used,    99328k free,   252740k buffers
Swap: 65537160k total,        0k used, 65537160k free, 27063272k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                     
  556 root      20   0     0    0    0 R  100  0.0 141:58.60 kswapd0                                                                                                                      
  557 root      20   0     0    0    0 R  100  0.0 138:30.02 kswapd1                                                                                                                      
 6154 root      20   0 27776 8288 1656 R   97  0.0   0:49.16 pmirror                                                                                                                      
 6209 solidata  20   0 27356 6872 1704 S   30  0.0   1:09.87 solidmirror                                                                                                                  
 6157 root      20   0 26160 6664 1632 R   14  0.0   0:51.78 pmirror                                                                                                                      
 6264 root      20   0  4244 1136  736 S    7  0.0   0:03.97 slabtop                                                                                                                      
 6134 root      20   0     0    0    0 D    3  0.0   0:03.09 flush-8:80                                                                                                                   
  594 root      20   0     0    0    0 S    1  0.0   0:07.33 xfsdatad/4                                                                                                                   
   16 root      20   0     0    0    0 S    0  0.0   0:02.15 ksoftirqd/6                                                                                                                  
11048 solidata  20   0  8820 1468  976 R    0  0.0   3:21.10 top                                                                                                                          
    1 root      20   0  1624  480  408 S    0  0.0   0:01.25 init                                                                                                                         
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                                                                     
 Active / Total Objects (% used)    : 15491439 / 15986833 (96.9%)
 Active / Total Slabs (% used)      : 1366347 / 1366507 (100.0%)
 Active / Total Caches (% used)     : 104 / 223 (46.6%)
 Active / Total Size (% used)       : 5259346.41K / 5355544.54K (98.2%)
 Minimum / Average / Maximum Object : 0.02K / 0.33K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
4718171 4501369  95%    0.06K  79969       59    319876K size-64
3912540 3711758  94%    0.19K 195627       20    782508K dentry
3753212 3736699  99%    1.00K 938303        4   3753212K xfs_inode
3156729 3156729 100%    0.10K  85317       37    341268K buffer_head
255017 208730  81%    0.54K  36431        7    145724K radix_tree_node
 77480  77449  99%    0.70K  15496        5     61984K reiser_inode_cache
 21420  21275  99%    0.12K    714       30      2856K size-128
 13621  13444  98%    0.07K    257       53      1028K sysfs_dir_cache
 10528   9129  86%    0.03K     94      112       376K size-32
  8717   8689  99%    4.00K   8717        1     34868K size-4096
  7266   7251  99%    0.18K    346       21      1384K xfs_ili
  5760   3634  63%    0.19K    288       20      1152K size-192
Comment 10 Peter Marquardt 2010-07-13 06:55:37 UTC
Aborting the I/O heavy programs (plain perl open read write close) almost immediatly drops the kswapds. xfs_inode slabs stay.


top - 08:51:51 up 22:08,  4 users,  load average: 1.66, 2.89, 2.07
Tasks: 268 total,   1 running, 267 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  33077232k total, 32953756k used,   123476k free,   227432k buffers
Swap: 65537160k total,        0k used, 65537160k free, 27218944k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                     
 6264 root      20   0  4244 1136  736 S    6  0.0   0:50.91 slabtop                                                                                                                      
    1 root      20   0  1624  480  408 S    0  0.0   0:01.26 init                                                                                                                         
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                                                                     
    3 root      RT   0     0    0    0 S    0  0.0   0:00.05 migration/0                                                                                                                  
    4 root      20   0     0    0    0 S    0  0.0   0:01.67 ksoftirqd/0                                                                                                                  
    5 root      RT   0     0    0    0 S    0  0.0   0:00.03 migration/1                                                                                                                  
    6 root      20   0     0    0    0 S    0  0.0   0:01.98 ksoftirqd/1                                                                                                                  
 Active / Total Objects (% used)    : 13291713 / 15368731 (86.5%)
 Active / Total Slabs (% used)      : 1335299 / 1335324 (100.0%)
 Active / Total Caches (% used)     : 101 / 223 (45.3%)
 Active / Total Size (% used)       : 4884030.56K / 5242406.91K (93.2%)
 Minimum / Average / Maximum Object : 0.02K / 0.34K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
4678877 3458608  73%    0.06K  79303       59    317212K size-64
3786780 3657252  96%    1.00K 946695        4   3786780K xfs_inode
3256703 3254908  99%    0.10K  88019       37    352076K buffer_head
3253440 2593056  79%    0.19K 162672       20    650688K dentry
252147 206842  82%    0.54K  36021        7    144084K radix_tree_node
 43960  42829  97%    0.70K   8792        5     35168K reiser_inode_cache
 21630  18137  83%    0.12K    721       30      2884K size-128
 13621  13444  98%    0.07K    257       53      1028K sysfs_dir_cache
 10528   9094  86%    0.03K     94      112       376K size-32
  8686   8686 100%    4.00K   8686        1     34744K size-4096
  5760   3184  55%    0.19K    288       20      1152K size-192
  3108   2053  66%    0.18K    148       21       592K xfs_ili
Comment 11 Dave Chinner 2010-07-13 08:15:45 UTC
Ok, so there's about 4m clean cached inodes and dentries at the start, and 2.5m cached dentries and 3.6m cached inodes at the end. reclaim doesn't really seem to be making much progress here - inodes can't be reclaimed until dentries are freed, and that isn't happening very fast, either.

I can reproduce kswapd running for a few seconds while it reclaims a couple of million inodes and dentries, but I can't keep it running for minutes. 

Right now I'll focus on the xfs-perag_get/put frequency and try to work that out. Can you post the output of xfs_info for each XFS filesystem you have so i can see how many filesystems and AGs are being traversed during each reclaim call...
Comment 12 Peter Marquardt 2010-07-13 08:24:57 UTC
here we go:

/dev/sdc on /amd/arachibotyrophobie/X/X0031 type xfs (rw,noatime,nodiratime,logbufs=8)
/dev/sdb on /amd/arachibotyrophobie/X/X0060 type xfs (rw,noatime,nodiratime,logbufs=8)
/dev/sdd on /amd/arachibotyrophobie/X/X0063 type xfs (rw,noatime,nodiratime,logbufs=8)
/dev/sde on /amd/arachibotyrophobie/X/X0064 type xfs (rw,noatime,nodiratime,logbufs=8)
/dev/sdg on /amd/arachibotyrophobie/X/X0047 type xfs (rw,noatime,nodiratime,logbufs=8)
/dev/sdf on /amd/arachibotyrophobie/X/X2002 type xfs (rw,noatime,nodiratime,logbufs=8)

meta-data=/dev/sdc               isize=256    agcount=2048, agsize=1668912 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3417931776, imaxpct=25
         =                       sunit=16     swidth=224 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=65536  blocks=0, rtextents=0

meta-data=/dev/sdb               isize=256    agcount=2048, agsize=1668912 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3417931776, imaxpct=25
         =                       sunit=16     swidth=224 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=65536  blocks=0, rtextents=0

meta-data=/dev/sdd               isize=256    agcount=2048, agsize=1668912 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3417931776, imaxpct=25
         =                       sunit=16     swidth=224 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=65536  blocks=0, rtextents=0

meta-data=/dev/sde               isize=256    agcount=2048, agsize=1668912 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3417931776, imaxpct=25
         =                       sunit=16     swidth=224 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=65536  blocks=0, rtextents=0

meta-data=/dev/sdg               isize=256    agcount=13, agsize=268435392 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3417931776, imaxpct=5
         =                       sunit=64     swidth=896 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

meta-data=/dev/sdf               isize=256    agcount=26, agsize=268435392 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=6835899392, imaxpct=5
         =                       sunit=64     swidth=896 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Comment 13 Dave Chinner 2010-07-13 11:10:36 UTC
(In reply to comment #12)
> meta-data=/dev/sdc               isize=256    agcount=2048, agsize=1668912
> blks
> meta-data=/dev/sdb               isize=256    agcount=2048, agsize=1668912
> blks
> meta-data=/dev/sdd               isize=256    agcount=2048, agsize=1668912
> blks
> meta-data=/dev/sde               isize=256    agcount=2048, agsize=1668912
> blks
> meta-data=/dev/sdg               isize=256    agcount=13, agsize=268435392
> blks
> meta-data=/dev/sdf               isize=256    agcount=26, agsize=268435392
> blks

It's what I was expecting. The old filesystems were configured strangely with a lot (2048) of relatively small allocation groups. The newer 13TB and 26TB filesystems are using 1TB allocation groups, so are a lot more sensible in configuration.

Basically the problem is that every iteration of the xfs inode shrinker will be doing a xfs_perag_get/put on every AG in every filesystem. That means it may be scaning >8000 AGs before finding an inode to reclaim. This is where teh CPU is being used.

Per-filesystem shrinker callouts will help this, as will a variable reclaim start AG. However, the overhead of aggregating across all those AGs won't go away. I think I have a solution to that (propagate the reclaimable inode bit to the per-ag radix tree) that will reduce the overhead of aggregation and reclaim scanning quite a bit, however it is onl an idea right now. Let me think on this a bit....
Comment 14 Peter Marquardt 2010-07-13 11:43:05 UTC
This vast amount of allocation groups was the result of a lot of performance tuning and pecking with a stick in the fog when we started to use xfs. We have a _lot_ of small and huge files and ran into 'disk-full / out-of-inodes'. anyway, we thought the bottleneck came from not having enough allocationgroups spread on the surface. 2048*1668912 equals approx 13*268435392. 

in the past we supplied agcount=2048 to mkfs.xfs 

# mkfs.xfs -b size=4096 -d sunit=128,swidth=1792,agcount=2048 -L x0060 /dev/sdb 

whereas we now initialize the raid6 (14+2) unit more or less per default:

# mkfs.xfs -L x2002 -d su=256k,sw=14 /dev/sdf

Maybe this agcount thing wasn't a good idea, but it worked fine up until the xfs-code changed a lot in the past 6 month (2.6.32+ ?!)

cheers and thanks for your work so far,

   Peter
Comment 15 Dave Chinner 2010-07-13 22:41:09 UTC
(In reply to comment #14)
> This vast amount of allocation groups was the result of a lot of performance
> tuning and pecking with a stick in the fog when we started to use xfs. We
> have
> a _lot_ of small and huge files and ran into 'disk-full / out-of-inodes'.
> anyway, we thought the bottleneck came from not having enough
> allocationgroups
> spread on the surface. 2048*1668912 equals approx 13*268435392.

Sounds like you were running out of space below 1TB - the inode32 allocator problem. THis probably would have gone away with the inode64 mount option.

> in the past we supplied agcount=2048 to mkfs.xfs 
> 
> # mkfs.xfs -b size=4096 -d sunit=128,swidth=1792,agcount=2048 -L x0060
> /dev/sdb 
> 
> whereas we now initialize the raid6 (14+2) unit more or less per default:
> 
> # mkfs.xfs -L x2002 -d su=256k,sw=14 /dev/sdf
> 
> Maybe this agcount thing wasn't a good idea, but it worked fine up until the
> xfs-code changed a lot in the past 6 month (2.6.32+ ?!)

Yeah, we've move all the inode caching and reclaim to per-ag structures, and there are now some places where we iterate over all the AGs to sum counters. THe inode reclaim shrinker is one of these places and it is clear that it needs improvement. (well, we knew that already....)

I'll try to have some patches for you to test later today.

> 
> cheers and thanks for your work so far,
> 
>    Peter
Comment 16 Rafael J. Wysocki 2010-07-23 19:58:46 UTC
On Friday, July 23, 2010, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 02:15:54PM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.33 and 2.6.34.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.33 and 2.6.34.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=16348
> > Subject             : kswapd continuously active when doing IO
> > Submitter   : Marius Tolzmann <tolzmann@molgen.mpg.de>
> > Date                : 2010-07-07 10:58 (17 days old)
> 
> The fix for this is already in 2.6.36-rc6.
Comment 17 Chuck Ebbert 2010-07-23 21:26:05 UTC
What was the fix? Is it submitted for 2.6.34-stable?
Comment 18 Dave Chinner 2010-07-26 04:05:30 UTC
(In reply to comment #17)
> What was the fix? Is it submitted for 2.6.34-stable?

7f8275d0d660c146de6ee3017e1e2e594c49e820 mm: add context argument to shrinker callback
70e60ce71516c3a9e882edb70a09f696a05961db xfs: convert inode shrinker to per-filesystem contexts
16fd5367370099b59d96e30bb7d9de8d419659f2 xfs: track AGs with reclaimable inodes in per-ag radix tree
567c7b0edec0200c5c6613f07c3d3b4034fdc836 mm: add context argument to shrinker callback to remaining shrinkers

And no, they haven't been submitted to 2.6.34-stable. It should be pretty trivial to backport them, though.
Comment 19 Peter Marquardt 2010-07-28 07:53:57 UTC
2.6.36-rc6 works fine. 32 cores computing, top processes are user processes, no kernel/xfs routines mentioned anymore in 'perf top'.

Thanks for fixing this!

   Peter