Hi.. this bug may be related to #15193 where i attached the issue as a comment. since bug #15193 has status RESOLVED PATCH_ALREADY_AVAILABLE and not CLOSED CODE_FIX i file the situation as a new bug since the problem still exists for us. THE ISSUE: we are currently having trouble with continuously running kswapds consuming up to 100% CPU time when the system is busy doing (heavy) I/O. to reproduce (on our machines): dd if=/dev/zero of=somefile will activate all kswapds when the file reaches the size of the installed memory (our systems have 8G up to 256G). (same effect on local reiserfs and local xfs) In kernel 2.6.34 (and 2.6.35-rc3) this issue causes the system to become very very slow and unresponsive. we switched back to 2.6.34-rc6 which seems to have no issues with the same IO. what information can i provide to help fixing this issue. regards, marius tolzmann
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Wed, 7 Jul 2010 10:58:50 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=16348 > > Summary: kswapd continuously active when doing IO > Product: IO/Storage > Version: 2.5 > Kernel Version: 2.6.34 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Other > AssignedTo: io_other@kernel-bugs.osdl.org > ReportedBy: tolzmann@molgen.mpg.de > Regression: Yes > > > Hi.. > > this bug may be related to #15193 where i attached the issue as a comment. > since bug #15193 has status RESOLVED PATCH_ALREADY_AVAILABLE and not CLOSED > CODE_FIX i file the situation as a new bug since the problem still exists for > us. > > THE ISSUE: > we are currently having trouble with continuously running kswapds consuming > up > to 100% CPU time when the system is busy doing (heavy) I/O. > > to reproduce (on our machines): dd if=/dev/zero of=somefile will activate > all > kswapds when the file reaches the size of the installed memory (our systems > have 8G up to 256G). (same effect on local reiserfs and local xfs) > > In kernel 2.6.34 (and 2.6.35-rc3) this issue causes the system to become very > very slow and unresponsive. > > we switched back to 2.6.34-rc6 which seems to have no issues with the same > IO. > > what information can i provide to help fixing this issue. > That's odd - there are no changes in mm/vmscan.c between 2.6.34-rc6 and 2.6.34. Are you sure about those kernel versions?
I can't reproduce this. Marius, can you try ext3/4 filesystems? or can you get wchan by following command? # ps -ewwo pid,cmd,state,wchan:30 And, Your dirty pages don't decrease completely? or very slowly decreasing?
Hi.. i now have trouble reproducing it the easy way (via dd) on high memory machines too.. sorry.. but i reproduced it on a 2GB-machine using dd # cat /proc/version Linux version 2.6.34.x86_64.45 (root@venustraphobie.molgen.mpg.de) (gcc version 4.4.2 (GCC) ) #1 SMP PREEMPT Tue May 25 12:24:36 CEST 2010 # top top - 11:37:46 up 44 days, 22:50, 11 users, load average: 2.99, 2.38, 1.28 Tasks: 170 total, 4 running, 166 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 84.1%sy, 0.0%ni, 13.9%id, 0.3%wa, 0.5%hi, 0.0%si, 0.0%st Mem: 2053436k total, 2019304k used, 34132k free, 438036k buffers Swap: 2048280k total, 3412k used, 2044868k free, 1399924k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 406 root 20 0 0 0 0 R 64 0.0 231:01.46 kswapd1 31677 root 20 0 4940 328 252 R 63 0.0 1:09.35 dd 405 root 20 0 0 0 0 R 36 0.0 203:36.02 kswapd0 31678 root 20 0 9256 1100 572 S 3 0.1 0:03.67 bash 1352 root 20 0 0 0 0 S 2 0.0 0:29.40 flush-3:0 226 root 20 0 0 0 0 S 2 0.0 4:33.12 sync_supers 233 root 20 0 0 0 0 S 0 0.0 4:13.86 kblockd/1 # ps -ewwo pid,cmd,state,wchan:30 | grep kswap 405 [kswapd0] R - 406 [kswapd1] R - i will test this machine with different kernel-versions and filesystems and attach all logs to this bug soon. when stopping the dd-process the system behaves normal again and no kswapds are running anymore. here is some dirty pages-data while dd is running.. from /proc/meminfo TIMESTAMP FILESIZE-OF-DD-OUTPUT DIRTY-PAGES 1278668407 7009581056 185084 1278668412 7109813760 189076 1278668415 7183441920 186092 1278668416 7223492608 186100 1278668417 7272640512 70040 1278668419 7316069888 43388 1278668420 7359446528 86268 1278668421 7385733120 94732 1278668423 7466468864 174408 1278668425 7530849280 186956 1278668428 7560326144 185928 1278668429 7595791360 48976 1278668431 7633754112 22468 1278668432 7663499264 52364 1278668433 7690510336 78148 1278668439 8045338112 187816 1278668451 8086799872 187712 1278668452 8120495104 185548 1278668453 8149839872 187044 1278668457 8237108224 187480 1278668458 8266469376 187572 1278668460 8299431936 111528 1278668462 8352441856 84272 1278668463 8375342080 44244 1278668466 8440810496 107268 1278668470 8527720448 14156 1278668471 8573242880 74400 1278668473 8616877568 101080 1278668474 8661471744 144852 1278668477 8727186944 188708 1278668479 8834936832 188176 1278668483 8875110400 186088 1278668484 8925085696 186104 regards, m.
Can you please run following command? # perf top
it turns out that this bug seems to be not as easy reproducible as we thought in first place. when booting into another kernel we can't trigger the kswapd behavior just by doing heavy IO. When the system is running some time and the bug was triggered once it can easily be triggered again by any large IO-Operation. currently i am stressing only one system to trigger it again and to provide the perf top output. Our production servers are currently running an older kernel again to get the work done - i will upgrade them again and try to find an reproducible way to trigger this. --- Comment #4 from KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> 2010-07-09 10:36:10 --- > Can you please run following command? > > # perf top >
this bug seems to be related to XFS.. can you please re-assign it to the right person - if you agree .. ? or should i file it again? further Info: ok.. we got the bug to trigger when doing heavy IO on our xfs-fs.. the strange part is: that i can trigger the same behavior when doing dd on an reiserfs-partiotion. the perf top and top output are snapshots of the system while doing dd if=/dev/zero of=somefile_on_reiserfs_partition just the dd is running... no other large IO is done on any other partition) --------------------------------------------------------------------------- PerfTop: 4115 irqs/sec kernel:95.9% [1000Hz cycles], (all, 8 CPUs) --------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _______________________________ ___________________ 5643.00 16.4% xfs_perag_get [kernel.kallsyms] 2325.00 6.7% xfs_perag_put [kernel.kallsyms] 1158.00 3.4% read_tsc [kernel.kallsyms] 988.00 2.9% do_sys_poll [kernel.kallsyms] 823.00 2.4% radix_tree_lookup [kernel.kallsyms] 785.00 2.3% perf_session__mmap_read_counter /usr/local/perf/bin/perf 735.00 2.1% x86_pmu_enable_all [kernel.kallsyms] 713.00 2.1% fget_light [kernel.kallsyms] 688.00 2.0% sched_clock [kernel.kallsyms] 612.00 1.8% _raw_spin_lock_irqsave [kernel.kallsyms] 610.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms] 560.00 1.6% xfs_reclaim_inode_shrink [kernel.kallsyms] 544.00 1.6% dso__find_symbol /usr/local/perf/bin/perf 513.00 1.5% perf_pending_interrupt [kernel.kallsyms] 511.00 1.5% sched_clock_local [kernel.kallsyms] top - 14:50:03 up 4:06, 6 users, load average: 4.33, 3.22, 3.03 Tasks: 275 total, 4 running, 271 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 32.5%sy, 0.0%ni, 43.8%id, 23.2%wa, 0.1%hi, 0.2%si, 0.0%st Mem: 33077232k total, 32979564k used, 97668k free, 62344k buffers Swap: 65537160k total, 0k used, 65537160k free, 29747004k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11148 root 20 0 4948 324 252 R 94 0.0 0:55.76 dd 556 root 20 0 0 0 0 R 81 0.0 32:19.99 kswapd0 557 root 20 0 0 0 0 R 80 0.0 29:53.66 kswapd1 2951 root 20 0 0 0 0 D 2 0.0 0:04.80 flush-8:0 250 root 20 0 0 0 0 D 2 0.0 0:03.65 sync_supers here is the perf top output before starting the dd: ----------------------------------------------------------------------- PerfTop: 3477 irqs/sec kernel:97.4% [1000Hz cycles], (all, 8 CPUs) ----------------------------------------------------------------------- samples pcnt function DSO _______ _____ _______________________________ ________________________ 14867.00 35.4% read_tsc [kernel.kallsyms] 2592.00 6.2% sched_clock [kernel.kallsyms] 1657.00 3.9% tick_handle_periodic [kernel.kallsyms] 1459.00 3.5% hrtimer_run_queues [kernel.kallsyms] 1186.00 2.8% do_sys_poll [kernel.kallsyms] 883.00 2.1% apic_timer_interrupt [kernel.kallsyms] 847.00 2.0% perf_session__mmap_read_counter /usr/local/perf/bin/perf 841.00 2.0% find_busiest_group [kernel.kallsyms] 831.00 2.0% find_next_bit [kernel.kallsyms] 699.00 1.7% sched_clock_local [kernel.kallsyms] 685.00 1.6% x86_pmu_enable_all [kernel.kallsyms] 679.00 1.6% current_kernel_time [kernel.kallsyms] 667.00 1.6% xfs_perag_get [kernel.kallsyms] 622.00 1.5% remove_wait_queue [kernel.kallsyms] thanks for the help.. marius
Here's the same situation on a 128 GB / 32 core server. It is doing some heavy I/O and very very unresponsive. This one is running 2.6.34-rc6: top - 17:17:59 up 19 days, 6:49, 8 users, load average: 11.29, 10.81, 10.47 Tasks: 688 total, 13 running, 675 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 37.5%sy, 0.0%ni, 62.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132293220k total, 132119488k used, 173732k free, 544744k buffers Swap: 156290900k total, 0k used, 156290900k free, 123948184k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1008 root 20 0 0 0 0 R 100 0.0 6975:11 kswapd1 1014 root 20 0 0 0 0 R 100 0.0 6903:52 kswapd7 26312 root 20 0 34836 10m 2704 R 100 0.0 2:05.84 perf 1007 root 20 0 0 0 0 R 100 0.0 6986:52 kswapd0 1009 root 20 0 0 0 0 R 100 0.0 7002:45 kswapd2 1010 root 20 0 0 0 0 R 100 0.0 6900:53 kswapd3 1011 root 20 0 0 0 0 R 100 0.0 6960:24 kswapd4 1012 root 20 0 0 0 0 R 100 0.0 6956:13 kswapd5 1013 root 20 0 0 0 0 R 100 0.0 6929:03 kswapd6 4255 root 20 0 69272 62m 1736 S 0 0.0 176:53.39 amd 4364 root 20 0 2716 380 308 S 0 0.0 189:06.05 pattexstasid 4780 solxdata 20 0 10532 2228 1324 S 0 0.0 322:13.79 iview 26324 solxdata 20 0 9088 1728 916 R 0 0.0 0:00.82 top 1 root 20 0 1624 476 408 S 0 0.0 0:18.51 init 2 root 20 0 0 0 0 S 0 0.0 45:47.76 kthreadd ---------------------------------------------------------------------------- PerfTop: 0 irqs/sec kernel: nan% [1000Hz cycles], (all, 32 CPUs) ---------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _______________________________ _______________________________ 232.00 9.5% hrtimer_run_queues [kernel.kallsyms] 197.00 8.1% xfs_perag_put [kernel.kallsyms] 172.00 7.0% do_sys_poll [kernel.kallsyms] 154.00 6.3% perf_session__mmap_read_counter /usr/local/perf/bin/perf 98.00 4.0% fget_light [kernel.kallsyms] 77.00 3.2% perf_poll [kernel.kallsyms] 74.00 3.0% remove_wait_queue [kernel.kallsyms] 69.00 2.8% _raw_spin_lock_irqsave [kernel.kallsyms] 63.00 2.6% ktime_get [kernel.kallsyms] 58.00 2.4% sched_clock [kernel.kallsyms] 53.00 2.2% S_hv_fetch_common /arch/x86_64/usr/local/bin/perl 52.00 2.1% apic_timer_interrupt [kernel.kallsyms] 50.00 2.0% schedule [kernel.kallsyms] 50.00 2.0% perf_event_do_pending [kernel.kallsyms] 49.00 2.0% find_busiest_group [kernel.kallsyms] 49.00 2.0% xfs_perag_get [kernel.kallsyms] 48.00 2.0% _raw_spin_unlock_irqrestore [kernel.kallsyms] here we also find some hints in the direction of xfs. Maybe it helps, that both machines have quite some big disks with xfs attached: the one Marius reported looks like this: root@hexakosioihexekontahexaphobia:~# cat /proc/partitions major minor #blocks name 8 0 156290904 sda 8 1 10241406 sda1 8 2 146046915 sda2 8 16 156290904 sdb 8 32 5859311616 sdc 8 48 13671727104 sdd 8 64 13671727104 sde 8 80 13671727104 sdf 8 96 13671727104 sdg this one has 5x 13 TB and one drive with 27 TB running xfs. root@arachibotyrophobie:~# cat /proc/partitions major minor #blocks name 8 0 244198584 sda 8 1 10241406 sda1 8 2 65537167 sda2 8 3 168417427 sda3 8 16 13671727104 sdb 8 32 13671727104 sdc 8 48 13671727104 sdd 8 80 27343597568 sdf 8 64 13671727104 sde 8 96 13671727104 sdg we'll supply any additional needed information a.s.a.p. ... just tell us what you need 8) thanks, Peter
(In reply to comment #7) > Here's the same situation on a 128 GB / 32 core server. It is doing some > heavy > I/O and very very unresponsive. This one is running 2.6.34-rc6: These look like the new xfs inode cache shrinker is active. The dd is generating memory pressure, and kswapd potentially running the slab shrinkers reclaiming cached inodes. Can you show the output of slabtop before the dd and while it is running? Cheers, Dave.
Here's the output of top/slabtop running on a 32 GB RAM machine: --------------------------------- start, 8 out of 32 GB RAM used --------------------------------- top - 08:32:13 up 21:48, 4 users, load average: 0.00, 0.00, 0.00 Tasks: 267 total, 1 running, 266 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 33077232k total, 8384520k used, 24692712k free, 252740k buffers Swap: 65537160k total, 0k used, 65537160k free, 2744916k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6055 root 20 0 4512 1288 736 S 7 0.0 0:05.35 slabtop 1 root 20 0 1624 480 408 S 0 0.0 0:01.25 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.05 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:01.49 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.03 migration/1 6 root 20 0 0 0 0 S 0 0.0 0:01.82 ksoftirqd/1 7 root RT 0 0 0 0 S 0 0.0 0:00.01 migration/2 [...] Active / Total Objects (% used) : 12463647 / 13977711 (89.2%) Active / Total Slabs (% used) : 1305429 / 1305429 (100.0%) Active / Total Caches (% used) : 95 / 223 (42.6%) Active / Total Size (% used) : 4892545.30K / 5125737.29K (95.5%) Minimum / Average / Maximum Object : 0.02K / 0.37K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 4716283 4595683 97% 0.06K 79937 59 319748K size-64 3913260 3809455 97% 0.19K 195663 20 782652K dentry 3753424 3723471 99% 1.00K 938356 4 3753424K xfs_inode 1219187 80374 6% 0.10K 32951 37 131804K buffer_head 204939 97428 47% 0.54K 29277 7 117108K radix_tree_node 77495 77479 99% 0.70K 15499 5 61996K reiser_inode_cache 21420 21164 98% 0.12K 714 30 2856K size-128 13621 13444 98% 0.07K 257 53 1028K sysfs_dir_cache 10528 9091 86% 0.03K 94 112 376K size-32 8684 8684 100% 4.00K 8684 1 34736K size-4096 5680 3189 56% 0.19K 284 20 1136K size-192 2885 2873 99% 0.77K 577 5 2308K shmem_inode_cache 2849 1880 65% 0.05K 37 77 148K anon_vma_chain 2438 1703 69% 0.07K 46 53 184K Acpi-Operand 2398 1913 79% 0.17K 109 22 436K vm_area_struct 2064 1806 87% 1.00K 516 4 2064K size-1024 1950 1621 83% 0.61K 325 6 1300K proc_inode_cache 1840 1365 74% 0.19K 92 20 368K filp 1776 1693 95% 0.50K 222 8 888K size-512 1728 1072 62% 0.02K 12 144 48K anon_vma 1510 1467 97% 2.00K 755 2 3020K size-2048 1470 1070 72% 0.25K 98 15 392K skbuff_head_cache 1140 404 35% 0.12K 38 30 152K cred_jar 900 798 88% 0.25K 60 15 240K size-256 ------------------------------------- 5 minutes later, 27 out of 32 GB used ------------------------------------- top - 08:36:59 up 21:53, 5 users, load average: 1.32, 0.54, 0.20 Tasks: 274 total, 1 running, 273 sleeping, 0 stopped, 0 zombie Cpu(s): 4.8%us, 5.1%sy, 0.0%ni, 79.5%id, 9.6%wa, 0.1%hi, 0.8%si, 0.0%st Mem: 33077232k total, 27160668k used, 5916564k free, 252740k buffers Swap: 65537160k total, 0k used, 65537160k free, 21326208k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6209 solidata 20 0 27356 6724 1696 S 36 0.0 0:46.69 solidmirror 6157 root 20 0 26160 6664 1632 S 25 0.0 0:29.76 pmirror 6154 root 20 0 27776 8288 1656 D 18 0.0 0:19.73 pmirror 6264 root 20 0 4244 1136 736 S 6 0.0 0:01.06 slabtop 6134 root 20 0 0 0 0 S 2 0.0 0:01.92 flush-8:80 594 root 20 0 0 0 0 S 1 0.0 0:06.78 xfsdatad/4 Active / Total Objects (% used) : 15002352 / 15335899 (97.8%) Active / Total Slabs (% used) : 1348855 / 1348971 (100.0%) Active / Total Caches (% used) : 104 / 223 (46.6%) Active / Total Size (% used) : 5206572.94K / 5289652.04K (98.4%) Minimum / Average / Maximum Object : 0.02K / 0.34K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 4717522 4606213 97% 0.06K 79958 59 319832K size-64 3913420 3823949 97% 0.19K 195671 20 782684K dentry 3753528 3737799 99% 1.00K 938382 4 3753528K xfs_inode 2506380 2474638 98% 0.10K 67740 37 270960K buffer_head 255066 187566 73% 0.54K 36438 7 145752K radix_tree_node 77500 77481 99% 0.70K 15500 5 62000K reiser_inode_cache 21420 21237 99% 0.12K 714 30 2856K size-128 13621 13446 98% 0.07K 257 53 1028K sysfs_dir_cache 10528 9105 86% 0.03K 94 112 376K size-32 8812 8764 99% 4.00K 8812 1 35248K size-4096 7140 7137 99% 0.18K 340 21 1360K xfs_ili 5680 3349 58% 0.19K 284 20 1136K size-192 ---------------------------------------------- 2 minutes later. 32 GB used, kswapd popping up ---------------------------------------------- top - 08:38:08 up 21:54, 5 users, load average: 1.98, 0.86, 0.33 Tasks: 274 total, 5 running, 269 sleeping, 0 stopped, 0 zombie Cpu(s): 3.3%us, 40.2%sy, 0.0%ni, 50.1%id, 5.7%wa, 0.0%hi, 0.8%si, 0.0%st Mem: 33077232k total, 32977904k used, 99328k free, 252740k buffers Swap: 65537160k total, 0k used, 65537160k free, 27063272k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 556 root 20 0 0 0 0 R 100 0.0 141:58.60 kswapd0 557 root 20 0 0 0 0 R 100 0.0 138:30.02 kswapd1 6154 root 20 0 27776 8288 1656 R 97 0.0 0:49.16 pmirror 6209 solidata 20 0 27356 6872 1704 S 30 0.0 1:09.87 solidmirror 6157 root 20 0 26160 6664 1632 R 14 0.0 0:51.78 pmirror 6264 root 20 0 4244 1136 736 S 7 0.0 0:03.97 slabtop 6134 root 20 0 0 0 0 D 3 0.0 0:03.09 flush-8:80 594 root 20 0 0 0 0 S 1 0.0 0:07.33 xfsdatad/4 16 root 20 0 0 0 0 S 0 0.0 0:02.15 ksoftirqd/6 11048 solidata 20 0 8820 1468 976 R 0 0.0 3:21.10 top 1 root 20 0 1624 480 408 S 0 0.0 0:01.25 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd Active / Total Objects (% used) : 15491439 / 15986833 (96.9%) Active / Total Slabs (% used) : 1366347 / 1366507 (100.0%) Active / Total Caches (% used) : 104 / 223 (46.6%) Active / Total Size (% used) : 5259346.41K / 5355544.54K (98.2%) Minimum / Average / Maximum Object : 0.02K / 0.33K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 4718171 4501369 95% 0.06K 79969 59 319876K size-64 3912540 3711758 94% 0.19K 195627 20 782508K dentry 3753212 3736699 99% 1.00K 938303 4 3753212K xfs_inode 3156729 3156729 100% 0.10K 85317 37 341268K buffer_head 255017 208730 81% 0.54K 36431 7 145724K radix_tree_node 77480 77449 99% 0.70K 15496 5 61984K reiser_inode_cache 21420 21275 99% 0.12K 714 30 2856K size-128 13621 13444 98% 0.07K 257 53 1028K sysfs_dir_cache 10528 9129 86% 0.03K 94 112 376K size-32 8717 8689 99% 4.00K 8717 1 34868K size-4096 7266 7251 99% 0.18K 346 21 1384K xfs_ili 5760 3634 63% 0.19K 288 20 1152K size-192
Aborting the I/O heavy programs (plain perl open read write close) almost immediatly drops the kswapds. xfs_inode slabs stay. top - 08:51:51 up 22:08, 4 users, load average: 1.66, 2.89, 2.07 Tasks: 268 total, 1 running, 267 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 33077232k total, 32953756k used, 123476k free, 227432k buffers Swap: 65537160k total, 0k used, 65537160k free, 27218944k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6264 root 20 0 4244 1136 736 S 6 0.0 0:50.91 slabtop 1 root 20 0 1624 480 408 S 0 0.0 0:01.26 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.05 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:01.67 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.03 migration/1 6 root 20 0 0 0 0 S 0 0.0 0:01.98 ksoftirqd/1 Active / Total Objects (% used) : 13291713 / 15368731 (86.5%) Active / Total Slabs (% used) : 1335299 / 1335324 (100.0%) Active / Total Caches (% used) : 101 / 223 (45.3%) Active / Total Size (% used) : 4884030.56K / 5242406.91K (93.2%) Minimum / Average / Maximum Object : 0.02K / 0.34K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 4678877 3458608 73% 0.06K 79303 59 317212K size-64 3786780 3657252 96% 1.00K 946695 4 3786780K xfs_inode 3256703 3254908 99% 0.10K 88019 37 352076K buffer_head 3253440 2593056 79% 0.19K 162672 20 650688K dentry 252147 206842 82% 0.54K 36021 7 144084K radix_tree_node 43960 42829 97% 0.70K 8792 5 35168K reiser_inode_cache 21630 18137 83% 0.12K 721 30 2884K size-128 13621 13444 98% 0.07K 257 53 1028K sysfs_dir_cache 10528 9094 86% 0.03K 94 112 376K size-32 8686 8686 100% 4.00K 8686 1 34744K size-4096 5760 3184 55% 0.19K 288 20 1152K size-192 3108 2053 66% 0.18K 148 21 592K xfs_ili
Ok, so there's about 4m clean cached inodes and dentries at the start, and 2.5m cached dentries and 3.6m cached inodes at the end. reclaim doesn't really seem to be making much progress here - inodes can't be reclaimed until dentries are freed, and that isn't happening very fast, either. I can reproduce kswapd running for a few seconds while it reclaims a couple of million inodes and dentries, but I can't keep it running for minutes. Right now I'll focus on the xfs-perag_get/put frequency and try to work that out. Can you post the output of xfs_info for each XFS filesystem you have so i can see how many filesystems and AGs are being traversed during each reclaim call...
here we go: /dev/sdc on /amd/arachibotyrophobie/X/X0031 type xfs (rw,noatime,nodiratime,logbufs=8) /dev/sdb on /amd/arachibotyrophobie/X/X0060 type xfs (rw,noatime,nodiratime,logbufs=8) /dev/sdd on /amd/arachibotyrophobie/X/X0063 type xfs (rw,noatime,nodiratime,logbufs=8) /dev/sde on /amd/arachibotyrophobie/X/X0064 type xfs (rw,noatime,nodiratime,logbufs=8) /dev/sdg on /amd/arachibotyrophobie/X/X0047 type xfs (rw,noatime,nodiratime,logbufs=8) /dev/sdf on /amd/arachibotyrophobie/X/X2002 type xfs (rw,noatime,nodiratime,logbufs=8) meta-data=/dev/sdc isize=256 agcount=2048, agsize=1668912 blks = sectsz=512 attr=0 data = bsize=4096 blocks=3417931776, imaxpct=25 = sunit=16 swidth=224 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=65536 blocks=0, rtextents=0 meta-data=/dev/sdb isize=256 agcount=2048, agsize=1668912 blks = sectsz=512 attr=0 data = bsize=4096 blocks=3417931776, imaxpct=25 = sunit=16 swidth=224 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=65536 blocks=0, rtextents=0 meta-data=/dev/sdd isize=256 agcount=2048, agsize=1668912 blks = sectsz=512 attr=0 data = bsize=4096 blocks=3417931776, imaxpct=25 = sunit=16 swidth=224 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=65536 blocks=0, rtextents=0 meta-data=/dev/sde isize=256 agcount=2048, agsize=1668912 blks = sectsz=512 attr=0 data = bsize=4096 blocks=3417931776, imaxpct=25 = sunit=16 swidth=224 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=65536 blocks=0, rtextents=0 meta-data=/dev/sdg isize=256 agcount=13, agsize=268435392 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3417931776, imaxpct=5 = sunit=64 swidth=896 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 meta-data=/dev/sdf isize=256 agcount=26, agsize=268435392 blks = sectsz=512 attr=2 data = bsize=4096 blocks=6835899392, imaxpct=5 = sunit=64 swidth=896 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
(In reply to comment #12) > meta-data=/dev/sdc isize=256 agcount=2048, agsize=1668912 > blks > meta-data=/dev/sdb isize=256 agcount=2048, agsize=1668912 > blks > meta-data=/dev/sdd isize=256 agcount=2048, agsize=1668912 > blks > meta-data=/dev/sde isize=256 agcount=2048, agsize=1668912 > blks > meta-data=/dev/sdg isize=256 agcount=13, agsize=268435392 > blks > meta-data=/dev/sdf isize=256 agcount=26, agsize=268435392 > blks It's what I was expecting. The old filesystems were configured strangely with a lot (2048) of relatively small allocation groups. The newer 13TB and 26TB filesystems are using 1TB allocation groups, so are a lot more sensible in configuration. Basically the problem is that every iteration of the xfs inode shrinker will be doing a xfs_perag_get/put on every AG in every filesystem. That means it may be scaning >8000 AGs before finding an inode to reclaim. This is where teh CPU is being used. Per-filesystem shrinker callouts will help this, as will a variable reclaim start AG. However, the overhead of aggregating across all those AGs won't go away. I think I have a solution to that (propagate the reclaimable inode bit to the per-ag radix tree) that will reduce the overhead of aggregation and reclaim scanning quite a bit, however it is onl an idea right now. Let me think on this a bit....
This vast amount of allocation groups was the result of a lot of performance tuning and pecking with a stick in the fog when we started to use xfs. We have a _lot_ of small and huge files and ran into 'disk-full / out-of-inodes'. anyway, we thought the bottleneck came from not having enough allocationgroups spread on the surface. 2048*1668912 equals approx 13*268435392. in the past we supplied agcount=2048 to mkfs.xfs # mkfs.xfs -b size=4096 -d sunit=128,swidth=1792,agcount=2048 -L x0060 /dev/sdb whereas we now initialize the raid6 (14+2) unit more or less per default: # mkfs.xfs -L x2002 -d su=256k,sw=14 /dev/sdf Maybe this agcount thing wasn't a good idea, but it worked fine up until the xfs-code changed a lot in the past 6 month (2.6.32+ ?!) cheers and thanks for your work so far, Peter
(In reply to comment #14) > This vast amount of allocation groups was the result of a lot of performance > tuning and pecking with a stick in the fog when we started to use xfs. We > have > a _lot_ of small and huge files and ran into 'disk-full / out-of-inodes'. > anyway, we thought the bottleneck came from not having enough > allocationgroups > spread on the surface. 2048*1668912 equals approx 13*268435392. Sounds like you were running out of space below 1TB - the inode32 allocator problem. THis probably would have gone away with the inode64 mount option. > in the past we supplied agcount=2048 to mkfs.xfs > > # mkfs.xfs -b size=4096 -d sunit=128,swidth=1792,agcount=2048 -L x0060 > /dev/sdb > > whereas we now initialize the raid6 (14+2) unit more or less per default: > > # mkfs.xfs -L x2002 -d su=256k,sw=14 /dev/sdf > > Maybe this agcount thing wasn't a good idea, but it worked fine up until the > xfs-code changed a lot in the past 6 month (2.6.32+ ?!) Yeah, we've move all the inode caching and reclaim to per-ag structures, and there are now some places where we iterate over all the AGs to sum counters. THe inode reclaim shrinker is one of these places and it is clear that it needs improvement. (well, we knew that already....) I'll try to have some patches for you to test later today. > > cheers and thanks for your work so far, > > Peter
On Friday, July 23, 2010, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 02:15:54PM +0200, Rafael J. Wysocki wrote: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.33 and 2.6.34. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.33 and 2.6.34. Please verify if it still should > > be listed and let the tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=16348 > > Subject : kswapd continuously active when doing IO > > Submitter : Marius Tolzmann <tolzmann@molgen.mpg.de> > > Date : 2010-07-07 10:58 (17 days old) > > The fix for this is already in 2.6.36-rc6.
What was the fix? Is it submitted for 2.6.34-stable?
(In reply to comment #17) > What was the fix? Is it submitted for 2.6.34-stable? 7f8275d0d660c146de6ee3017e1e2e594c49e820 mm: add context argument to shrinker callback 70e60ce71516c3a9e882edb70a09f696a05961db xfs: convert inode shrinker to per-filesystem contexts 16fd5367370099b59d96e30bb7d9de8d419659f2 xfs: track AGs with reclaimable inodes in per-ag radix tree 567c7b0edec0200c5c6613f07c3d3b4034fdc836 mm: add context argument to shrinker callback to remaining shrinkers And no, they haven't been submitted to 2.6.34-stable. It should be pretty trivial to backport them, though.
2.6.36-rc6 works fine. 32 cores computing, top processes are user processes, no kernel/xfs routines mentioned anymore in 'perf top'. Thanks for fixing this! Peter