Created attachment 21797 [details] dmesg output Built a minimalist kernel leaving anonymous page swapping disabled. Under a network stress test the above message and much more appeared in the kernel message log. Turning anonymous page swapping back on fixed it. It's not important to me to have this work correctly, so if it's not important to anyone else it could be ignored and the issue closed. Just reporting it since I tripped over it.
Created attachment 21798 [details] config
Created attachment 21799 [details] hugetlb patch 1
Created attachment 21800 [details] hugetlb patch 2
Forgot to add that external Intel network drivers were in use when this happened. 'igb' was in use at the time. ixgbe-1.3.56.17 igb-1.3.19.3 modprobe igb IntMode=1,1,1,1 modprobe ixgbe InterruptType=2,2 MQ=1,1 RSS=2,2 InterruptThrottleRate=1,1
Spoke too soon when saying that turning swap support on fixed it. Bug came back even with this and several other features re-enabled. Left NUMA off however as the target server is a Xeon. Problem results from high-stress network load from a multicast application and a huge 'rcp' running simultaneously at full 1G link speed.
Created attachment 21809 [details] dmesg output
Created attachment 21810 [details] config
Looking at the trace, I suppose this could be a bug in the 'igb' driver. Giving up on 'igb' and 'ixgbe' as both perform poorly compared with 'e1000e' and so we're staying with that.
I'm having a similar issue on the headnode of a ~80 nodes diskless cluster with gentoo-sources 2.6.27-gentoo-r8. Nodes are connected with infiniband, but this connection is not used yet. Another gigabit connection is used for management, mouting /home as nfs and to transfer an kernel/initrd at compute nodes boot (since compute nodes are diskless) What kind of error is this? Is the connection to be considered dead or should it come back without need for reboot? dmesg part: [511466.766844] swapper: page allocation failure. order:0, mode:0x4020 [511466.766855] Pid: 0, comm: swapper Not tainted 2.6.27-gentoo-r8 #1 [511466.766857] [511466.766858] Call Trace: [511466.766860] <IRQ> [<ffffffff81079efe>] __alloc_pages_internal+0x412/0x434 [511466.766870] [<ffffffff8109b148>] new_slab+0x55/0x1b0 [511466.766873] [<ffffffff8109b55c>] __slab_alloc+0x252/0x3d2 [511466.766878] [<ffffffff813c7e7c>] __netdev_alloc_skb+0x29/0x45 [511466.766881] [<ffffffff813c7e7c>] __netdev_alloc_skb+0x29/0x45 [511466.766885] [<ffffffff8109c474>] __kmalloc_node_track_caller+0x75/0xaa [511466.766889] [<ffffffff813c74ca>] __alloc_skb+0x6b/0x12f [511466.766892] [<ffffffff813c7e7c>] __netdev_alloc_skb+0x29/0x45 [511466.766897] [<ffffffff812c44a8>] igb_alloc_rx_buffers_adv+0xdc/0x1b7 [511466.766901] [<ffffffff812c4916>] igb_clean_rx_irq_adv+0x393/0x3d5 [511466.766904] [<ffffffff812c4acf>] igb_clean_rx_ring_msix+0x51/0x14a [511466.766908] [<ffffffff81048bd3>] hrtimer_reprogram+0x74/0x8f [511466.766913] [<ffffffff813ca9eb>] net_rx_action+0xb7/0x1a7 [511466.766917] [<ffffffff8103994a>] __do_softirq+0x63/0xcc [511466.766921] [<ffffffff8100d32c>] call_softirq+0x1c/0x28 [511466.766925] [<ffffffff8100e3c7>] do_softirq+0x2c/0x68 [511466.766927] [<ffffffff81039709>] irq_exit+0x3f/0x91 [511466.766930] [<ffffffff8100e5fc>] do_IRQ+0xb5/0xd2 [511466.766934] [<ffffffff8100c5f1>] ret_from_intr+0x0/0xa [511466.766936] <EOI> [<ffffffff812796c1>] acpi_idle_enter_bm+0x251/0x294 [511466.766944] [<ffffffff812796b7>] acpi_idle_enter_bm+0x247/0x294 [511466.766949] [<ffffffff8137441d>] cpuidle_idle_call+0x8d/0xca [511466.766952] [<ffffffff8100b193>] cpu_idle+0x88/0xdc [511466.766955] Mem-Info: [511466.766957] Node 0 DMA per-cpu: [511466.766960] CPU 0: hi: 0, btch: 1 usd: 0 [511466.766962] CPU 1: hi: 0, btch: 1 usd: 0 [511466.766964] CPU 2: hi: 0, btch: 1 usd: 0 [511466.766966] CPU 3: hi: 0, btch: 1 usd: 0 [511466.766968] CPU 4: hi: 0, btch: 1 usd: 0 [511466.766970] CPU 5: hi: 0, btch: 1 usd: 0 [511466.766972] CPU 6: hi: 0, btch: 1 usd: 0 [511466.766974] CPU 7: hi: 0, btch: 1 usd: 0 [511466.766976] CPU 8: hi: 0, btch: 1 usd: 0 [511466.766978] CPU 9: hi: 0, btch: 1 usd: 0 [511466.766980] CPU 10: hi: 0, btch: 1 usd: 0 [511466.766982] CPU 11: hi: 0, btch: 1 usd: 0 [511466.766984] CPU 12: hi: 0, btch: 1 usd: 0 [511466.766986] CPU 13: hi: 0, btch: 1 usd: 0 [511466.766988] CPU 14: hi: 0, btch: 1 usd: 0 [511466.766990] CPU 15: hi: 0, btch: 1 usd: 0 [511466.766992] Node 0 DMA32 per-cpu: [511466.766995] CPU 0: hi: 186, btch: 31 usd: 134 [511466.766997] CPU 1: hi: 186, btch: 31 usd: 126 [511466.766999] CPU 2: hi: 186, btch: 31 usd: 135 [511466.767001] CPU 3: hi: 186, btch: 31 usd: 130 [511466.767003] CPU 4: hi: 186, btch: 31 usd: 137 [511466.767005] CPU 5: hi: 186, btch: 31 usd: 137 [511466.767007] CPU 6: hi: 186, btch: 31 usd: 164 [511466.767009] CPU 7: hi: 186, btch: 31 usd: 139 [511466.767011] CPU 8: hi: 186, btch: 31 usd: 150 [511466.767013] CPU 9: hi: 186, btch: 31 usd: 136 [511466.767015] CPU 10: hi: 186, btch: 31 usd: 91 [511466.767017] CPU 11: hi: 186, btch: 31 usd: 140 [511466.767019] CPU 12: hi: 186, btch: 31 usd: 63 [511466.767021] CPU 13: hi: 186, btch: 31 usd: 57 [511466.767023] CPU 14: hi: 186, btch: 31 usd: 119 [511466.767025] CPU 15: hi: 186, btch: 31 usd: 88 [511466.767027] Node 0 Normal per-cpu: [511466.767030] CPU 0: hi: 186, btch: 31 usd: 101 [511466.767032] CPU 1: hi: 186, btch: 31 usd: 135 [511466.767034] CPU 2: hi: 186, btch: 31 usd: 183 [511466.767036] CPU 3: hi: 186, btch: 31 usd: 177 [511466.767038] CPU 4: hi: 186, btch: 31 usd: 167 [511466.767040] CPU 5: hi: 186, btch: 31 usd: 148 [511466.767042] CPU 6: hi: 186, btch: 31 usd: 180 [511466.767044] CPU 7: hi: 186, btch: 31 usd: 130 [511466.767046] CPU 8: hi: 186, btch: 31 usd: 123 [511466.767048] CPU 9: hi: 186, btch: 31 usd: 165 [511466.767050] CPU 10: hi: 186, btch: 31 usd: 179 [511466.767052] CPU 11: hi: 186, btch: 31 usd: 179 [511466.767054] CPU 12: hi: 186, btch: 31 usd: 74 [511466.767056] CPU 13: hi: 186, btch: 31 usd: 167 [511466.767058] CPU 14: hi: 186, btch: 31 usd: 178 [511466.767060] CPU 15: hi: 186, btch: 31 usd: 169 [511466.767062] Node 1 Normal per-cpu: [511466.767064] CPU 0: hi: 186, btch: 31 usd: 125 [511466.767067] CPU 1: hi: 186, btch: 31 usd: 181 [511466.767069] CPU 2: hi: 186, btch: 31 usd: 154 [511466.767071] CPU 3: hi: 186, btch: 31 usd: 157 [511466.767073] CPU 4: hi: 186, btch: 31 usd: 158 [511466.767075] CPU 5: hi: 186, btch: 31 usd: 153 [511466.767077] CPU 6: hi: 186, btch: 31 usd: 139 [511466.767079] CPU 7: hi: 186, btch: 31 usd: 153 [511466.767081] CPU 8: hi: 186, btch: 31 usd: 89 [511466.767083] CPU 9: hi: 186, btch: 31 usd: 168 [511466.767085] CPU 10: hi: 186, btch: 31 usd: 157 [511466.767087] CPU 11: hi: 186, btch: 31 usd: 156 [511466.767089] CPU 12: hi: 186, btch: 31 usd: 50 [511466.767091] CPU 13: hi: 186, btch: 31 usd: 158 [511466.767093] CPU 14: hi: 186, btch: 31 usd: 158 [511466.767095] CPU 15: hi: 186, btch: 31 usd: 121 [511466.767099] Active:4223822 inactive:1831524 dirty:2490 writeback:1276 unstable:0 [511466.767100] free:12767 slab:35363 mapped:9918 pagetables:15796 bounce:0 [511466.767102] Node 0 DMA free:7772kB min:4kB low:4kB high:4kB active:0kB inactive:0kB present:6708kB pages_scanned:0 all_unreclaimable? yes [511466.767107] lowmem_reserve[]: 0 2991 12081 12081 [511466.767111] Node 0 DMA32 free:37172kB min:2460kB low:3072kB high:3688kB active:1793752kB inactive:1005228kB present:3063392kB pages_scanned:0 all_unreclaimable? no [511466.767117] lowmem_reserve[]: 0 0 9090 9090 [511466.767121] Node 0 Normal free:2648kB min:7476kB low:9344kB high:11212kB active:6071344kB inactive:3083336kB present:9308160kB pages_scanned:128 all_unreclaimable? no [511466.767126] lowmem_reserve[]: 0 0 0 0 [511466.767130] Node 1 Normal free:3476kB min:9968kB low:12460kB high:14952kB active:9030192kB inactive:3237532kB present:12410880kB pages_scanned:0 all_unreclaimable? no [511466.767135] lowmem_reserve[]: 0 0 0 0 [511466.767139] Node 0 DMA: 3*4kB 2*8kB 2*16kB 3*32kB 1*64kB 1*128kB 3*256kB 1*512kB 2*1024kB 0*2048kB 1*4096kB = 7772kB [511466.767150] Node 0 DMA32: 8333*4kB 212*8kB 33*16kB 1*32kB 7*64kB 3*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36676kB [511466.767161] Node 0 Normal: 513*4kB 5*8kB 30*16kB 1*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 3372kB [511466.767172] Node 1 Normal: 468*4kB 1*8kB 1*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 3752kB [511466.767183] 57479 total pagecache pages [511466.767184] 31209 pages in swap cache [511466.767186] Swap cache stats: add 215086, delete 183877, find 511/1459 [511466.767189] Free swap = 24343440kB [511466.767190] Total swap = 25173812kB [511466.767703] 6291440 pages RAM [511466.767703] 107182 pages reserved [511466.767703] 117792 pages shared [511466.767703] 6100859 pages non-shared $ lsmod Module Size Used by nfsd 244648 21 exportfs 8256 1 nfsd rdma_ucm 13952 0 rdma_cm 26932 1 rdma_ucm iw_cm 11144 1 rdma_cm ib_addr 8648 1 rdma_cm ib_ipoib 58364 0 ib_cm 31256 2 rdma_cm,ib_ipoib ib_sa 33976 3 rdma_cm,ib_ipoib,ib_cm ipv6 246136 79 ib_ipoib ib_uverbs 37104 1 rdma_ucm ib_umad 15320 4 mlx4_ib 53408 0 ib_mad 33256 4 ib_cm,ib_sa,ib_umad,mlx4_ib ib_core 44548 10 rdma_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_sa,ib_uverbs,ib_umad,mlx4_ib,ib_mad mlx4_core 76772 1 mlx4_ib