Bug 10921

Summary: Repeated page allocation failures until complete machine lockup
Product: Memory Management Reporter: Tom Söderlund (t-om)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: CLOSED UNREPRODUCIBLE    
Severity: high CC: bunk, yi.zhu
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc6-git2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: Hardware info
Software info
/var/log/messages
/var/log/kern.log

Description Tom Söderlund 2008-06-15 10:03:05 UTC
Latest working kernel version: 2.6.26-rc5-git6 (maybe)
Earliest failing kernel version: 2.6.26-rc6-git2 (problem encountered here)
Distribution: Ubuntu 8.04
Hardware Environment:
Lenovo 3000 Y200 (6469-22R)
(more info in attachments)
Software Environment:
Linux version 2.6.26-rc6-git2-puuh (t-om@puuh) (gcc version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)) #1 SMP PREEMPT Sat Jun 14 23:56:01 MYT 2008
(more info in attachments)
Problem Description:
Jun 15 22:49:34 puuh kernel: ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Jun 15 22:49:34 puuh kernel: NetworkManager: page allocation failure. order:4, mode:0x40d0
Jun 15 22:49:34 puuh kernel: Pid: 14742, comm: NetworkManager Not tainted 2.6.26-rc6-git2-puuh #1
Jun 15 22:49:34 puuh kernel: 
Jun 15 22:49:34 puuh kernel: Call Trace:
Jun 15 22:49:34 puuh kernel:  [__alloc_pages_internal+0x38f/0x480] __alloc_pages_internal+0x38f/0x480
Jun 15 22:49:34 puuh kernel:  [fuse:__get_free_pages+0x15/0x7e0] __get_free_pages+0x15/0x60
Jun 15 22:49:34 puuh kernel:  [iwl3945:iwl3945_tx_queue_init+0x94/0x1d0] :iwl3945:iwl3945_tx_queue_init+0x94/0x1d0
Jun 15 22:49:34 puuh kernel:  [iwl3945:iwl3945_txq_ctx_reset+0x372/0x580] :iwl3945:iwl3945_txq_ctx_reset+0x372/0x580
Jun 15 22:49:34 puuh kernel:  [iwl3945:iwl3945_hw_nic_init+0xe7f/0x1360] :iwl3945:iwl3945_hw_nic_init+0xe7f/0x1360
Jun 15 22:49:34 puuh kernel:  [iwl3945:__iwl3945_up+0x135/0xe30] :iwl3945:__iwl3945_up+0x135/0xe30
Jun 15 22:49:34 puuh kernel:  [iwl3945:iwl3945_isr+0x0/0x380] :iwl3945:iwl3945_isr+0x0/0x380
Jun 15 22:49:34 puuh kernel:  [firewire_ohci:request_irq+0xd5/0x9f40] request_irq+0xd5/0xf0
Jun 15 22:49:34 puuh kernel:  [iwl3945:iwl3945_mac_start+0x940/0xdb0] :iwl3945:iwl3945_mac_start+0x940/0xdb0
Jun 15 22:49:34 puuh kernel:  [source_load+0x34/0x70] source_load+0x34/0x70
Jun 15 22:49:34 puuh kernel:  [acpi_cpufreq:__next_cpu+0x1a/0x30] __next_cpu+0x1a/0x30
Jun 15 22:49:34 puuh kernel:  [find_busiest_group+0x1da/0x820] find_busiest_group+0x1da/0x820
Jun 15 22:49:34 puuh kernel:  [mac80211:ieee80211_open+0x1a0/0x5c0] :mac80211:ieee80211_open+0x1a0/0x5c0
Jun 15 22:49:34 puuh kernel:  [af_packet:local_bh_disable+0xe/0x20] local_bh_disable+0xe/0x20
Jun 15 22:49:34 puuh kernel:  [mac80211:dev_open+0x59/0x710] dev_open+0x59/0xb0
Jun 15 22:49:34 puuh kernel:  [dev_change_flags+0x9f/0x1d0] dev_change_flags+0x9f/0x1d0
Jun 15 22:49:34 puuh kernel:  [do_setlink+0x304/0x3f0] do_setlink+0x304/0x3f0
Jun 15 22:49:34 puuh kernel:  [cfg80211:nla_parse+0x2f/0xb10] nla_parse+0x2f/0xe0
Jun 15 22:49:34 puuh kernel:  [rtnl_setlink+0xe1/0x130] rtnl_setlink+0xe1/0x130
Jun 15 22:49:34 puuh kernel:  [rtnetlink_rcv_msg+0x0/0x240] rtnetlink_rcv_msg+0x0/0x240
Jun 15 22:49:34 puuh kernel:  [netlink_rcv_skb+0x6c/0x90] netlink_rcv_skb+0x6c/0x90
Jun 15 22:49:34 puuh kernel:  [rtnetlink_rcv+0x18/0x20] rtnetlink_rcv+0x18/0x20
Jun 15 22:49:34 puuh kernel:  [cfg80211:netlink_unicast+0x298/0x1ec0] netlink_unicast+0x298/0x2d0
Jun 15 22:49:34 puuh kernel:  [netlink_sendmsg+0x1fd/0x2b0] netlink_sendmsg+0x1fd/0x2b0
Jun 15 22:49:34 puuh kernel:  [sock_sendmsg+0xcb/0x100] sock_sendmsg+0xcb/0x100
Jun 15 22:49:34 puuh kernel:  [<ffffffff8024a3a0>] autoremove_wake_function+0x0/0x30
Jun 15 22:49:34 puuh kernel:  [__link_path_walk+0xd75/0xf40] __link_path_walk+0xd75/0xf40
Jun 15 22:49:34 puuh kernel:  [enqueue_hrtimer+0x6e/0x100] enqueue_hrtimer+0x6e/0x100
Jun 15 22:49:34 puuh kernel:  [fget_light+0xbd/0xf0] fget_light+0xbd/0xf0
Jun 15 22:49:34 puuh kernel:  [move_addr_to_kernel+0x2e/0x40] move_addr_to_kernel+0x2e/0x40
Jun 15 22:49:34 puuh kernel:  [verify_iovec+0x2f/0xa0] verify_iovec+0x2f/0xa0
Jun 15 22:49:34 puuh kernel:  [sys_sendmsg+0x20e/0x360] sys_sendmsg+0x20e/0x360
Jun 15 22:49:34 puuh kernel:  [snd_hda_intel:_spin_unlock_irqrestore+0x12/0x40] _spin_unlock_irqrestore+0x12/0x40
Jun 15 22:49:34 puuh kernel:  [netlink_insert+0x9a/0x160] netlink_insert+0x9a/0x160
Jun 15 22:49:34 puuh kernel:  [fget_light+0xbd/0xf0] fget_light+0xbd/0xf0
Jun 15 22:49:34 puuh kernel:  [move_addr_to_user+0x5d/0x70] move_addr_to_user+0x5d/0x70
Jun 15 22:49:34 puuh kernel:  [sys_getsockname+0xd9/0xe0] sys_getsockname+0xd9/0xe0
Jun 15 22:49:34 puuh kernel:  [system_call_after_swapgs+0x7b/0x80] system_call_after_swapgs+0x7b/0x80
Jun 15 22:49:34 puuh kernel: 
Jun 15 22:49:34 puuh kernel: Mem-info:
Jun 15 22:49:34 puuh kernel: DMA per-cpu:
Jun 15 22:49:34 puuh kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jun 15 22:49:34 puuh kernel: CPU    1: hi:    0, btch:   1 usd:   0
Jun 15 22:49:34 puuh kernel: DMA32 per-cpu:
Jun 15 22:49:34 puuh kernel: CPU    0: hi:  186, btch:  31 usd:   0
Jun 15 22:49:34 puuh kernel: CPU    1: hi:  186, btch:  31 usd:   0
Jun 15 22:49:34 puuh kernel: Active:169972 inactive:26410 dirty:280 writeback:0 unstable:0
Jun 15 22:49:34 puuh kernel:  free:5323 slab:37217 mapped:20025 pagetables:4871 bounce:0
Jun 15 22:49:34 puuh kernel: DMA free:3984kB min:40kB low:48kB high:60kB active:4kB inactive:0kB present:11044kB pages_scanned:0 all_unreclaimable? no
Jun 15 22:49:34 puuh kernel: lowmem_reserve[]: 0 984 984 984
Jun 15 22:49:34 puuh kernel: DMA32 free:17308kB min:3992kB low:4988kB high:5988kB active:679884kB inactive:105640kB present:1008352kB pages_scanned:0 all_unreclaimable? no
Jun 15 22:49:34 puuh kernel: lowmem_reserve[]: 0 0 0 0
Jun 15 22:49:34 puuh kernel: DMA: 24*4kB 24*8kB 7*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3984kB
Jun 15 22:49:34 puuh kernel: DMA32: 3863*4kB 90*8kB 22*16kB 4*32kB 3*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17356kB
Jun 15 22:49:34 puuh kernel: 64999 total pagecache pages
Jun 15 22:49:34 puuh kernel: Swap cache: add 41222, delete 27361, find 19238/22059
Jun 15 22:49:34 puuh kernel: Free swap  = 1918280kB
Jun 15 22:49:34 puuh kernel: Total swap = 2000052kB
Jun 15 22:49:34 puuh kernel: 259698 pages of RAM
Jun 15 22:49:34 puuh kernel: 5172 reserved pages
Jun 15 22:49:34 puuh kernel: 107775 pages shared
Jun 15 22:49:34 puuh kernel: 13861 pages swap cached
Jun 15 22:49:43 puuh kernel: ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
(more info in attachments)

Steps to reproduce:
Still working on it. More detailed information later.
Comment 1 Tom Söderlund 2008-06-15 10:05:07 UTC
Created attachment 16488 [details]
Hardware info
Comment 2 Tom Söderlund 2008-06-15 10:05:52 UTC
Created attachment 16489 [details]
Software info
Comment 3 Tom Söderlund 2008-06-15 10:13:05 UTC
Created attachment 16490 [details]
/var/log/messages

From boot until what was saved before lockup.
Comment 4 Tom Söderlund 2008-06-15 10:26:24 UTC
Created attachment 16491 [details]
/var/log/kern.log

From boot until what was saved before lockup.
Comment 5 Rafael J. Wysocki 2008-06-15 10:32:08 UTC
This entry is being used for tracking a regression from 2.6.25.  Please don't
close it until the problem is fixed in the mainline.
Comment 6 Tom Söderlund 2008-06-15 22:05:43 UTC
Er.. just remembered that the kernel was not completely vanilla mainline; there was this patch from teknohog to modify acpi-cpufreq into a phc capable one for current kernels [0] applied and the voltages were set to "10 1" from the default "25 19" (due to apparently buggy ACPI BIOS there are only two very low frequency possibilities). These settings have not caused any trouble for a pretty long sequence of kernels.

[0] http://koti.tnnet.fi/teknohog/hacks/linux-phc-kernel-vanilla-2.6.25-teknohog.patch

It would have been very natural that the encountered problems were due to too low voltage settings that would have somehow been triggered in the most recent kernel. However, stress testing using two burnP6s and stress -vm 2 with voltage table settings stepped down over one hour time to even lower "0 0" did not bring up any similar page allocation or other problems.
Comment 7 Tom Söderlund 2008-06-16 03:57:30 UTC
Sigh, the previous stress test was done with the 2.6.26-rc5-git6 kernel and not on the problematic rc6-git2. Anyhow now it is more likely that rc5-git6 indeed is more stable one.

Managed to trigger the problem two times again in rc6-git2 through similar normal use as on the first time. Investigating..

(As a side track, there is something odd going on in ACPI thermal probes. Sometimes on both rc5-git6 and rc6-git2 acpi -t shows 144°C while the computer feels normal temperature wise. In Ubuntu's precompiled kernel (2.6.24-19) acpi -t shows similarly odd 1°C. Normal values are around 50°C. This is correctly shown by the coretemp module. Mentioning this since the page-allocation related problem could also be because of overheating but this does not seem to the case.)
Comment 8 Andrew Morton 2008-06-16 21:34:24 UTC
Please raise a separate report against acp for the thermal problems?
Comment 9 Tom Söderlund 2008-06-18 05:40:32 UTC
OK. The ACPI thermal reporting problem is in #10934 [0]. Have not seen those thermal reporting problems since though.

[0] http://bugzilla.kernel.org/show_bug.cgi?id=10934

Could not reproduce the page allocation related problem in vanilla rc6-git3 (without phc patch) but tested without power saving optimizations in /etc/rc.local.

Currently testing vanilla rc6-git5 (without phc patch) with power saving optimizations in /etc/rc.local which are as follows:

-- 

echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
echo 1500 > /proc/sys/vm/dirty_writeback_centisecs
echo 5 > /proc/sys/vm/laptop_mode

for f in /sys/bus/usb/devices/*/power/autosuspend;
     do if [ -f "$f" ]; then echo 0 > $f; fi; done
for f in /sys/bus/usb/devices/*/power/level;
     do if [ -f "$f" ]; then echo "auto" > $f; fi; done

[ -d /sys/bus/pci/drivers/iwl3945 ] && echo 5 > /sys/bus/pci/drivers/iwl3945/0000:01:00.0/power_level
[ -d /sys/module/snd_hda_intel ] && echo 10 > /sys/module/snd_hda_intel/parameters/power_save

[ ! -z "`hciconfig`" ] &&  hciconfig hci0 down
modprobe -Qr hci_usb firewire-ohci firewire-sbp2 b44

--

One hour gone and no similar problems have occurred.
Comment 10 Tom Söderlund 2008-06-18 09:47:57 UTC
Five hours gone, running vanilla 2.6.26-rc6-git5 with above power consumption optimizations. No similar problems have occurred. 

Tomorrow will test with the phc patch to see if the low voltage settings were the culprit after all.
Comment 11 Tom Söderlund 2008-06-19 21:36:42 UTC
Two sessions with 2.6.26-rc6-git6 with above power consumption optimizations and phc patch with voltage table settings at "0 0" gone without similar page allocation related problems even with stress tests alongside with normal use.

I am inclined to feel that the cause for the problem has either been fixed or I am unable to reproduce the exact conditions for it to occur again.

(As a side track, got a new kind of firmware error from iwl3945 but it was able to reset back to normal. It is now documented at #1686 of vendor's bugzilla [0].)

[0] http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1686
Comment 12 Tom Söderlund 2008-06-22 09:52:48 UTC
Unable to reproduce, closing.