Bug 10921
Summary: | Repeated page allocation failures until complete machine lockup | ||
---|---|---|---|
Product: | Memory Management | Reporter: | Tom Söderlund (t-om) |
Component: | Page Allocator | Assignee: | Andrew Morton (akpm) |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | high | CC: | bunk, yi.zhu |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26-rc6-git2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 10492 | ||
Attachments: |
Hardware info
Software info /var/log/messages /var/log/kern.log |
Description
Tom Söderlund
2008-06-15 10:03:05 UTC
Created attachment 16488 [details]
Hardware info
Created attachment 16489 [details]
Software info
Created attachment 16490 [details]
/var/log/messages
From boot until what was saved before lockup.
Created attachment 16491 [details]
/var/log/kern.log
From boot until what was saved before lockup.
This entry is being used for tracking a regression from 2.6.25. Please don't close it until the problem is fixed in the mainline. Er.. just remembered that the kernel was not completely vanilla mainline; there was this patch from teknohog to modify acpi-cpufreq into a phc capable one for current kernels [0] applied and the voltages were set to "10 1" from the default "25 19" (due to apparently buggy ACPI BIOS there are only two very low frequency possibilities). These settings have not caused any trouble for a pretty long sequence of kernels. [0] http://koti.tnnet.fi/teknohog/hacks/linux-phc-kernel-vanilla-2.6.25-teknohog.patch It would have been very natural that the encountered problems were due to too low voltage settings that would have somehow been triggered in the most recent kernel. However, stress testing using two burnP6s and stress -vm 2 with voltage table settings stepped down over one hour time to even lower "0 0" did not bring up any similar page allocation or other problems. Sigh, the previous stress test was done with the 2.6.26-rc5-git6 kernel and not on the problematic rc6-git2. Anyhow now it is more likely that rc5-git6 indeed is more stable one. Managed to trigger the problem two times again in rc6-git2 through similar normal use as on the first time. Investigating.. (As a side track, there is something odd going on in ACPI thermal probes. Sometimes on both rc5-git6 and rc6-git2 acpi -t shows 144°C while the computer feels normal temperature wise. In Ubuntu's precompiled kernel (2.6.24-19) acpi -t shows similarly odd 1°C. Normal values are around 50°C. This is correctly shown by the coretemp module. Mentioning this since the page-allocation related problem could also be because of overheating but this does not seem to the case.) Please raise a separate report against acp for the thermal problems? OK. The ACPI thermal reporting problem is in #10934 [0]. Have not seen those thermal reporting problems since though. [0] http://bugzilla.kernel.org/show_bug.cgi?id=10934 Could not reproduce the page allocation related problem in vanilla rc6-git3 (without phc patch) but tested without power saving optimizations in /etc/rc.local. Currently testing vanilla rc6-git5 (without phc patch) with power saving optimizations in /etc/rc.local which are as follows: -- echo 1 > /sys/devices/system/cpu/sched_mc_power_savings echo 1500 > /proc/sys/vm/dirty_writeback_centisecs echo 5 > /proc/sys/vm/laptop_mode for f in /sys/bus/usb/devices/*/power/autosuspend; do if [ -f "$f" ]; then echo 0 > $f; fi; done for f in /sys/bus/usb/devices/*/power/level; do if [ -f "$f" ]; then echo "auto" > $f; fi; done [ -d /sys/bus/pci/drivers/iwl3945 ] && echo 5 > /sys/bus/pci/drivers/iwl3945/0000:01:00.0/power_level [ -d /sys/module/snd_hda_intel ] && echo 10 > /sys/module/snd_hda_intel/parameters/power_save [ ! -z "`hciconfig`" ] && hciconfig hci0 down modprobe -Qr hci_usb firewire-ohci firewire-sbp2 b44 -- One hour gone and no similar problems have occurred. Five hours gone, running vanilla 2.6.26-rc6-git5 with above power consumption optimizations. No similar problems have occurred. Tomorrow will test with the phc patch to see if the low voltage settings were the culprit after all. Two sessions with 2.6.26-rc6-git6 with above power consumption optimizations and phc patch with voltage table settings at "0 0" gone without similar page allocation related problems even with stress tests alongside with normal use. I am inclined to feel that the cause for the problem has either been fixed or I am unable to reproduce the exact conditions for it to occur again. (As a side track, got a new kind of firmware error from iwl3945 but it was able to reset back to normal. It is now documented at #1686 of vendor's bugzilla [0].) [0] http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1686 Unable to reproduce, closing. |