Bug 65671 - Asus A8V deluxe(AMD): hibernation - IO errors writing to swap, crashing apps, 3,11.8 worked, unless rmmod powernow_k8
Summary: Asus A8V deluxe(AMD): hibernation - IO errors writing to swap, crashing apps,...
Status: NEEDINFO
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpufreq (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: cpufreq
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-24 22:38 UTC by higuita
Modified: 2014-09-09 15:03 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.12
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description higuita 2013-11-24 22:38:32 UTC
After upgrading the kernel to 3.12, i found that hibernation stop working. Most of the time it gives IO errors writing the swap on hibernation, and when it manage to do that, it crash all apps after the wakeup and remount the filesystem as read-only

As 3.11.8 hibernation worked fine, i have done a bisect and found that the problem show up in this commit: 

commit 4a511de96d692f2dfa126c10dda4e41636c0ef27
Author: Mark Brown <broonie@linaro.org>
Date:   Tue Aug 13 14:58:24 2013 +0200

    cpufreq: cpufreq-cpu0: NULL is a valid regulator
    
    Since NULL could in theory be a valid regulator we ought to check for
    IS_ERR() rather than for NULL. In practice this is unlikely to be an
    issue but it's better for neatness.
    
    Signed-off-by: Mark Brown <broonie@linaro.org>
    Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

:040000 040000 3e8d03d2dc170492ba9f0311ef9156344452f6de fb5d9ad67f332dec478a68cbd9387b15381584ce M      drivers

Kernel 3.12.1 still have the problem

i'm using slackware64 14.1 with gcc 4.8.2

Thanks
Comment 1 Rafael J. Wysocki 2013-11-24 23:49:55 UTC
If you revert that commit from 3.12.1, does the problem go away?
Comment 2 higuita 2013-11-25 02:49:41 UTC
As asked, i tried to revert the commit on 3.12.1 and test and the crashes after the wakeup are still there. So looks like that commit is not the only one that broke this.

Any tip how to debug/bisect this now?
Comment 3 Rafael J. Wysocki 2013-11-25 11:16:56 UTC
On Monday, November 25, 2013 02:49:41 AM you wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=65671
> 
> --- Comment #2 from higuita <higuita@gmx.net> ---
> As asked, i tried to revert the commit on 3.12.1 and test and the crashes
> after
> the wakeup are still there. So looks like that commit is not the only one
> that
> broke this.
> 
> Any tip how to debug/bisect this now?

Can you check bug #64841, please?
Comment 4 higuita 2013-11-25 17:08:50 UTC
I don't have a intel card, i have a old ATI HD2600 AGP on a old Asus A8V MB. Here is the lspci:

00:00.0 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.1 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.2 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.3 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.4 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.7 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237/8251 PCI bridge [K8M890/K8T800/K8T890 South]
00:07.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller (rev 80)
00:08.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak 378/SATA 378) (rev 02)
00:0a.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 13)
00:0e.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 11)
00:0e.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 11)
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.1 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.2 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.3 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.4 USB controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South]
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 60)
00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller (rev 80)
00:18.0 Host bridge: AMD [Advanced Micro Devices, Inc.] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: AMD [Advanced Micro Devices, Inc.] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: AMD [Advanced Micro Devices, Inc.] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: AMD [Advanced Micro Devices, Inc.] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 VGA compatible controller: AMD/ATI [Advanced Micro Devices, Inc.] RV630 XT [Radeon HD 2600 XT AGP]


But that bug asks this:
"I primarily meant outside of gfx domain. It would be likely to see fs corruption for example. Depending on your system, it might be unlikely to see this corruption without using various memory debugging tools. My fear is we've traded silent memory corruption for hangs."

in the past, i had some FS corruption after several hibernations cycles... could it be that i have this kind of silent memory corruption problem and that patch triggers the problem when in the past it was hidden?

when i wakeup the machine, i get this 

[ 5137.948087] Uhhuh. NMI received for unknown reason 2c on CPU 0.
[ 5137.950056] Do you have a strange power saving mode enabled?
[ 5137.995447] Dazed and confused, but trying to continue

Could this corrupt some part of the memory?

I recently run the memtest86+ for several hours and didn't got any problem. Also, if i don't hibernate the machine, it works fine for several months, without any problem.

Is any tool to do some sort of memory checksum before and after the hibernation, to try to catch any (bad) change?
Comment 5 higuita 2013-12-01 18:08:40 UTC
ok, more info.

I have spend half of the weekend to try to debug this with kernel 2.12.1 and this is what i found:

in the bios if i disable both the "V-Link 8X Support" and "AGP Fast write", things get more stable and i only have problems if i hibernate with the machine with load (actually, i'm using the glxgears)

If i disable the "ACPI APIC", i have no problem at all, the machine is stable during many hibernations cycles, even on heavy load... saddly, i lost the second CPU without APIC

So i think this points to something related to the AGP bus.

With the APIC turn on and the "Vlink" and the "AGP fast write" turn on, i get the problem even on the console, without the X11 loaded, so maybe the something used by the modeset.

I will try to disable it and do some more test.

Again, i only have problems with hibernations, if i dont hibernate, everything looks stable during several weeks/months
Comment 6 higuita 2013-12-02 02:11:53 UTC
with modeset disable, it still crashes after hibernating (but only on the second attempt to hibernate)

I also attempt to switch the mode from "platform" to "shutdown", but still crashes
Comment 7 Len Brown 2013-12-03 00:57:48 UTC
can you disable CONFIG_CPU_FREQ in your build of the latest kernel
to see if that helps?
Comment 8 higuita 2013-12-05 23:37:07 UTC
BINGO!

ok, just recompile 3.12.3 without CONFIG_CPU_FREQ i can hibernate without any problem, even under heavy load for several hibernate-resume cycles. 

Even re-enable the AGP fast write and the VLINK 8X the hibernation works fine

Using the same kernel, WITH the cpufreq enable it fails to resume usually on the first try under heavy load.
Comment 9 higuita 2013-12-06 01:59:27 UTC
Another recompile and tests and i found that the problem is in the "conservative" governor. Using the "ondemand" governor i found no problem for several hibernate-resume cycles... when i switch to the conservative governor and try to hibernate for the second time, i get IO/swap problems.

I usually use the conservative governor, so that is why i'm getting this problem.
Comment 10 higuita 2013-12-19 11:56:07 UTC
i've been using the ondemand governor and not problem at all. So it's really something that is broken in the conservative governor... maybe not all changes in the ondemand weren't applied to the conservative one?
Comment 11 higuita 2013-12-27 11:42:24 UTC
This problem can be much older than kernel 3.12...  i've been having sporadic problems with random corrupted filesystems since maybe about june. During all summer i assumed it was a bad HD (that started to report bad sectors and is mostly unused, it's actually umounted and suspended on boot) was introducing problems in other SATA channels and corrupting other HD partitions.

Since i switch to the performance governor, i had no FS corruption, so looks like the (unused) HD was not the problem, but again some bug in the ondemand governor. If true, this could be a lot older and recent changes made it much more frequent and reproducible. 

Sadly, the old FS corruption was hard to trigger, so it's not easy to point when it started to show up
Comment 12 higuita 2014-02-24 03:17:21 UTC
Ok, i have a little more information:


i found that ondemand still causes problems and after the upgrade to 3.13.3, it also lock up, just like the conservative governor. I then change my hibernate script to include this before the hibernation

cpufreq-set -g  performance
rmmod powernow_k8

and after the hibernation i revert that:

modprobe powernow_k8
cpufreq-set -g  ondemand

All seems to be working fine, i'm getting higher uptimes, many hibernations cycles without problems and zero Filesystem corruption.

i have seen this on the dmesg, after waking from hibernation:

[ 8379.324747] [Firmware Bug]: powernow-k8: No compatible ACPI _PSS objects found.
[ 8379.324747] [Firmware Bug]: powernow-k8: First, make sure Cool'N'Quiet is enabled in the BIOS.
[ 8379.324747] [Firmware Bug]: powernow-k8: If that doesn't help, try upgrading your BIOS.
[ 8379.324858] powernow-k8: Found 1 Dual Core AMD Opteron(tm) Processor 175 (2 cpu cores) (version 2.20.00)

This motherboard (asus A8V deluxe) don't have ACPI cpufreq, but the cool'n'quiet is enabled in the bios.
Comment 13 Len Brown 2014-03-25 00:25:57 UTC
> No compatible ACPI _PSS objects found.

Whelp, at least we know it isn't an ACPI bug:-)

This seems to be specific to powernow_k8 -- is it possible to bisect
but only changes within that driver?
Comment 14 Rafael J. Wysocki 2014-04-30 21:07:40 UTC
Any chance to test that with 3.15-rc3?
Comment 15 Rafael J. Wysocki 2014-04-30 21:10:09 UTC
I'm clearing the regression flag on this anyway, since it is not clear whether or not it is a regression.
Comment 16 higuita 2014-05-09 01:59:46 UTC
i compiled kernel 3.15-rc4, disabled my workarounds and tried to trigger the hibernation problems with more than 15 hibernations cycles (under various loads) and failed.

i also had tried something similar on kernel 4.14.1 and also didn't had any problem.

So or the problem was fixed, or the hidden corruption change it target. I still do have (maybe once a month) some random ext4 corruption, but even that have improved.

maybe because the hibernations tests, after those tests i rebooted and got this ext4 error a few minutes after, on the /home partition:

[  367.200228] EXT4-fs error (device sda8): ext4_mb_generate_buddy:756: group 513, 3097 clusters in bitmap, 3081 in gd; block bitmap corrupt.
[  367.218897] JBD2: Spotted dirty metadata buffer (dev = sda8, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[  367.285925] JBD2: Spotted dirty metadata buffer (dev = sda8, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

I will try to grab the fsck output to check if it helps.

Note You need to log in before you can comment on or make changes to this bug.