Bug 10065 - 2.6.25-rc2 regression - hang on suspend
2.6.25-rc2 regression - hang on suspend
Status: CLOSED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: PCI
All Linux
: P1 normal
Assigned To: Greg Kroah-Hartman
:
Depends on:
Blocks: 7216 9832
  Show dependency treegraph
 
Reported: 2008-02-22 07:54 UTC by Rafael J. Wysocki
Modified: 2008-03-25 04:44 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.25-rc2
Tree: Mainline
Regression: Yes


Attachments
debug patch (805 bytes, patch)
2008-03-11 20:08 UTC, Shaohua
Details | Diff

Description Rafael J. Wysocki 2008-02-22 07:54:19 UTC
Subject         : 2.6.25-rc2 regression - hang on suspend
Submitter       : Soeren Sonnenburg <kernel@nn7.de>
Date            : 2008-02-19 12:59
References      : http://lkml.org/lkml/2008/2/19/165
Handled-By      :
Patch           :

This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2008-02-23 13:07:38 UTC
Can you please test with the patch from:

http://bugzilla.kernel.org/attachment.cgi?id=14961&action=view

applied?
Comment 2 Rafael J. Wysocki 2008-02-23 13:43:31 UTC
References : http://lkml.org/lkml/2008/2/17/381
Comment 3 Rafael J. Wysocki 2008-02-24 17:00:51 UTC
Handled-By : Rafael J. Wysocki <rjw@sisk.pl>
Comment 4 Tino Keitel 2008-02-25 14:19:34 UTC
I tested suspend/resume with 1a4c6be4aca5ad6b300932efed1e2729fdc25af9 without any patches. I suspended/resumes a few times, without any failures.
Comment 5 Soeren Sonnenburg 2008-02-25 14:53:11 UTC
Well I definitely *can* suspend / resume a couple of times too (on rc3). However the regression I am seeing is that the display does not come back - but it used to with s2ram -f -p since about 2.6.21 ...

At some point, especially when switching back to the black console (from X), s2ram causes a reboot...
Comment 6 Rafael J. Wysocki 2008-03-02 17:29:44 UTC
I'm afraid it will be difficult to find out what's going on without bisecting. :-(

What kind of graphics adapter is there in your box?
Comment 7 Bruno Prémont 2008-03-10 17:25:44 UTC
Might be I'm seeing the same bug, currently bisecting.

In the mean time, here is BUG() from kernel (console suspend disabled)
when doing
  echo mem > /sys/power/state:

[   63.936267] HDA Intel 0000:80:01.0: suspend
[  128.856308] BUG: soft lockup - CPU#0 stuck for 61s! [bash:2858]
[  128.856308]
[  128.856308] Pid: 2858, comm: bash Not tainted (2.6.24-06676-g215e871 #1)
[  128.856308] EIP: 0060:[<c0257ec9>] EFLAGS: 00000246 CPU: 0
[  128.856308] EIP is at pci_bus_read_config_word+0x49/0x60
[  128.856308] EAX: 00000000 EBX: 00000000 ECX: 0000007a EDX: 00000030
[  128.856308] ESI: 00000246 EDI: f7c5f000 EBP: f76a1e34 ESP: f76a1e1c
[  128.856308]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  128.856308] CR0: 8005003b CR2: 08657000 CR3: 3744c000 CR4: 00000690
[  128.856308] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  128.856308] DR6: ffff0ff0 DR7: 00000400
[  128.856308]  [<c025a372>] pcie_wait_pending_transaction+0x42/0x50
[  128.856308]  [<c025a3b7>] pci_disable_device+0x37/0xa0
[  128.856308]  [<f88a61cf>] azx_suspend+0x9f/0xd0 [snd_hda_intel]
[  128.856308]  [<c025c686>] pci_device_suspend+0x26/0x70
[  128.856308]  [<c02bca92>] suspend_device+0xe2/0x140
[  128.856308]  [<c02bcea1>] device_suspend+0x101/0x1d0
[  128.856308]  [<c01410a0>] suspend_devices_and_enter+0x40/0xf0
[  128.856308]  [<c0141260>] enter_state+0x110/0x1a0
[  128.856308]  [<c0141384>] state_store+0x94/0xd0
[  128.856308]  [<c01412f0>] ? state_store+0x0/0xd0
[  128.856308]  [<c024f344>] kobj_attr_store+0x24/0x30
[  128.856308]  [<c019db1d>] sysfs_write_file+0x9d/0x100
[  128.856308]  [<c0166405>] vfs_write+0x95/0x120
[  128.856308]  [<c019da80>] ? sysfs_write_file+0x0/0x100
[  128.856308]  [<c016698d>] sys_write+0x3d/0x70
[  128.856308]  [<c0103fd6>] sysenter_past_esp+0x5f/0x89
[  128.856308]  =======================

System: Commell LE-365D with Via Ester CPU and Via HD-Audio

Will add more information tomorrow when I reach the end of bisect process
Comment 8 Bruno Prémont 2008-03-11 09:35:23 UTC
Ok bisect completed and gave me this:

4348a2dc49f9baecd34a9b0904245488c6189398 is first bad commit
commit 4348a2dc49f9baecd34a9b0904245488c6189398
Author: Shaohua Li <shaohua.li@intel.com>
Date:   Wed Oct 24 10:45:08 2007 +0800

    pcie: utilize pcie transaction pending bit

    PCIE has a mechanism to wait for Non-Posted request to complete. I think
    pci_disable_device is a good place to do this.

    Signed-off-by: Shaohua Li <shaohua.li@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

:040000 040000 c0f236ecaaa3c7500a8ac21f5b9190b82044fab4 2e63763dc50214dcc17e18d88c66df37a57fdb6c M      drivers
:040000 040000 4cfa05abc15b62588ffc88f0ca723f5070617319 b76654f3a68ff13aca0e96368fd5852de44fea42 M      include


lspci for my host:
00:00.0 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:0324] (rev 03)
00:00.1 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:1324]
00:00.2 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:2324]
00:00.3 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:3324]
00:00.4 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:4324]
00:00.7 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:7324]
00:01.0 PCI bridge [0604]: VIA Technologies, Inc. VT8237 PCI Bridge [1106:b198]
00:0f.0 IDE interface [0101]: VIA Technologies, Inc. Unknown device [1106:0581]
00:10.0 USB Controller [0c03]: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller [1106:3038] (rev 90)
00:10.1 USB Controller [0c03]: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller [1106:3038] (rev 90)
00:10.2 USB Controller [0c03]: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller [1106:3038] (rev 90)
00:10.4 USB Controller [0c03]: VIA Technologies, Inc. USB 2.0 [1106:3104] (rev 90)
00:11.0 ISA bridge [0601]: VIA Technologies, Inc. CX700 PCI to ISA Bridge [1106:8324]
00:11.7 Host bridge [0600]: VIA Technologies, Inc. CX700 Internal Module Bus [1106:324e]
00:13.0 Host bridge [0600]: VIA Technologies, Inc. CX700 Host Bridge [1106:324b]
00:13.1 PCI bridge [0604]: VIA Technologies, Inc. CX700 PCI to PCI Bridge [1106:324a]
01:00.0 VGA compatible controller [0300]: VIA Technologies, Inc. CX700M2 UniChrome PRO II Graphics [1106:3157] (rev 03)
02:08.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet [10ec:8169] (rev 10)
80:01.0 Audio device [0403]: VIA Technologies, Inc. VIA High Definition Audio Controller [1106:3288] (rev 10)

Will check most recent linus kernel without this commit and report back
Comment 9 Bruno Prémont 2008-03-11 10:34:44 UTC
Most recent linus kernel (commit 051a82fc0c450f6ca649acf684586477aa6d5c6a) with commit 4348a2dc49f9baecd34a9b0904245488c6189398 reverted suspends fine.

It does not resume every time though that's another issue.
Comment 10 Rafael J. Wysocki 2008-03-11 14:47:06 UTC
Thanks a lot for bisecting it.  I've reassigned this entry to Greg, Shaohua CCed.
Comment 11 Shaohua 2008-03-11 20:08:23 UTC
Created attachment 15222 [details]
debug patch

It appears we should disable new request first and then wait for request to finish. Does the debug patch change anything?
Comment 12 Bruno Prémont 2008-03-12 03:59:03 UTC
It does not change very much, the hang just happens one step later:

[   57.774831] HDA Intel 0000:80:01.0: suspend
[   57.787658] ACPI: PCI interrupt for device 0000:80:01.0 disabled
[  122.407658] BUG: soft lockup - CPU#0 stuck for 61s! [bash:2872]
[  122.407658]
[  122.407658] Pid: 2872, comm: bash Not tainted (2.6.25-rc5-00105-gbaadac8-dirty #2)
[  122.407658] EIP: 0060:[<c0231409>] EFLAGS: 00000246 CPU: 0
[  122.407658] EIP is at pci_bus_read_config_word+0x49/0x60
[  122.407658] EAX: 00000000 EBX: 00000000 ECX: 00000800 EDX: 00000030
[  122.407658] ESI: 00000246 EDI: f7cbd000 EBP: f763de54 ESP: f763de3c
[  122.407658]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  122.407658] CR0: 8005003b CR2: b74ef000 CR3: 376d0000 CR4: 00000690
[  122.407658] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  122.407658] DR6: ffff0ff0 DR7: 00000400
[  122.407658]  [<c0232d02>] pcie_wait_pending_transaction+0x42/0x50
[  122.407658]  [<c0232d73>] pci_disable_device+0x63/0xa0
[  122.407658]  [<f88a423f>] azx_suspend+0x9f/0xd0 [snd_hda_intel]
[  122.407658]  [<c0235056>] pci_device_suspend+0x26/0x70
[  122.407658]  [<c02969c2>] device_suspend+0x142/0x260
[  122.407658]  [<c02629ae>] ? acpi_sleep_prepare+0x44/0x51
[  122.407658]  [<c0140e8f>] suspend_devices_and_enter+0x4f/0x130
[  122.407658]  [<c01410fa>] enter_state+0x15a/0x1a0
[  122.407658]  [<c01411d4>] state_store+0x94/0xd0
[  122.407658]  [<c0141140>] ? state_store+0x0/0xd0
[  122.407658]  [<c0228764>] kobj_attr_store+0x24/0x30
[  122.407658]  [<c019f05d>] sysfs_write_file+0x9d/0x100
[  122.407658]  [<c0166e05>] vfs_write+0x95/0x120
[  122.407658]  [<c019efc0>] ? sysfs_write_file+0x0/0x100
[  122.407658]  [<c016738d>] sys_write+0x3d/0x70
[  122.407658]  [<c0103e26>] sysenter_past_esp+0x5f/0x89
[  122.407658]  =======================
Comment 13 Shaohua 2008-03-12 18:07:39 UTC
I have no idea why this happen. Per PCIE spec, the transaction pending bit should be clear after no request, but the device hang when reading the register, maybe it doesn't implement the bit well.
Greg, if you want to revert the patch, I have no objection.
Comment 14 Rafael J. Wysocki 2008-03-13 13:40:39 UTC
On Thursday, 13 of March 2008, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10065
> 
> ------- Comment #13 from shaohua.li@intel.com  2008-03-12 18:07 -------
> I have no idea why this happen. Per PCIE spec, the transaction pending bit
> should be clear after no request, but the device hang when reading the
> register, maybe it doesn't implement the bit well.
> Greg, if you want to revert the patch, I have no objection.

Well, I don't see any other solution at the moment.

Can we schedule the revert of commit 4348a2dc49f9baecd34a9b0904245488c6189398
"pcie: utilize pcie transaction pending bit" for merging, please?

Comment 15 Soeren Sonnenburg 2008-03-21 08:00:50 UTC
I just tried 457fb605834504af294916411be128a9b21fc3f6 and it still hangs for me on resume... 
Comment 16 Adrian Bunk 2008-03-25 03:05:28 UTC
reverted in commit 49741c4d01554c2630cea02cfdf236b17062a912
Comment 17 Soeren Sonnenburg 2008-03-25 04:30:56 UTC
I am still having problems on *resume* with git current a4083c9271e0a697278e089f2c0b9a95363ada0a, however it always suspends fine now.

- on resume (from within X) I don't get the display back, the ide led is still on then it reboots (about 10 seconds later).

- on resume (from console) s2ram -f -p does not anymore give me my display back, however blindly typing reboot or s2ram again works (system is alive).

I am open for suggestions.
Comment 18 Adrian Bunk 2008-03-25 04:35:51 UTC
It is not uncommon for one person to run into 2 or 3 distinct regressions in one -rc...  :-(

Please open new bugs for the problems that are now visible.
Comment 19 Soeren Sonnenburg 2008-03-25 04:44:42 UTC
OK, I've opened #10319

Note You need to log in before you can comment on or make changes to this bug.