Bug 213617 - Suspend to ram (STR) fails with nouveau driver ([TTM] Buffer eviction failed)
Summary: Suspend to ram (STR) fails with nouveau driver ([TTM] Buffer eviction failed)
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-29 09:54 UTC by computer.enthusiastic@gmail.com
Modified: 2023-02-05 17:28 UTC (History)
1 user (show)

See Also:
Kernel Version: From 5.8.0 up
Subsystem:
Regression: No
Bisected commit-id:


Attachments
suspend-resume utility log (125.62 KB, application/zip)
2021-06-29 09:54 UTC, computer.enthusiastic@gmail.com
Details
systemd journal with suspend error for Linux kernel version 5.12.13 (30.66 KB, application/zip)
2021-06-29 11:29 UTC, computer.enthusiastic@gmail.com
Details
suspend-resume utility log generated with kernel 5.13 (1.36 MB, application/zip)
2021-07-04 07:47 UTC, computer.enthusiastic@gmail.com
Details
systemd journal with suspend error for Linux kernel version 5.13 (35.54 KB, application/zip)
2021-07-04 07:49 UTC, computer.enthusiastic@gmail.com
Details
differences between working and not working kernel 5.10.0-8-amd64 config (66.69 KB, patch)
2021-09-03 17:07 UTC, computer.enthusiastic@gmail.com
Details | Diff
compressed ftrace log (with function_graph plugin) for suspend to ram with kernel started with init_on_alloc active (897.03 KB, application/x-xz)
2022-02-02 16:59 UTC, computer.enthusiastic@gmail.com
Details
compressed ftrace log (with function_graph plugin) for suspend to ram with kernel started with init_on_alloc disabled (1.23 MB, application/x-xz)
2022-02-02 17:00 UTC, computer.enthusiastic@gmail.com
Details
bash script to trace specific kernel function using ftrace (7.86 KB, application/x-shellscript)
2022-02-02 17:02 UTC, computer.enthusiastic@gmail.com
Details
bash script to trace graph of called functions using ftrace (4.18 KB, application/x-shellscript)
2022-02-02 17:03 UTC, computer.enthusiastic@gmail.com
Details
bash script to join ftrace profiles (697 bytes, application/x-shellscript)
2022-02-02 17:03 UTC, computer.enthusiastic@gmail.com
Details
bash coomands used to rebuild a vanilla kernel with debug options activated (3.17 KB, text/plain)
2022-02-02 17:05 UTC, computer.enthusiastic@gmail.com
Details
logs of suspension to ram tests with 7sys/power/pm_test kernel facility (64.56 KB, application/zip)
2022-04-29 08:26 UTC, computer.enthusiastic@gmail.com
Details

Description computer.enthusiastic@gmail.com 2021-06-29 09:54:47 UTC
Created attachment 297659 [details]
suspend-resume utility log

Suspend to RAM hangs computer with nouveau driver. The bug came up after upgrading from Debian Stable ("Buster", kernel 4.19) to Debian Testing ("Bullseye", kernel 5.10).

Suspend to RAM is always successfully completed booting from kernel version
4.19.0-11-amd64 (from current Debian Stable ("Buster").

On the contrary, suspend to RAM always fails both booting from  kernel 5.10.0-7-amd64 (current Debian Testing aka "Bullseye"), 5.12.10 and 5.12.13 (current
vanilla/upstream "stable").

When Suspend to RAM is started (both from the Desktop Environment KDE or the command line) with those two last kernels):
a) computer starts suspend to RAM as usual and the display goes dark;
b) the notebook stays alive with fan running for about 20 minutes with dark screen and keyboard/power leds staying on;
c) the suspend to RAM is apparently completed and computer shuts down with the usual blinking orange led;
d) when computer is resumed (eg, pressing a key), it hangs showing (briefly before going dark or forever) a garbled screen (just like the one showed in the bug report: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547);
e) at this stage:
e.1) computer sometimes is completely frozen (is is not remotely accessible nor it responds to REISUB sequence)
e.2) computer sometimes is remotely accessible by SSH after few minutes  (for
example, to obtain logs or to reboot system) and in this case it responds
to REISUB sequence, too.


I already:
- opened a bug report in the Debian Bug Tracking System (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705) 
- sent a follow-up post to the Freedesktop Bug Tracking System with a similar issue (https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547) 

The kernel version 5.4.125 works correctly, I've compiled from upstream and tested it.

The issue seems to hit kernel version 5.8, 5.9. 5.10.

I've tried to bisect the issue generating the following log:
$ git bisect log
# bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
# good: [3909e2374335335c9504467caabc906d3f7487e4] Linux 5.4.125
git bisect start 'v5.10' 'v5.4.125' '--' 'drivers/gpu/drm/nouveau'
# good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4
git bisect good 219d54332a09e8d8741c1e1982f5eae56099de85
# good: [ce74773305bd5e519745e7c406f7f006daef9902] drm/ttm: remove io_reserve_fastpath flag
git bisect good ce74773305bd5e519745e7c406f7f006daef9902
# good: [a41ef363f5e8a1608d5dd970aa706601145a142f] drm/nouveau/kms/nv50-: use NVIDIA's headers for core head_olut_set()
git bisect good a41ef363f5e8a1608d5dd970aa706601145a142f
# good: [6ba1193255ebf7eadf60f79c1555e507d0879ac3] drm/nouveau/kms/nv50-: Remove open-coded drm_dp_read_desc()
git bisect good 6ba1193255ebf7eadf60f79c1555e507d0879ac3
# bad: [441959ebc9056326d80af2d06adf4b80e0288a2a] drm/atomic-helper: Remove the timestamping constant update from drm_atomic_helper_update_legacy_modeset_state()
git bisect bad 441959ebc9056326d80af2d06adf4b80e0288a2a
# bad: [79416e97dda0118b137302575a70a14259a27d7d] drm/nouveau/kms: Start using drm_dp_read_dpcd_caps() -> non si avvia il kernel
git bisect bad 79416e97dda0118b137302575a70a14259a27d7d
# good: [a0922278f83eae085fdf73d06f71bbdfb9d6789e] drm/nouveau/kms/nv50-: Refactor and cleanup DP HPD handling
git bisect good a0922278f83eae085fdf73d06f71bbdfb9d6789e
# bad: [d297ce4b5d490eab60bede36bfd5da6fbcd416cf] drm/nouveau/kms: Only use hpd_work for reprobing in HPD paths -> sospende, ma al resume errori del kernel
git bisect bad d297ce4b5d490eab60bede36bfd5da6fbcd416cf
# bad: [8b75e83e9c94dbf619cf2499ec85656eed9d0320] drm/nouveau/kms: Use new drm_dp_read_mst_cap() helper for checking MST caps
git bisect bad 8b75e83e9c94dbf619cf2499ec85656eed9d0320
# first bad commit: [8b75e83e9c94dbf619cf2499ec85656eed9d0320] drm/nouveau/kms: Use new drm_dp_read_mst_cap() helper for checking MST caps

$ git bisect bad
8b75e83e9c94dbf619cf2499ec85656eed9d0320 is the first bad commit
commit 8b75e83e9c94dbf619cf2499ec85656eed9d0320
Author: Lyude Paul <lyude@redhat.com>
Date:   Wed Aug 26 14:24:46 2020 -0400

    drm/nouveau/kms: Use new drm_dp_read_mst_cap() helper for checking MST caps
    
    Signed-off-by: Lyude Paul <lyude@redhat.com>
    Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20200826182456.322681-11-lyude@redhat.com

 drivers/gpu/drm/nouveau/nouveau_dp.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)


Unfortunately, reverting the commit 8b75e83e9c94dbf619cf2499ec85656eed9d0320 did not solve the issue.

This is the lspci output:
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express Graphics Port (rev 07)
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 03)
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
01:00.0 VGA compatible controller: NVIDIA Corporation G96CM [GeForce 9600M GT] (rev a1)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8071 PCI-E Gigabit Ethernet Controller (rev 16)
03:00.0 Network controller: Intel Corporation WiFi Link 5100

This is the lsusb output:
Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 008 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 007 Device 002: ID 046d:c016 Logitech, Inc. Optical Wheel Mouse
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 002: ID 064e:a117 Suyin Corp. Acer HD Crystal Eye webcam
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

This is the dmidecode output:
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.5 present.
45 structures occupying 1821 bytes.
Table at 0xBFCC0000.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
	Vendor: Phoenix Technologies LTD
	Version: V1.21          
	Release Date: 08/22/2008
	Address: 0xE5D60
	Runtime Size: 107168 bytes
	ROM Size: 2 MB
	Characteristics:
		ISA is supported
		PCI is supported
		PC Card (PCMCIA) is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		ESCD support is available
		Boot from CD is supported
		ACPI is supported
		USB legacy is supported
		AGP is supported
		BIOS boot specification is supported
		Targeted content distribution is supported

Handle 0x0001, DMI type 1, 27 bytes
System Information
	Manufacturer: Acer           
	Product Name: Aspire 5930                    
	Version: 0100           
	Serial Number: LXAQ40X1328381F7142000        
	UUID: dbedeee0-88db-11dd-af68-fd4e9473e5f5
	Wake-up Type: Power Switch
	SKU Number: Not Specified
	Family: Not Specified

Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
	Manufacturer: Acer           
	Product Name: Eiger                          
	Version: Rev            
	Serial Number: LXAQ40X1328381F7142000        

Handle 0x0003, DMI type 3, 17 bytes
Chassis Information
	Manufacturer: Acer           
	Type: Notebook
	Lock: Not Present
	Version: N/A            
	Serial Number: None           
	Asset Tag:                                 
	Boot-up State: Safe
	Power Supply State: Safe
	Thermal State: Safe
	Security Status: None
	OEM Information: 0x00001234

Handle 0x0004, DMI type 4, 40 bytes
Processor Information
	Socket Designation: U2E1
	Type: Central Processor
	Family: Core 2
	Manufacturer: Intel
	ID: 76 06 01 00 FF FB EB BF
	Signature: Type 0, Family 6, Model 23, Stepping 6
	Flags:
		FPU (Floating-point unit on-chip)
		VME (Virtual mode extension)
		DE (Debugging extension)
		PSE (Page size extension)
		TSC (Time stamp counter)
		MSR (Model specific registers)
		PAE (Physical address extension)
		MCE (Machine check exception)
		CX8 (CMPXCHG8 instruction supported)
		APIC (On-chip APIC hardware supported)
		SEP (Fast system call)
		MTRR (Memory type range registers)
		PGE (Page global enable)
		MCA (Machine check architecture)
		CMOV (Conditional move instruction supported)
		PAT (Page attribute table)
		PSE-36 (36-bit page size extension)
		CLFSH (CLFLUSH instruction supported)
		DS (Debug store)
		ACPI (ACPI supported)
		MMX (MMX technology supported)
		FXSR (FXSAVE and FXSTOR instructions supported)
		SSE (Streaming SIMD extensions)
		SSE2 (Streaming SIMD extensions 2)
		SS (Self-snoop)
		HTT (Multi-threading)
		TM (Thermal monitor supported)
		PBE (Pending break enabled)
	Version: CPU Version
	Voltage: 3.3 V
	External Clock: 266 MHz
	Max Speed: 2260 MHz
	Current Speed: 2260 MHz
	Status: Populated, Enabled
	Upgrade: ZIF Socket
	L1 Cache Handle: 0x0005
	L2 Cache Handle: 0x0006
	L3 Cache Handle: Not Provided
	Serial Number: Not Specified
	Asset Tag: Not Specified
	Part Number: Not Specified
	Core Count: 2
	Core Enabled: 2
	Thread Count: 2
	Characteristics:
		64-bit capable

Handle 0x0005, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L1 Cache
	Configuration: Enabled, Socketed, Level 1
	Operational Mode: Write Back
	Location: Internal
	Installed Size: 64 kB
	Maximum Size: 64 kB
	Supported SRAM Types:
		Burst
		Pipeline Burst
		Asynchronous
	Installed SRAM Type: Asynchronous
	Speed: Unknown
	Error Correction Type: Unknown
	System Type: Unknown
	Associativity: Unknown

Handle 0x0006, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L2 Cache
	Configuration: Enabled, Socketed, Level 2
	Operational Mode: Write Back
	Location: Internal
	Installed Size: 3 MB
	Maximum Size: 4 MB
	Supported SRAM Types:
		Burst
		Pipeline Burst
		Asynchronous
	Installed SRAM Type: Burst
	Speed: Unknown
	Error Correction Type: Unknown
	System Type: Unknown
	Associativity: Unknown

Handle 0x0007, DMI type 8, 9 bytes
Port Connector Information
	Internal Reference Designator: J19
	Internal Connector Type: 9 Pin Dual Inline (pin 10 cut)
	External Reference Designator: COM 1
	External Connector Type: DB-9 male
	Port Type: Serial Port 16550A Compatible

Handle 0x0008, DMI type 8, 9 bytes
Port Connector Information
	Internal Reference Designator: J1A1
	Internal Connector Type: None
	External Reference Designator: Keyboard
	External Connector Type: Circular DIN-8 male
	Port Type: Keyboard Port

Handle 0x0009, DMI type 8, 9 bytes
Port Connector Information
	Internal Reference Designator: J1A1
	Internal Connector Type: None
	External Reference Designator: PS/2 Mouse
	External Connector Type: Circular DIN-8 male
	Port Type: Mouse Port

Handle 0x000A, DMI type 9, 13 bytes
System Slot Information
	Designation: PEG Slot J6B2
	Type: 32-bit PCI Express
	Current Usage: In Use
	Length: Long
	ID: 6
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x000B, DMI type 9, 13 bytes
System Slot Information
	Designation: PCI Express Slot J6B1
	Type: 32-bit PCI Express
	Current Usage: In Use
	Length: Long
	ID: 7
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x000C, DMI type 9, 13 bytes
System Slot Information
	Designation: PCI Express Slot J6D1
	Type: 32-bit PCI Express
	Current Usage: In Use
	Length: Long
	ID: 8
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x000D, DMI type 9, 13 bytes
System Slot Information
	Designation: PCI Express Slot J8B3
	Type: 32-bit PCI Express
	Current Usage: Available
	Length: Long
	ID: 9
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x000E, DMI type 9, 13 bytes
System Slot Information
	Designation: PCI Express Slot J8D1
	Type: 32-bit PCI Express
	Current Usage: In Use
	Length: Long
	ID: 10
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x000F, DMI type 9, 13 bytes
System Slot Information
	Designation: PCI Express Slot J7B1
	Type: 32-bit PCI Express
	Current Usage: Available
	Length: Long
	ID: 11
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x0010, DMI type 9, 13 bytes
System Slot Information
	Designation: PCI Express Slot 6
	Type: 32-bit PCI Express
	Current Usage: In Use
	Length: Long
	ID: 10
	Characteristics:
		5.0 V is provided
		3.3 V is provided

Handle 0x0011, DMI type 10, 6 bytes
On Board Device Information
	Type: Sound
	Status: Disabled
	Description: HD-Audio

Handle 0x0012, DMI type 11, 5 bytes
OEM Strings
	String 1: This is the Intel Cantiga
	String 2: Chipset CRB Platform

Handle 0x0013, DMI type 12, 5 bytes
System Configuration Options
	Option 1: Jumper settings can be described here.

Handle 0x0014, DMI type 15, 29 bytes
System Event Log
	Area Length: 16 bytes
	Header Start Offset: 0x0000
	Header Length: 16 bytes
	Data Start Offset: 0x0010
	Access Method: General-purpose non-volatile data functions
	Access Address: 0x0000
	Status: Invalid, Not Full
	Change Token: 0x00000000
	Header Format: Type 1
	Supported Log Type Descriptors: 3
	Descriptor 1: POST error
	Data Format 1: POST results bitmap
	Descriptor 2: Single-bit ECC memory error
	Data Format 2: Multiple-event
	Descriptor 3: Multi-bit ECC memory error
	Data Format 3: Multiple-event

Handle 0x0015, DMI type 16, 15 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: None
	Maximum Capacity: 4 GB
	Error Information Handle: Not Provided
	Number Of Devices: 2

Handle 0x0016, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0015
	Error Information Handle: No Error
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 2 GB
	Form Factor: SODIMM
	Set: 1
	Locator: M1
	Bank Locator: Bank 0
	Type: DDR2
	Type Detail: Synchronous
	Speed: 667 MT/s
	Manufacturer: Mfg 0
	Serial Number: 1234-B0
	Asset Tag: Not Specified
	Part Number: SODIMM000

Handle 0x0017, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x0015
	Error Information Handle: No Error
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 2 GB
	Form Factor: SODIMM
	Set: 1
	Locator: M2
	Bank Locator: Bank 1
	Type: DDR2
	Type Detail: Synchronous
	Speed: 667 MT/s
	Manufacturer: Mfg 1
	Serial Number: 1234-B1
	Asset Tag: Not Specified
	Part Number: SODIMM001

Handle 0x0018, DMI type 18, 23 bytes
32-bit Memory Error Information
	Type: OK
	Granularity: Unknown
	Operation: Unknown
	Vendor Syndrome: Unknown
	Memory Array Address: Unknown
	Device Address: Unknown
	Resolution: Unknown

Handle 0x0019, DMI type 18, 23 bytes
32-bit Memory Error Information
	Type: OK
	Granularity: Unknown
	Operation: Unknown
	Vendor Syndrome: Unknown
	Memory Array Address: Unknown
	Device Address: Unknown
	Resolution: Unknown

Handle 0x001A, DMI type 19, 15 bytes
Memory Array Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x000FFFFFFFF
	Range Size: 4 GB
	Physical Array Handle: 0x0015
	Partition Width: 2

Handle 0x001B, DMI type 20, 19 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x0007FFFFFFF
	Range Size: 2 GB
	Physical Device Handle: 0x0016
	Memory Array Mapped Address Handle: 0x001A
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

Handle 0x001C, DMI type 20, 19 bytes
Memory Device Mapped Address
	Starting Address: 0x00080000000
	Ending Address: 0x000FFFFFFFF
	Range Size: 2 GB
	Physical Device Handle: 0x0017
	Memory Array Mapped Address Handle: 0x001A
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

Handle 0x001D, DMI type 22, 26 bytes
Portable Battery
	Location: Rear
	Manufacturer: Intel Corporation
	Name: Intel Corporation
	Chemistry: Lithium Ion
	Design Capacity: 1000 mWh
	Design Voltage: 19 mV
	SBDS Version: BATT_1234
	Maximum Error: 100%
	SBDS Serial Number: 0001
	SBDS Manufacture Date: 1980-00-01
	OEM-specific Information: 0x00000000

Handle 0x001E, DMI type 23, 13 bytes
System Reset
	Status: Enabled
	Watchdog Timer: Present
	Boot Option: Do Not Reboot
	Boot Option On Limit: Do Not Reboot
	Reset Count: Unknown
	Reset Limit: Unknown
	Timer Interval: Unknown
	Timeout: Unknown

Handle 0x001F, DMI type 24, 5 bytes
Hardware Security
	Power-On Password Status: Disabled
	Keyboard Password Status: Unknown
	Administrator Password Status: Disabled
	Front Panel Reset Status: Unknown

Handle 0x0020, DMI type 25, 9 bytes
System Power Controls
	Next Scheduled Power-on: 12-31 23:59:59

Handle 0x0021, DMI type 26, 20 bytes
Voltage Probe
	Description: Voltage Probe
	Location: Processor
	Status: OK
	Maximum Value: Unknown
	Minimum Value: Unknown
	Resolution: Unknown
	Tolerance: Unknown
	Accuracy: Unknown
	OEM-specific Information: 0x00000000

Handle 0x0022, DMI type 27, 12 bytes
Cooling Device
	Temperature Probe Handle: 0x0023
	Type: Fan
	Status: OK
	OEM-specific Information: 0x00000000

Handle 0x0023, DMI type 28, 20 bytes
Temperature Probe
	Description: Temperature Probe
	Location: Processor
	Status: OK
	Maximum Value: Unknown
	Minimum Value: Unknown
	Resolution: Unknown
	Tolerance: Unknown
	Accuracy: Unknown
	OEM-specific Information: 0x00000000

Handle 0x0024, DMI type 29, 20 bytes
Electrical Current Probe
	Description: Electrical Current Probe
	Location: Processor
	Status: OK
	Maximum Value: Unknown
	Minimum Value: Unknown
	Resolution: Unknown
	Tolerance: Unknown
	Accuracy: Unknown
	OEM-specific Information: 0x00000000

Handle 0x0025, DMI type 30, 6 bytes
Out-of-band Remote Access
	Manufacturer Name: Intel
	Inbound Connection: Disabled
	Outbound Connection: Enabled

Handle 0x0026, DMI type 32, 20 bytes
System Boot Information
	Status: No errors detected

Handle 0x0027, DMI type 39, 22 bytes
System Power Supply
	Power Unit Group: 1
	Location: To Be Defined By O.E.M
	Name: To Be Defined By O.E.M
	Manufacturer: To Be Defined By O.E.M
	Serial Number: To Be Defined By O.E.M
	Asset Tag: To Be Defined By O.E.M
	Model Part Number: To Be Defined By O.E.M
	Revision: 2.50
	Max Power Capacity: Unknown
	Status: Present, Unknown
	Type: Unknown
	Input Voltage Range Switching: Unknown
	Plugged: Yes
	Hot Replaceable: No
	Input Voltage Probe Handle: 0x0021
	Cooling Device Handle: 0x0022
	Input Current Probe Handle: 0x0024

Handle 0x0028, DMI type 129, 8 bytes
OEM-specific Type
	Header and Data:
		81 08 28 00 01 01 02 01
	Strings:
		Intel_ASF
		Intel_ASF_001

Handle 0x0029, DMI type 136, 6 bytes
OEM-specific Type
	Header and Data:
		88 06 29 00 FF FF

Handle 0x002A, DMI type 150, 14 bytes
OEM-specific Type
	Header and Data:
		96 0E 2A 00 01 01 00 00 00 00 00 00 00 00
	Strings:
		ABSOLUTE(PHOENIX) CLM

Handle 0x002B, DMI type 200, 7 bytes
OEM-specific Type
	Header and Data:
		C8 07 2B 00 01 02 03
	Strings:
		17C0
		             
		0001

Handle 0x002C, DMI type 127, 4 bytes
End Of Table


The log generated with kernel 5.12.13 using the suspend-resume utility (with the following command: sleepgraph.py -m freeze) is attached.
Comment 1 computer.enthusiastic@gmail.com 2021-06-29 11:29:54 UTC
Created attachment 297661 [details]
systemd journal with suspend error for Linux kernel version 5.12.13

The full systemd journal for the boot session with suspend issue is attached.
Comment 2 computer.enthusiastic@gmail.com 2021-07-04 07:47:26 UTC
Created attachment 297703 [details]
suspend-resume utility log generated with kernel 5.13

Log generated using suspend-resume with developer option (detailed trace of suspend/resume system calls and timing)
Comment 3 computer.enthusiastic@gmail.com 2021-07-04 07:49:46 UTC
Created attachment 297705 [details]
systemd journal with suspend error for Linux kernel version 5.13
Comment 4 computer.enthusiastic@gmail.com 2021-07-04 08:09:34 UTC
The issue is present in Linux kernel versions 5.13 (current mainline version) and 5.12.14 (latest stable version). Logs from suspendresume utility and systemd journal are attached for kernel version 5.13.

The suspend-resume utility log with kernel 5.13 is generated with the  -dev option (developer detail option) with a more detailed trace of kernel system calls during suspend. 

I hope those further informations can be useful to understand the issue.
Comment 5 computer.enthusiastic@gmail.com 2021-08-11 10:06:30 UTC
Hello,

This is an update of the reported issue for different kernel versions.

The column 'STATUS" shows if suspend to ram/suspend to disk fails (FAIL) or succeeds (OK).

The kernel versions linux-image-4.19.0-11-amd64, linux-image-5.10.0-8-amd64 and linux-image-5.10.0-8-amd64 are from official Debian repositories. The other kernel versions are compiled by me from upstream (git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git).

STATUS	KERNEL VERSION                           UPSTREAM VER.   ARCHITECTURE COMMENT
======= ======================================== =============== ===========  ==================================
OK	ii  linux-image-4.19.0-11-amd64          4.19.146-1      amd64        Linux 4.19 for 64-bit PCs (signed)
FAIL	ii  linux-image-5.10.0-8-amd64           5.10.46-3       amd64        Linux 5.10 for 64-bit PCs (signed)
FAIL	ii  linux-image-5.10.0-8-amd64           5.10.46-4       amd64        Linux 5.10 for 64-bit PCs (signed)
OK	ii  linux-image-5.10.46                  5.10.46-11      amd64        Linux kernel, version 5.10.46
OK	ii  linux-image-5.10.47                  5.10.47-10      amd64        Linux kernel, version 5.10.47
OK	ii  linux-image-5.10.48                  5.10.48-9       amd64        Linux kernel, version 5.10.48
OK	ii  linux-image-5.10.49                  5.10.49-8       amd64        Linux kernel, version 5.10.49
OK	ii  linux-image-5.10.50                  5.10.50-7       amd64        Linux kernel, version 5.10.50
OK	ii  linux-image-5.10.51                  5.10.51-6       amd64        Linux kernel, version 5.10.51
OK	ii  linux-image-5.10.52                  5.10.52-5       amd64        Linux kernel, version 5.10.52
OK	ii  linux-image-5.10.53                  5.10.53-3       amd64        Linux kernel, version 5.10.53
FAIL	ii  linux-image-5.13.5                   5.13.5-4        amd64        Linux kernel, version 5.13.5

The actual kernel candidate for the next Debian stable version (
Comment 6 computer.enthusiastic@gmail.com 2021-09-03 17:06:06 UTC
Hello,

For kernel 5.10, I suppose I have identified that the issue was related to the
activation of the config parameter CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
in Debian Kernel 5.10.0-8-amd64 from Debian Bullseye 11.0 (it was
disabled in the Debian kernel 4.19.0 from Debian Buster 10.11). This
parameter was activated in Debian with linux (5.8.3-1~exp1)
experimental on Mon, 24 Aug 2020 01:23:22 +0100 (see
https://metadata.ftp-master.debian.org/changelogs//main/l/linux/linux_5.10.46-4_changelog)

I discovered it bisecting (by hand) the diff of a working kernel
config file for Debian Kernel 5.10.0-8-amd64 (generated by me from
Debian kernel source code with make makeoldconfig using as template
the Debian kernel config-4.19.0-11-amd64) and the default kernel
config file from stock Debian Kernel 5.10.0-8-amd64 (see attachment);
the "hunk" of the diff that I detected was the number 151:

--- linux-source-5.10/.config   2021-08-13 17:24:22.386243765 +0200
+++ /boot/config-5.10.46        2021-08-01 10:27:12.000000000 +0200
@@ -9063,7 +9063,7 @@
 # Memory initialization
 #
 CONFIG_INIT_STACK_NONE=y
-CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
+# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
 # CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
 # end of Memory initialization
 # end of Kernel hardening options

To verify this finding, I configured grub to start the kernel with the
parameter init_on_alloc=0:

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="no_console_suspend nouveau.debug=warn
init_on_alloc=0"
[...missing...]

After that, of course, I update the grub with kernel boot
configuration with the command:

update-grub2

The test with the stock Debian Bullseye (11.0) Kernel 5.10.0-8-amd64
was successful: I'm repeatedly able to suspend to ram and suspend to
disk with parameter init_on_alloc set to 0 with the same kernel that
freeze with init_on_alloc set to 1. I haven't deepened yet in kernel
source code, but in theory the kernel feature activated by this
parameter [1] (erase area of newly allocated memory) could have side
effects with the buffer handling/eviction of memory from video memory
to system memory during suspend to ram or suspend to disk.

[1] https://patchwork.kernel.org/project/linux-security-module/patch/20190626121943.131390-2-glider@google.com/
Comment 7 computer.enthusiastic@gmail.com 2021-09-03 17:07:42 UTC
Created attachment 298667 [details]
differences between working and not working kernel 5.10.0-8-amd64 config
Comment 8 computer.enthusiastic@gmail.com 2022-02-02 16:56:51 UTC
Hello,

The issue immediately appears during suspension to RAM, if the graphical console is active (e.g., CTRL-ALT-F7): the screen image is always garbled with the same pattern, regardless of the kernel version (see here https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/uploads/44ffbbef135538b42a39a2d4dd288624/After_resume_linux-5.4.60_and_linux-5.7.17.jpg for an example).

The issue does not immediately appears if suspend to ram is started (e.g., systemctl suspend) from a text virtual terminal (even if after a much longer wait) and it is possible to complete resume: in this case, the system crashes (the same way reported in previous paragraph) as soon as the graphical console (e.g., with ALT-F7) is activated by the user, anyways.

Below I report the analysis I've done so far: I apologize for its length, but I would like to give as much information as possible. 

I followed the advice from Nouveau mailing list (see https://lists.freedesktop.org/archives/nouveau/2021-October/039363.html) about slub debug. After some tests, I realized that the activation of "slub poisoning" (e.g. kernel option 'slub_debug=FPZ') has the same effect of "init_on_alloc=0 init_on_free=0". In fact, it seems that "slub poisoning" takes precedence over activation of init_on_alloc and init_on_free. 

This is an excerpt of kernel log when kernel is booted with slubdebug option activated (suspend to RAM succeeds):

8<------8<------8<------8<------8<------8<------8<------8<
gen 24 17:37:22 debian kernel: Linux version 5.10.87-dbg-timeout (([.,removed..])) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #2 SMP Sat Dec 18 09:02:06 CET 2021
gen 24 17:37:22 debian kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.87-dbg-timeout root=UUID=d4c01dd6-eca5-4311-9ea2-6511cd217dc7 ro no_console_suspend loglevel=6 nouveau.debug=info drm.debug=2 trace_clock=global slub_debug=FPZ nokaslr crashkernel=384M-:128M
  [..]
gen 24 17:37:22 debian kernel: mem auto-init: SLAB_POISON will take precedence over init_on_alloc/init_on_free
gen 24 17:37:22 debian kernel: Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
gen 24 17:37:22 debian kernel: Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
gen 24 17:37:22 debian kernel: mem auto-init: stack:off, heap alloc:on, heap free:off
gen 24 17:37:22 debian kernel: Memory: 205516K/3140468K available (14344K kernel code, 4648K rwdata, 4480K rodata, 2488K init, 21384K bss, 282616K reserved, 0K cma-reserved)
gen 24 17:37:22 debian kernel: random: get_random_u64 called from __kmem_cache_create+0x2c/0x580 with crng_init=0
gen 24 17:37:22 debian kernel: SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
gen 24 17:37:22 debian kernel: Kernel/User page tables isolation: enabled
gen 24 17:37:22 debian kernel: ftrace: allocating 37278 entries in 146 pages
gen 24 17:37:22 debian kernel: ftrace: allocated 146 pages with 3 groups
gen 24 17:37:22 debian kernel: Running RCU self tests
gen 24 17:37:22 debian kernel: rcu: Hierarchical RCU implementation.
gen 24 17:37:22 debian kernel: rcu:         RCU lockdep checking is enabled.
gen 24 17:37:22 debian kernel: rcu:         RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=2.
gen 24 17:37:22 debian kernel:         Rude variant of Tasks RCU enabled.
gen 24 17:37:22 debian kernel:         Tracing variant of Tasks RCU enabled.
gen 24 17:37:22 debian kernel: rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
gen 24 17:37:22 debian kernel: rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
gen 24 17:37:22 debian kernel: NR_IRQS: 524544, nr_irqs: 440, preallocated irqs: 16
gen 24 17:37:22 debian kernel: Console: colour VGA+ 80x25
gen 24 17:37:22 debian kernel: printk: console [tty0] enabled
gen 24 17:37:22 debian kernel: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_SUBCLASSES:  8
gen 24 17:37:22 debian kernel: ... MAX_LOCK_DEPTH:          48
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_KEYS:        8192
gen 24 17:37:22 debian kernel: ... CLASSHASH_SIZE:          4096
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_ENTRIES:     32768
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_CHAINS:      65536
gen 24 17:37:22 debian kernel: ... CHAINHASH_SIZE:          32768
gen 24 17:37:22 debian kernel:  memory used by lock dependency info: 6365 kB
gen 24 17:37:22 debian kernel:  memory used for stack traces: 4224 kB
gen 24 17:37:22 debian kernel:  per task-struct memory footprint: 1920 bytes
[..]
8<------8<------8<------8<------8<------8<------8<------8<

Furthermore, the activation of the "sanity checks" and "red zoning" in the kernel slub ('F' and 'Z' options in 'slub_debug=FPG') does not make appear warning or error messages in system logs when resume from suspension to RAM fails. This make me guess that probably the issue is not caused by a kernel slub corruption or a memory leak. 

These are the relevant error messages in system logs during suspension to RAM (when resume from STR fails):

8<------8<------8<------8<------8<------8<------8<------8<
  [..]
  gen 24 18:42:27 debian systemd-sleep[2228]: Suspending system...
  gen 24 18:42:27 debian kernel: PM: suspend entry (deep)
  gen 24 18:42:27 debian kernel: Filesystems sync: 0.045 seconds
  gen 24 18:45:30 debian kernel: Freezing user space processes ... (elapsed 0.026 seconds) done.
  gen 24 18:45:30 debian kernel: OOM killer disabled.
  gen 24 18:45:30 debian kernel: Freezing remaining freezable tasks ... (elapsed 0.003 seconds) done.
  gen 24 18:45:30 debian kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
  gen 24 18:45:30 debian kernel: sd 0:0:0:0: [sda] Stopping disk
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending console...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending display...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: evicting buffers...
>>gen 24 18:45:30 debian kernel: [TTM] Buffer eviction failed
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending fence...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending object tree...
  gen 24 18:45:30 debian kernel: ACPI: EC: interrupt blocked
  gen 24 18:45:30 debian kernel: ACPI: Preparing to enter system sleep state S3
  gen 24 18:45:30 debian kernel: ACPI: EC: event blocked
  gen 24 18:45:30 debian kernel: ACPI: EC: EC stopped
  gen 24 18:45:30 debian kernel: PM: Saving platform NVS memory
  gen 24 18:45:30 debian kernel: Disabling non-boot CPUs ...
  gen 24 18:45:30 debian kernel: smpboot: CPU 1 is now offline
  gen 24 18:45:30 debian kernel: ACPI: Low-level resume complete
  gen 24 18:45:30 debian kernel: ACPI: EC: EC started
  gen 24 18:45:30 debian kernel: PM: Restoring platform NVS memory
  gen 24 18:45:30 debian kernel: Enabling non-boot CPUs ...
  gen 24 18:45:30 debian kernel: x86: Booting SMP configuration:
  gen 24 18:45:30 debian kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1
  gen 24 18:45:30 debian kernel: CPU1 is up
  gen 24 18:45:30 debian kernel: ACPI: Waking up from system sleep state S3
  gen 24 18:45:30 debian kernel: ACPI: EC: interrupt unblocked
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: resuming object tree...
  gen 24 18:45:30 debian kernel: ACPI: EC: event unblocked
  [..]
  8<------8<------8<------8<------8<------8<------8<------8<

As you can see, there is a ***"[TTM] Buffer eviction failed"*** error message from the kernel during suspension to RAM. 

The "Buffer eviction failed" error message always appears (with my hardware: Intel Core 2 Duo, nvidia GEForce 9600M GT) during suspension to ram after that the kernel is booted with init_on_alloc activated (or without SLUB_debug='FPZ' option); it is *constantly replicable* from different kernel versions, for example from 5.10 (e.g., in Debian Bullseye 11.0, 11.1, 11.2 and in vanilla versions up to 5.16, the latter tested few days ago).  It probably started with kernel version 5.8.

The following tests and logs are from a custom kernel built from vanilla version 5.10.87 with debug options enabled (see attached file vanilla-kernel-build-with-debian-config); I used (as base) the kernel's .config from the latest Debian kernel from release 11.0 codenamed "Bullseye" (linux-image-5.10.0-10-amd64, kernel 5.10.84) and then activated further kernel debug option. The user space installation is from a Debian 11.2.

8<------8<------8<------8<------8<------8<------8<------8<

$ uname -a
Linux debian 5.10.87-dbg-timeout #2 SMP Sat Dec 18 09:02:06 CET 2021 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

$ cat /etc/debian_version 
11.2

# inxi -mMG
Machine:   Type: Laptop System: Acer product: Aspire 5930 v: 0100 serial: <superuser required> 
           Mobo: Acer model: Eiger v: Rev serial: <superuser required> BIOS: Phoenix v: 1.21 date: 08/22/2008 
Memory:    RAM: total: 2.77 GiB used: 2.27 GiB (82.1%) 
           RAM Report: permissions: Unable to run dmidecode. Root privileges required. 
Graphics:  Device-1: NVIDIA G96CM [GeForce 9600M GT] driver: nouveau v: kernel 
           Device-2: Suyin Acer HD Crystal Eye webcam type: USB driver: uvcvideo 
           Display: x11 server: X.Org 1.20.11 driver: loaded: modesetting unloaded: fbdev,vesa resolution: 1280x800~60Hz 
           OpenGL: renderer: NV96 v: 3.3 Mesa 20.3.5 
Machine:   Type: Laptop System: Acer product: Aspire 5930 v: 0100 serial: [..removed..] 
           Mobo: Acer model: Eiger v: Rev serial: [..removed..]  BIOS: Phoenix v: 1.21 date: 08/22/2008 
Memory:    RAM: total: 2.79 GiB used: 1.37 GiB (49.3%) 
           Array-1: capacity: 4 GiB slots: 2 EC: None 
           Device-1: M1 size: 2 GiB speed: 667 MT/s 
           Device-2: M2 size: 2 GiB speed: 667 MT/s 
Graphics:  Device-1: NVIDIA G96CM [GeForce 9600M GT] driver: nouveau v: kernel 
           Device-2: Suyin Acer HD Crystal Eye webcam type: USB driver: uvcvideo 
           Display: server: X.org 1.20.11 driver: loaded: modesetting unloaded: fbdev,vesa tty: 178x46 
           OpenGL: renderer: NV96 v: 3.3 Mesa 20.3.5 

8<------8<------8<------8<------8<------8<------8<------8<

I traced the kernel function nouveau_do_suspend() [1] as a starting point in the analysis of nouveau kernel module for "Suspend To Ram" ("STR", from now on) . 

I recorded the graph of kernel functions called by nouveau_do_suspend() after starting STR and up to the beginning of resume (I used the ftrace kernel feature [2] [3]: see attached script named "tracecmd_functgraph_do_suspend.sh"). 

The ftrace recording were done in two conditions:

1. kernel started *with* parameter debug_slub=FPZ (in this case, STR *works*); note: the kernel parameter debug_slub=FPZ [4] take precedence over init_on_alloc kernel parameter (which is therefore disabled in this case) [5]. 
 
2. kernel started *without* debug_slub=FPZ (in this case, STR *fails* after resume); note: as previously mentioned, init_on_alloc and init_on_free are enabled in Debian kernel by default starting from release 11.0 codenamed "Bullseye" (CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON).
    
The analysis of time spent by involved kernel functions when STR fails, pointed me to function named ttm_resource_manager_force_list_clean() [6] (see attachment "ttm_resource_manager_force_list_clean_stack-trace.log" for a stack trace when it is called). 

I generated three files for each recording (using ftrace with graph plugin) for each of the two aforementioned conditions:

1. a graph of functions call from ftrace kernel buffer in /sys/kernel/debug/tracing/trace as text log file (see attachments SLUBDEBUG_NO.log and SLUBDEBUG_YES.log)

2. a binary raw ftrace recording (with .dat extension, for subsequent analyzes)

3. a profile of functions traced (with overall time spent by kernel in each function) as text log file (see see attachments SLUBDEBUG_NO_profile.log and SLUBDEBUG_YES_profile.log)

    
Then, I compared the profile of called function when STR fails to the one when STR works (see the attached script join_traces.sh). 

I generated a comparison file (see attachment SLUBDEBUG_NO_YES_profile_joined_parsed.ods) that I reworked using a spreadsheet (functions calls are sorted by total time in descending order, see attached file named SLUBDEBUG_NO_YES_profile_joined_parsed.ods). 

The most time consuming functions called by ttm_resource_manager_force_list_clean() when STR *fails* are (in descending order):

NUM.CALLS   TOTAL TIME     AVG TIME     TRACED FUNCTION (note: STR *fails*, time in microsec)
=========  ===========   ===========    =====================================================
(1)        11436084627   11436084627  >> ttm_resource_manager_force_list_clean() <<
(10)       11435990310    1143599031  ttm_mem_evict_first()
(10)       11435753724    1143575372  ttm_bo_evict()
(10)       11433592700    1143359270  ttm_bo_handle_move_mem()
(10)       11381408793    1138140879  nouveau_bo_move()
(6)        10379433308    1729905551  ttm_bo_move_memcpy()
(2256)     10367683828       4595604  memcpy_fromio()
(10)        1001119248     100111924  nouveau_bo_move_m2mf.constprop.0()
(10)         995441053      99544105  ttm_bo_move_accel_cleanup()
(38)         994614055      26174054  ttm_bo_wait()
(38)         994403882      26168523  >> dma_resv_wait_timeout_rcu() <<
(6)          993363983     165560663  dma_fence_wait_timeout()
(6)          993285402     165547567  dma_fence_default_wait()
(2596)       137206543         52853  __sysvec_apic_timer_interrupt()
(2596)       129927214         50049  hrtimer_interrupt()
(2595)        91271707         35172  __hrtimer_run_queues()
(2594)        74417961         28688  tick_sched_timer()
(15)          47315477       3154365  ttm_tt_populate()
(10)          47287068       4728706  nouveau_ttm_tt_populate()
(10)          47268974       4726897  ttm_populate_and_map_pages()
(10)          47068753       4706875  ttm_pool_populate()
(12)          30015421       2501285  ttm_page_pool_get_pages()
(4)           29847966       7461991  ttm_alloc_new_pages.isra.0()
(6)           25147308       4191218  ttm_set_pages_caching()
(30758)       17361172           564  rcu_read_lock_sched_held()
(2283)        16058655          7034  ttm_mem_global_alloc_page()
(2283)        14317071          6271  ttm_mem_global_alloc_zone()
[.. see the attached SLUBDEBUG_NO_YES_profile_joined_parsed.ods for a complete log ..]


The time consumed by the same previous funcions in the same order, but when STR *succeeds* is:

NUM.CALLS   TOTAL TIME     AVG TIME     TRACED FUNCTION (note: STR *fails*, time in microsec)
=========  ===========   ===========    =====================================================
(1)          501338026     501338026  >> ttm_resource_manager_force_list_clean() <<
(119)        500532532       4206155  ttm_mem_evict_first()
(119)        497645267       4181893  ttm_bo_evict()
(119)        479173966       4026671  ttm_bo_handle_move_mem()
(119)        186045609       1563408  nouveau_bo_move()
                                      >> ttm_bo_move_memcpy()
                                      >> memcpy_fromio()
(119)        183246904       1539889  nouveau_bo_move_m2mf.constprop.0()
(119)        122513397       1029524  ttm_bo_move_accel_cleanup()
(311)         82506104        265292  ttm_bo_wait()
(311)         80733662        259593  >> dma_resv_wait_timeout_rcu() <<
(80)          75522674        944033  dma_fence_wait_timeout()
(80)          74936515        936706  dma_fence_default_wait()
(34)           1565271         46037  __sysvec_apic_timer_interrupt()
(17)            810651         47685  hrtimer_interrupt()
(13)            474341         36487  __hrtimer_run_queues()
(10)            302783         30278  tick_sched_timer()
(119)        240333865       2019612  ttm_tt_populate()
(119)        240037719       2017123  nouveau_ttm_tt_populate()
(119)        239856293       2015599  ttm_populate_and_map_pages()
(119)        224154269       1883649  ttm_pool_populate()
(144)         25454062        176764  ttm_page_pool_get_pages()
(20)          22598464       1129923  ttm_alloc_new_pages.isra.0()
(4)           13932500       3483125  ttm_set_pages_caching()
(7141)         7724863          1081  rcu_read_lock_sched_held()
(26455)      187335946          7081  ttm_mem_global_alloc_page()
(26455)      168008932          6350  ttm_mem_global_alloc_zone()
[.. see the attached SLUBDEBUG_NO_YES_profile_joined_parsed.ods for a complete log ..]


As you can see, when STR succeeds (because init_on_alloc disabled) compared with when it fails (because init_on_allloc is enabled):

1. the count of function calls is much higher for ttm* and nouveau* functions and the total time spent by kernel for ttm_resource_manager_force_list_clean() and subsequent calls is lower (total time: 501338026 vs 11436084627): this let me suppose that code flows and it is not stuck somewhere (when init_on_alloc is disabled); this is confirmed in subsequent point c)

2. the ttm_bo_move_memcpy() and memcpy_fromio() are not executed: this let me suppose that the sequence of funcions call is somewhere different (after a common starting point): this is confirmed by subsequent analysis of kernel function's trace

3. the number of call for dma_resv_wait_timeout_rcu() is much lower (10 times) and the average duration of the call is much lower (259593 usec vs 26168523 usec): this makes me suppose that the kernel is stuck in accessing to shared DMA resources (GPU's memory ? deadlock ?) when suspension to RAM fails; 


This is the kernel graph trace starting from ttm_resource_manager_force_list_clean (when STR succeeds or fails):

8<------8<------8<------8<------8<------8<------8<------8<
# kprobe-perf -s 'p:ttm_resource_manager_force_list_clean'
Tracing kprobe ttm_resource_manager_force_list_clean. Ctrl-C to end.
    kworker/u4:9-4234    [001] ....  1317.593113: ttm_resource_manager_force_list_clean: (ttm_resource_manager_force_list_clean+0x0/0x1d0 [ttm])
    kworker/u4:9-4234    [001] ....  1317.593126: <stack trace>
 => ttm_resource_manager_force_list_clean
 => nouveau_do_suspend
 => nouveau_pmops_suspend
 => pci_pm_suspend
 => dpm_run_callback
 => __device_suspend
 => async_suspend
 => async_run_entry_fn
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork
8<------8<------8<------8<------8<------8<------8<------8<

Then, I traced (using ftrace [2]) some specific kernel functions (see attachment tracecmd_function_do_suspend.sh) and probed the return values for some of them.

These are the functions called when suspend to RAM fails (with reference to source code at the URL https://elixir.bootlin.com/linux/v5.10.87/source):

8<------8<------8<------8<------8<------8<------8<------8<
nouveau_pmops_suspend [nouveau] -> nouveau_do_suspend
https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L945
https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L841
|
+-> nouveau_do_suspend -> ttm_bo_evict_mm
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L860
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1308
  |
  +-> ttm_bo_evict_mm -> ttm_manager_type 
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1310
  https://elixir.bootlin.com/linux/v5.10.87/source/include/drm/ttm/ttm_bo_driver.h#L355
  |
  +-> ttm_bo_evict_mm -> ttm_resource_manager_force_list_clean 
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1322
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_resource.c#L86
    |
    +-> ttm_resource_manager_force_list_clean -> ttm_mem_evict_first 
    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_resource.c#L107
    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L725
           ==> note: ttm_resource_manager_force_list_clean:104  <==
           ==>       here starts a loop for buffer eviction     <==
    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_resource.c#L104
      |
      +-> ttm_mem_evict_first -> ttm_bo_evict 
      https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L789
      https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L598
        |
        +-> ttm_bo_evict -> ttm_bo_handle_move_mem
        https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L598
        https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L235
            |
            +-> ttm_bo_handle_move_mem -> ttm_mem_io_free+0x0/0x40
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L244
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1471
            |
            +-> ttm_bo_handle_move_mem -> tm_tt_set_placement_caching+0x0/0x110
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L258
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_tt.c#L195
            |
            +-> ttm_bo_handle_move_mem -> ttm_tt_populate+0x0/0x80
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L265
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_tt.c#L422
            |
            |-> ttm_bo_handle_move_mem -> nouveau_bo_move_ntfy 
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L280
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1004
            |
            +-> ttm_bo_handle_move_mem -> nouveau_bo_move
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L286
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1053
              |
              +-> nouveau_bo_move -> ttm_bo_wait
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1063
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1480
              |
              +-> nouveau_bo_move -> nouveau_bo_move_m2mf.constprop
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1093
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L798
              | |
              | +-> nouveau_bo_move_m2mf.constprop -> ttm_bo_move_accel_cleanup
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L824
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L606
              | | |
              | | |   ** HARDWARE ASSISTED COPY **
              | | +-> ttm_bo_move_accel_cleanup -> dma_resv_add_excl_fence
              | | |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L617
              | | |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L617
              | | | 
HERE FLOW ==> | | +-> ttm_bo_move_accel_cleanup -> ttm_bo_wait_free_node
DIVERGES  ==> | | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L623
WITH INIT ==> | | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L529
ON ALLOC IS=> | | |  |
ACTIVATED ==> | | |  +-> ttm_bo_wait_free_node -> ttm_bo_wait
              | | |  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L533
              | | |    |   
              | | |    +-> ttm_bo_wait -> dma_resv_wait_timeout_rcu 
              | | |    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1492
              | | |    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/dma-buf/dma-resv.c#L532
              | | |    |
              | | |    +-> dma_resv_wait_timeout_rcu-> dma_fence_wait_timeout
              | | |    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/dma-buf/dma-resv.c#L594
              | | |      |
              | | |      +-> dma_fence_wait_timeout -> dma_fence_default_wait
              | | |      https://elixir.bootlin.com/linux/v5.10.87/source/drivers/dma-buf/dma-fence.c#L425
              | | |      https://elixir.bootlin.com/linux/v5.10.87/source/drivers/dma-buf/dma-fence.c#L674
              | | | 
              | | |      ** FAILING ACCELERATED COPY (TIMEOUT IN WAITING RCU) **
              | | | 
              | | |
              | +-+<- dma_fence_default_wait -> ttm_bo_move_accel_cleanup
              | |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L626
              | |     return arg1=0xfffffff0 (-EBUSY)
              | |    
              +-+<- ttm_bo_move_accel_cleanup -> nouveau_bo_move_m2mf.constprop
              |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L824
              |     return arg1=0xfffffff0 (-EBUSY)
              |
              +<-- nouveau_bo_move_m2mf.constprop -> nouveau_bo_move 
              |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1093
              |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1097)
              |     return arg1=0xfffffff0 (-EBUSY)
              |
              |   ** FALL BACK TO SOFTWARE COPY **
              +-> nouveau_bo_move -> ttm_bo_wait
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1101
              |
              +-> nouveau_bo_move -> ttm_bo_move_memcpy
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1103
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L217
                |
                +-> ttm_bo_move_memcpy -> ttm_bo_wait
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L234
                |
                +-> ttm_bo_move_memcpy -> ttm_resource_ioremap 
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L238
                |
                +-> ttm_bo_move_memcpy -> ttm_resource_ioremap 
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L241
                |
                +-> ttm_bo_move_memcpy -> ttm_tt_populate
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L265
                |
                +-> HERE START A LOOP
                +-> https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L279
                | |
                | +-> ttm_bo_move_memcpy -> ttm_copy_io_ttm_page 
                | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L289
                | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L173
                |   |
                |   +-> memcpy_fromio [REPEATED IN LOOP]
                |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L289
                |    
                |     ***********************************************
                |     ** SOFTWARE COPY FAILS AFTER MANY ITERATIONS **
                |     **        OF memcpy_fromio()                 **
                |     ***********************************************
                |        
                |<--- (nouveau_bo_move+0x346/0x740 [nouveau] <- ttm_bo_move_memcpy) arg1=0xffffffea
                |
            +--<+-nouveau_bo_move -> ttm_bo_handle_move_mem 
            |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/nouveau_bo.c#L1065
            |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L287
            |     return arg1=0xfffffff0 (-EBUSY)
            |    
    +--<----+<--ttm_bo_handle_move_mem -> ttm_bo_evict 
    |           https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L304
    |           https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L637
    |           return arg1=0xfffffff0 (-EBUSY)
    |            
    |                       ************************************
    |                       ** Buffer eviction failed MESSAGE **
    |                       ************************************
    |
    +-> ttm_resource_manager_force_list_clean -> ttm_mem_evict_first
    |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_resource.c#L86
    |  
    +-> ttm_mem_evict_first -> tm_bo_evict_mm
    |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L790
    |
    +-> tm_bo_evict_mm -> nouveau_do_suspend
        https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_bo.c#L1308
8<------8<------8<------8<------8<------8<------8<------8<

The function ttm_bo_wait_free_node() is where the flow diverges between suspend to ram success or failure. 

When the kernel feature CONFIG_INIT_ON_ALLOC_DEFAULT_ON is active:
1. the function dma_resv_wait_timeout_rcu() times out in ttm_bo_wait()
2. then, the function ttm_bo_wait() return 0xfffffff0 (corresponding to -EBUSY) to its callers
    
The ftrace log reports (excerpt):

8<------8<------8<------8<------8<------8<------8<------8<

  #                         _-----=> irqs-off
  #                        / _----=> need-resched
  #                        | / _---=> hardirq/softirq
  #                        || / _--=> preempt-depth
  #                        ||| /     delay
  #     TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
  #        | |         |   ||||      |         |
  kworker/u4:11-2068 [000] d..2 210.090141: probe4: (ttm_bo_move_accel_cleanup+0x1ba/0x510 [ttm] <- ttm_bo_wait) arg1=0xfffffff0 
  kworker/u4:11-2068 [000] d..2 210.090144: probe3: (nouveau_bo_move_m2mf.constprop.0+0x1a5/0x200 [nouveau] <- ttm_bo_move_accel_cleanup) arg1=0xfffffff0 
  kworker/u4:11-2068 [000] d..2 210.090147: probe1: (nouveau_bo_move+0x39c/0x740 [nouveau] <- nouveau_bo_move_m2mf.constprop.0) arg1=0xfffffff0 

8<------8<------8<------8<------8<------8<------8<------8<

This is the code for function ttm_bo_wait() in drivers/gpu/drm/ttm/ttm_bo.c:

   [..]
   1480 int ttm_bo_wait(struct ttm_buffer_object *bo,
   1481                 bool interruptible, bool no_wait)
   1482 {
   1483         long timeout = 15 * HZ;
   1484 
   1485         if (no_wait) {
   1486                 if (dma_resv_test_signaled_rcu(bo->base.resv, true))
   1487                         return 0;
   1488                 else
   1489                         return -EBUSY;
   1490         }
   1491 
>> 1492         timeout = dma_resv_wait_timeout_rcu(bo->base.resv, true,
   1493                                                       interruptible, timeout);
   1494         if (timeout < 0)
   1495                 return timeout;
   1496 
>> 1497         if (timeout == 0)
>> 1498                 return -EBUSY;
   1499 
   1500         dma_resv_add_excl_fence(bo->base.resv, NULL);
   1501         return 0;
   1502 }
   1503 EXPORT_SYMBOL(ttm_bo_wait);
   [..]

   
The -EBUSY value is returned from ttm_bo_wait() to ttm_bo_wait_free_node(), then to ttm_bo_move_accel_cleanup(), then to nouveau_bo_move_m2mf.constprop(), then to nouveau_bo_move() at line 1093 (see code below) and this activates in nouveau_bo_move() the fall back procedure from "hardware assisted copy" to "software copy" at line 1103 in /drivers/gpu/drm/nouveau/nouveau_bo.c (see code below): 

   [..]                        
   1053 nouveau_bo_move(struct ttm_buffer_object *bo, bool evict,
   1054                 struct ttm_operation_ctx *ctx,
   1055                 struct ttm_resource *new_reg)
   1056 {
   1057         struct nouveau_drm *drm = nouveau_bdev(bo->bdev);
   1058         struct nouveau_bo *nvbo = nouveau_bo(bo);
   1059         struct ttm_resource *old_reg = &bo->mem;
   1060         struct nouveau_drm_tile *new_tile = NULL;
   1061         int ret = 0;
   1062 
   1063         ret = ttm_bo_wait(bo, ctx->interruptible, ctx->no_wait_gpu);
   1064         if (ret)
   1065                 return ret;
   1066 
   1067         if (nvbo->pin_refcnt)
   1068                 NV_WARN(drm, "Moving pinned object %p!\n", nvbo);
   1069 
   1070         if (drm->client.device.info.family < NV_DEVICE_INFO_V0_TESLA) {
   1071                 ret = nouveau_bo_vm_bind(bo, new_reg, &new_tile);
   1072                 if (ret)
   1073                         return ret;
   1074         }
   1075 
   1076         /* Fake bo copy. */
   1077         if (old_reg->mem_type == TTM_PL_SYSTEM && !bo->ttm) {
   1078                 ttm_bo_move_null(bo, new_reg);
   1079                 goto out;
   1080         }
   1081 
>> 1082         /* Hardware assisted copy. */
   1083         if (drm->ttm.move) {
   1084                 if (new_reg->mem_type == TTM_PL_SYSTEM)
   1085                         ret = nouveau_bo_move_flipd(bo, evict,
   1086                                                     ctx->interruptible,
   1087                                                     ctx->no_wait_gpu, new_reg);
   1088                 else if (old_reg->mem_type == TTM_PL_SYSTEM)
   1089                         ret = nouveau_bo_move_flips(bo, evict,
   1090                                                     ctx->interruptible,
   1091                                                     ctx->no_wait_gpu, new_reg);
   1092                 else
>> 1093                         ret = nouveau_bo_move_m2mf(bo, evict,
>> 1094                                                    ctx->interruptible,
>> 1095                                                    ctx->no_wait_gpu,
>> new_reg);
   1096                 if (!ret)
   1097                         goto out;
   1098         }
   1099 
>> 1100         /* Fallback to software copy. */
>> 1101         ret = ttm_bo_wait(bo, ctx->interruptible, ctx->no_wait_gpu);
>> 1102         if (ret == 0)
>> 1103                 ret = ttm_bo_move_memcpy(bo, ctx, new_reg);
   1104 
   1105 out:
   1106         if (drm->client.device.info.family < NV_DEVICE_INFO_V0_TESLA) {
   1107                 if (ret)
   1108                         nouveau_bo_vm_cleanup(bo, NULL, &new_tile);
   1109                 else
   1110                         nouveau_bo_vm_cleanup(bo, new_tile, &nvbo->tile);
   1111         }
   1112 
>> 1113         return ret;
   1114 }
   [..]

The loop inside the function ttm_resource_manager_force_list_clean() (at line 104 in drivers/gpu/drm/ttm/resource.c) makes some iterations, then it stops when the function ttm_bo_handle_move_mem() fails and ttm_mem_evict_first() return an error value:

    [..]
     78  * ttm_resource_manager_force_list_clean
     79  *
     80  * @bdev - device to use
     81  * @man - manager to use
     82  *
     83  * Force all the objects out of a memory manager until clean.
     84  * Part of memory manager cleanup sequence.
     85  */
     86 int ttm_resource_manager_force_list_clean(struct ttm_bo_device *bdev,
     87                                           struct ttm_resource_manager *man)
     88 {
     89         struct ttm_operation_ctx ctx = {
     90                 .interruptible = false,
     91                 .no_wait_gpu = false,
     92                 .flags = TTM_OPT_FLAG_FORCE_ALLOC
     93         };
     94         struct ttm_bo_global *glob = &ttm_bo_glob;
     95         struct dma_fence *fence;
     96         int ret;
     97         unsigned i;
     98 
     99         /*
    100          * Can't use standard list traversal since we're unlocking.
    101          */
    102 
    103         spin_lock(&glob->lru_lock);
 >> 104         for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
    105                 while (!list_empty(&man->lru[i])) {
    106                         spin_unlock(&glob->lru_lock);
 >> 107                         ret = ttm_mem_evict_first(bdev, man, NULL, &ctx,
 >> 108                                                   NULL);
 >> 109                         if (ret)
 >> 110                                 return ret;
    111                         spin_lock(&glob->lru_lock);
    112                 }
    113         }
    114         spin_unlock(&glob->lru_lock);
    115 
    116         spin_lock(&man->move_lock);
    117         fence = dma_fence_get(man->move);
    118         spin_unlock(&man->move_lock);
    119 
    120         if (fence) {
    121                 ret = dma_fence_wait(fence, false);
    122                 dma_fence_put(fence);
    123                 if (ret)
    124                         return ret;
    125         }
    126 
    127         return 0;
    128 }
    129 EXPORT_SYMBOL(ttm_resource_manager_force_list_clean);
    [..]

The "Buffer eviction failed" error message is returned by function ttm_bo_evict() at line 637 in /drivers/gpu/drm/ttm/ttm_bo.c when ttm_bo_handle_move_mem() fails:

   [..]
    598 static int ttm_bo_evict(struct ttm_buffer_object *bo,
    599                         struct ttm_operation_ctx *ctx)
    600 {
    601         struct ttm_bo_device *bdev = bo->bdev;
    602         struct ttm_resource evict_mem;
    603         struct ttm_placement placement;
    604         int ret = 0;
    605 
    606         dma_resv_assert_held(bo->base.resv);
    607 
    608         placement.num_placement = 0;
    609         placement.num_busy_placement = 0;
    610         bdev->driver->evict_flags(bo, &placement);
    611 
    612         if (!placement.num_placement && !placement.num_busy_placement) {
    613                 ttm_bo_wait(bo, false, false);
    614 
    615                 ttm_bo_cleanup_memtype_use(bo);
    616                 return ttm_tt_create(bo, false);
    617         }
    618 
    619         evict_mem = bo->mem;
    620         evict_mem.mm_node = NULL;
    621         evict_mem.bus.offset = 0;
    622         evict_mem.bus.addr = NULL;
    623 
    624         ret = ttm_bo_mem_space(bo, &placement, &evict_mem, ctx);
    625         if (ret) {
    626                 if (ret != -ERESTARTSYS) {
    627                         pr_err("Failed to find memory space for buffer 0x%p eviction\n",
    628                                bo);
    629                         ttm_bo_mem_space_debug(bo, &placement);
    630                 }
    631                 goto out;
    632         }
    633 
>>  634         ret = ttm_bo_handle_move_mem(bo, &evict_mem, true, ctx);
>>  635         if (unlikely(ret)) {
>>  636                 if (ret != -ERESTARTSYS)
>>  637                         pr_err("Buffer eviction failed\n");)
>>  638                 ttm_resource_free(bo, &evict_mem);
>>  639         }
    640 out:
    641         return ret;
    642 }
    [..]

So, this is what I've found so far tracing the kernel.

The activation of the debug features [14][15] (see attachment vanilla-kernel-build-with-debian-config.txt) didn't show useful error or warning messages in system logs about memory leaks or kernel oops/warnings related to failure of GPU buffers eviction.

The only interesting message (so far) appears in kernel 5.16, when, during suspension to RAM, a warning about a "possible circular locking dependency detected" is shown (see attachment linux-5.16-lock-circular-dump.log.txt). But this warning appears regardless the computer is activated with or without activation of init_on_alloc and init_on_free kernel features: it appears both in success or in failure of suspension to RAM.

Now, just a small recap. 

The issue is triggered by the activation of init_on_alloc and init_on_free kernel feature (that are activated by default in the lasted stable Debian release, for example).

I suspect that, during suspend to RAM, the issue is caused by:

1. failure of buffers eviction for the nvidia GPU (the copy of video memory and GPU's pipelines to RAM, I suppose) 

2. then, the kernel suspends the computer anyway, leaving the internal data structures of the nouveau GPU module (and/or of the DRM module) in an inconsistent state

3. then, if the graphical console is active/activated (e.g. ALT-F7), the nouveau driver instantly fails (probably due to the inconsistent state of its internal data structures)

4. then, the kernel hangs the computer with multiple repeated errors, making the system unusable.

The issue is constantly replicable, therefore I can carry out further tests, if you wish to suggest them.

I'm still looking for clues, comments and advices to understand **why buffer eviction fails**: I would like to find (or help someone else to find) a stable fix for this issue.

Thanks for any advice or comment.

8<------8<------8<------8<------8<------8<------8<------8<

References:
    
1.  function nouveau_do_suspend(), see https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/
2.  https://www.kernel.org/doc/html/v5.10/trace/ftrace.html ; https://manpages.debian.org/bullseye/trace-cmd/trace-cmd.1.en.html
3.  https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
4.  https://www.kernel.org/doc/html/v5.10/vm/slub.html
5.  https://elixir.bootlin.com/linux/v5.10.50/source/mm/page_alloc.c#L191
6.  function ttm_resource_manager_force_list_clean(): https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/ttm_resource.c#L86
7.  function ttm_bo_move_accel_cleanup(), see https://elixir.bootlin.com/linux/v5.10.50/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L606
8.  function ttm_bo_wait, see https://elixir.bootlin.com/linux/v5.10.50/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L529
9.  function dma_resv_wait_timeout_rcu, see https://elixir.bootlin.com/linux/v5.10.50/source/drivers/dma-buf/dma-resv.c#L532
10. function dma_fence_default_wait, see https://elixir.bootlin.com/linux/v5.10.50/source/drivers/dma-buf/dma-fence.c#L674
11. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5
12. https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
13. https://bugzilla.kernel.org/show_bug.cgi?id=213617
14. https://www.kernel.org/doc/html/v4.10/dev-tools/kmemleak.html
15. https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt
Comment 9 computer.enthusiastic@gmail.com 2022-02-02 16:59:39 UTC
Created attachment 300379 [details]
compressed ftrace log (with function_graph plugin) for suspend to ram with kernel started with init_on_alloc active
Comment 10 computer.enthusiastic@gmail.com 2022-02-02 17:00:57 UTC
Created attachment 300380 [details]
compressed ftrace log (with function_graph plugin) for suspend to ram with kernel started with init_on_alloc disabled
Comment 11 computer.enthusiastic@gmail.com 2022-02-02 17:02:28 UTC
Created attachment 300381 [details]
bash script to trace specific kernel function using ftrace
Comment 12 computer.enthusiastic@gmail.com 2022-02-02 17:03:05 UTC
Created attachment 300382 [details]
bash script to trace graph of called functions using ftrace
Comment 13 computer.enthusiastic@gmail.com 2022-02-02 17:03:50 UTC
Created attachment 300383 [details]
bash script to join ftrace profiles
Comment 14 computer.enthusiastic@gmail.com 2022-02-02 17:05:38 UTC
Created attachment 300384 [details]
bash coomands used to rebuild a vanilla kernel with debug options activated
Comment 15 computer.enthusiastic@gmail.com 2022-04-29 08:23:54 UTC
Hi,

Just a little update, in the hope it can be useful to give more hints to anyone interested in spotting a solution.

The issue affects kernel version 5.17.4, too. 

Suspension to ram always fails at resume with INIT_ON_ALLOC=1 and nouveau.noaccel=0 (both as per default in Debian Kernel since version 5.10; kernel 4.19 was not affected).

The INIT_ON_ALLOC=0 and INIT_ON_FREE=0 kernel boot parameters seems to be still effective as workaround to prevent the system's freeze (garbled screen) when resuming from "suspension to ram" (STR) .
 
As already reported, another interesting clue of the issue is the length of the time to complete the suspension to ram, when the issue occurs. 

In the following tables are reported times to complete suspension test using the /sys/power/pm_test facility:

1) I followed instruction of "Debugging hibernation and suspend" at https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt 

2) detailed logs are available in the attachment named "debug_suspend.log.zip"

3) freezer, devices, platform, processor and core tests are done in sequential order without reboot with the same kernel

These are the elapsed times:

                Kernel version 5.10.0-13-amd64 (debian)
                with boot parameter nouveau.noaccel=0
                Time to complete the STR test
TEST > /sys/    ================================
power/pm_test   INIT_ON_ALLOC=0  INIT_ON_ALLOC=1 
=============   ===============  ===============
devices          0 min 28.6 sec   3 min  7.9 sec
platform         0 min 31.3 sec  13 min  4.6 sec
processors       0 min 29.1 sec   0 min 58.0 sec
core             0 min 29.3 sec   0 min 59.0 sec
=============   ===============  ===============

                Kernel version 5.16.14-amd64 (vanilla)
                with boot parameter nouveau.noaccel=0
                Time to complete the STR test
TEST > /sys/    ================================
power/pm_test   INIT_ON_ALLOC=0  INIT_ON_ALLOC=1 
=============   ===============  ===============
devices          0 min 30.9 sec   4 min 26.1 sec
platform         0 min 31.1 sec   5 min 34.8 sec
processors       0 min 28.1 sec   0 min 43.1 sec
core             0 min 27.5 sec   0 min 43.0 sec
=============   ===============  ===============
                
                Kernel version 5.17.4-amd64 (vanilla)
                with boot parameter nouveau.noaccel=0
                Time to complete the STR test
TEST > /sys/    ================================
power/pm_test   INIT_ON_ALLOC=0  INIT_ON_ALLOC=1 
=============   ===============  ===============
devices         0 min 27.4 sec   2 min 52.9 sec
platform        0 min 28.0 sec   2 min 44.0 sec
processors      0 min 28.0 sec   1 min 13.3 sec
core            0 min 27.8 sec   0 min 42.7 sec
=============   ===============  ===============

As you can see, the time to complete the "devices" test is always longer (often much longer) when INIT_ON_ALLOC=1 (1=activated) compared to when INIT_ON_ALLOC=0 (0=disabled). Note that the "devices" test is the first one calling the kernel modules' callbacks for power management. 

This is quite interesting: the elongation of the wait for suspension to ram links this issue with another similar issue ("[GT216][Linux 5.16.4] Long stalls in dma_fence_default_wait") [1].

The issue in the current bug report looks like arises in the nouveau kernel driver: the suspension to ram always succeeds (with a "normal" short wait, 8-15 secs) if the nouveau's acceleration feature is disabled at boot (using nouveau.noaccel=1) regardless of INIT_ON_ALLOC activation. On the contrary, suspension to RAM always fails at resume if the nouveau's acceleration feature is activated (nouveau.noaccel=0) and INIT_ON_ALLOC feature is activated, with longer times to complete tests form suspension to ram.

Besides that, the following warning is still dumped in kernel logs at suspension time or during "devices" test with kernel 5.17.x and 5.16.x (but not with 5.10.0-13) (see full log in attachment named "debug_suspend.log.zip") if they are compiled with CONFIG_DEBUG_LOCKDEP activated, for example:

Apr 26 22:22:52 debian kernel: [  631.659392] ======================================================
Apr 26 22:22:52 debian kernel: [  631.659402] WARNING: possible circular locking dependency detected
Apr 26 22:22:52 debian kernel: [  631.659412] 5.17.4-freeze #1 Tainted: G        W        
Apr 26 22:22:52 debian kernel: [  631.659421] ------------------------------------------------------
Apr 26 22:22:52 debian kernel: [  631.659431] kworker/u4:11/1888 is trying to acquire lock:
Apr 26 22:22:52 debian kernel: [  631.659441] ffff88800fe10548 (&cli->mutex){+.+.}-{3:3}, at: nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.659623] 
Apr 26 22:22:52 debian kernel: [  631.659623] but task is already holding lock:
Apr 26 22:22:52 debian kernel: [  631.659633] ffff88800bed81b0 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_bo_evict_swapout_allowable+0x31/0x150 [ttm]
Apr 26 22:22:52 debian kernel: [  631.659664] 
Apr 26 22:22:52 debian kernel: [  631.659664] which lock already depends on the new lock.
Apr 26 22:22:52 debian kernel: [  631.659664] 
Apr 26 22:22:52 debian kernel: [  631.659676] 
Apr 26 22:22:52 debian kernel: [  631.659676] the existing dependency chain (in reverse order) is:
Apr 26 22:22:52 debian kernel: [  631.659687] 
Apr 26 22:22:52 debian kernel: [  631.659687] -> #1 (reservation_ww_class_mutex){+.+.}-{3:3}:
Apr 26 22:22:52 debian kernel: [  631.659704]        __ww_mutex_lock.constprop.0+0xb7/0x10e0
Apr 26 22:22:52 debian kernel: [  631.659720]        ww_mutex_lock+0x38/0x90
Apr 26 22:22:52 debian kernel: [  631.659730]        nouveau_bo_pin+0x40/0x380 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.659882]        nouveau_channel_prep+0x12b/0x300 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.660030]        nouveau_channel_new+0x52/0x730 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.660178]        nouveau_abi16_ioctl_channel_alloc+0x112/0x3b0 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.660327]        drm_ioctl_kernel+0xb7/0x150 [drm]
Apr 26 22:22:52 debian kernel: [  631.660392]        drm_ioctl+0x23b/0x440 [drm]
Apr 26 22:22:52 debian kernel: [  631.660441]        nouveau_drm_ioctl+0x55/0xa0 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.660593]        __x64_sys_ioctl+0x83/0xb0
Apr 26 22:22:52 debian kernel: [  631.660604]        do_syscall_64+0x3b/0xc0
Apr 26 22:22:52 debian kernel: [  631.660614]        entry_SYSCALL_64_after_hwframe+0x44/0xae
Apr 26 22:22:52 debian kernel: [  631.660625] 
Apr 26 22:22:52 debian kernel: [  631.660625] -> #0 (&cli->mutex){+.+.}-{3:3}:
Apr 26 22:22:52 debian kernel: [  631.660640]        __lock_acquire+0x12e8/0x2260
Apr 26 22:22:52 debian kernel: [  631.660652]        lock_acquire+0xd7/0x310
Apr 26 22:22:52 debian kernel: [  631.660662]        __mutex_lock+0x90/0x900
Apr 26 22:22:52 debian kernel: [  631.660672]        nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.660825]        ttm_bo_handle_move_mem+0xe3/0x240 [ttm]
Apr 26 22:22:52 debian kernel: [  631.660842]        ttm_mem_evict_first+0x3e3/0x600 [ttm]
Apr 26 22:22:52 debian kernel: [  631.660858]        ttm_resource_manager_evict_all+0x9c/0x200 [ttm]
Apr 26 22:22:52 debian kernel: [  631.660877]        nouveau_do_suspend+0x82/0x180 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.661030]        nouveau_pmops_suspend+0x2e/0x70 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.661182]        pci_pm_suspend+0x74/0x160
Apr 26 22:22:52 debian kernel: [  631.661193]        dpm_run_callback+0x62/0xd0
Apr 26 22:22:52 debian kernel: [  631.661204]        __device_suspend+0x110/0x420
Apr 26 22:22:52 debian kernel: [  631.661215]        async_suspend+0x1e/0x60
Apr 26 22:22:52 debian kernel: [  631.661224]        async_run_entry_fn+0x30/0x130
Apr 26 22:22:52 debian kernel: [  631.661236]        process_one_work+0x288/0x5d0
Apr 26 22:22:52 debian kernel: [  631.661246]        worker_thread+0x55/0x3b0
Apr 26 22:22:52 debian kernel: [  631.661254]        kthread+0xf2/0x120
Apr 26 22:22:52 debian kernel: [  631.661264]        ret_from_fork+0x22/0x30
Apr 26 22:22:52 debian kernel: [  631.661275] 
Apr 26 22:22:52 debian kernel: [  631.661275] other info that might help us debug this:
Apr 26 22:22:52 debian kernel: [  631.661275] 
Apr 26 22:22:52 debian kernel: [  631.661287]  Possible unsafe locking scenario:
Apr 26 22:22:52 debian kernel: [  631.661287] 
Apr 26 22:22:52 debian kernel: [  631.661296]        CPU0                    CPU1
Apr 26 22:22:52 debian kernel: [  631.661304]        ----                    ----
Apr 26 22:22:52 debian kernel: [  631.661311]   lock(reservation_ww_class_mutex);
Apr 26 22:22:52 debian kernel: [  631.661321]                                lock(&cli->mutex);
Apr 26 22:22:52 debian kernel: [  631.661332]                                lock(reservation_ww_class_mutex);
Apr 26 22:22:52 debian kernel: [  631.661345]   lock(&cli->mutex);
Apr 26 22:22:52 debian kernel: [  631.661354] 
Apr 26 22:22:52 debian kernel: [  631.661354]  *** DEADLOCK ***
Apr 26 22:22:52 debian kernel: [  631.661354] 
Apr 26 22:22:52 debian kernel: [  631.661363] 4 locks held by kworker/u4:11/1888:
Apr 26 22:22:52 debian kernel: [  631.661372]  #0: ffff888004851d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x208/0x5d0
Apr 26 22:22:52 debian kernel: [  631.661395]  #1: ffffc90004123e70 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x208/0x5d0
Apr 26 22:22:52 debian kernel:since [  631.661418]  #2: ffff888005cf3260 (&dev->mutex){....}-{3:3}, at: __device_suspend+0xda/0x420
Apr 26 22:22:52 debian kernel: [  631.661440]  #3: ffff88800bed81b0 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_bo_evict_swapout_allowable+0x31/0x150 [ttm]
Apr 26 22:22:52 debian kernel: [  631.661471] 
Apr 26 22:22:52 debian kernel: [  631.661471] stack backtrace:
Apr 26 22:22:52 debian kernel: [  631.661479] CPU: 0 PID: 1888 Comm: kworker/u4:11 Tainted: G        W         5.17.4-freeze #1
Apr 26 22:22:52 debian kernel: [  631.661494] Hardware name: Acer            Aspire 5930                    /Eiger                          , BIOS V1.21           08/22/2008
Apr 26 22:22:52 debian kernel: [  631.661512] Workqueue: events_unbound async_run_entry_fn
Apr 26 22:22:52 debian kernel: [  631.661525] Call Trace:
Apr 26 22:22:52 debian kernel: [  631.661532]  <TASK>
Apr 26 22:22:52 debian kernel: [  631.661540]  dump_stack_lvl+0x59/0x73
Apr 26 22:22:52 debian kernel: [  631.661553]  check_noncircular+0xf3/0x110
Apr 26 22:22:52 debian kernel: [  631.661566]  ? sched_clock_local+0x12/0x80
Apr 26 22:22:52 debian kernel: [  631.661578]  __lock_acquire+0x12e8/0x2260
Apr 26 22:22:52 debian kernel: [  631.661594]  lock_acquire+0xd7/0x310
Apr 26 22:22:52 debian kernel: [  631.661604]  ? nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.661758]  ? lock_is_held_type+0xea/0x140
Apr 26 22:22:52 debian kernel: [  631.661770]  ? nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.661923]  ? nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.662076]  __mutex_lock+0x90/0x900
Apr 26 22:22:52 debian kernel: [  631.662086]  ? nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.662244]  nouveau_bo_move+0x175/0x980 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.662399]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
Apr 26 22:22:52 debian kernelsince: [  631.662410]  ? ttm_pool_alloc+0x3ab/0x500 [ttm]
Apr 26 22:22:52 debian kernel: [  631.662432]  ttm_bo_handle_move_mem+0xe3/0x240 [ttm]
Apr 26 22:22:52 debian kernel: [  631.662452]  ttm_mem_evict_first+0x3e3/0x600 [ttm]
Apr 26 22:22:52 debian kernel: [  631.662475]  ttm_resource_manager_evict_all+0x9c/0x200 [ttm]
Apr 26 22:22:52 debian kernel: [  631.662497]  nouveau_do_suspend+0x82/0x180 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.662652]  nouveau_pmops_suspend+0x2e/0x70 [nouveau]
Apr 26 22:22:52 debian kernel: [  631.662805]  pci_pm_suspend+0x74/0x160
Apr 26 22:22:52 debian kernel: [  631.662815]  ? pci_pm_freeze+0xb0/0xb0
Apr 26 22:22:52 debian kernel: [  631.662824]  dpm_run_callback+0x62/0xd0
Apr 26 22:22:52 debian kernel: [  631.662836]  __device_suspend+0x110/0x420
Apr 26 22:22:52 debian kernel: [  631.662848]  async_suspend+0x1e/0x60
Apr 26 22:22:52 debian kernel: [  631.662858]  async_run_entry_fn+0x30/0x130
Apr 26 22:22:52 debian kernel: [  631.662870]  process_one_work+0x288/0x5d0
Apr 26 22:22:52 debian kernel: [  631.662883]  worker_thread+0x55/0x3b0
Apr 26 22:22:52 debian kernel: [  631.662893]  ? process_one_work+0x5d0/0x5d0
Apr 26 22:22:52 debian kernel: [  631.662903]  kthread+0xf2/0x120
Apr 26 22:22:52 debian kernel: [  631.662912]  ? kthread_complete_and_exit+0x20/0x20
Apr 26 22:22:52 debian kernel: [  631.662925]  ret_from_fork+0x22/0x30
Apr 26 22:22:52 debian kernel: [  631.662940]  </TASK>


As you can see, the warning is about a possible circular locking dependency between the nouveau's suspension callback function (Workqueue: events_unbound async_run_entry_fn, #2) and a call to nouveau's module through an IOCTL started from user space.

In addition to this, the previous locked kernel warning is always preceded only once by a different DMA-API warning during desktop manager startup (when kernel is compiled with CONFIG_DMA_API_DEBUG activated and with kernels 5.17.x and 5.16.x) ; it reports "DMA-API: nouveau 0000:01:00.0: device driver tries to sync DMA memory it has not allocated", for example:


Apr 26 22:41:34 debian kernel: [   63.386999] DMA-API: nouveau 0000:01:00.0: device driver tries to sync DMA memory it has not allocated [device address=0x0000000078c00000] [size=917504 bytes]
Apr 26 22:41:34 debian kernel: [   63.387019] WARNING: CPU: 0 PID: 1246 at kernel/dma/debug.c:1089 check_sync+0x1e9/0x820
Apr 26 22:41:34 debian kernel: [   63.387032] Modules linked in: vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev iwldvm snd_hda_codec_hdmi mac80211 snd_hda_codec_realtek libarc4 snd_hda_codec_generic ledtrig_audio mc snd_hda_intel coretemp kvm_intel snd_intel_dspcfg iwlwifi snd_intel_sdw_acpi snd_hda_codec kvm joydev snd_hda_core acer_wmi irqbypass sparse_keymap wmi_bmof pcspkr serio_raw snd_hwdep at24 snd_pcm cfg80211 iTCO_wdt sg snd_timer intel_pmc_bxt iTCO_vendor_support watchdog snd rfkill soundcore ac acpi_cpufreq evdev binfmt_misc nouveau button mxm_wmi drm_ttm_helper ttm drm_kms_helper cec rc_core i2c_algo_bit parport_pc ppdev lp parport fuse drm configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq crc32c_generic libcrc32c raid1 raid0 multipath linear md_mod hid_generic sd_mod t10_pi crc_t10dif usbhid hid crct10dif_generic
Apr 26 22:41:34 debian kernel: [   63.387273]  sr_mod cdrom crct10dif_common ahci libahci libata i2c_i801 scsi_mod psmouse i2c_smbus scsi_common lpc_ich ehci_pci uhci_hcd ehci_hcd usbcore sky2 usb_common wmi battery video
Apr 26 22:41:34 debian kernel: [   63.387334] CPU: 0 PID: 1246 Comm: plasmashell Tainted: G           OE     5.16.14-freeze #16
Apr 26 22:41:34 debian kernel: [   63.387342] Hardware name: Acer            Aspire 5930                    /Eiger                          , BIOS V1.21           08/22/2008
Apr 26 22:41:34 debian kernel: [   63.387349] RIP: 0010:check_sync+0x1e9/0x820
Apr 26 22:41:34 debian kernel: [   63.387358] Code: 89 54 24 08 e8 e8 73 5f 00 4c 8b 54 24 08 48 89 c6 49 89 e8 4c 89 f1 4c 89 ea 48 c7 c7 18 4f 34 82 4c 89 14 24 e8 76 0b 92 00 <0f> 0b 4c 8b 14 24 8b 05 1b 0b 91 01 85 c0 0f 84 d4 04 00 00 48 83
Apr 26 22:41:34 debian kernel: [   63.387366] RSP: 0018:ffffc90001ffbba0 EFLAGS: 00010086
Apr 26 22:41:34 debian kernel: [   63.387375] RAX: 0000000000000000 RBX: ffffc90001ffbc40 RCX: 0000000000000000
Apr 26 22:41:34 debian kernel: [   63.387381] RDX: 0000000000000002 RSI: ffffffff82381114 RDI: 00000000ffffffff
Apr 26 22:41:34 debian kernel: [   63.387388] RBP: 00000000000e0000 R08: 0000000000000003 R09: 0000000000000001
Apr 26 22:41:34 debian kernel: [   63.387394] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff83eeb508
Apr 26 22:41:34 debian kernel: [   63.387401] R13: ffff888005bf3190 R14: 0000000078c00000 R15: 0000000078c00000
Apr 26 22:41:34 debian kernel: [   63.387408] FS:  00007f74e54bd8c0(0000) GS:ffff8880bb000000(0000) knlGS:0000000000000000
Apr 26 22:41:34 debian kernel: [   63.387415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 22:41:34 debian kernel: [   63.387421] CR2: 00007f74e9b2f9a0 CR3: 000000004bae8000 CR4: 00000000000006f0
Apr 26 22:41:34 debian kernel: [   63.387429] Call Trace:
Apr 26 22:41:34 debian kernel: [   63.387435]  <TASK>
Apr 26 22:41:34 debian kernel: [   63.387445]  ? _raw_spin_unlock_irqrestore+0x2d/0x60
Apr 26 22:41:34 debian kernel: [   63.387464]  debug_dma_sync_single_for_cpu+0xb5/0xc0
Apr 26 22:41:34 debian kernel: [   63.387499]  ? lock_release+0x155/0x4a0
Apr 26 22:41:34 debian kernel: [   63.387520]  ? dma_resv_iter_first_unlocked+0x7a/0x110
Apr 26 22:41:34 debian kernel: [   63.387531]  ? dma_resv_wait_timeout+0x45/0xd0
Apr 26 22:41:34 debian kernel: [   63.387544]  nouveau_bo_sync_for_cpu+0xb5/0xf0 [nouveau]
Apr 26 22:41:34 debian kernel: [   63.387718]  nouveau_gem_ioctl_cpu_prep+0x65/0xc0 [nouveau]
Apr 26 22:41:34 debian kernel: [   63.387833]  ? nouveau_gem_ioctl_pushbuf+0x15b0/0x15b0 [nouveau]
Apr 26 22:41:34 debian kernel: [   63.387948]  drm_ioctl_kernel+0xb7/0x150 [drm]
Apr 26 22:41:34 debian kernel: [   63.388032]  drm_ioctl+0x23b/0x440 [drm]
Apr 26 22:41:34 debian kernel: [   63.388080]  ? nouveau_gem_ioctl_pushbuf+0x15b0/0x15b0 [nouveau]
Apr 26 22:41:34 debian kernel: [   63.388082]  ? lock_release+0x155/0x4a0
Apr 26 22:41:34 debian kernel: [   63.388082]  ? _raw_spin_unlock_irqrestore+0x2d/0x60
Apr 26 22:41:34 debian kernel: [   63.388082]  ? lockdep_hardirqs_on+0x7e/0x100
Apr 26 22:41:34 debian kernel: [   63.388082]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
Apr 26 22:41:34 debian kernel: [   63.3                Kernel version 5.17.4-amd64 (vanilla)
                with boot parameter nouveau.noaccel=0
                Time to complete the STR test
TEST > /sys/    ================================
power/pm_test   INIT_ON_ALLOC=0  INIT_ON_ALLOC=1 
=============   ===============  ===============
devices         0 min 27.4 sec   2 min 52.9 sec
platform        0 min 28.0 sec   2 min 44.0 sec
processors      0 min 28.0 sec   1 min 13.3 sec
core            0 min 27.8 sec   0 min 42.7 sec
=============   ===============  ===============88082]  nouveau_drm_ioctl+0x55/0xa0 [nouveau]
Apr 26 22:41:34 debian kernel: [   63.388082]  __x64_sys_ioctl+0x83/0xb0
Apr 26 22:41:34 debian kernel: [   63.388082]  do_syscall_64+0x3b/0xc0
Apr 26 22:41:34 debian kernel: [   63.388082]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Apr 26 22:41:34 debian kernel: [   63.388082] RIP: 0033:0x7f74e988fcc7
Apr 26 22:41:34 debian kernel: [   63.388082] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
Apr 26 22:41:34 debian kernel: [   63.388082] RSP: 002b:00007fff21885688 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Apr 26 22:41:34 debian kernel: [   63.388082] RAX: ffffffffffffffda RBX: 00007fff218856d0 RCX: 00007f74e988fcc7
Apr 26 22:41:34 debian kernel: [   63.388082] RDX: 00007fff218856d0 RSI: 0000000040086482 RDI: 000000000000000d
Apr 26 22:41:34 debian kernel: [   63.388082] RBP: 0000000040086482 R08: 000000000000000d R09: 0000000108307000
Apr 26 22:41:34 debian kernel: [   63.388082] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Apr 26 22:41:34 debian kernel: [   63.388082] R13: 000000000000000d R14: 0000555a894e8e10 R15: 0000555a894e8e68
Apr 26 22:41:34 debian kernel: [   63.388082]  </TASK>
Apr 26 22:41:34 debian kernel: [   63.388082] irq event stamp: 1023624
Apr 26 22:41:34 debian kernel: [   63.388082] hardirqs last  enabled at (1023623): [<ffffffff81b475ad>] _raw_spin_unlock_irqrestore+0x2d/0x60
Apr 26 22:41:34 debian kernel: [   63.388082] hardirqs last disabled at (1023624): [<ffffffff81b47308>] _raw_spin_lock_irqsave+0x78/0xa0
Apr 26 22:41:34 debian kernel: [   63.388082] softirqs last  enabled at (1020678): [<ffffffff810ac8fb>] __irq_exit_rcu+0xfb/0x160
Apr 26 22:41:34 debian kernel: [   63.388082] softirqs last disabled at (1020661): [<ffffffff810ac8fb>] __irq_exit_rcu+0xfb/0x160
Apr 26 22:41:34 debian kernel: [   63.388082] ---[ end trace b87f0a8a7186bf12 ]---


These clues make me suppose that:

1) there is an interaction somewhere between INIT_ON_ALLOC/INIT_ON_FREE kernel's features (slub memory) and the code of accelerated features in the nouveau kernel module; the effect (for example, on system memory) of this interaction is not detectable (so far!) by native kernel's memory slub leak of corruption detectors nor by the Kernel Address Sanitizer (KASAN) (if activated, of course). This is always reproducible with my computer. 

2) the failure to resume from the suspension to ram is always preceded by longer times to complete suspension to ram: this time is spend in function nouveau_do_suspend(...)

3) the function nouveau_do_suspend(...) [2] seems to be involved, through the call of ttm_resource_manager_evict_all(..) [3] and subsequent calls; the ttm_resource_manager_evict_all() and subsequent calls is where the kernel waits during suspension to RAM when then it fails at resume; when INIT_ON_ALLOC/INIT_ON_FREE kernel's features are activated, ttm_resource_manager_evict_all() fails TTM buffer eviction.

4) DMA seems to be involved, because the kernel waits on DMA functions; the DMA-API warning is a clue, but it appears regardless of suspension to ram succeeds or not (with kernels 5.16 and 5.17).

5) the possible circular locking dependency detected could be involved, but the warning appears regardless of suspension to ram succeeds or not (with kernels 5.16 and 5.17).

This bug has been affecting the Linux kernel for nearly a year, with many reports (for example, in forums) but the root cause is still unknown and, despited the workaround, the fix is not in sight, unfortunately.

I am available for any further tests and insights, of course.

==================================
[1] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
[2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L844
[3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L866
Comment 16 computer.enthusiastic@gmail.com 2022-04-29 08:26:24 UTC
Created attachment 300846 [details]
logs of suspension to ram tests with 7sys/power/pm_test kernel facility
Comment 17 computer.enthusiastic@gmail.com 2022-08-10 11:26:05 UTC
Hello,

An experimental "work in progress" patch (by Karol Herbst) is available here: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156#note_1385770

It has been successfully tested by me with kernels version 5.10.117, 5.16.14 and 5.17.9 using an NVIDIA G96CM [GeForce 9600M GT], so far.
Comment 18 computer.enthusiastic@gmail.com 2023-02-05 17:28:28 UTC
Hello,

The patch has been merged upstream for 5.15+[1][2] and 5.10 [3].

Many thanks to Karol Herbst and Greg Kroah-Hartman.

[1] https://lore.kernel.org/stable/20220829105809.384152318@linuxfoundation.org/
[2] https://lore.kernel.org/stable/20220829105814.337312403@linuxfoundation.org/
[3] https://lore.kernel.org/stable/20230130134312.459081132@linuxfoundation.org/
Comment 19 computer.enthusiastic@gmail.com 2023-02-05 17:28:52 UTC
Marked as resolved upstream.

Note You need to log in before you can comment on or make changes to this bug.