Bug 197605

Summary: dmatest leaves dangling pointer when timeout expires which leads to deadlock if test is re-run
Product: Drivers Reporter: Adam Wallis (awallis)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: normal CC: okaya, rruigrok, sameer.goel, shunyong.yang, timur
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v4.14-rc7 Subsystem:
Regression: No Bisected commit-id:

Description Adam Wallis 2017-10-31 16:20:26 UTC
Using the dmatest module (I was testing with it built-in), if the timeout parameter is set such that an actual timeout occurs before a DMA operation completes in the dmatest (dmatest.c), the "done_wait" structure will be left with a dangling pointer that is not handled. 

an example of a message that might be printed out right before we lockup the system:

[  249.596529] dmatest: dma5chan0-copy0: result #10: 'test timed out' with src_off=0x9f300 dst_off=0xdf00

Often after this situation occurs, we arrive at what appears to be a deadlock (sometimes just by rerunning the same test)...but this is due to the fact that the spinlock structure for the wait queues was corrupted by the unfreed pointer that gets reinitialized on the stack.

Surprisingly, the fact that this bug exists is documented in the very code where the issue occurs: http://elixir.free-electrons.com/linux/v4.14-rc7/source/drivers/dma/dmatest.c#L698 

Further information on the race condition can be found in this original patch: https://lkml.org/lkml/2011/11/21/381

At the very least, the kernel should probably BUG out if (!done.done)...but, would ideally be fixed properly
Comment 1 Shunyong Yang 2017-11-02 01:24:10 UTC
In stress test, this occurs in some low possibility. Shall we move the wait queue to thread info of each thread?
Comment 2 Adam Wallis 2017-11-08 15:44:15 UTC
It happens with increased regularity if 1)You are running on a slower system (e.g., emulation, simulation) 2) You use an unrealistically low timeout value 3) Your DMA legitimately has an issue and times out

Moving the wait queue to each thread sounds like it could work.
Comment 3 Adam Wallis 2017-11-10 17:47:22 UTC
Proposed patch to fix the issue submitted: https://patchwork.kernel.org/patch/10053507/