Using the dmatest module (I was testing with it built-in), if the timeout parameter is set such that an actual timeout occurs before a DMA operation completes in the dmatest (dmatest.c), the "done_wait" structure will be left with a dangling pointer that is not handled. an example of a message that might be printed out right before we lockup the system: [ 249.596529] dmatest: dma5chan0-copy0: result #10: 'test timed out' with src_off=0x9f300 dst_off=0xdf00 Often after this situation occurs, we arrive at what appears to be a deadlock (sometimes just by rerunning the same test)...but this is due to the fact that the spinlock structure for the wait queues was corrupted by the unfreed pointer that gets reinitialized on the stack. Surprisingly, the fact that this bug exists is documented in the very code where the issue occurs: http://elixir.free-electrons.com/linux/v4.14-rc7/source/drivers/dma/dmatest.c#L698 Further information on the race condition can be found in this original patch: https://lkml.org/lkml/2011/11/21/381 At the very least, the kernel should probably BUG out if (!done.done)...but, would ideally be fixed properly
In stress test, this occurs in some low possibility. Shall we move the wait queue to thread info of each thread?
It happens with increased regularity if 1)You are running on a slower system (e.g., emulation, simulation) 2) You use an unrealistically low timeout value 3) Your DMA legitimately has an issue and times out Moving the wait queue to each thread sounds like it could work.
Proposed patch to fix the issue submitted: https://patchwork.kernel.org/patch/10053507/