Observed kernel panic during host shutdown on a AMD (Milan CPU) based server. The issue ended up being a NULL pointer dereference in pt_cmd_callback() when
called from pt_issue_pending(). If you follow the flow in pt_issue_pending() you will note that if pt_next_dma_desc() returns NULL, then engine_is_idle will remain as TRUE, including if pt_next_dma_desc() is still returning NULL in the 2nd call just prior to doing the call to pt_cmd_callback().
The stack flow leading up to the panic was:
dma_sync_wait() -> dma_async_issue_pending() -> pt_issue_pending() ->
Temporarily I worked around the issue by simply changing the IF condition for the call to pt_cmd_callback() to also check for a non-NULL desc, i.e.
if (engine_is_idle && desc)
This resolved the issue for me, however I don't know enough about the driver or the context here to know if this is really the desirable fix, and so I'm submitting this bug rather than attempting to patch myself. I wasn't sure if the secondary pt_next_dma_desc() call was mistakenly leftover from the change that introduced the engine_is_idle variable or not. Note that vchan_issue_pending() will return a boolean as to whether there are any descriptors on the Issue list, i.e. active descriptors. So, maybe that could be used to qualify the need to take some action? Also, if pt_cmd_callback() is really going to start processing on the next descriptor, I wonder if it should be called under the chan->vc.lock lock. I'm not sure of the safety of this, but if you are peeking at descriptors on the Issue list that you might want to ensure they're protected from being accessed/removed by some other thread.
Eric, does the problem still happen with latest mainline was this issue addressed meanwhile?
Hi Thorsten, yes, it appears the problem is still there in the latest mainline. I had sent an email to the owner of the driver about the fix I had done locally, in order to get some feedback since I'm not an expert on the driver. However, never received any response. I'm going to go ahead and just file a patch later today with the fix that I did locally.
Patch submitted and accepted. Fixed as noted in description, i.e. add NULL check for desc pointer.