Bug 216856

Summary: [ptdma] NULL pointer dereference in pt_cmd_callback during server shutdown
Product: Drivers Reporter: Eric Pilmore (epilmore)
Component: OtherAssignee: drivers_other
Severity: normal CC: regressions, vkoul
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.2 Subsystem:
Regression: No Bisected commit-id:

Description Eric Pilmore 2022-12-27 22:23:50 UTC
Observed kernel panic during host shutdown on a AMD (Milan CPU) based server. The issue ended up being a NULL pointer dereference in pt_cmd_callback() when
called from pt_issue_pending(). If you follow the flow in pt_issue_pending() you will note that if pt_next_dma_desc() returns NULL, then engine_is_idle will remain as TRUE, including if pt_next_dma_desc() is still returning NULL in the 2nd call just prior to doing the call to pt_cmd_callback().

The stack flow leading up to the panic was:
dma_sync_wait() -> dma_async_issue_pending() -> pt_issue_pending() ->

Temporarily I worked around the issue by simply changing the IF condition for the call to pt_cmd_callback() to also check for a non-NULL desc, i.e.

   if (engine_is_idle && desc)
      pt_cmd_callback(desc, 0);

This resolved the issue for me, however I don't know enough about the driver or the context here to know if this is really the desirable fix, and so I'm submitting this bug rather than attempting to patch myself. I wasn't sure if the secondary pt_next_dma_desc() call was mistakenly leftover from the change that introduced the engine_is_idle variable or not. Note that vchan_issue_pending() will return a boolean as to whether there are any descriptors on the Issue list, i.e. active descriptors. So, maybe that could be used to qualify the need to take some action? Also, if pt_cmd_callback() is really going to start processing on the next descriptor, I wonder if it should be called under the chan->vc.lock lock. I'm not sure of the safety of this, but if you are peeking at descriptors on the Issue list that you might want to ensure they're protected from being accessed/removed by some other thread.
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-01-27 11:40:40 UTC
Eric, does the problem still happen with latest mainline was this issue addressed meanwhile?
Comment 2 Eric Pilmore 2023-01-27 17:39:24 UTC
Hi Thorsten, yes, it appears the problem is still there in the latest mainline. I had sent an email to the owner of the driver about the fix I had done locally, in order to get some feedback since I'm not an expert on the driver. However, never received any response. I'm going to go ahead and just file a patch later today with the fix that I did locally.
Comment 3 Eric Pilmore 2023-02-10 09:11:47 UTC
Patch submitted and accepted. Fixed as noted in description, i.e. add NULL check for desc pointer.