Bug 217242
Summary: | CPU hard lockup related to xhci/dma | ||
---|---|---|---|
Product: | Drivers | Reporter: | Hunter M (miller.hunterc) |
Component: | USB | Assignee: | Default virtual assignee for Drivers/USB (drivers_usb) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | austin.domino, mathias.nyman |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.1.14-1-lts | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
cpu hard lockup
Log file displaying this problem on the ASRock NUC Box-1260P running kernel version 6.2.8 Log file displaying this problem on the Intel NUC7i5BNK running kernel version 6.2.8 1 serial comm lockup computer 1 dmesg CPU lockup computer 2 dmesg CPU lockup Exert from kernel log on system running 5.13 trb values debug patch comparing free trbs xhci debug xhci dynamic debug log Fix trb free calculation patch, goes on top debug patch comparing free trbs init test with trb calculation test log Final free trb fix for 6.4 and stable |
The same problem has been seen on Intel NUC7i5BNK, ASRock NUC BOX-1260P, LattePanda Delta 3 and other compact computers of the sort, and if enough individual processes are doing USB serial communications (ttyUSB/ttyACM) at once and it matches or exceeds the number of CPU cores, the entire CPU can get locked up until the watchdog frees things up. Most lock ups of this sort are resolved after ~20-25 seconds, but some have lasted over 20 minutes! There's several options to get around this problem. One of which is to set the CPU affinity so at least 1 core won't ever be used for USB serial communications. Another one is to use a semaphore to add a similar limitation. Created attachment 304072 [details]
Log file displaying this problem on the ASRock NUC Box-1260P running kernel version 6.2.8
Created attachment 304073 [details]
Log file displaying this problem on the Intel NUC7i5BNK running kernel version 6.2.8
The setup that created the output in the previous two attached log files is described in more detail in related posts on: Ubuntu Forums -> https://ubuntuforums.org/showthread.php … st14136903 and Ubuntu's Launchpad Bug Page -> https://bugs.launchpad.net/ubuntu/+sour … ug/2013390 The setup that created this output is described in related posts on: Ubuntu Forums -> [url]https://ubuntuforums.org/showthread.php?t=2485480&p=14136903#post14136903[/url] and Ubuntu's Launchpad Bug Page -> [url]https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2013390[/url] (In reply to Austin Domino from comment #4) > The setup that created the output in the previous two attached log files is > described in more detail in related posts on: > > Ubuntu Forums -> https://ubuntuforums.org/showthread.php … st14136903 > and > Ubuntu's Launchpad Bug Page -> https://bugs.launchpad.net/ubuntu/+sour … > ug/2013390 Try 3 (I wish I could edit posts, but I should not have pressed "Save Changes" so quickly. Sorry about this comment and my incompetence): Here are the actual links: Ubuntu Forums -> https://ubuntuforums.org/showthread.php?t=2485480&p=14136903#post14136903 and Ubuntu's Launchpad Bug Page -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2013390 On 4/2/23 00:17, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217242 > > --- Comment #6 from Austin Domino (austin.domino@hotmail.com) --- > (In reply to Austin Domino from comment #4) >> The setup that created the output in the previous two attached log files is >> described in more detail in related posts on: >> >> Ubuntu Forums -> https://ubuntuforums.org/showthread.php … st14136903 >> and >> Ubuntu's Launchpad Bug Page -> https://bugs.launchpad.net/ubuntu/+sour … >> ug/2013390 > Try 3 (I wish I could edit posts, but I should not have pressed "Save > Changes" > so quickly. Sorry about this comment and my incompetence): > > Here are the actual links: > > Ubuntu Forums -> > https://ubuntuforums.org/showthread.php?t=2485480&p=14136903#post14136903 > and > Ubuntu's Launchpad Bug Page -> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2013390 > Hi, I don't have access to the bugzilla, but this looks like a out of memory situation, and does not really point towards the USB XHCI. URB's are typically submitted using GFP_KERNEL, which allow memory allocators to sleep while waiting for more memory. GFP_ATOMIC does not allow sleeping. usb_submit_urb(xxx, GFP_KERNEL); While that being said, I wish the Linux USB core would take the example of the FreeBSD USB core, and pre-allocate all memory needed for USB transfers, also called URB's, during device attach. Frequently going through allocate and free cycles during operation, is not just inefficient, but also greatly degrades the ability to debug the system. USB is still quite essential when doing remote server access. Yeah, the serial port is great too, but one day inb() and outb() will die :-) --HPS On Sun, Apr 02, 2023 at 05:54:18PM +0200, Hans Petter Selasky wrote: > While that being said, I wish the Linux USB core would take the example of > the FreeBSD USB core, and pre-allocate all memory needed for USB transfers, > also called URB's, during device attach. Many drivers do that today already, which specific ones do you think need to have this added that are not doing so? > Frequently going through allocate > and free cycles during operation, is not just inefficient, but also greatly > degrades the ability to debug the system. Based on the slow USB speeds, "inefficient" isn't anything that I've been able to measure specifically, have you? > USB is still quite essential when doing remote server access. Yeah, > the serial port is great too, but one day inb() and outb() will die That's what a USB debugging cable is for :) thanks, greg k-h Created attachment 304082 [details]
1 serial comm lockup
Update: Linux version 6.1.21-1-lts (arch linux)
Performed test over the weekend to see if issue reproducible by only running process which communicates over UART. Logs attached.
Will build Kernel with DMA API Debug on and see if I am able to get any more information.
Created attachment 304094 [details]
computer 1 dmesg CPU lockup
Created attachment 304095 [details]
computer 2 dmesg CPU lockup
Update: DMA API debugging did not result in any warnings/errors from DMA. 2/6 computers have had CPU lockup occur.
After some more testing, I found this bug was most likely introduced between kernel version 5.11 and 5.13, but more testing will need to be done to narrow verify and narrow things down further. Although, it should be noted that the lockup that occurred on kernel version 5.13 appeared in a slightly different manner than I've seen before. I've uploaded a short exert from that kernel log. Created attachment 304110 [details]
Exert from kernel log on system running 5.13
Looks like it gets stuck during xhci ring expansion. During ring expansion the xhci driver allocates memory with spinlock held using dma_pool_zalloc(.., GFP_ATOMIC, ...) This apparently never completes, so spinlock isn't released. Any URBs queued for xhci after this will spin forever trying to take the lock, lockin up that CPU. The xhci ring expansion code looks broken, the calculation of new ring segments needed is incorrect, may be huge. Also I don't think we should need to expand the ring buffer in this case. There might be some bug in how driver keeps track on free trbs. I'll write a debugging patch that that tracks free trbs and expansion values. Created attachment 304127 [details] trb values Found what (possibly) looks to be a similar issue at https://github.com/raspberrypi/linux/is … 1241972882. The comment I've linked has a similar scenario (in that the single serial port communication occurs frequently - the comment stated the serial communication frequency was every 0.2s; the communication I've implemented is every 0.15s). Yesterday I had started the serial communication process and left it overnight to see the trb values by performing for d in /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/*/*; do if [ -d "$d" ]; then cd $d; echo "${d/?*\/devices\//}: $(wc -l trbs)"; fi done However, when coming back this morning and attempting the above command, the computer froze and either the software or hardware watchdog kicked in and restarted the computer (journalctl logs did not print out before the reboot occurred, so unfortunately I don't have any information on what happened). I went ahead and re-performed the test today reducing the sleep time in between serial write calls from 0.15s to 0.05s. With that, the ttyACM ring size did expand considerably. The values I was able to obtain until a power outage occurred which stopped my test are attached. Further information: utilizing Microchip MCP2200 USB to UART converter instead of FTDI (though this should not make a difference, pointing it out just in case). URL hyperlink got cut - here's the correct link for the issue: https://github.com/raspberrypi/linux/issues/5088#issuecomment-1241972882 Thanks, one reason why ring expansion calculation is incorrect is that we try to store a negative value in an unsigned int. static int prepare_ring(...) { unsigned int num_trbs_needed; .... num_trbs_needed = num_trbs - ep_ring->num_trbs_free; } ep_ring->num_trbs_free might be bigger than num_trbs So we end up with a huge and incorrect num_trbs_needed https://elixir.bootlin.com/linux/v6.2/source/drivers/usb/host/xhci-ring.c#L3186 In addition to this there is most likely a small bug in tracking ep_ring->num_trbs_free, gradually decreasing it incorrectly. Not sure where that happens Still working on that debugging patch I've narrowed down when this bug first appears to the 5.12 kernel release; I ran a couple computers for a week on kernel version 5.11 and ran into 0 problems while running a program like the one above, but before doing this, I had 1 of those computers on kernel version 5.12 while running that same program with the same devices and it ran into this bug within 24 hours. I looked at the number of TRBs for the computers that ran kernel version 5.11 for over a week with that program, and they were all at 512, so it's extremely unlikely that the ring expansion problems are present in version 5.11. This morning, out of curiosity, I took a computer running Ubuntu 18.04, went to Ubuntu's kernel build page, https://kernel.ubuntu.com/~kernel-ppa/mainline/, and tried a number of kernels to narrow down when the ring expansion problems first appeared. It seems that this problem is present in all the 5.12-rc releases, and I know that it's present on 5.15; that was what this computer was running before all of this, so I'm assuming that it's present from 5.12 onward. Right now I have this computer running kernel "v5.12-rc1" from Ubuntu's kernel page, the ring expansion problems are present; the maximum number of TRBs for a device the last time I checked was 8388608 after ~2.5 hours, and I'm curious if this computer will run into a hard lockup. I'm nearly certain that it will, but we'll just have to wait and see to be certain. Lastly, I went and looked at the changes that were done between v5.11 and v5.12-rc2 within the "drivers/usb/host" directory and it appears that a moderate amount of change took place (more than enough to make my head spin). I haven't dealt with kernel source like this before and it'd take a while to parse through everything to understand what's going on, so I don't know how much further I will get involved. Anyhow, I hope that this information might help. I suspect the offending commit is: 55f6153d8cc8 xhci: remove extra loop in interrupt context It changes how num_trbs_free is counted for a ring. I'll attach a debug patch that will track both free trb and ring expansion Created attachment 304134 [details]
debug patch comparing free trbs
Debugging patch for ring expansion and
Patch recalculates free trbs and compares it to the old tracked value of free trbs. Prints out a message if there is a new mismatch.
Patch also includes a new way of checking if ring needs expansion, and by how much. used only to show it when driver expands ring based on old code
I'll get the kernel compiled with the patch and install it on some computers next week. Will update with logs once they have ran for a few days. Created attachment 304159 [details]
xhci debug
Kernel compiled and initial test performed. See debug logs from journalctl attached.
Created attachment 304161 [details]
xhci dynamic debug log
xhci debug with dynamic debugging ON for xhci module
Thanks, I see whats going on now. Some transfers that are further ahead on the ring can simply be turned to no-op trbs by driver when cancelled. These are not added back to num_trbs_free. This is the case when several URBs are queued for an endpoint and then cancelled in reverse order. I have a untested fix for this that goes on top of previous debug patch. Can you try it out? I'm also reworking this whole thing, but we need a small fix like this for older stable kernels. Created attachment 304163 [details]
Fix trb free calculation patch, goes on top debug patch comparing free trbs
I can probably do so, but I'm sure Hunter will be on this eventually too. Also, I saw that you mentioned something about commit 55f6153d8cc8, so I reverted that commit, rebuilt the kernel, and tested, but I forgot to apply the previous debug patch. The TRBs problem doesn't appear to be present anymore and I haven't run into any hard lockups within the ~24 hour period that it has been running for so far. Although, I'd like to run things with that kernel build for a while longer before saying anything definitive, and I think that some of the changes themselves within that commit are justifiable (but what can I really say; I'm way too new at this), so changes likened unto those in this newest patch appear to be a better solution to me. I'll just build the most recent kernel release with this patch for another device. Created attachment 304164 [details]
init test with trb calculation test log
Here's the initial log with the trb free calculation patch compiled in the kernel.
Initial test shows that the number of trbs stays constant at 512. I'll run this patch on multiple computers until next week and update with results.
Let me know when the rework is done as well and I can go ahead and test that.
A quick update. I've been running a kernel with this patch on a system for about a day and a half now, I haven't run into any issues so far, and there's nothing notable in any of the logs. I'll likely respond back sometime next week with further updates, but things are looking good so far. 12 computers have been running since 04-19-2023 with no CPU lockups using the patches. TRB values constant at 512. The only thing I have seen on 2 of the computers is a warning log for the following: Apr 21 07:47:37 myuser kernel: xhci_hid 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state. However, no functionality has been lost and computers are running fine with serial communication occurring as normal. (In reply to Hunter M from comment #29) > 12 computers have been running since 04-19-2023 with no CPU lockups using > the patches. TRB values constant at 512. > The only thing I have seen on 2 of the computers is a warning log for the > following: > > Apr 21 07:47:37 myuser kernel: xhci_hid 0000:00:14.0: WARN Set TR Deq Ptr > cmd failed due to incorrect slot or ep state. Hunter, I've seen the same warning in logs for computers running kernel 6.1 and 6.2 before and after these patches were applied. Also, it appears that there's already a bug report put together for this warning (see bug w/ id 202541). Also, since I'm writing a quick comment, I'll give an update on testing the kernel with these patches applied. I haven't had any problems so far on any computer running kernel 5.15, 6.1 or 6.2 with these patches applied over the past 6 days, and the number of TRBs has remained at 512 on all 5 computers. I'm becoming more and more certain as they days go by that these patches fix this bug, and I'm okay with running a patched kernel for the time being. Thanks, I cleaned up the debug patch and turned into a real fix. It can be found in my tree in the fix_ring_expansion branch: git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_ring_expansion The larger rework of ring expansion is also done, but not very well tested. If you can try it out as well it would be appreciated. It can be found in the same tree in the rework_ring_expansion branch https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rework_ring_expansion In fear of regression I think I'll submit the smaller fix to 6.4 and older stable kernels, and then rebase the larger rework on top of that, and submit it to usb-next. (6.5 kernels and later) Created attachment 304188 [details]
Final free trb fix for 6.4 and stable
Will be running most computers with the real patch in the fix_ring_expansion branch for a couple of weeks to verify. I'll run a few of them with the larger rework next week and post results when I get a chance. Update on larger rework - No issues found running on 2 computers for the week. Would logs containing dynamic debug statement for xhci_hcd module be needed? If so, will be able to get a snippet of those logs next week. Thanks for testing the larger rework, no logs needed. Can I add "Tested-by: Miller Hunter <MillerH@hearthnhome.com>" tag to that series? Yes go ahead. Marking this issue as resolved. |
Created attachment 304018 [details] cpu hard lockup Utilizing Intel NUC11ATKC4 computers, there is an issue of CPU hard lockups that occurs seemingly randomly. Typically the issue arises every 1-5 days, though sometimes the issue does not arise for a week plus. Of note: These computers are constantly communicating via serial communication (request-reply pattern) to an embedded device via USB to UART cable. Unsure if that may contribute - will start test next week to see if issue still arises when serial communication does not occur. This issue has been seen as far back as linux-lts 5.15.62 (not saying that's when the issue started, just that is the earliest release that these computers have ran). This is on Arch Linux.