Bug 195477
Summary: | firmware_class: loading the same file multiple times in parallel finishes only one | ||
---|---|---|---|
Product: | Other | Reporter: | Nicolas (nbroeking) |
Component: | Other | Assignee: | Luis Chamberlain (mcgrof) |
Status: | CLOSED OBSOLETE | ||
Severity: | blocking | CC: | greg, jewalt, johannes, linuxwifi, luca, mcgrof, nbroeking |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | Linux 4.4.0-72-generic ( Ubuntu 16 ), 4.11 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
A tgz of 3 png that show log output
Dmesg ifconfig -a Patch to print a log in iwlwifi's probe LSPCI Logs after Patch DMESG After Patch After Patch custom Logs patch to delay request_fw DMSG with 500ms delay possible patch for firmware_class.c |
Can you please provide dmesg? Also, can you please paste the results of the following commands? ifconfig -a iw dev Created attachment 256073 [details]
Dmesg
Created attachment 256075 [details]
ifconfig -a
I added dmesg and ifconfig -a attachments. Note that there is also a intel ln930 on the system. That accounts for the wwp0s20u2 device. One more update. We tried the same configuration with the 7260. Everything works fine. This seems to only be an issue with the 8260. Correction on the original bug posting. The problem is with the 8260 not 6260 (In reply to Nicolas from comment #7) > Correction on the original bug posting. The problem is with the 8260 not 6260 Yes, I noticed. :) That's why I used "8260" in the title. This is a very strange issue... I can clearly see the devices being found at the PCI level: [ 0.160229] pci 0000:03:00.0: [8086:24f3] type 00 class 0x028000 [ 0.167352] pci 0000:05:00.0: [8086:24f3] type 00 class 0x028000 [ 0.175391] pci 0000:08:00.0: [8086:24f3] type 00 class 0x028000 [ 0.167352] pci 0000:05:00.0: [8086:24f3] type 00 class 0x028000 But later on I just see the iwlwifi driver reacting to one of them: [ 2.659223] iwlwifi 0000:03:00.0: loaded firmware version 16.242414.0 op_mode iwlmvm Unfortunately we don't have any traces at the probe level in iwlwifi. I'll provide you with a patch to add some traces there to check whether it's getting that far. Can you try to "remove" one of the devices that didn't show up and then rescan the PCI devices, like this: echo 1 > /sys/bus/pci/devices/0000\:05\:00.0/remove ...wait a while then: echo 1 > /sys/bus/pci/rescan ...and post dmesg again? Created attachment 256091 [details]
Patch to print a log in iwlwifi's probe
Can you apply this patch and run this scenario again so we can check whether the probe is getting to the iwlwifi driver?
Created attachment 256325 [details]
LSPCI Logs after Patch
Created attachment 256327 [details]
DMESG After Patch
I am sorry for the delay in getting this done. We applied the patch and uploaded the logs. The driver is definitely getting through the probe. We started adding other logs to see if we can help narrow down the issue. Thanks for the logs! Indeed we're getting probed but then not handling those other devices... I'll try to reproduce it locally to see if I can figure something out. Created attachment 256331 [details]
After Patch custom Logs
Thank you and one more log for you. We see it trying the multiple firmware versions for each of the cards. However, once it finds one that works for one of the cards, it stops trying for the remaining cards. Interesting. But this didn't happen in the previous logs, did anything change? Maybe this time it stopped working for the other NICs a bit later, so it still requests the first FW version and then stops. OTOH, it's quite possible that the firmware loading code is what is causing this issue. I'll look into that. Looked a bit further and tried to reproduce here, without success. This definitely seems to be some sort of race condition with the request_firmware_nowait() API... Our driver is calling it for all the 4 NICs (as you mentioned), but only the callback for the first one is getting called. BTW, in the first log (and with my tests), the iwlwifi driver gets probed very early withing seconds of booting, but in your latest logs it gets probed only ~16 seconds later... Did you change something in your system to cause this? Created attachment 256347 [details]
patch to delay request_fw
Can you try this patch? It's a very ugly hack to add 500 msecs delay multiplied by the request number. This should give plenty of distance in time for each device being probed, so we can see if it's a race.
Another way to check what is going on without having to recompile the modules is to check if there are any pending firmware requests by checking sysfs: ls -l /sys/class/firmware This should have a list of devices that have pending firmware requests. Also, you could check if the iwlwifi driver still has those devices instantiated: ls -l /sys/modules/iwlwifi/devices Please run these commands after booting and post the results. We applied the patch. We changed it to be 500ms between each and it worked. All 4 wifi cards appear every time. The logs are attached. /sys/class/firmware doesn’t exist. Neither does /sys/module/iwlwifi/devices /sys/module/iwlwifi/refcnt is 4 /sys/module/iwlwifi/driver/pci:iwlwifi has links to all 4 devices ls -l /sys/modules/iwlwifi/devices Created attachment 256363 [details]
DMSG with 500ms delay
Cool! Yeah, I guess any delay there would work... I probably exaggerated in multiplying it. Maybe even less, like 100 msecs works, dunno. Strange that the /sys/class/firmware/ directory doesn't exist! It should be there since that's where the userspace pushes the files for the kernel to get. Can you run those ls commands again *without* the delay? Just to be sure that the requests are indeed pending at that point. So this indeed seems to be a race somewhere... Maybe systemd-udev (which I assume you are using) loses some of the events the kernel sends? It may be a good idea to get some udev logs... Hi Luca, I'm working with Nic on this issue. I have tried looking for /sys/class/firmware on the original kernel and the modified one and it does not exist for either. The timing difference (16 seconds into dmesg instead of a few seconds) is due to a change in using TPM. I now have to manually enter the WDE passphrase so it takes a few more seconds to boot. Can you give me some pointers on enabling systemd-udevd debug? Odd, I'll try to see if the /sys/class/firmware/ thing can be in a different place. You can use this: udevadm control -l 7 To increase the loglevel in udev. It will then print a bunch of stuff to your /var/log/syslog. I'm definitely not an expert in udev, though, so I'm not sure I will be able to help read it. With udev debugging enabled, I don't see any udev messaging between the probe and the firmware loading. I see udevd messages about 'inserting' iwlwifi and passing the device to netlink monitor followed by the probing of all 4 devices. The firmware is loaded for 4 all devices and then I see more unrelated udev stuff. Ok, I have done a bit more debugging and udev does not come into play here. I think what is happening is the request_firmware is put in a work queue and multiple requests for the same firmware are serialized. Since we are looking for several files that don't exist (version 28-version 22) the first request loads but subsequent requests are put in a wait queue until the first request is either aborted or done. I have tried a few changes in this area of code (firmware_class.c) but I have not found the bug yet. One change that seems to make sense is to move the fw_state_done() call in fw_get_filesystem_firmware() to just before the return. Regardless of the reason, we are done with the file at this point. I am going to keep playing with this. If you need more information or want logs, please ask. I think I may have a solution. What I am seeing is in firmware_class.c, when multiple driver instances are waiting on the same firmware file they will all be put in a swait queue. When the file is loaded into the firmware cache, it is marked as done which should release the other waiting instances. It does this but only for the first instance since the firmware will not be reloaded from the filesystem, it is never marked as "DONE" again so the rest of the instances stay in the swait queue. A small change to __fw_state_set() in firmware_class.c to swake_up_all() instead of just swake_up() will fix the problem and modifying the fw_get_filesystem_firmware() function to mark the firmware as done or aborted will ensure that swakup_all() gets called when it should. A patch will be attached soon. Created attachment 256493 [details]
possible patch for firmware_class.c
I see that the title of this has been changed. Is there anything else I can provide to help move this along? Please let me know. I'm looking at this now. (In reply to John Ewalt from comment #30) > Created attachment 256493 [details] > possible patch for firmware_class.c The changes you propose make sense! However can you please split them apart, identify which commit caused each issue, draw up with easy language to explain the impact of the issue and what each patch fixes ? As an example of types of enhancements to the commit needed which I'd like to see refer to a recent fix proposed also for firmware_class [0], after some good review of its impact this changed to be very descriptive [1] including references to the commits which the patch provides a fix for. For instance just the swake_up_all() change seems to one atomic change and you can supply a Fixes: 5b029624948d68 ("firmware: do not use fw_lock for fw_state protection") tag. $ git describe --contains 5b029624948d6864053166da1263df74c0c443df v4.10-rc1~147^2~6 So this fix alone should be merged on >= 4.10. You can see how I referred to a fix applicable to a stable kernel on the 2nd iteration fix above [1]. The other changes should be split off as well and similar work should be done. Once done please feel free to send these patches to me and Cc the others I Cc'd in the recent patch I Cc'd you on. Bonus points if we are able to draw up test cases for these issues into the driver data API [2] on the lib/test_driver_data.c. As I noted recently to Yi Li, we *don't yet* have a way to flex the cache infrastructure but it should be fairly easy to add a debugfs interface to let test this [3]. As I also noted on that thread, this particular issue on this bug report however spans the cache infrastructure since as you very noted two pending requests for the same file can end up being serialized. Upon a glance it would seem we actually have *dozens* of other error cases which we do not wake up pending callers for on failure, and these issues have been lingering in the kernel for ages. You seem to have addressed a few, but upon a glance I think we have many more to address. For this reason interface for this would help a lot. I can help you along the way. If this is too much I understand as well, at the very least try to post patches split up and I can take it from there. Also in the commit logs please refer to this bug entry. [0] https://lkml.kernel.org/r/20170523131607.28138.12306.stgit@localhost [1] https://lkml.kernel.org/r/20170524214027.7775-1-mcgrof@kernel.org [2] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20170524-driver-data-stable [3[ https://lkml.kernel.org/r/20170524190357.GB8951@wotan.suse.de [4] https://lkml.kernel.org/r/20170524203246.GD8951@wotan.suse.de I can split the patches and describe what I think each is doing but I am not sure I an do all that you ask. I will give it a shot tomorrow. Thank you for looking into this. Poke? I haven't seen patches yet.. I am sorry. We have to drop the effort to fix this due to time constraints. We may be able to revist in the future. |
Created attachment 255941 [details] A tgz of 3 png that show log output With multiple wifi cards plugged in the system will only enumerate the correct capabilities 50% of the time. If you boot the system and run ifconfig it is not guaranteed that all wifi devices show up. 1.) Step One Plug 2+ 6260 Wifi cards into a system. ( Directly onto system motherboard) 2.) Boot the device 3.) Run Ifconfig ( You will either see 1 wireless interface or all of them) It is random 4.) If all wifi cards are there the reboot and check again The attachments show the output of commands with 4 wifi cards plugged in. ifconfig only shows 1 wifi card lspci shows all 4 lshw -C network shows 4 wifi cards. 1 has the correct capabilities and 3 have incorrect capabilities.