Bug 214037 - iwlwifi: ax210 firmware crash from PNVM file failing to load
Summary: iwlwifi: ax210 firmware crash from PNVM file failing to load
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless-intel (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Default virtual assignee for network-wireless-intel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-10 23:27 UTC by Sam Edwards
Modified: 2021-09-14 09:02 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.14.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel log showing iwlwifi microcode crash at startup (5.08 KB, text/plain)
2021-08-10 23:27 UTC, Sam Edwards
Details
Path to add PNVM assertion failure detection (3.46 KB, patch)
2021-09-09 16:04 UTC, Luca Coelho
Details | Diff
New version of the patch (3.81 KB, patch)
2021-09-10 11:59 UTC, Luca Coelho
Details | Diff

Description Sam Edwards 2021-08-10 23:27:24 UTC
Created attachment 298277 [details]
kernel log showing iwlwifi microcode crash at startup

Attached is a crash of the AX210's microcode that happens during initial boot. I can work around this problem by unbinding->binding the driver after the system finishes booting. This crash happens on every single boot.

I am not sure what fundamental change happens during the boot process that allows the AX210 to initialize properly later, but it's worth noting that USB isn't up at the time of this crash. Perhaps the crash has something to do with the USB-side interface (for btusb) not being initialized yet?
Comment 1 Sam Edwards 2021-08-31 04:42:47 UTC
Issue is still present in 5.14.0
Comment 2 Luca Coelho 2021-09-06 14:40:30 UTC
The SYSASSERT I can see in the logs point to missing PNVM file.

First of all, can you check whether you have the iwlwifi-ty-a0-gf-a0.pnv file installed in /lib/firmware? If not, you can install it from here:

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/iwlwifi-ty-a0-gf-a0.pnvm


It's strange that it works on the second time, i.e. when you rebind the driver.  This means that the FW is probably there but cannot be accessed the first time around... Do you have the iwlwifi module compiled into the kernel?

Can you try to load the module with debug=0x00010000? You can do that by passing iwlwifi.debug=0x00010000 in the kernel arguments or by removing and reinstalling the iwlwifi module with this parameter (if it's compiled as a module).
Comment 3 Sam Edwards 2021-09-06 22:26:05 UTC
The iwlwifi module is builtin; the selection of firmware available in the initramfs /lib/firmware is chosen by scanning the kernel log for firmware filenames. The full linux-firmware package is available after boot. I can try fixing the initramfs system -- how does the SYSASSERT indicate the filename that is missing?

I'll update the issue summary now that we understand the condition that causes this crash.
Comment 4 Luca Coelho 2021-09-07 07:20:37 UTC
Thanks for the follow-up.

We can see what caused the firmware crash according to the SYSASSERT code:

[    2.992468] iwlwifi 0000:5b:00.0: 0x2010070D | ADVANCED_SYSASSERT

In this case, the 070D tells us that the reason is missing PNVM.

So, the PNVM file is probably not listed for initramfs.  What exactly do you mean by "scanning the kernel log for firmware filenames"?

Running modinfo against the modules will show the list of firmwares used by that modules (based on MODULE_FIRMWARE() entries in the source code).  But in iwlwifi's case you would end up with loads of binaries, because it supports tens of different HW models... We could, in theory, add MODULE_FIRMWARE() entries for the PNVM files, but I'm not sure it would help in your case.
Comment 5 Sam Edwards 2021-09-07 20:38:30 UTC
> What exactly do you mean by "scanning the kernel log for firmware filenames"?

To put it formally, the set of paths copied to initramfs:/lib/firmware is the intersection of the set of (regular file) paths under root:/lib/firmware and the set of all substrings found in the output of `dmesg`.

e.g. the firmware file "iwlwifi-ty-a0-gf-a0-63.ucode" is included in my initramfs because of the message "Direct firmware load for iwlwifi-ty-a0-gf-a0-63.ucode failed with error -2" appearing in dmesg output during a previous boot.

However, the filename "iwlwifi-ty-a0-gf-a0.pnvm" is NOT a substring of "0x2010070D | ADVANCED_SYSASSERT", so there is no explicit indication that this additional file is needed.

> In this case, the 070D tells us that the reason is missing PNVM.

It's still unclear how 070D tells us (me particularly) that "iwlwifi-ty-a0-gf-a0.pnvm" is missing - what resource should I have consulted to discover this for myself? Perhaps iwlwifi should have printed something like "missing PNVM" at the end of its error report?

> We could, in theory, add MODULE_FIRMWARE() entries for the PNVM files, but
> I'm not sure it would help in your case.

Other initramfs creation scripts do make use of modinfo's firmware listings to decide what firmware files must be included when a given module is bundled into the initramfs. If the PNVM isn't listed by MODULE_FIRMWARE(), users who use such scripts to put iwlwifi.ko in initramfs will not have this file at device probe time either. (But yes: in my specific case, this would make no difference.)

---

I suspect the resolution to this issue is twofold -- that is, the driver should:

1. Not fail to load the PNVM file silently. Log a message indicating that this failure has occurred which informs the user of the consequences of that file being missing. (And it would be helpful to log the filename when the file is successfully loaded, too.)

2. Prevent the device from crashing. Either:
a) Fix the crash in the device firmware (if this is possible, i.e. PNVM is optional), or
b) Don't even attempt to initialize the device firmware when no PNVM is loaded but the PNVM is required, if such an attempt can only result in a crash

Thoughts?
Comment 6 Luca Coelho 2021-09-07 20:58:23 UTC
(In reply to Sam Edwards from comment #5)
> > What exactly do you mean by "scanning the kernel log for firmware
> filenames"?
> 
> To put it formally, the set of paths copied to initramfs:/lib/firmware is
> the intersection of the set of (regular file) paths under root:/lib/firmware
> and the set of all substrings found in the output of `dmesg`.
> 
> e.g. the firmware file "iwlwifi-ty-a0-gf-a0-63.ucode" is included in my
> initramfs because of the message "Direct firmware load for
> iwlwifi-ty-a0-gf-a0-63.ucode failed with error -2" appearing in dmesg output
> during a previous boot.
> 
> However, the filename "iwlwifi-ty-a0-gf-a0.pnvm" is NOT a substring of
> "0x2010070D | ADVANCED_SYSASSERT", so there is no explicit indication that
> this additional file is needed.

This sounds like a very flaky way to figure out the needed firmwares.  In the past, there was no way to try to load a firmware without causing a "firmware load failed" error.  But nowadays, there are ways to do so without generating direct warnings.  For instance, in the iwlwifi case, we try to load the newest version we support fist, but if that is not available we try the second newest one and so on.  So not all firmwares that fail to load are strictly necessary.


> > In this case, the 070D tells us that the reason is missing PNVM.
> 
> It's still unclear how 070D tells us (me particularly) that
> "iwlwifi-ty-a0-gf-a0.pnvm" is missing - what resource should I have
> consulted to discover this for myself? Perhaps iwlwifi should have printed
> something like "missing PNVM" at the end of its error report?

This is an internal code in our FW and it is not public.  We can tell you what the codes are when they happen (and you would certainly be able to infer it, if we didn't), but we cannot give you a list of all available codes.

 
> > We could, in theory, add MODULE_FIRMWARE() entries for the PNVM files, but
> > I'm not sure it would help in your case.
> 
> Other initramfs creation scripts do make use of modinfo's firmware listings
> to decide what firmware files must be included when a given module is
> bundled into the initramfs. If the PNVM isn't listed by MODULE_FIRMWARE(),
> users who use such scripts to put iwlwifi.ko in initramfs will not have this
> file at device probe time either. (But yes: in my specific case, this would
> make no difference.)

This is an easy thing to do and, IMHO, it's more failsafe.

 
> I suspect the resolution to this issue is twofold -- that is, the driver
> should:
> 
> 1. Not fail to load the PNVM file silently. Log a message indicating that
> this failure has occurred which informs the user of the consequences of that
> file being missing. (And it would be helpful to log the filename when the
> file is successfully loaded, too.)

In some newer systems, we may also be able to get this data from the BIOS, via ACPI tables.  So it's not directly clear if the FW is needed.  But at some point we will know that the FW version we're loading requires a PNVM and then we can check whether we have the data or not and fail with a proper message if we don't.


> 2. Prevent the device from crashing. Either:
> a) Fix the crash in the device firmware (if this is possible, i.e. PNVM is
> optional), or

This is not really a crash.  The firmware _intentionally_ asserts that the PNVM has been loaded, because it is mandatory for FW versions that need it.


> b) Don't even attempt to initialize the device firmware when no PNVM is
> loaded but the PNVM is required, if such an attempt can only result in a
> crash

This is basically the same as point 1. above.


I'll add the MODULE_FIRMWARE() tag to the code and try to figure out the best way to notify the user that the PNVM file is needed when it's not there.

In the short term, you can just add that file to your intifs so it will work for you. :)
Comment 7 Sam Edwards 2021-09-07 23:11:25 UTC
> This sounds like a very flaky way to figure out the needed firmwares.  In the
> past, there was no way to try to load a firmware without causing a "firmware
> load failed" error.  But nowadays, there are ways to do so without generating
> direct warnings.

Haha, it does indeed sound flaky, but this is actually the first time in nearly a decade that this method has missed a required firmware file, so in practice I'm really happy with it. (If you know of a different method that, at the time, would have properly identified this PNVM file as necessary, please do share.)

> For instance, in the iwlwifi case, we try to load the newest version we
> support fist, but if that is not available we try the second newest one and
> so on.  So not all firmwares that fail to load are strictly necessary.

But remember, iwlwifi stops attempting to find older versions once it finds a suitable firmware file. When a new kernel (with a newer version of iwlwifi, which prefers a newer firmware file) boots, the newer filename is logged, and the newer file is copied during the next initramfs build. The only drawbacks are that the older file remains around until manually deleted, and that it takes one additional kernel upgrade cycle for the newer firmware to make it to the initramfs.

The ideal for me would be a file under /sys/module/*/ listing the firmware filenames each module has (un)successfully tried to load, so I can automatically remove firmware files that are no longer attempted. But I'm just thinking aloud; this is not the proper venue to discuss such a change.

> This is an internal code in our FW and it is not public.  We can tell you
> what the codes are when they happen (and you would certainly be able to infer
> it, if we didn't), but we cannot give you a list of all available codes.

Hm; I should probably add a section on the iwlwifi debugging page of the wireless wiki with this explanation, and include a table of code meanings that have been shared with the public to-date.

> In some newer systems, we may also be able to get this data from the BIOS,
> via ACPI tables.  So it's not directly clear if the FW is needed.  But at
> some point we will know that the FW version we're loading requires a PNVM and
> then we can check whether we have the data or not and fail with a proper
> message if we don't.

From when I looked, it seemed like iwlwifi looks in a UEFI variable first, then falls back to loading from FS. (I didn't see anything about ACPI.) But as long as the FS is the option of last resort, the firmware loading code sounds like a great place to complain about the file being mandatory+missing.

What is the function of the PNVM? Is it fair to say that the FW file is just the default in case no OEM overrides it?

> This is not really a crash.  The firmware _intentionally_ asserts that the
> PNVM has been loaded, because it is mandatory for FW versions that need it.

Okay; PNVM is not optional. Though, that just means that this is a crash of the "failed assertion" variety. A user-friendly driver avoids failed assertions!

> I'll add the MODULE_FIRMWARE() tag to the code and try to figure out the best
> way to notify the user that the PNVM file is needed when it's not there.
> In the short term, you can just add that file to your intifs so it will work
> for you. :)

Sounds good, on both counts. Thank you very much for your time, effort, and a stimulating conversation! :)
Comment 8 Luca Coelho 2021-09-08 11:03:04 UTC
(In reply to Sam Edwards from comment #7)
> Haha, it does indeed sound flaky, but this is actually the first time in
> nearly a decade that this method has missed a required firmware file, so in
> practice I'm really happy with it. (If you know of a different method that,
> at the time, would have properly identified this PNVM file as necessary,
> please do share.)

Heh, this is exactly what flaky means to me.  It works until it doesn't. ;)

But, as we already discussed, there would be no way for you to infer that this file was missing just by looking at dmesg... So let's fix that.


> But remember, iwlwifi stops attempting to find older versions once it finds
> a suitable firmware file. When a new kernel (with a newer version of
> iwlwifi, which prefers a newer firmware file) boots, the newer filename is
> logged, and the newer file is copied during the next initramfs build. The
> only drawbacks are that the older file remains around until manually
> deleted, and that it takes one additional kernel upgrade cycle for the newer
> firmware to make it to the initramfs.

Yes, you're right.  So even though the driver could try to load older FWs, it will stop when it finds the first usable one.  But you don't have any FWs when you try to figure out the FWs needed, so you would get the whole list.

 
> The ideal for me would be a file under /sys/module/*/ listing the firmware
> filenames each module has (un)successfully tried to load, so I can
> automatically remove firmware files that are no longer attempted. But I'm
> just thinking aloud; this is not the proper venue to discuss such a change.

Yeah, but there may be cases where the driver can load either one or another firmware, and the one you choose could cause a different behavior.  For instance, the driver could support two different FWs, one for AP mode and one for station.  And decide what mode to use depending on the FW it finds.  But this may be stretching a bit and, yes, this is not the right place to discuss it. :)

 
> Hm; I should probably add a section on the iwlwifi debugging page of the
> wireless wiki with this explanation, and include a table of code meanings
> that have been shared with the public to-date.

There's no guarantee that the codes will remain the same and, especially, they may differ from HW to HW, so the list can end up growing exponentially.


> From when I looked, it seemed like iwlwifi looks in a UEFI variable first,
> then falls back to loading from FS. (I didn't see anything about ACPI.) But
> as long as the FS is the option of last resort, the firmware loading code
> sounds like a great place to complain about the file being mandatory+missing.

Right, sorry, I mixed things up.  It is indeed in the UEFI not in ACPI, but the idea is the same. :)

And yes, if the PNVM data is in the UEFI, we won't even try to load the binary from the filesystem.

 
> What is the function of the PNVM? Is it fair to say that the FW file is just
> the default in case no OEM overrides it?

It's platform-specific data.  It may include PHY and RF values that are fine-tuned for the platform.  The "defaults" is what we have in the PNVM file in the filesystem.  The one UEFI can be fine-tuned by the OEMs to the specific platform design.

 
> > This is not really a crash.  The firmware _intentionally_ asserts that the
> > PNVM has been loaded, because it is mandatory for FW versions that need it.
> 
> Okay; PNVM is not optional. Though, that just means that this is a crash of
> the "failed assertion" variety. A user-friendly driver avoids failed
> assertions!

We do our best.  And the assertion is not in the driver, it's in the FW.  The FW doesn't try to recover from bad data/flows by itself, it just gives up and tells the driver, which should then recover.  Our driver will try to restart the FW and continue from there, but in cases where the assertion fails during device boot, there will be no way to recover.

So, not disagreeing with you, I also believe that the driver shouldn't even try to load the FW if the PNVM data is required but is not there.


> 
> > I'll add the MODULE_FIRMWARE() tag to the code and try to figure out the
> best
> > way to notify the user that the PNVM file is needed when it's not there.
> > In the short term, you can just add that file to your intifs so it will
> work
> > for you. :)
> 
> Sounds good, on both counts. Thank you very much for your time, effort, and
> a stimulating conversation! :)

Same here! Thanks for reporting and helping with this.
Comment 9 Luca Coelho 2021-09-09 16:03:30 UTC
I wrote a patch that should make the driver detect the PVNM assertion failure and print out the file it tried to read.  Can you please test it? Notice that this is just compile-tested.
Comment 10 Luca Coelho 2021-09-09 16:04:10 UTC
Created attachment 298719 [details]
Path to add PNVM assertion failure detection
Comment 11 Sam Edwards 2021-09-10 07:57:36 UTC
Hi Luca,

It might be a little while before I get to testing this patch. I suspect some bug in Linux 5.14.1 killed that system's rootfs a couple days ago and my focus is on investigating/reporting the cause before I restore a backup and continue using it.

If I have a spare moment I might try applying the patch and booting the patched kernel from a recovery USB.

But until then: I haven't forgotten about this, I *will* test this patch sooner or later!

Best,
Sam
Comment 12 Luca Coelho 2021-09-10 08:10:07 UTC
Okay! Good luck with recovering your system!

And let me know when you get a chance to test it, so we can close this bug.
Comment 13 Luca Coelho 2021-09-10 11:59:47 UTC
Created attachment 298737 [details]
New version of the patch

I updated the patch a bit.  There were a few issues in the previous ones, so please use this version instead.
Comment 14 Luca Coelho 2021-09-13 12:52:37 UTC
I'm going to close this now.  Please reopen this if you think the changes are not enough when you get the chance to test it.  Thanks! :)
Comment 15 Sam Edwards 2021-09-14 08:43:33 UTC
The patch did require replacing pnvm.h in 5.14.x with the copy from 5.15-rc1, but it does result in the proper dmesg notice:

iwlwifi 0000:5c:00.0: PNVM data is missing, please install iwlwifi-ty-a0-gf-a0.pnvm

So, success! :)
Comment 16 Luca Coelho 2021-09-14 09:02:29 UTC
(In reply to Sam Edwards from comment #15)
> The patch did require replacing pnvm.h in 5.14.x with the copy from
> 5.15-rc1, but it does result in the proper dmesg notice:
> 
> iwlwifi 0000:5c:00.0: PNVM data is missing, please install
> iwlwifi-ty-a0-gf-a0.pnvm
> 
> So, success! :)

Great! Thanks for testing.  And yeah, this won't go into v5.14 and probably not even v5.15.  It's not really a bug or regression, so it should only go to -next.

Note You need to log in before you can comment on or make changes to this bug.