Bug 191601 - iwlwifi: 7260: hardware gets stuck 0x5a5a5a5a - WIFILNX-567
Summary: iwlwifi: 7260: hardware gets stuck 0x5a5a5a5a - WIFILNX-567
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
URL:
Keywords:
: 193131 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-12-31 07:34 UTC by ryan.jentzsch
Modified: 2021-02-01 09:33 UTC (History)
5 users (show)

See Also:
Kernel Version: 4.9.0-040900
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg > dsmg.log output (248.55 KB, text/x-log)
2016-12-31 07:34 UTC, ryan.jentzsch
Details
firmware with monitor enabled (1.00 MB, application/octet-stream)
2017-01-07 20:57 UTC, Emmanuel Grumbach
Details
firmware with monitor enabled (1.00 MB, application/octet-stream)
2017-01-12 21:52 UTC, Emmanuel Grumbach
Details
lspci -vvvv -xxxx (114.67 KB, text/plain)
2017-01-24 11:56 UTC, Hallo32
Details
lspci -vvvv -xxxx after setpci -s 04:00.0 0x50.B=0x40 (114.66 KB, text/plain)
2017-01-24 12:28 UTC, Hallo32
Details
output of sudo lspci -vvvv -xxxx from the card (16.58 KB, text/plain)
2017-01-24 18:39 UTC, Elizabeth Myers
Details
Output of sudo lspci -vvvv -xxxx after sudo setpci -s 03:00.0 0x50.B=0x40 (16.58 KB, text/plain)
2017-01-24 18:41 UTC, Elizabeth Myers
Details
new dmesg with error messages, sudo setpci -s 04:00.0 0x50.B=0x40 has not been set (359.01 KB, text/x-log)
2017-01-26 01:37 UTC, Hallo32
Details
Script to fix wifi getting stuck so you don't need to reboot (2.22 KB, application/x-shellscript)
2017-03-20 00:09 UTC, ryan.jentzsch
Details
dmesg under 5.10.0-1 (debian testing) (9.98 KB, text/plain)
2021-02-01 09:33 UTC, Pietro Battiston
Details

Description ryan.jentzsch 2016-12-31 07:34:34 UTC
Created attachment 249421 [details]
dmesg > dsmg.log output

linuxwifi@intel.com 
firmware-version: 17.352738.0
My system is using \lib\firmware\iwlwifi-7260-17.ucode

To get the wireless back up again without rebooting I have to do this:

sudo service network-manager stop
echo 1 | sudo tee /sys/bus/pci/devices/0000:08:00.0/remove
sudo killall wpa_supplicant
sleep 1
echo 1 | sudo tee /sys/bus/pci/rescan
sleep 2
sudo rmmod iwlmvm iwlwifi && sudo modprobe iwlmvm iwlwifi
sudo ifconfig wlan0 up
sudo service network-manager restart

See attached dmesg log.
Comment 1 ryan.jentzsch 2016-12-31 07:36:22 UTC
I've tried this in numerous kernel versions always with the same result.
I've also tried this with several wireless routers.
Comment 2 Emmanuel Grumbach 2017-01-07 20:57:13 UTC
Created attachment 250721 [details]
firmware with monitor enabled

This kind bugs involve the firmware and they are very busy these days.

Let's collect the right logs for them anyway.

Please install the firmware I attached and reproduce the bug with this version.
I'll need you to follow the steps explained here:

https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#firmware_debugging


Please take the time to read our privacy notice:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#privacy_aspects

Thank you.
Comment 3 ryan.jentzsch 2017-01-12 09:41:12 UTC
I followed the instructions and emailed them the log.
Will do so each time this thing crashes.
Comment 4 Emmanuel Grumbach 2017-01-12 09:44:59 UTC
Hi,

the capture didn't work.
Can you send the dmesg output to see why?
I'll send a new firmware a bit later.
Comment 5 Emmanuel Grumbach 2017-01-12 21:52:13 UTC
Created attachment 251401 [details]
firmware with monitor enabled

Hi,

Please try this firmware to record the firmware dump and send it again.
You can encrypt it and attach it to the bug and send it privately to me.

Thanks.
Comment 6 ryan.jentzsch 2017-01-13 12:18:44 UTC
Just Crashed (haven't tried above firmware yet). Here's the dmesg output:


[10151.023981] WARNING: CPU: 0 PID: 0 at /home/kernel/COD/linux/drivers/net/wireless/intel/iwlwifi/iwl-trans.h:1194 iwl_trans_pcie_log_scd_error+0x26c/0x300 [iwlwifi]
[10151.023982] Modules linked in: cpuid ccm ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c appletalk ipx p8023 p8022 psnap llc rfcomm bbswitch(OE) bnep uvcvideo nvidia_uvm(POE) videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev nvidia_drm(POE) intel_rapl nvidia_modeset(POE) x86_pkg_temp_thermal media intel_powerclamp nvidia(POE) hid_multitouch btusb btrtl coretemp btbcm btintel kvm_intel bluetooth kvm irqbypass snd_hda_codec_hdmi snd_hda_codec_realtek crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_generic snd_hda_intel
[10151.024008]  aes_x86_64 snd_hda_codec snd_hda_core iwlmvm snd_hwdep snd_pcm lrw snd_seq_midi snd_seq_midi_event glue_helper ablk_helper snd_rawmidi snd_seq cryptd hp_accel lis3lv02d snd_seq_device input_leds joydev serio_raw intel_cstate intel_rapl_perf input_polldev snd_timer iwlwifi mei_me mei snd soundcore acpi_pad lpc_ich wmi hp_wireless shpchp mac_hid binfmt_misc arc4 rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 parport_pc ppdev lp parport autofs4 dm_mirror dm_region_hash dm_log hid_generic i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci psmouse r8169 usbhid mii libahci hid drm video fjes
[10151.024037] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P        W  OE   4.9.0-040900-generic #201612111631
[10151.024037] Hardware name: Hewlett-Packard HP ENVY m7 Notebook PC /229D, BIOS F.34 12/19/2014
[10151.024039]  ffff90161ec03d88 ffffffff9e417982 0000000000000000 0000000000000000
[10151.024041]  ffff90161ec03dc8 ffffffff9e083d4b 000004aa1ec03dd8 000000000000001e
[10151.024043]  00000000a5a5a5a5 0000000000a02eac ffff90160cc50018 000000005a5a5a5a
[10151.024044] Call Trace:
[10151.024045]  <IRQ> 
[10151.024049]  [<ffffffff9e417982>] dump_stack+0x63/0x81
[10151.024051]  [<ffffffff9e083d4b>] __warn+0xcb/0xf0
[10151.024053]  [<ffffffff9e083e7d>] warn_slowpath_null+0x1d/0x20
[10151.024057]  [<ffffffffc060b97c>] iwl_trans_pcie_log_scd_error+0x26c/0x300 [iwlwifi]
[10151.024059]  [<ffffffff9e0b8cdd>] ? cpu_load_update+0xdd/0x150
[10151.024063]  [<ffffffffc0601c4a>] iwl_pcie_txq_stuck_timer+0x7a/0xa0 [iwlwifi]
[10151.024066]  [<ffffffffc0601bd0>] ? iwl_pcie_txq_inc_wr_ptr+0x100/0x100 [iwlwifi]
[10151.024068]  [<ffffffff9e0f65a5>] call_timer_fn+0x35/0x120
[10151.024070]  [<ffffffff9e0f6b35>] run_timer_softirq+0x215/0x4b0
[10151.024071]  [<ffffffff9e0ff0d1>] ? ktime_get+0x41/0xb0
[10151.024073]  [<ffffffff9e052f46>] ? lapic_next_deadline+0x26/0x30
[10151.024076]  [<ffffffff9e890394>] __do_softirq+0x104/0x28c
[10151.024077]  [<ffffffff9e08a1c6>] irq_exit+0xb6/0xc0
[10151.024078]  [<ffffffff9e8901a2>] smp_apic_timer_interrupt+0x42/0x50
[10151.024080]  [<ffffffff9e88f4b2>] apic_timer_interrupt+0x82/0x90
[10151.024080]  <EOI> 
[10151.024083]  [<ffffffff9e70fa22>] ? cpuidle_enter_state+0x122/0x2c0
[10151.024084]  [<ffffffff9e70fbf7>] cpuidle_enter+0x17/0x20
[10151.024085]  [<ffffffff9e0c93d3>] call_cpuidle+0x23/0x40
[10151.024087]  [<ffffffff9e0c964b>] cpu_startup_entry+0x15b/0x240
[10151.024088]  [<ffffffff9e87ffc7>] rest_init+0x77/0x80
[10151.024090]  [<ffffffff9ef85fd3>] start_kernel+0x448/0x469
[10151.024092]  [<ffffffff9ef85120>] ? early_idt_handler_array+0x120/0x120
[10151.024093]  [<ffffffff9ef852ca>] x86_64_start_reservations+0x24/0x26
[10151.024095]  [<ffffffff9ef85419>] x86_64_start_kernel+0x14d/0x170
[10151.024096] ---[ end trace 43d95ed52794e816 ]---
[10151.059152] iwlwifi 0000:08:00.0: Q 30 is active and mapped to fifo 2 ra_tid 0xa5a5 [90,1515870810]
Comment 7 Emmanuel Grumbach 2017-01-13 14:19:46 UTC
[10151.059152] iwlwifi 0000:08:00.0: Q 30 is active and mapped to fifo 2 ra_tid 0xa5a5 [90,1515870810]

the 0xa5a5 means it is the same bug.
Apparently, you can repro consistently.
Comment 8 ryan.jentzsch 2017-01-13 18:56:15 UTC
The problem presents at random times. I can reboot and the wifi is fine for 1+ hours or 20 minutes, or whenever. It doesn't matter what I'm doing I can be active on the web or something that is not using the wifi -- I usually hear my laptop fan kick on like the CPUs are working hard afterwards I run the script I mentioned in the first comment to get wifi back. What's interesting is after running the script I will lose wifi again more quickly, I'll run the script again and eventually I lose wifi so quickly I have to reboot.

Just happened again. I'd post the dmesg but it is the same info as before. Searching the web for this error indicates that it happened to them when the system when into suspend and was fixed in an upstream kernel build. My situation is different in that my laptop is active when this occurs (in fact I have it configured to prevent suspend).
Comment 9 Elizabeth Myers 2017-01-21 01:35:08 UTC
The bug always happens at random times for me as well. It's not predictable when it occurs; I can sometimes go a day without it, it sometimes happens 5 minutes later. But it does happen on a constant basis.
Comment 10 Elizabeth Myers 2017-01-21 04:41:02 UTC
Hi,

So I've installed the debug firmware. Next time I get a hang like that I'll email the dump to you.
Comment 11 Emmanuel Grumbach 2017-01-21 17:37:15 UTC
(In reply to Elizabeth Myers from comment #10)
> Hi,
> 
> So I've installed the debug firmware. Next time I get a hang like that I'll
> email the dump to you.

We got the debug dump you sent to us privately.
Comment 12 Emmanuel Grumbach 2017-01-22 07:14:11 UTC
I opened an internal ticket with all the data. Thanks for your cooperation.
Comment 13 Emmanuel Grumbach 2017-01-22 11:49:35 UTC
*** Bug 193131 has been marked as a duplicate of this bug. ***
Comment 14 Emmanuel Grumbach 2017-01-24 07:01:54 UTC
I received more logs from Elisabeth.

Can you please attach the output of:

sudo lspci -vvvv -xxxx
Comment 15 Hallo32 2017-01-24 11:56:45 UTC
Created attachment 253041 [details]
lspci -vvvv -xxxx
Comment 16 Emmanuel Grumbach 2017-01-24 12:21:41 UTC
can you try to do:

sudo setpci -s 04:00.0 0x50.B=0x40

run sudo lspci -vvvv -xxxx again and paste the ouput of the command above.

Then, let me know if it helps.

Thanks.
Comment 17 Hallo32 2017-01-24 12:28:47 UTC
Created attachment 253051 [details]
lspci -vvvv -xxxx after setpci -s 04:00.0 0x50.B=0x40

This is the requested lspci output after the command
sudo setpci -s 04:00.0 0x50.B=0x40
Comment 18 Hallo32 2017-01-24 12:32:00 UTC
Hey Emmanuel Grumbach,

the problem is, that I'm not able to trigger the problem directly. It may happen or it may not happen. My initial report has been linked to this one.
Comment 19 Emmanuel Grumbach 2017-01-24 12:32:20 UTC
LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+

Cool - it worked.
Let me know if things are more stable for you with this setting.
Note that this setting will not survive reboot or suspend / resume cycle.
Comment 20 Emmanuel Grumbach 2017-01-24 12:33:35 UTC
(In reply to Hallo32 from comment #18)
> Hey Emmanuel Grumbach,
> 
> the problem is, that I'm not able to trigger the problem directly. It may
> happen or it may not happen. My initial report has been linked to this one.

I know it is highly random.
Comment 21 Elizabeth Myers 2017-01-24 18:39:08 UTC
Created attachment 253061 [details]
output of sudo lspci -vvvv -xxxx from the card

Hi,

here is the output of sudo lscpi -vvvv -xxxx without the setpci command.
Comment 22 Elizabeth Myers 2017-01-24 18:41:40 UTC
Created attachment 253071 [details]
Output of sudo lspci -vvvv -xxxx after sudo setpci -s 03:00.0 0x50.B=0x40

This is after running the setpci command after adjusting for the PCI ID of my card.
Comment 23 Elizabeth Myers 2017-01-24 18:42:33 UTC
I will let you know if things are more stable with this setting or if it happens again.
Comment 24 Emmanuel Grumbach 2017-01-24 20:27:35 UTC
(In reply to Elizabeth Myers from comment #23)
> I will let you know if things are more stable with this setting or if it
> happens again.

L1 is already disabled on your system... so L1 can't be causing this. I am afraid there isn't much we can do now...
Can you check if your BIOS is updated?
Comment 25 Elizabeth Myers 2017-01-24 21:40:22 UTC
It happened again. I do not have the latest BIOS, I'll update it and report back.
Comment 26 Hallo32 2017-01-26 01:37:30 UTC
Created attachment 253141 [details]
new dmesg with error messages, sudo setpci -s 04:00.0 0x50.B=0x40 has not been set

A new dmesg log with errors related to iwlwifi.
"sudo setpci -s 04:00.0 0x50.B=0x40" has not been set.
Sorry, I forgot to set it again after startup.

Could it be, that the problem doesn't appear, if there is traffic on the wlan card?
Comment 27 ryan.jentzsch 2017-01-26 07:22:25 UTC
(In reply to Hallo32 from comment #26)
> Created attachment 253141 [details]
> new dmesg with error messages, sudo setpci -s 04:00.0 0x50.B=0x40 has not
> been set
> 
> A new dmesg log with errors related to iwlwifi.
> "sudo setpci -s 04:00.0 0x50.B=0x40" has not been set.
> Sorry, I forgot to set it again after startup.
> 
> Could it be, that the problem doesn't appear, if there is traffic on the
> wlan card?

I thought that as well. I kept a ping of 8.8.8.8 going all the time and I still get intermittent failures. It's really annoying since my laptop acts like the wireless adapter is half there. See the script in my first comment I created so I don't need to reboot each time.
Comment 28 ryan.jentzsch 2017-01-26 07:29:55 UTC
I have an HP laptop which are notorious for borking your Linux system with an attempted BIOS update due to bad M$ UEFI implementations. So I hesitate to "upgrade" the BIOS. Also I need to find a Windows computer to do the upgrade since the BIOS USB flash creator only runs under Windows.

One other thing of note is this problem started happening when I installed Xenial. I've tried downgrading to different kernels, but something is hanging around where the issue persists. I'm almost to the point of wiping my hard drive and doing a clean install.
Comment 29 ryan.jentzsch 2017-01-26 07:41:09 UTC
(In reply to Emmanuel Grumbach from comment #16)
> can you try to do:
> 
> sudo setpci -s 04:00.0 0x50.B=0x40
> 
> run sudo lspci -vvvv -xxxx again and paste the ouput of the command above.
> 
> Then, let me know if it helps.
> 
> Thanks.

What exactly does this command do? Manpage for this is written in an alien languge that I don't understand.
Comment 30 Emmanuel Grumbach 2017-01-26 08:10:53 UTC
This command clears the ASPM enabled bit in the PCI config space.
Bottom line: it disables a power saving feature of PCI which can be causing problems in some cases.
Comment 31 Hallo32 2017-01-26 11:00:16 UTC
@ryan

On which kernel version are you? (uname -r)

I had installed Xenial first with the 4.4 kernel and moved to 4.8 through the "LTS Enablement Stack" option.

I have recognized the error messages since I move to 4.8.
Comment 32 ryan.jentzsch 2017-01-27 13:20:47 UTC
(In reply to Hallo32 from comment #31)
> @ryan
> 
> On which kernel version are you? (uname -r)
> 
> I had installed Xenial first with the 4.4 kernel and moved to 4.8 through
> the "LTS Enablement Stack" option.
> 
> I have recognized the error messages since I move to 4.8.

Currently 4.9.3-040903-generic
I kept on the "bleeding edge" in the hopes a new kernel would fix the wifi issue.

It appears that the `sudo setpci -s 08:00.0 0x50.B=0x40` command may have solved the issue for me. I've gone for more than three hours without a disconnect (longest time so far).
Comment 33 Elizabeth Myers 2017-01-27 23:54:38 UTC
I have updated my BIOS and it seems to have resolved my problem. I'll keep my eyes peeled and see if it happens again.
Comment 34 ryan.jentzsch 2017-01-28 02:21:39 UTC
(In reply to ryan.jentzsch from comment #32)
> (In reply to Hallo32 from comment #31)
> > @ryan
> > 
> > On which kernel version are you? (uname -r)
> > 
> > I had installed Xenial first with the 4.4 kernel and moved to 4.8 through
> > the "LTS Enablement Stack" option.
> > 
> > I have recognized the error messages since I move to 4.8.
> 
> Currently 4.9.3-040903-generic
> I kept on the "bleeding edge" in the hopes a new kernel would fix the wifi
> issue.
> 
> It appears that the `sudo setpci -s 08:00.0 0x50.B=0x40` command may have
> solved the issue for me. I've gone for more than three hours without a
> disconnect (longest time so far).

Spoke too soon. Situation normal FUBAR. Although it takes a bit longer to go stupid now.
Comment 35 Emmanuel Grumbach 2017-01-29 06:46:59 UTC
@Elizabeth: can you please attach the output of sudo lspci -vvvv -xxxx after the BIOS update?

@Ryan: Are you sure the command was applied after reboot / suspend / whatever other flow that may impact the PCIe config space?

Please note that there is very little we can do from the driver side about this issue.

The 0x5a5a5a5a can point to a firmware bug, but since you all see a 0xffffffff value in the CSR registers before it happens, this really points to a (transient?) PCIe problem which can be analyzed only with the device operating in a lab environment and plugged to a PCIe analyzer. This is something that OEMs would do, but not end users...
Comment 36 Hallo32 2017-01-29 10:05:00 UTC
(In reply to Emmanuel Grumbach from comment #35)
> Please note that there is very little we can do from the driver side about
> this issue.
> 
> The 0x5a5a5a5a can point to a firmware bug, but since you all see a
> 0xffffffff value in the CSR registers before it happens, this really points
> to a (transient?) PCIe problem which can be analyzed only with the device
> operating in a lab environment and plugged to a PCIe analyzer. This is
> something that OEMs would do, but not end users...

You are right, a PCIe analyzer is not in place.

But what did changes between kernel 4.4 and 4.8 in relation to ASPM in the driver and the firmware code? The hardware didn't change in the time frame.
In this case I expect the problem at the software level or at tighter constrains for the ASPM transition on the software side.

Is there an option to force an ASPM transition?

Is there an option to dump the register values at the transition between the different ASPM states?
Comment 37 Emmanuel Grumbach 2017-01-29 10:10:55 UTC
You can force the aspm state using the command I sent or using the pcie bus driver. Look at http://lxr.free-electrons.com/ident?i=pcie_aspm_disable
Comment 38 Hallo32 2017-01-29 10:23:47 UTC
(In reply to Emmanuel Grumbach from comment #37)
> You can force the aspm state using the command I sent or using the pcie bus
> driver. Look at http://lxr.free-electrons.com/ident?i=pcie_aspm_disable

Sure, I can force it to disable but is it also possible to force the transition to the state L0s and L1?
Comment 39 Emmanuel Grumbach 2017-01-29 10:30:38 UTC
You can force enable L1, but you can't force an actual transition.
Comment 40 ryan.jentzsch 2017-01-31 16:34:20 UTC
Physically covering the the wifi card pin so that it NEVER gets the ASPM signal to go to sleep is what I am about to try: https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=pin+20+wifi+card
Comment 41 ryan.jentzsch 2017-01-31 16:51:28 UTC
Before I try to override the hardware I want to make clear that this is a SOFTWARE related bug (see @Hallo32 comment #36), also the wifi was previously working for me without issues AND I have a multi-boot machine with Win10 that the wifi has consistently worked without issues when running under Windows (doing many similar tasks to when I am in Linux -- I test my applications on both platforms before releasing to production).
Comment 42 Hallo32 2017-01-31 17:21:50 UTC
I'm sure, that the pin 20 has nothing to do with this problem.
Comment 43 ryan.jentzsch 2017-02-02 21:01:02 UTC
(In reply to Hallo32 from comment #42)
> I'm sure, that the pin 20 has nothing to do with this problem.

You were right but it was worth a try as I'm almost to the point of wiping my system and doing a fresh install.
Comment 44 Emmanuel Grumbach 2017-02-02 21:23:08 UTC
What can be useful here is to try older versions of the firmware.
Just remove -17.ucode and check what happens.

To see whether the problem comes from the driver, it can useful to take a working kernel, install our backport tree [1] on it and see what happens. This will allow you have a "good" base kernel but a supposedly "bad" driver.
An advantage of this is that the backport tree is easily bisectable: it includes our driver and the WiFi stack only.

[1] http://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git/
Comment 45 Hallo32 2017-02-03 06:36:33 UTC
@Emmanuel Grumbach

Not  directly related to this bug. I'm expecting that this is another one.

Is there a command to check, if the wlan card is still alive and in a valid working state?

The intention is to get an idea, what is falling at the moment, the wlan card or the "network stack". I'm seeing a lost wlan connection but no entries in the dmesg. Lan is working as expected shortly (the time I need to attach the cable) after the appearance of the problem.
Comment 46 Elizabeth Myers 2017-02-04 08:31:43 UTC
I spoke too soon as well. It's happening again.

It is worth noting that it was formerly stable, often with months of uptime and no issues, but is no longer. I can't really pinpoint when the issue began, unfortunately. I think it began in November, which is wholly unhelpful, but I believe it was around kernel 4.7-ish.
Comment 47 ryan.jentzsch 2017-02-07 17:31:09 UTC
(In reply to Elizabeth Myers from comment #46)
> I spoke too soon as well. It's happening again.
> 
> It is worth noting that it was formerly stable, often with months of uptime
> and no issues, but is no longer. I can't really pinpoint when the issue
> began, unfortunately. I think it began in November, which is wholly
> unhelpful, but I believe it was around kernel 4.7-ish.

This was working for me until I upgraded to Linux Mint 18 (Linux kernel 4.4 and an Ubuntu 16.04 package base). The problem didn't immediately manifest itself. It was some time after the upgrade from LM 17 to 18 that this problem started.  I've tried to downgrade back to the 4.4 kernel but to no avail -- the problem persists. Something somewhere changed in the software -- Because this happens at random and unpredictable times it is nearly impossible to diagnose and fix (I get this and appreciate any dev who is trying to figure this out and fix it). 
Researching this it appears that the 17.ucode driver was patched to "fix" a bug where the power state was assumed to always be on. I wish I could find the link to this patch but as I was res0earching the wifi of course went stupid.
Comment 48 ryan.jentzsch 2017-02-07 17:39:28 UTC
(In reply to Emmanuel Grumbach from comment #44)
> What can be useful here is to try older versions of the firmware.
> Just remove -17.ucode and check what happens.
> 
> To see whether the problem comes from the driver, it can useful to take a
> working kernel, install our backport tree [1] on it and see what happens.
> This will allow you have a "good" base kernel but a supposedly "bad" driver.
> An advantage of this is that the backport tree is easily bisectable: it
> includes our driver and the WiFi stack only.
> 
> [1] http://git.kernel.org/cgit/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

Successfully compiled and installed the backport as su:
1. Reboot
2. dmesg says Intel backport lwiwifi installed! Yea!
3. Wicd (and after reinstalling network-manager) both indicate no available networks. wlan0 is there just can't connect or "see" any networks. Bummer :(
4. Removed the -17.ucode and reboot.
5. System not happy. No wlan0. dmesg indicates "driver *** -17.ucode is REQUIRED".
6. Restored -17.ucode and reboot. Wash...Rinse...Repeat.
7. Had to boot to an older kernel because the current kernel is borked from the backport install.
8. Note: I have a script that at boot time tries to shut off the power manager for wlan0. After the backport this script reported that this feature is not available on the wlan0 interface. Not sure this matters or not...
Comment 49 Emmanuel Grumbach 2017-02-07 17:45:05 UTC
Wicd uses wireless extensions. You can try to enable them with make config in the backport directory
Comment 50 Emmanuel Grumbach 2017-02-07 17:46:51 UTC
It is possible to remove backport without switching kernel. I'll post the instructions later.
Comment 51 Emmanuel Grumbach 2017-02-07 20:49:13 UTC
(In reply to Emmanuel Grumbach from comment #50)
> It is possible to remove backport without switching kernel. I'll post the
> instructions later.

sudo rm -rf /lib/modules/`uname -r`/updates/
sudo depmod -a
Comment 52 ryan.jentzsch 2017-02-08 05:37:49 UTC
(In reply to Emmanuel Grumbach from comment #51)
> (In reply to Emmanuel Grumbach from comment #50)
> > It is possible to remove backport without switching kernel. I'll post the
> > instructions later.
> 
> sudo rm -rf /lib/modules/`uname -r`/updates/
> sudo depmod -a

Thanks this removed the backport.
Comment 53 ryan.jentzsch 2017-02-08 05:41:24 UTC
(In reply to Emmanuel Grumbach from comment #49)
> Wicd uses wireless extensions. You can try to enable them with make config
> in the backport directory

Looking at the Makefile it appears that wext is already being built and wicd is using them. Unfortunately it can see the wlan0 interface but it thinks there are no wireless signals.
Comment 54 Emmanuel Grumbach 2017-02-08 06:04:34 UTC
What does iwconfig say?

I know that wext is disabled by default, look for:
"cfg80211 wireless extensions compatibility" in the firsts page of menuconfig.
Comment 55 ryan.jentzsch 2017-02-08 07:21:28 UTC
(In reply to Emmanuel Grumbach from comment #54)
> What does iwconfig say?
> 
> I know that wext is disabled by default, look for:
> "cfg80211 wireless extensions compatibility" in the firsts page of
> menuconfig.

I tried this to get wext built into the backport:

$ export CPTCFG_CFG80211_WEXT=y
$ make
...

/backport-iwlwifi/net/mac80211/util.o
/backport-iwlwifi/net/wireless/wext-compat.c: In function ‘__cfg80211_set_encryption’:
/backport-iwlwifi/net/wireless/wext-compat.c:413:11: error: ‘struct wireless_dev’ has no member named ‘wext’
  if (!wdev->wext.keys) {

...
Comment 56 Emmanuel Grumbach 2017-02-08 07:23:32 UTC
no...
you need to enable this with the make menuconfig options. Just like if you configure your kernel. Lots of things happen behind the scenes when you configure through make menuconfig that you just bypassed.
Comment 57 ryan.jentzsch 2017-02-08 07:36:02 UTC
Sorry. This is my first time mucking about with kernel builds so I obviously need more guidance on what to do. 

I've built a number of apps from source but never a kernel build so this is new to me.
Comment 58 ryan.jentzsch 2017-02-08 08:34:05 UTC
You've got to be kidding me! There's a menu based UI for make configurations!?! (At least for the kernel). No more `export` environment settings?? Nice.

Anyway I used `make menuconfig` to get the wext enabled. When I did a `make install` this is what I got:

 Building modules, stage 2.
  MODPOST 6 modules
  INSTALL /backport-iwlwifi/compat/compat.ko
At main.c:158:
- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178
sign-file: certs/signing_key.pem: No such file or directory
  INSTALL /backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/iwlwifi.ko
At main.c:158:
- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178
sign-file: certs/signing_key.pem: No such file or directory
  INSTALL /backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/mvm/iwlmvm.ko
At main.c:158:
- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178
sign-file: certs/signing_key.pem: No such file or directory
  INSTALL /backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/xvt/iwlxvt.ko
At main.c:158:
- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178A
sign-file: certs/signing_key.pem: No such file or directory
  INSTALL /backport-iwlwifi/net/mac80211/mac80211.ko
At main.c:158:
- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178
sign-file: certs/signing_key.pem: No such file or directory
  INSTALL /backport-iwlwifi/net/wireless/cfg80211.ko
At main.c:158:
- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178
sign-file: certs/signing_key.pem: No such file or directory
  DEPMOD  4.9.3-040903-generic
depmod will prefer updates/ over kernel/ -- OK!
Note:
You may or may not need to update your initramfs, you should if
any of the modules installed are part of your initramfs. To add
support for your distribution to do this automatically send a
patch against "update-initramfs.sh". If your distribution does not
require this send a patch with the '/usr/bin/lsb_release -i -s'
("Linux Mint") tag for your distribution to avoid this warning.

Your backported driver modules should be installed now.
Reboot.

I tried the suggestions here: https://github.com/slavrn/gm12u320/issues/14
and here: https://github.com/patjak/bcwc_pcie/issues/70
neither make the error go away.
Comment 59 Emmanuel Grumbach 2017-02-08 08:36:27 UTC
should be benign

I am pretty sure the installation worked.
Comment 60 ryan.jentzsch 2017-02-08 10:55:15 UTC
Same issue as before. The wlan0 interface is present but wicd reports no wireless networks available. I double checked that wicd is using wext.
Comment 61 Emmanuel Grumbach 2017-02-08 10:57:25 UTC
what does iwconfig say?
Comment 62 ryan.jentzsch 2017-02-09 06:23:46 UTC
(In reply to Emmanuel Grumbach from comment #61)
> what does iwconfig say?

I tried the backport install again so I could report the iwconfig results. But  now the wlan0 interface isn't there. Here's the dmesg output:

[   25.966434] WARNING: CPU: 0 PID: 1154 at /home/ryan/Apps/backport-iwlwifi/net/wireless/core.c:797 wiphy_register+0x980/0x9d0 [cfg80211]
[   25.966435] Modules linked in: snd_hda_codec_realtek snd_hda_codec_generic iwlmvm(+) snd_hda_intel intel_cstate snd_hda_codec intel_rapl_perf snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event hp_accel snd_rawmidi snd_seq joydev input_leds serio_raw snd_seq_device lis3lv02d snd_timer input_polldev iwlwifi snd mei_me soundcore hp_wireless acpi_pad mei lpc_ich mac_hid wmi shpchp binfmt_misc rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211(OE) cfg80211(OE) compat(OE) parport_pc ppdev lp parport autofs4 dm_mirror dm_region_hash dm_log i915 i2c_algo_bit drm_kms_helper syscopyarea hid_generic sysfillrect ahci r8169 libahci sysimgblt psmouse usbhid fb_sys_fops mii hid drm video fjes
[   25.966464] CPU: 0 PID: 1154 Comm: modprobe Tainted: G           OE   4.9.3-040903-generic #201701120631
[   25.966465] Hardware name: Hewlett-Packard HP ENVY m7 Notebook PC /229D, BIOS F.34 12/19/2014
[   25.966466]  ffffb46dc238ba20 ffffffffa521bd12 0000000000000000 0000000000000000
[   25.966469]  ffffb46dc238ba60 ffffffffa4e83e7b 0000031d07fb9030 ffff89ee0ea002e0
[   25.966471]  0000000000000040 0000000000000000 0000000000000000 0000000000000001
[   25.966473] Call Trace:
[   25.966477]  [<ffffffffa521bd12>] dump_stack+0x63/0x81
[   25.966480]  [<ffffffffa4e83e7b>] __warn+0xcb/0xf0
[   25.966481]  [<ffffffffa4e83fad>] warn_slowpath_null+0x1d/0x20
[   25.966490]  [<ffffffffc0370b90>] wiphy_register+0x980/0x9d0 [cfg80211]
[   25.966492]  [<ffffffffa568f1f2>] ? down_write+0x12/0x40
[   25.966494]  [<ffffffffa552847d>] ? led_trigger_set_default+0x9d/0xb0
[   25.966495]  [<ffffffffa55279c0>] ? led_resume+0x30/0x30
[   25.966497]  [<ffffffffa5527e0f>] ? led_classdev_register+0x18f/0x1f0
[   25.966499]  [<ffffffffa5013fb2>] ? __kmalloc+0x162/0x1e0
[   25.966511]  [<ffffffffc0439ee1>] ? ieee80211_register_hw+0x291/0xb90 [mac80211]
[   25.966518]  [<ffffffffc043a084>] ieee80211_register_hw+0x434/0xb90 [mac80211]
[   25.966528]  [<ffffffffc06dd7f3>] iwl_mvm_mac_setup_register+0x843/0x920 [iwlmvm]
[   25.966535]  [<ffffffffc06e01a3>] iwl_op_mode_mvm_start+0x6f3/0x970 [iwlmvm]
[   25.966541]  [<ffffffffc05b8d8a>] _iwl_op_mode_start.isra.8+0x4a/0xa0 [iwlwifi]
[   25.966544]  [<ffffffffc05b8e53>] iwl_opmode_register+0x73/0xe0 [iwlwifi]
[   25.966545]  [<ffffffffc067c000>] ? 0xffffffffc067c000
[   25.966551]  [<ffffffffc067c033>] iwl_mvm_init+0x33/0x1000 [iwlmvm]
[   25.966553]  [<ffffffffa4e02190>] do_one_initcall+0x50/0x180
[   25.966555]  [<ffffffffa4ff30d1>] ? __vunmap+0x81/0xd0
[   25.966557]  [<ffffffffa5012a17>] ? kmem_cache_alloc_trace+0xd7/0x190
[   25.966559]  [<ffffffffa4fa6b35>] do_init_module+0x5f/0x1f7
[   25.966561]  [<ffffffffa4f14b3e>] load_module+0x18de/0x1c40
[   25.966562]  [<ffffffffa4f11380>] ? __symbol_put+0x60/0x60
[   25.966564]  [<ffffffffa51c1e9e>] ? ima_post_read_file+0x7e/0xa0
[   25.966566]  [<ffffffffa517abcb>] ? security_kernel_post_read_file+0x6b/0x80
[   25.966567]  [<ffffffffa4f1510f>] SYSC_finit_module+0xdf/0x110
[   25.966569]  [<ffffffffa4f1515e>] SyS_finit_module+0xe/0x10
[   25.966571]  [<ffffffffa5691abb>] entry_SYSCALL_64_fastpath+0x1e/0xad
[   25.966572] ---[ end trace 2f18d03a1c689019 ]---
Comment 63 Emmanuel Grumbach 2017-02-09 07:10:13 UTC
I can't find any WARNING around that line in core.c

What is the commit ID you are currently on?
Comment 64 ryan.jentzsch 2017-02-09 09:55:08 UTC
(In reply to Emmanuel Grumbach from comment #63)
> I can't find any WARNING around that line in core.c
> 
> What is the commit ID you are currently on?

master:
commit bad47e3d6bd84aa6db9cb0256d0a675a9fba2ec5
Author: Luca Coelho <luciano.coelho@intel.com>
Date:   Mon Feb 6 16:16:49 2017 +0200

    Merge remote-tracking branch 'auto/master'
    
    Change-Id: Idcf30b4d6046c46c15efecf780d423300c51ef3a
    x-iwlwifi-stack-dev: 9bca8aac704db385487b5eae745210913c0ca38a
Comment 65 ryan.jentzsch 2017-02-09 10:05:40 UTC
(In reply to ryan.jentzsch from comment #64)
> (In reply to Emmanuel Grumbach from comment #63)
> > I can't find any WARNING around that line in core.c
> > 
> > What is the commit ID you are currently on?
> 
> master:
> commit bad47e3d6bd84aa6db9cb0256d0a675a9fba2ec5
> Author: Luca Coelho <luciano.coelho@intel.com>
> Date:   Mon Feb 6 16:16:49 2017 +0200
> 
>     Merge remote-tracking branch 'auto/master'
>     
>     Change-Id: Idcf30b4d6046c46c15efecf780d423300c51ef3a
>     x-iwlwifi-stack-dev: 9bca8aac704db385487b5eae745210913c0ca38a


```
	/* sanity check supported bands/channels */
	for (band = 0; band < NUM_NL80211_BANDS; band++) {
		sband = wiphy->bands[band];
		if (!sband)
			continue;

		sband->band = band;
		if (WARN_ON(!sband->n_channels))
			return -EINVAL;
		/*
		 * on 60GHz band, there are no legacy rates, so
		 * n_bitrates is 0
		 */
		if (WARN_ON(band != NL80211_BAND_60GHZ &&
			    !sband->n_bitrates))
			return -EINVAL;

		/*
		 * Since cfg80211_disable_40mhz_24ghz is global, we can
		 * modify the sband's ht data even if the driver uses a
		 * global structure for that.
		 */
		if (cfg80211_disable_40mhz_24ghz &&
		    band == NL80211_BAND_2GHZ &&
		    sband->ht_cap.ht_supported) {
			sband->ht_cap.cap &= ~IEEE80211_HT_CAP_SUP_WIDTH_20_40;
			sband->ht_cap.cap &= ~IEEE80211_HT_CAP_SGI_40;
		}

		/*
		 * Since we use a u32 for rate bitmaps in
		 * ieee80211_get_response_rate, we cannot
		 * have more than 32 legacy rates.
		 */
		if (WARN_ON(sband->n_bitrates > 32))
			return -EINVAL;

		for (i = 0; i < sband->n_channels; i++) {
			sband->channels[i].orig_flags =
				sband->channels[i].flags;
			sband->channels[i].orig_mag = INT_MAX;
			sband->channels[i].orig_mpwr =
				sband->channels[i].max_power;
			sband->channels[i].band = band;
		}

		have_band = true;
	}

	if (!have_band) {
		WARN_ON(1); // <-- Line 797
		return -EINVAL;
	}
```
Comment 66 Emmanuel Grumbach 2017-02-09 12:34:55 UTC
I just don't understand what changed since last time. This WARNING just makes no sense to me. Looks like you can't even read the card's capability.
Comment 67 ryan.jentzsch 2017-02-09 22:14:11 UTC
(In reply to Emmanuel Grumbach from comment #66)
> I just don't understand what changed since last time. This WARNING just
> makes no sense to me. Looks like you can't even read the card's capability.

I commented out the "sanity check" code that was issuing the WARNING. Recompiled and reboot. Can not see wlan0 still. So I tried the following:

$ echo 1 | sudo tee /sys/bus/pci/devices/0000:08:00.0/remove
$ echo 1 | sudo tee /sys/bus/pci/rescan
$ sudo rmmod iwlmvm iwlwifi && sudo modprobe iwlmvm iwlwifi // this responded with an error
$ sudo ifconfig wlan0 up // obviously did not work with the previous command throwing an error.


Relevant dmesg output dump:
[  265.099880] pci 0000:08:00.0: [8086:08b1] type 00 class 0x028000
[  265.099917] pci 0000:08:00.0: reg 0x10: [mem 0xc6100000-0xc6101fff 64bit]
[  265.100101] pci 0000:08:00.0: PME# supported from D0 D3hot D3cold
[  265.100166] pci 0000:08:00.0: System wakeup disabled by ACPI
[  265.109016] pci 0000:08:00.0: BAR 0: assigned [mem 0xc6100000-0xc6101fff 64bit]
[  265.110028] iwlwifi 0000:08:00.0: loaded firmware version 17.352738.0 op_mode iwlmvm
[  265.114796] iwlmvm: Unknown symbol __ieee80211_get_radio_led_name (err 0)
[  265.145145] iwlwifi 0000:08:00.0: failed to load module iwlmvm (error 256), is dynamic loading enabled?
[  267.120561] iwlmvm: Unknown symbol __ieee80211_get_radio_led_name (err 0)

Note: In the make menuconfig is a [New] option for turning on LEDs (I did not enable this!)
Comment 68 ryan.jentzsch 2017-02-10 01:28:27 UTC
(In reply to ryan.jentzsch from comment #67)
> (In reply to Emmanuel Grumbach from comment #66)
> > I just don't understand what changed since last time. This WARNING just
> > makes no sense to me. Looks like you can't even read the card's capability.
 ...

I decided to delete the backport directory. Did a `git clone` of the backport so I would have a clean build environment.

Ran `make menuconfig` and ONLY enabled the wext support.
`make`
`sudo make install`
Reboot.

`dmesg` reports: 
 [   33.272159] mac80211: Unknown symbol __ieee80211_get_channel (err 0)

No wlan0 interface.
Comment 69 Emmanuel Grumbach 2017-02-11 19:24:48 UTC
To make sure I didn't mislead you I did:

clone the backport
make
<this takes the default which is what you want, wait until it finishes if you want>
make menuconfig
<enable cfg80211 wireless extensions compatibility>
exit and save

make
sudo make install
reboot


then I ran iwconfig and WEXT is just working.
Comment 70 ryan.jentzsch 2017-02-12 11:04:11 UTC
(In reply to Emmanuel Grumbach from comment #69)
> To make sure I didn't mislead you I did:
> 
> clone the backport
> make
> <this takes the default which is what you want, wait until it finishes if
> you want>
> make menuconfig
> <enable cfg80211 wireless extensions compatibility>
> exit and save
> 
> make
> sudo make install
> reboot
> 
> 
> then I ran iwconfig and WEXT is just working.

I was using `make` only once. Running make twice fixed the problem. I am able to run the backport now with wlan0 up and connected to the internet! Thx!!
Comment 71 ryan.jentzsch 2017-02-12 11:17:47 UTC
Didn't take long even after the backport is installed for the problem to occur. Here's the dmsg dump:

[  834.200941] WARNING: CPU: 2 PID: 0 at /home/ryan/Apps/backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2038 iwl_trans_pcie_grab_nic_access+0xf3/0x100 [iwlwifi]
[  834.200942] Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)
[  834.200943] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp ccm dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c appletalk ipx p8023 psnap p8022 llc rfcomm bbswitch(OE) bnep pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nvidia_uvm(POE) uvcvideo nvidia_drm(POE) videobuf2_vmalloc videobuf2_memops nvidia_modeset(POE) videobuf2_v4l2 videobuf2_core nvidia(POE) videodev hid_multitouch media arc4 btusb btrtl intel_rapl btbcm iwlmvm(OE) btintel bluetooth x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_helper
[  834.200967]  ablk_helper cryptd snd_hda_codec_hdmi iwlwifi(OE) snd_hda_codec_realtek snd_hda_codec_generic intel_cstate kvm_intel snd_hda_intel joydev input_leds snd_hda_codec kvm snd_hda_core snd_hwdep snd_pcm irqbypass snd_seq_midi snd_seq_midi_event intel_rapl_perf snd_rawmidi snd_seq serio_raw hp_accel snd_seq_device lis3lv02d snd_timer input_polldev snd mac_hid mei_me soundcore mei lpc_ich hp_wireless acpi_pad wmi shpchp binfmt_misc rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211(OE) cfg80211(OE) compat(OE) parport_pc ppdev lp parport autofs4 dm_mirror dm_region_hash dm_log hid_generic usbhid hid i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect r8169 sysimgblt ahci fb_sys_fops psmouse libahci mii drm video fjes
[  834.200993] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           OE   4.9.3-040903-generic #201701120631
[  834.200994] Hardware name: Hewlett-Packard HP ENVY m7 Notebook PC /229D, BIOS F.34 12/19/2014
[  834.200995]  ffffa078ded03cd0 ffffffffbea1bd12 ffffa078ded03d20 0000000000000000
[  834.200997]  ffffa078ded03d10 ffffffffbe683e7b 000007f600000000 ffffa078cf300018
[  834.200999]  0000000000000000 ffffa078cf3083d8 ffffa078ded03db8 ffffa078cf300018
[  834.201000] Call Trace:
[  834.201001]  <IRQ> 
[  834.201005]  [<ffffffffbea1bd12>] dump_stack+0x63/0x81
[  834.201006]  [<ffffffffbe683e7b>] __warn+0xcb/0xf0
[  834.201008]  [<ffffffffbe683eff>] warn_slowpath_fmt+0x5f/0x80
[  834.201012]  [<ffffffffc09141df>] ? iwl_read32+0x1f/0x90 [iwlwifi]
[  834.201016]  [<ffffffffc09275f3>] iwl_trans_pcie_grab_nic_access+0xf3/0x100 [iwlwifi]
[  834.201020]  [<ffffffffc0914762>] iwl_read_prph+0x32/0x80 [iwlwifi]
[  834.201024]  [<ffffffffc0929dc8>] iwl_trans_pcie_log_scd_error+0x138/0x270 [iwlwifi]
[  834.201026]  [<ffffffffbe6accc8>] ? update_rq_clock.part.78+0x18/0x40
[  834.201027]  [<ffffffffbe637089>] ? sched_clock+0x9/0x10
[  834.201029]  [<ffffffffbe6b945d>] ? cpu_load_update+0xdd/0x150
[  834.201033]  [<ffffffffc0920290>] ? iwl_pcie_txq_inc_wr_ptr+0x100/0x100 [iwlwifi]
[  834.201036]  [<ffffffffc09202db>] iwl_pcie_txq_stuck_timer+0x4b/0x70 [iwlwifi]
[  834.201037]  [<ffffffffbe6f71a5>] call_timer_fn+0x35/0x120
[  834.201038]  [<ffffffffbe6f7735>] run_timer_softirq+0x215/0x4b0
[  834.201040]  [<ffffffffbe6ffcd1>] ? ktime_get+0x41/0xb0
[  834.201041]  [<ffffffffbe6530a6>] ? lapic_next_deadline+0x26/0x30
[  834.201043]  [<ffffffffbee94614>] __do_softirq+0x104/0x28c
[  834.201045]  [<ffffffffbe68a336>] irq_exit+0xb6/0xc0
[  834.201046]  [<ffffffffbee94422>] smp_apic_timer_interrupt+0x42/0x50
[  834.201047]  [<ffffffffbee93732>] apic_timer_interrupt+0x82/0x90
[  834.201048]  <EOI> 
[  834.201050]  [<ffffffffbed13a92>] ? cpuidle_enter_state+0x122/0x2c0
[  834.201052]  [<ffffffffbed13c67>] cpuidle_enter+0x17/0x20
[  834.201053]  [<ffffffffbe6c9f53>] call_cpuidle+0x23/0x40
[  834.201054]  [<ffffffffbe6ca1cb>] cpu_startup_entry+0x15b/0x240
[  834.201055]  [<ffffffffbe651b94>] start_secondary+0x154/0x190
[  834.201056] ---[ end trace 31a5b2f6358dd159 ]---
[  834.248139] iwlwifi 0000:08:00.0: Queue 4 is active on fifo 2 and stuck for 10000 ms. SW [53, 88] HW [90, 90] FH TRB=0x05a5a5a5a
Comment 72 Emmanuel Grumbach 2017-02-12 11:19:11 UTC
and your base kernel is a "good" kernel, right?
Comment 73 ryan.jentzsch 2017-02-12 11:46:32 UTC
(In reply to Emmanuel Grumbach from comment #72)
> and your base kernel is a "good" kernel, right?

output of `uname -a`:

Linux leto 4.9.3-040903-generic #201701120631 SMP Thu Jan 12 11:33:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Comment 74 Emmanuel Grumbach 2017-02-12 13:33:27 UTC
So, this is the bad one, isn't it?
Comment 75 ryan.jentzsch 2017-02-12 23:28:35 UTC
(In reply to Emmanuel Grumbach from comment #74)
> So, this is the bad one, isn't it?

I'm not sure what you are asking. I've tied multiple kernel versions to remedy this issue and all are "bad" in the fact that the wifi card goes stupid regardless of kernel version.
Comment 76 Emmanuel Grumbach 2017-02-13 06:38:33 UTC
People seemed to say that older kernel versions used to work. See comment 44.
Comment 77 ryan.jentzsch 2017-02-13 10:24:14 UTC
(In reply to Emmanuel Grumbach from comment #76)
> People seemed to say that older kernel versions used to work. See comment 44.

See my comment: https://bugzilla.kernel.org/show_bug.cgi?id=191601#c28
I've tried many kernel versions and something went screwy shortly (not immediately) after upgrading to the 4.4 kernel. I've tried every stable version of the 4.x kernels without success. Problem persists.
Comment 78 Emmanuel Grumbach 2017-02-13 10:35:43 UTC
So you have a different issue from Halo32 it seems.
Comment 79 ryan.jentzsch 2017-02-13 16:49:31 UTC
(In reply to Emmanuel Grumbach from comment #78)
> So you have a different issue from Halo32 it seems.

So it may appear. I'll try the back port on different kernel versions and see.
Comment 80 Hallo32 2017-02-13 21:54:34 UTC
[ 6923.815626] iwlwifi 0000:04:00.0: Queue 16 stuck for 10000 ms.
[ 6923.815637] iwlwifi 0000:04:00.0: Current SW read_ptr 57 write_ptr 63
[ 6923.815684] iwl data: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 6923.815707] iwlwifi 0000:04:00.0: FH TRBs(0) = 0x800030cd
[ 6923.815722] iwlwifi 0000:04:00.0: FH TRBs(1) = 0xc011003e
[ 6923.815739] iwlwifi 0000:04:00.0: FH TRBs(2) = 0x80201058
[ 6923.815760] iwlwifi 0000:04:00.0: FH TRBs(3) = 0x80300053
[ 6923.815774] iwlwifi 0000:04:00.0: FH TRBs(4) = 0x00000000
[ 6923.815790] iwlwifi 0000:04:00.0: FH TRBs(5) = 0x00000000
[ 6923.815804] iwlwifi 0000:04:00.0: FH TRBs(6) = 0x00000000
[ 6923.815818] iwlwifi 0000:04:00.0: FH TRBs(7) = 0x0070909c
[ 6923.816105] iwlwifi 0000:04:00.0: Q 0 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.816441] iwlwifi 0000:04:00.0: Q 1 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.816767] iwlwifi 0000:04:00.0: Q 2 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.817093] iwlwifi 0000:04:00.0: Q 3 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.817431] iwlwifi 0000:04:00.0: Q 4 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.817762] iwlwifi 0000:04:00.0: Q 5 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.818081] iwlwifi 0000:04:00.0: Q 6 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.818399] iwlwifi 0000:04:00.0: Q 7 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.818718] iwlwifi 0000:04:00.0: Q 8 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.819044] iwlwifi 0000:04:00.0: Q 9 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.819378] iwlwifi 0000:04:00.0: Q 10 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.819733] iwlwifi 0000:04:00.0: Q 11 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.820060] iwlwifi 0000:04:00.0: Q 12 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.820392] iwlwifi 0000:04:00.0: Q 13 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.820713] iwlwifi 0000:04:00.0: Q 14 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.821039] iwlwifi 0000:04:00.0: Q 15 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.821370] iwlwifi 0000:04:00.0: Q 16 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.821696] iwlwifi 0000:04:00.0: Q 17 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.822033] iwlwifi 0000:04:00.0: Q 18 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.822358] iwlwifi 0000:04:00.0: Q 19 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.822678] iwlwifi 0000:04:00.0: Q 20 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.822996] iwlwifi 0000:04:00.0: Q 21 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.823315] iwlwifi 0000:04:00.0: Q 22 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.823651] iwlwifi 0000:04:00.0: Q 23 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.823984] iwlwifi 0000:04:00.0: Q 24 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.824322] iwlwifi 0000:04:00.0: Q 25 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.824647] iwlwifi 0000:04:00.0: Q 26 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.824971] iwlwifi 0000:04:00.0: Q 27 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.825299] iwlwifi 0000:04:00.0: Q 28 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.825627] iwlwifi 0000:04:00.0: Q 29 is inactive and mapped to fifo 2 ra_tid 0xa5a5 [162,-1515870814]
[ 6923.825953] iwlwifi 0000:04:00.0: Q 30 is inactive and mapped to fifo 2 ra_tid 0xa5a2 [162,-1515870814]
[ 6923.827195] iwlwifi 0000:04:00.0: Hardware error detected.  Restarting.
[ 6923.827201] iwlwifi 0000:04:00.0: CSR values:
[ 6923.827204] iwlwifi 0000:04:00.0: (2nd byte of CSR_INT_COALESCING is CSR_INT_PERIODIC_REG)
[ 6923.827214] iwlwifi 0000:04:00.0:        CSR_HW_IF_CONFIG_REG: 0X40489204
[ 6923.827228] iwlwifi 0000:04:00.0:          CSR_INT_COALESCING: 0X80000040
[ 6923.827242] iwlwifi 0000:04:00.0:                     CSR_INT: 0X20000000
[ 6923.827256] iwlwifi 0000:04:00.0:                CSR_INT_MASK: 0X00000000
[ 6923.827269] iwlwifi 0000:04:00.0:           CSR_FH_INT_STATUS: 0X00000000
[ 6923.827283] iwlwifi 0000:04:00.0:                 CSR_GPIO_IN: 0X00000000
[ 6923.827300] iwlwifi 0000:04:00.0:                   CSR_RESET: 0X00000004
[ 6923.827313] iwlwifi 0000:04:00.0:                CSR_GP_CNTRL: 0X080403c5
[ 6923.827326] iwlwifi 0000:04:00.0:                  CSR_HW_REV: 0X00000144
[ 6923.827339] iwlwifi 0000:04:00.0:              CSR_EEPROM_REG: 0X00000000
[ 6923.827352] iwlwifi 0000:04:00.0:               CSR_EEPROM_GP: 0X80000000
[ 6923.827364] iwlwifi 0000:04:00.0:              CSR_OTP_GP_REG: 0X803a0000
[ 6923.827378] iwlwifi 0000:04:00.0:                 CSR_GIO_REG: 0X001f0042
[ 6923.827390] iwlwifi 0000:04:00.0:            CSR_GP_UCODE_REG: 0X00000000
[ 6923.827403] iwlwifi 0000:04:00.0:           CSR_GP_DRIVER_REG: 0X00000000
[ 6923.827417] iwlwifi 0000:04:00.0:           CSR_UCODE_DRV_GP1: 0X00000000
[ 6923.827429] iwlwifi 0000:04:00.0:           CSR_UCODE_DRV_GP2: 0X00000000
[ 6923.827443] iwlwifi 0000:04:00.0:                 CSR_LED_REG: 0X00000060
[ 6923.827456] iwlwifi 0000:04:00.0:        CSR_DRAM_INT_TBL_REG: 0X8822e254
[ 6923.827469] iwlwifi 0000:04:00.0:        CSR_GIO_CHICKEN_BITS: 0X27800200
[ 6923.827481] iwlwifi 0000:04:00.0:             CSR_ANA_PLL_CFG: 0Xd55555d5
[ 6923.827494] iwlwifi 0000:04:00.0:      CSR_MONITOR_STATUS_REG: 0X3d0801bd
[ 6923.827507] iwlwifi 0000:04:00.0:           CSR_HW_REV_WA_REG: 0X0001001a
[ 6923.827520] iwlwifi 0000:04:00.0:        CSR_DBG_HPET_MEM_REG: 0Xffff0072
[ 6923.827523] iwlwifi 0000:04:00.0: FH register values:
[ 6923.827611] iwlwifi 0000:04:00.0:         FH_RSCSR_CHNL0_STTS_WPTR_REG: 0Xa5a5a5a2
[ 6923.827666] iwlwifi 0000:04:00.0:        FH_RSCSR_CHNL0_RBDCB_BASE_REG: 0Xa5a5a5a2
[ 6923.827730] iwlwifi 0000:04:00.0:                  FH_RSCSR_CHNL0_WPTR: 0Xa5a5a5a2
[ 6923.827784] iwlwifi 0000:04:00.0:         FH_MEM_RCSR_CHNL0_CONFIG_REG: 0Xa5a5a5a2
[ 6923.827832] iwlwifi 0000:04:00.0:          FH_MEM_RSSR_SHARED_CTRL_REG: 0Xa5a5a5a2
[ 6923.827888] iwlwifi 0000:04:00.0:            FH_MEM_RSSR_RX_STATUS_REG: 0Xa5a5a5a2
[ 6923.827943] iwlwifi 0000:04:00.0:    FH_MEM_RSSR_RX_ENABLE_ERR_IRQ2DRV: 0Xa5a5a5a2
[ 6923.827998] iwlwifi 0000:04:00.0:                FH_TSSR_TX_STATUS_REG: 0Xa5a5a5a2
[ 6923.828054] iwlwifi 0000:04:00.0:                 FH_TSSR_TX_ERROR_REG: 0Xa5a5a5a2
[ 6923.829470] iwlwifi 0000:04:00.0: Start IWL Error Log Dump:
[ 6923.829474] iwlwifi 0000:04:00.0: Status: 0x00000000, count: -1515870814
[ 6923.829477] iwlwifi 0000:04:00.0: Loaded firmware version: 17.352738.0
[ 6923.829480] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | ADVANCED_SYSASSERT          
[ 6923.829482] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | trm_hw_status0
[ 6923.829485] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | trm_hw_status1
[ 6923.829487] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | branchlink2
[ 6923.829489] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | interruptlink1
[ 6923.829492] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | interruptlink2
[ 6923.829494] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | data1
[ 6923.829496] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | data2
[ 6923.829498] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | data3
[ 6923.829500] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | beacon time
[ 6923.829502] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | tsf low
[ 6923.829504] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | tsf hi
[ 6923.829506] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | time gp1
[ 6923.829508] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | time gp2
[ 6923.829511] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | uCode revision type
[ 6923.829513] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | uCode version major
[ 6923.829515] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | uCode version minor
[ 6923.829517] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | hw version
[ 6923.829519] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | board version
[ 6923.829521] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | hcmd
[ 6923.829524] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | isr0
[ 6923.829526] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | isr1
[ 6923.829528] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | isr2
[ 6923.829530] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | isr3
[ 6923.829532] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | isr4
[ 6923.829534] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | last cmd Id
[ 6923.829537] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | wait_event
[ 6923.829539] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | l2p_control
[ 6923.829543] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | l2p_duration
[ 6923.829545] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | l2p_mhvalid
[ 6923.829548] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | l2p_addr_match
[ 6923.829550] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | lmpm_pmg_sel
[ 6923.829552] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | timestamp
[ 6923.829554] iwlwifi 0000:04:00.0: 0xA5A5A5A2 | flow_handler
[ 6923.829560] ieee80211 phy0: Hardware restart was requested
[ 6925.359620] perf: interrupt took too long (8858 > 6166), lowering kernel.perf_event_max_sample_rate to 22500
[ 6927.837109] perf: interrupt took too long (11858 > 11072), lowering kernel.perf_event_max_sample_rate to 16750
[ 6930.375871] perf: interrupt took too long (15593 > 14822), lowering kernel.perf_event_max_sample_rate to 12750
[ 6930.446658] iwlwifi 0000:04:00.0: Failing on timeout while stopping DMA channel 8 [0xa5a5a5a2]
[ 6930.455260] iwlwifi 0000:04:00.0: L1 Enabled - LTR Disabled
[ 6930.455822] iwlwifi 0000:04:00.0: L1 Enabled - LTR Disabled
[ 6930.687449] iwlwifi 0000:04:00.0: L1 Enabled - LTR Disabled
[ 6930.687714] iwlwifi 0000:04:00.0: L1 Enabled - LTR Disabled

uname -r 
4.8.0-36-generic
Comment 81 Emmanuel Grumbach 2017-02-14 02:49:00 UTC
@Halo,

What does this teach us? You seemed to say that 4.4 was good. Not 4.8.
Comment 82 Hallo32 2017-02-14 09:55:48 UTC
(In reply to Emmanuel Grumbach from comment #81)
> @Halo,
> 
> What does this teach us? You seemed to say that 4.4 was good. Not 4.8.

Hey Emmanuel,

at the moment it seems, that I see 3 WLAN related bugs.

1) The bug at this thread
2) The WLAN card/software interface is changing to a state I can not reconnect to the WLAN or at least see any WLAN. (I can not trigger the problem.)
3) The Access Point is changing the frequency/channel and the WLAN card/software stack on the Linux Client shows the same behavior on the first look.

Bug 2 and 3 are the sources for a the question at #45. At the moment I don't know, if your WLAN card and your related code is the problem or if it is the "network stack" above your code. Options to verify the current working state of the Intel wireless card would be nice.

I didn't see this  "0x5a5a5a5a Bug" on the 4.4 kernel line but I'm not able to stuck at the 4.4 kernel. The 4.4 has at least some problems with the Btrfs and I trigger at least one of them at a regularly basis.

The kernel 4.8 seems to be the supported kernel for the next ~6 month.

The different dmesg logs should be from different distribution versions of the 4.8 kernel. I hope that  one of them may include some new information. 

Is there an option to switch the iwlwifi modul to a useful verbose mode?
Between this and the last "0x5a5a5a5a Bug" the system has been running >150h without triggering the bug. With a look at the runtime between the appearance of the bug I would like a way to get as much information as possible on each crash. "It happens again", seems not to be so useful, or?

The best backport to test would be 4.9.9, or?

At the moment I have the feeling, that we are trying to not hit the black cat (the bug) in a black room instead of removing it.
Comment 83 Emmanuel Grumbach 2017-02-14 10:03:11 UTC
I don't understand 2 and 3. But since they are different bugs, let's put that aside.
You can always ping your router in your internal network to check that connectivity is working.

Brtfs is unrelated, so please, let's not confuse unrelated things. This bug is really about 0x5a5a5a5a. So let's focus on this. If you can't repro this on 4.4, then you can help by installing 4.4 and the latest WiFi driver from backport.
Then it can be easy to bisect the problem.
The backport you should take is the backport I mentioned in comment 44. Do not create backport from kernel, use our backport tree.

"
Is there an option to switch the iwlwifi modul to a useful verbose mode?
Between this and the last "0x5a5a5a5a Bug" the system has been running >150h without triggering the bug. With a look at the runtime between the appearance of the bug I would like a way to get as much information as possible on each crash. "It happens again", seems not to be so useful, or?
"

I have no clue what you mean here by "this".

Knowing that 4.4 doesn't exhibit the 0x5a5a5a5a (which is the only bug I am ready to discuss in this bugzilla) and 4.8 does exhibit it, is useful.

That's why since comment 44, I am trying to get this...
Comment 84 Hallo32 2017-02-14 10:27:53 UTC
(In reply to Emmanuel Grumbach from comment #83)
> I don't understand 2 and 3. But since they are different bugs, let's put
> that aside.
> You can always ping your router in your internal network to check that
> connectivity is working.

The WLAN connection doesn't exist any more. The router will not response on pings.
Where is the right place to discuss this topic? 

> 
> Brtfs is unrelated, so please, let's not confuse unrelated things. This bug
> is really about 0x5a5a5a5a. So let's focus on this. If you can't repro this
> on 4.4, then you can help by installing 4.4 and the latest WiFi driver from
> backport.
> Then it can be easy to bisect the problem.
> The backport you should take is the backport I mentioned in comment 44. Do
> not create backport from kernel, use our backport tree.

The kernel 4.4 is not an option. The risk to damage the file system is to high.
Only kernel 4.8 and up is an option. 

> "
> Is there an option to switch the iwlwifi modul to a useful verbose mode?
> Between this and the last "0x5a5a5a5a Bug" the system has been running >150h
> without triggering the bug. With a look at the runtime between the
> appearance of the bug I would like a way to get as much information as
> possible on each crash. "It happens again", seems not to be so useful, or?
> "
> 
> I have no clue what you mean here by "this".

This should point to the last report #80.
The system has been powered on >150h before the "0x5a5a5a5a Bug" triggered again. 

> 
> Knowing that 4.4 doesn't exhibit the 0x5a5a5a5a (which is the only bug I am
> ready to discuss in this bugzilla) and 4.8 does exhibit it, is useful.
> 
> That's why since comment 44, I am trying to get this...
Comment 85 Emmanuel Grumbach 2017-02-14 10:33:45 UTC
> The WLAN connection doesn't exist any more. The router will not response on
> pings.
> Where is the right place to discuss this topic? 

Send an email to linuxwifi@intel.com and / or linux-wireless@vger.kernel.org

> 
> > 
> > Brtfs is unrelated, so please, let's not confuse unrelated things. This bug
> > is really about 0x5a5a5a5a. So let's focus on this. If you can't repro this
> > on 4.4, then you can help by installing 4.4 and the latest WiFi driver from
> > backport.
> > Then it can be easy to bisect the problem.
> > The backport you should take is the backport I mentioned in comment 44. Do
> > not create backport from kernel, use our backport tree.
> 
> The kernel 4.4 is not an option. The risk to damage the file system is to
> high.
> Only kernel 4.8 and up is an option.

So you can't help bisecting the problem.

> 
> > "
> > Is there an option to switch the iwlwifi modul to a useful verbose mode?
> > Between this and the last "0x5a5a5a5a Bug" the system has been running
> >150h
> > without triggering the bug. With a look at the runtime between the
> > appearance of the bug I would like a way to get as much information as
> > possible on each crash. "It happens again", seems not to be so useful, or?
> > "
> > 
> > I have no clue what you mean here by "this".
> 
> This should point to the last report #80.
> The system has been powered on >150h before the "0x5a5a5a5a Bug" triggered
> again. 

There is a verbose mode it won't bring any useful data for this specific issue.
For this specific issue, I need the debug output of the firmware as mentioned in Comment 2 or bisection.
Comment 86 Hallo32 2017-02-14 10:40:08 UTC
(In reply to Emmanuel Grumbach from comment #85)
> > The WLAN connection doesn't exist any more. The router will not response on
> > pings.
> > Where is the right place to discuss this topic? 
> 
> Send an email to linuxwifi@intel.com and / or linux-wireless@vger.kernel.org

Will be done.
 
> > 
> > > "
> > > Is there an option to switch the iwlwifi modul to a useful verbose mode?
> > > Between this and the last "0x5a5a5a5a Bug" the system has been running
> > >150h
> > > without triggering the bug. With a look at the runtime between the
> > > appearance of the bug I would like a way to get as much information as
> > > possible on each crash. "It happens again", seems not to be so useful,
> or?
> > > "
> > > 
> > > I have no clue what you mean here by "this".
> > 
> > This should point to the last report #80.
> > The system has been powered on >150h before the "0x5a5a5a5a Bug" triggered
> > again. 
> 
> There is a verbose mode it won't bring any useful data for this specific
> issue.
> For this specific issue, I need the debug output of the firmware as
> mentioned in Comment 2 or bisection.

I will install the firmware and wait for the bug.
Comment 87 Emmanuel Grumbach 2017-02-14 10:41:54 UTC
> I will install the firmware and wait for the bug.

Don't forget the instructions on the retrieval of the data (udev rule) and the privacy notice.
Comment 88 Hallo32 2017-02-14 12:16:43 UTC
Is it possible, that the bug only appears in combination with active usage of the 5GHz band?

I didn't see it on the 2.4GHz band.
Comment 89 ryan.jentzsch 2017-02-14 16:50:32 UTC
(In reply to Emmanuel Grumbach from comment #81)
> @Halo,
> 
> What does this teach us? You seemed to say that 4.4 was good. Not 4.8.

I installed the backport on 4.4.0-53 and the wireless hasn't crashed for over 9 hours. This is the longest up time I've ever had. I usually see a crash within 20 minutes. 

I started a large torrent about an hour ago to see if I could force a crash via increased network traffic and so far so good.  

Thank you for everyone's time spent on this issue. I'm hoping that downgrading to 4.4 has solved this -- I can live with using an older kernel.
Comment 90 ryan.jentzsch 2017-02-14 16:52:12 UTC
(In reply to Hallo32 from comment #88)
> Is it possible, that the bug only appears in combination with active usage
> of the 5GHz band?
> 
> I didn't see it on the 2.4GHz band.

I had crashes using 5G and 2.4GHz. It didn't seem to matter which band I used. 5G "felt" more stable though.
Comment 91 Emmanuel Grumbach 2017-02-14 17:46:35 UTC
(In reply to ryan.jentzsch from comment #89)
> (In reply to Emmanuel Grumbach from comment #81)
> 
> I installed the backport on 4.4.0-53 and the wireless hasn't crashed for
> over 9 hours. This is the longest up time I've ever had. I usually see a
> crash within 20 minutes. 
> 
> I started a large torrent about an hour ago to see if I could force a crash
> via increased network traffic and so far so good.  
> 
> Thank you for everyone's time spent on this issue. I'm hoping that
> downgrading to 4.4 has solved this -- I can live with using an older kernel.

Ok - this is real data that we can work on.
This basically means that the regression hasn't been caused by the iwlwifi driver but rather by another component in the kernel. This of course, make it much more complicated to nail down...
I will try to involve people from the PCI subsystem, but at this stage, I can't really do much.
Comment 92 ryan.jentzsch 2017-02-14 19:16:36 UTC
(In reply to Emmanuel Grumbach from comment #91)
> (In reply to ryan.jentzsch from comment #89)
> > (In reply to Emmanuel Grumbach from comment #81)
> > 
> > I installed the backport on 4.4.0-53 and the wireless hasn't crashed for
> > over 9 hours. This is the longest up time I've ever had. I usually see a
> > crash within 20 minutes. 
> > 
> > I started a large torrent about an hour ago to see if I could force a crash
> > via increased network traffic and so far so good.  
> > 
> > Thank you for everyone's time spent on this issue. I'm hoping that
> > downgrading to 4.4 has solved this -- I can live with using an older
> kernel.
> 
> Ok - this is real data that we can work on.
> This basically means that the regression hasn't been caused by the iwlwifi
> driver but rather by another component in the kernel. This of course, make
> it much more complicated to nail down...
> I will try to involve people from the PCI subsystem, but at this stage, I
> can't really do much.

Thanks for your time and guidance. I'll still check back here from time to time to see if I can assist (now that I know a little bit better what I'm doing). It will be nice if the latest kernel and iwlwifi play nice with each other.
Comment 93 ryan.jentzsch 2017-02-16 07:21:25 UTC
(In reply to Emmanuel Grumbach from comment #91)
> (In reply to ryan.jentzsch from comment #89)
> > (In reply to Emmanuel Grumbach from comment #81)
> > 
> > I installed the backport on 4.4.0-53 and the wireless hasn't crashed for
> > over 9 hours. This is the longest up time I've ever had. I usually see a
> > crash within 20 minutes. 
> > (In reply to Emmanuel Grumbach from comment #91)
> (In reply to ryan.jentzsch from comment #89)
> > (In reply to Emmanuel Grumbach from comment #81)
> > 
> > I installed the backport on 4.4.0-53 and the wireless hasn't crashed for
> > over 9 hours. This is the longest up time I've ever had. I usually see a
> > crash within 20 minutes. 
> > 
> > I started a large torrent about an hour ago to see if I could force a crash
> > via increased network traffic and so far so good.  
> > 
> > Thank you for everyone's time spent on this issue. I'm hoping that
> > downgrading to 4.4 has solved this -- I can live with using an older
> kernel.
> 
> Ok - this is real data that we can work on.
> This basically means that the regression hasn't been caused by the iwlwifi
> driver but rather by another component in the kernel. This of course, make
> it much more complicated to nail down...
> I will try to involve people from the PCI subsystem, but at this stage, I
> can't really do much.

Looks like I spoke too soon. Here's what happened hopefully this is useful:
1. Went 9+ hours of uptime on kernel 4.4.0-53 after the backport install.
2. Shut down the laptop.
3. Booted into the 4.4.0-53 kernel at my work and connected to the wifi there (I've NEVER had the wifi crash at my work and it didn't crash worked fine always has).
4. Power down laptop.
5. At home boot into the 4.4 kernel and about 5 minutes later the problem is back. Tried rebooting and a hard shutdown restart but to no avail.
6. Uninstalled the backport.
7. Reboot.
8. Reinstall the backport.
9. Reboot.
10. About 3 minutes after booting the problem happens again.
11. Uninstall the backport.
12. Reboot.
13. Uninstall the 4.4 kernel.
14. Reboot.
15. Reinstall the 4.4 kernel.
16. Reboot into the newly installed 4.4 kernel.
17. wifi goes stupid almost immediately.
18. Install backport.
19. Reboot.
20. wifi goes stupid again after about 3 minutes this time.
21. Moved closer to my router after a reboot. Problem happens but takes a little longer to occur.

The dmesg output is the same as last time so no point in posting it again. I just wish I knew the secret formula to make this problem go away completely.
Comment 94 Emmanuel Grumbach 2017-03-08 07:15:48 UTC
Sorry for the late response.
@Ryan, you seem to be seeing the CSR 0xfffffff issue which is not directly the 0x5a5a5a5a thing.
In the meantime, unfortunately, I didn't get the debug data asked in Comment 2, so I am closing the bug.
Comment 95 ryan.jentzsch 2017-03-18 02:57:42 UTC
I finally gave up. I bought a Panda 300N wireless USB adapter for $15. This problem only happens at my house so after a bit of a struggle getting iwlwifi to play nice with Panda via rfkill I have no more issues with the 7260 (as it is turned off at home). The trade off is in speed. Panda does not support 5G and is a little slower. But I'll take a performance hit over an unstable wireless interface. Thanks for your time you spent on this -- obviously still an issue either with the kernel PCI or the Intel driver.
Comment 96 ryan.jentzsch 2017-03-19 12:14:20 UTC
For anyone that has this inexplicable issue. I found a consistent work-around/fix.


I am using wicd and finally got the WPA supplicant driver to use nl80211 (I was going to upload a screenshot, but since the issue is marked as closed it will not let me). The recommendation is to use wext -- this is outdated  behavior http://linuxwireless.org/en/developers/Documentation/Wireless-Extensions/index.html#Do_we_still_use_WE_.3F) -- I tried to get nl80211 to work before and I'm not sure why it is working now. This seems to have solved the problem.



Open wicd -- in the upper right is a dropdown arrow click it --> Preferences --> Advanced Settings tab --> WPA Supplicant (dropdown) --> Select nl80211 --> You many need to reboot.
Comment 97 ryan.jentzsch 2017-03-20 00:09:07 UTC
Created attachment 255355 [details]
Script to fix wifi getting stuck so you don't need to reboot

This is a bash script. I have this saved to /bin/fixwifi. It MUST be run as sudo. Currently it is hard coded to use the wlan0 interface. Here's what it does:

1. Figures out Wireless 7260 PCI slot.
2. Removes the device from Linux PCI device list.
3. Tells Linux to rescan the PCI devices (in hopes that the 7260 will be picked up again).
4. Tries to bring the wlan0 interface up.
5. If step 5 is successful then it tries to set the power management for wlan0 to "OFF" and exits.
6. If step 5 doesn't work then it resets the network driver modules and tries again starting at step 2.
Comment 98 Pietro Battiston 2021-02-01 09:33:19 UTC
Created attachment 295025 [details]
dmesg under 5.10.0-1 (debian testing)

Emmanuel, can you confirm that a trace like that attached (under 5.10.0) points to a different bug?

I ask because the symptoms look very similar to those described above (including the bug not happening with versions of the kernel 4.4 and older) and the script provided by @ryan (actually, the rewrite I found at https://askubuntu.com/a/1263453/152438 ) "solves" the problem, but I can't find the "00000000a5a5a5a5" string in dmesg.

Note You need to log in before you can comment on or make changes to this bug.