Bug 98411

Summary: iwlwifi module unpredictably crashes
Product: Drivers Reporter: Daniel (mayazcherquoi)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED DUPLICATE    
Severity: normal CC: ilw, linville
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: See Description, # Additional Information
Output of `lspci -xxxx -vvvv' before the problem.
Output of `sudo lspci -xxxx -vvvv' before the problem.
Output of `sudo lspci -xxxx -vvvv' after the problem.
Tar'd file containing the stdout of `lspci' and `dmesg'.
Tar'd file containing the stdout of `lspci' and `dmesg'.

Description Daniel 2015-05-15 07:36:47 UTC
Created attachment 176971 [details]
See Description, # Additional Information

# Overview

The iwlwifi module will sometimes crash at random intervals after the system has been on for a little while. dmesg has the line `Error sending REPLY_ADD_STA: time out after 2000ms.' immediately at the point of the crash. I am unable to unload the module. Attempting to forcefully unload it results in a kernel panic/system freeze.

This problem has occurred very few times in the past several months (half a year), however, now it is incredibly frequent and makes it annoying to use this workstation/laptop.

# Reproduction

I am unable to reproduce this issue as it occurs randomly. Sometimes when there is little network activity, and sometimes when there is lots. The iwlwifi module crashes not just when connected to my home wireless access point, but also to any other wireless access point (whether it be at the University [of Sydney], McDonalds, etc).

# Actual Results

My device disconnects from the wireless access point and is unable to reconnect. It is also unable to see any other networks in the area and is unable to attempt to connect to them. I am using the GNOME 3(.16) desktop environment, and where it shows the signal strength in the top-right pulldown menu, it shows a blank white square instead. Attempting to use the hardware switch to turn the wireless device on and off has no effect, nor does trying a software reset of the device.

In order to get the device in a usable state, I have to restart it.

# First Encountered

Many months ago. I cannot remember an exact date, sorry :-( It occurred on kernel versions prior to 3.18. However, for some reason, it is much more frequent now.

# Additional Information

As I could only attach one file, I decided to tar up all of the following files:
    * dmesg1.txt, dmesg2.txt - A couple of dmesg output's when the problem has occurred (look to the bottom of the file). 
    * rfkill.txt - The results of rfkill when the problem has happened (I think this is irrelevant, as the output is exactly the same when the problem has occurred, and when it hasn't).
    * modinfogood.txt, modinfobad.txt - Module info (modinfo) when there's no problem, and when there is a problem (these were taken at different reboots).
    * lsmod.txt - `lsmod' output.
    * lspci.txt - `lspci' output.
    * lspci_long.txt - `lspci -vvknnqq' output
    * lsusb.txt - `lsusb' output.
    * lscpu.txt - `lscpu' output.
    * cpuinfo.txt - `cat /proc/cpuinfo' output.
    * meminfo.txt - `cat /proc/meminfo' output.
    * ver_linux.txt - Output produced from `scripts/ver_linux'.

# Hardware

I am running a Clevo P150EM (known as Sager NP9150 in the United States), with an Intel Centrino Advanced-N 6235 wireless card ( http://www.intel.com/content/www/us/en/wireless-products/centrino-advanced-n-6235.html ).

# Anything Else?

If there is anything else you need to know, or need me to try, please let me know. :-) I hope I have done everything right.
Comment 1 Emmanuel Grumbach 2015-05-17 06:34:42 UTC
Unfortunately you are hitting a bug that was closed as will not fix. See the info there.

*** This bug has been marked as a duplicate of bug 91171 ***
Comment 2 Daniel 2015-05-17 08:26:11 UTC
(In reply to Emmanuel Grumbach from comment #1)
> Unfortunately you are hitting a bug that was closed as will not fix. See the
> info there.
> 
> *** This bug has been marked as a duplicate of bug 91171 ***

I fail to see how it is? My laptop is stationary the entire time, I do not move it from place to place and the problem occurs. If I do move the laptop to another location, the connection does not drop as it did for those in bug #91171.

Furthermore, I experience no such issue on Windows 7, only Linux. Not to mention I have had this computer for 2 years, with Linux running on it off and on, and only recently (couple of months) have I experienced this issue. I experience no issue with other hardware, only wireless.

Are you sure it's the same bug?
Comment 3 Emmanuel Grumbach 2015-05-17 08:31:55 UTC
Can you please attach the output of sudo lspci -xxxx -vvvv before and after the failure?

You are not the first one complaining about this issue happening more recently. Someone even tried to bisect with no luck.
Did you update your BIOS?

I am pretty sure it is the same bug since the driver can't access the device in both issues.
Comment 4 Daniel 2015-05-17 08:36:23 UTC
Created attachment 177091 [details]
Output of `lspci -xxxx -vvvv' before the problem.

Here is the output of the command `lspci -xxxx -vvvv' before the problem. 

I'll have to wait until the problem occurs again before I am able to grab the output of that command. This could be anywhere between a few minutes, to a few hours, to a few days.
Comment 5 Daniel 2015-05-17 08:37:47 UTC
(In reply to Emmanuel Grumbach from comment #3)
> Did you update your BIOS?

I have not updated my BIOS since I first got the computer a couple of years ago.
Comment 6 Emmanuel Grumbach 2015-05-17 08:39:43 UTC
please run lspci with root permissions. I mentioned you need sudo.
Comment 7 Daniel 2015-05-17 09:02:15 UTC
Created attachment 177101 [details]
Output of `sudo lspci -xxxx -vvvv' before the problem.

Hi, sorry for the late response, I ran into a couple of issues when piping the output of lspci to a file, namely: `pcilib: sysfs_read_vpd: read failed: Connection timed out'. dmesg had `rtsx_pci 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update' appended to the end of it each time this happened.

Anyways, I ran it under root permissions as you said (sorry about that the first time), but it appears to be mostly the same as before?
Comment 8 Emmanuel Grumbach 2015-05-17 09:37:02 UTC
Can you do the same with Windows?

you can use read write anything. This application will dump the config space of the device.

I'd like to compare them.

But I am very pessimistic.
Comment 9 Daniel 2015-05-19 09:59:51 UTC
Created attachment 177351 [details]
Output of `sudo lspci -xxxx -vvvv' after the problem.

Here's the output of `sudo lspci -xxxx -vvvv' after the problem had occurred. Sorry, it's been a little while I've been extremely busy. 

If you'd still like, I should be able to provide you with the output on Windows on the weekend?
Comment 10 Emmanuel Grumbach 2015-05-19 10:03:15 UTC
yeah you can - but again. I can't say anything regarding the likelihood that we will be able to do something with it.
Comment 11 Daniel 2015-05-19 10:11:17 UTC
That's understandable. If the problem is still occurring in a couple of months, I will try and give bisection a go.

I just noticed this webpage: https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging

If it would help, would you like me to provide you with a trace or a firmware dump?
Comment 12 Emmanuel Grumbach 2015-05-19 10:13:11 UTC
No - the problem is really a bus problem. BTW - can you attach a dmesg + lspci *after* the problem occurs. I'd like to correlate them. I am afraid you might be seeing several issues.
Comment 13 Daniel 2015-05-20 07:26:04 UTC
Created attachment 177381 [details]
Tar'd file containing the stdout of `lspci' and `dmesg'.

Hey Emmanuel,

For some reason, the problem was really really bad today. As a result, I've managed to snatch up many log files. Each dmesg text file corresponds to the lspci text file of the same number.

For both lspci and dmesg logs number 11, I decided to try the supposed workaround mentioned in #91171:

echo 1 > /sys/bus/devices/0000\:00\:03.0/remove
echo 1 > /sys/bus/pci/rescan
killall wpa_supplicant

That however, did not work.

I have also uploaded a video onto YouTube of what happens immediately after the problem has occurred (in this video, I just logged on and left it for a couple of minutes, after which the problem had happened and then the video happens):

https://youtu.be/M7OZ3jCMvwY
Comment 14 Daniel 2015-05-20 13:34:06 UTC
Created attachment 177451 [details]
Tar'd file containing the stdout of `lspci' and `dmesg'.

It happened again. Decided to try a much older version. 
 
This is using archlinux-2014.05.01.iso (kernel 3.14). It printed out a lot of text, different from those previous from a newer kernel.

This does not bode well :-(
Comment 15 Emmanuel Grumbach 2015-05-20 17:41:20 UTC
I am not surprised.
Comment 16 Emmanuel Grumbach 2015-05-31 06:15:14 UTC
re-closing as duplicate.