Bug 86861 - RTL8185L PCI ifup freeze system
Summary: RTL8185L PCI ifup freeze system
Status: REOPENED
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-25 07:44 UTC by Eric Shattow
Modified: 2014-12-15 19:13 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.16
Subsystem:
Regression: No
Bisected commit-id:


Attachments
fix rtl8225.c to use aux functions for anaparam write (2.71 KB, patch)
2014-11-08 16:57 UTC, Andrea Merello
Details | Diff
add delays in anaparam write functions (1.86 KB, patch)
2014-11-08 17:01 UTC, Andrea Merello
Details | Diff
add dealyts when locking/unlocking CONFIGX regs (5.18 KB, patch)
2014-11-21 19:30 UTC, Andrea Merello
Details | Diff

Description Eric Shattow 2014-10-25 07:44:17 UTC
1. Installed Gateway RTL8185L 802.11ABG Mini PCI wifi card into a Fujitsu Stylistic ST4121 replacing its Mini PCI combo modem & prism2 wifi device.
2. Power on, Debian boots (3.16-2-686-pae), but system is frozen before graphical login.
3. No SysRq response (from vendor Infrared keyboard, don't have a wired keyboard handy).
4. Hold power until device shuts off.
5. Boot Debian rescue mode is not frozen
6. Add blacklist rtl818x_pci via /etc/modprobe.d/ to get into multi-user.

After reboot into normal multi-user mode :

# modprobe rtl818x_pci
[24260.769593] ieee80211 phy1: Selected rate control algorithm 'minstrel_ht'
[24260.771547] ieee80211 phy1: hwaddr 00c0a8d7232b, RTL8185vD + rtl8225

# lspci
01:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8185 IEEE 802.11a/b/g Wireless LAN Controller (rev 20)
	Subsystem: Realtek Semiconductor Co., Ltd. Device 8225
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64 (8000ns min, 16000ns max), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 5
	Region 0: I/O ports at 2400 [size=256]
	Region 1: Memory at e0201400 (32-bit, non-prefetchable) [size=512]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: rtl818x_pci

# iwconfig
wlan0     IEEE 802.11bg  ESSID:off/any  
          Mode:Managed  Access Point: Not-Associated   Tx-Power=0 dBm   
          Retry short limit:7   RTS thr:off   Fragment thr:off
          Encryption key:off
          Power Management:on

# ifconfig wlan0 up

System immediately deadlocks, no messages from the kernel.

I have verified that the hardware works via ndiswrapper and a suitable RTL8185L driver for Windows XP platform.
Comment 1 Eric Shattow 2014-10-27 02:42:06 UTC
Temporary workaround is to #include <linux/delay.h> and then stuff 'udelay(2);' into all of the io functions in drivers/net/wireless/rtl818x/rtl8180/rtl8180.h

Driver loads, works correctly with some light load testing.

How to determine where the timing is falling over?
Comment 2 Eric Shattow 2014-10-27 09:02:41 UTC
For what it's worth, this is a bug in GCC 4.9.1 and not the linux kernel. Recompiled with -O0 and works great.
Comment 3 Eric Shattow 2014-11-01 07:59:23 UTC
Maybe it is not a bug in GCC. Also happens with gcc 4.8.3 as compiler. How to debug?
Comment 4 Andrea Merello 2014-11-01 12:48:22 UTC
Hello,
Thank you for reporting this.

It doesn't really semms a GCC bug. As you have find out, this problem is somewhat tining-sensitive. Adding explicit delays or disabling GCC optimizations has probably the same effect: delaying things..

When timings in the order of uS matters, then a lot of things may play a role.

Recently the rtl818x driver got a lot of modifications. Have you ever tried with a older kernel (say 3.13)?
If it worked then it could be a bug introduced with recent patches (or they just made arose an hidden timing problem). If it didn't then I can exclude those patches.

Indeed I have another report about a system crash like yours that happens with older kernel. I couldn't ever end up with any soulution.. Maybe your discover about IO delays have finally enlighed things a bit :)

Actually I think the vendor driver implements 20us delays on all MMAP write accessors. I thought this was just conservative for whose write accesses that might require delay (for example RF/BB access), but I didn't tought the HW could crash the whole system.. (I'm thinking about the HW hitting some weird internal condition that causes the HW itself to mess the PCI bus).

AFAIK the kernel IO accessors guareantee access ordering wrt themselfes and I took some care to put memory barriers where IO accesses should be ordered wrt to DMA memory. It is still possible I missed something here, or that some PCI posting issue is tricking things, but indeed the problem could be really a true HW requirement not to see IO accesses too fast after writing some register. BTW I'll give a colser look to the code double-checking barriers and related things..

Maybe it could also worth to try adding explicit delays only to write accesses where it is reasonable to think some clocking resynchronization is involved inside the ASIC. This likely means where BB and possibly Analog frontend and RF interface registers are accessed.

If this won't work then probably we have not enough information about the HW in order to infer what it is actually needed to delay, and we probably cannot do something rellay better than really place delays in all write accessors as in the reference driver. Maybe the register writes in the ISR handler might be the only ones that I could still try to save..

I will come out with some experimental patches as soon as I have time..

Andrea
Comment 5 Eric Shattow 2014-11-02 02:06:52 UTC
Compiled mainline linux kernel 3.13 and tested. Confirm loading rtl8180.ko is okay but it then does crash on "ifconfig wlan up". Now back to 3.16-3 (3.16.15 debian).

cd $LINUXSRC && make EXTRA_CFLAGS="-g -I$(pwd)/drivers/net/wireless/rtl818x" -C /lib/modules/$(uname -r)/build M=$(pwd) drivers/net/wireless/rtl818x/rtl8180/rtl818x_pci.ko

Good: rtl8180.h after every iowriteNN() is ndelay(20)

Good: rtl8180.h unmodified, all source calling rtl818x_iowriteNN() insert ndelay(20) after every call

Bad: rtl8180.h after every iowriteNN() is ndelay(20), rtl8225.c:rtl8225_rf_stop replace all iowriteNN calls to avoid ndelay. Deadlocks immediately on "ifconfig wlan0 down", consistently.

Good: rtl8180.h after every iowriteNN() is ndelay(20), rtl8225.c:rtl8225_rf_stop replace all rtl818x_iowriteNN calls to avoid ndelay, add ndelay(20) after each iowriteNN call.

I went through the source, one subroutine at a time and A/B compare "insmod; ifconfig up; ifconfig down; rmmod" with ndelay after every iowrite call and without those calls per function. This amounts to guesswork because I don't understand anything about memory boundries i.e.

I suspect problems with at least rtl8225.c:rtl8225_rf_init and rtl8225.c:rtl8225_rf_stop.
Comment 6 Andrea Merello 2014-11-05 22:42:48 UTC
Thank you a lot for your work!

Your test "Bad: rtl8180.h after every iowriteNN() is ndelay(20), rtl8225.c:rtl8225_rf_stop replace all iowriteNN calls ..." is probably the most interesting.

If you have let untouched the iowriteNN in rtl8225_write (they still had delays because of modified iowriteNN in rtl8180.h), that is called in rtl8225_rf_stop, then I'd focus on ANAPARAM registers write procedure.

As you can see in the code, the procedure for writing ANAPARAM registers involves the actual ANAPARAM register iowrite operation itself, as well as two nested "unlock/re-lock" operations:
Unlocking ANAPARAM registers (in order to be able to write them) is achieved by writing CONFIG3 register; unlocking CONFIG3 register is achieved by writing EEPROM_CMD register.

Perahps after the unlock operation, before writing the unlocked register, and/or before re-locking, the delay is needed. (Obviously this is just a speculation..) 

For writing ANAPARAM registers there are utility functions rtl8180_set_anaparam and rtl8180_set_anaparam2. Changing rtl8225_rf_init and rtl8225_rf_stop in order to use those utility functions is indeed shomething that I should had done earlier.. I'm preparing a patch to do this. I will attach here soon.

Once applied that patch, it is easy to put delays in rtl8180_set_anaparam and rtl8180_set_anaparam2 and see whether this fix the problem. I will prepare a patch also for this soon.

If this is not enough to fix the problem, then maybe the issue happens when unlocking/writing/locking CONFIGX registers (I'm still speculating..). There are probably other places where this is done other than whose related to ANAPARAM registers write.

I can eventually try also putting delays every time CONFIGX registers are written. I will eventually prepare a patch for this also.
Comment 7 Andrea Merello 2014-11-08 16:57:14 UTC
Created attachment 157041 [details]
fix rtl8225.c to use aux functions for anaparam write

This is against 3.16.3.
With this patch applied it is easier to test delays in ANAPARAM writing code.

(I'll probably also rebase and push to mainline as it also reduce duplicated code and make code cleaner..)
Comment 8 Andrea Merello 2014-11-08 17:01:27 UTC
Created attachment 157051 [details]
add delays in anaparam write functions

This can me applied on top of my previous patch I just posted here. It is an _experimental_ patch to see if the problem has to do with the card messing up the PCI bus whenever ANAPARAM registers are written without delays.

This adds delays in the following places:
- after unlocking anaparam regs, before writing them
- after writing anaparam regs, before locking them again.

If this turns out to help with the problem, then it's probably worth to try adding also dummy register reads to force PCI posting.
Comment 9 Eric Shattow 2014-11-18 08:22:43 UTC
Yes, I confirm "[patch] add delays in anaparam write functions" seems to help.

Process for this:

1. apply "[patch] fix rtl8225.c to use aux functions for anaparam write"
2. change rtl8180.h copy and modify rtl818x_iowriteNN routines i.e. 

+static inline void rtl818x_iowrite32_nd(struct rtl8180_priv *priv,
+                                    __le32 __iomem *addr, u32 val)
+{
+       iowrite32(val, addr);
+}
+
 static inline void rtl818x_iowrite8(struct rtl8180_priv *priv,
                                    u8 __iomem *addr, u8 val)
 {
        iowrite8(val, addr);
+       ndelay(20);
 }

3. verify driver will load and operate normally with added delay this way (yes, it does operate normally)
4. modify anaparam* routines that will be touched by "[patch] add delays in anaparam write functions" to use non-delay *_nd routines
5. verify driver now fails at "ifconfig wlan0 up" (yes, hard lock-up of kernel)
6. apply udelay(20) as directed by "[patch] add delays in anaparam write functions"
7. verify driver will now load and operate normally (Yes, with limited testing it does appear to work again).
Comment 10 Andrea Merello 2014-11-18 12:59:36 UTC
Hello,
Thank you for this tests.

Your report seems to confirm that delay after writing anaparam is required, but, because you still kept some other delays in IO accessors, we still don't know if waiting only after writing anaparams is enough to avoid the crash.

Can you please try to apply the patches keeping IO accessors in rtl8180.h without any delays to see what happens ? This way the only added delays will be those introduced by the patch "add delays in anaparam write functions".

If this is not enough (system still crashes) then I will prepare another patch to add delays also after unlocking CONFIGX registers.

Thank you
Andrea
Comment 11 Eric Shattow 2014-11-18 18:14:47 UTC
(In reply to Andrea Merello from comment #10)
> Can you please try to apply the patches keeping IO accessors in rtl8180.h
> without any delays to see what happens ? This way the only added delays will
> be those introduced by the patch "add delays in anaparam write functions".

I forgot to say, that the delays are needed or else it does crash.

To be sure, I tried again now adding a new patch to my steps in comment #9 reverting the changes to rtl8180.h and use of *_nd in dev.c; the module inserts as usual but there is a hardlock crash when "ifconfig wlan0 up". I guess the changes in anaparam routines are helpful but there are more troubles still.
Comment 12 Andrea Merello 2014-11-21 19:29:35 UTC
Ah, OK.

I'm going to attach another patch, to be applied in top of those two, that adds also delays surrounding the CONFIGX registers unlocks.

Can you please try this?
Comment 13 Andrea Merello 2014-11-21 19:30:20 UTC
Created attachment 158421 [details]
add dealyts when locking/unlocking CONFIGX regs
Comment 14 Eric Shattow 2014-11-22 00:24:26 UTC
Applied, "ifconfig wlan0 up" success now. Starting network-manager to bring up connection to home wifi network, and system freeze happens.  Last output at printk 10 logging level:

[ 1914.016566] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 1935.862172] wlan0: authenticate with xx:xx:xx:xx:xx:xx
[ 1935.924098] wlan0: send auth to xx:xx:xx:xx:xx:xx (try 1/3)
Comment 15 Eric Shattow 2014-12-06 20:27:24 UTC
What is next to try?
Comment 16 Andrea Merello 2014-12-15 19:13:32 UTC
Hello,
Sorry for the huge delay, these are quite busy months for me :(

I'm starting to suspect that delays are potentially needed everywere :(

Anyway before give up I first of all need to check code again and see if I'm missing something. I will do as soon as I have some spare time.

Your last log shown that the crash happens also when the card has been fully initialized.. I think that when the association procedure is ongoing, the driver shouldn't do particular things that may specifically need delays.

During this procedure the driver should mainly deal with interrupt status register and DMA kick register.

Maybe it could have some callback invoked to configure contention window, IFS* registers, and things like those. But these are all MAC-related stuff.. This seems to refute my initial supposition about interal ASIC busses and reclocking..

That's might be more likely a "simplier" ASIC PCI interface problem, loosely related to other ASIC internal things...

Maybe it's some weird incompatibilty with the chipset or something like that..

Well, I will check the code again and eventually (hopefully) I will come back with other patches, but I'm starting to think that we might have to introduce delays in all write accessors like the reference driver does..

Andrea

Note You need to log in before you can comment on or make changes to this bug.