Bug 43021

Summary: YARB: via-rhine: pre-existing interface rendered non-working after resume
Product: Drivers Reporter: Andreas Mohr (andi)
Component: NetworkAssignee: Francois Romieu (romieu)
Status: RESOLVED CODE_FIX    
Severity: normal CC: romieu
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.3.0-08839-gb5174fa Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 42644    
Attachments: Reboot only D3 transition avoidance
Fix to make a384a33bb1c9ec not break things

Description Andreas Mohr 2012-04-01 10:06:24 UTC
After my upgrade to 3.3.0something (yes, I know: failed to do -rc testing due to chaotic times currently) I ended up with broken network connectivity after resume.
3.2.0-00300-g0db49b7 was my last kernel version (working fine).

ifdown / pump ""fixes"" missing networking after resume.

Given that e92b9b3b091d5fcd ("via-rhine: rework suspend and resume.") removed the pci_save/restore_state() parts, I know now where to start investigating... (lspci -xxx differences pre/post-suspend).

["Yet Another Resume Bug"] - and I'm growing really, reaallly tired of them.


00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
        Subsystem: EPoX Computer Co., Ltd. VT6103
        Flags: bus master, medium devsel, latency 32, IRQ 23
        I/O ports at e400 [size=256]
        Memory at db002000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [40] Power Management version 2
        Kernel driver in use: via-rhine

TODO:
- lspci -xxx
- try re-connecting cable (link state)
- packet counters (I think they're still increasing!)



Anyway, even if it looks like certain people might have triggered the breakage, I'm very thankful for their heaps of work done recently.

Andreas Mohr
Comment 1 Francois Romieu 2012-04-01 11:22:34 UTC
Created attachment 72780 [details]
Reboot only D3 transition avoidance

The condition may even be completely removed.
Comment 2 Andreas Mohr 2012-04-01 16:14:54 UTC
Just verified that it's NOT a PCI config space save/restore issue - diffing log files of both pre and post resume lspci -xxx outputs comes up empty - IOW save/restore probably _is_ properly being done by PM layer now, and it does work correctly. Will try the attached patch next, and failing that, try to revert the via-rhine.c commit by commit.
Comment 3 Andreas Mohr 2012-04-01 16:47:44 UTC
OK, the attached patch did NOT work.


I also just did the link state test - resumed (resulting in broken connection), then pulled the cable, then reinserted - bingo! So that perhaps means it's an issue near MII/PHY parts, right?

Will now proceed with reverting individual commits unless told otherwise. ;)
Comment 4 Andreas Mohr 2012-04-01 19:34:54 UTC
a384a33bb1c9ec2 broke it (fc3e0f8aec05dd8 failed, 7ab87ff4c770eed worked, a384a33bb1c9ec2 failed, then 7ab87ff4c770eed worked again).

So far, so good (well, at least not too bad...).

And in fact I _am_ getting a
via-rhine 0000:00:12.0: eth-via: high bit wait (71/80) cycle count: 1024

(doh - should have checked dmesg _much_ earlier)

BTW, I also have:
"via_rhine: Broken BIOS detected, avoid_D3 enabled"



Well, the thing is... I just simply don't know what the hell could be breaking here.
I tried to disable the newly introduced udelay(10); in rhine_wait_bit()
(which is about the very only thing in the entire overly sizeable patch
which might be responsible AFAICS), but that didn't help.


BTW, the
                if (high ^ !!(ioread8(ioaddr + reg) & mask))

line is awful.
It should be changed to update a new
bool has_mask_bits = false;
helper variable and then simply check for
if (high ^ has_mask_bits)

Or, IOW,

        bool has_mask_bits = false;
        for (i = 0; i < 1024; i++) {
                has_mask_bits = !!(ioread8(ioaddr + reg) & mask);
                if (high ^ has_mask_bits)
                        break;
                udelay(10);
        }

Oh, and for giggles one could move the repeated ioaddr + reg calc out of the loop,
but this is quite likely such a simple optimization that any semi-modern compiler is likely doing that anyway, thus it might end up as a pessimization (would need to check generated asm).



One thing on my TODO list: perhaps we're being hit by a PCI posting issue?

The dmesg log strongly pinpoints rhine_disable_linkmon()
as the culprit. I'm running out of ideas - I will now reintroduce the macro specifically for that function to see whether that possibly fixes it.


BTW, the patch itself definitely goes in the right direction - it's just a bit unwieldy to analyze.

I'll do some more investigation now.
Comment 5 Andreas Mohr 2012-04-01 19:44:31 UTC
Perhaps it's also a matter of formerly

if (debug > X)-scoped

accesses of e.g. ioread16(ioaddr + ChipCmd) formerly being skipped and now unconditionally being executed, thereby causing very unwanted chip-side resets or some such? (just some paranoia)

I will have to comment out all those lines to see whether this was the problem.
Comment 6 Andreas Mohr 2012-04-01 19:57:04 UTC
Disregard my "unwieldy" and "if (debug > X)" comments - I just had a look at an overly broad diff :-P

Well, thus a384a33bb1c9ec2d9 "via-rhine: RHINE_WAIT_FOR macro removal." diff is even smaller, which makes this breakage all the more puzzling (perhaps compiler issue even??).

I'll try to tweak some crucial code and see what happens.
Comment 7 Andreas Mohr 2012-04-01 20:30:22 UTC
XOR check in rhine_wait_bit() _was_ buggy indeed - with that fixed a384a33bb1c9ec2 now works (haven't verified HEAD yet).
Patch to follow.
It's not a good idea to skip separation of concerns and thereby increase code complexity beyond understandability (has_mask_bits bool helper).
Us humans are stooopid bozos :)
Comment 8 Andreas Mohr 2012-04-01 20:32:51 UTC
Created attachment 72787 [details]
Fix to make a384a33bb1c9ec not break things
Comment 9 Francois Romieu 2012-04-01 22:26:26 UTC
(In reply to comment #8)
> Created an attachment (id=72787) [details]
> Fix to make a384a33bb1c9ec not break things

Ok, I add your Signed-off-by and I submit it now.

Thanks.

-- 
Ueimor