Bug 64081
Summary: | mdiobus_read fail and return -110 when called from get_phy_id r6040: 0000:00:08.0 failed to register MII bus | ||
---|---|---|---|
Product: | Networking | Reporter: | Nils Koehler (nils.koehler) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | blocking | CC: | f.fainelli, nils.koehler, szg00000, zenbooster |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.10.17 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Nils Koehler
2013-10-30 18:10:35 UTC
Ok, so a few things to check: - if you increase the existing 2048 limit to a higher number does it successfully probe for the PHY? - is your PHY actually accessible on the MDIO bus at all? I have tested right now with: #define MAC_DEF_TIMEOUT 2048*2 /* Default MAC read/write operation timeout */ So i doubled the value, but still the same problem: The Phy ID 1 response fine on the bus, just one of the not existing phys between 1 and 31 fails randomly if it happens the id where it fail changes i have seen all of them. but ID 1 respond the right way during init loop The init fails on 50% of the trys to insmod the driver, so when init does not fail, eth0 works propper and stable without any receive errors at all. I have stuffed the source code files core.c r6040.c mdio.c and phy_device.c with detailed trace information stuff, to see what happens during the driver init and to provide you more information about the missbehavier. but something supprising happened, to be able to post larger traces I did a change in kernel config to use ttyS0 instead of tty0 which I have used before. Until that change and the many printk#s I added the error did not happen right now any more. To be sure, I wrote a testscript to load and unload the driver and have a automated test in between with the following script: # cat eth_test.sh counter=1 error=0; while : do insmod /lib/modules/3.10.17/kernel/drivers/net/ethernet/rdc/r6040.ko ifconfig eth0 192.168.0.10 if ifconfig | grep -qci eth0; then echo " trys $counter OK Errors: $error" else error=$((error+1)) echo " trys $counter Fail Errors: $error" fi rmmod /lib/modules/3.10.17/kernel/drivers/net/ethernet/rdc/r6040 counter=$((counter+1)) done The script tested 11341 times and no fail detecteble. trys 11337 OK Errors: 0 trys 11338 OK Errors: 0 trys 11339 OK Errors: 0 trys 11340 OK Errors: 0 trys 11341 OK Errors: 0 After the lunch I try the same testscript with tty0 instead of ttys0 as i used before... I hope to be able to bring back the malfunction, I keep you informed One question as I traced the fail yesterday, the mdiobus_read function failed with returning error -110. I have googeled for error code 110 and google spit out EMULTIHOP for 110, but I can not find information about it. I think this is byond my knowledge of linux... any idear? Ok, so I think you are mixing up few things here. -110 = -ETIMEDOUT from r6040_phy_read() or r6040_phy_write(). 110 is defined in include/uapi/asm-generic/errno.h "The Phy ID 1 response fine on the bus, just one of the not existing phys between 1 and 31 fails randomly if it happens the id where it fail changes i have seen all of them." Ok, so this is totally expected, there is just one PHY responding at address 1. The driver currently probes for all PHYs on the bus, and that is not necessary as we know that PHYs would be. RDC sent me a patch a while ago that seems like it would fix your issue: --- linux-3.7.2/drivers/net/ethernet/rdc/r6040.c.orig 2013-06-07 03:00:05.670250988 -0700 +++ linux-3.7.2/drivers/net/ethernet/rdc/r6040.c 2013-06-07 03:05:01.666407765 -0700 @@ -1037,8 +1037,18 @@ static int r6040_mii_probe(struct net_de { struct r6040_private *lp = netdev_priv(dev); struct phy_device *phydev = NULL; + static int first_find_addr = 0; + int addr; + for ( addr = first_find_addr; addr < PHY_MAX_ADDR; addr++ ) { + + if ( lp->mii_bus->phy_map[addr] ) + phydev = lp->mii_bus->phy_map[addr]; + + if(phydev) + break; + } + first_find_addr++ ; - phydev = phy_find_first(lp->mii_bus); if (!phydev) { dev_err(&lp->pdev->dev, "no PHY found\n"); return -ENODEV; @@ -1194,6 +1204,10 @@ static int __devinit r6040_init_one(stru lp->mii_bus->write = r6040_mdiobus_write; lp->mii_bus->reset = r6040_mdiobus_reset; lp->mii_bus->name = "r6040_eth_mii"; + if ( card_idx == 0 ) + lp->mii_bus->phy_mask = 0x1; + else + lp->mii_bus->phy_mask = 0x3; snprintf(lp->mii_bus->id, MII_BUS_ID_SIZE, "%s-%x", dev_name(&pdev->dev), card_idx); lp->mii_bus->irq = kmalloc(sizeof(int)*PHY_MAX_ADDR, GFP_KERNEL); Can you try this? Today I had the stupid situation, that all test I did end with the result, that I have a rock solid functionality. Normally I am happy about that, but in this case I am not able to point out what change fixed the issue. So I decided today evening to roll back all last changes, and I got back my unstable situation. I will do minimal incremental tests on it tomorrow. Ok thank you as -110 is -ETIMEDOUT I will try to incerase timeout again at first. Then I will apply your patch. Yesterday I have also gambeled with deprecated switch in .config I will also try again with enabeled and disabeled the same... Could you try a simpler version of the patch (the phy_mask is wrong in the above patch): diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c index 1e49ec5..5c09085 100644 --- a/drivers/net/ethernet/rdc/r6040.c +++ b/drivers/net/ethernet/rdc/r6040.c @@ -1192,6 +1192,7 @@ static int r6040_init_one(struct pci_dev *pdev, const struct lp->mii_bus->write = r6040_mdiobus_write; lp->mii_bus->reset = r6040_mdiobus_reset; lp->mii_bus->name = "r6040_eth_mii"; + lp->mii_bus->phy_mask = ~(1 << (card_idx + 1)); snprintf(lp->mii_bus->id, MII_BUS_ID_SIZE, "%s-%x", dev_name(&pdev->dev), card_idx); lp->mii_bus->irq = kmalloc_array(PHY_MAX_ADDR, sizeof(int), GFP_KERNEL); The confusion of yesterdays behavier was that the r6040.ko driver was not updated after make, it was only updated when I touched .config before. I think it is an effect of my buildroot setup. Now I double check that .ko is freshly compiled before do a test. Now I have the unstable starting point for testing the script I wrote bring the information that I have 10 Errors at 26 trys to insmod the driver which is an error rate from 38% I did the following tests: 1. Test Change Timeout value from 2048 to 4096 #define MAC_DEF_TIMEOUT 4096 /* Default MAC read/write operation timeout */ Result No Error dectected within 3000 trys. So I think this was the change wich made it rock solid yesterday... 2. Test Set Timeout back to 2048 to be unstable again #define MAC_DEF_TIMEOUT 2048 /* Default MAC read/write operation timeout */ add your last Kernel patch + lp->mii_bus->phy_mask = ~(1 << (card_idx + 1)); With that patch it is better than the unpatched version. With the Patch I need 208 trys to get 10 errors which is an error rate from appr 5%, so it is better but, not good. trys 208 OK Errors: 10 trys 209 OK Errors: 10 ifconfig: SIOCSIFADDR: No such device 0 trys 210 Fail Errors: 11 trys 211 OK Errors: 11 trys 212 OK Errors: 11 So the question now is to apply both changes? Does the higher Timeout value have any other negative dependecys? I will run the Test over the weekend with both applyed... Snapshot of longterm test (still running)... trys 106362 OK Errors: 0 trys 106363 OK Errors: 0 trys 106364 OK Errors: 0 trys 106365 OK Errors: 0 trys 106366 OK Errors: 0 trys 106367 OK Errors: 0 trys 106368 OK Errors: 0 trys 106369 OK Errors: 0 trys 106370 OK Errors: 0 Ok, so this is with MAC_DEF_TIMEOUT = 4096 right? Yes... aktually passed trys 266950 OK Errors: 0 trys 266951 OK Errors: 0 trys 266952 OK Errors: 0 trys 266953 OK Errors: 0 May be I can tomorrow level out the border of Timeout value between unstable and stable functionality, to decide a value which is not to hight but low enought to ensure a save operation... I have goood news and bad news... This morning I found the device under test in this condition: trys 378722 OK Errors: 0 trys 378723 OK Errors: 0 trys 378724 OK Errors: 0 Killed The good news is, that it run 378724 times without any error, which simulate more than 100 Years of propper operation but... ...the bad news is that the test was killed by the kernel during the night. My first idear was memory loss # top Mem: 252292K used, 3128K free, 0K shrd, 0K buff, 4972K cached CPU: 0% usr 99% sys 0% nic 0% idle 0% io 0% irq 0% sirq It pointed out that it seem that memory is eaten up during the test, so I added that line to the end of the while loop in the test scropt: cat /proc/meminfo | grep -i MemFree It shows that the memory decrease not after each cycle but it get lower and lower after a couple of cycles Meminfo Direct after boot # cat /proc/meminfo MemTotal: 255420 kB MemFree: 247756 kB Buffers: 0 kB Cached: 4668 kB SwapCached: 0 kB Active: 880 kB Inactive: 4052 kB Active(anon): 288 kB Inactive(anon): 4 kB Active(file): 592 kB Inactive(file): 4048 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 280 kB Mapped: 356 kB Shmem: 28 kB Slab: 2104 kB SReclaimable: 324 kB SUnreclaim: 1780 kB KernelStack: 216 kB PageTables: 60 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 127708 kB Committed_AS: 972 kB VmallocTotal: 778116 kB VmallocUsed: 32 kB VmallocChunk: 778044 kB DirectMap4k: 262144 kB DirectMap4M: 0 kB # Meminfo later after 1016 insmod rmmod cycles # cat /proc/meminfo MemTotal: 255420 kB MemFree: 246208 kB Buffers: 0 kB Cached: 4964 kB SwapCached: 0 kB Active: 1308 kB Inactive: 3924 kB Active(anon): 584 kB Inactive(anon): 8 kB Active(file): 724 kB Inactive(file): 3916 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 280 kB Mapped: 356 kB Shmem: 324 kB Slab: 2960 kB SReclaimable: 324 kB SUnreclaim: 2636 kB KernelStack: 216 kB PageTables: 60 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 127708 kB Committed_AS: 1268 kB VmallocTotal: 778116 kB VmallocUsed: 44 kB VmallocChunk: 765856 kB DirectMap4k: 262144 kB DirectMap4M: 0 kB # The biggest difference of both traces is in: Memory before test after test difference VmallocChunk: 778044 765856 -12188.00 ...after 84850 trys: # cat /proc/meminfo MemTotal: 255420 kB MemFree: 191528 kB Buffers: 0 kB Cached: 5044 kB SwapCached: 0 kB Active: 1400 kB Inactive: 3908 kB Active(anon): 660 kB Inactive(anon): 8 kB Active(file): 740 kB Inactive(file): 3900 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 280 kB Mapped: 356 kB Shmem: 404 kB Slab: 57624 kB SReclaimable: 348 kB SUnreclaim: 57276 kB KernelStack: 208 kB PageTables: 60 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 127708 kB Committed_AS: 1348 kB VmallocTotal: 778116 kB VmallocUsed: 44 kB VmallocChunk: 769376 kB DirectMap4k: 262144 kB DirectMap4M: 0 kB More interesting than VmallocChunk is may be: Slab: 57624 kB SReclaimable: 348 kB SUnreclaim: 57276 kB We might be leaking some memory, which would explain why your free memory shrinks after a while. This would be a second fix. I am a little busy for the weekend, but will start submitting fixes for r6040 next week. Thanks! OK... I have evaluated out the critical MAC_DEF_TIMEOUT value, which is between 2132 and 2140 in my system: 2500 OK 2250 OK 2187 OK 2156 OK 2140 OK 2132 FAIL 2125 FAIL 2048 FAIL The values are not longterm thested just, short tests to evaluate the critical border value. Let me check the datasheet with respect to the actual timings. I think increasing the timing blindly without making it depend on the CPU timebase is not going to work on some other platforms with faster CPU clock rates. It turned out that the datasheet does not give any clue about how fast/slow a MDIO operation should complete, a 1 micro second delay in each loop should be safe though. I have a same problem with 3.12.13-gentoo on ebox3350mx. r6040 0000:00:08.0: failed to register MII bus r6040: probe of 0000:00:08.0 failed with error -5 After reading this topic, i set: #define MAC_DEF_TIMEOUT 4096 and then recompile kernel. And now it's work! |