Bug 77121

Summary: Staging ethernet driver et131x started to give oops lately.
Product: Drivers Reporter: Georgios Tsalikis (georgios)
Component: StagingAssignee: drivers_staging (drivers_staging)
Status: RESOLVED CODE_FIX    
Severity: normal CC: georgios, levex, mark.einon
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 3.14.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: bisection log

Description Georgios Tsalikis 2014-05-29 17:53:06 UTC
I have been warned, but it has worked fine for years! I dont remember what was the last working version. Probably 3.12-3.13. Here is the message:

[    7.525812] et131x: module is from the staging directory, the quality is unknown, you have been warned.
[    7.571152] libphy: et131x_eth_mii: probed
[    7.571236] BUG: unable to handle kernel paging request at ffffffc8
[    7.572105] IP: [<c158aa93>] phy_attach_direct+0x63/0x130
[    7.572105] *pde = 01c71067 *pte = 00000000 
[    7.572105] Oops: 0000 [#1] PREEMPT SMP 
[    7.572105] Modules linked in: et131x(C+) psmouse thermal video nfs lockd sunrpc fscache
[    7.572105] CPU: 0 PID: 166 Comm: systemd-udevd Tainted: G         C   3.14.4-1-independent #1
[    7.572105] Hardware name: LG Electronics LG/ROCKY, BIOS RKYWSF33 09/19/2007
[    7.572105] task: c263c980 ti: f30ba000 task.ti: f30ba000
[    7.572105] EIP: 0060:[<c158aa93>] EFLAGS: 00010202 CPU: 0
[    7.572105] EIP is at phy_attach_direct+0x63/0x130
[    7.572105] EAX: 00000000 EBX: f5313000 ECX: f5313008 EDX: 00000006
[    7.572105] ESI: fffffffb EDI: f3123000 EBP: f30bbc70 ESP: f30bbc54
[    7.572105]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    7.572105] CR0: 80050033 CR2: ffffffc8 CR3: 33a58000 CR4: 000007d0
[    7.572105] Stack:
[    7.572105]  f5313008 f30bbc78 00000000 f5313008 f5313000 f8a5b4e0 f8a5b4e0 f30bbc84
[    7.572105]  c158ab88 00000001 f5313000 f3123000 f30bbca0 c158ac20 00000001 0000001f
[    7.572105]  00000000 f3123500 f3123000 f30bbcf0 f8a5ab93 00000001 00000011 f8a5c511
[    7.572105] Call Trace:
[    7.572105]  [<f8a5b4e0>] ? et131x_set_mac_addr+0x240/0x240 [et131x]
[    7.572105]  [<f8a5b4e0>] ? et131x_set_mac_addr+0x240/0x240 [et131x]
[    7.572105]  [<c158ab88>] phy_connect_direct+0x28/0x80
[    7.572105]  [<c158ac20>] phy_connect+0x40/0x70
[    7.572105]  [<f8a5ab93>] et131x_pci_setup+0x783/0xab0 [et131x]
[    7.572105]  [<c12270d8>] ? kernfs_create_link+0x68/0xb0
[    7.572105]  [<c1539a96>] ? __pm_runtime_resume+0x46/0x60
[    7.572105]  [<c1387977>] pci_device_probe+0x77/0xe0
[    7.572105]  [<c1223c35>] ? sysfs_create_link+0x25/0x50
[    7.572105]  [<c152dbfa>] driver_probe_device+0x8a/0x3c0
[    7.572105]  [<c122391d>] ? sysfs_create_dir_ns+0x3d/0xa0
[    7.572105]  [<c1387332>] ? pci_match_device+0xb2/0xc0
[    7.572105]  [<c152e011>] __driver_attach+0x91/0xa0
[    7.572105]  [<c152df80>] ? __device_attach+0x50/0x50
[    7.572105]  [<c152bd47>] bus_for_each_dev+0x57/0xa0
[    7.572105]  [<c152d66e>] driver_attach+0x1e/0x20
[    7.572105]  [<c152df80>] ? __device_attach+0x50/0x50
[    7.572105]  [<c152d247>] bus_add_driver+0x157/0x250
[    7.572105]  [<c152e65d>] driver_register+0x5d/0xf0
[    7.572105]  [<f8a5f000>] ? 0xf8a5efff
[    7.572105]  [<c1387442>] __pci_register_driver+0x32/0x40
[    7.572105]  [<f8a5f017>] et131x_driver_init+0x17/0x19 [et131x]
[    7.572105]  [<c10004aa>] do_one_initcall+0xda/0x1b0
[    7.572105]  [<c111ef7c>] ? tracepoint_module_notify+0xac/0x180
[    7.572105]  [<c111f02c>] ? tracepoint_module_notify+0x15c/0x180
[    7.572105]  [<c188b708>] ? notifier_call_chain+0x48/0x60
[    7.572105]  [<c10bc088>] ? __blocking_notifier_call_chain+0x48/0x80
[    7.572105]  [<c110abd1>] load_module+0x1c01/0x25d0
[    7.572105]  [<c11088b9>] ? copy_module_from_fd.isra.43+0x119/0x1b0
[    7.572105]  [<c110b760>] SyS_finit_module+0xa0/0xd0
[    7.572105]  [<c1177ae3>] ? vm_mmap_pgoff+0xa3/0xc0
[    7.572105]  [<c188f891>] sysenter_do_call+0x12/0x2c
[    7.572105] Code: 78 04 00 00 89 8b a8 01 00 00 89 83 ac 01 00 00 b8 02 00 00 00 89 83 a4 01 00 00 89 d8 e8 76 f9 ff ff 85 c0 89 c6 75 30 8b 43 58 <8b> 50 c8 85 d2 74 16 89 d8 ff d2 89 f0 8b 5d f4 8b 75 f8 8b 7d
[    7.572105] EIP: [<c158aa93>] phy_attach_direct+0x63/0x130 SS:ESP 0068:f30bbc54
[    7.572105] CR2: 00000000ffffffc8
[    7.572105] ---[ end trace 2f707285a3cb06a8 ]---
Comment 1 Levente Kurusa 2014-05-31 10:22:03 UTC
Hi,

it would be amazing if you could bisect it!

Here is a link that can help you:
http://landley.net/writing/git-bisect-howto.html

Let me know if you need any help doing it. (I can't do this myself
as I don't have this hardware sadly...)

Thanks
Levente Kurusa
Comment 2 Georgios Tsalikis 2014-06-03 16:20:06 UTC
Hi and thanks for opening my eyes! This bisection feature is quite amazing.
This was the final message. I will attach the log too. 


87aa9f9c61ad56d505641681812e92ad976f8608 is the first bad commit
commit 87aa9f9c61ad56d505641681812e92ad976f8608
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Fri Dec 6 13:01:34 2013 -0800

    net: phy: consolidate PHY reset in phy_init_hw()
    
    There are quite a lot of drivers touching a PHY device MII_BMCR
    register to reset the PHY without taking care of:
    
    1) ensuring that BMCR_RESET is cleared after a given timeout
    2) the PHY state machine resuming to the proper state and re-applying
    potentially changed settings such as auto-negotiation
    
    Introduce phy_poll_reset() which will take care of polling the MII_BMCR
    for the BMCR_RESET bit to be cleared after a given timeout or return a
    timeout error code.
    
    In order to make sure the PHY is in a correct state, phy_init_hw() first
    issues a software reset through MII_BMCR and then applies any fixups.
    
    Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 8d43169b3a6a69f554472207759087dd0b6ca745 ee772cef693ce1768b41b89e4533644f2f7379d1 M      Documentation
:040000 040000 ec3ad443ef88819d0f58ceeb28df5dd4cb0b6299 eaeae7ce17856172e2473a4efe0382011663f13e M      drivers
Comment 3 Georgios Tsalikis 2014-06-03 16:27:15 UTC
Created attachment 137951 [details]
bisection log
Comment 4 Levente Kurusa 2014-06-14 14:29:56 UTC
Hi,

can you please try to boot the latest mainline kernel booted with this driver as builtin?
Tick 'Device Drivers->Staging Drivers->Agure ET1310 Gigabit...' to STAR.

Try to boot the kernel and supply a dmesg please.

Would be also awesome if you could tick 'Kernel hacking -> Compile-time checks... -> Compile kernel with debugging info' to STAR. Then, you could do a 
'gdb vmlinux' to open up GDB where you would simply type 'list *<EIP from the oops without the relation symbols but with 0x prefixed>' to get where the oops happens.

Thanks
Comment 5 Mark Einon 2014-07-20 20:05:13 UTC
*** Bug 80751 has been marked as a duplicate of this bug. ***
Comment 6 Mark Einon 2014-08-02 15:14:05 UTC
The crash has been fixed by commit b394745df2d9, but the cause (presumably phy_init_hw() failing) is still an issue.

------------
commit b394745df2d9d4c30bf1bcc55773bec6f3bc7c67
Author: Guenter Roeck <linux@roeck-us.net>
Date:   Wed May 14 13:12:49 2014 -0700

    net: phy: Don't call phy_resume if phy_init_hw failed
    
    After the call to phy_init_hw failed in phy_attach_direct, phy_detach is called
    to detach the phy device from its network device. If the attached driver is a
    generic phy driver, this also detaches the driver. Subsequently phy_resume
    is called, which assumes without checking that a driver is attached to the
    device. This will result in a crash such as
    
    Unable to handle kernel paging request for data at address 0xffffffffffffff90
    Faulting instruction address: 0xc0000000003a0e18
    Oops: Kernel access of bad area, sig: 11 [#1]
    ...
    NIP [c0000000003a0e18] .phy_attach_direct+0x68/0x17c
    LR [c0000000003a0e6c] .phy_attach_direct+0xbc/0x17c
    Call Trace:
    [c0000003fc0475d0] [c0000000003a0e6c] .phy_attach_direct+0xbc/0x17c (unreliable)
    [c0000003fc047670] [c0000000003a0ff8] .phy_connect_direct+0x28/0x98
    [c0000003fc047700] [c0000000003f0074] .of_phy_connect+0x4c/0xa4
    
    Only call phy_resume if phy_init_hw was successful.
Comment 7 Mark Einon 2014-08-06 18:14:00 UTC
Fix patch posted at http://marc.info/?l=linux-netdev&m=140727945201126&w=2
Comment 8 Mark Einon 2014-09-11 11:42:15 UTC
Fix now in mainline, v3.17-rc3.