Bug 9308
Summary: | forcedeth causes 7 hour boot delay | ||
---|---|---|---|
Product: | Drivers | Reporter: | Alex Howells (astinus) |
Component: | Network | Assignee: | Ayaz Abdulla (aabdulla) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | aabdulla, akpm, kernel, manfred, mingo |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24-rc1 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | This patch removes the outer loop in mgmt unit detection code. |
Description
Alex Howells
2007-11-05 09:43:10 UTC
Original downstream report: https://bugs.gentoo.org/show_bug.cgi?id=197561 There's a codepath in the forcedeth probe routines that can potentially delay boot up to 7 hours. Inside nv_probe: if (id->driver_data & DEV_HAS_MGMT_UNIT) { [...] for (i = 0; i < 5000; i++) { [...] if (nv_mgmt_acquire_sema(dev)) { So, nv_mgmt_acquire_sema() may be called 5000 times. Inside nv_mgmt_acquire_sema(): for (i = 0; i < 10; i++) { [...] msleep(500); } So, nv_mgmt_acquire_sema() may sleep for up to 5 seconds. 5 seconds * 5000 = almost 7 hours Alex's hardware hits this exact code path: nv_mgmt_acquire_sema() never manages to acquire the semaphore, so is called 5000 times, and boot is delayed. We tried disabling the loop, i.e. for (i = 0; i < 5000; i++) { becomes: for (i = 0; i < 0; i++) { In this case, the system boots as normal (no delay) and networking works fine. rofl. This looks pretty simple to fix. I'm not sure I see the architectural purpose to the loop if after running 5000 times and delaying my boot by seven hours, it fails through anyway without even a warning *and* my network interface works fine .... :) Unfortunately a fix more complex than simply removing the loop is beyond me! Thus far I've not noticed any anomalous behaviour when the code within the loop isn't being run, but I've only been using it that way for 2-3 hours on <=4 boxes! I'm more than happy to tinker and test any patches you guys come up with. Andrew, well, the question still remains, why did forcedeth on Alex's hardware fail to acquire this hardware-based semaphore even after 7 hours? I hope someone with more knowledge of the hardware can comment. But yes, I agree that even if Alex didn't have this system in this strange situation, it should be regarded as a bug anyway that we have a codepath that can take 7 hours. :) Thanks for catching this. The IPMI firmware is holding the semaphore. This is most likely a bug in the firmware. You can try to upgrade the firmware by contacting Supermicro. But, from a driver point of view, yes, we need to reduce the timeout :) The extra outer loop is not needed. Created attachment 13490 [details]
This patch removes the outer loop in mgmt unit detection code.
Ayaz, will you push this to Linus? It's still not upstream AFAICS. I just sent the patch to netdev and stable kernel list. |