Latest working kernel version:
Earliest failing kernel version:
Hardware Environment: any
Software Environment: any
Lots of device drivers register a 'struct device_driver' with the '.bus' member
set to '&platform_bus_type'.
This is wrong, since the platform_bus functions expect the 'struct device_driver' to be wrapped up in a 'struct platform_driver' which provides
some additional callbacks (like suspend_late, resume_early).
The effect may be that platform_suspend_late() uses bogus data outside the
device_driver struct as a pointer pointer to the device driver's suspend_late()
function or other hard to reproduce failures.
The following drivers need to be converted to use a platform_driver struct and
the platform_driver_register() function:
Steps to reproduce:
grep -r -B 5 '\.bus.*=.*platform_bus_type' . | grep device_driver
gives a (possibly incomplete) list of drivers that are potentially affected by this bug.
Created attachment 18129 [details]
The attached patch fixes the problem for the PXA2xx Flash driver
Yeah, looks like the platform core does:
to_platform_driver(drv) (container_of((drv), struct platform_driver, driver))
which means it expects the driver core driver to be embedded in a platform driver.
David, these drivers need a fix, right?
Yes, these drivers need fixing.
The problem came with whatever patch changed the platform bus code to rely on such casts for drivers. I noticed it when I added basic PM methods to some drivers -- suspend/resume would oops, long before I added the early/late versions to match that PM iteration -- and fixed drivers on all the platforms I worked with. That was already way too much work to fix any bug I didn't create, so I suppose I ignored the other drivers. :)
The issue is that *originally* this idiom was standard and used throughout the "legacy" bus (later renamed "platform"). Then it became incorrect, I think when suspend/resume handling changed, but that code was still in the tree ... and the idiom was in peoples' minds.
This is present on 2.6.26, 2.6.27, 2.6.28, and current mainline by a quick peek at the code. And it is causing suspend/resume issues, as expected (that's how I found this bug report).
Is anything being done about it?
It would seem nothing is being done about it. This might be a good project for kernel janitors...
Please correct me if I am wrong, but this bug DOES mean we are basically doing:
if (random memory contents)
(random memory contents)(foo);
at suspend and resume, isn't it?
Because if it is that bad, it is time to send a patch marking all those drivers BROKEN, until someone fixes them. And get that patch into -stable for all kernels.
A look at drivers/char/tpm shows me the "fix" might need a complete clean-up.
I wouldn't say "random"; but certainly it's not defined to be usable in that way.
Ok, some of the drivers (the easiest ones) were fixed but the patch is not in mailine yet:
But what about the rest of the drivers? Some of them, like tpm_nsc, are not a trivial fix.
Maybe they should be made to depend on !CONFIG_SUSPEND || BROKEN until they get fixed?
On Saturday 21 February 2009, email@example.com wrote:
> Maybe they should be made to depend on !CONFIG_SUSPEND || BROKEN until they
Sounds like a plan to me.
If this is still seen on modern kernels then please re-open/update