Bug 9905
Description
Leann Ogasawara
2008-02-06 14:27:09 UTC
I'm the original Ubuntu reporter. We sent this bug to the kernel mailing list several months back but didn't get any response. Let me know if I can help test any proposed fixes - I have git installed, etc. This is new. Have you tested running without X and checking if there's an oops when the system locks up? The next step would be to enable MMC_DEBUG and see what the final output lines are. Include "debug" as a parameter to the kernel and it should dump everything to the console. Also make sure you inform both your laptop vendor and O2 Micro that you are seeing problems because they do not bother to cooperate with the Linux community. When I escape out of X, and run "sudo modprobe sdhci", all I get is a single line saying: sdhci:slo0: Unknown controller version (16). You may experience problems. There's no oops. And then the system requires a hard reboot, SysRq doesn't work. I think I tried MMC_DEBUG on one of the earlier kernels, and didn't get any output, but I'll try again and report back one way or another in a little while. It turns out I still had CONFIG_MMC_DEBUG in my .config for that build. I added debug to the end of the kernel line in grub and booted up again, escaped out of X to a console, and did modprobe sdhci as root, but the output is identical (different time stamp at the beginning of the line is the only difference). If you have any other ideas for debugging, I'd be glad to try them. I did at some point try adding debug prints into the SDHCI and MMC modules to try and do some debugging myself, but I didn't get anywhere and probably didn't really understand all that was going on (I'm not a device driver programmer)... Sorry, I omitted a character in the output line (had to type it in from looking at the other computer screen): [ 90.0464545] sdhci:slot0: Unknown controller version (16). You may experience problems. Annoying. Just to make sure we actually got all the data, you could make one last go after doing "echo 9 > /proc/sysrq-trigger". The warning about unknown controller version is harmless right now. I don't have any really good debugging ideas right now. Adding printks are probably the only next step. We need to figure out where it hangs. Since you don't get any more output, it should hang somewhere in the sdhci_probe() function. So start by littering that with printks. Actually, I've done that. When I put all the debugging print statements in before, as I recall (it was 6 months or so ago), I was able to see all of them through the last line of the sdhci_probe function. They were just simple print statements, like "got to this line", but they all printed out up until that function was returning. That was when I stopped debugging, because I had very little idea which functions might be the issue then. Anyway, if you want to send me a C file with tons of debugging statements, I'd be glad to run it and send a photo of the screen when it locks up. Since the computer ends up totally locked up, it's hard to capture the text... Alternatively, I can try putting in some debug prints in the sdhci_probe function again and see if I can be more specific about "The print statement here did print out before it locked up" (line numbers, etc.). Just let me know -- I'm happy to spend some time on this, especially if it means the problem might get solved and I could use the card reader, boot live CDs, etc. on this laptop. :) I'm always short on time, so the more self-sufficient you can be, the faster things will get resolved. :) It should not be possible for sdhci_probe() to finish without some debug messages from the MMC core. Are you sure you had MMC_DEBUG enabled and debug messages were sent to the console when you tested this? I am certain that the MMC_DEBUG kernel option was turned on in my .config when I compiled the kernel, since it is there now, and I haven't changed that option through several kernel versions at this point. I compiled 2.6.24 on Jan 26. I am also certain that I booted up with "debug" (no quotes) added to the end of the kernel line in my grub record when I did the most recent test. I am also certain that when I did "modprobe sdhci" at the console (after booting up with debug and escaping out of X with ctrl-alt-f2 and logging in to a TTY), I got only the line saying "Unknown controller version". When I do "modprobe -n -v sdhci", it tells me it will load both mmc_core and sdhci, by the way (I've blacklisted "sdhci" for normal use so that I can boot up). Any ideas? I'm not a kernel expert, so I probably did something wrong there... is there any other kernel option I need to turn on to get output in general, or bootup option I need, or option to modprobe? Hopefully I'm not just being dense. Anyway, although I am not a kernel or device driver expert, I am a C programmer, so I can certainly add the print statements in again (haven't tried this since the 2.6.20 kernel I think) and see what happens. I will probably have time to do that later this week (have to do some work for my pesky clients first). OK, I had some time, so I did at least some rudimentary debugging -- I put a bunch of print statements in the sdhci.c file -- several in the sdhci_probe and sdhci_probe_slot functions, and then one at the beginning and end of each function in the file (I'll attach the file I used shortly). I also changed a bunch of the INFO and WARN level prints in the file to ERR (to make sure I would see them), and in the kernel.h file, changed the debug print macro to use ERR level too (I'm still not sure why I was unable to see WARN level printing in my previous testing, but this took care of that issue!) Then I compiled, and put the new mmc_core.ko and sdhci.ko modules into my kernel location (those appeared to be the only two modules loaded when I did modprobe -v -n sdhci). Here's what happened: Run 1: With all the debug statements enabled, I did "modprobe -v sdhci" at the console and captured the beginning of the output in kernel.log before the computer locked up, and took a photo of the screen after it locked up, but there was a gap in the middle and I couldn't tell whether or not it had definitely finished sdhci_probe before locking. I'll attach the lines from the log and the screen shot (literally a photo of the screen). (I realize it may be possible to do something with a serial connection and get the full text output, but this laptop doesn't have a serial port, so I'm not sure, and anyway I didn't do that.) The last things it did were MMC saying "starting CMD0 arg 00000000 flags 000000c0", which set off a bunch of stuff in the sdhci module, finishing with a call to sdhci_tasklet_finish and then sdhci_set_ios, which both completed, and then I got a linux prompt back and it locked up. Run 2: I commented out some of the debug statements that were clogging the screen at the end of run 1 and recompiled, this time just copying over the sdhci.ko module (since that's all I had changed). Again, I did "modprobe -v sdchi" at the console. This time there wasn't much in the kernel log, but I did verify in the screen shot (which I'll attach) that the sdhci_probe function finished completely before it locked up (the print statement at the end of sdhci_probe was the last output before the linux prompt came back, and then there was some more console output that looks like 3 calls to set_ios associated with some MMC "clock" output, then the above-mentioned MMC "starting CMD0 arg 00000000 flags 000000c0", with the same result (locked up after getting a linux prompt back). I'll leave the interpretation to you, since I don't really know anything about device drivers in general or MMC/SDHCI in particular... the module definitely is getting through modprobe and then starting on its regular business before it locks up, though. If you can suggest any additional debugging (e.g. attach a new sdhci.c file to try with more print statements, or some suggestions of where I might put some more that would give you more information), please let me know... I think I've done about all I can. I'll attach the files now so you can see what I saw. Created attachment 14866 [details] Version of sdhci.c used in "Run 1" in Comment #9 This is sdhci.c with a lot of print statements and some WARN/INFO statements changed to ERR, to get a lot of debugging output. See Comment #9 of Bug 9905 for details. Created attachment 14867 [details] Version of sdhci.c used in Run #2 in Comment #9 This is sdhci.c with a few less print statements and some WARN/INFO statements changed to ERR, to get a little bit less debugging output. See Comment #9 of Bug 9905 for details. Created attachment 14868 [details] Kernel log of modprobe run 1 in Comment #9 of Bug 9905 These are the lines created in the kernel log before the computer locked up, in run 1 described in Comment #9 of Bug 9905 -- see also the screen shot I'm about to attach. Created attachment 14869 [details] Screen shot (photo) of run 1 in Comment #9 of Bug 9905 Created attachment 14870 [details] Screen shot (photo) of run 2 in Comment #9 of Bug 9905 In the first photo, it has indeed finished probing. So the fact that it locks up there is extremely odd as the driver will no longer be poking the hardware. The second photo seems to be incorrect. In the text you say that you get the prompt back, but there is no prompt in the photo. So it's difficult to make any qualified guesses from that. Since there is some delay to the hang, we need to start taking pieces out and see what it is that provokes the hang. As you do not have a card in the slot, the driver will not send any actual command requests to the controller. But there is a lot of other fiddling going on. Try disabling each of the following and see which one makes the problem go away: 1. LED control. Modify sdhci_activate_led() and sdhci_deactivate_led() to just return early. 2. Reset on each finished request. Comment out the big if-clause in sdhci_tasklet_finish(). 3. Assorted hardware fiddling in sdhci_set_ios(). Disable individually each of the different section by commenting them out. (it's sufficient to remove the write calls, just don't forget sdhci_set_clock()). There is also another way you might pinpoint the problem. You can configure the kernel to add a delay after each printk(). You can find that setting under the debugging options in the kernel config. In the second photo, the prompt is up in the middle of the screen; you are right there is not one at the end. The photo is not "incorrect" -- it's a photo of my actual screen when it locked up on the second run. Anyway, I will try your debugging suggestions and report back, in the next few days. Thanks for being interested in solving this! By the way, I do not think there is an LED associated with this drive on my laptop. At least, there is none in evidence near the drive. I have some more information for you. I tried your suggestions from Comment #15, and nothing worked -- the laptop still locked up and I didn't get any more information about where. So I went even more drastic: in *every* function inside sdhci.c (except the sdhci_probe and sdhci_probe_slot functions), I put a return right at the top. And it still locked up. So I am pretty sure the locking up is happening in the mmc_core module, rather than actually happening inside an sdhci.c function. Just loading mmc_core with modprobe doesn't cause the lockup, though. You have to load mmc_core and run the modprobe sdhci functionality (which is all I currently have enabled in sdhci) for it to lock up. So, do you have any debugging hints for mmc_core? I'll try blindly commenting out stuff, adding debug statements, etc. in the meantime, and I'll let you know if I find anything. (In reply to comment #16) > By the way, I do not think there is an LED associated with this drive on my > laptop. At least, there is none in evidence near the drive. That's quite common, but the driver cannot determine this. That's why I wanted you to test removing that code in case they've done some stupid wiring that causes a hang when the LED pin is used. (In reply to comment #17) > > So I went even more drastic: in *every* function inside sdhci.c (except the > sdhci_probe and sdhci_probe_slot functions), I put a return right at the top. > And it still locked up. So I am pretty sure the locking up is happening in > the > mmc_core module, rather than actually happening inside an sdhci.c function. > That's extremely odd. It's just sdhci that pokes hardware, so the mmc core shouldn't be able to lock up your system (unless it corrupts memory). Since you've still got the probing in there my first guess is that the init sequence is enough to kill the machine. Try the following: 1. Avoid registering the device with the mmc layer. Comment out mmc_add_host() and mmc_remove_host() in sdhci.c. 2. If the above still hangs, try also modifying sdhci_probe() to believe that sdhci_probe_slot() failed. The effect should be that it powers up the chip, then directly powers it down again. Well, those two suggestions didn't quite work, but I was able to make the problem go away by putting return -ENODEV; right at the top of sdchi_probe_slot. I will try a "binary search" method (move it down to the middle of that function, then the middle of the non-functional section, etc.) to see if I can narrow down the exact offending line. But I guess it's somewhere in there: if sdchi probes the slot fully, even ignoring the successful return value, the computer hangs afterwards. Some progress... maybe we'll track this down after all? So if it still locks up even using method 2, it means that just activating the hardware for a brief period kills the machine. Pinpointing the problem in the probe routine sounds good, yes. Be aware that a simple return will leak memory and all kinds of reasources. Try using gotos to the cleanup portion in the end of sdhci_probe_slot(). Thanks, I figured that out. :) I actually have it narrowed down now to the debug routine sdhci_dumpregs that prints all the registers, which is called at the end of sdhci_probe_slot if you have the MMC_DEBUG config option turned on (which I do). With my current (almost completely commented out) version of sdhci.c, which is only basically doing sdhci_probe_slot and then returning, if I remove that call to sdhci_dumpregs, the computer no longer hangs after modprobe. It was an exciting moment, not to have to do a hard reboot! So I'm now trying to pinpoint which exact memory read(s) is/are causing the trouble. The host->ioaddr address doesn't look suspicious, and leaving in the earlier read in sdhci_probe_slot that read the version information does not cause the computer to hang. Presumably, in the non-MMC-debug version of the kernel, similar reads/writes in the non-debug parts of the sdhci routines are causing the problems... ?? anyway, I'll see if I can narrow it down to a particular address or addresses and then give you a full report. By the way, before hanging, the printout is giving me 0x00000000 for a lot of the results in that register dump. Is that normal? Could be partly a result of all the routines I have commented out (if the sdhci module normally would be setting those registers), but it seemed a bit suspicious to me. OK. I haven't tested all the calls in sdhci_dumpregs, but I've tested a bunch of them. Here are some results: These reads are OK -- leave these un-commented and the computer doesn't hang after my current attenuated modprobe: readw(host->ioaddr + SDHCI_HOST_VERSION) readw(host->ioaddr + SDHCI_BLOCK_SIZE) readl(host->ioaddr + SDHCI_ARGUMENT) readl(host->ioaddr + SDHCI_CAPABILITIES) These reads are not OK -- uncommenting any of these reads will cause the computer to hang after modprobe: readl(host->ioaddr + SDHCI_DMA_ADDRESS) readw(host->ioaddr + SDHCI_BLOCK_COUNT) readw(host->ioaddr + SDHCI_TRANSFER_MODE) Any thoughts now? This doesn't look all that good... (In reply to comment #21) > > Presumably, in the non-MMC-debug version of the kernel, similar reads/writes > in > the non-debug parts of the sdhci routines are causing the problems... ?? Yes. All of those registers are heavily used during normal operation. (In reply to comment #22) > By the way, before hanging, the printout is giving me 0x00000000 for a lot of > the results in that register dump. Is that normal? Could be partly a result > of > all the routines I have commented out (if the sdhci module normally would be > setting those registers), but it seemed a bit suspicious to me. > It's quite normal for those registers to have value 0. Most things are designed to have 0 be the default value. (In reply to comment #23) > > These reads are not OK -- uncommenting any of these reads will cause the > computer to hang after modprobe: > > readl(host->ioaddr + SDHCI_DMA_ADDRESS) > readw(host->ioaddr + SDHCI_BLOCK_COUNT) > readw(host->ioaddr + SDHCI_TRANSFER_MODE) > > Any thoughts now? This doesn't look all that good... > This is of course extremely odd. But all of those can be avoided without affecting functionality. So the next step would be to remove the reads of those registers and see if you can get a working controller. There are two reads for SDHCI_DMA_ADDRESS, two for SDHCI_BLOCK_COUNT and one for SDHCI_TRANSFER_MODE. You can just comment them out in all but one of the cases. For SDHCI_BLOCK_COUNT, you need to compute bytes_xfered as data->blksz * data->blocks for now. In my last report, I had only tested some of the reads in that sdhci_dumpregs function. Now I have tested all of them, and I found some more reads that cause the machine to hang (actually, about half of them do). So, here's a list of all the reads in the dummpreg function, along with their hex offsets; whether they are doing a readl, readr, or readb; and whether doing that particular read causes the machine to hang or not: SDHCI_DMA_ADDRESS = 0x00 L (hang) SDHCI_BLOCK_SIZE = 0x04 W (ok) SDHCI_BLOCK_COUNT = 0x06 W (hang) SDHCI_ARGUMENT = 0x08 L (ok) SDHCI_TRANSFER_MODE = 0x0C W (hang) SDHCI_PRESENT_STATE = 0x24 L (ok) SDHCI_HOST_CONTROL = 0x28 B (hang) SDHCI_POWER_CONTROL = 0x29 B (ok) SDHCI_BLOCK_GAP_CONTROL = 0x2A B (hang) SDHCI_WAKE_UP_CONTROL = 0x2B B (ok) SDHCI_CLOCK_CONTROL = 0x2C W (hang) SDHCI_TIMEOUT_CONTROL = 0x2E B (ok) SDHCI_INT_STATUS = 0x30 L (hang) SDHCI_INT_ENABLE = 0x34 L (ok) SDCHI_SIGNAL_ENABLE = 0x38 L (hang) SDHCI_ACMD12_ERR = 0x3C W (ok) SDHCI_CAPABILITIES = 0x40 L (ok) SDHCI_MAX_CURRENT = 0x48 L (hang) SDHCI_SLOT_INT_STATUS = 0xFC W (hang) SDHCI_HOST_VERSION = 0xFE W (ok) I am not seeing a pattern there... Just for completeness, here is the kernel log output for the sdhci_probe_slot function -- I added a line that prints out the value of sdhci->ioaddr, and replaced all the items marked "hang" above with 0 in the register dump function: sdhci [sdhci_probe_slot()]: slot 0 at 0xffbfe800, irq 19 sdhci [sdhci_probe_slot()]: IOADDR is 0xf8a7e800 sdhci:slot0: Unknown controller version (16). You may experience problems. sdhci [sdhci_probe_slot()]: Controller doesn't have DMA capability sdhci: ============== REGISTER DUMP ============== sdhci: Sys addr: 0x00000000 | Version: 0x00001010 sdhci: Blk size: 0x00000000 | Blk cnt: 0x00000000 sdhci: Argument: 0x00000000 | Trn mode: 0x00000000 sdhci: Present: 0x01fa0000 | Host ctl: 0x00000000 sdhci: Power: 0x00000000 | Blk gap: 0x00000000 sdhci: Wake-up: 0x00000000 | Clock: 0x00000000 sdhci: Timeout: 0x00000000 | Int stat: 0x00000000 sdhci: Int enab: 0x00000000 | Sig enab: 0x00000000 sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000 sdhci: Caps: 0x038021a1 | Max curr: 0x00000000 sdhci: ============================= Let me know if you still think it is worthwhile to comment out all the lines concerning all of these registers. I should have some time next week to try that, if so. Just one observation/question: host->addr is coming out at 0xffbfe800, and host->ioaddr is coming out at 0xf8a7e800 -- is it normal for host->ioaddr to be *below* host->addr? I don't know much of anything about the SDHCI module... just thought I would point it out in the above output. (In reply to comment #25) > > I am not seeing a pattern there... > I do. It is very close to every other read being a hang. How did you do this testing? Have you tried leaving a single read at a time? > > Let me know if you still think it is worthwhile to comment out all the lines > concerning all of these registers. I should have some time next week to try > that, if so. > It would seem that would be entirely insufficient. With so many registers causing a hang, I suspect there is a more general problem. (In reply to comment #26) > Just one observation/question: host->addr is coming out at 0xffbfe800, and > host->ioaddr is coming out at 0xf8a7e800 -- is it normal for host->ioaddr to > be > *below* host->addr? I don't know much of anything about the SDHCI module... > just thought I would point it out in the above output. > I'm not familiar with the mapper algorithms, but this result is not very surprising since the physical address is very close to the 4 GB limit. (In reply to comment #27) > (In reply to comment #25) > > > > I am not seeing a pattern there... > > > > I do. It is very close to every other read being a hang. How did you do this > testing? Have you tried leaving a single read at a time? Sorry, I should have explained. I started at the top of the sdhci_dumpregs function, and commented out all of the readl, readw, and readb lines. Then I uncommented one. If the computer didn't hang, I would leave that one uncommented, and then uncomment another one. If it made the computer hang, I would comment it out and uncomment the next one. So by the end, I had all the ones marked "OK" in the above list uncommented, and all the ones marked "not OK" commented out. So I don't think it's just an "every other read" problem, because I have about half of them running with no problems. (In reply to comment #28) > > So by the end, I had all the ones marked "OK" in the above list uncommented, > and all the ones marked "not OK" commented out. So I don't think it's just an > "every other read" problem, because I have about half of them running with no > problems. > Still, there is a very uncanny pattern there. Could you do some other variants so that we are 100% sure that it is the actual registers, and not just the access pattern that is the problem? (In reply to comment #29) > Still, there is a very uncanny pattern there. Could you do some other > variants > so that we are 100% sure that it is the actual registers, and not just the > access pattern that is the problem? Good thought. So, I took my working list of 10 reads (which doesn't hang the machine), and substituted SDHCI_DMA_ADDRESS (which I had marked as not working) in the place of SDHCI_ARGUMENT (which happens somewhere in the middle of the function, and both are readl calls). The machine hung when I did modprobe sdhci. Just to get one more data point, I then put SDHCI_ARGUMENT back in, and instead put SDHCI_DMA_ADDRESS in the place of SDHCI_INT_ENABLE. This time it didn't hang. So I guess you are right, that it's something else, not those particular addresses. Still, it's not as simple as "every other read", because I have a list of 10 reads enabled, and substituting SDHCI_DMA_ADDRESS in at one position in the list is OK, and in another one it isn't; with the original list it's OK too. I'm more perplexed than before.... I share that sentiment. It might be something on the PCI level that is misbehaving. Unfortunately, that's a bit outside of my expertise. Could you try some different combinations and see if you can figure out exactly what makes things hang? I'll see if I can get someone else to also have a look at this. I don't really know what to try next... it just seems random to me, and each test of some configuration takes quite a while to perform (change code, re-compile module, try modprobe, reboot -- even if the machine doesn't hang, I have to reboot to retry, because rmmod/modprobe after a "successful" modprobe doesn't follow the same code path). I don't really understand why *reading* from a particular address could cause the whole machine to hang in the first place. Do you? (In reply to comment #32) > I don't really know what to try next... it just seems random to me, and each > test of some configuration takes quite a while to perform (change code, > re-compile module, try modprobe, reboot -- even if the machine doesn't hang, > I > have to reboot to retry, because rmmod/modprobe after a "successful" modprobe > doesn't follow the same code path). Sorry about that, but the problem is so completely weird that debugging is more or less wild guessing and then testing those guesses. We can hold off a bit until some PCI expert can have a look though. > > I don't really understand why *reading* from a particular address could cause > the whole machine to hang in the first place. Do you? > Not the slightest. Greg, could you have a look at this? Looks to me more like some low level PCI problem, than a driver bug. Jesse, could you please have a look at this bug now that you're the new PCI maintainer? This looks like an ugly one... So we have some sequence of MMIO reads (readl etc.) that can cause the machine to hang? Normally reads that don't complete should result in a PCI master abort, which should give you all 1s in the read result (0xffffffff or whatever you register size was); the fact that it hangs in this case is strange. Can you attach the output of 'lspci -vv' from your machine, along with the boot log? It may be that you have a hardware problem, or that one of the PCI bridges on the way out to this device is misconfigured somehow. Or there could be an overlap between what the sdhci device decodes and some other device... My Averatec has the same problem. I have seen that windows reconfigure the memory area of the sdhci and the 8139c so i also think there is a configuration problem. http://bugzilla.kernel.org/show_bug.cgi?id=10231 Created attachment 16294 [details]
Output of lspci -vv
Created attachment 16295 [details]
Boot log
Hopefully this is what you mean by the "boot log"? I am attaching the output of dmesg after rebooting. I'm currently running the Ubuntu kernel that came with 8.04, which is a 2.6.24 derivative. The sdhci module is, of course, blacklisted (since loading it hangs the system).
If there's anything else I can do to help debug this, or any other output you need, let me know... As you can probably tell from above, I am not afraid to try things and make the machine hang... :) Reassigning to Jesse so he doesn't forget about this bug. ;) Ah, and I *had* forgotten about it; ignorance is bliss. :) I didn't see Arne's update in #37 and Jennifer's subsequent update of #10231. It sounds like the 8139 probably decodes a range that overlaps with the SD controller. If they both respond to MMIO reads the bus could hang, freezing the machine. I wonder if we could add a quirk to increase the MMIO resource size for 8139? Question is, how big should it be? Hm, let's see if the realtek windows driver release notes have anything useful... Nope. But working on the assumption that the size is the problem (but assuming 8139 isn't *totally* broken and doesn't decode everything on the bus) we can try increasing the size. Something like this might work (you may have to correct the device or vendor ID if I got it wrong). diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 338a3f9..3377907 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -1381,6 +1381,21 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TOSHIBA_2, PCI_DEVICE_ID_TOSHIBA_TC86C001_IDE, quirk_tc86c001_ide); +/* + * Apparently 8139 chips decode more than their advertised MMIO range. + * Increase the size to avoid conflicts. + */ +static void __init quirk_8139_mmio_size(struct pci_dev *dev) +{ + struct resource *r = &dev->resource[1]; + + r->start = 0; + r->end = 0xfffff; +} +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_REALTEK, + PCI_DEVICE_ID_REALTEK_8139, + quirk_8139_mmio_size); + static void __devinit quirk_netmos(struct pci_dev *dev) { unsigned int num_parallel = (dev->subsystem_device & 0xf0) >> 4; Thanks for giving this some attention again! Those vendor/device IDs look correct to me -- at least, they are used elsewhere in 8139-related driver files. And it sounds like an interesting hypothesis. Since both 8139too and sdhci had issues on this laptop, it would make some sense that one could be causing the other's problem. So I'll give this patch a shot in the next day or two. First I have to pull/checkout/build 2.6.26 and make sure that is working on this machine... I haven't built the kernel in a while (Ubuntu flipped the problematic MMIO/PIO 8139too config setting with the release of 8.04, so I can use their kernels' version of 8139too now). (I'm sure you kernel folks would be horrified to use a kernel two versions back and with Ubuntu's mods, but I am just happy to have something that works. :) ) (Of course, it would be even better if the card reader worked too.) Well, I got the 2.6.26 kernel built, booted up to make sure it ran on my laptop, and verified that doing modprobe sdhci still hung the system (no surprise there). Then I put in the patch from Comment #42 above, did a complete kernel rebuild from clean (just to make sure), installed that kernel, rebooted, and... unfortunately doing sudo modprobe sdhci still hangs the system. So... too bad! Any other ideas? Hm... can you verify that the quirk is being executed and affecting the resource size used by the driver? Assuming that part is working, you could try making the resource reservation even larger in the quirk... I can try putting in some kind of a kernel log statement to verify it's being executed. As far as making it even bigger, can you give me a bit of guidance -- how big can it be? On Monday, July 28, 2008 7:12 am bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #46 from yahgrp@poplarware.com 2008-07-28 07:12 ------- > I can try putting in some kind of a kernel log statement to verify it's > being executed. > > As far as making it even bigger, can you give me a bit of guidance -- how > big can it be? Oh you can probably make it pretty big (~256M) before you start having trouble, depending on how much RAM and many PCI devices you have and how much address space they want. Hopefully the device isn't so broken that you'll need a full 256M window reserved for it though. Jesse OK, I put in a printk statement and verified the quirk function was being run. Then I tried some different values for r->end in the patch from Comment #42 above: - 0xfffff -- still hangs when I modprobe sdhci - 0xfffffff -- still hangs - 0xffffffffffff -- compile warning "integer constant is too large for 'long' type" (I'm running/compiling a 32-bit kernel for obscure reasons, and apparently that struct element is an unsigned long) - 0xffffffff (max allowed in struct) -- still hangs So, I guess there is either something wrong with the codes, or the idea doesn't work. I'll reopen the bug... By the way, it also looks like with the quirk resource -> end set to 0xffffffff, the 8139too module is not working. At least, I don't seem to have any network interface coming up. I haven't investigated further... Yeah, setting it to take that much space will just fail (or eat *all* of your address space :)... Hm, too bad that idea didn't work. I guess we need another type of quirk that forces 8139 into PIO mode. Looks like the driver already has one check, maybe we can just add another like so? diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c index 8a5b0d2..2dfccd0 100644 --- a/drivers/net/8139too.c +++ b/drivers/net/8139too.c @@ -953,8 +953,10 @@ static int __devinit rtl8139_init_one (struct pci_dev *pdev, if (pdev->vendor == PCI_VENDOR_ID_REALTEK && pdev->device == PCI_DEVICE_ID_REALTEK_8139 && - pdev->subsystem_vendor == PCI_VENDOR_ID_ATHEROS && - pdev->subsystem_device == PCI_DEVICE_ID_REALTEK_8139) { + ((pdev->subsystem_vendor == PCI_VENDOR_ID_ATHEROS && + pdev->subsystem_device == PCI_DEVICE_ID_REALTEK_8139) || + (pdev->subsystem_vendor == 0x14ff && + pdev->subsystem_device == 0xa003))) { printk(KERN_INFO "8139too: OQO Model 2 detected. Forcing PIO\n"); use_io = 1; } Are you aware that on this particular laptop, compiling the 8139too module without CONFIG_8139TOO_PIO=y in the .config makes it so that loading the 8139too module causes the laptop to hang? So I am already compiling with that flag set. It seems as though that should be forcing it into PIO mode? Not sure, I really don't know much about device drivers (actually, I have no idea what PIO mode means). Let's see. Looking at the 8139too.c file, this config option sets a flag #define USE_IO_OPS 1 up towards the top of the file, which causes several things to happen farther down.... Is that the same thing as what your patch would do? The patch you suggested doesn't seem to apply to the 8139too.c file I have anyway. Which version are you patching against? I am building 2.6.26 currently. The context of the patched lines isn't right... those lines don't exist in my file, and the use_io variable doesn't exist. Someone just emailed me privately about this bug and explained what PIO and MMIO mean... thanks! Also he had the suggestion that maybe both the 8139too in MMIO mode and the sdhci module are interfering with some 3rd device, rather than with each other. Is that possible? Oh, 10231 made me think that it was 8139's MMIO usage that was causing trouble, I missed that Ubuntu already disabled 8139 MMIO. So yeah it's probably a more general MMIO bug on this platform. Pierre, is there a way of running SDHCI in PIO only mode? I wonder what Windows does on this platform to avoid this problem, maybe there's some special bridge programming needed. Yes, the Ubuntu team disabled 8139 MMIO because of this laptop. Someone had filed a bug, and either I or someone else (I don't recall) pointed out what the solution was, and they flipped the config bit, which enabled all of us poor souls who own one of these laptops to thereby update the kernel using the standard end-user upgrade path. As far as I know, Ubuntu is currently the only distro that can be installed from one of its install CDs on this laptop (with some shenanigans to blacklist SDHCI on initial bootup). (Maybe Gentoo would be an exception, as I think you configure as you go? I haven't tried that route.) All other install CDs we tried last year when I first got this laptop (and we tried a LOT of distros of many flavors) hang during initial bootup. It was only because some guy in Germany who owned a clone of this laptop had created an alternate Ubuntu install CD (without the SDHCI module and with 8139too set to PIO) that I was even able to get linux running on it, and that's why I am using Ubuntu. Anyway, all that aside, I like the idea of making the SDHCI use PIO, if it's possible, since that was the solution to the 8139 problem that got this laptop running in the first place. By the way, I have read somewhere that the 2.6.27 kernel has some MMIO debugging built into it. (I am still compiling/running 2.6.26.) Do you think it would be helpful if I pulled 2.6.27 RC1 (or whatever the latest RC is) and tried it out? Would it be likely to give us some useful information on the SDHCI problems? If so, can either Jesse or Pierre give me any suggestions on config options I would need to set in order to get this useful information? im adding myself since im interested in the outcomes of this discussion... I'm not sure about the MMIO debugging stuff; it might help but our best bet would be to get some info about the twinhead multifunction device from the vendor. Sounds like there are some workarounds we're missing. I've pinged the folks at O2 Micro, we'll see if the get back to us... I think in the laptop is an other device that also use the MMIO Address Area: ffbfe800-ffbfecff of maybee also to ffbfefff I have added a hack that move nic and cardreader out of this area and both are working http://bugzilla.kernel.org/show_bug.cgi?id=10231 Can you made correct patch to blacklist this area on the H12Y, to change the mmio size of the chips to force linux to move it is not a good way. arne, thanks for this. ive tried to apply the patch, and it wouldnt do it with patch -p1 < quirks-h12y.patch, it fails to apply it NOT finding the file. im no guru on the matter, but i had to manually add the lines to /drivers/pci/quirks.c am i doing something wrong? Arne's patch (which I applied on Ubuntu's kernel 2.6.24-21.42) seems to work on my laptop. Tomas, you need to chdir into drivers/pci and apply a 'patch -p0 < quirks-h12y.patch' there. Yeah Arne's patch addresses the root problem much better than the hack I posted. I wonder if there's an ACPI table on this machine that would tell us about these hidden resources? Either way we can push the Arne's quirk upstream if he spins a new one with some comments and sends it to jbarnes@virtuousgeek.org. thanks, Jesse Created attachment 17716 [details]
disassembled DSDT
I have no idea about this, but I have attached what I think is the ACPI table for my machine (/proc/acpi/dsdt, disassembled with iasl and gzipped), which somebody else may be able to interpret.
Hm, looks like those devices overlap with one of the PNP resources: Device (RMSC) { Name (_HID, EisaId ("PNP0C02")) Name (_UID, 0x10) Name (CRS, ResourceTemplate () { IO (Decode16, 0x0010, // Range Minimum 0x0010, // Range Maximum 0x00, // Alignment 0x10, // Length ) IO (Decode16, 0x0022, // Range Minimum 0x0022, // Range Maximum 0x00, // Alignment 0x1E, // Length ) IO (Decode16, 0x0044, // Range Minimum 0x0044, // Range Maximum 0x00, // Alignment 0x1C, // Length ) IO (Decode16, 0x0063, // Range Minimum 0x0063, // Range Maximum 0x00, // Alignment 0x01, // Length ) IO (Decode16, 0x0065, // Range Minimum 0x0065, // Range Maximum 0x00, // Alignment 0x01, // Length ) IO (Decode16, 0x0067, // Range Minimum 0x0067, // Range Maximum 0x00, // Alignment 0x09, // Length ) IO (Decode16, 0x0072, // Range Minimum 0x0072, // Range Maximum 0x00, // Alignment 0x0E, // Length ) IO (Decode16, 0x0080, // Range Minimum 0x0080, // Range Maximum 0x00, // Alignment 0x01, // Length ) IO (Decode16, 0x0084, // Range Minimum 0x0084, // Range Maximum 0x00, // Alignment 0x03, // Length ) IO (Decode16, 0x0088, // Range Minimum 0x0088, // Range Maximum 0x00, // Alignment 0x01, // Length ) IO (Decode16, 0x008C, // Range Minimum 0x008C, // Range Maximum 0x00, // Alignment 0x03, // Length ) IO (Decode16, 0x0090, // Range Minimum 0x0090, // Range Maximum 0x00, // Alignment 0x10, // Length ) IO (Decode16, 0x00A2, // Range Minimum 0x00A2, // Range Maximum 0x00, // Alignment 0x1E, // Length ) IO (Decode16, 0x00E0, // Range Minimum 0x00E0, // Range Maximum 0x00, // Alignment 0x10, // Length ) IO (Decode16, 0x04D0, // Range Minimum 0x04D0, // Range Maximum 0x00, // Alignment 0x02, // Length ) IO (Decode16, 0x0000, // Range Minimum 0x0000, // Range Maximum 0x00, // Alignment 0x00, // Length _Y0C) IO (Decode16, 0x0000, // Range Minimum 0x0000, // Range Maximum 0x00, // Alignment 0x00, // Length _Y0D) IO (Decode16, 0x0000, // Range Minimum 0x0000, // Range Maximum 0x00, // Alignment 0x00, // Length _Y0E) Memory32Fixed (ReadWrite, 0xFED1C000, // Address Base 0x00004000, // Address Length ) Memory32Fixed (ReadWrite, 0xFED20000, // Address Base 0x00070000, // Address Length ) Memory32Fixed (ReadWrite, 0xFFB00000, // Address Base 0x00100000, // Address Length _Y0A) Memory32Fixed (ReadWrite, 0xFFF00000, // Address Base 0x00100000, // Address Length _Y0B) }) Method (_CRS, 0, NotSerialized) { CreateDWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0A._LEN, SML1) CreateDWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0A._BAS, SMB1) CreateDWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0B._LEN, HCTL) CreateDWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0B._BAS, HCTB) Store (0xFFB00000, SMB1) Store (0x00100000, SML1) Store (0xFFF00000, HCTB) Store (0x00100000, HCTL) CreateWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0C._MIN, GP00) CreateWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0C._MAX, GP01) CreateByteField (CRS, \_SB.PCI0.SBRG.RMSC._Y0C._LEN, GP0L) Store (PMBS, GP00) Store (PMBS, GP01) Store (PMLN, GP0L) If (SMBS) { CreateWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0D._MIN, GP10) CreateWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0D._MAX, GP11) CreateByteField (CRS, \_SB.PCI0.SBRG.RMSC._Y0D._LEN, GP1L) Store (SMBS, GP10) Store (SMBS, GP11) Store (SMBL, GP1L) } If (GPBS) { CreateWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0E._MIN, GP20) CreateWordField (CRS, \_SB.PCI0.SBRG.RMSC._Y0E._MAX, GP21) CreateByteField (CRS, \_SB.PCI0.SBRG.RMSC._Y0E._LEN, GP2L) Store (GPBS, GP20) Store (GPBS, GP21) Store (GPLN, GP2L) } Return (CRS) } } } If this really is a device with registers, we shouldn't be assigning PCI devices to that region; maybe this is really an ACPI problem? If I understand correctly, you're saying that the 0xFFB00000-0xFFBFFFFF range reserved in the third Memory32Fixed clashes with the MMIO range used for the card reader and nic. Is this it? Is this setup (both the PNP and the MMIO reservation) suggested by the hardware, or is the kernel involved in any way? I assume it's the former, which is an interesting design flaw in this chipset.. Would the kernel conceivably be able to spot and solve clashes like this one automatically by looking at the ACPI table? This would fix any similar problems that may occur in the future. (This is a naive suggestion - it sounds feasible but it may be too complicated and/or of too little relevance to actually implement such a thing.) Anyway, in the light of this, is Arne's patch the optimal solution? Yes this is the area. I think at first it is a bios bug because the bios also assign this addresses to the devices. I have checked this with DOS and some tools to read the pci registers. In windows this area is reserved as "Motherboard resources" and windows move all devices out of it. I think my patch work, but it is not the optimal solution, because it is possibe that the next device that you put into the express card slot get again this address. > I think at first it is a bios bug because the bios also assign this addresses > to the devices. Perhaps there's a BIOS update that fixes this? My laptop comes with some R1.00 version. For the Twinhead H12Y model there's R1.04 and R1.08 at http://www.twinhead.com.tw/download.aspx . None of the other manufacturers have BIOS updates that I can find. I'm not trying the Twinhead updates in case I end up without a working BIOS, and anyway I don't have Windows anymore - but has anyone tried flashing the BIOS? Back to the main point - can that address area be reserved by the Linux kernel, as a quirk? That would solve the issue more robustly, wouldn't it? Yeah, I think the best solution would be for Linux to detect these PNP resources and reserve space for them. They're fixed, so Linux should move any overlapping PCI devices to another region. I haven't dug into the pnp driver enough to see if there's an easy way to do this; it might be that we should add some code to ACPI to reserve these regions early on... So it seems there are three possible solutions, ordered by increasing correctness/difficulty: 1) Arne's current patch -- this solves the original issue but still leaves the possibility of something going wrong if an additional PCI[e] device is added. 2) Reserving the memory regions indicated in the DSDT above for this particular hardware (a quirk). This would solve all possible PCI issues on this machine, but not others. 3) Teaching the kernel to handle contradictory BIOS information by reserving memory for fixed resources so that relocatable ones don't overlap with them. Now, option 1 we already have. Option 2 doesn't sound much more difficult than option 1. Option 3 is neat, but it probably takes quite a bit of work and expertise to implement, and anyway I don't know how often BIOSes suggest silly setups capable of hanging the computer.. I guess that it's either option 2 or 3 that would be the 'final' fix for this. It may not be wise to include a provisional fix (be it 1 or 2) in the mainline kernel. However, especially if the solution is going to take a while to develop, I would be quite happy to see a provisional fix (be it 1 or 2) downstream in the Ubuntu kernel where this bug originated so that it ships with 8.10 (due 30-Oct if I remember correctly) and makes a few people happy. Any thoughts on this? BTW, I would help with the coding stuff, but I don't even code (in C, that is). Sorry about that. Created attachment 17796 [details]
reassign PCI BARs that conflict with PNP resources
This patch attempts to detect PNP resource conflicts, forcing PCI to reallocate things. I'm not sure if the quirks will run in the right order though, anyone care to test?
Created attachment 17810 [details]
reconstructed dmesg
The patch tries to do something, but it ends up disabling most of the hardware, so I end up on a busybox terminal without access to the hard drive or network.
To reconstruct dmesg I compiled the 2.6.27-3 Ubuntu kernel, booted into it, stored the dmesg, compiled the patched kernel, booted into it, ran 'dmesg | more' in busybox, compared visually with the previous log on my desktop machine and wrote the differences by hand. I've used '>>>>' and '<<<<' to flag additions and removals, emulating some kind of 'diff orig.log patched.log'. I hope the notation is understandable.
The point at which the patched kernel log ends is flagged with a line. I may have made typos or left something out, but I hope it's nothing critical.
Yeah that helps. Looks like my patch is a bit too naive; I think I need to check what kind of PNP resource we're comparing against and only worry about non-aperture ones (or something, I'll talk with Bjorn). Created attachment 17819 [details]
oops
Last one was extra bad. This one might work a little better, but I still think we have to limit it to just pnp0c02 possibly.
Created attachment 17826 [details]
dmesg
Indeed it works better. This time it boots and the card reader is working. However the patch kills all USB ports and the CD drive, plus something called 'ata_piix' and I don't know if anything else. I've attached the dmesg, where I've put an asterisk at those lines which (I think) flag the new issues.
Minor thing: in the patch there are modifications to drivers/gpu/drm/i915/i915_irq.c which seem to be unrelated to this bug. Did you include these intentionally, or can I leave them out?
Oh no the i915 changes were a mistake, sorry about that. I'll take a look at the dmesg and see if I can figure out what's going on (we're likely forcing a BAR reallocation that shouldn't really happen). Created attachment 17870 [details]
add debug output and catch more cases
I'm not sure why the I/O port stuff fails yet but this patch should at least give us some more debug output. I think it's also more correct, since before we wouldn't catch overlapping regions and reassign the BARs appropriately.
Created attachment 17871 [details]
dmesg after patch #3
Attached is the dmesg I get with this patch. Nothing seems to have changed in terms of what works and what doesn't.
Wow, I have been out on vacation for a number of weeks -- glad Pablomme was around to test things! It looks like there isn't anything I should do right now to test... but I'm back in action and ready to help if there is anything I can do. --Jennifer Jesse, any progress on this? If you deem it appropriate, I could suggest Arne's patch to be added to Ubuntu's kernel for the 8.10 release while you figure out a better solution (although we've just missed the beta, don't know if it'll be accepted). Including Arne's patch in Ubuntu is a good idea; hopefully we can bang the PNP code into shape to avoid problems like this more generally, but in the meantime Arne's patch is a good workaround. I'll ping Bjorn again about how we might fix this properly. Arne, do you want to attach your latest patch to the launchpad bug ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187671 ) and comment a bit on the situation upstream? If you do, please do mention that your patch makes the one forcing the 8139too module into PIO mode (associated with the bug at https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/90271 ) obsolete. Or let us know if you want somebody else to submit it. Thanks! Hi Jesse - have you made any progress on this one? We didn't get to convince the Ubuntu kernel maintainers to include Arne's patch into 8.10, which is a pity.. Anyway, if you come up with new trial patches let me know, I'm still up for testing. I was thinking, would any machine other than these laptops benefit from the general solution? If not, the solution is maybe not worth the effort. Arne's patch could be included in the kernel directly and the bug could be closed. Sorry. It should not added. My patch trigger an other problem. If you try to use the patched kernel with a pcmcia/yenta socket and you insert a Realtek 8139C PCMCIA card you get a kernel panic at the "quirk". Also if the quirk do nothing becuase the vendor id not match. We work on the problem but i have not much time at the moment. (In reply to comment #53) > > Pierre, is there a way of running SDHCI in PIO only mode? > This is probably no longer relevant, but no you cannot operate SDHCI with just PIO. There is no ioport space for the device. :) (In reply to comment #56) > I've pinged the folks at O2 Micro, we'll see if the get back to us... > If you have some contacts there, some erratas on their controllers would be nice. :) kernel-resourcec-fix-sign-extension-in-reserve_setup.patch should address this. See http://bugzilla.kernel.org/show_bug.cgi?id=13253 ive proposed, and was approved, arne's patch to the archlinux developers. is the reserve option the prefered solution? im asking cause ive been following the issue closely and havent read, from the arch comunity any complains concerning this patch should i sugest they remove the patch, and start using the reserve method? what are the cons/pros? As I have already written in Comment #82 my patch crash the kernel if a very common Realtek 8139 Cardbus Lancard is inserted. You should prefere the reserve option. But at the moment this option doesn't work on 64bit kernel without the patch from Comment #86. The patch from Comment #84 doesn't address the main problem. It fix only the misinterpreted "reserve" parameter from the "work around" on 64bit kernels Can anyone provide informations how i can auto add such "reserve" if a twinhead system was detected. when using the reserve method, 8139too and sdhci_pci fail to reserve the memory region and dont load any device. is this the expected behaviour? using kernel 2.6.31-rc2-git5 vanilla. No. Reserving the memory should force the kernel to reconfigure the devices to use an other memory area. At my system the modules will load and work. (sdhci_pci and also 8139too) But i have not tested this with kernel versions higher than 2.6.29 hmmm, is there any config setting in the kernel that would trigger this behaviour? ---- sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman sdhci-pci 0000:03:06.2: SDHCI controller found [1217:7120] (rev 1) sdhci-pci 0000:03:06.2: PCI INT A -> GSI 19 (level, low) -> IRQ 19 sdhci-pci 0000:03:06.2: BAR 0: can't reserve mem region [0xffbfe800-0xffbfe8ff] sdhci-pci 0000:03:06.2: cannot request region sdhci-pci 0000:03:06.2: PCI INT A disabled sdhci-pci: probe of 0000:03:06.2 failed with error -16 ---- I'd like to look at this problem again. If anybody can collect this information, it would be helpful: 1) A complete dmesg log from a current upstream kernel, e.g., 2.6.35-rc5 or newer, with the "pci=use_crs" kernel argument. 2) A Windows system information report, such as the hardware-related pages from Everest (http://www.lavalys.com/products/everest-pc-diagnostics). The trial version of Everest is free, and I think it produces enough information for what we need. It might be possible to run this using Windows PE (http://en.wikipedia.org/wiki/Windows_Preinstallation_Environment), which is apparently available free of charge and should be bootable from a DVD or a USB flash drive. i will try to prepare this information tomorrow. checking the windows requirements right now since i dont have it installed.. mind you the kernel requires to be booted with the reserve= boot parameter. Created attachment 27174 [details]
dmesg of 2.6.35-rc5 with kernel booted with pci=usecrs
Created attachment 27184 [details]
report using the everest software under win_PE
Created attachment 27189 [details]
BIOS vs Linux vs WinPE resource assignments
Thank you Tomas! This attachment shows the resource assignments made by
BIOS and the changes Linux and WinPE made.
With "pci=use_crs", Linux moved all the devices into host bridge apertures.
I think this kernel will probably boot without the "reserve=" parameter,
as long as you do use "pci=use_crs".
In some cases, WinPE also moved devices into apertures. I'm puzzled by
the cases where it did not, e.g., 03:06.2 and .3. Those BARs are right
next to the 03:06.0 BARs, which WinPE *did* move, and they appear to conflict
with some ACPI device resources.
(In reply to comment #94) > Created an attachment (id=27189) [details] > BIOS vs Linux vs WinPE resource assignments > > Thank you Tomas! This attachment shows the resource assignments made by > BIOS and the changes Linux and WinPE made. > no, thank you for taking the time ;) > With "pci=use_crs", Linux moved all the devices into host bridge apertures. > I think this kernel will probably boot without the "reserve=" parameter, > as long as you do use "pci=use_crs". > i will test this as soon as i get home. > In some cases, WinPE also moved devices into apertures. I'm puzzled by > the cases where it did not, e.g., 03:06.2 and .3. Those BARs are right > next to the 03:06.0 BARs, which WinPE *did* move, and they appear to conflict > with some ACPI device resources. other than stepping on someone else's resources i dont know what this all means. is this common in the BIOS world? how is it handled under linux? and what should i do next now that we know whats happening? thanks again for the effort ;)
> In some cases, WinPE also moved devices into apertures. I'm puzzled by
> the cases where it did not, e.g., 03:06.2 and .3. Those BARs are right
> next to the 03:06.0 BARs, which WinPE *did* move, and they appear to conflict
> with some ACPI device resources.
03:06.2 is the SD Host controller. which didnt work with win_PE (no driver installed?). i dont know who is in charge of moving things around in windows, but it might be the driver.
03:06.3 is the ms/xD/SM controller, which i guess, wasnt working either (same slot in the notebook as 03:06.2). this was not tested, i did test with a SD card though. wasnt recognized.
I'm curious about the devices Windows did not move because I wonder if there's something we should learn from that. There's enough room in the host bridge apertures for all devices, and Linux put them all inside the apertures, which should work. There are four devices Linux moved but Windows did not: 00:02.0: reg 10: [mem 0xffe80000-0xffefffff] 00:02.0: reg 1c: [mem 0xffe40000-0xffe7ffff] 00:02.1: reg 10: [mem 0xffd80000-0xffdfffff] 03:06.2: reg 10: [mem 0xffbfe800-0xffbfe8ff] 03:06.3: reg 10: [mem 0xffbff000-0xffbfffff] The 00:02 devices are VGA-related. I can imagine an exception along the lines of "we *know* VGA works because the BIOS used it, so don't touch it." Without considering any device hierarchy, the 03:06 devices appear to conflict with these ACPI devices: PNP0C01: [mem 0xfec00000-0xffffffff] PNP0C02: [mem 0xffb00000-0xffbfffff] Tomas, could you turn on CONFIG_ACPI_DEBUG, boot with "acpi.debug_layer=0x00010000 acpi.debug_level=0x00000004", and attach another log? Linux PNP currently throws away the hierarchy information, but maybe we should be paying attention to it. If anybody has a Windows installation where the SD host controller and/or the MS/xD controller are working, I'd like to know what resources they are using (e.g., from Everest or the Device Manager), and how those compare to Linux (e.g., the output of "dmesg | grep 03:06"). Linux does device resource reassignment eagerly, as soon as we discover the device, but it's conceivable that Windows does it lazily, only when a driver claims the device. booting with pci=use_crs (with no reserve= / acpi_sleep=nonvs ) works. but suspending doesnt (as reported in bug 16396) Created attachment 27192 [details]
dmesg with acpi debug information
Created attachment 27193 [details]
dmesg with acpi debug information and kernel built with ACPI_DEBUG
sorry, i missunderstood your post. heres a new dmesg. hope its got what you need
I think the best fix for this would be to turn on "pci=use_crs", either just on this machine with a DMI quirk, or across the board. There are still some known issues that make me hesitant to do it across the board yet. In the meantime, we can specify "pci=use_crs" manually as a workaround. On a bit of a tangent, the suspend/resume bug 16396 affects this same machine. There's a patch for that bug, but it still requires a kernel boot option, which is never ideal. If anybody has Windows on this machine and can determine whether suspend/resume works properly with Windows, we might be able to find a clue that will help us fix Linux. The Linux problem is apparently related to the 03:04.0 (8139too) and 03:06.2 (sdhci) devices, so knowing how Windows configures those devices and the PCI bridge leading to bus 03 would be useful. since it seems im the only interested left in this bug report, i guess i will have to install the preloaded software that came with the notebook. i will try to get a hold of an extra sata drive and borg this thing a bit (im fond of my linux partitions ;) ) please in the meantime, could you propose test cases for the windows install? what information is needed (other than the everest pages). i havent tested pci=use_crs + acpi_sleep=nonvs yet..but i suppose it will work. will report back if it does not. one last question: whats the difference between the reserve= and pci=use_crs ? which one should be the sane default? > since it seems im the only interested left in this bug report
Unfortunately I don't have this laptop at hand, I gave it to my sister when I bought my current netbook. But, although I can't help, I'm interested in the outcome.
For this problem (machine hangs when loading sdhci), I think "pci=use_crs" is better than "reserve=". With "reserve=" you have to figure out which addresses to reserve, which depends on the machine. "pci=use_crs" is generic and in theory, it should work anywhere. We currently turn it on automatically for all machines with BIOS dates of 2008 or later. I think we could fix some old issues, like this one, if we made "pci=use_crs" the default everywhere, but as I mentioned, we still have a couple known issues that make that risky. For example, bug 16228 is a system that only works if we turn "pci=use_crs" OFF. I think that issue is caused by the fact that Linux will reassign a PCI device to an address marked "reserved" in the E820 memory map (see bug 16228 comment 8). In my opinion, this is a serious defect in the way Linux handles "reserved" areas. but it's not trivial to fix. If anybody collects information about suspend/resume under Windows, please attach it to bug 16396, not here, so we can keep these issues separate. I only posted the request here because this bug has more people on the CC: list who might be able to help investigate it. If you're interested in the suspend/resume problem, please subscribe to bug 16396. As far as Windows test cases, I'm interested in: - Does suspend/resume work? - Does hibernate/resume work? - What resources are assigned to the devices on bus 03? - Do the devices on bus 03 work? Do they require updated drivers from the OEM to make them work? - Do the bus 03 resources differ depending on whether the drivers are loaded? I think you might be able to use Windows "safe" mode to prevent automatic driver loading. Created attachment 27209 [details]
dmesg with pci=use_crs and reserve=ffb0....
Hi Björn,
nice to hear again from you. I cannot confirm that pci=use_crs works on my Identical Averatec 2400. The boot still crash at 8139too in mmio mode.
In Windows XP all devices are working with the normal drivers. But if the drives are not loaded Windows show a resourceconflict. (I not use standby or hibernation) eg. For the 8139 Memoryarea: FFBFEC00-FFBFCFF is used by Mainboardresources. (I hope i have correct translated this because i use a German Windows. (Speicherbereich FFBFEC00 - FFBFECFF wird verwendet von: Hauptplatinenressourcen) At normal boot Windows reconfigure this memory area to: FFE3E600 - FFE3E6FF And show no conflicts. (In reply to comment #106) > In Windows XP all devices are working with the normal drivers. But if the > drives are not loaded Windows show a resourceconflict. (I not use standby or > hibernation) > > eg. For the 8139 Memoryarea: FFBFEC00-FFBFCFF is used by Mainboardresources. > (I > hope i have correct translated this because i use a German Windows. > (Speicherbereich FFBFEC00 - FFBFECFF wird verwendet von: > Hauptplatinenressourcen) > At normal boot Windows reconfigure this memory area to: FFE3E600 - FFE3E6FF > And show no conflicts. you saved me a lot of work ;) i was going to mirror my hdd somewhere else and install vista could you please test bug https://bugzilla.kernel.org/show_bug.cgi?id=16396 and report suspend / hibernate from windows results there ? they were wondering if the notebook could suspend under windows. thanks Suspend to disk and ram is working with windows xp without problems. With Ubuntu 10.04 suspend to ram is working but to suspend to disk fail. Created attachment 27214 [details] add PNP resource debug output Hi Arne, it's quite embarrassing that after two years, two bug reports, and 150 comments, we still don't have a good resolution for this issue. Let me see if I understand your comment 105. The log (attachment 27209 [details]) shows a boot with "pci=use_crs reserve=0xffb00000,0x100000", and that works. But if you boot with only "pci=use_crs", it fails. Right? Let's experiment with this. Please apply the attached patch and also the one from attachment 26819 [details] (bug 16228 comment 4). Now, prevent 8139too from loading automatically (rename the module or something). If you boot with only "pci=use_crs", without using "reserve=", you should be able to collect the dmesg log after Linux reassigns device resources but before the 8139too driver loads and hangs the system. (BTW, if you don't mind, mark your attachments "text/plain" so they're easy to open in a browser.) If you then manually load 8139too, I guess you should see the hang. Now try a boot with "pci=bar=0000:03:04.0[14]=0x80000000". This will move the 8139 MMIO BAR elsewhere, on the theory that there's still another device conflicting with it. You can experiment with different addresses, including the place where Windows puts it, and see whether any avoid the hang. Re: comment 105, I think I see the problem. "pci=use_crs" doesn't help in this case because the host bridge window: pci_root PNP0A08:00: host bridge window [mem 0x7f800000-0xffffffff] does include all these problematic PCI BARs: pci 0000:03:04.0: reg 14: [mem 0xffbfec00-0xffbfecff] pci 0000:03:06.0: reg 14: [mem 0xffbfe000-0xffbfe7ff] pci 0000:03:06.2: reg 10: [mem 0xffbfe800-0xffbfe8ff] pci 0000:03:06.3: reg 10: [mem 0xffbff000-0xffbfffff] They do conflict with this ACPI device: system 00:09: [mem 0xffb00000-0xffbfffff] but we currently don't look at ACPI resources until later. And even then, all the PCI BARs are contained *within* the ACPI region, so I'm not sure we'd find the conflict. Don't waste your time experimenting with this. I think it's clear that we need some significant rework in the way Linux handles these ACPI resources. From comment 106, I think we can learn two very important things: 1) Windows moves PCI devices to avoid conflicts with ACPI devices. Linux tends to trust PCI resources more than ACPI, but this may be a mistake. 2) Windows doesn't move a PCI device until loading a driver for it. Linux moves PCI devices immediately, before loading any drivers, which may be more aggressive than necessary. Created attachment 27240 [details]
dmesg with patches and pci=use_crs
Loading of 8139too will crash. (mmio is still at 0xFFBFEC00)
Created attachment 27241 [details]
Now with pci=bar...
8139too is useable (linux has choosed an other address)
03:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
Subsystem: TWINHEAD INTERNATIONAL Corp Unknown device a003
Flags: bus master, medium devsel, latency 64, IRQ 10
I/O ports at d800 [size=256]
Memory at ffbfec00 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2
Ups. Sorry the lspci was the wrong and i have missed a zero at bar parameter. But the new address that linux chose has also worked. If 0x80000000 was used linux use this. It is possible to add a quirk that reserve the memory if a H12Y was detected by PCI Subvendor/Subdevice ID until a better generic solution was found? i've been using pci=use_crs for a while, but now with 2.6.39-rc series, this breaks intel's i915 modesettings. so i went back to the reserve=0xFFB00000,0x100000 method. Done... just getting it upstream |