Bug 41622
Summary: | [REGRESSION][BISECTED] Notebook crashes upon detecting the PCI subsystem with kernels >= 2.6.24-rc7 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Rogério Brito (rbrito) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED WILL_NOT_FIX | ||
Severity: | normal | CC: | alan, bjorn, florian, rbrito, torvalds, xerofoify, yinghai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://thread.gmane.org/gmane.linux.kernel/1181761/focus=1181777 | ||
Kernel Version: | 3.4-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
output from /proc/iomem
output from /proc/ioports output from lspci -vvxxx output from Linus's kernel v3.1-rc3-91-ga53e77f with the noted patch "Pseudo-revert" that makes the machine work acpidump from this notebook dmidecode from the notebook |
Description
Rogério Brito
2011-08-23 22:11:42 UTC
Hi. Reverting Linus' commit 12c22d6ef299ccf0955e5756eb57d90d7577ac68 makes the notebook bootable again, but that's (quite possibly) not something correct and I would like to collaborate to get a proper fix. Again, I am willing to test anything to make this notebook work with a recent kernel version, so that I can put here a supported distribution. Thanks. (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 23 Aug 2011 22:11:46 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=41622 > > URL: http://thread.gmane.org/gmane.linux.kernel/1181761/foc > us=1181777 > Summary: [REGRESSION][BISECTED] Notebook crashes upon detecting > the PCI subsystem with kernels >= 2.6.24-rc7 > Product: Drivers > Version: 2.5 > Kernel Version: 2.6.24-rc7+ > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: PCI > AssignedTo: drivers_pci@kernel-bugs.osdl.org > ReportedBy: rbrito@ime.usp.br > Regression: Yes > > > Hi. > > My mom has a notebook that is quite quirky and I have had a hard > time trying to install any Linux distribution on it. > > This is a notebook that works fine with Windows Vista, but she > wants to use Linux and wipe Windows completely. The notebook has > an nVidia C51 chipset (nForce), a "forcedeth" network card, with > a Geforce Go 6100 video card and an AMD Sempron 3400+. > > After *many* trial and error sessions, I found that kernel 2.6.24 worked > with this notebook, but *anything* more recent (in particular, even Debian > stable is "too recent" for it) just made it hang right after the io > schedulers were set up. > > In other words, a regression. > > >From my dmesg logs, the next thing after the io schedulers seems > to be the PCI devices. The first lines that appear right after > the io scheduler ones are: > > ,---- > | [ 0.207345] io scheduler cfq registered (default) > | [ 0.207407] pci 0000:00:05.0: Boot video device > | [ 0.223320] PCI: Setting latency timer of device 0000:00:02.0 to 64 > | [ 0.223320] assign_interrupt_mode Found MSI capability > | [ 0.223320] Allocate Port Service[0000:00:02.0:pcie00] > | [ 0.223320] PM: Adding info for pci_express:0000:00:02.0:pcie00 > | [ 0.223320] PCI: Setting latency timer of device 0000:00:03.0 to 64 > | [ 0.223320] assign_interrupt_mode Found MSI capability > | [ 0.223320] Allocate Port Service[0000:00:03.0:pcie00] > | [ 0.223320] PM: Adding info for pci_express:0000:00:03.0:pcie00 > | [ 0.223320] PM: Adding info for platform:vesafb.0 > `---- > > I took the laborious task of `git bisecting` the kernel (more than 30 > recompiles, and fsck'ing the disc), I found that the first bad commit > (12c22d6e), a revert by Linus: > > > ,----[ git bisect bad ] > | 12c22d6ef299ccf0955e5756eb57d90d7577ac68 is the first bad commit > | commit 12c22d6ef299ccf0955e5756eb57d90d7577ac68 > | Author: Linus Torvalds <torvalds@linux-foundation.org> > | Date: Wed Mar 26 11:22:40 2008 -0700 > | > | Revert "PCI: remove transparent bridge sizing" > | > | This reverts commit 8fa5913d54f3b1e09948e6a0db34da887e05ff1f, which > | caused various interesting problems for people, including wrong > resource > | allocations. See for example bugzilla entry "2.6.25-rc2: ohci1394 > | problem (MMIO broken)" at > | > | http://bugzilla.kernel.org/show_bug.cgi?id=10080 > | > | And Gary Hade says: > | > | "The same change had also exposed an issue reported by Paul Martin > that > | has been causing an Oops while hotplugging ThinkPads to a ThinkPad > | Dock II. See > | > | http://lkml.org/lkml/2008/2/19/405 > | http://bugzilla.kernel.org/show_bug.cgi?id=9961 > | > | I have a fix for the ThinkPad docking Oops but if the issue being > | discussed here is caused by the transparent bridge sizing removal > | change I totally agree that it should be reverted." > | > | The transparent bridge sizing removal change was motivated by > | insufficient PCI memory resource for a transparent bridge window that > | was being created as a result of expansion ROM(s) being included in > | the transparent bridge sizing calculations. > | > | A later "PCI: Remove default PCI expansion ROM memory allocation" > | change ( re: http://lkml.org/lkml/2007/12/11/361 ) removes the > | expansion ROM(s) from the transparent bridge sizing calculations > which > | actually resolves the original issue in a different manner. So, even > | if the "PCI: remove transparent bridge sizing" is not problematic it > | is no longer needed anyway." > | > | Identified-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> > | Tested-by: Thomas Meyer <thomas@m3y3r.de> > | Acked-by: Gary Hade <garyhade@us.ibm.com> > | Acked-by: Ingo Molnar <mingo@elte.hu> > | Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> > | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > | > | :040000 040000 e3d0046e6ce6dc4e3bb6114db61336125fab9ea7 > c60b1fdbdc2bda47dbb822b0a741d8254a7d8afa M drivers > `---- > > I am willing to test anything to make this notebook work with a recent > kernel version, so that I can prepare this to her use with a modern > distribution. > > Please, advise if you need any dmesg, lspci outputs, acpi dumps, etc. You > tell me, and I try my best to get whatever you want me. Hi, Andrew and other people. Thanks for taking the time to look at that regression. Replying with a little bit more of information: On Tue, Aug 23, 2011 at 19:25, Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 23 Aug 2011 22:11:46 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=41622 (...) >> I took the laborious task of `git bisecting` the kernel (more than 30 >> recompiles, and fsck'ing the disc), I found that the first bad commit >> (12c22d6e), a revert by Linus: If I revert Linus's revert (12c22d6e), then I am able to get the Linux 3.1-rc3 running on that computer and to actually use things. I can provide any extra data that may be useful and I am willing to perform tests on it to get a proper fix. Thanks, Rogério, thanks for doing all the work of bisecting. I know that's slow and painful. Can you please attach the complete dmesg log from a current upstream kernel with 12c22d6ef2 reverted? While you're at it, you might also try a boot of plain vanilla upstream with the "pci=use_crs" kernel option. Dear Bjorn, Thank you very much for your reply. It will be quite good to have the hope of this being supported in future distributions. On Tue, Aug 23, 2011 at 20:55, Bjorn Helgaas <bhelgaas@google.com> wrote: > Rogério, thanks for doing all the work of bisecting. I know that's > slow and painful. Can you please attach the complete dmesg log from a > current upstream kernel with 12c22d6ef2 reverted? OK, here you are. > While you're at it, you might also try a boot of plain vanilla upstream with > the > "pci=use_crs" kernel option. I'm just compiling a new, vanilla kernel and I will post back the result of booting with that kernel option. Thanks again, -- Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://rb.doesntexist.org : Packages for LaTeX : algorithms.berlios.de DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br On Tue, Aug 23, 2011 at 4:55 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > Rogério, thanks for doing all the work of bisecting. I know that's > slow and painful. Can you please attach the complete dmesg log from a > current upstream kernel with 12c22d6ef2 reverted? While you're at it, > you might also try a boot of plain vanilla upstream with the > "pci=use_crs" kernel option. Sadly, there's no way we'll revert that revert right now, but we could certainly try it in the next merge window. We've made enough changes to the resource allocations since, that maybe we can try that "remove transparent bridge sizing" thing again. That said, it would be really interesting to also see the output from the *broken* kernel, if there is some way to get it out of it sanely. Like a serial console or so, with PCI debugging enabled (CONFIG_PCI_DEBUG), and perhaps "pci=earlydump". There's probably some hidden system IO port or something that we just happen to stomp on given some random resource allocation. And it would be nice to try to figure it out, and fix it for real, rather than just have a "ok, we can make things work by randomly changing PCI allocations".. Linus Why are you booting with "acpi=off pnpbios=off noapic nolapic"? None of those should be necessary. If you need those to work around some other problem, we should fix that, too. Can you try again without any options? 2011/8/19 Rogério Brito <rbrito@ime.usp.br>: > > Reverting the commit above with the patch below makes me able to > compile and run Linus's v3.1-rc2: Oh, I just noticed that the "revert" you did actually does way more than revert. > diff --cc drivers/pci/setup-bus.c > index 8a1d3c7,125e7b7..0000000 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@@ -783,16 -486,14 +783,14 @@@ void __ref __pci_bus_size_bridges(struc > break; > > case PCI_CLASS_BRIDGE_PCI: > + /* don't size subtractive decoding (transparent) > + * PCI-to-PCI bridges */ > + if (bus->self->transparent) > + break; The above is the real revert. The below should be totally independent, and I'd like to make sure that you test the revert without this change: > pci_bridge_check_ranges(bus); > - if (bus->self->is_hotplug_bridge) { > - additional_io_size = pci_hotplug_io_size; > - additional_mem_size = pci_hotplug_mem_size; > - } > - /* > - * Follow thru > - */ > + /* fall through */ > default: > - pbus_size_io(bus); > + pbus_size_io(bus, 0, additional_io_size, add_head); And in fact I think that last line is just broken, you can't apply that on my current -git. What's going on? Also, I'd like to see the output of: - cat /proc/iomem - cat /proc/ioports - /sbin/lspci -vvxxx from that machine. And Bjorn asked for a full dmesg, and I see that email, but it didn't get updated into the bugzilla entry (apparently bugzilla is not smart enough to take email attachments and make them bugzilla attachments). Rogério, can you do that so that it doesn't get lost? Linus Hi there, Linux, and people. On Aug 24 2011, Linus Torvalds wrote: > 2011/8/19 Rogério Brito <rbrito@ime.usp.br>: > > Reverting the commit above with the patch below makes me able to > > compile and run Linus's v3.1-rc2: > > Oh, I just noticed that the "revert" you did actually does way more than > revert. OK. I guess that when I bit-bisected stuff and then pulled from your tree some conflict arised. Here is the "revert" that I could use and that works: diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 784da9d..b478c7b 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -849,14 +849,12 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus, break; case PCI_CLASS_BRIDGE_PCI: + /* don't size subtractive decoding (transparent) + * PCI-to-PCI bridges */ + if (bus->self->transparent) + break; pci_bridge_check_ranges(bus); - if (bus->self->is_hotplug_bridge) { - additional_io_size = pci_hotplug_io_size; - additional_mem_size = pci_hotplug_mem_size; - } - /* - * Follow thru - */ + /* fall through */ default: pbus_size_io(bus, 0, additional_io_size, realloc_head); /* If the bridge supports prefetchable range, size it So there are actually two changes there: what you called the "real revert" and the second part. I can test each separately. > The below should be totally independent, and I'd like to make sure > that you test the revert without this change: > > > pci_bridge_check_ranges(bus); > > - if (bus->self->is_hotplug_bridge) { > > - additional_io_size = pci_hotplug_io_size; > > - additional_mem_size = pci_hotplug_mem_size; > > - } > > - /* > > - * Follow thru > > - */ > > + /* fall through */ > > default: > > - pbus_size_io(bus); > > + pbus_size_io(bus, 0, additional_io_size, add_head); > > And in fact I think that last line is just broken, you can't apply > that on my current -git. What's going on? That last line was a mistake of mine. The real changes stop at the changed comment right before the "default:" clause of the switch. > Also, I'd like to see the output of: > > - cat /proc/iomem > - cat /proc/ioports > - /sbin/lspci -vvxxx > > from that machine. OK, no problems. I'm attaching the small things here, putting at the following URLs and will also attach them to bugzilla. * /proc/iomem: http://paste.debian.net/127268/ * /proc/ioports: http://paste.debian.net/127269/ * lspci -vvxxx: http://paste.debian.net/127270/ * dmesg from kernel v3.1-rc3-91-ga53e77f: http://paste.debian.net/127271/ * the pseudo-revert: http://paste.debian.net/127272/ * acpidump: http://paste.debian.net/127275/ * dmidecode: http://paste.debian.net/127276/ > And Bjorn asked for a full dmesg, and I see that email, but it didn't get > updated into the bugzilla entry (apparently bugzilla is not smart enough > to take email attachments and make them bugzilla attachments). Rogério, > can you do that so that it doesn't get lost? Sure. Anything you ask me. Thanks, -- Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://rb.doesntexist.org : Packages for LaTeX : algorithms.berlios.de DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br Created attachment 69972 [details]
output from /proc/iomem
Created attachment 69982 [details]
output from /proc/ioports
Created attachment 69992 [details]
output from lspci -vvxxx
Created attachment 70022 [details]
output from Linus's kernel v3.1-rc3-91-ga53e77f with the noted patch
Created attachment 70032 [details]
"Pseudo-revert" that makes the machine work
Created attachment 70052 [details]
acpidump from this notebook
Created attachment 70072 [details]
dmidecode from the notebook
Hi Linus and other people. 2011/8/24 Rogério Brito <rbrito@ime.usp.br>: > On Aug 24 2011, Linus Torvalds wrote: >> Oh, I just noticed that the "revert" you did actually does way more than >> revert. (...) Just for the record, I finished compiling the kernel with only the first part (the real revert) and it works as well as it did with the 2.6.24-rc's: diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 784da9d..1310989 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -849,6 +849,10 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus, break; case PCI_CLASS_BRIDGE_PCI: + /* don't size subtractive decoding (transparent) + * PCI-to-PCI bridges */ + if (bus->self->transparent) + break; pci_bridge_check_ranges(bus); if (bus->self->is_hotplug_bridge) { additional_io_size = pci_hotplug_io_size; If there is anything else that you would like me to change or to provide any extra information, then please let me know and I will do my best. Thanks, > If there is anything else that you would like me to change or to
> provide any extra information, then please let me know and I will do
> my best.
I'd still like to see a dmesg log with no arguments (remove the
"acpi=off pnpbios=off noapic nolapic" arguments). Your machine is new
enough that we'll use PCI _CRS by default, and I'd like to make sure
we're doing the right thing.
I assume that with no arguments, you still need the "skip transparent
bridge sizing" change to boot.
I don't really like that change because in __pci_bus_size_bridges(),
it's not obvious why sizing transparent bridges should be a problem.
If growing transparent bridge windows makes us run out of space, let's
put the smarts ("this bridge is transparent, we can take advantage of
subtractive decode so we may not need to grow the positive decode
windows") at the point where we grow, not at the point where we size.
If we do have enough space, growing the positive decode windows is
better because they're faster than subtractive decode.
Bjorn
On Wed, Aug 24, 2011 at 1:51 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > Just for the record, I finished compiling the kernel with only the > first part (the real revert) and it works as well as it did with the > 2.6.24-rc's: Ok. I think we should just try to do this very early in the 3.2 merge window, and see if anybody screams. It still would be very nice if you can get serial console output of the broken kernel with CONFIG_PCI_DEBUG enabled, but lots of notebooks don't even have serial ports.. So if that isn't doable, it isn't doable. If it just locks up, it *might* be interesting to see what the last messages were (again, with CONFIG_PCI_DEBUG). Linus On Wed, Aug 24, 2011 at 2:23 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > I don't really like that change because in __pci_bus_size_bridges(), > it's not obvious why sizing transparent bridges should be a problem. > If growing transparent bridge windows makes us run out of space, let's > put the smarts ("this bridge is transparent, we can take advantage of > subtractive decode so we may not need to grow the positive decode > windows") at the point where we grow, not at the point where we size. > If we do have enough space, growing the positive decode windows is > better because they're faster than subtractive decode. Hmm. Good point. Also, if the transparent bridge is deeper in the PCI hierarchy, we'd still want to size any upstream bridges appropriately, rather than say "the transparency will take everything into account". Because the *upstream* bridges may not be transparent. Of course, putting a transparent bridge behind a non-transparent one sounds pretty crazy to begin with, but in PCI resource handling, crazy is pretty much the norm... Linus Hi everybody. 2011/8/24 Bjorn Helgaas <bhelgaas@google.com>: >> If there is anything else that you would like me to change or to >> provide any extra information, then please let me know and I will do >> my best. > > I'd still like to see a dmesg log with no arguments (remove the > "acpi=off pnpbios=off noapic nolapic" arguments). Your machine is new > enough that we'll use PCI _CRS by default, and I'd like to make sure > we're doing the right thing. OK. I put them there one by one, but let me report what I get with those disabled, but *with* that small patch applied: * "acpi=off pnpbios=off noapic": it works well, no problems with dropping the nolapic. * "acpi=off noapic": it works well, with no problems dropping "pnpbios=off, aside from the "[Firmware Bug]: powernow-k8: No PSB or ACPI _PSS objects" message. * "acpi=off": does not work well---it books OK, but some accesses to disk get very long and I get the following in the dmesg log: [ 243.389359] ata1: EH in SWNCQ mode,QC:qc_active 0x7 sactive 0x7 [ 243.389367] ata1: SWNCQ:qc_active 0x3 defer_bits 0x4 last_issue_tag 0x1 [ 243.389368] dhfis 0x3 dmafis 0x0 sdbfis 0x0 [ 243.389374] ata1: ATA_REG 0x40 ERR_REG 0x0 [ 243.389377] ata1: tag : dhfis dmafis sdbfis sactive [ 243.389380] ata1: tag 0x0: 1 0 0 1 [ 243.389383] ata1: tag 0x1: 1 0 0 1 [ 243.389398] ata1.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x6 frozen [ 243.389403] ata1.00: failed command: READ FPDMA QUEUED [ 243.389411] ata1.00: cmd 60/08:00:c9:10:be/00:00:07:00:00/40 tag 0 ncq 4096 in [ 243.389413] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 243.389417] ata1.00: status: { DRDY } [ 243.389420] ata1.00: failed command: READ FPDMA QUEUED [ 243.389428] ata1.00: cmd 60/08:08:19:10:be/00:00:07:00:00/40 tag 1 ncq 4096 in [ 243.389430] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 243.389433] ata1.00: status: { DRDY } [ 243.389437] ata1.00: failed command: WRITE FPDMA QUEUED [ 243.389444] ata1.00: cmd 61/58:10:79:4f:62/00:00:06:00:00/40 tag 2 ncq 45056 out [ 243.389446] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 243.389450] ata1.00: status: { DRDY } [ 243.389457] ata1: hard resetting link [ 243.389460] ata1: nv: skipping hardreset on occupied port [ 244.289547] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 244.306186] Clocksource tsc unstable (delta = 429752755 ns) [ 244.306214] Switching to clocksource jiffies [ 244.309862] ata1.00: configured for UDMA/100 [ 244.309862] ata1.00: device reported invalid CHS sector 0 [ 244.309862] ata1.00: device reported invalid CHS sector 0 [ 244.309862] ata1.00: device reported invalid CHS sector 0 [ 244.309862] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 244.309862] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor] [ 244.309862] Descriptor sense data with sense descriptors (in hex): [ 244.309862] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 244.309862] 00 00 00 00 [ 244.309862] sd 0:0:0:0: [sda] Add. Sense: No additional sense information [ 244.309862] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 07 be 10 c9 00 00 08 00 [ 244.309862] end_request: I/O error, dev sda, sector 129896649 [ 244.309862] ata1: EH complete [ 293.834434] ata1: EH in SWNCQ mode,QC:qc_active 0x7FF sactive 0x7FF [ 293.834442] ata1: SWNCQ:qc_active 0x1F defer_bits 0x7E0 last_issue_tag 0x4 [ 293.834444] dhfis 0x1F dmafis 0x0 sdbfis 0x0 [ 293.834449] ata1: ATA_REG 0x40 ERR_REG 0x0 [ 293.834452] ata1: tag : dhfis dmafis sdbfis sactive [ 293.834456] ata1: tag 0x0: 1 0 0 1 [ 293.834459] ata1: tag 0x1: 1 0 0 1 [ 293.834462] ata1: tag 0x2: 1 0 0 1 [ 293.834465] ata1: tag 0x3: 1 0 0 1 [ 293.834468] ata1: tag 0x4: 1 0 0 1 [ 293.834483] ata1.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x6 frozen [ 293.834488] ata1.00: failed command: WRITE FPDMA QUEUED [ 293.834496] ata1.00: cmd 61/80:00:49:eb:6e/00:00:06:00:00/40 tag 0 ncq 65536 out [ 293.834498] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 293.834502] ata1.00: status: { DRDY } [ 293.834505] ata1.00: failed command: WRITE FPDMA QUEUED [ 293.834513] ata1.00: cmd 61/00:08:c9:f2:6e/02:00:06:00:00/40 tag 1 ncq 262144 out [ 293.834515] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 293.834518] ata1.00: status: { DRDY } [ 293.834522] ata1.00: failed command: WRITE FPDMA QUEUED [ 293.834530] ata1.00: cmd 61/c8:10:c9:f5:6e/01:00:06:00:00/40 tag 2 ncq 233472 out [ 293.834531] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) (...) [ 395.538669] ata1: tag 0x9: 1 0 0 1 [ 395.538672] ata1: tag 0xa: 1 0 0 1 [ 395.538675] ata1: tag 0xb: 1 0 0 1 [ 395.538678] ata1: tag 0xc: 1 0 0 1 [ 395.538681] ata1: tag 0xd: 1 0 0 1 [ 395.538684] ata1: tag 0xe: 0 0 0 1 [ 395.538697] ata1.00: NCQ disabled due to excessive errors [ 395.538703] ata1.00: exception Emask 0x0 SAct 0x7fe00 SErr 0x0 action 0x6 frozen [ 395.538708] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538717] ata1.00: cmd 61/80:48:49:34:8e/00:00:06:00:00/40 tag 9 ncq 65536 out [ 395.538718] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538722] ata1.00: status: { DRDY } [ 395.538725] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538733] ata1.00: cmd 61/00:50:c9:37:8e/01:00:06:00:00/40 tag 10 ncq 131072 out [ 395.538735] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538739] ata1.00: status: { DRDY } [ 395.538742] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538749] ata1.00: cmd 61/80:58:b9:37:7b/00:00:06:00:00/40 tag 11 ncq 65536 out [ 395.538751] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538755] ata1.00: status: { DRDY } [ 395.538758] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538766] ata1.00: cmd 61/08:60:c9:18:8e/00:00:06:00:00/40 tag 12 ncq 4096 out [ 395.538767] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538771] ata1.00: status: { DRDY } [ 395.538774] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538782] ata1.00: cmd 61/40:68:89:43:8e/00:00:06:00:00/40 tag 13 ncq 32768 out [ 395.538783] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538787] ata1.00: status: { DRDY } [ 395.538790] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538798] ata1.00: cmd 61/08:70:d9:ff:6e/00:00:06:00:00/40 tag 14 ncq 4096 out [ 395.538800] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538803] ata1.00: status: { DRDY } [ 395.538806] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538814] ata1.00: cmd 61/08:78:a9:34:6a/00:00:06:00:00/40 tag 15 ncq 4096 out [ 395.538816] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538819] ata1.00: status: { DRDY } [ 395.538822] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538830] ata1.00: cmd 61/10:80:71:d7:6f/00:00:06:00:00/40 tag 16 ncq 8192 out [ 395.538832] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538835] ata1.00: status: { DRDY } [ 395.538839] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538846] ata1.00: cmd 61/08:88:61:ff:6d/00:00:06:00:00/40 tag 17 ncq 4096 out [ 395.538848] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538852] ata1.00: status: { DRDY } [ 395.538855] ata1.00: failed command: WRITE FPDMA QUEUED [ 395.538862] ata1.00: cmd 61/08:90:e9:62:62/01:00:06:00:00/40 tag 18 ncq 135168 out [ 395.538864] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 395.538868] ata1.00: status: { DRDY } [ 395.538874] ata1: hard resetting link [ 395.538877] ata1: nv: skipping hardreset on occupied port [ 396.005559] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 396.025796] ata1.00: configured for UDMA/100 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1.00: device reported invalid CHS sector 0 [ 396.025796] ata1: EH complete During that, the computer is frozen (well, not actually, the trackpad is able to move the mouse). For all the details, please see the full dmesg attached to this mail. After that, it seems that I can use the computer and it doesn't happen anymore, but I get a a very high (98+% of time) amount of hardware interrupts happening while I am using X (not sure yet when I quit X). See the attached /proc/interrupts. * without any boot options: without "acpi=off", the machine just hangs at (hand copied, but photographed, if you want it): (...) CPU: Mobile AMD Sempron (tm) Processor 3400+ stepping 02 ACPI: Core revision 20110623 And nothing else happens. > I assume that with no arguments, you still need the "skip transparent > bridge sizing" change to boot. > > I don't really like that change because in __pci_bus_size_bridges(), > it's not obvious why sizing transparent bridges should be a problem. > If growing transparent bridge windows makes us run out of space, let's > put the smarts ("this bridge is transparent, we can take advantage of > subtractive decode so we may not need to grow the positive decode > windows") at the point where we grow, not at the point where we size. > If we do have enough space, growing the positive decode windows is > better because they're faster than subtractive decode. I'm afraid that I have understood almost nothing of what you just said :-), but I will try to read some of the code. Oh, Linux, all this is with "CONFIG_PCI_DEBUG=y" since the very first beginning. Thanks, -- Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://rb.doesntexist.org : Packages for LaTeX : algorithms.berlios.de DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br On Thu, Aug 25, 2011 at 8:49 AM, Rogério Brito <rbrito@ime.usp.br> wrote: > 2011/8/24 Bjorn Helgaas <bhelgaas@google.com>: >> I'd still like to see a dmesg log with no arguments (remove the >> "acpi=off pnpbios=off noapic nolapic" arguments). Your machine is new >> enough that we'll use PCI _CRS by default, and I'd like to make sure >> we're doing the right thing. > > OK. I put them there one by one, but let me report what I get with > those disabled, but *with* that small patch applied: > > * "acpi=off pnpbios=off noapic": it works well, no problems with > dropping the nolapic. > * "acpi=off noapic": it works well, with no problems dropping > "pnpbios=off, aside from the "[Firmware Bug]: powernow-k8: No PSB or > ACPI _PSS objects" message. > * "acpi=off": does not work well---it books OK, but some accesses to > disk get very long and I get the following in the dmesg log: > ... > * without any boot options: without "acpi=off", the machine just hangs > at (hand copied, but photographed, if you want it): > > (...) > CPU: Mobile AMD Sempron (tm) Processor 3400+ stepping 02 > ACPI: Core revision 20110623 > > And nothing else happens. This is a serious problem, and while we probably need to fix both this and the PCI bridge issue, this is the first thing you hit, so I'd like to look at it first. Users should never have to use "acpi=off" except for debugging purposes. I opened a separate bug report for it: https://bugzilla.kernel.org/show_bug.cgi?id=41722 Would you build your kernel with "CONFIG_ACPI_DEBUG=y" and boot with the kernel options "ignore loglevel acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff" and see if we can learn anything? If that's too much output, try acpi.debug_layer=0x59f instead. >> I don't really like that change because in __pci_bus_size_bridges(), >> it's not obvious why sizing transparent bridges should be a problem. >> If growing transparent bridge windows makes us run out of space, let's >> put the smarts ("this bridge is transparent, we can take advantage of >> subtractive decode so we may not need to grow the positive decode >> windows") at the point where we grow, not at the point where we size. >> If we do have enough space, growing the positive decode windows is >> better because they're faster than subtractive decode. > > I'm afraid that I have understood almost nothing of what you just said > :-), but I will try to read some of the code. That's all right; that was intended for the PCI people who might care about how this is implemented. Bjorn two MMIO of cardbus are all set to pref by BIOS. that is not right kernel will need MMIO1 is not pref. 05:07.0 CardBus bridge: Texas Instruments PCIxx12 Cardbus Controller Subsystem: CLEVO/KAPOK Computer Device 5407 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 10 Region 0: Memory at b3200000 (32-bit, non-prefetchable) [size=4K] Bus: primary=05, secondary=06, subordinate=09, sec-latency=176 Memory window 0: 88000000-8bfff000 (prefetchable) Memory window 1: 84000000-87fff000 (prefetchable) I/O window 0: 00001400-000014ff I/O window 1: 00001000-000010ff BridgeCtl: Parity- SERR- ISA+ VGA- MAbort- >Reset+ 16bInt- PostWrite- 16-bit legacy interface ports at 0001 00: 4c 10 39 80 07 01 10 02 00 00 07 06 10 40 82 00 10: 00 00 20 b3 a0 00 00 02 05 06 09 b0 00 00 00 88 20: 00 f0 ff 8b 00 00 00 84 00 f0 ff 87 00 14 00 00 30: fc 14 00 00 00 10 00 00 fc 10 00 00 0a 01 44 03 40: 58 15 07 54 01 00 00 00 00 00 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 60 90 44 18 19 00 00 02 00 00 1f 00 22 1b 08 00 90: c0 00 64 60 00 00 00 00 00 00 00 00 00 00 00 00 a0: 01 00 12 fe 00 00 c0 00 00 00 00 00 00 00 00 00 b0: 00 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 2b 37 48 3c 50 9b 01 41 00 00 00 00 00 00 00 00 A patch referencing this bug report has been merged in Linux v3.4-rc1: commit dcef0d06b34a80071da4496556e85f9bf3b3c0bf Author: Yinghai Lu <yinghai@kernel.org> Date: Fri Feb 10 15:33:46 2012 -0800 PCI: Disable cardbus bridge MEM1 prefetchable bit I'm reopening this bug because (very unfortunately) I still see this bug with Debian's 3.4 kernel. I can rerun a batch of tests with this. Please let me know what you'd like me to do and I will resume where we left off. Thanks, Rogério Brito. Hi. Just for the record, I just compiled 2 versions of Linus's 3.5.0-rc3-00203-g8874e81. The "vanilla" version of the kernel crashes exactly as before, with no visible changes. If, on the other hand, I apply the patch mentioned in comment #17 of this bug report, then everything boots and works OK (for suitable values of "OK). I have taken logs of the situation when the boot succeeds and I can post them here if wanted. Thanks, Rogério Brito. On Fri, Jun 22, 2012 at 5:15 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > If, on the other hand, I apply the patch mentioned in comment #17 of this bug > report, then everything boots and works OK (for suitable values of "OK). Gaah. These emails always happen at the most inconvenient time. Maybe we should just try that patch in https://bugzilla.kernel.org/show_bug.cgi?id=41622#c17 but it really needs to happen early in a merge window. We didn't do that for 3.2, maybe we can do it for 3.6? Bjorn, what do you think? Sizing up transparent bridges really is a bit unnecessary. Although I really don't see what the difference there is any more. Yinghai noticed that the prefetchability of the cardbus bridge changed, but the commit that changed that apparently made no difference. What other changes does the transparent bridge sizing cause? Linus > I have taken logs of the situation when the boot succeeds and I can post them
> here if wanted.
Yes. please.
I have some res allocating patches that are still pending, hope it
could make some difference.
please check them at
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-pci-res-alloc
also your system have
[ 0.126698] pci_bus 0000:06: [bus 06-09] partially hidden behind
transparent bridge 0000:05 [bus 05-06]
[ 0.126784] pci_bus 0000:05: bus scan returning with max=09
so like to see if busn_alloc could update the bus range for you.
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-pci-busn-alloc
Thanks
> Maybe we should just try that patch in > > https://bugzilla.kernel.org/show_bug.cgi?id=41622#c17 > > but it really needs to happen early in a merge window. We didn't do > that for 3.2, maybe we can do it for 3.6? The first problem is still that this machine doesn't get anywhere at all with ACPI enabled (https://bugzilla.kernel.org/show_bug.cgi?id=41722), so we only get to this transparent bridge issue with "acpi=off". I really think that it would be better to solve the ACPI issue first because it's too difficult to support an "acpi=off" config -- nobody else tests that and very few users will be sophisticated enough to even try it. Unfortunately, I don't have any good ideas about the ACPI issue other than the MTRR possibility Len suggested. If we just fiddle with transparent bridge sizing, we're making the problem go away, but we don't really know why, so I'm afraid of breaking some other machine or tripping over this somewhere else. My guess is that we hang because we moved a device on top of something else, and if we skip transparent bridge sizing, we avoid that device move. I do think it might be reasonable to avoid bridge sizing on the grounds that we shouldn't move things unless we have to. But right now, we don't even know the real effect of skipping the bridge sizing. Rogério, we might be able to at least identify a relevant difference if you collect the dmesg from the "working" case ("acpi=off" with the comment #17 patch), a video of the hanging case ("acpi=off" without the patch, possibly with "boot_delay=1000" to slow things down) that we can transcribe by hand, and the AIDA64 info from Windows that I mentioned in bug 41722. I would recommend testing this with a newer kernel to see if this is fixed as of kernel 3.15.1. Thanks Nick Hi, Nick and others. On Jun 25 2014, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=41622 > --- Comment #30 from xerofoify@gmail.com --- > I would recommend testing this with a newer kernel > to see if this is fixed as of kernel 3.15.1. > Thanks Nick I will try to test this soon, but if I don't respond, please ping me. I'm overloaded with stuff "in real life". Thanks, I think everybody has given up on this bug. If anybody is still interested, please reopen it. |