Bug 7495
Summary: | Kernel periodically hangs. | ||
---|---|---|---|
Product: | Other | Reporter: | Alex (alex) |
Component: | Other | Assignee: | other_other |
Status: | REJECTED INSUFFICIENT_DATA | ||
Severity: | blocking | CC: | akpm, bunk |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | Linux version 2.6.18.2 (root@pub) (gcc version 3.4.6) #13 SMP Fr | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Alex
2006-11-11 03:28:06 UTC
On Sat, 11 Nov 2006 03:29:32 -0800 bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=7495 > > Summary: Kernel periodically hangs. > Kernel Version: Linux version 2.6.18.2 (root@pub) (gcc version 3.4.6) > #13 SMP Fr > Status: NEW > Severity: blocking > Owner: other_other@kernel-bugs.osdl.org > Submitter: alex@hausnet.ru > > > [42587.676000] BUG: unable to handle kernel NULL pointer dereference at > virtual address 0000003c > [42587.680000] printing eip: > [42587.680000] 781610e7 > [42587.680000] *pde = 00000000 > [42587.680000] Oops: 0000 [#1] > [42587.684000] SMP > [42587.684000] Modules linked in: sata_promise sk98lin 8250_pnp 8250 > i2c_nforce2 ehci_hcd serial_core sata_nv ahci i2c_core ohci_hcd forcedeth > libata > [42587.688000] CPU: 1 > [42587.688000] EIP: 0060:[<781610e7>] Not tainted VLI > [42587.688000] EFLAGS: 00010286 (2.6.18.2 #13) > [42587.692000] EIP is at clear_inode+0x96/0xce > [42587.692000] eax: 00000000 ebx: c0102240 ecx: f7f278d4 edx: f510d400 > [42587.692000] esi: c0102384 edi: f7e6dec0 ebp: 00000070 esp: f7e6de98 > [42587.696000] ds: 007b es: 007b ss: 0068 > [42587.696000] Process kswapd0 (pid: 230, ti=f7e6c000 task=f7c03560 > task.ti=f7e6c000) > [42587.696000] Stack: c0102248 c0102240 7816116a da7b4af0 da7b4af8 00000000 > 00000080 781614a2 > [42587.700000] 00000080 00000080 c01023f8 ef78dca8 00000000 00009858 > 00000083 f7fee560 > [42587.700000] 781614c8 7813a643 00261600 00000000 00009858 00000005 > 00000000 00000000 > [42587.700000] Call Trace: > [42587.704000] [<7816116a>] dispose_list+0x4b/0xc1 > [42587.708000] [<781614a2>] prune_icache+0x17c/0x18e > [42587.708000] [<781614c8>] shrink_icache_memory+0x14/0x2b > [42587.708000] [<7813a643>] shrink_slab+0x130/0x18c > [42587.712000] [<7813b75a>] balance_pgdat+0x1ea/0x2dd > [42587.712000] [<7813b933>] kswapd+0xe6/0xe8 > [42587.716000] [<781261dc>] kthread+0x7d/0xa1 > [42587.716000] [<78100e05>] kernel_thread_helper+0x5/0xb I've seen three or four reports of oopses like this in 2.6.18. I have a suspision we broke something. > Kernel started with noapic option, cause it hands on load without this option. Him and a million other people. I know we broke APIC. Around 2.6.9, I think. > > Kernel started with noapic option, cause it hands on load without this
> option.
>
> Him and a million other people. I know we broke APIC. Around 2.6.9, I
> think.
is that when the "enable apic even on UP so that distro kernels can
install on the ibm x44*" patches went in?
On Sat, 11 Nov 2006 19:10:03 +0100 Arjan van de Ven <arjan@infradead.org> wrote: > > > Kernel started with noapic option, cause it hands on load without this option. > > > > Him and a million other people. I know we broke APIC. Around 2.6.9, I > > think. > > > is that when the "enable apic even on UP so that distro kernels can > install on the ibm x44*" patches went in? > I don't know. In fact I forget how I worked out that it worsened in 2.6.early. google(noapic) gets 232,000 hits. I don't think it really matters when or why it happened. If we take the approach of fixing one machine at a time, we'll only need to fix a few individual machines to improve the situation for a lot of people. > I don't know. In fact I forget how I worked out that it worsened in > 2.6.early. > > google(noapic) gets 232,000 hits. is there a way to ask google "only stuff in the last year"? Asking because "noapic" in 2.4 was the standard "try this" answer when people had a bios that had busted MPS (but good ACPI)... > I don't think it really matters when or why it happened. well to some degree it does; if it's one patch causing it narrowing it down at least somewhat in time would help ;) > If we take the > approach of fixing one machine at a time, we'll only need to fix a few > individual machines to improve the situation for a lot of people. alternative is that more new machines showed up that need it somehow, eg not really a regression just something else. Different approach is needed for hunting that down. But to be realistic we need to narrow things down a bit, which means 1) Only care about SMP machines. APIC on true UP (no Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft doesn't use it) and is just too likely to trip up SMM and other bad BIOS stuff. * exception is probably people who don't WANT to use apic but where it somehow gets used anyway; if that happens we probably have the magic bullet that causes the regression :) 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for this, but most vendors hardly maintain those anymore at all and they are generally just /dev/random nowadays 3) Ignore overclocking; if you overclock using the FSB the apic busses run out of spec as well; can be a huge timewaster in debug time. On Sun, Nov 12, 2006 at 12:50:37PM +0100, Arjan van de Ven wrote: > > > I don't know. In fact I forget how I worked out that it worsened in > > 2.6.early. > > > > google(noapic) gets 232,000 hits. > > is there a way to ask google "only stuff in the last year"? > Asking because "noapic" in 2.4 was the standard "try this" answer when > people had a bios that had busted MPS (but good ACPI)... Some APIC-related bugs in the kernel Bugzilla that have been reported or confirmed during the last 12 months (I only looked at "apic" in the subject, there might be more related bugs in the Bugzilla): #5038 Fast running system clock with IO-APIC enabled #5303 AMD64 Erratum: Should not enable C2 when using APIC #5565 Guess of i386 APIC PTE area scribble #6404 APIC error on CPU0: 40(40) #6748 Clock drifts by 30% for SMP kernel w/APIC #6859 Linux kernel won't work without "nolapic" passed #6890 Kernel boot freezes when APIC is enabled & SATA is used > > I don't think it really matters when or why it happened. > > well to some degree it does; if it's one patch causing it narrowing it > down at least somewhat in time would help ;) > > > If we take the > > approach of fixing one machine at a time, we'll only need to fix a few > > individual machines to improve the situation for a lot of people. > > alternative is that more new machines showed up that need it somehow, eg > not really a regression just something else. Different approach is > needed for hunting that down. But to be realistic we need to narrow > things down a bit, which means > > 1) Only care about SMP machines. APIC on true UP (no > Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft > doesn't use it) and is just too likely to trip up SMM and other bad BIOS > stuff. > * exception is probably people who don't WANT to use apic but where it > somehow gets used anyway; if that happens we probably have the magic > bullet that causes the regression :) On i386, it's a kernel configuration option. On x86_64, the APIC is currently always enabled even when configuring a UP kernel. > 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for > this, but most vendors hardly maintain those anymore at all and they are > generally just /dev/random nowadays What about non-ACPI SMP? > 3) Ignore overclocking; if you overclock using the FSB the apic busses > run out of spec as well; can be a huge timewaster in debug time. cu Adrian > Some APIC-related bugs in the kernel Bugzilla that have been reported or > confirmed during the last 12 months (I only looked at "apic" in the > subject, there might be more related bugs in the Bugzilla): > > #5038 Fast running system clock with IO-APIC enabled This is a UP machine. NotInteresting(tm) wrt APIC. > #5303 AMD64 Erratum: Should not enable C2 when using APIC This is clearly not a linux issue but a hardware bug, as the title says > #5565 Guess of i386 APIC PTE area scribble this is only on one machine and a "special case"; not ruling out anything fundamental but.. > #6404 APIC error on CPU0: 40(40) This bug is a mess though; many different people seeing a symptom of an apic error, and all jumping in assuming they see the same problem... Also it's afaik only a message and not (yet) fatal in any way. Sometimes apics do this a few times a day, esp when things are getting hot in the box. Afaik there is then just a resend of the message and nothing is lost. > #6748 Clock drifts by 30% for SMP kernel w/APIC this looks like a totally weird hardware case that probably just wants to be blacklisted. > #6859 Linux kernel won't work without "nolapic" passed weird one, probably a bios issue but it's the opposite of "noapic", and also this is about local apic not about ioapic. Although they share 4 letters they're entirely different animals. > #6890 Kernel boot freezes when APIC is enabled & SATA is used seems to be UP as well but asked for confirmation in the bug (lack of lots of information here!). If this isn't UP this could be the first real case of "noapic" in your entire list...... which isn't too useful. Maybe we need to get more/any people who see "need noapic on SMP" to file a bug (and provide a reasonable amount of info) > > > > 1) Only care about SMP machines. APIC on true UP (no > > Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft > > doesn't use it) and is just too likely to trip up SMM and other bad BIOS > > stuff. > > * exception is probably people who don't WANT to use apic but where it > > somehow gets used anyway; if that happens we probably have the magic > > bullet that causes the regression :) > > On i386, it's a kernel configuration option. yes but it's generally a bad idea to set it; it only works on some machines. (and it can't be fixed) > > On x86_64, the APIC is currently always enabled even when configuring a > UP kernel. I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS cause it to be turned off automatic most of the time. > > > 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for > > this, but most vendors hardly maintain those anymore at all and they are > > generally just /dev/random nowadays > > What about non-ACPI SMP? if the machine is new enough to run ACPI I don't care about the non-ACPI case; just enable it. Really. On newish machines (and that is 7 years old or newer) MPS tables are NOT getting much if any attention by the bios guys. So Linux should use ACPI, and if you deliberately disable ACPI and THEN hit a problem to a large degree you asked for the problem in the first place. Older machines, different story. On Sun, Nov 12, 2006 at 02:16:16PM +0100, Arjan van de Ven wrote: > > > Some APIC-related bugs in the kernel Bugzilla that have been reported or > > confirmed during the last 12 months (I only looked at "apic" in the > > subject, there might be more related bugs in the Bugzilla): > > > > #5038 Fast running system clock with IO-APIC enabled > > This is a UP machine. NotInteresting(tm) wrt APIC. >... Currently it's a supported configuration. We must either handle such cases or explicitely disable the APIC on all UP machines (BTW: Is there any way to handle this when installing a distribution kernel with CONFIG_HOTPLUG_CPU=y on an UP machine?). > > > 1) Only care about SMP machines. APIC on true UP (no > > > Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft > > > doesn't use it) and is just too likely to trip up SMM and other bad BIOS > > > stuff. > > > * exception is probably people who don't WANT to use apic but where it > > > somehow gets used anyway; if that happens we probably have the magic > > > bullet that causes the regression :) > > > > On i386, it's a kernel configuration option. > > yes but it's generally a bad idea to set it; it only works on some > machines. (and it can't be fixed) > > > > On x86_64, the APIC is currently always enabled even when configuring a > > UP kernel. > > I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS > cause it to be turned off automatic most of the time. I'd doubt the latter. Even on my cheap Asus board running an i386 AMD Athlon XP with 1.8 GHz the APIC is both used and working without any problems. > > > 2) Only care about ACPI using kernels. Non-ACPI uses MPS tables for > > > this, but most vendors hardly maintain those anymore at all and they are > > > generally just /dev/random nowadays > > > > What about non-ACPI SMP? > > if the machine is new enough to run ACPI I don't care about the non-ACPI > case; just enable it. Really. On newish machines (and that is 7 years > old or newer) MPS tables are NOT getting much if any attention by the > bios guys. So Linux should use ACPI, and if you deliberately disable > ACPI and THEN hit a problem to a large degree you asked for the problem > in the first place. > > Older machines, different story. My point was regarding the latter ones... cu Adrian On Sun, 2006-11-12 at 14:37 +0100, Adrian Bunk wrote: > On Sun, Nov 12, 2006 at 02:16:16PM +0100, Arjan van de Ven wrote: > > > > > Some APIC-related bugs in the kernel Bugzilla that have been reported or > > > confirmed during the last 12 months (I only looked at "apic" in the > > > subject, there might be more related bugs in the Bugzilla): > > > > > > #5038 Fast running system clock with IO-APIC enabled > > > > This is a UP machine. NotInteresting(tm) wrt APIC. > >... > > Currently it's a supported configuration. define "supported"; we have code to try it and it's great if it works. But if it doesn't... you're out of luck. We KNOW it can't work on a sizable amount of machines. This is why it is a config option; you can enable it if YOUR machine is KNOWN to work, and you get some gains. But it's also understood that it often it won't work. So any sensible distro (since they have to aim for a wide audience) disables this option ... > > We must either handle such cases or explicitely disable the APIC on all > UP machines that'd be the same as setting the config option off... > > I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS > > cause it to be turned off automatic most of the time. > > I'd doubt the latter. Even on my cheap Asus board running an i386 > AMD Athlon XP with 1.8 GHz the APIC is both used and working without any > problems. "it works on my one machine so it works for everyone". That's simply not true. We KNOW it can't work everywhere on UP, especially on i386. SMM assumptions; people gluing the apic pins to the reset line, we've seen it all. That it works for you is great. But that doesn't mean it automatically works for everyone. On Sun, Nov 12, 2006 at 02:57:48PM +0100, Arjan van de Ven wrote: > On Sun, 2006-11-12 at 14:37 +0100, Adrian Bunk wrote: > > On Sun, Nov 12, 2006 at 02:16:16PM +0100, Arjan van de Ven wrote: > > > > > > > Some APIC-related bugs in the kernel Bugzilla that have been reported or > > > > confirmed during the last 12 months (I only looked at "apic" in the > > > > subject, there might be more related bugs in the Bugzilla): > > > > > > > > #5038 Fast running system clock with IO-APIC enabled > > > > > > This is a UP machine. NotInteresting(tm) wrt APIC. > > >... > > > > Currently it's a supported configuration. > > define "supported"; we have code to try it and it's great if it works. > But if it doesn't... you're out of luck. > > We KNOW it can't work on a sizable amount of machines. This is why it > is a config option; you can enable it if YOUR machine is KNOWN to work, > and you get some gains. But it's also understood that it often it won't > work. So any sensible distro (since they have to aim for a wide > audience) disables this option ... Nowadays, many distributions only ship CONFIG_SMP=y kernels... > > We must either handle such cases or explicitely disable the APIC on all > > UP machines > > that'd be the same as setting the config option off... Except for the common case of CONFIG_SMP=y kernels on UP machines... > > > I think that's a mistake. But oh well, I suspect in practice ACPI/BIOS > > > cause it to be turned off automatic most of the time. > > > > I'd doubt the latter. Even on my cheap Asus board running an i386 > > AMD Athlon XP with 1.8 GHz the APIC is both used and working without any > > problems. > > "it works on my one machine so it works for everyone". That's simply not > true. We KNOW it can't work everywhere on UP, especially on i386. SMM > assumptions; people gluing the apic pins to the reset line, we've seen > it all. > That it works for you is great. But that doesn't mean it automatically > works for everyone. You miss my point. You said you'd suspect it to be turned off automatic most of the time, and that's the point I think you might be wrong at. cu Adrian > > We KNOW it can't work on a sizable amount of machines. This is why it > > is a config option; you can enable it if YOUR machine is KNOWN to work, > > and you get some gains. But it's also understood that it often it won't > > work. So any sensible distro (since they have to aim for a wide > > audience) disables this option ... > > Nowadays, many distributions only ship CONFIG_SMP=y kernels... that's a calculated risk on their side (and they know that); they're balancing not functioning on a set of machines off against needing more kernels. > You miss my point. > > You said you'd suspect it to be turned off automatic most of the time, > and that's the point I think you might be wrong at. it won't be turned off on machines that support dual core processors etc, since those DO get validated and designed for APIC use.. even if you only stick a single core processor in. So yes you're right, that nowadays is a pretty large group. But it's the safe group I guess:) On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote: > > > > We KNOW it can't work on a sizable amount of machines. This is why it > > > is a config option; you can enable it if YOUR machine is KNOWN to work, > > > and you get some gains. But it's also understood that it often it won't > > > work. So any sensible distro (since they have to aim for a wide > > > audience) disables this option ... > > > > Nowadays, many distributions only ship CONFIG_SMP=y kernels... > > that's a calculated risk on their side (and they know that); they're > balancing not functioning on a set of machines off against needing more > kernels. This might soon affect the majority of Linux users, so it's a case that has to be handled... > > You miss my point. > > > > You said you'd suspect it to be turned off automatic most of the time, > > and that's the point I think you might be wrong at. > > it won't be turned off on machines that support dual core processors > etc, since those DO get validated and designed for APIC use.. even if > you only stick a single core processor in. So yes you're right, that > nowadays is a pretty large group. But it's the safe group I guess:) But if APIC is even used on my more than 1 year old 40 Euro Socket A board (AFAIK there have never been dual core Socket A processors, there were no Socket A hyperthreading CPUs, it's not an SMP board, and the VIA KT600 is not an SMP chipset) it's not in what you call "safe group", and I don't see any reason why my board should behave different in this respect from all of the millions of other UP Socket A boards. Googling show that it could be that your claim "APIC on true UP (no Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft doesn't use it)" earlier in this thread was wrong. Looking at e.g. [1], it seems Windows does use the APIC even on UP. cu Adrian [1] http://www.microsoft.com/whdc/system/sysperf/IO-APIC.mspx
> But if APIC is even used on my more than 1 year old 40 Euro Socket A
once sparrow does not a summer make.
now can we get constructive again. If you find a real case where noapic
is needed on an SMP machine, preferably one where it wasn't needed
before earlier in 2.6, let us know; it's worthwhile to chase those down
since we know it's a decent use case and it's not flaky hardware.
Reply-To: diablod3@gmail.com On Sunday 12 November 2006 10:21, Adrian Bunk wrote: > On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote: > > > > We KNOW it can't work on a sizable amount of machines. This is why > > > > it is a config option; you can enable it if YOUR machine is KNOWN to > > > > work, and you get some gains. But it's also understood that it often > > > > it won't work. So any sensible distro (since they have to aim for a > > > > wide audience) disables this option ... > > > > > > Nowadays, many distributions only ship CONFIG_SMP=y kernels... > > > > that's a calculated risk on their side (and they know that); they're > > balancing not functioning on a set of machines off against needing more > > kernels. > > This might soon affect the majority of Linux users, so it's a case that > has to be handled... I actually agree here. Linux needs to be easier for people to use, not harder. Isn't there a way for bootloaders or the kernel early on figure out if the machine supports SMP, and if it doesnt, load a uniproc kernel instead? > > > You miss my point. > > > > > > You said you'd suspect it to be turned off automatic most of the time, > > > and that's the point I think you might be wrong at. > > > > it won't be turned off on machines that support dual core processors > > etc, since those DO get validated and designed for APIC use.. even if > > you only stick a single core processor in. So yes you're right, that > > nowadays is a pretty large group. But it's the safe group I guess:) > > But if APIC is even used on my more than 1 year old 40 Euro Socket A > board (AFAIK there have never been dual core Socket A processors, there > were no Socket A hyperthreading CPUs, it's not an SMP board, and the > VIA KT600 is not an SMP chipset) it's not in what you call "safe group", > and I don't see any reason why my board should behave different in this > respect from all of the millions of other UP Socket A boards. > > Googling show that it could be that your claim "APIC on true UP (no > Hyperthreading/Dualcore) is a thing no hardware vendor tests (Microsoft > doesn't use it)" earlier in this thread was wrong. Looking at e.g. [1], > it seems Windows does use the APIC even on UP. Socket A CPUs are also ungodly common. They're as common as slot 1/socket 370 Pentium 3s, and, at least with my old P3 board, trying to use APIC on UP caused lockups. My Duron 1ghz laptop also does the same thing. (Booting either with noapic fixes it). So yeah, if distros make stupid choices like these, then we're pretty screwed. > cu > Adrian > > [1] http://www.microsoft.com/whdc/system/sysperf/IO-APIC.mspx On Sun, 2006-11-12 at 10:59 -0500, Patrick McFarland wrote:
> On Sunday 12 November 2006 10:21, Adrian Bunk wrote:
> > On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote:
> > > > > We KNOW it can't work on a sizable amount of machines. This is why
> > > > > it is a config option; you can enable it if YOUR machine is KNOWN to
> > > > > work, and you get some gains. But it's also understood that it often
> > > > > it won't work. So any sensible distro (since they have to aim for a
> > > > > wide audience) disables this option ...
> > > >
> > > > Nowadays, many distributions only ship CONFIG_SMP=y kernels...
> > >
> > > that's a calculated risk on their side (and they know that); they're
> > > balancing not functioning on a set of machines off against needing more
> > > kernels.
> >
> > This might soon affect the majority of Linux users, so it's a case that
> > has to be handled...
>
> I actually agree here. Linux needs to be easier for people to use, not harder.
> Isn't there a way for bootloaders or the kernel early on figure out if the
> machine supports SMP, and if it doesnt, load a uniproc kernel instead?
this is what OS installers have been doing for a decade or so.
On Sun, Nov 12, 2006 at 10:59:55AM -0500, Patrick McFarland wrote:
>...
> Socket A CPUs are also ungodly common. They're as common as slot 1/socket 370
> Pentium 3s, and, at least with my old P3 board, trying to use APIC on UP
> caused lockups. My Duron 1ghz laptop also does the same thing. (Booting
> either with noapic fixes it).
>...
It might depend on the age of your computer.
Microsoft mandates the presence of an APIC implemented per MADT and all
hardware interrupts connected to an IOAPIC for all servers and desktops
with a "Designed for Windows XP" sticker.
This implies more or less that a working APIC is present in all
non-laptop x86 UP systems manufactured during the last 5 years.
cu
Adrian
Reply-To: ioe-lkml@rameria.de Hi there, On Sunday, 12. November 2006 14:16, Arjan van de Ven wrote: > If this isn't UP this could be the first real case of "noapic" in your > entire list...... which isn't too useful. > Maybe we need to get more/any people who see "need noapic on SMP" to > file a bug (and provide a reasonable amount of info) I need noapic since ever (5 years!) to get my USB controller running. Without noapic it doesn't get any interrupts for some reason. If now is the time to fix those bugs, I would be happy to try a new kernel and get you the dmesg + result of plugging in an usb mass storage device and reading from it on a DAILY basis. If you need anything else to resolve the issue, I would be happy to help out here. Maybe a pattern can be detected, which could help others. If you like to blacklist this machine by DMI, that would also help me. Many Thanks! Best Regards Ingo Oeser On Sun, 12 Nov 2006 20:18:51 +0100 Ingo Oeser <ioe-lkml@rameria.de> wrote: > Hi there, > > On Sunday, 12. November 2006 14:16, Arjan van de Ven wrote: > > If this isn't UP this could be the first real case of "noapic" in your > > entire list...... which isn't too useful. > > Maybe we need to get more/any people who see "need noapic on SMP" to > > file a bug (and provide a reasonable amount of info) > > I need noapic since ever (5 years!) to get my USB controller running. > Without noapic it doesn't get any interrupts for some reason. > > If now is the time to fix those bugs, I would be happy to try a new kernel > and get you the dmesg + result of plugging in an usb mass storage device > and reading from it on a DAILY basis. Yes, please send those. It'd be best to get the info into bugzilla too - this doesn't look like a quick-fix scenario. On Sun, 2006-11-12 at 20:18 +0100, Ingo Oeser wrote:
> Hi there,
>
> On Sunday, 12. November 2006 14:16, Arjan van de Ven wrote:
> > If this isn't UP this could be the first real case of "noapic" in your
> > entire list...... which isn't too useful.
> > Maybe we need to get more/any people who see "need noapic on SMP" to
> > file a bug (and provide a reasonable amount of info)
>
> I need noapic since ever (5 years!) to get my USB controller running.
> Without noapic it doesn't get any interrupts for some reason.
so it never worked? (that's important to know versus regression)
Also does this machine use ACPI for interrupt routing?
That's also important, because if you're NOT using ACPI, "noapic" means
that you're using the PIRQ for irq routing and not MPS, so you're not
"just" changing apic behavior, you're actually using a different BIOS
table. (and to be honest, a buggy bios table is more likely the
cause ... ;)
Reply-To: davej@redhat.com On Sun, Nov 12, 2006 at 03:16:38PM +0100, Arjan van de Ven wrote: > > > > We KNOW it can't work on a sizable amount of machines. This is why it > > > is a config option; you can enable it if YOUR machine is KNOWN to work, > > > and you get some gains. But it's also understood that it often it won't > > > work. So any sensible distro (since they have to aim for a wide > > > audience) disables this option ... > > > > Nowadays, many distributions only ship CONFIG_SMP=y kernels... > > that's a calculated risk on their side (and they know that); they're > balancing not functioning on a set of machines off against needing more > kernels. Andi has a nice patch in the suse kernel which adds heuristics to disable apic on systems where it isn't likely to work. It DTRT in at least one problem case that I know of. The actual fall-out from enabling 'run SMP kernels on UP i686' for FC6 has mostly been a non-event. Literally a handful of cases, that will likely all get caught and worked around by Andi's patch or similar. Dave
> Andi has a nice patch in the suse kernel which adds heuristics to disable
> apic on systems where it isn't likely to work. It DTRT in at least
> one problem case that I know of. The actual fall-out from enabling
> 'run SMP kernels on UP i686' for FC6 has mostly been a non-event.
> Literally a handful of cases, that will likely all get caught and worked
> around by Andi's patch or similar.
I haven't pushed that recently because i was busy with other things, but
needs to be revisited yes.
One broken case that still happens is that the patch assumes working
SMBIOS. When there is no year in SMBIOS it will turn off APIC because
it assumes it is a very old system. But sometimes new systems who would
like APIC have illegal or broken SMBIOS year. On very new systems it isn't
a problem again because those tend to have multiple cores.
That could be probably a bit more clever. It's always difficult to
navigate around all kinds of BIOS bugs.
-Andi
On Saturday November 11, akpm@osdl.org wrote: > On Sat, 11 Nov 2006 03:29:32 -0800 > bugme-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=7495 > > > > Summary: Kernel periodically hangs. > > Kernel Version: Linux version 2.6.18.2 (root@pub) (gcc version 3.4.6) > > #13 SMP Fr > > Status: NEW > > Severity: blocking > > Owner: other_other@kernel-bugs.osdl.org > > Submitter: alex@hausnet.ru So getting back to the main issue in this bug report..... > > > > > > [42587.676000] BUG: unable to handle kernel NULL pointer dereference at > > virtual address 0000003c it would appear that in: if (inode->i_sb && inode->i_sb->s_op->clear_inode) inode->i_sb->s_op->clear_inode(inode); inode->i_sb->s_op is NULL. This is unfortunate :-) alloc_super initialises s_op to '&default_op' and it isn't cleared on unmount, so the implication seems to be that i_sb has been freed and the memory has been reused. This tends to suggest that generic_shutdown_super isn't releasing all inodes before the superblock gets destroyed. I cannot see how this could be happening yet, but it might be helpful to compile with CONFIG_DEBUG_SLAB and maybe even CONFIG_DEBUG_PAGEALLOC. That might make the problem trigger earlier and so be easier to track. NeilBrown Reply-To: dhowells@redhat.com Neil Brown <neilb@suse.de> wrote: > it would appear that in: > if (inode->i_sb && inode->i_sb->s_op->clear_inode) > inode->i_sb->s_op->clear_inode(inode); > > inode->i_sb->s_op is NULL. Agreed. > This tends to suggest that generic_shutdown_super isn't releasing all inodes > before the superblock gets destroyed. > > I cannot see how this could be happening Perhaps sb->s_root == NULL? That would permit most of generic_shutdown_super() to be bypassed, including the check that all the inodes have been consumed. David > I cannot see how this could be happening yet, but it might be helpful
> to compile with CONFIG_DEBUG_SLAB and maybe even CONFIG_DEBUG_PAGEALLOC.
> That might make the problem trigger earlier and so be easier to track.
I recompiled my kernel with this options, as soon as it will repeat error
message, i shall let know.
Nov 11 09:48:53 pub kernel: [42587.676000] BUG: unable to handle kernel NULL pointer dereference at virtual address 0000003c Nov 11 09:48:53 pub kernel: [42587.680000] printing eip: Nov 11 09:48:53 pub kernel: [42587.680000] 781610e7 Nov 11 09:48:53 pub kernel: [42587.680000] *pde = 00000000 Nov 11 09:48:53 pub kernel: [42587.680000] Oops: 0000 [#1] Nov 11 09:48:53 pub kernel: [42587.684000] SMP Nov 11 09:48:53 pub kernel: [42587.684000] Modules linked in: sata_promise sk98lin 8250_pnp 8250 i2c_nforce2 ehci_hcd serial_core sata_nv ahci i2c_core ohci_h cd forcedeth libata Nov 11 09:48:53 pub kernel: [42587.688000] CPU: 1 Nov 11 09:48:53 pub kernel: [42587.688000] EIP: 0060:[<781610e7>] Not tainted VLI Nov 11 09:48:53 pub kernel: [42587.688000] EFLAGS: 00010286 (2.6.18.2 #13) Nov 11 09:48:53 pub kernel: [42587.692000] EIP is at clear_inode+0x96/0xce Nov 11 09:48:53 pub kernel: [42587.692000] eax: 00000000 ebx: c0102240 ecx: f7f278d4 edx: f510d400 Nov 11 09:48:53 pub kernel: [42587.692000] esi: c0102384 edi: f7e6dec0 ebp: 00000070 esp: f7e6de98 Nov 11 09:48:53 pub kernel: [42587.696000] ds: 007b es: 007b ss: 0068 Nov 11 09:48:53 pub kernel: [42587.696000] Process kswapd0 (pid: 230, ti=f7e6c000 task=f7c03560 task.ti=f7e6c000) Nov 11 09:48:53 pub kernel: [42587.696000] Stack: c0102248 c0102240 7816116a da7b4af0 da7b4af8 00000000 00000080 781614a2 Nov 11 09:48:53 pub kernel: [42587.700000] 00000080 00000080 c01023f8 ef78dca8 00000000 00009858 00000083 f7fee560 Nov 11 09:48:53 pub kernel: [42587.700000] 781614c8 7813a643 00261600 00000000 00009858 00000005 00000000 00000000 Nov 11 09:48:53 pub kernel: [42587.700000] Call Trace: Nov 11 09:48:53 pub kernel: [42587.704000] [<7816116a>] dispose_list+0x4b/0xc1 Nov 11 09:48:53 pub kernel: [42587.708000] [<781614a2>] prune_icache+0x17c/0x18e Nov 11 09:48:53 pub kernel: [42587.708000] [<781614c8>] shrink_icache_memory+0x14/0x2b Nov 11 09:48:53 pub kernel: [42587.708000] [<7813a643>] shrink_slab+0x130/0x18c Nov 11 09:48:53 pub kernel: [42587.712000] [<7813b75a>] balance_pgdat+0x1ea/0x2dd Nov 11 09:48:53 pub kernel: [42587.712000] [<7813b933>] kswapd+0xe6/0xe8 Nov 11 09:48:53 pub kernel: [42587.716000] [<781261dc>] kthread+0x7d/0xa1 Nov 11 09:48:53 pub kernel: [42587.716000] [<78100e05>] kernel_thread_helper+0x5/0xb Nov 11 09:48:53 pub kernel: [42587.720000] DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Nov 11 09:48:53 pub kernel: [42587.720000] Leftover inexact backtrace: Nov 11 09:48:53 pub kernel: [42587.720000] Code: c0 83 bc 83 00 01 00 00 00 75 08 40 83 f8 01 7e f0 eb 0b 48 7f 08 8b 52 24 89 d8 ff 52 04 8b 83 a0 00 00 00 8 5 c0 74 0e 8b 40 20 <8b> 50 3c 85 d2 74 04 89 d8 ff d2 83 bb 14 01 00 00 00 74 07 89 Nov 11 09:48:53 pub kernel: [42587.724000] EIP: [<781610e7>] clear_inode+0x96/0xce SS:ESP 0068:f7e6de98 Nov 13 17:16:53 pub kernel: [ 26.486021] Compat vDSO mapped to ffffe000. Nov 13 17:16:53 pub kernel: [ 26.507066] ACPI: setting ELCR to 0200 (from 0c20) Nov 13 17:16:53 pub kernel: [ 26.508117] CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ stepping 02 Nov 13 17:16:53 pub kernel: [ 26.508137] Booting processor 1/1 eip 2000 Nov 13 17:16:53 pub kernel: [ 26.596334] Calibrating delay using timer specific routine.. 4018.80 BogoMIPS (lpj=8037611) Nov 13 17:16:53 pub kernel: [ 26.597912] CPU1: AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ stepping 02 Nov 13 17:16:53 pub kernel: [ 0.026864] migration_cost=44 Ups, sorry... it's old one. 13nov it died without any message in syslog. Alex, where do we now stand with the various problems you've reported here? Is 2.6.20-rc7 still being bad? Please reopen this bug if it's still present with kernel 2.6.20. |