Bug 13933
Summary: | System lockup on dual Pentium-3 with kernel 2.6.30 | ||
---|---|---|---|
Product: | Other | Reporter: | Martin Rogge (marogge) |
Component: | Other | Assignee: | Linus Torvalds (torvalds) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | beauwinters, bugs-a21, devzero, enouf4u, hilld, iordanov, mingo, nemesis, rjw, rogerx.oss, rusty, thomas.bjornell, torvalds, tpfaff, wylda |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.30.4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 13070 | ||
Attachments: |
uname -a
lsmod .config syslog lspci -vv dmesg Opps after first bisection git bisecting - Call for HELP final git bisect log One of the rare trace in this saga |
Created attachment 22638 [details]
lsmod
gosh, what does a man have to do to attach more than one file around here? I'll give it one more try. Sorry, bugzilla won't let me attach any more info like .config, nor can I change the attachments already posted. Anyway, it's on the LKML, and you have my email. Created attachment 22641 [details]
.config
after change of IP I managed to attach the .config... thanks, bugzilla!
Created attachment 22642 [details]
syslog
A lockup occured between the time stamps 20:20:43 and 20:24:26.
Created attachment 22647 [details]
lspci -vv
bugzilla only lets me attach one file every 12 hours or so. But I am persistent. ;-)
here is a similar report from Osipov Stanislav: http://marc.info/?l=linux-kernel&m=124928938311561&w=2 here the report from martin just for reference: http://marc.info/?l=linux-kernel&m=124931667320530&w=2 here is the report from Frank de Jong: http://marc.info/?l=linux-kernel&m=124967492815396&w=2 and here from John Stoffel: http://marc.info/?l=linux-kernel&m=124967565617179&w=2 most likely a regression, so please someone with appropriate permission mark this as a regression! Created attachment 22652 [details]
dmesg
we have probably a duplicate bugreport at http://bugzilla.kernel.org/show_bug.cgi?id=13219 from David Hill I've just had a lockup after an uptime of 3 days plus. *** Bug 13945 has been marked as a duplicate of this bug. *** This is weird though. I never had more than 4 hours uptime since the bug is present... Is that server under heavy use? Can you simulate heavy disk read/write for a while? Or heavy CPU usage? Memory ? etc ? (In reply to comment #14) > This is weird though. I never had more than 4 hours uptime since the bug is > present... I've had anything from 2 minutes to 3 days. > > Is that server under heavy use? Can you simulate heavy disk read/write for a > while? Or heavy CPU usage? Memory ? etc ? the machine is mostly used as a workstation. Originally I had the feeling the lockups coincided with screen updates, but there is no conclusive evidence. I remember, on one occasion the machine was idle when it happened. I can try and simulate other workloads, but don't wait for a meaningful result. I have the same problem with kernel 2.6.30.4. The motherboard is Tyan S1834/Tiger 133 with P3 1000Mhz processors. maybe another dupe: http://bugzilla.kernel.org/show_bug.cgi?id=13982 (sorry, i don`t have permission to mark as duplicate, so just postint the link) *** Bug 13982 has been marked as a duplicate of this bug. *** Rafael, you are right, mine bugreport #13982 is the same problem. From my tests, following kernels work perfectly: * 2.6.26.8 * 2.6.27.29 * 2.6.28.10 * 2.6.29.6 Following freezes: * 2.6.30.4 * 2.6.30.5rc2 So decided to make bisecting by following nice howto: http://wiki.winehq.org/RegressionTesting git bisect start git bisect good v2.6.29 git bisect bad v2.6.30 But when i comile such a kernel, than cat /proc/version shows 2.6.29, which is not correct i guess. After few rounds it showed me v2.6.29-rc4 which is even before "good v2.6.29" (should be something like v2.6.30-rc4, shouldn't be??) Oldschool bisecting (without git) - following kernel seems OK: * 2.6.30-rc4 It would help me, if someone could advice: 1. What tag should i use now for bisect good and bad? 2. If i have local clone of Git repository, how can i set the source code for example to version 2.6.30-rc6 (so i will not have to download it separately again)? 3. Is it possible to export from local clone of Git repository for example linux-2.6.30-rc6.tar.bz2 (so i will get exactly the same file like on kernel.org)? (In reply to comment #20) > Oldschool bisecting (without git) - following kernel seems OK: > > * 2.6.30-rc4 > > > It would help me, if someone could advice: > > 1. What tag should i use now for bisect good and bad? Your procedure in comment #19 is correct and the fact that you got v2.6.29-rc4 in the process only reflects the history of development. Apparently, you got a bisection point in a branch that was originally based on 2.6.29-rc4 and then merged into 2.6.30. So, you should do git bisect start git bisect good v2.6.30-rc4 git bisect bad v2.6.30 and do not care too much for the versions you get in the middle of bisection (that can be anything from 2.6.29 upwards). > 2. If i have local clone of Git repository, how can i set the source code for > example to version 2.6.30-rc6 (so i will not have to download it separately > again)? git checkout v2.6.30-rc6 It will complain that you don't have to a local branch for this kernel, but that's fine. > 3. Is it possible to export from local clone of Git repository for example > linux-2.6.30-rc6.tar.bz2 (so i will get exactly the same file like on > kernel.org)? After checking out a particular tag, you should get exactly the same tree as from the corresponding tarball. One mistake, I should have said "that can be anything from 2.6.28 upwards". Going to semifinal :) Following freezes: * 2.6.30-rc5 What makes me little nervous, that this time it took nearly 50mins, i.e. looks like it is much harder to trigger it. So hopefully bisecting shows something. Right now bisecting v2.6.30-rc4/v2.6.30-rc5 (good/bad). Rafael, two more questions (if you find some time): a) if i do "git checkout v2.6.30-rc6", then how i set it back to the latest clone version (something like git checkout latest...)? b) how to update local git clone to Linus's latest version (to be up to date)? I will never be a git specialist, but these are probably the last, that will be enough for me to say i can handle it ;) Created attachment 22747 [details]
Opps after first bisection
I don't know if this is important, so for the record... After first bisection v2.6.30-rc4/v2.6.30-rc5 there was Oops, but everything worked as usually (OK, one FTP transfer died;) Oops attached.
Anyway i did a reboot to be sure, there is no influence caused by this Opss. Kernel freezed in few mins - giving "git bisect bad" and going on.
Let me know if there is no need to waste energy with Opsses in this case (ie they dont bring any light into this bug).
I also have a dual PPro system with NO problems so far with exactly the same kernel that is giving the dual PIII system fits. The only high level difference I can tell is that the PPro system does not have ACPI while the PIII does. Also, a different RAID card (3ware on the PPro vs aacraid on the PIII) I forgot to mention, due to this difference I booted with acpi=off on the PIII system with no effect on the bug; it crashed just the same. What Nic? Is it a realtek ? David Hill On 2009-08-16, at 15:53, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13933 > > > > > > --- Comment #25 from Ryan Underwood <nemesis@icequake.net> > 2009-08-16 15:53:05 --- > I also have a dual PPro system with NO problems so far with exactly > the same > kernel that is giving the dual PIII system fits. > > The only high level difference I can tell is that the PPro system > does not have > ACPI while the PIII does. Also, a different RAID card (3ware on the > PPro vs > aacraid on the PIII) > > -- > Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > e100 in both working and non-working systems David, i have a Realtek in my machine, but i dont use it. Even driver is not in kernel nor as module (ie. kernel does not see it) and still freezes :-/ My currently used config is as minimal as possible: * No modules * No power management, no ACPI, no Frequency scaling * No AGP, no ISA, ... * No wireless, IPv6, etc. * Minimal SCSI support * No I2C, GPIO, HW monitoring, multimedia, DVB, Sound, FB * No USB, HID, LED * (filesystem) Nothing except Ext3 * No security options, Cryptography, Virtualization, Library routines ...and even though, it freezes. Just few regression left, but all bisections were marked as bad till now. So i have a feeling, that 2.6.30-rc4 is also bad. But stay tuned, still working on :-/ I'm bisecting from 2.6.29 to 2.6.30... Bug is between those... I happen to also have a tyan dual p3 system!!! Anyone compiled working and non-working kernels with the same compiler? Maybe it's a subtle toolchain bug. David, i also did bisection between 2.6.29 to 2.6.30, but when i had to use "git bisect skip" because of broken compilation (sata_sil - which is must have) and than i saw versions like 2.6.29-rc4, rc1 and commit's dates somewhere from January 2009 i thought, i f***ed git ("_the_greatest_tool_ever_made_") up and did a reset after 14! rounds :-/ Based on comment #21 i know i should not do that, but you know ;-) It's my first time with git - not perfect ;) Following freezes: * 2.6.30-rc4 (unfortunately) * 2.6.30-rc2 * 2.6.30-rc1 So to be sure, that still chasing a bug and not ghosts, i did a careful test of Debian's lenny kernel 2.6.26-17lenny1 -> works perfectly. So bisecting again, but this time 2.6.30-rc1 and 2.6.29. Ryan, see comment #19. following kernels work perfectly: * 2.6.26.8 * 2.6.27.29 * 2.6.28.10 * 2.6.29.6 all were compiled with the same gcc and other tools. Created attachment 22750 [details] git bisecting - Call for HELP *** Call for HELP - Gitmaster wanted *** After some good/bad, i'm not able to overcome sata_sil build failure by "git bisect skip". Even though i remove sata_sil from kernel, than i get immediately during the boot process kernel panic (not related to missing SATA Sil3114 driver - i have a system on PATA). If i count correctly 36x skip * 7min = more than 4 hours of wasted time. Now i take this bisecting and building process to 8xCPU machine, but maybe its possible, that i got with git to the point were i'm not able to continue. I have a time to bisect to Tuesday evening (UTC). On Wednesday the server goes to production and i will loose opportunity to test and bisect. I read about "Reverse Regression Testing" (http://wiki.winehq.org/ReverseRegressionTesting), but thats to much for me. *** Call for HELP - Gitmaster wanted *** (from the log): git-bisect start git-bisect good 2.6.29 git-bisect bad 2.6.30-rc1 git-bisect bad 577c9c456f0e1371cbade38eaf91ae8e8a308555 git-bisect bad 5658ae9007490c18853fbf112f1b3516f5949e62 git-bisect good 08abe18af1f78ee80c3c3a5ac47c3e0ae0beadf6 git-bisect bad 6e15cf04860074ad032e88c306bea656bbdd0f22 git-bisect bad 7c178a26d3e753d2a4346d3e4b8aa549d387f698 git-bisect skip e2c75d9f54334646b3dcdf1fea0d1afe7bfbf644 git-bisect skip e0c7ae376a13fd79a4dad8becab51040d13dfa90 git-bisect skip 3769e7b4d8ef113e08221a210f849ba57475aff5 git-bisect skip 6a48565ed6ac76f351def25cd5e9f181331065f6 git-bisect skip 9c39801763ed6e08ea8bc694c5ab936643a2b763 git-bisect skip fbeb2ca0224182033f196cf8f63989c3e6b90aba git-bisect skip 4272ebfbefd0db40073f3ee5990bceaf2894f08b git-bisect skip f67ae5c9e52e385492b94c14376e322004701555 git-bisect skip 36ef4944ee8118491631e317e406f9bd15e20e97 git-bisect skip 9e111f3e167a14dd6252cff14fc7dd2ba4c650c6 git-bisect skip 06ac8346af04f6a972072f6c5780ba734832ad13 git-bisect skip 1ff2f20de354a621ef4b56b9cfe6f9139a7e493b git-bisect skip 1ec2dafd937c0f6fed46cbd8f6878f2c1db4a623 git-bisect skip 43f39890db2959b10891cf7bbf3f53fffc8ce3bd git-bisect skip 1c61d8c309a4080980474de8c6689527be180782 git-bisect skip 26f7ef14a76b0e590a3797fd7b2f3cee868d9664 git-bisect skip 4b19ed915576e8034c3653b4b10b79bde10f69fa git-bisect skip 6b64ee02da20d6c0d97115e0b1ab47f9fa2f0d8f git-bisect skip 193c81b979adbc4a540bf89e75b9039fae75bf82 git-bisect skip e006235e5b9cfb785ecbc05551788e33f96ea0ce git-bisect skip 7cd92366a593246650cc7d6198e2c7d3af8c1d8a git-bisect skip d1de36f5b5a30b8f9dae7142516fb122ce1e0661 git-bisect skip 8f47e16348e8e25eedf639092a8a2f10a66aba34 git-bisect skip c3e6a2042fef33b747d2ae3961f5312af801973d git-bisect skip 54523edd237b9e792a3b76988fde23a91d739f43 git-bisect skip 5da690d29f0de17cc1835dd3eb8f8bd0945521f0 git-bisect skip 647ad94fc0479e33958cb4d0e20e241c0bcf599c git-bisect skip e084e531000a488d2d27864266c13ac824575a8b git-bisect skip ed74ca6d5a3e57eb0969d4e14e46cf9f88d25d3f git-bisect skip f154f47d5180c2012bf97999e6c600d45db8af2d git-bisect skip 36619a8a80793a803588a17f772313d5c948357d git-bisect skip 3e92ab3d7e2edef5dccd8b0db21528699c81d2c0 git-bisect skip 550fe4f198558c147c6b8273a709568222a1668a git-bisect skip 9fc2e79d4f239c1c1dfdab7b10854c7588b39d9a git-bisect skip c379698fdac7cb65c96dec549850ce606dd6ceba git-bisect skip f095df0a0cb35a52605541f619d038339b90d7cc The problem with sata_sil: drivers/ata/sata_sil.c: In function ‘sil_broken_system_poweroff’: drivers/ata/sata_sil.c:713: error: implicit declaration of function ‘dmi_first_match’ drivers/ata/sata_sil.c:713: warning: initialization makes pointer from integer without a cast make[2]: *** [drivers/ata/sata_sil.o] Error 1 make[1]: *** [drivers/ata] Error 2 make: *** [drivers] Error 2 make: *** Waiting for unfinished jobs.... Finaly overcome build failure (8x CPU Xeon - what a difference). git bisect skip Bisecting: 313 revisions left to test after this [9d45cf9e36bf9bcf16df6e1cbf049807c8402823] Merge branch 'x86/urgent' into x86/apic I'll go on with bisecting today after 18h (UTC). (In reply to comment #35) > Finaly overcome build failure (8x CPU Xeon - what a difference). Excellent. I can't wait to find out what is causing the problem. Thanks for putting in all this work, Pavel. You are the chosen one because you can trigger the bug within minutes. ;-) I hope, that "release early and release often" also goes for this kind of spam ;c) So to keep you informed... I don't know who said, that for git's bisecting is best practice to use two close release good/bad. In this case it's not true. Bisecting between 2.6.29/2.6.30rc1 lead me to blind track :-/ When i finally overcome sata_sil.c build failure, i though i won... But that correctly built kernel did not boot (same kernel panic). After that i tried many git bisect skip, but nothing than panic. So after deep breath i did reset a began again with 2.6.29/2.6.30. I think i have 1 max 2 bisect turns ahead and i also have a good commit (after a lot of bad). So now i take my son for a walk and give that machine hardtry. I think, that this evening i shoud know bad commit. So Martin (possibly others) - could you test with me to revert that commit to be sure we got a right one? If i'm too optimistic, that would mean that statement "release early and release often" is horribly incorrect and is second wrong thing about linux (after the git statement) ;c) Created attachment 22764 [details] final git bisect log Git bisect gave me following commit. Make sence?? Now i have to find a way how to remove this patch from git's v2.6.30.5 and give it a try. Please test with me... # git bisect good 4595f9620cda8a1e973588e743cf5f8436dd20c6 is first bad commit commit 4595f9620cda8a1e973588e743cf5f8436dd20c6 Author: Rusty Russell <rusty@rustcorp.com.au> Date: Sat Jan 10 21:58:09 2009 -0800 x86: change flush_tlb_others to take a const struct cpumask Impact: reduce stack usage, use new cpumask API. This is made a little more tricky by uv_flush_tlb_others which actually alters its argument, for an IPI to be sent to the remaining cpus in the mask. I solve this by allocating a cpumask_var_t for this case and falling back to IPI should this fail. To eliminate temporaries in the caller, all flush_tlb_others implementations now do the this-cpu-elimination step themselves. Note also the curious "cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask)" which has been there since pre-git and yet f->flush_cpumask is always zero at this point. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> :040000 040000 f970a9bfa4ae30de22b4e9ef9e38836b1ff583cd 68b4e9c75b11bf81d5e4193a46328c34ca74415d M arch I can't test it :-/ There were probably some changes * a4a0acf8e17e3d08e28b721ceceb898fbc959ceb * 694aa960608d2976666d850bd4ef78053bbd0c84 which lead to: # git bisect reset # git checkout v2.6.30-rc1 # git show 4595f9620cda8a1e973588e743cf5f8436dd20c6 | patch -p1 -R patching file arch/x86/include/asm/paravirt.h Hunk #1 succeeded at 273 (offset 29 lines). Hunk #2 succeeded at 1076 (offset 92 lines). patching file arch/x86/include/asm/tlbflush.h Hunk #3 succeeded at 163 (offset -3 lines). patching file arch/x86/include/asm/uv/uv_bau.h Hunk #1 FAILED at 325. 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/include/asm/uv/uv_bau.h.rej can't find file to patch at input line 105 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c |index ce50546..ec53818 100644 |--- a/arch/x86/kernel/tlb_32.c |+++ b/arch/x86/kernel/tlb_32.c -------------------------- File to patch: Call for HELP - What should i do now? On Tuesday 18 August 2009, John Stoffel wrote:
>
> Just a quick followup, I've been doing a git bisect run over the past
> week or so trying to narrow this down. It's slow, since the system
> doesn't hang at any one point reliably. So I far, here's my git log:
>
> > git bisect log
> git bisect start
> # bad: [f4b9a988685da6386d7f9a72df3098bcc3270526] Merge branch
> 'for-linus' of git://git.infradead.org/ubi-2.6
> git bisect bad f4b9a988685da6386d7f9a72df3098bcc3270526
> # good: [8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84] Linux 2.6.29
> git bisect good 8e0ee43bc2c3e19db56a4adaa9a9b04ce885cd84
> # bad: [095342389e2ed8deed07b3076f990260ce3c7c9f] perf_counter, x86:
> generic use of cpuc->active
> git bisect bad 095342389e2ed8deed07b3076f990260ce3c7c9f
> # bad: [095342389e2ed8deed07b3076f990260ce3c7c9f] perf_counter, x86:
> generic use of cpuc->active
> git bisect bad 095342389e2ed8deed07b3076f990260ce3c7c9f
> # bad: [095342389e2ed8deed07b3076f990260ce3c7c9f] perf_counter, x86:
> generic use of cpuc->active
> git bisect bad 095342389e2ed8deed07b3076f990260ce3c7c9f
> # bad: [095342389e2ed8deed07b3076f990260ce3c7c9f] perf_counter, x86:
> generic use of cpuc->active
> git bisect bad 095342389e2ed8deed07b3076f990260ce3c7c9f
> # bad: [ebc8eca169be0283d5a7ab54c4411dd59cfb0f27] Merge branch 'next'
> of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Adding Rusty as cc. Rusty, does this make sense to you? Rafael, isn't it a weird bisection you posted? Because: * There is just one "good" - just starting one * Why 4x "git bisect bad" gives same commit "perf_counter, x86:...." * Last commit is from powerpc (not x86) * is suspiciously short Anyway, i wanted to prove, that Rusty's work in this particular case brought some badness into kernel (at least for 2x CPU P3 and P2 machines). So we all agreed, that 2.6.29 is OK (does not freeze at least). So i reverted the logic and instead of removing the 4595f9620cda8a1e973588e743cf5f8436dd20c6 (which is not possible in 2.6.30-rc1 and later) i applied this commit to 2.6.29. So i did: git checkout v2.6.29 git show 4595f9620cda8a1e973588e743cf5f8436dd20c6 | patch -p1 and guess what... 2.6.29 began to freeze exactly the same way like 2.6.30[.12345]. Is this enough or should i do something more? Last thing Rafael. If this bug is marked as "Blocking", does it mean that 2.6.31 cannot be released till this is fixed? Because i can confirm that this freezing also happens in 2.6.31-rc6. (In reply to comment #42) Googling 4595f9620cda8a1e973588e743cf5f8436dd20c6 or searching lkml.org for it reveals that the commit caused some crashes at the time (January/February 2009) and was subsequently fixed by Ingo Molnar. The fix was tested on hyperthreading machines because they were thought to be most vulnerable. Maybe it is possible that the fix fails on dual P2s and P3s? Created attachment 22779 [details]
One of the rare trace in this saga
During a preparation of my server i could not help and give it a try with debian's kernel and attached serial console just in case... And i was lucky and got a trace, after that i pressed few times Alt-SysRq l/m/s. After few sec machine died completely. At least i got something. Complete log attached.
This is probably one of my last contributions. Hope that those 10days and more than 160 restart were not wasted. Good luck!
[ 201.865003] BUG: soft lockup - CPU#1 stuck for 61s! [aptitude:2183]
[ 201.865003] Modules linked in: loop psmouse evdev snd_pcm snd_timer serio_raw snd soundcore snd_page_alloc pcspkr i2c_piix4 i2c_core parport_pc parport processor button sworks_agp agpgart ext3 jbd mbcache raid0 md_mod sg sr_mod cdrom sd_mod crc_t10dif ide_gd_mod ata_generic sata_sil ohci_hcd 8139cp serverworks ide_pci_generic libata e1000 usbcore 8139too mii ide_core scsi_mod thermal fan thermal_sys
[ 201.865003]
[ 201.865003] Pid: 2183, comm: aptitude Not tainted (2.6.30-1-686 #1) STL2
[ 201.865003] EIP: 0060:[<c031e818>] EFLAGS: 00000297 CPU: 1
[ 201.865003] EIP is at _spin_lock+0xe/0x15
[ 201.865003] EAX: f63ac990 EBX: 000000c3 ECX: 00000000 EDX: 0000a5a4
[ 201.865003] ESI: 00000000 EDI: f63ac968 EBP: c14ca060 ESP: d1863c90
[ 201.865003] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 201.865003] CR0: 80050033 CR2: 0b0ad01c CR3: 1183a000 CR4: 000006d0
[ 201.865003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 201.865003] DR6: ffff0ff0 DR7: 00000400
[ 201.865003] Call Trace:
[ 201.865003] [<c018132b>] ? page_referenced_file+0x2e/0x82
[ 201.865003] [<c02c414e>] ? tcp_v4_rcv+0x3e3/0x5de
[ 201.865003] [<c012afbe>] ? irq_enter+0xf/0x45
[ 201.865003] [<c012b2c2>] ? irq_exit+0x31/0x53
[ 201.865003] [<c0103996>] ? error_interrupt+0x2a/0x30
[ 201.865003] [<c0181d8e>] ? page_referenced+0xbf/0xf1
[ 201.865003] [<c012007b>] ? find_lowest_rq+0x75/0x106
[ 201.865003] [<c01200d8>] ? find_lowest_rq+0xd2/0x106
[ 201.865003] [<c017313a>] ? shrink_page_list+0x121/0x568
[ 201.865003] [<c01724dd>] ? isolate_pages_global+0x91/0x1d0
[ 201.865003] [<c012afbe>] ? irq_enter+0xf/0x45
[ 201.865003] [<c012b2c2>] ? irq_exit+0x31/0x53
[ 201.865003] [<c0103996>] ? error_interrupt+0x2a/0x30
[ 201.865003] [<c01737af>] ? shrink_list+0x22e/0x4d6
[ 201.865003] [<c012afbe>] ? irq_enter+0xf/0x45
[ 201.865003] [<c012b2c2>] ? irq_exit+0x31/0x53
[ 201.865003] [<c0103996>] ? error_interrupt+0x2a/0x30
[ 201.865003] [<c012afbe>] ? irq_enter+0xf/0x45
[ 201.865003] [<c012afbe>] ? irq_enter+0xf/0x45
[ 201.865003] [<c012b2c2>] ? irq_exit+0x31/0x53
[ 201.865003] [<c0103996>] ? error_interrupt+0x2a/0x30
[ 201.865003] [<c0173c77>] ? shrink_zone+0x220/0x2af
[ 201.865003] [<c01748cb>] ? try_to_free_pages+0x225/0x34c
[ 201.865003] [<c017244c>] ? isolate_pages_global+0x0/0x1d0
[ 201.865003] [<c016f832>] ? __alloc_pages_internal+0x219/0x39d
[ 201.865003] [<c017b78a>] ? handle_mm_fault+0x162/0x652
[ 201.865003] [<c011774d>] ? do_page_fault+0x1d8/0x1e7
[ 201.865003] [<c0117575>] ? do_page_fault+0x0/0x1e7
[ 201.865003] [<c031ea45>] ? error_code+0x6d/0x74
[ 201.865003] [<c0117575>] ? do_page_fault+0x0/0x1e7
(In reply to comment #43) > > ...fixed by Ingo Molnar. The fix was tested on hyperthreading > machines because they were thought to be most vulnerable. Maybe it is > possible > that the fix fails on dual P2s and P3s? Later fixes should not be the reason for this, because i took 4595f9620cda8a1e973588e743cf5f8436dd20c6 (without subsequent Ingo's fixes) and applied to 2.6.29. After that, machine began to freeze. congrats for the analysis and thanks for the hard work! let me try putting that together - this change probably introduced the issue: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4595f9620cda8a1e973588e743cf5f8436dd20c6 and this one fixed it to some degree, but obviously not entirely (maybe one more race being introduced?) : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5766b842b23c6b40935a5f3bd435b2bcdaff2143 correct ? >If this bug is marked as "Blocking", does it mean that 2.6.31 cannot >be released till this is fixed? i will hope that linus won`t release a kernel with such treacherous bug. While most people seem to experience these lockups with dual P2s or P3s, I might aswell add that 2.6.30.x is also locking up for me on my dual Athlon MP box. Looks like this particular commit caused some controversy before: http://lkml.indiana.edu/hypermail/linux/kernel/0901.2/01662.html The dual Pentium Pro I earlier reported as working (for >10 days!) crashed today, soft lockup on CPU#1. So PPro is susceptible too, it just takes much longer to manifest. My P3 system would typically lock up in less than an hour. I've also had this problem since 2.6.30 as well. I'm also running a dual Athlon MP system. Perhaps this is an issue related to dual processor systems? Ditto with freezing with dual P3's here. But I seem to have tracked it down to e100.c. Preventing e100.c from loading (ie. compiling as module & blacklisting it), I then got 16+ hours uptime before rebooting to further document the bugs. The freeze is so bad, even serial console locks-up/freezes. Here's my quick documentation on the e100.c freeze: http://bugzilla.kernel.org/show_bug.cgi?id=13991 The other reason I'm pointing my finger at e100, there are tons of patches within the past version. Not to mention previous patches which killed wake on lan. I might be slightly concluding it's e100.c though. (In reply to comment #51) > Ditto with freezing with dual P3's here. But I seem to have tracked it down > to > e100.c. OK, if you solve it, then it seems as different bug, because this happens to me with: * Network card: PCI-X, Intel 1Gbps 82543GC * Network card: PCI Realtek RT8139 None of these use e100.c. Pavel, I haven't been able to setup kgdb by serial console, but have had a serial console stead on 2.6.30 for the past days and didn't get any trace, but did get irratic e100 (nic) up & down just prior to freeze on time. Any other way of getting more debug output out of the kernel? btw, only PCI here, (Tyan Tiger 100 i440bx) Roger, you did not get my point. I don't argue with you - you found a bug, but you will have to decide: * A *) #13991 _is_ the same like #13933, than let #13991 CLOSE as DUPLICATE and _leave_ idea of e100 to be a source of all problems * B *) #13991 _is_not_ the same like #13933, than do the bisecting in e100 and find a commit wich is causing you troubles. _Then_ find a developer of e100 and try to persuade him, that he screwed your machine. After that be happy to close #13991. But _PLEASE_ do not mix things together. Contrariwise, we need separete things and make them as clear as possible(such as bisecting), otherwise developers will be confused and rather go away. And we don't want this. If you are confused and do not know, if choose A or B, take following steps: * Turn off one of your CPU (HW or SW way) and than simulate your lockup with e100. Still occur with only one CPU? YES: Then go for B and good luck with bisecting NO : Then forget about freezing bug in e100 and you are welcomed in #13933 club So _before_ you answer me, PLEASE read this comment three times :c) and be sure to choose A or B before. If you don't have time to take the step you don't have time to post here ;) That's OK, i also have problems with time... PS: I'm not a developer so i dont use KGDB and realy can't help you with debuging. Sorry :( I have a DELL Precision dual P3 server that also freezes with 2.6.30. My first thought was that the nvidia driver is not yet ready for 2.6.30 and switched back to 2.6.27. Next i tried again without the nvidia driver and it also freezes under high load. Then i build the kernel without smp support and it survived a five hour stress test. Therefore i do think that this bug and the other bug reports like http://bugzilla.kernel.org/show_bug.cgi?id=13219 are smp problems on P2 and P3 systems and not directly NIC or chipset related. I have not seen a freeze bug report for a single P3 system nor on P4 and above (single or multicore) so far. Still pretty quiet on developer front :-/ So i also reported this problem at Debian (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=542551). You can take it as a template for your distro bug tracking system. If your distro offer 2.6.30 and later, give it a try. Reason: * Let distro developer know about such issue * Save distro developers time with searching, replicating, bisecting * Linking this bug report form other sites, will make this #1 search result in google This is all done for attracting more people, so we will know, if there are 10, 10hunderds, 10zillion affected users and also can get more valuable report and platforms like those with dual Athlon MP. If this won't help to attract some developer we will probably have to fill in some www.petitiononline.com or do some advertisement in Times like Mozilla during a release of new version :-D OK, enough jokes. Anyone who attract developer who begin to work on this will get special tag "Nail-developer-down-by: Mr. X Y" (of course it will be in front of all those "Signed-off-by: " things) The common point of all freezing computers, is SMP and e100? Or did I miss something? I was few mins off... Of course not because waiting for sedative to take effects before replaying, but to summarize facts. Seriously, David. I don't think this is related to e100, e1000, realtek driver or whatever network driver. I even think this has nothing to do with general networking, IPv4 stack or whatever. I can trigger this bug without network connection. Hope it's clarified now. If i correct your conclusion, David: *** common point of all freezing computers _in_this_bugreport_ is 2x CPU _and_ *** commit 4595f9620cda8a1e973588e743cf5f8436dd20c6 Great news if we found THE commit... My bisection isn't going fast enough as I can reach more than 4 hours of uptime sometimes!!!! Is there a way to test this without loosing the point I've reached? David Hill On 2009-08-20, at 12:46, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13933 > > > > > > --- Comment #58 from Pavel Vilim <wylda@volny.cz> 2009-08-20 > 12:46:13 --- > > I was few mins off... Of course not because waiting for sedative to > take > effects before replaying, but to summarize facts. > > Seriously, David. I don't think this is related to e100, e1000, > realtek driver > or whatever network driver. I even think this has nothing to do with > general > networking, IPv4 stack or whatever. > > I can trigger this bug without network connection. Hope it's > clarified now. > > If i correct your conclusion, David: > *** common point of all freezing computers _in_this_bugreport_ is 2x > CPU _and_ > *** commit 4595f9620cda8a1e973588e743cf5f8436dd20c6 > > -- > Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > (In reply to comment #59) > Great news if we found THE commit... My bisection isn't going fast > enough as I can reach more than 4 hours of uptime sometimes!!!! > > Is there a way to test this without loosing the point I've reached? > I'm a git noob... But i think, if you backup your bisection, you can return anytime. Do "git bisect log > my_bisection.log" (or backup file: .git/BISECT_LOG) If following freeze your machine, you share same pain and you belongs to #13933: git bisect reset git reset --hard HEAD git checkout v2.6.29 git show 4595f9620cda8a1e973588e743cf5f8436dd20c6 | patch -p1 make mrproper cp your_config .config make oldconfig make -j 2 Please don't come in a day, that i'm wrong and previous works perfectly for you. There are reports, that freeze occur after 10days of work/uptime. I you want to return back to your bisection: git bisect reset git reset --hard HEAD git fetch ; git rebase origin git bisect start and then do git bisect good/bad <hash_from_my_bisection.log> -- do it based on your backup log. If this does not work for you. Sorry i warned you, that i'm git noob ;) >Is there a way to test this without loosing the point I've reached? yes - you need to test with a kernel w/ and w/o commit 4595f9620cda8a1e973588e743cf5f8436dd20c6 can`t you just checkout into a clean/new git-repo ? let me try putting things together again - this change probably introduced the issue: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4595f9620cda8a1e973588e743cf5f8436dd20c6 and this one fixed it to some degree, but obviously not entirely (maybe one more race being introduced?) : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5766b842b23c6b40935a5f3bd435b2bcdaff2143 and these ones are more fixes for commit 4595f9620cda8a1e973588e743cf5f8436dd20c6: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=694aa960608d2976666d850bd4ef78053bbd0c84 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a4a0acf8e17e3d08e28b721ceceb898fbc959ceb [ Added more people and bugzilla to the cc, since I have a random patch.
Quite frankly, this patch is not really deeply thought out, it's just a
"Hmm, that situation could have different behavior on different
hardware" kind of random musing ]
On Thu, 20 Aug 2009, Mike Travis wrote:
>
> I've been quite a ways away from this code for a while but I'll look closer
> at it today, especially your observations. Unfortunately, a 32-bit test
> machine is a hard thing to find around here. (Even my laptop runs a 64-bit
> kernel [w/NR_CPUS=4096 of course].)
Well, even then, some indications seem to be that it's mainly older
machines. I'm not seeing any Core 2's or even P4's. Of course, it might
be timing (and need a slower CPU), but there are no celerons or Atoms
there either (not that there are all that many reports, so it might be
just pure bad luck).
So it could literally be some interaction issue with "older APIC" or
similar.
Anyway, the whole "empty mask" thing does strike me as a special case, and
something that I could imagine different hardware does different thigns
for, so what happens with a patch like this?
Linus
---
arch/x86/kernel/apic/ipi.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index dbf5445..6ef00ba 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -106,6 +106,9 @@ void default_send_IPI_mask_logical(const struct cpumask *cpumask, int vector)
unsigned long mask = cpumask_bits(cpumask)[0];
unsigned long flags;
+ if (WARN_ONCE(!mask, "empty IPI mask"))
+ return;
+
local_irq_save(flags);
WARN_ON(mask & ~cpumask_bits(cpu_online_mask)[0]);
__default_send_IPI_dest_field(mask, vector, apic->dest_logical);
(In reply to comment #62) > + if (WARN_ONCE(!mask, "empty IPI mask")) > + return; > + Testing it right now (on untainted 2.6.30.4). It might take a while to trigger. On Thu, 20 Aug 2009, Linus Torvalds wrote:
>
> Anyway, the whole "empty mask" thing does strike me as a special case, and
> something that I could imagine different hardware does different thigns
> for, so what happens with a patch like this?
Just a quick note: commit 694aa960608d2976666d850bd4ef78053bbd0c84 seems
to imply that this "empty CPUmask" really does happen, and confused the
xen_flush_tlb_others() code.
I do suspect that if this really is it (ie that WARN_ON() actually
triggers, and returning early from default_send_IPI_mask_logical() fixes
the hang), then we should fix it at a higher level, rather than in the
actual IPI code.
It looks trivial to make 'bitmask_and[not]()' return whether the result
has any bits set or not, and then we could do something like this
instead..
However, this is only relevant if my previous hacky patch actually
triggers. But the fact that Xen had issues with empty CPUmasks does seem
to indicate that it really could trigger.
The patch below is totally untested. It may or may not compile, much less
actually work. Caveat emptor.
Linus
---
arch/x86/mm/tlb.c | 21 ++++++++++-----------
include/linux/bitmap.h | 18 ++++++++----------
include/linux/cpumask.h | 20 ++++++++++----------
lib/bitmap.c | 12 ++++++++----
4 files changed, 36 insertions(+), 35 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 821e970..c814e14 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -183,18 +183,17 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
f->flush_mm = mm;
f->flush_va = va;
- cpumask_andnot(to_cpumask(f->flush_cpumask),
- cpumask, cpumask_of(smp_processor_id()));
-
- /*
- * We have to send the IPI only to
- * CPUs affected.
- */
- apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
- INVALIDATE_TLB_VECTOR_START + sender);
+ if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+ /*
+ * We have to send the IPI only to
+ * CPUs affected.
+ */
+ apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
+ INVALIDATE_TLB_VECTOR_START + sender);
- while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
- cpu_relax();
+ while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+ cpu_relax();
+ }
f->flush_mm = NULL;
f->flush_va = 0;
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 2878811..756d78b 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -94,13 +94,13 @@ extern void __bitmap_shift_right(unsigned long *dst,
const unsigned long *src, int shift, int bits);
extern void __bitmap_shift_left(unsigned long *dst,
const unsigned long *src, int shift, int bits);
-extern void __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
+extern int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits);
extern void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits);
extern void __bitmap_xor(unsigned long *dst, const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits);
-extern void __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1,
+extern int __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits);
extern int __bitmap_intersects(const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits);
@@ -171,13 +171,12 @@ static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
}
}
-static inline void bitmap_and(unsigned long *dst, const unsigned long *src1,
+static inline int bitmap_and(unsigned long *dst, const unsigned long *src1,
const unsigned long *src2, int nbits)
{
if (small_const_nbits(nbits))
- *dst = *src1 & *src2;
- else
- __bitmap_and(dst, src1, src2, nbits);
+ return (*dst = *src1 & *src2) != 0;
+ return __bitmap_and(dst, src1, src2, nbits);
}
static inline void bitmap_or(unsigned long *dst, const unsigned long *src1,
@@ -198,13 +197,12 @@ static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
__bitmap_xor(dst, src1, src2, nbits);
}
-static inline void bitmap_andnot(unsigned long *dst, const unsigned long *src1,
+static inline int bitmap_andnot(unsigned long *dst, const unsigned long *src1,
const unsigned long *src2, int nbits)
{
if (small_const_nbits(nbits))
- *dst = *src1 & ~(*src2);
- else
- __bitmap_andnot(dst, src1, src2, nbits);
+ return (*dst = *src1 & ~(*src2)) != 0;
+ return __bitmap_andnot(dst, src1, src2, nbits);
}
static inline void bitmap_complement(unsigned long *dst, const unsigned long *src,
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index c5ac87c..796df12 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -43,10 +43,10 @@
* int cpu_isset(cpu, mask) true iff bit 'cpu' set in mask
* int cpu_test_and_set(cpu, mask) test and set bit 'cpu' in mask
*
- * void cpus_and(dst, src1, src2) dst = src1 & src2 [intersection]
+ * int cpus_and(dst, src1, src2) dst = src1 & src2 [intersection]
* void cpus_or(dst, src1, src2) dst = src1 | src2 [union]
* void cpus_xor(dst, src1, src2) dst = src1 ^ src2
- * void cpus_andnot(dst, src1, src2) dst = src1 & ~src2
+ * int cpus_andnot(dst, src1, src2) dst = src1 & ~src2
* void cpus_complement(dst, src) dst = ~src
*
* int cpus_equal(mask1, mask2) Does mask1 == mask2?
@@ -179,10 +179,10 @@ static inline int __cpu_test_and_set(int cpu, cpumask_t *addr)
}
#define cpus_and(dst, src1, src2) __cpus_and(&(dst), &(src1), &(src2), NR_CPUS)
-static inline void __cpus_and(cpumask_t *dstp, const cpumask_t *src1p,
+static inline int __cpus_and(cpumask_t *dstp, const cpumask_t *src1p,
const cpumask_t *src2p, int nbits)
{
- bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits);
+ return bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits);
}
#define cpus_or(dst, src1, src2) __cpus_or(&(dst), &(src1), &(src2), NR_CPUS)
@@ -201,10 +201,10 @@ static inline void __cpus_xor(cpumask_t *dstp, const cpumask_t *src1p,
#define cpus_andnot(dst, src1, src2) \
__cpus_andnot(&(dst), &(src1), &(src2), NR_CPUS)
-static inline void __cpus_andnot(cpumask_t *dstp, const cpumask_t *src1p,
+static inline int __cpus_andnot(cpumask_t *dstp, const cpumask_t *src1p,
const cpumask_t *src2p, int nbits)
{
- bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits);
+ return bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits);
}
#define cpus_complement(dst, src) __cpus_complement(&(dst), &(src), NR_CPUS)
@@ -738,11 +738,11 @@ static inline void cpumask_clear(struct cpumask *dstp)
* @src1p: the first input
* @src2p: the second input
*/
-static inline void cpumask_and(struct cpumask *dstp,
+static inline int cpumask_and(struct cpumask *dstp,
const struct cpumask *src1p,
const struct cpumask *src2p)
{
- bitmap_and(cpumask_bits(dstp), cpumask_bits(src1p),
+ return bitmap_and(cpumask_bits(dstp), cpumask_bits(src1p),
cpumask_bits(src2p), nr_cpumask_bits);
}
@@ -779,11 +779,11 @@ static inline void cpumask_xor(struct cpumask *dstp,
* @src1p: the first input
* @src2p: the second input
*/
-static inline void cpumask_andnot(struct cpumask *dstp,
+static inline int cpumask_andnot(struct cpumask *dstp,
const struct cpumask *src1p,
const struct cpumask *src2p)
{
- bitmap_andnot(cpumask_bits(dstp), cpumask_bits(src1p),
+ return bitmap_andnot(cpumask_bits(dstp), cpumask_bits(src1p),
cpumask_bits(src2p), nr_cpumask_bits);
}
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 35a1f7f..ec221a7 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -179,14 +179,16 @@ void __bitmap_shift_left(unsigned long *dst,
}
EXPORT_SYMBOL(__bitmap_shift_left);
-void __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
+int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits)
{
int k;
int nr = BITS_TO_LONGS(bits);
+ unsigned long result = 0;
for (k = 0; k < nr; k++)
- dst[k] = bitmap1[k] & bitmap2[k];
+ result |= (dst[k] = bitmap1[k] & bitmap2[k]);
+ return result != 0;
}
EXPORT_SYMBOL(__bitmap_and);
@@ -212,14 +214,16 @@ void __bitmap_xor(unsigned long *dst, const unsigned long *bitmap1,
}
EXPORT_SYMBOL(__bitmap_xor);
-void __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1,
+int __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1,
const unsigned long *bitmap2, int bits)
{
int k;
int nr = BITS_TO_LONGS(bits);
+ unsigned long result = 0;
for (k = 0; k < nr; k++)
- dst[k] = bitmap1[k] & ~bitmap2[k];
+ result |= dst[k] = bitmap1[k] & ~bitmap2[k];
+ return result != 0;
}
EXPORT_SYMBOL(__bitmap_andnot);
(In reply to comment #62) > + if (WARN_ONCE(!mask, "empty IPI mask")) > + return; > + Also applied it to my 2.6.30.4 source tree, recompiled and tested a pretty surefire way to hang the box, which is starting Opera with a bunch saved tabs, and got [ 154.157381] ------------[ cut here ]------------ [ 154.157420] WARNING: at arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xdd() [ 154.157430] Hardware name: MS-6501 [ 154.157437] empty IPI maskModules linked in: fuse netconsole snd_au8820 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_mpu401 ac97_bus snd_mpu401_uart snd_seq_oss snd_seq_midi snd_seq_midi_event snd_seq snd_rawmidi snd_timer snd_seq_device snd i2c_amd756 i2c_core ns558 parport_pc parport gameport soundcore evdev usbhid ohci_hcd e100 usbcore [ 154.157667] Pid: 5907, comm: opera Not tainted 2.6.30.4 #2 [ 154.157673] Call Trace: [ 154.157686] [<b012454c>] warn_slowpath_common+0x60/0x90 [ 154.157704] [<b01245b0>] warn_slowpath_fmt+0x24/0x27 [ 154.157712] [<b011046f>] default_send_IPI_mask_logical+0x2f/0xdd [ 154.157726] [<b0117c15>] flush_tlb_others_ipi+0x87/0xb4 [ 154.157746] [<b0117db8>] flush_tlb_mm+0x59/0x5d [ 154.157756] [<b016b19e>] mprotect_fixup+0x212/0x296 [ 154.157764] [<b016b38c>] sys_mprotect+0x16a/0x1c6 [ 154.157776] [<b0102958>] sysenter_do_call+0x12/0x36 [ 154.157793] ---[ end trace ced042bf780fb0a5 ]--- And returning early from the funtion means my box is still alive. (In reply to comment #63) > (In reply to comment #62) > > + if (WARN_ONCE(!mask, "empty IPI mask")) > > + return; > > + > > Testing it right now (on untainted 2.6.30.4). It might take a while to > trigger. caught a warning, machine survived: Aug 21 01:14:01 arnold kernel: ------------[ cut here ]------------ Aug 21 01:14:01 arnold kernel: WARNING: at arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2a/0xb0() Aug 21 01:14:01 arnold kernel: Hardware name: VT8653-8233 Aug 21 01:14:01 arnold kernel: empty IPI maskModules linked in: via_agp agpgart Aug 21 01:14:01 arnold kernel: Pid: 2561, comm: ktorrent Not tainted 2.6.30.4 #6 Aug 21 01:14:01 arnold kernel: Call Trace: Aug 21 01:14:01 arnold kernel: [<c0122994>] ? warn_slowpath_common+0x5e/0x8a Aug 21 01:14:01 arnold kernel: [<c01229f2>] ? warn_slowpath_fmt+0x26/0x2a Aug 21 01:14:01 arnold kernel: [<c010f63e>] ? default_send_IPI_mask_logical+0x2a/0xb0 Aug 21 01:14:01 arnold kernel: [<c011640d>] ? flush_tlb_others_ipi+0x83/0xad Aug 21 01:14:01 arnold kernel: [<c0116517>] ? flush_tlb_mm+0x60/0x7a Aug 21 01:14:01 arnold kernel: [<c0159a06>] ? unmap_region+0xe4/0x118 Aug 21 01:14:01 arnold kernel: [<c015a667>] ? do_munmap+0x1de/0x228 Aug 21 01:14:01 arnold kernel: [<c015a6d8>] ? sys_munmap+0x27/0x35 Aug 21 01:14:01 arnold kernel: [<c0102941>] ? syscall_call+0x7/0xb Aug 21 01:14:01 arnold kernel: ---[ end trace c311cb19727383c1 ]--- Ok, current -git now has all the commits, and marked with cc: stable@kernel.org. Commits: b04e637 x86: don't call '->send_IPI_mask()' with an empty mask f4b0373 Make bitmask 'and' operators return a result code 83d349f x86: don't send an IPI to the empty set of CPU's so this bug entry should probably be closed once people have double-checked it, upstream commits resolving this bugzilla are: b04e637: x86: don't call '->send_IPI_mask()' with an empty mask f4b0373: Make bitmask 'and' operators return a result code 83d349f: x86: don't send an IPI to the empty set of CPU's I shall keep running 83d349f just to make sure no more lockups occur. Thanks to everybody who contributed. Good teamwork. ;-) *** Bug 13991 has been marked as a duplicate of this bug. *** I tried the patches from Linus and it works on my Tyan S1834/Tiger 133 mainboard with P3 1000Mhz processors, but i still have freezes on Asus P2B-D Mainboard with 600Mhz PIII processors. I try 2.6.25.9 kernel, but it freezes too. Ognjan, this was a problem introduced around 2.6.30, so if you are having problems with 2.6.25.9, there is a different problem. Try booting with acpi=off and/or noapic, and also check all of the cylindrical capacitors on the motherboard for tops that are not flat indicating they have gone bad. Also consider running memtest86+ overnight. On Mon, 7 Sep 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > > I tried the patches from Linus and it works on my Tyan S1834/Tiger 133 > mainboard with P3 1000Mhz processors, but i still have freezes on Asus P2B-D > Mainboard with 600Mhz PIII processors. I suspect your 600MHz P-III board has some other issues. The fact that the problems with that board go back to 2.6.25.9 also implies that - the TLB flush IPI problem was new to 2.6.30. So your PIII lockup is different, and should probably get a bugzilla of its own rather than be mixed up with this one. Feel free to open a new bugzilla entry, but before you do that, can you try enabling the NMI watchdog and try to see if you can get it to hang in text-mode so that the NMI watchdog has a chance to trigger and show anything? (See Documentation/nmi_watchdog.txt for details) Also, one word of warning: how sure are you about the stability of that machine in general? Hangs under load can easily be due to borderline power supplies (which includes things like the capacitors on the motherboard, not just the PSU unit itself) causing CPU power brownouts etc. Linus The "Tyan S1834/Tiger 133" is a VIA chipset. The "Asus P2B-D" is an Intel 440BX chipset motherboard. IMO, the 440BX boards are pretty darn stable, in which, I was working on the LinuxBIOS/Coreboot project with three of them here. The common problem on these (or any) boards, dust in the RAM, PCI and/or CPU slots -- doesn't look like you have slot style CPU slots on the P2B-D. Getting the dust out and re-seating the boards always seems to solve the problem(s) here. I have a Tyan Tiger 100 440BX along with two other cheaper 440BX boards around and they tend to be rock solid (aside from the dust issues). Sorry for the confusion. It was my fault. It seems to work now with noapic and acpi = off. Thanks a lot. hi; Just to offer some information that may help track this issue down; I'm running/on a Dual XEON (old style) MP system here, a (Dell Precision WorkStation 530). and i do _NOT_ get these lockups using a Debian 2.6.30-1-686 sid kernel. Uptime == 15:21:19 up 31 days, 8:17, 1 user, load average: 0.09, 0.05, 0.01 * cat /proc/version Linux version 2.6.30-1-686 (Debian 2.6.30-4) (waldi@debian.org) (gcc version 4.3.3 (Debian 4.3.3-15) ) #1 SMP Thu Jul 30 14:45:30 UTC 2009 (note the gcc version as someone above alluded to as perhaps part of the issue). * Full System specs ; http://pompone.cs.ucsb.edu/admin/530_Workstation/00:00.0 Host bridge [0600]: * lspci -nn output; Intel Corporation 82860 860 (Wombat) Chipset Host Bridge (MCH) [8086:2531] (rev 04) 00:01.0 PCI bridge [0604]: Intel Corporation 82850 850 (Tehama) Chipset AGP Bridge [8086:2532] (rev 04) 00:02.0 PCI bridge [0604]: Intel Corporation 82860 860 (Wombat) Chipset AGP Bridge [8086:2533] (rev 04) 00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 04) 00:1f.0 ISA bridge [0601]: Intel Corporation 82801BA ISA Bridge (LPC) [8086:2440] (rev 04) 00:1f.1 IDE interface [0101]: Intel Corporation 82801BA IDE U100 Controller [8086:244b] (rev 04) 00:1f.2 USB Controller [0c03]: Intel Corporation 82801BA/BAM USB Controller #1 [8086:2442] (rev 04) 00:1f.3 SMBus [0c05]: Intel Corporation 82801BA/BAM SMBus Controller [8086:2443] (rev 04) 00:1f.4 USB Controller [0c03]: Intel Corporation 82801BA/BAM USB Controller #1 [8086:2444] (rev 04) 00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801BA/BAM AC'97 Audio Controller [8086:2445] (rev 04) 01:00.0 VGA compatible controller [0300]: nVidia Corporation NV34 [GeForce FX 5500] [10de:0326] (rev a1) 02:1f.0 PCI bridge [0604]: Intel Corporation 82806AA PCI64 Hub PCI Bridge [8086:1360] (rev 03) 03:00.0 PIC [0800]: Intel Corporation 82806AA PCI64 Hub Advanced Programmable Interrupt Controller [8086:1161] (rev 01) 04:0b.0 Ethernet controller [0200]: 3Com Corporation 3c905C-TX/TX-M [Tornado] [10b7:9200] (rev 78) 04:0c.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB12LV26 IEEE-1394 Controller (Link) [104c:8020] 1specs.htm cat /proc/cmdline; root=/dev/hda2 ro acpi=force vga=79 cat /proc/cpuinfo; processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 0 model name : Intel(R) Xeon(TM) CPU 1700MHz stepping : 10 cpu MHz : 1695.037 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pebs bts bogomips : 3390.07 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 0 model name : Intel(R) Xeon(TM) CPU 1700MHz stepping : 10 cpu MHz : 1695.037 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pebs bts bogomips : 3389.93 clflush size : 64 power management: * free -m total used free shared buffers cached Mem: 502 486 15 0 48 124 -/+ buffers/cache: 313 189 Swap: 964 82 882 You can see the CPU versions/family and Chipset from info above. If you can use any other relevant info, i'd be glad to post it. p.s. I also have a pentiumII _single_ CPU system, it's using a debian .29 kernel, i have yet to upgrade that system, however, AIUI, this issue does not manifest itself in non-MP ppro/pII/pIII systems. rich hi again; just some correction and additional info (and i _didn't_ explicitly/intentionally do that cc mailing stuff i see when posting). * root=/dev/hda2 ro acpi=force vga=791 <--corrected (i forget if acpi=force is even needed, it's a carryover from my pII system, prior to 2.6.18 - seems the pII needed it to enable ACPI - and my kernel line for the pII also contains 'lapic' ..it's now running a .29 though). * even though you see the 'ht' cpuflag above, there is no BIOS setting for it to enable/disable, and i found that in anything less than 1.8GHz Xeons, even though the flag is present, the CPUs don't have HT ability; http://lists.us.dell.com/pipermail/linux-poweredge/2002-June/003037.html $ egrep 'X86_HT|HT_IRQ' /boot/config-2.6.30-1-686 CONFIG_X86_HT=y CONFIG_HT_IRQ=y Also; i'm using only the Xorg 'nv' driver, so _not_ the proprietary nvidia one. Also; apologies about the horrible lspci output formatting .. uff. rich crap; one more fix; broken URL for System specs; http://pompone.cs.ucsb.edu/admin/530_Workstation/1specs.htm <--corrected sorry about so much noise and needed correction. rich |
Created attachment 22637 [details] uname -a After upgrade to 2.6.30 my machine locks up cold within a number of hours (TTL ranges from minutes to days). The mainboard is an MSI-9105 with dual P-3 1400Mhz. On the LKML similar cases have been reported for dual P2s and dual P3s. Further system details have been attached.