Bug 10572
Description
Rob Lensen
2008-04-28 12:20:29 UTC
can you try if "nosmp" makes a difference ? can you also try to catch full kernel panic ? if you cannot attach serial or net console, you may try "vga=ask" and select a video mode with higher resolution, so the message may fit on screen. did this also happen with earlier kernels? if you never tried, can you try some ? Ok some extra info ACPI=on, under high load (oggenc and kernel compile) http://rob.lensen.nu/files/zen1_acpi_on%20(Custom).JPG With nosmp: Vanilla kernel: http://rob.lensen.nu/files/vanilla_nosmp_acpi_on%20(Custom).JPG Zen: http://rob.lensen.nu/files/zen1_nosmp_acpi_on%20(Custom).JPG Problem was also occuring with 2.6.24, didn't try kernels beforce since that's all problematic with the jmicron controller More info: #lsmod Module Size Used by l2cap 24704 2 bluetooth 57508 3 l2cap ipv6 308072 21 coretemp 8320 0 ext3 138896 1 jbd 54696 1 ext3 mbcache 9476 1 ext3 ide_cd_mod 37792 0 cdrom 38440 1 ide_cd_mod ata_generic 7684 0 pata_jmicron 6016 0 firewire_ohci 20096 0 firewire_core 43232 1 firewire_ohci crc_itu_t 3200 1 firewire_core ide_pci_generic 5892 0 [permanent] jmicron 3712 0 [permanent] psmouse 43804 0 ohci1394 32180 0 pcspkr 3968 0 ieee1394 92024 1 ohci1394 serio_raw 7300 0 i2c_i801 11164 0 ide_core 132528 3 ide_cd_mod,ide_pci_generic,jmicron intel_agp 29680 0 i2c_core 23200 1 i2c_i801 ehci_hcd 38796 0 uhci_hcd 26016 0 sg 33504 0 thermal 19488 0 processor 37100 1 thermal evdev 12544 3 fan 6152 0 button 8608 0 battery 13704 0 ac 6280 0 snd_seq_oss 34688 0 snd_seq_midi_event 8576 1 snd_seq_oss snd_seq 57504 4 snd_seq_oss,snd_seq_midi_event snd_seq_device 8468 2 snd_seq_oss,snd_seq snd_hda_intel 433368 0 snd_hwdep 9736 1 snd_hda_intel snd_pcm_oss 44448 0 snd_pcm 81800 2 snd_hda_intel,snd_pcm_oss snd_timer 23952 2 snd_seq,snd_pcm snd_page_alloc 9872 2 snd_hda_intel,snd_pcm snd_mixer_oss 17920 1 snd_pcm_oss snd 60872 9 snd_seq_oss,snd_seq,snd_seq_device,snd_hda_intel,snd_hwdep,snd_pcm_oss,snd_pcm,snd_timer,snd_mixer_oss soundcore 8608 1 snd sky2 48772 0 rtc_cmos 11064 0 rtc_core 19652 1 rtc_cmos rtc_lib 4096 1 rtc_core usbcore 152728 3 ehci_hcd,uhci_hcd xfs 566656 4 sd_mod 26304 8 ahci 30600 6 libata 160048 3 ata_generic,pata_jmicron,ahci dock 10272 1 libata Two other photo's http://rob.lensen.nu/files/vanilla_acpi_on.jpg http://rob.lensen.nu/files/vanilla_acpi_on_high_res.jpg Console 1 start a kernel compile (runs ok for a couple of minutes) then on console 2 start oggenc and within 10 secs the kernel panic occurs Before these photo's I upgraded to the latest bios version of Foxconn (was one version behind) With netconsole got this output, while compiling a kernel: Eeek! page_mapcount(page) went negative! (-1) page pfn = 11fcd0 page->flags = 11c00000001002c page->count = 1 page->mapping = ffff810123897b28 vma->vm_ops = xfs_file_vm_ops+0x0/0xfffffffffffe69ef [xfs] vma->vm_ops->nopage = 0x0 vma->vm_ops->fault = filemap_fault+0x0/0x4d0 nvidia module taints the kernel. maybe it`s worth checking if it makes a difference if that module is being removed temporarly.... (In reply to comment #6) > nvidia module taints the kernel. > maybe it`s worth checking if it makes a difference if that module is being > removed temporarly.... > Did already do that, see lsmod at post http://bugzilla.kernel.org/show_bug.cgi?id=10572#c2 pardon, i missed that each crash seems to look different. can you make sure that the system doesn`t have hardware/stability issues ? memtest86? other operating system ? (& put that under load,too) if you can remove any redundant hardware, please remove it so that you run bare minimum configuration (In reply to comment #9) > each crash seems to look different. > can you make sure that the system doesn`t have hardware/stability issues ? > > memtest86? > other operating system ? (& put that under load,too) > I have compiled several kernels with acpi disabled and did the ogg encode ten times in a row without a crash. All with acpi disabled. Did memtest en cpuburn in for 10 hours with the ultimate bootcd without any problems. Right know building 2.6.25-zen1 with debugging enabled and netconsole compiled into to the kernel. (In reply to comment #10) > if you can remove any redundant hardware, please remove it so that you run > bare > minimum configuration > What do you mean with redundant hardware? - I can disable the soundcard and network card (no debugging via netconsole then) in the bios. There is no other hardware installed. why is this filed against ACPI? If you boot with "acpi=off" does it go away? (In reply to comment #13) > why is this filed against ACPI? > If you boot with "acpi=off" does it go away? > Yes I can easily do 10 kernel compiles and several oggencoding jobs. Yesterday I did with acpi=off: console 1: kernel compile console 2: oggenc console 3: memcoder console 4: several large iso copy's Load reached about 22, without any problems. Rob, would you please attach the acpidump output? Created attachment 16056 [details]
ACPIDUMP off
Created attachment 16057 [details]
acpidump on
>I need ACPI since when I disable acpi my dvd-rw isn't working, which is >connected the jmicron jb361. Ik keeps probing for it >(http://rob.lensen.nu/files/Afb019.jpg) Should be http://rob.lensen.nu/files/Afb021.jpg Hi, Rob Will you please try the boot option of "hpet=disable" and see whether the problem still exists? Created attachment 16282 [details]
Kernel crash photo
Photo 1 of the kernel crash, occurs under higher load
Created attachment 16283 [details]
Photo 2 (which scrolls by after some time)
Photo 2 of the kernel crash, occurs under higher load
Created attachment 16284 [details]
Dmesg out put hpet=disable
i noticed 2 things:
1. you have vmware 3rd party modules loaded. that`s not part of mainline kernel,so kernel is sort of tainted. i would first make sure that these aren`t loaded when further testing/analysing.
2.
>Clocksource tsc unstable (delta = 9373942676 ns)
what clocksource is being installed after that message?
(cat /sys/devices/system/clocksource/clocksource0/current_clocksource )
i would recommend booting with a stable clocksource to avoid clocksource switch at runtime - so you may boot with clocksource=acpi_pm or whatever is available (/sys/devices/system/clocksource/clocksource0/available_clocksource ).
i have seen machines having problems when clocksource being detected unstable and other clocksource being installed.
> 1. you have vmware 3rd party modules loaded. that`s not part of mainline > kernel,so kernel is sort of tainted. i would first make sure that these > aren`t > loaded when further testing/analysing. Will remove them with the next test > 2. > >Clocksource tsc unstable (delta = 9373942676 ns) > what clocksource is being installed after that message? > (cat /sys/devices/system/clocksource/clocksource0/current_clocksource ) Have to find out when booting with acpi=on > i would recommend booting with a stable clocksource to avoid clocksource > switch > at runtime - so you may boot with clocksource=acpi_pm or whatever is > available > (/sys/devices/system/clocksource/clocksource0/available_clocksource ). > i have seen machines having problems when clocksource being detected unstable > and other clocksource being installed. When acpi is off: # cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc jiffies Will do some these when I'm back home again. > 2.
> >Clocksource tsc unstable (delta = 9373942676 ns)
> what clocksource is being installed after that message?
> (cat /sys/devices/system/clocksource/clocksource0/current_clocksource )
When booted with acpi=on
# dmesg |grep tsc
Clocksource tsc unstable (delta = 9373988407 ns)
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
acpi_pm jiffies tsc
Should I just boot with clocksource=tsc?
(In reply to comment #5) > With netconsole got this output, while compiling a kernel: > > Eeek! page_mapcount(page) went negative! (-1) > page pfn = 11fcd0 > page->flags = 11c00000001002c > page->count = 1 > page->mapping = ffff810123897b28 > vma->vm_ops = xfs_file_vm_ops+0x0/0xfffffffffffe69ef [xfs] > vma->vm_ops->nopage = 0x0 > vma->vm_ops->fault = filemap_fault+0x0/0x4d0 > It looks like a memory leak. Would you please have a try kmemcheck? 1. pull kmemcheck from git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck.git current 2. apply the patch at: http://bugzilla.kernel.org/attachment.cgi?id=16199&action=view 3. make & reboot, check dmesg to find any thing reported by kmemcheck Created attachment 16319 [details]
Dmesg output with kmemcheck
This is the output with kmemcheck and the patch applied.
Looks promising, since I tried quick a oggenc and kernel compile together and no crash so far. Whereas before I could easily crash the kernel within minutes.
> Looks promising, since I tried quick a oggenc and kernel compile together and
> no crash so far. Whereas before I could easily crash the kernel within
> minutes.
>
Would you please help to verify if the bug still exist?
(In reply to comment #28) > > Looks promising, since I tried quick a oggenc and kernel compile together > and > > no crash so far. Whereas before I could easily crash the kernel within > minutes. > > > > Would you please help to verify if the bug still exist? > Will try to test it asap Created attachment 16368 [details]
Kernel crash 2.6.26rc4-zen0 with acpi patch applied
Created attachment 16369 [details]
Dmesg output with new kernel 2.6.26rc4-zen0 and acpi=on
Created attachment 16370 [details]
Dmesg output with new kernel 2.6.26rc4-zen0 and acpi=off
Created attachment 16371 [details]
Config used for the new 2.6.26rc4-zen0
I tried the patch and the result is not ok. I have the feeling that it takes more time before the kernel crash occurs (before it was within the 10 encodings now It happend after 40 files encoding), however I still crashes with acpi=on What I did (as you can see on the crash photo): console 1: #oggenc Track* console 2: #cp -R zen-sources zen-sources2 When I did the same thing with acpi=off four times, this means encoding 31 wav files to ogg four times and copied zen-sources four times. No crash occurs. If it is needed to use netconsole again I will try to get the output. please re-open if this failure is reproducible using a kernel and modules built entirely from kernel.org source. Every failure shown here is different, suggesting a memory corruption issue. None of them directly implicate ACPI, though the failure going away with "acpi=off" does. I still have this problem. I'm now on 2.6.27-rc1-zenmm1 and the problem is still there. Right know I'm hoping that my dvd burner can work when acpi=off made a bugreport for that (11429). Since it looks like I have to accept that my motherboard won't work stable with acpi enabled. I tested yesterday with the default Archlinux kernel (which is almost original) and the problem is really reproducible. So I'm hoping that someone can help me to get this problem out of the way. I made a serie of photo's with crashdumps and if need I can post them. could you please try the latest kernel and see if the problem still exists? if it's still reproducible, please attach the screen shot when system crashes. BTW: you'd better do the test in a vanilla kernel, say 2.6.27. Hi, Rob Will you please do the test as suggested in comment #37? If the problem still exists, Will you please try the following boot option on the latest kernel and see whether the problem still exists? a. idle=poll b. processor.max_cstate=1 (the processor had better be compiled as built-in module) c. nohz=off highres=off (Disable the tickless feature). Thanks. Will try todo the tests as described this week. ping Rob. :) Can you try latest kernel? There are a lot of cpu idle related fixes, which might be related to your issue. Building vanilla kernel 2.6.27 right know. Testing will be done tonight. Sorry for the inconvenience hah, any update? Created attachment 19539 [details]
Kernel crash during boot with the option idle=poll enabled
Created attachment 19540 [details]
Kernel crash during boot with the option idle=poll enabled (2 of 2)
Created attachment 19541 [details]
Kernel crash during encodig mp3 to files to ogg with option nohz=off and highres=off
Created attachment 19542 [details]
Kernel crash during encodig mp3 to files to ogg with option prcessor.max_cstate=1 (1 of 2)
Created attachment 19543 [details]
Kernel crash during encodig mp3 to files to ogg with option prcessor.max_cstate=1 (2 of 2)
Created attachment 19544 [details]
Dmesg output 2.6.28 with processor.max_cstate=1
Created attachment 19545 [details]
Dmesg output 2.6.28 with nohz=off highres=off
Ok did some testing with the kernel boot options as described on post #38 When the option idle=poll is enabled the kernel crashes during startup. For processor.max_cstate=1 crashes quite quickly as shown on the photo The option nohz=off highres=off looked promising, since it was encoding mp3 files for 20 min and then I opened another console doing another encoding process and then right away the kernel crashed Sorry for the delay in responding. I hope I gave enough information Hi, Rob From the picture in comment #44 it seems that the function of "smp_call_function_single" is called when irq is disabled. And this happens in the phase of initializing ethernet device. How about disabling the ethernet device module? How about disabling the SMP in kernel configuration? Thanks. Created attachment 19571 [details]
As suggested idle=poll with NIC disabled in the BIOS
Hi, Rob From the attached screenshot it seems that the kernel panic happens in signal return. Can the kernel panci still happen if the load is very low?(not start the job of encoding .wav to ogg file.) Thanks. (In reply to comment #54) > Hi, Rob > From the attached screenshot it seems that the kernel panic happens in > signal return. > Can the kernel panci still happen if the load is very low?(not start the > job of encoding .wav to ogg file.) I didn't have any panics under normal operation (low load). If I just boot the computer and let it just run, then nothing happens (no panic). Had it run for 3 weeks when I was on holiday. But I first encountered this problem when I was extracting many rar files (4,5Gb). Then I could reproduce it with ogg encoding files. does the problem exist in the latest kernel release? ping Rob. I will to try to test the latest kernel probably next week. okay, 2.6.29 would be a good candidate. > TuxOnIce 3.0-rc7 (http://tuxonice.net) > Replacing swsusp. > TuxOnIce: Resume= parameter is empty. Hibernating will be disabled. TuxOnIce is not part of the kernel.org source tree. Please see comment #35 (In reply to comment #60) > > TuxOnIce 3.0-rc7 (http://tuxonice.net) > > Replacing swsusp. > > TuxOnIce: Resume= parameter is empty. Hibernating will be disabled. > > TuxOnIce is not part of the kernel.org source tree. > > Please see comment #35 Maybe it is wise to read the complete problem, since the problem arises even faster with a vanilla kernel. I think it is bit strange to just close it, but I will leave it for now and run with acpi=off |