Latest working kernel version: 2.6.26 Earliest failing kernel version: 2.6.27-rc1 Distribution: openSUSE 11.0 Hardware Environment: Sony Vaio FW11 (Centino 2, Radeon Mobility HD3470) Software Environment: See attached ver_linux.log Problem Description: After playing with HDMI port I can not switch to init 5. I get bug "unable to handle kernel paging request" and many more lines of output. After that I can not type anything nor log in using ssh. Sometimes I am able to use Shift+PgUp which I used to take a photos. Steps to reproduce: 1) Boot into init 3 (set 3 in /etc/inittab) 2) Plug in some TV/LCD using HDMI port 3) [Optional] plug out HDMI cable 4) Try switch to init 5 (if system didn't freeze already) I compiled myself 2.6.26, 2.6.27-rc1 and 2.6.27-rc2. The last working version is 2.6.26.
Created attachment 17193 [details] Log of output from "ver_linux"
Created attachment 17194 [details] #1 photo of screen with error
Created attachment 17195 [details] #2 photo of screen with error
Created attachment 17196 [details] #3 photo of screen with error
Created attachment 17197 [details] #4 photo of screen with error
Created attachment 17198 [details] #5 photo of screen with error
Created attachment 17199 [details] #6 photo of screen with error
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 12 Aug 2008 14:30:01 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > Summary: Plugging HDMI causes "unable to handle kernel paging > request" > Product: Platform Specific/Hardware > Version: 2.5 > KernelVersion: 2.6.27-rcX > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: x86-64 > AssignedTo: platform_x86_64@kernel-bugs.osdl.org > ReportedBy: zajec5@gmail.com > > Latest working kernel version: 2.6.26 ugh. Random rampant memory corruption caused by unplugging an HDMI cable? I'm not even sure which subsystem could be involved here. Are you able to run a git bisection search please? http://www.kernel.org/doc/local/git-quick.html has some help. Thanks.
On Tuesday, 12 of August 2008, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Tue, 12 Aug 2008 14:30:01 -0700 (PDT) > bugme-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > > > Summary: Plugging HDMI causes "unable to handle kernel paging > > request" > > Product: Platform Specific/Hardware > > Version: 2.5 > > KernelVersion: 2.6.27-rcX > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: high > > Priority: P1 > > Component: x86-64 > > AssignedTo: platform_x86_64@kernel-bugs.osdl.org > > ReportedBy: zajec5@gmail.com > > > > Latest working kernel version: 2.6.26 > > ugh. Random rampant memory corruption caused by unplugging an HDMI > cable? I'm not even sure which subsystem could be involved here. Hm. Isn't that X trying to access memory which is not mapped? Rafa
12-08-08, Rafael J. Wysocki <rjw@sisk.pl> napisał(a): > On Tuesday, 12 of August 2008, Andrew Morton wrote: >> ugh. Random rampant memory corruption caused by unplugging an HDMI >> cable? I'm not even sure which subsystem could be involved here. > > Hm. Isn't that X trying to access memory which is not mapped? > > Rafał, which graphics driver are you using? I use radeonhd from today morning git. Does it matter as I play (plug, unplug) with HDMI in init 3? I'll try bisecting tomorrow.
2008/8/12, bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org>: > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > ------- Comment #9 from rjw@sisk.pl 2008-08-12 14:57 ------- > On Tuesday, 12 of August 2008, Andrew Morton wrote: > Hm. Isn't that X trying to access memory which is not mapped? > > Rafa³, which graphics driver are you using? I was not precise enought. I should describe problem better, sorry. Starting X (init 5) is not necessary. As well as "init 5" I can simply type "ls"... This also causes such a errors in console ("unable to handle kernel paging request" and something about scheduling sometimes).
Can you try the same test with the i915 driver loaded (it's a DRM driver). It should at least handle GPU interrupts, even though it doesn't do anything with hotplug ones at the moment. This could also be some sort of ACPI issue where the ACPI video code tries to enable an output at hotplug time and stomps all over memory in the process...
Oh err, not i915, you should try the radeon DRM driver
2008/8/13, bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org>: > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > ------- Comment #13 from jbarnes@virtuousgeek.org 2008-08-12 15:26 ------- > Oh err, not i915, you should try the radeon DRM driver I tried loading "radeon" module before "init 5". It causes hangs as well with similar errors. > ------- Comment #8 from anonymous@kernel-bugs.osdl.org 2008-08-12 14:44 > ------- > Reply-To: akpm@linux-foundation.org > > Are you able to run a git bisection search please? > http://www.kernel.org/doc/local/git-quick.html has some help. Problem was caused by merge commit commit d59fdcf2ac501de99c3dfb452af5e254d4342886 Merge: 2387ce5... bce7f79... Author: Ingo Molnar <mingo@elte.hu> Date: Mon Jul 14 11:37:46 2008 +0200 Merge commit 'v2.6.26' into x86/core I hope to find exact commit from merged patches today.
2008/8/12, bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org>: > ------- Comment #8 from anonymous@kernel-bugs.osdl.org 2008-08-12 14:44 > ------- > Reply-To: akpm@linux-foundation.org > > On Tue, 12 Aug 2008 14:30:01 -0700 (PDT) > bugme-daemon@bugzilla.kernel.org wrote: > > ugh. Random rampant memory corruption caused by unplugging an HDMI > cable? I'm not even sure which subsystem could be involved here. > > Are you able to run a git bisection search please? > http://www.kernel.org/doc/local/git-quick.html has some help. This whole problem seems to be quite complex for me. There is result of my researching: First of all I finally found two interesting revisions: 1) Good one: 80ca9a706b458d09b8cc8d5258bb61957f66ca5e Sun Jul 13 13:58:12 2008 +0200 ALSA: correct kcalloc usage 2) Bad one: 32b23e9a7331fce57eb0af52e19e8409fdef831b Sun Jul 13 14:29:41 2008 -0700 x86: max_low_pfn_mapped fix #4 There is my testing procedure: 1) Boot OS into init 3 2) Plug in HDMI 3) Plug out HDMI 4) Use "ls" 5) Use "init 5" 6) Use "init 3" 7) Compile kernel ("make rpm") When using "Bad one", system hangs on point 4 or 5 (randomly). I was never able to see X using "Bad one". When using "Good one" it's better but still not perfect. I am always able to go through 6 points (I see X and KDE, I can use it) but compilations always fails. So it seems that commit 32b23e made (already broken) kernel work even worse. Does it help somehow? Is there really something in commit 32b23e that could affect my notebook? Can I test something more?
2008/8/15, Rafał Miłecki <zajec5@gmail.com>: > When using "Good one" it's better but still not perfect. I am always > able to go through 6 points (I see X and KDE, I can use it) but > compilations always fails. I attach photo of screen after all theses points of procedure and "make rpm".
On Sunday, 17 of August 2008, Rafał Miłecki wrote: > 2008/8/16, Rafael J. Wysocki <rjw@sisk.pl>: > > This message has been generated automatically as a part of a report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.26. Please verify if it still should be listed and let me know > > (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > Subject : Plugging HDMI causes "unable to handle kernel paging > request" > > Submitter : Rafał Miłecki <zajec5@gmail.com> > > Date : 2008-08-12 14:30 (5 days old) > > Bug still exists in current git (tested 15 minutes ago).
Created attachment 17302 [details] git bisect log
Created attachment 17303 [details] Full result of bisecting Full content of commit 4f9c11dd49fb73e1ec088b27ed6539681a445988 x86, 64-bit: adjust mapping of physical pagetables to work with Xen
OK, finally I was able to make full bisecting: 4f9c11dd49fb73e1ec088b27ed6539681a445988 is first bad commit commit 4f9c11dd49fb73e1ec088b27ed6539681a445988 Author: Jeremy Fitzhardinge <jeremy@goop.org> Date: Wed Jun 25 00:19:19 2008 -0400 x86, 64-bit: adjust mapping of physical pagetables to work with Xen For more info (full output about first bad commit and out of "bit bisect log") see my two attachements to bug report (http://bugzilla.kernel.org/show_bug.cgi?id=11313) I let myself to add Jeremy and Ingo to CC of my mail. P.S. When testing many revisions I had to apply Ingo's patch found on http://lists-archives.org/linux-kernel/16541922-next-0704-x86_64-panics-on-booting.html I belive it doesn't change anything, however decided to mention.
The patch in question does two things: constructs 4k mappings if the CPU doesn't support PSE, and avoids re-constructing existing mappings. Since I assume you have PSE (no real 64-bit CPU lacks it), then the problem is caused by the existing mappings. This actually points the finger at a6523748bddd38bcec11431f57502090b6014a96 "paravirt/x86, 64-bit: move __PAGE_OFFSET to leave a space for hypervisor", which changes how the initial mappings are created. Does it help if you revert that while leaving 4f9c11dd49fb73e1ec088b27ed6539681a445988 applied?
2008/8/18, bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org>: > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > ------- Comment #21 from jeremy@goop.org 2008-08-18 08:43 ------- > The patch in question does two things: constructs 4k mappings if the CPU > doesn't support PSE, and avoids re-constructing existing mappings. Since I > assume you have PSE (no real 64-bit CPU lacks it), then the problem is > caused > by the existing mappings. > > This actually points the finger at a6523748bddd38bcec11431f57502090b6014a96 > "paravirt/x86, 64-bit: move __PAGE_OFFSET to leave a space for hypervisor", > which changes how the initial mappings are created. Does it help if you > revert > that while leaving 4f9c11dd49fb73e1ec088b27ed6539681a445988 applied? I made that this way: git checkout -f 4f9c11dd49fb73e1ec088b27ed6539681a445988 git revert a6523748bddd38bcec11431f57502090b6014a96 I hope I done this right? This resulted in: commit 90fd83f8f64fcd9a503621be8c2ffc5413efe500 Author: root <root@sony.site> Date: Mon Aug 18 18:15:40 2008 +0200 Revert "paravirt/x86, 64-bit: move __PAGE_OFFSET to leave a space for hypervisor" This reverts commit a6523748bddd38bcec11431f57502090b6014a96. In case this is important for someone I attach output of git-diff 4f9c11dd 90fd83f8 > zajec.patch to bug report on bugzilla. Unfortunately this reverting doesn't fix my problem. I still get errors after HDMI and "init 5". I made 2 photos of what appears on this kernel with reverted a6523748 (see bugzilla report for theses photos)
Created attachment 17310 [details] git-diff 4f9c11dd 90fd83f8 > zajec.patch 90fd83f8 is my own commit which reverts a6523748.
Created attachment 17311 [details] #1 photo of error on kernel with reverted a6523748
Created attachment 17312 [details] #2 photo of error on kernel with reverted a6523748
Created attachment 17329 [details] Config of my kernel This is openSUSE's kernel configuration with paravirt disabled.
Created attachment 17330 [details] dmesg | grep reusing Output of "dmesg | grep reusing" on kernel with Jeremy's patch applied.
Created attachment 17331 [details] full output of dmesg Well, I was not asked to grep dmesg at all... There is full output of dmesg.
Could we get a dump of the full kernel page tables? To do so: 1. make sure the kernel is configured with CONFIG_X86_PTDUMP. 2. make sure debugfs is mounted (mount -t debugfs none /sys/kernel/debug) 3. cat /sys/kernel/debug/kernel_page_tables > kernel_page_tables.txt
Some notes as I pick through all the evidence so far: - the crash is specifically because there are reserved bits set in the pmd - the pmd is b02a00043a6001a3 in both cases - the vaddr is ffff88013a600000 in the first crash, and ffff81013a6d1c00 in the second corresponding to the same large-page pmd mapping of phys page 0x13a600000 - this maps to e820 entry BIOS-e820: 0000000100000000 - 0000000140000000 (usable) - the corresponding boot-time mapping is init_memory_mapping 0100000000 - 0140000000 page 2M kernel direct mapping tables up to 140000000 @ b000-11000 addr 100000000 reusing pgd 201880 0000000000202063 last_map_addr: 140000000 end: 140000000 !!! and the memory allocated for this pagetable is: #5 [0000008000 - 000000b000] PGTABLE ==> [0000008000 - 000000b000] #6 [000000b000 - 000000c000] PGTABLE ==> [000000b000 - 000000c000] IOW, it's mapping using b000-11000, but it has only reserved b000 - c000 Also, this is right in the middle of the ISA area, which seems risky.
It's not in the middle of the ISA area; that would be another zero (a0000-100000).
*** This bug has been marked as a duplicate of bug 11237 ***
Right, yes. But still, under-allocated.
2008/8/20 <bugme-daemon@bugzilla.kernel.org>: > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > ------- Comment #29 from hpa@zytor.com 2008-08-20 10:25 ------- > Could we get a dump of the full kernel page tables? > > To do so: > > 1. make sure the kernel is configured with CONFIG_X86_PTDUMP. > 2. make sure debugfs is mounted > (mount -t debugfs none /sys/kernel/debug) > 3. cat /sys/kernel/debug/kernel_page_tables > kernel_page_tables.txt My notebook is Sony Vaio FW11S with 4GB of RAM. I checked for BIOS update on: http://support.vaio.sony.co.uk/model.asp?m=VGNFW11S.CEZ&site=voe_en_GB_cons but I did not find it. I also run memory test from openSUSE 11.0's DVD and got 3 passes without errors. As this bug is recognised as duplicate: do you Jeremy still want me to compare dmesg from two revisions? I do not understand what is that E820 requested by you. Could you explain that, please? (As far as you still need that comparision) Is there anything more I can do to help fixing this bug? I'll provide "dump of the full kernel page tables" little later.
The e820 map was included in the dmesg outputs you already posted, so that's OK. I mis-diagnosed the problem, so I'm not currently sure whether the bugs are actually dups or not. They certainly have some similar characteristics, and implicate the same change, so I think the chances are good that they're dups. The other bug is on a completely different machine, so I'm not very confident that the bug is specifically platform-related (ie, whether a BIOS update is relevent).
Created attachment 17364 [details] cat kernel_page_tables on 4f9c11dd+CONFIG_X86_PTDUMP
2008/8/20 <bugme-daemon@bugzilla.kernel.org>: > http://bugzilla.kernel.org/show_bug.cgi?id=11313 > > ------- Comment #29 from hpa@zytor.com 2008-08-20 10:25 ------- > Could we get a dump of the full kernel page tables? Done as bug report attachment: http://bugzilla.kernel.org/attachment.cgi?id=17364&action=view
OK, that's in: 0xffff880139200000-0xffff88013bc56000 43352K RW GLB NX pte 0xffff88013bc56000-0xffff88013bc58000 8K ro GLB NX pte So the large mapping has been shattered down to 4k mappings. Hm, what's with all the ro mappings? Could you also post a full dmesg from 4f9c11dd49fb73e1ec088b27ed6539681a445988~1?
Created attachment 17367 [details] dmesg from 4f9c11dd This is dmesg from 4f9c11dd with Ingo's patch applied: http://lists-archives.org/linux-kernel/16541922-next-0704-x86_64-panics-on-booting.html Configuration is quite same as http://bugzilla.kernel.org/attachment.cgi?id=17329 - I just enabled CONFIG_X86_PTDUMP in this one compilation.
Created attachment 17414 [details] boot.msg from 4f9c11dd49fb... booting with mem=2G Booting 4f9c11dd49fb73e1ec088b27ed6539681a445988 with mem=2G option resulted in kernel unable to load ACPI modules.
Created attachment 17419 [details] kernel_page_tables from 2.6.27-rc4 booted normally
Created attachment 17420 [details] kernel_page_tables from 2.6.27-rc4 booted with mem=1G Maybe comparision with "kernel_page_tables from 2.6.27-rc4 booted normally" can give some interesting info?