Bug 33012
Summary: | computer restart without warning | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Alexandre Demers (alexandre.f.demers) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | edwin+bugs, florian, herrmann.der.user, hpa, joro, maciej.rutecki, rientjes, rjw, yinghai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.39-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 32012 | ||
Attachments: |
.config at bad commit
dmesg from 2.6.38.2 new dmesg Longer dmesg dmesg from bad commit WITH proposed fix |
Description
Alexandre Demers
2011-04-11 02:47:20 UTC
Oops, pointing to wrong commit (by one commit in the tree). The first bad commit identified is d2137d5af4259f50c19addb8246a186c9ffac325 Merge branch 'linus' into x86/bootmem Author Ingo Molnar<mingo@elte.hu> Author date 11-02-14 5:55 AM Parent x86-64: Move out cleanup higmap [_brk_end, _end) out of i... Parent klist: Fix object alignment on 64-bit. Child Merge branch 'x86/bootmem' into x86/mm Branch origin/master (Merge branch 'fbdev-fixes-for-linus' of git://git.kernel....) Branch master (Linux 2.6.39-rc1) Follows v2.6.38-rc4 (Linux 2.6.38-rc4) Precedes v2.6.39-rc1 (Linux 2.6.39-rc1) I confirm the culprit commit d2137d5af4259f50c19addb8246a186c9ffac325. I will be testing the latest git version to see if it has been fixed. First-Bad-Commit : d2137d5af4259f50c19addb8246a186c9ffac325 Suggest earlyprintk= for further diagnosis. did you mean earlyprintk=1? No, you need to specify its route, which is probably over serial or tty. Try earlyprintk=ttyS0,115200n8 first. See the earlyprintk= entry in Documentation/kernel-parameters.txt for more information. Still present in rc3. I'll try the earlyprintk parameter. can you please send out .config and working dmesg ? Created attachment 54202 [details]
.config at bad commit
This is the .config file used when compiling kernel with bad commit. Dmesg from 2.6.38.2 will follow.
(In reply to comment #9) > Created an attachment (id=54202) [details] > .config at bad commit > > This is the .config file used when compiling kernel with bad commit. Dmesg > from > 2.6.38.2 will follow. I should have said "when hard resetting at bad commit" Created attachment 54222 [details]
dmesg from 2.6.38.2
Here is dmesg from the latest good version I've been using (2.6.38.2)
I can't see anything special using earlyprintk (ttyS0,115200n8 nor vga). It resets near the "sd 9:0:0:0: [sdc] Assuming drive cache: write through". Maybe I'm not using it correctly. Should I specify something else. Documentation about the parameter is pretty short. that dmesg seem get clipped ... can you check /var/log/messages or /var/log/bootlog or /var/log/dmesg? that should have complete boot log. CONFIG_XEN=y CONFIG_XEN_DOM0=y CONFIG_XEN_PRIVILEGED_GUEST=y CONFIG_XEN_PVHVM=y are you using xen/dom0, or use it on real HW directly? (In reply to comment #14) > CONFIG_XEN=y > CONFIG_XEN_DOM0=y > CONFIG_XEN_PRIVILEGED_GUEST=y > CONFIG_XEN_PVHVM=y > > are you using xen/dom0, or use it on real HW directly? I'm not using Xen, it's there just in case I'd like to use it at some point. Created attachment 54262 [details]
new dmesg
taken directly from /var/log/dmesg. It seems similar to the previous one.
You may need to increase CONFIG_LOG_BUF_SHIFT=17 to CONFIG_LOG_BUF_SHIFT=21 to increase the buffer or boot with log_buf_len=2M or even more. Created attachment 54302 [details]
Longer dmesg
increased log buffer lenght. Now you have a complete dmesg log from 2.6.38.2 (latest good kernel)
http://lists.freedesktop.org/archives/dri-devel/2011-April/010133.html to be followed on the dri-devel list. On 04/13/2011 10:21 AM, Joerg Roedel wrote:
> > On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote:
> > First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where
> > only a couple of patches and merged v2.6.38-rc4 in at every step. There
> > was no failure found.
> > Then I tried this again, but this time I merged v2.6.38-rc5 at every
> > step and was successful. The bad commit in this branch turned out to be
> >
> > 1a4a678b12c84db9ae5dce424e0e97f0559bb57c
> >
> > which is related to memblock.
> >
> > Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5
> > is needed to trigger the failure, so I used f005fe12b90c as a base,
> > bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step
> > into the base and tested. Here the bad commit turned out to be
> >
> > e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20
> >
> > which is related to gart. It turned out that the gart aperture on that
> > box is on another position with these patches. Before it was as
> > 0xa4000000 and now it is at 0xa0000000. It seems like this has something
> > to do with the root-cause.
> >
> > Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the
> > problem btw. and booting with iommu=soft also works, but I have no idea
> > yet why the aperture at that address is a problem (with the patch
> > reverted the aperture lands at 0x80000000).
> >
> > I have put some debug-data online. There is my .config and two
> > dmesg-files for good (==2.6.39-rc3 + revert) and bad (==2.6.39-rc3)
> > I also created these dmesg-files again with memblock=debug, maybe that
> > helps to find the problem. The files are at
> >
> > http://www.8bytes.org/~joro/debug/
thanks for the bisecting...
so those two patches uncover some problems.
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Node 0: aperture @ a0000000 size 32 MB
[ 0.000000] Aperture pointing to e820 RAM. Ignoring.
[ 0.000000] Your BIOS doesn't leave a aperture memory hole
[ 0.000000] Please enable the IOMMU option in the BIOS setup
[ 0.000000] This costs you 64 MB of RAM
[ 0.000000] memblock_x86_reserve_range: [0xa0000000-0xa3ffffff] aperture64
[ 0.000000] Mapping aperture over 65536 KB of RAM @ a0000000
so kernel try to reallocate apperture. because BIOS allocated is pointed to RAM or size is too small.
but your radeon does use [0xa0000000, 0xbfffffff)
[ 4.281993] radeon 0000:01:05.0: VRAM: 320M 0x00000000C0000000 - 0x00000000D3FFFFFF (320M used)
[ 4.290672] radeon 0000:01:05.0: GTT: 512M 0x00000000A0000000 - 0x00000000BFFFFFFF
[ 4.298550] [drm] Detected VRAM RAM=320M, BAR=256M
[ 4.309857] [drm] RAM width 32bits DDR
[ 4.313748] [TTM] Zone kernel: Available graphics memory: 1896524 kiB.
[ 4.320379] [TTM] Initializing pool allocator.
[ 4.324948] [drm] radeon: 320M of VRAM memory ready
[ 4.329832] [drm] radeon: 512M of GTT memory ready.
and the one seems working:
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Node 0: aperture @ a0000000 size 32 MB
[ 0.000000] Aperture pointing to e820 RAM. Ignoring.
[ 0.000000] Your BIOS doesn't leave a aperture memory hole
[ 0.000000] Please enable the IOMMU option in the BIOS setup
[ 0.000000] This costs you 64 MB of RAM
[ 0.000000] memblock_x86_reserve_range: [0x80000000-0x83ffffff] aperture64
[ 0.000000] Mapping aperture over 65536 KB of RAM @ 80000000
[ 0.000000] memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf] BOOTMEM
will use different position...
[ 4.250159] radeon 0000:01:05.0: VRAM: 320M 0x00000000C0000000 - 0x00000000D3FFFFFF (320M used)
[ 4.258830] radeon 0000:01:05.0: GTT: 512M 0x00000000A0000000 - 0x00000000BFFFFFFF
[ 4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
[ 4.271549] [drm] RAM width 32bits DDR
[ 4.275435] [TTM] Zone kernel: Available graphics memory: 1896526 kiB.
[ 4.282066] [TTM] Initializing pool allocator.
[ 4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
[ 4.293076] [drm] radeon: 320M of VRAM memory ready
[ 4.298277] [drm] radeon: 512M of GTT memory ready.
[ 4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[ 4.309854] [drm] Driver supports precise vblank timestamp query.
[ 4.315970] [drm] radeon: irq initialized.
[ 4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072
So question is why radeon is using the address [0xa0000000 - 0xc000000], and in E820 it is RAM ....
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000acb8d000 (usable)
[ 0.000000] BIOS-e820: 00000000acb8d000 - 00000000acb8f000 (reserved)
[ 0.000000] BIOS-e820: 00000000acb8f000 - 00000000afce9000 (usable)
[ 0.000000] BIOS-e820: 00000000afce9000 - 00000000afd21000 (reserved)
[ 0.000000] BIOS-e820: 00000000afd21000 - 00000000afd4f000 (usable)
[ 0.000000] BIOS-e820: 00000000afd4f000 - 00000000afdcf000 (reserved)
[ 0.000000] BIOS-e820: 00000000afdcf000 - 00000000afecf000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000afecf000 - 00000000afeff000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000afeff000 - 00000000aff00000 (usable)
so looks bios program wrong address to the radon card?
On Wed, Apr 13, 2011 at 12:14:55PM -0700, Yinghai Lu wrote: > thanks for the bisecting... > > so those two patches uncover some problems. > > [ 0.000000] Checking aperture... > [ 0.000000] No AGP bridge found > [ 0.000000] Node 0: aperture @ a0000000 size 32 MB > [ 0.000000] Aperture pointing to e820 RAM. Ignoring. > [ 0.000000] Your BIOS doesn't leave a aperture memory hole > [ 0.000000] Please enable the IOMMU option in the BIOS setup > [ 0.000000] This costs you 64 MB of RAM > [ 0.000000] memblock_x86_reserve_range: [0xa0000000-0xa3ffffff] > aperture64 +> [ 0.000000] Mapping aperture over 65536 KB of RAM @ a0000000 > > so kernel try to reallocate apperture. because BIOS allocated is pointed to > RAM or size is too small. It is actually beyond 4GB on that machine, this value read here is from the previous kernel-boot. The BIOS does not reset these values on a reboot. > but your radeon does use [0xa0000000, 0xbfffffff) Yes, I suspected that too (and spent a few hours reading radeon code), but then I talked the Alex Deucher and he explained that these addresses which the driver prints for GTT and VRAM are in the GPU address space and do not refer to system ram. So this shouldn't be the problem. Alexandre's working version on 2.6.38.2 is using Checking aperture... No AGP bridge found Node 0: aperture @ a4000000 size 32 MB Aperture pointing to e820 RAM. Ignoring. Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ a4000000 PM: Registered nosave memory: 00000000a4000000 - 00000000a8000000 so it is even with 512M alignment. Created attachment 54322 [details]
dmesg from bad commit WITH proposed fix
I applied the proposed fix and I was able to boot properly. This is the output from dmesg.
I applied the proposed fix (in the email discussion) and it fixed the boot process. Could you please specify which exact proposed fix that was? Checking aperture... No AGP bridge found Node 0: aperture @ 80000000 size 32 MB Aperture pointing to e820 RAM. Ignoring. Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 80000000 PM: Registered nosave memory: 0000000080000000 - 0000000084000000 looks like the big alignment to 1g. looks like any address other than 0xa0000000 will work. Okay, so the question is why, and how to prevent it by design, rather than accident. Sorry, just for the info since there is a parallel discussion on the dri mailing list, here is the discussion and proposed fix: http://lists.freedesktop.org/archives/dri-devel/2011-April/010150.html diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 86d1ad4..3b6a9d5 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -83,7 +83,7 @@ static u32 __init allocate_aperture(void) * so don't use 512M below as gart iommu, leave the space for kernel * code for safe */ - addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20); + addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21); if (addr == MEMBLOCK_ERROR || addr + aper_size > 0xffffffff) { printk(KERN_ERR "Cannot allocate aperture memory hole (%lx,%uK)\n", I'm now about to test the fix against 2.6.39-rc3. Tested on rc3 and it works fine here with the "fix". can you please try only patch2 and patch1+patch2? want to know what happen if k8 gart and radeon gtt are all set to 0x80000000 diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 890217e..08e1961 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc) mc->gtt_size = size_bf; } mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size; + if (mc->gtt_start == 0xa0000000) + mc->gtt_start = 0x80000000; } else { if (mc->gtt_size > size_af) { dev_warn(rdev->dev, "limiting GTT\n"); Tested: Applying both patches make the system resets. Applying only one of the patches allows it to boot properly. (In reply to comment #32) > Applying both patches make the system resets. Applying only one of the > patches > allows it to boot properly. Makes sense. When you apply both patches the both apertures end up to be on the same address (0x80000000) again. Btw. can you please post the exact model of your internal graphics card in that box? Is it a dedicated card (with on-card memory) or part of the chipset (and uses cpu ram with no own memory). Alexandre, can you please run on your system the command (as root) # lsmsr MC4_CTL The command is part of the x86info package. It would be good to know whether GART TLB Walk Errors are masked or not on your system. If it's already masked Joerg's patch (http://marc.info/?l=linux-kernel&m=130287312009403&w=2) is of no use on your system. Thanks. (In reply to comment #33) > (In reply to comment #32) > > Applying both patches make the system resets. Applying only one of the > patches > > allows it to boot properly. > > Makes sense. When you apply both patches the both apertures end up to be on > the > same address (0x80000000) again. > > Btw. can you please post the exact model of your internal graphics card in > that > box? Is it a dedicated card (with on-card memory) or part of the chipset (and > uses cpu ram with no own memory). I'm using the following model: RS780 (HD3200) integrated If you want more info on the mainboard: http://www.gigabyte.com/products/product-page.aspx?pid=2814#ov (In reply to comment #34) > Alexandre, > > can you please run on your system the command (as root) > > # lsmsr MC4_CTL > > The command is part of the x86info package. > It would be good to know whether GART TLB Walk Errors are masked or not on > your > system. > > If it's already masked Joerg's patch > (http://marc.info/?l=linux-kernel&m=130287312009403&w=2) > is of no use on your system. > > Thanks. Sorry, it took some time, I was in the middle of some updates and something went wrong. Then, I installed the x86info provided by Ubuntu, but lsmsr was missing (I have to open a bug about it). So I downloaded the source from the project's website, but latest version can't be compiled becausse of an error (I have to open a bug about it also). I'm using version 1.28, since it's the last working version I was able to compile. So now about the command: ./lsmsr MC4_CTL MC4_CTL = 0x000000003fffffff MC4_CTL_MASK = 0x0000000000780000 ./lsmsr MC4_CTL -V3 | grep "MC4\|Gart" MC4_CTL = 0x000000003fffffff GartTblWkEn=0x1 MC4_CTL_MASK = 0x0000000000780000 GartTblWkEn=0 So it looks really similar to the results of Joerg. I'll be now testing the suggested patch. Applied Joerg's patch and it now works correctly. It seems to be the good fix for this processor/BIOS problem. Apparently the fix for this bug introduced a regression when run under KVM: https://bugzilla.kernel.org/show_bug.cgi?id=35132 A patch referencing this bug report has been merged in v3.0-rc1: commit d47cc0db8fd6011de2248df505fc34990b7451bf Author: Roedel, Joerg <Joerg.Roedel@amd.com> Date: Thu May 19 11:13:39 2011 +0200 x86, amd: Use _safe() msr access for GartTlbWlk disable code |