Bug 13042 - MTRR problems after 4GB RAM upgrade
Summary: MTRR problems after 4GB RAM upgrade
Status: CLOSED OBSOLETE
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-07 23:46 UTC by higuita
Modified: 2012-05-30 16:04 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.29
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci (17.46 KB, text/plain)
2009-04-07 23:50 UTC, higuita
Details
dmesg.4G (41.21 KB, text/plain)
2009-04-07 23:50 UTC, higuita
Details
dmesg.3G (39.90 KB, text/plain)
2009-04-07 23:51 UTC, higuita
Details
my .config (68.04 KB, text/plain)
2009-05-13 22:00 UTC, higuita
Details
config for 2.6.30-rc4 (96.49 KB, text/plain)
2009-05-13 22:07 UTC, Sven Arvidsson
Details
mtrr-uncover output (535 bytes, text/plain)
2009-05-14 01:08 UTC, higuita
Details
mtrr-uncover with debug (4.64 KB, text/plain)
2009-05-14 01:10 UTC, higuita
Details
dmesg with enable_mtrr_cleanup mtrr_spare_reg_nr=3 debug (41.78 KB, text/plain)
2009-05-14 01:11 UTC, higuita
Details
4GB dmesg without fglrx (48.56 KB, text/plain)
2009-05-17 05:19 UTC, higuita
Details
Photo of the kernel ooops when removing mtrr 1 (467.95 KB, image/jpeg)
2009-05-17 05:20 UTC, higuita
Details
4GB dmesg - no extra modules (47.01 KB, text/plain)
2009-05-30 16:13 UTC, higuita
Details
dmesg, notaint, single (42.57 KB, text/plain)
2009-06-11 23:34 UTC, higuita
Details
mtrr for 4G, notaint, single (281 bytes, text/plain)
2009-06-11 23:37 UTC, higuita
Details

Description higuita 2009-04-07 23:46:58 UTC
Just like Bug #10508, i'm having problems with MTRR after adding more RAM to the machine, reaching 4GB.

Bug #10508 is 32bits and should be fixed, but the added patches dont seems to do anything in X86_64 platform

here is the mtrr with memory remapping enable:
reg00: base=0x000000000 (    0MB), size= 4096MB, count=1: write-back
reg01: base=0x100000000 ( 4096MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable
reg03: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-combining

when disabled:
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-combining

i tried to use the mtrr-uncover app, but the mtrr format changed a bit since it was updated the last time. Manually converting the new mtrr format to the old, it suggest this:
disable=0
disable=2
disable=3
base=0x000000000 size=0x080000000 type=write-back
base=0x080000000 size=0x040000000 type=write-back
base=0x0e0000000 size=0x020000000 type=uncachable

however trying to disable rule 0 the system hard-lock, disabling rule 2 it oops and so i'm unable to have a usable mtrr and "forced" to "downgrade" to 3GB by disabling the memory remap

i'm using a Asus A8V with the latest bios and any help is welcome

Thanks
Comment 1 higuita 2009-04-07 23:50:06 UTC
Created attachment 20869 [details]
lspci
Comment 2 higuita 2009-04-07 23:50:57 UTC
Created attachment 20870 [details]
dmesg.4G
Comment 3 higuita 2009-04-07 23:51:32 UTC
Created attachment 20871 [details]
dmesg.3G
Comment 4 Sven Arvidsson 2009-05-13 20:57:10 UTC
I had the same problem, setting CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1 fixed it for me. 

Any chance of having this set as the default, as discussed here:
http://lkml.org/lkml/2009/2/19/108
Comment 5 higuita 2009-05-13 22:00:44 UTC
Created attachment 21339 [details]
my .config

i already have those flags enabled.. maybe i have some incompatible option?
i'm attaching my config for kernel 2.6.29.3
Comment 6 D. Hugh Redelmeier 2009-05-13 22:06:10 UTC
I have updated mtrr-uncover to address a gratuitous change to the format of entries in /proc/mtrr.

My suspicion is that any kernel with this change also has the MTRR sanitizer code.  Even so, there have been systems for which mtrr-uncover works and the MTRR sanitizer does not (at least not without options, ones that are documented opaquely).

ftp://ftp.cs.utoronto.ca/pub/hugh/mtrr-uncover-2009may13.tgz
Comment 7 Sven Arvidsson 2009-05-13 22:07:20 UTC
Created attachment 21340 [details]
config for 2.6.30-rc4

I'm attaching the config I used for comparison.

I'm using an Asus P5Q-EM motherboard.
Comment 8 higuita 2009-05-14 01:08:22 UTC
Created attachment 21349 [details]
mtrr-uncover output

in a quick compare between configs i dont see anything that might explain the different results, i will try to do a better compare later

also try the new mtrr-uncover... again, the system hard lock if i try to change the MTRR. I will attach next the output with debug, but anyway, it seems impossible for me to disable the original MTRR config after booting.

finally, also tried to boot with "enable_mtrr_cleanup mtrr_spare_reg_nr=3 debug" but didnt work, i always have the same mtrr config
Comment 9 higuita 2009-05-14 01:10:05 UTC
Created attachment 21350 [details]
mtrr-uncover with debug
Comment 10 higuita 2009-05-14 01:11:36 UTC
Created attachment 21351 [details]
dmesg with enable_mtrr_cleanup mtrr_spare_reg_nr=3 debug
Comment 11 higuita 2009-05-14 01:21:32 UTC
as Sven CPU seems to be a intel (as most posts i see that were able to fix this) and mine is a AMD, maybe its a CPU related problem? 

here is my cpuinfo (core 0 only, core1 its the same)

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 35
model name      : Dual Core AMD Opteron(tm) Processor 175
stepping        : 2
cpu MHz         : 1000.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy
bogomips        : 2003.89
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
Comment 12 D. Hugh Redelmeier 2009-05-14 03:25:59 UTC
I've never seen overlapping with three MTRR's covering a range.  The MTRRs in the first configuration of the original message has three MTRR's covering the range 0x0c0000000-0x0cfffffff (note: these are 36-bit numbers; ignore the leading 0 to see 32-bit numbers).

The first message refers to this setup as "the mtrr with memory remapping enable".  I take it that this refers to a BIOS setting.

It is legal but generally useless to have MTRR's nested three deep.

In this case, the three MTRR types are: write-back, uncachable, and write-combining.  I find this very odd.  In particular, uncachable trumps other types and so the write-combining range (MTRR 3) is useless since it is completely contained within the uncachable range (MTRR 2).  (The rules are in section 10.11.4.1 "MTRR Precedences" of "Intel® 64 and IA-32 Architectures Software Developer’s Manual", Volume 3A: "System Programming Guide", Part 1)

Another odd thing is that I've never seen a BIOS initialize an range to write-combining.  Generally, an X video driver will attempt to set up a write-combining region.  I've also seen a rare Infiniband driver do that too.  But never a BIOS.  Are you printing this before X gets started?  Actually, the Kernel generally doesn't allow creation of a write-combining region like this, so it is unlikely to be driver code that did this.

I admit that I've not seen a very wide variety of MTRR settings.  I have seen a number as author of mtrr-uncover.

It would be useful to know what causes the machine to lock up.  You can issue the commands that mtrr-uncover proposes, one at a time.  Run it without --execute.  The last thing it prints is a set of commands.  Send each of them, one line at a time, to /proc/mtrr using echo.  For example
 echo "disable=0" >! /proc/mtrr
After which command does the system hang?
Comment 13 D. Hugh Redelmeier 2009-05-14 04:18:04 UTC
[I'm not a kernel hacker and I don't know the mtrr cleanup code.]

I have a machine running 64-bit Ubuntu 9.04 that needs mtrr-fixup.  The kernel is Ubuntu's 2.6.28, not the 2.6.29 that you are running, so the cleanup code might have changed.

On my system, with enable_mtrr_cleanup, the cleanup starts chattering in dmesg very early.  And it chatters a lot.  I see nothing in your dmesg output.

From my recollection of the cleanup code (very early on), it made some assumptions of the environment, checked them, and if they were not satisfied, it didn't do any cleanup.  It might not have even logged anything to this effect.  An example that I remember was that the default memory type must be UC (as set in a machine control register).  It might be worth reading the code to see if any such assumptions are not satisfied in your evironment.  Suspicion: the code might not be willing to deal with triple-nested ranges.  You might wish to drop in a few printk calls.

An early enough printk could let you know if the cleanup code is even compiled in.
Comment 14 higuita 2009-05-15 02:28:31 UTC
thanks for the help D.Hugh

 AFAIK, those MTRR are not changes by anything, i have done all the test in a console, without any X running (after all, fglrx locks if i try to start it in this MTRR config)... i may have fglrx module already loaded, but if i'm not wrong, its the exact same MTRR when i dont have it loaded ( i will confirm tomorrow in a few tests)

i already tried to run the commands to change the MTRR, and if my memory isnt bad, it hardlocks when removing the reg 0 and a kernel oops when i remove one of the reg 1 or reg 2, not sure which one...again, i will confirm this tomorrow 

i'm now running 2.6.29, but i have this problem at least since the 2.6.27, time where i added more ram... no difference between kernel versions and i also dont see any special output during the boot

diffing  the 3GB dmesg and the 4GB i get this:

$ diff dmesg.3G.1 dmesg.4G.1      
2c2
<  Command line: BOOT_IMAGE=2.6.29.1-k8_64 ro root=302 resume=/dev/sda6 memory_corruption_check=1
---
>  Command line: auto BOOT_IMAGE=2.6.29.1-k8_64 ro root=302 resume=/dev/sda6
>  memory_corruption_check=1 mtrr_chunk=256M mtrr_gran_size=256M
15a16
>   BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
18c19
<  last_pfn = 0xbffb0 max_arch_pfn = 0x100000000
---
>  last_pfn = 0x140000 max_arch_pfn = 0x100000000
19a21
>  last_pfn = 0xbffb0 max_arch_pfn = 0x100000000
30a33
>   modified: 0000000100000000 - 0000000140000000 (usable)
35a39,42
>  init_memory_mapping: 0000000100000000-0000000140000000
>   0100000000 - 0140000000 page 2M
>  kernel direct mapping tables up to 140000000 @ 13000-19000
>  last_map_addr: 140000000 end: 140000000
45c52
<  (5 early reservations) ==> bootmem [0000000000 - 00bffb0000]
---
>  (6 early reservations) ==> bootmem [0000000000 - 0140000000]
51c58,59
<   [ffffe20000000000-ffffe200029fffff] PMD -> [ffff880001200000-ffff880003bfffff] on node 0
---
>    #5 [0000013000 - 0000014000]          PGTABLE ==> [0000013000 -
>    0000014000]
>   [ffffe20000000000-ffffe200045fffff] PMD ->
>   [ffff880028200000-ffff88002c7fffff] on node 0
55c63
<    Normal   0x00100000 -> 0x00100000
---
>    Normal   0x00100000 -> 0x00140000
57c65
<  early_node_map[2] active PFN ranges
---
>  early_node_map[3] active PFN ranges
60c68,69
<  On node 0 totalpages: 786239
---
>      0: 0x00100000 -> 0x00140000
>  On node 0 totalpages: 1048383
62,65c71,77
<    DMA zone: 1782 pages reserved
<    DMA zone: 2145 pages, LIFO batch:0
<    DMA32 zone: 10695 pages used for memmap
<    DMA32 zone: 771561 pages, LIFO batch:31
---
>    DMA zone: 1783 pages reserved
>    DMA zone: 2144 pages, LIFO batch:0
>    DMA32 zone: 14280 pages used for memmap
>    DMA32 zone: 767976 pages, LIFO batch:31
>    Normal zone: 3584 pages used for memmap
>    Normal zone: 258560 pages, LIFO batch:31
>  Looks like a VIA chipset. Disabling IOMMU. Override with iommu=allowed
82a95,99
>  PM: Registered nosave memory: 00000000bffb0000 - 00000000bffc0000
>  PM: Registered nosave memory: 00000000bffc0000 - 00000000bfff0000
>  PM: Registered nosave memory: 00000000bfff0000 - 00000000c0000000
>  PM: Registered nosave memory: 00000000c0000000 - 00000000ff780000
>  PM: Registered nosave memory: 00000000ff780000 - 0000000100000000
86,87c103,104
<  Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 773706
<  Kernel command line: BOOT_IMAGE=2.6.29.1-k8_64 ro root=302 resume=/dev/sda6 memory_corruption_check=1
---
>  Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 1028680
>  Kernel command line: auto BOOT_IMAGE=2.6.29.1-k8_64 ro root=302
>  resume=/dev/sda6 memory_corruption_check=1 mtrr_chunk=256M
>  mtrr_gran_size=256M
93c110
<  Detected 2202.736 MHz processor.
---
>  Detected 2202.636 MHz processor.
98,103c115,118
<  Checking aperture...
<  AGP bridge at 00:00:00
<  Aperture from AGP @ c0000000 old size 32 MB
<  Aperture from AGP @ c0000000 size 256 MB (APSIZE f00)
<  Node 0: aperture @ c0000000 size 256 MB
<  Memory: 3088348k/3145408k available (3751k kernel code, 452k absent, 56164k reserved, 1851k data, 372k init)
---
>  PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>  Placing 64MB software IO TLB between ffff880020000000 - ffff880024000000
>  software IO TLB at phys 0x20000000 - 0x24000000
>  Memory: 4041812k/5242880k available (3751k kernel code, 1049348k absent,
>  150816k reserved, 1851k data, 372k init)
105c120
<  Calibrating delay loop (skipped), value calculated using timer frequency.. 4407.03 BogoMIPS (lpj=7342453)
---
>  Calibrating delay loop (skipped), value calculated using timer frequency..
>  4407.82 BogoMIPS (lpj=7342120)
127c142
<  Total of 2 processors activated (8814.96 BogoMIPS).
---
>  Total of 2 processors activated (8814.75 BogoMIPS).
140a156
>  TOM2: 0000000140000000 aka 5120M
144c160,161
<  bus: 00 index 2 mmio: [c0000000, fcffffffff]
---
>  bus: 00 index 2 mmio: [c0000000, ffffffff]
>  bus: 00 index 3 mmio: [140000000, fcffffffff]
233d249
<  pci 0000:00:00.0: BAR 0: can't allocate resource
251,254d266
<  pnp 00:0d: mem resource (0x0-0x9ffff) overlaps 0000:00:00.0 BAR 0 (0x0-0xfffffff), disabling
<  pnp 00:0d: mem resource (0xc0000-0xdffff) overlaps 0000:00:00.0 BAR 0 (0x0-0xfffffff), disabling
<  pnp 00:0d: mem resource (0xe0000-0xfffff) overlaps 0000:00:00.0 BAR 0 (0x0-0xfffffff), disabling
<  pnp 00:0d: mem resource (0x100000-0xbffeffff) overlaps 0000:00:00.0 BAR 0 (0x0-0xfffffff), disabling
265a278,281
>  system 00:0d: iomem range 0x0-0x9ffff could not be reserved
>  system 00:0d: iomem range 0xc0000-0xdffff has been reserved
>  system 00:0d: iomem range 0xe0000-0xfffff could not be reserved
>  system 00:0d: iomem range 0x100000-0xbffeffff could not be reserved
289c305
<  msgmni has been set to 6032
---
>  msgmni has been set to 7895
332d347
<  isa bounce pool size: 16 pages
442,444d456
<  ReiserFS: hda2: Using r5 hash to sort names
<  VFS: Mounted root (reiserfs filesystem) readonly on device 3:2.
<  Freeing unused kernel memory: 372k freed
451c463,473
<  Linux video capture interface: v2.00
---
>  generic-usb 0003:0463:FFFF.0002: hidraw1: USB HID v1.00 Device [MGE UPS
>  SYSTEMS ellipse] on usb-0000:00:10.1-2/input0
>  ReiserFS: hda2: replayed 215 transactions in 12 seconds
>  ReiserFS: hda2: Using r5 hash to sort names
>  VFS: Mounted root (reiserfs filesystem) readonly on device 3:2.
>  Freeing unused kernel memory: 372k freed
>  ohci1394 0000:00:07.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
>  ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[16]  MMIO=[fb700000-fb7007ff] 
>  Max Packet=[2048]  IR/IT contexts=[4/8]
>  skge 0000:00:0a.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
>  skge 0000:00:0a.0: PCI: Disallowing DAC for device
>  skge 1.13 addr 0xfbb00000 irq 17 chip Yukon-Lite rev 9
>  skge eth0: addr 00:11:d8:82:b1:0e
460,465c482
<  ohci1394 0000:00:07.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
<  ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[16]  MMIO=[fb700000-fb7007ff]  Max Packet=[2048]  IR/IT contexts=[4/8]
<  skge 0000:00:0a.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
<  skge 0000:00:0a.0: PCI: Disallowing DAC for device
<  skge 1.13 addr 0xfbb00000 irq 17 chip Yukon-Lite rev 9
<  skge eth0: addr 00:11:d8:82:b1:0e
---
>  Linux video capture interface: v2.00
481a499,500
>  [fglrx] Maximum main memory to use for locked dma buffers: 3786 MBytes.
>  [fglrx]   vendor: 1002 device: 9586 count: 1
483a503,506
>  [fglrx] ioport: bar 1, base 0xe000, size: 0x100
>  pci 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
>  [fglrx] Kernel PAT support detected, disabling driver built-in PAT support
>  [fglrx] module loaded - fglrx 8.59.2 [Mar 13 2009] with 1 minors
491,496d513
<  [fglrx] Maximum main memory to use for locked dma buffers: 2870 MBytes.
<  [fglrx]   vendor: 1002 device: 9586 count: 1
<  [fglrx] ioport: bar 1, base 0xe000, size: 0x100
<  pci 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
<  [fglrx] Kernel PAT support detected, disabling driver built-in PAT support
<  [fglrx] module loaded - fglrx 8.59.2 [Mar 13 2009] with 1 minors
499a517,518
>  ReiserFS: hda2: Removing [43 750871 0x0 SD]..done
>  ReiserFS: hda2: There were 1 uncompleted unlinks/truncates. Completed
505d523
<  generic-usb 0003:0463:FFFF.0002: hidraw1: USB HID v1.00 Device [MGE UPS SYSTEMS ellipse] on usb-0000:00:10.1-2/input0
530c548
<  Clocksource tsc unstable (delta = -274461584 ns)
---
>  Clocksource tsc unstable (delta = -273951755 ns)
602d619
<  ip_tables: (C) 2000-2006 Netfilter Core Team
604a622
>  ip_tables: (C) 2000-2006 Netfilter Core Team
617d634
<  hdc: UDMA/33 mode selected
618a636
>  hdc: UDMA/33 mode selected

as you can see, no mtrr references, that is why i suspect that the mtrr clean up isnt being run in my machine.. i open the bug as a x86_64 problem but now looks like is related with my hardware

i'm not a programmers (just shell scripts and a little of perl), but if you give me a example of printk call, i can try to randomly insert then in the /usr/src/linux-2.6.29/arch/x86/kernel/cpu/mtrr/ files  using my almost non existent C knowledge :)

tomorrow i will have more free time to boot the machine several times and do more tests, in the mean time, should i nag asus for a fix in the bios or a post in the lkml to alert about this problem?

thanks
Comment 15 D. Hugh Redelmeier 2009-05-15 07:12:00 UTC
This dmesg diff was interesting to me.

Do notice that fglrx is running.  Please try without it.  First of all, it is closed source.  Second of all, it could well muck with MTRRs (although it mentions usint PAT).

In the 4G case, the AGP aperture was not mentioned.  I don't know why or what that means.  The AGP aperture is used to access memory on the video board.

As far as advising you on how to adding printk calls, I don't feel that I can devote the time needed.  I'd need to download the kernel, find the MTRR cleanup code, try to understand it, and then write up what I had figured out.  But here's a start.

The relevant code seems to be here: http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/cpu/mtrr/main.c

Line 1364 is the start of the C function to clean up MTRRs.

You will see line 1377 is an "if" that will cause the routine to quit, without diagnostic, if the default MTRR type is not uncachable.

You can see an example of printk on line 1394.  The KERN_DEBUG part specifies what logging level is needed for this to show up in the log.  If you change that to KERN_WARNING, you are more likely to see the message.  I don't know what your kernel loglevel is but it is likely set to suppress KERN_DEBUG messages.
Comment 16 Yinghai Lu 2009-05-15 23:50:10 UTC
can you try tip/master?

http://people.redhat.com/mingo/tip.git/readme.txt

also please boot with
debug show_msr=1
Comment 17 higuita 2009-05-17 05:12:52 UTC
ok, lets start with the tests...

without fglrx module in the system: the exact same mtrr after booting, without start X

doing echo "disable=0" >| /proc/mtrr locks the machine, not even magic sysrq keys work
disable=2 removes the entry and the system still works
disable=3 removes the entry and the system still works
disable=1  (not needed by mtrr-uncover) generate a kernel ooops, i cant do anything but i can ping the machine and i see new iptables nessages in the console... i will attach a screenshot of the ooops (should i open a new bug?)

i'm compiling right now the tip tree
Comment 18 higuita 2009-05-17 05:19:15 UTC
Created attachment 21381 [details]
4GB dmesg without fglrx
Comment 19 higuita 2009-05-17 05:20:31 UTC
Created attachment 21382 [details]
Photo of the kernel ooops when removing mtrr 1
Comment 20 D. Hugh Redelmeier 2009-05-17 05:49:06 UTC
I don't think that this is possible, but one explanation would be if the default type is NOT uncachable.

Disabling 3 first should have no effect.  As I mentioned in #12, it doesn't do anything.

Once you do that, the setup appears normal.

Disabling only 0 should only slow the machine down.  A lot!  Guess: the performance might go down almost to that of a PC/XT (the last PC that had not caching).  Still, I imagine this to be fast enough to get to the next step in "mtrr-uncover --execute"

Disabling only 2 could do bad things.  Or maybe not.  It depends what's in that memory range: some device registers are likely mapped into that range and should not be cached.

Disabling only 1 would slow down any access to memory above address 4G.  I don't know what Linux would choose to place there.

Still, because enable_mtrr_cleanup isn't running, I think that there is something odd about your setup.  It might just be the triple nesting, but it might be something else.
Comment 21 D. Hugh Redelmeier 2009-05-17 06:02:19 UTC
Reading the dmesg #18.

I'd suggest not running vboxdr and gShield.  Try ditching as much as possible.
Comment 22 higuita 2009-05-30 16:13:43 UTC
Created attachment 21633 [details]
4GB dmesg - no extra modules

here is the latest dmesg (now with kernel 2.6.29.4) , without fglrx and vbox

gshield is just a firewall/iptables script, but i disabled it anyway

i tried the tip git tree, but i use reiserfs in the root filesystem and during this weeks it always crash during the mount of the root partition and the kernel oops shows a crash on xattr (or similar) on reiserfs
Comment 23 higuita 2009-05-30 17:21:45 UTC
i forgot... of course, the mtrr is still the same and still locks up when i try to disable entry 0
Comment 24 D. Hugh Redelmeier 2009-05-31 03:07:26 UTC
For what it is worth, MTRRdefType Register says that the default type is UC.  So that seems right. MSR000002ff: 0000000000000c00

This latest run does not include enable_mtrr_cleanup.
Comment 25 higuita 2009-06-11 23:34:41 UTC
Created attachment 21861 [details]
dmesg, notaint, single 

new dmesg, this time with the missing option and booting in single mode, so to exclude even more external intreferences.

again, disable=0 locks up the system
Comment 26 higuita 2009-06-11 23:37:31 UTC
Created attachment 21862 [details]
mtrr for 4G, notaint, single

mtrr didnt change, but for completeness.

i will try to test 2.6.30 when i have more time
Comment 27 D. Hugh Redelmeier 2009-06-12 04:17:33 UTC
It isn't clear from your message, but the attachment in #26 is a snapshot of /proc/mtrr.  If you can set the MIME type to text, it would be easier to look at.  No big deal.

I note that a write-combining MTRR is present still.  Since I assume that X isn't running, I find this quite surprising (as I mentioned earlier).

Looking at the dmesg output in #25, the string "mtrr" only appears in a kernel parameter.  It sure looks to me as if the mtrr-cleanup code isn't running.  Why?

Dropping a few printk calls in might help to narrow down what is going on.

Note You need to log in before you can comment on or make changes to this bug.