Bug 206963 - non-zeroed pages when using CONFIG_INIT_ON_FREE_DEFAULT_ON and AMD SME
Summary: non-zeroed pages when using CONFIG_INIT_ON_FREE_DEFAULT_ON and AMD SME
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-25 19:25 UTC by Filip Brygidyn
Modified: 2020-05-31 18:54 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.4.26
Subsystem:
Regression: No
Bisected commit-id:


Attachments
5.4.26.config, logs, patch and arch build script (110.45 KB, application/gzip)
2020-03-25 19:25 UTC, Filip Brygidyn
Details
symbolized logs with CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER (98.81 KB, text/plain)
2020-03-30 18:41 UTC, Filip Brygidyn
Details
vanilla kernel logs with page poisoning (PAGE_POISONING_ZERO enabled and disabled) (209.89 KB, application/gzip)
2020-04-02 08:33 UTC, Filip Brygidyn
Details
5.6 logs with print_hex_dump patch and CONFIG_RAMDOMIZE_MEMORY disabled (423.82 KB, application/gzip)
2020-04-02 17:31 UTC, Filip Brygidyn
Details
summary of poisoning errors (21.70 KB, text/plain)
2020-04-03 10:13 UTC, Vlastimil Babka
Details
logs and patch with hw_breakpoint added (62.43 KB, application/gzip)
2020-04-03 22:28 UTC, Filip Brygidyn
Details
logs of ecrypting/decrypting the corrupted page (330.52 KB, text/plain)
2020-04-06 19:26 UTC, Filip Brygidyn
Details
do decrypt+encrypt before poisoning (81.34 KB, application/gzip)
2020-04-07 17:35 UTC, Filip Brygidyn
Details
clflush directly (1.60 KB, patch)
2020-04-08 17:01 UTC, Borislav Petkov
Details | Diff
logs after applying 'clflush directly' (82.66 KB, application/gzip)
2020-04-08 20:25 UTC, Filip Brygidyn
Details
5.6.2 config (58.43 KB, application/gzip)
2020-04-09 17:44 UTC, Filip Brygidyn
Details
5.6.2_debug_patches+config (60.74 KB, application/gzip)
2020-04-10 14:53 UTC, Filip Brygidyn
Details
KASAN logs (301.62 KB, application/gzip)
2020-04-11 16:57 UTC, Filip Brygidyn
Details
logs with __set_memory_enc_dec() hook (203.21 KB, text/plain)
2020-04-12 17:14 UTC, Filip Brygidyn
Details

Description Filip Brygidyn 2020-03-25 19:25:45 UTC
Created attachment 288057 [details]
5.4.26.config, logs, patch and arch build script

I first reported this as a linux-hardened issue on Arch bug-tracker: https://bugs.archlinux.org/task/65869
But in the end I could reproduce it on a clean kernel tree with just a minimal check lifted from the hardened tree.

I tested and repoduced it as far back as on 4.14

Problem description:
When running with AMD SME and INIT_ON_FREE enabled it appears as though the supposedly zeroed free pages become non-zeroed when they taken from the freelist

The patch that checks for the non-zeroed pages is attached. (apply on top of 5.4.26)
It simply verifies if pages are zeroed when in prep_new_page (mm/page_alloc.c) function.
(lifted from https://github.com/anthraxx/linux-hardened/blob/5.4/mm/page_alloc.c#L2190 and https://github.com/anthraxx/linux-hardened/blob/5.4/include/linux/highmem.h#L218)


With it applied and using the .config attached boot into it and you should be able to see logs about non-zeroed pages randomly when allocating memory.
The importatnt options are:
- CONFIG_INIT_ON_FREE_DEFAULT_ON=y
- CONFIG_AMD_MEM_ENCRYPT=y
- CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y

The procedure that works for me is to run memtester (http://pyropus.ca/software/memtester):
while true
do
    memtester 12G & # with 16G of RAM
    sleep 8
    killall memtester
done

The fault should happen after at most a few iterations, if if doesn't just reboot and try again a few times.

Example symbolized call trace:
[  255.154391] ------------[ cut here ]------------
[  255.159010] WARNING: CPU: 11 PID: 550 at include/linux/highmem.h:227 prep_new_page (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/highmem.h:227 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:2162) 
[  255.167612] Modules linked in: amd64_edac_mod edac_mce_amd snd_hda_intel ipmi_ssif snd_intel_nhlt kvm snd_hda_codec irqbypass snd_hda_core snd_hwdep crct10dif_pclmul nls_iso8859_1 snd_pcm nls_cp437 crc32_pclmul ghash_clmulni_intel snd_timer vfat snd igb ccp wmi_bmof ipmi_si fat aesni_intel crypto_simd sp5100_tco soundcore cryptd glue_helper i2c_algo_bit rng_core pcspkr ipmi_devintf i2c_piix4 k10temp dca ipmi_msghandler pinctrl_amd gpio_amdpt evdev wmi mac_hid drm agpgart ip_tables x_tables btrfs libcrc32c crc32c_generic xor raid6_pq sd_mod ahci libahci libata scsi_mod xhci_pci xhci_hcd crc32c_intel
[  255.220530] CPU: 11 PID: 550 Comm: memtester Not tainted 5.4.26-1-lts #1
[  255.227223] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS L3.31 12/24/2019
[  255.236609] RIP: 0010:prep_new_page (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/highmem.h:227 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:2162) 
[ 255.241141] Code: 41 b8 08 00 00 00 b9 40 00 00 00 6a 00 ba 02 00 00 00 48 c7 c6 ab 42 72 af 48 c7 c7 44 d1 6e af 68 00 10 00 00 e8 38 39 20 00 <0f> 0b 58 5a 41 83 ad 88 19 00 00 01 48 83 c3 40 4c 39 fb 0f 85 72
<SNIP>
[  255.321690] Call Trace:
[  255.324147] get_page_from_freelist (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:3696) 
[  255.328764] __alloc_pages_nodemask (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:4751) 
[  255.333297] alloc_pages_vma (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/mempolicy.h:78 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/mempolicy.h:83 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mempolicy.c:2156) 
[  255.337134] __handle_mm_fault (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/highmem.h:208 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:3018 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:3863 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:3989) 
[  255.341312] handle_mm_fault (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:4026) 
[  255.345152] __get_user_pages (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:641 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:845) 
[  255.349165] populate_vma_page_range (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:1223) 
[  255.353611] __mm_populate (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:1279) 
[  255.357275] do_mlock (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:712) 
[  255.360500] __x64_sys_mlock (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:720 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:718 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:718) 
[  255.364244] do_syscall_64 (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/arch/x86/entry/common.c:290) 
[  255.367911] entry_SYSCALL_64_after_hwframe (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/arch/x86/entry/entry_64.S:184) 
[  255.372965] RIP: 0033:0x7f5745b6e35b
[ 255.376544] Code: f0 ff ff 73 01 c3 48 8b 0d 32 4b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 95 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 05 4b 0c 00 f7 d8 64 89 01 48
<SNIP>
[  255.438492] ---[ end trace e40ed40276155114 ]---

Corresponding kernel tree (besides the patch): https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/?h=v5.4.26
Full logs are attached (raw and symbolized).

The logs also contain dumps of the affected non-zeroed pages. For example:
[  255.443364] page:ffffd812c3a75c00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1
[  255.451626] flags: 0x1ffff0000000000()
[  255.455376] raw: 01ffff0000000000 dead000000000100 dead000000000122 0000000000000000
[  255.463119] raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
[  255.470858] page dumped because: verify_zero_highpage failed!
<SNIP ZEROS>
[  255.815899] NON-ZERO PAGE: 00000340: 593b748c68881e2a 057455310362e36e
[  255.822424] NON-ZERO PAGE: 00000350: 287ea35d5d557742 fc02e8916a27042f
[  255.828950] NON-ZERO PAGE: 00000360: a1cb0d6d5d34c6cd 714ab1a43cf924b3
[  255.835479] NON-ZERO PAGE: 00000370: a3004368e712b8be 30e8fd221c5de507
[  255.842004] NON-ZERO PAGE: 00000380: a5e5c7ac67995ded b78248897ebbc61d
[  255.848526] NON-ZERO PAGE: 00000390: ee9fe81503ab2cc8 ccf9fd5523fe68c5
[  255.855046] NON-ZERO PAGE: 000003a0: 3a3da027e811829e db6c21bbc436d22b
[  255.861571] NON-ZERO PAGE: 000003b0: 8d31e6c9d818c734 3392b17c4dc43ec1
[  255.868097] NON-ZERO PAGE: 000003c0: 9ea988607b7077d8 f43dead3876ce0e1
[  255.874625] NON-ZERO PAGE: 000003d0: a7c1b44e192af5cc db260a62c138e5d2
[  255.881152] NON-ZERO PAGE: 000003e0: 5eaf30dcc5c19fe3 b126b6843b2f86e4
[  255.887675] NON-ZERO PAGE: 000003f0: 80b7670e4b49cdcd 4d18678050ea6e3a
[  255.894203] NON-ZERO PAGE: 00000400: 1f3c63300b40050c 9d4b1a3ba4e18337
[  255.900731] NON-ZERO PAGE: 00000410: 4b77edc62003313f b4d267f9dea58995
[  255.907253] NON-ZERO PAGE: 00000420: c89f759708ec0051 7f957c9bc7c183db
[  255.913781] NON-ZERO PAGE: 00000430: 385812a8f267427d 327378397f19a164
<SNIP ZEROS>
[  255.946402] NON-ZERO PAGE: 00000480: 0ebbd516b67cc823 dfcde4d53e4e0007
[  255.952928] NON-ZERO PAGE: 00000490: 453285e132c03f29 92c6c36258a5bfb9
[  255.959454] NON-ZERO PAGE: 000004a0: d4046066d19f8d32 fcdd220d5f34f440
[  255.965981] NON-ZERO PAGE: 000004b0: 19b1eebf7b60dbed 8e4fe4e47f61e90d
<SNIP ZEROS>
[  256.677185] NON-ZERO PAGE: 00000b80: 2c5ea3021e2e7253 d2c35a99331b4059
[  256.683712] NON-ZERO PAGE: 00000b90: b22c3e3af5a34e46 24515bb4a8989047
[  256.690235] NON-ZERO PAGE: 00000ba0: aa5ff36196b7a812 b3f91556094ecd97
[  256.696755] NON-ZERO PAGE: 00000bb0: 202e2f2a9314fa6b 9b7a1e96ddc75fd7
<SNIP ZEROS>

Those do seem random - modified by a broken SME maybe?


When tested with KASAN enabled It did not report anything.
It did report use-after-free in amdgpu on 2200G but that is not relevant to this issue - you can find more details along with symbolized call stack in the arch-bug report


Hardware used:
I tested on several configurations - the parts I had to swap around were as follows:
  CPUs: ryzen 2600, ryzen 2200g
  Mobo: x470D4U, B450M steel legend
  Ram: 4 sticks of ECC ram: 2x8GB + 2x16GB
  SSD: intel 760p (nvme), samsung 860 evo (sata)
  No gpu - can reproduce in blind mode
  PSU: supermicro 1200W rack unit, seasonic 360W unit
  Cooling: wraith prism, wraith stealth
  Traces gathered over a serial port
And I verified that no single component is causing this.


I also attached an arch-linux specific build script I used - maybe it will be more convenient for someone
Comment 1 Borislav Petkov 2020-03-30 10:23:44 UTC
Can you enable CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER and boot with debug_pagealloc=on page_owner=on? That should tell you who last used the page.

Thx.
Comment 2 Filip Brygidyn 2020-03-30 18:41:24 UTC
Created attachment 288129 [details]
symbolized logs with CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER

Here you go, full log attached.

The excerpt I assume you were interested in:
    page dumped because: verify_zero_highpage failed!
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Movable, gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
    prep_new_page (linux-5.4.26/./include/linux/page_owner.h:31 linux-5.4.26/mm/page_alloc.c:2151 linux-5.4.26/mm/page_alloc.c:2157) 
    get_page_from_freelist (linux-5.4.26/mm/page_alloc.c:3696) 
    __alloc_pages_nodemask (linux-5.4.26/mm/page_alloc.c:4751) 
    alloc_pages_vma (linux-5.4.26/./include/linux/mempolicy.h:78 linux-5.4.26/./include/linux/mempolicy.h:83 linux-5.4.26/mm/mempolicy.c:2156) 
    __handle_mm_fault (linux-5.4.26/./include/linux/highmem.h:208 linux-5.4.26/mm/memory.c:3018 linux-5.4.26/mm/memory.c:3863 linux-5.4.26/mm/memory.c:3989) 
    handle_mm_fault (linux-5.4.26/mm/memory.c:4026) 
    __get_user_pages (linux-5.4.26/mm/gup.c:641 linux-5.4.26/mm/gup.c:845) 
    populate_vma_page_range (linux-5.4.26/mm/gup.c:1223) 
    __mm_populate (linux-5.4.26/mm/gup.c:1279) 
    do_mlock (linux-5.4.26/mm/mlock.c:712) 
    __x64_sys_mlock (linux-5.4.26/mm/mlock.c:720 linux-5.4.26/mm/mlock.c:718 linux-5.4.26/mm/mlock.c:718) 
    do_syscall_64 (linux-5.4.26/arch/x86/entry/common.c:290) 
    entry_SYSCALL_64_after_hwframe (linux-5.4.26/arch/x86/entry/entry_64.S:184) 
    page_owner free stack trace missing
Comment 3 Vlastimil Babka 2020-03-31 08:39:16 UTC
(In reply to Filip Brygidyn from comment #2)
> Created attachment 288129 [details]
> symbolized logs with CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER
> 
> Here you go, full log attached.
> 
> The excerpt I assume you were interested in:

Yeah, thanks.

>     page_owner free stack trace missing

So this indicates the page never went through freeing. All pages managed by mm/page_alloc.c should be added there in one way or another (during boot, hotplug...) through __free_pages_core() which basically simulates page freeing, thus including the init_on_free and setting of page_owner free stack trace. This page somehow circumvented this path, hmm...
Comment 4 Filip Brygidyn 2020-03-31 09:01:47 UTC
(In reply to Vlastimil Babka from comment #3)
> So this indicates the page never went through freeing.

This kind of matches my testing method:
With 16 or 32 GB of ram I launch memtester right after booting.
It presumably allocates most of the available memory which was until this point unused.

Most of the time after allocating 12 out of available 16 GB I only get a few mage warnings. The rest are correctly zeroed, And I was unable to reproduce this with mem_encrypt=off
Comment 5 Vlastimil Babka 2020-03-31 11:00:29 UTC
(In reply to Vlastimil Babka from comment #3)
> >     page_owner free stack trace missing
> 
> So this indicates the page never went through freeing. All pages managed by
> mm/page_alloc.c should be added there in one way or another (during boot,
> hotplug...) through __free_pages_core() which basically simulates page
> freeing, thus including the init_on_free and setting of page_owner free
> stack trace. This page somehow circumvented this path, hmm...

Correction: the "page_owner free stack trace missing" does indeed mean that this is the first time the page has been allocated. But the rest of above is wrong. At the initial pass through __free_pages_core() we are not yet recording the freeing stack traces. So I assume the page went properly through the initial freeing, including the zeroing. But then something corrupted parts of it. If it was e.g. zeroed with a wrong memory encryption state, then I would expect it to be fully garbage, not just parts of it.
Comment 6 Filip Brygidyn 2020-03-31 12:14:31 UTC
I checked with CONFIG_PAGE_POISONING=y and page_poison=on and I found no additional logs about corruption.

From what I understand this would eliminate a possibility that the page was corrupted after zero-ing (eg. somwhere in arch_free_page()).

Not sure if it was obvious for you.
Comment 7 Filip Brygidyn 2020-03-31 15:02:10 UTC
(In reply to Filip Brygidyn from comment #6)
> I checked with CONFIG_PAGE_POISONING=y and page_poison=on and I found no
> additional logs about corruption.
> 
> From what I understand this would eliminate a possibility that the page was
> corrupted after zero-ing (eg. somwhere in arch_free_page()).
> 
> Not sure if it was obvious for you.

I want to retract this. I did not update the .config file and tested the previous kernel version.

With properly updated .config I cannot reproduce the issue with page_poison=on.

Why could this help?
Comment 8 Filip Brygidyn 2020-03-31 16:18:23 UTC
Also checked with CONFIG_PAGE_POISONING_ZERO=y, nothing changed:
- page_poison=on:  can't reproduce
- page_poison=off: can reproduce

(And this time I verified the /proc/config.gz when testing)
Comment 9 Vlastimil Babka 2020-04-01 13:11:47 UTC
(In reply to Filip Brygidyn from comment #7)
> With properly updated .config I cannot reproduce the issue with
> page_poison=on.
> 
> Why could this help?

Simply, because:

static inline bool want_init_on_free(void)
{
        return static_branch_unlikely(&init_on_free) &&
               !page_poisoning_enabled();
}

Enabling page poisoning disabled init on free, including the check you added.
You should see this in dmesg:

mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_free
Comment 10 Filip Brygidyn 2020-04-01 15:12:40 UTC
Thanks, that makes sense.
Still: When enabled, page poisoning would report corruption if it happened in between freeing and allocation. So if I did not see them then that part can be eliminated as a suspect?

If I am reading correctly then the poisoning/un-poisoning happens at the same points where init_on_free and my check is. (free_pages_prepare() and prep_new_page())

So the thing you said previously:
(In reply to Vlastimil Babka from comment #3)
> This page somehow circumvented this path, hmm...
Would become more probable


Also: Please excuse me if I am talking nonsense or stating the obvious. I am not familiar with the codebase. At the same time I would like to help, not just throw logs at bugzilla.
Comment 11 Borislav Petkov 2020-04-01 16:30:33 UTC
You could audit how the C-bit - _PAGE_ENC - is being set/cleared for those pages. Because, for example, if you memset the page to 0s, then clear the C-bit in the pte and access that page with C-bit cleared, you'll get garbage instead of zeroes. Or the other way around. IOW, the C-bit setting should be the same when the page is cleared to 0s *and* when your check happens - otherwise you'll be seeing those funky bytes.

But this is only a conjecture we have - someone with time on his hands has to sit down and stare at the code. :)

HTH.
Comment 12 Vlastimil Babka 2020-04-01 18:11:01 UTC
(In reply to Filip Brygidyn from comment #10)
> Thanks, that makes sense.
> Still: When enabled, page poisoning would report corruption if it happened
> in between freeing and allocation. So if I did not see them then that part
> can be eliminated as a suspect?

Hmm good point, it should be like that yeah.

> If I am reading correctly then the poisoning/un-poisoning happens at the
> same points where init_on_free and my check is. (free_pages_prepare() and
> prep_new_page())

Seems so.

> So the thing you said previously:
> (In reply to Vlastimil Babka from comment #3)
> > This page somehow circumvented this path, hmm...
> Would become more probable

Hmm, but then the unpoisoning pass should also catch such page, same as the detection  patch your have added does. It's almost identical. But doublecheck that CONFIG_PAGE_POISONING_NO_SANITY is *not enabled*. If it is enabled, then no check happens on unpoisoning. Might be also interesting to see if PAGE_POISONING_ZERO enabled/disabled makes any difference.
Comment 13 Filip Brygidyn 2020-04-01 20:21:22 UTC
yup. I had CONFIG_PAGE_POISONING_NO_SANITY=y, will rebuild with it disabled and let you know tomorrow
Comment 14 Filip Brygidyn 2020-04-02 08:33:48 UTC
Created attachment 288149 [details]
vanilla kernel logs with page poisoning (PAGE_POISONING_ZERO enabled and disabled)

With CONFIG_PAGE_POISONING_NO_SANITY disabled now I see corruption without my patch. (vanilla 5.4.26 kernel)
And I checked SME again just in case: problems disappear when passing mem_encrypt=off

Logs attached along with their corresponding config.gz taken from /proc.
I tested with PAGE_POISONING_ZERO and without - didn't make a difference


So AFAIK the thing I posted previously should be treated as void:
(In reply to Filip Brygidyn from comment #10)
> Still: When enabled, page poisoning would report corruption if it happened
> in between freeing and allocation. So if I did not see them then that part
> can be eliminated as a suspect?
Comment 15 Vlastimil Babka 2020-04-02 09:17:54 UTC
(In reply to Filip Brygidyn from comment #14)
> With CONFIG_PAGE_POISONING_NO_SANITY disabled now I see corruption without
> my patch. (vanilla 5.4.26 kernel)
> And I checked SME again just in case: problems disappear when passing
> mem_encrypt=off
> 
> Logs attached along with their corresponding config.gz taken from /proc.
> I tested with PAGE_POISONING_ZERO and without - didn't make a difference

Good. I notice there's some pattern, it's always two pages where one has a small area corrupted, and another a larger area plus another small area. Could it be it's always the same addresses? However we have security features against our debugging here, so to find out the real and stable addresses you'd need to disable CONFIG_RAMDOMIZE_MEMORY, and also apply the patch below. Would it also be possible to test the latest 5.6.x kernel just to rule out this was something fixed already? Thanks.

diff --git a/lib/hexdump.c b/lib/hexdump.c
index 147133f8eb2f..52dd2fc1c8b4 100644
--- a/lib/hexdump.c
+++ b/lib/hexdump.c
@@ -256,7 +256,7 @@ void print_hex_dump(const char *level, const char *prefix_str, int prefix_type,
 
                switch (prefix_type) {
                case DUMP_PREFIX_ADDRESS:
-                       printk("%s%s%p: %s\n",
+                       printk("%s%s%px: %s\n",
                               level, prefix_str, ptr + i, linebuf);
                        break;
                case DUMP_PREFIX_OFFSET:
Comment 16 Filip Brygidyn 2020-04-02 17:31:10 UTC
Created attachment 288169 [details]
5.6 logs with print_hex_dump patch and CONFIG_RAMDOMIZE_MEMORY disabled

I took the current 5.6.2, disabled CONFIG_RAMDOMIZE_MEMORY and applied your patch.
Corruption is still present and you can find logs and a config.gz attached.


(In reply to Vlastimil Babka from comment #15)
> Good. I notice there's some pattern, it's always two pages where one has a
> small area corrupted, and another a larger area plus another small area.
> Could it be it's always the same addresses?
I attached a total of 8 dmesg logs - one per reboot to ease the spotting of patterns.
And the addresses are indeed the same. But it appears that in a few instances there are 3 pages reported instead of 2 - but the 3rd is offset only by 4k from one of the 2 remaining.
Comment 17 Vlastimil Babka 2020-04-03 10:13:55 UTC
Created attachment 288177 [details]
summary of poisoning errors

Thanks, I've extracted the hexdumps in one file, and confirm that there are two or three pages, some have corruption on exactly the same addresses (ffff8880e9d26fc0 to the end of page), some at slighlty different ranges.

That the pages are otherwise filled with poison 0xaa should mean they were correctly poisoned and it's not just null-init by bios and lack of poisoning.

Perhaps the exact scrambling we see is a result of non-encrypted write to encrypted pages, and then seeing the decrypted result. But, if memory encryption was off and somebody was still doing these corruption writes, we would see them too, just with different contents, right? So I still don't see how the mem encryption is exactly the culprit here. Or does it have a bug that manifests only at certain physical addresses?
Comment 18 Filip Brygidyn 2020-04-03 22:28:21 UTC
Created attachment 288195 [details]
logs and patch with hw_breakpoint added

With the known repeatable address (0xffff8880e9d26fc0) that can be monitored I added a hardware breakpoint using include/linux/hw_breakpoint.h while referencing the linux/samples/hw_breakpoint/data_breakpoint.c


I attached:
- raw dmesg
- symbolized dmesg
- trimmed symbolized dmesg with just interesting bits and commets about my thought process
- patch adding the breakpoint (it also contains your hex_dump patch and a small change that makes the page poisoning print the whole page instead of just the corrupted part)


And here's even more trimmed version of the dmesg with comments (marked with <> brackets):


<dmesg starts>
[    0.000000] Linux version 5.6.2-1-custom (linux-custom@archlinux) (gcc version 9.3.0 (Arch Linux 9.3.0-1)) #6 SMP PREEMPT Fri, 03 Apr 2020 20:53:42 +0000
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-custom root=UUID=d8dded4f-a123-4108-b7a9-3c2781ee024c rw console=tty0 console=ttyS0,115200 init_on_free=1 mem_encrypt=on debug_pagealloc=on page_owner=on page_poison=on loglevel=8
<SNIP IMO irrelevant logs>

<initial free on the page containing our magical broken address is called, dump the current stack>
<@Vlastimil Babka Presumably this is the first time this page is added to page_alloc.c? I do not see __free_pages_core() here though>

[    0.033937] dump_stack (/home/aur/linux-git/src/linux-5.6.2/lib/dump_stack.c:120) 
[    0.033937] kernel_poison_pages.cold (/home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:98 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:112 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:190) 
[    0.033937] free_pcp_prepare (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/jump_label.h:25 /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:2712 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1182 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1220) 
[    0.033937] free_unref_page (/home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3011 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3060) 
[    0.033937] __memblock_free_late (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/atomic64_64.h:88 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-instrumented.h:1050 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-long.h:160 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:60 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/mm/memblock.c:1630 (discriminator 2)) 
[    0.033937] efi_free_boot_services (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/quirks.c:413 (discriminator 2)) 
[    0.033937] efi_enter_virtual_mode (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:954 /home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:984) 
[    0.033937] start_kernel (/home/aur/linux-git/src/linux-5.6.2/init/main.c:970) 
[    0.033937] secondary_startup_64 (/home/aur/linux-git/src/linux-5.6.2/arch/x86/kernel/head_64.S:242) 


<while freeing we poisoned the page, print it just to make sure it is filled with 0xaa>
[    0.033937] hw_breakpoint: dump_stack finished, current page dump:
<SNIP 0xaa filled page dump>
[    0.880604] ffff8880e9d26fc0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
<SNIP 0xaa filled page dump>
<ok, it seems like the page is poisoned correctly, now we register a hw_breakpoint (read/write) at 0xffff8880e9d26fc0 address>


[    0.893937] hw_breakpoint: hexdump finished, registering the breakpoint:
[    0.897285] hw_breakpoint registered
<SNIP>

<Somewhere here I launch the memtester...>

<we are in the middle of trying to allocate a lot of memory in memtester>
<our page is going to be used, we are in prep_new_page() and about to unpoison the page. At this point we need to unregister the breakpoint, but to make sure it works we memset 4 bytes with 0xaf and expect the still enabled breakpoint to trigger...>
[   43.978733] hw_breakpoint: about to unregister the breakpoint, check if it works just in case...


[   43.988271] hw_breakpoint: value is changed
<ok, the breakpoint is triggered, print the call stack...>

<SNIP most of the call stack>
[   44.065581] RIP: 0010:kernel_poison_pages.cold (/home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:162 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:179 /home/aur
<SNIP>
<so the call-stack of the breakpoint is indeed our memset from page_poison.c, now to unregister the breakpoint>

[   44.278288] hw_breakpoint: testing breakpoint finished, unregistering...
[   44.285378] hw_breakpoint unregistered


<good, now the unpoisoning, which detects memory corruption. It should trigger regardless of the unknown corruption we are dealing with because we did a 4 byte memset the page>
[   44.289527] pagealloc: memory corruption
<SNIP 0xaa filled page dump>
[   46.515384] ffff8880e9d26fc0: af af af af bd 74 aa c8 7a 23 9e 9a 31 80 3f 6f  .....t..z#..1.?o
<SNIP page dump with other corruption>

<HMMM, our 4 bytes of 0xaf are here but other values are corrupted as well, ___so something that doesn't trigger hardware breakpoints did it???___>
<SNIP callstack printed along with the corrupted page>





So, from what I can tell the breakpoint works and should have caught the corruption. If it didn't then it's maybe something related to the C-Bit @Borislav Petkov mentioned.
In https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf states that it is sensitive to cache so I may try to somehow flush the memory after poisoning. Maybe that would help?
Although right now I do not know how to do it. Or how to go around from the page struct I have in page_alloc.c to the underlaying physical memory somewhere in arch/x86/mm/mem_encrypt.c and check the C-Bit. So it will probalby take me a while.


Bonus conspiracy theory: Maybe it's the PSP failing miserably while trying to steal my keys :>
Comment 19 Filip Brygidyn 2020-04-03 22:50:05 UTC
(In reply to Vlastimil Babka from comment #17)
> Perhaps the exact scrambling we see is a result of non-encrypted write to
> encrypted pages, and then seeing the decrypted result. But, if memory
> encryption was off and somebody was still doing these corruption writes, we
> would see them too, just with different contents, right?

Someone who knows how SME is hadled would have to answer this but my intuition tells me that if we did a write to a page that was set as encrypted, then the decryption later would produce garbage.

+ The thing you said previously is a very good point:
(In reply to Vlastimil Babka from comment #5)
> If it was e.g. zeroed with a wrong memory encryption
> state, then I would expect it to be fully garbage, not just parts of it.

What we see in page_alloc.c is virtual memory, right? (%px in printk) and the C-bits are set for physical memory pages (so I would have to use %pa I think). I don't know how do they correspond to each other but maybe the parts that we see as garbage in the middle of a page are just mapped to a separate physical page.
If that's the case then the theory of writing into a page with wrong C-Bit state is probable.

But again: the breakpoint I hope works as I think it does did not show any writes.

> So I still don't
> see how the mem encryption is exactly the culprit here. Or does it have a
> bug that manifests only at certain physical addresses?
Maybe but I am much more willing to believe that we do not flush the caches somewhere. the corruption ranges seem much closer to cache line size than to whole page size. (but maybe it's just the physical<>virtual mem mapping being misleading)
Comment 20 Vlastimil Babka 2020-04-04 13:26:22 UTC
(In reply to Filip Brygidyn from comment #18)
> Created attachment 288195 [details]
> logs and patch with hw_breakpoint added
> 
> With the known repeatable address (0xffff8880e9d26fc0) that can be monitored
> I added a hardware breakpoint using include/linux/hw_breakpoint.h while
> referencing the linux/samples/hw_breakpoint/data_breakpoint.c

That was a great idea, thanks. Looks like it did prove the kernel doesn't unexpectedly touch this memory through the direct mapping. I was pretty sure it doesn't, but having a proof is much better.

(In reply to Filip Brygidyn from comment #19)
> (In reply to Vlastimil Babka from comment #17)
> > Perhaps the exact scrambling we see is a result of non-encrypted write to
> > encrypted pages, and then seeing the decrypted result. But, if memory
> > encryption was off and somebody was still doing these corruption writes, we
> > would see them too, just with different contents, right?
> 
> Someone who knows how SME is hadled would have to answer this but my
> intuition tells me that if we did a write to a page that was set as
> encrypted, then the decryption later would produce garbage.

Yes, it looks like somebody did such write, using a mapping that didn't have the C bit set. Your HW breakpoint ruled out it was the kernel using the 0xffff8880e9d26fc0 address. It could have still been the kernel using a different virtual address without C bit set. Or more likely firmware (SMM?) or hardware. But the point is that such stray overwrite would corrupt the poison values even without encryption, and we would detect it.
 
> + The thing you said previously is a very good point:
> (In reply to Vlastimil Babka from comment #5)
> > If it was e.g. zeroed with a wrong memory encryption
> > state, then I would expect it to be fully garbage, not just parts of it.
> 
> What we see in page_alloc.c is virtual memory, right? (%px in printk) and

Yes.

> the C-bits are set for physical memory pages (so I would have to use %pa I
> think). I don't know how do they correspond to each other but maybe the

Since we disabled randomization, it's easy. Virtual address ffff8880e9d26fc0 should be physical address e9d26fc0. The 'ffff88' part is where direct mapping starts. The extra '8' is bit 47 so it's probably the C-bit - looks like it's a part of the virtual address of kernel mapping too, although it only needs t obe part of the physical address. Need to check how we deal with that in the kernel.

> parts that we see as garbage in the middle of a page are just mapped to a
> separate physical page.

No that's not possible, 4KB page is the lowest possible granularity for mapping. And the ffff88* mapping even uses 1GB granularity when possible.

> If that's the case then the theory of writing into a page with wrong C-Bit
> state is probable.
> 
> But again: the breakpoint I hope works as I think it does did not show any
> writes.

Yes.

> > So I still don't
> > see how the mem encryption is exactly the culprit here. Or does it have a
> > bug that manifests only at certain physical addresses?
> Maybe but I am much more willing to believe that we do not flush the caches
> somewhere. the corruption ranges seem much closer to cache line size than to
> whole page size. (but maybe it's just the physical<>virtual mem mapping
> being misleading)

In fact it seems the corrupted vs correct parts are exactly the granularity and alignment of cache line (or multiple cache line) size of 64bytes, as I mentioned to Boris. Thus it takes only a single byte write through a mapping without C-bit to corrupt the whole cache line.

I wonder, if cache lines can be dirtied speculatively. Then whoever fetched them without C-bit perhaps did it only because of mis-speculation, and then the flush writeback did the corruption?

Also I realized we might try to print the affected areas as unencrypted and see if the non-scrambled contents hints at anything. It should be possible just by masking bit 47 in the address, thus prefix ffff888 becomes ffff880. But then we shouldn't do the testing "af" byte writes.
Comment 21 Vlastimil Babka 2020-04-04 13:46:10 UTC
(In reply to Vlastimil Babka from comment #20)
> Since we disabled randomization, it's easy. Virtual address ffff8880e9d26fc0
> should be physical address e9d26fc0. The 'ffff88' part is where direct
> mapping starts. The extra '8' is bit 47 so it's probably the C-bit - looks
> like it's a part of the virtual address of kernel mapping too, although it
> only needs t obe part of the physical address. Need to check how we deal
> with that in the kernel.

Um, scracth this, there's no C-bit in direct mapping. The base is 0xffff888000000000 these days (if not randomized), it used to be 0xffff880000000000. The bit change (which is actually bit 39, not 47) has nothing to do with encryption.

> Also I realized we might try to print the affected areas as unencrypted and
> see if the non-scrambled contents hints at anything. It should be possible
> just by masking bit 47 in the address, thus prefix ffff888 becomes ffff880.
> But then we shouldn't do the testing "af" byte writes.

This theory should still work, but it's not about modifying the virtual address. We probably need to call set_memory_decrypted() on the whole page (e.g. set_memory_decrypted(0xffff8880e9d26000, 1) and then do the hexdump.

(In reply to Filip Brygidyn from comment #18)
> <@Vlastimil Babka Presumably this is the first time this page is added to
> page_alloc.c? I do not see __free_pages_core() here though>

__memblock_free_late() -> memblock_free_pages() -> __free_pages_core()
Probably inlined. IIRC kasan_symbolize.py was able to add such inlines to the symbolized logs.
Comment 22 Filip Brygidyn 2020-04-06 19:26:00 UTC
Created attachment 288237 [details]
logs of ecrypting/decrypting the corrupted page

(In reply to Vlastimil Babka from comment #21)
> We probably need to call set_memory_decrypted() on the whole page
> (e.g. set_memory_decrypted(0xffff8880e9d26000, 1) and then do the hexdump.

Good call.
Just calling set_memory_encrypted(), set_memory_decrypted() and set_memory_encrypted() again showed that the memory is corrupted when freeing initially.
For me this is a fairly good proof that something is not flushing the caches. (or invalidates them too early).

The story is below, I trimmed the hexdumps. (full are attached)


<initial simulated free call trace>
[    0.030520] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.2-1-custom #11
[    0.033849] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS L3.33 12/26/2019
[    0.033849] Call Trace:
[    0.033849] dump_stack (/home/aur/linux-git/src/linux-5.6.2/lib/dump_stack.c:120) 
[    0.033849] kernel_poison_pages.cold (/home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:99 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:123 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:215) 
[    0.033849] free_pcp_prepare (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/jump_label.h:25 /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:2712 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1182 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1220) 
[    0.033849] free_unref_page (/home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3011 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3060) 
[    0.033849] __memblock_free_late (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/atomic64_64.h:88 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-instrumented.h:1050 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-long.h:160 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:60 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/mm/memblock.c:1630 (discriminator 2)) 
[    0.033849] efi_free_boot_services (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/quirks.c:413 (discriminator 2)) 
[    0.033849] efi_enter_virtual_mode (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:954 /home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:984) 
[    0.033849] start_kernel (/home/aur/linux-git/src/linux-5.6.2/init/main.c:970) 
[    0.033849] secondary_startup_64 (/home/aur/linux-git/src/linux-5.6.2/arch/x86/kernel/head_64.S:242) 

<hexdump just after poisoning>
[    0.877184] current   ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.880517] current   ffff8880e9d26fc0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.883850] current   ffff8880e9d26fd0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.887184] current   ffff8880e9d26fe0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.890517] current   ffff8880e9d26ff0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
<well, it looks poisoned alright>

<right after that, call set_memory_encrypted and then hexdump>
[    1.730517] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.733850] encrypted ffff8880e9d26fc0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.737183] encrypted ffff8880e9d26fd0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.740517] encrypted ffff8880e9d26fe0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.743850] encrypted ffff8880e9d26ff0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
<ok, nothing changed, so the page was already set as encrypted>

<right after that, call set_memory_decrypted and then hexdump>
[    2.557184] decrypted ffff8880e9d26f30: 0e c9 57 3e e2 90 d6 b0 f7 df 57 ce 60 d5 41 49  ..W>......W.`.AI
[    2.560517] decrypted ffff8880e9d26f40: 66 95 f2 e5 bd f6 cd 05 db 32 6c 40 0e 8b 57 19  f........2l@..W.
[    2.563850] decrypted ffff8880e9d26f50: 59 2c 15 72 be 3d 63 ad 24 5a 1c 19 ba be 62 39  Y,.r.=c.$Z....b9
[    2.567184] decrypted ffff8880e9d26f60: 03 1f 65 9d 6b 3a 87 32 4b 35 ff c0 ca 9f 08 26  ..e.k:.2K5.....&
[    2.570517] decrypted ffff8880e9d26f70: b7 d4 e7 73 41 25 1d 3a 63 c2 88 31 8f 4d 37 80  ...sA%.:c..1.M7.
[    2.573850] decrypted ffff8880e9d26f80: 00 00 00 00 00 00 00 00 66 1c cb e9 00 00 00 00  ........f.......
[    2.577184] decrypted ffff8880e9d26f90: 10 05 77 ed 00 00 00 00 00 00 00 00 00 00 00 00  ..w.............
[    2.580517] decrypted ffff8880e9d26fa0: d8 05 d7 e9 00 00 00 00 08 00 00 00 00 00 00 00  ................
[    2.583850] decrypted ffff8880e9d26fb0: 00 00 00 00 00 00 00 00 34 2e cb e9 00 00 00 00  ........4.......
[    2.587184] decrypted ffff8880e9d26fc0: 18 0c d7 e9 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    2.590517] decrypted ffff8880e9d26fd0: d8 05 d7 e9 00 00 00 00 24 2e cb e9 00 00 00 00  ........$.......
[    2.593850] decrypted ffff8880e9d26fe0: 38 00 00 00 00 00 00 00 46 00 00 00 00 00 00 00  8.......F.......
[    2.597184] decrypted ffff8880e9d26ff0: 00 70 d2 e9 00 00 00 00 08 00 00 00 00 00 00 00  .p..............
<hmm, now this is interesting, instead of whole page being scrambled I see that at the end of the page there is some decrypted data. 128 bytes exactly>
<so the corruption was actually encrypted data>

<right after that, call set_memory_encrypted and then hexdump>
[    3.440517] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    3.443850] encrypted ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[    3.447184] encrypted ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[    3.450517] encrypted ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[    3.453850] encrypted ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<now we see the corrupted data that was apparently flushed, it stays with us until overwriten, but not the corrupted range is only 64 bytes wide>
<normally we would see this when trying to unpoison later>

<do the rest of booting, ssh into the machine and then call memtester>

<we are about to unpoison the page, hexdump current page:>
[   55.089760] current   ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[   55.099408] current   ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[   55.109061] current   ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[   55.118727] current   ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[   55.128395] current   ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<yup, the memory contains the corruption we were seeing until now>

<right after that, call set_memory_encrypted and then hexdump>
[   57.561664] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[   57.571310] encrypted ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[   57.580963] encrypted ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[   57.590627] encrypted ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[   57.600299] encrypted ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<as expected the page was set as encrypted already so nothing changed>

<right after that, call set_memory_decrypted and then hexdump>
[   60.311616] decrypted ffff8880e9d26fb0: 31 93 e9 97 0c 06 42 5a e7 94 28 80 e4 66 d0 07  1.....BZ..(..f..
[   60.330727] decrypted ffff8880e9d26fc0: 18 0c d7 e9 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.349011] decrypted ffff8880e9d26fd0: d8 05 d7 e9 00 00 00 00 24 2e cb e9 00 00 00 00  ........$.......
[   60.368214] decrypted ffff8880e9d26fe0: 38 00 00 00 00 00 00 00 46 00 00 00 00 00 00 00  8.......F.......
[   62.666298] decrypted ffff8880e9d26ff0: 00 70 d2 e9 00 00 00 00 08 00 00 00 00 00 00 00  .p..............
<hmm, we get gack the 64 of decrypted bytes>

<right after that, call set_memory_encrypted and then hexdump>
[   65.108977] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[   65.118625] encrypted ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[   65.128279] encrypted ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[   65.137943] encrypted ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[   65.147613] encrypted ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<and we get back our 'corrupted' data>


<the memory is back on the free list, call memtester again and the whole cycle repeats but without corruption this time>
Comment 23 Filip Brygidyn 2020-04-06 19:42:41 UTC
(In reply to Filip Brygidyn from comment #22)
> For me this is a fairly good proof that something is not flushing the
> caches. (or invalidates them too early).
I just realized that I've convinced myself that I already know the answer. So please be critical of my opinion.
Comment 24 Vlastimil Babka 2020-04-06 23:00:52 UTC
Huh, that's rather insane. Especially the part where we initially see all poison OK, then after decrypt it shows something's wrong with the last 128 bytes of the page, after another encrypt it's suddenly just the last 64 bytes, and then it stays that way. And since you're doing that immediately during the boot, it should rule out any erroneous activity between the rest of the boot and memtester allocating the page.
Note that each of the set_memory_decrypted()/set_memory_encrypted() call implies a cache flush of the whole page, and I don't really think the code there is wrong.

But maybe there's a theory explaining the above, that assumes the cache state was already inconsistent at the moment of the initial poisoning, or even at the moment of "initial encryption" (when the direct mapping was constructed with the c-bit set, I don't immediately see where exactly it's done in the kernel code).

1) before the "initial encryption", something accessed the last 128 bytes of the page, and wrote to the last cacheline (64 bytes), using a physical address without C-bit. Perhaps just wrote to the last 64 bytes, and the previous line was cached as an adjacent line prefetch. The result is that some cache in the cpu now caches the very last line as dirty and the previous line as e.g. shared, using physical memory tag that doesn't have the c-bit.

2) the "initial encryption" constructed direct mapping with c-bit sets, but didn't flush the caches first while using a mappings to physical addresses without c-bits.

3) the initial poisoning uses the direct mapping with c-bits but since that effectively maps to different physical addresses, cache coherency doesn't recognize those last two cache lines (physical tags don't match in the c-bit) and thus poisoning fetches a different copy of the 128 bytes from RAM, and writes the poison there. The hexdumps just prints from the cache it has written to.

4) you call set_memory_encrypted() which flushes the page from cache to RAM using mapping with c-bits, so the poison is propagated encrypted to memory. But it still doesn't affect the 128 bytes cached without c-bits. The following hexdump fetches that encrypted poison back to cache (but no longer dirty) and prints it.

5) Now it starts to be interesting. You call set_memory_decrypted(). That again flushes the cache lines containing encrypted poison with c-bit set, and then changes the mapping to remove c-bit. Then it flushes only TLB, not cache (see the second cpa_flush() in __set_memory_enc_dec()). Now the hexdump uses mapping without c-bits so it populates the cache with contents from memory that has been encrypted, but is now not decrypted, hence it's scrambled. Except those last two cache lines which are finally read from the cache and not memory, since we now used a mapping without c-bits! So the last 128 bytes look different.

6) You call set_memory_encrypted() again, which flushes the cache without c-bits. The last cache line was dirty, so it's written back to memory and overwrites the encrypted poison bytes. The next to last cache line was not dirty, so it's just invalidated, as the rest of the page. Then mapping is changed again to use the c-bit so the hexdump reads memory as encrypted and we now see only the last cache line as scrambled. But now the incoherent copy of cache is finally gone, and the page will now operate normally, as your further output shows.

If this theory is correct, then all of the observed corruption should be prevented if you hook a "set_memory_decrypted(); set_memory_encrypted();" sequence of operations on the page before the initial poisoning. That should flush the incoherent copy immediately. Then the question is if the initial direct mapping encryption lacks a flush, or e.g. some hardware or firmware misbehaves and populates those non-cbit cache lines.
Comment 25 Borislav Petkov 2020-04-07 14:49:01 UTC
Ok, two things to try:

* can you make the same observation if you boot with TSME on? TSME is a BIOS option which enables transparent memory encryption on the whole system and the SME code in the kernel is not used. But you'd have to check your BIOS to see whether there's even such an option exposed. Some BIOSes have it and you might be lucky.

Independent from the above,

* can you make the same observation if you boot the box with nr_cpus=1

Thx.
Comment 26 Filip Brygidyn 2020-04-07 17:35:40 UTC
Created attachment 288265 [details]
do decrypt+encrypt before poisoning

(In reply to Vlastimil Babka from comment #24)
> If this theory is correct, then all of the observed corruption should be
> prevented if you hook a "set_memory_decrypted(); set_memory_encrypted();"
> sequence of operations on the page before the initial poisoning
I verified it and indeed if I call set_memory_decrypted(), set_memory_encrypted() and then do the poisoning then the corruption does not appear in that page (see decrypt+encrypt_before_poison.txt attached)
Other pages that did not get this treatment still exhibit corruption (see decrypt+encrypt_before_poison.other_pages_corruption.txt - shows just unpoisoning corruption hexdumps)

(In reply to Borislav Petkov from comment #25)
> * can you make the same observation if you boot with TSME on?
Yes, and I already tried it. I tested it again (both disabled and enabled) just to be sure - no difference
In the BIOS I also have "Data Scramble" option next to TSME, also tried both enabled and disabled - no difference

(In reply to Borislav Petkov from comment #25)
> * can you make the same observation if you boot the box with nr_cpus=1
Same as above, I already tried it, also tested again to be sure - no difference, same corruption happens


I also found an interesting quote in this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4e237903f95db585b976e7311de2bfdaaf0f6e31
> When SME is enabled, the
> change from using pages that were marked encrypted to now being marked as
> not encrypted (through new identify mapped page tables) results in memory
> corruption if there are any cache entries for the previously encrypted
> pages. This is because separate cache entries can exist for the same
> physical location but tagged both with and without the encryption bit.
This commit addressed an unrelated successive kexecs situation but the remark about cache entries seems valuable to me.



* Side-note: When I tested TSME option few weeks ago I thought that TSME was synonymous with SME and when disabling TSME didn't disable SME I concluded that this BIOS option was broken. Now after searching for it I know I was wrong in my assumption.
Comment 27 Filip Brygidyn 2020-04-07 17:57:21 UTC
There is a thing about my testing hardware that I would like to hear your opinion about:
As I already mentioned I excluded any single component but all the systems have 2 common points:
1. Both tested boards are from Asrock. Granted - I tested various BIOS-es, including unreleased ones I got from Asrock support.
2. All 4 sticks of Ram I have for testing are ECC UDIMMs from Kingston. They are different models, enabling/disabling ECC doesn't make a difference, same with frequencies.

Do you think one of those things could be related?
Comment 28 Borislav Petkov 2020-04-07 18:06:15 UTC
I don't think so.

> I verified it and indeed if I call set_memory_decrypted(), 
> set_memory_encrypted() and then do the poisoning then the 
> corruption does not appear in that page.

Can you upload the exact diff you're using for this here and explain how exactly you're doing the poisoning?

Thx.
Comment 29 Filip Brygidyn 2020-04-07 18:39:25 UTC
(In reply to Borislav Petkov from comment #28)
> Can you upload the exact diff you're using for this here and explain how
> exactly you're doing the poisoning?

Sure, the poisoning is done with vanilla page poisoning (page_poison=on). I just added hexdumps/set_memory_decrypted/set_memory_encrypted before the actual memset in page_poison.c:

diff --git a/mm/page_poison.c b/mm/page_poison.c
index 34b9181ee5d1..feb3e0d9f7d0 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -7,6 +7,7 @@
 #include <linux/poison.h>
 #include <linux/ratelimit.h>
 #include <linux/kasan.h>
+#include <linux/set_memory.h>
 
 static bool want_page_poisoning __read_mostly;
 
@@ -37,14 +38,50 @@ bool page_poisoning_enabled(void)
 }
 EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
+static void* magic_corruption_ptr = (void*)0xffff8880e9d26fc0;
+
 static void poison_page(struct page *page)
 {
        void *addr = kmap_atomic(page);
 
+       if(magic_corruption_ptr >= addr && magic_corruption_ptr < (addr + PAGE_SIZE))
+       {
+               printk(KERN_INFO "poison_page called. We are just before page posioning with magic address\n");
+               print_hex_dump(KERN_ERR, "current   ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               set_memory_decrypted((unsigned long)addr, 1);
+               print_hex_dump(KERN_ERR, "decrypted ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               set_memory_encrypted((unsigned long)addr, 1);
+               print_hex_dump(KERN_ERR, "encrypted ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               printk(KERN_INFO "hexdump finished\n");
+       }
+
        /* KASAN still think the page is in-use, so skip it. */
        kasan_disable_current();
        memset(addr, PAGE_POISON, PAGE_SIZE);
        kasan_enable_current();
+
+       if(magic_corruption_ptr >= addr && magic_corruption_ptr < (addr + PAGE_SIZE))
+       {
+               printk(KERN_INFO "poison_page called. Just finished poisoning the page with magic address\n");
+               dump_stack();
+               printk(KERN_INFO "dump_stack finished, current page dump:\n");
+               print_hex_dump(KERN_ERR, "current   ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               set_memory_encrypted((unsigned long)addr, 1);
+               print_hex_dump(KERN_ERR, "encrypted ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               set_memory_decrypted((unsigned long)addr, 1);
+               print_hex_dump(KERN_ERR, "decrypted ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               set_memory_encrypted((unsigned long)addr, 1);
+               print_hex_dump(KERN_ERR, "encrypted ", DUMP_PREFIX_ADDRESS, 16, 1, addr, PAGE_SIZE, 1);
+
+               printk(KERN_INFO "hexdump finished\n");
+       }
+
        kunmap_atomic(addr);
 }




So the story is as follows:

1. The page is freed initially (__free_pages_core())
2. The builtin page poisoning code-path kicks in (page_posion=on)
3. We go into the poison_page() in page_poison.c which normally just calls kmap_atomic(), memset() and kunmap_atomic().
The changes are here: before memset I compare the address of current page with a known problematic address 0xffff8880e9d26fc0.
4. If we are in the problematic page then I call set_memory_decrypted() and set_memory_encrypted() on it.
5. memset is called - which does the poisoning.
6. I do the set_memory_encrypted/set_memory_decrypted/set_memory_encrypted and check if we see the corruption when hex_dumping. Without set_memory_decrypted+set_memory_encrypted before memset it appeared. When they are added it disappears.
Comment 30 Vlastimil Babka 2020-04-07 20:05:40 UTC
(In reply to Filip Brygidyn from comment #26)
> (In reply to Vlastimil Babka from comment #24)
> > If this theory is correct, then all of the observed corruption should be
> > prevented if you hook a "set_memory_decrypted(); set_memory_encrypted();"
> > sequence of operations on the page before the initial poisoning
> I verified it and indeed if I call set_memory_decrypted(),
> set_memory_encrypted() and then do the poisoning then the corruption does
> not appear in that page (see decrypt+encrypt_before_poison.txt attached)

Great!

> (In reply to Borislav Petkov from comment #25)
> > * can you make the same observation if you boot the box with nr_cpus=1
> Same as above, I already tried it, also tested again to be sure - no
> difference, same corruption happens
 
Interesting. If you boot with nr_cpus=1, does the "set_memory_decrypted(); set_memory_encrypted();" dance before poisoning still prevent the corruption?
 
> I also found an interesting quote in this commit:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/
> ?id=4e237903f95db585b976e7311de2bfdaaf0f6e31
> > When SME is enabled, the
> > change from using pages that were marked encrypted to now being marked as
> > not encrypted (through new identify mapped page tables) results in memory
> > corruption if there are any cache entries for the previously encrypted
> > pages. This is because separate cache entries can exist for the same
> > physical location but tagged both with and without the encryption bit.
> This commit addressed an unrelated successive kexecs situation but the
> remark about cache entries seems valuable to me.

Nice find, thanks, it seems it validates some important parts of my theory of what could be happening.
Comment 31 Filip Brygidyn 2020-04-08 08:25:10 UTC
Now that I tested it some more It seems like hex_dumping itself is affecting the results.

I need to compile and test a few more versions. Will do that after work, so expect new response in about ~9 hours.
Comment 32 Borislav Petkov 2020-04-08 17:00:56 UTC
Ok, here's something else I'd like you to try. It is silly and it very likely won't have any effect but let's cover that base before we continue staring further. Attached patch does the flushing only for the SME case using CLFLUSH directly and not paying attention to the P bit in the PTE. If it explodes due to a PF, then it'll be at least loud and tell us where we need to look.

Thx.
Comment 33 Borislav Petkov 2020-04-08 17:01:33 UTC
Created attachment 288287 [details]
clflush directly
Comment 34 Filip Brygidyn 2020-04-08 20:25:13 UTC
Created attachment 288289 [details]
logs after applying 'clflush directly'

(In reply to Borislav Petkov from comment #32)
> Ok, here's something else I'd like you to try. It is silly and it very
> likely won't have any effect but let's cover that base before we continue
> staring further. Attached patch does the flushing only for the SME case
> using CLFLUSH directly and not paying attention to the P bit in the PTE. If
> it explodes due to a PF, then it'll be at least loud and tell us where we
> need to look.
> 
> Thx.

I applied it on top of an almost clean tree. Just a patch for printing complete memory address from comment #15. None of the set_memory_decrypted/set_memory_encrypted, additional hex_dumps or breakpoints.
It didn't change much, the corruption happens the same way as before - when taking a page from the free-list. Nothing blew up. dmesg and config used is attached.



(In reply to Filip Brygidyn from comment #31)
> I need to compile and test a few more versions. Will do that after work, so
> expect new response in about ~9 hours.
It will take me more time after all, Suddenly the magical address check does not work anymore - the corruption moved elsewhere. Relying on this arbitrary number was flaky at best but well, it worked every time for some period.
I need to pick a different address or find another approach and retest.



Besides those things: After playing with set_memory_decrypted and set_memory_encrypted I am now confused as to how memory writes should be handled around them.

When I was playing with page_poison.c, around the memset that poisons the page with 0xaa the flow was as follows:

1. memset the page to 0xaa
2. hex_dump - showed 0xaa as expected
3. call set_memory_encrypted on the page
4. hexdump - still showed 0xaa

So this point this seems wrong to me. If calling set_memory_encrypted() didn't do anything then I guess that the PAGE_ENC bit was already set. So that means we just did a memset() on a page that was set as encrypted. Should writing to a page set as encrypted even happen?

Maybe what should happen when doing page_poison is something like this:

* When freeing
1. call set_memory_decrypted()
2. call memset()
3. call set_memory_encrypted()
* and when taking from free-list:
1. call set_memory_decrypted()
2. verify that poison is intact
3. call set_memory_encrypted()



Anyway this all is starting to be a jumbled mess of things to check and random code changes. To clarify: My plan for now is:
a) check the set_memory_decrypted/set_memory_encrypted behavior before memset with nr_cpus=1 that @Vlastimil Babka requested
b) Read more about SME implementation and try the memset while page is set as decrypted
c) Try to hook a hw_breakpoint earlier and maybe find out if anyone used the page before initial free.
d) hook __set_memory_enc_dec() and check how it is used on a page that shows corruption
Comment 35 Vlastimil Babka 2020-04-08 23:16:26 UTC
(In reply to Filip Brygidyn from comment #34)
> 
> (In reply to Filip Brygidyn from comment #31)
> > I need to compile and test a few more versions. Will do that after work, so
> > expect new response in about ~9 hours.
> It will take me more time after all, Suddenly the magical address check does
> not work anymore - the corruption moved elsewhere. Relying on this arbitrary
> number was flaky at best but well, it worked every time for some period.
> I need to pick a different address or find another approach and retest.
 
Hmm that's interesting, a relatively deterministic address for a while, and then it changes? After the change, is it again stable on the new address? Weird. 
 
> Besides those things: After playing with set_memory_decrypted and
> set_memory_encrypted I am now confused as to how memory writes should be
> handled around them.
> 
> When I was playing with page_poison.c, around the memset that poisons the
> page with 0xaa the flow was as follows:
> 
> 1. memset the page to 0xaa
> 2. hex_dump - showed 0xaa as expected
> 3. call set_memory_encrypted on the page
> 4. hexdump - still showed 0xaa
> 
> So this point this seems wrong to me. If calling set_memory_encrypted()
> didn't do anything then I guess that the PAGE_ENC bit was already set.

Yes.

> So
> that means we just did a memset() on a page that was set as encrypted.
> Should writing to a page set as encrypted even happen?

Yeah, the encryption is to protect against attackers extracting secrets from the RAM. The CPU is supposed to work with the encrypted contents normally, decrypting it transparently when fetching from RAM to its caches, encrypting again when flushing caches to memory. In some advanced scenarios the encryption is also to protect virt guests from each other or even guest from host, IIRC.

> Maybe what should happen when doing page_poison is something like this:
> 
> * When freeing
> 1. call set_memory_decrypted()
> 2. call memset()
> 3. call set_memory_encrypted()
> * and when taking from free-list:
> 1. call set_memory_decrypted()
> 2. verify that poison is intact
> 3. call set_memory_encrypted()
 
No, the memory can simply stay encrypted all the time. The poisoning is ortogonal to encryption, with the purpose of detecting writes to pages that are supposed to be unused. The writes will be detected regardless of encryption state.
 
> Anyway this all is starting to be a jumbled mess of things to check and
> random code changes. To clarify: My plan for now is:
> a) check the set_memory_decrypted/set_memory_encrypted behavior before
> memset with nr_cpus=1 that @Vlastimil Babka requested

Thanks.

> b) Read more about SME implementation and try the memset while page is set
> as decrypted

I expect that would indeed prevent the issue as memset in a decrypted state will flush the stray cachelines, similarly as how the decrypt/encrypt prior memset did.

> c) Try to hook a hw_breakpoint earlier and maybe find out if anyone used the
> page before initial free.

Could work, but I suspect whoever did it used a different virtual address than kernel's direct mapping range, and the breakpoint uses virtual address?

> d) hook __set_memory_enc_dec() and check how it is used on a page that shows
> corruption

Yeah might be good to know whether the function is used during the "initial encryption".
Comment 36 Borislav Petkov 2020-04-09 16:29:02 UTC
So I tried reproducing your observation from comment #1 on my Zen box here. SME is on, CONFIG_INIT_ON_FREE_DEFAULT_ON=y and your patch applied. I've added trace_printk() after the test runs:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f68ae50dfed..ba9e8facc618 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2161,6 +2161,8 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 
                for (i = 0; i < (1 << order); i++)
                        verify_zero_highpage(page + i);
+
+               trace_printk("Verified %d pages\n", 1 << order);
        }

to make sure the verification happens. Then I ran your memtester reproducer a bunch of times. It didn't trigger. Then I rebooted and ran it again and it ran for ~900 iterations and still nothing. So either your box is funky or I'm still not doing something right.
Comment 37 Filip Brygidyn 2020-04-09 17:44:36 UTC
Created attachment 288309 [details]
5.6.2 config

> a) check the set_memory_decrypted/set_memory_encrypted behavior before
> memset with nr_cpus=1 that @Vlastimil Babka requested
I checked again and confirm what I wrote earlier: calling set_memory_decrypted()+set_memory_encrypted() before memset prevents corruption. And hexdumps did not affect it - The change in the address observed threw me off.

So with nr_cpus=1 or not:
- nothing added before memset: corruption
- set_memory_decrypted()+set_memory_encrypted() before memset: no corruption
- set_memory_decrypted()+set_memory_encrypted() + hexdumps: no corruption
- hexdumps before memset: corruption



(In reply to Borislav Petkov from comment #36)
> So I tried reproducing your observation from comment #1 on my Zen box here.
Thanks. Could you share what kernel version, config and boot parameters you used? I would like to try them.

(In reply to Borislav Petkov from comment #36)
> SME is on, CONFIG_INIT_ON_FREE_DEFAULT_ON=y and your patch applied. I've
> added trace_printk() after the test runs:
The instructions in #comment #1 are not really the best ones at this point. Now I know that running memtester just once is enough - just make sure to allow it take as much memory as you can to increase the chance of reusing the corrupted page.
So the reproduce procedure for you would be to boot with mem_encrypt=on and run memtester with as much memory as you can give it. It should happen on the first, maybe a second reboot.
At least that's for your already compiled version.

If however you could compile another kernel then I would suggest following:
- vanilla kernel, I am now on 5.6.2. (no patches of any kind needed)
- attached config
- boot parameters added: "mem_encrypt=on debug_pagealloc=on page_owner=on page_poison=on loglevel=8"
- run memtester with as much memory as you can and you should see page posioning checks throwing "pagealloc: memory corruption" messages in the logs

Unfortunately I cannot update comment #1 :/


> Then I ran your memtester reproducer
> a bunch of times. It didn't trigger. Then I rebooted and ran it again and it
> ran for ~900 iterations and still nothing. So either your box is funky or
> I'm still not doing something right.
This is a letdown, I was hoping it would be easier to reproduce, please tell me what is your hardware config.
I am fairly sure that my hardware isn't the issue because I reproduce this of 2 separate machines with fresh systems:
1. Asrock X470D4U + ryzen 2600 + 16GB stick of ECC kingston ram + Samsung 860 evo ssd (SATA) + Supermicro 1200W PSU
2. Asrock B450M Steel legend + ryzen 2200G + 8GB stick of ECC kingston ram + Intel 760p ssd (NVME) + Seasonig 360W PSU
And Now I am running a mix of those 2.

The only common things between all tested configurations I can think of are:
- Asrock brand motherboard
- ECC kingston ram
- Using ryzen 2nd gen
Comment 38 Filip Brygidyn 2020-04-09 17:58:37 UTC
(In reply to Borislav Petkov from comment #36)
> So I tried reproducing your observation from comment #1 on my Zen box here.

Also: are you booting in UEFI mode? I did not check this yet but it could make a difference.
Comment 39 Vlastimil Babka 2020-04-09 21:38:31 UTC
(In reply to Filip Brygidyn from comment #37)
> So with nr_cpus=1 or not:
> - nothing added before memset: corruption
> - set_memory_decrypted()+set_memory_encrypted() before memset: no corruption
> - set_memory_decrypted()+set_memory_encrypted() + hexdumps: no corruption
> - hexdumps before memset: corruption

Hmm, so the reason for testing nr_cpus=1 was that I suspect the incoherent cache lines are in a cache on a different CPU than cpu0, and regardless of whether the kernel initializes that CPU. The idea is that with nr_cpus=1, cpa_flush() (done in set_memory_[de/en]crypted()) wouldn't broadcast the flush to the CPU's that kernel doesn't use, and thus it wouldn't prevent the corruption anymore, which would prove that it's a problem of different cpu.

But I didn't realize that cpa_flush() uses clflush instruction, which AFAICS does the broadcast in hardware regardless of kernel booted with nr_cpus=1.
So it might be interesting to repeat the experiment adding 'noclflush' kernel parameter, and then kernel should resort to IPI broadcast and really exclude other cpus with nr_cpus=1.
Comment 40 Filip Brygidyn 2020-04-10 14:53:43 UTC
Created attachment 288329 [details]
5.6.2_debug_patches+config

(In reply to Vlastimil Babka from comment #39)
> So it might be interesting to repeat the experiment adding 'noclflush'
> kernel parameter

I tested the combinations below.

nr_cpus=1 | noclflush | set dec/enc before memset > corruption?
----------+-----------+---------------------------+-----------
no        | no        | no                        > yes
no        | no        | yes                       > no
no        | yes       | no                        > yes
no        | yes       | yes                       > yes
yes       | no        | no                        > yes
yes       | no        | yes                       > no
yes       | yes       | no                        > yes
yes       | yes       | yes                       > yes


All test were done on the same kernel binary, I just changed the boot parameters.
config and all patches applied on my tree are attached.
* When adding set_memory_[de/en]crypted() I added them only, no extra hexdumps, no extra set_memory_[de/en]crypted() after memset - Just finish booting and force the memory out of free-list with memtester to observe memory unpoisoning report corruption.
* other boot parameter used for all the combinations: 'mem_encrypt=on debug_pagealloc=on page_owner=on page_poison=on loglevel=8'

The result is as you can see: the trick with 'adding set_memory_[de/en]crypted() before memset' stops working when I added noclflush to the boot parameters. So it is having an effect but I think this is not what you were expecting.



In other news:
(In reply to Filip Brygidyn from comment #38)
> Also: are you booting in UEFI mode? I did not check this yet but it could
> make a difference.

I took my current UEFI testing setup and converted it to BIOS mode. Just changed the boot partition type from 'EFI System' to 'BIOS boot', rerun grub-install and changed boot settings in the mobo-BIOS. So I used the same system installation and same kernel binary.
The corruption disappeared. And when I converted back to UEFI it started happening again.

So I guess UEFI boot may be a factor.
Comment 41 Filip Brygidyn 2020-04-11 16:57:27 UTC
Created attachment 288355 [details]
KASAN logs

I tried KASAN again and this time it showed "BUG: KASAN: slab-out-of-bounds".

Symbolized and raw dmesg logs are attached (from 4 separate reboots).
You will also find config used and patches applied on top of 5.6.2 tree (just print changes)

There are however a few unexpected things with those logs:
- KASAN report appears much further into the boot process than the initial free on the page I see as corrupted.
- KASAN report happens with mem_encrypt=off
- KASAN report happens just before "EDAC amd64: Node 0: DRAM ECC enabled." appears in the logs (maybe a coincidence)

Just to be sure I will order a non-ECC stick to test.


*As I mentioned in the problem description: I already tried KASAN and it didn't show anything then. When I tested it back then it was on 5.5.10 with linux-hardened patches and a different config. Just to be sure I compiled that version again and it again showed no KASAN logs.
(https://bugs.archlinux.org/task/65869#comment187641)
Comment 42 Filip Brygidyn 2020-04-11 21:39:03 UTC
FYI: the KASAN reports could be a red herring - disabling an integrated ASPEED AST2500 GPU and booting in blind mode makes them disappear again.
Comment 43 Vlastimil Babka 2020-04-11 21:52:28 UTC
(In reply to Filip Brygidyn from comment #40)
> nr_cpus=1 | noclflush | set dec/enc before memset > corruption?
> ----------+-----------+---------------------------+-----------
> no        | no        | no                        > yes
> no        | no        | yes                       > no
> no        | yes       | no                        > yes
> no        | yes       | yes                       > yes
> yes       | no        | no                        > yes
> yes       | no        | yes                       > no
> yes       | yes       | no                        > yes
> yes       | yes       | yes                       > yes

Hmm, that's interesting. So we are not able to flush out the "bad" cache lines using wbinvd (in absence of clfush), even if wbinvd is done on all cpus (nr_cpus=1 is not given). Or perhaps at the moment of page allocator init, the other cpus are not yet running?

Hmm, seems so, in one of the earlier dumps I see:

[    0.029004] poison_page called. Just finished poisoning the page with magic address
...
[    0.317188] smp: Bringing up secondary CPUs ...

So it seems to support my theory that the unflushed cache lines are in a different CPU than 0, so totally out of kernel's control at the moment of page allocator init. clflush works in that case, as the other cpus respond to coherency protocol. wbinvd has to be executed on each cpu, which is not possible.

> I took my current UEFI testing setup and converted it to BIOS mode. Just
> changed the boot partition type from 'EFI System' to 'BIOS boot', rerun
> grub-install and changed boot settings in the mobo-BIOS. So I used the same
> system installation and same kernel binary.
> The corruption disappeared. And when I converted back to UEFI it started
> happening again.
> 
> So I guess UEFI boot may be a factor.

So my theory is that UEFI was executing on some other CPU before handing over to kernel, and didn't flush itself properly. I would expect this to be against some spec and the firmware should be fixed, but I don't know the details. Boris can probably ask the right people. The workaround would be for the kernel to clflush everything before first use (what the decrypt/encrypt combo does) and not rely only on wbinvd during the early code, as that doesn't flush other cpus :/

The KASAN stuff is most likely indeed something unrelated.
Comment 44 Filip Brygidyn 2020-04-12 17:14:05 UTC
Created attachment 288393 [details]
logs with __set_memory_enc_dec() hook

(In reply to Filip Brygidyn from comment #34)
> d) hook __set_memory_enc_dec() and check how it is used on a page that shows
> corruption

During the whole boot process __set_memory_enc_dec() is called a total of 3 times on my system.

*I am observing a page at address 0xffff8880e9d27000 which shows corruption.
*According to https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt this address is in a direct mapping memory mapping region

dmesg logs show this order of events:
1. with efi=debug parameter I see the following in the memory map logs:
[    0.000000] efi: mem124: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x00000000e9d22000-0x00000000e9d3bfff] (0MB)

2. __set_memory_enc_dec(addr=0xffff8880e2f82000, numpages=16384, enc=false) is called:
[    0.000000] Call Trace:
[    0.000000]  dump_stack+0x66/0x90
[    0.000000]  __set_memory_enc_dec+0x74/0x158
[    0.000000]  swiotlb_update_mem_attributes+0x48/0x5d
[    0.000000]  mem_encrypt_init+0xf/0x5e
[    0.000000]  start_kernel+0x6e4/0x81d
[    0.000000]  secondary_startup_64+0xb6/0xc0
The range does not include the address under observation.
And this is again in a direct memory mapping region so I think I can safely assume this is not touching the page I am interested in.

3. with efi=debug parameter I see the following again (in the EFI runtime memory map):
[    0.480629] efi: mem138: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x00000000e9d22000-0x00000000e9d3bfff] (0MB)

4. page_poison() on the observed 0xffff8880e9d27000 address is called.

5. __set_memory_enc_dec(addr=0xffff88800008d000, numpages=6, enc=false) is called. Again direct memory mapping region - this does not seem to touch the page.

6. I see "smp: Bringing up secondary CPUs ..." in the logs

7. __set_memory_enc_dec(addr=0xffffffff92402000, numpages=510, enc=true) is called. According the the documentation linked above this should be
"kernel text mapping, mapped to physical address 0". So again doesn't seem like it could touch the observed page.


So it seems that __set_memory_enc_dec() does not touch a corrupted page at all.
* Full logs attached

Note You need to log in before you can comment on or make changes to this bug.