Bug 206963 - non-zeroed pages when using CONFIG_INIT_ON_FREE_DEFAULT_ON and AMD SME
Summary: non-zeroed pages when using CONFIG_INIT_ON_FREE_DEFAULT_ON and AMD SME
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-25 19:25 UTC by Filip Brygidyn
Modified: 2020-04-07 14:49 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.4.26
Tree: Mainline
Regression: No


Attachments
5.4.26.config, logs, patch and arch build script (110.45 KB, application/gzip)
2020-03-25 19:25 UTC, Filip Brygidyn
Details
symbolized logs with CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER (98.81 KB, text/plain)
2020-03-30 18:41 UTC, Filip Brygidyn
Details
vanilla kernel logs with page poisoning (PAGE_POISONING_ZERO enabled and disabled) (209.89 KB, application/gzip)
2020-04-02 08:33 UTC, Filip Brygidyn
Details
5.6 logs with print_hex_dump patch and CONFIG_RAMDOMIZE_MEMORY disabled (423.82 KB, application/gzip)
2020-04-02 17:31 UTC, Filip Brygidyn
Details
summary of poisoning errors (21.70 KB, text/plain)
2020-04-03 10:13 UTC, Vlastimil Babka
Details
logs and patch with hw_breakpoint added (62.43 KB, application/gzip)
2020-04-03 22:28 UTC, Filip Brygidyn
Details
logs of ecrypting/decrypting the corrupted page (330.52 KB, text/plain)
2020-04-06 19:26 UTC, Filip Brygidyn
Details

Description Filip Brygidyn 2020-03-25 19:25:45 UTC
Created attachment 288057 [details]
5.4.26.config, logs, patch and arch build script

I first reported this as a linux-hardened issue on Arch bug-tracker: https://bugs.archlinux.org/task/65869
But in the end I could reproduce it on a clean kernel tree with just a minimal check lifted from the hardened tree.

I tested and repoduced it as far back as on 4.14

Problem description:
When running with AMD SME and INIT_ON_FREE enabled it appears as though the supposedly zeroed free pages become non-zeroed when they taken from the freelist

The patch that checks for the non-zeroed pages is attached. (apply on top of 5.4.26)
It simply verifies if pages are zeroed when in prep_new_page (mm/page_alloc.c) function.
(lifted from https://github.com/anthraxx/linux-hardened/blob/5.4/mm/page_alloc.c#L2190 and https://github.com/anthraxx/linux-hardened/blob/5.4/include/linux/highmem.h#L218)


With it applied and using the .config attached boot into it and you should be able to see logs about non-zeroed pages randomly when allocating memory.
The importatnt options are:
- CONFIG_INIT_ON_FREE_DEFAULT_ON=y
- CONFIG_AMD_MEM_ENCRYPT=y
- CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y

The procedure that works for me is to run memtester (http://pyropus.ca/software/memtester):
while true
do
    memtester 12G & # with 16G of RAM
    sleep 8
    killall memtester
done

The fault should happen after at most a few iterations, if if doesn't just reboot and try again a few times.

Example symbolized call trace:
[  255.154391] ------------[ cut here ]------------
[  255.159010] WARNING: CPU: 11 PID: 550 at include/linux/highmem.h:227 prep_new_page (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/highmem.h:227 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:2162) 
[  255.167612] Modules linked in: amd64_edac_mod edac_mce_amd snd_hda_intel ipmi_ssif snd_intel_nhlt kvm snd_hda_codec irqbypass snd_hda_core snd_hwdep crct10dif_pclmul nls_iso8859_1 snd_pcm nls_cp437 crc32_pclmul ghash_clmulni_intel snd_timer vfat snd igb ccp wmi_bmof ipmi_si fat aesni_intel crypto_simd sp5100_tco soundcore cryptd glue_helper i2c_algo_bit rng_core pcspkr ipmi_devintf i2c_piix4 k10temp dca ipmi_msghandler pinctrl_amd gpio_amdpt evdev wmi mac_hid drm agpgart ip_tables x_tables btrfs libcrc32c crc32c_generic xor raid6_pq sd_mod ahci libahci libata scsi_mod xhci_pci xhci_hcd crc32c_intel
[  255.220530] CPU: 11 PID: 550 Comm: memtester Not tainted 5.4.26-1-lts #1
[  255.227223] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS L3.31 12/24/2019
[  255.236609] RIP: 0010:prep_new_page (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/highmem.h:227 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:2162) 
[ 255.241141] Code: 41 b8 08 00 00 00 b9 40 00 00 00 6a 00 ba 02 00 00 00 48 c7 c6 ab 42 72 af 48 c7 c7 44 d1 6e af 68 00 10 00 00 e8 38 39 20 00 <0f> 0b 58 5a 41 83 ad 88 19 00 00 01 48 83 c3 40 4c 39 fb 0f 85 72
<SNIP>
[  255.321690] Call Trace:
[  255.324147] get_page_from_freelist (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:3696) 
[  255.328764] __alloc_pages_nodemask (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/page_alloc.c:4751) 
[  255.333297] alloc_pages_vma (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/mempolicy.h:78 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/mempolicy.h:83 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mempolicy.c:2156) 
[  255.337134] __handle_mm_fault (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/./include/linux/highmem.h:208 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:3018 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:3863 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:3989) 
[  255.341312] handle_mm_fault (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/memory.c:4026) 
[  255.345152] __get_user_pages (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:641 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:845) 
[  255.349165] populate_vma_page_range (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:1223) 
[  255.353611] __mm_populate (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/gup.c:1279) 
[  255.357275] do_mlock (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:712) 
[  255.360500] __x64_sys_mlock (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:720 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:718 /home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/mm/mlock.c:718) 
[  255.364244] do_syscall_64 (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/arch/x86/entry/common.c:290) 
[  255.367911] entry_SYSCALL_64_after_hwframe (/home/aur/vanilla+checks/packages/trunk/src/linux-5.4.26/arch/x86/entry/entry_64.S:184) 
[  255.372965] RIP: 0033:0x7f5745b6e35b
[ 255.376544] Code: f0 ff ff 73 01 c3 48 8b 0d 32 4b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 95 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 05 4b 0c 00 f7 d8 64 89 01 48
<SNIP>
[  255.438492] ---[ end trace e40ed40276155114 ]---

Corresponding kernel tree (besides the patch): https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/?h=v5.4.26
Full logs are attached (raw and symbolized).

The logs also contain dumps of the affected non-zeroed pages. For example:
[  255.443364] page:ffffd812c3a75c00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1
[  255.451626] flags: 0x1ffff0000000000()
[  255.455376] raw: 01ffff0000000000 dead000000000100 dead000000000122 0000000000000000
[  255.463119] raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
[  255.470858] page dumped because: verify_zero_highpage failed!
<SNIP ZEROS>
[  255.815899] NON-ZERO PAGE: 00000340: 593b748c68881e2a 057455310362e36e
[  255.822424] NON-ZERO PAGE: 00000350: 287ea35d5d557742 fc02e8916a27042f
[  255.828950] NON-ZERO PAGE: 00000360: a1cb0d6d5d34c6cd 714ab1a43cf924b3
[  255.835479] NON-ZERO PAGE: 00000370: a3004368e712b8be 30e8fd221c5de507
[  255.842004] NON-ZERO PAGE: 00000380: a5e5c7ac67995ded b78248897ebbc61d
[  255.848526] NON-ZERO PAGE: 00000390: ee9fe81503ab2cc8 ccf9fd5523fe68c5
[  255.855046] NON-ZERO PAGE: 000003a0: 3a3da027e811829e db6c21bbc436d22b
[  255.861571] NON-ZERO PAGE: 000003b0: 8d31e6c9d818c734 3392b17c4dc43ec1
[  255.868097] NON-ZERO PAGE: 000003c0: 9ea988607b7077d8 f43dead3876ce0e1
[  255.874625] NON-ZERO PAGE: 000003d0: a7c1b44e192af5cc db260a62c138e5d2
[  255.881152] NON-ZERO PAGE: 000003e0: 5eaf30dcc5c19fe3 b126b6843b2f86e4
[  255.887675] NON-ZERO PAGE: 000003f0: 80b7670e4b49cdcd 4d18678050ea6e3a
[  255.894203] NON-ZERO PAGE: 00000400: 1f3c63300b40050c 9d4b1a3ba4e18337
[  255.900731] NON-ZERO PAGE: 00000410: 4b77edc62003313f b4d267f9dea58995
[  255.907253] NON-ZERO PAGE: 00000420: c89f759708ec0051 7f957c9bc7c183db
[  255.913781] NON-ZERO PAGE: 00000430: 385812a8f267427d 327378397f19a164
<SNIP ZEROS>
[  255.946402] NON-ZERO PAGE: 00000480: 0ebbd516b67cc823 dfcde4d53e4e0007
[  255.952928] NON-ZERO PAGE: 00000490: 453285e132c03f29 92c6c36258a5bfb9
[  255.959454] NON-ZERO PAGE: 000004a0: d4046066d19f8d32 fcdd220d5f34f440
[  255.965981] NON-ZERO PAGE: 000004b0: 19b1eebf7b60dbed 8e4fe4e47f61e90d
<SNIP ZEROS>
[  256.677185] NON-ZERO PAGE: 00000b80: 2c5ea3021e2e7253 d2c35a99331b4059
[  256.683712] NON-ZERO PAGE: 00000b90: b22c3e3af5a34e46 24515bb4a8989047
[  256.690235] NON-ZERO PAGE: 00000ba0: aa5ff36196b7a812 b3f91556094ecd97
[  256.696755] NON-ZERO PAGE: 00000bb0: 202e2f2a9314fa6b 9b7a1e96ddc75fd7
<SNIP ZEROS>

Those do seem random - modified by a broken SME maybe?


When tested with KASAN enabled It did not report anything.
It did report use-after-free in amdgpu on 2200G but that is not relevant to this issue - you can find more details along with symbolized call stack in the arch-bug report


Hardware used:
I tested on several configurations - the parts I had to swap around were as follows:
  CPUs: ryzen 2600, ryzen 2200g
  Mobo: x470D4U, B450M steel legend
  Ram: 4 sticks of ECC ram: 2x8GB + 2x16GB
  SSD: intel 760p (nvme), samsung 860 evo (sata)
  No gpu - can reproduce in blind mode
  PSU: supermicro 1200W rack unit, seasonic 360W unit
  Cooling: wraith prism, wraith stealth
  Traces gathered over a serial port
And I verified that no single component is causing this.


I also attached an arch-linux specific build script I used - maybe it will be more convenient for someone
Comment 1 Borislav Petkov 2020-03-30 10:23:44 UTC
Can you enable CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER and boot with debug_pagealloc=on page_owner=on? That should tell you who last used the page.

Thx.
Comment 2 Filip Brygidyn 2020-03-30 18:41:24 UTC
Created attachment 288129 [details]
symbolized logs with CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER

Here you go, full log attached.

The excerpt I assume you were interested in:
    page dumped because: verify_zero_highpage failed!
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Movable, gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
    prep_new_page (linux-5.4.26/./include/linux/page_owner.h:31 linux-5.4.26/mm/page_alloc.c:2151 linux-5.4.26/mm/page_alloc.c:2157) 
    get_page_from_freelist (linux-5.4.26/mm/page_alloc.c:3696) 
    __alloc_pages_nodemask (linux-5.4.26/mm/page_alloc.c:4751) 
    alloc_pages_vma (linux-5.4.26/./include/linux/mempolicy.h:78 linux-5.4.26/./include/linux/mempolicy.h:83 linux-5.4.26/mm/mempolicy.c:2156) 
    __handle_mm_fault (linux-5.4.26/./include/linux/highmem.h:208 linux-5.4.26/mm/memory.c:3018 linux-5.4.26/mm/memory.c:3863 linux-5.4.26/mm/memory.c:3989) 
    handle_mm_fault (linux-5.4.26/mm/memory.c:4026) 
    __get_user_pages (linux-5.4.26/mm/gup.c:641 linux-5.4.26/mm/gup.c:845) 
    populate_vma_page_range (linux-5.4.26/mm/gup.c:1223) 
    __mm_populate (linux-5.4.26/mm/gup.c:1279) 
    do_mlock (linux-5.4.26/mm/mlock.c:712) 
    __x64_sys_mlock (linux-5.4.26/mm/mlock.c:720 linux-5.4.26/mm/mlock.c:718 linux-5.4.26/mm/mlock.c:718) 
    do_syscall_64 (linux-5.4.26/arch/x86/entry/common.c:290) 
    entry_SYSCALL_64_after_hwframe (linux-5.4.26/arch/x86/entry/entry_64.S:184) 
    page_owner free stack trace missing
Comment 3 Vlastimil Babka 2020-03-31 08:39:16 UTC
(In reply to Filip Brygidyn from comment #2)
> Created attachment 288129 [details]
> symbolized logs with CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER
> 
> Here you go, full log attached.
> 
> The excerpt I assume you were interested in:

Yeah, thanks.

>     page_owner free stack trace missing

So this indicates the page never went through freeing. All pages managed by mm/page_alloc.c should be added there in one way or another (during boot, hotplug...) through __free_pages_core() which basically simulates page freeing, thus including the init_on_free and setting of page_owner free stack trace. This page somehow circumvented this path, hmm...
Comment 4 Filip Brygidyn 2020-03-31 09:01:47 UTC
(In reply to Vlastimil Babka from comment #3)
> So this indicates the page never went through freeing.

This kind of matches my testing method:
With 16 or 32 GB of ram I launch memtester right after booting.
It presumably allocates most of the available memory which was until this point unused.

Most of the time after allocating 12 out of available 16 GB I only get a few mage warnings. The rest are correctly zeroed, And I was unable to reproduce this with mem_encrypt=off
Comment 5 Vlastimil Babka 2020-03-31 11:00:29 UTC
(In reply to Vlastimil Babka from comment #3)
> >     page_owner free stack trace missing
> 
> So this indicates the page never went through freeing. All pages managed by
> mm/page_alloc.c should be added there in one way or another (during boot,
> hotplug...) through __free_pages_core() which basically simulates page
> freeing, thus including the init_on_free and setting of page_owner free
> stack trace. This page somehow circumvented this path, hmm...

Correction: the "page_owner free stack trace missing" does indeed mean that this is the first time the page has been allocated. But the rest of above is wrong. At the initial pass through __free_pages_core() we are not yet recording the freeing stack traces. So I assume the page went properly through the initial freeing, including the zeroing. But then something corrupted parts of it. If it was e.g. zeroed with a wrong memory encryption state, then I would expect it to be fully garbage, not just parts of it.
Comment 6 Filip Brygidyn 2020-03-31 12:14:31 UTC
I checked with CONFIG_PAGE_POISONING=y and page_poison=on and I found no additional logs about corruption.

From what I understand this would eliminate a possibility that the page was corrupted after zero-ing (eg. somwhere in arch_free_page()).

Not sure if it was obvious for you.
Comment 7 Filip Brygidyn 2020-03-31 15:02:10 UTC
(In reply to Filip Brygidyn from comment #6)
> I checked with CONFIG_PAGE_POISONING=y and page_poison=on and I found no
> additional logs about corruption.
> 
> From what I understand this would eliminate a possibility that the page was
> corrupted after zero-ing (eg. somwhere in arch_free_page()).
> 
> Not sure if it was obvious for you.

I want to retract this. I did not update the .config file and tested the previous kernel version.

With properly updated .config I cannot reproduce the issue with page_poison=on.

Why could this help?
Comment 8 Filip Brygidyn 2020-03-31 16:18:23 UTC
Also checked with CONFIG_PAGE_POISONING_ZERO=y, nothing changed:
- page_poison=on:  can't reproduce
- page_poison=off: can reproduce

(And this time I verified the /proc/config.gz when testing)
Comment 9 Vlastimil Babka 2020-04-01 13:11:47 UTC
(In reply to Filip Brygidyn from comment #7)
> With properly updated .config I cannot reproduce the issue with
> page_poison=on.
> 
> Why could this help?

Simply, because:

static inline bool want_init_on_free(void)
{
        return static_branch_unlikely(&init_on_free) &&
               !page_poisoning_enabled();
}

Enabling page poisoning disabled init on free, including the check you added.
You should see this in dmesg:

mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_free
Comment 10 Filip Brygidyn 2020-04-01 15:12:40 UTC
Thanks, that makes sense.
Still: When enabled, page poisoning would report corruption if it happened in between freeing and allocation. So if I did not see them then that part can be eliminated as a suspect?

If I am reading correctly then the poisoning/un-poisoning happens at the same points where init_on_free and my check is. (free_pages_prepare() and prep_new_page())

So the thing you said previously:
(In reply to Vlastimil Babka from comment #3)
> This page somehow circumvented this path, hmm...
Would become more probable


Also: Please excuse me if I am talking nonsense or stating the obvious. I am not familiar with the codebase. At the same time I would like to help, not just throw logs at bugzilla.
Comment 11 Borislav Petkov 2020-04-01 16:30:33 UTC
You could audit how the C-bit - _PAGE_ENC - is being set/cleared for those pages. Because, for example, if you memset the page to 0s, then clear the C-bit in the pte and access that page with C-bit cleared, you'll get garbage instead of zeroes. Or the other way around. IOW, the C-bit setting should be the same when the page is cleared to 0s *and* when your check happens - otherwise you'll be seeing those funky bytes.

But this is only a conjecture we have - someone with time on his hands has to sit down and stare at the code. :)

HTH.
Comment 12 Vlastimil Babka 2020-04-01 18:11:01 UTC
(In reply to Filip Brygidyn from comment #10)
> Thanks, that makes sense.
> Still: When enabled, page poisoning would report corruption if it happened
> in between freeing and allocation. So if I did not see them then that part
> can be eliminated as a suspect?

Hmm good point, it should be like that yeah.

> If I am reading correctly then the poisoning/un-poisoning happens at the
> same points where init_on_free and my check is. (free_pages_prepare() and
> prep_new_page())

Seems so.

> So the thing you said previously:
> (In reply to Vlastimil Babka from comment #3)
> > This page somehow circumvented this path, hmm...
> Would become more probable

Hmm, but then the unpoisoning pass should also catch such page, same as the detection  patch your have added does. It's almost identical. But doublecheck that CONFIG_PAGE_POISONING_NO_SANITY is *not enabled*. If it is enabled, then no check happens on unpoisoning. Might be also interesting to see if PAGE_POISONING_ZERO enabled/disabled makes any difference.
Comment 13 Filip Brygidyn 2020-04-01 20:21:22 UTC
yup. I had CONFIG_PAGE_POISONING_NO_SANITY=y, will rebuild with it disabled and let you know tomorrow
Comment 14 Filip Brygidyn 2020-04-02 08:33:48 UTC
Created attachment 288149 [details]
vanilla kernel logs with page poisoning (PAGE_POISONING_ZERO enabled and disabled)

With CONFIG_PAGE_POISONING_NO_SANITY disabled now I see corruption without my patch. (vanilla 5.4.26 kernel)
And I checked SME again just in case: problems disappear when passing mem_encrypt=off

Logs attached along with their corresponding config.gz taken from /proc.
I tested with PAGE_POISONING_ZERO and without - didn't make a difference


So AFAIK the thing I posted previously should be treated as void:
(In reply to Filip Brygidyn from comment #10)
> Still: When enabled, page poisoning would report corruption if it happened
> in between freeing and allocation. So if I did not see them then that part
> can be eliminated as a suspect?
Comment 15 Vlastimil Babka 2020-04-02 09:17:54 UTC
(In reply to Filip Brygidyn from comment #14)
> With CONFIG_PAGE_POISONING_NO_SANITY disabled now I see corruption without
> my patch. (vanilla 5.4.26 kernel)
> And I checked SME again just in case: problems disappear when passing
> mem_encrypt=off
> 
> Logs attached along with their corresponding config.gz taken from /proc.
> I tested with PAGE_POISONING_ZERO and without - didn't make a difference

Good. I notice there's some pattern, it's always two pages where one has a small area corrupted, and another a larger area plus another small area. Could it be it's always the same addresses? However we have security features against our debugging here, so to find out the real and stable addresses you'd need to disable CONFIG_RAMDOMIZE_MEMORY, and also apply the patch below. Would it also be possible to test the latest 5.6.x kernel just to rule out this was something fixed already? Thanks.

diff --git a/lib/hexdump.c b/lib/hexdump.c
index 147133f8eb2f..52dd2fc1c8b4 100644
--- a/lib/hexdump.c
+++ b/lib/hexdump.c
@@ -256,7 +256,7 @@ void print_hex_dump(const char *level, const char *prefix_str, int prefix_type,
 
                switch (prefix_type) {
                case DUMP_PREFIX_ADDRESS:
-                       printk("%s%s%p: %s\n",
+                       printk("%s%s%px: %s\n",
                               level, prefix_str, ptr + i, linebuf);
                        break;
                case DUMP_PREFIX_OFFSET:
Comment 16 Filip Brygidyn 2020-04-02 17:31:10 UTC
Created attachment 288169 [details]
5.6 logs with print_hex_dump patch and CONFIG_RAMDOMIZE_MEMORY disabled

I took the current 5.6.2, disabled CONFIG_RAMDOMIZE_MEMORY and applied your patch.
Corruption is still present and you can find logs and a config.gz attached.


(In reply to Vlastimil Babka from comment #15)
> Good. I notice there's some pattern, it's always two pages where one has a
> small area corrupted, and another a larger area plus another small area.
> Could it be it's always the same addresses?
I attached a total of 8 dmesg logs - one per reboot to ease the spotting of patterns.
And the addresses are indeed the same. But it appears that in a few instances there are 3 pages reported instead of 2 - but the 3rd is offset only by 4k from one of the 2 remaining.
Comment 17 Vlastimil Babka 2020-04-03 10:13:55 UTC
Created attachment 288177 [details]
summary of poisoning errors

Thanks, I've extracted the hexdumps in one file, and confirm that there are two or three pages, some have corruption on exactly the same addresses (ffff8880e9d26fc0 to the end of page), some at slighlty different ranges.

That the pages are otherwise filled with poison 0xaa should mean they were correctly poisoned and it's not just null-init by bios and lack of poisoning.

Perhaps the exact scrambling we see is a result of non-encrypted write to encrypted pages, and then seeing the decrypted result. But, if memory encryption was off and somebody was still doing these corruption writes, we would see them too, just with different contents, right? So I still don't see how the mem encryption is exactly the culprit here. Or does it have a bug that manifests only at certain physical addresses?
Comment 18 Filip Brygidyn 2020-04-03 22:28:21 UTC
Created attachment 288195 [details]
logs and patch with hw_breakpoint added

With the known repeatable address (0xffff8880e9d26fc0) that can be monitored I added a hardware breakpoint using include/linux/hw_breakpoint.h while referencing the linux/samples/hw_breakpoint/data_breakpoint.c


I attached:
- raw dmesg
- symbolized dmesg
- trimmed symbolized dmesg with just interesting bits and commets about my thought process
- patch adding the breakpoint (it also contains your hex_dump patch and a small change that makes the page poisoning print the whole page instead of just the corrupted part)


And here's even more trimmed version of the dmesg with comments (marked with <> brackets):


<dmesg starts>
[    0.000000] Linux version 5.6.2-1-custom (linux-custom@archlinux) (gcc version 9.3.0 (Arch Linux 9.3.0-1)) #6 SMP PREEMPT Fri, 03 Apr 2020 20:53:42 +0000
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-custom root=UUID=d8dded4f-a123-4108-b7a9-3c2781ee024c rw console=tty0 console=ttyS0,115200 init_on_free=1 mem_encrypt=on debug_pagealloc=on page_owner=on page_poison=on loglevel=8
<SNIP IMO irrelevant logs>

<initial free on the page containing our magical broken address is called, dump the current stack>
<@Vlastimil Babka Presumably this is the first time this page is added to page_alloc.c? I do not see __free_pages_core() here though>

[    0.033937] dump_stack (/home/aur/linux-git/src/linux-5.6.2/lib/dump_stack.c:120) 
[    0.033937] kernel_poison_pages.cold (/home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:98 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:112 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:190) 
[    0.033937] free_pcp_prepare (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/jump_label.h:25 /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:2712 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1182 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1220) 
[    0.033937] free_unref_page (/home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3011 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3060) 
[    0.033937] __memblock_free_late (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/atomic64_64.h:88 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-instrumented.h:1050 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-long.h:160 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:60 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/mm/memblock.c:1630 (discriminator 2)) 
[    0.033937] efi_free_boot_services (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/quirks.c:413 (discriminator 2)) 
[    0.033937] efi_enter_virtual_mode (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:954 /home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:984) 
[    0.033937] start_kernel (/home/aur/linux-git/src/linux-5.6.2/init/main.c:970) 
[    0.033937] secondary_startup_64 (/home/aur/linux-git/src/linux-5.6.2/arch/x86/kernel/head_64.S:242) 


<while freeing we poisoned the page, print it just to make sure it is filled with 0xaa>
[    0.033937] hw_breakpoint: dump_stack finished, current page dump:
<SNIP 0xaa filled page dump>
[    0.880604] ffff8880e9d26fc0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
<SNIP 0xaa filled page dump>
<ok, it seems like the page is poisoned correctly, now we register a hw_breakpoint (read/write) at 0xffff8880e9d26fc0 address>


[    0.893937] hw_breakpoint: hexdump finished, registering the breakpoint:
[    0.897285] hw_breakpoint registered
<SNIP>

<Somewhere here I launch the memtester...>

<we are in the middle of trying to allocate a lot of memory in memtester>
<our page is going to be used, we are in prep_new_page() and about to unpoison the page. At this point we need to unregister the breakpoint, but to make sure it works we memset 4 bytes with 0xaf and expect the still enabled breakpoint to trigger...>
[   43.978733] hw_breakpoint: about to unregister the breakpoint, check if it works just in case...


[   43.988271] hw_breakpoint: value is changed
<ok, the breakpoint is triggered, print the call stack...>

<SNIP most of the call stack>
[   44.065581] RIP: 0010:kernel_poison_pages.cold (/home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:162 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:179 /home/aur
<SNIP>
<so the call-stack of the breakpoint is indeed our memset from page_poison.c, now to unregister the breakpoint>

[   44.278288] hw_breakpoint: testing breakpoint finished, unregistering...
[   44.285378] hw_breakpoint unregistered


<good, now the unpoisoning, which detects memory corruption. It should trigger regardless of the unknown corruption we are dealing with because we did a 4 byte memset the page>
[   44.289527] pagealloc: memory corruption
<SNIP 0xaa filled page dump>
[   46.515384] ffff8880e9d26fc0: af af af af bd 74 aa c8 7a 23 9e 9a 31 80 3f 6f  .....t..z#..1.?o
<SNIP page dump with other corruption>

<HMMM, our 4 bytes of 0xaf are here but other values are corrupted as well, ___so something that doesn't trigger hardware breakpoints did it???___>
<SNIP callstack printed along with the corrupted page>





So, from what I can tell the breakpoint works and should have caught the corruption. If it didn't then it's maybe something related to the C-Bit @Borislav Petkov mentioned.
In https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf states that it is sensitive to cache so I may try to somehow flush the memory after poisoning. Maybe that would help?
Although right now I do not know how to do it. Or how to go around from the page struct I have in page_alloc.c to the underlaying physical memory somewhere in arch/x86/mm/mem_encrypt.c and check the C-Bit. So it will probalby take me a while.


Bonus conspiracy theory: Maybe it's the PSP failing miserably while trying to steal my keys :>
Comment 19 Filip Brygidyn 2020-04-03 22:50:05 UTC
(In reply to Vlastimil Babka from comment #17)
> Perhaps the exact scrambling we see is a result of non-encrypted write to
> encrypted pages, and then seeing the decrypted result. But, if memory
> encryption was off and somebody was still doing these corruption writes, we
> would see them too, just with different contents, right?

Someone who knows how SME is hadled would have to answer this but my intuition tells me that if we did a write to a page that was set as encrypted, then the decryption later would produce garbage.

+ The thing you said previously is a very good point:
(In reply to Vlastimil Babka from comment #5)
> If it was e.g. zeroed with a wrong memory encryption
> state, then I would expect it to be fully garbage, not just parts of it.

What we see in page_alloc.c is virtual memory, right? (%px in printk) and the C-bits are set for physical memory pages (so I would have to use %pa I think). I don't know how do they correspond to each other but maybe the parts that we see as garbage in the middle of a page are just mapped to a separate physical page.
If that's the case then the theory of writing into a page with wrong C-Bit state is probable.

But again: the breakpoint I hope works as I think it does did not show any writes.

> So I still don't
> see how the mem encryption is exactly the culprit here. Or does it have a
> bug that manifests only at certain physical addresses?
Maybe but I am much more willing to believe that we do not flush the caches somewhere. the corruption ranges seem much closer to cache line size than to whole page size. (but maybe it's just the physical<>virtual mem mapping being misleading)
Comment 20 Vlastimil Babka 2020-04-04 13:26:22 UTC
(In reply to Filip Brygidyn from comment #18)
> Created attachment 288195 [details]
> logs and patch with hw_breakpoint added
> 
> With the known repeatable address (0xffff8880e9d26fc0) that can be monitored
> I added a hardware breakpoint using include/linux/hw_breakpoint.h while
> referencing the linux/samples/hw_breakpoint/data_breakpoint.c

That was a great idea, thanks. Looks like it did prove the kernel doesn't unexpectedly touch this memory through the direct mapping. I was pretty sure it doesn't, but having a proof is much better.

(In reply to Filip Brygidyn from comment #19)
> (In reply to Vlastimil Babka from comment #17)
> > Perhaps the exact scrambling we see is a result of non-encrypted write to
> > encrypted pages, and then seeing the decrypted result. But, if memory
> > encryption was off and somebody was still doing these corruption writes, we
> > would see them too, just with different contents, right?
> 
> Someone who knows how SME is hadled would have to answer this but my
> intuition tells me that if we did a write to a page that was set as
> encrypted, then the decryption later would produce garbage.

Yes, it looks like somebody did such write, using a mapping that didn't have the C bit set. Your HW breakpoint ruled out it was the kernel using the 0xffff8880e9d26fc0 address. It could have still been the kernel using a different virtual address without C bit set. Or more likely firmware (SMM?) or hardware. But the point is that such stray overwrite would corrupt the poison values even without encryption, and we would detect it.
 
> + The thing you said previously is a very good point:
> (In reply to Vlastimil Babka from comment #5)
> > If it was e.g. zeroed with a wrong memory encryption
> > state, then I would expect it to be fully garbage, not just parts of it.
> 
> What we see in page_alloc.c is virtual memory, right? (%px in printk) and

Yes.

> the C-bits are set for physical memory pages (so I would have to use %pa I
> think). I don't know how do they correspond to each other but maybe the

Since we disabled randomization, it's easy. Virtual address ffff8880e9d26fc0 should be physical address e9d26fc0. The 'ffff88' part is where direct mapping starts. The extra '8' is bit 47 so it's probably the C-bit - looks like it's a part of the virtual address of kernel mapping too, although it only needs t obe part of the physical address. Need to check how we deal with that in the kernel.

> parts that we see as garbage in the middle of a page are just mapped to a
> separate physical page.

No that's not possible, 4KB page is the lowest possible granularity for mapping. And the ffff88* mapping even uses 1GB granularity when possible.

> If that's the case then the theory of writing into a page with wrong C-Bit
> state is probable.
> 
> But again: the breakpoint I hope works as I think it does did not show any
> writes.

Yes.

> > So I still don't
> > see how the mem encryption is exactly the culprit here. Or does it have a
> > bug that manifests only at certain physical addresses?
> Maybe but I am much more willing to believe that we do not flush the caches
> somewhere. the corruption ranges seem much closer to cache line size than to
> whole page size. (but maybe it's just the physical<>virtual mem mapping
> being misleading)

In fact it seems the corrupted vs correct parts are exactly the granularity and alignment of cache line (or multiple cache line) size of 64bytes, as I mentioned to Boris. Thus it takes only a single byte write through a mapping without C-bit to corrupt the whole cache line.

I wonder, if cache lines can be dirtied speculatively. Then whoever fetched them without C-bit perhaps did it only because of mis-speculation, and then the flush writeback did the corruption?

Also I realized we might try to print the affected areas as unencrypted and see if the non-scrambled contents hints at anything. It should be possible just by masking bit 47 in the address, thus prefix ffff888 becomes ffff880. But then we shouldn't do the testing "af" byte writes.
Comment 21 Vlastimil Babka 2020-04-04 13:46:10 UTC
(In reply to Vlastimil Babka from comment #20)
> Since we disabled randomization, it's easy. Virtual address ffff8880e9d26fc0
> should be physical address e9d26fc0. The 'ffff88' part is where direct
> mapping starts. The extra '8' is bit 47 so it's probably the C-bit - looks
> like it's a part of the virtual address of kernel mapping too, although it
> only needs t obe part of the physical address. Need to check how we deal
> with that in the kernel.

Um, scracth this, there's no C-bit in direct mapping. The base is 0xffff888000000000 these days (if not randomized), it used to be 0xffff880000000000. The bit change (which is actually bit 39, not 47) has nothing to do with encryption.

> Also I realized we might try to print the affected areas as unencrypted and
> see if the non-scrambled contents hints at anything. It should be possible
> just by masking bit 47 in the address, thus prefix ffff888 becomes ffff880.
> But then we shouldn't do the testing "af" byte writes.

This theory should still work, but it's not about modifying the virtual address. We probably need to call set_memory_decrypted() on the whole page (e.g. set_memory_decrypted(0xffff8880e9d26000, 1) and then do the hexdump.

(In reply to Filip Brygidyn from comment #18)
> <@Vlastimil Babka Presumably this is the first time this page is added to
> page_alloc.c? I do not see __free_pages_core() here though>

__memblock_free_late() -> memblock_free_pages() -> __free_pages_core()
Probably inlined. IIRC kasan_symbolize.py was able to add such inlines to the symbolized logs.
Comment 22 Filip Brygidyn 2020-04-06 19:26:00 UTC
Created attachment 288237 [details]
logs of ecrypting/decrypting the corrupted page

(In reply to Vlastimil Babka from comment #21)
> We probably need to call set_memory_decrypted() on the whole page
> (e.g. set_memory_decrypted(0xffff8880e9d26000, 1) and then do the hexdump.

Good call.
Just calling set_memory_encrypted(), set_memory_decrypted() and set_memory_encrypted() again showed that the memory is corrupted when freeing initially.
For me this is a fairly good proof that something is not flushing the caches. (or invalidates them too early).

The story is below, I trimmed the hexdumps. (full are attached)


<initial simulated free call trace>
[    0.030520] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.2-1-custom #11
[    0.033849] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS L3.33 12/26/2019
[    0.033849] Call Trace:
[    0.033849] dump_stack (/home/aur/linux-git/src/linux-5.6.2/lib/dump_stack.c:120) 
[    0.033849] kernel_poison_pages.cold (/home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:99 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:123 /home/aur/linux-git/src/linux-5.6.2/mm/page_poison.c:215) 
[    0.033849] free_pcp_prepare (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/jump_label.h:25 /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:2712 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1182 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:1220) 
[    0.033849] free_unref_page (/home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3011 /home/aur/linux-git/src/linux-5.6.2/mm/page_alloc.c:3060) 
[    0.033849] __memblock_free_late (/home/aur/linux-git/src/linux-5.6.2/./arch/x86/include/asm/atomic64_64.h:88 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-instrumented.h:1050 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/asm-generic/atomic-long.h:160 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/./include/linux/mm.h:60 (discriminator 2) /home/aur/linux-git/src/linux-5.6.2/mm/memblock.c:1630 (discriminator 2)) 
[    0.033849] efi_free_boot_services (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/quirks.c:413 (discriminator 2)) 
[    0.033849] efi_enter_virtual_mode (/home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:954 /home/aur/linux-git/src/linux-5.6.2/arch/x86/platform/efi/efi.c:984) 
[    0.033849] start_kernel (/home/aur/linux-git/src/linux-5.6.2/init/main.c:970) 
[    0.033849] secondary_startup_64 (/home/aur/linux-git/src/linux-5.6.2/arch/x86/kernel/head_64.S:242) 

<hexdump just after poisoning>
[    0.877184] current   ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.880517] current   ffff8880e9d26fc0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.883850] current   ffff8880e9d26fd0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.887184] current   ffff8880e9d26fe0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    0.890517] current   ffff8880e9d26ff0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
<well, it looks poisoned alright>

<right after that, call set_memory_encrypted and then hexdump>
[    1.730517] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.733850] encrypted ffff8880e9d26fc0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.737183] encrypted ffff8880e9d26fd0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.740517] encrypted ffff8880e9d26fe0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    1.743850] encrypted ffff8880e9d26ff0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
<ok, nothing changed, so the page was already set as encrypted>

<right after that, call set_memory_decrypted and then hexdump>
[    2.557184] decrypted ffff8880e9d26f30: 0e c9 57 3e e2 90 d6 b0 f7 df 57 ce 60 d5 41 49  ..W>......W.`.AI
[    2.560517] decrypted ffff8880e9d26f40: 66 95 f2 e5 bd f6 cd 05 db 32 6c 40 0e 8b 57 19  f........2l@..W.
[    2.563850] decrypted ffff8880e9d26f50: 59 2c 15 72 be 3d 63 ad 24 5a 1c 19 ba be 62 39  Y,.r.=c.$Z....b9
[    2.567184] decrypted ffff8880e9d26f60: 03 1f 65 9d 6b 3a 87 32 4b 35 ff c0 ca 9f 08 26  ..e.k:.2K5.....&
[    2.570517] decrypted ffff8880e9d26f70: b7 d4 e7 73 41 25 1d 3a 63 c2 88 31 8f 4d 37 80  ...sA%.:c..1.M7.
[    2.573850] decrypted ffff8880e9d26f80: 00 00 00 00 00 00 00 00 66 1c cb e9 00 00 00 00  ........f.......
[    2.577184] decrypted ffff8880e9d26f90: 10 05 77 ed 00 00 00 00 00 00 00 00 00 00 00 00  ..w.............
[    2.580517] decrypted ffff8880e9d26fa0: d8 05 d7 e9 00 00 00 00 08 00 00 00 00 00 00 00  ................
[    2.583850] decrypted ffff8880e9d26fb0: 00 00 00 00 00 00 00 00 34 2e cb e9 00 00 00 00  ........4.......
[    2.587184] decrypted ffff8880e9d26fc0: 18 0c d7 e9 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    2.590517] decrypted ffff8880e9d26fd0: d8 05 d7 e9 00 00 00 00 24 2e cb e9 00 00 00 00  ........$.......
[    2.593850] decrypted ffff8880e9d26fe0: 38 00 00 00 00 00 00 00 46 00 00 00 00 00 00 00  8.......F.......
[    2.597184] decrypted ffff8880e9d26ff0: 00 70 d2 e9 00 00 00 00 08 00 00 00 00 00 00 00  .p..............
<hmm, now this is interesting, instead of whole page being scrambled I see that at the end of the page there is some decrypted data. 128 bytes exactly>
<so the corruption was actually encrypted data>

<right after that, call set_memory_encrypted and then hexdump>
[    3.440517] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[    3.443850] encrypted ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[    3.447184] encrypted ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[    3.450517] encrypted ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[    3.453850] encrypted ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<now we see the corrupted data that was apparently flushed, it stays with us until overwriten, but not the corrupted range is only 64 bytes wide>
<normally we would see this when trying to unpoison later>

<do the rest of booting, ssh into the machine and then call memtester>

<we are about to unpoison the page, hexdump current page:>
[   55.089760] current   ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[   55.099408] current   ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[   55.109061] current   ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[   55.118727] current   ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[   55.128395] current   ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<yup, the memory contains the corruption we were seeing until now>

<right after that, call set_memory_encrypted and then hexdump>
[   57.561664] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[   57.571310] encrypted ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[   57.580963] encrypted ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[   57.590627] encrypted ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[   57.600299] encrypted ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<as expected the page was set as encrypted already so nothing changed>

<right after that, call set_memory_decrypted and then hexdump>
[   60.311616] decrypted ffff8880e9d26fb0: 31 93 e9 97 0c 06 42 5a e7 94 28 80 e4 66 d0 07  1.....BZ..(..f..
[   60.330727] decrypted ffff8880e9d26fc0: 18 0c d7 e9 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.349011] decrypted ffff8880e9d26fd0: d8 05 d7 e9 00 00 00 00 24 2e cb e9 00 00 00 00  ........$.......
[   60.368214] decrypted ffff8880e9d26fe0: 38 00 00 00 00 00 00 00 46 00 00 00 00 00 00 00  8.......F.......
[   62.666298] decrypted ffff8880e9d26ff0: 00 70 d2 e9 00 00 00 00 08 00 00 00 00 00 00 00  .p..............
<hmm, we get gack the 64 of decrypted bytes>

<right after that, call set_memory_encrypted and then hexdump>
[   65.108977] encrypted ffff8880e9d26fb0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
[   65.118625] encrypted ffff8880e9d26fc0: 37 8c 62 1f 79 b7 00 69 0d c1 c5 72 a8 36 cd 14  7.b.y..i...r.6..
[   65.128279] encrypted ffff8880e9d26fd0: 3c f2 95 ae 08 a4 77 84 61 d2 65 e3 6e 87 fb 54  <.....w.a.e.n..T
[   65.137943] encrypted ffff8880e9d26fe0: 10 9d de a8 2a c5 2a 42 81 05 ef 22 78 1e d9 78  ....*.*B..."x..x
[   65.147613] encrypted ffff8880e9d26ff0: a7 2c a6 2e 5a 51 58 f8 75 9e 9d dc 79 2e fe 0e  .,..ZQX.u...y...
<and we get back our 'corrupted' data>


<the memory is back on the free list, call memtester again and the whole cycle repeats but without corruption this time>
Comment 23 Filip Brygidyn 2020-04-06 19:42:41 UTC
(In reply to Filip Brygidyn from comment #22)
> For me this is a fairly good proof that something is not flushing the
> caches. (or invalidates them too early).
I just realized that I've convinced myself that I already know the answer. So please be critical of my opinion.
Comment 24 Vlastimil Babka 2020-04-06 23:00:52 UTC
Huh, that's rather insane. Especially the part where we initially see all poison OK, then after decrypt it shows something's wrong with the last 128 bytes of the page, after another encrypt it's suddenly just the last 64 bytes, and then it stays that way. And since you're doing that immediately during the boot, it should rule out any erroneous activity between the rest of the boot and memtester allocating the page.
Note that each of the set_memory_decrypted()/set_memory_encrypted() call implies a cache flush of the whole page, and I don't really think the code there is wrong.

But maybe there's a theory explaining the above, that assumes the cache state was already inconsistent at the moment of the initial poisoning, or even at the moment of "initial encryption" (when the direct mapping was constructed with the c-bit set, I don't immediately see where exactly it's done in the kernel code).

1) before the "initial encryption", something accessed the last 128 bytes of the page, and wrote to the last cacheline (64 bytes), using a physical address without C-bit. Perhaps just wrote to the last 64 bytes, and the previous line was cached as an adjacent line prefetch. The result is that some cache in the cpu now caches the very last line as dirty and the previous line as e.g. shared, using physical memory tag that doesn't have the c-bit.

2) the "initial encryption" constructed direct mapping with c-bit sets, but didn't flush the caches first while using a mappings to physical addresses without c-bits.

3) the initial poisoning uses the direct mapping with c-bits but since that effectively maps to different physical addresses, cache coherency doesn't recognize those last two cache lines (physical tags don't match in the c-bit) and thus poisoning fetches a different copy of the 128 bytes from RAM, and writes the poison there. The hexdumps just prints from the cache it has written to.

4) you call set_memory_encrypted() which flushes the page from cache to RAM using mapping with c-bits, so the poison is propagated encrypted to memory. But it still doesn't affect the 128 bytes cached without c-bits. The following hexdump fetches that encrypted poison back to cache (but no longer dirty) and prints it.

5) Now it starts to be interesting. You call set_memory_decrypted(). That again flushes the cache lines containing encrypted poison with c-bit set, and then changes the mapping to remove c-bit. Then it flushes only TLB, not cache (see the second cpa_flush() in __set_memory_enc_dec()). Now the hexdump uses mapping without c-bits so it populates the cache with contents from memory that has been encrypted, but is now not decrypted, hence it's scrambled. Except those last two cache lines which are finally read from the cache and not memory, since we now used a mapping without c-bits! So the last 128 bytes look different.

6) You call set_memory_encrypted() again, which flushes the cache without c-bits. The last cache line was dirty, so it's written back to memory and overwrites the encrypted poison bytes. The next to last cache line was not dirty, so it's just invalidated, as the rest of the page. Then mapping is changed again to use the c-bit so the hexdump reads memory as encrypted and we now see only the last cache line as scrambled. But now the incoherent copy of cache is finally gone, and the page will now operate normally, as your further output shows.

If this theory is correct, then all of the observed corruption should be prevented if you hook a "set_memory_decrypted(); set_memory_encrypted();" sequence of operations on the page before the initial poisoning. That should flush the incoherent copy immediately. Then the question is if the initial direct mapping encryption lacks a flush, or e.g. some hardware or firmware misbehaves and populates those non-cbit cache lines.
Comment 25 Borislav Petkov 2020-04-07 14:49:01 UTC
Ok, two things to try:

* can you make the same observation if you boot with TSME on? TSME is a BIOS option which enables transparent memory encryption on the whole system and the SME code in the kernel is not used. But you'd have to check your BIOS to see whether there's even such an option exposed. Some BIOSes have it and you might be lucky.

Independent from the above,

* can you make the same observation if you boot the box with nr_cpus=1

Thx.

Note You need to log in before you can comment on or make changes to this bug.