Bug 201631 - WARNING: CPU: 11 PID: 29593 at fs/ext4/inode.c:3927 .ext4_set_page_dirty+0x70/0xb0
Summary: WARNING: CPU: 11 PID: 29593 at fs/ext4/inode.c:3927 .ext4_set_page_dirty+0x70...
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-64 (show other bugs)
Hardware: PPC-64 Linux
: P1 normal
Assignee: platform_ppc-64
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-07 18:14 UTC by Erhard F.
Modified: 2019-02-20 13:07 UTC (History)
5 users (show)

See Also:
Kernel Version: 4.20-rc1
Tree: Mainline
Regression: No


Attachments
kernel config (86.89 KB, text/plain)
2018-11-07 18:14 UTC, Erhard F.
Details
dmesg (62.64 KB, text/plain)
2018-11-07 18:14 UTC, Erhard F.
Details
dmesg (72.27 KB, text/plain)
2018-11-28 11:00 UTC, Erhard F.
Details
kernel .config (4.19.5) (88.00 KB, text/plain)
2018-11-28 17:14 UTC, Erhard F.
Details
kernel config (4.20-rc4) (88.17 KB, text/plain)
2018-11-28 17:19 UTC, Erhard F.
Details
dmesg (4.20-rc6 on a PowerMac G5 7,3) (53.22 KB, text/plain)
2018-12-16 13:26 UTC, Erhard F.
Details
kernel .config (4.20-rc6, PowerMac G5 7,3) (86.84 KB, text/plain)
2018-12-16 13:27 UTC, Erhard F.
Details
mm: Dump page and PTE state (953 bytes, patch)
2018-12-17 11:30 UTC, Jan Kara
Details | Diff
dmesg (4.20-rc7 on a PowerMac G5 7,3) (46.40 KB, text/plain)
2018-12-17 16:51 UTC, Erhard F.
Details
dmesg (4.20-rc7 on a Talos II) (58.51 KB, text/plain)
2018-12-17 16:52 UTC, Erhard F.
Details
journalctl -k (4.20-rc7 on a Talos II) (626.18 KB, text/plain)
2018-12-17 17:50 UTC, Erhard F.
Details
mm: Dump page and PTE state and some history (3.33 KB, patch)
2018-12-19 17:03 UTC, Jan Kara
Details | Diff
journalctl -k (4.20-rc7 on a Talos II) (1.16 MB, text/plain)
2018-12-19 22:58 UTC, Erhard F.
Details
mm: Dump page and PTE state and some history (v2) (3.34 KB, patch)
2018-12-20 07:41 UTC, Jan Kara
Details | Diff
journalctl -k (4.20-rc7 on a Talos II) (1.16 MB, text/plain)
2018-12-21 11:17 UTC, Erhard F.
Details
journalctl -k (5.0-rc1 on a Talos II) (1.15 MB, text/plain)
2019-01-08 23:00 UTC, Erhard F.
Details
kernel .config (5.0-rc1, Talos II) (87.33 KB, text/plain)
2019-01-08 23:02 UTC, Erhard F.
Details
mm: Dump page and PTE state and some history (v3) (5.56 KB, patch)
2019-01-17 11:48 UTC, Jan Kara
Details | Diff
journalctl -k (5.0-rc3 on a Talos II) (1.44 MB, text/plain)
2019-01-22 23:10 UTC, Erhard F.
Details
mm: Dump page and PTE state and some history (v4) (6.89 KB, patch)
2019-01-23 16:41 UTC, Jan Kara
Details | Diff
journalctl -k (5.0-rc3 on a Talos II) (1.60 MB, text/plain)
2019-01-27 14:48 UTC, Erhard F.
Details
mm: Dump page and PTE state and some history (v5) (12.03 KB, patch)
2019-01-29 09:41 UTC, Jan Kara
Details | Diff
mm: Dump page and PTE state and some history (v6) (12.01 KB, patch)
2019-01-30 12:53 UTC, Jan Kara
Details | Diff
journalctl -k (5.0-rc4 on a Talos II) (1.66 MB, text/plain)
2019-01-31 00:32 UTC, Erhard F.
Details
Revert-powerpc-mm-book3s-Update-pmd_present (4.19 KB, patch)
2019-02-04 06:38 UTC, Aneesh Kumar KV
Details | Diff
Revert-powerpc-mm-book3s-Enable-THP-migration (2.24 KB, patch)
2019-02-04 06:39 UTC, Aneesh Kumar KV
Details | Diff
journalctl -k (5.0-rc4 on a Talos II, "Kernel NULL pointer dereference") (92.61 KB, text/plain)
2019-02-04 22:33 UTC, Erhard F.
Details
journalctl -k (5.0-rc5 on a Talos II) (1.66 MB, text/plain)
2019-02-05 12:21 UTC, Erhard F.
Details
possible revert for upstream (1014 bytes, application/mbox)
2019-02-08 15:04 UTC, Aneesh Kumar KV
Details

Description Erhard F. 2018-11-07 18:14:07 UTC
Created attachment 279373 [details]
kernel config

[ 3894.535640] WARNING: CPU: 11 PID: 29593 at fs/ext4/inode.c:3927 .ext4_set_page_dirty+0x70/0xb0
[ 3894.535794] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc cfg80211 rfkill radeon evdev ghash_generic i2c_algo_bit gf128mul snd_hda_codec_hdmi ttm ecb drm_kms_helper snd_hda_intel snd_hda_codec cfbfillrect cfbimgblt mptsas syscopyarea sysimgblt xts snd_hwdep sysfillrect snd_hda_core scsi_transport_sas fb_sys_fops cfbcopyarea xhci_pci drm snd_pcm xhci_hcd mptscsih mptbase snd_timer ctr usbcore tg3 scsi_mod cbc at24 snd powernv_flash drm_panel_orientation_quirks backlight libphy usb_common mtd soundcore ibmpowernv vmx_crypto opal_prd hwmon crc32c_vpmsum lz4 lz4_compress lz4_decompress zram zsmalloc
[ 3894.535880] CPU: 11 PID: 29593 Comm: debugedit Not tainted 4.20.0-rc1 #1
[ 3894.535886] NIP:  c000000000401d40 LR: c000000000250440 CTR: c000000000401cd0
[ 3894.535893] REGS: c0000003c6d9f560 TRAP: 0700   Not tainted  (4.20.0-rc1)
[ 3894.535898] MSR:  9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24082488  XER: 20040000
[ 3894.535913] CFAR: c000000000401d14 IRQMASK: 0 
               GPR00: c000000000250440 c0000003c6d9f7e8 c0000000011d5d00 c00a00800786e640 
               GPR04: 00003fffb263e000 c0002001e1b99186 0000000000000001 0000000000000000 
               GPR08: 0000000000000000 0000000000000000 c00a0080095480c8 0000000000000000 
               GPR12: 0000000024082488 c0000003ffffb400 1000000000000000 c0000000011e3918 
               GPR16: 00003fffb263e000 e61fffffffffffff c0002001e1b99186 c0002003e3a30960 
               GPR20: 00003fffb263f000 c00a00800786e640 c00000035c777ab8 c0000003c7692850 
               GPR24: c0000003645aa1f0 fffffffffffffe7f c0000003c55b3c98 0000000000000000 
               GPR28: ffffffffffffffff 00003fffb2649000 0000000000000000 c00a00800786e640 
[ 3894.535957] NIP [c000000000401d40] .ext4_set_page_dirty+0x70/0xb0
[ 3894.535965] LR [c000000000250440] .set_page_dirty+0xa0/0x150
[ 3894.535970] Call Trace:
[ 3894.535976] [c0000003c6d9f7e8] [c000000000247908] .__get_free_pages+0x18/0x70 (unreliable)
[ 3894.535984] [c0000003c6d9f858] [c000000000250440] .set_page_dirty+0xa0/0x150
[ 3894.535994] [c0000003c6d9f8d8] [c000000000298de0] .unmap_page_range+0xbf0/0xe10
[ 3894.536001] [c0000003c6d9fa68] [c000000000299364] .unmap_vmas+0x84/0x130
[ 3894.536009] [c0000003c6d9fb08] [c0000000002a3d98] .unmap_region+0xe8/0x190
[ 3894.536016] [c0000003c6d9fc38] [c0000000002a6470] .__do_munmap+0x2f0/0x510
[ 3894.536024] [c0000003c6d9fcf8] [c0000000002a6710] .__vm_munmap+0x80/0x110
[ 3894.536031] [c0000003c6d9fdb8] [c0000000002a67d4] .__se_sys_munmap+0x14/0x30
[ 3894.536040] [c0000003c6d9fe28] [c00000000000b3a4] system_call+0x5c/0x70
[ 3894.536046] Instruction dump:
[ 3894.536053] 71290008 4182002c e9230000 71292000 41820030 4bf6e8fd 60000000 38210070 
[ 3894.536066] e8010010 7c0803a6 4e800020 60000000 <0fe00000> e9230000 71292000 4082ffd8 
[ 3894.536080] irq event stamp: 18532
[ 3894.536087] hardirqs last  enabled at (18531): [<c00000000024a120>] .get_page_from_freelist+0x2b0/0x1940
[ 3894.536094] hardirqs last disabled at (18532): [<c000000000008f2c>] program_check_common+0x14c/0x160
[ 3894.536103] softirqs last  enabled at (18306): [<c0000000009b2384>] .__do_softirq+0x4b4/0x5e4
[ 3894.536111] softirqs last disabled at (18299): [<c0000000000b2128>] .irq_exit+0xf8/0x1b0
[ 3894.536117] ---[ end trace ca79a42e82947be5 ]---
Comment 1 Erhard F. 2018-11-07 18:14:44 UTC
Created attachment 279375 [details]
dmesg
Comment 2 Theodore Tso 2018-11-07 22:09:13 UTC
Can you reproduce this?   

And you may also want to send a report to linux-mm@kvack.org, since although it was tripped in ext4_set_page_dirty, the WARN_ON check which is it doing are fundamentally page specific, so this may very well be an mm problem:


static int ext4_set_page_dirty(struct page *page)
{
	WARN_ON_ONCE(!PageLocked(page) && !PageDirty(page));
	WARN_ON_ONCE(!page_has_buffers(page));
	return __set_page_dirty_buffers(page);
}
Comment 3 Erhard F. 2018-11-08 11:44:00 UTC
(In reply to Theodore Tso from comment #2)
> Can you reproduce this?

I will continue to run 4.20-rc kernels and report back if this occurs again. It has not popped up on 4.19.x on my ppc machines so far.

> And you may also want to send a report to linux-mm@kvack.org, since although
> it was tripped in ext4_set_page_dirty, the WARN_ON check which is it doing
> are fundamentally page specific, so this may very well be an mm problem:
> [...]

Done.
Comment 4 Erhard F. 2018-11-28 10:41:17 UTC
Next occurrence on 4.20-rc4, during building clang:

[ 1211.616339] WARNING: CPU: 18 PID: 15049 at fs/ext4/inode.c:3927 .ext4_set_page_dirty+0x70/0xb0
[ 1211.616469] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc cfg80211 rfkill evdev radeon ghash_generic gf128mul ecb snd_hda_codec_hdmi xts snd_hda_intel i2c_algo_bit ttm mptsas snd_hda_codec scsi_transport_sas drm_kms_helper xhci_pci mptscsih cbc mptbase snd_hwdep xhci_hcd snd_hda_core cfbfillrect cfbimgblt syscopyarea tg3 vmx_crypto powernv_flash sysimgblt scsi_mod snd_pcm sysfillrect mtd fb_sys_fops opal_prd cfbcopyarea ibmpowernv crc32c_vpmsum drm usbcore snd_timer at24 snd libphy drm_panel_orientation_quirks backlight usb_common soundcore hwmon nx_compress 842 842_decompress 842_compress lz4 lz4_compress lz4_decompress zram zsmalloc
[ 1211.616544] CPU: 18 PID: 15049 Comm: debugedit Not tainted 4.20.0-rc4 #1
[ 1211.616549] NIP:  c0000000004279a0 LR: c000000000275740 CTR: c000000000427930
[ 1211.616554] REGS: c0002001b5e73550 TRAP: 0700   Not tainted  (4.20.0-rc4)
[ 1211.616557] MSR:  9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24082488  XER: 20040000
[ 1211.616568] CFAR: c000000000427974 IRQMASK: 0 
               GPR00: c000000000275740 c0002001b5e737e0 c0000000011ee200 c00a00000bb46ac0 
               GPR04: 00003fffa52a0000 c0000002ed1ab186 0000000000000001 0000000000000000 
               GPR08: 0000000000000000 0000000000000000 c00a00000a60e948 0000000000000000 
               GPR12: 0000000024082482 c0002003ff7ff280 1000000000000000 c0000000011fd918 
               GPR16: 00003fffa52a0000 e61fffffffffffff c0000002ed1ab186 c0000001b3442d48 
               GPR20: 00003fffa52a1000 c00a00000bb46ac0 c0000003ee23c7a8 c0002003f061a850 
               GPR24: c0002001b53a9500 fffffffffffffe7f c0002000d3450948 0000000000000000 
               GPR28: ffffffffffffffff 00003fffa530f000 0000000000000000 c00a00000bb46ac0 
[ 1211.616602] NIP [c0000000004279a0] .ext4_set_page_dirty+0x70/0xb0
[ 1211.616607] LR [c000000000275740] .set_page_dirty+0xa0/0x150
[ 1211.616611] Call Trace:
[ 1211.616616] [c0002001b5e737e0] [c00000000027fa40] .activate_page+0xf0/0x160 (unreliable)
[ 1211.616622] [c0002001b5e73850] [c000000000275740] .set_page_dirty+0xa0/0x150
[ 1211.616629] [c0002001b5e738d0] [c0000000002be260] .unmap_page_range+0xbf0/0xe10
[ 1211.616635] [c0002001b5e73a60] [c0000000002be7e4] .unmap_vmas+0x84/0x130
[ 1211.616640] [c0002001b5e73b00] [c0000000002c9218] .unmap_region+0xe8/0x190
[ 1211.616646] [c0002001b5e73c30] [c0000000002cb8f0] .__do_munmap+0x2f0/0x510
[ 1211.616651] [c0002001b5e73cf0] [c0000000002cbb90] .__vm_munmap+0x80/0x110
[ 1211.616656] [c0002001b5e73db0] [c0000000002cbc54] .__se_sys_munmap+0x14/0x30
[ 1211.616664] [c0002001b5e73e20] [c00000000000b3a4] system_call+0x5c/0x70
[ 1211.616668] Instruction dump:
[ 1211.616674] 71290008 4182002c e9230000 71292000 41820030 4bf6e5dd 60000000 38210070 
[ 1211.616684] e8010010 7c0803a6 4e800020 60000000 <0fe00000> e9230000 71292000 4082ffd8 
[ 1211.616694] irq event stamp: 22098
[ 1211.616699] hardirqs last  enabled at (22097): [<c00000000026ec5c>] .free_unref_page_list+0x21c/0x3e0
[ 1211.616705] hardirqs last disabled at (22098): [<c000000000008f2c>] program_check_common+0x14c/0x160
[ 1211.616712] softirqs last  enabled at (22008): [<c0000000009f9144>] .__do_softirq+0x4b4/0x5e4
[ 1211.616718] softirqs last disabled at (22001): [<c0000000000d1c28>] .irq_exit+0xf8/0x1b0
[ 1211.616722] ---[ end trace 97217f5745b818a0 ]---
Comment 5 Erhard F. 2018-11-28 11:00:59 UTC
Created attachment 279691 [details]
dmesg

Some more lockupgs after re-trying:

[ 2088.802884] INFO: task watch:20375 blocked for more than 120 seconds.
[ 2088.803203]       Tainted: G        W         4.20.0-rc4 #1
[ 2088.803209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2088.803215] watch           D    0 20375    762 0x00000000
[ 2088.803224] Call Trace:
[ 2088.803232] [c00000033c50f210] [c00000033c50f2b0] 0xc00000033c50f2b0 (unreliable)
[ 2088.803244] [c00000033c50f3f0] [c0000000000217f0] .__switch_to+0x310/0x510
[ 2088.803256] [c00000033c50f4a0] [c0000000009f0000] .__schedule+0x320/0xcd0
[ 2088.803262] [c00000033c50f590] [c0000000009f09e0] .schedule+0x30/0xb0
[ 2088.803270] [c00000033c50f600] [c000000000365090] .d_alloc_parallel+0x5d0/0xb00
[ 2088.803277] [c00000033c50f750] [c00000000034bdc8] .__lookup_slow+0xc8/0x250
[ 2088.803284] [c00000033c50f890] [c00000000034bf94] .lookup_slow+0x44/0x70
[ 2088.803291] [c00000033c50f920] [c000000000350928] .walk_component+0x298/0x4d0
[ 2088.803297] [c00000033c50fa00] [c0000000003516f0] .path_lookupat+0xa0/0x280
[ 2088.803307] [c00000033c50fac0] [c000000000353744] .filename_lookup+0xb4/0x190
[ 2088.803313] [c00000033c50fc40] [c000000000342df8] .vfs_statx+0xa8/0x190
[ 2088.803320] [c00000033c50fd10] [c0000000003437b4] .__se_sys_newstat+0x54/0xa0
[ 2088.803332] [c00000033c50fe20] [c00000000000b3a4] system_call+0x5c/0x70
[...]

More info see full dmesg.
Comment 6 Erhard F. 2018-11-28 17:14:07 UTC
Created attachment 279713 [details]
kernel .config (4.19.5)
Comment 7 Erhard F. 2018-11-28 17:19:39 UTC
Created attachment 279715 [details]
kernel config (4.20-rc4)

The 2nd lockup from this bug seems reproducible. Happens every time at a certain stage when I try to build clang-7.0.1-rc2 when running the Talos II on kernel 4.20-rc4. I need to sigkill the build process or it stalls forever. When I run the machine on kernel 4.19.5 the build runs flawlessly, no lockups.

Building takes place on a zram-partition (842 compressed, ext4), root filesystem is btrfs (zstd compressed).
Comment 8 Jan Kara 2018-12-11 11:21:17 UTC
The lockups and ext4 warnings are likely a different thing. So please report the lockup to linux-fsdevel@vger.kernel.org and also CC Al Viro <viro@zeniv.linux.org.uk> on it.

The WARN_ON we hit in ext4 is likely:

WARN_ON_ONCE(!PageLocked(page) && !PageDirty(page));

because we don't expect page cache page to be marked dirty when unmapping it from page tables. It could be a powerpc specific race in handling of page table entry (PTE) flags as I don't see how PTE could be marked dirty without a page being already dirty.

Seeing you hit this already twice on 4.20-rc's and never on 4.19.x, it could be some recent powerpc change that is triggering this. I've skimmed through powerpc changelogs and didn't find anything obviously matching but there were quite some mm changes pulled in this merge window. Powerpc guys, any idea how could _PAGE_DIRTY PTE bit get spuriously set? Normally, we should clear the _PAGE_DIRTY bit in page_mkclean() together with write-protecting the PTE and we are marking the page dirty whenever writeable PTE is created... So clean pagecache page with _PAGE_DIRTY in PTE is an invalid combination.
Comment 9 Erhard F. 2018-12-16 13:26:38 UTC
Created attachment 280031 [details]
dmesg (4.20-rc6 on a PowerMac G5 7,3)

Also seeing this on my G5 with kernel 4.20-rc6.
Comment 10 Erhard F. 2018-12-16 13:27:34 UTC
Created attachment 280033 [details]
kernel .config (4.20-rc6, PowerMac G5 7,3)
Comment 11 Benjamin Herrenschmidt 2018-12-16 20:47:42 UTC
Jan, I don't know for sure what problems triggers it here, but there definitely is a HW bug on POWER9 with the "NestMMU" (the MMU that services coprocessors and the nvlink), which can cause a dirty bit to be set on a read-only PTE if it races with the PTE update & invalidation.

It was "assumed" this was ok. If not I don't know how to work around it.

However, at least in the above cases, I don't see how we would be setting dirty if we don't have RW access permission. A PowerMac G5 uses the hash MMU, we set dirty atomically with checking the PTE permission when hashing a PTE in.
Comment 12 Jan Kara 2018-12-17 10:58:04 UTC
Thanks for info Ben! First one slightly off-topic comment: Spuriously setting dirty bits in page tables is certainly not OK for file systems. This results in set_page_dirty() being called for a page when it gets unmapped and that means that file system is told there's new data in the page. But the file system is not prepared for that - it needs to prepare its structures for data modification (usually happens in ->page_mkwrite callback during page fault) like making sure there's enough free space, do COW on the page in case it was reflinked etc. As you say, this is likely not a problem of this particular reporter but I wanted to make sure the message gets across :).

Now to this report: It's good to know that setting of PTE dirty bit should be atomic at least on G5 wrt concurrent invalidation and page write protection. But something seems to have happened with 4.20-rcs that dirty bits do get set as Erhard started to see these warnings. I haven't heard any report of this from x86, this seems to be ppc specific. Since it happens once in couple of weeks, bisection is not really an option. I guess, I'll write a debug patch to dump PTE and page state when such problematic situation occurs and maybe that'll tell us something more to debug this.
Comment 13 Jan Kara 2018-12-17 11:30:00 UTC
Created attachment 280051 [details]
mm: Dump page and PTE state

Erhard, could you please run your machines with this patch applied and report what gets recorded when the warning hits again? Thanks!
Comment 14 Erhard F. 2018-12-17 11:51:18 UTC
(In reply to Jan Kara from comment #13)
> Erhard, could you please run your machines with this patch applied and
> report what gets recorded when the warning hits again? Thanks!

Sure! How do I apply the patch? Should I stay at rc6 or use the new rc7?
Comment 15 Jan Kara 2018-12-17 12:15:41 UTC
rc6 or rc7 doesn't really matter. You can apply the patch as any other kernel patch:

cd "kernel-source-directory"
patch -p1 <"downloaded-patch"

now compile & install the kernel - I'm assuming that since you test rcs, you know how to do that. If not, I can dig out some instructions for that.
Comment 16 Erhard F. 2018-12-17 16:51:39 UTC
Created attachment 280053 [details]
dmesg (4.20-rc7 on a PowerMac G5 7,3)

Interesting! Both machines did not trigger the bug yet, but your patch reports "Clean page (index 0, ino 4223082, flags 3ff00000002014) with dirty PTE (pte_val c82000025beab386) on unmap!"-like messages during boot. 2 times on the G5, one time on the Talos II.
Comment 17 Erhard F. 2018-12-17 16:52:42 UTC
Created attachment 280055 [details]
dmesg (4.20-rc7 on a Talos II)
Comment 18 Erhard F. 2018-12-17 17:50:28 UTC
Created attachment 280057 [details]
journalctl -k (4.20-rc7 on a Talos II)

Ok, here we go now. The bug has been triggered again by building a large project (llvm), seemingly during the install phase. Building takes place on a zram-partition (lz4 compressed, ext4), root filesystem is btrfs (zstd compressed).
Comment 19 Jan Kara 2018-12-18 08:45:50 UTC
Hum, interesting. The page we barfed on is:

Clean page (index 92, ino 1163533, flags 3ffe0000002016) with dirty PTE (pte_val c0000001ef7b3186) on unmap!

Page flags thus decode to: PG_referenced | PG_uptodate | PG_lru | PG_private
PTE bits decode to: _PAGE_WRITE | _PAGE_READ | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_PTE | _PAGE_PRESENT

So indeed the page was writeably mapped. The question now is, how the page could have been writeably mapped without getting marked dirty or how could the dirty bit get cleared while the page stayed writeably mapped.

I'll poke more into the code and probably come up with further debugging patch.
Comment 20 Benjamin Herrenschmidt 2018-12-18 22:27:49 UTC
Pardon my ignorance here, but I was under the impression that this was normal :) The whole point being that the PTE dirty bit gets set when accesses happen, and latter on gets harvested into the struct page dirty bit.

Otherwise, what would be the point of having a PTE dirty bit in the PTE at all or transferring a dirty bit from PTE to struct page in try_to_unmap_one() (for example) if a writable page is always mapped dirty ?
Comment 21 Jan Kara 2018-12-19 13:39:16 UTC
(In reply to Benjamin Herrenschmidt from comment #20)
> Pardon my ignorance here, but I was under the impression that this was
> normal :) The whole point being that the PTE dirty bit gets set when
> accesses happen, and latter on gets harvested into the struct page dirty bit.
> 
> Otherwise, what would be the point of having a PTE dirty bit in the PTE at
> all or transferring a dirty bit from PTE to struct page in
> try_to_unmap_one() (for example) if a writable page is always mapped dirty ?

The PTE dirty bit is used for anonymous pages as you describe. File pages are marked dirty immediately during page fault (look for example at mm/memory.c:wp_page_shared() and its call to fault_dirty_shared_page()). When page gets written back, clear_page_dirty_for_io() is used which writeprotects all PTEs, clears PTE dirty bits, and then page dirty bit is cleared as well. So for shared file mappings PTE dirty bits could be ignored.

The reason why PTE dirty bits are not ignored are partly heritage from before 2006 when this mechanism was introduced and party because people do it "just to be sure"...
Comment 22 Jan Kara 2018-12-19 17:03:32 UTC
Created attachment 280097 [details]
mm: Dump page and PTE state and some history

New debug patch to try which will also dump more about what was happening with the page. Can you please try running with this patch and report when the warning hits? Thanks!
Comment 23 Erhard F. 2018-12-19 22:58:15 UTC
Created attachment 280101 [details]
journalctl -k (4.20-rc7 on a Talos II)

Results with newest patch.
Comment 24 Benjamin Herrenschmidt 2018-12-19 23:03:15 UTC
Out of curiosity, how does this work in the context of pages that are GUPed either by a driver or by KVM when using file backed guest memory ?

GUP will not "see" the mkclean or attempt at making read only...
Comment 25 Aneesh Kumar KV 2018-12-20 04:15:15 UTC
(In reply to Benjamin Herrenschmidt from comment #24)
> Out of curiosity, how does this work in the context of pages that are GUPed
> either by a driver or by KVM when using file backed guest memory ?
> 
> GUP will not "see" the mkclean or attempt at making read only...

page_mkclean_one do call
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
Comment 26 Aneesh Kumar KV 2018-12-20 04:21:41 UTC
Dez 19 23:53:09 T1000 kernel: BUG: Bad rss-counter state mm:0000000065804292 idx:1 val:2
Dez 19 23:53:09 T1000 kernel: BUG: non-zero pgtables_bytes on freeing mm: 16384

I am wondering whether we have a wrong usage of a page here? One of the feature we enabled in the the recent kernel is THP migration (a0820ff33451aa1a12ec66812e229ac58beb3f24) . But I guess we are not using that feature. Maybe to isolate this further, can we disable transparent huge page for all the tests?
Comment 27 Jan Kara 2018-12-20 07:41:28 UTC
Created attachment 280103 [details]
mm: Dump page and PTE state and some history (v2)

I'm sorry but the patch had a bug and didn't set time when page_mkwrite() has been called for the most common ext4 path which is taken on your machine. Here is a fixed up patch. Can you please test again!

Also it seems you are now able to trigger the warning in under an hour? In that case it would be plausible to bisect the kernel if the directed debugging fails. We'll see...
Comment 28 Jan Kara 2018-12-20 07:49:03 UTC
(In reply to Benjamin Herrenschmidt from comment #24)
> Out of curiosity, how does this work in the context of pages that are GUPed
> either by a driver or by KVM when using file backed guest memory ?
> 
> GUP will not "see" the mkclean or attempt at making read only...

That's a good question :) Well behaved GUP users use mmu notifiers as Aneesh wrote to "see" mkclean. Badly behaved GUP users (and there's quite a few of them) don't and that can result in kernel crashes, data loss, or checksum failures in storage stack. The latest discussion on this topic is here if you're interested: https://lwn.net/Articles/774411/
Comment 29 Benjamin Herrenschmidt 2018-12-20 08:21:10 UTC
The problem is of course that not everybody *can* use the MMU notifiers.

Say I am KVM on POWER9, with the currently work-in-progress feature (unmerged patches) of exploiting the new HW support for interrupt virtualization.

In that context, the guest allocates a page for receiving interrupt events (it's a ring buffer) and "registers" it with the hypervisor (a hypercall). The HV gups it and passes the physical address to the HW, which will write to it from then on.

There's *nothing* KVM can do when getting the MMU notifier. The MM simply MUST NOT try to get rid of that page, it's going to be actively under HW use until the VM terminates.

How do we do that safely ?

There is a very similar problem when doing PCI pass-through. The guest pages are GUP'ed and put into the IOMMU so the devices can DMA to them. Here too, there's nothing useful KVM can do on an MMU notifier, those pages must remain pinned until either the guest is terminated or the IOMMU mapping is removed (in the case where it's done via hypercalls).

So how do we do that in such a way that doesn't involve all the crashes, data loss etc... that you mention ?

Talking of which, I noticed the AMD GPU driver in the call traces above, could it be a similar case of fun with GUP ?
Comment 30 Aneesh Kumar KV 2018-12-20 08:35:06 UTC
Dez 19 23:53:25 T1000 kernel: Clean page (index 72, ino 681324, i_size 715000, flags 3ffe0000002016) with dirty PTE (pte_val c0000006d3ea9186) on unmap!

Those pte bits value doesn't look correct. We expect the pfn value to be  64k aligned. ie, I should have found the last 4 values to be 0186. What is the page size configured? is that 64K?

-aneesh
Comment 31 Jan Kara 2018-12-20 09:10:34 UTC
(In reply to Aneesh Kumar KV from comment #30)
> Dez 19 23:53:25 T1000 kernel: Clean page (index 72, ino 681324, i_size
> 715000, flags 3ffe0000002016) with dirty PTE (pte_val c0000006d3ea9186) on
> unmap!
> 
> Those pte bits value doesn't look correct. We expect the pfn value to be 
> 64k aligned. ie, I should have found the last 4 values to be 0186. What is
> the page size configured? is that 64K?

There's .config attached to this bugzilla. It has:

CONFIG_PPC_4K_PAGES=y
# CONFIG_PPC_64K_PAGES is not set
Comment 32 Jan Kara 2018-12-20 09:29:42 UTC
(In reply to Benjamin Herrenschmidt from comment #29)
> The problem is of course that not everybody *can* use the MMU notifiers.

Yes, I'm aware of that. So my comment about "well-behaved users" was probably a bit overstated ;)
 
> Say I am KVM on POWER9, with the currently work-in-progress feature
> (unmerged patches) of exploiting the new HW support for interrupt
> virtualization.
> 
> In that context, the guest allocates a page for receiving interrupt events
> (it's a ring buffer) and "registers" it with the hypervisor (a hypercall).
> The HV gups it and passes the physical address to the HW, which will write
> to it from then on.
> 
> There's *nothing* KVM can do when getting the MMU notifier. The MM simply
> MUST NOT try to get rid of that page, it's going to be actively under HW use
> until the VM terminates.
> 
> How do we do that safely ?

Well, unless that page allocated by KVM comes from shared file mapping (and from what you write there's no reason for it to be from such mapping), there's nothing to worry about. GUP will get you page reference so that stops page reclaim from reclaiming the page and for anonymous pages, there's no filesystem trying to do anything clever about the page (like writing it back to disk).

> There is a very similar problem when doing PCI pass-through. The guest pages
> are GUP'ed and put into the IOMMU so the devices can DMA to them. Here too,
> there's nothing useful KVM can do on an MMU notifier, those pages must
> remain pinned until either the guest is terminated or the IOMMU mapping is
> removed (in the case where it's done via hypercalls).
> 
> So how do we do that in such a way that doesn't involve all the crashes,
> data loss etc... that you mention ?

This case is more difficult as often the pages you want to DMA into can eventually come from userspace and thus can be from a shared file mapping (if not, then again there's no problem). Essentially what you describe seems to be a similar problem like RDMA (Infiniband and similar drivers) currently has. And currently there's no good solution for that. We're trying to figure out how to fix this but it's difficult - tons of GUP users, tons of filesystems, some GUP users are very performance sensitive, and it's not like you have much space in struct page for any tracking...

> Talking of which, I noticed the AMD GPU driver in the call traces above,
> could it be a similar case of fun with GUP ?

Where did you see AMD GPU? I didn't find it in the last dmesg and .config doesn't seem to have anything GPU-related enabled. Also this particular problem seems to be triggered by heavy compilation so it seems to be something else than these GUP issues.
Comment 33 Benjamin Herrenschmidt 2018-12-21 00:26:15 UTC
Jan,

For KVM, it can... you can back the guest using a file, and there's some interest in doing just that to back guests using large "new memory" pmem stuff for example.

This actually leads to more generic variant of that issue which is the need to be able to provide proper "pinned" memory for some adapters in guest even if they use ATSD, as not all adapters can deal with faults on everything (for example they might be able to deal with fault on user buffers but not the descriptor ring).

A discussion for another day I suppose..

As for AMD, I was looking at attachment https://bugzilla.kernel.org/attachment.cgi?id=280057 posted in comment 18
Comment 34 Erhard F. 2018-12-21 11:16:44 UTC
(In reply to Jan Kara from comment #27)
> [...]
> Also it seems you are now able to trigger the warning in under an hour? In
> that case it would be plausible to bisect the kernel if the directed
> debugging fails. We'll see...

For now building llvm-7.0.1-rc2 with tests in my ext4 zram disk (lz4 compressed) triggers the bug reliably. I could try to bisect, but not the next few days or so. ;)
Comment 35 Erhard F. 2018-12-21 11:17:45 UTC
Created attachment 280117 [details]
journalctl -k (4.20-rc7 on a Talos II)

Results with latest patch.
Comment 36 Erhard F. 2018-12-21 11:52:36 UTC
(In reply to Benjamin Herrenschmidt from comment #29)
> [...]
> Talking of which, I noticed the AMD GPU driver in the call traces above,
> could it be a similar case of fun with GUP ?

If you are talking about the "[    4.883250] radeon: No suitable DMA available" message, this has also been the case before kernel 4.20-rc*. There also have been some DMA-related patches from https://git.raptorengineering.com/git/op-kernel-dev/log/?h=raptor-dma-dev-4.18-pre. Don't know if they reslove this error message of if they have been taken upstream already.

The card is BTW a Radeon HD 6450, which is running fine on the Talos II.
Comment 37 Benjamin Herrenschmidt 2018-12-21 23:47:02 UTC
I would advise against those DMA patches. If the GPU tries to map an sglist entry that crosses a 1G boundary, bad things will happen.
Comment 38 Erhard F. 2018-12-22 14:13:29 UTC
(In reply to Benjamin Herrenschmidt from comment #37)
> I would advise against those DMA patches. If the GPU tries to map an sglist
> entry that crosses a 1G boundary, bad things will happen.
Did not install them, I am running a vanilla kernel. So if they have not been upstreamed they are not in my kernel. ;)
Comment 39 Jan Kara 2019-01-02 15:10:10 UTC
(In reply to erhard_f from comment #35)
> Created attachment 280117 [details]
> journalctl -k (4.20-rc7 on a Talos II)
> 
> Results with latest patch.

Thanks for the results! Erhard, are you sure this kernel was with the latest version of the patch? Because the line:

page 137: b_state 21, b_blocknr 2726281, b_mapped 0, b_cleaned 1498120697108, now 1498125595648

shows that b_mapped is zero effectively meaning that ext4_block_mkwrite() was never called for the page. Which would be very strange for a page that has writeable PTE.
Comment 40 Erhard F. 2019-01-02 15:46:28 UTC
(In reply to Jan Kara from comment #39)
> Thanks for the results! Erhard, are you sure this kernel was with the latest
> version of the patch?

I am pretty sure it was the latest patch at that time. But I can re-test with 4.20 or 4.21-rc1 when it's out.
Comment 41 Erhard F. 2019-01-08 23:00:30 UTC
Created attachment 280345 [details]
journalctl -k (5.0-rc1 on a Talos II)

Results with latest patch (20.12.2018).

So far building llvm-7 + tests in my ext4-zramdisk reliably triggers it.
Comment 42 Erhard F. 2019-01-08 23:02:12 UTC
Created attachment 280347 [details]
kernel .config (5.0-rc1, Talos II)
Comment 43 Jan Kara 2019-01-17 10:32:16 UTC
Thanks for gathering the results and sorry for not replying earlier. I've been busy with other work... So again I can see in the logs:

Clean page (index 137, ino 1820427, i_size 1144448, flags 3ffe0000002016) with dirty PTE (pte_val c0000005f899f186) on unmap!
WARNING: CPU: 3 PID: 1532 at fs/ext4/inode.c:3935 .ext4_set_page_dirty+0x80/0x170
...
page 137: b_state 21, b_blocknr 2726281, b_mapped 0, b_cleaned 1498120697108, now 1498125595648

So the page has been submitted for writeout some 5 ms ago but still we didn't get b_mapped set at all which would mean that ext4_page_mkwrite() got never called for the page - which is very weird. I'll ponder more about it possibly come up with more debugging.
Comment 44 Jan Kara 2019-01-17 11:48:54 UTC
Created attachment 280551 [details]
mm: Dump page and PTE state and some history (v3)

OK, new version of the debug patch that also tracks when last PTE for a page got set and dumps some more information about VMA and the page. Please report what the kernel prints now. Thanks a lot!
Comment 45 Erhard F. 2019-01-22 23:09:29 UTC
(In reply to Jan Kara from comment #44)
> OK, new version of the debug patch that also tracks when last PTE for a page
> got set and dumps some more information about VMA and the page. Please
> report what the kernel prints now. Thanks a lot!

Thanks for being persistent though this bug does not seem to affect many people!
Comment 46 Erhard F. 2019-01-22 23:10:11 UTC
Created attachment 280679 [details]
journalctl -k (5.0-rc3 on a Talos II)
Comment 47 Jan Kara 2019-01-23 16:40:31 UTC
OK, so it seems to be more and more clear that PPC indeed has some race in page table updates. What I can see in the latest report is:

Clean page (index 92, ino 681741, i_size 828368, flags 7fff0000002016, mapcount 1) with dirty PTE (pte_val c0000005f7fae186) on unmap! Vma flags fb, pgoff 0, file ino 681741
...
page 92: b_state 21, b_blocknr 2801084, b_mapped 1452389112002, b_mapped2 0, b_cleaned 1452396217779, now 1452400395514

So "Vma flags fb" shows its a normal shared, writeable file mapping. Page is somewhere in the middle of the file (file size is 828368, page is at offset 376832). The page has been writeably mapped 11ms ago (you are using ext2 filesystem which was confusing my previous debug attempts so only this one has shown proper times) and written back 4ms ago (which should have writeprotected the pte) but we still have writeable pte now on which the assertion hits. So either page_mkclean() failed to clear the PTE or someone created new writeable PTE without telling ext4.

I'll attach a new version of debug patch to distinguish these two cases.
Comment 48 Jan Kara 2019-01-23 16:41:39 UTC
Created attachment 280697 [details]
mm: Dump page and PTE state and some history (v4)

Please run with this new version of the debug patch and report results. Thanks!
Comment 49 Aneesh Kumar KV 2019-01-24 03:17:48 UTC
Can you try this change?

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 6a23b9ebd2a1..c42f91ba7067 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -644,10 +644,12 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
        preempt_disable();
        smp_mb(); /* see radix__flush_tlb_mm */
        if (!mm_is_thread_local(mm)) {
+/*
                if (unlikely(mm_is_singlethreaded(mm))) {
                        exit_flush_lazy_tlbs(mm);
                        goto local;
                }
+*/
                _tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
        } else {
 local:
Comment 50 Aneesh Kumar KV 2019-01-24 03:40:18 UTC
(In reply to Jan Kara from comment #47)
> OK, so it seems to be more and more clear that PPC indeed has some race in
> page table updates. What I can see in the latest report is:
> 
> Clean page (index 92, ino 681741, i_size 828368, flags 7fff0000002016,
> mapcount 1) with dirty PTE (pte_val c0000005f7fae186) on unmap! Vma flags
> fb, pgoff 0, file ino 681741
> ...
> page 92: b_state 21, b_blocknr 2801084, b_mapped 1452389112002, b_mapped2 0,
> b_cleaned 1452396217779, now 1452400395514
> 
> So "Vma flags fb" shows its a normal shared, writeable file mapping. Page is
> somewhere in the middle of the file (file size is 828368, page is at offset
> 376832). The page has been writeably mapped 11ms ago (you are using ext2
> filesystem which was confusing my previous debug attempts so only this one
> has shown proper times) and written back 4ms ago (which should have
> writeprotected the pte) but we still have writeable pte now on which the
> assertion hits. So either page_mkclean() failed to clear the PTE or someone
> created new writeable PTE without telling ext4.
> 
> I'll attach a new version of debug patch to distinguish these two cases.

The fact that we did try to write out the page at (bh_cleaned 1452396217779)implies we should have cleared the _PAGE_WRITE bit right (clear_page_dirty_for_io())? So we should either find that bit cleared in pte (if we missed a related tlb flush and tlb still has that pte with _PAGE_WRITE) or we find that set. In this case, we find _PAGE_WRITE set in the pte during zap. Does that imply we did call finish_fault()? which should have ideally resulted in we calling page_mkwrite().

I am still not clear what could be a possible race that can result in this?

-aneesh
Comment 51 Jan Kara 2019-01-24 08:15:52 UTC
(In reply to Aneesh Kumar KV from comment #50)
> (In reply to Jan Kara from comment #47)
> > OK, so it seems to be more and more clear that PPC indeed has some race in
> > page table updates. What I can see in the latest report is:
> > 
> > Clean page (index 92, ino 681741, i_size 828368, flags 7fff0000002016,
> > mapcount 1) with dirty PTE (pte_val c0000005f7fae186) on unmap! Vma flags
> > fb, pgoff 0, file ino 681741
> > ...
> > page 92: b_state 21, b_blocknr 2801084, b_mapped 1452389112002, b_mapped2
> 0,
> > b_cleaned 1452396217779, now 1452400395514
> > 
> > So "Vma flags fb" shows its a normal shared, writeable file mapping. Page
> is
> > somewhere in the middle of the file (file size is 828368, page is at offset
> > 376832). The page has been writeably mapped 11ms ago (you are using ext2
> > filesystem which was confusing my previous debug attempts so only this one
> > has shown proper times) and written back 4ms ago (which should have
> > writeprotected the pte) but we still have writeable pte now on which the
> > assertion hits. So either page_mkclean() failed to clear the PTE or someone
> > created new writeable PTE without telling ext4.
> > 
> > I'll attach a new version of debug patch to distinguish these two cases.
> 
> The fact that we did try to write out the page at (bh_cleaned
> 1452396217779)implies we should have cleared the _PAGE_WRITE bit right
> (clear_page_dirty_for_io())? 

Yes, clear_page_dirty_for_io() calls page_mkclean() which clears _PAGE_WRITE bit. So at b_cleaned time there should be no writeable PTE.

> So we should either find that bit cleared in
> pte (if we missed a related tlb flush and tlb still has that pte with
> _PAGE_WRITE) or we find that set. In this case, we find _PAGE_WRITE set in
> the pte during zap. Does that imply we did call finish_fault()? which should
> have ideally resulted in we calling page_mkwrite().

The race is not clear to me either but the rule is that if you are creating writeable PTE for a page, you must call ->page_mkwrite(). And from the debug output page_mkclean() was called and no ->page_mkwrite() after that so there should be no writeable PTE. But somehow there is one as zapping reports so we need to find out who and when creates it without calling ->page_mkwrite(). New version of my debug patch should tell us a bit more.

Note that there are other places that play with PTEs other than fault - like page migration, mremap, mprotect, etc. All these seem to properly use PTE locks to serialize with page_mkclean() but well... reality is what it is and there must be bug somewhere :) After all there are close to 200 calls of set_pte_at() in the kernel...
Comment 52 Erhard F. 2019-01-26 16:03:30 UTC
(In reply to Aneesh Kumar KV from comment #49)
> Can you try this change?
> 
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index 6a23b9ebd2a1..c42f91ba7067 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -644,10 +644,12 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm,
> unsigned long vmaddr,
>         preempt_disable();
>         smp_mb(); /* see radix__flush_tlb_mm */
>         if (!mm_is_thread_local(mm)) {
> +/*
>                 if (unlikely(mm_is_singlethreaded(mm))) {
>                         exit_flush_lazy_tlbs(mm);
>                         goto local;
>                 }
> +*/
>                 _tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
>         } else {
>  local:

Should I apply this in addition to Jan's patch or standalone?
Comment 53 Erhard F. 2019-01-27 14:48:42 UTC
Created attachment 280809 [details]
journalctl -k (5.0-rc3 on a Talos II)

Results with Jans v4 patch.
Comment 54 Jan Kara 2019-01-29 09:40:03 UTC
Thanks for the results Erhard. Finally something to go on in the debug results. The pages have been normally faulted in for writing and PTE got set at time 1036580120591 from wp_page_reuse(). But then, although we've called clear_page_dirty_for_io() which calls page_mkclean() (at time 1036587452118), we've never reached the debug point in page_mkclean_one() where PTE write bit gets cleared. So it looks that for some reason page_mkclean() doesn't do what it should on your machines.

I was considering a race with munmap() but page_mkclean() uses mapping->i_mmap tree to find the VMA and we remove vmas from that tree only after unmap_page_range() is complete so that does not look possible. Also it is very strange because this is common code for all architectures but you seem to be the only one hitting the problem so far.

Aneesh, any idea?

Anyway, I've added more debugging to page_mkclean() to see which paths it is taking and maybe that'll tell us something.
Comment 55 Jan Kara 2019-01-29 09:41:02 UTC
Created attachment 280841 [details]
mm: Dump page and PTE state and some history (v5)

Erhard, please test with this debug patch. Thanks!
Comment 56 Erhard F. 2019-01-29 12:13:13 UTC
(In reply to Jan Kara from comment #55)
> Created attachment 280841 [details]
> mm: Dump page and PTE state and some history (v5)
> 
> Erhard, please test with this debug patch. Thanks!

I would, but the patch does not apply against -rc4. ;)

/usr/src/linux-5.0-rc4 # patch -p1 < mm.patch 
patching file fs/ext4/file.c
patching file fs/ext4/inode.c
Hunk #3 succeeded at 3933 (offset 1 line).
Hunk #4 succeeded at 6327 (offset 26 lines).
patching file include/linux/buffer_head.h
patching file mm/memory.c
Hunk #2 succeeded at 1081 (offset -1 lines).
Hunk #6 succeeded at 3253 (offset 23 lines).
patching file mm/page_vma_mapped.c
patching file mm/rmap.c
Hunk #3 succeeded at 894 with fuzz 2.
Hunk #4 FAILED at 902.
Hunk #5 succeeded at 921 (offset 2 lines).
Hunk #6 succeeded at 979 (offset 2 lines).
1 out of 6 hunks FAILED -- saving rejects to file mm/rmap.c.rej
Comment 57 Aneesh Kumar KV 2019-01-29 14:48:14 UTC
Wondering whether that rmap_walk_file and copy_page_range running in parallel can result in something like that?
Comment 58 Jan Kara 2019-01-30 12:53:39 UTC
Created attachment 280871 [details]
mm: Dump page and PTE state and some history (v6)

Yeah, that was just a context difference. Rebased the patch on top of 5.0-rc4.
Comment 59 Jan Kara 2019-01-30 12:59:32 UTC
(In reply to Aneesh Kumar KV from comment #57)
> Wondering whether that rmap_walk_file and copy_page_range running in
> parallel can result in something like that?

Well, but we have shared file VMA here so I believe copy_page_range() will just bail early and won't do anything?
Comment 60 Erhard F. 2019-01-31 00:31:41 UTC
(In reply to Jan Kara from comment #54)
> Also it is very strange because this is common code for all architectures but
> seem to be the only one hitting the problem so far.

Maybe there are not that may people running a debug kernel on ppc64, building loads of stuff on an ext2 (sorry for originally stating it as ext4, I just forgot to pay attention to this detail) zram disk. ;)
Comment 61 Erhard F. 2019-01-31 00:32:46 UTC
Created attachment 280881 [details]
journalctl -k (5.0-rc4 on a Talos II)

Results with Jans v6 patch.
Comment 62 Jan Kara 2019-01-31 09:10:32 UTC
Erhard, thanks for the results! They are very weird but pretty clear: cleaned state is 0x10 so page_vma_mapped_walk() called from page_mkclean() has bailed out somewhere between:

        if (bh)
                bh->b_cleaned_state |= 16;
restart:
        pgd = pgd_offset(mm, pvmw->address);
        if (!pgd_present(*pgd))
                return false;
        p4d = p4d_offset(pgd, pvmw->address);
        if (!p4d_present(*p4d))
                return false;
        pud = pud_offset(p4d, pvmw->address);
        if (!pud_present(*pud))
                return false;
        if (bh)
                bh->b_cleaned_state |= 32;

I really don't understand how any of these tests could fail when later unmap_page_range() has found the pte. But I somehow suspect commit 
da7ad366b497f5 "powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit" could be related since it was introduced in the window where the problems started happening and it touches relevant code.

Aneesh, what needs to be reverted so that we can test that this commit is indeed the problematic one? I suppose:

da7ad366b497 "powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit"
f1981b5b302f "powerpc/mm/hugetlb/book3s: add _PAGE_PRESENT to hugepd pointer."
ae28f17b5eeb "powerpc/mm/book3s: Check for pmd_large instead of pmd_trans_huge"
75646c480fde "arch/powerpc/mm/hash: validate the pte entries before handling the hash fault"
8890e03380d3 "powerpc/mm/thp: update pmd_trans_huge to check for pmd_present"
a0820ff33451 "powerpc/mm:book3s: Enable THP migration support"
Comment 63 Aneesh Kumar KV 2019-02-04 06:37:07 UTC
(In reply to Jan Kara from comment #62)
> Erhard, thanks for the results! They are very weird but pretty clear:
> cleaned state is 0x10 so page_vma_mapped_walk() called from page_mkclean()
> has bailed out somewhere between:
> 
>         if (bh)
>                 bh->b_cleaned_state |= 16;
> restart:
>         pgd = pgd_offset(mm, pvmw->address);
>         if (!pgd_present(*pgd))
>                 return false;
>         p4d = p4d_offset(pgd, pvmw->address);
>         if (!p4d_present(*p4d))
>                 return false;
>         pud = pud_offset(p4d, pvmw->address);
>         if (!pud_present(*pud))
>                 return false;
>         if (bh)
>                 bh->b_cleaned_state |= 32;
> 
> I really don't understand how any of these tests could fail when later
> unmap_page_range() has found the pte. But I somehow suspect commit 
> da7ad366b497f5 "powerpc/mm/book3s: Update pmd_present to look at
> _PAGE_PRESENT bit" could be related since it was introduced in the window
> where the problems started happening and it touches relevant code.
> 
> Aneesh, what needs to be reverted so that we can test that this commit is
> indeed the problematic one? I suppose:
> 
> da7ad366b497 "powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT
> bit"
> f1981b5b302f "powerpc/mm/hugetlb/book3s: add _PAGE_PRESENT to hugepd
> pointer."
> ae28f17b5eeb "powerpc/mm/book3s: Check for pmd_large instead of
> pmd_trans_huge"
> 75646c480fde "arch/powerpc/mm/hash: validate the pte entries before handling
> the hash fault"
> 8890e03380d3 "powerpc/mm/thp: update pmd_trans_huge to check for pmd_present"
> a0820ff33451 "powerpc/mm:book3s: Enable THP migration support"



We can revert the first and the last patch. I am attaching the patches this got the revert failures fixed.

-aneesh
Comment 64 Aneesh Kumar KV 2019-02-04 06:38:53 UTC
Created attachment 280941 [details]
Revert-powerpc-mm-book3s-Update-pmd_present
Comment 65 Aneesh Kumar KV 2019-02-04 06:39:28 UTC
Created attachment 280943 [details]
Revert-powerpc-mm-book3s-Enable-THP-migration
Comment 66 Erhard F. 2019-02-04 22:30:30 UTC
Currently I am testing this, the 2 patches from Aneesh applied and the one from Jan. So far I was not able to provoke the original bug. I faced however another one, don't know wether it's related:

[...]
Feb 04 23:11:43 T1000 kernel: BUG: Kernel NULL pointer dereference at 0x00000028
Feb 04 23:11:43 T1000 kernel: Faulting instruction address: 0xc0000000002a20ac
Feb 04 23:11:43 T1000 kernel: Oops: Kernel access of bad area, sig: 11 [#1]
Feb 04 23:11:43 T1000 kernel: BE SMP NR_CPUS=192 NUMA PowerNV
Feb 04 23:11:43 T1000 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc sr_mod cdrom cfg80211 rfkill evdev radeon ghash_generic gf128mul ecb xts snd_hda_codec_hdmi>
Feb 04 23:11:43 T1000 kernel: CPU: 0 PID: 218 Comm: kswapd0 Not tainted 5.0.0-rc4 #2
Feb 04 23:11:43 T1000 kernel: NIP:  c0000000002a20ac LR: c0000000002a209c CTR: c0000000002bd280
Feb 04 23:11:43 T1000 kernel: REGS: c0000007f8c76fa0 TRAP: 0300   Not tainted  (5.0.0-rc4)
Feb 04 23:11:43 T1000 kernel: MSR:  900000000280b032 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI>  CR: 44222288  XER: 00000000
Feb 04 23:11:43 T1000 kernel: CFAR: c000000000281724 DAR: 0000000000000028 DSISR: 40000000 IRQMASK: 0 
                              GPR00: c0000000002a5a48 c0000007f8c77230 c0000000011e2a00 c0000007e2cb8000 
                              GPR04: c000000000bebb70 00003fff92170000 0000000000000040 c0000007f4f9e600 
                              GPR08: c0000000011f54d8 0000000000000000 0000000000000000 0000000000000000 
                              GPR12: 0000000024222282 c000000001e2b000 c0000007e2cb8000 0000000000000000 
                              GPR16: 0000000000000001 c00a00001ed74ac0 c0000007f8c77798 0000000000000000 
                              GPR20: c00a00001ed74ac8 5deadbeef0000100 0000000020000000 c0000007f4f9e718 
                              GPR24: c0000007f4f9ea40 7c00000000000000 7800000000000000 c0000007f4f9ea38 
                              GPR28: c0000007f4f9e600 c0000007d18cfa90 c00a00001ed74ac0 c00a00001ed74ac0 
Feb 04 23:11:43 T1000 kernel: NIP [c0000000002a20ac] .is_ext4_page+0x3c/0x4b0
Feb 04 23:11:43 T1000 kernel: LR [c0000000002a209c] .is_ext4_page+0x2c/0x4b0
Feb 04 23:11:43 T1000 kernel: Call Trace:
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77230] [c0000007f9d0d880] 0xc0000007f9d0d880 (unreliable)
Feb 04 23:11:43 T1000 kernel: [c0000007f8c772b0] [c0000000002a5a48] .get_ext4_bh_nowarn+0x18/0x80
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77330] [c0000000002b963c] .page_vma_mapped_walk+0x3c/0x9d0
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77410] [c0000000002bd3d8] .try_to_unmap_one+0x158/0x950
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77560] [c0000000002bc018] .rmap_walk_anon+0x148/0x3a0
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77610] [c0000000002bf364] .try_to_unmap+0x104/0x160
Feb 04 23:11:43 T1000 kernel: [c0000007f8c776c0] [c000000000270958] .shrink_page_list+0x738/0x11e0
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77840] [c000000000271f7c] .shrink_inactive_list+0x34c/0x7f0
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77950] [c000000000272e94] .shrink_node_memcg+0x3c4/0x7e0
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77ab0] [c0000000002733c4] .shrink_node+0x114/0x670
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77bd0] [c000000000275158] .kswapd+0x3f8/0xc30
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77d60] [c0000000000dfb6c] .kthread+0x19c/0x1b0
Feb 04 23:11:43 T1000 kernel: [c0000007f8c77e20] [c00000000000b790] .ret_from_kernel_thread+0x58/0x68
Feb 04 23:11:43 T1000 kernel: Instruction dump:
Feb 04 23:11:43 T1000 kernel: f8010010 f821ff81 4bfdf62d 60000000 2fa30000 419e0150 7fe3fb78 4bfdf619 
Feb 04 23:11:43 T1000 kernel: 60000000 3c82ffa1 e9230000 38849170 <e9290028> e9290028 e8690000 78690520 
Feb 04 23:11:43 T1000 kernel: ---[ end trace 9d98850d9aa63397 ]---
[...]
Comment 67 Erhard F. 2019-02-04 22:33:38 UTC
Created attachment 280957 [details]
journalctl -k (5.0-rc4 on a Talos II, "Kernel NULL pointer dereference")
Comment 68 Jan Kara 2019-02-05 08:42:15 UTC
That looks like some problem with my debug patch. So just ignore it for now. If it happens frequently, just run a kernel without my debug patch.
Comment 69 Aneesh Kumar KV 2019-02-05 09:21:13 UTC
Ok, can we try with only Revert-powerpc-mm-book3s-Enable-THP-migration patch? I am looking at whether we are getting the migration check on page table entries wrong.
Comment 70 Erhard F. 2019-02-05 11:43:33 UTC
(In reply to Aneesh Kumar KV from comment #69)
> Ok, can we try with only Revert-powerpc-mm-book3s-Enable-THP-migration
> patch? I am looking at whether we are getting the migration check on page
> table entries wrong.
Applied only Revert-powerpc-mm-book3s-Enable-THP-migration to -rc5. This provokes the bug after 1 x llvm debug build (my usual test case):

[...]
[ 1596.523635] WARNING: CPU: 15 PID: 1709 at fs/ext4/inode.c:3928 .ext4_set_page_dirty+0x70/0xb0
[ 1596.523768] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
sr_mod cdrom cfg80211 rfkill radeon evdev ghash_generic gf128mul snd_hda_codec_hdmi i2c_algo_bit ttm ecb
 drm_kms_helper snd_hda_intel snd_hda_codec xts snd_hwdep cfbfillrect snd_hda_core cfbimgblt xhci_pci sy
scopyarea sysimgblt snd_pcm sysfillrect fb_sys_fops cbc xhci_hcd mptsas cfbcopyarea scsi_transport_sas u
sbcore mptscsih drm tg3 vmx_crypto powernv_flash snd_timer snd libphy mtd mptbase at24 crc32c_vpmsum ibm
powernv opal_prd scsi_mod hwmon soundcore drm_panel_orientation_quirks backlight usb_common lz4 lz4_comp
ress lz4_decompress zram zsmalloc
[ 1596.523840] CPU: 15 PID: 1709 Comm: debugedit Not tainted 5.0.0-rc5 #1
[ 1596.523846] NIP:  c00000000040f330 LR: c00000000025bd20 CTR: c00000000040f2c0
[ 1596.523851] REGS: c0000007b56b7530 TRAP: 0700   Not tainted  (5.0.0-rc5)
[ 1596.523854] MSR:  9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24082498  XER: 20040000
[ 1596.523866] CFAR: c00000000040f304 IRQMASK: 0
               GPR00: c00000000025bd20 c0000007b56b77c0 c0000000011e2b00 c00a00001a63ff40
               GPR04: 00003fffa7a33000 c000000698ffd186 0000000000000001 0000000000000000
               GPR08: 0000000000000000 0000000000000000 c00a00001a63ff08 0000000000000000
               GPR12: 0000000024082492 c0000007ffffa400 c000000698ffd186 1000000000000000
               GPR16: 00003fffa7a33000 e61fffffffffffff c0000007ab96a190 c0000000011f3918
               GPR20: 00003fffa7a34000 c00a00001a63ff40 c0000004bdc64ce8 c0000007ce95fc50
               GPR24: c0000007ab5069e8 c0000007de586198 fffffffffffffe7f 0000000000000000
               GPR28: 00003fffa7a9b000 ffffffffffffffff 0000000000000000 c00a00001a63ff40
[ 1596.523898] NIP [c00000000040f330] .ext4_set_page_dirty+0x70/0xb0
[ 1596.523903] LR [c00000000025bd20] .set_page_dirty+0xa0/0x150
[ 1596.523907] Call Trace:
[ 1596.523913] [c0000007b56b77c0] [c000000000266070] .activate_page+0xf0/0x160 (unreliable)
[ 1596.523920] [c0000007b56b7830] [c00000000025bd20] .set_page_dirty+0xa0/0x150
[ 1596.523927] [c0000007b56b78b0] [c0000000002a4c90] .unmap_page_range+0xc10/0xe10
[ 1596.523932] [c0000007b56b7a40] [c0000000002a5224] .unmap_vmas+0xa4/0x150
[ 1596.523938] [c0000007b56b7b00] [c0000000002afc18] .unmap_region+0xe8/0x190
[ 1596.523944] [c0000007b56b7c30] [c0000000002b22f0] .__do_munmap+0x2f0/0x510
[ 1596.523949] [c0000007b56b7cf0] [c0000000002b2590] .__vm_munmap+0x80/0x110
[ 1596.523955] [c0000007b56b7db0] [c0000000002b2654] .__se_sys_munmap+0x14/0x30
[ 1596.523962] [c0000007b56b7e20] [c00000000000b3a4] system_call+0x5c/0x70
[ 1596.523966] Instruction dump:
[ 1596.523971] 71290008 4182002c e9230000 71292000 41820030 4bf6d80d 60000000 38210070
[ 1596.523981] e8010010 7c0803a6 4e800020 60000000 <0fe00000> e9230000 71292000 4082ffd8               C
Comment 71 Erhard F. 2019-02-05 12:21:48 UTC
Created attachment 280965 [details]
journalctl -k (5.0-rc5 on a Talos II)

Results with Revert-powerpc-mm-book3s-Enable-THP-migration + mm: Dump page and PTE state and some history (v6).
Comment 72 Jan Kara 2019-02-07 09:06:38 UTC
Just a suggestion to PPC maintainers / Aneesh: This bug can lead to data corruption (lost data written to a file) or potential security issues (writes to file user is no longer supposed to be able to write to) because it leads to PPC leaving writeably mapped pages behind. So since we are at -rc5 I think considering just reverting the patches and figuring out what's going on separately without exposing general users to these issues is a fair question... 

Note that the problem affects already 4.20 so -stable needs to be included regardless how you decide to solve these issues.
Comment 73 Aneesh Kumar KV 2019-02-07 11:29:17 UTC
(In reply to Jan Kara from comment #72)
> Just a suggestion to PPC maintainers / Aneesh: This bug can lead to data
> corruption (lost data written to a file) or potential security issues
> (writes to file user is no longer supposed to be able to write to) because
> it leads to PPC leaving writeably mapped pages behind. So since we are at
> -rc5 I think considering just reverting the patches and figuring out what's
> going on separately without exposing general users to these issues is a fair
> question... 
> 
> Note that the problem affects already 4.20 so -stable needs to be included
> regardless how you decide to solve these issues.

I understand the urgency. But we are still not able to work out why the revert the change helps. Few things which are different from other commonly used kernel configs are

1) 4K page size
2) Big endian kernel.

One thing we could do is just revert the definition of pud_present and pgd_present. ( I already shared a patch to try that). If this helps, we could take that to upstream. But that is still not the fix because we still don't know who is clearing the top bit of the page table tree. This Level1 and Level2 entries are not modified in between. They are set during fault and cleared on page table free.


Another debugging effort we could try is with little-endian config. Erhard, how easy/difficult it will be you try a little-endian config?

All my effort to recreate this on my POWER9 system with 4k config has not been successful. The difference here is I have a little-endian config.

-aneesh
Comment 74 Michael Ellerman 2019-02-07 13:03:57 UTC
Erhard, sorry if it's mentioned elsewhere but I couldn't see it, what distro are you running?
Comment 75 Erhard F. 2019-02-07 13:13:37 UTC
(In reply to Aneesh Kumar KV from comment #73)
> Another debugging effort we could try is with little-endian config. Erhard,
> how easy/difficult it will be you try a little-endian config?

I got that on my Todo-List for some time. But it's not straightforward - I would have to do that without a boot .iso since the current firmware of the Talos doesn't reckognize my SATA card, hence no DVD drive at boot.

Apart from that I will test your latest patches and report back.

@Michael:
I am running Gentoo ppc64 on Talos II and G5. Default is a BE profile, an experimental LE one exists.
Comment 76 Jan Kara 2019-02-07 16:10:41 UTC
(In reply to Aneesh Kumar KV from comment #73)
> (In reply to Jan Kara from comment #72)
> > Just a suggestion to PPC maintainers / Aneesh: This bug can lead to data
> > corruption (lost data written to a file) or potential security issues
> > (writes to file user is no longer supposed to be able to write to) because
> > it leads to PPC leaving writeably mapped pages behind. So since we are at
> > -rc5 I think considering just reverting the patches and figuring out what's
> > going on separately without exposing general users to these issues is a
> fair
> > question... 
> > 
> > Note that the problem affects already 4.20 so -stable needs to be included
> > regardless how you decide to solve these issues.
> 
> I understand the urgency. But we are still not able to work out why the
> revert the change helps. 

Oh, I fully understand that revert / ignoring _PAGE_PRESENT bits is just papering over the real problem. But when we are speaking about data corruption, papering over the problem is IMO acceptable short-term solution. But I also understand that this makes debugging the problem further somewhat more difficult so it's a judgement call for the maintainers.
Comment 77 Michael Ellerman 2019-02-08 12:46:30 UTC
I spent a few hours trying to get gentoo BE installed on my Talos but haven't had any luck yet. I think my current problem is that the kernel doesn't have any driver for my nvme disk.
Comment 78 Michael Ellerman 2019-02-08 12:49:08 UTC
Jan, thanks for all your help on this. I am worried about data corruption obviously, but I'm also reluctant to merge a fix we don't understand. At this stage I'm inclined to keep chasing the bug and if we can't work it by late next week then I'll send the workaround before rc7.
Comment 79 Erhard F. 2019-02-08 13:34:18 UTC
(In reply to Michael Ellerman from comment #77)
> I spent a few hours trying to get gentoo BE installed on my Talos but
> haven't had any luck yet. I think my current problem is that the kernel
> doesn't have any driver for my nvme disk.
You could have a look at my linked .config files here in this thread. I have the NVMe parts statically built into the kernel, loading additional modules later from /. No initramfs used. The Talos II firmware is able to read my ext4 /boot/ partition from the NVMe SSD and I can boot different kernels just via adding a kboot.conf file.
Comment 80 Aneesh Kumar KV 2019-02-08 15:04:02 UTC
Created attachment 281065 [details]
possible revert for upstream

Can you try this patch?
Comment 81 Erhard F. 2019-02-08 20:56:32 UTC
(In reply to Aneesh Kumar KV from comment #80)
> Created attachment 281065 [details]
> possible revert for upstream
> 
> Can you try this patch?
Success! Built llvm-7.0.1 + tests 10 times now, no error message. Usually the bug shows up at the 1st build.

Out of curiosity - what's the impact of the patch on ppc64 BE? It seems like a minmal change only.
Comment 82 Aneesh Kumar KV 2019-02-11 07:45:42 UTC
(In reply to Erhard F. from comment #81)
> (In reply to Aneesh Kumar KV from comment #80)
> > Created attachment 281065 [details]
> > possible revert for upstream
> > 
> > Can you try this patch?
> Success! Built llvm-7.0.1 + tests 10 times now, no error message. Usually
> the bug shows up at the 1st build.
> 
> Out of curiosity - what's the impact of the patch on ppc64 BE? It seems like
> a minmal change only.

At this point, I don't know. Can we try with the below kernel config changes?

1) Disable HUGETLBFS and HUGETLB_PAGE
2) Disable TRANSPARENT_HUGEPAGE

If we can individually disable them and run the tests, that may possibly give us more insight.

-aneesh
Comment 83 Erhard F. 2019-02-11 21:05:49 UTC
(In reply to Aneesh Kumar KV from comment #82)
> 1) Disable HUGETLBFS and HUGETLB_PAGE
The kernel boots, but shortly afterwards the Talos plays dead. USB keyboard stops working, no screen output, no login via SSH possible. I don't have serial debug equipment to have a look what's going on.

# grep -i huge .config
CONFIG_SYS_SUPPORTS_HUGETLBFS=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
# CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
# CONFIG_HUGETLBFS is not set

> 2) Disable TRANSPARENT_HUGEPAGE
Runs fine without provoking the bug.

# CONFIG_CGROUP_HUGETLB is not set
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_HUGETLB_PAGE_SIZE_VARIABLE=y
CONFIG_SYS_SUPPORTS_HUGETLBFS=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
Comment 84 Michael Ellerman 2019-02-13 11:45:56 UTC
I was finally able to reproduce this on a G5 here.

I was also able to bisect, to:

# first bad commit: [da7ad366b497f5fc1d4a416f168057ba40bddb98] powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit


So well spotted Jan.

Unfortunately we still don't know what the actual bug is, just how to work around it. Have been trying to get some more data, but on my system the full debug patch makes the bug vanish. In fact basically any debug in on the unmap side seems to hide the bug.

One thing I have worked out is that we need to change both pgd_present() and pud_present() to use pxd_none(), otherwise the warning still fires. ie. changing one or the other is not sufficient.
Comment 85 Michael Ellerman 2019-02-17 07:33:28 UTC
Patch posted: https://patchwork.ozlabs.org/patch/1041838/
Comment 86 Michael Ellerman 2019-02-19 00:15:05 UTC
Erhard, that patch is now in v5.0-rc7, it would be great if you could confirm it fixes the warning for you.
Comment 87 Erhard F. 2019-02-19 11:27:18 UTC
(In reply to Michael Ellerman from comment #86)
> Erhard, that patch is now in v5.0-rc7, it would be great if you could
> confirm it fixes the warning for you.
Yes, v5.0-rc7 works like a charm! Did the usual llvm builds and lots of other stuff, never seen the bug appear on rc7.

Many thanks to all people involved!
Comment 88 Michael Ellerman 2019-02-19 11:33:49 UTC
Awesome, thanks for testing.

If you have any other trouble feel free to file more bugs or report things to linuxppc-dev@lists.ozlabs.org.
Comment 89 Erhard F. 2019-02-20 13:07:42 UTC
(In reply to Michael Ellerman from comment #88)
> If you have any other trouble feel free to file more bugs or report things
> to linuxppc-dev@lists.ozlabs.org.
At the moment only #201723, #200449, #202017 and #199561 bug me. ;) If I find anything new I will report it on the list.

Note You need to log in before you can comment on or make changes to this bug.