Bug 203935

Summary: [Bisected] [PATCH] Kernel boots to unusable state on 10YO AMD desktop: fine on newer AMD laptop
Product: Platform Specific/Hardware Reporter: Duncan Roe (duncan_roe)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: bp, greg, luto, torvalds
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Revert commit 459e3a21
Cleanup patch
.config with patch reverted (CONFIG_LOCALVERSION changed)
dmesg output on booting with patch reverted
log of boot with patch reverted (debug.log, logs everything)
vclock_gettime.o with new patch
patch to add potentially missing barrier
All requested .o & dbg files
Proposed fix

Description Duncan Roe 2019-06-20 00:22:22 UTC
Created attachment 283321 [details]
Revert commit 459e3a21

System comes up to the login prompt without changing from the initial terminal font to a smaller one as it usually does.
When I supply user name and password, I just get the login prompt again.
Ctl-Alt-Del to reboot gets message from INIT "No more processes at this run level". Reset button required.
On rebooting good kernel, there is nothing in the log from the failing session.
git bisect identifies commit 459e3a21535ae3c7a9a123650e54f5c882b8fcbf as the culprit. Reverting this commit in 5.1.12 gave a good working kernel again.
The supplied patch is simply a revert. This will re-introduce the gcc-9 warnings that 459e3a21 was designed to eliminate.
Sorry I have not attempted to address these warnings.
Comment 1 Borislav Petkov 2019-06-20 08:04:00 UTC
Hmm, this patch shouldn't change anything AFAICT. Adding some people to CC.
Comment 2 Linus Torvalds 2019-06-20 17:30:14 UTC
Whee. That's truly funky. 

That commit should make absoluely zero difference to code generation. It literally just changes the declaration of a variable from a single char to an array.

And yes, we then take the address of that variable, but the address of an array should be the exact same thing as the address of the variable. 

So the _only_ effect of that change should be that the compiler now sees the proper size of the area, and thus avoids the warning that was brought in by gcc-9.

I'd expect the bisection to be simply wrong, but the fact that the problem also goes away with a revert implies that it's a real issue. 

Duncan, could you send your config, and a boot log from a successful boot with the revert applied? I'll try to think about how this could *possibly* change any code generation at all..
Comment 3 Linus Torvalds 2019-06-20 18:44:29 UTC
Oh, and could you do

    make arch/x86/entry/vdso/vclock_gettime.o
    make arch/x86/entry/vdso/vdso32/vclock_gettime.o

both with and without that revert (but otherwise identical kernel configs), and send me the four different object files? 

Also, is your user space 32-bit or 64-bit? And is it some old user space?

That might explain why you see problems when nobody else does, although I still don't see how that change should affect anything. The address of an array should be the same as the address of the non-address variable, but admittedly we do some very odd things with the vdso's, and maybe there is some use that I'm not seeing.

More likely I'm blind, and that commit is fundamentally broken for some obvious reason, and I'm just missing something really really obvious. And the only reason I don't see it is that I have modern user space and don't use the vdso's at all.
Comment 4 Linus Torvalds 2019-06-20 19:03:10 UTC
Oh, and *because* the vdso files are so magic from a build perspective, you shouldn't actually do that

    make arch/x86/entry/vdso/vdso32/vclock_gettime.o

that I suggested. You should just do a full kernel build with and without the revert, and then those files should just be there.

Maybe add all those

    arch/x86/entry/vdso/vdso*.so.dbg

files to this bugzilla entry too..

Luto, what it is you did to debug that nasty nasty vdso build end result? I think you're the last person to have touched all that.
Comment 5 Linus Torvalds 2019-06-20 21:16:01 UTC
I'm actually testing this with my own compiler, and I see build differences myself.

I still can't tell the *reason* for them, some of them seem to be just random changes just because gcc picked different registers, and I see how gcc had a different model for internal temporary numbers. So the change from 'char' to 'char []' definitely changes what gcc does.

As far as I can tell, the changes are all benign, and in fact seem to look like improvements, ie:

        lea    hvclock_page,%r12                   
        mov    hvclock_page,%edx                   
        lea    0x1d(%r12),%r8
        mov    %edx,%ebx
        and    $0xfffffffe,%ebx
        testb  $0x1,(%r8)
        je     444 <do_hres+0x44>

becomes

        mov    hvclock_page,%edx                   
        lea    hvclock_page,%rbx                   
        and    $0xfffffffe,%edx
        mov    %edx,%r8d
        testb  $0x1,hvclock_page+0x1d      
        je     444 <do_hres+0x44>

which seems simpler and doesn't have that unnecessary %r8 address generation. 

But the fact that register allocations change makes it really hard to see that all the rest really is the same code. There might be something I am missing.

But this may well be compiler-specific, and clearly gcc does end up having different logic for the address of a single byte vs for the address of an array of bytes, and I'm either missing the meaningful difference, or the different compiler version you have just means that for you the changes aren't benign at all.
Comment 6 Andy Lutomirski 2019-06-20 21:54:23 UTC
(In reply to Linus Torvalds from comment #4)
> Luto, what it is you did to debug that nasty nasty vdso build end result? I
> think you're the last person to have touched all that.

A combination of objtool, QEMU+gdb, and swearing.  And some tools I wrote to dump the vDSO image out of a running kernel.  And make SUBDIRS=.

BTW:

$ make -j4 SUBDIRS=arch/x86/entry/vdso V=1
Makefile:213: ================= WARNING ================
Makefile:214: 'SUBDIRS' will be removed after Linux 5.3
Makefile:215: Please use 'M=' or 'KBUILD_EXTMOD' instead
Makefile:216: ==========================================

Can we please not remove SUBDIRS, or at least make sure we have a credible replacement?  M= and KBUILD_EXTMOD don't do this at all.
Comment 7 Linus Torvalds 2019-06-20 23:59:12 UTC
(In reply to Andy Lutomirski from comment #6)
> 
> $ make -j4 SUBDIRS=arch/x86/entry/vdso V=1
> Makefile:213: ================= WARNING ================
> Makefile:214: 'SUBDIRS' will be removed after Linux 5.3
> Makefile:215: Please use 'M=' or 'KBUILD_EXTMOD' instead
> Makefile:216: ==========================================
> 
> Can we please not remove SUBDIRS, or at least make sure we have a credible
> replacement?  M= and KBUILD_EXTMOD don't do this at all.

What does SUBDIRS do for you that you don't get from just doing

    make -j4 V=1 arch/x86/entry/vdso/

Hmm?
Comment 8 Linus Torvalds 2019-06-21 00:03:56 UTC
Created attachment 283329 [details]
Cleanup patch

Does this cleanup patch make any difference to the behavior you see?

It's what I should have done originally instead of 459e3a21535a ("gcc-9: properly declare the {pv,hv}clock_page storage") but I tried to keep the warning fixup patch _minimal_ instead of aiming for doing cleanups.

It makes the casts pointless, because it just declares the magic tsc pages as having the right type to begin with.

Honestly, I still don't see why any of this should matter, since the compiler should do the same thing regardless, but at least this cleans code up. I wonder if it makes gcc generate different object code?
Comment 9 Duncan Roe 2019-06-21 02:57:50 UTC
Created attachment 283343 [details]
.config with patch reverted (CONFIG_LOCALVERSION changed)
Comment 10 Duncan Roe 2019-06-21 03:18:30 UTC
Created attachment 283345 [details]
dmesg output on booting with patch reverted
Comment 11 Duncan Roe 2019-06-21 03:21:40 UTC
Created attachment 283347 [details]
log of boot with patch reverted (debug.log, logs everything)
Comment 12 Andy Lutomirski 2019-06-21 04:51:54 UTC
(In reply to Linus Torvalds from comment #7)
> (In reply to Andy Lutomirski from comment #6)
> > 
> > $ make -j4 SUBDIRS=arch/x86/entry/vdso V=1
> > Makefile:213: ================= WARNING ================
> > Makefile:214: 'SUBDIRS' will be removed after Linux 5.3
> > Makefile:215: Please use 'M=' or 'KBUILD_EXTMOD' instead
> > Makefile:216: ==========================================
> > 
> > Can we please not remove SUBDIRS, or at least make sure we have a credible
> > replacement?  M= and KBUILD_EXTMOD don't do this at all.
> 
> What does SUBDIRS do for you that you don't get from just doing
> 
>     make -j4 V=1 arch/x86/entry/vdso/
> 
> Hmm?

I didn't know that worked.  I tried it at some point without the trailing slash and nothing happened.  Thanks!
Comment 13 Andy Lutomirski 2019-06-21 05:07:37 UTC
(In reply to Linus Torvalds from comment #8)
> Created attachment 283329 [details]
> Cleanup patch
> 
> Does this cleanup patch make any difference to the behavior you see?
> 
> It's what I should have done originally instead of 459e3a21535a ("gcc-9:
> properly declare the {pv,hv}clock_page storage") but I tried to keep the
> warning fixup patch _minimal_ instead of aiming for doing cleanups.
> 
> It makes the casts pointless, because it just declares the magic tsc pages
> as having the right type to begin with.
> 
> Honestly, I still don't see why any of this should matter, since the
> compiler should do the same thing regardless, but at least this cleans code
> up. I wonder if it makes gcc generate different object code?

Linus, I like your patch except for the const.  I agree that those variables are conceptually const, but I think that the declaration:

extern const foo;

means that gcc will assume that it matches some other definition:

extern const foo = { ... };

and that it therefore will not change out from under the function being compiled.  This could easily result in completely incorrect code generation.

Also, from Duncan's logs, it looks like he's running on bare metal.  Duncan, is that right?  If so, what the heck is going on here?  The code in question shouldn't be running at all.  The logs do have this goody:

[    0.433315] tsc: Marking TSC unstable due to TSCs unsynchronized

So we should be running with vclock_mode == VCLOCK_NONE, which is 0.  Looking at the asm from my build with and without the patch reverted, it appears identical and correct.

So I'm mystified.  Duncan, can you send us the arch/x86/entry/vdso/vclock_gettime.o from a non-working kernel.

Also, is this 32-bit userspace or 64-bit userspace?
Comment 14 Duncan Roe 2019-06-21 05:10:48 UTC
I regret to have to advise that attachment 283343 [details] fails the same way as before.
I'll gather together object files as requested - need to rebuild to get the working ones (d'oh)
In reply to previous posts:
> Also, is your user space 32-bit or 64-bit? And is it some old user space?
As you'll see from dmesg / log, 64-bit.
As to age: Slackware 14.2 patched up to 07-Feb-2019. So, gcc 5.5.0 - perhaps about the oldest gcc that will build modern kernels.
I do have a VM running Slackware Current (with gcc-9) - would it be any help to try a build on that?
> That might explain why you see problems when nobody else does
Perhaps not quite nobody else - see https://www.linuxquestions.org/questions/slackware-14/new-kernel-5-1-5-1-2-boot-failure-strangeness-4175653980/
Looking again at that thread, @petejc in post #5 built under Slackware Current (might have been gcc-8) and that kernel worked on his old system. But I can re-check, if you would like.
Comment 15 Duncan Roe 2019-06-21 05:16:42 UTC
> Also, from Duncan's logs, it looks like he's running on bare metal.
Yes indeed.
> Also, is this 32-bit userspace or 64-bit userspace?
64-bit (was composing my previous post when you asked)
Comment 16 Duncan Roe 2019-06-21 05:33:53 UTC
Created attachment 283349 [details]
vclock_gettime.o with new patch
Comment 17 Andy Lutomirski 2019-06-21 05:54:23 UTC
(In reply to Duncan Roe from comment #14)
> I regret to have to advise that attachment 283343 [details] fails the same
> way as before.
> I'll gather together object files as requested - need to rebuild to get the
> working ones (d'oh)
> In reply to previous posts:
> > Also, is your user space 32-bit or 64-bit? And is it some old user space?
> As you'll see from dmesg / log, 64-bit.
> As to age: Slackware 14.2 patched up to 07-Feb-2019. So, gcc 5.5.0 - perhaps
> about the oldest gcc that will build modern kernels.

So there could easily be some weird GCC bug here.  I'd be interested in seeing both a working (with the revert) and broken vclock_gettime.o from this GCC.

> I do have a VM running Slackware Current (with gcc-9) - would it be any help
> to try a build on that?

It could be educational.
Comment 18 Andy Lutomirski 2019-06-21 06:11:49 UTC
I'm not sure what your vclock_gettime.o corresponds to (what does "new patch" mean?), but it disassembles like this:

0000000000000000 <do_hres>:
   0:   55                      push   %rbp
   1:   44 0f be 1d 00 00 00    movsbl 0x0(%rip),%r11d        # 9 <do_hres+0x9>
   8:   00 
                        5: R_X86_64_PC32        pvclock_page+0x18

and there's the problem!  Accessing pvclock_page if we're not using the pvclock is not okay.

It looks like we either have some header screwup or gcc 5.5.0 is buggy, since gtod_read_begin() contains an smp_rmb(), and that should be enough to prevent the hoisting that gcc is doing.

That being said, it would be perfectly legal, if rather silly, for GCC to hoist the loads before the vclock_mode checks but after gtod_read_begin().  I'll attach a patch to try to fix this.

Linus, any insight as to why gcc appears to be miscompiling the whole function?  I'm wondering if there's some gcc bug that causes it to think that const pointers point to unchanging data even if there's a barrier.  Hmm, barrier is just a "memory" clobber -- maybe gcc thinks that clobbering memory doesn't clobber things that are only accessible by const pointer.

Duncan, can you try Linus' patch minus the const?
Comment 19 Andy Lutomirski 2019-06-21 06:12:52 UTC
Created attachment 283351 [details]
patch to add potentially missing barrier
Comment 20 Duncan Roe 2019-06-21 06:27:36 UTC
> (what does "new patch" mean?)
attachment 283329 [details]
Comment 21 Andy Lutomirski 2019-06-21 06:28:08 UTC
I tried to provoke this miscompilation using godbolt with old GCCs and I failed.  Here's how far I got:

struct foo {
    int val;
};

extern unsigned char x[4096] __attribute__((visibility("hidden")));

static const struct foo *get_ptr(void)
{
    return (const struct foo *)&x;
}

int func(void)
{
	int i;
    const int *ptr = &get_ptr()->val;
	int ret = 0;
	for (i = 0; i < 10; i++) {
		asm volatile ("" ::: "memory");
		ret += *ptr;
	}
	return ret;
}

If I make that "extern const unsigned char x[4096]", then the barrier no longer takes effect, even on modern gcc, so Linus' patch is definitely wrong.  But I can't get the miscompilation without that const.

I feel like I'm still missing some subtlely.
Comment 22 Andy Lutomirski 2019-06-21 06:29:26 UTC
(In reply to Duncan Roe from comment #20)
> > (what does "new patch" mean?)
> attachment 283329 [details]

Ah, got it.  So it's exactly the issue in comment 21 if Linus' patch is applied.

Can you post vclock_gettime.o from an unpatched kernel?
Comment 23 Duncan Roe 2019-06-21 06:30:55 UTC
Will try attachment 283329 [details] first, OK?
Comment 24 Duncan Roe 2019-06-21 06:37:24 UTC
Created attachment 283353 [details]
All requested .o & dbg files

3 directories with subdirs for vdso32. "original" is from laptop build area - same GCC but is 5.1.11 (others are 5.1.12)
Comment 25 Duncan Roe 2019-06-21 06:40:19 UTC
(In reply to Duncan Roe from comment #24)
"original"  means "unpatched"
"new patch" means attachment 283329 [details] as before
Comment 26 Duncan Roe 2019-06-21 06:51:49 UTC
(In reply to Duncan Roe from comment #23)
> Will try attachment 283329 [details] first, OK?

I meant attachment 283351 [details] (barrier.patch). I just apply that to otherwise unpatched kernel, right?
Comment 27 Borislav Petkov 2019-06-21 09:15:29 UTC
(In reply to Andy Lutomirski from comment #6)
> And some tools I wrote to dump the vDSO image out of a running kernel.

This doesn't come up for the first time. Maybe we should have an vdso dumper in tools/ ...
Comment 28 Borislav Petkov 2019-06-21 09:53:21 UTC
(In reply to Andy Lutomirski from comment #13)
> [    0.433315] tsc: Marking TSC unstable due to TSCs unsynchronized

He has an AMD K8 which has unstable TSC.
Comment 29 Duncan Roe 2019-06-21 12:48:09 UTC
Results of 2 builds:

With barrier.patch, built with GCC-5.5.0 - fails as before

Unpatched, built with GCC-9.1.0 - works fine
Comment 30 Andy Lutomirski 2019-06-21 15:51:51 UTC
  2f:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 35 <do_hres+0x35>
                        31: R_X86_64_PC32       vvar_vsyscall_gtod_data
  35:   83 f8 01                cmp    $0x1,%eax
  38:   74 5b                   je     95 <do_hres+0x95>

We get here if TSC is unusable, i.e. only on your old laptop.

  3a:   83 f8 02                cmp    $0x2,%eax

Are we using pvclock?

  3d:   44 8b 15 00 00 00 00    mov    0x0(%rip),%r10d        # 44 <do_hres+0x44
>
                        40: R_X86_64_PC32       pvclock_page-0x4

Muahaha, segfault!  Thanks, GCC.

  44:   0f 84 c8 00 00 00       je     112 <do_hres+0x112>

My barrier patch wasn't quite good enough.  I'll attach a better one.
Comment 31 Andy Lutomirski 2019-06-21 16:00:51 UTC
Created attachment 283373 [details]
Proposed fix
Comment 32 Linus Torvalds 2019-06-21 18:15:02 UTC
(In reply to Andy Lutomirski from comment #31)
> Created attachment 283373 [details]
> Proposed fix

That looks like the obviously right thing to do.

Duncan, if you just verify that that patch (on top of original sources - no revert or anything else) makes things work for you - and I obviously think it will - then I'll apply it.

Andy, I'd want the proper sign-off and commit message, of course.

Thanks everybody.
Comment 33 Andy Lutomirski 2019-06-21 19:11:08 UTC
(In reply to Linus Torvalds from comment #32)
> Andy, I'd want the proper sign-off and commit message, of course.

You just have to click harder :)  Apparently clicking "Proposed fix" or "details" will give it to you.  If you click "View" or "Details" from the diff viewer, you'll also get it, but if you click "Raw Unified", you get a very stripped down diff.
Comment 34 Andy Lutomirski 2019-06-21 19:21:48 UTC
P.S. Linus, I'd like to pick up and fix your cleanup, too.  Can I have a Signed-off-by?
Comment 35 Linus Torvalds 2019-06-21 19:29:07 UTC
(In reply to Andy Lutomirski from comment #34)
> P.S. Linus, I'd like to pick up and fix your cleanup, too.  Can I have a
> Signed-off-by?

Sure. You can add my sign-off.

I'm wondering if that patch should be changed to bit just not have the "const", but indeed mark the whole struct "volatile" instead?

I think that would require the argument types to be fixed up too, but I think it would make the barrier unnecessary. 

What do you think?
Comment 36 Andy Lutomirski 2019-06-21 19:41:10 UTC
(In reply to Linus Torvalds from comment #35)
> (In reply to Andy Lutomirski from comment #34)
> > P.S. Linus, I'd like to pick up and fix your cleanup, too.  Can I have a
> > Signed-off-by?
> 
> Sure. You can add my sign-off.
> 
> I'm wondering if that patch should be changed to bit just not have the
> "const", but indeed mark the whole struct "volatile" instead?
> 
> I think that would require the argument types to be fixed up too, but I
> think it would make the barrier unnecessary. 
> 

I think it would indeed make the barrier unnecessary, but I'd be concerned that the resulting code will be horrible.  For example, __pvclock_read_cycles does:

static __always_inline
u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
{
        u64 delta = tsc - src->tsc_timestamp;
        u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
                                             src->tsc_shift);
        return src->system_time + offset;
}

and pvclock_scale_delta has this:

        __asm__ (
                "mulq %[mul_frac] ; shrd $32, %[hi], %[lo]"
                : [lo]"=a"(product),
                  [hi]"=d"(tmp)
                : "0"(delta),
                  [mul_frac]"rm"((u64)mul_frac));

With volatile, GCC has to emit the loads in exactly the order in which they're sequenced.  I give it a quick try, and the assembly looks worse in ways that don't even look like they're required by the standard.  This:

 11c:	f6 05 00 00 00 00 01 	testb  $0x1,0x0(%rip)        # 123 <do_hres+0x123>
			11e: R_X86_64_PC32	pvclock_page+0x18

becomes:

 121:	0f b6 05 00 00 00 00 	movzbl 0x0(%rip),%eax        # 128 <do_hres+0x128>
			124: R_X86_64_PC32	pvclock_page+0x19
 128:	a8 01                	test   $0x1,%al

and there are quite a few examples of transformations like that in the diff.

I suppose we could try to play awful games where we make the pages unconditionally readable, but that gets a bit gross, since it's not at all clear what's supposed to be mapped there if there is no pvclock.

I looked briefly at using static_cpu_has(X86_FEATURE_HYPERVISOR), but I'm not convinced it'll help anything performance or correctness-wise.
Comment 37 Linus Torvalds 2019-06-21 19:54:06 UTC
Fair enough. Gcc does indeed make a mess of volatile accesses and won't even do the obvious instruction simplification.

So yeah, take my patch, remove the const, and it is at least a small cleanup.

Thanks
Comment 38 Linus Torvalds 2019-06-21 20:38:32 UTC
Ok, I have applied Luto's patch, but apparently I can't close the bug since I'm not the owner of it.

Duncan, please close the bug once you've tested the patch. And if it turns out that Andy's barrier patch doesn't fix it (but I really am pretty sure it does), please holler and we'll continue to look at it.

Luto - I'll leave the cleanup patch to you. I don't think it's necessarily worth it at this stage, and I think it would be better to do as part of the 5.3 merge window or something. But it's obvious enough that if you *want* to make it part of -tip earlier, I won't object.

Of course, the patch that ended up exposing this gcc instruction scheduling issue was "obvious" too.
Comment 39 Linus Torvalds 2019-06-21 20:43:02 UTC
[ Irrelevant side comment ]

(In reply to Andy Lutomirski from comment #33)
> (In reply to Linus Torvalds from comment #32)
> > Andy, I'd want the proper sign-off and commit message, of course.
> 
> You just have to click harder :)

Yeah, the bugzilla patch view seems odd. If I ask to view the diff, I only see the diff. And if I ask to view the attachment, I just get "attachment is not viewable in your browser". So I never even noticed that you had a good commit message and sign-off, because the only thing I see in the browser is the diff.

Downloading it worked fine. 

I think there's a good reason we usually do things in email.
Comment 40 Duncan Roe 2019-06-22 00:52:46 UTC
Fixed by attachment 283373 [details]