Bug 194579

Summary: AMDGPU: Possible size overflow detected by PaX in ttm_bo_handle_move_mem (drivers/gpu/drm/ttm/ttm_bo.c:388)
Product: Drivers Reporter: Thore Bödecker (me)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: deathsimple, fin4478, me, nhaehnle, pageexec
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.9.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg including backtrace

Description Thore Bödecker 2017-02-13 17:06:38 UTC
While trying out the experimental amdgpu support for Southern Islands radeon cards, I stumpled upon this issue due to using the grsec patchset.

It seems there is a size overflow:

kernel: PAX: size overflow detected in function ttm_bo_handle_move_mem drivers/gpu/drm/ttm/ttm_bo.c:388 cicus.459_185 max, count: 5, decl: offset; num: 0; context: ttm_buffer_object;


After addings a debug printk():

kernel: PAX start:7fffffffffffffff type:1 offset:80000000


Would be great if someone could provide some insight and confirm that it is undesired behaviour.


Best,
Thore
Comment 1 Thore Bödecker 2017-02-13 17:10:01 UTC
Created attachment 254731 [details]
dmesg including backtrace
Comment 2 fin4478 2017-02-14 07:42:19 UTC
There is very little amdgpu driver code in the stock kernels, see kernel.org and diff column. See other amdgpu bug reports how to create a custom kernel with this code:
git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux
Comment 3 Christian König 2017-02-14 09:52:50 UTC
Please ignore the incorrect comment by fin4478@hotmail.com.

The upstream kernel the official base amdgpu driver code. Only a few not parts which are not upstream yet are in private repositories waiting for cleanup.

Regards,
Christian (co-maintainer of the amdgpu module).
Comment 4 Christian König 2017-02-14 09:55:51 UTC
Now back to topic, what code base is this? Could you post the branch you are using somewhere?

In the upstream kernel the line at ttm_bo.c:388 is just an assignment and can't cause any overflow.
Comment 5 fin4478 2017-02-14 15:07:35 UTC
(In reply to Christian König from comment #3)
> Please ignore the incorrect comment by fin4478@hotmail.com.
> 
> The upstream kernel the official base amdgpu driver code. Only a few not
> parts which are not upstream yet are in private repositories waiting for
> cleanup.
> 
> Regards,
> Christian (co-maintainer of the amdgpu module).

Amd developers want bad bad reputation for amd gpus, so I stop giving info. Here is an example how the wip kernel solved the problem:
https://bugzilla.kernel.org/show_bug.cgi?id=194559
Comment 6 PaX Team 2017-02-14 22:19:51 UTC
```
(In reply to Christian König from comment #4)
> Now back to topic, what code base is this? Could you post the branch
> you are using somewhere?

it's vanilla 4.9.9 + grsecurity (we don't change the amdgpu code except for
some function pointer correctness issues).

> In the upstream kernel the line at ttm_bo.c:388 is just an assignment
> and can't cause any overflow.

yet it does ;). here's the code in context, just to make sure we're talking
about the same thing:

   387         if (bo->mem.mm_node) {
   388                 bo->offset = (bo->mem.start << PAGE_SHIFT) +
   389             bdev->man[bo->mem.mem_type].gpu_offset;
   390         bo->cur_placement = bo->mem.placement;

as you can see from the printk log, bo->mem.start was LONG_MAX (probably via
AMDGPU_BO_INVALID_OFFSET) and shifting it left triggered our size overflow
instrumentation. my guess is that AMDGPU_BO_INVALID_OFFSET isn't supposed to
be used in offset calculations and thus there's some higher level logic bug
here where the mem.start field should have been initialized by the time we
got here. some casual code browsing points at amdgpu_ttm_placement_init that
probably initializes the ttm_placement structures for amdgpu but for some of
them it doesn't set any of fpfn/lpfn/TTM_PL_FLAG_TOPDOWN which is what would
trigger the initialization of mem.start in amdgpu_gtt_mgr_new.
```
Comment 7 Nicolai Hähnle 2017-02-16 23:18:28 UTC
VRAM buffers can get split into multiple underlying memory chunks; in this case, bo->mem.start == AMDGPU_BO_INVALID_OFFSET; it's not a logic bug.

bo->offset is only used via amdgpu_bo_gpu_offset, which has a WARN_ON_ONCE to guard against accidental use of the field in this particular case.

Therefore, I'm inclined to say that this is probably not an actual bug.
Comment 8 Christian König 2017-02-17 10:05:51 UTC
(In reply to Nicolai Hähnle from comment #7)
> Therefore, I'm inclined to say that this is probably not an actual bug.

Yeah, correct. The calculated offset is never used, so the overflow is irrelevant.
Comment 9 PaX Team 2017-02-18 00:21:22 UTC
would the following workaround do the job of not triggering the overflow and not causing any other logic bugs for our purposes:

--- a/drivers/gpu/drm/ttm/ttm_bo.c      2016-12-13 12:11:19.867579755 +0100
+++ b/drivers/gpu/drm/ttm/ttm_bo.c        2017-02-18 01:19:44.122817874 +0100
@@ -384,7 +384,7 @@
                bo->evicted = false;
        }

-       if (bo->mem.mm_node) {
+       if (bo->mem.mm_node && bo->mem.start != AMDGPU_BO_INVALID_OFFSET) {
                bo->offset = (bo->mem.start << PAGE_SHIFT) +
                    bdev->man[bo->mem.mem_type].gpu_offset;
                bo->cur_placement = bo->mem.placement;
Comment 10 Nicolai Hähnle 2017-02-21 09:24:41 UTC
1. That shouldn't compile (use LONG_MAX).
2. The cur_placement still needs to be updated even in that case.

Guarding only the assignment to bo->offset in this way is a reasonable workaround, I think.
Comment 11 Christian König 2017-02-21 09:52:58 UTC
(In reply to PaX Team from comment #9)
> would the following workaround do the job of not triggering the overflow and
> not causing any other logic bugs for our purposes:

Not really.

The issue is that the offset handling should actually be transparent to TTM. So mem.start can have any value here which might as well overflow during the assignment.

So even with Nicolais suggestion of using LONG_MAX I would NAK the patch.

The only clean solution I can see is to remove bo->offset altogether and move that into a helper the drivers can call on demand.
Comment 12 PaX Team 2017-02-21 13:19:29 UTC
(In reply to Christian König from comment #11)
> The issue is that the offset handling should actually be transparent to TTM.
> So mem.start can have any value here which might as well overflow during the
> assignment.
> 
> So even with Nicolais suggestion of using LONG_MAX I would NAK the patch.
> 
> The only clean solution I can see is to remove bo->offset altogether and
> move that into a helper the drivers can call on demand.

obviously i'm not qualified to do that kind of surgery ;), i'd just like
to keep our existing overflow checking instrumentation for tm_buffer_object.offset
instead of getting rid of it because of just one intentional overflow. if
setting ->offset regardless of any overflows is important then couldn't we
go the other way and change the value of AMDGPU_BO_INVALID_OFFSET to something
that would not trigger the overflow here? say LONG_MAX >> PAGE_SHIFT. would
that work/not clash with otherwise valid values for this offset? (makes me
also wonder why ULONG_MAX isn't used since that would produce an even bigger
safety zone)