While trying out the experimental amdgpu support for Southern Islands radeon cards, I stumpled upon this issue due to using the grsec patchset. It seems there is a size overflow: kernel: PAX: size overflow detected in function ttm_bo_handle_move_mem drivers/gpu/drm/ttm/ttm_bo.c:388 cicus.459_185 max, count: 5, decl: offset; num: 0; context: ttm_buffer_object; After addings a debug printk(): kernel: PAX start:7fffffffffffffff type:1 offset:80000000 Would be great if someone could provide some insight and confirm that it is undesired behaviour. Best, Thore
Created attachment 254731 [details] dmesg including backtrace
There is very little amdgpu driver code in the stock kernels, see kernel.org and diff column. See other amdgpu bug reports how to create a custom kernel with this code: git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux
Please ignore the incorrect comment by fin4478@hotmail.com. The upstream kernel the official base amdgpu driver code. Only a few not parts which are not upstream yet are in private repositories waiting for cleanup. Regards, Christian (co-maintainer of the amdgpu module).
Now back to topic, what code base is this? Could you post the branch you are using somewhere? In the upstream kernel the line at ttm_bo.c:388 is just an assignment and can't cause any overflow.
(In reply to Christian König from comment #3) > Please ignore the incorrect comment by fin4478@hotmail.com. > > The upstream kernel the official base amdgpu driver code. Only a few not > parts which are not upstream yet are in private repositories waiting for > cleanup. > > Regards, > Christian (co-maintainer of the amdgpu module). Amd developers want bad bad reputation for amd gpus, so I stop giving info. Here is an example how the wip kernel solved the problem: https://bugzilla.kernel.org/show_bug.cgi?id=194559
``` (In reply to Christian König from comment #4) > Now back to topic, what code base is this? Could you post the branch > you are using somewhere? it's vanilla 4.9.9 + grsecurity (we don't change the amdgpu code except for some function pointer correctness issues). > In the upstream kernel the line at ttm_bo.c:388 is just an assignment > and can't cause any overflow. yet it does ;). here's the code in context, just to make sure we're talking about the same thing: 387 if (bo->mem.mm_node) { 388 bo->offset = (bo->mem.start << PAGE_SHIFT) + 389 bdev->man[bo->mem.mem_type].gpu_offset; 390 bo->cur_placement = bo->mem.placement; as you can see from the printk log, bo->mem.start was LONG_MAX (probably via AMDGPU_BO_INVALID_OFFSET) and shifting it left triggered our size overflow instrumentation. my guess is that AMDGPU_BO_INVALID_OFFSET isn't supposed to be used in offset calculations and thus there's some higher level logic bug here where the mem.start field should have been initialized by the time we got here. some casual code browsing points at amdgpu_ttm_placement_init that probably initializes the ttm_placement structures for amdgpu but for some of them it doesn't set any of fpfn/lpfn/TTM_PL_FLAG_TOPDOWN which is what would trigger the initialization of mem.start in amdgpu_gtt_mgr_new. ```
VRAM buffers can get split into multiple underlying memory chunks; in this case, bo->mem.start == AMDGPU_BO_INVALID_OFFSET; it's not a logic bug. bo->offset is only used via amdgpu_bo_gpu_offset, which has a WARN_ON_ONCE to guard against accidental use of the field in this particular case. Therefore, I'm inclined to say that this is probably not an actual bug.
(In reply to Nicolai Hähnle from comment #7) > Therefore, I'm inclined to say that this is probably not an actual bug. Yeah, correct. The calculated offset is never used, so the overflow is irrelevant.
would the following workaround do the job of not triggering the overflow and not causing any other logic bugs for our purposes: --- a/drivers/gpu/drm/ttm/ttm_bo.c 2016-12-13 12:11:19.867579755 +0100 +++ b/drivers/gpu/drm/ttm/ttm_bo.c 2017-02-18 01:19:44.122817874 +0100 @@ -384,7 +384,7 @@ bo->evicted = false; } - if (bo->mem.mm_node) { + if (bo->mem.mm_node && bo->mem.start != AMDGPU_BO_INVALID_OFFSET) { bo->offset = (bo->mem.start << PAGE_SHIFT) + bdev->man[bo->mem.mem_type].gpu_offset; bo->cur_placement = bo->mem.placement;
1. That shouldn't compile (use LONG_MAX). 2. The cur_placement still needs to be updated even in that case. Guarding only the assignment to bo->offset in this way is a reasonable workaround, I think.
(In reply to PaX Team from comment #9) > would the following workaround do the job of not triggering the overflow and > not causing any other logic bugs for our purposes: Not really. The issue is that the offset handling should actually be transparent to TTM. So mem.start can have any value here which might as well overflow during the assignment. So even with Nicolais suggestion of using LONG_MAX I would NAK the patch. The only clean solution I can see is to remove bo->offset altogether and move that into a helper the drivers can call on demand.
(In reply to Christian König from comment #11) > The issue is that the offset handling should actually be transparent to TTM. > So mem.start can have any value here which might as well overflow during the > assignment. > > So even with Nicolais suggestion of using LONG_MAX I would NAK the patch. > > The only clean solution I can see is to remove bo->offset altogether and > move that into a helper the drivers can call on demand. obviously i'm not qualified to do that kind of surgery ;), i'd just like to keep our existing overflow checking instrumentation for tm_buffer_object.offset instead of getting rid of it because of just one intentional overflow. if setting ->offset regardless of any overflows is important then couldn't we go the other way and change the value of AMDGPU_BO_INVALID_OFFSET to something that would not trigger the overflow here? say LONG_MAX >> PAGE_SHIFT. would that work/not clash with otherwise valid values for this offset? (makes me also wonder why ULONG_MAX isn't used since that would produce an even bigger safety zone)