Bug 206895 - [amdgpu] crash while using opencl from amdgpu-pro on kernel 5.5.10 & 5.6.0-rc6
Summary: [amdgpu] crash while using opencl from amdgpu-pro on kernel 5.5.10 & 5.6.0-rc6
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-20 19:55 UTC by bigbeeshane
Modified: 2020-03-23 17:01 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.5.10 & 5.6.0-rc6
Tree: Mainline
Regression: No


Attachments
crash log (5.43 KB, text/plain)
2020-03-20 19:55 UTC, bigbeeshane
Details
amdgpu_possible_patch (3.47 KB, patch)
2020-03-23 13:57 UTC, bigbeeshane
Details | Diff

Description bigbeeshane 2020-03-20 19:55:37 UTC
Created attachment 287987 [details]
crash log

I have found that using the amdgpu-pro OpenCL stack with kernel 5.5.10 causes a crash (see attached log) I have seen this while using folding@home.

I have tested reverting back to 5.4.26 with no other changes, this fixes the issue.
Comment 1 Alex Deucher 2020-03-20 20:32:39 UTC
Can you bisect?
Comment 2 bigbeeshane 2020-03-20 20:41:41 UTC
Yes, should be able to over the weekend. Will report my findings.
Comment 4 bigbeeshane 2020-03-21 23:09:29 UTC
If that is the case isn't the issue the kernel rather than the user space applications ?

As in that case amdgpu is incompatible with any of the OpenCL 2.x implementation's
Comment 5 stefanspr94 2020-03-22 10:35:49 UTC
That depends on whether the changes to the DMA mechanics were meant to be compatible with the old implementation, but I can't answer that as I am no AMD developer.
Comment 6 bigbeeshane 2020-03-22 23:45:07 UTC
Looks like its more to do with switching from amd-iommmu to dma-iommu (see my bisect below)

git bisect start
# good: [e87eb585d31fadb5e9e549a1de4b2da60a79bfc9] Merge branch 'pci/misc'
git bisect good e87eb585d31fadb5e9e549a1de4b2da60a79bfc9
# bad: [c3bed3b20e40ab44b98ac5f0471a5bd92a802f5a] Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bad c3bed3b20e40ab44b98ac5f0471a5bd92a802f5a
# good: [3f1b210a7f97f7e75c56174ada476fba2d36f340] Merge tag 'sound-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 3f1b210a7f97f7e75c56174ada476fba2d36f340
# skip: [a6ed68d6468bd5a3da78a103344ded1435fed57a] Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm
git bisect skip a6ed68d6468bd5a3da78a103344ded1435fed57a
# skip: [3f86a7e090d1dfb974a9dc9d44049f9bff01e6a5] gpiolib: acpi: Print pin number on acpi_gpiochip_alloc_event errors
git bisect skip 3f86a7e090d1dfb974a9dc9d44049f9bff01e6a5
# good: [32d1fe8fcb32130733b59fc447e35753dc87fd40] mm/hotplug: reorder memblock_[free|remove]() calls in try_remove_memory()
git bisect good 32d1fe8fcb32130733b59fc447e35753dc87fd40
# good: [a5255bc31673c72e264d837cd13cd3085d72cb58] Merge tag 'dmaengine-5.5-rc1' of git://git.infradead.org/users/vkoul/slave-dma
git bisect good a5255bc31673c72e264d837cd13cd3085d72cb58
# bad: [9b326948c23908692d7dfe56ed149840d3829eaa] Merge tag 'firewire-update' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
git bisect bad 9b326948c23908692d7dfe56ed149840d3829eaa
# bad: [937d6eefc716a9071f0e3bada19200de1bb9d048] Merge tag 'docs-5.5a' of git://git.lwn.net/linux
git bisect bad 937d6eefc716a9071f0e3bada19200de1bb9d048
# good: [a8de1304b7df30e3a14f2a8b9709bb4ff31a0385] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h
git bisect good a8de1304b7df30e3a14f2a8b9709bb4ff31a0385
# good: [bf23a48edbe331f834eb49d1bd6484ae98cf4dc7] Documentation/translation: Use Korean for Korean translation title
git bisect good bf23a48edbe331f834eb49d1bd6484ae98cf4dc7
# good: [34d1b0895dbd10713c73615d8f532e78509e12d9] iommu/arm-smmu: Remove duplicate error message
git bisect good 34d1b0895dbd10713c73615d8f532e78509e12d9
# bad: [9b3a713feef8db41d4bcccb3b97e86ee906690c8] Merge branches 'iommu/fixes', 'arm/qcom', 'arm/renesas', 'arm/rockchip', 'arm/mediatek', 'arm/tegra', 'arm/smmu', 'x86/amd', 'x86/vt-d', 'virtio' and 'core' into next
git bisect bad 9b3a713feef8db41d4bcccb3b97e86ee906690c8
# bad: [3c124435e8dd516df4b2fc983f4415386fd6edae] iommu/amd: Support multiple PCI DMA aliases in IRQ Remapping
git bisect bad 3c124435e8dd516df4b2fc983f4415386fd6edae
# bad: [be62dbf554c5b50718a54a359372c148cd9975c7] iommu/amd: Convert AMD iommu driver to the dma-iommu api
git bisect bad be62dbf554c5b50718a54a359372c148cd9975c7
# good: [781ca2de89bae1b1d2c96df9ef33e9a324415995] iommu: Add gfp parameter to iommu_ops::map
git bisect good 781ca2de89bae1b1d2c96df9ef33e9a324415995
# good: [6e2350207f40e24884da262976f7fd4fba387e8a] iommu/dma-iommu: Use the dev->coherent_dma_mask
git bisect good 6e2350207f40e24884da262976f7fd4fba387e8a
# first bad commit: [be62dbf554c5b50718a54a359372c148cd9975c7] iommu/amd: Convert AMD iommu driver to the dma-iommu api


I am going to *try* and revert that change and see if if it fixes the issue. Will also check if the latest 5.6rc has shows the errors
Comment 7 bigbeeshane 2020-03-22 23:47:31 UTC
Seems some other issues are showing against this commit

https://bugzilla.kernel.org/show_bug.cgi?id=206461
Comment 8 bigbeeshane 2020-03-23 00:38:58 UTC
After some further validation

5.6-rc6 also has this bug

Reverting be62dbf554c5b50718a54a359372c148cd9975c7 fixes the issue but overall it seems that amdgpu is not using the new implementation of dma_map_sg correctly. 

Looking at the documentation (here : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/DMA-API.txt?h=v5.5.11#n354) it seems like return value of dma_map_sg and the supplied value for nents can differ in length.

Currently the amdgpu driver code validates that the return value of dma_map_sg and nents are equal, otherwise bailing out of amdgpu_ttm_tt_pin_userptr see line :

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c?h=v5.5.11#n976

This would explain the "*ERROR* failed to pin userptr" message followed by the trace.
Comment 9 bigbeeshane 2020-03-23 12:44:49 UTC
Also validated last night that the following patch to disable the merging of sg sections within dma-iommu fixes the issue I am seeing 


--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -779,7 +779,7 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
                 * - but doesn't fall at a segment boundary
                 * - and wouldn't make the resulting output segment too long
                 */
-               if (cur_len && !s_iova_off && (dma_addr & seg_mask) &&
+               if (0 && cur_len && !s_iova_off && (dma_addr & seg_mask) &&
                    (max_len - cur_len >= s_length)) {


I guess amdgpu needs to be updated to handle the case where the iommu driver is merging some of the requested segments ?
Comment 10 bigbeeshane 2020-03-23 13:57:22 UTC
Created attachment 288017 [details]
amdgpu_possible_patch

drm_prime_sg_to_page_addr_arrays does not support cases when the number of segments returned from dma_map_sg differs from that reported (this can be the case)

Add and make use of a version that can use the count data returned from dma_map_sg and the correct sg_dma_len macro
Comment 11 Alex Deucher 2020-03-23 15:59:03 UTC
Thanks for the patch.  Please fix drm_prime_sg_to_page_addr_arrays() directly and send the patch to dri-devel@lists.freedesktop.org .  Also please add your Signed-off_by.
Comment 12 Alex Deucher 2020-03-23 16:01:31 UTC
It's likely other drivers that rely on these helpers would be similarly broken.
Comment 13 bigbeeshane 2020-03-23 16:11:59 UTC
Indeed, however they may not have pushed the SG lists via dma map in the same way as amdgpu.

In that case getting lengths from dma_map_sg would probably cause other issues
Comment 14 Alex Deucher 2020-03-23 16:16:06 UTC
True.  For now just send out the patch and we can discuss further on the list.  Thanks!
Comment 15 Alex Deucher 2020-03-23 16:18:50 UTC
General comment about the patch, you can make amdgpu_ttm_dma_sg_to_arrays static since it's only used within amdgpu_ttm.c,
Comment 16 bigbeeshane 2020-03-23 17:01:25 UTC
I'll update drm_prime_sg_to_page_addr_arrays to support both the current logic and dma mapped logic and get a patch up this evening.

That way at least nothing else get broke

Note You need to log in before you can comment on or make changes to this bug.