Created attachment 287987 [details] crash log I have found that using the amdgpu-pro OpenCL stack with kernel 5.5.10 causes a crash (see attached log) I have seen this while using folding@home. I have tested reverting back to 5.4.26 with no other changes, this fixes the issue.
Can you bisect?
Yes, should be able to over the weekend. Will report my findings.
These two commits break AMDGPU-PRO OpenCL and ROCm. I guess userspace needs updating. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6e6db2722c287122bfc4d51e685872fb5031cf18 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a39414716ca08c08ce09d9e8409ef525e7a77eaf
If that is the case isn't the issue the kernel rather than the user space applications ? As in that case amdgpu is incompatible with any of the OpenCL 2.x implementation's
That depends on whether the changes to the DMA mechanics were meant to be compatible with the old implementation, but I can't answer that as I am no AMD developer.
Looks like its more to do with switching from amd-iommmu to dma-iommu (see my bisect below) git bisect start # good: [e87eb585d31fadb5e9e549a1de4b2da60a79bfc9] Merge branch 'pci/misc' git bisect good e87eb585d31fadb5e9e549a1de4b2da60a79bfc9 # bad: [c3bed3b20e40ab44b98ac5f0471a5bd92a802f5a] Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci git bisect bad c3bed3b20e40ab44b98ac5f0471a5bd92a802f5a # good: [3f1b210a7f97f7e75c56174ada476fba2d36f340] Merge tag 'sound-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect good 3f1b210a7f97f7e75c56174ada476fba2d36f340 # skip: [a6ed68d6468bd5a3da78a103344ded1435fed57a] Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm git bisect skip a6ed68d6468bd5a3da78a103344ded1435fed57a # skip: [3f86a7e090d1dfb974a9dc9d44049f9bff01e6a5] gpiolib: acpi: Print pin number on acpi_gpiochip_alloc_event errors git bisect skip 3f86a7e090d1dfb974a9dc9d44049f9bff01e6a5 # good: [32d1fe8fcb32130733b59fc447e35753dc87fd40] mm/hotplug: reorder memblock_[free|remove]() calls in try_remove_memory() git bisect good 32d1fe8fcb32130733b59fc447e35753dc87fd40 # good: [a5255bc31673c72e264d837cd13cd3085d72cb58] Merge tag 'dmaengine-5.5-rc1' of git://git.infradead.org/users/vkoul/slave-dma git bisect good a5255bc31673c72e264d837cd13cd3085d72cb58 # bad: [9b326948c23908692d7dfe56ed149840d3829eaa] Merge tag 'firewire-update' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 git bisect bad 9b326948c23908692d7dfe56ed149840d3829eaa # bad: [937d6eefc716a9071f0e3bada19200de1bb9d048] Merge tag 'docs-5.5a' of git://git.lwn.net/linux git bisect bad 937d6eefc716a9071f0e3bada19200de1bb9d048 # good: [a8de1304b7df30e3a14f2a8b9709bb4ff31a0385] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h git bisect good a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 # good: [bf23a48edbe331f834eb49d1bd6484ae98cf4dc7] Documentation/translation: Use Korean for Korean translation title git bisect good bf23a48edbe331f834eb49d1bd6484ae98cf4dc7 # good: [34d1b0895dbd10713c73615d8f532e78509e12d9] iommu/arm-smmu: Remove duplicate error message git bisect good 34d1b0895dbd10713c73615d8f532e78509e12d9 # bad: [9b3a713feef8db41d4bcccb3b97e86ee906690c8] Merge branches 'iommu/fixes', 'arm/qcom', 'arm/renesas', 'arm/rockchip', 'arm/mediatek', 'arm/tegra', 'arm/smmu', 'x86/amd', 'x86/vt-d', 'virtio' and 'core' into next git bisect bad 9b3a713feef8db41d4bcccb3b97e86ee906690c8 # bad: [3c124435e8dd516df4b2fc983f4415386fd6edae] iommu/amd: Support multiple PCI DMA aliases in IRQ Remapping git bisect bad 3c124435e8dd516df4b2fc983f4415386fd6edae # bad: [be62dbf554c5b50718a54a359372c148cd9975c7] iommu/amd: Convert AMD iommu driver to the dma-iommu api git bisect bad be62dbf554c5b50718a54a359372c148cd9975c7 # good: [781ca2de89bae1b1d2c96df9ef33e9a324415995] iommu: Add gfp parameter to iommu_ops::map git bisect good 781ca2de89bae1b1d2c96df9ef33e9a324415995 # good: [6e2350207f40e24884da262976f7fd4fba387e8a] iommu/dma-iommu: Use the dev->coherent_dma_mask git bisect good 6e2350207f40e24884da262976f7fd4fba387e8a # first bad commit: [be62dbf554c5b50718a54a359372c148cd9975c7] iommu/amd: Convert AMD iommu driver to the dma-iommu api I am going to *try* and revert that change and see if if it fixes the issue. Will also check if the latest 5.6rc has shows the errors
Seems some other issues are showing against this commit https://bugzilla.kernel.org/show_bug.cgi?id=206461
After some further validation 5.6-rc6 also has this bug Reverting be62dbf554c5b50718a54a359372c148cd9975c7 fixes the issue but overall it seems that amdgpu is not using the new implementation of dma_map_sg correctly. Looking at the documentation (here : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/DMA-API.txt?h=v5.5.11#n354) it seems like return value of dma_map_sg and the supplied value for nents can differ in length. Currently the amdgpu driver code validates that the return value of dma_map_sg and nents are equal, otherwise bailing out of amdgpu_ttm_tt_pin_userptr see line : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c?h=v5.5.11#n976 This would explain the "*ERROR* failed to pin userptr" message followed by the trace.
Also validated last night that the following patch to disable the merging of sg sections within dma-iommu fixes the issue I am seeing --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -779,7 +779,7 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents, * - but doesn't fall at a segment boundary * - and wouldn't make the resulting output segment too long */ - if (cur_len && !s_iova_off && (dma_addr & seg_mask) && + if (0 && cur_len && !s_iova_off && (dma_addr & seg_mask) && (max_len - cur_len >= s_length)) { I guess amdgpu needs to be updated to handle the case where the iommu driver is merging some of the requested segments ?
Created attachment 288017 [details] amdgpu_possible_patch drm_prime_sg_to_page_addr_arrays does not support cases when the number of segments returned from dma_map_sg differs from that reported (this can be the case) Add and make use of a version that can use the count data returned from dma_map_sg and the correct sg_dma_len macro
Thanks for the patch. Please fix drm_prime_sg_to_page_addr_arrays() directly and send the patch to dri-devel@lists.freedesktop.org . Also please add your Signed-off_by.
It's likely other drivers that rely on these helpers would be similarly broken.
Indeed, however they may not have pushed the SG lists via dma map in the same way as amdgpu. In that case getting lengths from dma_map_sg would probably cause other issues
True. For now just send out the patch and we can discuss further on the list. Thanks!
General comment about the patch, you can make amdgpu_ttm_dma_sg_to_arrays static since it's only used within amdgpu_ttm.c,
I'll update drm_prime_sg_to_page_addr_arrays to support both the current logic and dma mapped logic and get a patch up this evening. That way at least nothing else get broke