Bug 188271 - IOMMU DMAR fault with NVIDIA CUDA peer to peer
Summary: IOMMU DMAR fault with NVIDIA CUDA peer to peer
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-21 17:16 UTC by Vadim Markovtsev
Modified: 2016-11-21 17:18 UTC (History)
0 users

See Also:
Kernel Version: 4.8.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmidecode -t 2 (419 bytes, text/plain)
2016-11-21 17:17 UTC, Vadim Markovtsev
Details
lscpu (1.41 KB, text/plain)
2016-11-21 17:17 UTC, Vadim Markovtsev
Details
lspci -knnv (53.25 KB, text/plain)
2016-11-21 17:17 UTC, Vadim Markovtsev
Details
nvidia-smi proto -m (1.89 KB, text/plain)
2016-11-21 17:18 UTC, Vadim Markovtsev
Details
cat /proc/cmdline (310 bytes, text/plain)
2016-11-21 17:18 UTC, Vadim Markovtsev
Details
uname -a (99 bytes, text/plain)
2016-11-21 17:18 UTC, Vadim Markovtsev
Details

Description Vadim Markovtsev 2016-11-21 17:16:57 UTC
My motherboard is Supermicro X10DRG-Q (details in attached output of dmidecode). It has 2 Xeon E5-2620 v4 (details in attached lscpu output). Two Titan X 2016 GPUs are inserted into PCIe slots (see nvidia-smi output). After enabling of the peer to peer access between those two cards, execution of cudaMemcpyPeer() hangs and dmesg shows:

[16193.612535] DMAR: DRHD: handling fault status reg 602
[16193.617662] DMAR: [DMA Write] Request device [82:00.0] fault addr 387fc000c000 [fault reason 05] PTE Write access is not set
[16193.661857] DMAR: DRHD: handling fault status reg 702
[16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set (edited)

I am using CoreOS, and the whole stuff happens inside a docker container running with -device /dev/nvidiactl --device /dev/nvidia0 --device /dev/nvidia1 --device /dev/nvidia-uvm --privileged --security-opt seccomp=unconfined

The addition of intel_iommu=igfx_off to kernel command line cures the problem and peer to peer works perfectly.
Comment 1 Vadim Markovtsev 2016-11-21 17:17:09 UTC
Created attachment 245361 [details]
dmidecode -t 2
Comment 2 Vadim Markovtsev 2016-11-21 17:17:24 UTC
Created attachment 245371 [details]
lscpu
Comment 3 Vadim Markovtsev 2016-11-21 17:17:41 UTC
Created attachment 245381 [details]
lspci -knnv
Comment 4 Vadim Markovtsev 2016-11-21 17:18:04 UTC
Created attachment 245391 [details]
nvidia-smi proto -m
Comment 5 Vadim Markovtsev 2016-11-21 17:18:33 UTC
Created attachment 245401 [details]
cat /proc/cmdline

Added intel_iommu=off
Comment 6 Vadim Markovtsev 2016-11-21 17:18:49 UTC
Created attachment 245411 [details]
uname -a

Note You need to log in before you can comment on or make changes to this bug.