Bug 188271

Summary: IOMMU DMAR fault with NVIDIA CUDA peer to peer
Product: Drivers Reporter: Vadim Markovtsev (vadim)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.8.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmidecode -t 2
lscpu
lspci -knnv
nvidia-smi proto -m
cat /proc/cmdline
uname -a

Description Vadim Markovtsev 2016-11-21 17:16:57 UTC
My motherboard is Supermicro X10DRG-Q (details in attached output of dmidecode). It has 2 Xeon E5-2620 v4 (details in attached lscpu output). Two Titan X 2016 GPUs are inserted into PCIe slots (see nvidia-smi output). After enabling of the peer to peer access between those two cards, execution of cudaMemcpyPeer() hangs and dmesg shows:

[16193.612535] DMAR: DRHD: handling fault status reg 602
[16193.617662] DMAR: [DMA Write] Request device [82:00.0] fault addr 387fc000c000 [fault reason 05] PTE Write access is not set
[16193.661857] DMAR: DRHD: handling fault status reg 702
[16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set (edited)

I am using CoreOS, and the whole stuff happens inside a docker container running with -device /dev/nvidiactl --device /dev/nvidia0 --device /dev/nvidia1 --device /dev/nvidia-uvm --privileged --security-opt seccomp=unconfined

The addition of intel_iommu=igfx_off to kernel command line cures the problem and peer to peer works perfectly.
Comment 1 Vadim Markovtsev 2016-11-21 17:17:09 UTC
Created attachment 245361 [details]
dmidecode -t 2
Comment 2 Vadim Markovtsev 2016-11-21 17:17:24 UTC
Created attachment 245371 [details]
lscpu
Comment 3 Vadim Markovtsev 2016-11-21 17:17:41 UTC
Created attachment 245381 [details]
lspci -knnv
Comment 4 Vadim Markovtsev 2016-11-21 17:18:04 UTC
Created attachment 245391 [details]
nvidia-smi proto -m
Comment 5 Vadim Markovtsev 2016-11-21 17:18:33 UTC
Created attachment 245401 [details]
cat /proc/cmdline

Added intel_iommu=off
Comment 6 Vadim Markovtsev 2016-11-21 17:18:49 UTC
Created attachment 245411 [details]
uname -a