Bug 76331

Summary:	kernel BUG at drivers/iommu/intel-iommu.c:844!
Product:	Virtualization	Reporter:	Matt (mspeder)
Component:	kvm	Assignee:	virtualization_kvm
Status:	NEW ---
Severity:	normal	CC:	alex.williamson, dwmw2, mspeder, szg00000
Priority:	P1
Hardware:	x86-64
OS:	Linux
Kernel Version:	3.14.4	Subsystem:
Regression:	No	Bisected commit-id:

Description Matt 2014-05-16 08:09:08 UTC

Hit this bug while trying to simultaneously passthrough using vfio a nvidia gpu and an ICH audio device.

Extract from virtlib config for this VM :
  <qemu:commandline>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,host=00:1b.0,bus=pcie.0'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=pcieroot.1'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,host=06:00.0,bus=pcieroot.1,addr=00.0,multifunction=on,x-vga=on,romfile=/local/kvm2/GF114.rom'/>
  </qemu:commandline>

Kernel trace :

[  506.224316] ------------[ cut here ]------------
[  506.224323] kernel BUG at drivers/iommu/intel-iommu.c:844!
[  506.224325] invalid opcode: 0000 [#1] PREEMPT SMP 
[  506.224328] Modules linked in: vhost_net vhost macvtap macvlan tun vfio_iommu_type1 vfio_pci vfio fuse nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfsd auth_rpcgss oid_registry nfs_acl bridge stp llc snd_hda_codec_hdmi coretemp intel_powerclamp kvm_intel hid_generic nouveau btusb bluetooth mxm_wmi kvm mousedev crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul snd_hda_codec_realtek mac_hid snd_hda_codec_generic rc_ati_x10 ati_remote ppdev gpio_ich 6lowpan_iphc wmi snd_hda_intel rc_core sky2 snd_hda_codec snd_hwdep video snd_pcm ttm rfkill drm_kms_helper drm iTCO_wdt glue_helper evdev i7core_edac hwmon i2c_algo_bit iTCO_vendor_support parport_pc ablk_helper snd_timer edac_core snd parport cryptd soundcore i2c_i801 i2c_core pcspkr psmouse serio_raw microcode
[  506.224375]  lpc_ich shpchp button thermal acpi_cpufreq processor nfs lockd sunrpc fscache ext4 crc16 mbcache jbd2 usbhid hid sd_mod sr_mod crc_t10dif cdrom crct10dif_common usb_storage atkbd libps2 ahci libahci firewire_ohci libata firewire_core crc_itu_t megaraid_sas ehci_pci uhci_hcd xhci_hcd ehci_hcd scsi_mod usbcore usb_common i8042 serio
[  506.224399] CPU: 0 PID: 839 Comm: qemu:Win8j Tainted: G          I  3.14.4-1-ARCH #1
[  506.224402] Hardware name:    /PURE BLACK X58, BIOS 080016  11/24/2010
[  506.224405] task: ffff8808eb5c1d70 ti: ffff8808efe12000 task.ti: ffff8808efe12000
[  506.224408] RIP: 0010:[<ffffffff813eea04>]  [<ffffffff813eea04>] dma_pte_clear_range+0x1e4/0x1f0
[  506.224415] RSP: 0018:ffff8808efe13b50  EFLAGS: 00010206
[  506.224417] RAX: 00000000000001ff RBX: ffff8808f5626100 RCX: 000000000000001b
[  506.224420] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffff8808f5626100
[  506.224422] RBP: ffff8808efe13b78 R08: 0000000000000000 R09: 0000000000000001
[  506.224425] R10: ffff88092bc174a0 R11: ffffea0023bfeb00 R12: 0000000000000001
[  506.224428] R13: ffff8808efd018a0 R14: 0000000000000000 R15: 0000000fffffffff
[  506.224430] FS:  00007fd9ff5fd700(0000) GS:ffff88092bc00000(0000) knlGS:0000000000000000
[  506.224434] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  506.224436] CR2: 00007fb420e1fd80 CR3: 000000000180c000 CR4: 00000000000007e0
[  506.224438] Stack:
[  506.224440]  ffff8808f5626100 0000000000000001 ffff8808efd018a0 ffff8808f5626118
[  506.224445]  ffff8808f7f408e0 ffff8808efe13bd0 ffffffff813eedf8 0000000000000000
[  506.224449]  ffff8808effac8c0 ffff8808f5626118 ffff8808efe13bf8 ffff8808effabb00
[  506.224453] Call Trace:
[  506.224456]  [<ffffffff813eedf8>] vm_domain_exit+0x1f8/0x2e0
[  506.224460]  [<ffffffff813eeefd>] intel_iommu_domain_destroy+0x1d/0x20
[  506.224464]  [<ffffffff813e2bdb>] iommu_domain_free+0x1b/0x30
[  506.224468]  [<ffffffffa1ab16c9>] vfio_iommu_type1_release+0xe9/0x11a [vfio_iommu_type1]
[  506.224473]  [<ffffffffa1a9c67b>] __vfio_group_unset_container+0xfb/0x120 [vfio]
[  506.224477]  [<ffffffffa1a9c6c9>] vfio_group_try_dissolve_container+0x29/0x40 [vfio]
[  506.224481]  [<ffffffffa1a9c745>] vfio_device_fops_release+0x25/0x40 [vfio]
[  506.224485]  [<ffffffff811bc50c>] __fput+0x9c/0x240
[  506.224488]  [<ffffffff811bc6fe>] ____fput+0xe/0x10
[  506.224492]  [<ffffffff8108c26c>] task_work_run+0xcc/0xe0
[  506.224496]  [<ffffffff8106d328>] do_exit+0x398/0xb10
[  506.224501]  [<ffffffff810205f6>] ? ___preempt_schedule+0x56/0xb0
[  506.224504]  [<ffffffff8106db23>] do_group_exit+0x43/0xc0
[  506.224508]  [<ffffffff8107e020>] get_signal_to_deliver+0x270/0x6e0
[  506.224513]  [<ffffffff81016557>] do_signal+0x57/0x6c0
[  506.224516]  [<ffffffff81016c28>] do_notify_resume+0x68/0xa0
[  506.224521]  [<ffffffff815179a0>] int_signal+0x12/0x17
[  506.224523] Code: 41 89 f1 0f 1f 40 00 45 89 cc e9 46 ff ff ff 0f 0b 48 89 f0 48 d3 e8 48 85 c0 75 11 4c 89 f8 48 d3 e8 48 85 c0 0f 84 57 fe ff ff <0f> 0b 0f 0b 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 
[  506.224546] RIP  [<ffffffff813eea04>] dma_pte_clear_range+0x1e4/0x1f0
[  506.224549]  RSP <ffff8808efe13b50>
[  506.224552] ---[ end trace 074d753a846ea21f ]---

Comment 1 Alex Williamson 2014-05-16 19:07:25 UTC

Can you please report the IOMMU cap registers, for VT-d, simply:

dmesg | grep ecap

Also, it shouldn't matter, but what's the size of the specified guest memory and the version of QEMU being used?

Thanks

Comment 2 Matt 2014-05-16 20:28:31 UTC

Hi Alex,

# dmesg | grep ecap
[    0.057396] dmar: IOMMU 0: reg_base_addr fbfff000 ver 1:0 cap c9008010e60262 ecap f020fa
[    0.057403] dmar: IOMMU 1: reg_base_addr fbffe000 ver 1:0 cap c9078010ef0462 ecap f020fe
[  158.832381] vfio_ecap_init: 0000:00:1b.0 hiding ecap 0x5@0x130

Host RAM is 36G
Guest RAM was 16G

I just tried with Guest RAM 4G, it didn't make any difference.

Here is the relevant extract of libvirt conf :
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>25165824</hard_limit>
  </memtune>

# qemu-system-x86_64 --version
QEMU emulator version 2.0.0, Copyright (c) 2003-2008 Fabrice Bellard

If you need anything else don't hesitate !

Thanks,

Matt

Comment 3 Matt 2014-05-23 05:54:41 UTC

Hi Alex,

Any news ?
Do you need additional info ?
I keep receiving the bug each time I power off the VM (VM is working perfectly fine before the shutdown with both sound and gpu passthrough).
After that I need to reboot the host to launch this VM again, so this is a quite annoying issue.

Thanks !

Matthieu

Comment 4 Alex Williamson 2014-05-23 18:35:38 UTC

The DRHD capability registers are reported as:

IOMMU 0: c9008010e60262
IOMMU 1: c9078010ef0462

101111
100110

From the VT-d spec (v2.2), bits 12:8 of the capability register are the Supported Adjusted Guest Address Widths (SAGAW), defined as:

  This 5-bit field indicates the supported adjusted
  guest address widths (which in turn represents
  the levels of page-table walks for the 4KB base
  page size) supported by the hardware
  implementation.

  A value of 1 in any of these bits indicates the
  corresponding adjusted guest address width is
  supported. The adjusted guest address widths
  corresponding to various bit positions within this
  field are:
    • 0: Reserved
    • 1: 39-bit AGAW (3-level page-table)
    • 2: 48-bit AGAW (4-level page-table)
    • 3: Reserved
    • 4: Reserved

  Software must ensure that the adjusted guest
  address width used to set up the page tables is
  one of the supported guest address widths
  reported in this field.

This system therefore has one DRHD unit supporting 3-level page tables (IOMMU 0) and the other supporting 4-level page tables (IOMMU 1).

Bits 21:16 are the Maximum Guest Address Width:

  This field indicates the maximum DMA virtual
  addressability supported by remapping
  hardware. The Maximum Guest Address Width
  (MGAW) is computed as (N+1), where N is the
  valued reported in this field. For example, a
  hardware implementation supporting 48-bit
  MGAW reports a value of 47 (101111b) in this
  field.

  If the value in this field is X, untranslated and
  translated DMA requests to addresses above
  2(X+1)-1 are always blocked by hardware.
  Device-TLB translation requests to address
  above 2(X+1)-1 from allowed devices return a
  null Translation-Completion Data with R=W=0.
  Guest addressability for a given DMA request is
  limited to the minimum of the value reported
  through this field and the adjusted guest
  address width of the corresponding page-table
  structure. (Adjusted guest address widths
  supported by hardware are reported through
  the SAGAW field).

  Implementations must support MGAW at least
  equal to the physical addressability (host
  address width) of the platform.

On this system, IOMMU 0 therefore has a MGAW of 0x26 + 1 = 39 bits, IOMMU 1 = 0x2f + 1 = 48 bits.

The BUG we're hitting is:

BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);

So the last PFN of the domain is beyond the address width of the domain.

last_pfn here is created from DOMAIN_MAX_PFN(domain->gaw)

All VM domains are created with a 48 bit width (domain->gaw):

#define DEFAULT_DOMAIN_ADDRESS_WIDTH 48

So the default last_pfn is 0xf_ffff_ffff

Given the default 48 bit width, the default domain AGAW (Adjusted Guest Address Width) is 2 (domain->agaw)

When we add devices to the domain, the gaw is updated to match:

        /* check if this iommu agaw is sufficient for max mapped address */
        addr_width = agaw_to_width(iommu->agaw);
        if (addr_width > cap_mgaw(iommu->cap))
                addr_width = cap_mgaw(iommu->cap);

        if (dmar_domain->max_addr > (1LL << addr_width)) {
                printk(KERN_ERR "%s: iommu width (%d) is not "
                       "sufficient for the mapped address (%llx)\n",
                       __func__, addr_width, dmar_domain->max_addr);
                return -EFAULT;
        }
        dmar_domain->gaw = addr_width;

iommu->agaw is calculated from the SAGAW, and will be either 1 or 2 here depending on which IOMMU manages the device.  One bug stands out here, domain->gaw is set to the width of the iommu for the last device added, so an initial suspicion would be that you could avoid the problem by re-ordering the qemu command line to create the devices in the reverse order.

So, depending on the order devices were added, domain->gaw is either 48 bits or 39 bits and therefore last_pfn going into the BUG_ON is either 0xf_ffff_ffff or 0x7fff_ffff.

addr_width is set from 'agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT' where domain->agaw is initially 2, however just beyond the above code snippet we have:

        /*
         * Knock out extra levels of page tables if necessary
         */
        while (iommu->agaw < dmar_domain->agaw) {
                struct dma_pte *pte;

                pte = dmar_domain->pgd;
                if (dma_pte_present(pte)) {
                        dmar_domain->pgd = (struct dma_pte *)
                                phys_to_virt(dma_pte_addr(pte));
                        free_pgtable_page(pte);
                }
                dmar_domain->agaw--;
        }

Therefore, when we add the device behind the 39 bit IOMMU first, we get:

last_pfn = 0x7fff_ffff
addr_width = 39

but then we add the device behind the 48 bit IOMMU and get:

last_pfn = 0xf_ffff_ffff
addr_width = 39

Resulting in the BUG_ON

The fix might simply be to change setting the GAW here to:

dmar_domain->gaw = min(dmar_domain->gaw, addr_width);

Comment 5 Matt 2014-05-28 10:41:50 UTC

Hi Alex,

Great news !
Yesterday I had the opportunity to recompile my kernel with your suggested fix in intel-iommu driver : dmar_domain->gaw = min(dmar_domain->gaw, addr_width);

After multiple tests I can confirm that this successfully fixed the issue.
How can we have this integrated in the official kernel sources ?

I also tried to re-order the qemu command-line...
With or without the fix I don't see any difference and I always end up with various problems related to the gpu pass-thru :
- one VM blue screen at boot (VIDEO_TDR_ERROR)
- one Host crash !
- driver error (code 43) and dmesg full of errors like :
[ 2283.900194] dmar: DMAR:[DMA Read] Request device [06:00.0] fault addr 12de0a000 
DMAR:[fault reason 12] non-zero reserved fields in PTE
[ 2283.900201] dmar: DMAR:[DMA Write] Request device [06:00.0] fault addr aff93000 
DMAR:[fault reason 12] non-zero reserved fields in PTE
[ 2286.149141] dmar: DRHD: handling fault status reg 602

But I'm not sure if this problem is related...

Comment 6 Matt 2014-07-09 10:25:19 UTC

Hi Alex and David,

I've been successfully using Alex's fix for more than a month now.
https://lkml.org/lkml/2014/5/29/932

Would it be possible to close this bug by adding the patch to the official kernel tree ?

Thanks !