Bug 202673

Summary: TP-Link Archer T1U fails to probe
Product: Drivers Reporter: Jan Viktorin (jan.viktorin)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: duodave, stf_xl, ZeroBeat
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.20 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg from 4.20 with failing mt76x0u
0001-mt76usb-do-not-unaligned-sizes-for-buf-alloc.patch
dmesg from 4.20 with applied 0001-mt76usb-do-not-unaligned-sizes-for-buf-alloc.patch
dmesg from 4.20 with applied [net,v2] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
dmesg from 4.20 with applied 0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch
panic after login into kernel 4.20 system with 2 patches
0001-mt76usb-make-rx-buffers-page-size.patch
0001-mt76usb-avoid-using-usb-hcd-bounce-buffer.patch
dmesg from 4.20 with 0001-mt76usb-avoid-using-usb-hcd-bounce-buffer.patch
3_fixes_and_printk.patch
dmesg with 3 patches at once and printk
3_fixes_and_printk_v2.patch
dmesg from 4.20 after applying pack of 3 patches v2
3_fixes_and_printk_v3.patch
dmesg from 4.20 after applying pack of 3 patches v3
amd_iommu_correct_startaddr.patch
amd_iommu_correct_dma_address.patch
dmesg from 4.20, applied pack of 3 patches v3 + iommu correct dma address patch
mt76usb-use-2048-buffers.patch
dmesg from 4.20 to check 2k buffers

Description Jan Viktorin 2019-02-24 22:25:44 UTC
After upgrade from Linux 4.19.x to 4.20.x, the device

 2357:0105 TP-Link Archer T1U 802.11a/n/ac Wireless Adapter [MediaTek MT7610U]

stops working with the following messages:

Feb 24 22:53:49 kernel: mt76x0u 3-4:1.0: error: mt76x02u_mcu_wait_resp timed out
Feb 24 22:53:49 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1004 failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:06 off:1004 failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:141c failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:06 off:141c failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:080c failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:06 off:080c failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:0230 failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:06 off:0230 failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:50 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:51 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
Feb 24 22:53:52 kernel: mt76x0u 3-4:1.0: vendor request req:07 off:0080 failed:-22
Feb 24 22:53:52 kernel: mt76x0u 3-4:1.0: vendor request req:06 off:0080 failed:-22
Feb 24 22:53:52 kernel: mt76x0u 3-4:1.0: vendor request req:06 off:0080 failed:-22
Feb 24 22:53:52 kernel: mt76x0u: probe of 3-4:1.0 failed with error -5
Feb 24 22:53:52 kernel: usbcore: registered new interface driver mt76x0u
Comment 1 Stanislaw Gruszka 2019-02-27 07:49:21 UTC
If this is on AMD IOMMU you can check if disabling iommu helps.
Comment 2 Stanislaw Gruszka 2019-02-27 07:52:52 UTC
If this is not on AMD IOMMU on what HW this is ?
Comment 3 Jan Viktorin 2019-02-27 08:52:09 UTC
You are right. It seems that I was blind when previously looking into dmesg...

[   52.524299] mt76x0u 3-4:1.0: error: MCU resp evt:f seq:1-f
[   52.534871] amdgpu 0000:38:00.0: fb0: amdgpudrmfb frame buffer device
[   52.549072] amdgpu 0000:38:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[   52.549075] amdgpu 0000:38:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[   52.549076] amdgpu 0000:38:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[   52.549078] amdgpu 0000:38:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[   52.549079] amdgpu 0000:38:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[   52.549081] amdgpu 0000:38:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[   52.549082] amdgpu 0000:38:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[   52.549083] amdgpu 0000:38:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[   52.549085] amdgpu 0000:38:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[   52.549086] amdgpu 0000:38:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[   52.549087] amdgpu 0000:38:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[   52.549089] amdgpu 0000:38:00.0: ring 11(vcn_dec) uses VM inv eng 5 on hub 1
[   52.549090] amdgpu 0000:38:00.0: ring 12(vcn_enc0) uses VM inv eng 6 on hub 1
[   52.549091] amdgpu 0000:38:00.0: ring 13(vcn_enc1) uses VM inv eng 7 on hub 1
[   52.549093] amdgpu 0000:38:00.0: ring 14(vcn_jpeg) uses VM inv eng 8 on hub 1
[   52.552929] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:38:00.0 on minor 0
[   52.815341] IPv6: ADDRCONF(NETDEV_UP): enp31s0: link is not ready
[   52.816449] Generic PHY r8169-1f00:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-1f00:00, irq=IGNORE)
[   52.919408] IPv6: ADDRCONF(NETDEV_UP): enp31s0: link is not ready
[   53.786691] mt76x0u 3-4:1.0: error: mt76x02u_mcu_wait_resp timed out
[   53.788164] xhci_hcd 0000:38:00.3: WARNING: Host System Error
[   53.788205] xhci_hcd 0000:38:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fd652c00 flags=0x0020]

And yes, iommu=no amd_iommu=no makes the Archer T1U working again.

There is also a warning for the iommu, no idea if it is related or whether should I report it separately or anything else...

[   52.423014] WARNING: CPU: 1 PID: 382 at drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:1380 dcn_bw_update_from_pplib+0x16b/0x280 [amdgpu]
[   52.423014] Modules linked in: nct6775 hwmon_vid edac_mce_amd kvm_amd ccp amdgpu(+) rng_core kvm nls_iso8859_1 nls_cp437 vfat fat mt76x0u(+) mt76x0_common irqbypass mt76x02_usb mt76_usb mt76x02_lib mt76 mac80211 crct10dif_pclmul crc32_pclmul chash ghash_clmulni_intel amd_iommu_v2 gpu_sched wmi_bmof i2c_algo_bit snd_hda_codec_realtek ttm snd_hda_codec_generic drm_kms_helper snd_hda_codec_hdmi snd_hda_intel cfg80211 snd_hda_codec drm aesni_intel snd_hda_core aes_x86_64 crypto_simd cryptd glue_helper snd_hwdep snd_pcm input_leds pcspkr agpgart rfkill syscopyarea k10temp sysfillrect sysimgblt sp5100_tco fb_sys_fops mousedev snd_timer i2c_piix4 snd r8169 soundcore realtek libphy wmi evdev gpio_amdpt pinctrl_amd mac_hid pcc_cpufreq acpi_cpufreq sg ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod hid_generic usbhid hid serio_raw atkbd libps2 dm_mod crc32c_intel ahci libahci libata xhci_pci scsi_mod xhci_hcd i8042 serio bcache crc64
[   52.423048] CPU: 1 PID: 382 Comm: systemd-udevd Not tainted 4.20.12-arch1-1-ARCH #1
[   52.423049] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P1.20 06/26/2018
[   52.423095] RIP: 0010:dcn_bw_update_from_pplib+0x16b/0x280 [amdgpu]
[   52.423096] Code: d8 ca d8 f1 d9 5a 50 8b 44 fc 14 49 8b 94 24 78 01 00 00 48 89 04 24 df 2c 24 d8 f1 db 42 78 de c9 de ca de f9 d9 5a 4c eb 02 <0f> 0b 48 89 da be 04 00 00 00 48 89 ef e8 93 67 fe ff 84 c0 0f 84
[   52.423097] RSP: 0018:ffff98d542b47788 EFLAGS: 00010246
[   52.423098] RAX: 0000000000000001 RBX: ffff98d542b477e8 RCX: 0000000000000000
[   52.423098] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000ffffffff
[   52.423099] RBP: ffff924d3da61500 R08: 0000000000000001 R09: 00000000000003bc
[   52.423099] R10: 0000000000000001 R11: 0000000000000000 R12: ffff924d3ef7d000
[   52.423100] R13: ffff924d3d8b6480 R14: ffff924d3ef7d000 R15: 0000000000000000
[   52.423101] FS:  00007f37363af840(0000) GS:ffff924d4f840000(0000) knlGS:0000000000000000
[   52.423102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   52.423102] CR2: 00007ff4d7440198 CR3: 00000003c9d62000 CR4: 00000000003406e0
[   52.423103] Call Trace:
[   52.423153]  dcn10_create_resource_pool+0x7d5/0xac0 [amdgpu]
[   52.423198]  ? dal_aux_engine_dce110_create+0x39/0x70 [amdgpu]
[   52.423241]  dc_create_resource_pool+0x42/0x180 [amdgpu]
[   52.423245]  ? __kmalloc+0x18f/0x220
[   52.423287]  dc_create+0x20f/0x620 [amdgpu]
[   52.423330]  ? amdgpu_cgs_create_device+0x23/0x50 [amdgpu]
[   52.423374]  dm_hw_init+0xcb/0x140 [amdgpu]
[   52.423417]  amdgpu_device_init.cold.17+0x1207/0x136c [amdgpu]
[   52.423452]  amdgpu_driver_load_kms+0x86/0x330 [amdgpu]
[   52.423463]  drm_dev_register+0x113/0x150 [drm]
[   52.423496]  amdgpu_pci_probe+0xbd/0x120 [amdgpu]
[   52.423499]  ? _raw_spin_unlock_irqrestore+0x20/0x40
[   52.423502]  local_pci_probe+0x41/0x90
[   52.423503]  pci_device_probe+0x189/0x1a0
[   52.423506]  really_probe+0xf8/0x3b0
[   52.423507]  driver_probe_device+0xb3/0xf0
[   52.423508]  __driver_attach+0xdd/0x110
[   52.423509]  ? driver_probe_device+0xf0/0xf0
[   52.423510]  ? driver_probe_device+0xf0/0xf0
[   52.423511]  bus_for_each_dev+0x76/0xc0
[   52.423513]  bus_add_driver+0x152/0x230
[   52.423514]  ? 0xffffffffc10c9000
[   52.423515]  driver_register+0x6b/0xb0
[   52.423516]  ? 0xffffffffc10c9000
[   52.423518]  do_one_initcall+0x46/0x1f5
[   52.423520]  ? kmem_cache_alloc_trace+0x176/0x1d0
[   52.423522]  ? do_init_module+0x22/0x210
[   52.423523]  do_init_module+0x5a/0x210
[   52.423525]  load_module+0x1ff7/0x2290
[   52.423526]  ? __schedule+0x2a3/0x8b0
[   52.423527]  ? preempt_schedule_irq+0x53/0x90
[   52.423529]  ? __se_sys_init_module+0x10a/0x170
[   52.423530]  __se_sys_init_module+0x10a/0x170
[   52.423532]  do_syscall_64+0x5b/0x170
[   52.423534]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   52.423535] RIP: 0033:0x7f3737f9d56e
[   52.423536] Code: 48 8b 0d f5 18 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 18 0c 00 f7 d8 64 89 01 48
[   52.423537] RSP: 002b:00007ffe20d6d1f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[   52.423538] RAX: ffffffffffffffda RBX: 0000559a77ff9270 RCX: 00007f3737f9d56e
[   52.423539] RDX: 00007f3737c1284d RSI: 00000000006ab761 RDI: 0000559a789d3700
[   52.423539] RBP: 00007f3737c1284d R08: 0000000000000007 R09: 0000000000000006
[   52.423540] R10: 0000559a77fd2010 R11: 0000000000000246 R12: 0000559a789d3700
[   52.423540] R13: 0000559a77fe74d0 R14: 0000000000020000 R15: 0000559a77ff9270
[   52.423542] ---[ end trace 3d8af4a917f99ac2 ]---
Comment 4 Jan Viktorin 2019-02-27 08:55:07 UTC
Created attachment 281379 [details]
dmesg from 4.20 with failing mt76x0u
Comment 5 Stanislaw Gruszka 2019-02-27 10:41:48 UTC
Created attachment 281391 [details]
0001-mt76usb-do-not-unaligned-sizes-for-buf-alloc.patch

Could you check if this patch make things work with IOMMU enabled ?
Comment 6 Jan Viktorin 2019-02-27 19:38:04 UTC
Unfortunately, no luck with the patch. Applied on top of:

426bad2027c7 Arch Linux kernel v4.20.12-arch1
e4817043e07f exec: Fix mem leak in kernel_read_file
884528c4629b add sysctl to disallow unprivileged CLONE_NEWUSER by default
c91951f15978 Linux 4.20.12
578636114de4 ax25: fix possible use-after-free
f3876e6070bf mISDN: fix a race in dev_expire_timer()
c1339bd49e72 net/x25: do not hold the cpu too long in x25_new_lci()
42038180a1d6 netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs
994fc3c7be81 hwmon: (lm80) Fix missing unlock on error in set_fan_div()
795793799d07 mmc: meson-gx: fix interrupt name

Looks like there is no significant change in dmesg:

[   52.643596] mt76x0u 3-4:1.0: ASIC revision: 76100002 MAC revision: 76502000
...
[   52.857461] mt76x0u 3-4:1.0: error: MCU resp evt:8 seq:1-0
...
[   54.133391] mt76x0u 3-4:1.0: error: mt76x02u_mcu_wait_resp timed out
[   54.134955] xhci_hcd 0000:38:00.3: WARNING: Host System Error
[   54.134978] xhci_hcd 0000:38:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fd656000 flags=0x0020]
[   54.620433] fuse init (API version 7.28)
[   55.733684] mt76x0u 3-4:1.0: error: mt76x02u_mcu_wait_resp timed out
[   55.836344] mt76x0u 3-4:1.0: vendor request req:07 off:1004 failed:-22
[   55.938949] mt76x0u 3-4:1.0: vendor request req:06 off:1004 failed:-22
[   56.041878] mt76x0u 3-4:1.0: vendor request req:07 off:141c failed:-22
[   56.140903] mt76x0u 3-4:1.0: vendor request req:06 off:141c failed:-22
[   56.243928] mt76x0u 3-4:1.0: vendor request req:07 off:080c failed:-22
[   56.346344] mt76x0u 3-4:1.0: vendor request req:06 off:080c failed:-22
[   56.450942] mt76x0u 3-4:1.0: vendor request req:07 off:0230 failed:-22
[   56.554556] mt76x0u 3-4:1.0: vendor request req:06 off:0230 failed:-22
[   56.657564] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   56.781764] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   56.903688] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.024116] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.147326] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.270800] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.392801] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.515885] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.638826] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.762182] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   57.884491] mt76x0u 3-4:1.0: vendor request req:07 off:1200 failed:-22
[   58.007588] mt76x0u 3-4:1.0: vendor request req:07 off:0080 failed:-22
[   58.110046] mt76x0u 3-4:1.0: vendor request req:06 off:0080 failed:-22
[   58.212239] mt76x0u 3-4:1.0: vendor request req:06 off:0080 failed:-22
[   58.213011] mt76x0u: probe of 3-4:1.0 failed with error -5
[   58.213049] usbcore: registered new interface driver mt76x0u


I've noticed one thing while changing different kernels back and forth (but this looks like unrelated). Suddenly after loading an LTS 4.19 (which should work), I experienced a different kind of error:

[   52.920700] xhci_hcd 0000:38:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fd650000 flags=0x0000]
[   52.920746] xhci_hcd 0000:38:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000fd650100 flags=0x0000]
...
[   53.946958] mt76x0u 3-4:1.0: firmware upload timed out
[   54.698504] fuse init (API version 7.28)
[   59.120270] xhci_hcd 0000:38:00.3: xHCI host not responding to stop endpoint command.
[   59.120296] xhci_hcd 0000:38:00.3: xHCI host controller not responding, assume dead
[   59.120460] xhci_hcd 0000:38:00.3: HC died; cleaning up
[   59.120505] usb 3-4: USB disconnect, device number 2
[   59.121060] mt76x0u: probe of 3-4:1.0 failed with error -110
[   59.121120] usbcore: registered new interface driver mt76x0u

The solution was to physically re-plug the USB dongle.
Comment 7 Jan Viktorin 2019-02-27 19:39:12 UTC
Created attachment 281393 [details]
dmesg from 4.20 with applied 0001-mt76usb-do-not-unaligned-sizes-for-buf-alloc.patch
Comment 8 Stanislaw Gruszka 2019-02-28 08:39:56 UTC
Could you check two patches ? First this one
https://patchwork.ozlabs.org/patch/1042264/

If the first will not hep this one from other bz:
0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch
https://bugzilla.kernel.org/show_bug.cgi?id=202241#c6

It was reported that this one with another patch fixed the issue, but I'm looking for information if the one patch is sufficient and what is the actual problem (is this is mt76 buffer alignment issue, if this is AMD-IOMMU issue, page_frag_alloc issue).
Comment 9 Jan Viktorin 2019-02-28 11:27:40 UTC
Created attachment 281425 [details]
dmesg from 4.20 with applied [net,v2] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
Comment 10 Jan Viktorin 2019-02-28 11:29:08 UTC
Comment on attachment 281425 [details]
dmesg from 4.20 with applied [net,v2] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

Still does not work after application of [net,v2] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs from https://patchwork.ozlabs.org/patch/1042264/.
Comment 11 Stanislaw Gruszka 2019-02-28 11:33:54 UTC
Thanks for testing! Could also check as this one ?
https://bugzilla.kernel.org/show_bug.cgi?id=202241#c14

I think it can address the actual problem with AMD IOMMU
Comment 12 Jan Viktorin 2019-02-28 11:39:47 UTC
I am confused a little bit now. I am currently building kernel with added 0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch from https://bugzilla.kernel.org/show_bug.cgi?id=202241#c6.

How is the amd_iommu.patch from https://bugzilla.kernel.org/show_bug.cgi?id=202241#c14 related? Should I apply it alone or in addition to the two previous patches?
Comment 13 Stanislaw Gruszka 2019-02-28 12:06:50 UTC
(In reply to jan.viktorin from comment #12)
> How is the amd_iommu.patch from
> https://bugzilla.kernel.org/show_bug.cgi?id=202241#c14 related? Should I
> apply it alone or in addition to the two previous patches?

Alone. Also 0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch should be tested alone.
Comment 14 Stanislaw Gruszka 2019-02-28 12:16:44 UTC
amd_iommu.patch is wrong , don't test it . Please only check 0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch alone make things work.
Comment 15 Jan Viktorin 2019-02-28 14:23:38 UTC
Created attachment 281427 [details]
dmesg from 4.20 with applied 0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch

Still no success, just tried the 0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch. I hope that I apply and build kernels properly. I do clean builds and always check by git diff whether the patch is really applied.
Comment 16 Stanislaw Gruszka 2019-02-28 16:07:54 UTC
Ok, could you confirm that both patches from bug 202241

0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch
0002-mt76usb-do-not-use-compound-head-page-for-SG-I-O.patch

are needed to make problem gone ?
Comment 17 Jan Viktorin 2019-02-28 18:45:38 UTC
Created attachment 281431 [details]
panic after login into kernel 4.20 system with 2 patches

Well, this seems to be a step forward. When I log into the system I can see that NetworkManager is connected to a WiFi network but very soon after that the caps lock and scroll lock starts blinking and the system freezes (I assume that it's a kernel panic).

It happened in both of 2 boots I tried. From one of them I have no verbose logs, just:

Feb 28 18:30:06 kernel: BUG: Bad page state in process claws-mail  pfn:3af042
Feb 28 18:30:06 kernel: page:ffffe4b7cebc1080 count:1 mapcount:0 mapping:0000000000000000 index:0x0
Feb 28 18:30:06 kernel: flags: 0x2ffff0000000000()

Logs extracted by journalctl -k from the second boot attached.
Comment 18 Stanislaw Gruszka 2019-03-01 09:18:08 UTC
That's interesting, we have multiple issues here.
Comment 19 Stanislaw Gruszka 2019-03-01 09:21:17 UTC
Created attachment 281443 [details]
0001-mt76usb-make-rx-buffers-page-size.patch

Please test this patch together with:
0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch

if it sill not work, also with both patches:
0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch
0002-mt76usb-do-not-use-compound-head-page-for-SG-I-O.patch
Comment 20 Jan Viktorin 2019-03-01 16:49:50 UTC
Writing this from kernel 4.20 proves that the last variant (all the three patches applied) works and it seems stable. Thank you! Hope to find those patches soon in the mainline.
Comment 21 Stanislaw Gruszka 2019-03-01 17:48:58 UTC
Sorry to tell that, but second patch is only debug one to find out where the problem is. I think I know now what is the issue and will report this to AMD IOMMU maintainer. Is the third patch 0002-mt76usb-do-not-use-compound-head-page-for-SG-I-O.patch necessary ? Or system works with just two:

0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch
0001-mt76usb-make-rx-buffers-page-size.patch
Comment 22 Jan Viktorin 2019-03-01 17:56:35 UTC
Yes, all three patches are necessary. Without the last one, the situation was still the same.
Comment 23 Jan Viktorin 2019-03-04 10:27:34 UTC
Can you please point me, how can I track the issue to find out when it is fixed?
Comment 25 Stanislaw Gruszka 2019-03-04 13:21:36 UTC
Not sure if we have now sufficient data to identify the problems. Also is not clear if those are AMD IOMMU or mt76usb bug(s) . Most likely some more debuging/testing will be needed. But let's first wait for Joerg answer.
Comment 26 Stanislaw Gruszka 2019-03-05 13:45:33 UTC
Created attachment 281513 [details]
0001-mt76usb-avoid-using-usb-hcd-bounce-buffer.patch

We do not have answer from Joerg so far. Let's try to narrow this issue more. Please test this patch as replacement of:

0002-mt76usb-do-not-use-compound-head-page-for-SG-I-O.patch

with 2 other patches and let me know if things work or not.
Comment 27 Jan Viktorin 2019-03-05 16:15:11 UTC
Created attachment 281515 [details]
dmesg from 4.20 with 0001-mt76usb-avoid-using-usb-hcd-bounce-buffer.patch

Does not work, see dmesg, probably nothing new. Applied:

0001-mt76usb-make-rx-buffers-page-size.patch
0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch
0001-mt76usb-avoid-using-usb-hcd-bounce-buffer.patch
Comment 28 Stanislaw Gruszka 2019-03-05 17:44:40 UTC
Dmesg looks like previous without 2 patches:

0001-mt76usb-make-rx-buffers-page-size.patch
0001-mt76x02u-use-usb_bulk_msg-to-upload-firmware.patch

Could you double check if patches were applied and/or correct kernel loaded ?
Comment 29 Jan Viktorin 2019-03-05 18:04:36 UTC
Easy to say, difficult to do. The used kernel sources contains the 0001-mt76usb-avoid-using-usb-hcd-bounce-buffer.patch, I can see it by git diff.

The previous kernel I used for this was the one that was working with the 3 patches and the "SG I/O" one. And with the new one built today (as seen in dmesg), it didn't.

SHA checksum of installed vmlinuz binary (in /boot) is the same as of the one inside my generated package. I always build by using makepkg -sCf (Arch Linux-specific, it basically means clean-build, force). Patches are applied automatically as stated inside the PKGBUILD file.

I'd say, put there some ugly printk, otherwise, it's just guessing and praying.
Comment 30 Stanislaw Gruszka 2019-03-06 07:24:18 UTC
Created attachment 281525 [details]
3_fixes_and_printk.patch

Ok, please retest with this single patch, it contains 3 previous patches and additional printk.
Comment 31 Jan Viktorin 2019-03-06 13:40:15 UTC
Created attachment 281527 [details]
dmesg with 3 patches at once and printk

Nope... your printk is there and my Wi-Fi is still broken :/.
Comment 32 Stanislaw Gruszka 2019-03-07 12:31:47 UTC
Created attachment 281571 [details]
3_fixes_and_printk_v2.patch

Next try to narrow the issues . Please test and provide dmesg.
Comment 33 Jan Viktorin 2019-03-07 17:47:35 UTC
The patch has failed to apply. I am building on top of c91951f15978 Linux 4.20.12 (see comment 6 above). I am not sure, what can I do.

==> Starting prepare()...
  -> Setting version...
  -> Applying patch 3_fixes_and_printk_v2.patch...
patching file drivers/net/wireless/mediatek/mt76/mt76.h
patching file drivers/net/wireless/mediatek/mt76/mt76x0/usb.c
Hunk #1 FAILED at 306.
Hunk #2 succeeded at 311 (offset -12 lines).
1 out of 2 hunks FAILED -- saving rejects to file drivers/net/wireless/mediatek/mt76/mt76x0/usb.c.rej
patching file drivers/net/wireless/mediatek/mt76/mt76x02_usb_mcu.c
patching file drivers/net/wireless/mediatek/mt76/mt76x2/usb.c
patching file drivers/net/wireless/mediatek/mt76/usb.c
patching file drivers/net/wireless/mediatek/mt76/usb_mcu.c
==> ERROR: A failure occurred in prepare().
    Aborting...

Contents of usb.c.rej:

  1 --- drivers/net/wireless/mediatek/mt76/mt76x0/usb.c
  2 +++ drivers/net/wireless/mediatek/mt76/mt76x0/usb.c
  3 @@ -306,13 +306,11 @@ static int __maybe_unused mt76x0_suspend(struct usb_interface *usb_intf,
  4                                          pm_message_t state)
  5  {
  6         struct mt76x02_dev *dev = usb_get_intfdata(usb_intf);
  7 -       struct mt76_usb *usb = &dev->mt76.usb;
  8 
  9         mt76u_stop_queues(&dev->mt76);
 10         mt76x0u_mac_stop(dev);
 11         clear_bit(MT76_STATE_MCU_RUNNING, &dev->mt76.state);
 12         mt76x0_chip_onoff(dev, false, false);
 13 -       usb_kill_urb(usb->mcu.res.urb);
 14 
 15         return 0;
 16  }
Comment 34 Stanislaw Gruszka 2019-03-08 07:19:35 UTC
My patch is on top of 4.20.14 which include 'mt76x0u: fix suspend/resume' commit:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=379c6d591775cead56a2966bf4578fafef0ae107

Please apply this commit and 3_fixes_and_printk_v2.patch on top.
Comment 35 Jan Viktorin 2019-03-08 12:10:21 UTC
Created attachment 281635 [details]
dmesg from 4.20 after applying pack of 3 patches v2

Here you are, full of error messages. Hope that it helps in any way. I had to hard reboot because my PC becomes unusable.
Comment 36 Stanislaw Gruszka 2019-03-08 14:59:25 UTC
It was super usesull, now we know that AMD IOMMU return sg mapping error, which mt76usb ignored.
Comment 37 Stanislaw Gruszka 2019-03-08 15:03:59 UTC
Created attachment 281637 [details]
3_fixes_and_printk_v3.patch

Please test the next patch and provide dmesg. There are multiple issue, so let's identify them all.
Comment 38 Jan Viktorin 2019-03-09 10:17:18 UTC
Created attachment 281659 [details]
dmesg from 4.20 after applying pack of 3 patches v3
Comment 39 Stanislaw Gruszka 2019-03-09 12:31:36 UTC
Created attachment 281667 [details]
amd_iommu_correct_startaddr.patch

Please test this one on top of 3_fixes_and_printk_v3.patch. Perhaps this is the last element of solutions for this puzzles.
Comment 40 Stanislaw Gruszka 2019-03-10 09:46:00 UTC
Created attachment 281675 [details]
amd_iommu_correct_dma_address.patch

Please test this one instead of amd_iommu_correct_startaddr.patch.
Comment 41 Jan Viktorin 2019-03-10 18:41:56 UTC
Created attachment 281685 [details]
dmesg from 4.20, applied pack of 3 patches v3 + iommu correct dma address patch

It works now.
Comment 42 Stanislaw Gruszka 2019-03-11 08:08:25 UTC
Thank you. To address those bugs, we need this amd iommu patch. And backport to 5.0 and 4.20 those commits from 5.1 :

commit 5de4db8fcb6d6fc7d9064c22841211790c0ab81b
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Mon Feb 11 09:16:14 2019 +0100

    mt76x02u: use usb_bulk_msg to upload firmware

commit a18a494f908f88a8be95ce95399800204e338b55
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Wed Feb 20 17:15:19 2019 +0100

    mt76usb: use synchronous msg for mcu command responses

But I'm still not 100% sure if we need to change q->buf_size to PAGE_SIZE and if that is needed, is this amd iommu bug or not.

Could you do yet another test and confirm q-buf_size change. You already tested that, but without amd iommu patch, so perhaps this was needed because of this amd iommu s->offset bug.
Comment 43 Stanislaw Gruszka 2019-03-11 08:10:46 UTC
Created attachment 281695 [details]
mt76usb-use-2048-buffers.patch

Please test this on top of two previous patches and check if things will broke again or not.
Comment 44 Jan Viktorin 2019-03-11 11:35:26 UTC
Please, can you be more explicit? Which "two previous patches" do you mean? I assume this sequence:

* 379c6d59177  'mt76x0u: fix suspend/resume'
* 3_fixes_and_printk_v3.patch
* mt76usb-use-2048-buffers.patch
Comment 45 Stanislaw Gruszka 2019-03-11 13:22:39 UTC
Oh, sorry 4 patches then (I forgot about suspend/resume):

* 379c6d59177  'mt76x0u: fix suspend/resume'
* 3_fixes_and_printk_v3.patch
* amd_iommu_correct_dma_address.patch
* mt76usb-use-2048-buffers.patch
Comment 46 Jan Viktorin 2019-03-11 16:34:44 UTC
Created attachment 281721 [details]
dmesg from 4.20 to check 2k buffers

This works for me.
Comment 47 Stanislaw Gruszka 2019-03-12 07:39:25 UTC
Thanks for testing! So after applying correct AMD IOMMU fix even less changes in mt76usb are needed.  Maybe even that single fix is sufficient to avoid the bug you have encountered. However I would backport to 4.20 and 5.0 "mt76x02u: use usb_bulk_msg to upload firmware" since it solve some different bugs.
Comment 48 Stanislaw Gruszka 2019-03-28 13:30:20 UTC
*** Bug 202241 has been marked as a duplicate of this bug. ***
Comment 49 Stanislaw Gruszka 2019-03-28 13:43:38 UTC
AMD IOMMU fix is already in 5.0 tree:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.0.y&id=98e2c51c1ac3d1599c221944c5aecb5868c1605d

Looks like 4.20.y-stable is no longer updated.

I just post "mt76x02u: use usb_bulk_msg to upload firmware" to stable. It should be included in 5.0.y.

Hopefully those two patches make mt76x0u works on AMD IOMMU . I'm closing the bug. If not and some other patch backport to 5.0 is needed please reopen this bz.
Comment 50 Jan Viktorin 2019-04-02 09:01:02 UTC
Linux 5.0.5. It works! Thank you!
Comment 51 Michael 2019-04-02 09:25:01 UTC
I can confirm, too, that it works like expected, running kernel 5.0.5.

Great work, thanks!