Bug 202651

Summary: Regression in atlantic.ko between 4.20 and 5.00-rc4 for Aquantia NIC 1d6a:87b1
Product: Networking Reporter: Nicholas Johnson (nicholas.johnson-opensource)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: high CC: bero, dbogdan, irusskikh
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.00-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: Apply this patch with "git apply" at root of source.

Description Nicholas Johnson 2019-02-23 09:01:29 UTC
Hardware Details:
==============================================================================

VEN:DEV = 1d6a:87b1 "Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet
Controller [AQtion]" Specifically, this is the newer AQC107S with "S"
meaning "Secure" and the company could never elaborate on how it is more
secure than the plain AQC107 (different VEN:DEV). This chip is inside
the Promise SANLink3 N1 Thunderbolt 3 10GbE adapter.


Severity:
==============================================================================

Marked high severity because it directly impacts user experience (cannot
get the network to come up on the adapter).


Back Story and Problem:
==============================================================================

In 5.00-rc4, on my Ubuntu desktop I have had extreme problems getting
the connection up (borderline impossible). If I did manage (after
messing with systemd *a lot*) then I got a dodgy connection with a lot
of DNS resolution problems. At first I blamed it on my desktop which had
a habit of corrupting the SSD partition tables on a regular basis, and
hence I assumed that systemd was corrupted or something.

Recently, I tried on my laptop with Arch Linux and 5.00-rc4 and it had
the exact same problem. It does not matter if you manually assign the 
IPv4 address, or use DHCPv4.

Just now, on my laptop, I booted into 4.20.3-arch1-1-ARCH kernel from
Arch Linux and it worked straight away.


Expected behaviour:
==============================================================================

NetworkManager (or any tool) should bring the link up successfully.


Reproduction:
==============================================================================

Should be reproducible with a 5.00-rc4 kernel and the 1d6a:87b1 NIC. I
have no way of testing if other Aquantia devices are affected - I do not
own any others. I am using Ubuntu and Arch (both the very latest) and I
just cannot get NetworkManager to bring up the connection with 5.00-rc4
(happens effortlessly with 4.20 kernel). If you want NetworkManager to
manage the adapter, you may need to use the
10-globally-managed-devices.conf fix, depending on the version.


Technical Notes:
==============================================================================

It is hard to tell, but my gut tells me that it is IPv4-related. That
would explain why some sites could work. I have a native IPv6 service
from my ISP. The "ifconfig" command shows IPv6 addresses assigned, but
it is very difficult to get a DHCPv4-assigned address on the adapter.

A couple of times I did manage to get the adapter up in 5.00-rc4 with a
lot of messing around, and the resulting internet connection was
unreliable and unstable. DNS resolution seemed to be a big part of the
problem.

All BARs are assigned correctly. Other than that, dmesg does not give
anything interesting that I can tell, so if you want some debug logs
from somewhere then please tell me what you want and how I should do it.

There are a fair few lines of diff between 4.20 and 5.00-rc4 so there is
likely an oopsie in there.


Always Remembering My Manners:
==============================================================================

Thank you for reading my bug report!


Why Are Web Developers So Annoying?
==============================================================================

The preview removes all my newlines and nice formatting which I did so have fun trying to read this big blob of unformatted text.
Comment 1 Dmitry Bogdanov 2019-02-25 09:38:52 UTC
Hi Nicholas,

Thank you for a bug-report.
The diff between 4.20.3 and 5.0-rc4 are almost not related to the traffic processing. There are introducing rx-flow filters, vlan offload and a number of queues for RSS.

Could you please try the following tests to narrow it down:
1. Turn off all offloads on the NIC (via ethtool -K enp1s0)

2. Ensure that no rx-flows are configured (ethtool -n enp1s0)

3. Route all UDP traffic on the one queue(disable RSS for UDP)
(sudo ethtool -N enp1s0 flow-type udp4 action 0 loc 32)

BR,
 Dmitry
Comment 2 Nicholas Johnson 2019-02-25 12:58:29 UTC
Hello,

Thanks for looking at this. For reference, my card is named ens1 or ens4 depending on which Thunderbolt port it is in.

Ethtool is unreliable and it can sometimes not change options when you request them, sometimes complaining, and sometimes failing to make the change without saying anything.

I managed to get all of the things in "ethtool -k ens4" off, even in 5.00-rc7, it still does not work.

$ ethtool -n ens4
8 RX rings available
Total 0 rules

# ethtool -N ens4 flow-type udp4 action 0 loc 32

$ ethtool -n ens4
8 RX rings available
Total 1 rules

Filter: 32
	Rule Type: UDP over IPv4
	Src IP addr: 0.0.0.0 mask: 255.255.255.255
	Dest IP addr: 0.0.0.0 mask: 255.255.255.255
	TOS: 0x0 mask: 0xff
	Src port: 0 mask: 0xffff
	Dest port: 0 mask: 0xffff
	Action: Direct to queue 0

Still does not work.

4.20 kernel still works without me having to do anything whatsoever.

The issue does seem to be DNS. It is assigned IP address now and I can browse my router (192.168.0.1) but ICMP is broken and I cannot ping, or use nslookup.

The 5.00-rcX kernel works fine if I swap a RTL8153 in a USB port and take the same Ethernet cable from the Aquantia. So it does seem to be just the Aquantia one.

I will let you know if I find out more.

Is there a flag to compile a kernel with to enable verbose dmesg for net subsystem?

Cheers!
Comment 3 Nicholas Johnson 2019-02-25 13:34:09 UTC
Okay, this is really, really nasty.

In 5.00-rc7 (possibly rc4 also, but I upgraded to check if the issue is fixed in rc7) I am able to consistently crash the computer by removing the Thunderbolt AQC107S NIC and hot-adding it again. Even if in tty when it is added. Even if iommu is enabled. So it is likely that the driver is locking up.

In 4.20, not only does it work without any messing about, but I can hot-add it all day, over and over again, and not so much as a BAR assignment warning happens.

If it is not atlantic.ko, then a maintainer did something really bad to the PCIe net subsystems. As I said, USB Realtek RTL8153 works in v5.00-rcX.


But I really feel like this is worth reverting the changes for and waiting a kernel release cycle to get things ironed out before they are added again.

Have you tried to reproduce the bug? And if so, what were the results?

I assume it was tested before pushing the change, but was it tested with every single VEN:DEV hardware supported under the driver? Perhaps it is just this one affected.

Cheers!
Comment 4 Igor Russkikh 2019-02-25 14:01:07 UTC
Nicholas,

Any chance of observing panic logs in dmesg?
If no, could you try running 'dmesg -w' in tty term and then doing that detach/attach to see if anything appears then?

> But I really feel like this is worth reverting the changes for and waiting a
> > kernel release cycle to get things ironed out before they are added again.

For this we have to pin where the regression happened. Do you have an option/time to do a bisect? Or, at least, checkout 5.0rc1 for the same?

> Have you tried to reproduce the bug?

We are setting up the 5.0rc system now, will let you know.
Comment 5 Igor Russkikh 2019-02-25 15:07:41 UTC
> Have you tried to reproduce the bug?

Nicholas, tried on Lenovo X1 Carbon, kernel 

5.0.0-0.rc7.git3.1.vanilla.knurd.1.fc28.x86_64 #1 SMP Fri Feb 22 18:58:22 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

With two TB dongles I have.

NO bad behavior observed. IP allocated from DHCP, link is stable.

Please explain in more details the host systems you are trying on.
Also please give a full "sudo lspci -vvv" output, as well as full dmesg after issue starts happening.
Comment 6 Bernhard Rosenkränzer 2019-02-25 16:57:41 UTC
Probably related: I'm also seeing an Aquantia NIC regression in 5.0-rc kernels.
4.20.11 works perfectly, in 5.0-rc7, the NIC is seen, shows up in lspci and friends correctly, it shows up in /proc/net/dev and friends, but it acts pretty much as if the network cable wasn't plugged in (not getting an IP from DHCP, can't ping the router if I set an IP manually, etc).
There's nothing in dmesg that indicates what's going wrong.

Unfortunately I'm not near that box right now to run any checks, but this looks related.

My NIC is a PCIE card:

07:00.0 0200: 1d6a:d107 (rev 02)
        Subsystem: 1043:872e
        Flags: bus master, fast devsel, latency 0, IRQ 30, NUMA node 0
        Memory at d9040000 (64-bit, non-prefetchable) [size=64K]
        Memory at d9050000 (64-bit, non-prefetchable) [size=4K]
        Memory at d8c00000 (64-bit, non-prefetchable) [size=4M]
        Expansion ROM at d9000000 [disabled] [size=256K]
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI-X: Enable+ Count=32 Masked-
        Capabilities: [a0] MSI: Enable- Count=1/32 Maskable- 64bit+
        Capabilities: [c0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [180] Secondary PCI Express <?>
        Kernel driver in use: atlantic
        Kernel modules: atlantic
Comment 7 Nicholas Johnson 2019-02-26 00:30:03 UTC
The laptop is Dell 9370 but in native Thunderbolt enumeratiom by flashing NVM40 from Dell 9380. The BIOS is not updated to support it but that does not matter for Linux. With my kernel patches I am trying to get accepted, I can control the resources like puppets and I prefer my implementation over Intel's because I can throw 64G of MMIO_PREF and should even be able to park multiple Xeon Phi coprocessors (16G BARs) in Thunderbolt enclosures.

The desktop is also running native enum with Titan Ridge add-in cards. Both are happy to add daisy chains of docks and then throw eGPUs at the end without so much as a hiccup. Just to prove it is not my kernel patches, I tried 5.00-rc8 from Ubuntu mainline and used hpmemsize=128M. Same issues, minus crash when hot-plugging. I could not make it crash, but I could not make it work either. So the crashing could be unreleated, but it is worth noting in case it actually has something to do with it.

The desktop actually has BIOS support for Alpine Ridge add-in cards (fully disabled because it messes with Titan Ridge). Tonight after work I can take the Titan Ridge cards out, enable the BIOS support and chuck an Alpine Ridge card in and test with a fully official Thunderbolt setup. If it works though, I will eat my shorts because my own Thunderbolt implementation has been running perfectly for some time now and I simply cannot break it (except for with this NIC in 5.00). But if that convinces you that it isn't caused by my whacky Thunderbolt setup then I will do it. In fact, I will ask my sister to install 5.00 kernel on her Dell 9370 which uses BIOS-assisted enumeration and check there. She uses Linux.

When it crashes, there are no incriminating dmesg entries. Unless it goes down before they can make the framebuffer. Tonight after work, I can connect my laptop to a UART to the RS232 of the desktop for kernel debugging and see if I can find any messages there.

What are the VEN:DEV IDs of the PCI endpoints in your Thunderbolt adapters? Do they support SR-IOV? Oh, I can also dig out the Intel X550-T2 out of my box of random cool stuff and check that it works, just in case this is a subsystem issue affecting other PCI endpoints. I can also chuck it in an eGPU enclosure and check it on Thunderbolt.

Anyways it seems like you believe everything is hunky dory at your end and maybe your test platform does not expose the problem for some reason. Hence I will do what I did with Thunderbolt and mess with the source code until it works. I do not have the time or interest to present a full solution, but I will comment stuff out until it works again, proving where the problem is. I would prefer fix that horrendous amdgpu.ko that cannot handle device removal, whether it be by surprise or by unloading the driver (the explosion in dmesg is nasty, but funnily enough, it does not render the computer unusable if in tty at the time, unlike hot-adding the Aquantia NIC). Now serious question, what is a bisect? Is that copying parts of the old driver to 5.00 kernel one by one or vice-versa until it works or breaks? I tried to insmod the 4.20 atlantic.ko but quickly realised that was never going to work. I can try to do it but it would be appreciated if you could send me resources and knowledge articles about doing a bisect.

It occurred to me that perhaps the distros we are using are doing things differently. I am using Ubuntu and Arch. Could they be enabling things by default that screw it up that others do not? I would say it is NetworkManager but if I disable that service and try to manually run "sudo dhclient ens4" then it just blocks and sits there. What Linux distro are you guys using to test, Igor and Bernhard?

Bernhard Rosenkränzer, do you know how to patch and compile a kernel? If you are a competent Linux user then you can be my test subject if you wish, and confirm if any patch I produce solves the problem for you.

Cheers!
Comment 8 Bernhard Rosenkränzer 2019-02-26 01:24:45 UTC
I'm on OpenMandriva, and I know how to patch and compile a kernel -- I'm mostly a userspace developer, but I've landed a few kernel patches as well.
Comment 9 Nicholas Johnson 2019-02-26 10:11:34 UTC
Okay, a few things:

1. The kernel crashing was my fault. Sorry about that. I had some pci_dbg calls with bad pointers that were left around from the rc7 that I was messing with.

2. It's definitely the atlantic.ko driver causing the network issues - because I copied the drivers/net/ethernet/aquantia/atlantic directory from 4.20 to 5.00 source directory, fixed the dev_open call with NULL argument, and compiled. The resulting atlantic.ko worked straight away. So it is not the whole subsystem that is screwed. Which is probably preferable, anyway.

3. I tried to disable vlans and ntuple features before I did this but with no success - the atlantic.ko with the partial changes still did not work, even after multiple attempts. Is there an easy way to turn these off in the driver by only making a tiny change? I am not familiar enough with the driver.

Cheers!
Comment 10 Igor Russkikh 2019-02-26 10:17:54 UTC
Hi Nicholas,

Thanks alot for confirming this. It's sad but seems your point (2) basically proves some change in driver caused that.

We have reviewed the changeset, the suspect now is increased number of HW rings created.

How many logical Cores do you have on your system? If more than four, try the following patch:

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
index 3944ce7f0870..91eb8910b1c9 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
@@ -12,7 +12,7 @@
 #ifndef AQ_CFG_H
 #define AQ_CFG_H
 
-#define AQ_CFG_VECS_DEF   8U
+#define AQ_CFG_VECS_DEF   4U
 #define AQ_CFG_TCS_DEF    1U
 
 #define AQ_CFG_TXDS_DEF    4096U


Essentially, change AQ_CFG_VECS_DEF from 8 to 4.
Comment 11 Nicholas Johnson 2019-02-26 10:31:36 UTC
Oh, here's a crazy idea: are you testing with 1GBASE-T connection at all?

If you are testing it in the lab at 10GBASE-T speeds, and I am testing it here with 1GBASE-T network, could that make the difference between it working for you and not working for me?

I am only using 1GBASE-T networks most of the time.

I have quad-core with HT, so 8 logical. I will try. Cheers!
Comment 12 Nicholas Johnson 2019-02-26 10:36:41 UTC
You're a legend! Can confirm changing AQ_CFG_VECS_DEF to 4U instantly fixed the problem.

Now all that's left is to figure out why setting it to 8U breaks it, and also why it only sometimes breaks it.

Cheers!

i7-8650U 4c/8t
Comment 13 Igor Russkikh 2019-02-26 11:29:10 UTC
Thanks for confirmation.

We'll try to reproduce this on our side, but anyway for now we'll rollback this change to make AQC stable in kernel 5.0 release.

Bernhard, could you try doing the same and confirm the patch actually helps?
Comment 14 Nicholas Johnson 2019-02-26 12:56:01 UTC
Okay, if I move drivers/net/ethernet/aquantia/atlantic directory from 4.20 to 5.00 source, fix the dev_open and change AQ_CFG_VECS_DEF to 8U it works also. So some combination of the 5.00 driver and AQ_CFG_VECS_DEF being 8U breaks it. I might be saying something blindingly obvious to you, but I hope it is helpful.

Any ideas why it affects Bernhard and I, and not you? Do you have less logical cores?
Comment 15 Igor Russkikh 2019-02-26 13:13:50 UTC
That was absolutely not obvious, thanks for doing such a test! It's valuable.

Previously we've tested on 4 Core machines, thats why we did not catch it.

Right now we are trying to use 8 Cores, we see some anomalies, trying to debug these.
Comment 16 Bernhard Rosenkränzer 2019-02-26 13:16:48 UTC
Unfortunately can't try it out right now - while I can ssh into the box that has the problem, rebooting it remotely isn't the smartest idea if there's a good chance its only network interface won't come back up.
Will try the next time I'm anywhere near the box.

FWIW this is a threadripper 1950X (16 physical cores, 32 logical).
Comment 17 Nicholas Johnson 2019-02-26 14:12:55 UTC
Yikes, you have a serious system there. My dream would be to have the 32-core to destroy kernel compilation times but holding off until I can fully abandon Windows (Windows has no hope of managing Thunderbolt without BIOS support).

Everything I have been doing has been working toward being able to abandon Intel CPUs altogether finally. Which means making sure Thunderbolt can work on AMD and ARM. I like Intel for NICs (SR-IOV Intel X550 T2 is my preferred device) and Thunderbolt, maybe Optane when they start to make proper M.2 2280 ones. That's it.

Back on track. Any time. Good to know the test was of use.

Ah, so you did 4-core non-HT before? Interesting. So I take it that the queues are either limited to the number of logical processors, or are affected in some way by too many cores accessing them out of order i.e. if cores <= queues then that serialises the accesses sufficiently?

I have an i7-4930K at my disposal - I could try it on that by installing Titan Ridge in it.

Understandable with SSH. Been there, done that.

Anyways, this is great! We made significant progress. Cheers!
Comment 18 Dmitry Bogdanov 2019-02-26 14:31:49 UTC
 We found the root cause of the issue. That was a missed line in the upstream driver in the configuration of a number of Traffic Classes and a number of Queues per TC.

Here is the patch that will be sent to the upstream:

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
index b58ca7cb8e9d..fbba300c1d01 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
@@ -275,6 +275,9 @@ static int hw_atl_b0_hw_offload_set(struct aq_hw_s *self,
 
 static int hw_atl_b0_hw_init_tx_path(struct aq_hw_s *self)
 {
+       /* Tx TC/Queue number config */
+       hw_atl_rpb_tps_tx_tc_mode_set(self, 1U);
+
        hw_atl_thm_lso_tcp_flag_of_first_pkt_set(self, 0x0FF6U);
        hw_atl_thm_lso_tcp_flag_of_middle_pkt_set(self, 0x0FF6U);
        hw_atl_thm_lso_tcp_flag_of_last_pkt_set(self, 0x0F7FU);
Comment 19 Dmitry Bogdanov 2019-02-26 14:53:54 UTC
Sorry, here is full patch:

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
index b58ca7cb8e9d..fbba300c1d01 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
@@ -275,6 +275,9 @@ static int hw_atl_b0_hw_offload_set(struct aq_hw_s *self,
 
 static int hw_atl_b0_hw_init_tx_path(struct aq_hw_s *self)
 {
+	/* Tx TC/Queue number config */
+	hw_atl_rpb_tps_tx_tc_mode_set(self, 1U);
+
 	hw_atl_thm_lso_tcp_flag_of_first_pkt_set(self, 0x0FF6U);
 	hw_atl_thm_lso_tcp_flag_of_middle_pkt_set(self, 0x0FF6U);
 	hw_atl_thm_lso_tcp_flag_of_last_pkt_set(self, 0x0F7FU);
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.c
index 939f77e2e117..8ac7a67b15c1 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.c
@@ -1274,6 +1274,15 @@ void hw_atl_tpb_tx_buff_en_set(struct aq_hw_s *aq_hw, u32 tx_buff_en)
 			    HW_ATL_TPB_TX_BUF_EN_SHIFT, tx_buff_en);
 }
 
+void hw_atl_rpb_tps_tx_tc_mode_set(struct aq_hw_s *aq_hw,
+				   u32 tx_traf_class_mode)
+{
+	aq_hw_write_reg_bit(aq_hw, HW_ATL_TPB_TX_TC_MODE_ADDR,
+			HW_ATL_TPB_TX_TC_MODE_MSK,
+			HW_ATL_TPB_TX_TC_MODE_SHIFT,
+			tx_traf_class_mode);
+}
+
 void hw_atl_tpb_tx_buff_hi_threshold_per_tc_set(struct aq_hw_s *aq_hw,
 						u32 tx_buff_hi_threshold_per_tc,
 					 u32 buffer)
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.h b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.h
index 03c570d115fe..f529540bfd7e 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.h
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh.h
@@ -605,6 +605,10 @@ void hw_atl_thm_lso_tcp_flag_of_middle_pkt_set(struct aq_hw_s *aq_hw,
 
 /* tpb */
 
+/* set TX Traffic Class Mode */
+void hw_atl_rpb_tps_tx_tc_mode_set(struct aq_hw_s *aq_hw,
+				   u32 tx_traf_class_mode);
+
 /* set tx buffer enable */
 void hw_atl_tpb_tx_buff_en_set(struct aq_hw_s *aq_hw, u32 tx_buff_en);
 
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh_internal.h b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh_internal.h
index 8470d92db812..e91ffce005f1 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh_internal.h
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_llh_internal.h
@@ -1948,6 +1948,19 @@
 /* default value of bitfield tx_buf_en */
 #define HW_ATL_TPB_TX_BUF_EN_DEFAULT 0x0
 
+/* register address for bitfield tx_tc_mode */
+#define HW_ATL_TPB_TX_TC_MODE_ADDR 0x00007900
+/* bitmask for bitfield tx_tc_mode */
+#define HW_ATL_TPB_TX_TC_MODE_MSK 0x00000100
+/* inverted bitmask for bitfield tx_tc_mode */
+#define HW_ATL_TPB_TX_TC_MODE_MSKN 0xFFFFFEFF
+/* lower bit position of bitfield tx_tc_mode */
+#define HW_ATL_TPB_TX_TC_MODE_SHIFT 8
+/* width of bitfield tx_tc_mode */
+#define HW_ATL_TPB_TX_TC_MODE_WIDTH 1
+/* default value of bitfield tx_tc_mode */
+#define HW_ATL_TPB_TX_TC_MODE_DEFAULT 0x0
+
 /* tx tx{b}_hi_thresh[c:0] bitfield definitions
  * preprocessor definitions for the bitfield "tx{b}_hi_thresh[c:0]".
  * parameter: buffer {b} | stride size 0x10 | range [0, 7]
--
Comment 20 Nicholas Johnson 2019-02-26 23:09:20 UTC
Patch is corrupt, please add as attachment so that bugzilla does not interfere with the text. Cheers
Comment 21 Nicholas Johnson 2019-02-26 23:23:54 UTC
Managed to manually ungarble patch, and it works! Good job!
Comment 22 Nicholas Johnson 2019-02-26 23:34:04 UTC
Created attachment 281369 [details]
Apply this patch with "git apply" at root of source.

This way the line breaks cannot be interfered with by Bugzilla.

Apply this patch with "git apply" at root of source.

Great job to those who figured the bug out!

Now the only complaint I have is that the Windows driver does not support DMA remapping for IOMMU and Thunderbolt 3. But that may or may not be a different team entirely.

Strange how Windows needs opt-in and Linux drivers just support it.
Comment 23 Nicholas Johnson 2019-02-26 23:36:37 UTC
Oh, so will this make the final release, or is it going to be delayed by one kernel version?
Comment 24 Igor Russkikh 2019-02-27 11:13:39 UTC
The patch is in net queue, let's hope it'll fit into 5.0 release.
Comment 25 Nicholas Johnson 2019-03-01 14:07:38 UTC
Also, note that in the driver there are three exclamation marks in hw_atl/hw_atl_b0.c "!!!enable". Is there a reason for this? Three should be the same as one. Actually, looking at the code, should it be two? I noticed this before we solved the major issue but forgot to mention.
Comment 26 Igor Russkikh 2019-03-04 18:36:47 UTC
Nicholas, thats not a bug but just a leaked typo.

FYI, the subject patch is in kernel 5.0 release.
Comment 27 Nicholas Johnson 2019-03-05 00:35:10 UTC
Excellent! We saved a lot of people whining about their NICs in the 5.0 release. Thanks for taking my bug report on board.

About the typo, is it meant to be !enable (functionally identical)? Or !!enable (boolean inverse)?

Might be worth fixing it in a maintenance release. Typos don't look great. I was more looking at my screen going ???? is this some new syntax I have never encountered?
Comment 28 Igor Russkikh 2019-03-05 13:30:41 UTC
Thank you for catching this early and helping with debug.

!!!enable meant to be !enable, so thats not a bug, will submit a fix.
Comment 29 Nicholas Johnson 2019-03-06 04:41:29 UTC
Any time, I am just happy it did not make it into a release kernel and cause hundreds of mainline kernel users who use the latest kernels to be affected. If I hadn't blamed it on systemd at first then I would have been quicker off the mark.

It's still so strange to me how the number of cores can affect whether something works like that. I never would have thought.

Anyways I like this chip - convenient in Thunderbolt form, Intel ICs cannot fit into the TDP to work. It is great how Aquantia has brought the pricing down for 10G networking into the prosumer segment. I just hope that in time, firmware updates can enable SR-IOV but they are more consumer devices so maybe not. Turns out most of these PCIe devices are just another processor - in the case of Intel ones, the firmwares are full of x86-family instructions. So sky's the limit for the stuff it can be made to do.

Can you please clarify the difference between AQC107 and AQC107S? They have different PCI IDs. Most of the time the "S" does not even appear, like in the PCI IDs database name. S=Secure but nobody seems to know why this one is more secure. You engineers will definitely know a lot more than customer support so worth asking while I am here.

Cheers!
Comment 30 Igor Russkikh 2019-03-06 05:48:44 UTC
'S' Secure chips use verified and signed firmware loading process.
As you correctly noted we do have internal microprocessor on a chip.
It drives internal firmware tasks.

Secure chips are protected against firmware tampering. So with them the intruder will not be able to alter internal microcode (reflash) and will not be able to do any 'hacking' on hardware networking level.
Comment 31 Nicholas Johnson 2019-03-06 06:58:31 UTC
So they went to the trouble of burning a signing key into hardware and signing the firmware updates or something....

Wouldn't it be easier to remove the NVM chip from the NIC and have the OS driver upload the firmware upon init into DRAM or volatile memory in the NIC? It would save needing an NVM chip and completely eliminate the possibility of something going wrong if the signing key were to be stolen or list.

Just saying, there can be no back doors if there is nowhere to store them persistently.

Although the NSA probably wouldn't approve of rock solid security practices....

Am I correct in saying that the linux-firmware git is for this purpose? Firmwares that are uploaded on init?
Comment 32 Igor Russkikh 2019-03-06 07:34:25 UTC
Not all the environments have ability to upload FW. AQC however do support such a mode.
Comment 33 Nicholas Johnson 2019-03-06 08:10:07 UTC
Please elaborate on which environments lack the ability, and why.

In those cases, could the firmware not be placed as a binary blob in the driver and updated with a new driver version?

Would Aquantia ever consider releasing the firmware source code for auditing and people who would like to mess with it on their own systems in upload mode?

I assume signature is not checked when uploaded upon init? I believe it should not be, because I like to mess with my own systems. If I can reverse engineer my firmware then I will be attempting crazy stuff like turning NICs into PCIe bridges.

Is the firmware blob for upload on init mode available for AQC?

Cheers
Comment 34 Igor Russkikh 2019-03-07 13:50:13 UTC
One example is BIOS pxe/uefi boot.

> Would Aquantia ever consider releasing the firmware source code for auditing
> and people 

Sorry I'm not able to officially comment here anything. 

> I assume signature is not checked when uploaded upon init?

Only on unsecure chips.

> Is the firmware blob for upload on init mode available for AQC?

This is in planned development stage..
Watch for our edge driver on https://github.com/Aquantia/AQtion