Created attachment 242231 [details] System Log First, I want to point out that I am not very technical in this area, this is the first time I make a Linux Kernel bug report. If you want me to offer additional info, please give more details on how I could help. I report in "Drivers" because I found the message here: https://github.com/torvalds/linux/blob/master/drivers/base/cacheinfo.c#L61 I installed the kernel from Ubuntu Mainline Kernels website: http://kernel.ubuntu.com/~kernel-ppa/mainline/ This issue is Not present in Linux Kernel 4.8.3 and 4.8.4(both also installed from Ubuntu Mainline Kernels). Linux Kernel 4.4.0 from Ubuntu 16.04 official repository also does not have this issue. I attached the system log(where the message can be seen) and the hardware info that Phoronix test suite could give me.
Created attachment 242241 [details] System Info
JYFI: Bug 177681 describes a similar issue
Created attachment 242421 [details] lspci executed on Linux Kernel 4.8.4
Created attachment 242431 [details] lspci executed on Linux Kernel 4.9-rc1
I don't understand much about these logs(I basically copied the command from the other issue but ran it with `sudo` - which gives more info) but I hope these are helpful. I see interesting differences: "Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a802] (rev 01) (prog-if 02 [NVM Express])": - On 4.8.4: DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- - On 4.9-rc1: DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend- Observe the "FatalErr+". Btw, the OS boots from this SSD, so it seems to work from this point of view(I am using it right now to write this comment). Also: "Communication controller [0780]: Intel Corporation Sunrise Point-H CSME HECI #1 [8086:a13a] (rev 31)": - 4.8.4: CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr- - 4.9-rc1: CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- And: "SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])": - 4.8.4: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ - 4.9-rc1: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- - 4.8.4: Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee2000c Data: 4122 - 4.9-rc1: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 This last one is interesting because it seems like on "4.9-rc1" the "Address" is actually NULL.
Just realized that the Address is NULL probably because of "Enable-"(I don't know how to interpret these logs)...
I wonder if these patches help: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1260133.html
I compiled Linux Kernel 4.9-rc3 with the Ubuntu ".config" - aka almost everything activated(without the mentioned patches) and the message does not appear any more. I then installed 4.9-rc3 from Ubuntu Mainline Kernels and the issue is still not reproducible - the message does not appear any more, at least from these non-comprehensive tests(previously it was happening every time). As for "lspci", I see the same results as on "4.9-rc1": "Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01) (prog-if 02 [NVM Express])": - 4.9-rc3": DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend- Still "FatalErr+". "PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #5 [8086:a114] (rev f1) (prog-if 00 [Normal decode])"(It seems I confused things last time, it wasn't "Point-H CSME HECI"): - 4.9-rc3: CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- "SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])": - 4.9-rc3: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- - 4.9-rc3: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 I have another regression(works on 4.8.4 but not on 4.9-rc1) on a build server at work where it completely freezes out when it connects to a network(If I disconnect the cable, it will work alright, until I connect it back). Sadly, I probably won't be able to properly report the issue because I need to wait till late at night to do these tests and be able to restart the server...
It sounds like disconnecting the cable fixes the issue - it does not. What I meant is: 1. Disconnecting the cable + 2. Starting the server == works + 3. Connecting the cable(the server detects a network and connects to it) == freezes and doesn't recover in any way, unless forcibly shut down by holding the shutdown button for a while and then starting it back...