Bug 180031

Summary: Possible Linux 4.9 regression: Failed to find cpu0 device node at boot
Product: Drivers Reporter: Lilian Moraru (lilian.moraru90)
Component: OtherAssignee: drivers_other
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: lilian.moraru90, regressions
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: v4.9-rc1 (1001354ca34179f3db924eb66672442a173147dc) Subsystem:
Regression: Yes Bisected commit-id:
Attachments: System Log
System Info
lspci executed on Linux Kernel 4.8.4
lspci executed on Linux Kernel 4.9-rc1

Description Lilian Moraru 2016-10-23 10:44:52 UTC
Created attachment 242231 [details]
System Log

First, I want to point out that I am not very technical in this area, this is the first time I make a Linux Kernel bug report.
If you want me to offer additional info, please give more details on how I could help.

I report in "Drivers" because I found the message here: https://github.com/torvalds/linux/blob/master/drivers/base/cacheinfo.c#L61

I installed the kernel from Ubuntu Mainline Kernels website: http://kernel.ubuntu.com/~kernel-ppa/mainline/
This issue is Not present in Linux Kernel 4.8.3 and 4.8.4(both also installed from Ubuntu Mainline Kernels).
Linux Kernel 4.4.0 from Ubuntu 16.04 official repository also does not have this issue.

I attached the system log(where the message can be seen) and the hardware info that Phoronix test suite could give me.
Comment 1 Lilian Moraru 2016-10-23 10:45:34 UTC
Created attachment 242241 [details]
System Info
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-10-23 12:44:33 UTC
JYFI: Bug 177681 describes a similar issue
Comment 3 Lilian Moraru 2016-10-23 23:01:48 UTC
Created attachment 242421 [details]
lspci executed on Linux Kernel 4.8.4
Comment 4 Lilian Moraru 2016-10-23 23:02:14 UTC
Created attachment 242431 [details]
lspci executed on Linux Kernel 4.9-rc1
Comment 5 Lilian Moraru 2016-10-23 23:16:19 UTC
I don't understand much about these logs(I basically copied the command from the other issue but ran it with `sudo` - which gives more info) but I hope these are helpful.
I see interesting differences:
"Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a802] (rev 01) (prog-if 02 [NVM Express])":
- On 4.8.4:   DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
- On 4.9-rc1: DevSta:	CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend-

Observe the "FatalErr+".
Btw, the OS boots from this SSD, so it seems to work from this point of view(I am using it right now to write this comment).

Also:
"Communication controller [0780]: Intel Corporation Sunrise Point-H CSME HECI #1 [8086:a13a] (rev 31)":
- 4.8.4:   CESta:	RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr-
- 4.9-rc1: CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-

And:
"SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])":
- 4.8.4:   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
- 4.9-rc1: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

- 4.8.4:   Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee2000c  Data: 4122
- 4.9-rc1: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000

This last one is interesting because it seems like on "4.9-rc1" the "Address" is actually NULL.
Comment 6 Lilian Moraru 2016-10-23 23:26:10 UTC
Just realized that the Address is NULL probably because of "Enable-"(I don't know how to interpret these logs)...
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-10-30 11:38:36 UTC
I wonder if these patches help: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1260133.html
Comment 8 Lilian Moraru 2016-10-30 15:51:21 UTC
I compiled Linux Kernel 4.9-rc3 with the Ubuntu ".config" - aka almost everything activated(without the mentioned patches) and the message does not appear any more.
I then installed 4.9-rc3 from Ubuntu Mainline Kernels and the issue is still not reproducible - the message does not appear any more, at least from these non-comprehensive tests(previously it was happening every time).

As for "lspci", I see the same results as on "4.9-rc1":
"Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01) (prog-if 02 [NVM Express])":
- 4.9-rc3": DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend-
Still "FatalErr+".

"PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #5 [8086:a114] (rev f1) (prog-if 00 [Normal decode])"(It seems I confused things last time, it wasn't "Point-H CSME HECI"):
- 4.9-rc3: CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-

"SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])":
- 4.9-rc3: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
- 4.9-rc3: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
                Address: 00000000  Data: 0000



I have another regression(works on 4.8.4 but not on 4.9-rc1) on a build server at work where it completely freezes out when it connects to a network(If I disconnect the cable, it will work alright, until I connect it back).
Sadly, I probably won't be able to properly report the issue because I need to wait till late at night to do these tests and be able to restart the server...
Comment 9 Lilian Moraru 2016-10-30 15:57:08 UTC
It sounds like disconnecting the cable fixes the issue - it does not.
What I meant is:
1. Disconnecting the cable
+
2. Starting the server
==
works

+
3. Connecting the cable(the server detects a network and connects to it)
==
freezes and doesn't recover in any way, unless forcibly shut down by holding the shutdown button for a while and then starting it back...