Bug 180031 - Possible Linux 4.9 regression: Failed to find cpu0 device node at boot
Summary: Possible Linux 4.9 regression: Failed to find cpu0 device node at boot
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-23 10:44 UTC by Lilian Moraru
Modified: 2016-10-30 15:57 UTC (History)
2 users (show)

See Also:
Kernel Version: v4.9-rc1 (1001354ca34179f3db924eb66672442a173147dc)
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
System Log (101.38 KB, text/plain)
2016-10-23 10:44 UTC, Lilian Moraru
Details
System Info (1.63 KB, text/plain)
2016-10-23 10:45 UTC, Lilian Moraru
Details
lspci executed on Linux Kernel 4.8.4 (50.29 KB, text/plain)
2016-10-23 23:01 UTC, Lilian Moraru
Details
lspci executed on Linux Kernel 4.9-rc1 (50.29 KB, text/plain)
2016-10-23 23:02 UTC, Lilian Moraru
Details

Description Lilian Moraru 2016-10-23 10:44:52 UTC
Created attachment 242231 [details]
System Log

First, I want to point out that I am not very technical in this area, this is the first time I make a Linux Kernel bug report.
If you want me to offer additional info, please give more details on how I could help.

I report in "Drivers" because I found the message here: https://github.com/torvalds/linux/blob/master/drivers/base/cacheinfo.c#L61

I installed the kernel from Ubuntu Mainline Kernels website: http://kernel.ubuntu.com/~kernel-ppa/mainline/
This issue is Not present in Linux Kernel 4.8.3 and 4.8.4(both also installed from Ubuntu Mainline Kernels).
Linux Kernel 4.4.0 from Ubuntu 16.04 official repository also does not have this issue.

I attached the system log(where the message can be seen) and the hardware info that Phoronix test suite could give me.
Comment 1 Lilian Moraru 2016-10-23 10:45:34 UTC
Created attachment 242241 [details]
System Info
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-10-23 12:44:33 UTC
JYFI: Bug 177681 describes a similar issue
Comment 3 Lilian Moraru 2016-10-23 23:01:48 UTC
Created attachment 242421 [details]
lspci executed on Linux Kernel 4.8.4
Comment 4 Lilian Moraru 2016-10-23 23:02:14 UTC
Created attachment 242431 [details]
lspci executed on Linux Kernel 4.9-rc1
Comment 5 Lilian Moraru 2016-10-23 23:16:19 UTC
I don't understand much about these logs(I basically copied the command from the other issue but ran it with `sudo` - which gives more info) but I hope these are helpful.
I see interesting differences:
"Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a802] (rev 01) (prog-if 02 [NVM Express])":
- On 4.8.4:   DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
- On 4.9-rc1: DevSta:	CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend-

Observe the "FatalErr+".
Btw, the OS boots from this SSD, so it seems to work from this point of view(I am using it right now to write this comment).

Also:
"Communication controller [0780]: Intel Corporation Sunrise Point-H CSME HECI #1 [8086:a13a] (rev 31)":
- 4.8.4:   CESta:	RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr-
- 4.9-rc1: CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-

And:
"SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])":
- 4.8.4:   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
- 4.9-rc1: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

- 4.8.4:   Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee2000c  Data: 4122
- 4.9-rc1: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000

This last one is interesting because it seems like on "4.9-rc1" the "Address" is actually NULL.
Comment 6 Lilian Moraru 2016-10-23 23:26:10 UTC
Just realized that the Address is NULL probably because of "Enable-"(I don't know how to interpret these logs)...
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-10-30 11:38:36 UTC
I wonder if these patches help: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1260133.html
Comment 8 Lilian Moraru 2016-10-30 15:51:21 UTC
I compiled Linux Kernel 4.9-rc3 with the Ubuntu ".config" - aka almost everything activated(without the mentioned patches) and the message does not appear any more.
I then installed 4.9-rc3 from Ubuntu Mainline Kernels and the issue is still not reproducible - the message does not appear any more, at least from these non-comprehensive tests(previously it was happening every time).

As for "lspci", I see the same results as on "4.9-rc1":
"Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01) (prog-if 02 [NVM Express])":
- 4.9-rc3": DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend-
Still "FatalErr+".

"PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #5 [8086:a114] (rev f1) (prog-if 00 [Normal decode])"(It seems I confused things last time, it wasn't "Point-H CSME HECI"):
- 4.9-rc3: CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-

"SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])":
- 4.9-rc3: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
- 4.9-rc3: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
                Address: 00000000  Data: 0000



I have another regression(works on 4.8.4 but not on 4.9-rc1) on a build server at work where it completely freezes out when it connects to a network(If I disconnect the cable, it will work alright, until I connect it back).
Sadly, I probably won't be able to properly report the issue because I need to wait till late at night to do these tests and be able to restart the server...
Comment 9 Lilian Moraru 2016-10-30 15:57:08 UTC
It sounds like disconnecting the cable fixes the issue - it does not.
What I meant is:
1. Disconnecting the cable
+
2. Starting the server
==
works

+
3. Connecting the cable(the server detects a network and connects to it)
==
freezes and doesn't recover in any way, unless forcibly shut down by holding the shutdown button for a while and then starting it back...

Note You need to log in before you can comment on or make changes to this bug.