Bug 180031
Summary: | Possible Linux 4.9 regression: Failed to find cpu0 device node at boot | ||
---|---|---|---|
Product: | Drivers | Reporter: | Lilian Moraru (lilian.moraru90) |
Component: | Other | Assignee: | drivers_other |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | lilian.moraru90, regressions |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | v4.9-rc1 (1001354ca34179f3db924eb66672442a173147dc) | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
System Log
System Info lspci executed on Linux Kernel 4.8.4 lspci executed on Linux Kernel 4.9-rc1 |
Description
Lilian Moraru
2016-10-23 10:44:52 UTC
Created attachment 242241 [details]
System Info
JYFI: Bug 177681 describes a similar issue Created attachment 242421 [details]
lspci executed on Linux Kernel 4.8.4
Created attachment 242431 [details]
lspci executed on Linux Kernel 4.9-rc1
I don't understand much about these logs(I basically copied the command from the other issue but ran it with `sudo` - which gives more info) but I hope these are helpful. I see interesting differences: "Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a802] (rev 01) (prog-if 02 [NVM Express])": - On 4.8.4: DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- - On 4.9-rc1: DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend- Observe the "FatalErr+". Btw, the OS boots from this SSD, so it seems to work from this point of view(I am using it right now to write this comment). Also: "Communication controller [0780]: Intel Corporation Sunrise Point-H CSME HECI #1 [8086:a13a] (rev 31)": - 4.8.4: CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr- - 4.9-rc1: CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- And: "SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])": - 4.8.4: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ - 4.9-rc1: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- - 4.8.4: Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee2000c Data: 4122 - 4.9-rc1: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 This last one is interesting because it seems like on "4.9-rc1" the "Address" is actually NULL. Just realized that the Address is NULL probably because of "Enable-"(I don't know how to interpret these logs)... I wonder if these patches help: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1260133.html I compiled Linux Kernel 4.9-rc3 with the Ubuntu ".config" - aka almost everything activated(without the mentioned patches) and the message does not appear any more. I then installed 4.9-rc3 from Ubuntu Mainline Kernels and the issue is still not reproducible - the message does not appear any more, at least from these non-comprehensive tests(previously it was happening every time). As for "lspci", I see the same results as on "4.9-rc1": "Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01) (prog-if 02 [NVM Express])": - 4.9-rc3": DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq- AuxPwr+ TransPend- Still "FatalErr+". "PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #5 [8086:a114] (rev f1) (prog-if 00 [Normal decode])"(It seems I confused things last time, it wasn't "Point-H CSME HECI"): - 4.9-rc3: CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- "SATA controller [0106]: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] [8086:a102] (rev 31) (prog-if 01 [AHCI 1.0])": - 4.9-rc3: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- - 4.9-rc3: Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 I have another regression(works on 4.8.4 but not on 4.9-rc1) on a build server at work where it completely freezes out when it connects to a network(If I disconnect the cable, it will work alright, until I connect it back). Sadly, I probably won't be able to properly report the issue because I need to wait till late at night to do these tests and be able to restart the server... It sounds like disconnecting the cable fixes the issue - it does not. What I meant is: 1. Disconnecting the cable + 2. Starting the server == works + 3. Connecting the cable(the server detects a network and connects to it) == freezes and doesn't recover in any way, unless forcibly shut down by holding the shutdown button for a while and then starting it back... |