Bug 209345 - [nouveau] unknown chipset (0f22d0a1) (nVidia Tesla K80)
Summary: [nouveau] unknown chipset (0f22d0a1) (nVidia Tesla K80)
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-21 23:55 UTC by Alexander von Gluck
Modified: 2021-04-14 15:35 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.8.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
nvidia thunderbolt 3 attachment (15.31 KB, text/plain)
2020-09-22 00:01 UTC, Alexander von Gluck
Details
tesla k80 patch (1.74 KB, patch)
2020-09-22 04:07 UTC, Alexander von Gluck
Details | Diff

Description Alexander von Gluck 2020-09-21 23:55:14 UTC
Attaching an nVidia Tesla K80 compute card to a Fedora 32 machine.

Getting:
nouveau 0000:08:00.0: unknown chipset (0f22d0a1)
Comment 1 Alexander von Gluck 2020-09-22 00:00:43 UTC
lspci -vvn

08:00.0 0302: 10de:102d (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: nouveau
	Kernel modules: nouveau

09:00.0 0302: 10de:102d (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel modules: nouveau

08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev ff)
09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev ff)
Comment 2 Alexander von Gluck 2020-09-22 00:01:15 UTC
Created attachment 292555 [details]
nvidia thunderbolt 3 attachment

logs from attachment
Comment 3 Ilia Mirkin 2020-09-22 01:14:50 UTC
Is this a physical K80 board, or a virtual one via some sort of cloud provider?

We have nvf0 = GK110, nvf1 = what we call GK110B, but I'm not sure that's an official name - basically the GTX 780 Ti and related Titans.

We don't have explicit support for nvf2 -- as I understand the K80 (GK210) actually has some slight differences, e.g. more shared memory, etc... not sure if that translates into some ctxsw fw differences or if it should just work -- you can check drivers/gpu/drm/nouveau/nvkm/engine/device/base.c, should be easy to add in 0xf2 support based on the nvf1 if you want to play with it.
Comment 4 Alexander von Gluck 2020-09-22 01:54:52 UTC
Physical K80 board in my possession.  They go for cheap now-a-days on ebay :-)

Memory size (GDDR5): 24GB
CUDA cores: 4992
Number Of GPUs: 2x GK120 GPUs


I'll try adding the nvf2 and see what happens.   I have it in a TB3 enclosure plugged into my Dell XPS 13, so it makes testing things pretty easy.
Comment 5 Alexander von Gluck 2020-09-22 04:05:55 UTC
[ 2208.130049] nouveau: version magic '5.8.10 SMP mod_unload ' should be '5.8.10-200.fc32.x86_64 SMP mod_unload '
[ 2460.923164] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20200528/nsarguments-59)
[ 2460.923220] nouveau 0000:08:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 2460.923393] nouveau 0000:08:00.0: GPU not supported on big-endian
[ 2460.923411] nouveau: probe of 0000:08:00.0 failed with error -38
[ 2460.923424] nouveau 0000:09:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[ 2460.923504] nouveau 0000:09:00.0: GPU not supported on big-endian
[ 2460.923507] nouveau: probe of 0000:09:00.0 failed with error -38
Comment 6 Alexander von Gluck 2020-09-22 04:07:32 UTC
Created attachment 292557 [details]
tesla k80 patch
Comment 7 Ilia Mirkin 2020-09-22 13:59:16 UTC
[ 2460.923220] nouveau 0000:08:00.0: can't change power state from D3hot to D0 (config space inaccessible)

That's just really bad. My guess is that the "big-endian" notice is just due to a register returning all 0xffffffff (we try to flip the GPU into little-endian mode if we can). Seems like there are issues with the TB enclosure, or something along those lines.

It does seem like you got further earlier to have gotten the "unknown chipset" error, but by the time you were running lspci above, they were gone already (returning all 1's, and PCI is active-low, so that just means it's all off). Don't know what the difference is, I know nothing about those enclosures. I'd try to disable any sort of power management that might be turning the enclosure off.
Comment 8 Alexander von Gluck 2020-09-22 14:14:14 UTC
rebooted without TB3 enclosure attached. Msnuslly loaded nouveau vis insmod after the TB3 attachment calmed down, and got something a bit cleaner:


[  176.083524] nouveau: loading out-of-tree module taints kernel.
[  176.084343] nouveau: module verification failed: signature and/or required key missing - tainting kernel
[  176.124991] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20200528/nsarguments-59)
[  176.125405] nouveau 0000:08:00.0: NVIDIA GK120 (0f22d0a1)
[  176.406057] nouveau 0000:08:00.0: bios: version 80.21.1f.00.01
[  176.537701] nouveau 0000:08:00.0: fb: 11520 MiB GDDR5
[  176.562278] nouveau 0000:08:00.0: bar: one-time init failed, -12
[  176.562522] nouveau 0000:08:00.0: init failed with -12
[  176.562523] nouveau: DRM-master:00000000:00000080: init failed with -12
[  176.562525] nouveau 0000:08:00.0: DRM-master: Device allocation failed: -12
[  176.563099] nouveau: probe of 0000:08:00.0 failed with error -12
[  176.563387] nouveau 0000:09:00.0: NVIDIA GK120 (0f22d0a1)
[  176.842900] nouveau 0000:09:00.0: bios: version 80.21.1f.00.02
[  176.977507] nouveau 0000:09:00.0: fb: 11520 MiB GDDR5
[  177.002138] nouveau 0000:09:00.0: bar: one-time init failed, -12
[  177.002380] nouveau 0000:09:00.0: init failed with -12
[  177.002382] nouveau: DRM-master:00000000:00000080: init failed with -12
[  177.002384] nouveau 0000:09:00.0: DRM-master: Device allocation failed: -12
[  177.003019] nouveau: probe of 0000:09:00.0 failed with error -12


So, each GK120 gets 11.5 GiB to make up that 24GiB of ram.
Comment 9 Alexander von Gluck 2020-09-22 14:15:40 UTC
better lspci:

08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation Device 106c
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 18
	Region 0: Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at <unassigned> (64-bit, prefetchable)
	Region 3: Memory at a0000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, NROPrPrP-, LTR-
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Kernel modules: nouveau

09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation Device 106c
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 18
	Region 0: Memory at c5000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at <unassigned> (64-bit, prefetchable)
	Region 3: Memory at a4000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #16, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, NROPrPrP-, LTR-
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Kernel modules: nouveau
Comment 10 Ilia Mirkin 2020-09-22 14:18:53 UTC
[  176.562278] nouveau 0000:08:00.0: bar: one-time init failed, -12

08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation Device 106c
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 18
	Region 0: Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at <unassigned> (64-bit, prefetchable)
	Region 3: Memory at a0000000 (64-bit, prefetchable) [size=32M]

That's not good. BAR1 is unassigned. We want BAR1. This is fallout from the TB enclosure. I know nothing about this stuff... there are various memory windows, etc. And apparently we don't fit in the window. I'm guessing there are errors further up about how there's not enough space to assign those BAR's.
Comment 11 Alexander von Gluck 2020-09-22 14:23:53 UTC
weird... let me move things over to my Ryzen desktop and see what changes.
Comment 12 Alexander von Gluck 2021-04-14 15:00:43 UTC
A new motherboard later.. and after enabling 64-bit PCIe stuff the card posts.


ArchLinux 5.11.13

[    4.689213] nouveau 0000:0d:00.0: enabling device (0000 -> 0002)
[    4.689343] nouveau 0000:0d:00.0: unknown chipset (0f22d0a1)
[    4.690686] nouveau 0000:0e:00.0: enabling device (0000 -> 0002)
[    4.690758] nouveau 0000:0e:00.0: unknown chipset (0f22d0a1)




0d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation Device 106c
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 44
	IOMMU group: 21
	Region 0: Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 7800000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at 7c00000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Kernel modules: nouveau

0e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation Device 106c
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 44
	IOMMU group: 22
	Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 7000000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at 7400000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #16, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Kernel modules: nouveau
Comment 13 Ilia Mirkin 2021-04-14 15:08:47 UTC
See comment #3 - it explains what you need to copy in nouveau to try to load it.
Comment 14 Ilia Mirkin 2021-04-14 15:10:35 UTC
Also, wow, BAR1 = 16GB?? Normally it's like 256MB. No wonder your TB setup had issues.
Comment 15 Alexander von Gluck 2021-04-14 15:35:47 UTC
Applied my patch above to ArchLinux (5.11.13-arch1-1) and gave it a whirl.  Got a little information from nouveou before the system hard locks up.

nouveau 0000:0d:00.0: enabling device (0000 -> 0002)
nouveau 0000:0d:00.0: NVIDIA GK120 (0f22d0a1)
nouveau 0000:0d:00.0: bios: version 80.21.1f.00.01
nouveau 0000:0d:00.0: fb: 11520 MiB GDDR5

(hard crash)

I might get more information from serial... however, ran into an unrelated issue. Cooling!

The Tesla K80 got up to 175F+ at idle and I had to shut things down. Need to rig some better cooling solution.

Note You need to log in before you can comment on or make changes to this bug.