Bug 212443 - Random hang at shutdown/poweroff - Sun Microsystems Ultra 24 Workstation (Ursa)
Summary: Random hang at shutdown/poweroff - Sun Microsystems Ultra 24 Workstation (Ursa)
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: ACPI
Classification: Unclassified
Component: BIOS (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-25 17:21 UTC by Julius Henry Marx
Modified: 2022-06-21 07:40 UTC (History)
2 users (show)

See Also:
Kernel Version: Linux devuan 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
Subsystem:
Regression: No
Bisected commit-id:


Attachments
screenshot (59.05 KB, image/jpeg)
2021-03-25 17:22 UTC, Julius Henry Marx
Details
ultra24_dmesg (61.94 KB, text/plain)
2021-03-25 17:23 UTC, Julius Henry Marx
Details
ultra24_acpidump (231.42 KB, text/plain)
2021-03-25 17:23 UTC, Julius Henry Marx
Details
ultra24_syslog.txt (818 bytes, text/plain)
2021-03-25 17:24 UTC, Julius Henry Marx
Details

Description Julius Henry Marx 2021-03-25 17:21:12 UTC
I have a 'bad shutdown' problem with a Sun Microsystems Ultra 24 WS.

The workstation has an Intel Core2 Q9550 CPU + 8Gb RAM and the BIOS verson 1.56, the last one available from the OEM, from which there's no possible update, upgrade, support, etc.

Sun Microsystems has been dead for years now and Oracle lists it as EOL.

It runs on Devuan Beowulf 3.1.0.
ie: Linux devuan 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux

I have modified the BIOS' original DSDT table using information gathered from MS based utilities used to patch BIOSes to run OSx on non OSx hardware.

As a result I managed to reduce the original 'Error' count on recompilation from '30' to '0', with only '9' 'Warnings'.

---
Intel ACPI Component Architecture
ASL+ Optimizing Compiler/Disassembler version 20181213
Copyright (c) 2000 - 2018 Intel Corporation

ASL Input:     dsdt.dsl - 6179 lines, 198990 bytes, 2846 keywords
AML Output:    dsdt.aml - 22065 bytes, 751 named objects, 2095 executable opcodes

Compilation complete. 0 Errors, 9 Warnings, 39 Remarks, 65 Optimizations
groucho@devuan:~/dsdt/test$
---

That is the DSDT table I am using now, injecting it at boot time via `/boot/acpi_override` in `grub` but the problem subsists.

DESCRIPTION

On shutdown the box will do one of two things:

1. shut down properly -> 99.9% of the time
2. freeze during the shutdown with the screen showing this output:

[code]
e1000e: EEE Tx LPI Timer
Preparing to enter sleep state S5
Reboot: Power Down
[/code]

... with the fans blowing at full speed.

The keyboard becomes unresponsive and the only way out is a hard shut down via the PWR button.

Whenever it occurrs, the screen output before the freeze is the same.

See attached file `scrshot_U24.jpg`.

Unfortunately, I have not been able to reproduce it or link it to anything in particular.

It just happens every so many shutdowns and the number can go from 3 to 20 with no discernible pattern.

None of the /var/log files (syslog, kern.log, faillog, messages) show anything relevant, which is not a surprise as all filesystems are in RO state.

See attached file `ultra24_syslog.txt`. 

I have tried using a wide variety of kernel boot parameters and scripts at shutdown to unload the e1000e driver module prior to shutdown or use /proc/sysrq-trigger instead of the usual command/s.

Upon reading about similar issues, I also tried blacklisting the acpi_cpufreq module, to no avail.

This problem has happened across many kernel versions and also seems to be distribution agnostic, from a skeleton TCLinux installation on a USB stick to Ubuntu, Mint, PCLinuxOS and now Devuan: all distributions have sufferd the same problem.

Attaching files:

scrshot_U24.jpg
ultra24_acpidump.txt
ultra24_dmesg.txt
ultra24_syslog.txt

Please advise or ask for addiitonal information if required.

Thanks in advance.

JHM
Comment 1 Julius Henry Marx 2021-03-25 17:22:02 UTC
Created attachment 296063 [details]
screenshot
Comment 2 Julius Henry Marx 2021-03-25 17:23:11 UTC
Created attachment 296065 [details]
ultra24_dmesg
Comment 3 Julius Henry Marx 2021-03-25 17:23:47 UTC
Created attachment 296067 [details]
ultra24_acpidump
Comment 4 Julius Henry Marx 2021-03-25 17:24:12 UTC
Created attachment 296069 [details]
ultra24_syslog.txt
Comment 5 Julius Henry Marx 2021-03-27 11:59:23 UTC
This bug was originally reported 2019-03-14 14:45:06 UTC here:

https://bugzilla.kernel.org/show_bug.cgi?id=201965
Comment 6 Len Brown 2021-06-24 14:19:35 UTC
Julius,
Please try this experiment:
offline all cpus except CPU0 before shutdown.

The theory is that the SMM responsible for actual poweroff
may have a bug if invoked from other than the BSP.
Comment 7 Julius Henry Marx 2021-06-24 17:01:13 UTC
Hello Len:

Thanks for taking an interest in this but I'm afraid you give me too much credit.

> ... offline all cpus except CPU0 before shutdown.
I don't have a clue as to how to do that.

Since I posted last, I did away with the modified DSDT file as it has been too much to set it up hassle every time the kernel is updated and it did not solve the problem.

I have also tried shutting down the box with a modified script in an icon on the desktop panel:

[code]
~$ cat /usr/bin/shutdown.sh
#!/bin/sh
# added to shutdown directly - no shutdown helper 
# options added to troubleshoot nic related bad shutdown 
PATH=/sbin:/bin:/usr/sbin:/usr/bin:

# 1
# shutdown system directly 
# sudo shutdown -h now

# 2
# sync
# disable onboard eth wol
# shutdown system directly 
# sync && sudo ethtool -s eth0 wol d && sudo shutdown -h now

# 3
# sync
# remove e1000e module
# shutdown system directly 
# sync && sudo rmmod -s -v e1000e && sudo shutdown -h now

# 4
# sync
# disable onboard eth wol
# remove e1000e module
# shutdown system directly 
sync && sudo ethtool -s eth0 wol d && sudo rmmod -s -v e1000e && sudo shutdown -h now
~$ 
[/code]

I have tried all the options you see there and settled on #4.

As you can see, this script syncs, disables wol and removes the e1000e driver before shutting down.

To no avail. 

'Every so often' I get what I have coome to call a bad shudown.

And in the years I have had this going on, the only thing I can link to this 'every so often' is changes in room temperature. eg: end of summer to beginning of autumn.

One thing I neglected to add is this:

In is BIOS version (1.56 - last available) it is *impossible* for me to:

1. disable the on Intel board GbE controller in BIOS: it is greyed out.
2. disable the ME "Firmware Power Control", the only change I can make is "Host Sleep States" ON to S0 or S3. (ony two options).

Attempting to disable ME "Firmware Power Control" renders the box unusable.
ie: totally non responsive to any kb. input, with both case and CPU fans blowing at 100%, like if the sensors (wherever they are) see a high temperature inside the box or at the CPU.

The *only* way out of that nightmare is a hard shut-down and the *only* way to get it to boot properly again involves clearing the CMOS and a reflash of the ME BIOS.

I have the feeling that this is a severe hardware issue which pops up in boxes running non-MS OSs.

Since my last post I found this web page from 2007 (when the U24 was released) and which I could only access as cached content (ie: not available on-line).

[url]https://webcache.googleusercontent.com/search?q=cache:1n3s2V4blzYJ:https://community.oracle.com/tech/apps-infra/discussion/1907211/ultra-24-and-intel-boot-agent+&cd=3&hl=en&ct=clnk&gl=us[/url]

From the text I gather that these issues were present from the start.

In any case, if you let me know how I what I can do to run the test you suggest, I'll be more than willing to do it and report back.

Thanks in advance,

JHM
Comment 8 Zhang Rui 2021-07-01 02:21:13 UTC
(In reply to Julius Henry Marx from comment #7)
> Hello Len:
> 
> Thanks for taking an interest in this but I'm afraid you give me too much
> credit.
> 
> > ... offline all cpus except CPU0 before shutdown.
> I don't have a clue as to how to do that.
> 

you can do it in this way,
for all cpuX under /sys/devices/system/cpu, run echo 0 > cpuX/online, except for CPU0.
In this way, you can offline all the cpus except cpu0.
And then, you can check if the problem still exists.
Comment 9 Zhang Rui 2022-06-21 07:40:35 UTC
Bug closed as there is no response.
Please feel free to reopen it if you can provide the info required.

Note You need to log in before you can comment on or make changes to this bug.