Bug 2839

Summary: java segfaults with 2.6.[567] x86_64 but not with 2.4.27-pre5
Product: Platform Specific/Hardware Reporter: Marc Heckmann (mh)
Component: x86-64Assignee: Andi Kleen (andi-bz)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: high CC: jk, mh
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.7-rc2-bk6 + x86_64 bugfixes patch from bkbits.net Tree: Mainline
Regression: ---
Attachments: Fix for segfault logging in 2.6.6

Description Marc Heckmann 2004-06-06 01:21:45 UTC
Distribution: Fedora Core 2
Hardware Environment: Shuttle SN85G4, NFORCE3, Athlon 64 2800
Software Environment: java sdk from java.sun.com or blackdown.org, 32 bit or 64
bit versions.

Problem Description:
                                                                                
The java binaries (sdk) from blackdown.org or java.sun.com, 64 or 32 bit,
segfault on me using the 2.6.[567] kernels (haven't tried any older 2.6.x)
but not with 2.4.27-pre5. If I understand the code in arch/x86_64/mm/fault.c
correctly, 2.4 would produce the same messages if java was segfaulting too.

As a consequence, my java webapps do not function under 2.6.x.

Below is an excerpt of the messages I get:

java[3028]: segfault at 0000000000000000 rip 0000000057a1d2d5 rsp
00000000ffff9f30 error 4
java[3028]: segfault at 0000000000000000 rip 0000000057a1d2d5 rsp
00000000ffff9f5c error 4

Steps to reproduce:

1. download java sdk (http://java.sun.com) and jakarta-tomcat
(http://jakarta.apache.org/tomcat)
2. export JAVA_HOME=/path/to/java/sdk/install
3. cd /path/to/tomcat/jakarta-tomcat-4.1.30
4. "./bin/startup.sh" and watch dmesg/syslog for the messages.

This is 100% reproducable for me.

-m
Comment 1 Andi Kleen 2004-06-06 04:35:16 UTC
At least simple java programs work fine for me on 2.6 with the 32bit 1.4.2 JDK
from Sun. Does a java hello world also crash for you?

I have no time to debug tomcat sorry, someone who knows more about it 
has to do that. Have you contacted Sun or the tomcat maintainers about it?
Comment 2 Johnny Elliott 2004-06-09 13:33:55 UTC
This is information that we received from Sun in regards to this problem:

Here is comment from our HotsSpot engineer:

  That's a kernel issue. They should not log SEGSEGV if the signal
  is handled by user application. JVM uses SIGSEGV and several
  other signals (e.g. for implicit null check, safepoint polling, etc.)
  What is the OS version? Redhat used to have this problem in one of
  their beta release, we talked to them and it's fixed in FCS. Someone
  should log a bug with the Linux vendor.

  As to performance, yes, excessive logging could have a performance
  impact, because it is disk I/O. But usually the segfault happens
  infrequently so the impact is negligible and it's mainly a cosmetic
  thing. Depending on the number of threads, if they get hundreds of
  segfaults in a second, that could imply a problem in the Java app
  (e.g. deref null pointer in inner loop) or in JVM.

I have passed your OS information to them. Keep you posted if
I get any more information.

Regards,
-- 
Ingrid Yao
CAP program Developer Support Engineer
Java Web Service
Sun Microsystems
Comment 3 Andi Kleen 2004-06-09 13:37:08 UTC
The Sun analysis is outdated. Current kernels log the signal
only when the signal is not handled by sigaction and no debugger is running.
This basically only happens when the program is really crashing,
it's extremly unlikely to be not a crash.
Comment 4 Marc Heckmann 2004-06-09 21:21:12 UTC
I do believe that the app (the java JVM) _is_ really crashing because my webapps
(running in the tomcat container) do not run correctly at all and I did manage
to get 1.5.0-beta1 to dump a core file.

However, the point is that the JVM's do not crash under 2.4.27-rc5 X86_64. I am
not alledging that the kernel is too blame instead of java, it may be either
one. Perhaps there is a bug in the JVM that it can get away with in 2.4.x. I was
just hoping that someone might have some clues. Either way this is a problem for
myself and others uses who want to develop Java apps. on the x86_64 platform.

Once again, the application is dying with SIGABORT. don't know if that is
significant or not. I am willing to help get to the bottom of this, just looking
for pointers.

OS is Fedora Core 2 X86_64.
Comment 5 Juergen Kreileder 2004-06-10 12:04:02 UTC
Created attachment 3138 [details]
Fix for segfault logging in 2.6.6

Andi, the logging in vanilla 2.6.6 seems to be broken: Only catched segfaults
get logged.  With this patch it works more like intended for me.
Comment 6 Andi Kleen 2004-06-10 14:11:56 UTC
Hi,

The ! for SIGSEGV was indeed wrong. Thanks for catching this. I will fix this.

But the unhandled_signal() change itself is imho not correct. What the 
check does is to match when PT_PTRACED is set, but fail when PT_TRACESYSGOOD
(= strace running) is set. As far as I can see the original code for this 
is correct.

Overall there must be still some other problem, because these printks
have no relation to how the program works (except for making it a bit slower)
Comment 7 Juergen Kreileder 2004-06-10 15:03:08 UTC
Ah.  Then my strace probably is too old, it doesn't seem to use 
PTRACE_O_TRACESYSGOOD.

We'll release 1.4.2-fcs in few days.  It has a lot x86_64 specific fixes (and
also works on EMT64 machines).  If the bug reporter still has problems (except
for bogus segfault logs) with that version, he should report it at Blackdown.

32-bit JVMs on the other hand already _should_ work fine.
Comment 8 Marc Heckmann 2004-06-14 10:33:34 UTC
ok, I retested my own Java code and it does indeed work fine despite the kernel
messages. i noticed the fix for the false messages also made it into the vanilla
kernel, so I'm going to close this one.

sorry for the confusion.

-m