Bug 10720 - SIGSEGV or SIGILL crash of SUN JVM 1.6 32 bit in Linux x86_64 in VMware ESX
Summary: SIGSEGV or SIGILL crash of SUN JVM 1.6 32 bit in Linux x86_64 in VMware ESX
Status: CLOSED WILL_NOT_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-05-16 02:47 UTC by Michael Burger
Modified: 2008-09-24 01:39 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.18-6-amd64 and openSUSE 2.6.22.5-31-default and sles10-sp1-s
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
strace output including crash (311.37 KB, text/plain)
2008-05-16 02:53 UTC, Michael Burger
Details
Java VM crash report file (26.15 KB, text/plain)
2008-05-16 02:57 UTC, Michael Burger
Details

Description Michael Burger 2008-05-16 02:47:09 UTC
Latest working kernel version:
all 2.6 i686 kernels

Earliest failing kernel version:
since change to x86_64 kernels

Distribution:
- Debial 4.0 (Etch)
- SuSE Enterprise Server 10
- openSUSE 10.3 (X86-64)

Hardware Environment:
- VMware ESX host, 1 CPU (Intel(R) Xeon(TM) CPU 3.00GHz)
- MemTotal:      2063116 kB

Software Environment:
SUN's Java JDK / JRE 1.6.0_06 as well as 1.6.0_10 in its 32 bit edition.

Problem Description:
Sproradic but reproducible JVM (Hotspot) crashes getting a SIGSEGV or SIGILL while unning a simple garbage collector stress application seval times in parallel.

hs_err files are generated each time.

The problem only appears using JDK or JRE 1.6 32 bit edition on a Linux 2.6 x86_64 kernel in VMware.

Tests using real hardware, using a i686 kernel and using JDK / JRE 1.6 64 bit edition or JDK / JRE 1.5 (32 bit and 64 bit) could not reproduce the crashes.

Steps to reproduce:

Run the attached simple Java application, no matter if compiled using a JDK 1.5 or JDK 1.6.
Run it in parallel with these parameters to produce it more often:

java -server -Xms32m -Xms256m GCStress 1000 10485760

public class GCStress {
    private static Thread[] threads;
    private static java.util.Random random = new java.util.Random();
    private static int maxBytes;
    private static int numThreads;

    public static void main(String[] args) {
        numThreads = Integer.parseInt(args[0]);
        maxBytes = Integer.parseInt(args[1]);
        threads = new Thread[numThreads];
        for (int x = 0; x < numThreads; x++) {
            threads[x] = new Thread(new Allocator(), "Thread_" + x);
            threads[x].start();
        }

        while (true) {
            for (int x = 0; x < numThreads; x++) {
                if (!threads[x].isAlive()) {
                    threads[x] = new Thread(new Allocator(), "Thread_r" + x);
                    threads[x].start();
                }
            }
            try {
                Thread.sleep(500);
            } catch (InterruptedException e) {
                // ignore
            }

        }
    }

    private static class Allocator implements Runnable {
        public void run() {
            while (true) {
                int tSize = random.nextInt(maxBytes);
//                System.out.println(Thread.currentThread().getName()
//                        + " allocating " + tSize / 1024 + "kb");
                byte[] tMemFiller = new byte[tSize];
                try {
                    Thread.sleep(random.nextInt(200));
                } catch (InterruptedException e) {
                    ; // ignore
                }
            }
        }
    };
}



Expected and Actual Results:

After about 5 minutes running (in VMware), some or all applications crash. Often, two and more of the parallel applications crash almost at the same time.

hs_err files are generated and show:
- SIGILL
- SIGSEGV

often but not always while the JVM is in a safepoint, e.g.
VM_Operation (0xe371ee1c): GenCollectForAllocation, mode: safepoint, requested by thread 0x08217000
Comment 1 Michael Burger 2008-05-16 02:53:37 UTC
Created attachment 16161 [details]
strace output including crash
Comment 2 Andrew Morton 2008-05-16 02:56:21 UTC
Not for the kernel.org guys, sorry.  2.6.18 is way old, and suse have changed it and
it's running in vmware which might affect things.

Probably it'd be best to take it up with suse.
Comment 3 Michael Burger 2008-05-16 02:57:11 UTC
Created attachment 16162 [details]
Java VM crash report file
Comment 4 Michael Burger 2008-05-16 04:20:50 UTC
Ah!
Field has a cut: It is reproducible on

Linux debian 2.6.18-6-amd64 #1 SMP Sun Feb 10 17:50:19 UTC 2008 x86_64 GNU/Linux

as well as

Linux openSUSE 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC x86_64 x86_64 x86_64 GNU/Linux

as well as

Linux sles10-sp1-sdm 2.6.16.46-0.12-default #1 Thu May 17 14:00:09 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux


Which x86_64 kernel do you recommend to install?
Could there be a general problem with a x86_64 kernel in VMware ESX?
Comment 5 Roland Kletzing 2008-06-02 14:25:51 UTC
can this be reproduced on different esx boxes ?  (to make sure it isn`t hardware related)

which esx version do you use ? 
did you apply all patches for that?
Comment 6 Michael Burger 2008-06-02 22:11:01 UTC
It CAN be reproduced on different ESX servers, I tried with two - my collegues with other 2.
Additionally, it can be reproduced on VMware Server (not ESX).
(Version follows)
Comment 7 Michael Burger 2008-06-03 00:31:33 UTC
We 've tried on several Intel Xeon 64 bit processors and an AMD Turion64, the version of our ESX software is
VMware ESX Server 3.0.2, 61618

The version of the VMware Server with the same error hehaviour is
VMware Server 6.0.3, 80004

On the VMware Server it took an hour until the error appeared, on the ESX servers between 1 and 30 minutes.
Comment 8 Alan 2008-09-23 13:26:37 UTC
"Tests using real hardware, using a i686 kernel and using JDK / JRE 1.6 64 bit
edition or JDK / JRE 1.5 (32 bit and 64 bit) could not reproduce the crashes."

Please refer to vmware in that case. There isn't any obvious Linux reason for this 
Comment 9 Michael Burger 2008-09-24 01:39:18 UTC
Hello dear Kernel engineers!

- Tests on real hardware cannot reproduce the problem in any case - no time this was a problem
- After updating to a newer ESX release, the problem disappeared

Not knowing exactly the problem, I suppose that ESX memory mapping was buggy for special JDK memory allocation.

Note You need to log in before you can comment on or make changes to this bug.