Most recent kernel where this bug did not occur:Not Tested on other kernels. Distribution:RedHat RHEL 3.0 U3 Hardware Environment:i686 athlon i386, AMD Athlon Software Environment:WebSphere Application driven by IBM JDK 1.4.2 Problem Description: Symptom : WebSphere application crash Description of Issue : Analysis of crash footprints indicate that 1) Well guarded (Locked) data structures ending up holding invalid memory. 2) A linked list with all the nodes correctly "formed" correctly ends up pointing to an "invalid" node at the point of execution. 3) This list is well guarded (locked by native locks) and therefore have rules out the possibility of being abruptly updated by any other thread. 4) We have also verified that this area of memory has not been overlaid by another memory allocation. In short a native memory inconsistency issue that occurs, albeit the piece of memory being guarded, not overlaid and correctly built which suggests that this is a low level memory issue possibly to do with the memory management in the kernel. Analysis and Exact details retrieved from the System Core ========================================================== Crash Symptom : Abort owing to a panic in the Java Virtual Machine(JVM). Reason for Panic : One of the JVM data structures that internally represent a Java Thread is pointed to by an "invalid address". Stack Trace of Crash ==================== #0 0xb749acdf in raise () from /lib/tls/libc.so.6 #1 0xb749c4e5 in abort () from /lib/tls/libc.so.6 #2 0xb71ebd41 in _hpiPanic ( fmt=0xb71faf80 "JVMLH019: invalid thread sr_state %d\n") at /userlvl/cxia32142ifx/src/hpi/pfm/hpi_util_md.c:183 #3 0xb71f548e in tellThreadToSuspend (self=0xa0fb1bc0, tid=0x31284347, type=GLOB_SUSPEND) at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:1502 #4 0xb71f6dbb in sysThreadSingle () at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:2542 #5 0xb74296af in __clz_tab () from /opt/IBM/WebSphere/AppServer/java/jre/bin/classic/libjvm.so #6 0x00000001 in ?? () #7 0x00000000 in ?? () frames 2 to 5 represent JVM function calls. Frame 4 executes the following psuedo code : tid = head; //start from the tid pointed by head. i = 0; while ((i < no_of_elements in list ) && ( tid != NULL)) { if ((CHECK 1) && CHECK 2) { /*ibm@52001*/ if (tid = self) { //do Something } else { if (tellThreadToSuspend(self, tid, suspendattribute) == ERROR) {/*ibm@57783*/ ret = ERROR; } } } prev = tid; tid = tid->next; // iterate the list using the "next" field i++; } The current problem is with the current value of tid=0x31284347 which is the "invalid memory location" leading to the panic subsequently. As we can see the code iterates from the head through the elements of the list using tid->next. So ideally, the invalid tid should be a part of the linked list, I have pasted the entire list in order below which is effectively the list the above code is iterating through and we can evidently see that the above "invalid tid value" is not a part of the list. Note : tid=systhread ======================================== List Memory HEAD systhread=0BD464B8 systhread=0C395F70 systhread=9FE5D378 systhread=0C3666E8 systhread=A0F7B598 systhread=0C265000 systhread=0C1F62E0 systhread=0C3415A0 systhread=0C4BCA78 systhread=A0F79348 systhread=A0F75888 systhread=A0F7AFE0 systhread=0BEDC178 systhread=0BF61570 systhread=0B05A6E8 systhread=0A949598 systhread=A0F7C6C0 systhread=A0F78D90 systhread=A0F77C68 systhread=A0F7BB50 systhread=A0F7EEC8 systhread=A0F7E910 systhread=A0F70088 systhread=A0FB2178 systhread=A0FB1608 systhread=A0FB1BC0 systhread=A0F7E358 systhread=A0F78220 systhread=0B963C40 systhread=A48C2338 systhread=0A950D10 systhread=0A8EF5D0 systhread=0C05EF40 systhread=0C130060 systhread=0AA86850 systhread=0C22D6E8 systhread=0C24F2D8 systhread=0B962AB8 systhread=0B9EC808 systhread=0BAF6A38 systhread=A2823A90 systhread=A2810E38 systhread=0BF2B218 systhread=0B9A2D88 systhread=0BA644D8 systhread=0C2823D0 systhread=9F942A00 systhread=9FBC1400 systhread=0B9A3518 systhread=0B24E6F8 systhread=0C22DCA0 systhread=0B8F9310 systhread=0BD00C18 systhread=0BC349F8 systhread=0C185D50 systhread=0B965FE0 systhread=0AC3D360 systhread=0AB0A6C0 systhread=0AB7C418 systhread=0B962310 systhread=0BAD7A30 systhread=09CBDDC8 systhread=A16FA328 systhread=A16F9AA8 systhread=A89866B8 systhread=9FE92F08 systhread=9FB86B70 systhread=9FEFE8C8 systhread=9FD21E68 systhread=0BE97830 systhread=0AA02318 systhread=0BAAE598 systhread=0BFA4750 systhread=A16F9248 systhread=A16F7908 systhread=9FB7A390 systhread=9F904230 systhread=0C383378 systhread=0AA29308 systhread=0AB6D470 systhread=0A36E1C0 systhread=0BEFB240 systhread=0AA20840 systhread=A231B858 systhread=A2310420 systhread=9FDDD180 systhread=9FE57368 systhread=9FB53708 systhread=9FE5DBF8 systhread=9FE08B28 systhread=0AEF4708 systhread=0BA66258 systhread=0AC003C0 systhread=0A6CDD00 systhread=0AE01D18 systhread=0ADBBF70 systhread=0AD28F78 systhread=0A6829E8 systhread=0A50D590 systhread=AA6031D8 systhread=0A222050 systhread=0A223240 systhread=09CD37D8 systhread=A9A148B8 systhread=AB8093C8 systhread=AB808E10 systhread=A9DE1B98 systhread=AABE65D8 systhread=A9D29060 systhread=A9D28AA8 systhread=AABBBD28 systhread=AABBB770 systhread=AABB95E0 systhread=09C7A8A8 systhread=AADF6310 systhread=AAD7EF50 systhread=AAD7E998 systhread=AAD7E3E0 systhread=AAD7C080 systhread=AAD7BAC8 systhread=09C6A260 systhread=0868DE28 systhread=0868C818 systhread=AB841F60 systhread=09A886B8 systhread=09A87148 systhread=090BDA48 systhread=B129D488 systhread=AE737AD8 systhread=0A0C03F0 systhread=0A0BD3B0 systhread=0A077878 systhread=AECDF240 systhread=B12D4518 systhread=0A0A35E0 systhread=0A097140 systhread=0A075548 systhread=0A01E950 systhread=09FCAB10 systhread=09FCA558 systhread=09E7F568 systhread=09DCD460 systhread=09DF6398 systhread=09DEBDE8 systhread=09DAF558 systhread=09D9FC90 systhread=09D9ABA0 systhread=B12DA7B0 systhread=08FA09C0 systhread=0925F140 systhread=08F9E080 systhread=08332550 systhread=081EF880 systhread=08188AB0 systhread=081873F0 systhread=08185D30 systhread=08184670 systhread=08182FB0 systhread=08181970 systhread=08180400 systhread=0817DA00 systhread=0817B270 systhread=08178C60 NULL ======================================== 1)The invalid tid value of 0x31284347 is not in the above list. 2)This value was retrieved when iterating through the above list at runtime. 3)The list itself is guarded by locks in the JVM code. 4)The list nodes are not overlaid. 5)Also from frame 4 : #4 0xb71f6dbb in sysThreadSingle () at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:2542 i locals i = 49 ret = 0 tid = (sys_thread_t *) 0x31284347 self = (sys_thread_t *) 0xa0fb1bc0 "i" represents the loop induction variable in the while loop in the pseudocode above. It suggests that the last tid that was processed correctly is : systhread=9FBC1400 so ideally 0B24E6F8->next should point to the incorrect value of "0x31284347", but when we check the next (gdb) p *(sys_thread_t *) 0xa0fb1bc0 $2 = {ref_count = 0, pid = 7764, sys_tid = 2727795632, next = 0xa0f7e358, state = {value = RUNNABLE, data = 0},interrupted = 0, single_threaded = FALSE, is_system_thread = TRUE, seen_to_die = FALSE, ps_count = 0 It correctly points to the next item in the list namely "0xa0f7e358" Considering the above 5 points, there seems to be no reason for the invalid memory to "turn up" at runtime and very clearly points to a "low level memory management issue". Steps to reproduce: There is no standalone testcase that reproduces this problem, This problem occurs only on a single server that constitues a production server. It cannot be reproduced in the test environment. Footprints that are available for analysis are core file, WebSphere and Application logs. Other Information ================= For access of the footprints like the core file, logs and the thread dump, please do get in touch with me at lvenkata@in.ibm.com.
Created attachment 12233 [details] Static log that has Thread stacks of all executing threads at the point of crash and information about other JVM components. This log can be used to get a view to the stacks of executing threads at the point of crash.
This doesn't belong here at all, I'll close.