|
Sun Studio 提供的 dbx 工具可用于在 Solaris x86 系统上调试特别令人棘手的 JVM 故障。
目录
问题
在 Solaris 系统上,我通常使用 Sun Studio dbx 工具解决 JVM 中的崩溃和挂起等问题。对于大多数此类问题,我们从客户那里接收到一个磁心存储文件。该文件使用由 JVM 构建的产品所生成。由于 JVM 构建的产品并不包含任何调试信息,所以 dbx 就不能显示 JVM 的对象细节,这使得调试和解密地址非常困难。
解决方案
dbx 调试器有一个非常有用的功能。使用这项功能我们可以“加载”和“使用”一个交替共享对象,而不是在进程运行中所使用的对象。
(dbx) loadobject -load alternate-lib
(dbx) loadobject -use alternate-lib
这项功能对于调试没有调试信息的进程或磁心存储文件非常有用。使用 -g 标志对我们所希望调试的库进行重新编译,然后将库加载到 dbx 调试会话中,这是可能的。
调试 JVM 挂起
在发生挂起的情况下,在装有 JDK 版本 1.5.0_08 的 Solaris x86 系统上客户面临一个硬挂起,甚至进程不响应 kill -3。客户向我们发送一个挂起进程的磁心存储文件。
对于具有调试信息的 dbx,我们使用 -g 标志重新构建 JDK 版本 1.5.0_08 的产品版本 libjvm.so,并将之加载,然后使用带有 dbx 的 .so。做了这样的更改,dbx 就能够获得使用 libjvm.so 生成的原本不带任何调试信息的磁心存储文件的所有调试信息。
在磁心存储文件上的 pstack 显示,VMThread 正试图获取 Threads_lock,主线程在 Threads_lock 上正处于等待状态,另一个线程 3992 也试图攫取 Threads_lock。那么这个锁看起来会引起死锁。下面给出了这个磁心存储文件。
----------------- lwp# 1 / thread# 1 --------------------
fef4009b lwp_park (0, 0, 0)
fef3a909 cond_wait_queue (807c540, 807c528, 0, 0) + 3b
fef3ae02 _cond_wait (807c540, 807c528) + 66
fef3ae44 cond_wait (807c540, 807c528) + 21
fe92fc64 __1cHMonitorEwait6Mil_i_ (807c4d0, 0, 0) + 2a4
fecf526b __1cHThreadsKdestroy_vm6F_i_ (8073fb9, 0, 806dc14, fee4d33c, fee4a2a0, 10004) + 7b
feb66ea3 jni_DestroyJavaVM (fee43560) + 123
08051f8e main (2, 807017c, 80432b4) + bfb
080512fa ???????? (86, 8043468, 804346d, 8043494, 80434a8, 80434c6)
----------------- lwp# 11 / thread# 11 --------------------
fef4009b lwp_park (0, 0, 0)
fef399bb slow_lock (fbc92000, 807c528, 0) + 3d
fef39ab1 mutex_lock_impl (807c528, 0) + ec
fef39bbd mutex_lock (807c528) + 1a
fe8c9a00 __1cFMutexElock6MpnGThread__v_ (807c4d0, 8261480) + c0
fe8cc1fe __1cFMutexElock6M_v_ (807c4d0) + 4e
fe996295 __1cUSafepointSynchronizeFbegin6F_v_ (fee493fc, 8261480, fedee000, f8199b74, 3, f8199bbc) + 85
fea659c4 __1cIVMThreadEloop6M_v_ (8261480) + 134
fea66478 __1cIVMThreadDrun6M_v_ (8261480) + 88
fecbe0f2 __1cG_start6Fpv_0_ (8261480) + d2
fef3fd52 _thr_setup (fbc92000) + 4e
fef40040 _lwp_start (fbc92000, 0, 0, f8199ff8, fef40040, fbc92000)
----------------- lwp# 3992 / thread# 3992 --------------------
fef4009b lwp_park (0, 0, 0)
fef399bb slow_lock (6c173c00, 807c528, 0) + 3d
fef39ab1 mutex_lock_impl (807c528, 0) + ec
fef39bbd mutex_lock (807c528) + 1a
fe8ea95e __1cFMutexbClock_without_safepoint_check6M_v_ (807c4d0) + 7e
fe95c56e __1cUSafepointSynchronizeFblock6FpnKJavaThread__v_ (a27f780) + ee
fecb6825 __1cGParkerEpark6Mix_v_ (a27f908, 0, d964b198, 45) + 1e5
fed0963d Unsafe_Park (a27f840, 6807aba4, 0, d964b198, 45, 9480bcb8) + 1dd
f876f2f9 ???????? (0, 0, a27c330, 0, 9ae69a8, 0)
00000000 ???????? (fecf6330, fea09d40, fea03c80, fe964f40, fe91cc40, fe939c50)
00000000 ???????? (fffffd2c, 3badd9be, f468fe16, e78b7254, fe1dbbcc, 0)
00000000 ???????? ()
|
现在让我们使用 dbx 看一看 Threads_lock。
(dbx) print -fx Threads_lock
Threads_lock = 0x807c4d0
(dbx) print -r *((Monitor*)0x807c4d0)
*((class Monitor *) 0x807c4d0) = {
Monitor::Mutex::_lock_count = 0
Monitor::Mutex::_lock_event = 0x807c500
Monitor::Mutex::_suppress_signal = 0
Monitor::Mutex::_owner = 0x97b0978 <----------------
Monitor::Mutex::_name = 0xfed97a0f "Threads_lock"
Monitor::Mutex::INVALID_THREAD = 0
Monitor::_event = (nil)
Monitor::_counter = 0
Monitor::_waiters = 1
Monitor::_tickets = 0
}
(dbx) print -r *((Thread*)0x97b0978)
*((class Thread *) 0x97b0978) = {
Thread::ThreadShadow::_pending_exception = (nil)
Thread::_exception_file = (nil)
Thread::_exception_line = 0
Thread::_real_malloc_address = 0x97b0d28
Thread::_rawmonitor_list = (nil)
Thread::_SR_lock = 0x9b7b890
Thread::_vm_suspend_depth = 0
Thread::_suspend_flags = 0
Thread::_active_handles = 0x97b0b48
Thread::_free_handle_block = (nil)
Thread::_last_handle_mark = 0x65c92618
Thread::_oops_do_parity = 0
Thread::_highest_lock = 0x65c92cb8 "x^I{^I\xe8^K{^I\xb0\xd6\x82\xf4\xa0H\xbeu"
Thread::_unboosted_priority = NoPriority
Thread::_tlab = {
Thread::ThreadLocalAllocBuffer::_start = 0x762476d8
Thread::ThreadLocalAllocBuffer::_top = 0x7624c750
Thread::ThreadLocalAllocBuffer::_end = 0x76257308
Thread::ThreadLocalAllocBuffer::_size = 14078U
Thread::ThreadLocalAllocBuffer::_refill_waste_limit = 219U
Thread::ThreadLocalAllocBuffer::_number_of_refills = 6U
Thread::ThreadLocalAllocBuffer::_fast_refill_waste = 0
Thread::ThreadLocalAllocBuffer::_slow_refill_waste = 202U
Thread::ThreadLocalAllocBuffer::_gc_waste = 0
Thread::ThreadLocalAllocBuffer::_slow_allocations = 0
Thread::ThreadLocalAllocBuffer::_allocation_fraction = {
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_average = 0.006992618
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_sample_count = 1U
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_weight = 35U
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_last_sample = 0.006992618
}
Thread::ThreadLocalAllocBuffer::_target_refills = 50
Thread::ThreadLocalAllocBuffer::_global_stats = 136222936
}
Thread::_vm_operation_started_count = 1
Thread::_vm_operation_completed_count = 1
Thread::_current_pending_monitor = (nil)
Thread::_current_pending_monitor_is_from_java = 1
Thread::_current_waiting_monitor = (nil)
Thread::_osthread = 0x97b0b00 <-------------
Thread::_resource_area = 0x8c55f50
Thread::_handle_area = 0x98dafc0
Thread::_stack_base = 0x65c93000 ""
Thread::_stack_size = 262144U
Thread::_self_raw_id = 1731617792U
}
(dbx) print -r *((OSThread*)0x97b0b00)
*((class OSThread *) 0x97b0b00) = {
OSThread::_start_proc = (nil)
OSThread::_start_parm = (nil)
OSThread::_state = RUNNABLE
OSThread::_interrupted = 0
OSThread::_thread_id = 11982U
OSThread::_lwp_id = 11982U
OSThread::_caller_sigmask = {
OSThread::__sigbits = (4U, 0, 0, 0)
}
OSThread::_vm_created_thread = 1
OSThread::_current_callback = (nil)
OSThread::_current_callback_lock = (nil)
OSThread::_interrupt_event = 0x97b07d8
OSThread::_saved_interrupt_thread_state = _thread_new
}
|
在前面的的文件显示,Threads_lock 的所有者是具有 ID 11982 的线程。现在我们看一下这个线程的堆栈跟踪。
(dbx) thread t@11982
t@11982 (l@11982) stopped in __lwp_park at 0xfef4009b
0xfef4009b: __lwp_park+0x000b: jae __lwp_park+0x19 [ 0xfef400a9, .+0xe ]
(dbx) where
current thread: t@11982
=>[1] __lwp_park(0x0, 0x0), at 0xfef4009b
[2] mutex_lock_queue(0x67366400, 0x0, 0x88a0598, 0x0), at 0xfef3911f
[3] slow_lock(0x67366400, 0x88a0598, 0x0), at 0xfef399bb
[4] mutex_lock_impl(0x88a0598, 0x0), at 0xfef39ab1
[5] __mutex_lock(0x88a0598), at 0xfef39bbd
[6] Mutex::wait_for_lock_blocking_implementation(0xa27f908, 0x97b0978), at 0xfe9d93ba
[7] Mutex::lock(0xa27f908, 0x97b0978), at 0xfe8c99ea
[8] Mutex::lock(0xa27f908), at 0xfe8cc1fe
[9] Parker::unpark(0xa27f908), at 0xfecb69bb
[10] os::interrupt(0xa27f780), at 0xfecbf6bb
[11] Thread::interrupt(0xa27f780), at 0xfecf2ed9
[12] JVM_Interrupt(0x97b0a38, 0x65c923fc, 0x65c923d0, 0x0, 0x65c923fc, 0xf4873020, 0x0, 0xf482f600, 0x65c923fc, 0x65c92424,
0xf8202b6b, 0x0, 0xf8206399, 0x9593fdc0, 0x1, 0x0, 0x65c92400, 0xf482db6b, 0x65c92438, 0xf4873020), at 0xfeb92255
[13] 0xf820843a(0x0, 0xf8206399, 0x9593fdc0, 0x1, 0x0, 0x65c92400), at 0xf820843a
[14] 0xf8202b6b(0x0, 0x0, 0x97f59c18, 0x9593fdc0, 0x65c9243c, 0xf5746e95), at 0xf8202b6b
[15] 0xf8202b6b(0x967a0c40, 0x9, 0x968f6eb8, 0x65c92464, 0xf57472be, 0x65c92498), at 0xf8202b6b
[16] 0xf8202b6b(0x0, 0x968f6eb8, 0x967a0c40, 0x65c9249c, 0xf5845a0c, 0x65c924c4), at 0xf8202b6b
[17] 0xf8202b6b(0x1b58, 0x967a0c40, 0x65c924c8, 0xf5845674, 0x65c92524, 0xf58a1f78), at 0xf8202b6b
[18] 0xf8202b6b(0x7624c168, 0x7624c138, 0x7624c108, 0x7624c0d8, 0x7624bee8, 0x7624bd00), at 0xf8202b6b
|
线程正试图获取 Monitor 0xa27f908。
现在,让我们看一下 Monitor。
(dbx) print -r *((Monitor*)0xa27f908)
*((class Monitor *) 0xa27f908) = {
Monitor::Mutex::_lock_count = 0
Monitor::Mutex::_lock_event = 0x88a0570 <-----------
Monitor::Mutex::_suppress_signal = 0
Monitor::Mutex::_owner = (nil)
Monitor::Mutex::_name = 0xfed97c99 "Park Semaphore"
Monitor::Mutex::INVALID_THREAD = 0
Monitor::_event = (nil)
Monitor::_counter = 0
Monitor::_waiters = 0
Monitor::_tickets = 0
}
(dbx) print -r *((Event*)0x88a0570)
More than one identifier 'Event'.
Select one of the following:
0) Cancel
1) class os::Solaris::Event
2) class Solaris::Event
> 1
*((class os::Solaris::Event *) 0x88a0570) = {
Event::_count = 0
Event::_nParked = 0
Event::CachePad = (-6.7529205832009e-205, -6.6220952168053e-205, -1.2786650622506e-204, -1.9668306807887e-204)
Event::_mutex = (
{
Event::_lwp_mutex::flags = {
Event::_lwp_mutex::flag1 = 4U
Event::_lwp_mutex::flag2 = '\0'
Event::_lwp_mutex::ceiling = '\0'
Event::_lwp_mutex::mbcp_type_un = {
Event::_lwp_mutex::bcptype = 0
Event::_lwp_mutex::mtype_rcount = {
Event::_lwp_mutex::count_type1 = '\0'
Event::_lwp_mutex::count_type2 = '\0'
}
}
Event::_lwp_mutex::magic = 19800U
}
Event::_lwp_mutex::lock = {
Event::_lwp_mutex::lock64 = {
Event::_lwp_mutex::pad = ""
}
Event::_lwp_mutex::lock32 = {
Event::_lwp_mutex::ownerpid = 0
Event::_lwp_mutex::lockword = 16842752U
}
Event::_lwp_mutex::owner64 = 72339069014638592ULL
}
Event::_lwp_mutex::data = 1813462016ULL
}
)
Event::_cond = (
{
Event::_lwp_cond::flags = {
Event::_lwp_cond::flag = ""
Event::_lwp_cond::type = 0
Event::_lwp_cond::magic = 17238U
}
Event::_lwp_cond::data = 0
}
)
Event::FreeNext = (nil)
Event::Immortal = 0
}
|
这里,_lwp_mutex->data 是 0x6C173C00。它是 ulwp_t) 线程的所有者。
接下来,让我们看一下该线程。
(dbx) x 0x6C173C00/30X
0x6c173c00: 0x6c173c00 0x00000000 0x00000000 0x00000000
0x6c173c10: 0x00000000 0x00000000 0x00000000 0x00000000
0x6c173c20: 0x00000000 0x00000000 0x00000000 0xfef6b740
0x6c173c30: 0x00000000 0x00000000 0x6c174000 0x6c173800
0x6c173c40: 0x00000000 0x00000000 0x00000000 0x6803b000
0x6c173c50: 0x00040000 0x00000000 0x6807b000 0x00040000
0x6c173c60: 0x6803b000 0x00040000 0x00000000 0x00000398
0x6c173c70: 0x00000f98 0x0000007f
|
线程 0x00000f98 ( 是 3992) 是持有 Event::mutex 的线程 ID。
现在,让我们看一下线程 3992。
(dbx) thread t@3992
t@3992 (l@3992) stopped in __lwp_park at 0xfef4009b
0xfef4009b: __lwp_park+0x000b: jae __lwp_park+0x19 [ 0xfef400a9, .+0xe ]
(dbx) where
current thread: t@3992
=>[1] __lwp_park(0x0, 0x0), at 0xfef4009b
[2] mutex_lock_queue(0x6c173c00, 0x0, 0x807c528, 0x0), at 0xfef3911f
[3] slow_lock(0x6c173c00, 0x807c528, 0x0), at 0xfef399bb
[4] mutex_lock_impl(0x807c528, 0x0), at 0xfef39ab1
[5] __mutex_lock(0x807c528), at 0xfef39bbd
[6] Mutex::lock_without_safepoint_check(0x807c4d0), at 0xfe8ea95e
[7] SafepointSynchronize::block(0xa27f780), at 0xfe95c56e
[8] Parker::park(0xa27f908, 0x0, 0xd964b198, 0x45), at 0xfecb6825
[9] Unsafe_Park(0xa27f840, 0x6807aba4, 0x0, 0xd964b198, 0x45, 0x9480bcb8), at 0xfed0963d
|
它正等待锁定 Threads_lock。
因此线程 11982 和 3992 发生了死锁:
- 线程 11982 持有
Thread_lock 并等待获取 Monitor 0xa27f908。
- 线程 3992 持有在 0xa27f908 上的锁并等待获取
Threads_lock。
这个问题在 Java SE 5.0 更新 15 和 Java SE 6 中使用 CR 6601725 已经得到了修复。
关于作者
Poonam Bajaj 于 2001 年加入 SUN 公司的 Java Sustaining 团队。该团队负责使用 Java 相关产品解决客户遇到的各类问题,并及时提供更新发行。Poonam 热衷于发现问题和解决问题,他尤为擅长 Hotspot VM 领域。
|