Tech

20131213

Watching and taking Thread Dumps with WLST

The hanging or looping behaviour trouble shooting can be some troublesome if you do not know how to read this snapshot. Specially if you did not get the snapshot in the correct time and also on the intervals needed to identify and inculpate a specific thread and your respective class stack.

One of the first steps to investigate is to identify how frequent and how does behave the hanging. Usually we must considered a very broad approach by asking:


  • Why do we use this program for?
  • Does it have a process peak?


Then we should ask our selves:


  • Does the hanging behaviour frequency relate with any of the questions asked previously?


Hopefully with these questions answered will help us determine when to actually wait to collect data for analysis. Some behaviour can help us identify what actually what we are looking for, if the hanging consumes a lot CPU, means that we are dealing with a possible loop, a while(true) or even recursive methods loop or recursive Architecture with bad handling exceptions loops; If the behaviour is just hanging and not consuming much CPU resources, this means is possible a deadlock or livelock. But some times the frequency and behaviour are very random, thanks to parallelism, that makes even more difficult to collect good useful data.

To analyse thread dumps if very difficult and we usually need more then just one, I would say that 10 to 20 thread dumps are good amount for investigating and the frequency between thread dumps depends how fast does the contention happens. There are many visual tools such as samurai and TDA that might help in investigating the hanging. But going back to my original mental thread (talking about parallelism ;), is actually to collect the useful data at the right time which holds the key.

On WLS, I have write a simple WLST/script which does the thread-dumping for me. That way I can spend my time on real useful things like updating my facebook or reading Dilbert strips.

How does it work:

1. Select one of the many Managed-Servers, and set the following parameters:

goto: Managed Server:Configuration:Tuning and set 


    • Stuck Thread Max Time: 15 sec (needs restart)
    • Stuck Thread Timer Interval: 10 sec (needs restart)


*This parameters have different behaviours and can be set as you like, please look for oracle WLS documentation for more details.

2. Set your domain Environment by running setDomainEnv.sh from your <DOMAIN>/bin directory:

$ . setDomainEnv.sh

3. Create a python script  in which will watch the health of the server:

<code:>

import java.lang
import os
import string
import time

def serverRuntimeNavegate():
    serverRuntime()
    cd("/")
    cd("ThreadPoolRuntime/ThreadPoolRuntime")

def runtimeNavegate():
    runtime()
    cd("/")
    cd("JVMRuntime/" serverName)

def checkHealthOfServer(serverName):
    print 'Checking : ' serverName
    os.system("echo Starting")
    serverRuntimeNavegate()
    0
    while true
        state str(cmo.getHealthState())
        check string.find(state,"HEALTH_WARN")
        if  check != -1:
          print "Warning State"
          threadDump(writeToFile="true",fileName"ServerDump" str(x))
          serverRuntimeNavegate()
          Thread.sleep(20000)
          += 1
        else
          print 'Its all good...'
          Thread.sleep(5000)

connect("weblogic","weblogic1","localhost:7001")
checkHealthOfServer(serverName)



4. All you need to do on this script is to change the username, password and server URL:PORT on the connect() command from the script, at the next to last script. Then you can call the WLST to run the script:

$ java weblogic.WLST <scriptName>.py

Basically this Jython/WLST connects to any server and check on the health, if returns OK, all it prints is a message. Soon as the WLS engine decides that exist a long running thread, the script start taking thread dumps and writes the output on files generated on the same location where you called the script. 

1 comment:

  1. I did not understand what it means. , "at the next to last script". Can you pleas ehelp in undertsanding this.

    ReplyDelete