Tech

20120809

What about Service Guardian on Coherence

Service Guardian is basically a stuck thread watch dog for Coherence cluster, which consist in sent heartbeats from owned created  Coherence's thread; In case a thread from a specific node fails to respond the heartbeat some time-out flags are triggers, for corrective action to be taken.

The time-out recoveries works like:

Soft time-out _ Coh. attempts to interrupt the thread before the Hard time-out is reached. If successful normal Processing resumes.


<Error> (thread=DistributedCache, member=1): Attempting recovery (due to soft
timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper(com.
tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper(com.
tangosol.examples.rwbm.TimeoutTest)]", State=Running}


Possible some network delay or latency.305000 milliseconds is the default value, and there is no action required, unless you frequently see this log output. Which means that you might need watch your network traffic and do some tuning. Also you may change the default value to better fit your necessities.

Hard Time-out _ after the set timing, this case the default 305000 milliseconds is reached Coh. now tries to stop the thread. 


<Error> (thread=DistributedCache, member=1): Terminating guarded execution (due 
to hard timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper
(com.tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper
(com.tangosol.examples.rwbm.TimeoutTest)]", State=Running}

The Coh. thread is not behaving as expected, possible doing some investigation by thread dumps might help identify the issue. But first you need to identify which node in which should take thread dumps. From the log above it gives you hints like "thread=DistributedCache, member=1", the thread is DetributedCache and the member is 1. 

305000 milliseconds, if I'm not mistaking should be about 5 minutes. There fore running about 15 thread dumps, each 30 seconds should help analyse, in this case why the DistributedCache is taking too long. Do not disregard network traffic, some issues can be resolve by using Coh.'s Unicast and Coh. WKA.

Settings for Unicast:


-Dtangosol.coherence.localhost=192.168.0.1
-Dtangosol.coherence.localport=8090
-Dtangosol.coherence.localport.adjust=true


Settings for Well Known Addresses:

-Dtangosol.coherence.wka=192.168.0.100
-Dtangosol.coherence.wka.port=8088



Lastly _ The dead end, after all fails you are done for it... Naahhhh!!! At this point Coh. actually tries to follow policies like:

  • Shutting down the cluster service: 
The faulty node stop all its cluster communication on an attempt to reset all the distribution services. Depending on your logging level and size of cluster, this could be a pain.  
  • Shutting down the JVM:
I am not really experience with this behaviour, but one thing is for sure, we would know which node is the cupid; I understand that WLS's Node Manager can start Coh. cache servers, and also that Node Manager can restart WLS servers... hummm... But I am not sure if the Node Manager, but any ways bellow is some interesting links. 

Start Coh. Servers from the WLS's Admin Console:
http://docs.oracle.com/cd/E28271_01/apirefs.1111/e13952/taskhelp/coherence/StartCoherenceServers.html
How NM restart Managed Servers:
http://docs.oracle.com/cd/E23943_01/web.1111/e13740/overview.htm#i1074986
 
  • Performing a custom action:


This option means that you have known situations in which Coherence threads might take longer than expected or would like to have more control on this feature by controlling its behaviour. But is preferable that you follow the Coh.'s  documentation for this settings.
ref: http://docs.oracle.com/cd/E24290_01/coh.371/e22837/api_guardian.htm

Just known that Service Guardian is a new feature on Coherence, which was introduce on the 3.5 version. This service is reaching some good maturity on Coh. 3.7.1.xx, therefore upgrading to the latest version of Coherence is a must to avoid defects. One last thing, in case you just do not want to go so deep into this feature you can always disable or even raise the time-out value:

Shut-down Guardian:
-Dtangosol.coherence.guard.timeout=0
Raise time-out in milliseconds:
-Dtangosol.coherence.guard.timeout=700000
Hard coding time-out:
import
com.tangosol.net.GuardSupport
set heartbeat
GuardSupport.heartbeat();
known long running operation
GuardSupport.heartbeat(long cMillis);











No comments:

Post a Comment