Hi,



I need some quick insight on a problem that am currently facing.



The environment is like this:



3 application servers (WASND v6.1 / AIX), managed by a dmgr. One of the application server resides on the same server as the dmgr. All of the application servers are clustered.



Dmgr01 - 10.2.1.7

AppSrv01 - 10.2.1.7



AppSrv02 - 10.2.1.8



AppSrv03 - 10.2.1.10


2 web servers (IHS / AIX) - 10.2.1.3 / 10.2.1.4



Applications have been deployed on the cluster and everything is fine since I can access the applications from the webserver with no problem. Tested with snoop, the requests are directed to the application server in turn on a round-robin basis.



Now here's the problem - at times, I won't be able to access the application from the webserver. I tried to access it directly on WAS, and only AppSrv01 and AppSrv02 responded well. AppSrv03 ended up 'loading' forever. AppSrv03 status is up and started, 9082 port is accessible, but the application just won't load. And plus, what bothers me, the webservers are supposed to direct the requests to the other available application servers, but since they failed to load the application, it seems like the web servers are hitting AppSrv03 constantly.



I tried to ping AppSrv03, and the result is consistent (though I kinda remember not getting any responds from it once or twice).



An additional information on AppSrv03 - previously the server was used as a load balancer (edge). I just checked today and noticed that there is a second IP being aliased to the network interface en2. Since there are 2 IP addresses attached to the NIC, the routing table is also affected. The last few times I configured edge servers, I remember editing the routing table since adding IP alias to the network interface (not sure lo0 or enx) led to some network problems. Could this contribute to the problem that I'm facing?



I have checked the webservers' logs - error_log has recorded numerous of following warnings / errors:



child process 3759 did not exit, sending SIGTERM

child process 3759 did not exit, sending SIGKILL



and also - SIGHUP received. Attempting to restart - but I think this is due to the rotatelogs that I just configured.



QUESTION: Why is the application intermittently not accessible from the webserver (and even on one of the application server), when all of the application servers are up and running. Plugin-cfg.xml and httpd.conf files have been reviewed and both looked fine to me.



Ideas? Any advice is highly appreciated.



Thanks in advance.



P.S. I don't have any logs with me at the moment else I could attach it for reference.

P.P.S. I'm posting this on both WAS and IHS forums, since I think it fits for both.