For scalable deployments of middleware with high availability, employing a session state persistence approach to persist session state to all instances in the cluster is a non-scalable approach. Replicating sessions to all instances in the cluster would result in significantly higher network traffic just for replicating state reducing bandwidth for growing application user requests. This approach of sharing sessions across all instances perhaps is suited for small clusters with limited number of concurrent requests.

One of the better approaches to use to secure scaling advantages is buddy replication. In this approach, each instance selects one (or more) other instance(s) in the cluster to replicate any and all of its sessions. This is a superior approach and in fact, works for fairly large deployments. There are factors to consider here, in terms of the overhead the replication subsystem will need to handle at the cost of performance particularly when large number of concurrent sessions are being processed and later expired. An overhead to consider is the need for instances to form ring-like replica partnerships based on a certain order in which buddies would be selected. When a buddy fails, there is the cost of re-adjusting and forming new buddy relationship with another surviving instance and when the original recovers, to re-adjust again to use this instance as a replica partner by one of the instances in the cluster. There is also cost to be considered each time cluster shape changes for dynamically changing any cookie information pertaining to replica locations that could be sent back as part of the response headers to the LB - typically that cost should also be avoided through more efficient means.


In Sailfin 2.0, scaling and reliability needs for telco applications is typically high and we needed a scalable approach to ensure Sip Session Replication overhead sustained good performance with the added reliability and availability. We, therefore, used a consistent hashing algorithm to auto assign a replica instance for each new session. This we did by leveraging the consistent hashing mechanism that the Sailfin Converged Load Balancer (CLB) uses for proxying requests to a target instance using a BEKey. In the case of replication, the same logic of using a hashed key for the target instance assignment is taken a bit further, in that, for each session, we pre-calculate the most likely target instance that the CLB would failover to, if the current target primary instance serving the session, were to fail in future. This gives us the instance to which, the current primary instance, should replicate to. This gave us significant benefits in that there was no cookie updates required to include replica partner information. There was no readjustment of replica partnerships needed when a particular instance failed as the hashing algorithm would provide another instance to replicate to with just an API call.

Since this is based on a hashed selection algorithm with predetermined failover target, replica selection is dynamic, and does not need the knowledge of a particular order of instances being ready in the cluster to point all sessions from another instance as a replication partner. And more importantly, as the failover occurs to the specifc instance where replica data is located, there is significantly less network overhead to locate any particular session in the cluster when a particular request within the session scope is sent to the CLB. This approach is thus superior to the buddy replication approach and helped us scale to higher throughput and sustain a larger number of long running sessions.

It must be emphasized that system level, and application server level tuning and sizing are essential to ensure sustained performance, scalability and reliability.

As always, we welcome your feedback and encourage you to try Sailfin and send us any inputs and questions you may have in this respect.

Sailfin Promoted Builds are available here : Sailfin Downloads







More...