Sailfin 2.0/Overload Protection Layer
|
02Version |
Date |
Comments |
Author |
|
0.1 |
2009-01-26 |
First version |
Robert Handl |
| 0.2 | 2009-02-27 | Added http-olp and admin changes | Ramesh Parthasarathy |
| 0.3 | 2009-03-02 | Modified dynamic configuration support section to reflect the details | Ramesh Parthasarathy |
2009-01-26
The CPU based Overload Protection mechanism available in SGSC protects the system from getting to much load. Unfortunately the protection system is too sensitive and reacts on every small CPU spike. An algorithm is needed to minimize the spikes and reduce the number of alarms generated.
Currently there is a poor way of notifications of overload within SGSC. Only warnings are logged when overload is detected. A notification mechanism should be introduced for clients to observe changes from normal load to overload and vice versa.
Http overload protection : Current behavior
1. Http overload protection is implemented on top of the Clb, and so it is available only in a system where the converged load balancer is present and enabled. The constraint is in place because the OLP has to be invoked before the Clb in the http request processing chain. Since the http-proxy in clb is implemented on top of the Grizzly connector, a HttpLayer interface was introduced to support request interceptions in the Http path. Also, the OLP is present in both the FE and the BE and does not have a way to figure out if it is a FE or BE (because its before the Clb), so this information (FE or BE) cannot be used for initiating an action during overload conditions. The OLP is a standalone module and is not aware of other instances in the cluster and cannot work collaboratively.
2.The OLP triggering algorithm is shared between the Http and SIP parts. When an overload situation occurs, a 503 (with a retry-after) is sent to the client. This was done because Http clients expect responses for requests that are sent and would be forced to timeout otherwise (if there is no response). And unlike SIP, the response is sent using the same thread that is used for request processing.
The Ericsson Presence and Group Management application (PGM) has discovered that the current Overload Protection mechanism available in SGCS is not good enough for overload detection and reporting.
The main criticism is that the system is too sensitive for CPU measurement: if an overload is detected and an alarm is triggered it could unfortunately cease and be raised again over and over again during short periods when the CPU is oscillating heavily. It is fine if the traffic is toggled on/off to accept and reject traffic in a fast manner but the reporting should be less sensitive.
The SGCS and the MMAS products also needs better separation of functionality.
Currently SGCS detects overload for CPU and memory and rejects HTTP and SIP traffic when overloaded. The only reporting that exist today is that SGCS adds a WARNING statement into the log file. The MMAS Alarm handler duplicates the behaviour of detecting overload in its life cycle module for reporting alarms.
A future system should align the behaviour. SGCS should reject traffic and notify overload detection via a notification API stating what has caused the overload and the severity of the overload. A client (MMAS Alarm handler) could then register a notification listener to get overload notifications. The MMAS Alarm handler could e.g. filter incoming notifications and report MMAS alarms.
Http olp:
P1 (must have) : The behavior described in (2) under Decription is not aligned with SIP and is not acceptible when the system is running under maximum load and does not have resources to spare to send responses back to the client. So, the http-olp behavior during maximum overload has to be aligned with SIP behavior of not sending back any response and releasing the resources immediately so as to contribute to reducing the load on the system. A mechanism to ensure that the threads are not released back to the pool immediately would also ensure that the further requests coming into the system are throttled automactically. The action that is taken during the overloaded phase should be such that it aids in protecting the system from a total failure.
Lower priority (nice to have) : The dependency described in (1) under Description has to be eliminated so that the http-olp can be enabled/configured independent of the clb in a system. This would help accomplish use cases where the olp has to be available in a pure backend system.To create a less sensitive overload protection mechanism when using the CPU based algorithm.
Enhanced reporting of overload
External systems such
as MMAS can customize the reporting to its alarm handler
The Overload detection algorithm should be changed to provide two different modes: CONSECUTIVE and MEDIAN. CONSECUTIVE is the same as current option with the addition that all samples below threshold are also counted before ceasing alarm. Currently the alarm is ceased as soon as one sample is below the threshold making it extremely sensible.
Two different algorithms for detecting an overload situation (and the end of it) should be implemented:
CONSECUTIVE – the configured number of samples all have to be above (or below) the threshold.
MEDIAN –the median value of the configured number of samples have to be above (or below) the threshold. If the number of samples is even, the median value is computed as the mean of the two middle values.
The
Overload
mechanism should be separated into a detection unit with a reporter
which notifies all listeners of an overload event when overload is
raised or ceased. The
event will include
the type of algorithm causing the overload and the traffic type (SIP,
HTTP, etc). The action taken by the listener is up to each
implementation of the listener. The rejection listener will reject or
drop traffic. The logging listener will log warning statements.
Example of other possible listeners: the JMX notification listener
could send JMX notifications; the MMAS listener could report MMAS
alarms, etc.
OverloadEvent, OverloadListener
Update of existing Overload documentation
New configuration for the modes: CONSECUTIVE or MEDIAN
When http max overload is reached the response should be dropped. It is not possible in glassfish today.
For MEDIAN mode the SIP and HTTP Retry-After header should only have fix value.
| Will this component work with JDK 64bit | Yes |
| Will this component require configuration using a sun-specific deployment descriptor.If yes, please specify below that configuration elements needed | No |
| Issue No | Description | Comments | Resolution |
|---|---|---|---|
| 1 |
How to configure the parameters of the SGCS Overload Protection layer and the MMAS Alarm life cycle module in a consistent manner? |
|
Sailfin 2.0