This is a wide-ranging topic that is often discussing during Forum Sentry Deployment reviews. This article provides some general recommendations and suggestions for identifying and resolving latency/performance issues and unexpected high system resource usage (CPU/RAM).
To schedule a Deployment Review please review:
Determining if there are Capacity Issues
Monitoring the CPU utilization is the best indicator of throughput capacity. If the CPU is constantly near 75% some policy adjustments or additional Sentry instances may be required. SNMP polling can be used to automate the monitoring of the CPU and memory.
Bench-marking the expected service response times for an end to end transaction and monitoring the response times through automated testing (e.g. using SOAPSonar to test policies) can identify potential bottlenecks before they cause any issues.
For more information see:
High Memory Usage
In some cases, Sentry administrators may want to understand why a system is showing higher than expected memory usage and determine if there is a way to lower the usage. There are valid reasons for high memory usage and a few tuning measures to try and lower the usage stats. Several potential reasons for high memory usage and solutions are listed in this article.
Historically speaking, memory usage issues (leaks, etc.) are very rare with Forum Sentry. In fact, many Sentry instances (hardware or virtual) typically run for several months (or years) without being rebooted.
Ultimately it is important to watch the memory usage stats over an extended period of time, during which you should see regular spikes and drops due to the garbage collection. You expect consistent memory usage stats (whatever they are) if other factors don't change (i.e. no change in load, no new services, etc.).
If memory usage climbs to an alerting level and never drops back down, this may be a signal that it is time to scale - or that there are hung threads or other issues and a reboot is required (see below).
For software instances of Sentry, it may be required to update the config.properties file to ensure Sentry as access to the available memory of the host system.
In production environments DEBUG logging should be enabled on individual policies when troubleshooting specific issues. This can be facilitated at the local policy level but you first need to set the Global Logging Level to DEBUG, and keep the global settings to INFO for SYSTEM.
The Audit log setting should always be set to DEBUG to ensure capturing and logging all administrative changes that occur on the system.
If high memory and/or CPU usage or performance/latency is a consistent issue, review the logging thresholds in use.
Forum Sentry utilizes the ClamAV scanning engine. This is built into all ForumOS variants (hardware appliance, virtual appliance, AMI, and Azure VM). If the virus scanning causes performance degradation, you can turn scanning off at the global level and only enable it selectively on policies where there will be the potential of BASE64 or binary data being processed.
As an example, for inbound flows (external clients calling into your own services) consider turning off antivirus scanning of the response payloads as this is data leaving your internal systems and may not require virus scanning.
Max Threads (Concurrent Transactions)
With both hardware and virtual instances there is a "max threads" limit in Sentry. This limit is the number of concurrent transactions that can be processed. A single transaction (thread) handles the full request/response processing for a typical synchronous message flow. In the Sentry system log at debug level this is essentially everything between the "document entered communication layer" to "document left communication layer" messages.
The default "max threads" setting is 4096. The count can be set from 8 to 16384 with the CLI command "system config max-threads". You can view the setting with "show max-threads". The Performance Monitor page of the WebAdmin shows statistics for thread counts.
Modifying the "max threads" setting without first consulting Forum Support is not recommended as extensive testing by Forum Systems has determined that 4096 is the optimal setting for the majority of Sentry use cases.
Increasing the thread count does not necessarily mean increasing the throughput of Sentry or lowering system resource usage. If there are long running or hung threads it is best to identify what are causing these and determine if there are better resolutions than simply increasing the thread count in Sentry.
Increasing the thread count to resolve a resource or latency problem may in fact have the opposite effect. For instance, if the root cause of the problem is the remote server having issues processing heavy volume, increasing the threads would result in Sentry sending more data to an already inundated remote system.
If Sentry is maxing out threads because of a problem with the remote server, increasing the threads may hide the issue temporarily but this does not resolve the root cause of the problem. In these cases, it is usually best to send a proper error back to the client in a timely manner, rather than let them time out while Sentry waits for the remote server.
If the volume is such that the thread count is consistently high every day under normal conditions, that is a sign that the system is near capacity and additional Sentry instances may be necessary.
For more information see: Best Practices: Future Capacity Planning for Forum Sentry
In general, longer running transactions (large files and/or slow remote servers) will cause higher thread usage. As each thread takes some system resources, the higher the number of active threads the higher the resource usage.
To prevent faulty remote servers from causing hung threads (Sentry waiting forever for a response) ensure there are timeouts enabled on the remote policies.
To clear any "hung" threads, you'll need to reboot the system. If you see a number of active threads but there is no traffic being routed to Sentry, there may be hung threads.
Identity and Other External Calls
A common cause of abnormally long running threads (latency on a transaction and a cause of high system resource usage) is a delayed identity or other dependency API calls.
For example, a use case may require Sentry call out to an LDAP server to validate user credentials provided by a client. If the LDAP server is slow to return results this will delay the transaction.
Other use cases may require Sentry to call out to external systems for things like certificate revocation checking, database queries, and/or custom payload scanning/processing via an Access Control policy or an Enrich Message task.
General troubleshooting and performance tuning should include both inspection of and optimizations of any external calls made by Sentry (e.g. LDAP, database, CRL, API, etc.).
Not only will delays with these calls cause higher latency, while Sentry is waiting a thread is active and each active thread takes some system resources.
It is important to note that Sentry supports caching in a variety of ways to help avoid these types of problems.
Similar to outbound identity and API calls, there may be times when Sentry need to query a DNS server. A slow (or unresponsive) DNS server may cause the transaction in Sentry to be delayed.
DNS queries can happen in unexpected circumstances. For instance, an SSL connection may require a DNS query depending on how the policy is configured.
Sentry includes DNS lookup utilities in both the CLI and WebAdmin interface. DNS caching is configurable both globally and per policy. Static host entries can be added via the CLI.
The base Forum Sentry configuration includes several IDP rules that are enabled for both inbound (request) and outbound (response) payload processing. The IDP rules are extensible and customizable.
Some IDP rules may be more resource intensive than others. For instance, virus scanning or regex pattern matching payloads may contribute to latency or higher than normal resource usage.
General troubleshooting and performance tuning should definitely include inspection of the ClamAV virus scanning settings.
For outbound flows, where you control the request data, you might consider disabling some of the IDP rules on the requests.
Consider enabling rate throttling to prevent DOS attacks.
Health Check Monitoring
Ensure that any health check monitoring of the Sentry instances performed by a load balancer or other device isn’t inadvertently using too many resources or leaving open connections. The health-checks should ideally not just be TCP Layer-3 socket connections, but rather Layer 4-7 application health-checks that send valid data requests to the back-end and receive valid responses.