Of late I have been examining an intermittent error situation at a large OpenAM implementation. Several aspects of the implementation are not in accordance with the recommendation of the vendor, or even good practice. But that has afforded some very good learning opportunities. As is often the case when straying from the preferred path.
The deployment there are four OpenAM instances and four peer-to-peer replicated OpenDJ instances for data storage.
OpenAM keeps policy, configuration and session information in a directory. OpenDJ in this scenario. It is for obvious reasons recommended to keep the session store (it takes more load and does not need a backup) separate from the policy and configuration store.
An early clue was that the error conditions where resolvable with restarts. Clearly this was not fundamentally a policy matter. Policies persist across restarts. Sessions however, do not. Following up on this it was discovered that the sessions store and policy store where collocated.
OpenAM has caching on session token validation operations. So there would be no need to ask the directory everytime a session token is to be validated. But this is an object cache, not a search cache. So the session object can be cached on the OpenAM instance but if there is an attempt to validate a non-existent token, that attempt would result in a search against the underlying session object store. Every time. Examination of the search logs on the object store revealed that between 12% and 40% of token search operations were for tokens that did not exist.
It gets worse. Since the customer had declined to implement stickyness on the load balancer in front of the multiple OpenAM instances, users were sprayed across all of them. This leads to an excessive number of session token modification operations. Quite a large number of them in fact; Many tokens were modified hundreds of times. And with all modifications replicated across all of the directory store instances, the load on the OpenDJ instances is considerable.
But wait, it gets worse still. Forgerock is quite adamant that there can be no load balancing going on between the OpenAM and directories used as session store. In this design there are four OpenDJ OpenDJ instances used for sessions store, one designated for each OpenAM instance. Should one OpenDJ instance become unavailable “it’s” OpenAM instance will use one of the others, but otherwise the OpenAM instance will used it’s designated OpenDJ instance for all session token operations.
One can of course argue about what constitutes load balancing but I think of it as being chiefly about load distribution, with or with out stickyness. This design I would describe the OpenAM to OpenDJ connection to load balanced with permanent stickyness.
Does this matter ? In a replicated environment all the OpenDJ instances will be equal. Ideally, no. But not always. There will always be some replication delay. The problem here was the users being distributed across all OpenAM instances. Meaning that the same token operation could arrive at two, or more, different OpenAM instances practically simultaneously. Replication is fast in a good network and with good indexing. But it does not take zero time. If different OpenAM instances uses different OpenDJ instances as session token store, there will be errors. Probably not many, but ones likely to increase in frequency with increasing system load.
Ok. How to determine this. First: have plenty of logging on the OpenDJs. The search request for session tokens are easy to identity in the OpenDJ access log. They are base-level searches and therefore simple to extract reliably with an egrep statement. The same with failures to find a token. In fact directly related to the base level searches for tokens. A base level search when the base object does not exist produces an error to that effect. In this case the access log clearly states that the session token object is not present. Very convenient. And egrep-able.
So it is easy the get a list of tokens being searched for and which ones were not found. Extract these list from all OpenDJ instances within the same time period – say one hour. And determine if there where any failed token searches on any instance that succeeded on any of the others. There were.
We all gripe about vendor documentation. And Forerock has had some doosies. But there is usually good advice in there. Don’t ignore it without very good reason backed up by observational data.
The short and sweet on OpenAM, play nice with your session handling.