Monthly Archives: June 2018

Token based API access control has a critical flaw

I’ve written on this subject before. I come back to the subject because I keep coming across this problem. Granted where or not it is a problem depends on how you look at things. My view is from the point of view of robustness and fail-safety.

I work with programmers quite a bit and theirs is to deliver functionality out the door. Which is understandable: that is what they are measured on.

Then I come along later and try to track down a problem. Over privileged service accounts which is what these API access tokens amount to is a perennial problem. Leaving aside for the moment the security aspect of it. A service account and an API access token can be poorly implemented in two main ways. Be given too much access privileges, let’s call that a Type 1 fault; and be shared by too many different entities, call that Type 2. I’ve seen plenty of both. Okta is a good(bad) example of the former. Their SaaS REST API only grants one kind of access token. To everything.  Type 1 fault as bad as can be.

Programmers using the API for Okta can then keep reusing the same access token – it is just a text string after all – to any extent rather than going to the rigmarole of issuing another. Yes, it happens. A lot. Type 2 fault. The API access tokens are usually valid for a long time since they are meant to be used by other applications, hence the service account equivalence, but can be selectively revoked and at least in Okta there is good overview of currently valid tokens. But not what has been going on with them and what they have been used for.

In Unix/Linux these API tokens are like having multiple root accounts that have been shared with people you do not know and being used for unknown things. No one would call that a satisfactory state of affairs on a *x server system. But it is rarely expressed in those terms.

The root account comparison is apt. On unix -like systems the root user is not subject to OS-level access control. Access is unlimited. Switching to any other user however souped-up, is better. But it involves actually implementing some access control: groups and permissions in a unix-like system. Which can be laborious – and worse, error prone. So let’s just give it root – at least it will work.

Yes, laziness. The malicious user/intruders best friend.  API access tokens are employed on great many platforms but since we call them API access tokens rather than user/service accounts, little or no scrutiny is ever applied to them.

Now we have these API tokens with good-like system powers. Revocable, sure, but does that ever happen ?

Funny symptoms on OpenAM when session handling is under strain.

Of late I have been examining  an intermittent error situation at a large OpenAM implementation. Several aspects of the implementation are not in accordance with the recommendation of the vendor, or even good practice. But that has afforded some very good learning opportunities. As is often the case when straying from the preferred path.

The deployment there are four OpenAM instances and four peer-to-peer replicated OpenDJ instances for data storage.

OpenAM keeps policy, configuration and session information in a directory. OpenDJ in this scenario. It is for obvious reasons recommended to keep the session store (it takes more load and does not need a backup) separate from the policy and configuration store.

An early clue was that the error conditions where resolvable with restarts. Clearly this was not fundamentally a policy matter. Policies persist across restarts. Sessions however, do not. Following up on this it was discovered that the sessions store and policy store where collocated.

OpenAM has caching on session token validation operations. So there would be no need to ask the directory everytime a session token is to be validated. But this is an object cache, not a search cache. So the session object can be cached on the OpenAM instance but if there is an attempt to validate a non-existent token, that attempt would result in a search against the underlying session object store. Every time. Examination of the search logs on the object store revealed that between 12% and 40% of token search operations were for tokens that did not exist.

It gets worse. Since the customer had declined to implement stickyness on the load balancer in front of the multiple OpenAM instances, users were sprayed across all of them. This leads to an excessive number of session token modification operations. Quite a large number of them in fact; Many tokens were modified hundreds of times. And with all modifications replicated across all of the directory store instances, the load on the OpenDJ instances is considerable.

But wait, it gets worse still. Forgerock is quite adamant that there can be no load balancing going on between the OpenAM and directories used as session store. In this design there are four OpenDJ OpenDJ instances used for sessions store, one designated for each OpenAM instance. Should one OpenDJ instance become unavailable “it’s” OpenAM instance will use one of the others, but otherwise the OpenAM instance will used it’s designated OpenDJ instance for all session token operations.

One can of course argue about what constitutes load balancing but I think of it as being chiefly about load distribution, with or with out stickyness. This design I would describe the OpenAM to OpenDJ connection to load balanced with permanent stickyness.

Does this matter ? In a replicated environment all the OpenDJ instances will be equal. Ideally, no. But not always. There will always be some replication delay. The problem here was the users being distributed across all OpenAM instances. Meaning that the same token operation could arrive at two, or more, different OpenAM instances practically simultaneously. Replication is fast in a good network and with good indexing. But it does not take zero time. If different OpenAM instances uses different OpenDJ instances as session token store, there will be errors. Probably not many, but ones likely to increase in frequency with increasing system load.

Ok. How to determine this. First: have plenty of logging on the OpenDJs. The search request for session tokens are easy to identity in the OpenDJ access log. They are base-level searches and therefore simple to extract reliably with an egrep statement. The same with failures to find a token. In fact directly related to the base level searches for tokens. A base level search when the base object does not exist produces an error to that effect. In this case the access log clearly states that the session token object is not present. Very convenient. And egrep-able.

So it is easy the get a list of tokens being searched for and which ones were not found. Extract these list from all OpenDJ instances within the same time period – say one hour. And determine if there where any failed token searches on any instance that succeeded on any of the others. There were.

 

We all gripe about vendor documentation. And Forerock has had some doosies. But there is usually good advice in there. Don’t ignore it without very good reason backed up by observational data.

The short and sweet on OpenAM, play nice with your session handling.