Thundering herds, Spring Cache and Ehcache

15/6/2014

The Spring Cache abstraction makes it possible to cache expensive method calls with just an annotation:

@Cacheable("expensiveMethodCache")
public MyReturnType getSomethingExpensive(SomeType someParam) { ... }

All that's required on the configuration side is to slap <cache:annotation-driven /> into the Spring config, and to configure a backing cache manager. In this example, we use the ubiquitous Ehcache as the backing cache behind the AOP magic. This would normally look like this:

<cache:annotation-driven />

<bean id="cacheManager" class="org.springframework.cache.ehcache.EhCacheCacheManager">
    <property name="cacheManager" ref="ehCache"/>
</bean>

<bean id="ehCache" class="org.springframework.cache.ehcache.EhCacheManagerFactoryBean">
    <property name="configLocation" value="classpath:ehcache.xml"/>
</bean>

This already works just fine for a lot of use cases.

The application I was writing reads a configuration from a Redis store. This seemed like an obvious thing to cache for at least a couple of seconds, so I added a @Cacheable annotation to the high-level service reading the configuration data:

@Override
@Cacheable("configuration")
@Nonnull
public ConfigurationTree getConfiguration() {
    final Collection<UriPatternConfiguration> configurations = this.configurationReader.loadConfiguration();
    return new UriPatternConfigurationTree(configurations);
}

The Ehcache configuration file looks like this:

<ehcache noNamespaceSchemaLocation="http://ehcache.org/ehcache.xsd"
         maxBytesLocalHeap="1G">

    <!-- Configuration is reloaded every 20 seconds -->
    <cache name="configuration" timeToLiveSeconds="20" eternal="false">
        <persistence strategy="none"/>
    </cache>

</ehcache>

A simple, memory-only cache capped at 1 GB, with a configuration cache with a 20 second TTL. Alright.

Thundering herds

This proved to be suspect to a thundering herd problem under heavy load: Whenever the time to live of the configuration data expires, hundreds of threads observe a cache miss and reach into the Redis store simultaniously to read and parse the configuration data. This continues until the first of them is done, and the result is put into the cache. Thus, the Redis would be flooded with hundreds or thousands of new connections every few seconds, and sometimes would refuse any more connections from the same host, resulting in an exception in the application. Not good.

An obvious possibility would be to solve the problem in the service by using state to block other clients while the configuration is already being read, and then serve them the result once the read is complete. However, this blows the entire nice caching abstraction, forcing you to either work with a Cache directly, delegating to an Executor and saving the Future, or writing other custom thread logic. This was not part of the original plan, was it?

Blocking Cache to the rescue

Thankfully, Ehcache comes with a cache decorator called BlockingCache. It "allows concurrent read access to elements already in the cache. If the element is null, other reads will block until an element with the same key is put into the cache." Now that sounds like what we need, doesn't it? When there's a cache miss, the blocking cache will block other access to the key until the original caller who received the cache miss performs a put operation, and then returns the result to all the waiting threads. Alright, let's implement this!

While BlockingCache is part of the Ehcache distribution, we need to implement our own factory to be able to use it in the configuration:

import net.sf.ehcache.Ehcache;
import net.sf.ehcache.constructs.CacheDecoratorFactory;
import net.sf.ehcache.constructs.blocking.BlockingCache;

import java.util.Properties;

/**
 * Used in the ehcache configuration to create {@link net.sf.ehcache.constructs.blocking.BlockingCache}s. This
 * avoids a thundering herds situation when a heavily trafficed cached object expires (i.e., the configuration).
 */
public class BlockingCacheDecoratorFactory extends CacheDecoratorFactory {

    private static final int TIMEOUT_MILLIS = 1000;

    @Override
    public Ehcache createDecoratedEhcache(final Ehcache cache, final Properties properties) {
        final BlockingCache blockingCache = new BlockingCache(cache);
        blockingCache.setTimeoutMillis(TIMEOUT_MILLIS);
        return blockingCache;
    }

    @Override
    public Ehcache createDefaultDecoratedEhcache(final Ehcache cache, final Properties properties) {
        return this.createDecoratedEhcache(cache, properties);
    }
}

Now, we need to wire up our new factory in the Ehcache configuration:

<ehcache noNamespaceSchemaLocation="http://ehcache.org/ehcache.xsd"
         maxBytesLocalHeap="1G">

    <!-- Configuration is reloaded every 20 seconds -->
    <cache name="configuration" timeToLiveSeconds="20" eternal="false">
        <!-- Blocking cache to avoid thundering herds when the configuration TTL expires -->
        <cacheDecoratorFactory class="my.project.cache.BlockingCacheDecoratorFactory" />
        <persistence strategy="none"/>
    </cache>

</ehcache>

Nice! This works as expected, and solves the thundering herd problem. Right? Not quite.

Quoting from BlockingCache's get() method documentation:

Looks up an entry. Blocks if the entry is null until a call to put(net.sf.ehcache.Element) is done to put an Element in. If a put is not done, the lock is never released.

If this method throws an exception, it is the responsibility of the caller to catch that exception and call put(new Element(key, null)); to release the lock acquired.

Erm. Okay. By using a blocking cache, we have just made sure that when one call fails because the configuration could not be read, all subsequent calls will also fail, because the lock is never released. This means that the application dies a horrible death whenever someone restarts the Redis server. We need to release the lock, and the documentation says exactly how to do it. But how do we make Spring Cache do it?

Releasing the lock

To solve this problem, whenever a call to a cached method fails, Spring needs to call the put() method on its Cacheabstraction with the proper key and a null value. For Ehcache, a null value corresponds to a no-op, so the underlying cache does absolutely nothing, but it causes BlockingCache to release the lock.

Digging around with the debugger for a while shows that the logical place to do this would be in the CacheInterceptor class. We can extend the class like this:

import org.springframework.aop.framework.AopProxyUtils;
import org.springframework.cache.Cache;
import org.springframework.cache.interceptor.CacheInterceptor;
import org.springframework.cache.interceptor.CacheOperation;

import java.lang.reflect.Method;
import java.util.Collection;

/**
 * Useful in combination with a {@link net.sf.ehcache.constructs.blocking.BlockingCache}. When the intercepted method
 * call fails, removes the lock created during the get() operation so other threads can try again.
 */
public class UnblockingOnExceptionCacheInterceptor extends CacheInterceptor {

    @Override
    protected Object execute(final Invoker invoker, final Object target, final Method method, final Object[] args) {
        try {
            return super.execute(invoker, target, method, args);
        } catch (Exception e) {
            // get backing class
            Class<?> targetClass = AopProxyUtils.ultimateTargetClass(target);
            if (targetClass == null && target != null) {
                targetClass = target.getClass();
            }

            final Collection<CacheOperation> cacheOps =
                    getCacheOperationSource().getCacheOperations(method, targetClass);
            for (final CacheOperation cacheOp : cacheOps) {
                for (final Cache c : this.getCaches(cacheOp)) {
                    c.put(cacheOp.getKey(), null);
                }
            }
            throw e;
        }
    }
}

Alright! But now we need to get Spring to use our UnblockingOnExceptionCacheInterceptor instead of its own CacheInterceptor. Unfortunately, this forces us to replace the <cache:annotation-driven /> with something much more verbose, effectively instantiating everything by hand that annotation-driven does automatically:

<!-- Declarative caching via @Cachable, backed by Ehcache. This is mostly the explicit equivalent
    of <cache:annotation-driven/>, but we need to override the cache interceptor. -->

<aop:config>
    <aop:aspect ref="beanFactoryCacheOperationSourceAdvisor"/>
</aop:config>

<bean id="annotationCacheOperationSource" class="org.springframework.cache.annotation.AnnotationCacheOperationSource" />

<bean id="cacheInterceptor" class="my.project.cache.UnblockingOnExceptionCacheInterceptor"
      p:cacheOperationSources-ref="annotationCacheOperationSource"
      p:cacheManager-ref="cacheManager" p:keyGenerator-ref="keyGenerator" />

<bean id="beanFactoryCacheOperationSourceAdvisor"
      class="org.springframework.cache.interceptor.BeanFactoryCacheOperationSourceAdvisor"
      p:adviceBeanName="cacheInterceptor" p:cacheOperationSource-ref="annotationCacheOperationSource" />

<bean id="keyGenerator" class="org.springframework.cache.interceptor.DefaultKeyGenerator" />

<bean id="cacheManager" class="org.springframework.cache.ehcache.EhCacheCacheManager">
    <property name="cacheManager" ref="ehCache"/>
</bean>
<bean id="ehCache" class="org.springframework.cache.ehcache.EhCacheManagerFactoryBean">
    <property name="configLocation" value="classpath:ehcache.xml"/>
</bean>

Now we have Spring's cache abstraction use a blocking cache, and also release the lock when a method call fails. On the downside, we have added some messy XML configuration complex enough to need explanation, and a custom subclass of CacheInterceptor, which is quite a bulky class to extend from.

On the upside, we have left Spring's cache abstraction intact, keeping the blocking cache completely transparent to the application.

I decided the functionality was worth the complexity, but put an integration test for the whole thing in place to make sure things still work should, say, the Spring version later be updated to a newer release. (That surely shouldn't break anything, right?). Figuring out how to integration-test this is left as an exercise to the reader.

Comments