How to cache some active data only based on the result from external storage in Guava?

417 Views Asked by At

Here is the background: I have a 1 billion users in my external storage and most of them will be accessed at least once in a day, but only some active data will be accessed much more.

So for Guava, I may write:

cache.get(key, new Callable() {
    Call() {
       return getExternal(key);        
    }
});

However, Guava will cache the object into memory every time I load from external storage. But since I have a very large data set, and the very inactive data will be also loaded into memory and then exceed the max size, thus really active data will possibly be eliminated.

So I hope to control Guava, telling it that this data is not intended to be cached, like that:

cache.get(key, new Callable() {
    Call() {
       MyObject o = getExternal(key);      
       if (!o.isActive())   {
           ...//do NOT cache
       }
    }
});

Is it possible to achieve this goal in Guava?

2

There are 2 best solutions below

7
On

As per Guava Cache Explanation, there's no way to prevent caching an object if you obtain it via Cache.get.

So there are two ways to handle this:

1) Retrieve the values outside of the cache using Cache.getIfPresent, and insert them directly using Cache.put (Inserted directly):

MyObject o = cache.getIfPresent(key);
if (o == null) {
    o = getExternal(key);
    if (o.isActive()) {
        cache.put(key, o);
    }
}

2) Remove the inactive value from the cache using Cache.invalidate as soon as you obtain it from Cache.get (Explicit removals):

MyObject o = cache.get(key, () -> getExternal(key));
if (!o.isActive()) {
    cache.invalidate(key);
}

EDIT: There's actually a third way to go about it, but it's an even greater hack than Ben's suggestion:

MyObjectHolder holder = new MyObjectHolder();
cache.asMap().compute(key, holder::computeActive); // discards the result of compute()
MyObject o = holder.result;

where MyObjectHolder:

private static class MyObjectHolder {
    MyObject result = null;

    MyObject computeActive(String key, MyObject oldValue) {
        if (oldValue != null) {
            result = oldValue;
            return oldValue;
        }
        result = getExternal(key);
        return result.isActive() ? result : null; // cache only active values
    }
}
0
On

That is a good general caching related question, so please forgive me if I broaden the scope a little and not only give advice with respect to Guava Cache.

   if (!o.isActive())   {
       ...//do NOT cache
   }

Firstly, are you really sure you need to make that kind of optimization and it will have some benefits? The cache eviction algorithm is already doing what you'd like to achieve: It keeps data that is requested more often in the cache and evict data that is not requested any more. If you do not want to have so much inactive data in your cache, just lowering the cache size might be the simplest solution. Caches using the LRU eviction algorithm, like Guava, are quite slow with evicting unused data, since the entry needs to "march down" the whole LRU list. Caches using a more modern algorithm like Caffeine or cache2k evict unused data faster.

Another approach is to set an expiry after access. So if an entry is not requested periodically within the given time duration it is expired and then removed from the cache after some time.

If you want to control the caching behavior depending on the read data, Guava is missing a feature that other caches provide, which is a variable expiry based on the cached value. For cache2k you could add the following rule when constructing the cache, which would keep active entries for 5 minutes and expire others immediately:

 builder.expiryPolicy((key, value, loadTime, oldEntry) -> 
    value.isActive() ? TimeUnit.MINUTES.toMillis(5) : Expiry.NOW)

Similar approaches are feasible with Caffine and EHCache.