I recently wrote a tiny demo program to demonstrate to co-workers how one can build a cache with age-based expiration of its entries, using purely immutable Scala collections. The core of the cache was something like 25-30 lines of Scala code like this:
class FooCache(maxAgeMillis: Long) { def now: Long = System.currentTimeMillis case class CacheEntry(number: Long, value: Long, birthTime: Long) { def age: Long = now - birthTime } lazy val cache: AtomicReference[Hashmap[Long, CacheEntry]] = new AtomicReference(HashMap[Long, CacheEntry]]()) def values: Hashmap[Long, CacheEntry] = cache.get.filter{ (key, entry) => entry.age <= maxAgeMillis } def get(number: Long): Long = { values.find{ case (key, entry) => key == number && entry.age <= maxAgeMillis } match { case Some((key, entry)) => entry.value // cache hit case _ => val entry = CacheEntry(number, compute(number), now) cache.set(values + (number -> entry)) entry.value } } def compute(number: Long): Long = { /* Some long-running computation based on 'number' */ } }
The main idea here is that we keep an atomically updated reference to an immutable HashMap. Every time we look for entries in the HashMap we check if (entry.age <= maxAgeMillis), to skip over entries which are already too old to be of any use. Then on cache insertion time we go through the ‘values’ function which excludes all cache entries which have already expired.
Note how the cache itself is not ‘immutable’. We are just using an immutable HashMap collection to store it. This means that Scala can do all sorts of optimizations when multiple threads want to iterate through all the entries of the cache looking for something they want. But there’s an interesting performance bug in this code too…
It’s relatively easy to spot once you know what you are looking for, but did you already catch it? I didn’t. At least not the first time I wrote this code. But I did notice something was ‘odd’ when I started doing lookups from multiple threads and looked at the performance stats of the program in a profiler. YourKit showed the following for this version of the caching code:
See how CPU usage hovers around 60% and we are doing a hefty bunch of garbage collections every second? The profiler quickly led me to line 17 of the code pasted above, where I am going through ‘values’ when looking up cache entries.
Almost 94% of the CPU time of the program was spent inside the .values() function. The profiling report included this part:
+-----------------------------------------------------------+--------|------+ | Name | Time | Time | | | (ms) | (%) | +-----------------------------------------------------------+--------|------+ | demo.caching | 62.084 | 99 % | | +-- d.caching.Numbers.check(long) | 62.084 | 99 % | | +-- d.caching.FooCacheModule$FooCache.check(long) | 62.084 | 99 % | | +---d.caching.FooCacheModule$FooCache.values() | 58.740 | 94 % | | +---scala.collection.AbstractIterable.find(Function1) | 3.215 | 5 % | +-----------------------------------------------------------+--------|------+
We are spending far too much time expiring cache entries. This is easy to understand why with a second look at the code of the get() function: every cache lookup does old entry expiration and then searches for a matching cache entry.
The way cache-entry expiration works with an immutable HashMap as the underlying cache entry store is that values() iterates over the entire cache HashMap, and builds a new HashMap containing only the cache entries which have not expired. This is bound to take a lot of procesisng power, and it’s also what’s causing the creation of all those ‘new’ objects we are garbage collecting every second!
Do we really need to construct a new cache HashMap every time we do a cache lookup? Of course not… We can just filter the entries while we are traversing the cache.
Changing line 17 from values.find{} to cache.get.find{} does not do cache-entry expiration at the time of every single lookup, and now our cache lookup speed is not limited by how fast we can construct new CacheEntry objects, link them to a HashMap and garbage-collect the old ones. Running the new code through YourKit once more showed an immensely better utilization profile for the 8 cores of my laptop’s CPU:
Now we are not spending a bunch of time constructing throw-away objects, and garbage collector activity has dropped by a huge fraction. We can also make much more effective use of the available CPU cores for doing actual cache lookups, instead of busy work!
This was instantly reflected at the metrics I was collecting for the actual demo code. Before the change, the code was doing almost 6000 cache lookups per second:
-- Timers ------------------------------- caching.primes.hashmap.cache-check count = 4528121 mean rate = 5872.91 calls/second 1-minute rate = 5839.87 calls/second 5-minute rate = 6053.27 calls/second 15-minute rate = 6648.47 calls/second min = 0.29 milliseconds max = 10.25 milliseconds mean = 1.34 milliseconds stddev = 1.45 milliseconds median = 0.62 milliseconds 75% <= 0.99 milliseconds 95% <= 4.00 milliseconds 98% <= 4.59 milliseconds 99% <= 6.02 milliseconds 99.9% <= 10.25 milliseconds
After the change to skip cache expiration at cache lookup, and only do cache entry expiration when we are inserting new cache entries, the same timer reported a hugely improved speed for cache lookups:
-- Timers ------------------------------- caching.primes.hashmap.cache-check count = 27500000 mean rate = 261865.50 calls/second 1-minute rate = 237073.52 calls/second 5-minute rate = 186223.68 calls/second 15-minute rate = 166706.39 calls/second min = 0.00 milliseconds max = 0.32 milliseconds mean = 0.02 milliseconds stddev = 0.02 milliseconds median = 0.02 milliseconds 75% <= 0.03 milliseconds 95% <= 0.05 milliseconds 98% <= 0.05 milliseconds 99% <= 0.05 milliseconds 99.9% <= 0.32 milliseconds
That’s more like it. A cache lookup which completes in 0.32 milliseconds for the 99-th percentile of all cache lookups is something I definitely prefer working with. The insight from profiling tools, like YourKit, was instrumental in both understanding what the actual problem was, and verifying that the solution actually had the effect I expected it to have.
That’s why profiling is our friend!
So.. If I get it right, you now only do cache expiration on insertion instead of every lookup, right?
What if you don’t do any insertions for a long time, don’t you need an additional check to make sure that a cache hit is not stale?
Nevermind, it’s already there I just missed it in the code ;)
Yes, that’s it. There is one more optimization that one can do, to make the cache ‘warm up’ as the initial cache misses trickle in and then reach amazingly better speeds. I’ll leave that for another post though. One that combines profiling and codahale metrics.
As a teaser, I’ll just leave this here: