Profiling is Our Friend

I recently wrote a tiny demo program to demonstrate to co-workers how one can build a cache with age-based expiration of its entries, using purely immutable Scala collections. The core of the cache was something like 25-30 lines of Scala code like this:

class FooCache(maxAgeMillis: Long) {
  def now: Long = System.currentTimeMillis

  case class CacheEntry(number: Long, value: Long,
                        birthTime: Long) {
    def age: Long = now - birthTime

  lazy val cache: AtomicReference[Hashmap[Long, CacheEntry]] =
    new AtomicReference(HashMap[Long, CacheEntry]]())

  def values: Hashmap[Long, CacheEntry] =
    cache.get.filter{ (key, entry) =>
      entry.age <= maxAgeMillis }

  def get(number: Long): Long = {
    values.find{ case (key, entry) =>
      key == number && entry.age <= maxAgeMillis
    } match {
      case Some((key, entry)) =>
        entry.value                // cache hit
      case _ =>
        val entry = CacheEntry(number, compute(number), now)
        cache.set(values + (number -> entry))

  def compute(number: Long): Long =
    { /* Some long-running computation based on 'number' */ }

The main idea here is that we keep an atomically updated reference to an immutable HashMap. Every time we look for entries in the HashMap we check if (entry.age <= maxAgeMillis), to skip over entries which are already too old to be of any use. Then on cache insertion time we go through the ‘values’ function which excludes all cache entries which have already expired.

Note how the cache itself is not ‘immutable’. We are just using an immutable HashMap collection to store it. This means that Scala can do all sorts of optimizations when multiple threads want to iterate through all the entries of the cache looking for something they want. But there’s an interesting performance bug in this code too…

It’s relatively easy to spot once you know what you are looking for, but did you already catch it? I didn’t. At least not the first time I wrote this code. But I did notice something was ‘odd’ when I started doing lookups from multiple threads and looked at the performance stats of the program in a profiler. YourKit showed the following for this version of the caching code:

JVM Profile #1

See how CPU usage hovers around 60% and we are doing a hefty bunch of garbage collections every second? The profiler quickly led me to line 17 of the code pasted above, where I am going through ‘values’ when looking up cache entries.

Almost 94% of the CPU time of the program was spent inside the .values() function. The profiling report included this part:

|                           Name                            | Time   | Time |
|                                                           | (ms)   | (%)  |
| demo.caching                                              | 62.084 | 99 % |
| +-- d.caching.Numbers.check(long)                         | 62.084 | 99 % |
|   +-- d.caching.FooCacheModule$FooCache.check(long)       | 62.084 | 99 % |
|     +---d.caching.FooCacheModule$FooCache.values()        | 58.740 | 94 % |
|     +---scala.collection.AbstractIterable.find(Function1) |  3.215 |  5 % |

We are spending far too much time expiring cache entries. This is easy to understand why with a second look at the code of the get() function: every cache lookup does old entry expiration and then searches for a matching cache entry.

The way cache-entry expiration works with an immutable HashMap as the underlying cache entry store is that values() iterates over the entire cache HashMap, and builds a new HashMap containing only the cache entries which have not expired. This is bound to take a lot of procesisng power, and it’s also what’s causing the creation of all those ‘new’ objects we are garbage collecting every second!

Do we really need to construct a new cache HashMap every time we do a cache lookup? Of course not… We can just filter the entries while we are traversing the cache.

Changing line 17 from values.find{} to cache.get.find{} does not do cache-entry expiration at the time of every single lookup, and now our cache lookup speed is not limited by how fast we can construct new CacheEntry objects, link them to a HashMap and garbage-collect the old ones. Running the new code through YourKit once more showed an immensely better utilization profile for the 8 cores of my laptop’s CPU:

JVM Profile #2

Now we are not spending a bunch of time constructing throw-away objects, and garbage collector activity has dropped by a huge fraction. We can also make much more effective use of the available CPU cores for doing actual cache lookups, instead of busy work!

This was instantly reflected at the metrics I was collecting for the actual demo code. Before the change, the code was doing almost 6000 cache lookups per second:

-- Timers -------------------------------
             count = 4528121
         mean rate = 5872.91 calls/second
     1-minute rate = 5839.87 calls/second
     5-minute rate = 6053.27 calls/second
    15-minute rate = 6648.47 calls/second
               min = 0.29 milliseconds
               max = 10.25 milliseconds
              mean = 1.34 milliseconds
            stddev = 1.45 milliseconds
            median = 0.62 milliseconds
              75% <= 0.99 milliseconds
              95% <= 4.00 milliseconds
              98% <= 4.59 milliseconds
              99% <= 6.02 milliseconds
            99.9% <= 10.25 milliseconds

After the change to skip cache expiration at cache lookup, and only do cache entry expiration when we are inserting new cache entries, the same timer reported a hugely improved speed for cache lookups:

-- Timers -------------------------------
             count = 27500000
         mean rate = 261865.50 calls/second
     1-minute rate = 237073.52 calls/second
     5-minute rate = 186223.68 calls/second
    15-minute rate = 166706.39 calls/second
               min = 0.00 milliseconds
               max = 0.32 milliseconds
              mean = 0.02 milliseconds
            stddev = 0.02 milliseconds
            median = 0.02 milliseconds
              75% <= 0.03 milliseconds
              95% <= 0.05 milliseconds
              98% <= 0.05 milliseconds
              99% <= 0.05 milliseconds
            99.9% <= 0.32 milliseconds

That’s more like it. A cache lookup which completes in 0.32 milliseconds for the 99-th percentile of all cache lookups is something I definitely prefer working with. The insight from profiling tools, like YourKit, was instrumental in both understanding what the actual problem was, and verifying that the solution actually had the effect I expected it to have.

That’s why profiling is our friend!

About these ads

3 thoughts on “Profiling is Our Friend

  1. Nick Raptis

    So.. If I get it right, you now only do cache expiration on insertion instead of every lookup, right?
    What if you don’t do any insertions for a long time, don’t you need an additional check to make sure that a cache hit is not stale?

    1. keramida Post author

      Yes, that’s it. There is one more optimization that one can do, to make the cache ‘warm up’ as the initial cache misses trickle in and then reach amazingly better speeds. I’ll leave that for another post though. One that combines profiling and codahale metrics.

      As a teaser, I’ll just leave this here:

      -- Gauges ---------------------------------------
                   value = 9041
      -- Counters -------------------------------------
                   count = 853820827
                   count = 285007
      -- Timers ---------------------------------------
                   count = 854105834
               mean rate = 2737430.36 calls/second
           1-minute rate = 2293654.18 calls/second
           5-minute rate = 1799390.24 calls/second
          15-minute rate = 813696.52 calls/second
                     min = 0.00 milliseconds
                     max = 3.84 milliseconds
                    mean = 0.00 milliseconds
                  stddev = 0.12 milliseconds
                  median = 0.00 milliseconds
                    75% <= 0.00 milliseconds
                    95% <= 0.00 milliseconds
                    98% <= 0.00 milliseconds
                    99% <= 0.00 milliseconds
                  99.9% <= 3.73 milliseconds

Comments are closed.