Profiling is Our Friend

I recently wrote a tiny demo program to demonstrate to co-workers how one can build a cache with age-based expiration of its entries, using purely immutable Scala collections. The core of the cache was something like 25-30 lines of Scala code like this:

class FooCache(maxAgeMillis: Long) {
  def now: Long = System.currentTimeMillis

  case class CacheEntry(number: Long, value: Long,
                        birthTime: Long) {
    def age: Long = now - birthTime
  }

  lazy val cache: AtomicReference[Hashmap[Long, CacheEntry]] =
    new AtomicReference(HashMap[Long, CacheEntry]]())

  def values: Hashmap[Long, CacheEntry] =
    cache.get.filter{ (key, entry) =>
      entry.age <= maxAgeMillis }

  def get(number: Long): Long = {
    values.find{ case (key, entry) =>
      key == number && entry.age <= maxAgeMillis
    } match {
      case Some((key, entry)) =>
        entry.value                // cache hit
      case _ =>
        val entry = CacheEntry(number, compute(number), now)
        cache.set(values + (number -> entry))
        entry.value
    }
  }

  def compute(number: Long): Long =
    { /* Some long-running computation based on 'number' */ }
}

The main idea here is that we keep an atomically updated reference to an immutable HashMap. Every time we look for entries in the HashMap we check if (entry.age <= maxAgeMillis), to skip over entries which are already too old to be of any use. Then on cache insertion time we go through the ‘values’ function which excludes all cache entries which have already expired.

Note how the cache itself is not ‘immutable’. We are just using an immutable HashMap collection to store it. This means that Scala can do all sorts of optimizations when multiple threads want to iterate through all the entries of the cache looking for something they want. But there’s an interesting performance bug in this code too…

It’s relatively easy to spot once you know what you are looking for, but did you already catch it? I didn’t. At least not the first time I wrote this code. But I did notice something was ‘odd’ when I started doing lookups from multiple threads and looked at the performance stats of the program in a profiler. YourKit showed the following for this version of the caching code:

See how CPU usage hovers around 60% and we are doing a hefty bunch of garbage collections every second? The profiler quickly led me to line 17 of the code pasted above, where I am going through ‘values’ when looking up cache entries.

Almost 94% of the CPU time of the program was spent inside the .values() function. The profiling report included this part:

+-----------------------------------------------------------+--------|------+
|                           Name                            | Time   | Time |
|                                                           | (ms)   | (%)  |
+-----------------------------------------------------------+--------|------+
| demo.caching                                              | 62.084 | 99 % |
| +-- d.caching.Numbers.check(long)                         | 62.084 | 99 % |
|   +-- d.caching.FooCacheModule$FooCache.check(long)       | 62.084 | 99 % |
|     +---d.caching.FooCacheModule$FooCache.values()        | 58.740 | 94 % |
|     +---scala.collection.AbstractIterable.find(Function1) |  3.215 |  5 % |
+-----------------------------------------------------------+--------|------+

We are spending far too much time expiring cache entries. This is easy to understand why with a second look at the code of the get() function: every cache lookup does old entry expiration and then searches for a matching cache entry.

The way cache-entry expiration works with an immutable HashMap as the underlying cache entry store is that values() iterates over the entire cache HashMap, and builds a new HashMap containing only the cache entries which have not expired. This is bound to take a lot of procesisng power, and it’s also what’s causing the creation of all those ‘new’ objects we are garbage collecting every second!

Do we really need to construct a new cache HashMap every time we do a cache lookup? Of course not… We can just filter the entries while we are traversing the cache.

Changing line 17 from values.find{} to cache.get.find{} does not do cache-entry expiration at the time of every single lookup, and now our cache lookup speed is not limited by how fast we can construct new CacheEntry objects, link them to a HashMap and garbage-collect the old ones. Running the new code through YourKit once more showed an immensely better utilization profile for the 8 cores of my laptop’s CPU:

Now we are not spending a bunch of time constructing throw-away objects, and garbage collector activity has dropped by a huge fraction. We can also make much more effective use of the available CPU cores for doing actual cache lookups, instead of busy work!

This was instantly reflected at the metrics I was collecting for the actual demo code. Before the change, the code was doing almost 6000 cache lookups per second:

-- Timers -------------------------------
caching.primes.hashmap.cache-check
             count = 4528121
         mean rate = 5872.91 calls/second
     1-minute rate = 5839.87 calls/second
     5-minute rate = 6053.27 calls/second
    15-minute rate = 6648.47 calls/second
               min = 0.29 milliseconds
               max = 10.25 milliseconds
              mean = 1.34 milliseconds
            stddev = 1.45 milliseconds
            median = 0.62 milliseconds
              75% <= 0.99 milliseconds
              95% <= 4.00 milliseconds
              98% <= 4.59 milliseconds
              99% <= 6.02 milliseconds
            99.9% <= 10.25 milliseconds

After the change to skip cache expiration at cache lookup, and only do cache entry expiration when we are inserting new cache entries, the same timer reported a hugely improved speed for cache lookups:

-- Timers -------------------------------
caching.primes.hashmap.cache-check
             count = 27500000
         mean rate = 261865.50 calls/second
     1-minute rate = 237073.52 calls/second
     5-minute rate = 186223.68 calls/second
    15-minute rate = 166706.39 calls/second
               min = 0.00 milliseconds
               max = 0.32 milliseconds
              mean = 0.02 milliseconds
            stddev = 0.02 milliseconds
            median = 0.02 milliseconds
              75% <= 0.03 milliseconds
              95% <= 0.05 milliseconds
              98% <= 0.05 milliseconds
              99% <= 0.05 milliseconds
            99.9% <= 0.32 milliseconds

That’s more like it. A cache lookup which completes in 0.32 milliseconds for the 99-th percentile of all cache lookups is something I definitely prefer working with. The insight from profiling tools, like YourKit, was instrumental in both understanding what the actual problem was, and verifying that the solution actually had the effect I expected it to have.

That’s why profiling is our friend!

3 thoughts on “Profiling is Our Friend”

Nick Raptis 2013-12-27 at 11:10:23

So.. If I get it right, you now only do cache expiration on insertion instead of every lookup, right?
What if you don’t do any insertions for a long time, don’t you need an additional check to make sure that a cache hit is not stale?

Nick Raptis 2013-12-27 at 11:15:14

Nevermind, it’s already there I just missed it in the code ;)

keramida Post author2013-12-28 at 06:17:40

Yes, that’s it. There is one more optimization that one can do, to make the cache ‘warm up’ as the initial cache misses trickle in and then reach amazingly better speeds. I’ll leave that for another post though. One that combines profiling and codahale metrics.

As a teaser, I’ll just leave this here:

-- Gauges ---------------------------------------
caching.primes.hashmap.cache-size
             value = 9041

-- Counters -------------------------------------
caching.primes.hashmap.cache-hits
             count = 853820827
caching.primes.hashmap.cache-misses
             count = 285007

-- Timers ---------------------------------------
caching.primes.hashmap.cache-check
             count = 854105834
         mean rate = 2737430.36 calls/second
     1-minute rate = 2293654.18 calls/second
     5-minute rate = 1799390.24 calls/second
    15-minute rate = 813696.52 calls/second
               min = 0.00 milliseconds
               max = 3.84 milliseconds
              mean = 0.00 milliseconds
            stddev = 0.12 milliseconds
            median = 0.00 milliseconds
              75% <= 0.00 milliseconds
              95% <= 0.00 milliseconds
              98% <= 0.00 milliseconds
              99% <= 0.00 milliseconds
            99.9% <= 3.73 milliseconds

Comments are closed.

	keramida on Some Useful RCIRC Snippet…
	wlad on Some Useful RCIRC Snippet…
	keramida on Profiling is Our Friend
	Nick Raptis on Profiling is Our Friend
	Nick Raptis on Profiling is Our Friend

Share this:

Related

3 thoughts on “Profiling is Our Friend”