Profiling is Our Friend

I recently wrote a tiny demo program to demonstrate to co-workers how one can build a cache with age-based expiration of its entries, using purely immutable Scala collections. The core of the cache was something like 25-30 lines of Scala code like this:

class FooCache(maxAgeMillis: Long) {
  def now: Long = System.currentTimeMillis

  case class CacheEntry(number: Long, value: Long,
                        birthTime: Long) {
    def age: Long = now - birthTime
  }

  lazy val cache: AtomicReference[Hashmap[Long, CacheEntry]] =
    new AtomicReference(HashMap[Long, CacheEntry]]())

  def values: Hashmap[Long, CacheEntry] =
    cache.get.filter{ (key, entry) =>
      entry.age <= maxAgeMillis }

  def get(number: Long): Long = {
    values.find{ case (key, entry) =>
      key == number && entry.age <= maxAgeMillis
    } match {
      case Some((key, entry)) =>
        entry.value                // cache hit
      case _ =>
        val entry = CacheEntry(number, compute(number), now)
        cache.set(values + (number -> entry))
        entry.value
    }
  }

  def compute(number: Long): Long =
    { /* Some long-running computation based on 'number' */ }
}

The main idea here is that we keep an atomically updated reference to an immutable HashMap. Every time we look for entries in the HashMap we check if (entry.age <= maxAgeMillis), to skip over entries which are already too old to be of any use. Then on cache insertion time we go through the ‘values’ function which excludes all cache entries which have already expired.

Note how the cache itself is not ‘immutable’. We are just using an immutable HashMap collection to store it. This means that Scala can do all sorts of optimizations when multiple threads want to iterate through all the entries of the cache looking for something they want. But there’s an interesting performance bug in this code too…

It’s relatively easy to spot once you know what you are looking for, but did you already catch it? I didn’t. At least not the first time I wrote this code. But I did notice something was ‘odd’ when I started doing lookups from multiple threads and looked at the performance stats of the program in a profiler. YourKit showed the following for this version of the caching code:

JVM Profile #1

See how CPU usage hovers around 60% and we are doing a hefty bunch of garbage collections every second? The profiler quickly led me to line 17 of the code pasted above, where I am going through ‘values’ when looking up cache entries.

Almost 94% of the CPU time of the program was spent inside the .values() function. The profiling report included this part:

+-----------------------------------------------------------+--------|------+
|                           Name                            | Time   | Time |
|                                                           | (ms)   | (%)  |
+-----------------------------------------------------------+--------|------+
| demo.caching                                              | 62.084 | 99 % |
| +-- d.caching.Numbers.check(long)                         | 62.084 | 99 % |
|   +-- d.caching.FooCacheModule$FooCache.check(long)       | 62.084 | 99 % |
|     +---d.caching.FooCacheModule$FooCache.values()        | 58.740 | 94 % |
|     +---scala.collection.AbstractIterable.find(Function1) |  3.215 |  5 % |
+-----------------------------------------------------------+--------|------+

We are spending far too much time expiring cache entries. This is easy to understand why with a second look at the code of the get() function: every cache lookup does old entry expiration and then searches for a matching cache entry.

The way cache-entry expiration works with an immutable HashMap as the underlying cache entry store is that values() iterates over the entire cache HashMap, and builds a new HashMap containing only the cache entries which have not expired. This is bound to take a lot of procesisng power, and it’s also what’s causing the creation of all those ‘new’ objects we are garbage collecting every second!

Do we really need to construct a new cache HashMap every time we do a cache lookup? Of course not… We can just filter the entries while we are traversing the cache.

Changing line 17 from values.find{} to cache.get.find{} does not do cache-entry expiration at the time of every single lookup, and now our cache lookup speed is not limited by how fast we can construct new CacheEntry objects, link them to a HashMap and garbage-collect the old ones. Running the new code through YourKit once more showed an immensely better utilization profile for the 8 cores of my laptop’s CPU:

JVM Profile #2

Now we are not spending a bunch of time constructing throw-away objects, and garbage collector activity has dropped by a huge fraction. We can also make much more effective use of the available CPU cores for doing actual cache lookups, instead of busy work!

This was instantly reflected at the metrics I was collecting for the actual demo code. Before the change, the code was doing almost 6000 cache lookups per second:

-- Timers -------------------------------
caching.primes.hashmap.cache-check
             count = 4528121
         mean rate = 5872.91 calls/second
     1-minute rate = 5839.87 calls/second
     5-minute rate = 6053.27 calls/second
    15-minute rate = 6648.47 calls/second
               min = 0.29 milliseconds
               max = 10.25 milliseconds
              mean = 1.34 milliseconds
            stddev = 1.45 milliseconds
            median = 0.62 milliseconds
              75% <= 0.99 milliseconds
              95% <= 4.00 milliseconds
              98% <= 4.59 milliseconds
              99% <= 6.02 milliseconds
            99.9% <= 10.25 milliseconds

After the change to skip cache expiration at cache lookup, and only do cache entry expiration when we are inserting new cache entries, the same timer reported a hugely improved speed for cache lookups:

-- Timers -------------------------------
caching.primes.hashmap.cache-check
             count = 27500000
         mean rate = 261865.50 calls/second
     1-minute rate = 237073.52 calls/second
     5-minute rate = 186223.68 calls/second
    15-minute rate = 166706.39 calls/second
               min = 0.00 milliseconds
               max = 0.32 milliseconds
              mean = 0.02 milliseconds
            stddev = 0.02 milliseconds
            median = 0.02 milliseconds
              75% <= 0.03 milliseconds
              95% <= 0.05 milliseconds
              98% <= 0.05 milliseconds
              99% <= 0.05 milliseconds
            99.9% <= 0.32 milliseconds

That’s more like it. A cache lookup which completes in 0.32 milliseconds for the 99-th percentile of all cache lookups is something I definitely prefer working with. The insight from profiling tools, like YourKit, was instrumental in both understanding what the actual problem was, and verifying that the solution actually had the effect I expected it to have.

That’s why profiling is our friend!

Posted in Computers, Programming, Scala, Software | Tagged , , , | 3 Comments

Using SBT To Experiment With New Scala Libraries

Library dependency tracking is a complicated thing. Using a new library to experiment with it, write a few bits of exploratory code, or even small self-contained bits of software which uses the new library is often an exercise in frustration. Especially if the library one is interested in requires a multitude of other libraries and bits of code to even start working. This is a problem that surfaces very often and pretty much in all programming environments. Running code that is written on the JVM (in Java or Scala) is one of the cases where this sort of issue is particularly frequent and can get extremely hairy. Fortunately, there’s a very convenient tool which handles this problem in an amazingly flexible manner for Scala code: the SBT utility.

In this article I’m going to use the Scalaz library of functional programming idioms as a demo of what SBT can do for you. I’ll start by showing how Scalaz libraries can be loaded manually, by tinkering with the classpath of scala. Then we’ll see how the same can be done with a small SBT project. Finally, I’ll describe why I prefer the SBT based method of experimenting with new Scala libraries.

Loading A Scala Library Manually

Loading a new Scala library, in preparation for writing some exploratory code which uses the library, is pretty similar to what one does with Java libraries. All it takes is to start the Scala REPL with the appropriate JAR file somewhere in the classpath. For example, if you wanted to load the Scalaz library of functional programming idioms, you can just download the JAR file of the appropriate version, and pass this JAR file to the scala(1) utility with the -cp option:

% wget -q -nd -np -c -r \
  http://repo1.maven.org/maven2/org/scalaz/scalaz-core_2.10/6.0.4/scalaz-core_2.10-6.0.4.jar
% scala -cp scalaz-core_2.10-6.0.4.jar
Welcome to Scala version 2.10.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import scalaz._
import scalaz._

scala> import Scalaz._
import Scalaz._

There’s nothing special to adding library JAR files to the Scala REPL’s classpath. It’s exactly what you would do e.g. for any plain old Java library too. The important thing to notice here though is that because of the possibility of incompatibilities between major Scala versions, you have to make sure that you load the correct version of the JAR file for the library. Hence the “_2.10″ part of the “scalaz-core_2.10-6.0.4.jar” filename.

Real Use Case: Loading Scalaz From Your REPL

Scalaz is a popular Scala library for functional programming. There are many exciting features in scalaz, e.g. the Validation[A, B] support for handling exceptional conditions in a functional manner, or the NonEmptyList[A] classes for lists which must have at least one item.

Using the library is possible, of course, like any other library which can load on the JVM. Just fetch the appropriate set of JAR files, point the Scala interpreter’s class-path to them, and then import the library’s exported symbols from your Scala REPL.

For this to work you have to make sure to match the scala version (2.10 here) with the one started by default by your Scala interpreter. Then if you want to use the JAR file you have to specify it manually at the class-path of your scala session:

% scala -cp scalaz-core_2.10-6.0.4.jar
Welcome to Scala version 2.10.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import scalaz._
import scalaz._

scala> import Scalaz._
import Scalaz._

scala> def validate(text: String): Validation[String, Boolean] = {
     |   text.find{ _.isUpper } match {
     |     case Some(character) => "'%s' is not a valid string".format(text).fail
     |     case _ => true.success
     |   }
     | }
validate: (text: String)scalaz.Validation[String,Boolean]

scala> validate("hello world")
res0: scalaz.Validation[String,Boolean] = Success(true)

scala> validate("Hey, this shouldn't work")
res1: scalaz.Validation[String,Boolean] = Failure('Hey, this shouldn't work' is not a valid string)

What if you want to experiment with multiple scala language versions though? Or with multiple scalaz library versions?

Using Scala Build Tool (SBT) To Do Even More Fun Stuff

Manually tracking where to download the JAR files from, and adding the right JAR files to the class-path is bound to get very tedious. Fortunately, you can use the SBT utility and a small “bootstrap project” to load the require libraries. The SBT utility is the standard “build tool” used by many Scala libraries, and it provides an easy way of defining:

  • Which libraries your project will need to load
  • Which Scala compiler version to use (2.9.X, 2.10.X, etc.)
  • Which subset of the downloaded libraries to preload for REPL sessions

The pre-defined support of SBT for using maven repositories, searching for the appropriate Scala compiler version, fetching and caching the JAR files or the libraries the project loads, and all the other small conveniences it provides quickly add up. So the second method of loading Scalaz is the same method I am using nowadays to quickly load any new Scala library I want to experiment with. I just let SBT handle the details, by creating a new directory, and putting in that a “build.sbt” file with the minimal amount of dependency information to fetch, load and start using the library.

A sample project definition which you can use to load Scalaz version 6.0.4 using the 2.10.2 version of the Scala compiler is:

name := "scalaz-demo"

version := "0.2"

scalaVersion := "2.10.2"

libraryDependencies += "org.scalaz" %% "scalaz-core" % "6.0.4"

scalacOptions += "-deprecation"

initialCommands in console := """
    |import scalaz._
    |import Scalaz._
    |""".stripMargin

Note how the version of the Scala compiler and the version of the Scalaz library are specified in a simple ‘setting’ of sbt. The rest of the details, like where to download the JAR file of scalaz, how to find the appropriate JAR version for the scala compiler version we are using, how to cache the JAR files locally, and many other things, are handled internally by SBT. The initialCommands in console setting saves some typing by pre-importing scalaz when we fire up the interactive Scala repl for this project, so with this in place we can just fire up “sbt console” and start writing code which uses scalaz-core right away:

% sbt console
[info] Loading global plugins from /Users/gkeramidas/.sbt/0.13/plugins
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.10.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> def foo(x: Int) =
     |   "%d is a funny integer".format(x).fail[Int]
foo: (x: Int)scalaz.Validation[String,Int]

scala> foo(10)
res0: scalaz.Validation[String,Int] = Failure(10 is a funny integer)

With SBT, it’s even possible to reload the entire project, using a different versino of the Scala compiler, and keep hacking at the REPL:

% sbt
[info] Loading global plugins from /Users/gkeramidas/.sbt/0.13/plugins
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
> show scalaVersion
[info] 2.10.2
> set scalaVersion := "2.9.0-1"
[info] Defining *:scalaVersion
[info] The new value will be used by *:allDependencies, *:ivyScala and 10 others.
[info] 	Run `last` for details.
[info] Reapplying settings...
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
> console
[info] Updating {file:/Users/gkeramidas/hg/demo/scalaz-demo/}scalaz-demo...
[info] Resolving org.scala-lang#scala-library;2.9.0-1 ...
[info] Resolving org.scalaz#scalaz-core_2.9.0-1;6.0.4 ...
[info] Resolving org.scala-lang#scala-compiler;2.9.0-1 ...
[info] Resolving org.scala-lang#jline;2.9.0-1 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /Users/gkeramidas/hg/demo/scalaz-demo/target/scala-2.9.0-1/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.9.0.1. Compiling...
[info]   Compilation completed in 11.889 s
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.9.0.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> def foo(x: Int) =
     |   "%d is a funny integer".format(x).fail[Int]
foo: (x: Int)scalaz.Validation[String,Int]

scala> foo(12345)
res0: scalaz.Validation[String,Int] = Failure(12345 is a funny integer)

Note how SBT took care of loading the correct version of scalaz’s JAR file. The artifact loaded for scalaz-core is shown at the informational message:

[info] Resolving org.scalaz#scalaz-core_2.9.0-1;6.0.4 ...

When SBT reports that it ‘resolved’ the artifact, it has already fetched it (form one of my previous experiments with scalaz on this machine), and it’s ready to be loaded. It’s even possible to see where SBT has fetched the artifact, by peeking at the classpath it uses to load the Scala REPL:

> show full-classpath
[info] List(Attributed(/Users/gkeramidas/hg/demo/scalaz-demo/target/scala-2.9.0-1/classes), Attributed(/Users/gkeramidas/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.9.0-1.jar), Attributed(/Users/gkeramidas/.ivy2/cache/org.scalaz/scalaz-core_2.9.0-1/jars/scalaz-core_2.9.0-1-6.0.4.jar))
[success] Total time: 0 s, completed Sep 4, 2013 4:27:21 PM

The scalaz JAR file which matches the Scala compiler version has been downloaded by SBT and cached, in the local artifact cache at ~/.ivy2. Setting scalaVersion to a different value and reloading the project fetches the new scalaz JAR file and caches it under ~/.ivy2 again:

> set scalaVersion := "2.9.1"
[info] Defining *:scalaVersion
[info] The new value will be used by *:allDependencies, *:ivyScala and 10 others.
[info] 	Run `last` for details.
[info] Reapplying settings...
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
> compile:console
[info] Updating {file:/Users/gkeramidas/hg/demo/scalaz-demo/}scalaz-demo...
[info] Resolving org.scala-lang#scala-library;2.9.1 ...
[info] Resolving org.scalaz#scalaz-core_2.9.1;6.0.4 ...
[info] Resolving org.scala-lang#scala-compiler;2.9.1 ...
[info] Resolving org.scala-lang#jline;2.9.1 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.9.1/scala-library-2.9.1.jar ...
[info] 	[SUCCESSFUL ] org.scala-lang#scala-library;2.9.1!scala-library.jar (7366ms)
[info] downloading http://repo1.maven.org/maven2/org/scala-lang/scala-compiler/2.9.1/scala-compiler-2.9.1.jar ...
[info] 	[SUCCESSFUL ] org.scala-lang#scala-compiler;2.9.1!scala-compiler.jar (5512ms)
[info] downloading http://repo1.maven.org/maven2/org/scala-lang/jline/2.9.1/jline-2.9.1.jar ...
[info] 	[SUCCESSFUL ] org.scala-lang#jline;2.9.1!jline.jar (803ms)
[info] Done updating.
[info] Compiling 1 Scala source to /Users/gkeramidas/hg/demo/scalaz-demo/target/scala-2.9.1/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.9.1.final. Compiling...
[info]   Compilation completed in 9.926 s
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.9.1.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala>

SBT is a nice tool for experimenting with a new Scala library. Fetching all the right JAR files, resolving their dependencies correctly, and taking care of all the bothersome details to set up the ‘environment’ properly for running your code with version X.Y.Z of the library under version A.B.C of the Scala language is too convenient and useful to ignore. So the next time you have to play around with a Scala library, consider writing a simple SBT build definition which loads it as a dependency, instead of fighting with JAR classpaths and other such annoying stuff.

Update: The feature of SBT which allows reloading a project’s dependencies with a different base-compiler for the Scala runtime is so useful that the friendly folks who develop SBT have given it a special short alias. You can switch to another Scala compiler version by typing “++” and the version you are switching to, e.g.:

> ++2.9.1
[info] Setting version to 2.9.1
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/git/demo/scalaz-demo/)
> console
[info] Updating {file:/Users/gkeramidas/git/demo/scalaz-demo/}scalaz-demo...
[info] Resolving org.scala-lang#scala-library;2.9.1 ...
[info] Resolving org.scalaz#scalaz-core_2.9.1;6.0.4 ...
[info] Resolving org.scala-lang#scala-compiler;2.9.1 ...
[info] Resolving org.scala-lang#jline;2.9.1 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.9.1.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65).
Type in expressions to have them evaluated.
Type :help for more information.
Posted in Computers, Programming, Scala, Software | Tagged , , ,

Reading book: “Akka Essentials”

I have recently gotten myself a copy of “Akka Essentials” and started going through the code, but writing the same examples in Scala (instead of Java, which the original book uses). This is turning out to be a fairly good exercise, as it forces me to think what is the idiomatic way of writing something in Scala, versus just plainly dumping Java code and minimally editing it to make everything compile.

For example here’s a Java snippet from one of the first chapters, which sets things up for a “MapActor”:

package akka.first.app.mapreduce.actors;

import java.util.*;
import java.util.StringTokenizer;
import akka.actor.UntypedActor;
import akka.first.app.mapreduce.messages.MapData;
import akka.first.app.mapreduce.messages.WordCount;

public class MapActor extends UntypedActor {
  String[] STOP_WORDS = {
      "a", "am", "an", "and", "are", "as", "at",
      "be", "do", "go", "if", "in", "is", "it",
      "of", "on", "the", "to" };

  private List<String> STOP_WORDS_LIST =
      Arrays.asList(STOP_WORDS);

  @Override
  public void onReceive(Object message) throws Exception {
    if (message instanceof String) {
      String work = (String) message;
      // map the words in the sentence and send the result
      to MasterActor
          getSender().tell(evaluateExpression(work));
    } else
      unhandled(message);
  }

  private MapData evaluateExpression(String line) {
    List<WordCount> dataList = new ArrayList<WordCount>();
    StringTokenizer parser = new StringTokenizer(line);
    while (parser.hasMoreTokens()) {
      String word = parser.nextToken().toLowerCase();
      if (!STOP_WORDS_LIST.contains(word)) {
        dataList.add
            (new WordCount(word,Integer.valueOf(1)));
      }
    }
    return new MapData(dataList);
  }
}

This MapActor is part of a Map-Reduce-Aggregate set of Akka actors, and its work is relatively simple:

  • It should handle a “String” message, by tokenizing the string and returning a MapData result.
  • The MapData should contain a list of WordCount instances, with every word’s initial count set to 1.
  • The words which are part of a STOP_WORDS list should be skipped wen generating the result.

The code is readable in Java too, but as I was writing the equivalent in Scala, I noticed that I could do a couple of things to simplify it even more. One of them was to use pattern matching instead of instanceOf checks in the message-handling code. The other one was that a while loop is a bit noisy in this case, and it can be replaced by several more concise snippets.

Another happy side-effect of writing in Scala was that the code was stripped of all the ‘boilerplate’ Java requires. For example, I didn’t have to do all the crafty stuff about ArrayList and manual conversion to List<String>, since Scala provides a neat and very readable way of initializing a list of strings.

I started with something like this:

package akka.first.app.mapreduce.actors

import akka.first.app.mapreduce.messages.MapData
import akka.actor.UntypedActor

class MapActor extends UntypedActor {
  override def onReceive(message: Any) =
    message match {
      case (work: String) => getSender ! evaluateExpression(work)
      case _              => unhandled(message)
    }

  /** Defines a list of words which are never counted. */
  private val STOP_WORDS: List[String] = List(
    "a", "am", "an", "and", "are", "as", "at", "be","do",
    "go", "if", "in", "is", "it", "of", "on", "the", "to")

  /** Evaluates a sentence, removes non-stop words, and returns
    * the result as a new `MapData` instance.
    */
  private def evaluateExpression(text: String): MapData =
    new MapData(Nil)
}

This doesn’t do anything useful in evaluateExpression yet. It’s already almost there though, and even if we include the Scaladoc comment lines, the total is about half the size of the Java code.

Now, I experimented with two alternative approaches of tokenizing the input text. One of them uses the simplest code possible:

  private def evaluateExpression(text: String): MapData =
    new MapData(text.split("\\s").map{_.toLowerCase}
      .filter(w => !STOP_WORDS.contains(w))
      .map(w => new WordCount(w, 1)))

This version should work fine, and the Scala version of the MapActor is still very readable code:

package akka.first.app.mapreduce.actors

import akka.first.app.mapreduce.messages.MapData
import akka.actor.UntypedActor

class MapActor extends UntypedActor {
  override def onReceive(message: Any) =
    message match {
      case (work: String) => getSender ! evaluateExpression(work)
      case _              => unhandled(message)
    }

  /** Defines a list of words which are never counted. */
  private val STOP_WORDS: List[String] = List(
    "a", "am", "an", "and", "are", "as", "at", "be","do",
    "go", "if", "in", "is", "it", "of", "on", "the", "to")

  /** Evaluates a sentence, removes non-stop words, and returns
    * the result as a new `MapData` instance.
    */
  private def evaluateExpression(text: String): MapData =
    new MapData(text.split("\\s").map{_.toLowerCase}
      .filter(w => !STOP_WORDS.contains(w))
      .map(w => new WordCount(w, 1)))
}

Another version of evaluateExpression, which I wrote to play around with Scala and Stream generating functions was the following:

import java.util.StringTokenizer

def tokens(parser: StringTokenizer): Stream[String] =
  parser.hasMoreTokens match {
    case false => Stream.empty
    case true  => Stream.cons(
      parser.nextToken.toLowerCase, tokens(parser))
  }

def evaluateExpression(text: String): MapData = {
  val parser = new StringTokenizer(text)
  val words = tokens(parser)
  MapData(words.filter(w => !STOP_WORDS.contains(w))
          .map(w => WordCount(w, 1)).toList)
}

One important difference of this version is that it doesn’t pre-generate the entire list of words in memory, so it might be a better option for extremely large input strings. It is already showing a difference for e.g. this code:

scala> import java.util.StringTokenizer
import java.util.StringTokenizer

scala> def tokens(parser: StringTokenizer): Stream[String] =
     |     parser.hasMoreTokens match {
     |       case false => Stream.empty
     |       case true  => Stream.cons(parser.nextToken.toLowerCase, tokens(parser))
     |     }
tokens: (parser: java.util.StringTokenizer)Stream[String]

scala> def time[R](block: => R): R = {
     |     val t0 = System.nanoTime()
     |     val result = block    // call-by-name
     |     val t1 = System.nanoTime()
     |     println("Elapsed time: " + (t1 - t0) + "ns")
     |     result
     | }
time: [R](block: => R)R

scala> val large = "foo " * 500000
large: String = "foo foo foo foo foo foo foo foo foo ...

scala> time { tokens(new StringTokenizer(large)).size }
Elapsed time: 92060000ns
res1: Int = 500000

scala> time { large.split("\\s").size }
Elapsed time: 126982000ns
res2: Int = 500000

Note how the second version, which uses plain split(), is already 37.93% slower, even when we are just counting the number of words in each token list [126982000.0 / 92060000 =~ 1.37934].

What I got from doing this rewrite in Scala though was something entirely different. An appreciation for the features of Scala which let me write compact, concise, yet also very readable, and very clean code like this final version of the MapActor code:

package akka.first.app.mapreduce.actors

import akka.first.app.mapreduce.messages.MapData
import akka.first.app.mapreduce.messages.WordCount
import akka.actor.UntypedActor

/** Parses a `String` into whitespace-separated words, skips
  * over "stop words" and constructs a MapData message with
  * the resulting words and their counts initialized to 1.
  */
class MapActor extends UntypedActor {
  override def onReceive(message: Any) =
    message match {
      case (work: String) => getSender ! evaluateExpression(work)
      case _              => unhandled(message)
    }

  /** Defines a list of words which are never counted. */
  private val STOP_WORDS: List[String] = List(
    "a", "am", "an", "and", "are", "as", "at", "be","do",
    "go", "if", "in", "is", "it", "of", "on", "the", "to")

  /** Evaluates a sentence, removes non-stop words, and returns
    * the result as a new `MapData` instance.
    */
  private def evaluateExpression(text: String): MapData =
    MapData(text.split("\\s")
      .filter(w => !STOP_WORDS.contains(w))
      .map(w => WordCount(w, 1)).toList)
}
Posted in Computers, Programming, Scala, Software | Tagged , , ,

Gnus: Saving Outgoing Messages To Multiple Gmail Folders

Everything is possible, if you a have an extensible email-reading application, written in one of the most powerful languages of the world:

;; Where to save a copy of all outgoing messages.
;; Save a copy in Gmail's special "Sent Mail" folder
;; and another one in "keramida", so that they appear
;; correctly in searches for "label:keramida" too.

(setq gnus-message-archive-group
      (list "keramida" "[Gmail]/Sent Mail"))
Posted in Computers, Emacs, Free software, Gnus, Lisp, Programming, Software | Tagged , , , , , ,

Mostly Spam After A Few Months

Just a quick note, since I haven’t posted a lot of stuff lately. I noticed that the “comments” added to old posts in this blog are almost exclusively spam, so I enabled the WordPress option that blocks all comments for posts older than a few months. If you are a real person and you still want to post something as a comment to a very old post, feel free to email me your text, and we’ll find a way of publishing it here.

Posted in Weblog | Tagged

Factory Objects in Scala Code

When programming with an object-oriented language like Scala it’s often necessary to create a different type of object, depending on the value of a configuration flag, the current state of the program, or other conditions. At this point your first idea may be to write something like:

val someFlag: Boolean = config("flag-name")
val realObject = if (someFlag) new Foo() else new Bar()

or even longer pieces of switching code, like:

val someFlag: String = config("flag-name")
val realObject: SomeClass = someFlag match {
  case "foo" => new Foo()
  case "bar" => new Bar()
  case _ => new Frob()
}

This is almost “ok” for one place, but it’s conflating two different things in one place: the fact that the type (and value) of realObject depends on the value of someFlag, and how the switching between different someFlag values works. What’s even worse, this sort of initialization for realObject binds the switching logic with this specific instance. If more copies of SomeClass are used elsewhere, the switching logic for someFlag will have to be repeated at all places. Then a few weeks or months later, you may decide to change something in the switching logic and miss updating one of those places, ending up with a nice bug on your plate.

Avoiding this sort of code duplication and writing the switching logic in a well-factored manner is exactly what factory objects can do.

I’ll use a trivial but easy to follow example to show how you can use factory objects in Scala. Let’s assume that you have two different classes in your code. One of them represents the home directory of a plain user, pointing at “/home/users/USERNAME“. The other one represents the home directory of a user with administrative rights, and it points at “/home/admins/USERNAME” instead. Your program also has some way to determine that it wants to create a ‘plain user’ or an ‘admin user’ home directory object, which can be useful at object creation time, like someFlag in the first example.

Since both types of objects will be “home directories”, it’s convenient if they are both sub-classes of a common, abstract superclass, e.g.:

import java.io.File

/*
 * Points at the home directory of {{userName}},
 * under a configurable {{baseDirectory}}.
 */
abstract class HomeDirectory(baseDirectory: String,
                             userName: String) {
  val path = new File(baseDirectory + "/" + userName)
}

class UserHomeDirectory(baseDirectory: String,
                        userName: String)
      extends HomeDirectory(baseDirectory, userName)

class AdministratorHomeDirectory(baseDirectory: String,
                                 userName: String)
      extends HomeDirectory(baseDirectory, userName)

Now it’s still tempting to write code like this:

def homeDirectory(userName: String,
  isAdministrator: Boolean): HomeDirectory = {
  if (isAdministrator)
    new AdministratorHomeDirectory("/home/admins", userName)
  else
    new UserHomeDirectory("/home/users", userName)
}

This is “ok” in the sense that it already works, and it does what we expect: a UserHomeDirectory object is created when isAdministrator is false, and an AdministratorHomeDirectory is created when we pass true:

scala> val h = homeDirectory("keramida", false)
h: HomeDirectory = UserHomeDirectory@400ae511

scala> h.path
res1: java.io.File = /home/users/keramida

scala> val h = homeDirectory("keramida", true)
h: HomeDirectory = AdministratorHomeDirectory@5b0a1d2d

scala> h.path
res2: java.io.File = /home/admins/keramida

Even so, we can still improve things a lot. Instead of having the switching logic inside a function that is completely isolated, and decoupled from the actual object classes, it would be nice if we could pass just the username and the flag to a HomeDirectory and have the decision about UserHomeDirectory vs. AdministratorHomeDirectory encoded in one, well-known and predictable place: the class itself. Then we would be able to write stuff like:

val home = HomeDirectory("keramida", isAdministrator = false)

or even just:

val home = HomeDirectory("keramida", false)

This is where Scala’s “companion objects” and factory methods really shine. A companion object in Scala is described like this in the Scala Reference Manual:

“Generally, a companion module of a class is an object which has the same name as the class and is defined in the same scope and compilation unit. Conversely, the class is called the companion class of the module.”

This doesn’t say a lot, but one of the interesting things that companion objects can do is to implement various apply() methods. These methods can be used to construct objects of the “companion class” (the class with the same name as the companion object), and they are very commonly used to implement factories like the one we want to implement.

Since we want to be able to create HomeDirectory subclass instances with the signature (username: String, isAdministrator: Boolean), this is what our apply() method should look like:

object HomeDirectory {
  def apply(userName: String, isAdministrator: Boolean) = {
    if (isAdministrator)
      new AdministratorHomeDirectory("/home/admins", userName)
    else
      new UserHomeDirectory("/home/users", userName)
  }
}

We can even move the base directory strings out of apply():

object HomeDirectory {
  val adminBase: String = "/home/admins"
  val userBase: String = "/home/users"

  def apply(userName: String, isAdministrator: Boolean) = {
    if (isAdministrator)
      new AdministratorHomeDirectory(adminBase, userName)
    else
      new UserHomeDirectory(userBase, userName)
  }
}

With that final piece in place, our source code is now a lot cleaner than the original, the switching between administrator vs. plain users does not have to be copied at every single place where we use a ‘home directory’ object:

import java.io.File

/* Abstract base class for all home directory types. */
abstract class HomeDirectory(baseDirectory: String,
                             userName: String) {
  val path = new File(baseDirectory + "/" + userName)
}

/* Home directory for a simple, non-administrator user. */
class UserHomeDirectory(baseDirectory: String,
                        userName: String)
    extends HomeDirectory(baseDirectory, userName)

/* Home directory for a user with administrator rights. */
class AdministratorHomeDirectory(baseDirectory: String,
                                 userName: String)
    extends HomeDirectory(baseDirectory, userName)

object HomeDirectory {
  val adminBase: String = "/home/admins"
  val userBase: String = "/home/users"

  /* Factory method for home directories. */
  def apply(userName: String, isAdministrator: Boolean) = {
    if (isAdministrator)
      new AdministratorHomeDirectory(adminBase, userName)
    else
      new UserHomeDirectory(userBase, userName)
  }
}

With all the bits in place, now we can do what we originally set out to do, in a much cleaner manner. Create different types of HomeDirectory subclasses, without all the messy ‘switching’ logic polluting every single call site. We can create a HomeDirectory object for a plain user, and have it automatically point to the right place under “/home/users”:

scala> val h = HomeDirectory("keramida", false)
h: HomeDirectory = UserHomeDirectory@615e10ab

scala> h.path
res0: java.io.File = /home/users/keramida

We can also create a HomeDirectory object for a user with administrative rights, pointing under “/home/admins”:

scala> val h = HomeDirectory("superuser", true)
h: HomeDirectory = AdministratorHomeDirectory@7490649e

scala> h.path
res1: java.io.File = /home/admins/superuser

The call is always the same: make me a HomeDirectory. It’s the actual instance class that changes, and what the instance itself contains.

Posted in Computers, Programming, Scala, Software | Tagged , , ,

Powerful Regular Expressions Combined with Lisp in Emacs

Regular expressions are a powerful text transformation tool. Any UNIX geek will tell you that. It’s so deeply ingrained into our culture, that we even make jokes about it. Another thing that we also love is having a powerful extension language at hand, and Lisp is one of the most powerful extension languages around (and of course, we make jokes about that too).

Emacs, one of the most famous Lisp applications today, has for a while now the ability to combine both of these, to reach entirely new levels of usefulness. Combining regular expressions and Lisp can do really magical things.

An example that I recently used a few times is parsing & de-humanizing numbers in dstat output. The output of dstat includes numbers that are printed with a suffix, like ‘B’ for bytes, ‘k’ for kilobytes and ‘M’ for megabytes, e.g.:

----system---- ----total-cpu-usage---- --net/eth0- -dsk/total- sda-
     time     |usr sys idl wai hiq siq| recv  send| read  writ|util
16-05 08:36:15|  2   3  96   0   0   0|  66B  178B|   0     0 |   0
16-05 08:36:16| 42  14  37   0   0   7|  92M 1268k|   0     0 |   0
16-05 08:36:17| 45  11  36   0   0   7|  76M 1135k|   0     0 |   0
16-05 08:36:18| 27  55   8   0   0  11|  67M  754k|   0    99M|79.6
16-05 08:36:19| 29  41  16   5   0  10| 113M 2079k|4096B   63M|59.6
16-05 08:36:20| 28  48  12   4   0   8|  58M  397k|   0    95M|76.0
16-05 08:36:21| 38  37  14   1   0  10| 114M 2620k|4096B   52M|23.2
16-05 08:36:22| 37  54   0   1   0   8|  76M 1506k|8192B   76M|33.6

So if you want to graph one of the columns, it’s useful to convert all the numbers in the same unit. Bytes would be nice in this case.

Separating all columns with ‘|’ characters is a good start, so you can use e.g. a CSV-capable graphing tool, or even simple awk scripts to extract a specific column. ‘C-x r t’ can do that in Emacs, and you end up with something like this:

|     time     |cpu|cpu|cpu|cpu|cpu|cpu|eth0 |eth0 | disk| disk|sda-|
|     time     |usr|sys|idl|wai|hiq|siq| recv| send| read| writ|util|
|16-05 08:36:15|  2|  3| 96|  0|  0|  0|  66B| 178B|   0 |   0 |   0|
|16-05 08:36:16| 42| 14| 37|  0|  0|  7|  92M|1268k|   0 |   0 |   0|
|16-05 08:36:17| 45| 11| 36|  0|  0|  7|  76M|1135k|   0 |   0 |   0|
|16-05 08:36:18| 27| 55|  8|  0|  0| 11|  67M| 754k|   0 |  99M|79.6|
|16-05 08:36:19| 29| 41| 16|  5|  0| 10| 113M|2079k|4096B|  63M|59.6|
|16-05 08:36:20| 28| 48| 12|  4|  0|  8|  58M| 397k|   0 |  95M|76.0|
|16-05 08:36:21| 38| 37| 14|  1|  0| 10| 114M|2620k|4096B|  52M|23.2|
|16-05 08:36:22| 37| 54|  0|  1|  0|  8|  76M|1506k|8192B|  76M|33.6|

The leading and trailing ‘|’ characters are there so we can later use orgtbl-mode, an awesome table editing and realignment tool of Emacs. Now to the really magical step: regular expressions and lisp working together.

What we would like to do is convert text like “408B” to just “408”, text like “1268k” to the value of (1268 * 1024), and finally text like “67M” to the value of (67 * 1024 * 1024). The first part is easy:

M-x replace-regexp RET \([0-9]+\)B RET \1 RET

This should just strip the “B” suffix from byte values.

For the kilobyte and megabyte values what we would like is to be able to evaluate an arithmetic expression that involves \1. Something like “replace \1 with the value of (expression \1)“. This is possible in Emacs by prefixing the substitution pattern with \,. This instructs Emacs to evaluate the rest of the substitution pattern as a Lisp expression, and use its string representation as the “real” substitution text.

So if we match all numeric values that are suffixed by ‘k’, we can use (string-to-number \1) to convert the matching digits to an integer, multiply by 1024 and insert the resulting value by using the following substitution pattern:

\,(* 1024 (string-to-number \1))

The full Emacs command would then become:

M-x replace-regexp RET \([0-9]+\)k RET \,(* 1024 (string-to-number \1)) RET

This, and the byte suffix removal, yield now the following text in our Emacs buffer:

|     time     |cpu|cpu|cpu|cpu|cpu|cpu|eth0 |eth0 | disk| disk|sda-|
|     time     |usr|sys|idl|wai|hiq|siq| recv| send| read| writ|util|
|16-05 08:36:15|  2|  3| 96|  0|  0|  0|  66| 178|   0 |   0 |   0|
|16-05 08:36:16| 42| 14| 37|  0|  0|  7|  92M|1298432|   0 |   0 |   0|
|16-05 08:36:17| 45| 11| 36|  0|  0|  7|  76M|1162240|   0 |   0 |   0|
|16-05 08:36:18| 27| 55|  8|  0|  0| 11|  67M| 772096|   0 |  99M|79.6|
|16-05 08:36:19| 29| 41| 16|  5|  0| 10| 113M|2128896|4096|  63M|59.6|
|16-05 08:36:20| 28| 48| 12|  4|  0|  8|  58M| 406528|   0 |  95M|76.0|
|16-05 08:36:21| 38| 37| 14|  1|  0| 10| 114M|2682880|4096|  52M|23.2|
|16-05 08:36:22| 37| 54|  0|  1|  0|  8|  76M|1542144|8192|  76M|33.6|

Note: Some of the columns are indeed not aligned very well. We’ll fix that later. On to the megabyte conversion:

M-x replace-regexp RET \([0-9]+\)M RET \,(* 1024 1024 (string-to-number \1)) RET

Which produces a version that has no suffixes at all:

|     time     |cpu|cpu|cpu|cpu|cpu|cpu|eth0 |eth0 | disk| disk|sda-|
|     time     |usr|sys|idl|wai|hiq|siq| recv| send| read| writ|util|
|16-05 08:36:15|  2|  3| 96|  0|  0|  0|  66| 178|   0 |   0 |   0|
|16-05 08:36:16| 42| 14| 37|  0|  0|  7|  96468992|1298432|   0 |   0 |   0|
|16-05 08:36:17| 45| 11| 36|  0|  0|  7|  79691776|1162240|   0 |   0 |   0|
|16-05 08:36:18| 27| 55|  8|  0|  0| 11|  70254592| 772096|   0 |  103809024|79.6|
|16-05 08:36:19| 29| 41| 16|  5|  0| 10| 118489088|2128896|4096|  66060288|59.6|
|16-05 08:36:20| 28| 48| 12|  4|  0|  8|  60817408| 406528|   0 |  99614720|76.0|
|16-05 08:36:21| 38| 37| 14|  1|  0| 10| 119537664|2682880|4096|  54525952|23.2|
|16-05 08:36:22| 37| 54|  0|  1|  0|  8|  79691776|1542144|8192|  79691776|33.6|

Finally, to align everything in neat, pipe-separated columns, we enable M-x orgtbl-mode, and type “C-c C-c” with the pointer somewhere inside the transformed dstat output. The buffer now becomes something usable for pretty-much any graphing tool out there:

| time           | cpu | cpu | cpu | cpu | cpu | cpu |      eth0 |    eth0 |  disk |      disk | sda- |
| time           | usr | sys | idl | wai | hiq | siq |      recv |    send |  read |      writ | util |
| 16-05 08:36:15 |   2 |   3 |  96 |   0 |   0 |   0 |        66 |     178 |     0 |         0 |    0 |
| 16-05 08:36:16 |  42 |  14 |  37 |   0 |   0 |   7 |  96468992 | 1298432 |     0 |         0 |    0 |
| 16-05 08:36:17 |  45 |  11 |  36 |   0 |   0 |   7 |  79691776 | 1162240 |     0 |         0 |    0 |
| 16-05 08:36:18 |  27 |  55 |   8 |   0 |   0 |  11 |  70254592 |  772096 |     0 | 103809024 | 79.6 |
| 16-05 08:36:19 |  29 |  41 |  16 |   5 |   0 |  10 | 118489088 | 2128896 |  4096 |  66060288 | 59.6 |
| 16-05 08:36:20 |  28 |  48 |  12 |   4 |   0 |   8 |  60817408 |  406528 |     0 |  99614720 | 76.0 |
| 16-05 08:36:21 |  38 |  37 |  14 |   1 |   0 |  10 | 119537664 | 2682880 |  4096 |  54525952 | 23.2 |
| 16-05 08:36:22 |  37 |  54 |   0 |   1 |   0 |   8 |  79691776 | 1542144 |  8192 |  79691776 | 33.6 |

The trick of combining arbitrary Lisp expressions with regexp substitution patterns like \1, \2\9 is something I have found immensely useful in Emacs. Now that you know how it works, I hope you can find even more amusing use-cases for it.

Update: The Emacs manual has a few more useful examples of \, in action, as pointed out by tunixman on Twitter.

Posted in Computers, Emacs, Free software, FreeBSD, GNU/Linux, Lisp, Open source, Programming, Software | Tagged , , , , , , , , | 1 Comment