Daily Archives: 2013-12-02

Using SBT To Experiment With New Scala Libraries

Library dependency tracking is a complicated thing. Using a new library to experiment with it, write a few bits of exploratory code, or even small self-contained bits of software which uses the new library is often an exercise in frustration. Especially if the library one is interested in requires a multitude of other libraries and bits of code to even start working. This is a problem that surfaces very often and pretty much in all programming environments. Running code that is written on the JVM (in Java or Scala) is one of the cases where this sort of issue is particularly frequent and can get extremely hairy. Fortunately, there’s a very convenient tool which handles this problem in an amazingly flexible manner for Scala code: the SBT utility.

In this article I’m going to use the Scalaz library of functional programming idioms as a demo of what SBT can do for you. I’ll start by showing how Scalaz libraries can be loaded manually, by tinkering with the classpath of scala. Then we’ll see how the same can be done with a small SBT project. Finally, I’ll describe why I prefer the SBT based method of experimenting with new Scala libraries.

Loading A Scala Library Manually

Loading a new Scala library, in preparation for writing some exploratory code which uses the library, is pretty similar to what one does with Java libraries. All it takes is to start the Scala REPL with the appropriate JAR file somewhere in the classpath. For example, if you wanted to load the Scalaz library of functional programming idioms, you can just download the JAR file of the appropriate version, and pass this JAR file to the scala(1) utility with the -cp option:

% wget -q -nd -np -c -r \
  http://repo1.maven.org/maven2/org/scalaz/scalaz-core_2.10/6.0.4/scalaz-core_2.10-6.0.4.jar
% scala -cp scalaz-core_2.10-6.0.4.jar
Welcome to Scala version 2.10.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import scalaz._
import scalaz._

scala> import Scalaz._
import Scalaz._

There’s nothing special to adding library JAR files to the Scala REPL’s classpath. It’s exactly what you would do e.g. for any plain old Java library too. The important thing to notice here though is that because of the possibility of incompatibilities between major Scala versions, you have to make sure that you load the correct version of the JAR file for the library. Hence the “_2.10” part of the “scalaz-core_2.10-6.0.4.jar” filename.

Real Use Case: Loading Scalaz From Your REPL

Scalaz is a popular Scala library for functional programming. There are many exciting features in scalaz, e.g. the Validation[A, B] support for handling exceptional conditions in a functional manner, or the NonEmptyList[A] classes for lists which must have at least one item.

Using the library is possible, of course, like any other library which can load on the JVM. Just fetch the appropriate set of JAR files, point the Scala interpreter’s class-path to them, and then import the library’s exported symbols from your Scala REPL.

For this to work you have to make sure to match the scala version (2.10 here) with the one started by default by your Scala interpreter. Then if you want to use the JAR file you have to specify it manually at the class-path of your scala session:

% scala -cp scalaz-core_2.10-6.0.4.jar
Welcome to Scala version 2.10.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import scalaz._
import scalaz._

scala> import Scalaz._
import Scalaz._

scala> def validate(text: String): Validation[String, Boolean] = {
     |   text.find{ _.isUpper } match {
     |     case Some(character) => "'%s' is not a valid string".format(text).fail
     |     case _ => true.success
     |   }
     | }
validate: (text: String)scalaz.Validation[String,Boolean]

scala> validate("hello world")
res0: scalaz.Validation[String,Boolean] = Success(true)

scala> validate("Hey, this shouldn't work")
res1: scalaz.Validation[String,Boolean] = Failure('Hey, this shouldn't work' is not a valid string)

What if you want to experiment with multiple scala language versions though? Or with multiple scalaz library versions?

Using Scala Build Tool (SBT) To Do Even More Fun Stuff

Manually tracking where to download the JAR files from, and adding the right JAR files to the class-path is bound to get very tedious. Fortunately, you can use the SBT utility and a small “bootstrap project” to load the require libraries. The SBT utility is the standard “build tool” used by many Scala libraries, and it provides an easy way of defining:

  • Which libraries your project will need to load
  • Which Scala compiler version to use (2.9.X, 2.10.X, etc.)
  • Which subset of the downloaded libraries to preload for REPL sessions

The pre-defined support of SBT for using maven repositories, searching for the appropriate Scala compiler version, fetching and caching the JAR files or the libraries the project loads, and all the other small conveniences it provides quickly add up. So the second method of loading Scalaz is the same method I am using nowadays to quickly load any new Scala library I want to experiment with. I just let SBT handle the details, by creating a new directory, and putting in that a “build.sbt” file with the minimal amount of dependency information to fetch, load and start using the library.

A sample project definition which you can use to load Scalaz version 6.0.4 using the 2.10.2 version of the Scala compiler is:

name := "scalaz-demo"

version := "0.2"

scalaVersion := "2.10.2"

libraryDependencies += "org.scalaz" %% "scalaz-core" % "6.0.4"

scalacOptions += "-deprecation"

initialCommands in console := """
    |import scalaz._
    |import Scalaz._
    |""".stripMargin

Note how the version of the Scala compiler and the version of the Scalaz library are specified in a simple ‘setting’ of sbt. The rest of the details, like where to download the JAR file of scalaz, how to find the appropriate JAR version for the scala compiler version we are using, how to cache the JAR files locally, and many other things, are handled internally by SBT. The initialCommands in console setting saves some typing by pre-importing scalaz when we fire up the interactive Scala repl for this project, so with this in place we can just fire up “sbt console” and start writing code which uses scalaz-core right away:

% sbt console
[info] Loading global plugins from /Users/gkeramidas/.sbt/0.13/plugins
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.10.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> def foo(x: Int) =
     |   "%d is a funny integer".format(x).fail[Int]
foo: (x: Int)scalaz.Validation[String,Int]

scala> foo(10)
res0: scalaz.Validation[String,Int] = Failure(10 is a funny integer)

With SBT, it’s even possible to reload the entire project, using a different versino of the Scala compiler, and keep hacking at the REPL:

% sbt
[info] Loading global plugins from /Users/gkeramidas/.sbt/0.13/plugins
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
> show scalaVersion
[info] 2.10.2
> set scalaVersion := "2.9.0-1"
[info] Defining *:scalaVersion
[info] The new value will be used by *:allDependencies, *:ivyScala and 10 others.
[info] 	Run `last` for details.
[info] Reapplying settings...
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
> console
[info] Updating {file:/Users/gkeramidas/hg/demo/scalaz-demo/}scalaz-demo...
[info] Resolving org.scala-lang#scala-library;2.9.0-1 ...
[info] Resolving org.scalaz#scalaz-core_2.9.0-1;6.0.4 ...
[info] Resolving org.scala-lang#scala-compiler;2.9.0-1 ...
[info] Resolving org.scala-lang#jline;2.9.0-1 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /Users/gkeramidas/hg/demo/scalaz-demo/target/scala-2.9.0-1/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.9.0.1. Compiling...
[info]   Compilation completed in 11.889 s
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.9.0.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> def foo(x: Int) =
     |   "%d is a funny integer".format(x).fail[Int]
foo: (x: Int)scalaz.Validation[String,Int]

scala> foo(12345)
res0: scalaz.Validation[String,Int] = Failure(12345 is a funny integer)

Note how SBT took care of loading the correct version of scalaz’s JAR file. The artifact loaded for scalaz-core is shown at the informational message:

[info] Resolving org.scalaz#scalaz-core_2.9.0-1;6.0.4 ...

When SBT reports that it ‘resolved’ the artifact, it has already fetched it (form one of my previous experiments with scalaz on this machine), and it’s ready to be loaded. It’s even possible to see where SBT has fetched the artifact, by peeking at the classpath it uses to load the Scala REPL:

> show full-classpath
[info] List(Attributed(/Users/gkeramidas/hg/demo/scalaz-demo/target/scala-2.9.0-1/classes), Attributed(/Users/gkeramidas/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.9.0-1.jar), Attributed(/Users/gkeramidas/.ivy2/cache/org.scalaz/scalaz-core_2.9.0-1/jars/scalaz-core_2.9.0-1-6.0.4.jar))
[success] Total time: 0 s, completed Sep 4, 2013 4:27:21 PM

The scalaz JAR file which matches the Scala compiler version has been downloaded by SBT and cached, in the local artifact cache at ~/.ivy2. Setting scalaVersion to a different value and reloading the project fetches the new scalaz JAR file and caches it under ~/.ivy2 again:

> set scalaVersion := "2.9.1"
[info] Defining *:scalaVersion
[info] The new value will be used by *:allDependencies, *:ivyScala and 10 others.
[info] 	Run `last` for details.
[info] Reapplying settings...
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/hg/demo/scalaz-demo/)
> compile:console
[info] Updating {file:/Users/gkeramidas/hg/demo/scalaz-demo/}scalaz-demo...
[info] Resolving org.scala-lang#scala-library;2.9.1 ...
[info] Resolving org.scalaz#scalaz-core_2.9.1;6.0.4 ...
[info] Resolving org.scala-lang#scala-compiler;2.9.1 ...
[info] Resolving org.scala-lang#jline;2.9.1 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.9.1/scala-library-2.9.1.jar ...
[info] 	[SUCCESSFUL ] org.scala-lang#scala-library;2.9.1!scala-library.jar (7366ms)
[info] downloading http://repo1.maven.org/maven2/org/scala-lang/scala-compiler/2.9.1/scala-compiler-2.9.1.jar ...
[info] 	[SUCCESSFUL ] org.scala-lang#scala-compiler;2.9.1!scala-compiler.jar (5512ms)
[info] downloading http://repo1.maven.org/maven2/org/scala-lang/jline/2.9.1/jline-2.9.1.jar ...
[info] 	[SUCCESSFUL ] org.scala-lang#jline;2.9.1!jline.jar (803ms)
[info] Done updating.
[info] Compiling 1 Scala source to /Users/gkeramidas/hg/demo/scalaz-demo/target/scala-2.9.1/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.9.1.final. Compiling...
[info]   Compilation completed in 9.926 s
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.9.1.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala>

SBT is a nice tool for experimenting with a new Scala library. Fetching all the right JAR files, resolving their dependencies correctly, and taking care of all the bothersome details to set up the ‘environment’ properly for running your code with version X.Y.Z of the library under version A.B.C of the Scala language is too convenient and useful to ignore. So the next time you have to play around with a Scala library, consider writing a simple SBT build definition which loads it as a dependency, instead of fighting with JAR classpaths and other such annoying stuff.

Update: The feature of SBT which allows reloading a project’s dependencies with a different base-compiler for the Scala runtime is so useful that the friendly folks who develop SBT have given it a special short alias. You can switch to another Scala compiler version by typing “++” and the version you are switching to, e.g.:

> ++2.9.1
[info] Setting version to 2.9.1
[info] Set current project to scalaz-demo (in build file:/Users/gkeramidas/git/demo/scalaz-demo/)
> console
[info] Updating {file:/Users/gkeramidas/git/demo/scalaz-demo/}scalaz-demo...
[info] Resolving org.scala-lang#scala-library;2.9.1 ...
[info] Resolving org.scalaz#scalaz-core_2.9.1;6.0.4 ...
[info] Resolving org.scala-lang#scala-compiler;2.9.1 ...
[info] Resolving org.scala-lang#jline;2.9.1 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Starting scala interpreter...
[info]
import scalaz._
import Scalaz._
import util.control.Exception.allCatch
Welcome to Scala version 2.9.1.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65).
Type in expressions to have them evaluated.
Type :help for more information.
Advertisements

Reading book: “Akka Essentials”

I have recently gotten myself a copy of “Akka Essentials” and started going through the code, but writing the same examples in Scala (instead of Java, which the original book uses). This is turning out to be a fairly good exercise, as it forces me to think what is the idiomatic way of writing something in Scala, versus just plainly dumping Java code and minimally editing it to make everything compile.

For example here’s a Java snippet from one of the first chapters, which sets things up for a “MapActor”:

package akka.first.app.mapreduce.actors;

import java.util.*;
import java.util.StringTokenizer;
import akka.actor.UntypedActor;
import akka.first.app.mapreduce.messages.MapData;
import akka.first.app.mapreduce.messages.WordCount;

public class MapActor extends UntypedActor {
  String[] STOP_WORDS = {
      "a", "am", "an", "and", "are", "as", "at",
      "be", "do", "go", "if", "in", "is", "it",
      "of", "on", "the", "to" };

  private List<String> STOP_WORDS_LIST =
      Arrays.asList(STOP_WORDS);

  @Override
  public void onReceive(Object message) throws Exception {
    if (message instanceof String) {
      String work = (String) message;
      // map the words in the sentence and send the result
      to MasterActor
          getSender().tell(evaluateExpression(work));
    } else
      unhandled(message);
  }

  private MapData evaluateExpression(String line) {
    List<WordCount> dataList = new ArrayList<WordCount>();
    StringTokenizer parser = new StringTokenizer(line);
    while (parser.hasMoreTokens()) {
      String word = parser.nextToken().toLowerCase();
      if (!STOP_WORDS_LIST.contains(word)) {
        dataList.add
            (new WordCount(word,Integer.valueOf(1)));
      }
    }
    return new MapData(dataList);
  }
}

This MapActor is part of a Map-Reduce-Aggregate set of Akka actors, and its work is relatively simple:

  • It should handle a “String” message, by tokenizing the string and returning a MapData result.
  • The MapData should contain a list of WordCount instances, with every word’s initial count set to 1.
  • The words which are part of a STOP_WORDS list should be skipped wen generating the result.

The code is readable in Java too, but as I was writing the equivalent in Scala, I noticed that I could do a couple of things to simplify it even more. One of them was to use pattern matching instead of instanceOf checks in the message-handling code. The other one was that a while loop is a bit noisy in this case, and it can be replaced by several more concise snippets.

Another happy side-effect of writing in Scala was that the code was stripped of all the ‘boilerplate’ Java requires. For example, I didn’t have to do all the crafty stuff about ArrayList and manual conversion to List<String>, since Scala provides a neat and very readable way of initializing a list of strings.

I started with something like this:

package akka.first.app.mapreduce.actors

import akka.first.app.mapreduce.messages.MapData
import akka.actor.UntypedActor

class MapActor extends UntypedActor {
  override def onReceive(message: Any) =
    message match {
      case (work: String) => getSender ! evaluateExpression(work)
      case _              => unhandled(message)
    }

  /** Defines a list of words which are never counted. */
  private val STOP_WORDS: List[String] = List(
    "a", "am", "an", "and", "are", "as", "at", "be","do",
    "go", "if", "in", "is", "it", "of", "on", "the", "to")

  /** Evaluates a sentence, removes non-stop words, and returns
    * the result as a new `MapData` instance.
    */
  private def evaluateExpression(text: String): MapData =
    new MapData(Nil)
}

This doesn’t do anything useful in evaluateExpression yet. It’s already almost there though, and even if we include the Scaladoc comment lines, the total is about half the size of the Java code.

Now, I experimented with two alternative approaches of tokenizing the input text. One of them uses the simplest code possible:

  private def evaluateExpression(text: String): MapData =
    new MapData(text.split("\\s").map{_.toLowerCase}
      .filter(w => !STOP_WORDS.contains(w))
      .map(w => new WordCount(w, 1)))

This version should work fine, and the Scala version of the MapActor is still very readable code:

package akka.first.app.mapreduce.actors

import akka.first.app.mapreduce.messages.MapData
import akka.actor.UntypedActor

class MapActor extends UntypedActor {
  override def onReceive(message: Any) =
    message match {
      case (work: String) => getSender ! evaluateExpression(work)
      case _              => unhandled(message)
    }

  /** Defines a list of words which are never counted. */
  private val STOP_WORDS: List[String] = List(
    "a", "am", "an", "and", "are", "as", "at", "be","do",
    "go", "if", "in", "is", "it", "of", "on", "the", "to")

  /** Evaluates a sentence, removes non-stop words, and returns
    * the result as a new `MapData` instance.
    */
  private def evaluateExpression(text: String): MapData =
    new MapData(text.split("\\s").map{_.toLowerCase}
      .filter(w => !STOP_WORDS.contains(w))
      .map(w => new WordCount(w, 1)))
}

Another version of evaluateExpression, which I wrote to play around with Scala and Stream generating functions was the following:

import java.util.StringTokenizer

def tokens(parser: StringTokenizer): Stream[String] =
  parser.hasMoreTokens match {
    case false => Stream.empty
    case true  => Stream.cons(
      parser.nextToken.toLowerCase, tokens(parser))
  }

def evaluateExpression(text: String): MapData = {
  val parser = new StringTokenizer(text)
  val words = tokens(parser)
  MapData(words.filter(w => !STOP_WORDS.contains(w))
          .map(w => WordCount(w, 1)).toList)
}

One important difference of this version is that it doesn’t pre-generate the entire list of words in memory, so it might be a better option for extremely large input strings. It is already showing a difference for e.g. this code:

scala> import java.util.StringTokenizer
import java.util.StringTokenizer

scala> def tokens(parser: StringTokenizer): Stream[String] =
     |     parser.hasMoreTokens match {
     |       case false => Stream.empty
     |       case true  => Stream.cons(parser.nextToken.toLowerCase, tokens(parser))
     |     }
tokens: (parser: java.util.StringTokenizer)Stream[String]

scala> def time[R](block: => R): R = {
     |     val t0 = System.nanoTime()
     |     val result = block    // call-by-name
     |     val t1 = System.nanoTime()
     |     println("Elapsed time: " + (t1 - t0) + "ns")
     |     result
     | }
time: [R](block: => R)R

scala> val large = "foo " * 500000
large: String = "foo foo foo foo foo foo foo foo foo ...

scala> time { tokens(new StringTokenizer(large)).size }
Elapsed time: 92060000ns
res1: Int = 500000

scala> time { large.split("\\s").size }
Elapsed time: 126982000ns
res2: Int = 500000

Note how the second version, which uses plain split(), is already 37.93% slower, even when we are just counting the number of words in each token list [126982000.0 / 92060000 =~ 1.37934].

What I got from doing this rewrite in Scala though was something entirely different. An appreciation for the features of Scala which let me write compact, concise, yet also very readable, and very clean code like this final version of the MapActor code:

package akka.first.app.mapreduce.actors

import akka.first.app.mapreduce.messages.MapData
import akka.first.app.mapreduce.messages.WordCount
import akka.actor.UntypedActor

/** Parses a `String` into whitespace-separated words, skips
  * over "stop words" and constructs a MapData message with
  * the resulting words and their counts initialized to 1.
  */
class MapActor extends UntypedActor {
  override def onReceive(message: Any) =
    message match {
      case (work: String) => getSender ! evaluateExpression(work)
      case _              => unhandled(message)
    }

  /** Defines a list of words which are never counted. */
  private val STOP_WORDS: List[String] = List(
    "a", "am", "an", "and", "are", "as", "at", "be","do",
    "go", "if", "in", "is", "it", "of", "on", "the", "to")

  /** Evaluates a sentence, removes non-stop words, and returns
    * the result as a new `MapData` instance.
    */
  private def evaluateExpression(text: String): MapData =
    MapData(text.split("\\s")
      .filter(w => !STOP_WORDS.contains(w))
      .map(w => WordCount(w, 1)).toList)
}