Author Archive

The operators of blackmail viruses have been actively looking for new niches to target over the past years. The sad truth in this regard is that entities like educational institutions, healthcare organizations and even law enforcement agencies are low-hanging fruit to these attackers. One of the latest onslaughts demonstrates this unsettling susceptibility. In mid-April 2018, an unidentified ransomware strain hit the computer network of the Leominster Public School District in Massachusetts. While the details of the specific attack vector remain undisclosed at the time of writing, the most likely entry point was a phishing email opened by one of unsuspecting staffers. Ultimately, the district officials have admitted to paying $10,000 worth of Bitcoin to regain access to the proprietary records.

Ransomware continues to be a critical problem to users and organizations

According to the local police that’s investigating the incident, the school didn’t maintain an offsite data backup. That’s very poor security hygiene that makes users and companies incur serious losses in various information security incursions and data breaches. As a result, part of the target’s network was locked down as the malicious code applied a strong cipher to encrypt the most common types of files spotted on the host servers.

According to some unconfirmed reports, the troublemaking program might be the infamous WannaCry ransomware, which broke out worldwide in May 2017 and crippled numerous computer networks, including government-related ones and those belonging to industry giants. Some organizations had to rebuild entire segments of their infrastructure from scratch to recover from this massive attack. The UK’s National Health Service exemplifies the harsh impact as about 70,000 of its devices were affected.

The involvement of WannaCry in the Leominster case is a mere speculation, though. If it holds true, the attack probably tool place via unpatched software exploited in a furtive way. One way or another, although the FBI and security professionals advise against submitting ransoms in scenarios like that, the school district elected the lesser of two evils. The officials followed the crooks’ demands and coughed up the negotiated amount of cryptocurrency.

In summary, crypto ransomware continues to be a serious concern, and organizations are much better off keeping file backups to avoid the damage.

Con artists are homing in on users’ social network accounts via phishing messages disguised as verification requests or copyright infringement alerts.

Social networks such as Facebook, Instagram, Twitter, and TikTok boast huge user audiences and therefore increasingly lure online scammers. By obtaining credentials for numerous accounts, threat actors can mishandle the unauthorized access to perpetrate fraudulent initial coin offerings (ICOs) or controversial propaganda. In some cases, crooks simply sell these details on hacker forums.

Unsurprisingly, social network accounts have become a pricey asset in the cybercriminal circles and need to be safeguarded accordingly. Since early August 2020, security researchers have been observing a spike in phishing stratagems that zero in on social network users.

The two dominant forms of these scams are described below. Go over the information to identify these hoaxes if they happen to hit you.

(more…)

A recent report presented recently by IT-security company Carbon Black stresses a 2,500 % increase in the ransomware Dark Net industry, matched against the previous year.

The study supports numerous forecasts expressed by the majority of info-security specialists a year ago who said ransomware would likely have an essential role in all types of cyber-crime and get the biggest market share.

To collect information for this report, experts scanned the Dark Net for communities and sites offering and advertising all ransomware related products and services.

Researchers found approximately 6,200 spots where criminals had offered their services with the help of more than 44,000 ads.

Rates are varying greatly, from $1 to $4,000. The price variance is determined by different economic models crooks select to sell their goods. Some charge on a per-sample basis when others prefer monthly subscription plans.

Comparing 2016 and 2017, the ransomware economy has exploded from $250,000 to $6,230, 000, a rate of 2,500%, researchers note in their report. These extortion schemes get enormous ransom payouts that totaled in $1B in 2016. Earlier in 2015, it was $24M.

Ransomware-as-a-Service (RaaS) is the main driving force of the ransomware economy. Big and small RaaS portals started to appear in early 2017. These RaaS sites are all different and each works in its price niche. For instance, you can find RaaS portals offering all-in-one solutions. Some portals offer only minimal number of services. Finally, there are individual sellers who offer just the ransomware code.

Multi-function RaaS services provide the ransomware executable file itself, they also offer delivery mediums like botnets and exploit kits. In addition, you can get a payment portal to manage ransoms. On top of that, you can rent customer support team. All of this is available from a convenient web-based admin panel.

Reduced service RaaS sites supply the ransomware file, and just a couple of the services above, typically at more affordable prices.

Finally, there are private sellers who are virus writers. They sell just the ransomware file and allow clients to manage the rest. Some ransomware writers earn more than $150,000 a year. That is much more than the standard salary of a legal software developer.

After an embarrassingly long hiatus, I’ve started fooling with Scala some more and would like to report how I whiled away a few pleasant hours this weekend. You may have heard of the Stupid Pet Tricks and Stupid Human Tricks segments that David Letterman featured on his television show some years ago. The idea was that the pets and people would perform tricks which, while amusing, were not useful for anything. They were just stupid tricks. In this post I will be describing a stupid Scala trick.

If I may continue to reminisce, I remember years ago seeing a Martin Gardner-esqe mathematical puzzle called the four bug problem (although I don’t remember whether what I read was actually written by Gardner). I see that Wolfram Alpha refers to it as the “Mice Problem”. In short, imagine four bugs (or mice) positioned at the four corners of a square. Each one begins walking toward his clockwise neighbor. Since they’re all walking, they’re all following moving targets. Each bug’s path is affected by his neighbor, so they form a sort of feedback loop. The problem is to describe the path that each bug will take. It’s an interesting problem.

You can also describe the path of 5 bugs on a pentagon or 6 on a hexagon, and so forth. But I wondered what might happen if you had 5,000 bugs positioned not on a 5,000-agon but just everywhere, each following one of his randomly chosen fellows. Chaos, no doubt. This is the story of a little Scala app that answers the question. I’ll present the parts of my solution individually as well as the complete source code which you can compile and experiment with.

Domain classes

First, I need to describe the location of a bug. I could use something built-in like java.awt.Point but I didn’t like the look of it because its interface mixes doubles and ints whereas I wanted floating point coordinates. So I started by creating a class called Position that stores x and y as Doubles. And since the whole idea is to move from position to position, I included a move function. Like so:

class Position(val x: Double, val y: Double) {
  def move(xOffset: Double, yOffset: Double) = new Position(x + xOffset, y + yOffset)
}

That’s a good start, but ultimately we want to take one bug’s position and move it toward another bug’s position. So I added a moveToward and moveAwayFrom function as well as a function to calculate the distance between two positions. I won’t describe this code in excruciating detail. It’s pretty apparent what’s going one. Here it is:

class Position(val x: Double, val y: Double) {
  def move(xOffset: Double, yOffset: Double) = new Position(x+xOffset, y+yOffset)
  private def distNumbers(that: Position) = {
    val xDist = that.x - this.x
    val yDist = that.y - this.y
    val dist = scala.math.sqrt(xDist*xDist + yDist*yDist)
    (xDist, yDist, dist)
  }
  def dist(that: Position) = distNumbers(that)._3
  def moveToward(that: Position, dist: Double) = {
    val (xDist, yDist, totalDist) = distNumbers(that)
    def offset(num: Double) = {
      val result = dist * num / totalDist
      if (result.isNaN) 0 else result
    }
    move(offset(xDist), offset(yDist))
  }
}

The private function is there to prevent duplication of logic. This class isn’t super fancy, but it has the following features: It is immutable, it consistently uses Doubles for coordinates and offsets, it allows us to move a position, to find the distance between two positions and to move one position some distance toward or away from a second position.

One improvement I added after a few tests was to check for NaN (not-a-number) in the xOffset and yOffset values. You can see that on line 14 above. Since these bugs are going to walk toward each other they will often actually meet, resulting in a division by zero. This would cause the bugs to go haywire and shoot off into the corner of the screen after a little while. Line 14 prevents this.

That’s it for the Position class. Now to create the Bug class.

class Bug(xCoord: Double, yCoord: Double) {
  private var pos = new Position(xCoord,yCoord)
  private var nextPos = new Position(xCoord,yCoord)

  def getPosition = pos
  def x: Double = pos.x
  def y: Double = pos.y

  def prepareMove(func : (Position) => Position) = {
    nextPos = func(pos)
  }

  def paint(g: Graphics2D) {
    g.drawLine(pos.x.toInt, pos.y.toInt, nextPos.x.toInt, nextPos.y.toInt)
    pos = nextPos
  }
}

This is mostly self-explanatory. A Bug is little more than a position, a mechanism for transitioning from one position to the next, and a way for it to draw itself on the screen. The only thing that might need explanation is the reason for the nextPos member and the prepareMove function. I included pos and nextPos because I wanted to ensure that all the bugs first decided where to move based on the other bugs’ current positions, and then moved. Say bug B is following bug A. First, bug A decides his move, based on whoever he happens to be following. When bug B makes his move, I want his move to be based on A’s original position, not the new position. That is to say, I want the bugs to all decide where to move at once based on their fellows’ current position, and then to move all at once to the positions they had decided on. The next position isn’t copied into current position until just after the bug draws itself.

The prepareMove function take a function parameter. You’ll notice that the Bug class contains no logic about how to follow another bug. I wanted this logic to reside outside Bug itself and get passed in.

Setup

Now, to instantiate a bunch of bugs and assign each of them someone to follow:

  val initWinSize = new Dimension(800, 800)
  val bugs = {
    def random = Random.nextInt(2000) - 1000
    List.fill(5000)(
      new Bug(random + initWinSize.width/2,
              random + initWinSize.height/2))
  }
  var targets = Random.shuffle(bugs)

I chose to distribute the locations across a range from -1000 to +1000 in both the x and y directions, and to create 5000 bugs. These numbers are arbitrarily chosen. I also added an offset so that the bugs would be distributed about the center of the program window, whose dimensions I have decided to set at 800 by 800. The scala.util.Random class makes it trivially easy to create a randomly shuffled list of target bugs. The targets list contains the same Bug instances as the bugs list, only rearranged. So the first item in the bugs list will follow the first item in the targets list, the second bug follows the second target, and so forth.

Bug logic

Now we have our Bug class defined, a population of bugs created, and we’ve assigned them someone to follow. Below is a function called performMoves which will be called over and over in a loop running in a Thread.

      def performMoves = {
        for ((bug,target) <- bugs.zip(targets)) {
          bug.prepareMove{(pos) =>
            pos.moveToward(target.getPosition, 1.0)
          }
        }
      }

This code loops through all the bug/target pairs, and moves the bug one unit toward the target. This is behavior we were looking for and it works! When I ran the complete program using the above logic the bugs start our in a tangled mess of intersecting paths, but they gradually coalesce into a few smooth curved lines. It’s neat to see.

The complex curves of single-file marching bugs begin to smooth out and the bugs gradually get closer and closer to one another. When one bug is very close to another bug and moves one unit towards it, the direction he moves starts to get erratic. This makes the smooth line jittery and jagged. To combat this tendency, I made an enhancement to the movement logic. In my improved version of performMoves, bugs behave in the normal way when they 1.0 units or further from their target bug. When they get closer than 1.0 unit the bug moves away from the center of gravity of the bug system. I experimented with different distances for the bugs to flee the center of gravity, but I got the most pleasing results when they flee at a rate of 1.0 over the square root of the bug’s distance from the center of gravity.

This causes the bugs to spread out a bit when they start getting too close to each other and keeps the paths relatively smooth. It also has the effect of making the whole system more dynamic and interesting to watch. When the bugs start to get close to each other there is an outward impulse which causes little waves in the lines. Also, whenever you get a small circuit of bugs (a loop of a dozen bugs, say) they very quickly close into a tight formation and are propelled away from the center, sometimes in a flattened figure eight. Anyway, here’s the improved bug movement logic:

      def performMoves = {
        centerOfGravity = {
          val big = bugs.foldLeft(new Position(0,0))((a,c) => a.move(c.x, c.y))
          new Position(big.x / bugs.length, big.y / bugs.length)
        }
        for ((cur,target) <- bugs.zip(targets)) {
          val centerDist = centerOfGravity.dist(cur.getPosition)
          cur.prepareMove{(bug) =>
            if (bug.dist(target.getPosition) < 1.0)
              bug.moveToward(centerOfGravity, -1.0 / math.sqrt(centerDist))
            else
              bug.moveToward(target.getPosition, 1.0)
          }
        }
      }

That’s all of the interesting parts of the code. In addition to this stuff, we need to add the user interface bits, mouse operations, drawing, etc. before we have a working program.

Finally, the complete source

I’ve kept you in suspense long enough. Without further ado, below is the complete source code followed by section-by-section description in case you really are interested in the UI. To run it just call BugsGame.main(null).

import swing._
import event._
import util.Random

class Position(val x: Double, val y: Double) {
  def move(xOffset: Double, yOffset: Double) = new Position(x+xOffset, y+yOffset)
  private def distNumbers(that: Position) = {
    val xDist = that.x - this.x
    val yDist = that.y - this.y
    val dist = scala.math.sqrt(xDist*xDist + yDist*yDist)
    (xDist, yDist, dist)
  }
  def dist(that: Position) = distNumbers(that)._3
  def moveToward(that: Position, dist: Double) = {
    val (xDist, yDist, totalDist) = distNumbers(that)
    def offset(num: Double) = {
      val result = dist * num / totalDist
      if (result.isNaN) 0 else result
    }
    move(offset(xDist), offset(yDist))
  }
}

class Bug(xCoord: Double, yCoord: Double) {
  private var pos = new Position(xCoord,yCoord)
  private var nextPos = new Position(xCoord,yCoord)
 
  def getPosition = pos
  def x: Double = pos.x
  def y: Double = pos.y
 
  def prepareMove(func : (Position) => Position) = {
    nextPos = func(pos)
  }
 
  def paint(g: Graphics2D) {
    g.drawLine(pos.x.toInt, pos.y.toInt, nextPos.x.toInt, nextPos.y.toInt)
    pos = nextPos
  }
}

object BugsGame extends SimpleSwingApplication {
  import java.awt.{Dimension, Graphics2D, Color => AWTColor}

  override def top = frame

  var center = new Point(0,0)
  var origCenter = new Point(0,0)
  var clickPt: Point = new Point(0,0)

  var centerOfGravity = new Position(0,0)

  val initWinSize = new Dimension(800, 800)

  val bugs = {
    def random = Random.nextInt(2000) - 1000
    List.fill(5000)(
      new Bug(random + initWinSize.width/2,
              random + initWinSize.height/2))
  }
  var targets = Random.shuffle(bugs)

  val frame = new MainFrame {
    title = "Scala Bugs"
    contents = mainPanel
    lazy val mainPanel = new Panel() {
      focusable = true
      background = AWTColor.white
      preferredSize = initWinSize

      override def paint(g: Graphics2D) {
        g.setColor(AWTColor.white)
        g.fillRect(0, 0, size.width, size.height)
        onPaint(g)
      }
    }

    listenTo(mainPanel.mouse.clicks, mainPanel.mouse.moves)
    reactions += {
      case MousePressed(src, point, i1, i2, b) => {
        clickPt = point
        origCenter = center
      }
      case MouseDragged(src, point, i1) => {
        center = new Point(origCenter.x + point.x - clickPt.x,
                           origCenter.y + point.y - clickPt.y)
      }
      case MouseReleased(src, point, i1, i2, b) => {
        center = new Point(origCenter.x + point.x - clickPt.x,
                           origCenter.y + point.y - clickPt.y)
      }
      case MouseClicked(src, point, i1, i2, b) => {
        targets = scala.util.Random.shuffle(bugs)
      }
    }

    val runner = new Thread(new Runnable {
      def run() = {
        while (true) {
          performMoves
          repaint()
          //Thread.sleep(10);
        }
      }

      def performMoves = {
        centerOfGravity = {
          val big = bugs.foldLeft(new Position(0,0))((a,c) => a.move(c.x, c.y))
          new Position(big.x / bugs.length, big.y / bugs.length)
        }
        for ((cur,target) <- bugs.zip(targets)) {
          val centerDist = centerOfGravity.dist(cur.getPosition)
          cur.prepareMove{(bug) =>
            if (bug.dist(target.getPosition) < 1.0)
              bug.moveToward(centerOfGravity, -1.0 / math.sqrt(centerDist))
            else
              bug.moveToward(target.getPosition, 1.0)
          }
        }
      }
    })
    runner.start
  }

  def onPaint(g: Graphics2D) {
    g.translate(center.x, center.y)

    for ((cur,target) <- bugs.zip(targets)) {
      g.setColor(AWTColor.lightGray)
      g.drawLine(cur.x.toInt,    cur.y.toInt,
        target.x.toInt, target.y.toInt)

      g.setColor(AWTColor.black)
      cur.paint(g)
    }

    g.setColor(AWTColor.red)
    g.drawOval(centerOfGravity.x.toInt - 3, centerOfGravity.y.toInt - 3, 6, 6)
  }

}

I’ll skip over the lines that I think are trivial:

Lines 5-22: The Position class as described above

Lines 24-40: The Bug class as described above

Lines 42: I used a SimpleSwingApplication as my base class. I’m no swing expert, so this may be a rather naive swing app.

Lines 47-49: Points (in screen coordinates) that I’ll use in my mouse operations.

Lines 53-61: The bug list initialization I described earlier.

Lines 63-76: Setting up the frame and main panel.

Lines 78-95: The mouse operations. You can drag the view around the window to see different areas in a large system, or to follow the bugs if they drift out of frame. You can also click in the window to re-randomize the targets list. This is a fun feature.

Lines 97-122: The main thread. It contains an infinite loop that moves the bugs and repaints, over and over. This section contains the performMoves function that I described earlier. This is the section that you’ll want to experiment with to see what kind of interesting behaviors you can get. Also, line 102 is a good place to add a Thread.sleep if you want to slow things down.

Lines 125-139: The onPaint function. First, it shifts the whole picture to the position selected using the mouse (line 89). Then for each bug it draws a light grey line from that bug to the bug it’s following, and calls on the bug to draw itself. Finally, it draws a small red circle to indicate the center of gravity of the bug population.

I recently began writing, as an exercise, some unit of measure code in Scala. I saw a headline in my newsreader some months ago about a Scala library for handling units of measure and I made a point NOT to read it because it sounded to me like an interesting problem and I wanted to first take a stab at it myself and then compare my solution to the one from the article and maybe even write my own article on my solution.

In the course of trying out a couple of designs I encountered a situation I wasn’t sure how to handle. I wanted an abstract class called Dimension which would encapsulate a measurement in some unit and I wanted to create several subtypes extending Dimension such as Length, Time, Mass, Temperature, etc. All Dimensions should be able to sum together, but only with their own kind. For example, 1 meter + 2 feet should give 1.6096 meters and 1 kilogram + 500 grams should give 1.5 kilograms. However, it makes no sense to add 30 seconds and 45 degrees Celsius. I wanted to arrange the types in such a way that a user of the library would not have the option of adding dimensions of two different types.

Here’s my initial code:

  abstract class Dimension(val value: Double, val name: String, val coef: Double) {
    def +(x: Dimension): Dimension
    override def toString(): String = value + " " + name
  }

  class Time(value: Double, name: String, coef: Double) extends
  Dimension(value, name, coef) {
    def +(x: Dimension): Time= new Time(value + coef * x.value / x.coef, name, coef)
  }

  class Length(value: Double, name: String, coef: Double) extends 
  Dimension(value, name, coef) {
    def +(x: Dimension): Length= new Length(value + coef * x.value / x.coef, name, coef)
  }

You can see the Dimension class declares an addition operator to satisfy our requirement that all Dimensions must be additive. No surprises so far.

The Time and Length classes extend Dimension and are concrete, so they must implement the addition operator. The operators have the same signature as the one from Dimension except for the return type. When creating a subtype, we are allowed to narrow the return types, so I made them more specific. Time.+ returns not merely a Dimension as in the supertype but a Time. Parameters, on the other hand, can only have their types widened in subtypes, so they remain Dimension. This is because the return type is a covariant position and the parameter type is a contravariant position. If you don’t know what variance is, I have an article on it.

This code has two main weaknesses. First, a developer subclassing Dimension is trusted to return the same type as the class. That is to say, a developer could write:

  class Length(value: Double, name: String, coef: Double) extends 
  Dimension(value, name, coef) {
    def +(x: Dimension): Time = new Time(value + coef * x.value / x.coef, name, coef)
  }

The developer could mix the types! This would return an unexpected and nonsensical result. The second and much more serious weakness is that a user of the library doesn’t have to pass a Length to Length.+. The user could write:

  val sum = new Length(10.0, "meters", 1.0) + new Time(10.0, "seconds", 1.0)

Nothing is stopping him from doing this.

What I wanted was a Dimension class that enforced some additional rules on its descendants. Dimension should dictate not only that all its subtypes must implement the addition operator but also that the operator should only accept a parameter of the same type as the class in which it is defined and that the operator should also return that same type. In short, I wanted to force all subtypes of Dimension to look like this:

  class X(...) extends Dimension(...) {
    def +(x: X): X = ...
  }

Learning to Accept Yourself (As a Parameter)

As I considered the problem I pretty quickly noticed a similarity to a basic Scala trait: scala.Ordered. Ordered[A] is used primarily in sorting. By mixing in Ordered, you can define an ordering for any class. The reason I thought of ordered is that includes the abstract method “compare (that : A) : Int”. This method compares “this” to “that”. In other words, it takes a parameter that it is able to compare with itself, which is *usually* a parameter of the same type. Here’s a typical use of Ordered[A]:

class Student(val lastName: String, val firstName: String) 
  extends Ordered[Student]
{
  def compare(that: Student) = 
    (lastName + "," + firstName).compare(that.lastName + "," + that.firstName)
}

That’s close to what I want, but not quite. Ordered[A] allows you to implement a class that can be compared to itself, but it doesn’t require it. You could implement “class Apples extends Ordered[Oranges]”, literally comparing apples and oranges. So this arrangement (a parameterized class or trait in which the subtypes specify themselves as the type parameter) allows, but does not enforce, the structure that I want. So Ordered[A] provides a clue, but not the complete solution.

Becoming Self Aware

The missing piece is a little-known Scala construct called the explicit self type. It is a way of specifying what type the “this” reference must have. The Scala-lang website has an article explaining another situation in which explicit self types are useful: specifying that within a class “this” should refer to an abstract variable type.

Here’s a very simple example of how explicit self types work. Here’s a base trait called TraitA and two traits that make use of TraitA. TraitB1 uses an explicit self type to denote that “this” must be of type TraitA, and TraitB2 uses extends to inherit from TraitA.

trait TraitA {
  def t1(): String
}

trait TraitB1 {
  self: TraitA =>
  def t2(): String = "TraitB1.t2 !" + t1() + "!"
}

trait TraitB2 extends TraitA {
  def t2(): String = "TraitB2.t2!" + t1() + "!"
}

TraitB1 and TraitB2 are exactly alike except for the way they gain access to the t1() method. Here’s an interpreter session in which we create some classes that extend these traits.

scala> class Class1 extends TraitB1 {
     |   def t1() = "Class1.t1"
     |   override def toString() = "Class1: " + t1() + " " + t2()
     | }
<console>:6: error: illegal inheritance;
 self-type Class1 does not conform to TraitB1's selftype TraitB1 with TraitA
       class Class1 extends TraitB1 {
                            ^

scala> class Class2 extends TraitB1 with TraitA {
     |   def t1() = "Class2.t1"
     |   override def toString() = "Class2: " + t1() + " " + t2()
     | }
defined class Class2

scala> new Class2
res0: Class2 = Class2: Class2.t1 TraitB1.t2 !Class2.t1!

scala> class Class3 extends TraitB2 {
     |   def t1() = "Class3.t1"
     |   override def toString() = "Class3: " + t1() + " " + t2()
     | }
defined class Class3

scala> new Class3
res1: Class3 = Class3: Class3.t1 TraitB2.t2!Class3.t1!

Class1 extends TraitB1, the trait that used the explicit self type. The class defines t1(), so all the necessary implementation is there, but the compile fails anyway. The explicit self type says that “this” must be of type TraitA but neither Class1 nor TraitB1 extends TraitA, so even though the t1() method is supplied, “this” cannot have type TraitA for Class1 because Class1 does not inherit from TraitA.

Class2 and Class3 compile and run just fine. Class2 is identical to Class1 except that it mixes in TraitA. Since it is declared “with TraitA” the explicit self type is satisfied because “this” can have type TraitA.

Class3 is identical to the others except that it extends TraitB2. TraitB2 is declared as extending TraitA, so Class3 compiles because it indirectly inherits from TraitA.

Use explicit self types with care! They can be a little dangerous if you use them to subvert Scala’s compile-time type checking. For example:

trait StringMaker {
  def makeString(): String
}

class DoesntCompile extends StringMaker {
  override def toString() = "DoesntCompile " + makeString
}

class Compiles {
  self: StringMaker =>
  override def toString() = "Compiles " + makeString
}

The first class is declared as extending StringMaker, but it doesn’t implement the makeString method. This class fails to compile, and rightly so. The compiler does its job and warns you that the class won’t work.

The second class includes an explicit self type. The class named Compiles says that it must be a StringMaker. Now, it says this internally, not externally. The class declaration says nothing about StringMaker. Any code that used the Compiles class wouldn’t know that it’s supposed to be a StringMaker. The class compiles but when you try to instantiate it you get an exception. Not only is it an exception, it’s a NullPointerException which crashes my interpreter (!!!) which makes me think this may be a bug.

Time to Self Actualize

My solution to the problem was a combination of the “class X extends Ordered[X]” idiom and the explicit self type. Here it is:

  abstract class Dimension[T](val value: Double, val name: String, val coef: Double) {
    self: T =>
    protected def create(value: Double, name: String, coef: Double): T
    def +(x: Dimension[T]): T = create(value + coef * x.value / x.coef, name, coef)
    override def toString(): String = value + " " + name
  }

  class Time(value: Double, name: String, coef: Double) extends
        Dimension[Time](value, name, coef) {
    protected def create(a: Double, b: String, c: Double) = new Time(a, b, c)
  }

  class Length(value: Double, name: String, coef: Double) extends
        Dimension[Length](value, name, coef) {
    protected def create(a: Double, b: String, c: Double) = new Length(a, b, c)
  }

  class Mass(value: Double, name: String, coef: Double) extends
        Dimension[Length](value, name, coef) {
    protected def create(a: Double, b: String, c: Double) = new Length(a, b, c)
  }

This compiles just fine except for the last class, which I included to demonstrate that the compiler enforces conformance to the explicit self type. Every class that extends Dimension[X] must itself be an X. That enforces the rule I wanted. Here’s the compiler error for the last class.

$ scalac Units.scala
Units.scala:19: error: illegal inheritance;
 self-type Mass does not conform to Dimension[Length]'s selftype Dimension[Length] with Length
        Dimension[Length](value, name, coef) {
        ^
one error found

That basically says if you want to extend Dimension[Length] then you better be a Length yourself.

Considering that when I had investigated this problem for about 10 minutes I was nearly ready to call it impossible, this is a surprisingly simple and not too cryptic solution. Plus, it’s a usage of explicit self types that I hadn’t seen before. I wonder, in fact, why the Ordered[A] trait itself doesn’t use this trick.

As a bonus, here’s a sneak peek at part of my Units library so far.

  abstract class Dimension[T](val value: Double, val name: String, val coef: Double) {
    self: T =>
    protected def create(value: Double, name: String, coef: Double): T
    def +(x: Dimension[T]): T = create(value + coef * x.value / x.coef, name, coef)
    def -(x: Dimension[T]): T = create(value - coef * x.value / x.coef, name, coef)
    override def toString(): String = value + " " + name
  }

  class Time(value: Double, name: String, coef: Double) extends
        Dimension[Time](value, name, coef) {
    protected def create(a: Double, b: String, c: Double) = new Time(a, b, c)
  }

  class Length(value: Double, name: String, coef: Double) extends
        Dimension[Length](value, name, coef) {
    protected def create(a: Double, b: String, c: Double) = new Length(a, b, c)
  }

  abstract class TimeUnit(name: String, coef: Double) {
    def apply(value: Double) = new Time(value, name, coef)
    def apply(orig: Time) = new Time(0, name, coef) + orig
  }

  object Second   extends TimeUnit("seconds",    1.0)
  object Minute   extends TimeUnit("minutes",    1.0 / 60)
  object Hour     extends TimeUnit("hours",      1.0 / 3600)

  abstract class LengthUnit(name: String, coef: Double) {
    def apply(value: Double) = new Length(value, name, coef)
    def apply(orig: Length) = new Length(0, name, coef) + orig
  }

  object Meter      extends LengthUnit("meters",      1.0)
  object Inch       extends LengthUnit("inches",      1.0 / .0254)
  object Foot       extends LengthUnit("feet",        1.0 / .0254 / 12)

And here’s what it looks like in the interpreter:

scala> val length1 = Meter(3)
length1: Length = 3.0 meters

scala> val length2 = Foot(4.5)
length2: Length = 4.5 feet

scala> length1 + length2
res0: Length = 4.3716 meters

scala> length2 + length1
res1: Length = 14.34251968503937 feet

scala> Inch(length1 + length2)
res2: Length = 172.11023622047244 inch

scala> val time1 = Second(90)
time1: Time = 90.0 seconds

scala> val time2 = Hour(.75)
time2: Time = 0.75 hours

scala> Minute(time2)
res3: Time = 45.0 minutes

scala> time1 + length1
<console>:14: error: type mismatch;
 found   : Length
 required: Dimension[Time]
       time1 + length1
               ^

Here’s a three-for-one special for you: A post about implementing the Levenshtein string distance algorithm in Scala AND refactoring it from an imperative style to a functional style AND I even throw in a short lesson on memoization. To make sure that our refactoring is correct and preserves the expected behavior, I’ll unit test the code along the way using ScalaTest. ScalaCheck, JUnit, or TestNG would work just as well, but I used ScalaTest.

First Things First

“What, exactly,” some of you may be asking, “is Levenshtein string distance? Some kind of Teutonic tailoring terminology?” Not at all. It’s a way of measuring how alike or different two strings of symbols are. For example, the string “sturgeon” is a lot more similar to “surgeon” than to “urgently”. “Sturgeon” and “surgeon” are only a single letter (t) apart. They have a string distance of 1. “Sturgeon” and “urgently” share some letters, but each has some letters not present in the other. So what’s their string distance? It’s not so obvious now.

String distance is useful. The use that most quickly springs to mind is spelling correction. If I type “computwr” a string distance algoritm could tell us that “computwr” is very close to the dictionary word “computer”. But there’s more to it than that. There are a lot of fuzzy problems in which you want to find which two sets of complex data are most similar. One way to solve a problem like that is to encode it into a string compare using string distance. For example, you could write a program to find pen strokes in an image and encode their shapes (up, curve left, down, angle right, etc) as a string which could be compared to known encodings for handwritten letters. By finding the closest matches you can create a handwriting recognition program. String distance is useful in DNA analysis, of course, recognizing patterns in signals, and a host of other situations.

The Levenshtein string distance algorithm was developed by Vladimir Levenshtein in 1965. It can easily tell us the distance from “sturgeon” to “urgently”. This algorithm breaks down string transformation into three basic operations: adding a character, deleting a character, and replacing a character. Each of these operations is assigned a cost, usually a cost of 1 for each operation. Leaving a character unchanged has a cost of 0. So to go from “surgeon” to “sturgeon” you leave the “s” unchanged for a cost of 0. Then you add a “t” for a cost of 1. All the other letters also remained unchanged, so the total cost is 1, just as we expected.

To change “sturgeon” to “urgently” is harder. They have the same number of letters, so we could just do a replacement on each one for a distance of 8. But is that the shortest distance? What if we try to re-use that “urge” from “sturgeon”? Can we re-use the “n”? Does that help? What about the “t”? We need an algorithm that we can follow.

The Grid

What we need is a way to find the cheapest combination of operations which changes the first string to the second. That’s the Levenshtein algorithm. It works like this. Write the first string vertically from top to bottom. To the right of each letter write 1, 2, 3, etc. Write a 0 above the number 1. Then write the second string horizontally and again add the numbers 1, 2, 3 etc. to the right of the 0. It will look something like this:

    u r g e n t l y
  0 1 2 3 4 5 6 7 8
s 1
t 2
u 3
r 4
g 5
e 6
o 7
n 8

That’s the first step. The grid remains un-filled-in at this point. Can you guess what the grid positions are going to hold? They are going to hold the cost to convert the various prefixes of “sturgeon” to the various prefixes of “urgently”. The position at the intersection of row r4, and the column a2, for example, will contain the cost to convert “stur” to “ur”. We fill in these positions (left-to-right then top to bottom) with the smallest of the following three numbers:

  • The number above the current position plus one.
  • The number to the left of the current position plus one.
  • The above-left number if row letter and column letter are the same, or the above-left number plus one otherwise.

When we finish, the bottom right corner contains the cost to convert the first string to the second.

But Why? (Understanding the Algorithm)

Those are some shockingly simple rules! Let’s examine how they translate into string distance.

First, what are those numbers 0 to n that we write along the top and left? They’re not just indices. The numbers along the top represent the cost to convert the empty string to the various prefixes of “urgently”. The cost is 0 to convert “” to “”, 1 to convert “” to “u”, 2 to convert “” to “ur”, and so on. The numbers on the left are the cost to convert the prefixes of “sturgeon” to the the empty string . “” to “” is 0, “s” to “” is 1, and so forth. These costs are obvious. The only way to convert “” to “urgently” is to add eight letters. There’s nothing to delete, nothing to replace. The only way to convert “sturgeon” to “” is to delete all eight letters.

Each position, as we have established, represents the cost to convert the string of characters down the left side ending in the current row’s character into the the string of characters along the top ending at the current column’s character. Put another way, for any given position let’s call the current row’s letter A, and the current column’s letter B. If we use a colon (:) to indicate string concatenation then the beginning string, the one along the left of the grid, can be written Prefix1:A. So for our example word “sturgeon” if we look at a position in row e6 then Prefix1 is “sturg” and the final letter, which we’re calling A, is “e”.

So, speaking in terms of Prefix1, Prefix2, A, and B, we use the following inputs:

  • The cost to change Prefix1 to Prefix2:B (the number above the current position).
  • The cost to change Prefix1:A to Prefix2 (the number to the left).
  • The cost to change Prefix1 to Prefix2 (the above-left number).

If we know the costs of these three conversions we can find the cost to change Prefix1:A to Prefix2:B using this logic:

  • We know that if converting Prefix1 to Prefix2:B has a cost of X, then Prefix1:A can be converted to Prefix2:B for the cost of X plus the cost of deleting A, or X + 1.
  • We know that if converting Prefix1:A to Prefix2 has a cost of Y, then Prefix1:A can be converted to Prefix2:B for the cost of Y plus the cost of adding B, or Y + 1.
  • We know that if converting Prefix1 to Prefix2 has a cost of Z, then Prefix1:A can be converted to Prefix2:B for the cost of Z plus the cost of replacing A with B, or Z + (0 or 1 depending on whether A = B).

If you understand the above logic, then you understand this really neat algorithm. It’s a quintessential example of dynamic programming. The solution is built up by solving simpler problems. You start with the trivial case of converting to and from the empty string, and then you build up the solution for the prefixes until you have the complete solution.

In the grid’s initial configuration there is only one location for which we know all three costs and that is row s1, column u1. That’s the only space for which all three neighboring values (above, left, and above-left) are populated. After we fill in this one, there are two more spaces available to us. Those are (t2, u1) and (s1, r2). Ordinarily, though, the spaces are populated line by line.

A Simple Example

Let’s do a quick example by hand. Then we’ll take a stab at implementing it in code. What’s the string distance from “hat” to “tape”? First, our empty grid:

    t a p e
  0 1 2 3 4
h 1
a 2
t 3

The first space is row h, column t. The letters are different so our choices are 1 + 1 (above), 1 + 1 (left), or 0 + 1 (above-left). 0 + 1 is the smallest value, so the first space gets populated with 1. The next space has choices 2 + 1 (above), 1 + 1 (left), or 1 + 1 (above-left). 1 + 1 is the lowest, so we fill in the second space with 2. Once we finish the row, we have this:

    t a p e
  0 1 2 3 4
h 1 1 2 3 4
a 2
t 3

The next space is row a, column t. Our choices are 1 + 1 (above), 2 + 1 (left), or 1 + 1 (above-left). 1 + 1 is the smallest value, so the space gets populated with 2. The next space has choices 2 + 1 (above), 2 + 1 (left), or 1 + 0 (above-left). Why 1 + 0? Because the above-left value is 1 and both letters for this space are “a” so we can replace “a” with “a” for free. Go ahead and finish the whole grid. This is the result:

    t a p e
  0 1 2 3 4
h 1 1 2 3 4
a 2 2 1 2 3
t 3 2 2 2 3

The strings “hat” and “tape” have a distance of 3.

Some Code, Finally

As fascinating as Levenshtein distance is, and as much more as there is to say on the topic, the time has come to write some code. Here’s a Scala implementation that is very close to the pencil-and-paper approach that we just performed.

import scala.Math.min
import scala.Math.max

object StringDistance {
  def stringDistance(s1: String, s2: String): Int = {
    def minimum(i1: Int, i2: Int, i3: Int) = min(min(i1, i2), i3)

    val dist = new Array[Array[Int]](s1.length + 1, s2.length + 1)

    for (idx <- 0 to s1.length) dist(idx)(0) = idx
    for (jdx <- 0 to s2.length) dist(0)(jdx) = jdx

    for (idx <- 1 to s1.length; jdx <- 1 to s2.length)
      dist(idx)(jdx) = minimum (
        dist(idx-1)(jdx  ) + 1,
        dist(idx  )(jdx-1) + 1,
        dist(idx-1)(jdx-1) + (if (s1(idx-1) == s2(jdx-1)) 0 else 1)
      )

    dist(s1.length)(s2.length)
  }

Do you see what I mean when I say it’s close to the pencil-and-paper approach? We actually construct a two-dimensional Array to represent the grid we drew earlier. It’s a very literal implementation.

To explain the code briefly, we declare a singleton StringDistance having a single method called stringDistance. Within this method we declare a 3-argument minimum method. (I wonder why there’s no “def min(params: Int*): Int” defined in scala.Math.) Then we create an array called “dist” and populate the top row and leftmost column in lines 8-11. The for loop on line 13 cycles through each array position from left to right then top to bottom, and populates them according to the Levenshtein logic. Finally, once the grid is filled in we return the number in the bottom right position.

The job’s not done yet, of course. We haven’t written our unit tests. Here’s the test I wrote:

import org.scalatest._

class StringDistanceSuite extends FunSuite with PrivateMethodTester {

  test("stringDistance should work on empty strings") {
    assert( StringDistance.stringDistance(   "",    "") == 0 )
    assert( StringDistance.stringDistance(  "a",    "") == 1 )
    assert( StringDistance.stringDistance(   "",   "a") == 1 )
    assert( StringDistance.stringDistance("abc",    "") == 3 )
    assert( StringDistance.stringDistance(   "", "abc") == 3 )
  }

  test("stringDistance should work on equal strings") {
    assert( StringDistance.stringDistance(   "",    "") == 0 )
    assert( StringDistance.stringDistance(  "a",   "a") == 0 )
    assert( StringDistance.stringDistance("abc", "abc") == 0 )
  }

  test("stringDistance should work where only inserts are needed") {
    assert( StringDistance.stringDistance(   "",   "a") == 1 )
    assert( StringDistance.stringDistance(  "a",  "ab") == 1 )
    assert( StringDistance.stringDistance(  "b",  "ab") == 1 )
    assert( StringDistance.stringDistance( "ac", "abc") == 1 )
    assert( StringDistance.stringDistance("abcdefg", "xabxcdxxefxgx") == 6 )
  }

  test("stringDistance should work where only deletes are needed") {
    assert( StringDistance.stringDistance(  "a",    "") == 1 )
    assert( StringDistance.stringDistance( "ab",   "a") == 1 )
    assert( StringDistance.stringDistance( "ab",   "b") == 1 )
    assert( StringDistance.stringDistance("abc",  "ac") == 1 )
    assert( StringDistance.stringDistance("xabxcdxxefxgx", "abcdefg") == 6 )
  }

  test("stringDistance should work where only substitutions are needed") {
    assert( StringDistance.stringDistance(  "a",   "b") == 1 )
    assert( StringDistance.stringDistance( "ab",  "ac") == 1 )
    assert( StringDistance.stringDistance( "ac",  "bc") == 1 )
    assert( StringDistance.stringDistance("abc", "axc") == 1 )
    assert( StringDistance.stringDistance("xabxcdxxefxgx", "1ab2cd34ef5g6") == 6 )
  }

  test("stringDistance should work where many operations are needed") {
    assert( StringDistance.stringDistance("example", "samples") == 3 )
    assert( StringDistance.stringDistance("sturgeon", "urgently") == 6 )
    assert( StringDistance.stringDistance("levenshtein", "frankenstein") == 6 )
    assert( StringDistance.stringDistance("distance", "difference") == 5 )
    assert( StringDistance.stringDistance("java was neat", "scala is great") == 7 )
  }

}

It tests several special cases as well as the general case. All we have to do is compile our StringDistance object and the StringDistanceSuite unit test, fire up the scala interpreter and run the test:

scala> (new StringDistanceSuite).execute()
Test Starting - StringDistanceSuite: recursiveStringDistance should work on empty strings
Test Succeeded - StringDistanceSuite: recursiveStringDistance should work on empty strings
Test Starting - StringDistanceSuite: recursiveStringDistance should work on equal strings
Test Succeeded - StringDistanceSuite: recursiveStringDistance should work on equal strings
Test Starting - StringDistanceSuite: recursiveStringDistance should work where only inserts are needed
Test Succeeded - StringDistanceSuite: recursiveStringDistance should work where only inserts are needed
Test Starting - StringDistanceSuite: recursiveStringDistance should work where only deletes are needed
Test Succeeded - StringDistanceSuite: recursiveStringDistance should work where only deletes are needed
Test Starting - StringDistanceSuite: recursiveStringDistance should work where only substitutions are needed
Test Succeeded - StringDistanceSuite: recursiveStringDistance should work where only substitutions are needed
Test Starting - StringDistanceSuite: stringDistance should work where many operations are needed
Test Succeeded - StringDistanceSuite: stringDistance should work where many operations are needed

scala>

Refactoring: Reduced Memory Usage

The code passes all the tests, so let’s take things a step further. One shortcoming of our implementation is that it can require a lot of memory. That array has to be of size (n+1)*(m+1) where n and m are the lengths of the two strings we’re comparing. If you want to apply the method to strings that are a few hundred characters long (or longer) then you’re starting to talk about some serious memory requirements. How can we reduce the amount of memory required? Can you think of a way?

Once we complete one row of the grid, we use it again as an input when we compute the next row. But after that we’re done with it. Why leave it to clutter the heap? Let’s tweak our implementation slightly. Try rewriting the method using only two rows. Fill the first row with the initial 0, 1, 2, etc. Then use one to compute the other over and over. Think about how you would implement this, then have a look at my solution below.

  def stringDistance(s1: String, s2: String): Int = {
    def minimum(i1: Int, i2: Int, i3: Int) = min(min(i1, i2), i3)

    var dist = ( new Array[Int](s1.length + 1),
                 new Array[Int](s1.length + 1) )

    for (idx <- 0 to s1.length) dist._2(idx) = idx

    for (jdx <- 1 to s2.length) {
      val (newDist, oldDist) = dist
      newDist(0) = jdx
      for (idx <- 1 to s1.length) {
        newDist(idx) = minimum (
          oldDist(idx) + 1,
          newDist(idx-1) + 1,
          oldDist(idx-1) + (if (s1(idx-1) == s2(jdx-1)) 0 else 1)
        )
      }
      dist = dist.swap
    }

    dist._2(s1.length)
  }

This one uses a Pair (also called a Tuple2) containing two one-dimensional arrays, instead of a 2*(n+1) array. Pair happens to have the very handy “swap” method which we can use to trade out the rows when we’ve finished one and are ready to compute the next.

This is where our unit tests really show their worth. No need to wonder whether this new code really does work. We just recompile, run the tests again, and we can see that the code still gives us the expected results.

Refactoring: Recursion, Kind Of

What are some other ways we could write this code? Can it be improved? I thought I would try to replace the iteration in the previous implementations with recursion. Here’s what that code looks like:

  def stringDistance(s1: String, s2: String): Int = {
    def newCost(ins: Int, del: Int, subst: Int, c1: Char, c2:Char) =
      Math.min( Math.min( ins+1, del+1 ), subst + (if (c1 == c2) 0 else 1) )

    def getNewCosts(s1: List[Char], c2: Char, delVal: Int, prev: List[Int] ): List[Int] = (s1, prev) match {
      case (c1 :: _ , substVal :: insVal :: _) =>
        delVal :: getNewCosts(s1.tail, c2, newCost(insVal, delVal, substVal, c1, c2), prev.tail)
      case _ => List(delVal)
    }

    def sd(s1: List[Char], s2: List[Char], prev: List[Int]): Int = s2 match {
      case Nil => prev.last
      case _ => sd( s1, s2.tail, getNewCosts(s1, s2.head, prev.head+1, prev) )
    }

    (s1, s2) match {
      case (`s2`, `s1`) => 0
      case (_, "") | ("", _) => max(s1.length, s2.length)
      case _ => sd(s1.toList, s2.toList, (0 to s1.length).toList)
    }
  }

It’s a pretty naïve implementation, actually. It just replaces the the repetition of the two for loops with the repetition of the recursion of the two methods sd and getNewCosts. The sd method is even tail-recursive, allowing scala to optimize it. It does the same basic thing as the for loop version, though. It recurses through the characters of a row in the getNewCosts method, and it recurses through the rows of the grid in the sd method.

It looks more complicated than the previous implementations. It’s harder to read. But it passes our unit tests, so we can be pretty sure it’s correct.

Refactoring: List Methods

After the last implementation, I thought it looked a little sloppy. I wondered whether I could improve the situation by using some of the many useful methods built into scala’s List class. Here is the comparatively brief code that resulted:

  def stringDistance(s1: String, s2: String): Int = {
    def sd(s1: List[Char], s2: List[Char], costs: List[Int]): Int = s2 match {
      case Nil => costs.last
      case c2 :: tail => sd( s1, tail,
          (List(costs.head+1) /: costs.zip(costs.tail).zip(s1))((a,b) => b match {
            case ((rep,ins), chr) => Math.min( Math.min( ins+1, a.head+1 ), rep + (if (chr==c2) 0 else 1) ) :: a
          }).reverse
        )
    }
    sd(s1.toList, s2.toList, (0 to s1.length).toList)
  }

Like I say, it’s brief. Those List methods give you a lot of mileage.

Refactoring: Real Recursion

The more I looked at my previous attempt at a recursive solution, the more I realized how hare-brained it was. It wasn’t real recursion. It was just iteration using the stack. So I went back to the drawing board. If I want to know the final answer, the value in the bottom right position, how do I get it? I apply my three rules to the above, left, and above-left positions. How do I get those positions? Apply the rules again. That’s real recursion. Here’s my first stab at it:

  def stringDistance(s1: String, s2: String): Int = {
    def min(a:Int, b:Int, c:Int) = Math.min( Math.min( a, b ), c)
    def sd(s1: List[Char], s2: List[Char]): Int = (s1, s2) match {
      case (_, Nil) => s1.length
      case (Nil, _) => s2.length
      case (c1::t1, c2::t2)  => min( sd(t1,s2) + 1, sd(s1,t2) + 1,
                                     sd(t1,t2) + (if (c1==c2) 0 else 1) )
    }
    sd( s1.toList, s2.toList )
  }

Now, THAT is a nice looking recursive function. That’s more like it. See the pattern match block? If we try to convert from any string to the empty string or from the empty to a non-empty string then we just use the string length. You see? That takes care of the positions along the top and left of the grid. All the others are determined in the last case. That last case just applies our three rules. That is so short and simple. It’s a thing of beauty.

The only problem? It doesn’t work.

It’s technically correct. It will return correct answers … eventually. Or, rather, I think it will. I can’t be sure because it’s too slow to complete my unit tests! For anything but very short inputs (4 or 5 characters), the function takes a long time to return. Why? Let’s look at how many recursive calls are made for some inputs.

If we use strings “a” and “b” we pass over the “(_, Nil)” and “(Nil, _)” cases in the first call to function sd, because both our strings (Lists, actually) are non-empty. This results in three more calls to sd. Each of these three calls includes an empty List of characters, so there is no more recursion. That’s a total of four calls to sd for strings “a” and “b”.

What about “ab” and “xy”? Think about it for a moment? Step through the function in your head. How many calls to sd will there be for “ab” and “xy”?

Have you done it? I count 19. What about “abc” and “xyz”? I’ll save you the trouble. It’s 94. For length 4 strings it’s 481. For length 5 it’s 2524. Length 6 is 13,483. Then 73 thousand, then 400 thousand, then 2 million and so forth. Why so many calls? Each position in the grid is computed using all the positions to the left and all the positions above the current position. So a position in the top left will be computed and recomputed many times.

There is a way to get around this, of course. You probably already have some ideas. We’re going to do something called memoization. When you memoize a function, you make it remember results that it computed previously without having to actually recompute them. I’ll do that by caching results in a map. The map’s key is a Pair of List[Char]s, the inputs to my inner function sd, and its data is an Int, the return type of sd. I will modify sd to first check the map to see if the result for the current parameters has already been cached. If so, we simply return it. If not, we compute the value, cache it, and return it.

  def stringDistance(s1: String, s2: String): Int = {
    val memo = scala.collection.mutable.Map[(List[Char],List[Char]),Int]()
    def min(a:Int, b:Int, c:Int) = Math.min( Math.min( a, b ), c)
    def sd(s1: List[Char], s2: List[Char]): Int = {
      if (memo.contains((s1,s2)) == false)
        memo((s1,s2)) = (s1, s2) match {
          case (_, Nil) => s1.length
          case (Nil, _) => s2.length
          case (c1::t1, c2::t2)  => min( sd(t1,s2) + 1, sd(s1,t2) + 1,
                                         sd(t1,t2) + (if (c1==c2) 0 else 1) )
        }
      memo((s1,s2))
    }

    sd( s1.toList, s2.toList )
  }

There. That uglies up my function somewhat, but at least it’s usable now. And it passes my unit tests, so I’m reasonably assured that it’s right.

Memoization only works if you expect the same result for each identical function call. If your function takes input from stdin, for example, you can’t memoize that. Or if it has a random component. Or if, for any other reason, its return value is not always the same for the same inputs. You can memoize functions in different ways. There’s a post on Michid’s Weblog about a more general solution, a memoizing class which wraps existing functions to give you a memoized version.

PHEW!

Ok, I’ve tried to keep my rambling to a minimum in this post but it’s still a doozy. The things I wanted to get across are:

  • The usefulness of Levenshtein distance in solving a variety of problems.
  • How to understand the Levenshtein distance algorithm and why it works.
  • How to use unit tests to improve your code while maintaining some assurance that the new code still has the correct behavior.
  • Some of the different ways of implementing Levenshtein.

In the end, I think I like the second implementation (the one that switches out the two rows) and the last implementation the best. The second one seems to have good performance. I did some informal performance tests and it has a good mix of performance and simplicity. The last one, the memoized recursive one, appeals to me because it is in a more functional style and still has respectable performance.

In Scala, as in Java, C and many other languages, identifiers may contain a mix of lower and upper case characters. These identifiers are treated in a case sensitive manner. For example “index”, “Index” and “INDEX” would be treated as three separate identifiers. You can define all three in the same scope. That goes for Scala, Java, and most if not all descendants of the C language. In most of these languages, although case is significant in distinguishing identifiers, and although various capitalization schemes are used by convention, case does not alter functionality. Whether you name a variable “index”, “Index” or “INDEX”, as long as you don’t hide an identifier from an enclosing scope, the code will function in exactly the same way.

Scala, though, diverges slightly from this tradition. Here’s an example. Say we have a Pair of two Ints. Say we also have two plain Int values and we want to know whether those two Ints are equal to the values inside the Pair. In the case that they do match, we also want to know in which order they appear in the Pair.

Operations on tuples (such as a Pair) can often be implemented neatly by pattern matching. Here’s one solution to this problem:

def matchPair(x: (Int,Int), A: Int, b: Int): String = 
x match {
  case (A, b) => "Matches (A, b)"
  case (b, A) => "Matches (b, A)"
  case _      => "Matches neither"
}

This is completely unsurprising code except for one little detail. One of the function parameters is upper case while the other two are lower case. Other than that, there’s nothing unusual so far. So let’s try out this code in the Scala interpreter.

scala> def matchPair(x: (Int,Int), A: Int, b: Int): String = 
     | x match {
     |   case (A, b) => "Matches (A, b)"
     |   case (b, A) => "Matches (b, A)"
     |   case _      => "Matches neither"
     | }
matchPair: ((Int, Int),Int,Int)String

scala> val pair = (5, 10)
pair: (Int, Int) = (5,10)

scala> matchPair( pair,  5, 10 )
res1: String = Matches (A, b)

scala> matchPair( pair, 10,  5 )
res2: String = Matches (b, A)

scala> matchPair( pair, 99, 99 )
res3: String = Matches neither

So far so good! It returns the expected value when the values match in order, in reverse order, and when both values don’t match. Is this sufficient unit testing? What other tests would you run?

As you may well guess, no, this isn’t sufficient unit testing. Let’s try the case where one Int matches but not the other:

scala> matchPair( pair,  5, 99 )
res4: String = Matches (A, b)

scala> matchPair( pair, 99, 10 )
res5: String = Matches (b, A)

That didn’t work right. Is the matchPair function telling us that ‘pair’ (which is (5, 10) ) matches (5, 99) or (99 10)? That’s what it look like, but no. Scala does something a little bit surprising here. Do you know why?

As I said before, you can have variable and constants in Scala with upper or lower case names. Both are legal, just as they are in Java. But Scala makes some distinctions that Java doesn’t. Within a pattern (the part between ‘case’ and ‘=>’) Scala treats simple lower case identifiers differently. It uses them as new variables into which matched data is stored, but this is not the case for identifiers that begin with an upper case letter!

If you want to capture results of a pattern match in Scala you must use a lower case identifier and that identifier will hide any identifiers with the same name from an enclosing scope. So in our example function, “case (A, b)” matches a Pair. The first element of the pair is “A” which start with an upper case letter, so pattern matching results can’t be stored in it. It is used the way we intended, i.e. the pattern is matched if x._1 equals A.

The “b” in “case (A, b)”, though, begins with a lower case letter so it is assigned the value of x._2 (assuming x._1 equals A). It is as if you had typed “val b = x._2” in the function body. Within the case line, the “b” from the pattern hides the parameter named “b”.

So how can we make this function work the way we want? Here’s one way:

def matchPair(x: (Int,Int), A: Int, B: Int): String = 
x match {
  case (A, B) => "Matches (A, B)"
  case (B, A) => "Matches (B, A)"
  case _      => "Matches neither"
}

Now both the the Int parameters start with an upper case letter and are therefore tested against x._1 and x._2. This code passes our tests. Note that the code behaves differently simply based on the parameter names we choose. There’s another way to prevent Scala from using the identifiers for storing pattern results.

def matchPair(x: (Int,Int), a: Int, b: Int): String = 
x match {
  case (`a`, `b`) => "Matches (a, b)"
  case (`b`, `a`) => "Matches (b, a)"
  case _          => "Matches neither"
}

You can use the more traditional lower case parameter names if you quote them using the backquote character. That’s the key to the left of the “1” on my keyboard. This matchPair is equivalent to the one that used capital “A” and “B”.

Another quick example:

scala> val pair = (5, 10)
pair: (Int, Int) = (5,10)

scala> val (a,b) = pair
a: Int = 10
b: Int = 5

scala> val (X,Y) = pair
:5: error: not found: value X
       val (X,Y) = pair
            ^
:5: error: not found: value Y
       val (X,Y) = pair
              ^

You know that you can use the construct from line 4 above to declare and initialize multiple vals or vars using a tuple, right? And you know how that magic is done? Patterns, so the same principle applies here. The capitalized identifiers “X” and “Y” are taken to refer to existing identifiers because they can’t be used to store pattern match results. Since no such identifiers had been defined, you get an error.

If we define these values beforehand then Scala tries to match their values:

scala> val pair = (5,10)
pair: (Int, Int) = (5,10)

scala> val I = 5
I: Int = 5

scala> val J = 10
J: Int = 10

scala> val (I,q) = pair
q: Int = 10

scala> val (J,r) = pair
scala.MatchError: (5,10)
        at .<init>(<console>:6)
        at .<clinit>(<console>)
        at RequestResult$.<init>(<console>:3)
        at RequestResult$.<clinit>(<console>)
        at RequestResult$result(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(...

scala> val (I,J) = pair

In line 10, the value I has been declared and initialized to 5 so it matches the first part of the Pair. The identifier q becomes a new val initialized to the value in the second part of the Pair.

In line 13, we do the same thing but we try to match J to the first part of the Pair. This won’t work since pair._1 is 5 and J is 10. A MatchError is thrown.

In line 24, we use both of the capitalized identifiers. They match, but there are no lower case identifiers to make new values out of, so the line does nothing except to confirm (by not throwing an Error) that I equals pair._1 and J equals pair._2.

Now you know how to match when you want to match and assign results when you want to assign results. I hope that being able to match against already-defined identifiers will make your matching code more powerful.