Easy Parsing in Scala

I’ve been experimenting with Scala lately. As a practice project, I started writing some parts of a Subversion client. Honestly, I don’t intend to create a real finished product, but it’s been a good source of interesting problems to solve. Subversion systems pass messages back and forth using a fairly simple protocol. This makes the Subversion protocol an ideal example for a Scala parsing tutorial. Let’s write a parser for this protocol using Scala’s parsing package. Here’s the first few lines from the Subversion protocol spec:

The Subversion protocol is specified in terms of the following
syntactic elements, specified using ABNF [RFC 2234]:

  item   = word / number / string / list
  word   = ALPHA *(ALPHA / DIGIT / "-") space
  number = 1*DIGIT space
  string = 1*DIGIT ":" *OCTET space
         ; digits give the byte count of the *OCTET portion
  list   = "(" space *item ")" space
  space  = 1*(SP / LF)

Here is an example item showing each of the syntactic elements:

  ( word 22 6:string ( sublist ) )

Very simple! Every message is made up of a number of items. Each item is either an integer number, a word (made up of letters, digits, and hyphens), a string (which can have any character), or a list of these items, and a list can contain other sub-lists. Plus, notice that the protocol is further simplified by the fact that each item is followed by a “space” which is defined as 1 or more space character or linefeeds. First, let’s look at a Scala parser that implements a small subset of this spec:

import scala.util.parsing.combinator._
import scala.util.matching.Regex

object SvnParser extends RegexParsers {
  private def number = regex(new Regex("[0-9]+[ \\n]+"))
  def parseItem(str: String): ParseResult[Any] = parse(number, str)
}

SvnParser.parseItem("123  \n\n  ") match {
  case SvnParser1.Success(result, _) => println(result.toString)
  case _ => println("Could not parse the input string.")
}

Assuming you have a passing familiarity with Scala, this looks pretty straightforward. You have an object (something like a Java singleton) that inherits from trait RegexParsers. It makes calls to two RegexParsers methods: regex and parse. It adds only one public method: parseItem. The second section is a call to the parseItem method followed by a match block for handling the possible outcomes. If the parse is successful, the parse result is printed.

The interesting part is that call to parse. The first parameter is number. In this program, number is just a method that calls method regex which returns an instance of Parser[+T]. And it’s pretty obvious what regex does. It attempts to match a regular expression. The regex pattern in this example is “[0-9]+[ \\n]+” which matches 1 or more digits followed by 1 or more spaces or newlines. Try running this code through the Scala interpreter. It works! However, it includes all that whitespace which we don’t really need. Let’s see if we can match it, but keep it out of the results. I’ll leave out the imports and test code for brevity.

object SvnParser extends RegexParsers {
  private def number = regex(new Regex("[0-9]+")) ~ regex(new Regex("[ \\n]+"))
  def parseItem(str: String) = parse(number, str)
}

This time, we’re making two regex calls with a tilde (~) in between. What is that thing? It’s a function from the Parser[+T] class. It has a signature like this:

def ~ [U](p : => Parser[U]) : Parser[~[T, U]]

This “type soup” is the kind of thing that scares people away from Scala. But it’s not that bad. This signature says that a Parser of type T has a method called “~” that takes a single parameter: a method returning a Parser of type U. This “~” method returns a Parser of a class called “~” of type T and U. Again, there’s a method called “~” and class called “~”. What does that mean for our example? The first call to regex returns a Parser[T] (T is String in this case) that matches the digits. That Parser calls its “~” method with the result of a second regex call as its parameter, which returns a Parser[U] (U is also String) that matches the whitespace. That “~” method returns another Parser of class “~” of T and U. It turns out that ~[T,U] (more specifically, ~[String,String]) is basically an ordered pair.

Try running this through the interpreter. What happens? The parse fails. Why? This confused me for 10 or 15 minutes and I resorted to looking into the RegexParsers source code. Here’s what I found:

  protected val whiteSpace = """\s+""".r

  def skipWhitespace = whiteSpace.toString.length > 0
  protected def handleWhiteSpace(source: java.lang.CharSequence, offset: Int): Int =
        if (skipWhitespace)
          (whiteSpace findPrefixMatchOf (source.subSequence(offset, source.length))) match {
            case Some(matched) => offset + matched.end
            case None => offset
          }
        else
          offset

I don’t know the details of how and why, but it looks like RegexParsers is messing around with our whitespace. That’s OUR whitespace! And we’ll handle it as we see fit. So let’s override the default behavior. Add a method to our SvnParser like this: “override def skipWhitespace = false” This should set things straight. Whew. That was a fun diversion. Try running the code now. The output string is this:

(123~

  )

That’s the contents of that ~[T,U]. It’s printed as (T~U). That’s great if we need both parts of the parse, but we don’t. We only want the digits. There are two more Parser methods we should look at. They are “<~” and “~>”. Why all these funny names? They’re brief and they’ll make more sense as we go. Let’s look at the signatures.

def <~ [U](p : => Parser[U]) : Parser[T]
def ~> [U](p : => Parser[U]) : Parser[U]

Again, it looks daunting if you’re not familiar with Scala, but let’s take a closer look. It’s all the same as the “~” method except the return type. That means we use these the same way we use “~”, but we get a different result. The “<~” method returns a Parser[T], the same type as the object calling “<~”. The “~>” method returns Parser[U], the same type as the parameter. You see? “<~” returns the left side and throws away the right. “~>” returns the right side and throws away the left. The angle bracket points to the one we want to keep! Now let’s try using one of these new methods to keep the digits and throw away the whitespace. Which will we use? “<~” or “~>”? Right! We’ll use “<~” because we want to keep the left side, the digits. Here’s what the object looks like now:

object SvnParser extends RegexParsers {
  override def skipWhitespace = false
  private def number = regex(new Regex("[0-9]+")) <~ regex(new Regex("[ \\n]+"))
  def parseItem(str: String) = parse(number, str)
}

Look at that “<~”. It says that in order to parse we must match both sides, but we only want to keep the left. When we run this through the interpreter we see just the number output. Great. Now we’re getting the hang of it. Let’s take a big leap forward and add parsing of words as well as numbers. Here’s the new code:

object SvnParser extends RegexParsers {
  override def skipWhitespace = false

  private def space = regex(new Regex("[ \\n]+"))
  private def number = regex(new Regex("[0-9]+")) <~ space
  private def word   = regex(new Regex("[a-zA-Z][a-zA-Z0-9-]*")) <~ space
  private def item = ( number | word )

  def parseItem(str: String) = parse(item, str)
}

New things in this code: we’re parsing for item, item is defined as a number or a word, word is defined using a new regex, and we’ve factored out the whitespace Parser into a new method.

This is mainly all self explanatory. The space method just encapsulates the Parser that matches whitespace. The item method calls method number which returns a Parser. Then we call yet another Parser method, called “|” (pipe). The “|” method take a single parameter, another Parser. As you’ve probably guessed, it returns whichever side is matched and so lets us match the left or the right side. Try changing the test string to some valid and invalid numbers and words.

Now let’s add strings. This one’s tricky. We have to match an integer number we’ll call N, a “:” literal, then a string of N characters, and finally the trailing whitespace. The only part we want to keep is the actual string. Here’s the code:

object SvnParser extends RegexParsers {
  override def skipWhitespace = false

  private def space = regex(new Regex("[ \\n]+"))
  private def number = regex(new Regex("[0-9]+")) <~ space
  private def word   = regex(new Regex("[a-zA-Z][a-zA-Z0-9-]*")) <~ space
  private def string = regex(new Regex("[0-9]+")) >> { len => ":" ~> regex(new Regex(".{" + len + "}")) <~ space }

  private def item = ( number | word | string )

  def parseItem(str: String) = parse(item, str)
}

First, note that we chained another “|” call in the definition of the item method. The new string method first calls regex which returns a Parser that matches an integer. Then guess what. Another Parser method with a funny name. This one is called “>>”. Let’s look at the method signature again:

def >> [U](fq : (T) => Parser[U]) : Parser[U]

T is the type of the result of the left Parser. U is the type of the result of the right Parser. So this method takes a single parameter: a method with one parameter of type T returning a Parser[U]. The Parser returned by this fq function is also returned as the result of “>>”. What does that mean? We get to do something with the results of the Parser on the left and we return a new Parser so we can keep chaining Parsers if we want.

So the right side of the “>>” call in our example is a closure, an sort of anonymous method. It takes a parameter called len. This will be of type String because the left side of “>>” is a regex Parser. Inside the closure we match a literal “:” and call the “~>” method because we don’t really care about the “:”. We’re interested in the string on the right of the “:” so the right side is a regex call that returns a Parser that matches the number of characters specified by the len parameter. So now we’ve passed the string length to the closure (and otherwise thrown it away), we’ve thrown away the “:”, and we’ve matched (and kept) the string data. Finally, we call the “<~” method on a call to the space method. So we match the trailing whitespace, but we don’t keep it. Once again try this code with some different input strings and see how it works.

Now we’re only missing the list item. Remember, we want to match a literal “(“, whitespace, then zero or more items, a literal “)”, and trailing whitespace. We’re going to make list one of the items, but we’re also going to have items in our list. You know what that means. Recursion.

object SvnParser extends RegexParsers {
  override def skipWhitespace = false

  private def space  = regex(new Regex("[ \\n]+"))
  private def number = regex(new Regex("[0-9]+")) <~ space
  private def word   = regex(new Regex("[a-zA-Z][a-zA-Z0-9-]*")) <~ space
  private def string = regex(new Regex("[0-9]+")) >> { len => ":" ~> regex(new Regex(".{" + len + "}")) <~ space }
  private def list: Parser[Any] = "(" ~> space ~> ( item * ) <~ ")" <~ space

  private def item = ( number | word | string | list )

  def parseItem(str: String) = parse(item, str)
}

Easy as pie! We just add list to the chain of methods in method “item” and define the new list method. The righthand side is easy to figure out. Match “(” and throw it away, match whitespace and throw it away, match zero or more items (the “*” is yet another handy Parser method) and keep them, match “)” and throw it away, and then match trailing whilespace and throw it away. See how easy it is to read those “<~” and “~>” methods?

If you’re not well versed in Scala’s type inference (I’m not) then this is a little confusing at first. I first tried to run this without the “: Parser[Any]” specified. The interpreter told me that I had to specify the return type for recursive methods. Oh yeah! This is recursive, isn’t it? Method item calls method list and method list calls method item. Scala is great at looking at the code and inferring what the types must be, which I love. Some people don’t like it. It declutters the code and generally behaves pretty intuitively. Notice that we don’t have any return types on any of the other methods. Scala looks at the method body and figures out what the return type is.

So why do we have to specify the return type for recursive methods if we didn’t have to do in for any of the other methods? Think about it. In the space method we’re returning whatever is returned from the call to regex (that’s Parser[String]) so Scala infers that the return type of space must also be Parser[String]. But the type of the list method will be whatever is returned by the “*” method, which will be based on the return type of method item. But the type of method item depends on type of list. So we go around in circles. That’s why must we specify the return type in this case.

I was lazy. I just specified “Parser[Any]“, meaning that any kind of Parser could be returned. It works, but we could be more specific if we desired stronger type safety. For example, the “*” method always returns a Parser[List[T]] so we could have specified “Parser[List[Any]]” or maybe even more specific if we can narrow down what T can be.

That’s it. I don’t claim this is the most efficient parser. If speed is an issue you could surely get better performance from a custom parser, but you get a lot of functionality from a tiny bit of code with Scala’s parsing package. Try out more test input strings and see how they behave. Study that Parser[+T] documentation, and even the source code. It’s those Parser methods that give you the power to do so much with so little code.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Matt Malone's Old-Fashioned Software Development Blog

No Responses to “ Easy Parsing in Scala ”

Archived Entry

Search:

Industry News

Categories