Glenn Engstrand

ANTLR is ANother Tool for Language Recognition. I have used ANTLR many times before to embed the ability to parse a configuration or set of instructions written in a domain specific language into a command line tool. This past weekend, I coded up a Java servlet that uses ANTLR technology to parse web requests written in a specific query language conducive to web publishing.

Compiler theory and language parsing is a pretty deep subject in computer science. You start with Chomsky's Theory of Formal Languages and end up with parsing strategies such as LALR(1) and LL(k). It's not for everyone. ANTLR makes it easy to develop and maintain your own parser for a language whose specification keeps changing but only if you have a solid grasp of the fundamental concepts. Otherwise, it is going to look a lot like voodoo.

I have used ANTLR before but this is the first time that I have used version 3. There has been a lot of improvements since the last version. One of the biggest improvements has been the change of parsing strategy from LL(k) to LL(*). Perhaps a small digression to explain that is in order.

A very important part to parsing is the ability to recognize strings that conform to the language specification from those that do not. You write the language specification in a kind of EBNF format (file type is .g in ANTLR) and run an ANTLR tool that creates classes in a variety of languages (e.g. Java, C#, Python) whose job it is to parse these strings.

What the parser does is break the string up into tokens (this is called the lexical analysis phase) then process those tokens. At the same time, it is traversing the language specification which is a set of production rules. If it finds a rule for the next token, then the string is still compliant to the language specification. Otherwise, it is not.

What happens if it finds two or more rules that match the next token? This is called ambiguity which is a bad thing. The parser needs to decide what the next rule is or it can't proceed. The way to make that decision is to look ahead to see if any tokens further down stream can resolve the ambiguity. Grammar specifications with limited look ahead are harder to read.

That is why the move to LL(*) from LL(k) is a big deal. With LL(k) you have to specify the maximum look ahead and the bigger the number, the slower the parsing. With LL(*) you can have unlimited look ahead and the parsing remains fast thanks to two other features new in version 3 called memoization and backtracking.

Some other cool features in version 3 is the ability to use ANTLR to easily create filters and rewriters. With these new features, it is easy to use ANTLR to create log file analysis tools, for example. Grammar rules can now have return values which allows you to easily propagate relevant information up the call stack of the rule traversal in a parsing session.

What do you do with a string once it has been parsed? Earlier versions of ANTLR allowed you to either embed your own commands into the parser class that get executed when the corresponding rule has been chosen and to create an AST or Abstract Symbol Tree that can be traversed after the parsing has concluded.

Another new feature in version 3 is a third option which is to use string template technology to generate output; however, that has not proven to be so useful in my own exploration of version 3.

Another thing that I did this past weekend was stress test the servlet using Tsung. I wanted to see how well ANTLR could perform when expected to serve about two requests per second for an hour.

What I found was that ANTLR was easily able to handle the load taking around one millisecond to perform the parsing. A surprise discovery for me what that the AST version performed just as well as the embedded commands version. I had expected that the AST version would take longer since you had to traverse the AST data structure afterwards. There was no noticeable difference in latency between the two types of parsings.

Scalable Server Side Language Parsing