Post-Programming for Dummies

February 15, 2008

case study: string tokenization

Filed under: Uncategorized — ppfd @ 6:24 am

Ok, so I’m writing this meta-ruby compiler but I’ve returned to ground zero because my tools were not sharp enough. In particular, I want transparent representations. Here’s an example:

Let’s say you have the following string of Ruby code:

a = “why, I ‘never’ \\\” would \\\” do ‘that” +   “said’ the man”

My goal is to create a “data structure” containing “tokens” so that I have a more efficient way of accessing information about this string. For instance, here are some questions I might like to ask:

Where are the string literals?
What variable is being assigned to?
What operators (if any) are being used?

Etc.

The number of questions I could ask about the string are infinite, but it seems like there is a “sweet spot” which it would be nice if I could access in O(1).

This brings me to the idea of a new data structure – the “flagged” list. It’s just a list, but each element has a “flag”. So the tokenized version of the above might look like:

["a"->symbol, " "->space, "="->operator, " "->space, "why, I 'never' \\\" would \\\" do 'that"->doubleQuoteContent, " "->space, "+"->operator, "   "->space, "said' the man"->doubleQuoteContent]

The goals for such a representation are:

1.) Fast query

E.g. I can say “flaggedList.getAll(“space”)” or “flaggedList.removeAll(“doubleQuoteContent”)” or “flaggedList.getAllAfter(“=”)” or whatever. The entire point is to make the structure extremely flexible.

2.) Fast extension.

Ok, this is a subtle point, but it’s important because I might want to build yet another compiler on top of my compiler. For instance, I might want to do something KERAZZY like make operators delimiters (so that everything between operators becomes absorbed into a token). Needing to create a new class for each type of token will just hurt my little old hands. Seriously, though, it’s better to just add a line to a list somewhere (operatorDelimiter) instead of creating a whole new class.

I don’t know why programmers are so afraid of flagging with strings. Maybe because it’s error prone? Bah humbug.

3.) Information Persistence

Another goal is to persist all information. None of the detail of the first line is lost in its flagged representation (e.g. even spaces are maintained). I can drop detail out if I want, but it is not dropped by default.

Anyway, there are other reasons, but they should be obvious. Or you can think of them. Whatever! The structure just appeals to me intuitively! Is it isomorphic to some other structure?

I guess the high-level claim here is that the difference between expressing something via one line of code vs. via two lines of code is enormous if you are planning on expressing that thing many many times. The difference between dropping a little information and dropping no information is enormous if you intend to extend the system ad infinitum. The difference between flagging with strings and flagging with classes is enormous if you want to maintain cognitive state (i.e. if you don’t want to forget why you wanted to add a flag while actually adding it).

You could summarize the design philosophy behind this approach to tokenization as: “Always sugar. Always.”

No Comments Yet »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.