Inspired by a xkcd comic, Peter Norvig, Director of Research at Google and all around interesting and nice guy, has created an above par code kata involving a regex program that demonstrates the core inner loop of many successful systems profiled on HighScalability.
The original code is at xkcd 1313: Regex Golf, which comes up with an algorithm to find a short regex that matches the winners and not the losers from two arbitrary lists. The Python code is readable, the process is TDDish, and the problem, which sounds simple, but soon explodes into regex weirdness, as does most regex code. If you find regular expressions confusing you'll definitely benefit from Peter's deliberate strategy for finding a regex.
The post demonstrating the iterated improvement of the program is at xkcd 1313: Regex Golf (Part 2: Infinite Problems). As with most first solutions it wasn't optimal. To improve the program Peter recommends the following steps:
- Profiling: Figure out where the program spends its time.
- Speedup: Make the program faster.
- Benchmarking: Run the program over pairs of arbitrary lists to see how fast it is and how short the solutions are.
- Studying: Learn something by looking at the benchmark results.
- Searching: Introduce a better search algorithm.
- Eliminating Components: Get rid of components that can't possibly be part of an optimal solution.
- Adding Components: Add new types of components to allow new, shorter solutions.
- Randomizing Search: Randomness allows us to explore different parts of the search space.
- Speculating: Think about what we could do next.
He then takes you through an elaborately detailed walk through of each step as it applies to the program. Showing exactly how to perform each step along with the results. It's full of graphs and explanations of key ideas like set cover theory, greedy search, exhaustive search, and much more.
It's a great explanation of the process of how a master programmer attacks a problem. It's unlike anything you've probably ever seen. Glorious.