Tuesday, August 11, 2009


For every software engineer there is an area of Computer Science that is exciting and fun. While I love virtually every aspect of the craft, compilers and optimization are my two favorites. For those who may not know, a compiler is a computer program that converts human-readable text into something the computer can understand (this is often called "machine language"). There are compilers for all sorts of human-readable languages, like C, C++, C#, Visual Basic, Python, and so on. Writing the program that has to understand text and turn it into something the computer can actually use is a very challenging endeavor, but it is also one of the more intellectually satisfying things I've done. There are all sorts of interesting aspects to writing a compiler that come up only after you're embedded in the design and implementation of the thing. You have to draw from all areas of Computer Science to make it work, but one of the most fascinating areas for me is optimization.

Optimization requires you to look at a problem from all sorts of different angles so you can be sure you understand exactly what you're dealing with. The goal is to take a program (or a piece of a program) and somehow make it better. Perhaps you want to make it smaller by cutting out all of the useless cruft. Maybe you need it to run faster. Sometimes you need to change it so that it is easier to understand and maintain later on. All of these fall under the general heading "Optimization".

The project I'm working on now has me dealing with many large files (hundreds of files that are each at least 10 megabytes in size). These files contain information that has to be updated at an astounding pace. At its peak, my program must perform close to 20,000 updates per second. Just finding the correct information in each file to update is a challenge. Maintaining this level of speed adds a whole lot more to the challenge. Doing all of this in C# (on top of .NET) adds another level of excitement. Eight months ago, I was not sure it could be done without lots of computers working together as one big virtual computer. Now, I'm watching it run on my home PC with close to 40,000 updates per second. Amazing.

Eash file uses a database engine, SQLite, as its maintainer. I chose SQLite for three main reasons: Transactions, speed, and encryption. Since there are several changes that need to be made to each file as part of one sequence, I wanted to make sure that either everything happened correctly or no changes were made at all. This is called "atomic" behavior, and requires transactions to work correctly. Obviously, I needed it to be very fast and one of SQLite's primary features is its speed. I had to take that claim on trust until I built a prototype system. Finally, I needed each one of these files to be encrypted - nobody else should be able to see what is inside the file unless they know the complicated password.

Everything in this project is written on top of Microsoft's .NET framework. Thus, to use SQLite, I had to use SQLite.NET from Robert Simpson. I was concerned that the .NET version would not be able to keep up with my speed requirements, but it turns out that this was not a problem at all. Each file individually can be accessed, updated, and checked close to 50,000 times per second on my home PC. I find this amazing, especially given that the files are encrypted with RC2 data encryption.

I am left both happy and sad with the results I've seen. I'm happy because SQLite.NET has done everything I've asked of it better than I expected. I'm sad because there's no reason to optimize anything any further than the prototype. For something that took me a few weeks to build, I'm impressed. Consider this post a glowing review of SQLite.NET specifically and the SQLite database generally.

No comments:

Post a Comment