Sep 5 2009

I Has A Hash Table

I know, three weeks ago I said I was building me a B-Tree. I did build it, even the parts I listed under “What’s next?” in that post, and it works. But it became apparent there was a more urgent need for a hash table, for work-related reasons, so I switched gears to build one of those on the same principles.

The biggest principle is Append-Only Storage, as described in the prior post. So I thought back to the simplest on-disk hash table I know of—Dan Bernstein’s CDB—which is very clever, but read-only. I implemented something similar, and then mashed in the CouchDB-like approach of incrementally appending only the modified sub-components.

Initially I made the file a series of key-value pairs, followed by the hash-table index as an array of {hash code, position} structures, each of which pointed to the position of the corresponding key and value. Very simple. To save changes, I’d write out the changed pairs, followed by a new copy of the index. The problem with that was that the index gets large as the number of records increases, so with a 100,000-record file, changing even one record would append almost a megabyte.

I had a brainstorm about how to fix that—add a bit of tree structure. So now there’s a root index of 256 pointers to sub-indexes. The 8 most significant bits of the key’s hash code are looked up in the root index to find the sub-index (hash table) in which to look up the value. The advantage is that every value change only alters one sub-index; when saving, only sub-indexes that were modified, plus the root, have to be written out. So if only one key changes, that’s just one sub-index, meaning only 1/256 as much data to append as before.

[I thought I was really clever for coming up with this, but then when I went back and looked at the original CDB, which I hadn’t seen in a year and a half, it turns out it does it too! I guess I’d just forgotten about that…]

What I have now

What I have now is a persistent dictionary whose keys and values can be arbitrary data blobs. You can make changes in memory, and then save them back to disk, where they’re appended to the file. (Or you can rewrite the file from scratch, which takes a bit longer but reclaims wasted space.) The file format should be virtually un-corruptible by crashes, unlike a traditional database such as sqlite. An interesting extra benefit is better concurrency—the file doesn’t require any form of read-locking, because it’s perfectly safe for one client to write to the file while another is reading from it.

(The grand total’s about 3400 lines of C++, for both the tree and hash table.)

I plan to upload this to Bitbucket soon, but I keep brutally refactoring the classes, or maliciously changing the file format, neither of which would endear me to other people trying to use the code. Hopefully things will settle down soon, now that I’ve got nearly all the functionality I want. I’ll let you know.

[Update: Struck out a comparison that’s probably unfair to current versions of sqlite, judging by descriptions of how it saves transactions.]


4 Responses to “I Has A Hash Table”

  • Gus Mueller Says:

    “The file format should be virtually un-corruptible by crashes, unlike a traditional database such as sqlite.”

    So uh, how did you manage to do that with sqlite? :)

  • Jens Alfke Says:

    I never did — you pretty much can’t make sqlite (or any update-in-place format) robust enough. The PubSub store in OS X has to keep a backup copy of critical stuff like the list of subscriptions, and it has to check the db for corruption when it opens it, and if it’s gone bad, nuke it and restore what it can from the backup. (Mail, Address Book, etc. do the same thing.)

    This was a total pain to implement, and is why I really want to write something that won’t have these problems, even if it isn’t otherwise as powerful as sqlite.

  • Joachim Bengtsson Says:

    Sqlite does a lot of crash testing — it has the most comprehensive test suite I’ve ever seen (45 million lines of test code to 64 thousand lines of library code). But you’re saying it’s relatively easy to corrupt with a crash anyway? Not that I don’t believe you, I’m just surprised :)

  • Jens Alfke Says:

    Joachim — it may not be easy, but it happens. I saw evidence of several instances during internal testing of Tiger and Leopard.

    Problems seem to occur if the kernel panics or, worse, the disk abruptly loses power. Disk controllers have an internal cache, and they re-order sector writes to minimize the amount of head travel. If the controller resets/dies before it can flush its cache, some random subset of the most recent sector writes will be lost.

    I’m reading through the “Atomic Commit In SQLite” document now — it’s very interesting, and I don’t believe it was around when I was working with sqlite in ‘05-‘07. It sounds like current versions of it are more robust to disk power failures … maybe as a result of the problems we had at Apple ;)

    A related problem that came up much more often was denial-of-service to processes accessing the database, due to another process being hung while it holds a lock. On my Leopard system this seemed to happen every few months to PubSub. This is my fault, not sqlite’s, of course, but it does mean that a failure in one process propagates into others, which is bad. Worse, the way to recover from it is to move the database aside and re-create it, just as though it had been corrupted. This makes me appreciate the way that in an append-only file format, readers can never be locked out.