SIDEBAR
»
S
I
D
E
B
A
R
«
Ottoman Update: Direct-Write Mode
Oct 19th, 2009 by jens

I’ve pushed a new update of my Ottoman storage library.

What’s new is the option to have individual additions to the dictionary written directly to disk, instead of being buffered in memory until you save a new version. This helps performance if you’re writing a ton of data all at once. And there’s a simple demo called PackDir that writes a ton of data all at once: it packs the contents of an entire directory tree into an Ottoman file, with keys being filenames and values being file contents.

To implement this I had to violate the lockless semantics a bit, because having multiple clients appending key/value pairs to the file at the same time would make it a lot harder to keep their revisions straight. So before being able to do any direct puts, you have to enter an [ugh] ExclusiveTransaction.

(Of course, the issues with locking don’t matter if the file is only being used by one process. Or even if there are multiple readers but only one writer, since the locks don’t interrupt readers.)

Ottoman For Cocoa (and bug-fixes)
Sep 24th, 2009 by jens

I’ve pushed out a few updates to the Ottoman library today. These fix a couple of embarrassing bugs, and also add an API in regular ol’ Objective-C for those of you who aren’t into that funky “C++” stuff.

(The Cocoa stuff uses MYUtilities, so if you’re not already using that, please read the setup instructions.)

Alternate images for Ottoman logo
Sep 20th, 2009 by jens

It’s got a moose! But then I found the ottoman with hanging files inside and it was even more perfect…

moose-storage-ottoman.jpg

I thought this one was pretty damn stylish too:

abstract-design-storage-ottoman.jpg

Announcing Ottoman
Sep 20th, 2009 by jens

As promised

Ottoman is a lightweight, reliable key-value store with multi-version concurrency control.

A key-value store is a persistent on-disk dictionary that maps arbitrary keys (usually short text strings) to arbitrary values (binary data of any length). There’s already a large and venerable family of libraries that do this, often referred to by the name “db”. Notable members include dbm, Berkeley DB, cdb and Tokyo Cabinet. What makes Ottoman different?

  1. It tries to provide extremely reliable storage, immune to file corruption caused by operating system crashes or power failure while writing.
  2. It stores past versions of the database, as a useful side-effect of the way it works. Applications can use this to provide features like snapshots, timelines or persistent undo. (You can prune old versions, for space or privacy reasons, by writing a fresh copy of the file.)
  3. It has very low memory overhead. The file and its contents are memory-mapped, so even large values can be accessed without having to copy them into the heap. (Opening a database only mallocs a few hundred bytes!)
  4. It allows concurrent read/write access by multiple processes. Optimistic concurrency — as used by version-control systems like Subversion — keeps file locking to a minimum: readers are //never// blocked at all, and writers are only blocked briefly (clients can’t lock the database for arbitrary periods, only for as long as it takes the accumulated changes to be written to disk.)

Why is it named after a footstool? First, as an homage to CouchDB. Secondly, it turns out ottomans can be used for storage. Third, because I have a mild fascination with Asian and Islamic history.

_…Read more at the project home page!_

I Has A Hash Table
Sep 5th, 2009 by jens

I know, three weeks ago I said I was building me a B-Tree. I did build it, even the parts I listed under “What’s next?” in that post, and it works. But it became apparent there was a more urgent need for a hash table, for work-related reasons, so I switched gears to build one of those on the same principles.

The biggest principle is Append-Only Storage, as described in the prior post. So I thought back to the simplest on-disk hash table I know of—Dan Bernstein’s CDB—which is very clever, but read-only. I implemented something similar, and then mashed in the CouchDB-like approach of incrementally appending only the modified sub-components.

Initially I made the file a series of key-value pairs, followed by the hash-table index as an array of {hash code, position} structures, each of which pointed to the position of the corresponding key and value. Very simple. To save changes, I’d write out the changed pairs, followed by a new copy of the index. The problem with that was that the index gets large as the number of records increases, so with a 100,000-record file, changing even one record would append almost a megabyte.

I had a brainstorm about how to fix that—add a bit of tree structure. So now there’s a root index of 256 pointers to sub-indexes. The 8 most significant bits of the key’s hash code are looked up in the root index to find the sub-index (hash table) in which to look up the value. The advantage is that every value change only alters one sub-index; when saving, only sub-indexes that were modified, plus the root, have to be written out. So if only one key changes, that’s just one sub-index, meaning only 1/256 as much data to append as before.

[I thought I was really clever for coming up with this, but then when I went back and looked at the original CDB, which I hadn’t seen in a year and a half, it turns out it does it too! I guess I’d just forgotten about that…]

What I have now

What I have now is a persistent dictionary whose keys and values can be arbitrary data blobs. You can make changes in memory, and then save them back to disk, where they’re appended to the file. (Or you can rewrite the file from scratch, which takes a bit longer but reclaims wasted space.) The file format should be virtually un-corruptible by crashes, unlike a traditional database such as sqlite. An interesting extra benefit is better concurrency—the file doesn’t require any form of read-locking, because it’s perfectly safe for one client to write to the file while another is reading from it.

(The grand total’s about 3400 lines of C++, for both the tree and hash table.)

I plan to upload this to Bitbucket soon, but I keep brutally refactoring the classes, or maliciously changing the file format, neither of which would endear me to other people trying to use the code. Hopefully things will settle down soon, now that I’ve got nearly all the functionality I want. I’ll let you know.

[Update: Struck out a comparison that’s probably unfair to current versions of sqlite, judging by descriptions of how it saves transactions.]

I’m Building Me A B-Tree
Aug 14th, 2009 by jens

The other day I took it into my head to implement a B+tree. Why? Because they sound neat, and I’ve done hardly any serious programming with trees in my career. (Someone, I think Buzz Andersen, once noted that there are two kinds of programmers: those who do think in terms of trees, and those who do everything with hash tables. I’m in the latter camp.)

And also because I’m a big fan of CouchDB, and really admire its elegant storage model. It’s an on-disk B-tree—no surprises there—but the file is append-only, which both makes it impervious to crash-related corruption, provides nearly lockless concurrency, and makes it easy to access earlier revisions.

[In a nutshell: Updated data values or tree nodes are appended to the file instead of overwriting the earlier versions. Since updating a node changes its location, its parent node needs to be updated too to point to the new location. This recurses up the tree, meaning any change ends up with a new root node written at the very end of the file. In fact, when you open the file you find the root by looking at the very end. Since no data is ever changed, once you open the file you’re impervious to changes made by other writers since they don’t affect anything you’re looking at.]

I’d love to use something like that for various projects, but as CouchDB is implemented in the exotic functional language Erlang, I can’t really use its storage layer as-is. So: could I implement something like it in C++?

Thus far I have a working in-memory B+tree implementation. Inserts and deletes work, and I’m working on the iterator. Even this much was harder to get working than it should have been, or so it feels. But that always seems to be true—algorithms sound straightforward when you read about them, but putting them into practice exposes you to all the details and subtleties inherent in hand-waving like “now merge the node with a neighbor”.

Actually I haven’t implemented a straight B+tree, rather a ‘top-down’ variant described by Ohad Rodeh that’s better suited to this type of application because it changes fewer numbers of interior nodes during an update.

What’s next?

  • Support for string keys (so far it just handles ints)
  • Serializing nodes to/from disk
  • Keeping track of which nodes are touched during an operation, and appending those to the file
  • Writing a trailer to the file to mark a successful update (and to link back to the previous trailer, for historical purposes.)

Sounds straightforward, but of course the devil’s in the details.

»  Substance:WordPress   »  Style:Ahren Ahimsa