Brent “NetNewsWire” Simmons raises the idea of an open protocol for syncing RSS/Atom subscriptions, that is, a way of keeping multiple local newsreader apps (like on a Mac and an iPhone) in sync with each other, so that they share the same set of subscribed feeds, and remember which articles have already been read. You can think of it as “IMAP for RSS”.
NetNewsWire already does this using Google Reader as an intermediary, and Apple’s PubSub framework (which is what Safari and Mail use) shares the read/unread state using MobileMe. But it would be nice to have an open protocol.
I have some experience with this, having implemented the sync system used by PubSub. It’s an interesting problem—you might think I would have just used Apple’s SyncServices, and it’s true that it would have worked great for the subscription list, but it doesn’t scale well to huge numbers of rapidly-changing “read/unread” flags.
I have two suggestions (which I would have made on Brent’s blog, except he doesn’t allow comments anymore.)
CouchDB is an awesome web-centric database engine. It doesn’t use SQL; instead, it’s a glorified key-value store whose values are arbitrary JSON objects, and which uses map-reduce for efficient querying. The basic API is pure REST, though glue libraries for many languages exist.
CouchDB natively supports syncing data through distributed groups of servers. It’s sort of like the way distributed version-control systems like Git or Mercurial work: multiple CouchDB instances each store a replica of the same data set, but can “pull” changes from each other over HTTP to stay in sync.
CouchDB is pretty lightweight and is already being used on the desktop by client apps: GNOME has been integrating it into the Linux desktop to use as a shared store for user data like contacts and bookmarks. It plays a similar role to SyncServices on Mac OS, but it’s all open source and any two instances can sync with each other instead of requiring a proprietary server. I hear this is already shipping in the latest Ubuntu releases.
It doesn’t look as though anyone’s designed a schema for storing RSS subscriptions this way, but it would be pretty easy to define one. You then need a local agent running CouchDB (it can be stripped down to be pretty small), a client library for Cocoa apps, and an upstream CouchDB server to sync to.
This protocol is similar to what I came up with for PubSub. It’s a simple extension of REST, but I haven’t heard of it being used elsewhere. The idea is that you model an append-only log file as an HTTP resource. The items that are logged are ‘events’ describing changes in the data model, in this case the subscriptions and articles.
The sync algorithm looks like this:
You can think of the log file as a queue or message stream that’s being collaboratively read and written by all of the clients. This sounds like something you’d need a fancy web-app to manage, but it turns out that all it takes is a typical HTTP 1.1 server and a trivial server-side script.
The download is a conditional GET, as used for fetching feeds themselves. The difference is that you use a “Range:” header to request only the bytes past the last known EOF. For example, if the last time you read the log it was 123456 bytes long, you add the header “Range: 123456-” to the request. This ensures that you only get back the new bytes that were added to the end. (And since this is a conditional GET, if the file hasn’t changed at all you just get back an empty 304 response.)
That’s all you need to do to track changes. Since the file is append-only, the only bytes you need to read are the ones added to the end. This request efficiently sends you just those bytes.
What’s cool is that this require no server-side software. If the log is a static file, any regular HTTP server like Apache will automatically handle GET requests for it, even byte-range ones. (Ranges are already used by browsers to resume interrupted downloads.) And it sends the response at high speed, since the server’s just streaming from a file, without multiple back-and-forth requests and without expensive database queries.
How about writing? Ideally you’d use the same approach, with a byte-range PUT that specifies that the request body should go at the end of the file. Unfortunately most servers don’t support this for static files, even though it’s basically just HTTP 1.1. But it’s really easy to implement. Any PHP crufter should be able to whip up a one-page script that simply responds to a POST by reading the request body and appending it to a local file (while doing the necessary ETag and range verification.) The great thing is that this script doesn’t have to know anything at all about RSS or subscriptions or unread counts; it’s completely generic. You can upgrade the data model without having to touch the script, and you could use the same script to sync anything, not just RSS.
(Yes, there is a semi-obvious drawback to this protocol: the file grows without limit. Surprisingly, this is not a problem most of the time, since clients only upload or download new data; the only real limit is the maximum file size or disk quota allowed by the server. But it does present a problem for a new client, whose first-time sync would download the entire file. This can be worked around by having new clients ignore very old data (only download the latest 10MB, say) or by periodically writing a compact subscription list to a separate URL.)
Yes, I’m still working on Ottoman (my append-only multiversion-concurrent storage library). As the code grows in size and complexity, so it grows in its resistance to being changed, but as Piet Hein said and I never tire of quoting:
Problems worthy of attack Prove their worth by fighting back.
I just pushed my latest changes up to bitbucket.org. What’s new?
Variable-sized top-level index. Previously, the hash data structure used a top level array of 256 hashtables. I’ve now made that ‘256’ a variable that scales with the number of records. This saves a lot of room for smaller datasets. (Sounds easy, right? I thought so too. But it ended up triggering significant changes to the file format and algorithms and took a lot of debugging.)
The return of the trees! part un. I started this as a project to learn about B+trees, but when I got to the point of implementing the append-only multiversion stuff I did it with hashtables first because it’s more straightforward that way, and the tree code went into the freezer. Well, it’s back now. There is a “new” Tree class that can be used to read and write persistent B+trees. It’s not yet integrated into the top-level Ottoman class, though, so you can only use it on its own, not intermixed with hash-tables, and without the API for tracking versions.
The next task is of course to integrate the tree API into the Ottoman class. But more than that, it will be entwined with the hashtable such that trees will serve as searchable indexes into the data store. So the hashtable will provide extremely fast but unordered access via a unique “record ID”, but you can build as many indexes as you like that will use ordered secondary keys (of your choice) to look up hash values.
At that point, I believe Ottoman will be ready to serve as a substrate for HTML local data storage, for implementing a CouchDB-compatible database (a la JSONDB), or for other fun databasey purposes.
ZSync is a new Mac/iPhone library that uses my BLIP P2P networking protocol:
“ZSync is an open source syncing library designed to allow easy syncing of data between an iPhone/iPod Touch and the OS X Desktop. ZSync utilizes the BLIP library and Apple’s Sync Services to allow easy and seamless syncing of data.”
It’s still in early development though, with a first public release expected in January:
Right now the code is in a private GitHub repository while the initial framework and protocols are fleshed out. This is expected to go public in January of 2010. Until then we are keeping the development team very small so that we can flesh out the design without a lot of overhead.
This looks like it’ll be super useful for iPhone apps that want to integrate with their Mac siblings, especially since their design won’t require you to have the Mac app running while you sync.
As everyone knows who works in the pet-food industry (or computer software for that matter), it can be hard to start eating your own dogfood. Case in point: I just this week set Chrome to be my default browser, though I’ve been working on it for four months now.
Partly that’s because when I started in July the Mac version of Chrome was too immature; and partly it’s because a web browser is something you need to have running and working all the time—especially since the Chrome project’s bug tracker and code-review tool are web-based.
But Mac Chrome is quite stable enough to use now, and as I haven’t been doing much Chrome development on this MacBook Pro lately (it takes too long to compile compared to my souped-up Mac Pro) I’ve installed the latest dev-channel build and replaced Safari with it in my Dock and as my default browser.
It’s hard to get used to a new browser, after all these years. I remember that I dropped IE 5 like a hot rock as soon as Safari became useable, but that’s because IE sucked so badly on OS X (as you youngsters may not remember.) But Safari is a great browser. Chrome’s great too, but in different ways, and the Mac version’s not finished yet so there are some missing bits.
I should note that I generally don’t work on the user-visible parts of Chrome, rather the underlying WebKit engine; so I haven’t been focusing on the UI much, or noticing the features being added, until experiencing them as an end-user.
In Chrome’s favor:
Rough edges (remember, this is a pre-beta build):
I’ve been impressed by Chrome’s stability too, for a pre-beta development build. The app hasn’t crashed once, and I haven’t even gotten the “Oh, snap!” page that shows that a renderer process crashed. (I’ve seen plugins crash a few times, but that’s probably Flash’s fault, and as in Safari on 10.6, this doesn’t affect the browser or even the rest of the page.)
One thing I’m really looking forward to is extensions. Safari’s a closed system, and I’ve long been envious of the plethora of cool plug-ins available for Firefox. I’m looking forward to using, and maybe developing, extensions for Chrome. (In the current dev channel Mac release, extensions can be installed, but the ones I’ve tried don’t do anything yet.)
I’ve pushed a new update of my Ottoman storage library.
What’s new is the option to have individual additions to the dictionary written directly to disk, instead of being buffered in memory until you save a new version. This helps performance if you’re writing a ton of data all at once. And there’s a simple demo called PackDir that writes a ton of data all at once: it packs the contents of an entire directory tree into an Ottoman file, with keys being filenames and values being file contents.
To implement this I had to violate the lockless semantics a bit, because having multiple clients appending key/value pairs to the file at the same time would make it a lot harder to keep their revisions straight. So before being able to do any direct puts, you have to enter an [ugh] ExclusiveTransaction.
(Of course, the issues with locking don’t matter if the file is only being used by one process. Or even if there are multiple readers but only one writer, since the locks don’t interrupt readers.)
Farhad Manjoo writing in Slate about Google Wave:
The trouble is, everything you type into Wave is transmitted live, in real time—every keystroke was getting sent to Zach just as I hit it. This made me too self-conscious to get my thoughts across.
… Maybe I should just delete what I’d written and say, “Twitter works because it’s simple.” But I couldn’t do that, because Zach was watching me. He could see me struggling right now—he could see that I’d gotten myself stuck in a textual cul-de-sac and that I was desperately searching for a way out without looking foolish. Now I saw Zach beginning to type: “Don’t let the live-typing get you down!” The game was up; what was the point of making a point now? I ended my thought clumsily and then resolved never to attempt to say anything very deep on Wave.
The same thing happened seven years ago with the live-typing feature that I implemented in iChat 1.0 (which was only supported for Bonjour chats.) I thought it was an awesome idea, and I’d wanted to have it in a chat program since about 1997. But it turned out that, in actual use, people hated it, for exactly the reasons Manjoo describes: it makes you self-conscious. We took it out in the next release.
(Interestingly, I hate video chat for a very similar reason. Somehow, the fact that my picture is being shown in real time to the remote person makes me horrifically self-conscious, even though it wouldn’t bother me at all to talk face-to-face with that person. I don’t know whether it’s the little preview on my screen, or the fact that the person is spookily both present and not-present, but the few times I’ve tried video-chat have been really unpleasant.)
I’m usually on the side of more technology. I believe that our online communications tools are still horribly primitive and have only scratched the surface of what’s possible. But this was a case where more technology was bad.
The low-tech alternative that lots of people use in IM, is to write in short fragments, each a separate message, so the other person can see them one by one without waiting for you to finish the whole sentence.
The difference is that you’re in control over when to send a partial message, and the other person knows you’re in control, and so on. I still think it might be possible to do this in a higher-tech way, like using a hot-key to send a partial message on demand without having the funky line-breaks, but the current approach isn’t so bad as long as you’re not embarrassed about unintentional free verse.
I could have told the Wave people about what I’d learned, except I didn’t know Wave existed until April (shortly before the public announcement), and even then I was just some guy lost in the crowd at the demos….
Part of the problem, in both cases, is that live typing is one of those Cool Demo Features that looks really awesome when showing off the app. Features like that can be dangerous because they are legitimately very useful during the app’s gestation, when exciting demos are a key survival trait; but then they can’t be removed later on because they’re so well-known, even if they turn out to be useless. Sometimes these features aren’t actually harmful to the user experience, they just make the code more complex and harder to maintain. Instant typing is both, unfortunately. (The clever sync algorithms and rapid-fire network messages Wave uses would be needed even without live typing, but the fact that they have to run on every few keystrokes, not just every minute or so, pushes those things so much harder.)
I’ve pushed out a few updates to the Ottoman library today. These fix a couple of embarrassing bugs, and also add an API in regular ol’ Objective-C for those of you who aren’t into that funky “C++” stuff.
(The Cocoa stuff uses MYUtilities, so if you’re not already using that, please read the setup instructions.)
It’s got a moose! But then I found the ottoman with hanging files inside and it was even more perfect…
I thought this one was pretty damn stylish too:
As promised …
A key-value store is a persistent on-disk dictionary that maps arbitrary keys (usually short text strings) to arbitrary values (binary data of any length). There’s already a large and venerable family of libraries that do this, often referred to by the name “db”. Notable members include dbm, Berkeley DB, cdb and Tokyo Cabinet. What makes Ottoman different?
Why is it named after a footstool? First, as an homage to CouchDB. Secondly, it turns out ottomans can be used for storage. Third, because I have a mild fascination with Asian and Islamic history.
_…Read more at the project home page!_
I know, three weeks ago I said I was building me a B-Tree. I did build it, even the parts I listed under “What’s next?” in that post, and it works. But it became apparent there was a more urgent need for a hash table, for work-related reasons, so I switched gears to build one of those on the same principles.
The biggest principle is Append-Only Storage, as described in the prior post. So I thought back to the simplest on-disk hash table I know of—Dan Bernstein’s CDB—which is very clever, but read-only. I implemented something similar, and then mashed in the CouchDB-like approach of incrementally appending only the modified sub-components.
Initially I made the file a series of key-value pairs, followed by the hash-table index as an array of {hash code, position} structures, each of which pointed to the position of the corresponding key and value. Very simple. To save changes, I’d write out the changed pairs, followed by a new copy of the index. The problem with that was that the index gets large as the number of records increases, so with a 100,000-record file, changing even one record would append almost a megabyte.
I had a brainstorm about how to fix that—add a bit of tree structure. So now there’s a root index of 256 pointers to sub-indexes. The 8 most significant bits of the key’s hash code are looked up in the root index to find the sub-index (hash table) in which to look up the value. The advantage is that every value change only alters one sub-index; when saving, only sub-indexes that were modified, plus the root, have to be written out. So if only one key changes, that’s just one sub-index, meaning only 1/256 as much data to append as before.
[I thought I was really clever for coming up with this, but then when I went back and looked at the original CDB, which I hadn’t seen in a year and a half, it turns out it does it too! I guess I’d just forgotten about that…]
What I have now is a persistent dictionary whose keys and values can be arbitrary data blobs. You can make changes in memory, and then save them back to disk, where they’re appended to the file. (Or you can rewrite the file from scratch, which takes a bit longer but reclaims wasted space.) The file format should be virtually un-corruptible by crashes, unlike a traditional database such as sqlite. An interesting extra benefit is better concurrency—the file doesn’t require any form of read-locking, because it’s perfectly safe for one client to write to the file while another is reading from it.
(The grand total’s about 3400 lines of C++, for both the tree and hash table.)
I plan to upload this to Bitbucket soon, but I keep brutally refactoring the classes, or maliciously changing the file format, neither of which would endear me to other people trying to use the code. Hopefully things will settle down soon, now that I’ve got nearly all the functionality I want. I’ll let you know.
[Update: Struck out a comparison that’s probably unfair to current versions of sqlite, judging by descriptions of how it saves transactions.]