Oct 13 2006

Dream apps and the perils of screen-scraping

There’s an interesting online competition going on called My Dream App. The idea is that a bunch of people pitch their ideas for a Mac application, and the set of ideas gets winnowed down in several rounds of public voting, until one is left. A group of experienced developers have promised to implement the winning idea as a shareware app, whatever it turns out to be.

It’s a fun concept, but it highlights some of the problems of having end users design software. A number of the proposals give me a particular sinking feeling I associate with user-interface design meetings: lots of ideas that sound super-cool as one-sentence pitches, accompanied by irresistibly glitzy faked-up screen shots (all replete with translucency, rounded corners, and this year’s de rigeur reflections). But too often there’s no “there” there. It’s all so vague that I can sense that these people haven’t thought through the difficult bits or worked out a coherent idea of what the app will do.

It’s kind of like being a writer and having someone come up to you at a party and say “I have this brilliant idea for a novel…” followed by a rambling series of plot twists, closing with “…and I’ve done the hard part, now you just have to write it down!”

Case In Point

The idea that frustrates me the most is called “Hijack”. (I’d link to it, but the site is thoroughly Ajax-y that I can’t find a URL for this page. You’ll just have to start at the home page and click on “Hijack”. Oh well; permalinks are so 2004 anyway.) This is to be a super-duper native GUI interface to web forums. Like an email app or RSS reader, it will show all the posts of every forum you subscribe to in a consistent interface, tracking new unread posts and threads. Great stuff! What server does this work with? All of them!

Yup. In an amazing feat of hand-waving, this app is going to Just Work with every BBS program under the sun, through the miracle of screen-scraping. An army of volunteers are going to code and maintain scripts that will strain through the murky tag soup and extract every bit of information you need. Even though every single installation of every single brand of web forum is different, and they get redesigned all the time, and in many cases some sort of AI would be required to figure out how to reassemble the semantics from the stuff on the screen.

(I’m sure that by now my colleague Jessica is shivering uncontrollably, if she’s still reading. Jessica did a lot of work on Sherlock, including continued maintenance of the channels. Many of the channels work this way, by scraping the HTML of various web sites. They are frustrating to maintain because the sites keep changing their design, which breaks the scripts that run the associated channel. Few people know that the channels are dynamically updated, with new versions downloaded when necessary whenever you launch Sherlock. It has to be that way, because otherwise Sherlock itself would have to be Software-Updated many times a year. Now, consider that Apple only ships/supports a handful of channels, and that these are vastly simpler than a web content-management system. The task of doing this kind of development and maintenance for every damn BBS is just mind-boggling.)

Thinking Inside The Box

But what bothers me the most is not just that it’s infeasible. It’s that it’s a horrendous kludge. It’s thinking completely inside the box. “I read so many of these forums, and they’re ugly and I can’t track them. I know! I’ll make an app that will read the forums for me!” It reminds me of Gopher and Archie, which were now-quaint attempts to spackle over the old ARPAnet by putting a textual menu system and a searchable index on top of the maze of FTP servers. They were kind of neat, but as soon as NCSA Mosaic appeared they were history.

Don’t get me wrong, I absolutely love the idea of an elegant, unified user interface for distributed many-to-many discussions. I’ve been obsessed by it for years, in fact. But the real way to do this is to build protocols and servers that make it work, and then robust and reliable clients can be plugged into them. The need for this has been obvious to me since about 1997. The world keeps getting closer and closer to it, in incremental steps starting with RSS.

I think that we actually have most of the necessary pieces now — the Atom format and publication protocol, the Atom threading extensions, and decentralized identity systems like OpenID. All it takes now (he said with slight sarcasm) is getting the BBSs to adopt some of these standards in a coherent and compatible way, so that news-readers can follow and track all of the threads, posts and identities involved.


29 Responses to “Dream apps and the perils of screen-scraping”

  • SuitCase Says:

    Part of the Daring Fireball flood, sorry if you have ten million messages to moderate.

    I am immensely in love with the idea of Hijack, I’ve been wishing for a rich client that accesses content from forums for many years and had fantasies about learning how to code myself just so I could try and make it.

    Yes, the collection of data will be hard, however I think you are exaggerating the difficulty it would involve. That Dapper thing looks promising enough, but it already seemed mostly feasible to me because some forums use semantic markup (or at least, informative CSS stylings), offer information-rich RSS feeds, or (probably the easiest thing) a default skin\the option to choose the default skin in the user prefs. Imagine - if you had support for the subSilver theme of phpBB alone, you’d probably then support 10% of the forums on the internet. Do the same for the default looks of vBulletin, punbb, Invision and.. heck, I dunno, UBB? Then throw in the ones that can’t be changed, like ezboard and proboards.. And you’ve probably covered 50% or more of all forums, and I bet those “forum definitions” for stuff like subSilver would work with a lot of custom themes as they are almost always heavily based on the default. So even if we take it as “50% can feasibly work”, it’s a useful app already, and all that needs to be accomplished after that is a training mode and\or definition file sharing community, both concepts which are being discussed in the forum for the app.

    Regardless of “rarely listen to the user” philosophy, I think your idea that people should start writing protocols and microformats or whatever for forums clearly shows a programmer-oriented perspective that is apathetic to what users like I really want from this application. I don’t care to use an app that works with the five hyper-cool Web 2.0 sites that bothered to implement a weird new protocol. I want something that works with the web as a whole, and I think that’s a lot more feasible than you’re indicating.

  • Jens Alfke Says:

    Maciej: My guess is that adding semantic markup to the HTML page (like hAtom) is about the same difficulty as adding an RSS/Atom feed. But it’s less flexible, since it is constrained to show only the same set of entries that are on the page, and the content only in its post-themed state.

    SuitCase: You raise some good points, although I am still not convinced it will work well enough. But perhaps some others have a looser definition of “well enough”.

    By “apathetic” I think you mean “antithetical”? I don’t think we’re really at odds in what we want. I’m just looking at it from what I think is a broader perspective: I would rather not have a ton of work go into something that is at heart a kludge.

    Remember, today’s “hyper-cool Web _.0 … weird new protocol” often becomes tomorrow’s everyday reality. RSS used to be that weird new protocol. So did CSS, Flash, JavaScript, tables, inline JPEGs. I’m sure when Mosaic came out the Gopher/Archie folks said “forget this foo-foo ivory-tower ‘hypertext’ stuff, we just want something that works with the FTP sites we already have.”

  • Step Says:

    Hmm, my links got eaten in my last comment, and I don’t see a way to edit.

    Either way, I see from your last response that you’re approaching this from a different perspective. I can appreciate that, and I agree that I’d like to see forum software change to include an appropriate standard. I’m sure it could happen, too, just not sure what would push it forward. You say “today” and “tomorrow”, but in reality it took a couple years to catch on. Forum software could take just as long, maybe it will transition faster, maybe slower. I’ve never created and moderated my own forum, so I don’t really know.

    I hope my comments have been helpful. I’m mainly concerned that it seems your article unfairly attacks two ideas (the whole MDA competition, and Hijack) based on their weaknesses, without seeing if those weaknesses were acknowledged or addressed in any way. Another words, I’d hope for more balance. But then again, you’re coming from quite a different perspective, and I’m not a regular reader of your blog, so perhaps I’m missing something vital to the discussion here.

  • Eduo Says:

    I keep scanning mydreamapp.com periodically and keep voting for the seemingly least-popular projects. There were some nifty ideas among all the crap sent and, to me, the filtering hasn’t been that good but, then again, that speaks about the voters more than anything.

    I was about to submit an idea to it, back when it started but after seeing what the people were raving about and what’s being the most successful pitches I realised mine wasn’t as flashy and, indeed, too techy for most of the reviewers.

    Several other people did try to do this, and seemingly failed because their mockups weren’t pretty (there isn’t any project right not in the semifinals that isn’t shiny and/or animated). In the end I didn’t even see if anyone had proposed anything like I had thought (some did propose other ideas I had thought of, and were summarily buried to the bottom of the lists).

    *sigh*, maybe next time.

  • shadownight Says:

    Dear blogger and commenters,

    Jason Harris, Developer at My Dream App has repeatedly stated that Hijack is is feasible. He suggests to do it not through HTML, but through DOM scraping. For more info, you can check out the Official Feasibility thread in the MDA forums and other threads in the Hijack section: http://mydreamapp.com/forums/viewtopic.php?id=1263

  • fluffy Says:

    Yeah, except most forum software doesn’t put out compliant XHTML, so you need to use a smile-and-nod HTML parser which will convert to DOM, or a prefilter which de-moronizes the HTML.

  • Eduo Says:

    Shadownight: That may have been stated, but DOM scraping is really not diferent than HTML scraping unless you’re working on a standard (and then, it isn’t either, just more efficient).

    It’s not possible to build an “universal scraper” of any sort when there is no standard on what’ll be scraped.

    Unless every forum in the world decides to go the route of tagging specific parts of the forum with IDs or with classes this won’t be feasible without hordes of people creating “plug-ins” for each different forum (bear in mind that even generic forum software like phpbb, yabb and bbpress allow for theme customisation, where the theme can be pretty much whatever you want).

    I’d dare to say that probably bbpress would be the best candidate for an app of this sort, as it can provide RSS feeds of its forums and posts, and posting can be done through the methods defined in the RSS. But that means the best candidate is the newcomer in the arena. Existing major forum apps would need to change their releases to add a functionality like this and even then older forums that won’t be updated won’t benefit from it.

    The route to do this is not to make an app that does scraping, it’s to propose a standard for it and lobby this standard to the major forum programmers (and to bbpress, I insist. Wordpress’ weight is too much to ignore and integration with WP itself makes bbpress a probably quick-raiser to the top-five). Along with the protocol an app can be designed that takes advantage of it, but the protocolo needs to exist so other platforms and other forum bases can adopt it as well.

    That’s how RSS started, by the way.

  • SuitCase Says:

    Jens - Hey, thanks for the response. Reading myself again I hope I didn’t come off too hostile, but I’m passionate about this idea and don’t want it killed :P

    Grammatically “apathetic” looks a bit awkward there but I still meant it, though by using the word I was exaggerating your position in an attempt to show that for this app your “programmer’s logic” didn’t fit. While the nicest way to do this would be through standards, this is not in the interest of a user who wants to monitor their favourite forums _now_, and not just a scant few that are modeled on some sort of new protocol (which, as a commenter above mentioned, is kind of already done with Usenet.)

    As for the idea that Hijack is a kludge, I dunno. A lot of apps I use feel like they’re messed beyond recognition and could be nicer in an ideal world, but they cope with the problems and I make use of them very effectively - my web browser and my mail client, for one. I really don’t see a future where web forums become abandoned for rich clients, I see them as a minority thing for the quirky power users on quirky platforms like the Mac, and thus I think it’s acceptable that they require some maintenance and not be as perfect and simple as they could.

    It’s funny that I’m arguing that for a “My Dream App” competition, but I honestly think it’s far less feasible for a client based on standard forum-provided data to be a good product than one that spends a lot of time trudging through messy forum templates. There’s not enough demand for a forum aggregator for a standard to be adopted, basically, and so I don’t think the examples you cited are directly comparable as that kind of ground-up approach would result in a client for a specification nobody cares about and never will.

  • Jens Alfke Says:

    SuitCase — it doesn’t take rewriting the forums, and it won’t require writing all-new protocols. Most of the job can be done using Atom syndication (which many forums already support, although buggily) with the special secret-sauce of the threading extensions.

    These happen to be implemented for WordPress already, as a plugin.

    That covers a lot of the subscription side of what Hijack would need. There’s more to do; I’m going to write a post describing this in more detail.