SIDEBAR
»
S
I
D
E
B
A
R
«
Dream apps and the perils of screen-scraping
October 13th, 2006 by jens

There’s an interesting online competition going on called My Dream App. The idea is that a bunch of people pitch their ideas for a Mac application, and the set of ideas gets winnowed down in several rounds of public voting, until one is left. A group of experienced developers have promised to implement the winning idea as a shareware app, whatever it turns out to be.

It’s a fun concept, but it highlights some of the problems of having end users design software. A number of the proposals give me a particular sinking feeling I associate with user-interface design meetings: lots of ideas that sound super-cool as one-sentence pitches, accompanied by irresistibly glitzy faked-up screen shots (all replete with translucency, rounded corners, and this year’s de rigeur reflections). But too often there’s no “there” there. It’s all so vague that I can sense that these people haven’t thought through the difficult bits or worked out a coherent idea of what the app will do.

It’s kind of like being a writer and having someone come up to you at a party and say “I have this brilliant idea for a novel…” followed by a rambling series of plot twists, closing with “…and I’ve done the hard part, now you just have to write it down!”

Case In Point

The idea that frustrates me the most is called “Hijack”. (I’d link to it, but the site is thoroughly Ajax-y that I can’t find a URL for this page. You’ll just have to start at the home page and click on “Hijack”. Oh well; permalinks are so 2004 anyway.) This is to be a super-duper native GUI interface to web forums. Like an email app or RSS reader, it will show all the posts of every forum you subscribe to in a consistent interface, tracking new unread posts and threads. Great stuff! What server does this work with? All of them!

Yup. In an amazing feat of hand-waving, this app is going to Just Work with every BBS program under the sun, through the miracle of screen-scraping. An army of volunteers are going to code and maintain scripts that will strain through the murky tag soup and extract every bit of information you need. Even though every single installation of every single brand of web forum is different, and they get redesigned all the time, and in many cases some sort of AI would be required to figure out how to reassemble the semantics from the stuff on the screen.

(I’m sure that by now my colleague Jessica is shivering uncontrollably, if she’s still reading. Jessica did a lot of work on Sherlock, including continued maintenance of the channels. Many of the channels work this way, by scraping the HTML of various web sites. They are frustrating to maintain because the sites keep changing their design, which breaks the scripts that run the associated channel. Few people know that the channels are dynamically updated, with new versions downloaded when necessary whenever you launch Sherlock. It has to be that way, because otherwise Sherlock itself would have to be Software-Updated many times a year. Now, consider that Apple only ships/supports a handful of channels, and that these are vastly simpler than a web content-management system. The task of doing this kind of development and maintenance for every damn BBS is just mind-boggling.)

Thinking Inside The Box

But what bothers me the most is not just that it’s infeasible. It’s that it’s a horrendous kludge. It’s thinking completely inside the box. “I read so many of these forums, and they’re ugly and I can’t track them. I know! I’ll make an app that will read the forums for me!” It reminds me of Gopher and Archie, which were now-quaint attempts to spackle over the old ARPAnet by putting a textual menu system and a searchable index on top of the maze of FTP servers. They were kind of neat, but as soon as NCSA Mosaic appeared they were history.

Don’t get me wrong, I absolutely love the idea of an elegant, unified user interface for distributed many-to-many discussions. I’ve been obsessed by it for years, in fact. But the real way to do this is to build protocols and servers that make it work, and then robust and reliable clients can be plugged into them. The need for this has been obvious to me since about 1997. The world keeps getting closer and closer to it, in incremental steps starting with RSS.

I think that we actually have most of the necessary pieces now — the Atom format and publication protocol, the Atom threading extensions, and decentralized identity systems like OpenID. All it takes now (he said with slight sarcasm) is getting the BBSs to adopt some of these standards in a coherent and compatible way, so that news-readers can follow and track all of the threads, posts and identities involved.


29 Responses  
  • fluffy writes:
    October 14th, 20061:18 AMat

    I totally agree. I’ve actually written about this not too long ago on my weblog, and have been thinking about this a lot recently. People could have an intelligent RSS agent which checks per-page RSS feeds for every page they go (with some sort of mechanism to make it only check for updates at the rate at which updates actually occur), and then to extend that with people being able to publish and subscribe to each others’ clickstreams as a sort of emergent recommendation network (FOAF and OpenID would be good underlying transports for clickstream discovery).

    Several weblogs have started furnishing per-post comment feeds, which is a great way to actually syndicate comment updates without having to actually register for every single site or keep on overloading the poor, abused email infrastructure, but working with that is pretty unwieldy if you have to manually manage the feeds in a reader, and if there’s a forum which you watch which has a lot of short-lived topics, having a hundred users’ feed readers subscribed to a thousand threads and checking for updates every hour would be something of a problem.

    All of the current forum software out there is so ad-hoc and hacky, though, and getting people to upgrade their fora will be impossible. Personally I think the whole forum concept is a bit hacky to begin with, though. Distributed identity/authentication systems with smart filters based on a trust network, with comments dual-published via HTML and RSS/Atom/whatever, would be ideal. I’ve always thought of the Internet itself as one big community; why set up oodles of walled gardens which duplicate effort? Keep things lightweight and trim. Weblog comment systems can be made so much better, and conversations should be able to happen anywhere, not in The Forum For That Site That People Visit.

    The Internet is the medium, not phpBB or what have you.

  • Jonathan Deutsch writes:
    October 14th, 200610:51 AMat

    Have you seen dappit? While I doubt it could solve all scraping woes, it looks like a pretty neat approach, letting users create scrapers by pointing and clicking.

  • fluffy writes:
    October 14th, 200611:26 AMat

    If you just want to support a scraping model for update notification, there’s always stuff like RSSpect which just does a diff on the current and previous version of a page and adds a feed item if it’s different enough (and that mode of operation is also something which would make sense in the context of my clickstream ramble).

  • Jens Alfke writes:
    October 14th, 200612:07 PMat

    Re: RSSpect — the diff-the-page idea goes back to IE 4 (at least the Mac version) which had this as a built-in feature. It never worked well, because most popular sites include content that changes on every page view, whether it’s rotating ads or a hit-counter.

    Their more advanced model where you put SPAN tags in your content to indicate posts is precisely the same idea as the hAtom microformat (although the word “microformat” isn’t mentioned on that page.) It’s a reasonable idea, although it seems like nearly as much work as creating a real feed, only less efficient (since clients have to pull down all the non-semantic markup every time, too.)

    Also, the notion of commenting out the tags to make them not visible on the page is horrible — it means you can’t process the page using a real *ML parser, and implies they are simply grepping the HTML with regexps, which is even more fragile than DOM-based scraping.

  • fluffy writes:
    October 14th, 200612:21 PMat

    Well, yeah, I was referring to RSSpect’s “anysite” function, though it has various thresholds for the amount of a diff to consider an update. Simply changing links on an image won’t trigger it unless you set it to the most sensitive.

    I haven’t looked into the other stuff, I was only looking into anysite because I wanted to monitor content by a person who is, like, religiously opposed to RSS for some reason. The other methods are pretty silly because you’d have to have a pretty braindead CMS to not allow you to just provide an RSS view in addition to the normal HTML view.

  • leeg writes:
    October 15th, 200610:24 AMat

    Don’t get me wrong, I absolutely love the idea of an elegant, unified user interface for distributed many-to-many discussions. I’ve been obsessed by it for years, in fact. But the real way to do this is to build protocols and servers that make it work, and then robust and reliable clients can be plugged into them

    Actually, I’m pretty sure Usenet already exists. ;-)

  • Matt Tavares writes:
    October 15th, 200611:01 AMat

    What i hate most about my dream app is the fact that all the ideas down right suck or just cannot be implimented. I think their is one that actually puts a fake window on your computer to sync with the weather outside.

    If you can’t fucking get up to check a window, then you have problems.

  • fluffy writes:
    October 15th, 200611:12 AMat

    Yeah, and then there’s all those apps which have slick professional sites which make it seem as if the app already exists, and they’re all like “hurr blurf duh we’ll do everything else too and maybe more and totally revolutionize the world.”

    The ONLY app I was even remotely interested in was Whistler and it still seemed pretty darn infeasible. Pitch detection and correction are actually difficult, especially when you’re dealing with people blowing on a $10 mic.

  • e:leaf writes:
    October 15th, 200611:25 AMat

    The nice part about My Dream App is that it removes the ability to create programs from the top floor of “The Ivory Tower.” Perhaps that’s one reason why some programmers are against the idea. Non-programmers now have an outlet to get their ideas into practice, and some of those ideas may be damn good. If the programmers want to deal with feasability issues, then let them do it.

    I know that there have been a few good ideas that I have come up with over the years that I simply couldn’t implement because I don’t program. Why should application design and concept be left to coders alone?

  • julian writes:
    October 15th, 200611:45 AMat

    e:leaf, as much as “The Ivory Tower” is something you don’t want users to feel they are up against, the concept Jens references isn’t just some notion programmers hold.

    In fact, in the field of Human-Computer Interaction (usability), we are taught over and over again that information about what to build comes from studying users, their environment, and their responses to questions… but NOT from directly asking them what they want.

    The central problem is that users think they know what they want, but they do not actually know what they want in terms of good solutions to do the problem.

    As much as people may believe that the programmers and designers are all ignoring the users, historically the programmers are the people who just go out there and implement anything the user asks for.

    This problem is reaching the point where it is so well-understood that it is now dealt with even in the Software Engineer’s world. A Google search these days turns up results from Agile Software Development sites as well as usability sites.

    Agile Software Development and Human-Computer Interaction programs alike teach that the solution to the problem you describe is more user involvement in the process of developing software—but not directly implementing what the user tells you they want, rather doing the harder work and figuring out what kind of solution would really solve their need.

  • Colin writes:
    October 15th, 20062:10 PMat
  • Jens Alfke writes:
    October 15th, 20062:34 PMat

    lee, Colin: Yup, I was a big Usenet user back in the mid-80s (I remember the transition to the hierarchical newsgroup system, i.e. when net.music became rec.music.misc) and to a lesser extent in the early ’90s (I remember when the flood of people announcing Kurt Cobain’s death provided graphic evidence of how the lag of message propagation could make communication impossible.)

    Usenet had some nice properties, but sucked at least as badly as today’s BBS sites. Some of my early designs for The Perfect Social Software were based on extensions of NNTP, but I soon gave up on that as Usenet sank from view beneath the crushing weight of spam.

  • Danny Cohen writes:
    October 15th, 20063:27 PMat

    There is a demo that the Hijack-conciever came up with that shows that DOM-scraping could be as easy as clicking on elements once on a forum, and then it remembers what to look for on every page.

    stop being so negative, they are “dreams” after all. You have problems with dreams?

  • Joshua writes:
    October 15th, 20063:33 PMat

    Thanks for saying what many of us have been thinking all along. You have to remember too that half the people at MyDreamApp are 13 year olds from MacThemesForums who only care about two things, Forums and Pretty GUI’s.

  • Jens Alfke writes:
    October 15th, 20063:41 PMat

    Danny: I fail to see how a mocked-up demo can show the feasibility of implementing something.

    These aren’t “dreams”, they are entries in a competition to actually build a real app. Rather, to have someone else who can program build a real app. If the concept is going to drive that programmer insane, and if people who know nothing about programming keep insisting that they (somehow) know it’s “as easy as clicking”, I think that’s a valid thing for me to be negative about.

  • Jon Aizen writes:
    October 15th, 20063:53 PMat

    Dapper could definitely be used to build a meta-forum application.

    Yup. In an amazing feat of hand-waving, this app is going to Just Work with every BBS program under the sun, through the miracle of screen-scraping. An army of volunteers are going to code and maintain scripts that will strain through the murky tag soup and extract every bit of information you need.

    Dapper does a great job of solving this difficulty. Using a GUI, you can construct an API for any website without a writing a line of code (nor a single regular expression to parse that nasty tag soup).

  • pwb writes:
    October 15th, 20064:49 PMat

    What a joke! You rag on screen scraping and then propose that the “right” way to do it is for everyone to get together and agree on some common format and then throw in some OpenID single sign on, blah, blah, blah. Are you freaking kidding me?

  • Jens Alfke writes:
    October 15th, 20065:54 PMat

    Jon: Dapper is interesting, and might go some way toward making this workable (though the devil is in the details and I’m not sure enough of the semantics can be extracted reliably.) I do maintain that this entire approach is undeniably a kludge, though, the web equivalent of OCR.

    pwb: No, I am not fucking kidding. What’s your point? Plenty of advances have come about from people getting together to develop and standardize new protocols, and conversely, plenty of kludged-together systems have collapsed under the weight of their own messes.

  • Scott Stevenson writes:
    October 15th, 20067:46 PMat

    Where’s the fun in working on something that’s feasible?

  • Maciej Stachowiak writes:
    October 16th, 20062:35 AMat

    How about getting forum software to add hAtom markup? Then you can unambiguously pull content from the actual web pages without screen scraping hacks, and the software does not have to be updated to provide the data via alternate formats or protocols.

    I’ll add though that I think “screen scraping” on the web (more accurately it would be called markup scraping) is fundamentally different than the original meaning — trying to infer information from pixels. You already have the text, you’re just trying to scrape the semantics - and markup is supposed to be about semantics.

    This is why I think microformats are a good approach because they make the markup richer to let you extract the semantics you want in a reliable way, without the need for out-of-band metadata.


»  Substance:WordPress   »  Style:Ahren Ahimsa