96 Characters Ought To Be Enough For Anyone
Famous Hacker Paul Graham on his new LISP dialect, Arc:
“Arc only supports Ascii. MzScheme, which the current version of Arc compiles to, has some more advanced plan for dealing with characters. But it would probably have taken me a couple days to figure out how to interact with it, and I don’t want to spend even one day dealing with character sets. Character sets are a black hole. I realize that supporting only Ascii is uninternational to a point that’s almost offensive [...] But the kind of people who would be offended by that wouldn’t like Arc anyway.”
That last bit [emphasis mine] sort of flummoxed me. Is he saying that LISP only appeals to native English speakers?[1] Or that no one in their right mind would use LISP to write software for end-users?[2] Or maybe that internationalization is just some sort of abstract feel-good political-correctness issue, since none of those third-worlders even have computers anyway?[3]
He makes similarly eye-opening assertions about HTML, too. Arc has HTML-generating libraries, but they “just do everything with tables” instead of CSS. Why? Because apparently CSS-based Web designs are less agile than ones made out of tables. Somehow I don’t think most people who’ve done web design both ways would agree—those old-school layouts made with infinitely-nested tables were about as agile as a house of cards.
Anyway. Normally I’m a big slut for new programming languages, but it would probably take me a couple days to figure out Arc, and I don’t want to spend even one day on Yet Another LISP Variant that won’t even let me write code I could use in the real world.
Wait—let me not end on such a snarky note. Since I’m such a big smarty, what would I have done differently? I would simply have made the language’s “character” data type 16 bits wide instead of 8, and provided four trivial library routines to convert such strings to and from UTF-8 and CP-1252 encodings for I/O purposes. That’s about an hour’s work, and all you need for really basic Unicode support; once you have that, you can add further Unicode niceties (and there are admittedly a zillion of ‘em) a few at a time without completely breaking old code.
[1] And only those native English speakers who don’t care about foo-foo details like “curly quotes”—or emdashes—or other Arcane Symbols™. Remember, to the true old-school hacker, even lowercase letters are an inessential frill.
[2] Like, say, Reddit.com. From SecretGeek’s awesome overview of LISP:
“Reddit is proof that lisp is really powerful. Paul Graham originally wrote reddit, in lisp, on the back of a napkin while he was waiting for a coffee. it was so powerful that it had to be rewritten in python just so that ordinary computers could understand it. Because it was written in lisp it was almost no effort to rewrite the entire thing, and the rewrite was completed in-between two processor cycles.”
[3] I’m reminded of a meeting between Apple and Sun (JavaSoft) back in 1996. I was there to discuss OpenDoc and JavaBeans, but each company also had text and I18N engineers there, who were talking about Unicode and text-layout technology for the upcoming Java2D graphics engine. There was an exchange that went something like this:
Apple engineer: …and the layout needs to take into account ligatures and contextual forms, where adjacent letters change glyphs depending on neighboring characters, or even merge into a single glyph.
Sun engineer: C’mon, is this important? How many people need advanced typographic features like that, anyway?
Apple engineer: [after a pause] Well, there are over 900 million of them in India alone, and another 200 million or so in the Arabic world>
January 30th, 2008 at 11:31 AM
Wait… 16-bit? Why would you limit your programming language to the BMP? Treating Unicode as an array of 16-bit values is a bug, not a feature.
January 30th, 2008 at 11:49 AM
[…] Update: Jens Alfke is “flummoxed” by Graham’s comments about character sets. […]
January 30th, 2008 at 11:49 AM
Why would you only use 16-bit characters, rather than just making the default string encoding UTF8? That would be a more compact representation for the common case of Latin characters, and still let you represent all of Unicode.
January 30th, 2008 at 12:11 PM
Please don’t confuse Lisp with Arc. Common Lisp and Scheme users have very ideas about what a good programming language is. In fact, the implementation that Arc uses — MzScheme, a Scheme implementation — already supports utf-8 (why he didn’t just use that as a basis for utf-8 support eludes me).
Arc’s idea of good design is _not_ representative of any more established Lisp variant. (Arc is a very disappointing take on what was claimed to become a ‘100-year language’. But hey, it only took 6 years of design (see also http://xach.livejournal.com/156104.html).)
January 30th, 2008 at 12:12 PM
That should read ‘have very different ideas’, not ‘have very ideas’.
That’s what I get for not proofreading.
January 30th, 2008 at 1:02 PM
RE: your comment about i18n at JavaSoft in 1996.
By 1998 and 1999, JavaSoft had to spend a tremendous amount of time and effort making Java2D play nice in the i18n world and handle all those “advanced typographic features” of Unicode.
And don’t get me started on how they initially forgot to create a printing API for Java2D. 1998 was a very busy year.
January 30th, 2008 at 1:07 PM
It should be obvious that unicode isn’t that required: your 16-bit broken unicode is just fine- but does it really offer any advantages over 8bit? Does 32-bit broken unicode offer enough advantages over your 16bit to warrant using that?
Making strings of bytes instead of strings of characters is a good thing- and good enough for any web app anyone would ever want to write (think: � is a perfectly good encoding for unicode on the web)- and a lot more portable.
On the other hand, building characters into the language causes no end in problems for languages like Python which generate coding errors unexpectedly, and CL which has no standard for external formats. Punting the issue makes it easier to get your program working faster, and I can’t possibly consider why that would be a bad thing… Especially considering that nobody complaining about this can offer a single reason why having bignum characters for all strings is a good thing.
January 30th, 2008 at 1:22 PM
I think what Paul Graham meant to say is, “There is so much to do in designing a language, and things like character sets that require a lot of detail and planning are going to have to wait. The kind of people who don’t understand that I am going to focus and develop how the language feels before I worry about that kind of stuff probably won’t like Arc.”
January 30th, 2008 at 1:36 PM
A lot of what I’ve read about Arc since Graham published his tutorial has been worrying. Why would I even bother looking at Arc instead of Scheme? It’s not even as interesting a language as either classical prefix Dylan or even infix Dylan — and the classical prefix Dylan was essentially just Scheme re-imagined with a CLOS-style object system underneath (and exposed to the developer).
January 30th, 2008 at 1:45 PM
I don’t know what’s worse, PG not understanding unicode, or all the people whining about it not understanding it either. Using 16 bit characters doesn’t solve anything, as it still won’t hold tons of code points that are above 16 bit, so you still need to use a variable length encoding. At which point you might as well just use utf-8.
January 30th, 2008 at 2:48 PM
Joe: I think people mean lots of different things by “using unicode”.
Some are talking about string-representation, and string-access time. Others are talking about automatic io-coding. Both of these things have further questions: what to do for invalid codings, or how to handle un-normalized code points.
But even before you get into what you think the answers to these questions are, start with what exactly you need the language to help with. What part of unicode is least pleasant to deal with directly- so much so that language-support could help? Then how would you keep your new feature from pissing off the people who that doesn’t bother?
I don’t mean to suggest there aren’t answers: just that they aren’t obvious; this isn’t a “simple” thing by a long shot.
January 30th, 2008 at 2:58 PM
This was my sentiment exactly, after reading Paul’s announcement.
Also his arguments seems little backwards. It goes like this:
1. Python was not decided with right character representation in the beginning and it took a year to change that
2. I would not want to do that, so I will use character representation which is unlikely to satisfy most of the world population and is pretty much obsolete so I would not have the problem.
Vadim
January 30th, 2008 at 7:40 PM
my snarky note is that he is right.. ascii was good enough for a long time and in the length of time you have to work on a language you are much better off spending your time implementing something that matters. Later you can go back and fix character issues. Getting the language is more important in the scheme of things.
as for HTML… CSS sucks. period. If you think otherwise you have not had to write it and account for all the variations of browsers. CSS was meant to be written by tools not humans… that said CSS sadly is becoming the black magic of choice for some reason…
January 31st, 2008 at 3:03 PM
#!/usr/bin/env newlisp
(constant (global ‘☼) MAIN)
(context ‘☺)
(define (☻ ✄ ☁ ⍾)
(print ✄ ☁ ⍾))
(define (‽)
(println {‽}))
(context ☼)
(set ‘℥ “what ” ‘ᴥ “the ” ‘ᴒ “dickens”)
(☺:☻ ℥ ᴥ ᴒ)
(☺:‽)
(exit)
January 31st, 2008 at 10:04 PM
[…] 96 Characters Ought To Be Enough For Anyone: Jens’ Unicode retort […]
February 1st, 2008 at 1:53 AM
[…] Jens Alfke’s latest blog post rambles about a couple of things but finishes on something that I really empathised with: Apple engineer: …and the layout needs to take into account ligatures and contextual forms, where adjacent letters change glyphs depending on neighboring characters, or even merge into a single glyph. […]
February 1st, 2008 at 8:01 AM
Real Unicode support is pretty hard if your language has a “character” object. One nice bit of fun is that Unicode encodings can represent the same character once it actually needs to be exchanged with another implementation (which could be a API, a network protocol, or application data format). First, you need to choose an encoding scheme, like UTF-7, UTF-8, UTF-16, and then, within some of those schemes, there are multiple choices like: is “á” one code point (as in ISOLatin1, which is also the first block of Unicode), or is a + accent, or accent + a ? Look at this if you dare. So any language implementation is going to have to deal with these issues once it promises “Unicode support.” Also, if looks like if you’re willing to use IBM’s ICU, there is help
I’d probably choose a canonical representation of 32 bits for a Unicode character, but perhaps that’s hopeless from a performance point of view.
February 1st, 2008 at 3:29 PM
Is there some reason not to just make characters 16 bits wide, specify they are in fact UTF-16, which is indistinguishable from UCS-2 in all the common cases and provide library support for the ugliness that occurs if people actually use something outside the basic multilingual plane (which means China and Japan, probably).
Unicode normalization is a real pain, but that usually only matters when doing I/O and is a library issue.
February 1st, 2008 at 3:42 PM
@Andrew Thompson — You took the words out of my mouth. The mainstream Unicode APIs I’m aware of (on Mac OS X, Windows, Java) basically do this. 16-bit Unicode isn’t perfect, but people seem to have settled on it as the sweet spot where you get almost all the necessary functionality without so much pain. (There are APIs in CoreFoundation to deal with the edge cases like normalization, but I’ve never had to use them, fortunately.)
On the other hand, 8-bit encodings are troublesome. There actually is a reason why Python and Ruby are putting in all the effort to get off of 8-bit strings. UTF-8 isn’t generally too horrible, but there are too many not-uncommon text manipulations where you operate on characters and character positions, and then it becomes a pain in the ass. Worst of all, if you’re an English-speaking developer, you don’t hit most of those edge cases very soon, unless you’re very careful about I18N testing. Then your app barfs in weird ways when you get users or installations in the rest of the world.
(And using HTML-style escaping as an encoding for working with 8-bit strings in memory, as “Anonymous” suggested, is just nuts. HTML-escaped text is nasty to deal with — even simple stuff like string comparisons becomes difficult and expensive, because there are so many ways to represent any one string.)