E-text formats are a waste of time

And I’m not talking about Amazon’s Apple-stylee first-mover vertically-integrated land grab for which I have equal parts contumely (the software) and covetousness (the hardware).

What I want to stop is the immense amount of effort being wasted, especially in the free software and commons communities, on dreaming up and implementing e-text formats, and on providing free e-texts in those formats.

What’s the problem? First, the texts. The majority of free texts come from the various Projects Gutenberg. (“Projects” plural, indeed: a compartmentalised approach to different copyright régimes being the simplest way to take advantage of each and avoid legal problems, the sites have no formal connection with each other, and hence require separate staff and servers in every country.) They are surprisingly thinly staffed, many being one-person operations, and even the mighty US site apparently involving only a handful. The Gutenberg ethos is intentionally decentralized, and so far the US project, at least, has used text and lightly-marked-up HTML formats as its canonical formats. This is great for accessibility and long-term archiving, but it’s lousy for providing e-readers with rich and accurate metadata, or even simple things like contents and footnotes.

So we need a proper e-text format, then? A lot of people seem to think so, and there are half-a-dozen supported by various e-reader programs. FBReader, my favourite, supports FB2, ePub, plucker, Mobipocket, Open E-Book, OpenReader and Palmdoc, and its web site lists several more which it doesn’t support. Project Gutenberg has experimental support for generating some of these formats automatically, but the results are rather poor. Meanwhile, sites like FeedBooks lovingly hand-craft metadata, and add nice touches, such as scans of original cover images, but must redo the work each time the original text is updated (mostly, when corrections are submitted). But none of these formats is really any better than the structured markup format we all use all the time: HTML. Rather than pour effort into defining and promulgating new formats (though at least most of the recent efforts are XML languages), and then implement them in readers, why not just agree on some conventions for semantic HTML markup, or even a microformat? No new software would be needed (though for offline reading, we could do with a good HTML 5 reader application until browser authors implement the decent offline reading support that is so obviously missing).

Corrections are another major source of waste: surprisingly few readers seem to submit them. I say this purely from personal experience: Gutenberg’s corrections email address, which until recently was quite hard to find, has, since early 2010, been linked to a ticketing system, and when I last submitted a correction a couple of months ago, my ticket number was in the 400’s, which suggests a tiny trickle of corrections; meanwhile I’ve found hundreds of errors in texts released years ago by other sites, and my feedback is treated with a degree of gratitude and dispatch that suggests it is rare. Given this state of affairs, you would hope that there would be measures in place to collate corrections from downstream suppliers such as Feedbooks, but no such luck. The sorts of links that are commonly forged in the free software community to share bug fixes between software distributors and authors, and enable users to easily report bugs, seem to be virtually absent in the world of free e-texts.

So here are my gradus ad Parnassum of free e-texts:

  • Forge links between the repositories and their free and commercial redistributors. Push improvements upstream. The Gutenbergs should be demanding this, as it provides an incentive to the commercial redistributors to continue to innovate. (Of course, proper licensing would help; while it may be too late for existing books, the rate at which new books are being added would soon create a useful lever to crack open any reluctant partners.)
  • Help the repositories and redistributors automate their efforts. There are many things that could be automated or automated better: extraction of metadata from non-marked-up texts, automatic application of corrections to downstream marked-up texts, and automatic generation of corrections directly from reading programs and devices, by readers.
  • Get readers to help. Many willing helpers will be unaware that help is sought or even required, and others put off by not knowing how to help. Making correction an obvious feature of all e-readers might well cause the number of corrections to leap; similar functionality would in any case be a boon for scholarly and recreational noters and doodlers, and encourage both old and new forms of interaction with, and perhaps most interestingly, through texts.

Michael S. Hart’s original vision of getting the world’s great texts into everyone’s hands is well on the way to being fulfilled, and we know how to bring it to fruition. Yes, the vast majority of texts are Western literature, and the files all require electronic devices to read, but consider: Gutenberg includes images, recordings, and even “Night of the Living Dead”, so the model works not only for any written language, but even languages without a written form; and there’s nothing to stop written texts being printed, or audio texts broadcast or transferred to tape. Project Gutenberg’s methods of digitisation can be applied to virtually any human language, and dissemination is not limited to digital technologies.

It’s time to enlarge the vision.


Last updated 2010/12/08