Secondary Screening

« You Say Its Your Birthday | Main | Let's All Get Real IDs »

May 05, 2005 | Statistically Improbable Story

Amazon's got a corpus, engineers with degrees and computers with powerful processors, and as today's story in Wired News shows, they aren't afraid to use them.

Name that famous book from just these phrases: "pagan harpooneers," "stricken whale," "ivory leg." Or how about this one: "old sport."

Yes, it's Herman Melville's Moby Dick and F. Scott Fitzgerald's The Great Gatsby, respectively, but the words aren't just a game. They are Statistically Improbable Phrases, the result of a new Amazon.com feature that compares the text of hundreds of thousands of books to reveal an author's signature constructions.

The haiku-like SIPs are not the only word toys on the site. Customers can also see the 100 most common words in a book. Penny pinchers -- or those with back problems -- can check stats on how many words a volume delivers per dollar or per ounce. (Bargain hunters will love the Penguin Classics edition of War and Peace that delivers 51,707 words per dollar.)

Customers can also see how complicated the writing is (yes, post-structuralist Michel Foucault's prose is foggier than Immanuel Kant's), and how much education you need to understand a book. (To understand French philosopher Pierre Bourdieu, you'll need a second Ph.D.)

While such services seem to have little value and have generated scant publicity, except from bibliophilic thrill seekers, web watchers say the madcap stats aren't just for kicks.

"(Amazon CEO) Jeff Bezos was born on numbers," said Nathan Torkington, an editor and conference coordinator for O'Reilly Media. "Before starting Amazon.com, he was a Wall Street analyst. They will be looking at this thinking, 'What can we do to drive the bottom line?' There's no way they will be regarding this as, 'We are math geeks and you will enjoy the numbers, too.'"

Really it's pretty fascinating what Amazon (and Google Scholar) might be able to pull off once their corpus includes millions of books. Hell, its pretty impressive -- all jokes aside -- what Amazon is doing with classification and phrases with the corpus they have now.

At risk of sounding too 1999, score one for the web here. This is pretty astounding. Even better is when Amazon opens an API for researchers so they can start testing natural language processing theories that are starving for books to test themselves on.

Posted by Ryan Singel at May 5, 2005 09:22 AM

Trackback Pings

TrackBack URL for this entry:
http://www.secondaryscreening.net/cgi-bin/mt-tb.cgi/180

Powered by
Movable Type 3.2