This Year's Language Model ========================== Here's a poor man's introduction to the concept of a "language model," by way of something a little more familiar, Scrabble. Scrabble uses a language model, of sorts. Let's take a look. First, some preliminaries: here's how a little utility function called freq() works (I keep it in text.py) >>> import scrabble >>> from text import freq >>> fq = freq(list('abbccc')) freq(sequence) will return an dictionary with frequency counts of all the elements of sequence. Let's see if it works as expected: >>> fq == { 'a': 1, 'b': 2, 'c': 3 } True We simply counted the number of times the digits '1', '2', and '3' occurred in the string '122333'. Scrabble -------- In Scrabble, every letter has a value. Here's one chart for Scrabble in Italian: ITALIAN = { 'A': 1, 'B': 5, 'C': 4, 'D': 5, 'E': 1, 'F': 8, 'G': 5, 'H': 8, 'I': 1, 'L': 3, 'M': 3, 'N': 2, 'O': 1, 'P': 5, 'Q': 10, 'R': 2, 'S': 2, 'T': 2, 'U': 3, 'V': 4, 'Z': 8, } Writing a function to score a word given this information is easy enough: >>> from models import * >>> # sorry Jonas: >>> for word in "Italia campione del Mondo".split(): ... print word, scrabble.score(word, ITALIAN) Italia 9 campione 17 del 9 Mondo 13 Now, here's the point. ---------------------- What happens if we use the _wrong_ model on a word? SPANISH = { 'A': 1, 'B': 3, 'C': 3, 'CH': 5, 'D': 2, 'E': 1, 'F': 4, 'G': 2, 'H': 4, 'I': 1, 'J': 8, 'L': 1, 'LL': 8, 'M': 3, 'N': 1, 'Ñ': 8, 'O': 1, 'P': 3, 'Q': 5, 'R': 1, 'RR': 8, 'S': 1, 'T': 1, 'U': 1, 'V': 4, 'X': 8, 'Y': 4, 'Z': 10, } Let's see: >>> for word in "Italia campione del Mondo".split(): ... print word, scrabble.score(word, ITALIAN), scrabble.score(word, SPANISH) Italia 9 6 campione 17 20 del 9 4 Mondo 13 8 Well, we get different scores, obviously, because there are different values and distributions of letters in each language. The question is, is this crude metric (and it's _very_ crude, so far) enough to help us classify text according to language? Finnish or German or English? ----------------------------- I'm a firm believer in _not_ starting with test data that's neat and tidy. Start with something real world. So, I picked a more-or-less random web page, this one: BBC - h2g2 - Finnish/German/English Phrase Guide I used Aaron Swartz's html2text.py to strip the text, like this: >>> from html2text import html2text >>> bbchtml = open('bbcpage.html').read() >>> bbchtml = unicode(bbchtml) # cross your fingers >>> bbc = html2text(bbchtml) So now we have a big string containing the text of that page. Let's make a (crude) word list: >>> words = set(bbc.split()) That's the worst tokenizer in human history, but there you go. Now, we're going to classify each of the words in this list into one of three languages: Finnish, German, or English. To do this we'll score each word, as above, using the Scrabble language model for each word (which I built for your convenience from the Wikipedia page linked above.)