This Year's Language Model
==========================
Here's a poor man's introduction to the concept of a "language model," by way of something a little more familiar, Scrabble.
Scrabble uses a language model, of sorts. Let's take a look.
First, some preliminaries: here's how a little utility function called freq() works (I keep it in text.py)
>>> import scrabble
>>> from text import freq
>>> fq = freq(list('abbccc'))
freq(sequence) will return an dictionary with frequency counts of all the elements of sequence. Let's see if it works as expected:
>>> fq == { 'a': 1, 'b': 2, 'c': 3 }
True
We simply counted the number of times the digits '1', '2', and '3' occurred in the string '122333'.
Scrabble
--------
In Scrabble, every letter has a value. Here's one chart for Scrabble in Italian:
ITALIAN = {
'A': 1,
'B': 5,
'C': 4,
'D': 5,
'E': 1,
'F': 8,
'G': 5,
'H': 8,
'I': 1,
'L': 3,
'M': 3,
'N': 2,
'O': 1,
'P': 5,
'Q': 10,
'R': 2,
'S': 2,
'T': 2,
'U': 3,
'V': 4,
'Z': 8,
}
Writing a function to score a word given this information is easy enough:
>>> from models import *
>>> # sorry Jonas:
>>> for word in "Italia campione del Mondo".split():
... print word, scrabble.score(word, ITALIAN)
Italia 9
campione 17
del 9
Mondo 13
Now, here's the point.
----------------------
What happens if we use the _wrong_ model on a word?
SPANISH = {
'A': 1,
'B': 3,
'C': 3,
'CH': 5,
'D': 2,
'E': 1,
'F': 4,
'G': 2,
'H': 4,
'I': 1,
'J': 8,
'L': 1,
'LL': 8,
'M': 3,
'N': 1,
'Ñ': 8,
'O': 1,
'P': 3,
'Q': 5,
'R': 1,
'RR': 8,
'S': 1,
'T': 1,
'U': 1,
'V': 4,
'X': 8,
'Y': 4,
'Z': 10,
}
Let's see:
>>> for word in "Italia campione del Mondo".split():
... print word, scrabble.score(word, ITALIAN), scrabble.score(word, SPANISH)
Italia 9 6
campione 17 20
del 9 4
Mondo 13 8
Well, we get different scores, obviously, because there are different values and distributions of letters in each language.
The question is, is this crude metric (and it's _very_ crude, so far) enough to help us classify text according to language?
Finnish or German or English?
-----------------------------
I'm a firm believer in _not_ starting with test data that's neat and tidy. Start with something real world. So, I picked a more-or-less random web page, this one:
BBC - h2g2 - Finnish/German/English Phrase Guide
I used Aaron Swartz's html2text.py to strip the text, like this:
>>> from html2text import html2text
>>> bbchtml = open('bbcpage.html').read()
>>> bbchtml = unicode(bbchtml) # cross your fingers
>>> bbc = html2text(bbchtml)
So now we have a big string containing the text of that page. Let's make a (crude) word list:
>>> words = set(bbc.split())
That's the worst tokenizer in human history, but there you go.
Now, we're going to classify each of the words in this list into one of three languages: Finnish, German, or English. To do this we'll score each word, as above, using the Scrabble language model for each word (which I built for your convenience from the Wikipedia page linked above.)