Here's a poor man's introduction to the concept of a "language model," by way of something a little more familiar, Scrabble.
Scrabble uses a language model, of sorts. Let's take a look.
First, some preliminaries: here's how a little utility function called freq() works (I keep it in text.py)
>>> import scrabble
>>> from text import freq
>>> fq = freq(list('abbccc'))
freq(sequence) will return an dictionary with frequency counts of all the elements of sequence. Let's see if it works as expected:
>>> fq == { 'a': 1, 'b': 2, 'c': 3 }
True
We simply counted the number of times the digits '1', '2', and '3' occurred in the string '122333'.
In Scrabble, every letter has a value. Here's one chart for Scrabble in Italian:
ITALIAN = {
'A': 1,
'B': 5,
'C': 4,
'D': 5,
'E': 1,
'F': 8,
'G': 5,
'H': 8,
'I': 1,
'L': 3,
'M': 3,
'N': 2,
'O': 1,
'P': 5,
'Q': 10,
'R': 2,
'S': 2,
'T': 2,
'U': 3,
'V': 4,
'Z': 8,
}
Writing a function to score a word given this information is easy enough:
>>> from models import *
>>> # sorry Jonas:
>>> for word in "Italia campione del Mondo".split():
... print word, scrabble.score(word, ITALIAN)
Italia 9
campione 18
del 9
Mondo 12
What happens if we use the wrong model on a word?
SPANISH = {
'A': 1,
'B': 3,
'C': 3,
'CH': 5,
'D': 2,
'E': 1,
'F': 4,
'G': 2,
'H': 4,
'I': 1,
'J': 8,
'L': 1,
'LL': 8,
'M': 3,
'N': 1,
'Ñ': 8,
'O': 1,
'P': 3,
'Q': 5,
'R': 1,
'RR': 8,
'S': 1,
'T': 1,
'U': 1,
'V': 4,
'X': 8,
'Y': 4,
'Z': 10,
}
Let's see:
>>> for word in "Italia campione del Mondo".split():
... print word, scrabble.score(word, ITALIAN), scrabble.score(word, SPANISH)
Italia 9 6
campione 18 14
del 9 4
Mondo 12 8
Well, we get different scores, obviously, because there are different values and distributions of letters in each language.
The question is, is this crude metric (and it's very crude, so far) enough to help us classify text according to language?