This Year's Language Model

Here's a poor man's introduction to the concept of a "language model," by way of something a little more familiar, Scrabble.

Scrabble uses a language model, of sorts. Let's take a look.

First, some preliminaries: here's how a little utility function called freq() works (I keep it in text.py)

>>> import scrabble
>>> from text import freq
>>> fq = freq(list('abbccc'))

freq(sequence) will return an dictionary with frequency counts of all the elements of sequence. Let's see if it works as expected:

>>> fq == { 'a': 1, 'b': 2, 'c': 3 }
True

We simply counted the number of times the digits '1', '2', and '3' occurred in the string '122333'.

Scrabble

In Scrabble, every letter has a value. Here's one chart for Scrabble in Italian:

ITALIAN = {
        'A': 1, 
        'B': 5, 
        'C': 4, 
        'D': 5, 
        'E': 1, 
        'F': 8, 
        'G': 5, 
        'H': 8, 
        'I': 1, 
        'L': 3, 
        'M': 3, 
        'N': 2, 
        'O': 1, 
        'P': 5, 
        'Q': 10, 
        'R': 2, 
        'S': 2, 
        'T': 2, 
        'U': 3, 
        'V': 4, 
        'Z': 8, 
}

Writing a function to score a word given this information is easy enough:

>>> from models import *
>>> # sorry Jonas:
>>> for word in "Italia campione del Mondo".split(): 
...         print word, scrabble.score(word, ITALIAN)
Italia 9
campione 18
del 9
Mondo 12

Now, here's the point.

What happens if we use the wrong model on a word?

SPANISH = {
        'A': 1, 
        'B': 3, 
        'C': 3, 
        'CH': 5, 
        'D': 2, 
        'E': 1, 
        'F': 4, 
        'G': 2, 
        'H': 4, 
        'I': 1, 
        'J': 8, 
        'L': 1, 
        'LL': 8, 
        'M': 3, 
        'N': 1, 
        'Ñ': 8, 
        'O': 1, 
        'P': 3, 
        'Q': 5, 
        'R': 1, 
        'RR': 8, 
        'S': 1, 
        'T': 1, 
        'U': 1, 
        'V': 4, 
        'X': 8, 
        'Y': 4, 
        'Z': 10, 
}

Let's see:

>>> for word in "Italia campione del Mondo".split(): 
...         print word, scrabble.score(word, ITALIAN), scrabble.score(word, SPANISH)
Italia 9 6
campione 18 14
del 9 4
Mondo 12 8

Well, we get different scores, obviously, because there are different values and distributions of letters in each language.

The question is, is this crude metric (and it's very crude, so far) enough to help us classify text according to language?

Finnish or German or English?