We're going to do some crude alignment of parallel texts and see if we can extract any useful lexical information.
>>> import align
We're using [my favorite data structure], which I call a "collection." It works like this:
>>> acollection = align.Collection()
>>> acollection['a'] = 1
>>> acollection['a'] = 2
>>> acollection['a']
[1, 2]
It works like a dictionary syntactically, except that you never write over anything, you just keep ti all, even duplicates:
>>> acollection['a'] = 2 # add **another** 2
>>> acollection['a']
[1, 2, 2]
This is useful, because we can use it to go through all the words in a text, and record all the nths where each word occurs.
Er, I'll show you in a second. First, here's how a useful Python built-in called "enumerate" works:
>>> letters = ['a', 'b', 'c']
>>> letterindex = enumerate(letters)
>>> type(letterindex)
<type 'enumerate'>
Now we can see what's inside that enumerate thing:
>>> for nth, word in letterindex:
... print nth, word
0 a
1 b
2 c
It just numbers stuff... I would even go so far as to say that it enumerates stuff.
Now, here's the same thing with some text:
>>> book = "Mary had a little lamb, little lamb, little lamb."
>>> marywords = book.split()
>>> maryindex = align.Collection()
First, we'll go through all the words in "mary" and record them in "maryindex", just like we did with "letters".
>>> for nth, word in enumerate(marywords):
... maryindex[word] = nth
>>> print maryindex
{'a': [2], 'little': [3, 5, 7], 'lamb,': [4, 6], 'lamb.': [8], 'had': [1], 'Mary': [0]}
As you can see, "maryindex" looks walks like a dictionary, talks like a dictionary, so we can treat it like a dictionary. Let's make it a bit prettier to read:
>>> for word, nths in maryindex.items():
... print word, nths
a [2]
little [3, 5, 7]
lamb, [4, 6]
lamb. [8]
had [1]
Mary [0]
(Notice that our crude approach doesn't handle punctuation, so "maryindex" thinks that lamb and lamb, and Lamb (if that word were there) are all distinct words.)
You can think of this process as an automatic means of creating the sort of index you find in the back of a book, except that this book has only one word per page, and the index contains all the words in the book. (So the index is longer than the book itself. Efficient it ain't.)
We can think of a list such as [3, 5, 7] (for the word little) as a vector, and if we set things up correctly, we can do a bit of math to compare them to other such vectors.
In order to compare the elements of these vectors, we need to convert them to comparable forms. One simple way to do this is simply to consider these "nths" as proportional offsets. First, how many words are in the text?
>>> marylen = len(marywords)
>>> print marylen
9
Instead of saying that little occurs as the 3rd, 5th, and 7th words of the text, we can say that it occurs at 3/9, 5/9, and 7/9. Let's rebuild maryindex with this approach:
>>> maryindex = align.Collection()
>>> for nth, word in enumerate(marywords):
... marylen = len(marywords)
... maryindex[word] = float(nth) / marylen
(We have to convert either "nth" or "marylen" to a float, because Python automatically rounds down fractions that are less that one to zero.)
Now let's take a look again:
>>> for word, nths in maryindex.items():
... print word, nths
a [0.22222222222222221]
little [0.33333333333333331, 0.55555555555555558, 0.77777777777777779]
lamb, [0.44444444444444442, 0.66666666666666663]
lamb. [0.88888888888888884]
had [0.1111111111111111]
Mary [0.0]
At least, that's what I got on my machine. I think the exact value of floats like these depends on your processor and blah blah.
But approximate is fine anyway; let's just convert these to friendly percentages:
>>> maryindex = align.Collection()
>>> for nth, word in enumerate(marywords):
... marylen = len(marywords)
... interim = float(nth) / marylen * 100
... maryindex[word] = round(interim)
>>> for word, nths in maryindex.items():
... print word, nths
a [22.0]
little [33.0, 56.0, 78.0]
lamb, [44.0, 67.0]
lamb. [89.0]
had [11.0]
Mary [0.0]
Oops, Mary has a zero in there, can't have that... er, let's cheat:
>>> maryindex = align.Collection()
>>> for nth, word in enumerate(marywords):
... marylen = len(marywords)
... interim = float(nth) / marylen * 100 + 1
... maryindex[word] = int(round(interim))
>>> for word, nths in maryindex.items():
... print word, nths
a [23]
little [34, 57, 79]
lamb, [45, 68]
lamb. [90]
had [12]
Mary [1]
Okay, that's easy on teh brain.
Hehehe. Cheating is fun.
It's interesting to stop and think what a percentage really is, after all. When we take a percentage, it's like we're thinking of a series of numbers as being placed on the number line, and we're subdividing that number line into one hundred equal parts. Each value from the series then associated with the index of its part.
The pages of a book (unlike the chapters) really work in the same way. If we ignore things like text formatting, images, and so on, we can think of all the pages in a book as being of the exact same length; equal fractions of the entire text. The index at the back of a book tells you which "fractions" (pages) the word in question appears in.
Interestingly, it doesn't tell you how many times the indexed word occurs on that page.
These calculations are pretty meaningless on a short text. After all, we're calculating (approximate!) percentages of a "text" which doesn't even have 100 words.
So, I ran across these two random pdfs, which I saved as text files from Adobe Acrobat. They're pretty long;
>>> entext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-en.txt"
>>> pttext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-pt.txt"