We're going to do some crude alignment of parallel texts and see if we can extract any useful lexical information.
>>> import align
So, I ran across these two random pdfs, which I saved as text files from Adobe Acrobat. They're pretty long;
>>> entext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-en.txt"
>>> pttext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-pt.txt"
>>> endex = align.Collection()
>>> ptdex = align.Collection()
>>> en = align.UnicodeDammit(open(entext, 'U').read()).unicode
>>> pt = align.UnicodeDammit(open(pttext, 'U').read()).unicode
Almost 800K characters per file, that's pretty respectable. By way of comparison, Moby Dick has:
>>> moby = "/home/pat/l/en/mobydick.txt"
>>> print len(align.UnicodeDammit(open(moby, 'U').read()).unicode)
1232923
So they're on the same order as that novel.
>>> for i in [en, pt]: print len(i)
788807
788267
>>> entokens = en.split()
>>> pttokens = pt.split()
>>> enwords = set(entokens)
>>> ptwords = set(pttokens)
>>> for i,w in enumerate(entokens):
... endex[w] = i
>>> for i,w in enumerate(pttokens):
... ptdex[w] = i
Alright, let's see how many hits a random Portuguese word got:
>>> len(ptdex[u'quando'])
136
>>> len(endex[u'when'])
91
Here's how normalizing works. We want to convert everything to a percentage.
>>> maximum = 1232923
>>> from random import choice
>>> nums = [choice(range(maximum)) for i in range(10)]
Mmkay, pretend these are what we got:
>>> nums = [463882, 793980, 1221109, 1215836, 871746, 326138, 445866, 997517, 174756, 465463]
A bunch of random numbers. Now, we want to convert them all to percentages, which one does thusly, I think:
>>> percentages = [ float(n)/maximum * 100 for n in nums]
>>> percentages
[37.624571850796848, 64.398182206025851, 99.041789308821393, 98.614106477046832, 70.705632062991768, 26.452422414051814, 36.163328934572561, 80.906674626071535, 14.174121173828375, 37.752803703069858]
It doesn't matter too much how exact these things are. Or rather, we don't know how important it is yet. So:
>>> def normalize(seq):
... maximum = max(seq)
... return [ int(float(n)/maximum * 100) for n in seq]
>>> normalize(percentages)
[37, 65, 100, 99, 71, 26, 36, 81, 14, 38]
Okay, now what is this thing going to give us?
>>> normalize(ptdex[u'quando'])
[5, 7, 7, 7, 7, 7, 8, 8, 11, 11, 11, 12, 12, 13, 14, 15, 16, 16, 18, 20, 20, 20, 21, 24, 24, 25, 26, 27, 28, 31, 38, 38, 39, 39, 39, 39, 40, 40, 40, 41, 41, 42, 43, 44, 45, 46, 46, 46, 49, 49, 51, 51, 52, 53, 53, 54, 55, 58, 60, 60, 60, 60, 61, 61, 61, 61, 62, 62, 62, 64, 65, 65, 67, 67, 67, 69, 69, 70, 70, 70, 71, 71, 71, 71, 71, 71, 73, 73, 73, 73, 73, 73, 73, 73, 73, 74, 74, 76, 76, 77, 77, 77, 79, 81, 81, 82, 83, 83, 85, 85, 86, 86, 86, 86, 86, 87, 89, 89, 91, 96, 96, 96, 96, 96, 96, 97, 97, 98, 98, 98, 98, 98, 98, 98, 99, 100]
Well, a big list.
Here's my idea. We take the frequencies of the percentages in that big list, and we treat them as a vector representing the distribution of that word. Then, we compare the cosine measure of that vector with the same sort of vector representing all the words in the English side. Hopefully similar words will have similar distributions?
>>> def hitfreq(word, langdex):
... return align.freq(normalize(langdex[word]))
>>> def compare(ptwd, enwd):
... ptvec = hitfreq(ptwd, ptdex)
... envec = hitfreq(enwd, endex)
... answer = align.sim(ptvec, envec)
... if answer > 0: return answer
Here are a few test results:
>>> print compare(u'Brasil', u'Brazil')
0.85776561382
>>> print compare(u'quando', u'when')
0.748669557285
>>> print compare(u'Brasil', u'when')
0.392501568125
>>> print compare(u'artigo', u'article')
0.434400518707
>>> print compare(u'artigos', u'articles')
0.0946094540761
>>> print compare(u'lei', u'law')
0.605214445726
Okay, now let's take a single English word and try to compare it to ALL of the Portuguese words, then sort those results by similarity...
>>> def findsims(enword):
... sims = []
... for ptword in list(ptwords)[:8000]:
... res = (compare(ptword, enword), enword, ptword)
... sims.append(res)
... return sorted(sims)
Make some coffee:
>>> query = u'investment'
>>> biz = findsims(query)
>>> for x, e, p in biz[-100:]:
... print "ยป %f\t%s\t%s" % (x, e, p)
>>> print u"out of " + unicode(len(endex[query]))