Introduction: align =================== We're going to do some crude alignment of parallel texts and see if we can extract any useful lexical information. >>> import align So, I ran across these two random pdfs, which I saved as text files from Adobe Acrobat. They're pretty long; >>> entext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-en.txt" >>> pttext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-pt.txt" We're using [my favorite data structure], which I call a "collection." It works like this: >>> acollection = align.Collection() >>> acollection['a'] = 1 >>> acollection['a'] = 2 >>> acollection['a'] [1, 2] It's like a dictionary, but values are lists, and you don't overwrite previous entries. As a matter of fact, you can have duplicates: >>> acollection['a'] = 2 [1, 2, 2] >>> endex = align.Collection() >>> ptdex = align.Collection() >>> en = align.UnicodeDammit(open(entext, 'U').read()).unicode >>> pt = align.UnicodeDammit(open(pttext, 'U').read()).unicode >>> for i in [en, pt]: print len(i) 788807 788267 Almost eight-hundred thousand characters per file, that's pretty respectable. By way of comparison, Moby Dick has: >>> moby = "/home/pat/l/en/mobydick.txt" >>> print len(align.UnicodeDammit(open(moby, 'U').read()).unicode) 1232923 1.2 million characters. In other words, my documents should be long enough to get some useful information out. How many words is that, anyway? >>> entokens = en.split() # this is a crummy definition of "word" >>> pttokens = pt.split() >>> enwords = set(entokens) # remove duplicates >>> ptwords = set(pttokens) >>> for i,w in enumerate(entokens): ... endex[w] = i >>> for i,w in enumerate(pttokens): ... ptdex[w] = i Alright, let's see how many hits a random Portuguese word got: >>> len(ptdex[u'quando']) 136 >>> len(endex[u'when']) 91 Normalizing an Index -------------------- Here's how normalizing works. We want to convert everything to a percentage. >>> maximum = 1232923 >>> from random import choice >>> nums = [choice(range(maximum)) for i in range(10)] Mmkay, pretend these are what we got: >>> nums = [463882, 793980, 1221109, 1215836, 871746, 326138, 445866, 997517, 174756, 465463] A bunch of random numbers. Now, we want to convert them all to percentages, which one does thusly, I think: >>> percentages = [ float(n)/maximum * 100 for n in nums] >>> percentages [37.624571850796848, 64.398182206025851, 99.041789308821393, 98.614106477046832, 70.705632062991768, 26.452422414051814, 36.163328934572561, 80.906674626071535, 14.174121173828375, 37.752803703069858] It doesn't matter too much how exact these things are. Or rather, we don't know how important it is yet. So: >>> def normalize(seq): ... maximum = max(seq) ... return [ int(float(n)/maximum * 100) for n in seq] >>> normalize(percentages) [37, 65, 100, 99, 71, 26, 36, 81, 14, 38] Okay, now what is this thing going to give us? >>> normalize(ptdex[u'quando']) [5, 7, 7, 7, 7, 7, 8, 8, 11, 11, 11, 12, 12, 13, 14, 15, 16, 16, 18, 20, 20, 20, 21, 24, 24, 25, 26, 27, 28, 31, 38, 38, 39, 39, 39, 39, 40, 40, 40, 41, 41, 42, 43, 44, 45, 46, 46, 46, 49, 49, 51, 51, 52, 53, 53, 54, 55, 58, 60, 60, 60, 60, 61, 61, 61, 61, 62, 62, 62, 64, 65, 65, 67, 67, 67, 69, 69, 70, 70, 70, 71, 71, 71, 71, 71, 71, 73, 73, 73, 73, 73, 73, 73, 73, 73, 74, 74, 76, 76, 77, 77, 77, 79, 81, 81, 82, 83, 83, 85, 85, 86, 86, 86, 86, 86, 87, 89, 89, 91, 96, 96, 96, 96, 96, 96, 97, 97, 98, 98, 98, 98, 98, 98, 98, 99, 100] Well, a big list. The Part Where We Attempt to Get Clever --------------------------------------- Here's my idea. We take the frequencies of the percentages in that big list, and we treat them as a vector representing the distribution of that word. Then, we compare the cosine measure of that vector with the same sort of vector representing all the words in the English side. Hopefully similar words will have similar distributions? >>> def hitfreq(word, langdex): ... return align.freq(normalize(langdex[word])) >>> def compare(ptwd, enwd): ... ptvec = hitfreq(ptwd, ptdex) ... envec = hitfreq(enwd, endex) ... answer = align.sim(ptvec, envec) ... if answer > 0: return answer Here are a few test results: >>> print compare(u'Brasil', u'Brazil') 0.85776561382 >>> print compare(u'quando', u'when') 0.748669557285 >>> print compare(u'Brasil', u'when') 0.392501568125 >>> print compare(u'artigo', u'article') 0.434400518707 >>> print compare(u'artigos', u'articles') 0.0946094540761 >>> print compare(u'lei', u'law') 0.605214445726 Okay, now let's take a single English word and try to compare it to ALL of the Portuguese words, then sort those results by similarity... >>> def findsims(enword): ... sims = [] ... for ptword in list(ptwords)[:8000]: ... res = (compare(ptword, enword), enword, ptword) ... sims.append(res) ... return sorted(sims) Make some coffee: >>> query = u'investment' >>> biz = findsims(query) >>> for x, e, p in biz[-100:]: ... print "ยป %f\t%s\t%s" % (x, e, p) >>> print u"out of " + unicode(len(endex[query]))