Introduction: align =================== We're going to do some crude alignment of parallel texts and see if we can extract any useful lexical information. >>> import align So, I ran across these two random pdfs, which I saved as text files from Adobe Acrobat. They're pretty long; >>> entext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-en.txt" >>> pttext = "/home/pat/l/pt/doingbusinessinbrazil/brazilbiz-pt.txt" >>> endex = align.Comparable() >>> ptdex = align.Comparable() >>> en = align.UnicodeDammit(open(entext, 'U').read()).unicode >>> pt = align.UnicodeDammit(open(pttext, 'U').read()).unicode Almost 800K characters per file, that's pretty respectable. By way of comparison, Moby Dick has: >>> moby = "/home/pat/l/en/mobydick.txt" >>> print len(align.UnicodeDammit(open(moby, 'U').read()).unicode) 1232923 So they're on the same order as that novel. >>> for i in [en, pt]: print len(i) 788807 788267 >>> for i,w in enumerate(en.split()): ... endex[w] = i >>> for i,w in enumerate(pt.split()): ... ptdex[w] = i Alright, let's see how many hits a random Portuguese word got: >>> len(ptdex[u'quando']) 136 >>> len(endex[u'when']) 91 Normalizing an Index -------------------- >>> endex.normalize() >>> ptdex.normalize() Here's how normalizing works. We want to convert everything to a percentage. >>> maximum = 1232923 >>> from random import choice >>> nums = [choice(range(maximum)) for i in range(10)] Mmkay, pretend these are what we got: >>> nums = [463882, 793980, 1221109, 1215836, 871746, 326138, 445866, 997517, 174756, 465463] A bunch of random numbers. Now, we want to convert them all to percentages, which one does thusly, I think: >>> percentages = [ float(n)/maximum * 100 for n in nums] >>> percentages [37.624571850796848, 64.398182206025851, 99.041789308821393, 98.614106477046832, 70.705632062991768, 26.452422414051814, 36.163328934572561, 80.906674626071535, 14.174121173828375, 37.752803703069858] It doesn't matter too much how exact these things are. Or rather, we don't know how important it is yet. So: >>> def normalize(seq): ... maximum = max(seq) ... return [ int(float(n)/maximum * 100) for n in seq] >>> normalize(percentages) [37, 65, 100, 99, 71, 26, 36, 81, 14, 38] Okay, now what is this thing going to give us? >>> normalize(ptdex[u'quando']) [5, 7, 7, 7, 7, 7, 8, 8, 11, 11, 11, 12, 12, 13, 14, 15, 16, 16, 18, 20, 20, 20, 21, 24, 24, 25, 26, 27, 28, 31, 38, 38, 39, 39, 39, 39, 40, 40, 40, 41, 41, 42, 43, 44, 45, 46, 46, 46, 49, 49, 51, 51, 52, 53, 53, 54, 55, 58, 60, 60, 60, 60, 61, 61, 61, 61, 62, 62, 62, 64, 65, 65, 67, 67, 67, 69, 69, 70, 70, 70, 71, 71, 71, 71, 71, 71, 73, 73, 73, 73, 73, 73, 73, 73, 73, 74, 74, 76, 76, 77, 77, 77, 79, 81, 81, 82, 83, 83, 85, 85, 86, 86, 86, 86, 86, 87, 89, 89, 91, 96, 96, 96, 96, 96, 96, 97, 97, 98, 98, 98, 98, 98, 98, 98, 99, 100] Well, a big list. The Part Where We Attempt to Get Clever --------------------------------------- Here's my idea. We take the frequencies of the percentages in that big list, and we compare them to similar lists for all the words in the English side. >>> def hitfreq(word, langdex): ... return align.freq(normalize(langdex[word])) >>> quando = hitfreq(u'quando', ptdex) >>> for k,v in sorted(quando.items()): ... print "%d\t%s" % (k,v * '*')