Average Character Distance ========================== >>> import chardist Most programmers know that the frequency distributions of letters in given language are more or less consistent: in English, ETAOINSHRDLU (pretty accurately) describes the most common letters in order of frequency. But, we can imagine other metrics about the distributions of letters besides simple frequency. Let's investigate **average character distance**. (I just made that term up.) Here's a string: >>> from string import lowercase >>> import random >>> punctuation = '?!!...' # 5 characters >>> somespaces = ' ' * 10 >>> randomtext = punctuation + lowercase + lowercase + lowercase + somespaces >>> randomtext = list(randomtext) >>> random.shuffle(randomtext) So, now we have a random string, right? Well, kind of random. We know, for instance, that there are exactly three times as many 'e' characters as '?' characters. >>> ecount = randomtext.count('e') >>> hmmcount = randomtext.count('?') >>> ratio = float(ecount)/hmmcount >>> thrice = 3 / 1 >>> ratio == thrice True Well, duh, there's the frequency for you (and the shuffling makes no difference). But what about this **average character distance** idea? On average, how far are the 3 'e' characters from each other? Here, the shuffling certainly _does_ make a difference. In other words, in this artificial case we know there are 3 * 26 + 15 = 93 characters, and we know what those characters are, but what are the distributions of those letters? Obviously, this is a statistical matter. >>> chardist.locations('x', 'axbxc') [1, 3] So that will give us the locations of a substring with in a string. But we can't really _test_ what the distributions of, say, the letter 'e' is in the variable randomtext, because randomtext is randomized. We can make trivial observations. We know there are three 'e' characters: >>> elocs = chardist.locations('e', randomtext) >>> len(elocs) == 3 True And that they are all in different positions: >>> len(set(elocs)) == 3 # duplicates would be removed, but there aren't any True In order to test that our code is working, we have to find a _finite_ way to describe an _variable phenomenon_, namely, where 'e' happens to end up in this randomized string. A bit of thinking will tell you that we know that any character in this string is equally likely to be an 'e'. So, suppose we get this string: >>> instance = 'qzckkglhgdmvnmmz?pqvz dc io!bxwrrwtubojj.tiyeyqssl x.vpeyhe gaf.n ifjuxanaplw c !t bkohrud sf' >>> print instance qzckkglhgdmvnmmz?pqvz dc io!bxwrrwtubojj.tiyeyqssl x.vpeyhe gaf.n ifjuxanaplw c !t bkohrud sf Now, what are the three locations of 'e'? >>> locs = chardist.locations('e', instance) >>> locs [45, 56, 59] Check out this function: >>> chardist.offsets('e')