Average Character Distance

>>> import chardist

Most programmers know that the frequency distributions of letters in given language are more or less consistent: in English, ETAOINSHRDLU (pretty accurately) describes the most common letters in order of frequency.

But, we can imagine other metrics about the distributions of letters besides simple frequency. Let's investigate average character distance. (I just made that term up.)

Here's a string:

>>> from string import lowercase
>>> import random
>>> punctuation = '?!!...' # 5 characters
>>> somespaces = ' ' * 10 
>>> randomtext = punctuation + lowercase + lowercase + lowercase + somespaces
>>> randomtext = list(randomtext)
>>> random.shuffle(randomtext)

So, now we have a random string, right? Well, kind of random. We know, for instance, that there are exactly three times as many 'e' characters as '?' characters.

>>> ecount = randomtext.count('e')
>>> hmmcount = randomtext.count('?')
>>> ratio = float(ecount)/hmmcount 
>>> thrice = 3 / 1
>>> ratio == thrice
True

Well, duh, there's the frequency for you (and the shuffling makes no difference).

But what about this average character distance idea? On average, how far are the 3 'e' characters from each other? Here, the shuffling certainly does make a difference.

In other words, in this artificial case we know there are 3 * 26 + 15 = 93 characters, and we know what those characters are, but what are the distributions of those letters? Obviously, this is a statistical matter.

>>> chardist.locations('x', 'axbxc')
[1, 3]

So that will give us the locations of a substring with in a string. But we can't really test what the distributions of, say, the letter 'e' is in the variable randomtext, because randomtext is randomized. We can make trivial observations. We know there are three 'e' characters:

>>> elocs = chardist.locations('e', randomtext)
>>> len(elocs) == 3
True

And that they are all in different positions:

>>> len(set(elocs)) == 3 # duplicates would be removed, but there aren't any
True

In order to test that our code is working, we have to find a finite way to describe an variable phenomenon, namely, where 'e' happens to end up in this randomized string. A bit of thinking will tell you that we know that any character in this string is equally likely to be an 'e'.

So, suppose we get this string:

>>> instance  = 'qzckkglhgdmvnmmz?pqvz dc  io!bxwrrwtubojj.tiyeyqssl x.vpeyhe gaf.n ifjuxanaplw c !t bkohrud sf'
>>> print instance
qzckkglhgdmvnmmz?pqvz dc  io!bxwrrwtubojj.tiyeyqssl x.vpeyhe gaf.n ifjuxanaplw c !t bkohrud sf

Now, what are the three locations of 'e'?

>>> locs = chardist.locations('e', instance)
>>> locs
[45, 56, 59]

Check out this function:

>>> chardist.offsets('e')