Measuring Spaciness

We're going to try to determine which languages use spaces to delimit words. We'll use the Universal Declaration of Human Rights (UDHR) as our test corpus, and we'll simply calculate the number of characters which function as spaces as a percentage of the total number of characters in each language. The module is called whitespace:

>>> import whitespace

Sorting the languages in decreasing order of percentage of frequency of spaces should give us some idea of which languages don't use spaces.

You can download these files and follow along; there is a zip archive at the UDHR in Unicode site, and the file I'm using is http://udhrinunicode.org/assemblies/udhr_txt.zip, a zip file containing the utf-8 encoded texts of all the languages. I put it into a directory called udhr/.

We'll start by testing English, then test Thai, then Chinese, and see if we're getting sensible results. Then, we'll dig in to running the comparison on all the texts.

All of these texts have an identical preamble in English. But it's short compared to the lengths of the UDHR itself. So we'll blithely ignore the effect of the preamble on the statistics for now.

The journey of a thousand miles... er, that is, here's step one:

Step One: English

Read the file into a variable (and make sure it's Unicode of course):

>>> udhr_eng = unicode(open('udhr/udhr_eng.txt', 'U').read())

And here's our mind-numbingly simplistic function to count spaces:

>>> def spaces(text): 
...  return text.count(u' ')

Testing testing:

>>> spaces('a b c') == 2
True

>>> spaces('a b  c') == 3
True

>>> spaces(' a b  c') == 4 
True

Okay, let's see how many spaces are in the English file:

>>> spaces(udhr_eng)
3243

Out of how many characters?

>>> len(udhr_eng)
12528

Whoa, that seems wrong... what percentage is that, anyway? Taking a cue from the Colbert Report, we'll call the function spaciness():

>>> def spaciness(text):
...  return float(spaces(text)) / len(text) * 100

The following function is to make things look nicer, but one has to be careful converting floats to strings. It's all approximate anyway, so one might be tempted to say "Who cares." But let's be careful only to use this function when printing, and when comparing.

>>> def pretty(floatnum): # for sanity
...  return "%.2f" % floatnum

>>> pretty(spaciness(udhr_eng))
'25.89'

Over one fourth of the characters are spaces? Er...

Well, if you look at the file you'll see that there are spaces all over the place used for indentation and formatting. We're not interested in that stuff; we want to know about spaces that are separating linguistic elements, not pushing it around on the screen.

Come to think of it, it would make more sense to use the XML versions of these files than the plain text ones we're using; there's probably less formatting whitespace in well-formed XML. But then, most text in the world isn't in well-formed XML...

So let's squeeze out the meaningless whitespace: we'll use the builtin "rstrip" and "lstrip" functions and a regular expression to reduce sequences of whitespace to a single space (not the most cross-linguistically sound approach, but we'll see what happens):

>>> import re
>>> whitespaceRE = re.compile("\s+", re.UNICODE)

>>> def squeeze(text):
...    text = text.rstrip()
...    text = text.lstrip()
...    text = ' '.join(re.split(whitespaceRE, text)) # squeeze!
...    return text

Okay, let's see if squeeze() is working:

>>> spaces(squeeze('a b c')) == 2
True

>>> spaces(squeeze('a b  c')) == 2
True

>>> spaces(squeeze(' a b  c')) == 2
True

Looks good. Now, we can rerun our numbers for English:

>>> spaces(squeeze(udhr_eng)) # was 3243 before squeezing
1776

>>> len(squeeze(udhr_eng)) # was 12528 before squeezing
10843

>>> pretty(spaciness(squeeze(udhr_eng)))
'16.38'

Still seems a little high to me, but it's a little more reasonable.

Let's wrap up all that junk in a convenient function, and we may as well go ahead and open our Thai and Chinese files:

>>> def squeezeread(fname): 
...  return squeeze(unicode(open(fname, 'U').read()))

>>> udhr_eng = squeezeread('udhr/udhr_eng.txt')
>>> udhr_tha = squeezeread('udhr/udhr_tha.txt')
>>> udhr_cmn = squeezeread('udhr/udhr_cmn.txt')

Just to keep things simple, let's define a function that will hand us a percentage from a filename:

>>> def spacesinfile(fname):
...  return spaciness(squeezeread(fname))

>>> pretty(spacesinfile('udhr/udhr_eng.txt'))
'16.38'

Step Two: Thai

Thai has a very different system of using spaces; they are used to separate phrases as well as sentences, but not between words. Thai should thus represent a half-way point between English and Chinese (which only uses spaces between sentences, as a rule) in terms of the frequency of spaces. As a result, we should expect that they are less common as a percentage of text length than they are in English.

Is this true?

>>> pretty(spacesinfile('udhr/udhr_tha.txt'))
'3.91'

Wow.

Step Three: Mandarin Chinese

Chinese only uses spaces between sentences, so it should have fewer still than Thai:

Survey says:

>>> pretty(spacesinfile('udhr/udhr_cmn.txt'))
'3.99'

Pretty darn close to the value we got for Thai... more than Thai, as a matter of fact, if only slightly. At this point we just start looking at the text and seeing if we can find a pattern.

Okay, good, we've learned something, check this out (I hope you have a Chinese font!):

1948 年 12 月 10 日，联合国大会通过并颁布《世界人权宣言》。这一具有历史意义的《宣言》颁布后，大会要求所有会员国广为宣传，并且“不分国家或领土的政治地位,主要在各级学校和其他教育机构加以传播、展示、阅读和阐述。”《宣言》全文如下：

The sneaky bit is this guy:

U+3002 IDEOGRAPHIC FULL STOP 。

Ever seen a period in Mandarin? Now you have! It's a bit surprising, but in this text at least, the period is doubling as a space. So, this particular Mandarin text, it's almost as if the period were working as a space. There are no whitespace characters at all in that whole paragraph.

This is important information to know, for instance, if you're trying to split Chinese text into sentences...

I still can't figure out why, exactly, Chinese has a slightly higher spaciness than Thai... But anyway, we're getting numbers, and unexpected results are progress, right? So, onwards...

Step Four: All Languages in the UDHR

Now for a bit of input/output-fu. We want to go through all of the files in our corpus and measure their spaciness. Then, we want to output a list of those languages sorted on spaciness.

We'll keep the data in a dictionary called "catalog". Building it will take a bit of time, since we're futzing about with 200-odd files in memory. Slow, I know. But not that slow.

>>> from glob import glob # for old school shell globs
>>> catalog = {}
>>> udhr = glob('udhr/udhr_*.txt') # udhr is now a list of file names
>>> for fname in udhr:
...  catalog[fname] = spacesinfile(fname)

Now we've got the data we need. Just to double check, let's look up the three values we already know:

>>> for langcode in ['eng', 'tha', 'cmn']:
...  print pretty(catalog['udhr/udhr_' + langcode + '.txt']),
...  print langcode
16.38 eng
3.91 tha
3.99 cmn

Okay good, let's light this candle.

>>> sortable = [(v,k) for k,v in catalog.items()]
>>> byspaciness = sorted(sortable)

Here comes the whole honkin' file:

>>> for language_spaciness, language in byspaciness:
...  print pretty(language_spaciness), language
2.39 udhr/udhr_bod.txt
2.78 udhr/udhr_jpn.txt
3.91 udhr/udhr_tha.txt
3.99 udhr/udhr_cmn.txt
4.10 udhr/udhr_lao.txt
6.45 udhr/udhr_kal.txt
7.82 udhr/udhr_mal.txt
8.68 udhr/udhr_cot.txt
8.95 udhr/udhr_cpu.txt
9.23 udhr/udhr_cic.txt
9.31 udhr/udhr_cni.txt
9.36 udhr/udhr_tam.txt
9.70 udhr/udhr_san.txt
10.18 udhr/udhr_quy.txt
10.30 udhr/udhr_qvm.txt
10.48 udhr/udhr_qxu.txt
10.55 udhr/udhr_boa.txt
10.55 udhr/udhr_zul.txt
10.60 udhr/udhr_xho.txt
10.60 udhr/udhr_amc.txt
10.62 udhr/udhr_ssw.txt
10.65 udhr/udhr_not.txt
10.71 udhr/udhr_ayr.txt
10.80 udhr/udhr_qvc.txt
10.86 udhr/udhr_cbu.txt
10.90 udhr/udhr_qxa.txt
10.93 udhr/udhr_cbt.txt
10.98 udhr/udhr_yad.txt
10.99 udhr/udhr_qvn.txt
11.01 udhr/udhr_quz.txt
11.02 udhr/udhr_koo.txt
11.10 udhr/udhr_nbl.txt
11.26 udhr/udhr_qwh.txt
11.28 udhr/udhr_qvh.txt
11.30 udhr/udhr_hye.txt
11.39 udhr/udhr_mcd.txt
11.46 udhr/udhr_fin.txt
11.47 udhr/udhr_top.txt
11.51 udhr/udhr_sme.txt
11.52 udhr/udhr_kan.txt
11.56 udhr/udhr_qxn.txt
11.62 udhr/udhr_sna.txt
11.71 udhr/udhr_qud.txt
11.72 udhr/udhr_tsz.txt
11.74 udhr/udhr_abk.txt
11.76 udhr/udhr_mic.txt
11.83 udhr/udhr_nyn.txt
11.90 udhr/udhr_kat.txt
11.91 udhr/udhr_acu.txt
11.99 udhr/udhr_qva.txt
12.02 udhr/udhr_arl.txt
12.06 udhr/udhr_guc.txt
12.08 udhr/udhr_lue.txt
12.22 udhr/udhr_uig_latn.txt
12.22 udhr/udhr_mcf.txt
12.25 udhr/udhr_amr.txt
12.29 udhr/udhr_shp.txt
12.32 udhr/udhr_toi.txt
12.36 udhr/udhr_agr.txt
12.42 udhr/udhr_lun.txt
12.43 udhr/udhr_ame.txt
12.43 udhr/udhr_pbb.txt
12.52 udhr/udhr_lug.txt
12.56 udhr/udhr_eus.txt
12.72 udhr/udhr_mly.txt
12.73 udhr/udhr_uzn_latn.txt
12.83 udhr/udhr_hun.txt
12.88 udhr/udhr_ura.txt
12.88 udhr/udhr_ban.txt
12.91 udhr/udhr_est.txt
12.91 udhr/udhr_yao.txt
12.99 udhr/udhr_umb.txt
13.00 udhr/udhr_nhn.txt
13.02 udhr/udhr_run.txt
13.09 udhr/udhr_gug.txt
13.13 udhr/udhr_ind.txt
13.19 udhr/udhr_kaz.txt
13.29 udhr/udhr_tur.txt
13.33 udhr/udhr_lav.txt
13.38 udhr/udhr_ztu.txt
13.40 udhr/udhr_gag.txt
13.42 udhr/udhr_mlt.txt
13.44 udhr/udhr_knc.txt
13.44 udhr/udhr_uzn_cyrl.txt
13.48 udhr/udhr_bug.txt
13.52 udhr/udhr_lat.txt
13.53 udhr/udhr_gax.txt
13.53 udhr/udhr_lit.txt
13.58 udhr/udhr_rus.txt
13.61 udhr/udhr_kin.txt
13.61 udhr/udhr_mad.txt
13.66 udhr/udhr_plt.txt
13.66 udhr/udhr_lat_1.txt
13.67 udhr/udhr_nya_chechewa.txt
13.69 udhr/udhr_bem.txt
13.70 udhr/udhr_pol.txt
13.70 udhr/udhr_ndo.txt
13.78 udhr/udhr_deu.txt
13.84 udhr/udhr_azj_cyrl.txt
13.85 udhr/udhr_min.txt
13.88 udhr/udhr_huu.txt
13.91 udhr/udhr_jav.txt
14.07 udhr/udhr_rmy.txt
14.08 udhr/udhr_bel.txt
14.08 udhr/udhr_mar.txt
14.11 udhr/udhr_azj_latn.txt
14.27 udhr/udhr_sun.txt
14.29 udhr/udhr_kqn.txt
14.29 udhr/udhr_khk.txt
14.30 udhr/udhr_cab.txt
14.36 udhr/udhr_hsb.txt
14.38 udhr/udhr_ccx.txt
14.45 udhr/udhr_swe.txt
14.47 udhr/udhr_tob.txt
14.56 udhr/udhr_ben.txt
14.61 udhr/udhr_oss.txt
14.65 udhr/udhr_ukr.txt
14.76 udhr/udhr_ilo.txt
14.88 udhr/udhr_slv.txt
14.88 udhr/udhr_kde.txt
14.90 udhr/udhr_slk.txt
14.93 udhr/udhr_dan.txt
15.01 udhr/udhr_mzi.txt
15.02 udhr/udhr_nya_chinyanja.txt
15.08 udhr/udhr_dyo.txt
15.09 udhr/udhr_ita.txt
15.14 udhr/udhr_tgl.txt
15.18 udhr/udhr_mam.txt
15.24 udhr/udhr_hrv.txt
15.25 udhr/udhr_ces.txt
15.27 udhr/udhr_ron.txt
15.28 udhr/udhr_pam.txt
15.28 udhr/udhr_suk.txt
15.34 udhr/udhr_srp_cyrl.txt
15.35 udhr/udhr_ell.txt
15.36 udhr/udhr_nld.txt
15.40 udhr/udhr_nep.txt
15.41 udhr/udhr_guj.txt
15.43 udhr/udhr_ltz.txt
15.43 udhr/udhr_kek.txt
15.44 udhr/udhr_nob.txt
15.48 udhr/udhr_csw.txt
15.49 udhr/udhr_hil.txt
15.51 udhr/udhr_bul.txt
15.54 udhr/udhr_ceb.txt
15.58 udhr/udhr_bos_latn.txt
15.60 udhr/udhr_rmn_1.txt
15.62 udhr/udhr_srp_latn.txt
15.71 udhr/udhr_cjk.txt
15.75 udhr/udhr_bcl.txt
15.79 udhr/udhr_isl.txt
15.80 udhr/udhr_rmn.txt
15.84 udhr/udhr_bos_cyrl.txt
15.86 udhr/udhr_lua.txt
15.89 udhr/udhr_glg.txt
15.91 udhr/udhr_ace.txt
16.02 udhr/udhr_yua.txt
16.05 udhr/udhr_epo.txt
16.06 udhr/udhr_war.txt
16.08 udhr/udhr_spa.txt
16.08 udhr/udhr_ast.txt
16.09 udhr/udhr_ina.txt
16.14 udhr/udhr_roh.txt
16.15 udhr/udhr_cbr.txt
16.16 udhr/udhr_nym.txt
16.20 udhr/udhr_mkd.txt
16.21 udhr/udhr_wln.txt
16.22 udhr/udhr_som.txt
16.25 udhr/udhr_quc.txt
16.25 udhr/udhr_fao.txt
16.26 udhr/udhr_afr.txt
16.26 udhr/udhr_por.txt
16.27 udhr/udhr_src.txt
16.29 udhr/udhr_fra.txt
16.35 udhr/udhr_cha.txt
16.38 udhr/udhr_eng.txt
16.39 udhr/udhr_fuc.txt
16.40 udhr/udhr_fri.txt
16.41 udhr/udhr_tzc.txt
16.46 udhr/udhr_cak.txt
16.48 udhr/udhr_hus.txt
16.50 udhr/udhr_lnc.txt
16.51 udhr/udhr_cat.txt
16.51 udhr/udhr_ido.txt
16.63 udhr/udhr_gla.txt
16.73 udhr/udhr_cym.txt
16.73 udhr/udhr_gle.txt
16.75 udhr/udhr_als.txt
16.79 udhr/udhr_auv.txt
16.82 udhr/udhr_miq.txt
16.94 udhr/udhr_nno.txt
16.98 udhr/udhr_tzh.txt
17.00 udhr/udhr_swh.txt
17.00 udhr/udhr_kng.txt
17.01 udhr/udhr_ktu.txt
17.07 udhr/udhr_crs.txt
17.09 udhr/udhr_tet.txt
17.15 udhr/udhr_eml.txt
17.18 udhr/udhr_pon.txt
17.23 udhr/udhr_hni.txt
17.24 udhr/udhr_tzm.txt
17.27 udhr/udhr_hva.txt
17.33 udhr/udhr_arn.txt
17.48 udhr/udhr_heb.txt
17.51 udhr/udhr_bre.txt
17.54 udhr/udhr_arb.txt
17.54 udhr/udhr_pcd.txt
17.56 udhr/udhr_toj.txt
17.56 udhr/udhr_ppl.txt
17.67 udhr/udhr_tsn.txt
17.71 udhr/udhr_kmb.txt
17.77 udhr/udhr_kbp.txt
17.79 udhr/udhr_sus.txt
17.85 udhr/udhr_lin.txt
17.91 udhr/udhr_cos.txt
18.07 udhr/udhr_ckb.txt
18.09 udhr/udhr_tpi.txt
18.10 udhr/udhr_mxv.txt
18.11 udhr/udhr_ewe.txt
18.13 udhr/udhr_chk.txt
18.14 udhr/udhr_fur.txt
18.16 udhr/udhr_yap.txt
18.18 udhr/udhr_tah.txt
18.40 udhr/udhr_pau.txt
18.52 udhr/udhr_ibb.txt
18.57 udhr/udhr_hau.txt
18.58 udhr/udhr_pis.txt
18.62 udhr/udhr_hin.txt
18.66 udhr/udhr_fij.txt
18.73 udhr/udhr_lia.txt
18.76 udhr/udhr_nzi.txt
18.82 udhr/udhr_hat_kreyol.txt
18.88 udhr/udhr_sot.txt
18.90 udhr/udhr_bis.txt
18.94 udhr/udhr_maz.txt
19.02 udhr/udhr_nso.txt
19.02 udhr/udhr_ton.txt
19.14 udhr/udhr_bho.txt
19.15 udhr/udhr_vie.txt
19.18 udhr/udhr_yor.txt
19.47 udhr/udhr_bam.txt
19.52 udhr/udhr_ibo.txt
19.74 udhr/udhr_aka_fante.txt
19.76 udhr/udhr_wol.txt
19.83 udhr/udhr_snk.txt
19.85 udhr/udhr_hat_popular.txt
19.85 udhr/udhr_rar.txt
19.95 udhr/udhr_mag.txt
19.95 udhr/udhr_tbz.txt
19.96 udhr/udhr_ddn.txt
20.02 udhr/udhr_smo.txt
20.03 udhr/udhr_gaa.txt
20.05 udhr/udhr_men.txt
20.10 udhr/udhr_dag.txt
20.11 udhr/udhr_hms.txt
20.14 udhr/udhr_aka_akuapem.txt
20.17 udhr/udhr_tir.txt
20.18 udhr/udhr_bin.txt
20.18 udhr/udhr_mah.txt
20.22 udhr/udhr_hna.txt
20.30 udhr/udhr_tem.txt
20.38 udhr/udhr_blu.txt
20.42 udhr/udhr_aka_asante.txt
20.45 udhr/udhr_loz.txt
20.48 udhr/udhr_btb.txt
20.58 udhr/udhr_gkp.txt
20.65 udhr/udhr_emk.txt
20.65 udhr/udhr_hea.txt
20.80 udhr/udhr_mri.txt
20.80 udhr/udhr_ajg.txt
20.88 udhr/udhr_csa.txt
21.10 udhr/udhr_zam.txt
21.15 udhr/udhr_srr.txt
21.17 udhr/udhr_gjn.txt
21.36 udhr/udhr_pcm.txt
21.45 udhr/udhr_sag.txt
21.70 udhr/udhr_mos.txt
22.09 udhr/udhr_kri.txt
22.22 udhr/udhr_haw.txt
22.43 udhr/udhr_dga.txt
22.45 udhr/udhr_bba.txt
22.50 udhr/udhr_pnb.txt
22.52 udhr/udhr_xsm.txt
22.61 udhr/udhr_ote.txt
22.77 udhr/udhr_skr.txt
23.02 udhr/udhr_tiv.txt
23.05 udhr/udhr_lns.txt
23.22 udhr/udhr_chj.txt
23.52 udhr/udhr_bci.txt
24.68 udhr/udhr_kor.txt
24.82 udhr/udhr_fon.txt
26.56 udhr/udhr_ada.txt

Phew!

I'm interested in observations about these results. Just poking around in the head and the tail of the list seems to have some interesting results, and an intriguing spread of languages.