Conversation with #blogamundo at 2007-03-23 14:50:24 on me@irc.freenode.net (irc)Conversation with #blogamundo at 2007-03-23 14:50:24 on me@irc.freenode.net (irc) (14:50:24) The topic for #blogamundo is: in which me attempts to woo people into a wikipedia project; http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Endangered_languages#Perhaps_we_should_all_pick_a_single_language_to_work_on__for_a_while.3F (14:50:26) me: meep (14:50:38) me: i pingeth amigo friend (14:54:25) friend: plon (14:55:26) me: hey chrys (14:55:32) me: i did sometin neato :P (14:55:45) me: it is not that complex but it's interesting (14:55:48) me: some python code (14:56:25) friend: ah? (15:02:59) me: http://ruphus.com/stash/out.txt (15:03:01) me: yeah that's the output (15:03:10) me: the input looks like this: (15:03:36) me: http://ruphus.com/stash/samplein.txt (15:03:55) me: it's a suuuuper dumb attempt at finding a transltieration correspondence (15:04:45) me: not bad for 25 lines hehe (15:14:38) friend: sorry had to chat about a client problem (15:17:21) me: no worries (15:19:05) friend: was is das letzte in jeder zeile? (15:20:00) friend: for your amusement -- how my clients talk to me (this channel isn't publicly logged is it?) (15:20:16) friend: "Thanks for this explanation but it is not acceptable for me. I can understand that you “locked” the number when the message is archived and they might be a difference when you get the data but we are talking of 30 000 emails less than the initial targeting with “your” technical approach !!!   Please Find a solution or a quick fix to make things happened !" (15:21:28) me: what is the last in every headline? (15:21:36) friend: oops (15:21:44) friend: that was german wasn't it :) (15:21:53) friend: DEVANAGARI LETTER RA र d (15:21:57) friend: the last in every line (15:21:58) friend: the d (15:22:08) me: yeah that one is wrong :) (15:22:18) me: RA and VIRAMA like to screw things up (15:22:19) friend: DEVANAGARI LETTER RA र n (15:22:26) friend: DEVANAGARI VOWEL SIGN AA ा l (15:22:32) friend: DEVANAGARI DIGIT ZERO ० 0 (15:22:39) me: heh, tha'ts right! (15:22:40) me: yay :) (15:22:42) friend: the last is a transliteration ... that's what you're heading at? (15:22:49) me: so like (15:23:06) me: the assumptions go like this: (15:23:32) me: IF two words are in fact transltierations of each other, then they are sort of of the same length right (15:23:33) me: i mean (15:23:44) friend: hm (15:23:46) friend: depends (15:23:47) friend: like (15:23:48) me: they won't be like, a factor of forty bazillion different in legnth (15:24:06) friend: there's loads of stuff that gets transliterated in different numbers of chars (15:24:09) friend: like greek (15:24:11) friend: X = ch (15:24:15) me: well, let's put it this way: they're more likely to be of similar length than randomly chosen english and hindi terms (15:24:19) me: yep, that's right (15:24:22) friend: ok yeah (15:24:30) me: and in the case of devanagari, it's kind of a syllabary (15:24:33) friend: but that's a bit too .. statistical innit :) (15:24:33) me: so that comes into play too (15:24:43) me: oh i'm all about things statistical :) (15:24:49) me: so then (15:25:01) me: let's pretend instead of hindi, we're going from "lowercase" to "UPPERCASE" (15:25:10) friend: ok (15:25:15) me: so the "transliteration" might be, say, house -> HOUSE (15:25:23) me: in that case, you can just do: (15:25:32) me: zip(list('house'), list('HOUSE')) (15:25:45) me: >>> zip(list('house'), list('HOUSE')) (15:25:45) me: [('h', 'H'), ('o', 'O'), ('u', 'U'), ('s', 'S'), ('e', 'E')] (15:25:47) me: obviously (15:25:52) me: but (15:25:54) me: what about this: (15:26:04) me: >>> zip(list('house'), list('hause')) (15:26:04) me: [('h', 'h'), ('o', 'a'), ('u', 'u'), ('s', 's'), ('e', 'e')] (15:26:05) me: er (15:26:09) me: is that a german form for house? heh (15:26:11) me: ooops (15:26:15) friend: not QUITE (15:26:18) friend: it's Haus (15:26:19) me: >>> zip(list('house'), list('Haus')) (15:26:19) me: [('h', 'H'), ('o', 'a'), ('u', 'u'), ('s', 's')] (15:26:32) me: pretty g ood match there, as it happens (15:26:33) friend: that'll be kinda english to kinda german then (15:26:39) me: yah, kinda opointless (15:26:48) friend: giggle ... no! (15:26:49) me: i mean, it's much more interesting ot do across alhpabets (15:26:56) friend: strange thing .. like ... (15:27:05) friend: english != what's in the books (15:27:25) friend: but english ~= what a lliterate native speaker can decipher (15:28:15) me: true dat (15:28:28) me: wait wait let me annoy you more with my thingie (15:28:28) me: haha (15:28:30) me: preeeeze (15:28:33) me: i yearn for approval (15:28:34) me: haha (15:28:38) me: APPROVE ME (15:28:39) me: heh (15:29:09) me: so here's the interesting bit (15:29:30) me: beacuse the words are of different lengths (as you mention, c might correspond to 'ch' or something) (15:29:43) me: they might align at the beginning, but then get thrown off by the end (15:29:47) me: behold: (15:30:15) me: >>> zip(reversed(list('house')), reversed(list('Haus'))) (15:30:15) me: [('e', 's'), ('s', 'u'), ('u', 'a'), ('o', 'H')] (15:30:34) me: crummy results in that case, just because of hte way house and Haus happen to be (15:30:44) me: but sometimes you catch more real correspondences that way (15:30:46) ***friend approves me (15:31:04) friend: huh (15:31:15) me: so, what i do is, take the first style of correspondences, then the reversed style, push them in a huge list, count, sort, output (15:31:16) me: that's it (15:32:29) friend: uh ... (15:32:33) ***friend has a fried brain (15:35:58) me: http://ruphus.com/stash/hinditranslit_py.txt (15:43:31) friend: http://itre.cis.upenn.edu/~myl/languagelog/archives/004331.html this is cool (15:44:14) friend: sys.stdin = codecs.getreader('utf-8')(sys.stdin) that's kinda cool too (15:44:14) me: wow, that rules (15:44:20) me: yeah amigo taught me that (15:45:42) friend: that's kinda ... you redefine an object there, innit? (15:45:59) me: yeah seems to (15:46:10) me: but if amigo says it's ok i don't doubt that it is , heh (15:47:21) friend: my poor team member who's a native speaker of seychellois creole had part of his newfound assurance in french destroyed (15:47:50) friend: when our french account management team trashed his grammatical spelling in a client communication. poor boy. (15:48:11) friend: yeah he needs to improve his written french, but ... those ppl are just so insensitive. (15:49:43) me: he shoudl write everything in seychellois heh (15:50:05) me: http://ruphus.com/stash/out3.txt english to french, heh (15:51:45) me: wow, even japanese catches some (15:51:49) me: i don't really know what the point of this is (15:51:52) me: but it's addictive (15:51:52) me: heh (15:52:36) me: pat@gwin:~/Desktop$ cat out-enja.txt |grep 'LETTER RA' (15:52:36) me: KATAKANA LETTER RA ラ (15:52:36) me: KATAKANA LETTER RA ラ l (15:52:36) me: KATAKANA LETTER RA ラ r (15:52:36) me: KATAKANA LETTER RA ラ a (15:52:49) me: sorry, i'll stop spitting random data at you hehe (16:04:19) me has changed the topic to: http://christianflury.com/blog/2007/03/quite_some_characters_a_unicod.html (18:58:35) friend left the room (quit: "Leaving.").