Here is the development of an idea that I suggested at the XKCD forum: treat our Beanish corpus/words as Markov chains and, using a simple Maximum-Likelihood algorithm, suggest possible pronunciations using the dictionary of an actual language as a reference. This suggestion makes a lot of assumptions: that the Beanish script is an alphabet (but the idea could be later expanded to the syllabary/abugida/… hypothesis), that Beanish words can be treated as Markov chains of symbols (not necessarily valid, as at least the syllable structure and the morphology may not meet the Markov-chain assumption that the current state is independent from past states), that the alphabet at least somewhat mirrors the phonology, that the language we use for reference is similar to Beanish, that distinctive phonetic features (i.e., the phonology) are graphically represented, that phonotactict restrictions can, indeed, be found this way, and many more. It also raises a good number of technical questions, such as what will constitute the corpus to be evaluated (token samples or token outcomes? words taken individually or the entire sentences?) and what kind of smoothing, if any, should be performed (Good-Turing would be the first obvious choice, but we are not dealing with large groups and not necessarily with exclusive ones, not to mention that it seems that we won’t have any new Beanish text coming from Randall in the foreseeable future and, as a consequence, we must exclude any Bayesian “nature”).

Still, I imagined it would have been fun to do. Here are the results.

I took the CMU Pronuncing Dictionary and calculated the transitions (including as first and last symbol) for each phoneme, excluding the stress distinction for vowels. The CMUPD is far from the best choice for our situation, but I wasn’t aware of any better free dictionary to play with. Here, for instance, are the counts of transition from the initial position (i.e., the count for the first phoneme in English words):

{'IY': 590, 'W': 3773, 'DH': 67, 'Y': 1350, 'HH': 6656, 'CH': 1260, 'JH': 2148, 'ZH': 91, 'D': 7722, 'TH': 636, 'AA': 1898, 'B': 9632, 'AE': 2906, 'EH': 2928, 'G': 4963, 'F': 5561, 'AH': 3423, 'K': 12969, 'M': 9450, 'L': 5470, 'AO': 883, 'N': 3214, 'P': 7833, 'S': 12371, 'R': 7445, 'EY': 490, 'T': 4854, 'AW': 351, 'V': 2427, 'AY': 615, 'Z': 946, 'ER': 389, 'IH': 4133, 'UW': 84, 'SH': 2467, 'UH': 14, 'OY': 28, 'OW': 1295}

A smoothing being necessary (we don’t want to end up with probabilities equal to zero), I used NLTK‘s implementation of the “Simple Good Turing” algorithm from Gale&Sampson, proud to be using something first developed by Turing himself. Here are the transition probabilities calculated with SGT ("{" and "}" are, respectively, my chain-start and chain-end markers):

{  0.00 %     EH 2.20 %    K  9.73 %    S  9.28 %
L  4.10 %     AH 2.57 %    M  7.09 %    EY 0.37 %
SH 1.85 %     N  2.41 %    P  5.87 %    OY 0.02 %
T  3.64 %     }  0.00 %    OW 0.97 %    Z  0.71 %
W  2.83 %     D  5.79 %    B  7.22 %    V  1.82 %
IH 3.10 %     AA 1.42 %    R  5.58 %    AY 0.46 %
ER 0.29 %     AE 2.18 %    F  4.17 %    IY 0.44 %
AW 0.26 %     AO 0.66 %    Y  1.01 %    UW 0.06 %
G  3.72 %     NG 0.00 %    TH 0.48 %    DH 0.05 %
HH 4.99 %     UH 0.01 %    CH 0.95 %    ZH 0.07 %
JH 1.61 %

We can now test some English words, computing the combined log probabilities for their phonemes:

water ['W', 'AO', 'T', 'ER'] -12.8270136724
desktop ['D', 'EH', 'S', 'K', 'T', 'AA', 'P'] -23.8901166969
hagiography ['HH', 'AE', 'G', 'IY', 'AA', 'G', 'R', 'AH', 'F', 'IY'] -32.0648872671

Which confirms that “water” has a sequence of phonemes with higher probability than “hagiography”.

I could now start playing around with genetic algorithms, hill-climbing or a true maximum-likelihood estimator, but decided to go for what any lazy hacker always does: generate lots of random mappings between Beanish glyphs and English phonemes and keep the one with the best score. I know, I know.

The results, however, were terrible. While some mappings did perform a little, there clearly was not pattern to develop with random mappings (don’t forget we are dealing with something like 10^34 different mappings). The only thing I could notice was that mappings with more vowels and high sonority in general (glides, liquids…) were performing a little better, which makes sense from a phonological point of view, but it is not good for our purposes (Rosetta does not seem to have problems with the pronunciation of English, but with its syntax and vocabulary).

I then decided to do things the right way, selecting the best mappings and swapping glyphs trying to find something better. I could have written a true genetic algorithm, but it seemed useless as this hill-climbing was also useless, only confirming that we should have a lot of vowels to make it somewhat pronounceable.

By this time I was reinforcing my guess that the Beanish script is an abugida, but it was necessary to test and evaluate the alphabet hypothesis. Facing a problem the computer was not able to help me (or which I was unable to use the computer to help me, which is essentially the same), I once more did what any scientist does and decided to map it by hand, using what I know of phonetics and linguistics (and believe me, I don’t know much). My map could be used as a basis for future developments, including the algorithms I described above, and was necessary: dealing with the arcane Beanish script is not easy, because our brains (or, at least, mine) are used to letters and phonemes. It is much harder to infer anything, and guessing possible mappings between Beanish glyphs and phonetic representations might be a good idea. As usual, it will probably also develop our Beanish skills.

Here are my guesses, always assuming that the script is alphabetical and that it is unlikely that letter shapes are dependent on their position (lunate sigma, anyone?). I’ll later try to do a syllabary one.

Both ᕬ and ᓄ are rare glyphs, likely rare sounds. The first is only found in the word ᖆᕬᖉᔭ; ᖆ and ᖉ are probably vowels (but we can’t completely run out that they are glides, liquid or even nasal stops) and ᔭ a consonant, giving a probable VCVC word-structure. Being found in mid-vowel position, it could be just anything; let’s assume it is a /ʒ/.

Regarding ᓄ, it is only found in final position, excluding the complex ᖆᓄᘈᖉᐣ word that I have discussed (and is likely a toponym and/or a compound word); we know, however, that it can be followed by the dot diacritic. It usually follows ᖆ (as in ᔪᖆᓄᐧ and ᖆᓄᘈᖉᐣᐨ, which might be related considering that ᔪ could be a prefix) or other glyphs supposed to represent vowels, but our best word, ᓭᘖᔭᓄ “water”, has what I guess to be a VCC? structure (in fact, it is one of the reasons I still haven’t dropped the syllabary hypothesis). Still impossible to make any educated guess.

Keeping up with the analysis of word structures, we move on to the short words (at most two glyphs, not considering the diacritics): ᖆᐣᖽ / ᖊ,ᘖ / ᓭᘈ / ᓭᘖᑦ / ᖊᘊ / ᒣᖉ / ᓭᐧᘖ / ᓭᑦᐧ / ᓭᐧᖚ / ᘛᔭ / ᘛ / ᘛᐣ / ᖉᑦ, / ᖉ, / ᔪ, / ᒣᖉ and ᖉᑦ, — more short words than we’d like and expect for an alphabetic system. Still, at least ᓭ, ᘛ and ᖉ look like vowels, confirming the tendency for a VC syllable-structure; for now, let’s assign them to /a/, /e/ and /o/.

Having established that, we can disgress and look at some other words, like ᓭᘊᘊ (now /aᘊᘊ/) and ᑫᘊᘊ (“Gibraltar”). ᘊ is the only glyph we find repeated (Beanish is not Italian), and as we can almost rule out it being a vowel or a glide, the repetition probably means to represent either a long consonant, a stronger consonant or a repeated consonant. Now, ᘊ is usually found next to I guessed to be vowels (ᓭ, ᘛ and ᖉ), but not always: we have ᘊᒣᓭᐧᖊᔕ (that forces us to guess what ᒣ is), ᘊᘖᑫᘖᒣᐣᖚ (where it seems that ᘖ is consonant and ᒣ a vowel), ᖆᘊᓭᒣᖊᐣᖗ (where the guess needed is about ᖆ), ᖊᘊ (where we confirm a tendency for ᖊ to be a vowel), ᑕᘊᐣᒣ (where we have the mysterious ᑕ and, even more important, a diacritic — maybe this diacritic makes a consonant the coda of the syllable!?) and the greeting ᘈᘊᘖ (where, if ᘖ is a consonant, we would probably have VCC). Let’s first accept that ᒣ and ᖊ are vowels, giving them /i/ and /u/. Now, considering that ᘊ looks like a consonant that precedes other consonants, we can guess that it has low sonority, thus being either a stop or a fricative, preferably unvoiced. Going back to ᓭᘊᘊ and ᑫᘊᘊ, my guess is that it is an /s/. People will probably like it because many have guessed that the prefix ᘊ- was a plural mark. We now also have a word, ᓭᘊᘊ /ass/ (in IPA, don’t read it in English! 😉 ) and are forced to accept that ᑫ is a vowel. Having used the five cardinal ones, I will try to keep it Romance, now going with /ɛ/ (thus having ᑫᘊᘊ /ɛss/). I can also trying to solve the ᘈ problem (found, for example, both in ᓭᘈ and ᘈᘊᘖ) by going to the other side of the mouth, and assigning it to /ɔ/.

Trying to solve what I kept pending, the ᘖ consonant. Using my guesses, we now have it in contexts such as sɛiᐣᖚ / ɛ / aᘖᔭᓄ (don’t get so excited, is not coincidence that I made “water” start with an /a/) / aᘖᑦ / aᐧᘖ / ᘖᐣᖗᔭ / saᘖᔭᓄ / ᖽᔕᐣᘖ / ᖚᐣᘖᖗɛ / ɔsᘖ (the greeting) / ɔ / aᔭᑦᘖ / oisᐣᘖɛ / ᘖᖆie and, which does not sound that plausible, ᘖᓄɔo. Once more, the most obvious choice would be for a stop/fricative, being a common one likely /t/ or /f/, but cannot stop considering the diacritics anymore. I still think that, if the script is alphabetic, the diacritics are either some coloring or some indication of phonetic features, either to indicate the correct pronunciation or to graphically represent some phonetic change due to word properties, thing like the voicing of an unvoiced consonant between vowels. I like to think that it would help explain the mirrored ᑦ and ᐣ. Of course, to make it difficult, ᘖ is one of those glyphs that can take both ᑦ and ᐣ, leaving us with less likely features such as syllabic, aspirated, nasal release, etc. (and now I am thinking — why the hell are we assuming that the anatomy of the Beanies is like ours?). Anyway, stops can be a little bit more flexible given these probably wrong assumptions, and thus I will go with /t/.

The diacritics, then. The middle dot looks like the simpler one, as it seems to be applied to vowels and to the strange ᓄ glyph, which would make ᖚ yet another vowel (or glide, perhaps). To make it simple, I will assign this last glyph to a vowel not strongly linked to any phonetic feature, be it height, backness or roundness: the good and old mid central vowel ə. This raises some problems, particularly for those used to English phonology, because in some cases it would result in a strongly syllabic schwa (such as /ᕒəᐧ/ or /əɛt/), but we can try fixing it later (and, of course, it is just weird and unlikely, but not impossible). Back to the diacritics, we can now assign some phonetic trait to this middle dot: the best ones would be nasal and aspirated. As aspiration is likely to be more evenly distributed among vowels and consonants (the prove is that… well, I guess so), let’s reserve it to diacritics more evenly distributed and use “nasal” or “nasal release”. Tildes on the way, guys! (yes, even with tilded schwas, which likely won’t render correctly in your system /◌̃ə/ — and now spend two minutes trying to pronounce it, by relaxing pretty much every single muscle in your mouth).

Now, the lunar diacritics, ᑦ and ᐣ. The first is the weirder one, as it is even used, in a single occurrence, in the initial position of the word (one of the things that makes me think that a syllabary/abugida is likely): ᑦᘈᖽᐣ. But ᘈ is an unusual glyph graphically, and maybe it is used before for trivial reasons, aesthetic or not. In order to try to create a glyph mapping, we better not to consider it. ᑦ is thus used with ᕋ, ᓭ, ᖉ, ᘈ, ᘖ, ᒣ, ᔭ, ᑕ and ᖊ (mostly vowels, but also the common consonant ᘖ), and ᐣ with ᖊ, ᒣ, ᖽ, ᘖ, ᔕ, ᖚ, ᖆ, ᘊ, ᖉ, ᘛ and ᓄ, a true “bag of stuff”. Things get now very complicate: we can try to consider ᑦ a phonetic feature such as aspiration, but as for ᐣ there is no possible educated guess; while I will translate it as “long” (either vowel or consonant), it should be taken as an indicator that some of the assumptions we made so far are very, very off.

Before going on, it is time to make sure that every syllable has a vowel. At this moment, we are still left with ᔪᖆᓄⁿ / ᔪ, / ᔭ / ᕋᖗ / ᖽᔕ:t / ᖆ:ᖽ and ᔪᑕ. Now, ᔪ and ᔭ must be vowels, let’s assign them /y/ and /ɯ/, just ’cause I like ’em; this makes ᖆ an unlikely candidate for vowelness, and we can assume it is a consonant and that ᖽ is a vowel (starting to run out of vowels, I choose /ɐ/, trying to keep them as spaced as possible).

Given these new vowels, C is now, probably, a consonant, found in ᑕs:i, saʰᑕo, əiᑕɛa, ᖆ:ᑕʰəɛ and yᑕ. We need a consonant that can be aspirated and can be found in the complex onset ‘Cs’ (assuming phonotactic restrictions are similar to those of the languages I am used to): the best choice is /p/. We also have ᖆ as a consonant, but now more sonorant is less implausible (after all, we are only playing to finish this game), and I’ll go with an /m/.

Let’s now finish by working out the stubborn glyphs we still have. ᕋ is not very frequent, but we find it in ᕒəɛᕋ, / ɐeᕋʰ / ᕋᖗ . It looks like a consonant similar to ᑕ; let’s assign it to /b/. We also have ᕒ, and the only thing we know about it is that it frequent in questions, in yᕒəⁿ / ᕒəⁿ / ᕒəɛb, ; let’s assign it a “strong” sound that clearly distinguishes it in a sentence: /ʃ/.

We get back to ᓄ, now assuming that it can have a nasal release. This is almost impossible by the restrains we have by now, but we should only start changing and swapping glyphs at the end. As it is common in final position, we can make it quasi-English-like as assign it to /ŋ/.

Our now almost impossible to pronounce language still has ᖗ, found in msaiu:ᖗ / bᖗ / t:ᖗɯ / ə:tᖗɛ. The number of vowels in the language is probably too high by now, but the only good alternative would be to make ᖗ, too, a vowel (did I say I keep thinking it is an abugida?). But I will try to restrict it, and thus assign it to /r/: /br/, for example, is not such an impossible word. Finally, ᑲ is found in saⁿᑲ; I will make it a voiceless fricative, /f/, to try to get some rythm in this language full of vowels (I still study poetry, I can’t help it…).

We still have left the “comma” diacritic, by this reckoning yet-another-phonetic-feature, but otherwise the mapping is complete:

ᕬ –> /ʒ/ # voiced palato-alveolar sibilant
ᓭ –> /a/ # open front unrounded vowel
ᘛ –> /e/ # close-mid front unrounded vowel
ᖉ –> /o/ # close-mid back rounded vowel
ᒣ –> /i/ # close front unrounded vowel
ᖊ –> /u/ # close back rounded vowel
ᘊ –> /s/ # voiceless alveolar sibilant
ᑫ –> /ɛ/ # open-mid front unrounded vowel
ᘈ –> /ɔ/ # open-mid back rounded vowel
ᘖ –> /t/ # voiceless alveolar stop
ᖚ –> /ə/ # mid central vowel
ᔪ –> /y/ # close front rounded vowel
ᔭ –> /ɯ/ # close back unrounded vowel
ᑕ –> /p/ # voiceless bilabial stop
ᖽ –> /ɐ/ # near-open central vowel
ᖆ –> /m/ # bilabial nasal
ᕋ –> /b/ # voiced bilabial stop
ᕒ –> /ʃ/ # voiceless palato-alveolar sibilant
ᓄ –> /ŋ/ # velar nasal
ᖗ –> /r/ # aveolar trill
ᑲ –> /f/ # voiceless labiodental fricative
ᐧ –> /_ⁿ/ # nasal release
ᑦ –> /_ʰ/ # aspirated
ᐣ –> /_:/ # long vowel or geminated consonant

Do I really need to say that I not satisfied with this mapping? There are far too many vowels, there is far less symmetry that what we’d expect from a plausible language, and while it is general pronounceable (for example /atɯŋ/ for ᓭᘖᔭᓄ, “water”) we have bizarre things like /ps:i/ for ᑕᘊᐣᒣ and unacceptable ones like /ymŋⁿ/ for ᔪᖆᓄᐧ — we could try to fix some of these later, but it just doesn’t seem right.

It is now time to investigate the hypothesis of the Beanish script as an abugida; I’ll do it in the next post.