hou tu pranownse binish

Here is the development of an idea that I suggested at the XKCD forum: treat our Beanish corpus/words as Markov chains and, using a simple Maximum-Likelihood algorithm, suggest possible pronunciations using the dictionary of an actual language as a reference. This suggestion makes a lot of assumptions: that the Beanish script is an alphabet (but the idea could be later expanded to the syllabary/abugida/… hypothesis), that Beanish words can be treated as Markov chains of symbols (not necessarily valid, as at least the syllable structure and the morphology may not meet the Markov-chain assumption that the current state is independent from past states), that the alphabet at least somewhat mirrors the phonology, that the language we use for reference is similar to Beanish, that distinctive phonetic features (i.e., the phonology) are graphically represented, that phonotactict restrictions can, indeed, be found this way, and many more. It also raises a good number of technical questions, such as what will constitute the corpus to be evaluated (token samples or token outcomes? words taken individually or the entire sentences?) and what kind of smoothing, if any, should be performed (Good-Turing would be the first obvious choice, but we are not dealing with large groups and not necessarily with exclusive ones, not to mention that it seems that we won’t have any new Beanish text coming from Randall in the foreseeable future and, as a consequence, we must exclude any Bayesian “nature”).

Still, I imagined it would have been fun to do. Here are the results.

I took the CMU Pronuncing Dictionary and calculated the transitions (including as first and last symbol) for each phoneme, excluding the stress distinction for vowels. The CMUPD is far from the best choice for our situation, but I wasn’t aware of any better free dictionary to play with. Here, for instance, are the counts of transition from the initial position (i.e., the count for the first phoneme in English words):

{'IY': 590, 'W': 3773, 'DH': 67, 'Y': 1350, 'HH': 6656, 'CH': 1260, 'JH': 2148, 'ZH': 91, 'D': 7722, 'TH': 636, 'AA': 1898, 'B': 9632, 'AE': 2906, 'EH': 2928, 'G': 4963, 'F': 5561, 'AH': 3423, 'K': 12969, 'M': 9450, 'L': 5470, 'AO': 883, 'N': 3214, 'P': 7833, 'S': 12371, 'R': 7445, 'EY': 490, 'T': 4854, 'AW': 351, 'V': 2427, 'AY': 615, 'Z': 946, 'ER': 389, 'IH': 4133, 'UW': 84, 'SH': 2467, 'UH': 14, 'OY': 28, 'OW': 1295}

A smoothing being necessary (we don’t want to end up with probabilities equal to zero), I used NLTK‘s implementation of the “Simple Good Turing” algorithm from Gale&Sampson, proud to be using something first developed by Turing himself. Here are the transition probabilities calculated with SGT ("{" and "}" are, respectively, my chain-start and chain-end markers):

{  0.00 %     EH 2.20 %    K  9.73 %    S  9.28 %
L  4.10 %     AH 2.57 %    M  7.09 %    EY 0.37 %
SH 1.85 %     N  2.41 %    P  5.87 %    OY 0.02 %
T  3.64 %     }  0.00 %    OW 0.97 %    Z  0.71 %
W  2.83 %     D  5.79 %    B  7.22 %    V  1.82 %
IH 3.10 %     AA 1.42 %    R  5.58 %    AY 0.46 %
ER 0.29 %     AE 2.18 %    F  4.17 %    IY 0.44 %
AW 0.26 %     AO 0.66 %    Y  1.01 %    UW 0.06 %
G  3.72 %     NG 0.00 %    TH 0.48 %    DH 0.05 %
HH 4.99 %     UH 0.01 %    CH 0.95 %    ZH 0.07 %
JH 1.61 %

We can now test some English words, computing the combined log probabilities for their phonemes:

water ['W', 'AO', 'T', 'ER'] -12.8270136724
desktop ['D', 'EH', 'S', 'K', 'T', 'AA', 'P'] -23.8901166969
hagiography ['HH', 'AE', 'G', 'IY', 'AA', 'G', 'R', 'AH', 'F', 'IY'] -32.0648872671

Which confirms that “water” has a sequence of phonemes with higher probability than “hagiography”.

I could now start playing around with genetic algorithms, hill-climbing or a true maximum-likelihood estimator, but decided to go for what any lazy hacker always does: generate lots of random mappings between Beanish glyphs and English phonemes and keep the one with the best score. I know, I know.

The results, however, were terrible. While some mappings did perform a little, there clearly was not pattern to develop with random mappings (don’t forget we are dealing with something like 10^34 different mappings). The only thing I could notice was that mappings with more vowels and high sonority in general (glides, liquids…) were performing a little better, which makes sense from a phonological point of view, but it is not good for our purposes (Rosetta does not seem to have problems with the pronunciation of English, but with its syntax and vocabulary).

I then decided to do things the right way, selecting the best mappings and swapping glyphs trying to find something better. I could have written a true genetic algorithm, but it seemed useless as this hill-climbing was also useless, only confirming that we should have a lot of vowels to make it somewhat pronounceable.

By this time I was reinforcing my guess that the Beanish script is an abugida, but it was necessary to test and evaluate the alphabet hypothesis. Facing a problem the computer was not able to help me (or which I was unable to use the computer to help me, which is essentially the same), I once more did what any scientist does and decided to map it by hand, using what I know of phonetics and linguistics (and believe me, I don’t know much). My map could be used as a basis for future developments, including the algorithms I described above, and was necessary: dealing with the arcane Beanish script is not easy, because our brains (or, at least, mine) are used to letters and phonemes. It is much harder to infer anything, and guessing possible mappings between Beanish glyphs and phonetic representations might be a good idea. As usual, it will probably also develop our Beanish skills.

Here are my guesses, always assuming that the script is alphabetical and that it is unlikely that letter shapes are dependent on their position (lunate sigma, anyone?). I’ll later try to do a syllabary one.

Both ᕬ and ᓄ are rare glyphs, likely rare sounds. The first is only found in the word ᖆᕬᖉᔭ; ᖆ and ᖉ are probably vowels (but we can’t completely run out that they are glides, liquid or even nasal stops) and ᔭ a consonant, giving a probable VCVC word-structure. Being found in mid-vowel position, it could be just anything; let’s assume it is a /ʒ/.

Regarding ᓄ, it is only found in final position, excluding the complex ᖆᓄᘈᖉᐣ word that I have discussed (and is likely a toponym and/or a compound word); we know, however, that it can be followed by the dot diacritic. It usually follows ᖆ (as in ᔪᖆᓄᐧ and ᖆᓄᘈᖉᐣᐨ, which might be related considering that ᔪ could be a prefix) or other glyphs supposed to represent vowels, but our best word, ᓭᘖᔭᓄ “water”, has what I guess to be a VCC? structure (in fact, it is one of the reasons I still haven’t dropped the syllabary hypothesis). Still impossible to make any educated guess.

Keeping up with the analysis of word structures, we move on to the short words (at most two glyphs, not considering the diacritics): ᖆᐣᖽ / ᖊ,ᘖ / ᓭᘈ / ᓭᘖᑦ / ᖊᘊ / ᒣᖉ / ᓭᐧᘖ / ᓭᑦᐧ / ᓭᐧᖚ / ᘛᔭ / ᘛ / ᘛᐣ / ᖉᑦ, / ᖉ, / ᔪ, / ᒣᖉ and ᖉᑦ, — more short words than we’d like and expect for an alphabetic system. Still, at least ᓭ, ᘛ and ᖉ look like vowels, confirming the tendency for a VC syllable-structure; for now, let’s assign them to /a/, /e/ and /o/.

Having established that, we can disgress and look at some other words, like ᓭᘊᘊ (now /aᘊᘊ/) and ᑫᘊᘊ (“Gibraltar”). ᘊ is the only glyph we find repeated (Beanish is not Italian), and as we can almost rule out it being a vowel or a glide, the repetition probably means to represent either a long consonant, a stronger consonant or a repeated consonant. Now, ᘊ is usually found next to I guessed to be vowels (ᓭ, ᘛ and ᖉ), but not always: we have ᘊᒣᓭᐧᖊᔕ (that forces us to guess what ᒣ is), ᘊᘖᑫᘖᒣᐣᖚ (where it seems that ᘖ is consonant and ᒣ a vowel), ᖆᘊᓭᒣᖊᐣᖗ (where the guess needed is about ᖆ), ᖊᘊ (where we confirm a tendency for ᖊ to be a vowel), ᑕᘊᐣᒣ (where we have the mysterious ᑕ and, even more important, a diacritic — maybe this diacritic makes a consonant the coda of the syllable!?) and the greeting ᘈᘊᘖ (where, if ᘖ is a consonant, we would probably have VCC). Let’s first accept that ᒣ and ᖊ are vowels, giving them /i/ and /u/. Now, considering that ᘊ looks like a consonant that precedes other consonants, we can guess that it has low sonority, thus being either a stop or a fricative, preferably unvoiced. Going back to ᓭᘊᘊ and ᑫᘊᘊ, my guess is that it is an /s/. People will probably like it because many have guessed that the prefix ᘊ- was a plural mark. We now also have a word, ᓭᘊᘊ /ass/ (in IPA, don’t read it in English! 😉 ) and are forced to accept that ᑫ is a vowel. Having used the five cardinal ones, I will try to keep it Romance, now going with /ɛ/ (thus having ᑫᘊᘊ /ɛss/). I can also trying to solve the ᘈ problem (found, for example, both in ᓭᘈ and ᘈᘊᘖ) by going to the other side of the mouth, and assigning it to /ɔ/.

Trying to solve what I kept pending, the ᘖ consonant. Using my guesses, we now have it in contexts such as sɛiᐣᖚ / ɛ / aᘖᔭᓄ (don’t get so excited, is not coincidence that I made “water” start with an /a/) / aᘖᑦ / aᐧᘖ / ᘖᐣᖗᔭ / saᘖᔭᓄ / ᖽᔕᐣᘖ / ᖚᐣᘖᖗɛ / ɔsᘖ (the greeting) / ɔ / aᔭᑦᘖ / oisᐣᘖɛ / ᘖᖆie and, which does not sound that plausible, ᘖᓄɔo. Once more, the most obvious choice would be for a stop/fricative, being a common one likely /t/ or /f/, but cannot stop considering the diacritics anymore. I still think that, if the script is alphabetic, the diacritics are either some coloring or some indication of phonetic features, either to indicate the correct pronunciation or to graphically represent some phonetic change due to word properties, thing like the voicing of an unvoiced consonant between vowels. I like to think that it would help explain the mirrored ᑦ and ᐣ. Of course, to make it difficult, ᘖ is one of those glyphs that can take both ᑦ and ᐣ, leaving us with less likely features such as syllabic, aspirated, nasal release, etc. (and now I am thinking — why the hell are we assuming that the anatomy of the Beanies is like ours?). Anyway, stops can be a little bit more flexible given these probably wrong assumptions, and thus I will go with /t/.

The diacritics, then. The middle dot looks like the simpler one, as it seems to be applied to vowels and to the strange ᓄ glyph, which would make ᖚ yet another vowel (or glide, perhaps). To make it simple, I will assign this last glyph to a vowel not strongly linked to any phonetic feature, be it height, backness or roundness: the good and old mid central vowel ə. This raises some problems, particularly for those used to English phonology, because in some cases it would result in a strongly syllabic schwa (such as /ᕒəᐧ/ or /əɛt/), but we can try fixing it later (and, of course, it is just weird and unlikely, but not impossible). Back to the diacritics, we can now assign some phonetic trait to this middle dot: the best ones would be nasal and aspirated. As aspiration is likely to be more evenly distributed among vowels and consonants (the prove is that… well, I guess so), let’s reserve it to diacritics more evenly distributed and use “nasal” or “nasal release”. Tildes on the way, guys! (yes, even with tilded schwas, which likely won’t render correctly in your system /◌̃ə/ — and now spend two minutes trying to pronounce it, by relaxing pretty much every single muscle in your mouth).

Now, the lunar diacritics, ᑦ and ᐣ. The first is the weirder one, as it is even used, in a single occurrence, in the initial position of the word (one of the things that makes me think that a syllabary/abugida is likely): ᑦᘈᖽᐣ. But ᘈ is an unusual glyph graphically, and maybe it is used before for trivial reasons, aesthetic or not. In order to try to create a glyph mapping, we better not to consider it. ᑦ is thus used with ᕋ, ᓭ, ᖉ, ᘈ, ᘖ, ᒣ, ᔭ, ᑕ and ᖊ (mostly vowels, but also the common consonant ᘖ), and ᐣ with ᖊ, ᒣ, ᖽ, ᘖ, ᔕ, ᖚ, ᖆ, ᘊ, ᖉ, ᘛ and ᓄ, a true “bag of stuff”. Things get now very complicate: we can try to consider ᑦ a phonetic feature such as aspiration, but as for ᐣ there is no possible educated guess; while I will translate it as “long” (either vowel or consonant), it should be taken as an indicator that some of the assumptions we made so far are very, very off.

Before going on, it is time to make sure that every syllable has a vowel. At this moment, we are still left with ᔪᖆᓄⁿ / ᔪ, / ᔭ / ᕋᖗ / ᖽᔕ:t / ᖆ:ᖽ and ᔪᑕ. Now, ᔪ and ᔭ must be vowels, let’s assign them /y/ and /ɯ/, just ’cause I like ’em; this makes ᖆ an unlikely candidate for vowelness, and we can assume it is a consonant and that ᖽ is a vowel (starting to run out of vowels, I choose /ɐ/, trying to keep them as spaced as possible).

Given these new vowels, C is now, probably, a consonant, found in ᑕs:i, saʰᑕo, əiᑕɛa, ᖆ:ᑕʰəɛ and yᑕ. We need a consonant that can be aspirated and can be found in the complex onset ‘Cs’ (assuming phonotactic restrictions are similar to those of the languages I am used to): the best choice is /p/. We also have ᖆ as a consonant, but now more sonorant is less implausible (after all, we are only playing to finish this game), and I’ll go with an /m/.

Let’s now finish by working out the stubborn glyphs we still have. ᕋ is not very frequent, but we find it in ᕒəɛᕋ, / ɐeᕋʰ / ᕋᖗ . It looks like a consonant similar to ᑕ; let’s assign it to /b/. We also have ᕒ, and the only thing we know about it is that it frequent in questions, in yᕒəⁿ / ᕒəⁿ / ᕒəɛb, ; let’s assign it a “strong” sound that clearly distinguishes it in a sentence: /ʃ/.

We get back to ᓄ, now assuming that it can have a nasal release. This is almost impossible by the restrains we have by now, but we should only start changing and swapping glyphs at the end. As it is common in final position, we can make it quasi-English-like as assign it to /ŋ/.

Our now almost impossible to pronounce language still has ᖗ, found in msaiu:ᖗ / bᖗ / t:ᖗɯ / ə:tᖗɛ. The number of vowels in the language is probably too high by now, but the only good alternative would be to make ᖗ, too, a vowel (did I say I keep thinking it is an abugida?). But I will try to restrict it, and thus assign it to /r/: /br/, for example, is not such an impossible word. Finally, ᑲ is found in saⁿᑲ; I will make it a voiceless fricative, /f/, to try to get some rythm in this language full of vowels (I still study poetry, I can’t help it…).

We still have left the “comma” diacritic, by this reckoning yet-another-phonetic-feature, but otherwise the mapping is complete:

ᕬ –> /ʒ/ # voiced palato-alveolar sibilant
ᓭ –> /a/ # open front unrounded vowel
ᘛ –> /e/ # close-mid front unrounded vowel
ᖉ –> /o/ # close-mid back rounded vowel
ᒣ –> /i/ # close front unrounded vowel
ᖊ –> /u/ # close back rounded vowel
ᘊ –> /s/ # voiceless alveolar sibilant
ᑫ –> /ɛ/ # open-mid front unrounded vowel
ᘈ –> /ɔ/ # open-mid back rounded vowel
ᘖ –> /t/ # voiceless alveolar stop
ᖚ –> /ə/ # mid central vowel
ᔪ –> /y/ # close front rounded vowel
ᔭ –> /ɯ/ # close back unrounded vowel
ᑕ –> /p/ # voiceless bilabial stop
ᖽ –> /ɐ/ # near-open central vowel
ᖆ –> /m/ # bilabial nasal
ᕋ –> /b/ # voiced bilabial stop
ᕒ –> /ʃ/ # voiceless palato-alveolar sibilant
ᓄ –> /ŋ/ # velar nasal
ᖗ –> /r/ # aveolar trill
ᑲ –> /f/ # voiceless labiodental fricative
ᐧ –> /_ⁿ/ # nasal release
ᑦ –> /_ʰ/ # aspirated
ᐣ –> /_:/ # long vowel or geminated consonant

Do I really need to say that I not satisfied with this mapping? There are far too many vowels, there is far less symmetry that what we’d expect from a plausible language, and while it is general pronounceable (for example /atɯŋ/ for ᓭᘖᔭᓄ, “water”) we have bizarre things like /ps:i/ for ᑕᘊᐣᒣ and unacceptable ones like /ymŋⁿ/ for ᔪᖆᓄᐧ — we could try to fix some of these later, but it just doesn’t seem right.

It is now time to investigate the hypothesis of the Beanish script as an abugida; I’ll do it in the next post.

hou tu pranownse binish – part 2

(note: I am posting part 2 before part 1… Part 1 and 3, alphabetic and syllabic guesses, are far harder and I don’t know if I’ll be able to finish them soon — real life knocking at the door)

In the previous post, I tried, without much success or confidence, to map Beanish glyphs to phonemes, assuming it is an alphabet. I used frequency tables, some linguistic knowledge, my ear (“it sounds good enough”) and, mostly, wild guesses. As I stated, the biggest problem are the diacritics: we can be more or less flexible regarding potential Beanish phonotactic restrictions, but the diacritics (with the possible exception of the “comma” one) do not work like the other glyphs (i.e., they are not letters) but don’t seem to work well as phonetic traits either. I tried to map them to some phonetic features nonetheless, but nobody should be pleased with my suggestions (I certainly am not).

One idea that has been debated in the XKCD fora since the time Time was playing was to treat it as an abugida. The diacritics are probably, once more, to blame, but in a lot of ways it does make sense: they could be very well vowel-marks (we can even try to think of them as a graphical representation based on the point of articulation in the mouth, very loosely like Korean) and the biggest objection is that the mean word lenght is a bit too long. Not that the abugida solution solves every single difficulty regarding Beanish: the transition probabilities among glyphs do suggest an alphabet more than an abugida (assuming the grammar isn’t terribly strict) and the number of glyphs is a bit too large for a “plausible” language. A third possibility is that the script is indeed a sillabary (remember that Randall used Linear A as an example), which does not exclude the possibility of the diacritics being vowel marks; we shall investigate this later.

Anyway, we have four diacritics in the Beanish script: the “middle dot” ᐧ , the “c” ᑦ , the “inverted c” ᐣ and the “comma” ,. Our major difficulty is that they can be combined, particularly the comma, in words such as ᖉᑦ, (but we also have the complex word ᓭᑦᐧ). If the diacritics are vowels, this could mean that vowels can sometimes be combined: in particular, the “comma” could be a glide (the most obvious being the palatal approximant /j/). We are left with ᓭᑦᐧ which, among other hypothesis, could be a diphtong (the only one we have so far) or the mark for a rare vowel. This is what I will assume.

Considering the three diacritics we have left, the fact that one of them looks graphically “neutral” (probably the most common vowel, such as /a/ or /ə/) and the fact that the other two seem to mirror/negate themselves, it is a good guess to consider the middle dot as an /a/, the “inverted c” as /e/ (possibly with allophones such as /ɛ/), the “c” as /o/ (possibly with allophones such as /ɔ/), the “comma” the /j/ glide and the combined diacritic ᑦᐧ just /oa/ or, even better, /oə/.

And now, let’s tabulate everything to find both the default vowel for each consonant and a guess of what consonant it is (based in the consonant frequency of both Beanish and English, plus two dorsals not found in English but common in other languages). Everything assumes that the syllable structure is V+C, and we are solving the isolated diacritic in ᑦᘈᖽᐣ (it would just be a word starting with /a/, the only one in our corpus: /asaʤe/).

Glyph Count /a/ /e/ /o/ Probable base-vowel Guess consonant
29 0 3 0 /a/ ? /p/
27 0 1 2 /a/ /b/
24 8 + 0.5 (ᓭᑦᐧ) 0 2 + 0.5 (ᓭᑦᐧ) /e/ /t/
21 0 7 0 /a/ /d/
17 5 2 0 /o/ /k/
17 0 0 1 /a/ ? /g/
16 0 1 3 /a/ /ʧ/
15 0 1 0 /a/ ? /ʤ/
13 0 2 2 /a/ /f/
11 0 1 0 /a/ ? /v/
10 0 0 0 /a/ ? /θ/
10 0 3 2 /a/ /ð/
10 0 0 0 /a/ ? /s/
7 0 1 0 /a/ ? /z/
7 0 0 0 /a/ ? /ʃ/
7 0 0 1 /a/ ? /ʒ/
6 0 3 0 /a/ ? /m/
5 0 0 0 /a/ ? /n/
5 0 0 0 /a/ ? /l/
4 0 0 2 /a/ ? /r/
3 0 0 0 /a/ ? /ŋ/
1 0 0 0 /a/ ? /ʎ/
3 0 0 0 /a/ ? /ɲ/

Which is great, because 1. There is no glyph with at least one occurence for every diacritic and 2. While a bit extensive, the size of the phonetic catalog is very reasonable (no need to use ejectives or the like, as in the guessed alphabet of part 1 of this post).

If you are still puzzled, this means that (completely made up words) ᘊᓭ should be read with the default vowel for each glyph, here /a/ and /e/ and thus /pate/; if the vowel is not the standard, you add the corresponding diacritic, and thus /pote/ would be written as ᘊᑦᓭ and /pato/ as ᘊᓭᑦ. The “comma” is a semivowel /j/ added after the vowel, and thus ᘊᓭ, would be /patej/ and ᘊᑦ,ᓭ would give us /pojte/.

The abugida hypothesis is at least plausible, even though, as I said, the words are a bit longer than I’d like and my score at guessing the consonants probably isn’t much better than a random choice. We can later try better guesses using the vocabulary we have decoded so far, such as “water” and “sea”, hoping they are related to some known language (phonosymbolism, anyone?)

But at least ᓭᘖᔭᓄ as /tebagava/ for “water”, while very unlikely, sounds better then the pronountiation I derived in the previous post, the “alphabetic guess”

Regarding ᘝᓄᘈᖉᐣ

Yet another hyphotesis: while our corpus is small and most of the words I am using for this hypothesis seem to be related (“water”, “sea”…), there is a strong tendency for the glyph ᓄ to be found only at the end of words (mostly nouns).

The exception is ᘝᓄᘈᖉᐣ, a somewhat unusual word that many suppose is the name of the Beanie city. Maybe its name is actually a compund word, ᘝᓄ and ᘈᖉᐣ? An even wilder guess: ᘝᓄ or, more likely given the syntax, ᘈᖉᐣ could mean “new” (as in “New York”).

Do you carry these people?

I have decided to study Big Hair’s speech in English, as people have pointed that it might be a “key”. Just had my first insight:

ImmagineIn frame 2897, she (supposedly) says “Do you carry these people with you?”. She probably intends “Did you bring any of those people with you?”, referring to the Forty.

We could make hypothesis about the reason for the past-mark-dropping, but I want to focus in the verb “to carry”. While it may sound very weird to native English speakers (for some people in the forum, it was undecipherable at first), it could be expected error from the speaker of a language that makes a different distinction between to carry/to bring, such as Italian and French. We know we are in current-day France and Randall said that Beanish was “plausible”, not to mention the fact that all Big Hair’s numbers “are too small”… maybe Beanish has French features?

Thirteen regular expressions to rule (almost) them all

Maybe it is time to get back to work. In this post I present 13 regular expressions (Python syntax) that cover most of the words in the Beanish corpus.

Immagine

 

The goal was to have a way to test and group the words, not to actually perform regular expression pattern matching or substitutions. If you are familiar with regular expressions, you can probably tell this by the fact the syntax and the grouping do not make much sense. I wanted to make it easier to spot groups, raise hypothesis and find the most unusual words. In a way, this is a form of data compression, of entropy reduction. Words could be grouped in different ways and, if one wanted to have full coverage, longer patterns could match all words.

All patterns exclude what we safely assume to be final punctuation (which can be added with a ur'[ᐨᐦᐤ]?$’).

Pattern 1

ur'(ᖆᐣ?|ᘛᖆ|ᘖᐣ)(ᑕᑦ)?[ᖽᖚᖗ](ᔭ,?|[ᒣᘈᑫ])?'

Words covered: ᖆᖚᔭ,ᐨ / ᖆᐣᖚᔭᐦ / ᖆᐣᖽ / ᘛᖆᖚᘈᐤ / ᖆᐣᑕᑦᖚᑫ / ᘖᐣᖗᔭ, / ᖆᖽᒣ

  • ᖆ is usually followed by the diacritic ᐣ when it is an initial, a feature it seems to share with ᘖ
  • While, for this group, ᑕ and ᑦ are always grouped, there is no indication that they are dependent
  • [ᖚᖗ] and [ᘈᑫ] seem to be two different groups; it is also possible that ᖽ belongs to the first group and ᒣ to the second one (as suggested by the following patterns)

Pattern 2

ur'ᖆ(ᓄ?ᘈ|ᕬ)(ᖉᐣ?|ᘖ)ᔭ?'

Words covered: ᖆᘈᘖᐦ / ᖆᕬᖉᔭ / ᖆᘈᘖ / ᖆᓄᘈᖉᐣ

  • The second of the three patterns for words starting with ᖆ, which looks extremely frequent and prolific (if it were English and this is an alphabet, likely a vowel)
  • Not much in common among these words, syntactically
  • ᓄ and ᘈ probably belong to different categories
  • ᖉ and ᘖ probably belong to the same category

Pattern 3

ur'ᖆ(ᘊᓭᒣ|ᔭ)ᖊᐣ?[ᖗᖽ]'

Words covered: ᖆᔭᖊᖽ / ᖆᘊᓭᒣᖊᐣᖗᐨ

  • Last pattern for words starting with ᖆ
  • Not much can be said, but the words could be related if Beanish uses infixes
  • ᖗ and ᖽ are probably in the same category
  • ᘊ, ᓭ and ᒣ are once more seen together; if it is alphabet, one of them is likely a vowel and the other two are consonants, possibly a fricative/plosive and a liquid

Pattern 4

ur'ᘊ?[ᒣᓭᑫᖊ]+[ᐧ,ᑦ]?[ᖚᘊᘖᘈᑲᓄᑕᖊᔭᖽ]*[ᑦ,ᐧ]?[ᔭᖉᔕᘖᖆ]?ᓄ?'

Words covered: ᘊᒣᓭᐧᖊᔕ / ᘊᓭᘖᑦᓄᐨ / ᖊ,ᘖ / ᓭᘈ / ᓭᘖᑦ / ᘊᓭᐧᑲ / ᖊᘊᐤ / ᒣᖉ / ᘊᓭᘖᔭᓄᐤ / ᓭᘖᔭᓄᐨ / ᓭᘖᔭᓄᐦ / ᘊᓭᑦᑕᖉ / ᓭᐧᘖ / ᓭᑦᐧ / ᘊᓭᘖᔭᓄᐨ / ᘊᖊᑦᓄ / ᓭᔭᑦᘖ / ᘊᓭᐧᑲᐤ / ᑫᘊᘊ / ᘊᒣᑦᖽᖆᐨ / ᘊᓭᐧᑲᐨ / ᓭᘊᘊ / ᒣᓭᐧᖊᔕᐨ / ᓭᐧᖚ / ᘊᓭᑦᑕᖉᐨ

  • The most complex and performing pattern, covers most of what are supposed to be nouns
  • ᘊ- looks indeed as a prefix
  • ᒣ, ᓭ, ᑫ and ᖊ would likely be vowels, allowing diphtongs and the diacritics would thus be applied to vowels
  • [ᖚᘊᘖᘈᑲᓄᑕᖊᔭᖽ] looks like a big bag of consonants, confirming some of my previous assumptions; however, the diacritics can be applied to some of them too
  • The final -ᓄ could be a suffix, or the indication of a strict word phonology

Pattern 5

ur'ᘛᐣ?[ᔭ]?'

Words covered: ᘛᔭᐤ / ᘛ / ᘛᐣ

  • Covers most one/two-symbol words
  • ᘛ is probably a vowel, or at the very least a sonorant, and ᔭ a consonant (guess not supported by evidence: a fricative)

Pattern 6

ur'ᒣ?[ᖉᔪ],?(ᑦ,)?'

Words covered: ᖉᑦ,ᐨ / ᖉᑦ,ᐦ / ᖉ, / ᔪ, / ᒣᖉ

  • Covers most of the words that seem associated with the ideas of “yes, positive, affermative, good”
  • Just like ᘛ, ᖉ and ᔪ are probably vowels/sonorants

Pattern 7

ur'ᔪ?ᕒᖚᐧ?(ᑫᕋ,)?'

Words covered: ᕒᖚᑫᕋ,ᐨ / ᔪᕒᖚᐧ / ᕒᖚᐧ

  • Covers the (ᕒ)ᖚ group, where ᕒ- is likely a question mark (is it just a CU/QU /k/ of Romance languages? or perhaps a WH- from English?)
  • If it is an alphabet, ᕒᖚ looks like a Consonant+Vowel; given that ᑫ is likely a vowel, the rare ᕋ would likely be a rare consonant, and a word like ᕒᖚᑫᕋ would sound something like /kwəX/, where /X/ is the rare consonant

Pattern 8

ur'ᔪ[ᖆᑕ](ᓄᐧ)?'

Words covered: ᔪᖆᓄᐧ / ᔪᑕᐨ

  • There is no clear indication that the two words covered by this regex are related
  • ᓄ is confirmed in its common final position

Pattern 9

ur'ᖽ(ᘛ|ᔕᐣ)(ᕋᑦ|ᘖ)'

Words covered: ᖽᘛᕋᑦ / ᖽᔕᐣᘖ

  • Once more, there is no indication that these words are related
  • Given that ᘛ is likely a vowel, ᔕ would be a vowel too and ᖽ a consonant

Pattern 10

ur'ᘊ(ᘖ[ᑫᒣ])*ᐣᖚ'

Words covered: ᘊᘖᑫᘖᒣᐣᖚ

  • An interesting word, with apparently no consensus on probable translations and because it contratics or make less plausible some of my hypothesis
  • However, it confirms that ᑫ and ᒣ could be vowels and ᘊ, ᘖ and ᖚ a consonant, giving something like TRIROS, FLALEP, PNENUV, etc. (just to make it clear: only to evidence the pattern, I am not suggesting that the symbols correspond to there letters)

Pattern 11

ur'ᖚ(ᒣᑕ)?ᑫ[ᓭᘖ]'

Words covered: ᖚᒣᑕᑫᓭ / ᖚᑫᘖ

  • Another pattern with no clear indication of relation between the words it covers (unless, as stated before, Beanish uses infix morphology and zero-morphemes…)
  • ᒣᑕ could be a Vowel+Consonant

Pattern 12

ur'[ᘈᑕ]ᘊᐣ?[ᘖᒣ]'

Words covered: ᘈᘊᘖᐨ / ᑕᘊᐣᒣ

  • ᘊ can take a diacritic and is probably a common consonant (a liquid?)
  • ᘈ and ᑕ are probably consonants too

Pattern 13

ur'(ᖚᐣ|ᖉᔭᒣᘊᐣ|ᔪᖉᔭ)[ᘖᖗᑫ]+'

Words covered: ᖚᐣᘖᖗᑫ / ᖉᔭᒣᘊᐣᘖᑫᖗ / ᔪᖉᔭᑫ

  • ᘖ, ᖗ and ᑫ seem to constitute a group like ᘊ, ᓭ and ᒣ: probably one vowel and two consonants

Words not covered

Words: ᔭ / ᘖᖆᒣᘛᐨ / ᘖᓄᘈᖉᐣ / ᕋᖗ / ᖊᐣᖽ / ᑦᘈᖽᐣ

  • Equally important, this six uncovered words
  • While I suspected that ᔭ would be a consonant, it can form a word of its own; while possible, this could indicate that we are not dealing with an alphabet
  • ᘖᖆᒣᘛ is one of the words with no agreement on the translation; ᘖᓄᘈᖉᐣ has an uncommon ᓄ in middle position but it is likely a toponym; ᕋᖗ is followed (frame 2728) by the other very strange ᖆᕬᖉᔭ word, ᖊᐣᖽ could a transcription error or might be related to ᖊ,ᘖ and the strange ᑦᘈᖽᐣ is from the same long speech in frame 2728.

Maybe it is now time to go back to the comic and to the blotched English of Big Hair; paying attention to the strange words that could be the key, being toponyms.

The glyph transitions (at last) — Part II

Here is the table that was missing from the previous post, it shows the transitions between glyphs, left-to-right. It is far more important than the previous table, if the script is actually written left-to-right as most agree.

I plan to generate a graph showing the main transitions, stay tuned.

Symbol (from) Transition (to) Occurrences/Total (percentage)
1/14 (7.14%)
1/14 (7.14%)
1/14 (7.14%)
1/14 (7.14%)
1/14 (7.14%)
1/14 (7.14%)
2/14 (14.29%)
4/14 (28.57%)
1/14 (7.14%)
} 1/14 (7.14%)
3/9 (33.33%)
, 1/9 (11.11%)
2/9 (22.22%)
1/9 (11.11%)
} 2/9 (22.22%)
2/7 (28.57%)
1/7 (14.29%)
1/7 (14.29%)
1/7 (14.29%)
} 2/7 (28.57%)
2/8 (25.00%)
, 1/8 (12.50%)
1/8 (12.50%)
1/8 (12.50%)
2/8 (25.00%)
1/8 (12.50%)
1/3 (33.33%)
} 2/3 (66.67%)
1/5 (20.00%)
1/5 (20.00%)
} 3/5 (60.00%)
1/18 (5.56%)
1/18 (5.56%)
1/18 (5.56%)
2/18 (11.11%)
1/18 (5.56%)
1/18 (5.56%)
2/18 (11.11%)
2/18 (11.11%)
} 7/18 (38.89%)
1/6 (16.67%)
1/6 (16.67%)
1/6 (16.67%)
1/6 (16.67%)
} 2/6 (33.33%)
1/12 (8.33%)
1/12 (8.33%)
2/12 (16.67%)
2/12 (16.67%)
1/12 (8.33%)
3/12 (25.00%)
} 2/12 (16.67%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
2/11 (18.18%)
} 2/11 (18.18%)
2/9 (22.22%)
1/9 (11.11%)
1/9 (11.11%)
1/9 (11.11%)
} 4/9 (44.44%)
1/5 (20.00%)
1/5 (20.00%)
, 1/5 (20.00%)
1/5 (20.00%)
1/5 (20.00%)
1/13 (7.69%)
1/13 (7.69%)
, 2/13 (15.38%)
2/13 (15.38%)
1/13 (7.69%)
1/13 (7.69%)
} 5/13 (38.46%)
, 1/7 (14.29%)
} 6/7 (85.71%)
1/8 (12.50%)
1/8 (12.50%)
1/8 (12.50%)
1/8 (12.50%)
1/8 (12.50%)
} 3/8 (37.50%)
? 1/1 (100.00%)
1/17 (5.88%)
2/17 (11.76%)
2/17 (11.76%)
2/17 (11.76%)
2/17 (11.76%)
5/17 (29.41%)
} 3/17 (17.65%)
2/8 (25.00%)
1/8 (12.50%)
1/8 (12.50%)
} 4/8 (50.00%)
1/3 (33.33%)
, 1/3 (33.33%)
1/3 (33.33%)
2/17 (11.76%)
3/17 (17.65%)
3/17 (17.65%)
1/17 (5.88%)
1/17 (5.88%)
2/17 (11.76%)
1/17 (5.88%)
} 4/17 (23.53%)
3/3 (100.00%)
1/5 (20.00%)
1/5 (20.00%)
1/5 (20.00%)
1/5 (20.00%)
} 1/5 (20.00%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
, 1/11 (9.09%)
1/11 (9.09%)
2/11 (18.18%)
1/11 (9.09%)
} 2/11 (18.18%)
1/9 (11.11%)
2/9 (22.22%)
1/9 (11.11%)
1/9 (11.11%)
1/9 (11.11%)
} 3/9 (33.33%)
1/16 (6.25%)
4/16 (25.00%)
1/16 (6.25%)
5/16 (31.25%)
1/16 (6.25%)
1/16 (6.25%)
2/16 (12.50%)
} 1/16 (6.25%)
1/1 (100.00%)
} 1/1 (100.00%)
{ 10/60 (16.67%)
3/60 (5.00%)
1/60 (1.67%)
3/60 (5.00%)
3/60 (5.00%)
4/60 (6.67%)
3/60 (5.00%)
2/60 (3.33%)
5/60 (8.33%)
1/60 (1.67%)
2/60 (3.33%)
? 1/60 (1.67%)
8/60 (13.33%)
1/60 (1.67%)
2/60 (3.33%)
1/60 (1.67%)
1/60 (1.67%)
1/60 (1.67%)
8/60 (13.33%)

Aside

While I still think that the new template for the blog looks good, it might have hidden from general view some good work that has been showing up in the comments.

I still want to check some things about Lojban before writing an answer to greb, but first I want to share some thoughts by J.:

Anyway, I have been working on Beanish for a bit before reading this blog, and you have helped me fill in a lot of blanks in my notes. It’s actually starting to make sense! Thank you. Coming from a different direction I’ve notice a couple differences between your corpus and mine, but the majority is almost exactly the same.

One important feature I think you are missing is word-structure. I’m going from the assumption that ‘Beanish’ is a synthetic language (as opposed to an isolating one), and one with a rigid templatic structure and semi-fluid morpheme boundaries. With that in mind, I want to propose a few morphemes:

ᖆᐣᖽ – To (Preposition)

ᖆᐣᖽ (To, x3)
ᔪᖆᓄᐧ (What to?)
ᖆᖽᒣ (Up?)
ᖆᐣᖚᔭ (Today? [Fudging a bit here])
ᖆᐣᑕᑦᖚᑫ (UKN)
ᖆᔭᖊᖽ (UKN)
*ᖆᕬᖉᔭ (Not an occurrence of the morpheme?)
*ᖆᓄᘈᖉᐣ (Not an occurrence of the morpheme?)
*ᘊᒣᑦᖽᖆ (Not an occurrence of the morpheme)
*ᖆᘈᘖ (Not an occurrence of the morpheme)

(ᖽ)ᘛ – You (2nd Person)

ᘛᔭ (You are)
ᖽᘛᕋᑦ (You Possessive x2) *This changes corpus line 26*
ᘛ (You)
ᘛᖆᖚᘈ (You Journeyed?) *This is a minor change to corpus line 24*
ᘛᐣ (You, [3 plural?])
*ᘖᖆᒣᘛ (Not an occurrence of morpheme)

ᕒᖚᐧ – Where

ᔪᕒᖚᐧ (Where from)
ᕒᖚᐧ (Where x4) *Line 25?*
ᕒᖚᑫᕋ, (UKN)

Also, if ᖉᑦ, means yes or good, then ᖉ, ᖆᐣᖚᔭ, would mean good day, with the ᖆᐣ prefix meaning something .. but that bit is eluding me.

Notice, for all these morphemes, than when the morpheme is placed before (or sometimes next to) a word, the phonemes ᐣ, ᖽ, ᘖ, & ᐧ will occasionally drop. This gives us a ‘core’ phoneme or two for each morpheme (ᖆ, ᘛ, ᕒᖚ, ᖉ, and ᔕ for we). I also propose that ᑫ is a core for a possessive morphemic suffix, and ᘊ is a core for a importance-marker type morphemic prefix.

Finally, this is the word structure template that I can work out with these morphemes:

(Q-Word)/PERSON (opt?) – (PREPOSITION) – (Importance marker) – word – (POSSESSION)

If you have any insights on this approach, or know of a better place to put this, let me know! Hopefully we can get this cracked.

I don’t think that ᖆ- is a morpheme in the way J. suggests, as the hypothesis of having it as a determiner or a semantic morpheme for “bigger, larger” still sounds more plausible to me. I also find it unlikely that morphemes and phonemes are joined the way described (but I am not sure I completely understood it), especially after  considering the frequencies of glyph to glyph to transition that seem to disprove it (not to mention that, if what we have been calling “diacritics” are indeed diacritics, it might be needed to remove them from our analysis).

While I am not really sure abput the differences in his/her reading of the script, the suggestions for ᘛ and ᕒᖚᐧ make a lot of sense and ᖉ, ᖆᐣᖚᔭ, has probably been nailed down (I had translated it as “good morning” by observing the conversational clues, but I am now surprised about how I could have missed the ᖉ as “good”, which now seems so obvious! great work, J.!). What is really important, however, is that the freshness of the rigid synthetic paradigm he/she suggests. While some comments on the OTT during the comic run had similar hypothesis, and I think I posted something along these lines in one of my first posts, it is the first time I am seeing a true hypothesis for the word structure template of Beanish.

We are finally getting to the point that we can draw some hypothesis and test them.

The glyph transitions (at last)

I wrote a simple Python script to finally present the transitions from glyph to glyph (including word boundaries, denoted with { and }). You can find it at the same GitHub repository, please fork, modify, correct, extend it if you want.

Without further ado, here are the tables of transitions from glyph to glyph. I will discuss them in a future post, possibly with some graphical representation.

The first table indicates the occurrences of glyphs in the first column when following by those in the second one:

Symbol Transition from Occurrences/Total (percentage)
1/14 (7.14%)
1/14 (7.14%)
1/14 (7.14%)
1/14 (7.14%)
{ 10/14 (71.43%)
2/9 (22.22%)
1/9 (11.11%)
1/9 (11.11%)
1/9 (11.11%)
1/9 (11.11%)
{ 3/9 (33.33%)
1/7 (14.29%)
1/7 (14.29%)
2/7 (28.57%)
1/7 (14.29%)
1/7 (14.29%)
{ 1/7 (14.29%)
1/8 (12.50%)
2/8 (25.00%)
1/8 (12.50%)
1/8 (12.50%)
{ 3/8 (37.50%)
2/3 (66.67%)
1/3 (33.33%)
1/5 (20.00%)
1/5 (20.00%)
2/5 (40.00%)
1/5 (20.00%)
1/18 (5.56%)
1/18 (5.56%)
, 1/18 (5.56%)
2/18 (11.11%)
3/18 (16.67%)
1/18 (5.56%)
2/18 (11.11%)
4/18 (22.22%)
{ 3/18 (16.67%)
1/6 (16.67%)
1/6 (16.67%)
{ 4/6 (66.67%)
1/12 (8.33%)
1/12 (8.33%)
3/12 (25.00%)
3/12 (25.00%)
1/12 (8.33%)
{ 3/12 (25.00%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
2/11 (18.18%)
1/11 (9.09%)
1/11 (9.09%)
{ 2/11 (18.18%)
2/9 (22.22%)
1/9 (11.11%)
1/9 (11.11%)
5/9 (55.56%)
{ 5/5 (100.00%)
1/13 (7.69%)
3/13 (23.08%)
1/13 (7.69%)
2/13 (15.38%)
1/13 (7.69%)
2/13 (15.38%)
1/13 (7.69%)
1/13 (7.69%)
{ 1/13 (7.69%)
, 1/7 (14.29%)
1/7 (14.29%)
1/7 (14.29%)
2/7 (28.57%)
1/7 (14.29%)
1/7 (14.29%)
1/8 (12.50%)
1/8 (12.50%)
1/8 (12.50%)
2/8 (25.00%)
1/8 (12.50%)
{ 2/8 (25.00%)
? { 1/1 (100.00%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
2/17 (11.76%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
{ 8/17 (47.06%)
2/8 (25.00%)
1/8 (12.50%)
2/8 (25.00%)
? 1/8 (12.50%)
2/8 (25.00%)
1/3 (33.33%)
1/3 (33.33%)
{ 1/3 (33.33%)
4/17 (23.53%)
2/17 (11.76%)
2/17 (11.76%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
1/17 (5.88%)
2/17 (11.76%)
1/17 (5.88%)
1/3 (33.33%)
{ 2/3 (66.67%)
1/5 (20.00%)
1/5 (20.00%)
1/5 (20.00%)
1/5 (20.00%)
{ 1/5 (20.00%)
1/11 (9.09%)
1/11 (9.09%)
2/11 (18.18%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
1/11 (9.09%)
2/11 (18.18%)
{ 1/11 (9.09%)
1/9 (11.11%)
2/9 (22.22%)
3/9 (33.33%)
1/9 (11.11%)
1/9 (11.11%)
{ 1/9 (11.11%)
2/16 (12.50%)
5/16 (31.25%)
1/16 (6.25%)
{ 8/16 (50.00%)
1/1 (100.00%)
1/1 (100.00%)
} 1/60 (1.67%)
2/60 (3.33%)
2/60 (3.33%)
2/60 (3.33%)
3/60 (5.00%)
7/60 (11.67%)
2/60 (3.33%)
2/60 (3.33%)
2/60 (3.33%)
4/60 (6.67%)
5/60 (8.33%)
, 6/60 (10.00%)
3/60 (5.00%)
3/60 (5.00%)
4/60 (6.67%)
4/60 (6.67%)
1/60 (1.67%)
2/60 (3.33%)
3/60 (5.00%)
1/60 (1.67%)
1/60 (1.67%)

It is clear that some glyphs are far more promiscuous than others, but its even clearer that our corpus is too limited for any general assumption.

The following table is as important as the previous: while the first gives us the transitions of glyphs towards others, this one gives us the transitions of glyph from others (in other words, it indicates the occurrences of glyphs in the second column when following those in the first one):

(Sorry, the table was wrong — I’ll fix and post it later)

On word lengths

It is soon to discuss word length in Beanish, as we have just started studying the glyphs (btw, the corpus on GitHub has already been corrected and improved — the joy of the crowd!). But I can’t help discussing the particular word length distribution in Beanish.

Words in Beanish have a strong tendency to have between 3 to 5 glyphs; if we consider words between 2 to 5 glyphs, they represent 70% of Beanish words. Compare with the word-length distribution in English from Peter Norvig webpage (http://norvig.com/mayzner.html):

oIt is not only a matter of different mean length (3.784 glyphs), but of very small standard deviation (1.427 glyphs). We will end up discussing it in the future; for the time being, I take this as a further suggestion that the script is a consonantal abugida,  even though the glyph distribution doesn’t strongly suggest it. (I know some people disagree, please discuss it in the comments, it’s the purpose of this post 😉 )