Back to the corpus – IV

I said I would not be linear, and here it goes: let’s jump to the interesting speech in frame 2728.

This frame is interesting, because it’s one of the Beanish “data points” closer to being a “Rosetta stone”. After some initial and difficult communication, when Cueball apparently learns the Beanish word for “water” (frames 2708/2709), he and Megan try to communicate with the Beanie by drawing. When Cueball makes it clear that they came from the lowlands, where we know the sea is rising, the Beanie seems both surprised and excited and fires three questions: ᔪᖉᔭᑫ ᕒᖚᐧ ᘊᓭᐧᑲᐤ ᑦᘈᖽᐣ ᔭ ᖆᖽᒣ ᓭᘖᑦ ᖊᘊᐤ ᕋᖗ ᖆᕬᖉᔭ ᘖᐣᖗᔭ, ᘊᓭᘖᔭᓄᐤ As Cuegan don’t (doesn’t?) understand, the Beanie takes the stitch and draws some parallel lines to indicate that the sea is rising.

We can make many educated guesses. The first sentence is probably a result of the surprise/excitement of the Beanie (remember that, as Rosetta will later explain, they did not imagine there were still people living there). One of the two last sentences (or possibly both, but I guess only the last one) is likely related to the the rising of the sea, and it seems a good guess that among those questions one of them is something like “How did you get here?” or, at best, “Are you (two) alone?”.

Syntactically, the first question is the one that helps us most. We have ᔪᖉᔭᑫ ᕒᖚᐧ ᘊᓭᐧᑲᐤ (“nafagaθa laka pataɲa?”), which is stringkly similar to the very first Beanish sentence, ᔪᕒᖚᐧ ᘛᔭᐤ (“nalaka zaga?”), in frame 2663. We have seen that this first question probably means “Where are you from?” or “Who are you?”, with the “question prefix” ᔪ (“na”) — that seems to work as a clitic, like most of Beanish morphemes –, the “locative adverb/preposition” ᕒᖚᐧ (“laka”) — sorry to keep the linguistic jargon, basically something which denotes a physical location, probably something translated as “here/there/where/in/at/…” (but at least I am not saying things like “place deixis”! –, and ᘛᔭ (“zaga”), likely a verb. In our new sentence, we keep the ᔪ (“na”) clitic, along with the ᕒᖚᐧ (“laka”). But we also have the common ᘊᓭᐧᑲ (“pataɲa”) word, which — unless Randall gave us homographs — is the name of the place where Cuegan come from (cfe. frame 2906, a.k.a. “the map”); I like to think of it as “Balearic”, to omit the fact that it might be a composite. However, the word seems to be the result of, at least, the ᘊ- (“pa-”) prefix and the word ᓭᐧᑲ (“taɲa”), used in the “good night” sentence of frame 2697. A case of homography here is not probable, and it raises an interesting hypothesis: that, maybe, perhaps, who knows, possibly, just a guess, ᘊ- (“pa-”) does not mean “big, superior”, but is a kind of determinative (an “article”, or maybe a “demonstrative”), used in precise circumstances. Think about the usage of articles in Ancient Greek — I keep going back to it as Randall spoke of Linear A, even though I know that Linear A is not Greek (as far as we know, it is not even Indo-European) –, including the fact that its usage in what we suppose to be Rosetta’s name might be an honorific.

We then have two sentences, one likely “QUESTION-(from) (to come)” and one “QUESTION-(?) (from) (Balears)”. The ᖉᔭᑫ (“fagaθa”) word is unfortunately very obscure, but, using the logic of English, it should be a verb. The alternative of having the first as “QUESTION-(person/who) (are, inflected 2nd plural)” and the second as “QUESTION-(?) (person/who) (Balears)” has some difficulties due to the later usage of ᕒᖚᐧ (“laka”), but it can’t absolutely be ruled out (a more fluent English translation would be “Are you Balearic?”). For the time being, we should study the corpus with both possibilities in mind: ᕒᖚᐧ (“laka”) as “from/where/there/here” (the one I favor) and as something that is or work as a copula verb (“to be”).

The second question, ᑦᘈᖽᐣ ᔭ ᖆᖽᒣ ᓭᘖᑦ ᖊᘊᐤ (“osaʤe ga daʤaʧa tebo ðapa?”), is more difficult. No word in it is clearly related to anything else in our corpus, not to mention the uncommon diacritic in initial position in ᑦᘈᖽᐣ (“osaʤe”). ᓭᘖᑦ (“tebo”) could be related to ᓭᘖᔭᓄ (“water”), but there is not clear indication of that — this is one of the few words where calculating the probability of a random similarity — I’ll do it, eventually — may actually help us, but once more what we need are good hypothesis to test. What do you all think this second sentence means?

We finally have ᕋᖗ ᖆᕬᖉᔭ ᘖᐣᖗᔭ, ᘊᓭᘖᔭᓄᐤ (“raʃa daʎafaga beʃagai patebava?”). We are naturally drawn to ᘊᓭᘖᔭᓄ (“patebava”), almost certainly a compound of the prefix ᘊ- (“pa-” — “big, superior” or a determinative) and ᓭᘖᔭᓄ (“tebava” — “water”). Based mostly in the following sentence by Megan (“Yes! The sea is rising!”), most people (including myself) speculated that this is, loosely, semantically similar ot the third question, which goes hand in hand with the translation of ᘊᓭᘖᔭᓄ as “sea” (“big water”). This has, however, some problems, in particular the fact that we know almost for sure that “sea” is written ᘊᖊᑦᓄ (“paðeva”). azule has suggested in the XKCD fora that ᓄ (“va”) is actually “water, liquid”), and while I am not completely confident with his/her hypothesis of freely joinable morphemes with independent semantic load (I use this complex description because I am not sure if he/she thinks about something more like Klingon or more like Chinese, but we are talking of somewhat-analytic languages), this makes a lot of sense here. Plus, it supports the hypothesis of ᘊ- (“pa-”) as a determinative: in our sentence, “water” is used with a determinative (maybe it’s the subject, maybe some other rule is at play), and in the maps, for example, the prefix is explained by proper nouns (ᘊᖊᑦᓄ ᘊᓭᐧᑲ would be something like “the-Sea the-Balearic”).

The other words in the sentence are, unfortunately, obscure. I agree that we should expect something like “up” or “rise” or “increase” in it, but we cannot go much farther than guessing. ᕋᖗ (“raʃa”) has both uncommon glyphs and uncommon features; the initial ᖆ- (“da-”) in ᖆᕬᖉᔭ (“daʎafaga”) could mean “good” (“up”?), but it doesn’t fit very well with the supposed meaning of the sentence; the only word very loosely similar to ᘖᐣᖗᔭ, (“beʃagai”) is ᖚᐣᘖᖗᑫ (“kebaʃaθa”), which doesn’t add much either. You’ve guessed it: we need more and better hypothesis for the translations.

Back to the corpus – III

In frame 2668 (btw, I am using Geekwagons numbering system), one of the Beanies has examined Megan’s leg and asks or orders something to a second Beanie, which promptly leaves: ᓭᘈ ᘊᒣᓭᐧᖊᔕ ᖆᘊᓭᒣᖊᐣᖗᐨ (“tesa paʧataðama dapateʧaðeʃa”).

It seems everyone agrees on the meaning of this sentence: either “Get/Bring (me/us) cream for-healing” or “Cream for-healing is-needed”. The “cream for-healing” is one the most clear keys Randall has given us: Megan will soon ask what they are putting in her leg, and the answer will be ᘊᒣᓭᐧᖊᔕ ᖆᘊᓭᒣᖊᐣᖗᐨ. We have established that ᘊ- (“pa”) is almost certainly a prefix (or, more appropriately in terms of Beanish morphology, a semantic particle usually found in initial position when applied to names), probably meaning “big, large, superior”; and in fact the base form ᒣᓭᐧᖊᔕ (“ʧataðama”) will return in frame 2797 referring to what they are eating. Not delving into the possible morphology of ᒣᓭᐧᖊᔕ (“ʧataðama”) — but the word is probably at least inflected, if not a compound –, “cream” is indeed the best translation for something that can be both eaten and applied to injuries (“unguent”), assuming it is not a proper noun (who knows, maybe aloe vera or nettle are sacred plants among the Beanies, used as aliments) and that Beanish food is not so different from ours.

ᖆᘊᓭᒣᖊᐣᖗ (“dapataʧaðeʃa”) is a bit more interesting. First of all, we can safely assume that it is qualifying the word if follows, which is probably the less disputable feature of Beanish syntax (see, for example, the map, where the word probably meaning sea, ᘊᖊᑦᓄ ["paðova"] is followed by the proper name): modifiers are postponed. The “for-healing” part is a good guess, and allows us to think once more of -ᘊ- as “big”, implying a prefix ᖆ- ["da"] (on OTT, azule suggested it is “home”, but I think it is more likely a general physical descriptor/connector, if indeed all glyphs have a semantic load) and much more manageable base word ᓭᒣᖊᐣᖗ (“teʧaðeʃa”), which looks a bit like a “verb” (maybe the ᖆ- turns a verb, possibly a nominal one, into an agent, thus “healer (cream)”).

None of this is new, and has been extensively discussed in the OTT. I want to focus in the word ᓭᘈ (“tesa”). Now, without considering its meaning (be it an imperative “get” or a “is-needed”), the word is likely a verb; our tendency as SVO-language speakers is to take it as an imperative. We unfortunately don’t have words clearly related to it — considering the ᘈ element it is completely opaque, while considering ᓭ we have some candidates: in the difficult sentence from frame 2671 we have the bizarre word ᓭᑦᐧ (“toj”); in frame 2697 we have the debated ᓭᐧᖚ (“tako”), the word-without-punctuation (but as many people suggested, it’s probably a lapsus calami, just a miskate) which seems more related to ᓭᐧᘖ (“taba”) in frame 2866; in frame 2728, one that desperately needs more translation effort, we have ᓭᘖᑦ (“tebo”), the likeliest candidate in my opinion.

The conclusion is that I need to study in much finer detail the dubious translations, particularly for frames 2671 and 2728. Please keep posting your suggestions (if you study the corpus I put on Github, all I have there are elipsis…)

Now, for something different. Maybe it’s the literary critic in me, but I have been wondering if the game we are playing is the same for, say, identify time and place. We could do more for “deciphering” Beanish, but unless someone finds an algorithm to derive it from whatever language there is, the data we have is simply not enough (I had many crazy, crazy ideas I know wouldn’t work well, going as far as a Naïve Bayes to classify words in their parts of speech using people guesses of translations). Maybe we are not supposed to find the rules but actually find the signal in the noise, i.e., develop a grammar that fits the language we have so far? Really, has it been discussed? I look back and seem to have always worked under the assumption that the grammar was there to decrypt.

But I will try to go on with my mumblings, even though life knocks at the door asking me to stop playing that much and focus on work…

On the abugida hypothesis

A note on the abugida hypothesis, as it has been causing some confusion. I should have been more clear saying that, while I support the hypothesis that the script is an abugida, the pronunciations I am using (such as /p/ for ᘊ and /ʤ/ for ᖽ ) are only suggestions to make the discussion easier. I am not saying that the actual pronunciation is or likely is what I am using. My only intention for proposing these pronunciations was to solve the difficulty I (and apparently others) have with the Unicode glyphs, as my mind always reads something like “three-b-dot-seven-en”.

Even if there were evidences supporting that the pronunciation is right, which we do not have, we should not try to find patterns and similarities between Beanish and any other languages. Randall said that Beanish is supposed to be “plausible” and a plausible future language, even when actually evolved from a natural one currently known, would not have any clear phonetic similarities, due to nature of sound changes. Besides, we must consider that “Time” is set very far in the future; even Proto-Indo-European as usually reconstructed (i.e., as far as we can possibly go in terms of human language without a time machine — I am always hoping for the Doctor’s next companion to be a linguist) was likely spoken ~ 3,500 years ago (no Paleolithic Continuity Theory, please) and you just can’t easily go from *dhǵhemon to groom, or from *kʷetwóres to four.

Finally, we are a pattern-matching species, used to find signals in any noise, even random noise (which is my personal explanation for why people still try to write universal data compressors). We cannot help but find similarities among words in different languages, because not only we have this tendency, but also because we know that it is how languages work and because the population of phonemes is so small and the semantic boundaries so flexible that words can, by chance, seem related. This is why so many people try to link languages like Hebrew and Quechua, or Basque and Chinese, and that is why the accepted methodology in comparative linguistics is to look for regular changes (exceptions, if any, must be very well explained, like Tolkien did with Elvish numbers), the words must usually be taken in its “purest” form (and we don’t really know them for Beanish, perhaps with the exception of those starting with ᘊ-), and vowels are important too. There is a very good old article by Mark Rosenfelder you can read: How likely are chance resemblances between languages?

In short, sorry for the confusion, but let’s focus on Beanish syntax and morphology, maybe vocabulary, but not in its similarities with other languages. Sorry for the confusion, my fault!

Back to the corpus – II

In frame 2664 we have the second Beanish sentence, ᔪᖆᓄᐧ ᔪ, ᒣᖉ ᖊᐣᖽ ᖽᘛᕋᑦᐤ The context is that, after the first sentence not understood by Megan and Cueball, the second Beanie points to Megan’s leg, previously injured during a keyboard attack, probably asking what happened or if they can help (and, in fact, Megan will show them her leg in the following frames).

This is one of the most cryptic sentences in our corpus. We know it is a question, as it ends with question mark ᐤ and there are two ᔪ (“na”), which have established to be common in questions. The only word which is not an hapax is the last one, ᖽᘛᕋᑦ (“ʤazaro”), used, probably by the same Beanie, in frame 2866 when our heroes are presented to Rosetta with a two-word sentence. The word has the ᘛ (“za”) syllable which may or may not be related to verbs of movement; it could be a noun derived from it (“thing-that-make-you-move”, i.e., “leg”). Given its length, its double occurrence and the fact that it is used in two different situations, one of them along with the extremely complex word ᖉᔭᒣᘊᐣᘖᑫᖗ (“fagaʧapebaθaʃa”), the best guess is to consider it, using the terminology of English grammar, an open class, in order of probability a noun, a verb or an adverb.

The other words are even more difficult. If what we assume to be the role of the prefix ᔪ- (“na-”) is correct, we have a base word *ᖆᓄᐧ (“dava”), whose closest match is ᖆᓄᘈᖉᐣ (“davesafe”) in frame 2821, usually taken as the name of the Beanie city and which I proposed that might be a compound word *ᖆᓄ + *ᘈᖉᐣ (considering it is similar to ᘖᓄᘈᖉᐣ in frame 2906, another toponym), but there is no clear indication of that. If it works like ᕒᖚᐧ in frame 2663, the most obvious translation is a word like “what” (or “who”, as Beanish can very well distinguish between animate and inanimate beings, instead of human and non human).

Not much can be said about ᔪ, (“daj”), ᒣᖉ (“ʧafa”) and ᖊᐣᖽ (“ðeʤa”). The first is too short and similar to the ᔪ- prefix and, as per Zipf’s Law, would likely be a common semantic trait; the second has the ᖉ (“fa”) syllable that could mean “good, well, normal, happy” (similar to Ancient Greek “eu-”), but unfortunately we cannot state much (I’d love to say that ᒣ- is a negation prefix, like “mal-” in Esperanto, making but ᒣᖉ a “no good, not well” ["malbona" in Esperanto], but there is absolutely nothing to support that); the third word is completely opaque, the closest match being the common ᖆᐣᖽ (“deʤa”) which probably means “to/at/in/into” — but to think that the fist is a “motion to place” and this a “motion in place” is… just not good.

The most accepted translations are “What happened to your leg?”, “Are you injured?”, “Were you attacked?/What attacked you?” and “Could you show me/us your leg?”. They all seem likely probable, especially because there is no other sentence whose translation we can safely assume has “you(r)” (in the singular — why did you English speakers had to drop ‘thou’?), be it a word or a glyph.

I really don’t know. What are your best guesses about the meaning of the sentence and of each of its words? Any breakthrough?

Back to the corpus – I

It is time to go back to our corpus. I will use the abugida hypothesis I developed, as it makes it easier to explain, and I will try to analyze it without the linearity of the last time, jumping to where it seems necessary.

In frame 2663, we have our first Beanish sentence: ᔪᕒᖚᐧ ᘛᔭᐤ (“nalaka zaga?”). We know it is a question, but there has been much debate about its meaning, the three most accepted suggestions being “Where are you from?”, “Who are you?” and “Do you speak/understand Beanish?”.

The first option is similar to Rosetta’s sentence “Whence have you traveled here?”, the second is the one with less semantic load. We know that ᔪ (“na”) is correlated with questions and ᕒᖚᐧ (“laka”) is common in sentences that probably indicate locations, as it was even proposed to translate it as “where” (but we know that Beanish is supposed to be as different from English as possible, and word-by-word translations are difficult). ᔪ (“na”) could be a mark for incomplete information, to be supplied by the listener, similar to Lojban’s “ma” (i.e., a closer translation would be “(you) came from {na}”, expecting the listener to answer “(we) came from X”). This would make ᘛᔭ (“zaga”) the action (“the verb”), possibly inflected for past and second-person plural; in fact, ᘛ (“za”) is found in other sentences whose meaning seem to include movement, as in frame 2865 (possibly “where are they from?”, or, better, “(do) they came from (the-)wildlands?” — the mark of past, if any, would be in the verb) and in frame 2880 (possibly “you can leave now”). These are the three hypothesis from this interpretation: ᔪ- (“na”) is a prefix for incomplete information, used in questioning; ᕒᖚᐧ (“laka”) refers to physical locations and ᘛ (“za”) is part (root?) of actions (verbs?) related to movement.

Regarding the second hypothesis, “Who are you?”, ᔪ- (“na”) could still be a prefix for incomplete information (even though it makes more sense with the previous option), making ᕒᖚᐧ (“laka”) a word related to people (“who”). It does not seem very plausible to me, as we also probably need to omit either the copula verb (“are”) or the pronoun (“you”); while the second alternative is more common, given the analysis above (for ᘛᔭ “zaga” as “to move/leave”) I’d say it is more likely that the verb is omitted and that ᘛ (“za”) is related to the second-person plural.

Regarding the third option, “Do you speak Beanish?”, it is the one that makes more sense to me in the narration (the Beanies have already heard Cueball and Megan speaking in a foreign language, it wouldn’t make much sense to make any other question other than one that tries to establish a channel of communication — accepting it is a question), but has some linguistic difficulties. We’d have two main semantic elements, the action (“to speak”, or “to understand”) and the name of the language/people (“beanish”). Still accepting that ᔪ (“na”) is common in questions, it could be taken as the mark for a yes/no question, and in fact we assume that, in frame 2880, ᔪᑕ (“naʒa”) is indeed “yes” (some people have translated as “ok”, just like ᖉᑦ, “faj”, in frame 2806, but I believe it actually is “good”). It still a bit difficult to translate this sentence (it could also be “(do) you understand us?”, which is less likely given the later usage of ᕒᖚᐧ “laka”), but I would not rule it out.

(personal note to Randall: if you are reading this, please give us more corpus! even Linear A has more than a thousand specimens! please, please, oh, pretty please!)

hou tu pranownse binish

Here is the development of an idea that I suggested at the XKCD forum: treat our Beanish corpus/words as Markov chains and, using a simple Maximum-Likelihood algorithm, suggest possible pronunciations using the dictionary of an actual language as a reference. This suggestion makes a lot of assumptions: that the Beanish script is an alphabet (but the idea could be later expanded to the syllabary/abugida/… hypothesis), that Beanish words can be treated as Markov chains of symbols (not necessarily valid, as at least the syllable structure and the morphology may not meet the Markov-chain assumption that the current state is independent from past states), that the alphabet at least somewhat mirrors the phonology, that the language we use for reference is similar to Beanish, that distinctive phonetic features (i.e., the phonology) are graphically represented, that phonotactict restrictions can, indeed, be found this way, and many more. It also raises a good number of technical questions, such as what will constitute the corpus to be evaluated (token samples or token outcomes? words taken individually or the entire sentences?) and what kind of smoothing, if any, should be performed (Good-Turing would be the first obvious choice, but we are not dealing with large groups and not necessarily with exclusive ones, not to mention that it seems that we won’t have any new Beanish text coming from Randall in the foreseeable future and, as a consequence, we must exclude any Bayesian “nature”).

Still, I imagined it would have been fun to do. Here are the results.

I took the CMU Pronuncing Dictionary and calculated the transitions (including as first and last symbol) for each phoneme, excluding the stress distinction for vowels. The CMUPD is far from the best choice for our situation, but I wasn’t aware of any better free dictionary to play with. Here, for instance, are the counts of transition from the initial position (i.e., the count for the first phoneme in English words):

{'IY': 590, 'W': 3773, 'DH': 67, 'Y': 1350, 'HH': 6656, 'CH': 1260, 'JH': 2148, 'ZH': 91, 'D': 7722, 'TH': 636, 'AA': 1898, 'B': 9632, 'AE': 2906, 'EH': 2928, 'G': 4963, 'F': 5561, 'AH': 3423, 'K': 12969, 'M': 9450, 'L': 5470, 'AO': 883, 'N': 3214, 'P': 7833, 'S': 12371, 'R': 7445, 'EY': 490, 'T': 4854, 'AW': 351, 'V': 2427, 'AY': 615, 'Z': 946, 'ER': 389, 'IH': 4133, 'UW': 84, 'SH': 2467, 'UH': 14, 'OY': 28, 'OW': 1295}

A smoothing being necessary (we don’t want to end up with probabilities equal to zero), I used NLTK‘s implementation of the “Simple Good Turing” algorithm from Gale&Sampson, proud to be using something first developed by Turing himself. Here are the transition probabilities calculated with SGT ("{" and "}" are, respectively, my chain-start and chain-end markers):

{  0.00 %     EH 2.20 %    K  9.73 %    S  9.28 %
L  4.10 %     AH 2.57 %    M  7.09 %    EY 0.37 %
SH 1.85 %     N  2.41 %    P  5.87 %    OY 0.02 %
T  3.64 %     }  0.00 %    OW 0.97 %    Z  0.71 %
W  2.83 %     D  5.79 %    B  7.22 %    V  1.82 %
IH 3.10 %     AA 1.42 %    R  5.58 %    AY 0.46 %
ER 0.29 %     AE 2.18 %    F  4.17 %    IY 0.44 %
AW 0.26 %     AO 0.66 %    Y  1.01 %    UW 0.06 %
G  3.72 %     NG 0.00 %    TH 0.48 %    DH 0.05 %
HH 4.99 %     UH 0.01 %    CH 0.95 %    ZH 0.07 %
JH 1.61 %

We can now test some English words, computing the combined log probabilities for their phonemes:

water ['W', 'AO', 'T', 'ER'] -12.8270136724
desktop ['D', 'EH', 'S', 'K', 'T', 'AA', 'P'] -23.8901166969
hagiography ['HH', 'AE', 'G', 'IY', 'AA', 'G', 'R', 'AH', 'F', 'IY'] -32.0648872671

Which confirms that “water” has a sequence of phonemes with higher probability than “hagiography”.

I could now start playing around with genetic algorithms, hill-climbing or a true maximum-likelihood estimator, but decided to go for what any lazy hacker always does: generate lots of random mappings between Beanish glyphs and English phonemes and keep the one with the best score. I know, I know.

The results, however, were terrible. While some mappings did perform a little, there clearly was not pattern to develop with random mappings (don’t forget we are dealing with something like 10^34 different mappings). The only thing I could notice was that mappings with more vowels and high sonority in general (glides, liquids…) were performing a little better, which makes sense from a phonological point of view, but it is not good for our purposes (Rosetta does not seem to have problems with the pronunciation of English, but with its syntax and vocabulary).

I then decided to do things the right way, selecting the best mappings and swapping glyphs trying to find something better. I could have written a true genetic algorithm, but it seemed useless as this hill-climbing was also useless, only confirming that we should have a lot of vowels to make it somewhat pronounceable.

By this time I was reinforcing my guess that the Beanish script is an abugida, but it was necessary to test and evaluate the alphabet hypothesis. Facing a problem the computer was not able to help me (or which I was unable to use the computer to help me, which is essentially the same), I once more did what any scientist does and decided to map it by hand, using what I know of phonetics and linguistics (and believe me, I don’t know much). My map could be used as a basis for future developments, including the algorithms I described above, and was necessary: dealing with the arcane Beanish script is not easy, because our brains (or, at least, mine) are used to letters and phonemes. It is much harder to infer anything, and guessing possible mappings between Beanish glyphs and phonetic representations might be a good idea. As usual, it will probably also develop our Beanish skills.

Here are my guesses, always assuming that the script is alphabetical and that it is unlikely that letter shapes are dependent on their position (lunate sigma, anyone?). I’ll later try to do a syllabary one.

Both ᕬ and ᓄ are rare glyphs, likely rare sounds. The first is only found in the word ᖆᕬᖉᔭ; ᖆ and ᖉ are probably vowels (but we can’t completely run out that they are glides, liquid or even nasal stops) and ᔭ a consonant, giving a probable VCVC word-structure. Being found in mid-vowel position, it could be just anything; let’s assume it is a /ʒ/.

Regarding ᓄ, it is only found in final position, excluding the complex ᖆᓄᘈᖉᐣ word that I have discussed (and is likely a toponym and/or a compound word); we know, however, that it can be followed by the dot diacritic. It usually follows ᖆ (as in ᔪᖆᓄᐧ and ᖆᓄᘈᖉᐣᐨ, which might be related considering that ᔪ could be a prefix) or other glyphs supposed to represent vowels, but our best word, ᓭᘖᔭᓄ “water”, has what I guess to be a VCC? structure (in fact, it is one of the reasons I still haven’t dropped the syllabary hypothesis). Still impossible to make any educated guess.

Keeping up with the analysis of word structures, we move on to the short words (at most two glyphs, not considering the diacritics): ᖆᐣᖽ / ᖊ,ᘖ / ᓭᘈ / ᓭᘖᑦ / ᖊᘊ / ᒣᖉ / ᓭᐧᘖ / ᓭᑦᐧ / ᓭᐧᖚ / ᘛᔭ / ᘛ / ᘛᐣ / ᖉᑦ, / ᖉ, / ᔪ, / ᒣᖉ and ᖉᑦ, — more short words than we’d like and expect for an alphabetic system. Still, at least ᓭ, ᘛ and ᖉ look like vowels, confirming the tendency for a VC syllable-structure; for now, let’s assign them to /a/, /e/ and /o/.

Having established that, we can disgress and look at some other words, like ᓭᘊᘊ (now /aᘊᘊ/) and ᑫᘊᘊ (“Gibraltar”). ᘊ is the only glyph we find repeated (Beanish is not Italian), and as we can almost rule out it being a vowel or a glide, the repetition probably means to represent either a long consonant, a stronger consonant or a repeated consonant. Now, ᘊ is usually found next to I guessed to be vowels (ᓭ, ᘛ and ᖉ), but not always: we have ᘊᒣᓭᐧᖊᔕ (that forces us to guess what ᒣ is), ᘊᘖᑫᘖᒣᐣᖚ (where it seems that ᘖ is consonant and ᒣ a vowel), ᖆᘊᓭᒣᖊᐣᖗ (where the guess needed is about ᖆ), ᖊᘊ (where we confirm a tendency for ᖊ to be a vowel), ᑕᘊᐣᒣ (where we have the mysterious ᑕ and, even more important, a diacritic — maybe this diacritic makes a consonant the coda of the syllable!?) and the greeting ᘈᘊᘖ (where, if ᘖ is a consonant, we would probably have VCC). Let’s first accept that ᒣ and ᖊ are vowels, giving them /i/ and /u/. Now, considering that ᘊ looks like a consonant that precedes other consonants, we can guess that it has low sonority, thus being either a stop or a fricative, preferably unvoiced. Going back to ᓭᘊᘊ and ᑫᘊᘊ, my guess is that it is an /s/. People will probably like it because many have guessed that the prefix ᘊ- was a plural mark. We now also have a word, ᓭᘊᘊ /ass/ (in IPA, don’t read it in English! ;) ) and are forced to accept that ᑫ is a vowel. Having used the five cardinal ones, I will try to keep it Romance, now going with /ɛ/ (thus having ᑫᘊᘊ /ɛss/). I can also trying to solve the ᘈ problem (found, for example, both in ᓭᘈ and ᘈᘊᘖ) by going to the other side of the mouth, and assigning it to /ɔ/.

Trying to solve what I kept pending, the ᘖ consonant. Using my guesses, we now have it in contexts such as sɛiᐣᖚ / ɛ / aᘖᔭᓄ (don’t get so excited, is not coincidence that I made “water” start with an /a/) / aᘖᑦ / aᐧᘖ / ᘖᐣᖗᔭ / saᘖᔭᓄ / ᖽᔕᐣᘖ / ᖚᐣᘖᖗɛ / ɔsᘖ (the greeting) / ɔ / aᔭᑦᘖ / oisᐣᘖɛ / ᘖᖆie and, which does not sound that plausible, ᘖᓄɔo. Once more, the most obvious choice would be for a stop/fricative, being a common one likely /t/ or /f/, but cannot stop considering the diacritics anymore. I still think that, if the script is alphabetic, the diacritics are either some coloring or some indication of phonetic features, either to indicate the correct pronunciation or to graphically represent some phonetic change due to word properties, thing like the voicing of an unvoiced consonant between vowels. I like to think that it would help explain the mirrored ᑦ and ᐣ. Of course, to make it difficult, ᘖ is one of those glyphs that can take both ᑦ and ᐣ, leaving us with less likely features such as syllabic, aspirated, nasal release, etc. (and now I am thinking — why the hell are we assuming that the anatomy of the Beanies is like ours?). Anyway, stops can be a little bit more flexible given these probably wrong assumptions, and thus I will go with /t/.

The diacritics, then. The middle dot looks like the simpler one, as it seems to be applied to vowels and to the strange ᓄ glyph, which would make ᖚ yet another vowel (or glide, perhaps). To make it simple, I will assign this last glyph to a vowel not strongly linked to any phonetic feature, be it height, backness or roundness: the good and old mid central vowel ə. This raises some problems, particularly for those used to English phonology, because in some cases it would result in a strongly syllabic schwa (such as /ᕒəᐧ/ or /əɛt/), but we can try fixing it later (and, of course, it is just weird and unlikely, but not impossible). Back to the diacritics, we can now assign some phonetic trait to this middle dot: the best ones would be nasal and aspirated. As aspiration is likely to be more evenly distributed among vowels and consonants (the prove is that… well, I guess so), let’s reserve it to diacritics more evenly distributed and use “nasal” or “nasal release”. Tildes on the way, guys! (yes, even with tilded schwas, which likely won’t render correctly in your system /◌̃ə/ — and now spend two minutes trying to pronounce it, by relaxing pretty much every single muscle in your mouth).

Now, the lunar diacritics, ᑦ and ᐣ. The first is the weirder one, as it is even used, in a single occurrence, in the initial position of the word (one of the things that makes me think that a syllabary/abugida is likely): ᑦᘈᖽᐣ. But ᘈ is an unusual glyph graphically, and maybe it is used before for trivial reasons, aesthetic or not. In order to try to create a glyph mapping, we better not to consider it. ᑦ is thus used with ᕋ, ᓭ, ᖉ, ᘈ, ᘖ, ᒣ, ᔭ, ᑕ and ᖊ (mostly vowels, but also the common consonant ᘖ), and ᐣ with ᖊ, ᒣ, ᖽ, ᘖ, ᔕ, ᖚ, ᖆ, ᘊ, ᖉ, ᘛ and ᓄ, a true “bag of stuff”. Things get now very complicate: we can try to consider ᑦ a phonetic feature such as aspiration, but as for ᐣ there is no possible educated guess; while I will translate it as “long” (either vowel or consonant), it should be taken as an indicator that some of the assumptions we made so far are very, very off.

Before going on, it is time to make sure that every syllable has a vowel. At this moment, we are still left with ᔪᖆᓄⁿ / ᔪ, / ᔭ / ᕋᖗ / ᖽᔕ:t / ᖆ:ᖽ and ᔪᑕ. Now, ᔪ and ᔭ must be vowels, let’s assign them /y/ and /ɯ/, just ’cause I like ‘em; this makes ᖆ an unlikely candidate for vowelness, and we can assume it is a consonant and that ᖽ is a vowel (starting to run out of vowels, I choose /ɐ/, trying to keep them as spaced as possible).

Given these new vowels, C is now, probably, a consonant, found in ᑕs:i, saʰᑕo, əiᑕɛa, ᖆ:ᑕʰəɛ and yᑕ. We need a consonant that can be aspirated and can be found in the complex onset ‘Cs’ (assuming phonotactic restrictions are similar to those of the languages I am used to): the best choice is /p/. We also have ᖆ as a consonant, but now more sonorant is less implausible (after all, we are only playing to finish this game), and I’ll go with an /m/.

Let’s now finish by working out the stubborn glyphs we still have. ᕋ is not very frequent, but we find it in ᕒəɛᕋ, / ɐeᕋʰ / ᕋᖗ . It looks like a consonant similar to ᑕ; let’s assign it to /b/. We also have ᕒ, and the only thing we know about it is that it frequent in questions, in yᕒəⁿ / ᕒəⁿ / ᕒəɛb, ; let’s assign it a “strong” sound that clearly distinguishes it in a sentence: /ʃ/.

We get back to ᓄ, now assuming that it can have a nasal release. This is almost impossible by the restrains we have by now, but we should only start changing and swapping glyphs at the end. As it is common in final position, we can make it quasi-English-like as assign it to /ŋ/.

Our now almost impossible to pronounce language still has ᖗ, found in msaiu:ᖗ / bᖗ / t:ᖗɯ / ə:tᖗɛ. The number of vowels in the language is probably too high by now, but the only good alternative would be to make ᖗ, too, a vowel (did I say I keep thinking it is an abugida?). But I will try to restrict it, and thus assign it to /r/: /br/, for example, is not such an impossible word. Finally, ᑲ is found in saⁿᑲ; I will make it a voiceless fricative, /f/, to try to get some rythm in this language full of vowels (I still study poetry, I can’t help it…).

We still have left the “comma” diacritic, by this reckoning yet-another-phonetic-feature, but otherwise the mapping is complete:

ᕬ –> /ʒ/ # voiced palato-alveolar sibilant
ᓭ –> /a/ # open front unrounded vowel
ᘛ –> /e/ # close-mid front unrounded vowel
ᖉ –> /o/ # close-mid back rounded vowel
ᒣ –> /i/ # close front unrounded vowel
ᖊ –> /u/ # close back rounded vowel
ᘊ –> /s/ # voiceless alveolar sibilant
ᑫ –> /ɛ/ # open-mid front unrounded vowel
ᘈ –> /ɔ/ # open-mid back rounded vowel
ᘖ –> /t/ # voiceless alveolar stop
ᖚ –> /ə/ # mid central vowel
ᔪ –> /y/ # close front rounded vowel
ᔭ –> /ɯ/ # close back unrounded vowel
ᑕ –> /p/ # voiceless bilabial stop
ᖽ –> /ɐ/ # near-open central vowel
ᖆ –> /m/ # bilabial nasal
ᕋ –> /b/ # voiced bilabial stop
ᕒ –> /ʃ/ # voiceless palato-alveolar sibilant
ᓄ –> /ŋ/ # velar nasal
ᖗ –> /r/ # aveolar trill
ᑲ –> /f/ # voiceless labiodental fricative
ᐧ –> /_ⁿ/ # nasal release
ᑦ –> /_ʰ/ # aspirated
ᐣ –> /_:/ # long vowel or geminated consonant

Do I really need to say that I not satisfied with this mapping? There are far too many vowels, there is far less symmetry that what we’d expect from a plausible language, and while it is general pronounceable (for example /atɯŋ/ for ᓭᘖᔭᓄ, “water”) we have bizarre things like /ps:i/ for ᑕᘊᐣᒣ and unacceptable ones like /ymŋⁿ/ for ᔪᖆᓄᐧ — we could try to fix some of these later, but it just doesn’t seem right.

It is now time to investigate the hypothesis of the Beanish script as an abugida; I’ll do it in the next post.

hou tu pranownse binish – part 2

(note: I am posting part 2 before part 1… Part 1 and 3, alphabetic and syllabic guesses, are far harder and I don’t know if I’ll be able to finish them soon — real life knocking at the door)

In the previous post, I tried, without much success or confidence, to map Beanish glyphs to phonemes, assuming it is an alphabet. I used frequency tables, some linguistic knowledge, my ear (“it sounds good enough”) and, mostly, wild guesses. As I stated, the biggest problem are the diacritics: we can be more or less flexible regarding potential Beanish phonotactic restrictions, but the diacritics (with the possible exception of the “comma” one) do not work like the other glyphs (i.e., they are not letters) but don’t seem to work well as phonetic traits either. I tried to map them to some phonetic features nonetheless, but nobody should be pleased with my suggestions (I certainly am not).

One idea that has been debated in the XKCD fora since the time Time was playing was to treat it as an abugida. The diacritics are probably, once more, to blame, but in a lot of ways it does make sense: they could be very well vowel-marks (we can even try to think of them as a graphical representation based on the point of articulation in the mouth, very loosely like Korean) and the biggest objection is that the mean word lenght is a bit too long. Not that the abugida solution solves every single difficulty regarding Beanish: the transition probabilities among glyphs do suggest an alphabet more than an abugida (assuming the grammar isn’t terribly strict) and the number of glyphs is a bit too large for a “plausible” language. A third possibility is that the script is indeed a sillabary (remember that Randall used Linear A as an example), which does not exclude the possibility of the diacritics being vowel marks; we shall investigate this later.

Anyway, we have four diacritics in the Beanish script: the “middle dot” ᐧ , the “c” ᑦ , the “inverted c” ᐣ and the “comma” ,. Our major difficulty is that they can be combined, particularly the comma, in words such as ᖉᑦ, (but we also have the complex word ᓭᑦᐧ). If the diacritics are vowels, this could mean that vowels can sometimes be combined: in particular, the “comma” could be a glide (the most obvious being the palatal approximant /j/). We are left with ᓭᑦᐧ which, among other hypothesis, could be a diphtong (the only one we have so far) or the mark for a rare vowel. This is what I will assume.

Considering the three diacritics we have left, the fact that one of them looks graphically “neutral” (probably the most common vowel, such as /a/ or /ə/) and the fact that the other two seem to mirror/negate themselves, it is a good guess to consider the middle dot as an /a/, the “inverted c” as /e/ (possibly with allophones such as /ɛ/), the “c” as /o/ (possibly with allophones such as /ɔ/), the “comma” the /j/ glide and the combined diacritic ᑦᐧ just /oa/ or, even better, /oə/.

And now, let’s tabulate everything to find both the default vowel for each consonant and a guess of what consonant it is (based in the consonant frequency of both Beanish and English, plus two dorsals not found in English but common in other languages). Everything assumes that the syllable structure is V+C, and we are solving the isolated diacritic in ᑦᘈᖽᐣ (it would just be a word starting with /a/, the only one in our corpus: /asaʤe/).

Glyph Count /a/ /e/ /o/ Probable base-vowel Guess consonant
29 0 3 0 /a/ ? /p/
27 0 1 2 /a/ /b/
24 8 + 0.5 (ᓭᑦᐧ) 0 2 + 0.5 (ᓭᑦᐧ) /e/ /t/
21 0 7 0 /a/ /d/
17 5 2 0 /o/ /k/
17 0 0 1 /a/ ? /g/
16 0 1 3 /a/ /ʧ/
15 0 1 0 /a/ ? /ʤ/
13 0 2 2 /a/ /f/
11 0 1 0 /a/ ? /v/
10 0 0 0 /a/ ? /θ/
10 0 3 2 /a/ /ð/
10 0 0 0 /a/ ? /s/
7 0 1 0 /a/ ? /z/
7 0 0 0 /a/ ? /ʃ/
7 0 0 1 /a/ ? /ʒ/
6 0 3 0 /a/ ? /m/
5 0 0 0 /a/ ? /n/
5 0 0 0 /a/ ? /l/
4 0 0 2 /a/ ? /r/
3 0 0 0 /a/ ? /ŋ/
1 0 0 0 /a/ ? /ʎ/
3 0 0 0 /a/ ? /ɲ/

Which is great, because 1. There is no glyph with at least one occurence for every diacritic and 2. While a bit extensive, the size of the phonetic catalog is very reasonable (no need to use ejectives or the like, as in the guessed alphabet of part 1 of this post).

If you are still puzzled, this means that (completely made up words) ᘊᓭ should be read with the default vowel for each glyph, here /a/ and /e/ and thus /pate/; if the vowel is not the standard, you add the corresponding diacritic, and thus /pote/ would be written as ᘊᑦᓭ and /pato/ as ᘊᓭᑦ. The “comma” is a semivowel /j/ added after the vowel, and thus ᘊᓭ, would be /patej/ and ᘊᑦ,ᓭ would give us /pojte/.

The abugida hypothesis is at least plausible, even though, as I said, the words are a bit longer than I’d like and my score at guessing the consonants probably isn’t much better than a random choice. We can later try better guesses using the vocabulary we have decoded so far, such as “water” and “sea”, hoping they are related to some known language (phonosymbolism, anyone?)

But at least ᓭᘖᔭᓄ as /tebagava/ for “water”, while very unlikely, sounds better then the pronountiation I derived in the previous post, the “alphabetic guess”

Regarding ᘝᓄᘈᖉᐣ

Yet another hyphotesis: while our corpus is small and most of the words I am using for this hypothesis seem to be related (“water”, “sea”…), there is a strong tendency for the glyph ᓄ to be found only at the end of words (mostly nouns).

The exception is ᘝᓄᘈᖉᐣ, a somewhat unusual word that many suppose is the name of the Beanie city. Maybe its name is actually a compund word, ᘝᓄ and ᘈᖉᐣ? An even wilder guess: ᘝᓄ or, more likely given the syntax, ᘈᖉᐣ could mean “new” (as in “New York”).

Do you carry these people?

I have decided to study Big Hair’s speech in English, as people have pointed that it might be a “key”. Just had my first insight:

ImmagineIn frame 2897, she (supposedly) says “Do you carry these people with you?”. She probably intends “Did you bring any of those people with you?”, referring to the Forty.

We could make hypothesis about the reason for the past-mark-dropping, but I want to focus in the verb “to carry”. While it may sound very weird to native English speakers (for some people in the forum, it was undecipherable at first), it could be expected error from the speaker of a language that makes a different distinction between to carry/to bring, such as Italian and French. We know we are in current-day France and Randall said that Beanish was “plausible”, not to mention the fact that all Big Hair’s numbers “are too small”… maybe Beanish has French features?


Get every new post delivered to your Inbox.