So, after months of not looking at the corpus and only lurking on rare occasions on the OTT (sorry, guys), I decided it is time to come back and try some new ideas. But, first, I will probably wrap everything up into some academic paper — my CV needs it, and it will hopefully bring more people on board.

If anyone wants to collaborate, please get in touch! 🙂

Rosetta’s errors

Since the beginning of this blog, I have been saying that some good clues on Beanish vocabulary and syntax (and, maybe, even on its origin) might be found in Rosetta’s errors. It is common in language learning to use the errors of an adult learner to pinpoint (and, thus, work with) the “parts” of its mother language that are actually more unlike the language he/she is trying to speak.

GLR and the linguist(s) that helped him were certainly aware of that, and might have bestowed upon us some indications in the unusual graphical representation of Rosetta’s Unglish (which, for example, has settled the matter on what the the circle diacritic meant — it is question mark). So, without further ado:

In frame 2865, “Somewhat” seems to have an interference of an expression “some what”, where the words/morphemes/semantic units are separated. It might just be an expression of Rosetta’s difficulty, but it might indicate a language where they are usually separated. Think about Italian, where you can have both “qualcosa” and “qualche cosa”, in different context (unfortunately, not the one we have here). The same happens with French (Proto-Beanish?), where “qualque peu” (or, farther from the context, “un peu”), an acceptable translation, is separated.

In frame 2868, we have “Where” and “(From) Whence”. “Whence” sounds, of course,  archaic, but it could indicate Beanish as a language in which “where” is not used in questions (or, in detail, in questions with this kind of movement — think of the difference between in+ablative and accusative in Latin when describing movement, both of which would generally be translated as “where” in English), and where the best Unglish translation is “Whence”. Regarding the “From” in Whence, while most rigid grammarians of English will complain, it is an attested form since the 14th century — still, it could indicate a language where the equivalent of “from” is eeded.

In frame 2870, I cannot read the faint “to-(something)” at the beginning. The only other notable correction is “sand-(something)” for “desert”, which would an expected substitution from a speaker with Rosetta’s proficiency.

In frame 2873, Rosetta says “Your language is like those spoken by the (…) difficult”. By investigating the space after the article, either she immediately stopped the sentence (which seems unlikely) or the noun is extremely short, two or three letters. My guess would be “old”, and it is not impossible that, somewhere in the Beanish sentences (like in 2723 or in 2861) we have the Beanish equivalent for “old”.

Frame 2874 suggests that Beanish neutral form is “have patience” and not “be patient” (like in most Romance languages).

In frame 2878, Rosetta says “They understand nothing”. While this is normal for Unglish, it shows nothing of the normal interference of Romance languages that makes us expect the double negative “They don’t understand nothing”. It could be that she just got it right, it could be that it works the same in Beanish (maybe it is an evolution of dialects of Italian and French where you have only postponed negations, like in “(non) Capisco mica”), maybe it means nothing.

In frame 2879, Rosetta says “packs” for “bags”. As “packs” would be understandable and she corrects it, it probably indicates that in Beanish “pack” and “bag” are referred to with the same word.

In frame 2880, Rosetta corrects an initial “For (they are heavy)”. It could indicate that in Beanish the equivalent is mandatory.

In frame 2886, Rosetta corrects “house” with “home”; once more, it might indicate that the word for “house” and “home” is the same, or that it this type of sentence you usually use the word for “house”.

I remember that many people in OTT noticed the strange syntax in frame 2890, “How many people strong are you?” (the “strong” is, however, somewhat dubious). The superimposed word seems to be “numerous”.

I have already discussed frame 2891 — it suggests that in Beanish there are no names for large numbers, that they are composed like in modern French.

Frame 2894 is probably a good clue in terms of the final verb used by Rosetta. We should probably ask a good Scrabble player what he/she thinks of it (I get .EL…NA)

In frame 2895, we have yet another particular syntax in “Your sea does not stand alone”.

In frame 2897, a Beanish synonym for “hill” seem to be “rock”. Our Scrabble expert has a new challange, a synonym for “closed” in terms of “…RBIDE”(?).

Frame 2899 shows an interesting and somewhat unexpected used of “build” in “build a map”, with “find” as a tentative synonym. It is also worth nothing the syntax “to understanding”, with a preposition and gerund.

In frame 2901, I have already noted the construction of possessive used by Rosetta “X is (pronoun)”. Once more, it somewhat reminds of French, or maybe Latin, or Greek, or…

In frame 2904, Rosetta uses “forefathers” as synonym to “parents”. It is analog to her usage of “whence”, and could point to a similar construction in Beanish (once more, it brings to mind Romance words, like Italian “ante-nato” and, even better, Portuguese “ante-passados”, not to mention the forms derived from Latin “pro-genitor”, which, by the way, is the source of the English “forefather” calque).

In 2908, it has been noted that Rosetta calls the “castle” a “fortress”.

In frame 2917, we might have another interesting syntax interference, “much too long”.

What now? Well, back to building a Beanish grammar!

We should all go to ᘊᒣᑦᖽᖆ

An interesting group of sentences is the one that suggests movement (I will start with the translations still found in my transcribed corpus):

  • Frame 2734: ᖽᔕᐣᘖ ᖚᐣᘖᖗᑫ ᖆᐣᖽ ᘊᒣᑦᖽᖆᐨ (“We are going to the castle.”)
  • Frame 2806: ᖆᘈᘖ ᖽᔕᐣᘖ ᖚᒣᑕᑫᓭ ᖆᐣᖽ ᘊᒣᑦᖽᖆᐨ (“Comrade, we shall now go to the castle.”)
  • Frame 2836: ᖽᔕᐣᘖ ᖚᐣᘖᖗᑫ ᖆᐣᖽ ᘊᓭᘖᑦᓄᐨ (“We are going to the leader.”)

A table might illustrate it better:

F2734 ᖽᔕᐣᘖ ᖚᐣᘖᖗᑫ ᖆᐣᖽ ᘊᒣᑦᖽᖆ
F2806 ᖆᘈᘖ ᖽᔕᐣᘖ ᖚᒣᑕᑫᓭ ᖆᐣᖽ ᘊᒣᑦᖽᖆ
F2836 ᖽᔕᐣᘖ ᖚᐣᘖᖗᑫ ᖆᐣᖽ ᘊᓭᘖᑦᓄ

We are very confident about the meaning of ᘊᒣᑦᖽᖆ (“paʧoʤada”), given its usage in frame 2827 and its importance in OTT lore: “castle”. The word, just like the alternative ᘊᓭᘖᑦᓄ (“patebova”, i.e., “leader”), has the ᘊ- (“pa-“) prefix that is probably either a determinative (“the castle”), the hypothesis I now favor, or an augmentative (“(the) big castle”); its basic form ᒣᑦᖽᖆ (“ʧoʤada”) doesn’t say much, but the ᒣ (“ʧ(a)”) glyph is once more found in what is probably a name (a “noun”) that refers to something Beanie-made. As just stated, ᘊᓭᘖᑦᓄ (“patebova”) is probably “leader” (“teacher, master, king, commander, professor…”), whose basic form ᓭᘖᑦᓄ (“tebova”) is not extremely far from our old friend ᓭᘖᔭᓄ (“tebagava”), “water”. We don’t know Beanish cosmology, maybe the leader of the group is a “master of waters” or the like (perhaps the leading engineer?); what is more likely, however, is that the ᓭᘖ group develops a word from a basic constituent. This would force us to accept ᓄ as a basic semantic unit (“liquid”, as per azule’s theory), but the hypothesis is not extremely likely and, alas, untestable for the time being.

What is hardly deniable, however, is that these words refer to the castle and to Rosetta, almost certainly acting like objects (a conclusion that has its difficulties in context, however, as Beanish is supposed to be as different from English as possible, and these sentences start to look like a common European language). We are lucky that one of them is a person (and not only a person, but a leader) and the other a thing: it is not impossible to take them as agents (for example, “we will be seen by Rosetta” and “we will be seen by the castle”), but they don’t seem to work like that. The hypothesis of Beanish as an ergative language, which I raised months ago, is also very improbable, in particular when we analyze the isolated occurence of ᘊᒣᑦᖽᖆ in frame 2827 (but, truth be told, we don’t really know what kind elipsis Beanish has in place, i.e., what it omits).

We unfortunately don’t have other occurences of ᖆᐣᖽ (“deʤa”), the closest match looks to be ᖊᐣᖽ (“ðeʤa”) in frame 2664, but in any case it looks a lot like a proposition and seems to work as expected from Germanic and Romance ones — the most obvious translation, of course, is “to(wards)”. To take the repeated ᖽᔕᐣᘖ (“ʤameba”) as the subject, likely a “we”, is the next logical step, leaving us with what starts to look like a coniugated verb: ᖚᐣᘖᖗᑫ / ᖚᒣᑕᑫᓭ / ᖚᐣᘖᖗᑫ (the word ᖆᘈᘖ [“dasaba”] in frame 2806, used once more in frame 2842, is almost certainly a vocative, likely even the name of the Beanie, but in any case it is superflous for this analysis).

Now, if ᖚᐣᘖᖗᑫ / ᖚᒣᑕᑫᓭ / ᖚᐣᘖᖗᑫ are different forms of a single verb “to go”, what do we learn? First of all, a structure ᖚ__ᑫ_, that suggest either an infix morphology (where the words are altered not with something before or after a word, like biannual or toys) or a complex phonotactics (i.e., restrictions of the combinations of sounds). As usual, there is not enough corpus to delve into this, but it is something to restart with. Next time, I will probably try to better investigate the ᘊ- morpheme (is it a morpheme?) as a determinative (“the/this”).

ᖉ, ᖆᐣᖚᔭ,ᐨ

Back to the corpus – IV

I said I would not be linear, and here it goes: let’s jump to the interesting speech in frame 2728.

This frame is interesting, because it’s one of the Beanish “data points” closer to being a “Rosetta stone”. After some initial and difficult communication, when Cueball apparently learns the Beanish word for “water” (frames 2708/2709), he and Megan try to communicate with the Beanie by drawing. When Cueball makes it clear that they came from the lowlands, where we know the sea is rising, the Beanie seems both surprised and excited and fires three questions: ᔪᖉᔭᑫ ᕒᖚᐧ ᘊᓭᐧᑲᐤ ᑦᘈᖽᐣ ᔭ ᖆᖽᒣ ᓭᘖᑦ ᖊᘊᐤ ᕋᖗ ᖆᕬᖉᔭ ᘖᐣᖗᔭ, ᘊᓭᘖᔭᓄᐤ As Cuegan don’t (doesn’t?) understand, the Beanie takes the stitch and draws some parallel lines to indicate that the sea is rising.

We can make many educated guesses. The first sentence is probably a result of the surprise/excitement of the Beanie (remember that, as Rosetta will later explain, they did not imagine there were still people living there). One of the two last sentences (or possibly both, but I guess only the last one) is likely related to the the rising of the sea, and it seems a good guess that among those questions one of them is something like “How did you get here?” or, at best, “Are you (two) alone?”.

Syntactically, the first question is the one that helps us most. We have ᔪᖉᔭᑫ ᕒᖚᐧ ᘊᓭᐧᑲᐤ (“nafagaθa laka pataɲa?”), which is stringkly similar to the very first Beanish sentence, ᔪᕒᖚᐧ ᘛᔭᐤ (“nalaka zaga?”), in frame 2663. We have seen that this first question probably means “Where are you from?” or “Who are you?”, with the “question prefix” ᔪ (“na”) — that seems to work as a clitic, like most of Beanish morphemes –, the “locative adverb/preposition” ᕒᖚᐧ (“laka”) — sorry to keep the linguistic jargon, basically something which denotes a physical location, probably something translated as “here/there/where/in/at/…” (but at least I am not saying things like “place deixis”! –, and ᘛᔭ (“zaga”), likely a verb. In our new sentence, we keep the ᔪ (“na”) clitic, along with the ᕒᖚᐧ (“laka”). But we also have the common ᘊᓭᐧᑲ (“pataɲa”) word, which — unless Randall gave us homographs — is the name of the place where Cuegan come from (cfe. frame 2906, a.k.a. “the map”); I like to think of it as “Balearic”, to omit the fact that it might be a composite. However, the word seems to be the result of, at least, the ᘊ- (“pa-“) prefix and the word ᓭᐧᑲ (“taɲa”), used in the “good night” sentence of frame 2697. A case of homography here is not probable, and it raises an interesting hypothesis: that, maybe, perhaps, who knows, possibly, just a guess, ᘊ- (“pa-“) does not mean “big, superior”, but is a kind of determinative (an “article”, or maybe a “demonstrative”), used in precise circumstances. Think about the usage of articles in Ancient Greek — I keep going back to it as Randall spoke of Linear A, even though I know that Linear A is not Greek (as far as we know, it is not even Indo-European) –, including the fact that its usage in what we suppose to be Rosetta’s name might be an honorific.

We then have two sentences, one likely “QUESTION-(from) (to come)” and one “QUESTION-(?) (from) (Balears)”. The ᖉᔭᑫ (“fagaθa”) word is unfortunately very obscure, but, using the logic of English, it should be a verb. The alternative of having the first as “QUESTION-(person/who) (are, inflected 2nd plural)” and the second as “QUESTION-(?) (person/who) (Balears)” has some difficulties due to the later usage of ᕒᖚᐧ (“laka”), but it can’t absolutely be ruled out (a more fluent English translation would be “Are you Balearic?”). For the time being, we should study the corpus with both possibilities in mind: ᕒᖚᐧ (“laka”) as “from/where/there/here” (the one I favor) and as something that is or work as a copula verb (“to be”).

The second question, ᑦᘈᖽᐣ ᔭ ᖆᖽᒣ ᓭᘖᑦ ᖊᘊᐤ (“osaʤe ga daʤaʧa tebo ðapa?”), is more difficult. No word in it is clearly related to anything else in our corpus, not to mention the uncommon diacritic in initial position in ᑦᘈᖽᐣ (“osaʤe”). ᓭᘖᑦ (“tebo”) could be related to ᓭᘖᔭᓄ (“water”), but there is not clear indication of that — this is one of the few words where calculating the probability of a random similarity — I’ll do it, eventually — may actually help us, but once more what we need are good hypothesis to test. What do you all think this second sentence means?

We finally have ᕋᖗ ᖆᕬᖉᔭ ᘖᐣᖗᔭ, ᘊᓭᘖᔭᓄᐤ (“raʃa daʎafaga beʃagai patebava?”). We are naturally drawn to ᘊᓭᘖᔭᓄ (“patebava”), almost certainly a compound of the prefix ᘊ- (“pa-” — “big, superior” or a determinative) and ᓭᘖᔭᓄ (“tebava” — “water”). Based mostly in the following sentence by Megan (“Yes! The sea is rising!”), most people (including myself) speculated that this is, loosely, semantically similar ot the third question, which goes hand in hand with the translation of ᘊᓭᘖᔭᓄ as “sea” (“big water”). This has, however, some problems, in particular the fact that we know almost for sure that “sea” is written ᘊᖊᑦᓄ (“paðeva”). azule has suggested in the XKCD fora that ᓄ (“va”) is actually “water, liquid”), and while I am not completely confident with his/her hypothesis of freely joinable morphemes with independent semantic load (I use this complex description because I am not sure if he/she thinks about something more like Klingon or more like Chinese, but we are talking of somewhat-analytic languages), this makes a lot of sense here. Plus, it supports the hypothesis of ᘊ- (“pa-“) as a determinative: in our sentence, “water” is used with a determinative (maybe it’s the subject, maybe some other rule is at play), and in the maps, for example, the prefix is explained by proper nouns (ᘊᖊᑦᓄ ᘊᓭᐧᑲ would be something like “the-Sea the-Balearic”).

The other words in the sentence are, unfortunately, obscure. I agree that we should expect something like “up” or “rise” or “increase” in it, but we cannot go much farther than guessing. ᕋᖗ (“raʃa”) has both uncommon glyphs and uncommon features; the initial ᖆ- (“da-“) in ᖆᕬᖉᔭ (“daʎafaga”) could mean “good” (“up”?), but it doesn’t fit very well with the supposed meaning of the sentence; the only word very loosely similar to ᘖᐣᖗᔭ, (“beʃagai”) is ᖚᐣᘖᖗᑫ (“kebaʃaθa”), which doesn’t add much either. You’ve guessed it: we need more and better hypothesis for the translations.

Back to the corpus – III

In frame 2668 (btw, I am using Geekwagons numbering system), one of the Beanies has examined Megan’s leg and asks or orders something to a second Beanie, which promptly leaves: ᓭᘈ ᘊᒣᓭᐧᖊᔕ ᖆᘊᓭᒣᖊᐣᖗᐨ (“tesa paʧataðama dapateʧaðeʃa”).

It seems everyone agrees on the meaning of this sentence: either “Get/Bring (me/us) cream for-healing” or “Cream for-healing is-needed”. The “cream for-healing” is one the most clear keys Randall has given us: Megan will soon ask what they are putting in her leg, and the answer will be ᘊᒣᓭᐧᖊᔕ ᖆᘊᓭᒣᖊᐣᖗᐨ. We have established that ᘊ- (“pa”) is almost certainly a prefix (or, more appropriately in terms of Beanish morphology, a semantic particle usually found in initial position when applied to names), probably meaning “big, large, superior”; and in fact the base form ᒣᓭᐧᖊᔕ (“ʧataðama”) will return in frame 2797 referring to what they are eating. Not delving into the possible morphology of ᒣᓭᐧᖊᔕ (“ʧataðama”) — but the word is probably at least inflected, if not a compound –, “cream” is indeed the best translation for something that can be both eaten and applied to injuries (“unguent”), assuming it is not a proper noun (who knows, maybe aloe vera or nettle are sacred plants among the Beanies, used as aliments) and that Beanish food is not so different from ours.

ᖆᘊᓭᒣᖊᐣᖗ (“dapataʧaðeʃa”) is a bit more interesting. First of all, we can safely assume that it is qualifying the word if follows, which is probably the less disputable feature of Beanish syntax (see, for example, the map, where the word probably meaning sea, ᘊᖊᑦᓄ [“paðova”] is followed by the proper name): modifiers are postponed. The “for-healing” part is a good guess, and allows us to think once more of -ᘊ- as “big”, implying a prefix ᖆ- [“da”] (on OTT, azule suggested it is “home”, but I think it is more likely a general physical descriptor/connector, if indeed all glyphs have a semantic load) and much more manageable base word ᓭᒣᖊᐣᖗ (“teʧaðeʃa”), which looks a bit like a “verb” (maybe the ᖆ- turns a verb, possibly a nominal one, into an agent, thus “healer (cream)”).

None of this is new, and has been extensively discussed in the OTT. I want to focus in the word ᓭᘈ (“tesa”). Now, without considering its meaning (be it an imperative “get” or a “is-needed”), the word is likely a verb; our tendency as SVO-language speakers is to take it as an imperative. We unfortunately don’t have words clearly related to it — considering the ᘈ element it is completely opaque, while considering ᓭ we have some candidates: in the difficult sentence from frame 2671 we have the bizarre word ᓭᑦᐧ (“toj”); in frame 2697 we have the debated ᓭᐧᖚ (“tako”), the word-without-punctuation (but as many people suggested, it’s probably a lapsus calami, just a miskate) which seems more related to ᓭᐧᘖ (“taba”) in frame 2866; in frame 2728, one that desperately needs more translation effort, we have ᓭᘖᑦ (“tebo”), the likeliest candidate in my opinion.

The conclusion is that I need to study in much finer detail the dubious translations, particularly for frames 2671 and 2728. Please keep posting your suggestions (if you study the corpus I put on Github, all I have there are elipsis…)

Now, for something different. Maybe it’s the literary critic in me, but I have been wondering if the game we are playing is the same for, say, identify time and place. We could do more for “deciphering” Beanish, but unless someone finds an algorithm to derive it from whatever language there is, the data we have is simply not enough (I had many crazy, crazy ideas I know wouldn’t work well, going as far as a Naïve Bayes to classify words in their parts of speech using people guesses of translations). Maybe we are not supposed to find the rules but actually find the signal in the noise, i.e., develop a grammar that fits the language we have so far? Really, has it been discussed? I look back and seem to have always worked under the assumption that the grammar was there to decrypt.

But I will try to go on with my mumblings, even though life knocks at the door asking me to stop playing that much and focus on work…

On the abugida hypothesis

A note on the abugida hypothesis, as it has been causing some confusion. I should have been more clear saying that, while I support the hypothesis that the script is an abugida, the pronunciations I am using (such as /p/ for ᘊ and /ʤ/ for ᖽ ) are only suggestions to make the discussion easier. I am not saying that the actual pronunciation is or likely is what I am using. My only intention for proposing these pronunciations was to solve the difficulty I (and apparently others) have with the Unicode glyphs, as my mind always reads something like “three-b-dot-seven-en”.

Even if there were evidences supporting that the pronunciation is right, which we do not have, we should not try to find patterns and similarities between Beanish and any other languages. Randall said that Beanish is supposed to be “plausible” and a plausible future language, even when actually evolved from a natural one currently known, would not have any clear phonetic similarities, due to nature of sound changes. Besides, we must consider that “Time” is set very far in the future; even Proto-Indo-European as usually reconstructed (i.e., as far as we can possibly go in terms of human language without a time machine — I am always hoping for the Doctor’s next companion to be a linguist) was likely spoken ~ 3,500 years ago (no Paleolithic Continuity Theory, please) and you just can’t easily go from *dhǵhemon to groom, or from *kʷetwóres to four.

Finally, we are a pattern-matching species, used to find signals in any noise, even random noise (which is my personal explanation for why people still try to write universal data compressors). We cannot help but find similarities among words in different languages, because not only we have this tendency, but also because we know that it is how languages work and because the population of phonemes is so small and the semantic boundaries so flexible that words can, by chance, seem related. This is why so many people try to link languages like Hebrew and Quechua, or Basque and Chinese, and that is why the accepted methodology in comparative linguistics is to look for regular changes (exceptions, if any, must be very well explained, like Tolkien did with Elvish numbers), the words must usually be taken in its “purest” form (and we don’t really know them for Beanish, perhaps with the exception of those starting with ᘊ-), and vowels are important too. There is a very good old article by Mark Rosenfelder you can read: How likely are chance resemblances between languages?

In short, sorry for the confusion, but let’s focus on Beanish syntax and morphology, maybe vocabulary, but not in its similarities with other languages. Sorry for the confusion, my fault!

Back to the corpus – II

In frame 2664 we have the second Beanish sentence, ᔪᖆᓄᐧ ᔪ, ᒣᖉ ᖊᐣᖽ ᖽᘛᕋᑦᐤ The context is that, after the first sentence not understood by Megan and Cueball, the second Beanie points to Megan’s leg, previously injured during a keyboard attack, probably asking what happened or if they can help (and, in fact, Megan will show them her leg in the following frames).

This is one of the most cryptic sentences in our corpus. We know it is a question, as it ends with question mark ᐤ and there are two ᔪ (“na”), which have established to be common in questions. The only word which is not an hapax is the last one, ᖽᘛᕋᑦ (“ʤazaro”), used, probably by the same Beanie, in frame 2866 when our heroes are presented to Rosetta with a two-word sentence. The word has the ᘛ (“za”) syllable which may or may not be related to verbs of movement; it could be a noun derived from it (“thing-that-make-you-move”, i.e., “leg”). Given its length, its double occurrence and the fact that it is used in two different situations, one of them along with the extremely complex word ᖉᔭᒣᘊᐣᘖᑫᖗ (“fagaʧapebaθaʃa”), the best guess is to consider it, using the terminology of English grammar, an open class, in order of probability a noun, a verb or an adverb.

The other words are even more difficult. If what we assume to be the role of the prefix ᔪ- (“na-“) is correct, we have a base word *ᖆᓄᐧ (“dava”), whose closest match is ᖆᓄᘈᖉᐣ (“davesafe”) in frame 2821, usually taken as the name of the Beanie city and which I proposed that might be a compound word *ᖆᓄ + *ᘈᖉᐣ (considering it is similar to ᘖᓄᘈᖉᐣ in frame 2906, another toponym), but there is no clear indication of that. If it works like ᕒᖚᐧ in frame 2663, the most obvious translation is a word like “what” (or “who”, as Beanish can very well distinguish between animate and inanimate beings, instead of human and non human).

Not much can be said about ᔪ, (“daj”), ᒣᖉ (“ʧafa”) and ᖊᐣᖽ (“ðeʤa”). The first is too short and similar to the ᔪ- prefix and, as per Zipf’s Law, would likely be a common semantic trait; the second has the ᖉ (“fa”) syllable that could mean “good, well, normal, happy” (similar to Ancient Greek “eu-“), but unfortunately we cannot state much (I’d love to say that ᒣ- is a negation prefix, like “mal-” in Esperanto, making but ᒣᖉ a “no good, not well” [“malbona” in Esperanto], but there is absolutely nothing to support that); the third word is completely opaque, the closest match being the common ᖆᐣᖽ (“deʤa”) which probably means “to/at/in/into” — but to think that the fist is a “motion to place” and this a “motion in place” is… just not good.

The most accepted translations are “What happened to your leg?”, “Are you injured?”, “Were you attacked?/What attacked you?” and “Could you show me/us your leg?”. They all seem likely probable, especially because there is no other sentence whose translation we can safely assume has “you(r)” (in the singular — why did you English speakers had to drop ‘thou’?), be it a word or a glyph.

I really don’t know. What are your best guesses about the meaning of the sentence and of each of its words? Any breakthrough?

Back to the corpus – I

It is time to go back to our corpus. I will use the abugida hypothesis I developed, as it makes it easier to explain, and I will try to analyze it without the linearity of the last time, jumping to where it seems necessary.

In frame 2663, we have our first Beanish sentence: ᔪᕒᖚᐧ ᘛᔭᐤ (“nalaka zaga?”). We know it is a question, but there has been much debate about its meaning, the three most accepted suggestions being “Where are you from?”, “Who are you?” and “Do you speak/understand Beanish?”.

The first option is similar to Rosetta’s sentence “Whence have you traveled here?”, the second is the one with less semantic load. We know that ᔪ (“na”) is correlated with questions and ᕒᖚᐧ (“laka”) is common in sentences that probably indicate locations, as it was even proposed to translate it as “where” (but we know that Beanish is supposed to be as different from English as possible, and word-by-word translations are difficult). ᔪ (“na”) could be a mark for incomplete information, to be supplied by the listener, similar to Lojban’s “ma” (i.e., a closer translation would be “(you) came from {na}”, expecting the listener to answer “(we) came from X”). This would make ᘛᔭ (“zaga”) the action (“the verb”), possibly inflected for past and second-person plural; in fact, ᘛ (“za”) is found in other sentences whose meaning seem to include movement, as in frame 2865 (possibly “where are they from?”, or, better, “(do) they came from (the-)wildlands?” — the mark of past, if any, would be in the verb) and in frame 2880 (possibly “you can leave now”). These are the three hypothesis from this interpretation: ᔪ- (“na”) is a prefix for incomplete information, used in questioning; ᕒᖚᐧ (“laka”) refers to physical locations and ᘛ (“za”) is part (root?) of actions (verbs?) related to movement.

Regarding the second hypothesis, “Who are you?”, ᔪ- (“na”) could still be a prefix for incomplete information (even though it makes more sense with the previous option), making ᕒᖚᐧ (“laka”) a word related to people (“who”). It does not seem very plausible to me, as we also probably need to omit either the copula verb (“are”) or the pronoun (“you”); while the second alternative is more common, given the analysis above (for ᘛᔭ “zaga” as “to move/leave”) I’d say it is more likely that the verb is omitted and that ᘛ (“za”) is related to the second-person plural.

Regarding the third option, “Do you speak Beanish?”, it is the one that makes more sense to me in the narration (the Beanies have already heard Cueball and Megan speaking in a foreign language, it wouldn’t make much sense to make any other question other than one that tries to establish a channel of communication — accepting it is a question), but has some linguistic difficulties. We’d have two main semantic elements, the action (“to speak”, or “to understand”) and the name of the language/people (“beanish”). Still accepting that ᔪ (“na”) is common in questions, it could be taken as the mark for a yes/no question, and in fact we assume that, in frame 2880, ᔪᑕ (“naʒa”) is indeed “yes” (some people have translated as “ok”, just like ᖉᑦ, “faj”, in frame 2806, but I believe it actually is “good”). It still a bit difficult to translate this sentence (it could also be “(do) you understand us?”, which is less likely given the later usage of ᕒᖚᐧ “laka”), but I would not rule it out.

(personal note to Randall: if you are reading this, please give us more corpus! even Linear A has more than a thousand specimens! please, please, oh, pretty please!)

hou tu pranownse binish

Here is the development of an idea that I suggested at the XKCD forum: treat our Beanish corpus/words as Markov chains and, using a simple Maximum-Likelihood algorithm, suggest possible pronunciations using the dictionary of an actual language as a reference. This suggestion makes a lot of assumptions: that the Beanish script is an alphabet (but the idea could be later expanded to the syllabary/abugida/… hypothesis), that Beanish words can be treated as Markov chains of symbols (not necessarily valid, as at least the syllable structure and the morphology may not meet the Markov-chain assumption that the current state is independent from past states), that the alphabet at least somewhat mirrors the phonology, that the language we use for reference is similar to Beanish, that distinctive phonetic features (i.e., the phonology) are graphically represented, that phonotactict restrictions can, indeed, be found this way, and many more. It also raises a good number of technical questions, such as what will constitute the corpus to be evaluated (token samples or token outcomes? words taken individually or the entire sentences?) and what kind of smoothing, if any, should be performed (Good-Turing would be the first obvious choice, but we are not dealing with large groups and not necessarily with exclusive ones, not to mention that it seems that we won’t have any new Beanish text coming from Randall in the foreseeable future and, as a consequence, we must exclude any Bayesian “nature”).

Still, I imagined it would have been fun to do. Here are the results.

I took the CMU Pronuncing Dictionary and calculated the transitions (including as first and last symbol) for each phoneme, excluding the stress distinction for vowels. The CMUPD is far from the best choice for our situation, but I wasn’t aware of any better free dictionary to play with. Here, for instance, are the counts of transition from the initial position (i.e., the count for the first phoneme in English words):

{'IY': 590, 'W': 3773, 'DH': 67, 'Y': 1350, 'HH': 6656, 'CH': 1260, 'JH': 2148, 'ZH': 91, 'D': 7722, 'TH': 636, 'AA': 1898, 'B': 9632, 'AE': 2906, 'EH': 2928, 'G': 4963, 'F': 5561, 'AH': 3423, 'K': 12969, 'M': 9450, 'L': 5470, 'AO': 883, 'N': 3214, 'P': 7833, 'S': 12371, 'R': 7445, 'EY': 490, 'T': 4854, 'AW': 351, 'V': 2427, 'AY': 615, 'Z': 946, 'ER': 389, 'IH': 4133, 'UW': 84, 'SH': 2467, 'UH': 14, 'OY': 28, 'OW': 1295}

A smoothing being necessary (we don’t want to end up with probabilities equal to zero), I used NLTK‘s implementation of the “Simple Good Turing” algorithm from Gale&Sampson, proud to be using something first developed by Turing himself. Here are the transition probabilities calculated with SGT ("{" and "}" are, respectively, my chain-start and chain-end markers):

{  0.00 %     EH 2.20 %    K  9.73 %    S  9.28 %
L  4.10 %     AH 2.57 %    M  7.09 %    EY 0.37 %
SH 1.85 %     N  2.41 %    P  5.87 %    OY 0.02 %
T  3.64 %     }  0.00 %    OW 0.97 %    Z  0.71 %
W  2.83 %     D  5.79 %    B  7.22 %    V  1.82 %
IH 3.10 %     AA 1.42 %    R  5.58 %    AY 0.46 %
ER 0.29 %     AE 2.18 %    F  4.17 %    IY 0.44 %
AW 0.26 %     AO 0.66 %    Y  1.01 %    UW 0.06 %
G  3.72 %     NG 0.00 %    TH 0.48 %    DH 0.05 %
HH 4.99 %     UH 0.01 %    CH 0.95 %    ZH 0.07 %
JH 1.61 %

We can now test some English words, computing the combined log probabilities for their phonemes:

water ['W', 'AO', 'T', 'ER'] -12.8270136724
desktop ['D', 'EH', 'S', 'K', 'T', 'AA', 'P'] -23.8901166969
hagiography ['HH', 'AE', 'G', 'IY', 'AA', 'G', 'R', 'AH', 'F', 'IY'] -32.0648872671

Which confirms that “water” has a sequence of phonemes with higher probability than “hagiography”.

I could now start playing around with genetic algorithms, hill-climbing or a true maximum-likelihood estimator, but decided to go for what any lazy hacker always does: generate lots of random mappings between Beanish glyphs and English phonemes and keep the one with the best score. I know, I know.

The results, however, were terrible. While some mappings did perform a little, there clearly was not pattern to develop with random mappings (don’t forget we are dealing with something like 10^34 different mappings). The only thing I could notice was that mappings with more vowels and high sonority in general (glides, liquids…) were performing a little better, which makes sense from a phonological point of view, but it is not good for our purposes (Rosetta does not seem to have problems with the pronunciation of English, but with its syntax and vocabulary).

I then decided to do things the right way, selecting the best mappings and swapping glyphs trying to find something better. I could have written a true genetic algorithm, but it seemed useless as this hill-climbing was also useless, only confirming that we should have a lot of vowels to make it somewhat pronounceable.

By this time I was reinforcing my guess that the Beanish script is an abugida, but it was necessary to test and evaluate the alphabet hypothesis. Facing a problem the computer was not able to help me (or which I was unable to use the computer to help me, which is essentially the same), I once more did what any scientist does and decided to map it by hand, using what I know of phonetics and linguistics (and believe me, I don’t know much). My map could be used as a basis for future developments, including the algorithms I described above, and was necessary: dealing with the arcane Beanish script is not easy, because our brains (or, at least, mine) are used to letters and phonemes. It is much harder to infer anything, and guessing possible mappings between Beanish glyphs and phonetic representations might be a good idea. As usual, it will probably also develop our Beanish skills.

Here are my guesses, always assuming that the script is alphabetical and that it is unlikely that letter shapes are dependent on their position (lunate sigma, anyone?). I’ll later try to do a syllabary one.

Both ᕬ and ᓄ are rare glyphs, likely rare sounds. The first is only found in the word ᖆᕬᖉᔭ; ᖆ and ᖉ are probably vowels (but we can’t completely run out that they are glides, liquid or even nasal stops) and ᔭ a consonant, giving a probable VCVC word-structure. Being found in mid-vowel position, it could be just anything; let’s assume it is a /ʒ/.

Regarding ᓄ, it is only found in final position, excluding the complex ᖆᓄᘈᖉᐣ word that I have discussed (and is likely a toponym and/or a compound word); we know, however, that it can be followed by the dot diacritic. It usually follows ᖆ (as in ᔪᖆᓄᐧ and ᖆᓄᘈᖉᐣᐨ, which might be related considering that ᔪ could be a prefix) or other glyphs supposed to represent vowels, but our best word, ᓭᘖᔭᓄ “water”, has what I guess to be a VCC? structure (in fact, it is one of the reasons I still haven’t dropped the syllabary hypothesis). Still impossible to make any educated guess.

Keeping up with the analysis of word structures, we move on to the short words (at most two glyphs, not considering the diacritics): ᖆᐣᖽ / ᖊ,ᘖ / ᓭᘈ / ᓭᘖᑦ / ᖊᘊ / ᒣᖉ / ᓭᐧᘖ / ᓭᑦᐧ / ᓭᐧᖚ / ᘛᔭ / ᘛ / ᘛᐣ / ᖉᑦ, / ᖉ, / ᔪ, / ᒣᖉ and ᖉᑦ, — more short words than we’d like and expect for an alphabetic system. Still, at least ᓭ, ᘛ and ᖉ look like vowels, confirming the tendency for a VC syllable-structure; for now, let’s assign them to /a/, /e/ and /o/.

Having established that, we can disgress and look at some other words, like ᓭᘊᘊ (now /aᘊᘊ/) and ᑫᘊᘊ (“Gibraltar”). ᘊ is the only glyph we find repeated (Beanish is not Italian), and as we can almost rule out it being a vowel or a glide, the repetition probably means to represent either a long consonant, a stronger consonant or a repeated consonant. Now, ᘊ is usually found next to I guessed to be vowels (ᓭ, ᘛ and ᖉ), but not always: we have ᘊᒣᓭᐧᖊᔕ (that forces us to guess what ᒣ is), ᘊᘖᑫᘖᒣᐣᖚ (where it seems that ᘖ is consonant and ᒣ a vowel), ᖆᘊᓭᒣᖊᐣᖗ (where the guess needed is about ᖆ), ᖊᘊ (where we confirm a tendency for ᖊ to be a vowel), ᑕᘊᐣᒣ (where we have the mysterious ᑕ and, even more important, a diacritic — maybe this diacritic makes a consonant the coda of the syllable!?) and the greeting ᘈᘊᘖ (where, if ᘖ is a consonant, we would probably have VCC). Let’s first accept that ᒣ and ᖊ are vowels, giving them /i/ and /u/. Now, considering that ᘊ looks like a consonant that precedes other consonants, we can guess that it has low sonority, thus being either a stop or a fricative, preferably unvoiced. Going back to ᓭᘊᘊ and ᑫᘊᘊ, my guess is that it is an /s/. People will probably like it because many have guessed that the prefix ᘊ- was a plural mark. We now also have a word, ᓭᘊᘊ /ass/ (in IPA, don’t read it in English! 😉 ) and are forced to accept that ᑫ is a vowel. Having used the five cardinal ones, I will try to keep it Romance, now going with /ɛ/ (thus having ᑫᘊᘊ /ɛss/). I can also trying to solve the ᘈ problem (found, for example, both in ᓭᘈ and ᘈᘊᘖ) by going to the other side of the mouth, and assigning it to /ɔ/.

Trying to solve what I kept pending, the ᘖ consonant. Using my guesses, we now have it in contexts such as sɛiᐣᖚ / ɛ / aᘖᔭᓄ (don’t get so excited, is not coincidence that I made “water” start with an /a/) / aᘖᑦ / aᐧᘖ / ᘖᐣᖗᔭ / saᘖᔭᓄ / ᖽᔕᐣᘖ / ᖚᐣᘖᖗɛ / ɔsᘖ (the greeting) / ɔ / aᔭᑦᘖ / oisᐣᘖɛ / ᘖᖆie and, which does not sound that plausible, ᘖᓄɔo. Once more, the most obvious choice would be for a stop/fricative, being a common one likely /t/ or /f/, but cannot stop considering the diacritics anymore. I still think that, if the script is alphabetic, the diacritics are either some coloring or some indication of phonetic features, either to indicate the correct pronunciation or to graphically represent some phonetic change due to word properties, thing like the voicing of an unvoiced consonant between vowels. I like to think that it would help explain the mirrored ᑦ and ᐣ. Of course, to make it difficult, ᘖ is one of those glyphs that can take both ᑦ and ᐣ, leaving us with less likely features such as syllabic, aspirated, nasal release, etc. (and now I am thinking — why the hell are we assuming that the anatomy of the Beanies is like ours?). Anyway, stops can be a little bit more flexible given these probably wrong assumptions, and thus I will go with /t/.

The diacritics, then. The middle dot looks like the simpler one, as it seems to be applied to vowels and to the strange ᓄ glyph, which would make ᖚ yet another vowel (or glide, perhaps). To make it simple, I will assign this last glyph to a vowel not strongly linked to any phonetic feature, be it height, backness or roundness: the good and old mid central vowel ə. This raises some problems, particularly for those used to English phonology, because in some cases it would result in a strongly syllabic schwa (such as /ᕒəᐧ/ or /əɛt/), but we can try fixing it later (and, of course, it is just weird and unlikely, but not impossible). Back to the diacritics, we can now assign some phonetic trait to this middle dot: the best ones would be nasal and aspirated. As aspiration is likely to be more evenly distributed among vowels and consonants (the prove is that… well, I guess so), let’s reserve it to diacritics more evenly distributed and use “nasal” or “nasal release”. Tildes on the way, guys! (yes, even with tilded schwas, which likely won’t render correctly in your system /◌̃ə/ — and now spend two minutes trying to pronounce it, by relaxing pretty much every single muscle in your mouth).

Now, the lunar diacritics, ᑦ and ᐣ. The first is the weirder one, as it is even used, in a single occurrence, in the initial position of the word (one of the things that makes me think that a syllabary/abugida is likely): ᑦᘈᖽᐣ. But ᘈ is an unusual glyph graphically, and maybe it is used before for trivial reasons, aesthetic or not. In order to try to create a glyph mapping, we better not to consider it. ᑦ is thus used with ᕋ, ᓭ, ᖉ, ᘈ, ᘖ, ᒣ, ᔭ, ᑕ and ᖊ (mostly vowels, but also the common consonant ᘖ), and ᐣ with ᖊ, ᒣ, ᖽ, ᘖ, ᔕ, ᖚ, ᖆ, ᘊ, ᖉ, ᘛ and ᓄ, a true “bag of stuff”. Things get now very complicate: we can try to consider ᑦ a phonetic feature such as aspiration, but as for ᐣ there is no possible educated guess; while I will translate it as “long” (either vowel or consonant), it should be taken as an indicator that some of the assumptions we made so far are very, very off.

Before going on, it is time to make sure that every syllable has a vowel. At this moment, we are still left with ᔪᖆᓄⁿ / ᔪ, / ᔭ / ᕋᖗ / ᖽᔕ:t / ᖆ:ᖽ and ᔪᑕ. Now, ᔪ and ᔭ must be vowels, let’s assign them /y/ and /ɯ/, just ’cause I like ’em; this makes ᖆ an unlikely candidate for vowelness, and we can assume it is a consonant and that ᖽ is a vowel (starting to run out of vowels, I choose /ɐ/, trying to keep them as spaced as possible).

Given these new vowels, C is now, probably, a consonant, found in ᑕs:i, saʰᑕo, əiᑕɛa, ᖆ:ᑕʰəɛ and yᑕ. We need a consonant that can be aspirated and can be found in the complex onset ‘Cs’ (assuming phonotactic restrictions are similar to those of the languages I am used to): the best choice is /p/. We also have ᖆ as a consonant, but now more sonorant is less implausible (after all, we are only playing to finish this game), and I’ll go with an /m/.

Let’s now finish by working out the stubborn glyphs we still have. ᕋ is not very frequent, but we find it in ᕒəɛᕋ, / ɐeᕋʰ / ᕋᖗ . It looks like a consonant similar to ᑕ; let’s assign it to /b/. We also have ᕒ, and the only thing we know about it is that it frequent in questions, in yᕒəⁿ / ᕒəⁿ / ᕒəɛb, ; let’s assign it a “strong” sound that clearly distinguishes it in a sentence: /ʃ/.

We get back to ᓄ, now assuming that it can have a nasal release. This is almost impossible by the restrains we have by now, but we should only start changing and swapping glyphs at the end. As it is common in final position, we can make it quasi-English-like as assign it to /ŋ/.

Our now almost impossible to pronounce language still has ᖗ, found in msaiu:ᖗ / bᖗ / t:ᖗɯ / ə:tᖗɛ. The number of vowels in the language is probably too high by now, but the only good alternative would be to make ᖗ, too, a vowel (did I say I keep thinking it is an abugida?). But I will try to restrict it, and thus assign it to /r/: /br/, for example, is not such an impossible word. Finally, ᑲ is found in saⁿᑲ; I will make it a voiceless fricative, /f/, to try to get some rythm in this language full of vowels (I still study poetry, I can’t help it…).

We still have left the “comma” diacritic, by this reckoning yet-another-phonetic-feature, but otherwise the mapping is complete:

ᕬ –> /ʒ/ # voiced palato-alveolar sibilant
ᓭ –> /a/ # open front unrounded vowel
ᘛ –> /e/ # close-mid front unrounded vowel
ᖉ –> /o/ # close-mid back rounded vowel
ᒣ –> /i/ # close front unrounded vowel
ᖊ –> /u/ # close back rounded vowel
ᘊ –> /s/ # voiceless alveolar sibilant
ᑫ –> /ɛ/ # open-mid front unrounded vowel
ᘈ –> /ɔ/ # open-mid back rounded vowel
ᘖ –> /t/ # voiceless alveolar stop
ᖚ –> /ə/ # mid central vowel
ᔪ –> /y/ # close front rounded vowel
ᔭ –> /ɯ/ # close back unrounded vowel
ᑕ –> /p/ # voiceless bilabial stop
ᖽ –> /ɐ/ # near-open central vowel
ᖆ –> /m/ # bilabial nasal
ᕋ –> /b/ # voiced bilabial stop
ᕒ –> /ʃ/ # voiceless palato-alveolar sibilant
ᓄ –> /ŋ/ # velar nasal
ᖗ –> /r/ # aveolar trill
ᑲ –> /f/ # voiceless labiodental fricative
ᐧ –> /_ⁿ/ # nasal release
ᑦ –> /_ʰ/ # aspirated
ᐣ –> /_:/ # long vowel or geminated consonant

Do I really need to say that I not satisfied with this mapping? There are far too many vowels, there is far less symmetry that what we’d expect from a plausible language, and while it is general pronounceable (for example /atɯŋ/ for ᓭᘖᔭᓄ, “water”) we have bizarre things like /ps:i/ for ᑕᘊᐣᒣ and unacceptable ones like /ymŋⁿ/ for ᔪᖆᓄᐧ — we could try to fix some of these later, but it just doesn’t seem right.

It is now time to investigate the hypothesis of the Beanish script as an abugida; I’ll do it in the next post.