Linear A?

From Wired ( http://www.wired.com/underwire/2013/08/xkcd-time-comic/ ):

Even now, some mysteries remain. With the help of a linguist, Munroe invented a language and orthography (dubbed “Beanish” by readers) for one of the foreign cultures his characters encounter, which he wanted to be “as different from [English] as our language is from Linear A or Linear B,” the still undeciphered writing systems of ancient Crete. His abstruse approach worked; despite the efforts of “Time” superfans, no one has been able to decode the language, which Munroe finds fitting since “we haven’t cracked Linear A, either!”

Now, Randall could just be using Linear A in the general sense of “uncraked strange language”, but it could also be a further hint. Does this mean that the script is indeed syllabic and, even more, that Renil is the right road with his transliteration in CAS?

At least it confirms that the language is very different from English, as my initial investigation supported. Let’s just hope that the verbal doesn’t use an infix morphology as the verb “to go” in Beanish seems to indicate…

Advertisements

On comments

An amazing day, with three comments. Not to mention that someone pointed a translatebeanish.com domain to this blog…

I am going to reply to the comments in this post.

First, the comment by Hel-G:

1. I think Randall has meant it to be decodable. In the blog he writes “[I] created a plausible future language for readers to try to decode”. This means it is probably NOT completely made up, but it is probably based on en existing language. There simply isn’t enough context to decode a language with arbitrary words. It also means he has provided hints. I think [42bJ]=[water] is crucial.

You are right; when we consider the tradition of xkcd and in particular XKCD Time, it’s unlikely that Randall would throw pure, random garbage at us, and I don’t think he would ever lie about it being decodable. Beanish must be based in one or more existing languages (natural languages, or at least plausible languages, as we know what he thinks about Lojban…) and [42bJ] is our main key. But as it was posted in another comment, Randall is not an humanities guy and, what is more important, he says “decode” the language, not “understand” it. It might actually be a lot simpler than what we are assuming (just like his explanation for the Voynich manuscript).

2. He mentions experts in several fields who has helped him with various aspects of the world (astronomers, biologists and botanists), but no linguists. This probably means he has based it on a language he already knows. I also think we could limit the candidate base languages to English as it is Randall’s native language and a language all readers are familiar with, or one of the languages that are geographically close to the “comic”‘s location, that is Spanish, Catalan, French, Italian or Arabic.

English is the most probably candidate, but his statament of a “plausible future language” might be pointing to a different direction. And we should also add at least Provençal to the list of geographically close languages (or even Basque!).

3. I believe historical linguists work on the hypothesis that languages evolve following sound changes that are consistently applied to all words that meet certain criteria and the same goes for syntax in general. (Now we are well outside of my field of study and please correct me if I am wrong.) I find it very likely that Randall has invented a few laws and applied them to one of the above-mentioned modern languages. The story is set very deep into the future, so deep that historical linguists usually refuse to go back a similar timespan, but I think we can (must?) postulate that the pace of change has been slowed down a lot due to the invention of writing.

You are right, this is the basis of historical linguistics — there are of course many footnotes to it, like words that resist to changes, cultural/geographic areas that are not subject to the changes and later “export” an unchanged word back into the language and so on. Regarding the invention of writing, you are stepping into a very dangerous field, where scientific research sometimes meets ideology, however — even though we don’t have a very large time frame of almost full literacy to test this theory — it is accepted that writing does not slow language change. Some theories, almost fringe linguistic theories at this time, say that modern language communication (omnipresent television and so on) will slow down language change, but I don’t think it is likely (my own fringe theory would be that it will give us a stronger, world-wide diglossia, but let’s go back to Beanish — if people from Language Log read what I wrote I will be in trouble 😉 )

4. The strange script is probably just to throw us off. Megan and Cueball obviously don’t speak Modern English and what we read is a translation (like in Lord of the Rings) and it is therefore necessary to render Beanish in an invented script so that it looks as strange to us as it is to Cuegan. Even if it is Arabic, using Arabic script would be too much of a giveaway.

I also had this “Lord of the Rings” feeling, but I am still puzzled by the script. At first it looked Semitic, but after investigation it does not seem to work like, say, Arabic. Its inner gears are likely those of an alphabetic script, perhaps even one based in the Latin alphabet, but I still can’t help with the diacritics.

5. The spelling is probably very phonetic or it would be too difficult, e.g. “ghoti”… It could even be that Beanish script is simply obfuscated IPA and each character represents a phoneme or phone of Beanish.

This is actually the hypothesis I am working on, and that’s why I can’t wait for a free time to compile and study the tables of letter transitions! It is also possible that Randall didn’t use IPA, but one of the many “corrected” ortographies of English (like this one, which derives the new words from the old ones with a C program: http://www.zompist.com/spell.html ).

Based on this I will propose the following working hypothesis:
– Beanish is English, Italian, Catalan, French or Arabic changed according to a few consistent rules and written in an obfuscated version of IPA.
We could try to apply this to the Beanish word for [water] and the syntactic elements you have found to see if it leads to any insights. For geographical reasons, French would be the most natural choice for base language (we’re certain they are at the Château d’If) and Randall may very well have learnt it in school, but I can’t see how it is possible to go from [eau]/|o| to [42bJ] without letting the phonetic spelling hypothesis go. I’ll try English (which I know quite well) and a little Italian (which I learnt more than a decade ago, but have used very little). From your blog I would guess you are an Italian with an extremely good command of English (looks native to me, but why is the blog software in Italian?), so I guess you are much better than me at trying these languages. Spanish is a good candidate as I believe it is a much taught second language in the US and the Beanies seem to have been at Gibraltar, but unfortunately I don’t know Spanish. Arabic is Greek to me, and most likely to Randall too, so I think (hope!) it is the least likely candidate. (I can help a bit with German and above all the Scandinavian languages, but I find these highly unlikely candidates.)

I am both Italian and Brazilian, but you are too generous saying that I have an “extremely good command of English”.

Back to what matters, your hypothesis is, given that, taken that, just like mine. I have tried to mentally apply some changes and go back to the languages I know, unsuccessfully. I still think that English is more likely (perhaps the Beanie are actually Britons, Big Hair is Queen Elizabeth CCXLII and their “water”ends with a schwa) but I believe that, unless someone cracks the script, a study of the morphology and the syntax will be more helpful than word matching.

Second, a comment by Ronald:

As I pointed out earlier on the OTT, if you mirror the Beanish characters, they become like English characters. For example, the feline that attacked Megan, the characters then look like ‘Panther’. Mirroring also seems to be consistent with the prefixes discovered here – mirroring it means left-to-right becomes right-to-left, which is consistent with English. It is also quite possible that Beanish is a cypher, not a true language. Also, Randall is an exact science guy – it’s possible that words for water are based on chemical symbols, so mirrored H2O.

It is unlikely that Randall spent days on creating Beanish, so I expect the level of effort to crack it to be medium at best. Also note that Randall said ‘decoding’, also indicating some kind of cypher.

Great insights! I had not thought of translating “water” into “H2O”. It makes a lot of sense; the more I think about it, the more it seems obvious, especially given that it is only real clue he has given us (besides the punctuation in Big Hair speech).

I must have missed your comment in the OTT. I will go back to it because it is intriguing; unfortunately, I cannot see it by myself right now, and it would be difficult to explain part of the syntax — perhaps the language isn’t pure, current English? And you are right, “deconding” seems to indicate that it is a cypher in the mathematical sense (which makes sense if we think about all that Randall has done so far).

Third, a comment by Joel Dinda:

I’d say “unfortunately, Beanish does not seem to have drawn much attention, even among XKCD enthusiasts” is overstating things. Just as the thread (OTT) was trying to come to grips with Beanish, the comic’s tempo changed, and we were all caught up in the rush. And I’m pretty sure everyone was assuming we’d get more examples after Cuegan escaped from the flood.

Didn’t work out that way, and everyone went into mourning for a few days. About now we’re beginning to regroup. People are working on the Wiki again, we’re beginning to inventory what we haven’t figured out, and here you are working out the Beenies. That timing’s about what I’d expect.

Glad you’re here, and that folks are passing the word about your blog. I’m really not qualified to help, but I’m really interested in seeing what you (and anyone else) can figure out.

I am sorry, I didn’t express myself very well: what I wanted to say is that it was not drawing much attention anymore. But I have fortunately been proved wrong. 🙂

Last but not least, a comment by waveney:

There is also beanish writing on the gate (best seen in frame 2821) I think it is a 2

You are amazing! It seems that there is a [2], I doubt I would have ever noticed it.

M32=

More on words

We have established that [ZL.] (has seen in frames 2865 and 2866) probably means from, as in 2866 it is used, referring to Megan and Cueball, before [34.6], which is where they are from. This would imply that the first question in frame 2728 ends in “from [34.6]?” too.

We have also established that the first sentence, [dZL. Ubo] in frame 2663, probably means “Who are (you pl.)?”, and, given that both [d-] and [ZL] seem correlated with questions, [dZL.] is likely an interrogative “who” and [Ub] a conjugation of the verb “to be”. This last hypothesis might be confirmed by other words: [NUq(] in frame 2664 could be a verb (and not a noun as I find more probable) and the [U] in frame 2671 could also be a verb, beside other cases (as we have seen with the verb “to go”, conjugations in Beanish seem to maintain a root, and we have already discussed that [U], the supposed root of “to be”, is likely a consonant — our guess was /f/).

If these assumptions are true, the sentence [ZL. 4b(2 UALMo] in frame 2865 does not mean “Who are these?”, as most people (including me) initially guessed, but “Where are they from?” or, using the order of Beanish syntax, “From where are (they)?”. This would give us two words, a new conjugation for “to be” ([UALM], “they are”) and “where” as [4b(2].

The recurring [U] might, however, be a coincidence because [UALM] as “they are” makes the translation of the first sentence in frame 2866 necessarily different from our supposed “they are from [34.6]” — and we are pretty sure about the “from [34.6]”.

Just some random thoughts to keep the blog warm…

Where have all my vowels gone?

Before moving on to the letter-to-letter transitions, it is worth to study the words by themselves, so that the collected statistics will make more sense.

We don’t know if the Beanish script is alphabetic, syllabic or even logographics, but it clearly looks an alphabetic one. Being alphabetic, it is important to figure out (or at least try to guess) the vowels and the consonants in the alphabet, preferably explaining what the diacritics do. We can make some educated guesses based on the words we have and on the frequencies we have calculated.

In the first sentence, we have a word Ub which like is “are (you pl.)” — the most probable translation is “Who are you?”, but expressed in a language that does not have, does not use or, more likely, does not force the expression of a subject pronoun because it can be derived by other linguistic informations (usually the verb conjugation, such as in most Romance languages). While the phonology of Beanish might allow complex syllable structures such as CC, we should start by the assumption that it favors one of the common CV or VC structures. In Beanish, [b] is far more common than [U] but not that common; a safe assumption is that [b] is one of less frequent vowels (perhaps /u/) and [U] a normal consonant such as /f/. However, we also find [U] as a single word, with no diacritics; we might need to either change the syllable structure to CV or to assign to [U] a “more syllabic” consonant such as “m” (they hypothesis that every letter is a consonant with an implicit vowel seems unlikely by now).

Another short word is 7X. Using the same reasoning, [7] is likely a vowel such as /i/, and [X] a somewhat more common consonant such as /d/ (I am using English letter frequencies). This would strenght the hypothesis that the most common syllable structure is VC.

In fact, the next short word we have is 4M, and the frequencies would let us assume that [4] is a common vowel, such as /a/, and [M] a consonant as frequent as [U], perhaps /p/.

Going on, we have the strange [dG] word. Both [d] and [G] are very uncommon, and if the supposed VC structure is used, [d] would be a very uncommon vowel.

It would be easier if we knew about the origin of Beanish. If Beanish is a conlang ex novo, like Esperanto or, much worse, Klingon, we cannot infer much. But I think that Randall has probably derived it from “true”, natural languages. While we could exclude no language, the obvious candidates for proto-Beanish would be languages spoken in France, or perhaps some geographically close language as Catalan — and he might have given us a clue. When reading the comic, I found it strange that the leader of a society as advanced as the Beanish wouldn’t know large numbers when she, in fact, is pretty proficient in English (just don’t blame her for such a strong accent). Just as he did with the question mark, the little ball at the end of sentences, Randall might have given us the clue that Beanish doesn’t have names for large numbers such as forty. You probably know where this is going, right? Megan had to explain arithmetically what is forty, and French works this way (well, mostly). A language derived/influenced by French might have inherited words such as quatre-vingts (four-twenties, also known as 80) or, in our case, five-eights.

Statistics 3 – Letters

It is always a good practice to consider the transition from letter to letter, using fake letters for word boundaries.

Starting with the initial letter (transition from word-start to letter one, “tendency” denotes the percentage of occurrences of a specific letter in the first position):

Index Letter Count Frequency Tendency
 1 3 14 17.94 % 14/21 = 66.66 %
 2 A 13 16.66 % 13/19 = 68.42 %
 3 4 9 13.23 % 9/22 = 40.90 %
 4 N 5 6.41 % 5/15 = 33.33 %
X 5 6.41 % 5/10 = 50.00 %
Z 5 6.41 % 5/6 = 83.33 %
d 5 6.41 % 5/5 = 100 %
 8 U 4 5.12 % 4/7 = 57.14 %
L 4 5.12 % 4/16 = 25.00 %
 10 W 3 3.84 % 3/10 = 30.00%
2 3 3.84 % 3/25 = 12.00 %
 12 7 2 2.56 % 2/16 = 12.50 %
G 2 2.56 % 2/5 = 40.00 %
 14 q 1 1.28 % 1/4 = 25.00 %
M 1 1.28 % 1/7 = 14.28 %
c 1 1.28 % 1/7 = 14.28 %
b 1 1.28 % 1/15 = 6.66 %
Total 78 100 %

Things to note:

  • There is a word starting with c — cMN) on frame 2728 — which suggests that c is not a diacritic, or at least a diacritic that works in a different way than ( or );
  • There is a very strong tendency for 3 and A, two very common letters, to be in the onset of syllables;
  • The same as above is true for Z, that, as we have seen and will study better, is always found before L — the only case when Z is not the first letter is in dZL, which suggests a syllable structure d?Z? (I’m approximating regex notation, as I believe there will be more programmers than linguists reading this);
  • Any other assumption has to deal with the small population, but we should at least note that 4 does not present a tendency to be in the onset and that 2 and 7, the most common letter and a medium frequency one, have a clear tendency of not being in the onset;
  • Given that we have 78 words in the corpus, an equal distribution would have 3.25 occurrences for each letter (I’m considering c a letter); once more, while the population is small, we are allowed the hypothesis that the letters in the groups { g J 9 S 6 Q j } are not found in the onset of Beanish syllables (the same might be true for b which is found only in a single word “b” in frame 2728). The group is similar to the { g 6 Q j M Z } group from the previous post of letters that do not seem to take diacritics. This suggests that the first letter in a syllable must potentially take a diacritic, which makes more likely the hypothesis that diacritics are phonological marks. This two groups, and in particular their intersection { g 6 Q j }, will be useful in discovering the syllable structure and are probably consonants (assuming that Beanish phonology is similar to the phonology of most European languages). If b represents a single phoneme — we cannot rule out that the script is alphabetic — it might be a syllabic consonant, such as the final ‘m’ in English “bottom”.

We can perform the same analysis with the transition to the end symbol (“Count” excludes diacritics, “Pure Count” does not — see the case of d) as discussed below):

Index Letter Count Pure Count Frequency Tendency Pure Tendency
1 2 11  10 14.10 % 11/25 = 44.00% 10/25 = 40.00%
2 L 7  2 8.97 %  7/16 = 43.75%  2/16 = 12.50%
J 7  6 8.97 %  7/8 = 87.50%  6/8 = 75.00%
4 N 6  5 7.69 %  6/15 = 40.00%  5/15 = 33.33%
b 6  4 7.69 %  6/15 = 40.00%  4/15 = 26.66%
6 X 5  2 6.41 %  5/10 = 50.00%  2/10 = 20.00%
7 g 4  4 5.12 %  4/9 = 44.44%  4/9 = 44.44%
9 4  4 5.12 %  4/8 = 50.00%  4/8 = 50.00%
9 S 3  3 3.84 % 3/6 = 50.00%  3/6 = 50.00%
q 3  0 3.84 %  3/4 = 75.00%  0/4 = 0.00%
U 3  2 3.84 %  3/7 = 42.85%  2/7 = 28.57%
6 3  3 3.84 %  3/3 = 100%  3/3 = 100%
7 3  3 3.84 %  3/16 = 18.75%  3/16 = 18.75%
A 3  3 3.84 %  3/19 = 15.78%  3/19 = 15.78%
15 4 2  1 2.56 %  2/22 = 9.09%  1/22 = 4.54%
M 2  2 2.56 %  2/7 = 28.57%  2/7 = 28.57%
c 2  0 2.56 %  2/7 = 28.57%  0/7 = 0.00%
18 d 1  0 1.28 %  1/5 = 20.00%  0/5 = 0.00%
G 1  1 1.28 %  1/5 = 20.00%  1/5 = 20.00%
3 1  1 1.28 % 1/21 =  4.76%  1/21 = 4.76%
j 1  1 1.28 % 1/1 = 100%  1/1 = 100%
Total 78 100 %

Comments:

  • There is a single occurence of d in a final position (frame 2664), but in that case it has the diacritic ). It would seem to confirm that d is a consonant and that the ) diacritic is a vowel.
  • The high frequency of J in the final position is due to the word 42bJ (“water”), which is repeated many times.
  • We can make some new groups: first, the letters that can take a diacritic when in the coda but that usually do not: { 2 J N G 3 j}; second, the letters that can either take or not a diacritic in the coda: { L b X U 4 }; third, the letters that don’t seem to take diacritics when in the coda: { g 9 S 6 7 A M }; fourth, the letters that apparently must have a diacritic to figure in the coda (or that, perhaps, are the nucleus of the syllables and the diacritic serves as the coda): { q c d }.
  • The letter q, with a diacritic, seems strongly fixed in the final position: the only word where it is not at the very end is q9 , in frame 2728.
  • We are by now pretty certain that 6 is only found at the final position.
  • Among the most common letters, 2 is very common in the final position, A is somewhat common and 4 and 3 are not very common. This might confirm that 2 is a vowel, the most common vowel in the language, and that 4 and 3 are consonants, in a language that might favor a standard CV syllable structure. It is impossible not be tempted to apply the letter frequency from English (etaoin shrdlu, anyone?) and guess that 2 is /e/, 4 is /t/ and 3 is /s/, but it is just a wild guess (not to mention the fact that I am working under the assumption that the Beanish script is phonological, or at least more like Spanish and Italian than English or French — does anyone have a frequency list of phonemes in these languages, i.e., not letters? Might be time to scrap Wikidictionary…)

Words so far

I didn’t think there would be so many visitors in two days. Here you have a small dictionary of guesses:

  • little final circle – question mark
  • single horizontal line – dot
  • double horizontal line – exclatamation
  • 374.WS – “cream, paste”
  • 4M – “get (me)”
  • 42bJ – “water”
  • Xc^ – “yes, ok”
  • 34.6 – “land, desert” or “home to Cueball and Megan”
  • L)29g – “(we) are going to go”, form of the verb “L…g.” (to go)
  • L7Gg4 – “(we) are going”, form of the verb “L…g.” (to go)
  • A)N – “to, towards”
  • 37cNA – “castle”
  • X^ A)Lb^ – standard long greeting (“good morning”?)
  • M32 – standard short greeting (“hi”)
  • ZL. – “from”
  • 3W(J – “sea”

It is also likely that the verb “to be” has a structure such as “4…”.

Statistics 2 – Diacritics

Here we have a set of tables with statistics on the diacritics. Diacritics are loosely defined, here, as symbols that don’t show up alone, and which graphically are shorter. The c could be both, as it looks like a diacritic but does not seem to depend on any particular other symbol. Exploring the hypothesis of the diacritics representing vowels, it gives us a new hypothesis, that c is a semivowel.

First, a table that lists, diacritic by diacritic, the letters that bear them, their counts and frequency.

Diacritic Letter Count Frequency
( q 2 / 9 22.22 %
2 2 / 9 22.22 %
W 2 / 9 22.22 %
9 1 / 9 11.11 %
4 1 / 9 11.11 %
b 1 / 9 11.11 %
) A 6 / 22 27.27 %
3 3 / 22 13.63 %
S 3 / 22 13.63 %
W 3 / 22 13.63 %
d 1 / 22 13.63 %
L 1 / 22 13.63 %
N 1 / 22 13.63 %
2 1 / 22 13.63 %
U 1 / 22 13.63 %
7 1 / 22 13.63 %
X 1 / 22 13.63 %
^ X 2 / 7 28.57 %
b 2 / 7 28.57 %
c 2 / 7 28.57 %
q 1 / 7 28.57 %
. 4 9 / 15 * 60.00 %
L 5 / 15 33.33 %
J 1 / 15 6.66 %
c 7 3 / 6 50.00 %
X 2 / 6 33.33 %
4 1 / 6 16.66 %

* One case, frame 2671, is a 4(. , unless I misread a Beanish question mark for a diacritic; if they are indeed two diacritics and diacritics do represent vowels, it might be the only attested occurrence of a diphtong in Beanish

We can already note some properties of language, of at least of its script. Some diacritics, as it is better shown by the following table, are strongly related to some letters: the only possible diacritic for A, for example, is ). There is no solution on c being a diacritic or a letter, but there one could read a tendency of confirming it as a diacritic (possibly representing a rare vowel).

Regarding the possibility of ^ and c being, respectively, allographemes for ) and (, the distributions seem to void this hypothesis. The population is very small, but it seems rather unlikely, not to mention that the letter X can bear both ^ and ), while the letter 4 can bear both c and ( (albeit just one occurrence). An analysis of the corpus also reveals that ^ is only found at the end of a word.

The second table, presenting all characters along with any diacritic they may bear (this will be later extended to the contexts where each letter is found):

Letter Diacritic Count Total count Frequency
3 ) 3 / 3 21 14.28 %
2 ) 1 / 3 25 4.00 %
( 2 / 3 25 8.00 %
4 c 1 / 10 22 4.54 %
( 1 / 10 22 4.54 %
. 8 / 10 22 36.36 %
7 c 3 / 4 16 18.75 %
) 1 / 4 16 6.25 %
6 0 / 0 3 0.00 %
g 0 / 0 5 0.00 %
9 ( 1 / 1 8 12.5 %
A ) 6 / 6 19 31.57 %
J . 1 / 1 8 12.5 %
M 0 / 0 7 0.00 %
L ) 1 / 6 16 6.25 %
. 5 / 6 16 31.25 %
N ) 1 / 1 15 6.66 %
Q 0 / 0 1 0.00 %
S ) 3 / 3 6 50.00 %
U ) 1 / 1 7 14.28 %
W ) 3 / 5 10 30.00 %
( 2 / 5 10 20.00 %
X ) 1 / 5 10 10.00 %
c 2 / 5 10 20.00 %
^ 2 / 5 10 20.00 %
Z 0 / 0 6 0.00 %
b ( 1 / 3 15 6.66 %
^ 2 / 3 15 13.33 %
d ) 1 / 1 5 20.00 %
g 0 / 0 9 0.00 %
j 0 / 0 1 0.00 %
q ( 2 / 3 4 50.00 %
^ 1 / 3 4 25.00%

This table is very important. First of all, it pretty much disproves the theory that the diacritics are vowels; the only possibility would be for every single consonant in the script to have a “default” vowel and a diacritic would only be used if the actual vowel is different from the default one. One obvious new theory is that diacritics represent phonetic features, such as aspiration: in fact, from the scene when Cueball learns the word for “water” we learn that the script is likely phonetic, or at least alphabetic.

What is now difficult is to identify which symbols represent vowels; if the diacritics don’t represent them, we would expect the letters that have no diacritics, such as q and j, to have a high frequency, but we have the opposite situation. A new hypothesis would be for the diacritics to represent semi-vowels, but given that most letters have at least one diacritic it seems unlikely.

We can at least list some combinations of letters and diacritics which we can assume as “common” even with the small population : 3) — 4. — 7c — 7) — A) — L. — S) — W) —- W( — Xc — X^

We can also start dividing the letters into groups:

  • letters that take more than one diacritic: 2 4 7 L W X b q
  • letters that seem to take a single diacritic: 3 9 A J N S U d
  • letters that seem to take no diacritic: 6 g M Q Z g j

If the script is indeed alphabetical, the frequencies seem to confirm that the first group is likely to contain mostly vowels and the third one mostly (rare) consonants.

Statistics 1

The very first statistic to compute is the frequency of diacritics (which could be vowels) and letters. I am not including the final ball (which is a question mark), the final double horizontal line (which is an exclamation mark) and the final single horizontal line (which is a normal mark, i.e., a dot).

There are still some doubts about the diacritics; in particular, the ^ and the c could be just different versions of ) and (, respectively. If the hypothesis of the diacritics as vowels is right, this would indicate a language either with a restricted set of vowels or a script with a standard vowel for each consonant.

The distribution looks pretty normal for a human language. Considering how restricted is our corpus, it could be much more skewed. The first thing to note is that it does not seem to confirm that the diacritics are vowels, or at least that every vowel is represented by a diacritic.

Our next step is to study the distribution of letters at the beginning and at the end of the words, and then the relationship between them (for example, you probably have noticed that Z seems to depend on L, like Q and U in most languages).

Index Symbol Count Frequency
1 2 25 8.36 %
2 ) 22 7.36 %
4 22 7.36 %
 4 3 21 7.02 %
 5 A 19 6.35 %
 6 7 16 5.35 %
L 16 5.35 %
 8 b 15 5.02 %
N 15 5.02 %
. 15 5.02 %
 11 W 10 3.34 %
X 10 3.34 %
 13 ( 9 3.01 %
g 9 3.01 %
 15 J 8 2.68 %
9 8 2.68 %
 17 ^ 7 2.34 %
U 7 2.34 %
c 7 2.34 %
M 7 2.34 %
 21 S 6 2.01 %
Z 6 2.01 %
 23 d 5 1.67 %
G 5 1.67 %
 25 q 4 1.34 %
 26 6 3 1.00 %
 27 Q 1 0.33 %
j 1 0.33 %
Total  299 100 %

Organizing data

Here is a first attempt at organizing the data so far. I am making some adaptations to the transcription rules and correcting them, as the transcriptions in Wikia mix diacritics with sentence marks, sometimes joining different diacritics and with different orders.

It is important to notice that the final “little ball” (not the dot, represented by ‘o’ in my transcription) and the double horizontal lines (represented by the equality symbol, =) are given by the author in the quasi-English speak of the leader. She clearly uses the first as a question mark and the second as an exclamation mark, which probably indicates that the single line is also a sentence mark. This seems to confirm that the language is written left-to-right. Her speech also gives away some vocabulary that might help deciphering Beanish, in particular “desert”.

Frame Beanish Transcription Translation
2663 dZL. Ubo (question, probably “who are you?”)
2664 dAJ. d) 7X W)N NUq(o (question, probably “what happened to your leg?”)
2668 4M 374.WS A347W)9- get (me) cream for-healing
2671 ZL. 32g27)L U 4(. ZLgq^- !
2676 374.WS A347W)9- cream for-healing
2697 Lg2 4.L “good night” or “let’s sleep”
2706 42bJ- water
2708 (Cueball) / (Cueball) 42b?-Xc^= / 42bJ- water
2712 (Cueball) (Cueball) 42b?-  water
2713 42bJ-  water
2727 (Cueball) (Cueball) 42b?-  water
2728 dXbg ZL. 34.6o cMN) b AN7 42( W3o q9 AQXb 2)9b^ 342bJo (three questions, probably “can you see the desert? you know what is happening? you know the waters are rising?”)
2734 NS)2 L)29g A)N 37cNA- We will leave for the castle.
2797 74.WS- cream
2798 (Megan) (Megan) 74.WS- cream
2802 G3)7 34cGX- / G3)7 342bJ- / 42bJ= this desert, this seas, water
2806 / / AM2 / NS)2 L7Gg4 / A)N 37cNA-Xc^- We should leave for the castle now.Ok.
2821 2JMX)- (our) city
2823 X^ A)Lb^-X^ A)Lb^- Good morningGood morning
2827 37cNA- the castle
2836 NS)2 L29g A)N 342(j- We should look for the leaders.
2841 M32-X^ A)Lb^- HelloGood morning
2842 / X^ A)Lb=AM2=/ X^ A)Lb= Good morningMy friend! good morning
2863 M32- Hello
2864 M32-M32- HelloHello
2865 ZL. 4b(2 UALMo Who are these?
2866 A)9(Lg 4.2 ZL. 34.6-Xb73)2g9 NUq(- They are from the desert.One of them is injured.
2880 U) AbWN W(2 2A7U-dG- You can leave us now.Ok
2882 M32-  Hello
2906 3W(J 34.6  Sea of-yours

Left-to-right or right-to-left?

Given the first analysis done in the two previous post, we could move on to organize what was found and start comparing the hypothesis, preferably with some statistical analysis.

But as I said, the analysis just assumed that the text was written left-to-right, which is by no means undisputable. Not only using a right-to-left script would be perfectly acceptable in a creation by Randall, but it would actually simply some of the strange characteristics of the language. This is going to be my next step.