It is always a good practice to consider the transition from letter to letter, using fake letters for word boundaries.
Starting with the initial letter (transition from word-start to letter one, “tendency” denotes the percentage of occurrences of a specific letter in the first position):
Index | Letter | Count | Frequency | Tendency |
1 | 3 | 14 | 17.94 % | 14/21 = 66.66 % |
2 | A | 13 | 16.66 % | 13/19 = 68.42 % |
3 | 4 | 9 | 13.23 % | 9/22 = 40.90 % |
4 | N | 5 | 6.41 % | 5/15 = 33.33 % |
X | 5 | 6.41 % | 5/10 = 50.00 % | |
Z | 5 | 6.41 % | 5/6 = 83.33 % | |
d | 5 | 6.41 % | 5/5 = 100 % | |
8 | U | 4 | 5.12 % | 4/7 = 57.14 % |
L | 4 | 5.12 % | 4/16 = 25.00 % | |
10 | W | 3 | 3.84 % | 3/10 = 30.00% |
2 | 3 | 3.84 % | 3/25 = 12.00 % | |
12 | 7 | 2 | 2.56 % | 2/16 = 12.50 % |
G | 2 | 2.56 % | 2/5 = 40.00 % | |
14 | q | 1 | 1.28 % | 1/4 = 25.00 % |
M | 1 | 1.28 % | 1/7 = 14.28 % | |
c | 1 | 1.28 % | 1/7 = 14.28 % | |
b | 1 | 1.28 % | 1/15 = 6.66 % | |
Total | 78 | 100 % |
Things to note:
- There is a word starting with c — cMN) on frame 2728 — which suggests that c is not a diacritic, or at least a diacritic that works in a different way than ( or );
- There is a very strong tendency for 3 and A, two very common letters, to be in the onset of syllables;
- The same as above is true for Z, that, as we have seen and will study better, is always found before L — the only case when Z is not the first letter is in dZL, which suggests a syllable structure d?Z? (I’m approximating regex notation, as I believe there will be more programmers than linguists reading this);
- Any other assumption has to deal with the small population, but we should at least note that 4 does not present a tendency to be in the onset and that 2 and 7, the most common letter and a medium frequency one, have a clear tendency of not being in the onset;
- Given that we have 78 words in the corpus, an equal distribution would have 3.25 occurrences for each letter (I’m considering c a letter); once more, while the population is small, we are allowed the hypothesis that the letters in the groups { g J 9 S 6 Q j } are not found in the onset of Beanish syllables (the same might be true for b which is found only in a single word “b” in frame 2728). The group is similar to the { g 6 Q j M Z } group from the previous post of letters that do not seem to take diacritics. This suggests that the first letter in a syllable must potentially take a diacritic, which makes more likely the hypothesis that diacritics are phonological marks. This two groups, and in particular their intersection { g 6 Q j }, will be useful in discovering the syllable structure and are probably consonants (assuming that Beanish phonology is similar to the phonology of most European languages). If b represents a single phoneme — we cannot rule out that the script is alphabetic — it might be a syllabic consonant, such as the final ‘m’ in English “bottom”.
We can perform the same analysis with the transition to the end symbol (“Count” excludes diacritics, “Pure Count” does not — see the case of d) as discussed below):
Index | Letter | Count | Pure Count | Frequency | Tendency | Pure Tendency |
1 | 2 | 11 | 10 | 14.10 % | 11/25 = 44.00% | 10/25 = 40.00% |
2 | L | 7 | 2 | 8.97 % | 7/16 = 43.75% | 2/16 = 12.50% |
J | 7 | 6 | 8.97 % | 7/8 = 87.50% | 6/8 = 75.00% | |
4 | N | 6 | 5 | 7.69 % | 6/15 = 40.00% | 5/15 = 33.33% |
b | 6 | 4 | 7.69 % | 6/15 = 40.00% | 4/15 = 26.66% | |
6 | X | 5 | 2 | 6.41 % | 5/10 = 50.00% | 2/10 = 20.00% |
7 | g | 4 | 4 | 5.12 % | 4/9 = 44.44% | 4/9 = 44.44% |
9 | 4 | 4 | 5.12 % | 4/8 = 50.00% | 4/8 = 50.00% | |
9 | S | 3 | 3 | 3.84 % | 3/6 = 50.00% | 3/6 = 50.00% |
q | 3 | 0 | 3.84 % | 3/4 = 75.00% | 0/4 = 0.00% | |
U | 3 | 2 | 3.84 % | 3/7 = 42.85% | 2/7 = 28.57% | |
6 | 3 | 3 | 3.84 % | 3/3 = 100% | 3/3 = 100% | |
7 | 3 | 3 | 3.84 % | 3/16 = 18.75% | 3/16 = 18.75% | |
A | 3 | 3 | 3.84 % | 3/19 = 15.78% | 3/19 = 15.78% | |
15 | 4 | 2 | 1 | 2.56 % | 2/22 = 9.09% | 1/22 = 4.54% |
M | 2 | 2 | 2.56 % | 2/7 = 28.57% | 2/7 = 28.57% | |
c | 2 | 0 | 2.56 % | 2/7 = 28.57% | 0/7 = 0.00% | |
18 | d | 1 | 0 | 1.28 % | 1/5 = 20.00% | 0/5 = 0.00% |
G | 1 | 1 | 1.28 % | 1/5 = 20.00% | 1/5 = 20.00% | |
3 | 1 | 1 | 1.28 % | 1/21 = 4.76% | 1/21 = 4.76% | |
j | 1 | 1 | 1.28 % | 1/1 = 100% | 1/1 = 100% | |
Total | 78 | 100 % |
Comments:
- There is a single occurence of d in a final position (frame 2664), but in that case it has the diacritic ). It would seem to confirm that d is a consonant and that the ) diacritic is a vowel.
- The high frequency of J in the final position is due to the word 42bJ (“water”), which is repeated many times.
- We can make some new groups: first, the letters that can take a diacritic when in the coda but that usually do not: { 2 J N G 3 j}; second, the letters that can either take or not a diacritic in the coda: { L b X U 4 }; third, the letters that don’t seem to take diacritics when in the coda: { g 9 S 6 7 A M }; fourth, the letters that apparently must have a diacritic to figure in the coda (or that, perhaps, are the nucleus of the syllables and the diacritic serves as the coda): { q c d }.
- The letter q, with a diacritic, seems strongly fixed in the final position: the only word where it is not at the very end is q9 , in frame 2728.
- We are by now pretty certain that 6 is only found at the final position.
- Among the most common letters, 2 is very common in the final position, A is somewhat common and 4 and 3 are not very common. This might confirm that 2 is a vowel, the most common vowel in the language, and that 4 and 3 are consonants, in a language that might favor a standard CV syllable structure. It is impossible not be tempted to apply the letter frequency from English (etaoin shrdlu, anyone?) and guess that 2 is /e/, 4 is /t/ and 3 is /s/, but it is just a wild guess (not to mention the fact that I am working under the assumption that the Beanish script is phonological, or at least more like Spanish and Italian than English or French — does anyone have a frequency list of phonemes in these languages, i.e., not letters? Might be time to scrap Wikidictionary…)